WO2019184517A1 - 一种音频指纹提取方法及装置 - Google Patents

一种音频指纹提取方法及装置 Download PDF

Info

Publication number
WO2019184517A1
WO2019184517A1 PCT/CN2018/125491 CN2018125491W WO2019184517A1 WO 2019184517 A1 WO2019184517 A1 WO 2019184517A1 CN 2018125491 W CN2018125491 W CN 2018125491W WO 2019184517 A1 WO2019184517 A1 WO 2019184517A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio fingerprint
audio
bit
extraction method
strong
Prior art date
Application number
PCT/CN2018/125491
Other languages
English (en)
French (fr)
Inventor
李�根
李磊
何轶
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Priority to JP2020502951A priority Critical patent/JP6908774B2/ja
Priority to SG11202008533VA priority patent/SG11202008533VA/en
Priority to US16/652,028 priority patent/US10950255B2/en
Publication of WO2019184517A1 publication Critical patent/WO2019184517A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present disclosure relates to the field of audio processing technologies, and in particular, to an audio fingerprint extraction method and apparatus.
  • Audio fingerprinting (or audio features) and audio fingerprinting are widely used in today's "multimedia information society."
  • the audio fingerprint retrieval is initially applied to the listening to the song, that is, inputting a piece of audio, and by extracting and comparing the fingerprint features of the piece of audio, the corresponding song can be identified.
  • audio fingerprint retrieval can also be applied to content monitoring, such as audio deduplication, retrieval-based voice advertisement monitoring, audio copyright, and the like.
  • the existing audio fingerprint retrieval method has a problem of poor accuracy, which is partly due to the poor accuracy of the extracted audio fingerprint.
  • the existing audio fingerprint extraction methods have problems such as poor robustness to noise and complicated processing.
  • the purpose of the present disclosure is to provide a new audio fingerprint extraction method and apparatus.
  • An audio fingerprint extraction method comprising the steps of: converting an audio signal into a sound spectrum; determining a feature point in the sound spectrum; and determining, on the sound spectrum, an OR for the feature point a plurality of masks, each of the masks comprising a plurality of spectral regions; determining an average energy of each of the spectral regions; determining an audio fingerprint bit based on an average energy of the plurality of spectral regions in the mask; determining The degree of trust of the audio fingerprint bits is determined to determine strong and weak weight bits; the audio fingerprint bits and the strong and weak weight bits are combined to obtain an audio fingerprint.
  • the object of the present disclosure can also be further achieved by the following technical measures.
  • the foregoing audio fingerprint extraction method wherein the converting the audio signal into the sound spectrum map comprises: converting the audio signal into a time-frequency two-dimensional sound spectrum map by a short-time Fourier transform, each of the sound spectrum maps The value of the point represents the energy of the audio signal.
  • the converting the audio signal into the sound spectrum further comprises: performing a mel change on the sound spectrum.
  • the audio fingerprint extraction method of the foregoing wherein the converting the audio signal into the sound spectrum further comprises: performing human auditory system filtering on the sound spectrum.
  • the feature point is a point at which the frequency value is equal to a preset plurality of frequency setting values.
  • the feature point is an energy maximum point in the sound spectrum map, or the feature point is an energy minimum point in the sound spectrum map.
  • a plurality of the spectral regions included in the mask are symmetrically distributed.
  • the foregoing audio fingerprint extraction method wherein a plurality of the spectral regions included in the mask have the same frequency range, and/or have the same time range, and/or are center-symmetrically distributed around the feature point .
  • the aforementioned audio fingerprint extraction method wherein the spectral region mean energy is an average value of energy values of all points included in the spectral region.
  • determining the audio fingerprint bit according to the average energy of the plurality of spectral regions in the mask comprises: determining an average value of the plurality of spectral regions according to one of the masks The difference in energy determines the value of an audio fingerprint bit.
  • the determining the degree of trust of the audio fingerprint bit to determine the strength and weakness weight bit comprises: determining whether the absolute value of the difference value reaches or exceeds a preset strong and weak bit threshold, if Reaching or exceeding the strong and weak bit threshold, determining the audio fingerprint bit as a strong bit, otherwise determining the audio fingerprint bit as a weak bit; determining whether the audio fingerprint bit is a strong bit or a weak bit Strong and weak weight bits.
  • the strong and weak bit threshold is a fixed value, or a value based on the difference, or a proportional value.
  • the audio fingerprint extraction method of the foregoing further comprising: dividing the audio signal into a plurality of audio sub-signals by time; extracting the audio fingerprint of the audio sub-signal; and performing the extracted audio fingerprint of each of the audio sub-signals Combine to obtain an audio fingerprint of the audio signal.
  • the object of the present disclosure is also achieved by the following technical solutions.
  • the audio fingerprint library construction method includes: extracting an audio fingerprint of an audio signal according to the audio fingerprint extraction method of any of the foregoing; storing the audio fingerprint into an audio fingerprint database.
  • An audio fingerprint extraction apparatus comprising: a sound spectrum map conversion module for converting an audio signal into a sound spectrum map; a feature point determination module, configured to determine a feature point in the sound spectrum map; a module for determining one or more masks for the feature points on the spectrogram, each of the masks comprising a plurality of spectral regions; a mean energy determining module for determining each of the spectra An average energy of the region; an audio fingerprint bit determining module, configured to determine an audio fingerprint bit according to an average energy of the plurality of spectral regions in the mask; a strong and weak weight bit determining module, configured to determine the audio fingerprint bit The degree of trust determines the strong and weak weight bits; the audio fingerprint determination module is configured to combine the audio fingerprint bits and the strong and weak weight bits to obtain an audio fingerprint.
  • the object of the present disclosure can also be further achieved by the following technical measures.
  • the aforementioned audio fingerprint extraction apparatus further includes means for performing the steps of any of the foregoing audio fingerprint extraction methods.
  • An audio fingerprint library construction apparatus comprising: an audio fingerprint extraction module, configured to extract an audio fingerprint of an audio signal according to the audio fingerprint extraction method of any of the foregoing; an audio fingerprint storage module, configured to store the audio fingerprint To the audio fingerprint library; an audio fingerprint library for storing the audio fingerprint.
  • An audio fingerprint extraction hardware device comprising: a memory for storing non-transitory computer readable instructions; and a processor for executing the computer readable instructions such that the processor is implemented Any of the foregoing audio fingerprint extraction methods.
  • a computer readable storage medium for storing non-transitory computer readable instructions, when the non-transitory computer readable instructions are executed by a computer, causing the computer to perform any of the aforementioned audio fingerprints Extraction Method.
  • a terminal device includes any of the foregoing audio fingerprint extracting devices.
  • FIG. 1 is a schematic flow chart of an audio fingerprint extraction method according to an embodiment of the present disclosure.
  • FIG. 2 is a schematic flow chart of a method for constructing an audio fingerprint library according to an embodiment of the present disclosure.
  • FIG. 3 is a block diagram showing the structure of an audio fingerprint extracting apparatus according to an embodiment of the present disclosure.
  • FIG. 4 is a structural block diagram of an audio fingerprint library construction apparatus according to an embodiment of the present disclosure.
  • FIG. 5 is a hardware block diagram of an audio fingerprint extraction hardware device in accordance with an embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of a computer readable storage medium in accordance with an embodiment of the present disclosure.
  • FIG. 7 is a structural block diagram of a terminal device according to an embodiment of the present disclosure.
  • FIG. 1 is a schematic flowchart of an embodiment of an audio fingerprint extraction method according to the present disclosure.
  • an audio fingerprint extraction method of an example of the present disclosure mainly includes the following steps:
  • the audio signal is converted into a spectrogram.
  • the audio signal is converted into a time-frequency spectrogram by Short Fourier Transformation.
  • the spectrogram is a two-dimensional spectrogram of a commonly used audio signal.
  • the horizontal axis is time t and the vertical axis is frequency f.
  • the specific value of each point (t, f) in the figure is E(t, f). ) represents the energy of the signal.
  • the specific type of the audio signal is not limited, and may be a static file or a streaming audio. Thereafter, the process proceeds to step S12.
  • the spectrogram can be pre-processed using a Mel (MEL) transform, which can divide the spectrum into a plurality of frequency blocks (frequency bins) by using the Mel transform, and the divided frequency blocks are The number is configurable.
  • MEL Mel
  • Human Auditory System filtering can be performed on the sound spectrum, and nonlinear transformation such as filtering by the human auditory system can be used to make the spectrum distribution in the sound spectrum more suitable for human ear perception.
  • each hyperparameter in the short-time Fourier transform can be adjusted to adapt to different actual situations.
  • each super parameter in step S11 may be set to: in a short time Fourier transform, the time window is set to 100 ms, and the interval is set to 50 ms; in the Mel transform, the frequency block is The number is set to 32 to 128.
  • step S12 feature points in the spectrogram are determined.
  • the feature point may be selected as a maximum point of energy in the spectrogram, or may be selected as a minimum point of energy.
  • the energy E(t,f) of a point (t,f) in the spectrogram can satisfy simultaneously: E(t,f)>E(t+1,f), E(t,f)> E(t-1,f), E(t,f)>E(t,f+1) and E(t,f)>E(t,f-1), then the (t,f) point is The maximum point of energy in the spectrogram.
  • the energy extreme point is selected as the feature point: the energy extreme point is susceptible to noise; it is difficult to control the number of extreme points, and there may be no extreme point in one spectrogram, and the other There are multiple extreme points in the spectrogram, resulting in uneven feature points; additional time stamps need to be stored to record the position of the energy extremes in the spectrogram. Therefore, it is also possible to select the extreme point of the energy as the feature point, and select the fixed point as the feature point. For example, a point whose frequency value is equal to the preset frequency setting value (a fixed point of frequency) can be selected.
  • a plurality of frequency setting values of the low frequency, the intermediate frequency, and the high frequency may be preset according to the frequency (the specific values of the low frequency, the intermediate frequency, and the high frequency may be set).
  • the selected feature points can be made more uniform. It should be noted that fixed points can also be selected according to other criteria, such as selecting points equal to one or more preset energy values.
  • the hyperparameter in step S12 may be set such that the density of feature points is set to 20 to 80 per second.
  • Step S13 on the spectrogram, in the vicinity of the feature point, one or more masks are determined for the feature points, and each mask includes (or covers) an area on the plurality of sound spectrum maps (may be Called the spectral region). Thereafter, the process proceeds to step S14.
  • the plurality of spectral regions included in each mask may be symmetrically distributed:
  • a mask containing the two spectral regions R11 and R12 can be determined for the feature points, R11 R12 is located on the left side of the feature point, and R11 is located on the left side of R12, and R11 and R12 cover the same frequency block;
  • a mask containing two spectral regions R13 and R14 can be determined for the feature points, R13 is located on the upper side of the feature point, R14 is located on the lower side of the feature point, and R13 is R14 has the same time range;
  • a mask containing two spectral regions R15 and R16 may be determined for the feature points, and R15 is located at the upper left side of the feature points.
  • R16 is located on the lower right side of the feature point, and R15 and R16 are symmetrical with each other centering on the feature point.
  • a plurality of spectral regions included in one mask can simultaneously satisfy a plurality of symmetric distributions.
  • a mask including four spectral regions R21, R22, R23, and R24 may be determined for the feature points, and R21, R22, R23, and R24 are respectively located at the upper left, upper right, lower left, and lower right of the feature points, and R21 and R22, respectively.
  • R23 and R24 have the same frequency range
  • R21 and R23 have the same time range
  • R22 and R24 have the same time range
  • the four spectral regions are center-symmetric centered on the feature points.
  • the four spectral regions of a mask are not necessarily center-symmetrically distributed around the feature points. For example, they may all be located on the left side of the feature points and distributed on both sides of the feature points on the frequency axis.
  • each mask may overlap each other.
  • different masks can also overlap each other.
  • each mask may contain an even number of spectral regions.
  • the mask may be determined according to a fixed preset standard, that is, the position of each mask in the spectrogram and the area covered are preset.
  • the mask area may be automatically determined using a data-driven method without pre-fixing the position and extent of the mask: the mask with the smallest co-variance and the most discriminant degree is selected from a large number of masks.
  • step S14 the mean energy of each spectral region is determined. Specifically, for a spectral region containing only one point, the mean energy of the spectral region is the energy value of the point; when the spectral region is composed of a plurality of points, the mean energy of the spectral region may be set to the plurality of points. The average of the energy values. Thereafter, the processing proceeds to step S15.
  • Step S15 determining an audio fingerprint bit according to the average energy of the plurality of spectral regions in the mask. Thereafter, the process proceeds to step S16.
  • one audio fingerprint bit may be determined according to a difference value of mean energy of a plurality of spectral regions included in one mask.
  • the difference D1 of the mean energy of R11 and R12 can be calculated according to the following formula 1:
  • the positive and negative of the difference D1 is judged. If the difference D1 is a positive value, an audio fingerprint bit with a value of 1 is obtained. If the difference D1 is a negative value, an audio fingerprint bit with a value of 0 is obtained.
  • the difference D2 of the mean energy of R21, R22, R23, and R24 can be calculated according to Equation 2 below. :
  • the positive and negative of the difference D2 is judged. If the difference D2 is a positive value, an audio fingerprint bit with a value of 1 is obtained. If the difference D2 is a negative value, an audio fingerprint bit with a value of 0 is obtained. It should be noted that it is not necessary to determine the audio fingerprint bit of a mask containing four spectral regions by the difference D2, and the audio fingerprint bits may also be determined by using other forms of the difference. For example, the second-order difference D3 of the mean energy of the four spectral regions can also be calculated:
  • the positive and negative of the difference D1 is then determined to determine the audio fingerprint bits.
  • Step S16 determining a strong and weak weight bit corresponding to the audio fingerprint bit, where the strong and weak weight bit is used to indicate the degree of trust of the audio fingerprint bit.
  • the audio fingerprint bit with high reliability is defined as a strong bit
  • the audio fingerprint bit with low reliability is defined as a weak bit.
  • the degree of trust of an audio fingerprint bit is determined, and the value of the strong and weak weight bits is determined according to whether the audio fingerprint bit is a strong bit or a weak bit. Thereafter, the processing proceeds to step S17.
  • step S16 specifically includes: determining the difference used to generate the audio fingerprint bit. Whether the absolute value of the value reaches (or exceeds) the preset strong and weak bit threshold; if the strong bit threshold is reached, the audio fingerprint bit is determined to be a strong bit, and a value corresponding to the audio fingerprint bit is obtained as 1 The strong weak bit is determined; if the strong bit threshold is not reached, the audio fingerprint bit is determined to be a weak bit, and a strong weak bit corresponding to the audio fingerprint bit having a value of 0 is obtained.
  • step S16 includes: determining the absolute value of the difference D2 and the preset The magnitude relationship of the strong and weak bit threshold T, if
  • the strong and weak bit threshold may be multiple types of thresholds: the strong weak bit threshold may be a preset fixed value, for example, may be fixed to 1; or the strong weak bit threshold may be based on the mean The value obtained by the difference of the energy, for example, the strong bit threshold may be set to an average of a plurality of differences corresponding to the plurality of masks (or a plurality of feature points) (in fact, not limited to the average number, Any one of the value between the largest difference and the smallest difference), and the audio fingerprint bit whose difference reaches the average is determined as a strong bit, and the audio fingerprint bit whose difference does not reach the average is determined as a weak bit; or alternatively, the strong bit threshold may also be a proportional value, for example, the strong bit threshold may be set to 60%, among a plurality of differences corresponding to the plurality of masks (or feature points), If the absolute value of a difference is in the first 60% of all differences, the audio fingerprint bit is determined to be a strong bit, otherwise the audio
  • Step S17 combining the obtained plurality of audio fingerprint bits and the plurality of strong and weak weight bits to obtain an audio fingerprint.
  • the combination of the audio fingerprint and the length of the audio fingerprint are not limited.
  • an audio fingerprint may include two parts, one part is an audio fingerprint bit sequence obtained by combining audio fingerprint bits corresponding to all masks of one feature point, and then multiple audio fingerprint bits corresponding to multiple feature points.
  • the sequence is arranged in the chronological order of the feature points to obtain the first part of the audio fingerprint; the other part is the strong and weak weight bit sequence which is equal to the length of the audio fingerprint bit sequence obtained by combining the corresponding strong and weak weight bits, and then the plurality of features are The plurality of strong and weak weight bit sequences corresponding to the points are arranged in time order of the feature points to obtain the second part of the audio fingerprint.
  • the length of the obtained audio fingerprint bit sequence may be 32 bits.
  • the present disclosure can generate an audio fingerprint with high accuracy and good robustness for a piece of audio by extracting the strong and weak weight bits corresponding to the fingerprint bit while extracting the audio fingerprint bit.
  • the audio fingerprint extraction method further includes: adding a timestamp field to the audio fingerprint, and a field for indicating a time difference between the audio start position and the feature point, and the field may be a hash value.
  • the feature point is set to a fixed point, it is not necessary to include this step, that is, it is not necessary to record the time stamp.
  • the audio fingerprint extraction method further includes: adding an audio signal identification field to the audio fingerprint, and recording ID identification information of the audio signal corresponding to the audio fingerprint, where the field may be a hash value.
  • the audio fingerprint extraction method further includes: dividing the audio signal into multiple pieces of audio sub-signals according to time; and extracting audio fingerprints for each piece of audio sub-signals according to the steps of the foregoing method to obtain a plurality of audio fingerprints; The audio fingerprints of the respective feature points of the audio sub-signals are combined to obtain an audio fingerprint of the entire audio signal.
  • the process of performing audio retrieval and audio recognition using the audio fingerprint extracted by the present disclosure when calculating the distance between two audio fingerprints (for example, Hamming distance), for each audio fingerprint bit, use The corresponding strong and weak weight bits are weighted, the weight of the strong bits is high, and the weight of the weak bits is low (the weight of the weak bits can also be set to zero) to weaken or remove the weight of the weak bits, thereby making the audio search pair
  • the noise is more robust and effectively solves the problem of noise error rate.
  • the Hamming distance is a commonly used metric in the field of information theory.
  • the Hamming distance between two equal-length strings is the number of different characters corresponding to the positions of the two strings.
  • the two strings can be XORed and the result is a number of 1, and this number is the Hamming distance.
  • FIG. 2 is a schematic flowchart of an embodiment of an audio fingerprint library construction method according to the present disclosure.
  • the audio fingerprint library construction method of the example of the present disclosure mainly includes the following steps:
  • Step S21 extracting an audio fingerprint of the audio signal according to the steps of the audio fingerprint extraction method of the foregoing example of the present disclosure. Thereafter, the process proceeds to step S22.
  • step S22 the audio fingerprint of the obtained audio signal is stored in the audio fingerprint library.
  • the audio fingerprint library can be updated at any time over time.
  • FIG. 3 is a schematic structural diagram of an embodiment of an audio fingerprint extracting apparatus of the present disclosure.
  • the audio fingerprint extraction apparatus 100 of the example of the present disclosure mainly includes: a sound spectrum map conversion module 101, a feature point determination module 102, a mask determination module 103, an average energy determination module 104, an audio fingerprint bit determination module 105, and a strong The weak weight bit determination module 106 and the audio fingerprint determination module 107.
  • the spectrogram conversion module 101 is configured to convert an audio signal into a spectrogram. Specifically, the spectrogram conversion module 101 can be specifically configured to convert an audio signal into a time-frequency spectrogram by using a short Fourier Transformation.
  • the spectrogram conversion module 101 may include a Mel transform sub-module for pre-processing the spectrogram with a Mel (MEL) transform, which is capable of dividing the spectrum into multiple frequencies by using the Mel transform A bin in which the number of frequency blocks divided is configurable.
  • the sound spectrum conversion module 101 may further include a human auditory system filtering sub-module for performing Human Auditory System filtering on the sound spectrum, and using a nonlinear transformation such as filtering of the human auditory system to enable the sound spectrum.
  • the spectral distribution in the figure is more suitable for human ear perception.
  • the feature point determination module 102 is configured to determine feature points in the sound spectrum map.
  • the feature point determining module 102 may be specifically configured to determine a feature point by using one of a plurality of criteria.
  • the feature point may be selected as a maximum value point of energy in the sound spectrum map, or may also be selected. It is the minimum point of energy.
  • the feature point determining module 102 may also select the extreme point of the energy as the feature point, but use the fixed point as the feature point, for example, the frequency value and the preset frequency setting may be selected. Points with equal values (fixed points of frequency). Further, the feature point determining module 102 can be configured to select a plurality of frequency setting values of the low frequency, the intermediate frequency, and the high frequency according to the frequency size (the specific values of the low frequency, the intermediate frequency, and the high frequency are settable).
  • the mask determination module 103 is configured to determine, on the spectrogram, one or more masks for the feature points in the vicinity of the feature points, each mask comprising a plurality of spectral regions. Specifically, in the spectrogram, a plurality of spectral regions included in each mask may be symmetrically distributed.
  • the mean energy determination module 104 is configured to determine the mean energy of each spectral region, respectively.
  • the audio fingerprint bit determining module 105 is configured to determine an audio fingerprint bit according to the average energy of the plurality of spectral regions in a mask.
  • the audio fingerprint bit determining module 105 may be specifically configured to determine an audio fingerprint bit according to a difference value of mean energy of a plurality of spectral regions included in one mask.
  • the strong and weak weight bit determining module 106 is configured to determine the degree of trust of the audio fingerprint bits to determine the strong and weak weight bits corresponding to each audio fingerprint bit.
  • the strong and weak weight bit determining module 106 is specifically configured to: determine to generate the audio fingerprint Whether the absolute value of the difference used by the bit reaches (or exceeds) the preset strong bit threshold; if the strong bit threshold is reached, the audio fingerprint bit is determined to be a strong bit, and a value of 1 is obtained. Strong and weak weight bits; if the strong and weak bit threshold is not reached, the audio fingerprint bit is determined to be a weak bit, and a strong and weak weight bit with a value of 0 is obtained.
  • the audio fingerprint determining module 107 is configured to combine the obtained plurality of audio fingerprint bits and the plurality of strong and weak weight bits to obtain an audio fingerprint.
  • the audio fingerprinting apparatus 100 further includes a timestamp adding module (not shown) for adding a timestamp field to the audio fingerprint, and a field for indicating a time difference between the audio starting position and the feature point.
  • the field can be a hash value.
  • the feature point is set to a fixed point, it is not necessary to include the timestamp to add the module.
  • the audio fingerprinting device 100 further includes an audio signal identifier adding module (not shown) for adding an audio signal identifier field to the audio fingerprint to record an ID identifier of the audio signal corresponding to the audio fingerprint. information.
  • the audio fingerprint extraction apparatus 100 further includes an audio segmentation module (not shown) and an audio fingerprint combination module (not shown).
  • the audio segmentation module is configured to divide an audio signal into a plurality of audio sub-signals by time.
  • the audio fingerprint is extracted from each segment of the audio sub-signals by using a module included in the audio fingerprint extracting device to obtain a plurality of audio fingerprints.
  • the audio fingerprint combination module is configured to combine the audio fingerprints of the extracted feature points of the audio sub-signals to obtain an audio fingerprint of the entire audio signal.
  • FIG. 4 is a schematic structural diagram of an embodiment of an audio fingerprint library construction apparatus according to the present disclosure.
  • the audio fingerprint library construction apparatus 200 of the example of the present disclosure mainly includes:
  • the audio fingerprint extraction module 201 includes the spectrogram conversion module 101, the feature point determination module 102, the mask determination module 103, the mean energy determination module 104, and the audio fingerprint bit determination module 105 of the audio fingerprint extraction device 100 of the foregoing disclosed example.
  • the strong and weak weight bit determining module 106 and the audio fingerprint determining module 107 are configured to extract an audio fingerprint of the audio signal according to the steps of the audio fingerprint extracting method of the foregoing example of the present disclosure.
  • the audio fingerprint storage module 202 is configured to store the audio fingerprint of the audio signal obtained by the audio fingerprint extraction module 201 into the audio fingerprint database 203.
  • the audio fingerprint database 203 is configured to store an audio fingerprint of each audio signal.
  • FIG. 5 is a hardware block diagram illustrating an audio fingerprint extraction hardware device in accordance with an embodiment of the present disclosure.
  • the audio fingerprint extraction hardware device 300 according to an embodiment of the present disclosure includes a memory 301 and a processor 302.
  • the components in the audio fingerprinting hardware device 300 are interconnected by a bus system and/or other form of connection mechanism (not shown).
  • the memory 301 is for storing non-transitory computer readable instructions.
  • memory 301 can include one or more computer program products, which can include various forms of computer readable storage media, such as volatile memory and/or nonvolatile memory.
  • the volatile memory may include, for example, random access memory (RAM) and/or cache or the like.
  • the nonvolatile memory may include, for example, a read only memory (ROM), a hard disk, a flash memory, or the like.
  • the processor 302 can be a central processing unit (CPU) or other form of processing unit with data processing capabilities and/or instruction execution capabilities, and can control other components in the audio fingerprinting hardware device 300 to perform desired functions.
  • the processor 302 is configured to execute the computer readable instructions stored in the memory 301 such that the audio fingerprint extraction hardware device 300 performs the audio fingerprint extraction method of the foregoing embodiments of the present disclosure. All or part of the steps.
  • FIG. 6 is a schematic diagram illustrating a computer readable storage medium in accordance with an embodiment of the present disclosure.
  • a computer readable storage medium 400 is stored thereon with non-transitory computer readable instructions 401 stored thereon.
  • the non-transitory computer readable instructions 401 are executed by a processor, all or part of the steps of the audio fingerprint extraction method of the various embodiments of the present disclosure described above are performed.
  • FIG. 7 is a schematic diagram showing a hardware structure of a terminal device according to an embodiment of the present disclosure.
  • the terminal device may be implemented in various forms, and the terminal device in the present disclosure may include, but is not limited to, such as a mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a PDA (Personal Digital Assistant), a PAD (Tablet), a PMP.
  • Mobile terminal devices portable multimedia players
  • navigation devices in-vehicle terminal devices, in-vehicle display terminals, in-vehicle electronic rearview mirrors, and the like, and fixed terminal devices such as digital TVs, desktop computers, and the like.
  • the terminal device 1100 may include a wireless communication unit 1110, an A/V (audio/video) input unit 1120, a user input unit 1130, a sensing unit 1140, an output unit 1150, a memory 1160, an interface unit 1170, and control.
  • Figure 7 shows a terminal device having various components, but it should be understood that not all illustrated components are required to be implemented. More or fewer components can be implemented instead.
  • the wireless communication unit 1110 allows radio communication between the terminal device 1100 and a wireless communication system or network.
  • the A/V input unit 1120 is for receiving an audio or video signal.
  • the user input unit 1130 can generate key input data according to a command input by the user to control various operations of the terminal device.
  • the sensing unit 1140 detects the current state of the terminal device 1100, the location of the terminal device 1100, the presence or absence of a user's touch input to the terminal device 1100, the orientation of the terminal device 1100, the acceleration or deceleration movement and direction of the terminal device 1100, and the like, and A command or signal for controlling the operation of the terminal device 1100 is generated.
  • the interface unit 1170 serves as an interface through which at least one external device can connect with the terminal device 1100.
  • Output unit 1150 is configured to provide an output signal in a visual, audio, and/or tactile manner.
  • the memory 1160 may store a software program or the like that performs processing and control operations performed by the controller 1180, or may temporarily store data that has been output or is to be output.
  • Memory 1160 can include at least one type of storage medium.
  • the terminal device 1100 can cooperate with a network storage device that performs a storage function of the memory 1160 through a network connection.
  • Controller 1180 typically controls the overall operation of the terminal device. Additionally, the controller 1180 can include a multimedia module for reproducing or playing back multimedia data.
  • the controller 1180 can perform a pattern recognition process to recognize a handwriting input or a picture drawing input performed on the touch screen as a character or an image.
  • the power supply unit 1190 receives external power or internal power under the control of the controller 1180 and provides appropriate power required to operate the various components and components.
  • Various embodiments of the audio fingerprint extraction method proposed by the present disclosure may be implemented in a computer readable medium using, for example, computer software, hardware, or any combination thereof.
  • various embodiments of the audio fingerprint extraction method proposed by the present disclosure may be through the use of an application specific integrated circuit (ASIC), a digital signal processor (DSP), a digital signal processing device (DSPD), a programmable logic device (PLD). a field programmable gate array (FPGA), a processor, a controller, a microcontroller, a microprocessor, an electronic unit designed to perform the functions described herein, in some cases,
  • ASIC application specific integrated circuit
  • DSP digital signal processor
  • DSPD digital signal processing device
  • PLD programmable logic device
  • FPGA field programmable gate array
  • a processor a controller
  • microcontroller a microcontroller
  • microprocessor an electronic unit designed to perform the functions described herein, in some cases
  • Various embodiments of the publicly proposed audio fingerprint extraction method may be implemented in the controller 1180.
  • various implementations of the audio fingerprint extraction method proposed by the present disclosure can be implemented with separate software modules that allow for the execution of at least one function or operation.
  • the software code can be implemented by a software application (or program) written in any suitable programming language, which can be stored in memory 1160 and executed by controller 1180.
  • the audio fingerprint extraction method, the device, the hardware device, the computer readable storage medium, and the terminal device extract the audio fingerprint bits by using the mask, and extract the corresponding strong and weak weight bits, which can greatly improve the audio fingerprint extraction.
  • the accuracy and the efficiency of the extraction to generate an audio fingerprint with good degree of security and good soundness for the audio signal, thereby making the audio comparison, audio retrieval, and audio de-emphasis of the audio fingerprint obtained by the audio fingerprint extraction method of the present disclosure. And audio content monitoring with higher accuracy, higher efficiency and better robustness.
  • exemplary does not mean that the described examples are preferred or better than the other examples.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种音频指纹提取方法及装置,该方法包括:将音频信号转换成声谱图;确定该声谱图中的特征点;在该声谱图上,为该特征点确定一个或多个掩模,每个掩模包含多个谱区域;确定每个谱区域的均值能量;根据该掩模中的多个谱区域的均值能量确定音频指纹比特;判断该音频指纹比特的可信程度以确定强弱权重比特;将该音频指纹比特和该强弱权重比特进行组合,得到音频指纹。

Description

一种音频指纹提取方法及装置
相关申请的交叉引用
本申请要求申请号为201810273669.6、申请日为2018年3月29日的中国专利申请的优先权,该文献的全部内容以引用方式并入本文。
技术领域
本公开涉及音频处理技术领域,特别是涉及一种音频指纹提取方法及装置。
背景技术
音频指纹(或者称为音频特征)以及音频指纹检索在如今的“多媒体信息社会”中具有广泛的应用。音频指纹检索最初被应用到听歌识曲之中,也就是输入一段音频,通过提取和比对该段音频的指纹特征,就能识别出对应的歌曲。另外,音频指纹检索也可应用到内容监控之中,比如音频消重、基于检索的语音广告监控、音频版权等。
现有的音频指纹检索方法存在准确性差的问题,在一定程度上这是由于所提取的音频指纹的准确性较差导致的。现有的音频指纹提取方法存在着对噪声的鲁棒性较差、处理复杂等问题。
发明内容
本公开的目的在于提供一种新的音频指纹提取方法及装置。
本公开的目的是采用以下的技术方案来实现的。依据本公开提出的音频指纹提取方法,包括以下步骤:将音频信号转换成声谱图;确定所述声谱图中的特征点;在所述声谱图上,为所述特征点确定一个或多个掩模,每个所述掩模包含多个谱区域;确定每个所述谱区域的均值能量;根据所述掩模中的所述多个谱区域的均值能量确定音频指纹比特;判断所述音频指纹比特的可信程度以确定强弱权重比特;将所述音频指纹比特以及所述强弱权重比特进行组合,得到音频指纹。
本公开的目的还可以采用以下的技术措施来进一步实现。
前述的音频指纹提取方法,其中所述将音频信号转换成声谱图包括:通过短时傅里叶变换将音频信号转换成时间-频率的二维声谱图,所述声谱图中每个点的取值代表所述音频信号的能量。
前述的音频指纹提取方法,其中所述将音频信号转换成声谱图还包括:对所述声谱图进行梅尔变化。
前述的音频指纹提取方法,其中所述将音频信号转换成声谱图还包括: 对所述声谱图进行人类听觉系统滤波。
前述的音频指纹提取方法,其中所述特征点为所述声谱图中的固定点。
前述的音频指纹提取方法,其中所述特征点为频率值与预设的多个频率设定值相等的点。
前述的音频指纹提取方法,其中所述特征点为所述声谱图中的能量极大值点,或者,所述特征点为所述声谱图中的能量极小值点。
前述的音频指纹提取方法,其中所述掩模所包含的多个所述谱区域是对称分布的。
前述的音频指纹提取方法,其中所述掩模所包含的多个所述谱区域具有相同的频率范围、和/或具有相同的时间范围、和/或以所述特征点为中心而中心对称分布。
前述的音频指纹提取方法,其中所述谱区域均值能量为所述谱区域所包含的所有点的能量值的平均值。
前述的音频指纹提取方法,其中所述的根据所述掩模中的所述多个谱区域的均值能量确定音频指纹比特包括:根据一个所述掩模所包含的多个所述谱区域的均值能量的差值确定一个音频指纹比特的取值。
前述的音频指纹提取方法,其中所述的判断所述音频指纹比特的可信程度以确定强弱权重比特包括:判断所述差值的绝对值是否达到或超过预设的强弱比特阈值,如果达到或超过所述强弱比特阈值,则将所述音频指纹比特确定为强比特,否则将所述音频指纹比特确定为弱比特;根据所述音频指纹比特是强比特还是弱比特来确定所述强弱权重比特。
前述的音频指纹提取方法,其中所述的强弱比特阈值为固定值、或者为基于所述差值的值、或者为比例值。
前述的音频指纹提取方法,其还包括:将音频信号按时间分成多段音频子信号;提取所述音频子信号的所述音频指纹;将提取得到的各个所述音频子信号的所述音频指纹进行组合,得到所述音频信号的音频指纹。
本公开的目的还采用以下技术方案来实现。依据本公开提出的音频指纹库构建方法,包括:按照前述的任意一项的音频指纹提取方法提取音频信号的音频指纹;将所述音频指纹存储到音频指纹库中。
本公开的目的还采用以下技术方案来实现。依据本公开提出的音频指纹提取装置,包括:声谱图转换模块,用于将音频信号转换成声谱图;特征点确定模块,用于确定所述声谱图中的特征点;掩模确定模块,用于在所述声谱图上,为所述特征点确定一个或多个掩模,每个所述掩模包含多个谱区域;均值能量确定模块,用于确定每个所述谱区域的均值能量;音频指纹比特确定模块,用于根据所述掩模中的所述多个谱区域的均值能量确定音频指纹比特;强弱权重比特确定模块,用于判断所述音频指纹比特 的可信程度以确定强弱权重比特;音频指纹确定模块,用于将所述音频指纹比特以及所述强弱权重比特进行组合,得到音频指纹。
本公开的目的还可以采用以下的技术措施来进一步实现。
前述的音频指纹提取装置,其还包括执行前述任一音频指纹提取方法步骤的模块。
本公开的目的还采用以下技术方案来实现。依据本公开提出的音频指纹库构建装置,包括:音频指纹提取模块,用于按照前述任意一项的音频指纹提取方法提取音频信号的音频指纹;音频指纹存储模块,用于将所述音频指纹存储到音频指纹库中;音频指纹库,用于存储所述音频指纹。
本公开的目的还采用以下技术方案来实现。依据本公开提出的一种音频指纹提取硬件装置,包括:存储器,用于存储非暂时性计算机可读指令;以及处理器,用于运行所述计算机可读指令,使得所述处理器执行时实现前述任意一种音频指纹提取方法。
本公开的目的还采用以下技术方案来实现。依据本公开提出的一种计算机可读存储介质,用于存储非暂时性计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时,使得所述计算机执行前述任意一种音频指纹提取方法。
本公开的目的还采用以下技术方案来实现。依据本公开提出的一种终端设备,包括前述任意一种音频指纹提取装置。
上述说明仅是本公开技术方案的概述,为了能更清楚了解本公开的技术手段,而可依照说明书的内容予以实施,并且为让本公开的上述和其他目的、特征和优点能够更明显易懂,以下特举较佳实施例,并配合附图,详细说明如下。
附图说明
图1是本公开一个实施例的音频指纹提取方法的流程示意图。
图2是本公开一个实施例的音频指纹库构建方法的流程示意图。
图3是本公开一个实施例的音频指纹提取装置的结构框图。
图4是本公开一个实施例的音频指纹库构建装置的结构框图。
图5是本公开一个实施例的音频指纹提取硬件装置的硬件框图。
图6是本公开一个实施例的计算机可读存储介质的示意图。
图7是本公开一个实施例的终端设备的结构框图。
具体实施方式
为更进一步阐述本公开为达成预定发明目的所采取的技术手段及功效,以下结合附图及较佳实施例,对依据本公开提出的音频指纹提取方法及装置的具体实施方式、结构、特征及其功效,详细说明如后。
图1为本公开的音频指纹提取方法一个实施例的示意性流程图。请参阅图1,本公开示例的音频指纹提取方法,主要包括以下步骤:
步骤S11,将音频信号转换成声谱图(Spectrogram)。具体地,通过短时傅里叶变换(Fast Fourier Transformation)将音频信号转换成时间-频率声谱图。其中的声谱图是一种常用的音频信号的二维频谱图,横轴是时间t,纵轴是频率f,图中每个点(t,f)的具体的取值E(t,f)代表了信号的能量。需注意,对音频信号的具体类型不做限制,可以是静态文件(static file)也可以是流音频(streaming audio)。此后,处理进到步骤S12。
在本公开的实施例中,可利用梅尔(MEL)变换对声谱图进行预处理,利用梅尔变换能够将频谱分成多个频率区块(频率bin),而所分成的频率区块的数目是可以配置的。另外,还可以对声谱图进行人类听觉系统滤波(Human Auditory System filtering),利用人类听觉系统滤波等非线性变换,能够使得声谱图中的频谱分布更适合人耳感知。
需要说明的是,可以通过调整短时傅里叶变换中的各个超参数以适应不同的实际情况。在本公开的实施例中,可将步骤S11中的各个超参数设置为:在短时傅里叶变换中,时间窗设置为100ms,间隔设置为50ms;在梅尔变换中,频率区块的数目设置为32~128。
步骤S12,确定声谱图中的特征点。
具体地,采用多种标准中的一种来确定特征点,例如,可以将特征点选为声谱图中的能量的极大值点,或者,也可以选为能量的极小值点。其中,如果声谱图中的一个点(t,f)的能量E(t,f)能够同时满足:E(t,f)>E(t+1,f)、E(t,f)>E(t-1,f)、E(t,f)>E(t,f+1)且E(t,f)>E(t,f-1),则该(t,f)点为声谱图中的能量极大值点。类似地,如果一个点(t,f)的能量E(t,f)能够同时满足:E(t,f)<E(t+1,f)、E(t,f)<E(t-1,f)、E(t,f)<E(t,f+1)且E(t,f)<E(t,f-1),则该(t,f)点为声谱图中的能量极小值点。此后,处理进到步骤S12。
在本公开的实施例中,由于选取能量极值点作为特征点存在:能量极值点易受噪声影响;不易控制极值点的数量,可能一个声谱图中没有极值点,而另一个声谱图中有多个极值点,导致特征点不均匀;需要存储额外的时间戳以记录能量极值点在声谱图中的位置等问题。因此,也可以不选能量的极值点作为特征点,而是选取固定点作为特征点,例如可以选取频率值与预设的频率设定值相等的点(频率固定的点)。进一步地,可按照频率大小预设低频、中频、高频的多个频率设定值(低频、中频、高频的具体值是可以设置的)。通过选取频率为低频、中频、高频的多个固定点作为特征点,可以使得选取的特征点更加均匀。需要注意的是,也可以按照其他标准选取固定点,如选取与一个或多个预设能量值相等的点。
需要说明的是,可以通过调整所选取的特征点的数量以适应不同的实际情况。在本公开的实施例中,可将步骤S12中的超参数设置为:特征点的密度设置为每秒20~80个。
步骤S13,在声谱图上,在特征点的附近,为特征点确定一个或多个掩模(mask),每个掩模包含(或者说,覆盖)多块声谱图上的区域(不妨称为谱区域)。此后,处理进到步骤S14。
具体地,在声谱图中,每个掩模所包含的多块谱区域可以是对称分布的:
以时间轴对称(即,多个谱区域具有相同的频率范围),例如,在一个梅尔-声谱图中,可以为特征点确定一个包含R11和R12这两块谱区域的掩模,R11、R12均位于特征点的左侧,且R11位于R12的左侧,并且R11与R12覆盖相同的频率区块;
或者以频率轴对称(即,多个谱区域具有相同的时间范围)。例如,在一个梅尔-声谱图中,可以为特征点确定一个包含R13和R14这两块谱区域的掩模,R13位于特征点的上侧,R14位于特征点的下侧,并且R13与R14具有相同的时间范围;
或者以特征点为中心而中心对称分布,例如,在一个梅尔-声谱图中,可以为特征点确定一个包含R15和R16这两块谱区域的掩模,R15位于特征点的左上侧,R16位于特征点的右下侧,并且R15与R16以特征点为中心而相互对称。
当然,一个掩模所包含的多块谱区域也可以同时满足多种对称分布情况。例如,可以为特征点确定一个包含R21、R22、R23和R24这四块谱区域的掩模,R21、R22、R23、R24分别位于特征点的左上、右上、左下、右下,并且R21与R22具有相同的频率范围、R23与R24具有相同的频率范围、R21与R23具有相同的时间范围、R22与R24具有相同的时间范围,而且这四块谱区域还以特征点为中心而中心对称。需要说明的是,一个掩模的四个谱区域并非一定以特征点为中心而中心对称分布,例如,可以均位于特征点的左侧,且在频率轴上分布于特征点的两侧。
需要说明的是,属于同一掩模的多块谱区域之间是可以相互交叠的。另外,不同的掩模之间也是可以相互交叠的。可选地,每个掩模可包含偶数个谱区域。
需要注意的是,掩模可以是按照固定的预设标准确定的,即每个掩模在声谱图中的位置及覆盖的区域是预先设置好的。或者,也可以不预先固定掩模的位置和范围,而是使用数据驱动的方式自动确定掩模区域:从大量掩模中选取协方差最小、最有区分度的掩模。
步骤S14,确定每个谱区域的均值能量。具体地,对于仅包含一个点的 谱区域,该谱区域的均值能量就是这个点的能量值;当谱区域由多个点组成时,可以将该谱区域的均值能量设置为这多个点的能量值的平均值。此后,处理进到步骤S15。
步骤S15,根据掩模中的多块谱区域的均值能量,确定音频指纹比特(bit)。此后,处理进到步骤S16。
在本公开实施例的步骤S15中,可根据一个掩模所包含的多个谱区域的均值能量的差值确定一个音频指纹比特。
具体地,如果一个掩模包含两个谱区域,例如前述的包含R11和R12两块谱区域的示例,可以按照下面的公式一来计算R11、R12的均值能量的差值D1:
D1=E(R11)-E(R12),        (公式一)
然后判断差值D1的正负,如果差值D1为正值,则得到一个取值为1的音频指纹比特,如果差值D1为负值,则得到一个取值为0的音频指纹比特。
如果一个掩模包含四个谱区域,例如前述的包含R21、R22、R23、R24四块谱区域的示例,可以按照下面的公式二来计算R21、R22、R23、R24的均值能量的差值D2:
D2=(E(R21)+E(R22))-(E(R23)+E(R24)),   (公式二)
然后判断差值D2的正负,如果差值D2为正值,则得到一个取值为1的音频指纹比特,如果差值D2为负值,则得到一个取值为0的音频指纹比特。需要说明的是,并非必须通过差值D2来确定一个包含四个谱区域的掩模的音频指纹比特,也可以利用其他形式的差值来确定音频指纹比特。例如,也可以计算这四个谱区域的均值能量的二阶差值D3:
D3=(E(R23)-E(R24))-(E(R21)-E(R22)),   (公式三)
然后判断差值D1的正负来确定音频指纹比特。
需要说明的是,如果为特征点确定了多个掩模,则能够对应地得到多个音频指纹比特。
步骤S16,确定音频指纹比特对应的强弱权重比特,该强弱权重比特用于表示该音频指纹比特的可信程度。具体他,将可信度高的音频指纹比特定义为强比特,将可信度低的音频指纹比特定义为弱比特。判断一个音频指纹比特的可信程度,并根据该音频指纹比特是强比特还是弱比特来确定强弱权重比特的取值。此后,处理进到步骤S17。
在本公开的实施例中,如果音频指纹比特是根据一个掩模所包含的多个谱区域均值能量的差值来确定的,则步骤S16具体包括:判断生成该音频指纹比特所使用的该差值的绝对值是否达到(或超过)预设的强弱比特阈值;如果达到强弱比特阈值,则将该音频指纹比特确定为强比特,并得到一个与该音频指纹比特对应的取值为1的强弱权重比特;如果未达到强弱 比特阈值,则将该音频指纹比特确定为弱比特,并得到一个与该音频指纹比特对应的取值为0的强弱权重比特。
作为一个具体示例,如果一个音频指纹比特是通过判断前述公式二的四个谱区域均值能量的差值D2的正负来确定的,则步骤S16包括:判断该差值D2的绝对值与预设的强弱比特阈值T的大小关系,如果|D2|≥T,则该音频指纹比特是强比特,并将该音频指纹比特对应的强弱权重比特取值设置为1;如果|D2|<T,则该音频指纹比特是弱比特,并将该音频指纹比特对应的强弱权重比特取值设置为0。需要说明的是,该强弱比特阈值可以是多种类型的阈值:该强弱比特阈值可以是个预设的固定值,例如可以固定取为1;或者,该强弱比特阈值也可以是基于均值能量的差值而得到的数值,例如可将该强弱比特阈值设置为多个掩模(或多个特征点)对应的多个差值的平均数(事实上不限于平均数,也可以是任意一个介于最大的差值与最小的差值之间的数值),并且将差值达到该平均数的音频指纹比特确定为强比特,将差值未达到该平均数的音频指纹比特确定为弱比特;再或者,该强弱比特阈值也可以是个比例值,例如可将该强弱比特阈值设置为60%,在多个掩模(或多个特征点)对应的多个差值中,如果一个差值的绝对值位于所有差值中的前60%,则将该音频指纹比特确定为强比特,否则将该音频指纹比特确定为弱比特。
步骤S17,将得到的多个音频指纹比特以及多个强弱权重比特组合在一起,得到音频指纹。具体地,对音频指纹的组合方式、音频指纹的长度不做限制。例如,一个音频指纹可以包括两部分,一部分是将一个特征点的所有掩模所对应的音频指纹比特组合在一起而得到的音频指纹比特序列,然后将多个特征点对应的多个音频指纹比特序列按照特征点的时间顺序排列得到音频指纹的第一部分;另一部分是将对应的强弱权重比特组合在一起而得到的与音频指纹比特序列长度相等的强弱权重比特序列,然后将多个特征点对应的多个强弱权重比特序列按照特征点的时间顺序排列得到音频指纹的第二部分。可选地,获得的音频指纹比特序列的长度可以是32bits。
本公开通过在提取音频指纹比特的同时,提取该指纹比特对应的强弱权重比特,能够为一段音频生成一个准确性高、鲁棒性好的音频指纹。
可选地,该音频指纹提取方法还包括:为音频指纹添加一个时间戳字段,用于表示音频起始位置与该特征点的时间差的字段,该字段可以是一个hash值。而如果将特征点设为固定点,则可以不必包含本步骤,即不必记录该时间戳。
可选地,该音频指纹提取方法还包括:为音频指纹添加一个音频信号标识字段,用于记录该音频指纹所对应的音频信号的ID标识信息,该字段 可以是一个hash值。
可选地,该音频指纹提取方法还包括:将音频信号按时间分成多段音频子信号;按照前述方法的步骤,对各段音频子信号提取音频指纹,得到多个音频指纹;将提取的各段音频子信号的各个特征点的音频指纹组合在一起,得到该整段音频信号的音频指纹。
作为一种可选示例,在利用本公开提取的音频指纹进行音频检索、音频识别的过程中,在计算两个音频指纹间的距离(例如汉明距离)时,针对每个音频指纹比特,利用对应的强弱权重比特进行加权,强比特的权重高,弱比特的权重低(也可将弱比特的权重设为零),以削弱或移除弱比特所占的权重,进而使音频检索对噪声更加鲁棒,有效解决噪声误码率问题。
其中的汉明距离是一种信息论领域中常用的度量,两个等长字符串之间的汉明距离是两个字符串对应位置的不同字符的个数。在实际计算汉明距离时,可以对两个字符串进行异或运算,并统计结果为1的个数,而这个数就是汉明距离。
图2为本公开的音频指纹库构建方法一个实施例的示意性流程图。请参阅图2,本公开示例的音频指纹库构建方法,主要包括以下步骤:
步骤S21,按照前述的本公开示例的音频指纹提取方法的步骤提取音频信号的音频指纹。此后,处理进到步骤S22。
步骤S22,将得到的音频信号的音频指纹存储到音频指纹库中。
需要说明的是,上述音频信号的数量越多,该音频指纹库存储到信息就越丰富。另外,随着时间的推移,可以随时对音频指纹库进行更新。
图3为本公开的音频指纹提取装置一个实施例的示意性结构图。请参阅图3,本公开示例的音频指纹提取装置100主要包括:声谱图转换模块101、特征点确定模块102、掩模确定模块103、均值能量确定模块104、音频指纹比特确定模块105、强弱权重比特确定模块106以及音频指纹确定模块107。
其中,该声谱图转换模块101用于将音频信号转换成声谱图(Spectrogram)。具体地,声谱图转换模块101可具体用于通过短时傅里叶变换(Fast Fourier Transformation)将音频信号转换成时间-频率声谱图。
在本公开的实施例中,声谱图转换模块101可包括梅尔变换子模块,用于利用梅尔(MEL)变换对声谱图进行预处理,利用梅尔变换能够将频谱分成多个频率区块(bin),其中所分成的频率区块的数目是可以配置的。另外,声谱图转换模块101还可以包括人类听觉系统滤波子模块,用于对声谱图进行人类听觉系统滤波(Human Auditory System filtering),利用人类听觉系统滤波等非线性变换,能够使得声谱图中的频谱分布更适合人 耳感知。
该特征点确定模块102用于确定声谱图中的特征点。
具体地,该特征点确定模块102可以具体用于采用多种标准中的一种来确定特征点,例如,可以将特征点选为声谱图中的能量的极大值点,或者也可以选为能量的极小值点。
在本公开的实施例中,该特征点确定模块102也可以不选能量的极值点作为特征点,而是用于选取固定点作为特征点,例如可以选取频率值与预设的频率设定值相等的点(频率固定的点)。进一步地,该特征点确定模块102可用于按照频率大小分别选取低频、中频、高频的多个频率设定值(低频、中频、高频的具体值是可以设置的)。
该掩模确定模块103用于在声谱图上,在特征点的附近,为特征点确定一个或多个掩模(mask),每个掩模包含多个谱区域。具体地,在声谱图中,每个掩模所包含的多块谱区域可以是对称分布的。
该均值能量确定模块104,用于分别确定每个谱区域的均值能量。
该音频指纹比特确定模块105,用于根据一个掩模中的多块谱区域的均值能量,来确定一个音频指纹比特。
在本公开的实施例中,该音频指纹比特确定模块105可具体用于根据一个掩模所包含的多个谱区域的均值能量的差值确定一个音频指纹比特。
该强弱权重比特确定模块106,用于判断音频指纹比特的可信程度,以确定每个音频指纹比特对应的强弱权重比特。
在本公开的实施例中,如果音频指纹比特是根据一个掩模所包含的多个谱区域均值能量的差值确定的,则该强弱权重比特确定模块106具体用于:判断生成该音频指纹比特所使用的该差值的绝对值是否达到(或超过)预设的强弱比特阈值;如果达到强弱比特阈值,则将该音频指纹比特确定为强比特,并得到一个取值为1的强弱权重比特;如果未达到强弱比特阈值,则将该音频指纹比特确定为弱比特,并得到一个取值为0的强弱权重比特。
该音频指纹确定模块107,用于将得到的多个音频指纹比特以及多个强弱权重比特组合在一起,得到音频指纹。
可选地,该音频指纹提取装置100还包括时间戳添加模块(图中未示出),用于为音频指纹添加一个时间戳字段,用于表示音频起始位置与该特征点的时间差的字段,该字段可以是一个hash值。而如果将特征点设为固定点,则可以不必包含该时间戳添加模块。
可选地,该音频指纹提取装置100还包括音频信号标识添加模块(图中未示出),用于为音频指纹添加一个音频信号标识字段,以记录该音频指纹所对应的音频信号的ID标识信息。
可选地,该音频指纹提取装置100还包括音频分割模块(图中未示出)和音频指纹组合模块(图中未示出)。该音频分割模块用于将音频信号按时间分成多段音频子信号。利用音频指纹提取装置所包含的模块,对各段音频子信号提取音频指纹,以得到多个音频指纹。而音频指纹组合模块用于将提取的各段音频子信号的各个特征点的音频指纹组合在一起,得到该整段音频信号的音频指纹。
图4为本公开的音频指纹库构建装置一个实施例的示意性结构图。请参阅图4,本公开示例的音频指纹库构建装置200主要包括:
音频指纹提取模块201,包括前述的本公开示例的音频指纹提取装置100的声谱图转换模块101、特征点确定模块102、掩模确定模块103、均值能量确定模块104、音频指纹比特确定模块105、强弱权重比特确定模块106及音频指纹确定模块107,用于按照前述的本公开示例的音频指纹提取方法的步骤提取音频信号的音频指纹。
音频指纹存储模块202,用于将由音频指纹提取模块201得到的音频信号的音频指纹存储到音频指纹库203中。
音频指纹数据库203,用于存储各个音频信号的音频指纹。
图5是图示根据本公开的实施例的音频指纹提取硬件装置的硬件框图。如图5所示,根据本公开实施例的音频指纹提取硬件装置300包括存储器301和处理器302。音频指纹提取硬件装置300中的各组件通过总线系统和/或其它形式的连接机构(未示出)互连。
该存储器301用于存储非暂时性计算机可读指令。具体地,存储器301可以包括一个或多个计算机程序产品,该计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。该易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。该非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。
该处理器302可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元,并且可以控制音频指纹提取硬件装置300中的其它组件以执行期望的功能。在本公开的一个实施例中,该处理器302用于运行该存储器301中存储的该计算机可读指令,使得该音频指纹提取硬件装置300执行前述的本公开各实施例的音频指纹提取方法的全部或部分步骤。
图6是图示根据本公开的实施例的计算机可读存储介质的示意图。如图6所示,根据本公开实施例的计算机可读存储介质400,其上存储有非暂时性计算机可读指令401。当该非暂时性计算机可读指令401由处理器运行时,执行前述的本公开各实施例的音频指纹提取方法的全部或部分步骤。
图7是图示根据本公开实施例的终端设备的硬件结构示意图。终端设备可以以各种形式来实施,本公开中的终端设备可以包括但不限于诸如移动电话、智能电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、导航装置、车载终端设备、车载显示终端、车载电子后视镜等等的移动终端设备以及诸如数字TV、台式计算机等等的固定终端设备。
如图7所示,终端设备1100可以包括无线通信单元1110、A/V(音频/视频)输入单元1120、用户输入单元1130、感测单元1140、输出单元1150、存储器1160、接口单元1170、控制器1180和电源单元1190等等。图7示出了具有各种组件的终端设备,但是应理解的是,并不要求实施所有示出的组件。可以替代地实施更多或更少的组件。
其中,无线通信单元1110允许终端设备1100与无线通信系统或网络之间的无线电通信。A/V输入单元1120用于接收音频或视频信号。用户输入单元1130可以根据用户输入的命令生成键输入数据以控制终端设备的各种操作。感测单元1140检测终端设备1100的当前状态、终端设备1100的位置、用户对于终端设备1100的触摸输入的有无、终端设备1100的取向、终端设备1100的加速或减速移动和方向等等,并且生成用于控制终端设备1100的操作的命令或信号。接口单元1170用作至少一个外部装置与终端设备1100连接可以通过的接口。输出单元1150被构造为以视觉、音频和/或触觉方式提供输出信号。存储器1160可以存储由控制器1180执行的处理和控制操作的软件程序等等,或者可以暂时地存储己经输出或将要输出的数据。存储器1160可以包括至少一种类型的存储介质。而且,终端设备1100可以与通过网络连接执行存储器1160的存储功能的网络存储装置协作。控制器1180通常控制终端设备的总体操作。另外,控制器1180可以包括用于再现或回放多媒体数据的多媒体模块。控制器1180可以执行模式识别处理,以将在触摸屏上执行的手写输入或者图片绘制输入识别为字符或图像。电源单元1190在控制器1180的控制下接收外部电力或内部电力并且提供操作各元件和组件所需的适当的电力。
本公开提出的音频指纹提取方法的各种实施方式可以以使用例如计算机软件、硬件或其任何组合的计算机可读介质来实施。对于硬件实施,本公开提出的音频指纹提取方法的各种实施方式可以通过使用特定用途集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理装置(DSPD)、可编程逻辑装置(PLD)、现场可编程门阵列(FPGA)、处理器、控制器、微控制器、微处理器、被设计为执行这里描述的功能的电子单元中的至少一种来实施,在一些情况下,本公开提出的音频指纹提取方法的各种实施方式可以在控制器1180中实施。对于软件实施,本公开提出的音频指纹提取方法 的各种实施方式可以与允许执行至少一种功能或操作的单独的软件模块来实施。软件代码可以由以任何适当的编程语言编写的软件应用程序(或程序)来实施,软件代码可以存储在存储器1160中并且由控制器1180执行。
以上,根据本公开实施例的音频指纹提取方法、装置、硬件装置、计算机可读存储介质以及终端设备,利用掩模提取音频指纹比特,并且提取对应的强弱权重比特,能够大大提高音频指纹提取的准确性和提取的效率,为音频信号生成优良程度高、鲁棒性好的音频指纹,进而使得基于本公开的音频指纹提取方法得到的音频指纹进行的音频比对、音频检索、音频消重以及音频内容监测具有更高的准确率、更高的效率和更好的鲁棒性。
以上结合具体实施例描述了本公开的基本原理,但是,需要指出的是,在本公开中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本公开的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本公开为必须采用上述具体的细节来实现。
本公开中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。诸如“包括”、“包含”、“具有”等等的词语是开放性词汇,指“包括但不限于”,且可与其互换使用。这里所使用的词汇“或”和“和”指词汇“和/或”,且可与其互换使用,除非上下文明确指示不是如此。这里所使用的词汇“诸如”指词组“诸如但不限于”,且可与其互换使用。
另外,如在此使用的,在以“至少一个”开始的项的列举中使用的“或”指示分离的列举,以便例如“A、B或C的至少一个”的列举意味着A或B或C,或AB或AC或BC,或ABC(即A和B和C)。此外,措辞“示例的”不意味着描述的例子是优选的或者比其他例子更好。
还需要指出的是,在本公开的系统和方法中,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。
可以不脱离由所附权利要求定义的教导的技术而进行对在此所述的技术的各种改变、替换和更改。此外,本公开的权利要求的范围不限于以上所述的处理、机器、制造、事件的组成、手段、方法和动作的具体方面。可以利用与在此所述的相应方面进行基本相同的功能或者实现基本相同的结果的当前存在的或者稍后要开发的处理、机器、制造、事件的组成、手段、方法或动作。因而,所附权利要求包括在其范围内的这样的处理、机器、制造、事件的组成、手段、方法或动作。
提供所公开的方面的以上描述以使本领域的任何技术人员能够做出或 者使用本公开。对这些方面的各种修改对于本领域技术人员而言是非常显而易见的,并且在此定义的一般原理可以应用于其他方面而不脱离本公开的范围。因此,本公开不意图被限制到在此示出的方面,而是按照与在此公开的原理和新颖的特征一致的最宽范围。
为了例示和描述的目的已经给出了以上描述。此外,此描述不意图将本公开的实施例限制到在此公开的形式。尽管以上已经讨论了多个示例方面和实施例,但是本领域技术人员将认识到其某些变型、修改、改变、添加和子组合。

Claims (21)

  1. 一种音频指纹提取方法,所述方法包括:
    将音频信号转换成声谱图;
    确定所述声谱图中的特征点;
    在所述声谱图上,为所述特征点确定一个或多个掩模,每个所述掩模包含多个谱区域;
    确定每个所述谱区域的均值能量;
    根据所述掩模中的所述多个谱区域的均值能量确定音频指纹比特;
    判断所述音频指纹比特的可信程度以确定强弱权重比特;
    将所述音频指纹比特和所述强弱权重比特进行组合,得到音频指纹。
  2. 根据权利要求1所述的音频指纹提取方法,其中,所述将所述音频信号转换成声谱图包括:通过短时傅里叶变换将所述音频信号转换成时间-频率的二维声谱图,所述声谱图中每个点的取值代表所述音频信号的能量。
  3. 根据权利要求2所述的音频指纹提取方法,其中,所述将音频信号转换成声谱图还包括:对所述声谱图进行梅尔变化。
  4. 根据权利要求2所述的音频指纹提取方法,其中,所述将音频信号转换成声谱图还包括:对所述声谱图进行人类听觉系统滤波。
  5. 根据权利要求2所述的音频指纹提取方法,其中,所述特征点为所述声谱图中的固定点。
  6. 根据权利要求5所述的音频指纹提取方法,其中,所述特征点为频率值与预设的多个频率设定值相等的点。
  7. 根据权利要求2所述的音频指纹提取方法,其中,所述特征点为所述声谱图中的能量极大值点,或者,所述特征点为所述声谱图中的能量极小值点。
  8. 根据权利要求1所述的音频指纹提取方法,其中,所述掩模所包含的多个所述谱区域是对称分布的。
  9. 根据权利要求8所述的音频指纹提取方法,其中,所述掩模所包含的多个所述谱区域具有相同的频率范围、和/或具有相同的时间范围、和/或以所述特征点为中心而中心对称分布。
  10. 根据权利要求1所述的音频指纹提取方法,其中,所述谱区域均值能量为所述谱区域所包含的所有点的能量值的平均值。
  11. 根据权利要求1所述的音频指纹提取方法,其中,所述的根据所述掩模中的所述多个谱区域的均值能量确定音频指纹比特包括:
    根据一个所述掩模所包含的多个所述谱区域的均值能量的差值确定一个音频指纹比特。
  12. 根据权利要求11所述的音频指纹提取方法,其中,所述的判断所述音频指纹比特的可信程度以确定强弱权重比特包括:
    判断所述差值的绝对值是否达到或超过预设的强弱比特阈值,如果达到或超过所述强弱比特阈值,则将所述音频指纹比特确定为强比特,否则将所述音频指纹比特确定为弱比特;根据所述音频指纹比特是强比特还是弱比特来确定所述强弱权重比特。
  13. 根据权利要求12所述的音频指纹提取方法,其中,所述的强弱比特阈值为固定值、或者为基于所述差值的值、或者为比例值。
  14. 根据权利要求1所述的音频指纹提取方法,所述方法还包括:
    将音频信号按时间分成多段音频子信号;
    提取所述音频子信号的所述音频指纹;
    将提取得到的各个所述音频子信号的所述音频指纹进行组合,得到所述音频信号的音频指纹。
  15. 一种音频指纹库构建方法,所述方法包括:
    按照如权利要求1到14中任意一项所述的音频指纹提取方法提取音频信号的音频指纹;
    将所述音频指纹存储到音频指纹库中。
  16. 一种音频指纹提取装置,所述装置包括:
    声谱图转换模块,用于将音频信号转换成声谱图;
    特征点确定模块,用于确定所述声谱图中的特征点;
    掩模确定模块,用于在所述声谱图上,为所述特征点确定一个或多个掩模,每个所述掩模包含多个谱区域;
    均值能量确定模块,用于确定每个所述谱区域的均值能量;
    音频指纹比特确定模块,用于根据所述掩模中的所述多个谱区域的均值能量确定音频指纹比特;
    强弱权重比特确定模块,用于判断所述音频指纹比特的可信程度以确定强弱权重比特;
    音频指纹确定模块,用于将所述音频指纹比特和所述强弱权重比特进行组合,得到音频指纹。
  17. 根据权利要求16所述的音频指纹提取装置,所述装置还包括执行权利要求2到14中任一权利要求所述步骤的模块。
  18. 一种音频指纹库构建装置,所述装置包括:
    音频指纹提取模块,用于按照如权利要求1到14中任意一项所述的音频指纹提取方法提取音频信号的音频指纹;
    音频指纹存储模块,用于将所述音频指纹存储到音频指纹库中;
    音频指纹库,用于存储所述音频指纹。
  19. 一种音频指纹提取硬件装置,包括:
    存储器,用于存储非暂时性计算机可读指令;以及
    处理器,用于运行所述计算机可读指令,使得所述处理器执行时实现根据权利要求1到14中任意一项所述的音频指纹提取方法。
  20. 一种计算机可读存储介质,用于存储非暂时性计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时,使得所述计算机执行权利要求1到14中任意一项所述的音频指纹提取方法。
  21. 一种终端设备,包括权利要求16或17所述的一种音频指纹提取装置。
PCT/CN2018/125491 2018-03-29 2018-12-29 一种音频指纹提取方法及装置 WO2019184517A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2020502951A JP6908774B2 (ja) 2018-03-29 2018-12-29 オーディオ指紋抽出方法及び装置
SG11202008533VA SG11202008533VA (en) 2018-03-29 2018-12-29 Audio fingerprint extraction method and device
US16/652,028 US10950255B2 (en) 2018-03-29 2018-12-29 Audio fingerprint extraction method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810273669.6 2018-03-29
CN201810273669.6A CN110322886A (zh) 2018-03-29 2018-03-29 一种音频指纹提取方法及装置

Publications (1)

Publication Number Publication Date
WO2019184517A1 true WO2019184517A1 (zh) 2019-10-03

Family

ID=68062543

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/125491 WO2019184517A1 (zh) 2018-03-29 2018-12-29 一种音频指纹提取方法及装置

Country Status (5)

Country Link
US (1) US10950255B2 (zh)
JP (1) JP6908774B2 (zh)
CN (1) CN110322886A (zh)
SG (1) SG11202008533VA (zh)
WO (1) WO2019184517A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11138471B2 (en) 2018-05-18 2021-10-05 Google Llc Augmentation of audiographic images for improved machine learning
CN111581430B (zh) * 2020-04-30 2022-05-17 厦门快商通科技股份有限公司 一种音频指纹的生成方法和装置以及设备
CN111862989B (zh) * 2020-06-01 2024-03-08 北京捷通华声科技股份有限公司 一种声学特征处理方法和装置
CN112104892B (zh) * 2020-09-11 2021-12-10 腾讯科技(深圳)有限公司 一种多媒体信息处理方法、装置、电子设备及存储介质
US11798577B2 (en) * 2021-03-04 2023-10-24 Gracenote, Inc. Methods and apparatus to fingerprint an audio signal

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101663708A (zh) * 2007-04-17 2010-03-03 韩国电子通信研究院 用于按照索引信息搜索音频指纹的系统和方法
CN103999473A (zh) * 2011-12-20 2014-08-20 雅虎公司 用于内容识别的音频指纹
CN104050259A (zh) * 2014-06-16 2014-09-17 上海大学 一种基于som算法的音频指纹提取方法
CN107622773A (zh) * 2017-09-08 2018-01-23 科大讯飞股份有限公司 一种音频特征提取方法与装置、电子设备

Family Cites Families (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6990453B2 (en) * 2000-07-31 2006-01-24 Landmark Digital Services Llc System and methods for recognizing sound and music signals in high noise and distortion
JP2005534098A (ja) 2002-07-24 2005-11-10 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ ファイル共有規制方法及び装置
EP1550297B1 (en) * 2002-09-30 2009-03-18 Gracenote, Inc. Fingerprint extraction
US20050249080A1 (en) * 2004-05-07 2005-11-10 Fuji Xerox Co., Ltd. Method and system for harvesting a media stream
US7516074B2 (en) * 2005-09-01 2009-04-07 Auditude, Inc. Extraction and matching of characteristic fingerprints from audio signals
US9299364B1 (en) * 2008-06-18 2016-03-29 Gracenote, Inc. Audio content fingerprinting based on two-dimensional constant Q-factor transform representation and robust audio identification for time-aligned applications
WO2011132184A1 (en) * 2010-04-22 2011-10-27 Jamrt Ltd. Generating pitched musical events corresponding to musical content
KR20150095957A (ko) * 2010-05-04 2015-08-21 샤잠 엔터테인먼트 리미티드 미디어 스트림의 샘플을 처리하는 방법 및 시스템
US8584197B2 (en) * 2010-11-12 2013-11-12 Google Inc. Media rights management using melody identification
US9093120B2 (en) * 2011-02-10 2015-07-28 Yahoo! Inc. Audio fingerprint extraction by scaling in time and resampling
JP2012185195A (ja) 2011-03-03 2012-09-27 Jvc Kenwood Corp オーディオデータ特徴抽出方法、オーディオデータ照合方法、オーディオデータ特徴抽出プログラム、オーディオデータ照合プログラム、オーディオデータ特徴抽出装置、オーディオデータ照合装置及びオーディオデータ照合システム
ES2459391T3 (es) * 2011-06-06 2014-05-09 Bridge Mediatech, S.L. Método y sistema para conseguir hashing de audio invariante al canal
WO2013008956A1 (ja) * 2011-07-14 2013-01-17 日本電気株式会社 音響処理方法と音響処理システム、ビデオ処理方法とビデオ処理システム、音響処理装置およびその制御方法と制御プログラム
CN102324232A (zh) * 2011-09-12 2012-01-18 辽宁工业大学 基于高斯混合模型的声纹识别方法及系统
US9384272B2 (en) * 2011-10-05 2016-07-05 The Trustees Of Columbia University In The City Of New York Methods, systems, and media for identifying similar songs using jumpcodes
US20140330556A1 (en) * 2011-12-12 2014-11-06 Dolby International Ab Low complexity repetition detection in media data
US11140439B2 (en) * 2012-02-21 2021-10-05 Roku, Inc. Media content identification on mobile devices
US8681950B2 (en) * 2012-03-28 2014-03-25 Interactive Intelligence, Inc. System and method for fingerprinting datasets
CN102820033B (zh) * 2012-08-17 2013-12-04 南京大学 一种声纹识别方法
US9305559B2 (en) * 2012-10-15 2016-04-05 Digimarc Corporation Audio watermark encoding with reversing polarity and pairwise embedding
US9183849B2 (en) * 2012-12-21 2015-11-10 The Nielsen Company (Us), Llc Audio matching with semantic audio recognition and report generation
US9451048B2 (en) * 2013-03-12 2016-09-20 Shazam Investments Ltd. Methods and systems for identifying information of a broadcast station and information of broadcasted content
JP6257537B2 (ja) 2015-01-19 2018-01-10 日本電信電話株式会社 顕著度推定方法、顕著度推定装置、プログラム
US9971928B2 (en) * 2015-02-27 2018-05-15 Qualcomm Incorporated Fingerprint verification system
CN104865313B (zh) * 2015-05-12 2017-11-17 福建星网锐捷通讯股份有限公司 一种基于声谱条纹检测玻璃破碎的检测方法及装置
US20170097992A1 (en) * 2015-10-02 2017-04-06 Evergig Music S.A.S.U. Systems and methods for searching, comparing and/or matching digital audio files
US10360905B1 (en) * 2016-03-11 2019-07-23 Gracenote, Inc. Robust audio identification with interference cancellation
CN106250742A (zh) * 2016-07-22 2016-12-21 北京小米移动软件有限公司 移动终端的解锁方法、装置和移动终端
CN106296890B (zh) * 2016-07-22 2019-06-04 北京小米移动软件有限公司 移动终端的解锁方法、装置和移动终端
US10236006B1 (en) * 2016-08-05 2019-03-19 Digimarc Corporation Digital watermarks adapted to compensate for time scaling, pitch shifting and mixing
CN106782568A (zh) * 2016-11-22 2017-05-31 合肥星服信息科技有限责任公司 一种频率极值和均值结合的声纹过滤方法
CN107610708B (zh) * 2017-06-09 2018-06-19 平安科技(深圳)有限公司 识别声纹的方法及设备
CN111279414B (zh) * 2017-11-02 2022-12-06 华为技术有限公司 用于声音场景分类的基于分段的特征提取

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101663708A (zh) * 2007-04-17 2010-03-03 韩国电子通信研究院 用于按照索引信息搜索音频指纹的系统和方法
CN103999473A (zh) * 2011-12-20 2014-08-20 雅虎公司 用于内容识别的音频指纹
CN104050259A (zh) * 2014-06-16 2014-09-17 上海大学 一种基于som算法的音频指纹提取方法
CN107622773A (zh) * 2017-09-08 2018-01-23 科大讯飞股份有限公司 一种音频特征提取方法与装置、电子设备

Also Published As

Publication number Publication date
US10950255B2 (en) 2021-03-16
JP2020527255A (ja) 2020-09-03
US20200273483A1 (en) 2020-08-27
CN110322886A (zh) 2019-10-11
SG11202008533VA (en) 2020-10-29
JP6908774B2 (ja) 2021-07-28

Similar Documents

Publication Publication Date Title
WO2019184517A1 (zh) 一种音频指纹提取方法及装置
WO2021082941A1 (zh) 视频人物识别方法、装置、存储介质与电子设备
WO2019184518A1 (zh) 一种音频检索识别方法及装置
US10261965B2 (en) Audio generation method, server, and storage medium
US10657945B2 (en) Noise control method and device
RU2666966C2 (ru) Способ и прибор управления для воспроизведения аудио
RU2642369C2 (ru) Аппарат и способ распознавания отпечатка пальца
US20140280304A1 (en) Matching versions of a known song to an unknown song
WO2021056797A1 (zh) 音视频信息处理方法及装置、电子设备和存储介质
US8868419B2 (en) Generalizing text content summary from speech content
CN109117622A (zh) 一种基于音频指纹的身份认证方法
CN110111811A (zh) 音频信号检测方法、装置和存储介质
CN111816192A (zh) 语音设备及其控制方法、装置和设备
CN107112011A (zh) 用于音频特征提取的倒谱方差归一化
CN111326146A (zh) 语音唤醒模板的获取方法、装置、电子设备及计算机可读存储介质
WO2019184520A1 (zh) 一种视频特征提取方法及装置
TWI659410B (zh) Audio recognition method and device
CN105718174B (zh) 一种界面的切换方法及切换系统
WO2019200996A1 (zh) 多声道音频处理方法、装置和计算机可读存储介质
CN104637496B (zh) 计算机系统及音频比对方法
WO2020052085A1 (zh) 视频文字检测方法、装置和计算机可读存储介质
WO2019184523A1 (zh) 一种媒体特征的比对方法及装置
EP3616104B1 (en) Methods, systems, and media for detecting and transforming rotated video content items
WO2019184519A1 (zh) 一种媒体检索方法及装置
US20130304470A1 (en) Electronic device and method for detecting pornographic audio data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18911871

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020502951

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18911871

Country of ref document: EP

Kind code of ref document: A1