WO2019184518A1 - 一种音频检索识别方法及装置 - Google Patents
一种音频检索识别方法及装置 Download PDFInfo
- Publication number
- WO2019184518A1 WO2019184518A1 PCT/CN2018/125493 CN2018125493W WO2019184518A1 WO 2019184518 A1 WO2019184518 A1 WO 2019184518A1 CN 2018125493 W CN2018125493 W CN 2018125493W WO 2019184518 A1 WO2019184518 A1 WO 2019184518A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- fingerprint
- retrieval
- audio fingerprint
- ranking
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 153
- 239000011159 matrix material Substances 0.000 claims description 87
- 239000000178 monomer Substances 0.000 claims description 59
- 230000003595 spectral effect Effects 0.000 claims description 53
- 238000001228 spectrum Methods 0.000 claims description 35
- 238000003860 storage Methods 0.000 claims description 12
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 2
- 238000012550 audit Methods 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 25
- 230000008569 process Effects 0.000 description 24
- 238000012545 processing Methods 0.000 description 23
- 230000006870 function Effects 0.000 description 6
- 238000006243 chemical reaction Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 239000000284 extract Substances 0.000 description 5
- 238000001914 filtration Methods 0.000 description 5
- 230000005236 sound signal Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000009826 distribution Methods 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 238000000605 extraction Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000037396 body weight Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/68—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/683—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/63—Querying
- G06F16/632—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/63—Querying
- G06F16/635—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/63—Querying
- G06F16/638—Presentation of query results
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/54—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
Definitions
- the present disclosure relates to the field of audio processing technologies, and in particular, to an audio retrieval and identification method and apparatus.
- Audio fingerprinting (or audio features) and audio fingerprinting are widely used in today's "multimedia information society."
- the audio fingerprint retrieval is initially applied to the listening to the song, that is, inputting a piece of audio, and by extracting and comparing the fingerprint features of the audio, the corresponding song can be identified.
- audio fingerprint retrieval can also be applied to content monitoring, such as audio deduplication, retrieval-based voice advertisement monitoring, audio copyright, and the like.
- the existing audio retrieval and recognition methods have the problems of poor accuracy and slow speed, which will cause huge consumption of computing resources and storage resources.
- An audio retrieval and recognition method comprising the steps of: acquiring an audio fingerprint of an audio to be recognized, wherein the audio fingerprint includes a first portion for indicating a content feature of the to-be-identified audio and for indicating the a second part of the degree of credibility; identifying the to-be-identified audio according to the audio fingerprint to obtain a recognition result.
- the object of the present disclosure can also be further achieved by the following technical measures.
- the acquiring an audio fingerprint of the to-be-identified audio comprises: converting the to-be-identified audio into a sound spectrum map; determining feature points in the sound spectrum map; and on the sound spectrum map Determining one or more masks for the feature points, each of the masks comprising a plurality of spectral regions; determining an average energy of each of the spectral regions; according to the plurality of spectral regions in the mask The mean energy determines an audio fingerprint bit; determines a degree of trust of the audio fingerprint bit to determine a strong and weak weight bit; and determines an audio fingerprint of the audio to be recognized based on the audio fingerprint bit and the strong weak weight bit.
- the foregoing audio retrieval and recognition method wherein the converting the to-be-identified audio into a sound spectrum map comprises: converting the to-be-identified audio into a time-frequency two-dimensional sound spectrum map by a short-time Fourier transform, The value of each point in the spectrogram represents the energy of the audio to be recognized.
- the foregoing audio retrieval and recognition method wherein the converting the to-be-identified audio into a sound spectrum map further comprises: performing a Meyer change on the sound spectrum map.
- the feature point is a point at which the frequency value is equal to a preset plurality of frequency setting values.
- the feature point is an energy maximum point in the sound spectrum diagram, or the feature point is an energy minimum point in the sound spectrum diagram.
- a plurality of the spectral regions included in the mask are symmetrically distributed.
- the spectral region mean energy is an average value of energy values of all points included in the spectral region.
- the determining the audio fingerprint bit according to the average energy of the plurality of spectral regions in the mask comprises: according to a plurality of the spectral regions included in one of the masks The difference in mean energy determines an audio fingerprint bit.
- the determining the degree of trust of the audio fingerprint bit to determine the strength and weakness weight bit comprises: determining whether the absolute value of the difference value reaches or exceeds a preset strong and weak bit threshold, Determining the audio fingerprint bit as a strong bit if the strong weak bit threshold is reached or exceeded, otherwise determining the audio fingerprint bit as a weak bit; determining whether the audio fingerprint bit is a strong bit or a weak bit Describe the strong and weak weight bits.
- the foregoing audio retrieval and identification method further includes: dividing the audio to be recognized into a plurality of sub-audios according to time; extracting the audio fingerprint of each sub-audio; and combining the extracted audio fingerprints of each of the sub-audios Obtaining an audio fingerprint of the to-be-identified audio.
- the audio fingerprint of the to-be-identified audio is defined as a first audio fingerprint
- the first audio fingerprint includes a plurality of first audio fingerprint units and each of the first audio fingerprint sheets a first strong weak weighting unit corresponding to the body, the first audio fingerprinting unit comprising a plurality of the audio fingerprinting bits of the audio to be identified, the first strong weak weighting unit comprising the plurality of A plurality of the strong and weak weight bits corresponding to the audio fingerprint bits.
- the identifying the to-be-identified audio according to the audio fingerprint comprises: performing a first ranking of the plurality of known audios according to each of the individual first audio fingerprint units, Obtaining, according to the result of the first ranking, the first k pieces of the known audio as a first candidate audio set, where k is a positive integer; the first audio fingerprint unit according to the plurality of orders is paired with the first The candidate audio set performs a second ranking, and according to the result of the second ranking, the first n first candidate audios are taken as a recognition result, where n is a positive integer.
- the foregoing audio retrieval and identification method further includes: acquiring an audio fingerprint of the known audio as a second audio fingerprint in advance, the second audio fingerprint comprising a plurality of second audio fingerprint units and the second audio fingerprint a second strong weak weight cell corresponding to the body; indexing the second audio fingerprint to obtain a fingerprint index of the known audio in advance.
- the foregoing audio retrieval and identification method wherein, in performing the first ranking and/or performing the second ranking, according to the first strong weak weight single unit and/or the second strong weak weight single unit, The first audio fingerprint unit and/or the second audio fingerprint unit are weighted.
- the foregoing audio retrieval and identification method wherein the performing the first ranking of the plurality of known audios according to each of the individual first audio fingerprint units comprises: according to each of the individual first audio fingerprint pairs A plurality of known audios are subjected to word frequency-reverse file frequency TF-IDF ranking.
- the first ranking of the word frequency-reverse file frequency TF-IDF manner for each of the plurality of known audios according to each of the individual first audio fingerprint units comprises: A fingerprint index of the known audio is matched with the first audio fingerprint unit to perform the TF-IDF ranking on the known audio.
- the fingerprint index of the known audio is obtained in advance: the front fingerprint index of the known audio is obtained and/or inverted according to the second strong weak weight unit Print fingerprint index.
- the matching the fingerprint index of the known audio with the first audio fingerprint unit comprises: using the first strong and weak weight single, the fingerprint of the audio The index is absolutely matched with the first audio fingerprint unit.
- the foregoing audio retrieval and identification method wherein the second audio fingerprinting of the first candidate audio set according to the first audio fingerprint unit arranged in a plurality of orders comprises: performing fingerprint according to the known audio Indexing and the first audio fingerprint to obtain a similarity matrix of audio in the first candidate audio set, and ranking audio in the first candidate audio set according to the similarity matrix.
- the foregoing audio retrieval and identification method wherein the similarity matrix of the audio in the first candidate audio set is obtained according to the fingerprint index of the known audio and the first audio fingerprint, according to the similarity matrix
- Ranking the audio in the first candidate audio set includes: weighting the first strong weak weight unit and/or the second strong weak weight unit to obtain the weighted similarity matrix, and The audio in the first candidate audio set is ranked according to the weighted similarity matrix.
- the foregoing audio search and identification method wherein the ranking the audio in the first candidate audio set according to the similarity matrix comprises: pairing the first candidate audio set according to a straight line in the similarity matrix The audio is ranked.
- the acquiring an audio fingerprint of the to-be-identified audio includes acquiring a plurality of types of first audio fingerprints of the to-be-identified audio; and acquiring the audio fingerprint of the known audio as a first
- the second audio fingerprint includes: acquiring a plurality of types of second audio fingerprints of the audio in the first candidate audio set; and obtaining the first according to the fingerprint index of the known audio and the first audio fingerprint
- the similarity matrix of the audio in the candidate audio set includes determining the similarity matrix according to the plurality of types of first audio fingerprints and the plurality of types of second audio fingerprints.
- each type of the first audio fingerprint comprises a plurality of first audio fingerprint units
- each type of the second audio fingerprint comprises a plurality of second audio fingerprint units
- Determining the similarity matrix according to the plurality of types of first audio fingerprints and the plurality of types of second audio fingerprints includes: determining the second audio fingerprint unit of the same type and the first A monomer similarity between an audio fingerprinting unit to obtain a plurality of said monomer similarities; determining said similarity matrix based on an average or minimum of said plurality of monomer similarities.
- the foregoing audio retrieval and identification method further includes: pre-processing the identified audio and the known audio according to a preset time length to obtain a plurality of pieces of sub-audio to be recognized and a plurality of pieces of known sub-audio, and the plurality of pieces of sub-audio to be recognized and the plurality of segments. It is known that the sub audio extracts the audio fingerprints separately to obtain a plurality of first sub audio fingerprints and a plurality of second sub audio fingerprints of the same length.
- the foregoing audio retrieval and identification method further includes: before the performing the first ranking, slicing the obtained first audio fingerprint of the to-be-identified audio and the second audio fingerprint of the known audio according to a preset length.
- the plurality of first sub-audio fingerprints and the plurality of second sub-audio fingerprints having the same length are obtained.
- the foregoing audio retrieval and recognition method further includes: determining, according to the similarity matrix, the repeated segments of the audio to be recognized and the audio in the recognition result.
- An audio retrieval and recognition apparatus comprising: an audio fingerprint acquisition system, configured to acquire an audio fingerprint of an audio to be recognized, wherein the audio fingerprint includes a first portion for representing a content feature of the to-be-identified audio, and And a second part representing the degree of trust of the first part; a retrieval and recognition system, configured to identify the to-be-identified audio according to the audio fingerprint, to obtain a recognition result.
- the object of the present disclosure can also be further achieved by the following technical measures.
- the aforementioned audio retrieval identification apparatus further includes means for performing the steps of any of the foregoing audio retrieval identification methods.
- An audio retrieval and recognition hardware device comprising: a memory for storing non-transitory computer readable instructions; and a processor for executing the computer readable instructions such that the processor is implemented Any of the foregoing audio retrieval and recognition methods.
- a computer readable storage medium for storing non-transitory computer readable instructions, when the non-transitory computer readable instructions are executed by a computer, causing the computer to perform any of the foregoing audio retrievals recognition methods.
- a terminal device includes any one of the foregoing audio retrieval and recognition devices.
- FIG. 1 is a block flow diagram of an audio retrieval and recognition method according to an embodiment of the present disclosure.
- FIG. 2 is a block diagram of a process for acquiring an audio fingerprint according to an embodiment of the present disclosure.
- FIG. 3 is a block diagram of a process for retrieving and recognizing audio provided by an embodiment of the present disclosure.
- FIG. 4 is a flow chart of a first ranking provided by an embodiment of the present disclosure.
- FIG. 5 is a flow chart of a second ranking provided by an embodiment of the present disclosure.
- FIG. 6 is a flow chart of determining a sequence similarity score by using a dynamic programming method according to an embodiment of the present disclosure.
- FIG. 7 is a flow chart of determining a sequence similarity score by using a uniform audio method according to an embodiment of the present disclosure.
- FIG. 8 is a flow chart of determining a similarity matrix based on a plurality of types of first audio fingerprints and second audio fingerprints according to an embodiment of the present disclosure.
- FIG. 9 is a block diagram showing the structure of an audio retrieval and recognition apparatus according to an embodiment of the present disclosure.
- FIG. 10 is a structural block diagram of an audio fingerprint acquiring system according to an embodiment of the present disclosure.
- FIG. 11 is a structural block diagram of a retrieval and recognition system according to an embodiment of the present disclosure.
- FIG. 12 is a structural block diagram of a first ranking module according to an embodiment of the present disclosure.
- FIG. 13 is a structural block diagram of a second ranking module according to an embodiment of the present disclosure.
- FIG. 14 is a structural block diagram of an audio retrieval and recognition apparatus that determines a similarity matrix based on a plurality of types of first audio fingerprints and second audio fingerprints, in accordance with an embodiment of the present disclosure.
- 15 is a hardware block diagram of an audio retrieval and recognition hardware device of an embodiment of the present disclosure.
- 16 is a schematic diagram of a computer readable storage medium in accordance with an embodiment of the present disclosure.
- Figure 17 is a block diagram showing the structure of a terminal device according to an embodiment of the present disclosure.
- FIG. 1 is a schematic flowchart of an embodiment of an audio retrieval and recognition method of the present disclosure.
- an audio retrieval and recognition method of an example of the present disclosure mainly includes the following steps:
- Step S10 Acquire an audio fingerprint of the audio to be recognized (Query Audio).
- the audio fingerprint includes a first portion for representing a content feature of the audio to be recognized and a second portion for indicating a degree of trust of the first portion.
- the process proceeds to step S20.
- Step S20 Identify the audio to be recognized according to the audio fingerprint of the audio to be identified, and obtain a recognition result.
- the audio retrieval recognition method of the example of the present disclosure performs audio retrieval recognition by acquiring and utilizing an audio fingerprint feature of an audio object including a first portion for representing an audio content feature and a second portion for indicating a degree of trust of the first portion, It can improve the accuracy, robustness and efficiency of audio retrieval recognition.
- FIG. 2 is a schematic flow chart of acquiring an audio fingerprint according to an embodiment of the present disclosure. Since the audio fingerprint can be acquired for any audio according to the method shown in FIG. 2, it is not distinguished whether or not the audio is to be recognized in the description of the embodiment. Referring to FIG. 2, in an embodiment of the present disclosure, the foregoing process of acquiring an audio fingerprint in step S10 specifically includes the following steps:
- step S11 the audio is converted into a spectrogram.
- the audio signal is converted into a time-frequency spectrogram by Short Fourier Transformation.
- the spectrogram is a two-dimensional spectrogram of a commonly used audio signal.
- the horizontal axis is time t and the vertical axis is frequency f.
- the specific value of each point (t, f) in the figure is E(t, f). ) represents the energy of the signal.
- the specific type of the audio signal is not limited, and may be a static file or a streaming audio. Thereafter, the process proceeds to step S12.
- the spectrogram can be pre-processed using a Mel (MEL) transform, which can divide the spectrum into a plurality of frequency blocks (frequency bins) by using the Mel transform, and the divided frequency blocks are The number is configurable.
- MEL Mel
- Human Auditory System filtering can be performed on the sound spectrum, and nonlinear transformation such as filtering by the human auditory system can be used to make the spectrum distribution in the sound spectrum more suitable for human ear perception.
- each hyperparameter in the short-time Fourier transform can be adjusted to adapt to different actual situations.
- each super parameter in step S11 may be set to: in a short time Fourier transform, the time window is set to 100 ms, and the interval is set to 50 ms; in the Mel transform, the frequency block is The number is set to 32 to 128.
- step S12 feature points in the spectrogram are determined.
- the feature point may be selected as a maximum point of energy in the spectrogram, or may be selected as a minimum point of energy.
- the energy E(t,f) of a point (t,f) in the spectrogram can satisfy simultaneously: E(t,f)>E(t+1,f), E(t,f)> E(t-1,f), E(t,f)>E(t,f+1) and E(t,f)>E(t,f-1), then the (t,f) point is The maximum point of energy in the spectrogram.
- the energy extreme point is selected as the feature point: the energy extreme point is susceptible to noise; it is difficult to control the number of extreme points, and there may be no extreme point in one spectrogram, and the other There are multiple extreme points in the spectrogram, resulting in uneven feature points; additional time stamps need to be stored to record the position of the energy extremes in the spectrogram. Therefore, it is also possible to select the extreme point of the energy as the feature point, and select the fixed point as the feature point. For example, a point whose frequency value is equal to the preset frequency setting value (a fixed point of frequency) can be selected.
- a plurality of frequency setting values of the low frequency, the intermediate frequency, and the high frequency may be preset according to the frequency (the specific values of the low frequency, the intermediate frequency, and the high frequency may be set).
- the selected feature points can be made more uniform. It should be noted that fixed points can also be selected according to other criteria, such as selecting points equal to one or more preset energy values.
- the hyperparameter in step S12 may be set such that the density of feature points is set to 20 to 80 per second.
- Step S13 on the spectrogram, in the vicinity of the feature point, one or more masks are determined for the feature points, and each mask includes (or covers) an area on the plurality of sound spectrum maps (may be Called the spectral region). Thereafter, the process proceeds to step S14.
- the plurality of spectral regions included in each mask may be symmetrically distributed:
- a mask containing the two spectral regions R11 and R12 can be determined for the feature points, R11 R12 is located on the left side of the feature point, and R11 is located on the left side of R12, and R11 and R12 cover the same frequency block;
- a mask containing two spectral regions R13 and R14 can be determined for the feature points, R13 is located on the upper side of the feature point, R14 is located on the lower side of the feature point, and R13 is R14 has the same time range;
- a mask containing two spectral regions R15 and R16 may be determined for the feature points, and R15 is located at the upper left side of the feature points.
- R16 is located on the lower right side of the feature point, and R15 and R16 are symmetrical with each other centering on the feature point.
- a plurality of spectral regions included in one mask can simultaneously satisfy a plurality of symmetric distributions.
- a mask including four spectral regions R21, R22, R23, and R24 may be determined for the feature points, and R21, R22, R23, and R24 are respectively located at the upper left, upper right, lower left, and lower right of the feature points, and R21 and R22, respectively.
- R23 and R24 have the same frequency range
- R21 and R23 have the same time range
- R22 and R24 have the same time range
- the four spectral regions are center-symmetric centered on the feature points.
- the four spectral regions of a mask are not necessarily center-symmetrically distributed around the feature points. For example, they may all be located on the left side of the feature points and distributed on both sides of the feature points on the frequency axis.
- each mask may overlap each other.
- different masks can also overlap each other.
- each mask may contain an even number of spectral regions.
- the mask may be determined according to a fixed preset standard, that is, the position of each mask in the spectrogram and the area covered are preset.
- the mask area may be automatically determined using a data-driven method without pre-fixing the position and extent of the mask: the mask with the smallest co-variance and the most discriminant degree is selected from a large number of masks.
- step S14 the mean energy of each spectral region is determined. Specifically, for a spectral region containing only one point, the mean energy of the spectral region is the energy value of the point; when the spectral region is composed of a plurality of points, the mean energy of the spectral region may be set to the plurality of points. The average of the energy values. Thereafter, the processing proceeds to step S15.
- Step S15 determining an audio fingerprint bit according to the average energy of the plurality of spectral regions in the mask. It should be noted that the audio fingerprint bit is the first part of the aforementioned audio fingerprint for representing the content feature of the audio. Thereafter, the process proceeds to step S16.
- one audio fingerprint bit may be determined according to a difference value of mean energy of a plurality of spectral regions included in one mask.
- the difference D1 of the mean energy of R11 and R12 can be calculated according to the following formula 1:
- D1 E(R11)-E(R12), (Formula 1) then judge the positive and negative of the difference D1. If the difference D1 is positive, an audio fingerprint bit with a value of 1 is obtained, if the difference D1 is A negative value results in an audio fingerprint bit with a value of zero.
- the difference D2 of the mean energy of R21, R22, R23, and R24 can be calculated according to Equation 2 below. :
- D2 (E(R21)+E(R22))-(E(R23)+E(R24)), (Formula 2) and then judge the positive and negative of the difference D2. If the difference D2 is positive, then one is obtained. An audio fingerprint bit with a value of 1, if the difference D2 is a negative value, an audio fingerprint bit with a value of 0 is obtained. It should be noted that it is not necessary to determine the audio fingerprint bit of a mask containing four spectral regions by the difference D2, and the audio fingerprint bits may also be determined by using other forms of the difference. For example, the second-order difference D3 of the mean energy of the four spectral regions can also be calculated:
- Step S16 determining a strong and weak weight bit corresponding to the audio fingerprint bit, where the strong and weak weight bit is used to indicate the degree of trust of the audio fingerprint bit.
- the strong and weak weight bits are the second part of the aforementioned audio fingerprint for indicating the degree of trust of the first part.
- the audio fingerprint bit with high reliability is defined as a strong bit
- the audio fingerprint bit with low reliability is defined as a weak bit.
- the degree of trust of an audio fingerprint bit is determined, and the value of the strong and weak weight bits is determined according to whether the audio fingerprint bit is a strong bit or a weak bit. Thereafter, the processing proceeds to step S17.
- step S16 specifically includes: determining the difference used to generate the audio fingerprint bit. Whether the absolute value of the value reaches (or exceeds) the preset strong and weak bit threshold; if the strong bit threshold is reached, the audio fingerprint bit is determined to be a strong bit, and a value corresponding to the audio fingerprint bit is obtained as 1 The strong weak bit is determined; if the strong bit threshold is not reached, the audio fingerprint bit is determined to be a weak bit, and a strong weak bit corresponding to the audio fingerprint bit having a value of 0 is obtained.
- step S16 includes: determining the absolute value of the difference D2 and the preset The magnitude relationship of the strong and weak bit threshold T, if
- the strong and weak bit threshold may be multiple types of thresholds: the strong weak bit threshold may be a preset fixed value, for example, may be fixed to 1; or the strong weak bit threshold may be based on the mean The value obtained by the difference of the energy, for example, the strong bit threshold may be set to an average of a plurality of differences corresponding to the plurality of masks (or a plurality of feature points) (in fact, not limited to the average number, Any one of the value between the largest difference and the smallest difference), and the audio fingerprint bit whose difference reaches the average is determined as a strong bit, and the audio fingerprint bit whose difference does not reach the average is determined as a weak bit; or alternatively, the strong bit threshold may also be a proportional value, for example, the strong bit threshold may be set to 60%, among a plurality of differences corresponding to the plurality of masks (or feature points), If the absolute value of a difference is in the first 60% of all differences, the audio fingerprint bit is determined to be a strong bit, otherwise the audio
- Step S17 determining an audio fingerprint of the audio according to the audio fingerprint bit and the strong weak weight bit.
- the combination of the audio fingerprint and the length of the audio fingerprint are not limited, and only the audio fingerprint includes each audio fingerprint bit corresponding to one or more feature points (to form a first part of the audio fingerprint) and each strength and weakness. Weight bits (to form the second part of the audio fingerprint).
- the audio fingerprint includes a plurality of audio fingerprinting units and a strong and weak weighting unit corresponding to each of the audio fingerprinting monomers, the audio fingerprinting unit including a plurality of the audio fingerprinting bits of audio
- the strong and weak weighting unit includes a plurality of the strong and weak weight bits corresponding to the plurality of audio fingerprint bits; for example, audio fingerprint bits corresponding to all masks of one feature point may be combined to obtain an audio
- the fingerprint bit sequence is used as an audio fingerprint unit, and the corresponding strong and weak weight bits are combined to obtain a strong and weak weight bit sequence equal to the length of the audio fingerprint bit sequence, and the plurality of feature points correspond to
- the audio fingerprint unit and the strong and weak weight elements are arranged in chronological order of feature points to form an audio fingerprint.
- the length of the obtained audio fingerprint unit may be 32 bits.
- the present disclosure can generate an audio fingerprint with high accuracy and good robustness for a piece of audio by extracting the strong and weak weight bits corresponding to the audio fingerprint bit while extracting the audio fingerprint bit.
- step S10 of the present disclosure further includes: adding a timestamp field to the audio fingerprint, and a field for indicating a time difference between the audio start position and the feature point, and the field may be a hash value.
- the feature point is set to a fixed point, it is not necessary to include this step, that is, it is not necessary to record the time stamp.
- step S10 of the present disclosure further includes: adding an audio identification field to the audio fingerprint, for recording ID identification information of the audio corresponding to the audio fingerprint, and the field may be a hash value.
- the step S10 of the present disclosure further includes: dividing the original audio into multiple pieces of sub-audio according to time; extracting an audio fingerprint for each piece of sub-audio according to the steps of the foregoing method to obtain a plurality of audio fingerprints; and extracting audio of each piece of sub-audio The fingerprints are combined to obtain an audio fingerprint of the entire audio.
- the audio fingerprint of the audio to be identified may be referred to as a first audio fingerprint, and the audio fingerprint included in the first audio fingerprint is referred to as a first audio fingerprint unit, and the first audio fingerprint unit is correspondingly strong.
- the weak weight monomer is called the first strong weak weight monomer.
- FIG. 3 is a schematic flow chart of retrieving and recognizing audio according to an audio fingerprint according to an embodiment of the present disclosure.
- the process of performing the retrieval and identification of the audio to be identified in step S20 includes the following steps:
- Step S21 Perform a first ranking on the plurality of known audios according to the first audio fingerprint, and take the first k known audios as the first candidate audio set according to the result of the first ranking.
- k is a positive integer
- the specific value of k is configurable.
- the first ranking is a ranking based on the matching of each individual first audio fingerprint unit with known audio.
- the first ranking may be a term frequency-inverse document frequency ranking (referred to as TF-IDF ranking) according to each first audio fingerprint unit.
- TF-IDF ranking term frequency-inverse document frequency ranking
- Step S22 performing a second ranking on the first candidate audio set according to the first audio fingerprint, and extracting, according to the result of the second ranking, the first n first candidate audios in the first candidate audio set as the recognition result.
- n is a positive integer
- the specific value of n can be set.
- the second ranking is a ranking of the audio in the first candidate audio set by the first audio fingerprint unit arranged in a plurality of orders.
- the plurality of sequentially arranged first audio fingerprint monomers include a continuous portion of the first audio fingerprint, the first audio fingerprint as a whole, and/or the plurality of sequentially arranged first audio fingerprint monomers include the first A plurality of first audio fingerprint units having the same interval of sequence numbers in the audio fingerprint, such as a plurality of first audio fingerprint units of sequence numbers 1, 3, 5, 7, .
- the search is performed in the Meta database according to the recognition result, and the audio information of the recognition result, such as the name, author, source, and the like of the recognized audio, can be obtained.
- the recognition result includes a plurality of audios
- information of the plurality of recognized audios can be simultaneously provided.
- the pair of audio fingerprints of the individual may be based on the strong and weak weights in the audio fingerprint.
- the body is weighted. Since the unweighted first ranking and second ranking processes are equivalent to applying the same weight to each audio fingerprinting unit at the time of ranking, the following only ranks the first ranking and the weighting of the audio fingerprint using the strong and weak weights.
- the process of the second ranking is specified.
- the media retrieval method proposed by the present disclosure can greatly improve the accuracy and efficiency of media retrieval by performing the first ranking and the second ranking to obtain the retrieval result.
- the aforementioned known audio can be audio in an audio database.
- An audio fingerprint of the known audio is stored in the audio database, and the audio fingerprint of the stored known audio includes the same type of audio fingerprint as the first audio fingerprint obtained by the same extraction method as the first audio fingerprint, thereby
- the audio fingerprint of the known audio also includes a first portion for representing the content characteristics of the audio and a second portion for indicating the degree of trust of the first portion.
- the audio retrieval and recognition method of the present disclosure further includes: acquiring audio fingerprints of a plurality of known audios in advance, and for convenience of description and understanding, the audio fingerprint of the known audio may be referred to as a second audio fingerprint.
- the audio fingerprint unit included in the second audio fingerprint is referred to as a second audio fingerprint unit, and the strong and weak weight unit included in the second audio fingerprint is referred to as a second strong weak weight unit; the second audio fingerprint is performed on the second audio fingerprint Indexing to obtain a fingerprint index of known audio in advance; matching the fingerprint index with the first audio fingerprint unit of the audio to be identified to perform TF-IDF ranking on a plurality of known audios.
- the foregoing fingerprint index that obtains the known audio in advance further includes: a forward index and an inverted index of the audio fingerprint of the known audio in advance, so as to facilitate the retrieval of the audio fingerprint. And comparison.
- the positive fingerprint index and the inverted fingerprint index may be pre-stored in the audio database.
- the positive fingerprint index is used to record the audio fingerprint of each known audio, that is, the audio fingerprints of each known audio are specifically recorded, and the order of the audio fingerprints is included; the inverted fingerprint index is used for It is recorded in which audio fingerprint of the known audio is present in each audio fingerprint unit.
- the positive fingerprint index and the inverted fingerprint index may be stored in the form of a key-value pair: in the positive fingerprint index, a key is used to indicate the number of an audio (or, It is called audio ID), and the value corresponding to the key records which audio fingerprints are included in the audio and the order of the audio fingerprints. It is possible to refer to the keys and values in the positive fingerprint index as positive.
- the row key and the positive row value; in the inverted fingerprint index, a key (key) is used to represent an audio fingerprint unit, and a value corresponding to the key records the number of the audio containing the audio fingerprint unit. It is possible to refer to the key and value in the inverted fingerprint index as the inverted key and the inverted value respectively.
- the second audio fingerprint can be indexed according to the strong and weak weights to improve the robustness.
- the strong and weak weight cells corresponding to the respective audio fingerprints of the known audio may be recorded in the positive fingerprint index.
- the weak bits in the audio fingerprint unit to be indexed may be ignored, and only the audio to be indexed is determined.
- the inverted fingerprint index of the to-be-indexed audio fingerprint unit records: the audio containing the first and third audio fingerprint bits that are the same as the audio fingerprint unit to be indexed.
- the known audio number of the fingerprint unit is the known audio number of the fingerprint unit.
- the TF-IDF ranking is a technique for judging the importance of information by weighting the frequency of words and the frequency of reverse files for ranking.
- the word frequency refers to the frequency at which a word (or a message) appears in an article (or a file). The higher the word frequency, the more important the word is for the article; the frequency of the file refers to a The word appears in the number of articles in the article library, and the reverse file frequency is the reciprocal of the file frequency (in actual calculation, the logarithm of the reverse file frequency can also be taken, or the inverse file frequency is the logarithm of the reciprocal of the file frequency) ), the higher the frequency of the reverse file, the better the discrimination of the word.
- the TF-IDF ranking is ranked by the size of the product of the word frequency and the reverse file frequency.
- an audio fingerprint of an audio can be used as an article, and each audio fingerprint can be used as a word to rank known audio using the TF-IDF method.
- the known audio in the audio database can be absolutely matched.
- the absolute matching is used to select the known audio of the first audio fingerprint unit included in the preset number or preset ratio as the second candidate audio set.
- the second candidate audio set is then ranked first to select the first candidate audio set.
- FIG. 4 is a schematic flow chart of a first ranking provided by an embodiment of the present disclosure.
- the first ranking specifically includes the following steps:
- Step S31 According to the inverted fingerprint index, statistics are generated in which second audio fingerprints of the known audios of the respective first audio fingerprints, so as to match the audio database from the preset number of the first audio fingerprints. Knowing audio as a second candidate audio set. It should be noted that, in the matching process, only the strong and weak weights corresponding to the first audio fingerprint unit may be used to determine that only the strong bits in the first audio fingerprint unit are in the second audio fingerprint of the known audio. Matching situation, while ignoring the matching of weak bits in the first audio fingerprint unit to improve robustness. Thereafter, the processing proceeds to step S32.
- the “number” in the “predetermined number of first audio fingerprint units” refers to the type of the first audio fingerprint unit.
- the preset number may be one, so that the matched second candidate audio set is a known audio in which at least one of the first audio fingerprint units appears in the second audio fingerprint; the preset quantity may also be Multiple, may be p (p is a positive integer), so that the matched second candidate audio set is the known audio in which at least p first audio fingerprint monomers appear in the second audio fingerprint.
- Step S32 determining a word frequency of a first audio fingerprint unit in a second audio fingerprint of a second candidate audio based on the positive fingerprint index.
- the word frequency is: the proportion of a first audio fingerprint unit among all the audio fingerprint units included in a second audio fingerprint.
- the positive fingerprint index may be the aforementioned fingerprint index obtained according to the strong and weak weights.
- Step S33 determining a file frequency of a first audio fingerprint unit based on the inverted fingerprint index.
- the file frequency is: among a plurality of known audios (for example, all known audios in the audio database), and the second audio fingerprint contains the number of known audios of the first audio fingerprint unit. Know the proportion of the total number of audio.
- the inverted fingerprint index may be the aforementioned fingerprint index obtained according to the strong and weak weights.
- Step S34 determining a word frequency-reverse file frequency score of the second candidate audio according to a word frequency of each first audio fingerprint unit in a second audio fingerprint of a second candidate audio and a file frequency of each first audio fingerprint unit. . Thereafter, the process proceeds to step S35.
- Step S35 ranking the second candidate audio set according to the obtained word frequency-reverse file frequency score of each second candidate audio, obtaining a result of the first ranking, and extracting the first k second candidate audios from the first ranking result.
- the first candidate audio set At the same time, a second audio fingerprint (positive fingerprint index) of each first candidate audio may also be returned for further processing of the first candidate audio set based on the second audio fingerprint in a subsequent second ranking.
- the index server may be used as an index request for the first audio fingerprint unit of the audio to be identified, and the absolute matching and the TF-IDF ranking are performed according to the foregoing positive fingerprint index and inverted fingerprint index.
- the first candidate audio set is recalled and the obtained positive fingerprint index of each first candidate audio is returned at the same time.
- the above-described various steps can be performed using the open source Elasticsearch search engine to achieve the effect of fast retrieval.
- the audio retrieval and recognition method proposed by the present disclosure can perform absolute matching and first ranking based on the TF-IDF method according to an audio fingerprint containing strong and weak weights, which can greatly improve the accuracy and efficiency of audio retrieval and recognition.
- the second ranking is a case in which a sequence of a sequence of first audio fingerprints arranged in a plurality of orders appears in an audio fingerprint of the first candidate audio,
- the ranking of the audio in a candidate audio set includes: obtaining a similarity matrix of the audio in the first candidate audio set according to the fingerprint index of the known audio and the first audio fingerprint, according to the similarity matrix, in the first candidate audio set Audio is ranked.
- weighting may be performed according to strong and weak weights corresponding to the first audio fingerprint and/or strong and weak weights in the fingerprint index of the known audio, and the weighted similarity matrix may be utilized.
- the audio in the first candidate audio set is ranked to improve robustness.
- FIG. 5 is a schematic flow chart of a second ranking provided by an embodiment of the present disclosure.
- the second ranking specifically includes the following steps:
- Step S41 Acquire a second audio fingerprint of a first candidate audio in the first candidate audio set (in fact, each first candidate audio is a known audio).
- the second audio fingerprint may be acquired according to a fingerprint index of a known audio (eg, a positive fingerprint index). It may be assumed that the first audio fingerprint of the audio to be identified includes M 1 first audio fingerprint units, and the second audio fingerprint of the first candidate audio includes M 2 second audio fingerprint units, where M 1 and M 2 are A positive integer.
- the first audio fingerprint includes a strong and weak weighting unit corresponding to each first audio fingerprinting unit (may be referred to as a first strong weak weighting unit), and/or a second audio.
- the fingerprint includes a strong and weak weighting monomer corresponding to each of the second audio fingerprinting monomers (may be referred to as a second strong weak weighting monomer). Thereafter, the processing proceeds to step S42.
- Step S42 determining a single similarity between each second audio fingerprint unit included in the second audio fingerprint of the first candidate audio and each of the first audio fingerprint units, to obtain M 1 ⁇ M 2 monomers similar degree.
- Each cell similarity indicates the degree of similarity between a first audio fingerprint unit and a second audio fingerprint unit. Specifically, the greater the similarity of the monomers, the more similar.
- each of the first audio fingerprint unit and the second audio fingerprint unit may be selected according to the first strong weak weight unit and/or the second strong weak weight unit. The body is weighted, and then the unit similarity is determined based on the weighted first and second audio fingerprint units.
- the first audio fingerprint unit and the second audio fingerprint unit may be separately weighted by the second strong weak weight unit. Thereafter, the processing proceeds to step S43.
- a distance or metric capable of determining the degree of similarity of two audio fingerprint monomers may be selected as the single cell similarity according to the type of the audio fingerprint. Specifically, when the first audio fingerprint unit and the second audio fingerprint unit are the binary fingerprint obtained according to the steps S11 to S17 in the foregoing embodiment, the first audio fingerprint unit and the second audio fingerprint are calculated.
- the Hamming distance between the monomers, the difference between the length of the audio fingerprint unit (the number of bits) and the Hamming distance, and the ratio of the difference to the length of the audio fingerprint unit is determined as The monomer similarity is used to represent the proportion of the same bits in the two binary fingerprints.
- the Hamming distance is a commonly used metric in the field of information theory.
- the Hamming distance between two equal-length strings is the number of different characters corresponding to the positions of the two strings.
- the two strings can be XORed and the result is a number of 1, and this number is the Hamming distance.
- the specific method for weighting the similarity of the Hamming distance type by using the strong and weak weights of the present disclosure is to first use the strong and weak weight bits in the strong and weak weighting unit to correspond to the corresponding audio in the audio fingerprint unit. The fingerprint bits are weighted, and then the first audio fingerprint unit and the second audio fingerprint unit are XORed to obtain a single similarity weighted by the strong and weak weights.
- any distance or metric that can determine the similarity of the two audio fingerprint monomers can be utilized.
- Step S43 determining a similarity matrix between the first candidate audio and the to-be-identified audio according to each individual similarity.
- each point in the similarity matrix corresponds to a single unit similarity, such that the similarity matrix records a second audio fingerprint unit of a first candidate audio and each of the first audio fingerprint units.
- the respective points of the similarity matrix are arranged in the order of the first audio fingerprints of the audio to be recognized in the first audio fingerprint in the horizontal direction, and according to the second audio of the first candidate audio in the vertical direction.
- the fingerprint elements are sequentially arranged in the second audio fingerprint.
- the point located in the jth column of the i-th row represents the monomer similarity between the i-th first audio fingerprint unit of the audio to be recognized and the j-th second audio fingerprint unit of the first candidate audio, and thus the similarity
- the degree matrix is an M 1 ⁇ M 2 matrix.
- step S42 it is not necessary to first perform the calculation of the individual similarity of step S42, and then the determining the similarity matrix of step S43, but directly determining the similarity matrix, and determining the similarity matrix.
- the corresponding monomer similarity is calculated in the process of each point.
- Step S44 Determine a sequence similarity score of the first candidate audio according to a similarity matrix of each first candidate audio.
- the sequence similarity score is used to represent the degree of similarity between the first candidate audio and the audio to be recognized.
- the sequence similarity score can be a score between 0 and 1, the larger the number, the more similar the two segments of audio. Thereafter, the process proceeds to step S45.
- sequence similarity score is determined according to a straight line in the similarity matrix.
- the similarity matrix is a finite matrix, so the so-called "straight line” is a finitely long number of points in the similarity matrix.
- Line segment. The line has a slope that is the slope of the line connecting the plurality of points included in the line.
- the starting point and the ending point of the straight line may be any points in the similarity matrix, and are not necessarily points located at the edge.
- the straight line in the present disclosure includes a diagonal line in the similarity matrix, and each line segment parallel to the diagonal line, and the straight line from the upper left to the lower right in the similarity matrix has a slope of 1, and the slope is not included.
- 1 straight line may be a straight line with a slope of approximately 1 to improve the robustness of the audio search recognition; it may be a straight line with a slope of 2, 3, ... or 1/2, 1/3, ..., etc.
- the retrieval of the speed-adjusted audio should be recognized; it may even be a straight line with a negative slope (a line from the lower left to the upper right in the similarity matrix) to deal with the retrieval and recognition of the audio processed by the reverse playback.
- the diagonal line is a line segment consisting of points at (1,1), (2,2), (3,3)... (actually a point starting from the point in the upper left corner and having a slope of 1) straight line).
- each straight line in the similarity matrix is composed of a plurality of single similarities arranged in order, so that each straight line represents a similar situation of a plurality of sequentially arranged pairs of audio fingerprints, thereby being able to express the to-be-recognized
- the degree to which an audio segment in an audio is similar to an audio segment in known audio.
- Each of the audio fingerprint unit pairs includes a first audio fingerprint unit and a second audio fingerprint unit (that is, each line represents a plurality of sequentially arranged first audio fingerprint units and a plurality of sequential arrangements The degree of similarity between the second audio fingerprint cells).
- the slope of the line and the end point of the line represent the length and position of the two audio segments.
- a straight line composed of (1,1), (2,3), (3,5), (4,7), because the first audio fingerprint unit with a ordinal number of 1 and the ordinal number is 1 second audio
- the similarity between the fingerprint monomers, the similarity between the first audio fingerprint unit with the ordinal number 2 and the ordinal number is the second audio fingerprint unit, so that the straight line can react to the ordinal number of 1, 2
- the similarity between a first candidate audio and the audio to be recognized can be determined according to a straight line in the similarity matrix: it is possible to define the average (or overall condition) of each individual similarity included in a straight line as the Straight line similarity of the line, the line similarity can reflect the similarity between the corresponding plurality of first audio fingerprint monomers and the plurality of second audio fingerprint units; determining a straight line with the highest linear similarity in the similarity matrix It may be referred to as a matching straight line; the straight line similarity of the matching straight line is determined as the sequence similarity score of the first candidate audio.
- a straight line with the highest linear similarity may be determined from a plurality of preset straight lines, for example, the preset multiple straight lines are all the slopes set to a preset slope.
- a straight line with a fixed value such as a slope of 1
- a straight line is fitted according to the points to generate A line that makes the straight line similarity the highest.
- Step S45 The first candidate audio set is ranked according to the sequence similarity score of each first candidate audio, and the result of the second ranking is obtained, and the first n first candidate audios are taken out from the second ranking result as the recognition result.
- the audio retrieval and recognition method proposed by the present disclosure can greatly improve the accuracy and efficiency of audio retrieval recognition according to the audio fingerprint including the strong and weak weights and the second ranking based on the similarity matrix.
- FIG. 6 is a schematic flow chart of performing audio retrieval and recognition by using a dynamic programming method according to an embodiment of the present disclosure. Referring to FIG. 6, in an embodiment, step S44 includes the following specific steps:
- Step S44-1a defining a plurality of straight lines whose slopes in the similarity matrix are preset preset slope values as candidate straight lines, and determining the candidate straight lines according to each individual similarity included in each candidate straight line Straight line similarity.
- the straight line similarity of a straight line may be set as an average value of the individual similarities of the respective units included in the straight line, or may be set as the sum of the individual similarities of the respective units included in the straight line.
- the slope setting value may be taken as 1, that is, the aforementioned alternative straight line is: a diagonal line in the similarity matrix and a straight line parallel to the diagonal line.
- step S44-1a further includes: excluding, from the candidate line, those lines including the number of monomer similarities less than a preset straight line length setting value. Then, it proceeds to step S44-1b.
- the candidate straight line must also satisfy that the number of included monomer similarities reaches a preset straight line length setting value.
- Step S44-1b from the plurality of candidate lines, determining an alternative line that maximizes the similarity of the line, and defines it as the first matching line. Thereafter, the processing proceeds to step S44-1c.
- step S44-1c the straight line similarity of the first matching straight line is determined as a sequence similarity score.
- the preset slope setting value in step S44-1a may be multiple, that is, the candidate straight line is equal to any one of the plurality of slope setting values.
- a straight line for example, an alternative straight line may be a straight line having a slope of 1, -1, 2, 1/2, etc., and in step S44-1b, a plurality of alternative straight lines having a slope from any one of a plurality of slope setting values Determine a first matching line.
- the audio retrieval and recognition method proposed by the present disclosure can improve the accuracy and efficiency of audio retrieval and recognition by using the dynamic programming method to determine the sequence similarity score.
- FIG. 7 is a schematic flow chart of performing audio retrieval and recognition by using a uniform media method according to an embodiment of the present disclosure. Referring to FIG. 7, in an embodiment, step S34 includes the following specific steps:
- step S44-2a a plurality of points having the largest single similarity are selected as the similarity extreme points in the similarity matrix.
- the specific number of similarity extreme points taken may be preset. Thereafter, the processing proceeds to step S44-2b.
- Step S44-2b based on the plurality of similarity extreme points, fitting a straight line as the second matching straight line in the similarity matrix.
- a straight line having a preset slope set value or a preset slope set value is fitted as a second matching line based on the plurality of similarity extreme points, for example, fitting a line A line with a slope close to 1.
- a random sample consistency method Random Sample Consensus method, RANSAC method for short
- the RANSAC method is a commonly used method for calculating the mathematical model parameters of a data according to a set of sample data sets containing abnormal data to obtain valid sample data. Thereafter, the processing proceeds to step S44-2c.
- Step S44-2c determining a sequence similarity score according to the plurality of single cell similarities included in the second matching straight line. Specifically, an average value of individual monomer similarities on the second matching straight line may be determined as the sequence similarity score.
- the audio retrieval and recognition method proposed by the present disclosure can improve the accuracy and efficiency of audio retrieval recognition by using the uniform media method to determine the sequence similarity score.
- the similarity matrix may be obtained by comprehensive consideration of various audio similarities.
- the audio retrieval identification of the present disclosure further includes: acquiring a plurality of types of first audio fingerprints of the audio to be identified, and acquiring a plurality of types of second audio fingerprints of the audio in the first candidate audio set, according to the plurality of types.
- the fingerprint index obtained by the second audio fingerprint and the plurality of types of first audio fingerprints determine the similarity matrix.
- FIG. 8 is a schematic flow chart of determining a similarity matrix based on multiple types of first audio fingerprints and second audio fingerprints for audio retrieval according to an embodiment of the present disclosure.
- an audio retrieval and recognition method of the present disclosure includes:
- Step S51 Acquire a plurality of types of first audio fingerprints of the to-be-identified audio by using a plurality of audio fingerprint extraction methods, and each type of first audio fingerprint includes a plurality of first portions for indicating audio content features, which may be referred to as a first Audio fingerprint unit.
- at least some types of first audio fingerprints comprise a second portion for indicating the degree of trust of the first portion.
- the audio fingerprint obtained by the method of step S11 to step S17 in the foregoing embodiment, and other types of audio fingerprints are simultaneously acquired. Thereafter, the processing proceeds to step S52.
- Step S52 acquiring a plurality of types of second audio fingerprints of a known audio (specifically, the audio in the foregoing first candidate audio set), each type of second audio fingerprint containing a plurality of used to represent audio
- the first part of the content feature may be referred to as the second audio fingerprint unit.
- at least some types of second audio fingerprints comprise a second portion for indicating the degree of trust of the first portion.
- an audio fingerprint obtained by the method of steps S11 to S17 in the foregoing embodiment, and other types of audio fingerprints of known audio are simultaneously acquired. Thereafter, the processing proceeds to step S53.
- step S53 the similarity between the second audio fingerprint unit of the same type and the first audio fingerprint unit is determined by a method similar to step S42 of the foregoing embodiment.
- step S54 the processing proceeds to step S54.
- Step S54 determining an average value or a minimum value of the plurality of monomer similarities; determining the similarity of the known audio according to the average value or the minimum value of the plurality of monomer similarities using a method similar to the step S43 of the foregoing embodiment. Degree matrix.
- step S44 the sequence similarity score and the result of determining the second ranking are determined based on the similarity matrix based on the average or minimum value of the plurality of monomer similarities. step.
- the effect of determining the similarity matrix by using the average or minimum of multiple similarities is that the audio search and recognition using the similarity obtained by a single audio fingerprint may have a mismatch, by taking the average of the similarity of multiple audio fingerprints.
- the value or the minimum value can reduce or eliminate the mismatching problem, thereby improving the accuracy of the audio retrieval recognition.
- the various monomer similarities have a consistent range of values, for example, the values of all types of monomer similarities can be determined in advance.
- the range is set to 0 to 1.
- the aforementioned example of the monomer similarity determined according to the Hamming distance has set the range of the monomer similarity to 0 to 1.
- the audio retrieval and recognition method further includes: before performing the first ranking, the first audio fingerprint of the acquired audio to be recognized and the second audio fingerprint of the known audio according to a preset fixed length Slicing to obtain a plurality of first sub-audio fingerprints and second sub-audio fingerprints of the same length (including the same number of audio fingerprinting units) (eg, in an embodiment including the step of indexing the second audio fingerprint, Before the index is sliced; and/or, before the audio fingerprint is acquired, the audio to be recognized and the known audio are sliced according to a preset fixed time length to obtain a plurality of pieces of the to-be-identified audio segment and the known audio segment having the same length of time, and then Obtaining audio fingerprints of each of the to-be-identified audio segments and the known audio segments respectively, and obtaining a first sub-audio fingerprint of each to-be-identified audio segment and a second sub-audio fingerprint of each known audio segment.
- a preset fixed length Slicing to obtain a plurality of first sub-audio fingerprints
- the recognition result of each sub-audio fingerprint is obtained, and then the original is determined according to the recognition result of each sub-audio fingerprint.
- the recognition result of the audio to be recognized is obtained.
- the effect of slicing the audio or audio fingerprint according to the fixed length is: 1. Make the TF-IDF ranking more fair; 2. The obtained monomer similarity and sequence similarity score are more accurate; 3. The uniform length is beneficial to the audio fingerprint and The storage of fingerprint indexes.
- the first audio fingerprint unit in the first audio fingerprint and the second audio fingerprint unit in the second audio fingerprint are temporally arranged, for example, in chronological order. of.
- the audio retrieval and recognition method of the present disclosure further includes determining, based on the aforementioned similarity matrix, a repeated segment of the audio to be recognized and the known audio (specifically, the audio in the foregoing recognition result). Specifically, the start and end times of the repeated segments in the two audios can be obtained from the start and end points of the straight line in the similarity matrix.
- the specific method for determining the repeated segment according to the straight line in the similarity matrix may be: the ordinal number of the first audio fingerprint unit corresponding to the starting point of the straight line (or the abscissa in the similarity matrix) Determining a start time of the repeated segment in the audio to be identified, and determining an iteration of the first candidate audio according to an ordinal number of the second audio fingerprint unit corresponding to the starting point (or an ordinate in the similarity matrix) The start time; similarly, the end time of the repeated segment in the audio to be recognized is determined according to the abscissa of the end point of the straight line, and the end time of the repeated segment in the first candidate audio is determined according to the ordinate of the end point.
- step S44 further includes: detecting a first portion and an ending portion of the obtained first matching line or second matching line, and determining Whether the point (monomer similarity) of the beginning portion and the end portion of the first matching line/second matching line reaches a preset unit similarity setting value, and the beginning of the first matching line/second matching line is removed
- the portion of the ending that does not reach the monomer similarity setting value ie, the monomer similarity is not high
- the accuracy of the audio search and recognition can be improved, and the accuracy can be improved. Accurate repeats.
- the specific method for removing the portion of the matching straight line at the beginning/end of the matching line that does not reach the unit similarity setting value may be: checking from the start/end point of the matching straight line to the middle to determine whether the single similarity setting value is reached. After finding the first point that reaches the monomer similarity setpoint, remove the point to a number of points between the start/end point.
- the monomer similarity setting value may be a specific value of a single unit similarity, and it is judged whether a point reaches the value during the inspection; or may be a proportional value, and a point is determined at the time of inspection. Whether the ratio value is reached compared to the average or maximum value of all points included in a matching straight line/second matching straight line.
- FIG. 9 is a schematic structural block diagram of an embodiment of an audio retrieval and recognition apparatus 1000 of the present disclosure.
- the audio retrieval and recognition apparatus 1000 of the example of the present disclosure mainly includes:
- the audio fingerprint acquiring system 1100 is configured to acquire an audio fingerprint of the audio to be recognized (Query Audio). Wherein the audio fingerprint includes a first portion for representing a content feature of the audio to be recognized and a second portion for indicating a degree of trust of the first portion.
- the retrieval identification system 1200 is configured to identify the audio to be recognized according to the audio fingerprint of the audio to be identified, and obtain a recognition result.
- FIG. 10 is a schematic structural block diagram of an audio fingerprint acquiring system 1100 according to an embodiment of the present disclosure.
- the audio fingerprint acquiring system 1100 of the example of the present disclosure mainly includes: a sound spectrum map conversion module 1101, a feature point determination module 1102, a mask determination module 1103, an average energy determination module 1104, an audio fingerprint bit determination module 1105, and a strong The weak weight bit determination module 1106 and the audio fingerprint determination module 1107.
- the spectrogram conversion module 1101 is configured to convert audio into a spectrogram. Specifically, the spectrogram conversion module 1101 can be specifically configured to convert an audio signal into a time-frequency spectrogram by using a short Fourier Transformation.
- the sound spectrum conversion module 1101 may include a Mel transform sub-module for pre-processing the sound spectrum map by using a Mel (MEL) transform, and dividing the frequency spectrum into multiple frequencies by using the Mel transform The block (frequency bin), and the number of divided frequency blocks is configurable.
- the sound spectrum conversion module 1101 may further include a human auditory system filtering sub-module for performing Human Auditory System filtering on the sound spectrum, and using a nonlinear transformation such as filtering of the human auditory system to enable the sound spectrum.
- the spectral distribution in the figure is more suitable for human ear perception.
- the feature point determination module 1102 is configured to determine feature points in the sound spectrum map.
- the feature point determining module 1102 may be specifically configured to determine a feature point by using one of a plurality of criteria. For example, the feature point may be selected as a maximum value point of energy in the sound spectrum map, or Selected as the minimum point of energy.
- the feature point determining module 1102 may also select the extreme point of the energy as the feature point, but use the fixed point as the feature point, for example, the frequency value and the preset frequency setting may be selected. Points with equal values (fixed points of frequency). Further, the feature point determining module 1102 can be configured to preset a plurality of frequency setting values of a low frequency, an intermediate frequency, and a high frequency according to a frequency magnitude.
- the mask determination module 1103 is configured to determine, on the spectrogram, one or more masks for the feature points in the vicinity of the feature points, each mask comprising a plurality of spectral regions. Specifically, in the spectrogram, a plurality of spectral regions included in each mask may be symmetrically distributed.
- the mean energy determination module 1104 is configured to determine an average energy of each spectral region.
- the audio fingerprint bit determining module 1105 is configured to determine an audio fingerprint bit according to the average energy of the plurality of spectral regions in the mask. It should be noted that the audio fingerprint bit is the first part of the aforementioned audio fingerprint for representing the content feature of the audio.
- the audio fingerprint bit determining module 1105 may be specifically configured to determine an audio fingerprint bit according to a difference value of mean energy of a plurality of spectral regions included in one mask.
- the strong and weak weight bit determining module 1106 is configured to determine the degree of trust of the audio fingerprint bits to determine the strong and weak weight bits corresponding to each audio fingerprint bit. It should be noted that the strong and weak weight bits are the second part of the aforementioned audio fingerprint for indicating the degree of trust of the first part.
- the strong and weak weight bit determining module 1106 is specifically configured to: determine to generate the audio fingerprint Whether the absolute value of the difference used by the bit reaches (or exceeds) the preset strong bit threshold; if the strong bit threshold is reached, the audio fingerprint bit is determined to be a strong bit, and a bit with the audio fingerprint is obtained Correspondingly, the strong and weak weight bits are 1; if the strong and weak bit threshold is not reached, the audio fingerprint bit is determined to be a weak bit, and a strong and weak weight bit corresponding to the audio fingerprint bit is obtained. .
- the audio fingerprint determining module 1107 is configured to determine an audio fingerprint of the audio according to the audio fingerprint bit and the strong weak weight bit.
- the present disclosure can generate an audio fingerprint with high accuracy and good robustness for a piece of audio by extracting the strong and weak weight bits corresponding to the audio fingerprint bit while extracting the audio fingerprint bit.
- the audio fingerprint obtaining system 1100 of the present disclosure further includes a timestamp adding module (not shown) for adding a timestamp field to the audio fingerprint for indicating the time difference between the audio starting position and the feature point.
- a timestamp adding module (not shown) for adding a timestamp field to the audio fingerprint for indicating the time difference between the audio starting position and the feature point.
- Field this field can be a hash value.
- the feature point is set to a fixed point, it is not necessary to include the module, that is, it is not necessary to record the time stamp.
- the audio fingerprint acquiring system 1100 of the present disclosure further includes an audio identifier adding module (not shown) for adding an audio identification field to the audio fingerprint for recording the ID of the audio signal corresponding to the audio fingerprint.
- Identification information this field can be a hash value.
- the audio fingerprint acquisition system 1100 of the present disclosure further includes an audio segmentation module (not shown) and an audio fingerprint combination module (not shown).
- the audio segmentation module is used to divide the original audio into multiple sub-audios by time.
- the audio fingerprint is extracted for each segment of the sub-audio by using the module included in the audio fingerprint acquiring system 1100 to obtain a plurality of audio fingerprints.
- the audio fingerprint combination module is configured to combine the extracted audio fingerprints of the pieces of sub-audio to obtain an audio fingerprint of the entire audio.
- the audio fingerprint of the audio to be identified may be referred to as a first audio fingerprint, and the audio fingerprint included in the first audio fingerprint is referred to as a first audio fingerprint unit, and the first audio fingerprint unit is correspondingly strong.
- the weak weight monomer is called the first strong weak weight monomer.
- FIG. 11 is a schematic structural block diagram of a retrieval and recognition system 1200 according to an embodiment of the present disclosure.
- the retrieval recognition system 1200 of the example of the present disclosure mainly includes:
- the first ranking module 1210 is configured to perform a first ranking of the plurality of known audios according to the first audio fingerprint, and extract the first k known audios as the first candidate audio set according to the result of the first ranking.
- k is a positive integer
- the specific value of k is configurable.
- the first ranking module 1210 is configured to perform ranking based on the matching of each individual first audio fingerprint unit with known audio. Further, the first ranking module 1210 can be configured to rank the word frequency-reverse file frequency TF-IDF for the known audio according to each of the first audio fingerprint units.
- the second ranking module 1220 is configured to perform a second ranking on the first candidate audio set according to the first audio fingerprint, and extract, according to the result of the second ranking, the first n first candidate audios in the first candidate audio set. As a result of the recognition.
- n is a positive integer, and the specific value of n can be set.
- the second ranking module 1220 is configured to rank the audio in the first candidate audio set according to the first audio fingerprint unit arranged in a plurality of orders.
- the retrieval and recognition system 1200 can also be used to perform retrieval in the Meta database according to the recognition result, and can obtain audio information of the recognition result, such as the name, author, source, and the like of the recognized audio.
- the recognition result includes a plurality of audios
- information of the plurality of recognized audios can be simultaneously provided.
- the first ranking module 1210 may perform the first ranking and/or the second ranking module 1220 in the process of performing the second ranking, and when utilizing the audio fingerprint, may be based on the strength and weakness weight in the audio fingerprint.
- the body weights the audio fingerprint unit.
- the aforementioned known audio can be audio in an audio database.
- An audio fingerprint of the known audio is stored in the audio database, and the audio fingerprint of the stored known audio includes the same type of audio fingerprint as the first audio fingerprint obtained by the same extraction method as the first audio fingerprint, thereby
- the audio fingerprint of the known audio also includes a first portion for representing the content characteristics of the audio and a second portion for indicating the degree of trust of the first portion.
- the audio retrieval and recognition apparatus 1000 of the present disclosure further includes a fingerprint index acquisition module (not shown) for acquiring audio fingerprints of a plurality of known audios, for convenience of description and understanding.
- the audio fingerprint of the known audio is referred to as a second audio fingerprint
- the audio fingerprint included in the second audio fingerprint is referred to as a second audio fingerprint unit
- the strong and weak weight monomer included in the second audio fingerprint is referred to as a second Strong and weak weighting unit
- indexing the second audio fingerprint to obtain a fingerprint index of known audio in advance.
- the first ranking module 1210 is specifically configured to match the fingerprint index with the first audio fingerprint unit of the audio to be identified to perform TF-IDF ranking on a plurality of known audios.
- the fingerprint index obtaining module may be configured to acquire a forward fingerprint index and an inverted index of an audio fingerprint of the known audio.
- the fingerprint index obtaining module may be configured to index the second audio fingerprint according to the strong and weak weights to improve the robustness.
- the first ranking module 1210 of the present disclosure may include an absolute matching sub-module 1211 for prior to the first ranking. First, an exact match is made to the known audio in the audio database.
- FIG. 12 is a schematic structural diagram of a first ranking module 1210 according to an embodiment of the present disclosure.
- the first ranking module 1210 specifically includes:
- the absolute matching sub-module 1211 is configured to count, according to the inverted fingerprint index, which of the first audio fingerprints of the respective first audio fingerprints are present in the second audio fingerprint of the known audio, to match the first audio of the preset number or more from the audio database.
- the known audio of the fingerprint unit is used as the second candidate audio set.
- the absolute matching sub-module 1211 may be specifically configured to determine, according to a strong and weak weighting unit corresponding to a first audio fingerprinting unit, only the strong bits in the first audio fingerprinting unit are in the known audio. The matching situation in the two audio fingerprints, while ignoring the matching of the weak bits in the first audio fingerprint unit to improve the robustness.
- the word frequency determination sub-module 1212 is configured to determine a word frequency of a first audio fingerprint unit in a second audio fingerprint of a second candidate audio based on the positive fingerprint index.
- the positive fingerprint index may be the aforementioned fingerprint index obtained according to the strong and weak weights.
- the file frequency determining sub-module 1213 is configured to determine a file frequency of a first audio fingerprint unit based on the inverted fingerprint index.
- the inverted fingerprint index may be the aforementioned fingerprint index obtained according to the strong and weak weights.
- a word frequency-reverse file frequency scoring sub-module 1214 configured to determine the second frequency according to a word frequency of each first audio fingerprint unit in a second audio fingerprint of a second candidate audio and a file frequency of each first audio fingerprint unit The word frequency-reverse file frequency score of the candidate audio.
- the first ranking sub-module 1215 is configured to rank the second candidate audio set according to the obtained word frequency-reverse file frequency score of each second candidate audio, obtain a result of the first ranking, and extract the first k from the first ranking result.
- the second candidate audio is used as the first candidate audio set; the first ranking sub-module 1215 is further configured to return the second audio fingerprint (the positive fingerprint index) of each first candidate audio to the second ranking module 1220, in preparation for Subsequent further processing.
- the second ranking is a case in which a sequence of a sequence of first audio fingerprints arranged in a plurality of orders appears in an audio fingerprint of the first candidate audio, The ranking of the audio in a candidate audio set.
- the second ranking module 1220 is configured to: obtain a similarity matrix of the audio in the first candidate audio set according to the fingerprint index of the known audio and the first audio fingerprint, and select the first candidate audio according to the similarity matrix.
- the audio in the collection is ranked.
- the second ranking module 1220 may be specifically configured to perform, according to the strength and weakness weight corresponding to the first audio fingerprint and/or the strength and weakness in the fingerprint index of the known audio, in determining the similarity matrix. Weighting, and using the weighted similarity matrix to rank the audio in the first candidate audio set to improve robustness.
- FIG. 13 is a schematic structural diagram of a second ranking module 1220 according to an embodiment of the present disclosure.
- the second ranking module 1220 specifically includes:
- the second audio fingerprint acquisition sub-module 1221 is configured to acquire a second audio fingerprint of a first candidate audio (actually each first candidate audio is a known audio) in the first candidate audio set.
- the second audio fingerprint may be acquired according to a fingerprint index of a known audio (eg, a positive fingerprint index).
- the first audio fingerprint includes a strong and weak weighting unit corresponding to each first audio fingerprinting unit (may be referred to as a first strong weak weighting unit), and/or a second audio.
- the fingerprint includes a strong and weak weighting monomer corresponding to each of the second audio fingerprinting monomers (may be referred to as a second strong weak weighting monomer).
- the unit similarity first determining sub-module 1222 is configured to determine a single similarity between each second audio fingerprint unit included in the second audio fingerprint of the first candidate audio and each of the first audio fingerprint units. It should be noted that the unit similarity first determining sub-module 1222 may be specifically configured to: according to the first strong weak weight monomer and/or the second strong weak weight monomer, in determining the monomer similarity, Each of the first audio fingerprint unit and the second audio fingerprint unit are weighted, and then the unit similarity is determined according to the weighted first and second audio fingerprint units. In an example of the present disclosure, since the accuracy of the data information in the audio database is higher, the first audio fingerprint unit and the second audio fingerprint unit may be separately weighted by the second strong weak weight unit.
- the similarity matrix first determining sub-module 1223 is configured to determine a similarity matrix between the first candidate audio and the to-be-identified audio according to each individual similarity.
- the sequence similarity score determining sub-module 1224 is configured to determine a sequence similarity score of the first candidate audio according to a similarity matrix of the first candidate audio. Specifically, the sequence similarity score determination sub-module 1224 is specifically configured to determine the sequence similarity score according to a straight line in the similarity matrix.
- the second ranking sub-module 1225 is configured to rank the first candidate audio set according to the sequence similarity score of each first candidate audio, obtain a result of the second ranking, and extract the first n first from the second ranking result.
- Candidate audio is used as the recognition result.
- sequence similarity score determination sub-module 1224 is specifically configured to determine the sequence similarity score by using the specific steps of the foregoing uniform audio method.
- sequence similarity score determination sub-module 1224 is specifically configured to determine the sequence similarity score by using various specific steps of the foregoing dynamic programming method.
- FIG. 14 is a schematic structural block diagram of an audio retrieval and recognition apparatus 1000 for determining a similarity matrix based on a plurality of types of first audio fingerprints and second audio fingerprints according to an embodiment of the present disclosure.
- the audio retrieval and recognition apparatus 1000 of the present disclosure includes:
- the multi-type first audio fingerprint obtaining module 1300 is configured to acquire a plurality of types of first audio fingerprints of the to-be-identified audio by using a plurality of audio fingerprint acquiring methods, where each type of the first audio fingerprint includes multiple pieces for representing the audio content.
- the first part of the feature may be referred to as the first audio fingerprint unit.
- at least some types of first audio fingerprints comprise a second portion for indicating the degree of trust of the first portion.
- the multi-type second audio fingerprint obtaining module 1400 is configured to acquire a plurality of types of second audio fingerprints of a known audio (specifically, the audio in the foregoing first candidate audio set), each type of second
- the audio fingerprint containing a plurality of first portions for representing audio content features may be referred to as a second audio fingerprint unit.
- at least some types of second audio fingerprints comprise a second portion for indicating the degree of trust of the first portion.
- the unit similarity second determining sub-module 1500 is configured to respectively determine a single similarity between the second audio fingerprint unit of the same type and the first audio fingerprint unit. Thus, corresponding to multiple types of audio fingerprints, a variety of monomer similarities for a known audio can be obtained.
- the similarity matrix second determining sub-module 1600 is configured to determine an average value or a minimum value of the plurality of single cell similarities, and determine the similarity of the known audio according to the average value or the minimum value of the plurality of single cell similarities matrix.
- sequence similarity score determination sub-module 1224 is configured to determine the sequence similarity score according to the similarity matrix based on the average or minimum value of the plurality of single-body similarities.
- the audio retrieval recognition apparatus 1000 further includes an audio slicing module (not shown).
- the audio slicing module is configured to slice the first audio fingerprint of the acquired audio to be recognized and the second audio fingerprint of the known audio according to a preset fixed length before performing the first ranking, and obtain the same length (including the same quantity)
- the first sub-audio fingerprint and the second sub-audio fingerprint of the audio fingerprint unit and/or the audio slicing module is configured to pre-recognize the audio and the known audio according to a preset fixed length of time before acquiring the audio fingerprint Slicing, obtaining a plurality of segments of the same length of the to-be-identified audio segment and the known audio segment, and then respectively acquiring the audio fingerprints of the respective to-be-identified audio segments and the known audio segments, obtaining the first sub-audio fingerprints of the respective to-be-recognized audio segments, each Know the second sub-audio fingerprint of the audio clip.
- the foregoing first ranking module 1210 and the second ranking module 1220 are configured to perform the foregoing first ranking and second ranking according to each of the first sub-audio fingerprint and the second sub-audio fingerprint, to obtain identification of each sub-audio fingerprint. As a result, the original recognition result of the to-be-identified audio is then determined based on the recognition result of each sub-audio fingerprint.
- the first audio fingerprint unit of the first audio fingerprint and the second audio fingerprint unit of the second audio fingerprint are temporally arranged.
- the audio retrieval identifying apparatus 1000 of the present disclosure further includes a repeating audio segment determining module (not shown) for determining a repeating segment of the audio to be recognized and the known audio according to the aforementioned similarity matrix.
- the repeated media segment determining module is specifically configured to obtain start and end times of the repeated segments in the two audios according to the start and end points of the straight line in the similarity matrix.
- FIG. 15 is a hardware block diagram illustrating an audio retrieval recognition hardware device in accordance with an embodiment of the present disclosure.
- an audio retrieval recognition hardware device 2000 according to an embodiment of the present disclosure includes a memory 2001 and a processor 2002.
- the components of the audio retrieval recognition hardware device 2000 are interconnected by a bus system and/or other form of connection mechanism (not shown).
- the memory 2001 is for storing non-transitory computer readable instructions.
- memory 2001 can include one or more computer program products, which can include various forms of computer readable storage media, such as volatile memory and/or nonvolatile memory.
- the volatile memory may include, for example, random access memory (RAM) and/or cache or the like.
- the nonvolatile memory may include, for example, a read only memory (ROM), a hard disk, a flash memory, or the like.
- the processor 2002 can be a central processing unit (CPU) or other form of processing unit with data processing capabilities and/or instruction execution capabilities, and can control other components in the audio retrieval recognition hardware device 2000 to perform the desired functions.
- the processor 2002 is configured to execute the computer readable instructions stored in the memory 2001 such that the audio retrieval identification hardware device 2000 performs the aforementioned audio retrieval identification method of various embodiments of the present disclosure. All or part of the steps.
- FIG. 16 is a schematic diagram illustrating a computer readable storage medium in accordance with an embodiment of the present disclosure.
- a computer readable storage medium 3000 according to an embodiment of the present disclosure stores thereon non-transitory computer readable instructions 3001.
- the non-transitory computer readable instructions 3001 are executed by a processor, all or part of the steps of the audio retrieval identification method of the various embodiments of the present disclosure described above are performed.
- FIG. 17 is a diagram showing a hardware configuration of a terminal device according to an embodiment of the present disclosure.
- the terminal device may be implemented in various forms, and the terminal device in the present disclosure may include, but is not limited to, such as a mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, a PDA (Personal Digital Assistant), a PAD (Tablet), a PMP.
- Mobile terminal devices portable multimedia players
- navigation devices in-vehicle terminal devices, in-vehicle display terminals, in-vehicle electronic rearview mirrors, and the like
- fixed terminal devices such as digital TVs, desktop computers, and the like.
- the terminal device 4100 may include a wireless communication unit 4110, an A/V (audio/video) input unit 4120, a user input unit 4130, a sensing unit 4140, an output unit 4150, a memory 4160, an interface unit 4170, and control.
- Figure 17 shows a terminal device having various components, but it should be understood that not all illustrated components are required to be implemented. More or fewer components can be implemented instead.
- the wireless communication unit 4110 allows radio communication between the terminal device 4100 and a wireless communication system or network.
- the A/V input unit 4120 is for receiving an audio or video signal.
- the user input unit 4130 can generate key input data according to a command input by the user to control various operations of the terminal device.
- the sensing unit 4140 detects the current state of the terminal device 4100, the location of the terminal device 4100, the presence or absence of a user's touch input to the terminal device 4100, the orientation of the terminal device 4100, the acceleration or deceleration movement and direction of the terminal device 4100, and the like, and A command or signal for controlling the operation of the terminal device 4100 is generated.
- the interface unit 4170 serves as an interface through which at least one external device can connect with the terminal device 4100.
- Output unit 4150 is configured to provide an output signal in a visual, audio, and/or tactile manner.
- the memory 4160 may store a software program or the like that performs processing and control operations performed by the controller 4180, or may temporarily store data that has been output or is to be output.
- the memory 4160 can include at least one type of storage medium.
- the terminal device 4100 can cooperate with a network storage device that performs a storage function of the memory 4160 through a network connection.
- Controller 4180 typically controls the overall operation of the terminal device. Additionally, the controller 4180 can include a multimedia module for reproducing or playing back multimedia data.
- the controller 4180 can perform a pattern recognition process to recognize a handwriting input or a picture drawing input performed on the touch screen as a character or an image.
- the power supply unit 4190 receives external power or internal power under the control of the controller 4180 and provides appropriate power required to operate the various components and components.
- Various embodiments of the audio retrieval identification method proposed by the present disclosure may be implemented in a computer readable medium using, for example, computer software, hardware, or any combination thereof.
- various embodiments of the audio retrieval identification method proposed by the present disclosure may be through the use of an application specific integrated circuit (ASIC), a digital signal processor (DSP), a digital signal processing device (DSPD), a programmable logic device (PLD). a field programmable gate array (FPGA), a processor, a controller, a microcontroller, a microprocessor, an electronic unit designed to perform the functions described herein, in some cases,
- ASIC application specific integrated circuit
- DSP digital signal processor
- DSPD digital signal processing device
- PLD programmable logic device
- FPGA field programmable gate array
- a processor a controller
- microcontroller a microcontroller
- microprocessor an electronic unit designed to perform the functions described herein, in some cases
- Various embodiments of the publicly proposed audio retrieval identification method can be implemented in the controller 4
- various implementations of the audio retrieval identification method proposed by the present disclosure can be implemented with separate software modules that allow for the execution of at least one function or operation.
- the software code can be implemented by a software application (or program) written in any suitable programming language, which can be stored in memory 4160 and executed by controller 4180.
- an audio retrieval identification method, apparatus, hardware device, computer readable storage medium, and terminal device by acquiring and utilizing an audio object, include a first portion for representing an audio content feature and for representing the first portion
- the audio fingerprint feature of the second part of the credibility is used for audio retrieval and recognition, which can greatly improve the accuracy, robustness and efficiency of audio retrieval.
- exemplary does not mean that the described examples are preferred or better than the other examples.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Library & Information Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种音频检索识别方法及装置,方法包括:获取待识别音频的音频指纹,其中,音频指纹包括用于表示待识别音频的内容特征的第一部分以及用于表示第一部分的可信程度的第二部分;根据音频指纹对待识别音频进行识别,得到识别结果。
Description
相关申请的交叉引用
本申请要求申请号为201810273699.7、申请日为2018年3月29日的中国专利申请的优先权,该文献的全部内容以引用方式并入本文。
本公开涉及音频处理技术领域,特别是涉及一种音频检索识别方法及装置。
音频指纹(或者称为音频特征)以及音频指纹检索在如今的“多媒体信息社会”中具有广泛的应用。音频指纹检索最初被应用到听歌识曲之中,也就是输入一段音频,通过提取和比对该音频的指纹特征,就能识别出对应的歌曲。另外,音频指纹检索也可应用到内容监控之中,比如音频消重、基于检索的语音广告监控、音频版权等。
现有的音频检索识别方法存在准确性差、速度慢的问题,这对运算资源和存储资源都会产生巨大消耗。
发明内容
本公开的目的在于提供一种新的音频检索识别方法及装置。
本公开的目的是采用以下的技术方案来实现的。依据本公开提出的音频检索识别方法,包括以下步骤:获取待识别音频的音频指纹,其中,所述音频指纹包括用于表示所述待识别音频的内容特征的第一部分以及用于表示所述第一部分的可信程度的第二部分;根据所述音频指纹对所述待识别音频进行识别,得到识别结果。
本公开的目的还可以采用以下的技术措施来进一步实现。
前述的音频检索识别方法,其中,所述获取待识别音频的音频指纹包括:将所述待识别音频转换成声谱图;确定所述声谱图中的特征点;在所述声谱图上,为所述特征点确定一个或多个掩模,每个所述掩模包含多个谱区域;确定每个所述谱区域的均值能量;根据所述掩模中的所述多个谱区域的均值能量确定音频指纹比特;判断所述音频指纹比特的可信程度以确定强弱权重比特;根据所述音频指纹比特和所述强弱权重比特确定述待识别音频的音频指纹。
前述的音频检索识别方法,其中,所述将所述待识别音频转换成声谱图包括:通过短时傅里叶变换将所述待识别音频转换成时间-频率的二维声 谱图,所述声谱图中每个点的取值代表所述待识别音频的能量。
前述的音频检索识别方法,其中,所述将所述待识别音频转换成声谱图还包括:对所述声谱图进行梅尔变化。
前述的音频检索识别方法,其中,所述特征点为所述声谱图中的固定点。
前述的音频检索识别方法,其中,所述特征点为频率值与预设的多个频率设定值相等的点。
前述的音频检索识别方法,其中,所述特征点为所述声谱图中的能量极大值点,或者,所述特征点为所述声谱图中的能量极小值点。
前述的音频检索识别方法,其中,所述掩模所包含的多个所述谱区域是对称分布的。
前述的音频检索识别方法,其中,所述掩模所包含的多个所述谱区域具有相同的频率范围、和/或具有相同的时间范围、和/或以所述特征点为中心而中心对称分布。
前述的音频检索识别方法,其中,所述谱区域均值能量为所述谱区域所包含的所有点的能量值的平均值。
前述的音频检索识别方法,其中,所述的根据所述掩模中的所述多个谱区域的均值能量确定音频指纹比特包括:根据一个所述掩模所包含的多个所述谱区域的均值能量的差值确定一个音频指纹比特。
前述的音频检索识别方法,其中,所述的判断所述音频指纹比特的可信程度以确定强弱权重比特包括:判断所述差值的绝对值是否达到或超过预设的强弱比特阈值,如果达到或超过所述强弱比特阈值,则将所述音频指纹比特确定为强比特,否则降所述音频指纹比特确定为弱比特;根据所述音频指纹比特是强比特还是弱比特来确定所述强弱权重比特。
前述的音频检索识别方法,还包括:将待识别音频按时间分成多段子音频;提取每段所述子音频的所述音频指纹;将提取得到的各个所述子音频的所述音频指纹进行组合,得到所述待识别音频的音频指纹。
前述的音频检索识别方法,其中,将所述待识别音频的音频指纹定义为第一音频指纹,所述第一音频指纹包含多个第一音频指纹单体以及与各个所述第一音频指纹单体相对应的第一强弱权重单体,所述第一音频指纹单体包含所述待识别音频的多个所述音频指纹比特,所述第一强弱权重单体包含与所述多个音频指纹比特相对应的多个所述强弱权重比特。
前述的音频检索识别方法,其中,所述根据所述音频指纹对所述待识别音频进行识别包括:根据每个单独的所述第一音频指纹单体对多个已知音频进行第一排名,根据所述第一排名的结果,取出前k个所述已知音频作为第一候选音频集合,其中k为正整数;根据多个顺序排列的所述第一 音频指纹单体对所述第一候选音频集合进行第二排名,根据所述第二排名的结果,取出前n个所述第一候选音频作为识别结果,其中n为正整数。
前述的音频检索识别方法,还包括:预先获取所述已知音频的音频指纹作为第二音频指纹,所述第二音频指纹包含多个第二音频指纹单体以及与所述第二音频指纹单体相对应的第二强弱权重单体;对所述第二音频指纹进行索引,以预先得到所述已知音频的指纹索引。
前述的音频检索识别方法,其中,在进行所述第一排名和/或进行所述第二排名的过程中,根据所述第一强弱权重单体和/或第二强弱权重单体,对所述第一音频指纹单体和/或所述第二音频指纹单体进行加权。
前述的音频检索识别方法,其中,所述根据每个单独的所述第一音频指纹单体对多个已知音频进行第一排名包括:根据每个单独的所述第一音频指纹单体对多个已知音频进行词频-逆向文件频率TF-IDF排名。
前述的音频检索识别方法,其中,所述根据每个单独的所述第一音频指纹单体对多个已知音频进行词频-逆向文件频率TF-IDF方式的第一排名包括:将所述已知音频的指纹索引与所述第一音频指纹单体进行匹配,以对所述已知音频进行所述TF-IDF排名。
前述的音频检索识别方法,其中,所述预先得到所述已知音频的指纹索引包括:根据所述第二强弱权重单体,预先得到所述已知音频的正排指纹索引和/或倒排指纹索引。
前述的音频检索识别方法,其中,所述将所述已知音频的指纹索引与所述第一音频指纹单体进行匹配包括:根据所述第一强弱权重单体,将所述音频的指纹索引与所述第一音频指纹单体进行绝对匹配。
前述的音频检索识别方法,其中,所述根据多个顺序排列的所述第一音频指纹单体对所述第一候选音频集合中的音频进行第二排名包括:根据所述已知音频的指纹索引与所述第一音频指纹得到所述第一候选音频集合中的音频的相似度矩阵,根据所述相似度矩阵对所述第一候选音频集合中的音频进行排名。
前述的音频检索识别方法,其中,所述的根据所述已知音频的指纹索引与所述第一音频指纹得到所述第一候选音频集合中的音频的相似度矩阵,根据所述相似度矩阵对所述第一候选音频集合中的音频进行排名包括:利用所述第一强弱权重单体和/或所述第二强弱权重单体进行加权,得到加权的所述相似度矩阵,并根据所述加权的相似度矩阵对所述第一候选音频集合中的音频进行排名。
前述的音频检索识别方法,其中,所述根据所述相似度矩阵对所述第一候选音频集合中的音频进行排名包括:根据所述相似度矩阵中的直线对所述第一候选音频集合中的音频进行排名。
前述的音频检索识别方法,其中:所述获取待识别音频的音频指纹包括,获取所述待识别音频的多种类型的第一音频指纹;所述预先获取所述已知音频的音频指纹作为第二音频指纹包括,获取所述第一候选音频集合中的音频的多种类型的第二音频指纹;所述的根据所述已知音频的指纹索引与所述第一音频指纹得到所述第一候选音频集合中的音频的相似度矩阵包括,根据所述多种类型的第一音频指纹和所述多种类型的第二音频指纹确定所述相似度矩阵。
前述的音频检索识别方法,其中,每种类型的所述第一音频指纹包含多个第一音频指纹单体,每种类型的所述第二音频指纹包含多个第二音频指纹单体;所述的根据所述多种类型的第一音频指纹和所述多种类型的第二音频指纹确定所述相似度矩阵包括:分别确定同种类型的所述第二音频指纹单体与所述第一音频指纹单体之间的单体相似度,以得到多种所述单体相似度;根据所述多种单体相似度的平均值或最小值确定所述相似度矩阵。
前述的音频检索识别方法,还包括:预先对待识别音频和已知音频按照预设的时间长度切片,得到多段待识别子音频和多段已知子音频,对所述多段待识别子音频和所述多段已知子音频分别提取音频指纹,以得到长度相同的多个第一子音频指纹和多个第二子音频指纹。
前述的音频检索识别方法,还包括:在进行所述第一排名之前,对获得的待识别音频的所述第一音频指纹和已知音频的所述第二音频指纹按照预设的长度切片,以得到长度相同的多个第一子音频指纹和多个第二子音频指纹。
前述的音频检索识别方法,其中,所述多个第一音频指纹单体在所述第一音频指纹中按时间顺序排列,所述多个第二音频指纹单体在所述第二音频指纹中按时间顺序排列。
前述的音频检索识别方法,还包括:根据所述相似度矩阵确定所述待识别音频与所述识别结果中的音频的重复片段。
本公开的目的还采用以下技术方案来实现。依据本公开提出的音频检索识别装置,包括:音频指纹获取系统,用于获取待识别音频的音频指纹,其中,所述音频指纹包括用于表示所述待识别音频的内容特征的第一部分以及用于表示所述第一部分的可信程度的第二部分;检索识别系统,用于根据所述音频指纹对所述待识别音频进行识别,得到识别结果。
本公开的目的还可以采用以下的技术措施来进一步实现。
前述的音频检索识别装置,其还包括执行前述任一音频检索识别方法步骤的模块。
本公开的目的还采用以下技术方案来实现。依据本公开提出的一种音 频检索识别硬件装置,包括:存储器,用于存储非暂时性计算机可读指令;以及处理器,用于运行所述计算机可读指令,使得所述处理器执行时实现前述任意一种音频检索识别方法。
本公开的目的还采用以下技术方案来实现。依据本公开提出的一种计算机可读存储介质,用于存储非暂时性计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时,使得所述计算机执行前述任意一种音频检索识别方法。
本公开的目的还采用以下技术方案来实现。依据本公开提出的一种终端设备,包括前述任意一种音频检索识别装置。
上述说明仅是本公开技术方案的概述,为了能更清楚了解本公开的技术手段,而可依照说明书的内容予以实施,并且为让本公开的上述和其他目的、特征和优点能够更明显易懂,以下特举较佳实施例,并配合附图,详细说明如下。
图1是本公开一个实施例的音频检索识别方法的流程框图。
图2是本公开一个实施例提供的获取音频指纹的流程框图。
图3是本公开一个实施例提供的对音频进行检索识别的流程框图。
图4是本公开一个实施例提供的第一排名的流程框图。
图5是本公开一个实施例提供的第二排名的流程框图。
图6是本公开一个实施例提供的利用动态规划法确定序列相似度评分的流程框图。
图7是本公开一个实施例提供的利用匀速音频法确定序列相似度评分的流程框图。
图8是本公开一个实施例提供的基于多种类型第一音频指纹、第二音频指纹确定相似度矩阵的流程框图。
图9是本公开一个实施例的音频检索识别装置的结构框图。
图10是本公开一个实施例提供的音频指纹获取系统的结构框图。
图11是本公开一个实施例提供的检索识别系统的结构框图。
图12是本公开一个实施例提供的第一排名模块的结构框图。
图13是本公开一个实施例提供的第二排名模块的结构框图。
图14是本公开一个实施例的基于多种类型第一音频指纹和第二音频指纹确定相似度矩阵的音频检索识别装置的结构框图。
图15是本公开一个实施例的音频检索识别硬件装置的硬件框图。
图16是本公开一个实施例的计算机可读存储介质的示意图。
图17是本公开一个实施例的终端设备的结构框图。
为更进一步阐述本公开为达成预定发明目的所采取的技术手段及功效,以下结合附图及较佳实施例,对依据本公开提出的音频检索识别方法及装置的具体实施方式、结构、特征及其功效,详细说明如后。
图1为本公开的音频检索识别方法一个实施例的示意性流程图。请参阅图1,本公开示例的音频检索识别方法,主要包括以下步骤:
步骤S10,获取待识别音频(Query音频)的音频指纹。其中,该音频指纹包括用于表示待识别音频的内容特征的第一部分以及用于表示该第一部分的可信程度的第二部分。此后,处理进到步骤S20。
步骤S20,根据待识别音频的该音频指纹对待识别音频进行识别,得到识别结果。
本公开示例的音频检索识别方法,通过获取并利用音频对象的包括用于表示音频内容特征的第一部分和用于表示第一部分的可信程度的第二部分的音频指纹特征来进行音频检索识别,能够提高音频检索识别的准确性、鲁棒性和效率。
下面对上述各步骤分别进行详细的陈述和说明。
一、关于步骤S10。
图2为本公开一个实施例提供的获取音频指纹的示意性流程框图。由于对任何音频均可以按照图2所示的方法获取音频指纹,在本实施例的说明中不区分是否为待识别的音频。请参阅图2,在本公开一个实施例中,前述的步骤S10的获取音频指纹的过程具体包括以下步骤:
步骤S11,将音频转换成声谱图(Spectrogram)。具体地,通过短时傅里叶变换(Fast Fourier Transformation)将音频信号转换成时间-频率声谱图。其中的声谱图是一种常用的音频信号的二维频谱图,横轴是时间t,纵轴是频率f,图中每个点(t,f)的具体的取值E(t,f)代表了信号的能量。需注意,对音频信号的具体类型不做限制,可以是静态文件(static file)也可以是流音频(streaming audio)。此后,处理进到步骤S12。
在本公开的实施例中,可利用梅尔(MEL)变换对声谱图进行预处理,利用梅尔变换能够将频谱分成多个频率区块(频率bin),而所分成的频率区块的数目是可以配置的。另外,还可以对声谱图进行人类听觉系统滤波(Human Auditory System filtering),利用人类听觉系统滤波等非线性变换,能够使得声谱图中的频谱分布更适合人耳感知。
需要说明的是,可以通过调整短时傅里叶变换中的各个超参数以适应不同的实际情况。在本公开的实施例中,可将步骤S11中的各个超参数设置为:在短时傅里叶变换中,时间窗设置为100ms,间隔设置为50ms;在 梅尔变换中,频率区块的数目设置为32~128。
步骤S12,确定声谱图中的特征点。
具体地,采用多种标准中的一种来确定特征点,例如,可以将特征点选为声谱图中的能量的极大值点,或者,也可以选为能量的极小值点。其中,如果声谱图中的一个点(t,f)的能量E(t,f)能够同时满足:E(t,f)>E(t+1,f)、E(t,f)>E(t-1,f)、E(t,f)>E(t,f+1)且E(t,f)>E(t,f-1),则该(t,f)点为声谱图中的能量极大值点。类似地,如果一个点(t,f)的能量E(t,f)能够同时满足:E(t,f)<E(t+1,f)、E(t,f)<E(t-1,f)、E(t,f)<E(t,f+1)且E(t,f)<E(t,f-1),则该(t,f)点为声谱图中的能量极小值点。此后,处理进到步骤S12。
在本公开的实施例中,由于选取能量极值点作为特征点存在:能量极值点易受噪声影响;不易控制极值点的数量,可能一个声谱图中没有极值点,而另一个声谱图中有多个极值点,导致特征点不均匀;需要存储额外的时间戳以记录能量极值点在声谱图中的位置等问题。因此,也可以不选能量的极值点作为特征点,而是选取固定点作为特征点,例如可以选取频率值与预设的频率设定值相等的点(频率固定的点)。进一步地,可按照频率大小预设低频、中频、高频的多个频率设定值(低频、中频、高频的具体值是可以设置的)。通过选取频率为低频、中频、高频的多个固定点作为特征点,可以使得选取的特征点更加均匀。需要注意的是,也可以按照其他标准选取固定点,如选取与一个或多个预设能量值相等的点。
需要说明的是,可以通过调整所选取的特征点的数量以适应不同的实际情况。在本公开的实施例中,可将步骤S12中的超参数设置为:特征点的密度设置为每秒20~80个。
步骤S13,在声谱图上,在特征点的附近,为特征点确定一个或多个掩模(mask),每个掩模包含(或者说,覆盖)多块声谱图上的区域(不妨称为谱区域)。此后,处理进到步骤S14。
具体地,在声谱图中,每个掩模所包含的多块谱区域可以是对称分布的:
以时间轴对称(即,多个谱区域具有相同的频率范围),例如,在一个梅尔-声谱图中,可以为特征点确定一个包含R11和R12这两块谱区域的掩模,R11、R12均位于特征点的左侧,且R11位于R12的左侧,并且R11与R12覆盖相同的频率区块;
或者以频率轴对称(即,多个谱区域具有相同的时间范围)。例如,在一个梅尔-声谱图中,可以为特征点确定一个包含R13和R14这两块谱区域的掩模,R13位于特征点的上侧,R14位于特征点的下侧,并且R13与R14具有相同的时间范围;
或者以特征点为中心而中心对称分布,例如,在一个梅尔-声谱图中,可以为特征点确定一个包含R15和R16这两块谱区域的掩模,R15位于特征点的左上侧,R16位于特征点的右下侧,并且R15与R16以特征点为中心而相互对称。
当然,一个掩模所包含的多块谱区域也可以同时满足多种对称分布情况。例如,可以为特征点确定一个包含R21、R22、R23和R24这四块谱区域的掩模,R21、R22、R23、R24分别位于特征点的左上、右上、左下、右下,并且R21与R22具有相同的频率范围、R23与R24具有相同的频率范围、R21与R23具有相同的时间范围、R22与R24具有相同的时间范围,而且这四块谱区域还以特征点为中心而中心对称。需要说明的是,一个掩模的四个谱区域并非一定以特征点为中心而中心对称分布,例如,可以均位于特征点的左侧,且在频率轴上分布于特征点的两侧。
需要说明的是,属于同一掩模的多块谱区域之间是可以相互交叠的。另外,不同的掩模之间也是可以相互交叠的。可选地,每个掩模可包含偶数个谱区域。
需要注意的是,掩模可以是按照固定的预设标准确定的,即每个掩模在声谱图中的位置及覆盖的区域是预先设置好的。或者,也可以不预先固定掩模的位置和范围,而是使用数据驱动的方式自动确定掩模区域:从大量掩模中选取协方差最小、最有区分度的掩模。
步骤S14,确定每个谱区域的均值能量。具体地,对于仅包含一个点的谱区域,该谱区域的均值能量就是这个点的能量值;当谱区域由多个点组成时,可以将该谱区域的均值能量设置为这多个点的能量值的平均值。此后,处理进到步骤S15。
步骤S15,根据掩模中的多块谱区域的均值能量,确定音频指纹比特(bit)。需要注意的是,该音频指纹比特即为前述的音频指纹中的用于表示音频的内容特征的第一部分。此后,处理进到步骤S16。
在本公开实施例的步骤S15中,可根据一个掩模所包含的多个谱区域的均值能量的差值确定一个音频指纹比特。
具体地,如果一个掩模包含两个谱区域,例如前述的包含R11和R12两块谱区域的示例,可以按照下面的公式一来计算R11、R12的均值能量的差值D1:
D1=E(R11)-E(R12), (公式一)然后判断差值D1的正负,如果差值D1为正值,则得到一个取值为1的音频指纹比特,如果差值D1为负值,则得到一个取值为0的音频指纹比特。
如果一个掩模包含四个谱区域,例如前述的包含R21、R22、R23、R24四块谱区域的示例,可以按照下面的公式二来计算R21、R22、R23、R24的 均值能量的差值D2:
D2=(E(R21)+E(R22))-(E(R23)+E(R24)), (公式二)然后判断差值D2的正负,如果差值D2为正值,则得到一个取值为1的音频指纹比特,如果差值D2为负值,则得到一个取值为0的音频指纹比特。需要说明的是,并非必须通过差值D2来确定一个包含四个谱区域的掩模的音频指纹比特,也可以利用其他形式的差值来确定音频指纹比特。例如,也可以计算这四个谱区域的均值能量的二阶差值D3:
D3=(E(R23)-E(R24))-(E(R21)-E(R22)), (公式三)然后判断差值D1的正负来确定音频指纹比特。
需要说明的是,如果为特征点确定了多个掩模,则能够对应地得到多个音频指纹比特。
步骤S16,确定音频指纹比特对应的强弱权重比特,该强弱权重比特用于表示该音频指纹比特的可信程度。需要注意的是,该强弱权重比特即为前述的音频指纹中的用于表示第一部分的可信程度的第二部分。具体他,将可信度高的音频指纹比特定义为强比特,将可信度低的音频指纹比特定义为弱比特。判断一个音频指纹比特的可信程度,并根据该音频指纹比特是强比特还是弱比特来确定强弱权重比特的取值。此后,处理进到步骤S17。
在本公开的实施例中,如果音频指纹比特是根据一个掩模所包含的多个谱区域均值能量的差值来确定的,则步骤S16具体包括:判断生成该音频指纹比特所使用的该差值的绝对值是否达到(或超过)预设的强弱比特阈值;如果达到强弱比特阈值,则将该音频指纹比特确定为强比特,并得到一个与该音频指纹比特对应的取值为1的强弱权重比特;如果未达到强弱比特阈值,则将该音频指纹比特确定为弱比特,并得到一个与该音频指纹比特对应的取值为0的强弱权重比特。
作为一个具体示例,如果一个音频指纹比特是通过判断前述公式二的四个谱区域均值能量的差值D2的正负来确定的,则步骤S16包括:判断该差值D2的绝对值与预设的强弱比特阈值T的大小关系,如果|D2|≥T,则该音频指纹比特是强比特,并将该音频指纹比特对应的强弱权重比特取值设置为1;如果|D2|<T,则该音频指纹比特是弱比特,并将该音频指纹比特对应的强弱权重比特取值设置为0。需要说明的是,该强弱比特阈值可以是多种类型的阈值:该强弱比特阈值可以是个预设的固定值,例如可以固定取为1;或者,该强弱比特阈值也可以是基于均值能量的差值而得到的数值,例如可将该强弱比特阈值设置为多个掩模(或多个特征点)对应的多个差值的平均数(事实上不限于平均数,也可以是任意一个介于最大的差值与最小的差值之间的数值),并且将差值达到该平均数的音频指纹比特确定为强比特,将差值未达到该平均数的音频指纹比特确定为弱比特;再或 者,该强弱比特阈值也可以是个比例值,例如可将该强弱比特阈值设置为60%,在多个掩模(或多个特征点)对应的多个差值中,如果一个差值的绝对值位于所有差值中的前60%,则将该音频指纹比特确定为强比特,否则将该音频指纹比特确定为弱比特。
步骤S17,根据该音频指纹比特和该强弱权重比特确定该音频的音频指纹。具体地,对音频指纹的组合方式、音频指纹的长度不做限制,仅需使得音频指纹包括一个或多个特征点所对应的各个音频指纹比特(以形成音频指纹的第一部分)和各个强弱权重比特(以形成音频指纹的第二部分)。在本公开的一些实施例中,该音频指纹包含多个音频指纹单体以及与各个该音频指纹单体相对应的强弱权重单体,该音频指纹单体包含音频的多个该音频指纹比特,该强弱权重单体包含与该多个音频指纹比特相对应的多个该强弱权重比特;例如,可以将一个特征点的所有掩模所对应的音频指纹比特组合在一起而得到一个音频指纹比特序列作为音频指纹单体,将对应的强弱权重比特组合在一起而得到与该音频指纹比特序列长度相等的强弱权重比特序列作为强弱权重单体,将多个特征点所对应的音频指纹单体、强弱权重单体按特征点的时间顺序排列以组成音频指纹。可选地,获得的音频指纹单体的长度可以是32bits。
本公开通过在提取音频指纹比特的同时,提取该音频指纹比特对应的强弱权重比特,能够为一段音频生成一个准确性高、鲁棒性好的音频指纹。
可选地,本公开的步骤S10还包括:为音频指纹添加一个时间戳字段,用于表示音频起始位置与该特征点的时间差的字段,该字段可以是一个hash值。而如果将特征点设为固定点,则可以不必包含本步骤,即不必记录该时间戳。
可选地,本公开的步骤S10还包括:为音频指纹添加一个音频标识字段,用于记录该音频指纹所对应的音频的ID标识信息,该字段可以是一个hash值。
可选地,本公开的步骤S10还包括:将原始音频按时间分成多段子音频;按照前述方法的步骤,对各段子音频提取音频指纹,得到多个音频指纹;将提取的各段子音频的音频指纹组合在一起,得到该整段音频的音频指纹。
二、关于步骤S20。
为了便于叙述和理解,不妨将待识别音频的音频指纹称为第一音频指纹,第一音频指纹所包含的音频指纹单体称为第一音频指纹单体,第一音频指纹单体对应的强弱权重单体称为第一强弱权重单体。
图3为本公开一个实施例提供的根据音频指纹对音频进行检索识别的示意性流程框图。请参阅图3,在本公开一个实施例中,前述的步骤S20的 对待识别音频进行检索识别的过程具体包括以下步骤:
步骤S21,根据该第一音频指纹,对多个已知音频进行第一排名,根据该第一排名的结果,取出前k个已知音频作为第一候选音频集合。其中的k为正整数,而k的具体取值是可以设置的。具体地,该第一排名是根据每个单独的第一音频指纹单体与已知音频的匹配情况进行的排名。进一步地,该第一排名可以是根据各个第一音频指纹单体对已知音频进行的词频-逆向文件频率排名(term frequency–inverse document frequency ranking,简称为TF-IDF排名)。此后,处理进到步骤S22。
步骤S22,根据该第一音频指纹,对该第一候选音频集合进行第二排名,根据该第二排名的结果,取出第一候选音频集合中的前n个第一候选音频作为识别结果。其中的n为正整数,而n的具体取值是可以设置的。具体地,该第二排名为根据多个顺序排列的第一音频指纹单体,对该第一候选音频集合中的音频进行的排名。例如,该多个顺序排列的第一音频指纹单体包括第一音频指纹中的连续的一部分、该第一音频指纹整体,和/或该多个顺序排列的第一音频指纹单体包括第一音频指纹中的具有相同间隔的序号的多个第一音频指纹单体,例如序号为1、3、5、7、...的多个第一音频指纹单体。
从而根据该识别结果在Meta数据库中进行检索,能够得到该识别结果的音频信息,例如识别出的音频的名称、作者、出处等等。当识别结果包括多个音频时,可以同时提供多个识别出的音频的信息。
在本实施例中,在进行步骤S21的第一排名和/或步骤S22的第二排名的过程中,在利用到音频指纹时,可以根据音频指纹中的强弱权重单体的对音频指纹单体进行加权。由于不加权的第一排名、第二排名过程相当于对在排名时对每个音频指纹单体施加了相同的权重,因此以下仅对利用强弱权重对音频指纹进行加权的第一排名和第二排名的过程进行具体说明。
本公开提出的媒体检索方法,通过进行第一排名和第二排名以得到检索结果,能够大大提高媒体检索的准确性和效率。
关于前述的步骤S21。
前述的已知音频可以是一个音频数据库中的音频。在该音频数据库中存储有已知音频的音频指纹,并且在存储的已知音频的音频指纹中包含有与第一音频指纹利用相同提取方法得到的与第一音频指纹相同类型的音频指纹,从而已知音频的音频指纹中也包括用于表示音频的内容特征的第一部分以及用于表示该第一部分的可信程度的第二部分。
在本公开的一些实施例中,本公开的音频检索识别方法还包括:预先获取多个已知音频的音频指纹,为了便于叙述和理解,不妨将已知音频的音频指纹称为第二音频指纹,第二音频指纹所包含的音频指纹单体称为第 二音频指纹单体,第二音频指纹所包含的强弱权重单体称为第二强弱权重单体;对该第二音频指纹进行索引,以预先得到已知音频的指纹索引;将该指纹索引与待识别音频的第一音频指纹单体进行匹配,以对多个已知音频进行TF-IDF排名。
具体地,前述的预先得到已知音频的指纹索引进一步包括,预先得到已知音频的音频指纹的正排指纹索引(forward index)和倒排指纹索引(inverted index),以便于对音频指纹的检索和比对。该正排指纹索引和倒排指纹索引可以预先存储在音频数据库中。其中,正排指纹索引用于记录各个已知音频的音频指纹,即记录了各个已知音频的音频指纹具体包含了哪些音频指纹单体以及这些音频指纹单体的顺序;倒排指纹索引用于记录各个音频指纹单体在哪个或哪些已知音频的音频指纹中出现。具体地,可以利用键值对(key-value对)的形式来存储该正排指纹索引和倒排指纹索引:在正排指纹索引中,用一个键(key)表示一个音频的编号(或者,称为音频ID),而与该键对应的值(value)记录该音频包含了哪些音频指纹单体以及这些音频指纹单体的顺序,不妨将正排指纹索引中的键、值分别称为正排键、正排值;在倒排指纹索引中,用一个键(key)表示一个音频指纹单体,而与该键对应的值(value)记录包含有该音频指纹单体的音频的编号,不妨将倒排指纹索引中的键、值分别称为倒排键、倒排值。
值得注意的是,可以根据强弱权重对第二音频指纹进行索引,以提高鲁棒性。具体地,在确定正排指纹索引的过程中,可以在正排指纹索引中记录已知音频的各个音频指纹单体对应的强弱权重单体。在确定倒排指纹索引的过程中,在确定一个待索引音频指纹单体是否出现在一个已知音频中时,可以忽略该待索引音频指纹单体中的弱比特,而仅判断该待索引音频指纹单体中的所有强比特是否与该已知音频中的某个音频指纹单体的对应比特相一致;例如,如果一个待索引音频指纹单体中的第一个、第三个音频指纹比特为强比特、其他的音频指纹比特为弱比特,则该待索引音频指纹单体的倒排指纹索引记录有:包含与待索引音频指纹单体具有相同的第一、第三音频指纹比特的音频指纹单体的已知音频编号。
其中的TF-IDF排名是一类通过对信息进行词频和逆向文件频率加权,来判断信息的重要程度,以进行排名的技术。其中的词频是指一个词(或者说,一个信息)在某个文章(或者说,某个文件)中出现的频率,词频越高说明该词对于该文章越重要;其中的文件频率是指一个词出现在了文章库中的多少个文章中,而逆向文件频率是文件频率的倒数(实际计算时,还可对逆向文件频率取对数,或者定义逆向文件频率是文件频率的倒数的对数),逆向文件频率越高,说明该词的区分度越好。因此,TF-IDF排名利用词频与逆向文件频率的乘积的大小进行排名。事实上,可以将一个音频 的音频指纹作为一个文章,而每个音频指纹单体作为一个词,从而能够利用TF-IDF方式对已知音频进行排名。
另外,如果对音频数据库中的所有已知音频都进行第一排名,可能会影响检索识别的效率,因此在第一排名之前,可以先对音频数据库中的已知音频进行绝对匹配(exact match)。其中的绝对匹配,用于选出所包含的第一音频指纹单体的数量在预设数量或预设比例以上的已知音频作为第二候选音频集合。然后再对该第二候选音频集合进行第一排名,以选出第一候选音频集合。
图4为本公开一个实施例提供的第一排名的示意性流程框图。请参阅图4,在本公开一个实施例中,第一排名具体包括以下步骤:
步骤S31,根据倒排指纹索引,统计各个第一音频指纹单体在哪些已知音频的第二音频指纹中出现,以从音频数据库中匹配出包含预设数量以上第一音频指纹单体的已知音频作为第二候选音频集合。值得注意的是,在匹配过程中,可以根据一个第一音频指纹单体对应的强弱权重单体,仅判断该第一音频指纹单体中的强比特在已知音频的第二音频指纹中的匹配情况,而忽略该第一音频指纹单体中的弱比特的匹配情况,以提高鲁棒性。此后,处理进到步骤S32。
需要注意的是,“预设数量以上第一音频指纹单体”中的“数量”指的是第一音频指纹单体的种类。具体地,该预设数量可以是一个,从而匹配出的第二候选音频集合为第二音频指纹中至少出现了某一种第一音频指纹单体的已知音频;该预设数量也可以是多个,不妨为p个(p为正整数),从而匹配出的第二候选音频集合为第二音频指纹中至少出现了p种第一音频指纹单体的已知音频。
步骤S32,基于正排指纹索引,确定一个第一音频指纹单体在一个第二候选音频的第二音频指纹中的词频。该词频为:一个第一音频指纹单体在一个第二音频指纹所包含的全部音频指纹单体之中所占的比例。值得注意的是,该正排指纹索引可以是前述的根据强弱权重得到的指纹索引。此后,处理进到步骤S33。
步骤S33,基于倒排指纹索引,确定一个第一音频指纹单体的文件频率。该文件频率为:在多个已知音频之中(例如,可以是音频数据库中所有的已知音频),第二音频指纹中包含有该第一音频指纹单体的已知音频的数量占已知音频总数的比例。值得注意的是,该倒排指纹索引可以是前述的根据强弱权重得到的指纹索引。此后,处理进到步骤S34。
步骤S34,根据各个第一音频指纹单体在一个第二候选音频的第二音频指纹中的词频以及各个第一音频指纹单体的文件频率,确定该第二候选音频的词频-逆向文件频率评分。此后,处理进到步骤S35。
步骤S35,根据得到的各个第二候选音频的词频-逆向文件频率评分对第二候选音频集合进行排名,得到第一排名的结果,从该第一排名结果中取出前k个第二候选音频作为第一候选音频集合。同时,还可以返回各个第一候选音频的第二音频指纹(正排指纹索引),以备在后续的第二排名中基于该第二音频指纹对第一候选音频集合进行进一步处理。
在本实施例中,可以利用索引服务器,将待识别音频的第一音频指纹单体的集合作为索引请求,根据前述的正排指纹索引和倒排指纹索引,进行绝对匹配和TF-IDF排名,以召回第一候选音频集合并同时返回得到的各个第一候选音频的正排指纹索引。具体地,可以利用开源的Elasticsearch搜索引擎进行上述的各个步骤,以达到快速检索的效果。
值得注意的是,绝对匹配和第一排名着重关注各个第一音频指纹单体出现在哪些已知音频中以及第一音频指纹单体本身的检索情况,并未考虑各个第一音频指纹单体在第一音频指纹中的顺序对检索的影响,或者说并未考虑音频指纹的整体或连续多个音频指纹单体的检索情况。
本公开提出的音频检索识别方法,根据包含强弱权重的音频指纹来进行绝对匹配和基于TF-IDF方式的第一排名,能够大大提高音频检索识别的准确性和效率。
关于前述的步骤S22。
在本公开的一些实施例中,该第二排名为根据多个顺序排列的第一音频指纹单体所组成的具有先后顺序的序列在第一候选音频的音频指纹中出现的情况,对该第一候选音频集合中的音频进行的排名。具体地,该第二排名包括:根据已知音频的指纹索引与第一音频指纹得到该第一候选音频集合中的音频的相似度矩阵,根据该相似度矩阵对该第一候选音频集合中的音频进行排名。值得注意的是,在确定相似度矩阵的过程中,可以根据第一音频指纹对应的强弱权重和/或已知音频的指纹索引中的强弱权重进行加权,并利用加权后的相似度矩阵对第一候选音频集合中的音频进行排名,以提高鲁棒性。
图5为本公开一个实施例提供的第二排名的示意性流程框图。请参阅图5,在本公开一个实施例中,该第二排名具体包括以下步骤:
步骤S41,获取第一候选音频集合中的一个第一候选音频(事实上每个第一候选音频都是已知音频)的第二音频指纹。具体地,可以根据已知音频的指纹索引(例如,正排指纹索引)获取该第二音频指纹。不妨假设待识别音频的第一音频指纹包含M
1个第一音频指纹单体,该第一候选音频的第二音频指纹包含M
2个第二音频指纹单体,其中的M
1和M
2为正整数。在本公开的一些示例中,第一音频指纹中包含有与各个第一音频指纹单体对应的强弱权重单体(不妨称之为第一强弱权重单体),和/或第二音频指 纹中包含有与各个第二音频指纹单体对应的强弱权重单体(不妨称之为第二强弱权重单体)。此后,处理进到步骤S42。
步骤S42,确定该第一候选音频的第二音频指纹所包含的各个第二音频指纹单体与各个第一音频指纹单体之间的单体相似度,得到M
1×M
2个单体相似度。每个单体相似度表示一个第一音频指纹单体与一个第二音频指纹单体之间的相似程度,具体可以是,单体相似度越大表示越相似。值得注意的是,在确定该单体相似度的过程中,可以根据第一强弱权重单体和/或第二强弱权重单体,对各个第一音频指纹单体、第二音频指纹单体进行加权,然后根据加权后的第一、第二音频指纹单体确定该单体相似度。在本公开的一种示例中,由于音频数据库中的数据信息的准确性更高,可以利用第二强弱权重单体分别对第一音频指纹单体、第二音频指纹单体进行加权。此后,处理进到步骤S43。
在本公开的实施例中,可以根据音频指纹的类型,选择能够判断两个音频指纹单体的相似程度的距离或度量作为该单体相似度。具体地,当第一音频指纹单体、第二音频指纹单体同为按照前述实施例中的步骤S11至步骤S17方法得到的二值指纹时,计算第一音频指纹单体、第二音频指纹单体之间的汉明距离(Hamming距离),再计算音频指纹单体的长度(比特数)与该汉明距离的差值,并将该差值与该音频指纹单体长度的比值确定为单体相似度,用以表示两个二值指纹中的相同比特所占的比例。其中的汉明距离是一种信息论领域中常用的度量,两个等长字符串之间的汉明距离是两个字符串对应位置的不同字符的个数。在实际计算汉明距离时,可以对两个字符串进行异或运算,并统计结果为1的个数,而这个数就是汉明距离。需要说明的是,利用同种方法提取得到的音频指纹单体具有相同的长度。而本公开的利用强弱权重对这种汉明距离类型的单体相似度进行加权的具体方法为,先利用强弱权重单体中的强弱权重比特对音频指纹单体中的对应的音频指纹比特进行加权,再对第一音频指纹单体、第二音频指纹单体进行异或运算,以得到利用强弱权重加权的单体相似度。另外需要说明的是,不限于利用汉明距离表示该单体相似度,而是可以利用任何可以判断两个音频指纹单体的相似程度的距离或度量。
步骤S43,根据各个单体相似度,确定该第一候选音频与待识别音频之间的相似度矩阵(similarity matrix)。
具体地,该相似度矩阵中的每个点对应一个单体相似度,使得该相似度矩阵记录有一个第一候选音频的各个第二音频指纹单体与各个第一音频指纹单体之间的单体相似度。并且,该相似度矩阵的各个点:在横向上按照待识别音频的各个第一音频指纹单体在第一音频指纹中的先后顺序排列,且在纵向上按照第一候选音频的各个第二音频指纹单体在第二音频指 纹中的先后顺序排列。从而位于第i行第j列的点表示待识别音频的第i个第一音频指纹单体和第一候选音频的第j个第二音频指纹单体之间的单体相似度,进而该相似度矩阵为一个M
1×M
2矩阵。此后,处理进到步骤S44。
需要说明的是,在实际操作中,并非一定先进行步骤S42的计算各个单体相似度,再进行步骤S43的确定相似度矩阵,而是可以直接确定相似度矩阵,在确定该相似度矩阵的各个点的过程中计算对应的单体相似度。
步骤S44,根据每个第一候选音频的相似度矩阵,确定该第一候选音频的序列相似度评分。该序列相似度评分用于表现该第一候选音频与待识别音频之间的相似程度。该序列相似度评分可以是一个0到1之间的分数,数字越大表示两段音频越相似。此后,处理进到步骤S45。
具体地,根据相似度矩阵中的直线来确定该的序列相似度评分。
需注意,由于音频指纹一般包含有穷的多个音频指纹单体,从而相似度矩阵为有穷矩阵,因此实际上所谓的“直线”是相似度矩阵中的多个点组成的有穷长的线段。该直线具有斜率,该斜率为直线所包括的多个点的连线的斜率。另外,该直线的起点和终点可以是相似度矩阵中的任意的点,不必是位于边缘的点。
本公开所说的直线包括相似度矩阵中的对角线、与该对角线相平行的各条线段这些在相似度矩阵中从左上到右下的斜率为1的直线,还包括斜率不为1的直线。例如,可以是的斜率近似于1的直线,以提高音频检索识别的鲁棒性;可以是斜率为2、3、...或1/2、1/3、...等的直线,以应对经过调速的音频的检索识别;甚至可以是斜率为负数的直线(在相似度矩阵中从左下到右上的直线),以应对经过反向播放处理的音频的检索识别。其中的对角线为由位于(1,1)、(2,2)、(3,3)...的点组成的线段(事实上就是以左上角的点为起点且斜率为1的一条直线)。
事实上,相似度矩阵中的每条直线均由顺序排列的多个单体相似度构成,因此由于每条直线表现了多个顺序排列的音频指纹单体对的相似情况,从而能够表现待识别音频中的一段音频片段与已知音频中的一段音频片段的相似程度。其中每个音频指纹单体对包括一个第一音频指纹单体和一个第二音频指纹单体(也就是说,每条直线表现了多个顺序排列的第一音频指纹单体与多个顺序排列的第二音频指纹单体之间的相似程度)。而直线的斜率、起点终点表现了两段音频片段的长度、位置。例如,由(1,1)、(2,3)、(3,5)、(4,7)构成的直线,由于表现了序数为1的第一音频指纹单体与序数为1第二音频指纹单体之间的相似情况、序数为2的第一音频指纹单体与序数为3第二音频指纹单体之间的相似情况、...,从而该直线能够反应序数为1、2、3、4的第一音频指纹单体所对应的一段待识别音频片段与序 数为1、3、5、7的第二音频指纹单体所对应的一段已知音频片段之间的相似情况。
因此,可以根据相似度矩阵中的直线来确定一个第一候选音频与待识别音频之间的相似情况:不妨将一个直线所包含的各个单体相似度的平均情况(或总体情况)定义为该直线的直线相似度,该直线相似度能够体现对应的多个第一音频指纹单体与多个第二音频指纹单体之间的相似情况;在相似度矩阵中确定一条直线相似度最高的直线,不妨称为匹配直线;将匹配直线的直线相似度确定为第一候选音频的序列相似度评分。
需要注意的是,在确定匹配直线的过程中,可以是从预设的多条直线中确定一条直线相似度最高的直线,例如该预设的多条直线为所有的斜率为预设的斜率设定值(比如斜率为1)的直线,或者,也可以是先从相似度矩阵中选取使得单体相似度的大小排名靠前的多个点,再根据这些点拟合出一条直线,以生成一条使得直线相似度相对最高的直线。
步骤S45,根据各个第一候选音频的该序列相似度评分对第一候选音频集合进行排名,得到第二排名的结果,从该第二排名结果中取出前n个第一候选音频作为识别结果。
本公开提出的音频检索识别方法,根据包含强弱权重的音频指纹并基于相似度矩阵来进行第二排名,能够大大提高音频检索识别的准确性和效率。
在本公开的一个具体实施例中,可以利用动态规划法来根据相似度矩阵确定序列相似度评分。图6为本公开一个实施例提供的利用动态规划法进行音频检索识别的示意性流程框图。请参阅图6,在一种实施例中,步骤S44包括以下具体步骤:
步骤S44-1a,将相似度矩阵中的斜率为预设的斜率设定值的多条直线定义为备选直线,根据每条备选直线所包含的各个单体相似度确定该备选直线的直线相似度。具体地,一条直线的直线相似度可以设置为该直线所包含的各个单体相似度的平均值,或者可以设置为该直线所包含的各个单体相似度的总和值。在一种具体示例中,可以将斜率设定值取为1,即前述的备选直线为:相似度矩阵中的对角线以及与该对角线平行的直线。此后,处理进到步骤S44-1b。
需要注意的是,在本公开的一种实施例中,步骤S44-1a还包括:先从备选直线中排除那些包含的单体相似度的数量少于预设的直线长度设定值的直线,然后再进到步骤S44-1b。或者说,在本实施例中,备选直线还须满足:包含的单体相似度的数量达到预设的直线长度设定值。通过排除单体相似度过少的直线,可以排除当直线包含的单体相似度过少而影响最终得到的序列相似度评分的准确性的问题。
步骤S44-1b,从该多条备选直线中,确定一条使得该直线相似度最大的备选直线,并定义为第一匹配直线。此后,处理进到步骤S44-1c。
步骤S44-1c,将该第一匹配直线的直线相似度确定为序列相似度评分。
需要注意的是,在本公开的一些实施例中,步骤S44-1a中的预设的斜率设定值可以为多个,即备选直线为斜率与多个斜率设定值中任意一个相等的直线,例如备选直线可以为斜率为1、-1、2、1/2等的直线,并且在步骤S44-1b中,从斜率为多个斜率设定值中任意一个的多条备选直线中确定一条第一匹配直线。
本公开提出的音频检索识别方法,通过利用动态规划法来确定序列相似度评分,能够提高音频检索识别的准确性和效率。
在本公开的另一个具体实施例中,可以利用匀速媒体法来根据相似度矩阵确定序列相似度评分。图7为本公开一个实施例提供的利用匀速媒体法进行音频检索识别的示意性流程框图。请参阅图7,在一种实施例中,步骤S34包括以下具体步骤:
步骤S44-2a,在相似度矩阵中选取单体相似度最大的多个点作为相似度极值点。所取的相似度极值点的具体数量可以是预设的。此后,处理进到步骤S44-2b。
步骤S44-2b,基于该多个相似度极值点,在该相似度矩阵中拟合出一条直线作为第二匹配直线。在一些具体示例中,基于该多个相似度极值点拟合出一条具有预设的斜率设定值或接近预设的斜率设定值的直线作为第二匹配直线,例如,拟合出一条斜率接近1的直线。具体地,可以利用随机抽样一致法(Random Sample Consensus法,简称为RANSAC法)在该相似度矩阵中拟合出一条斜率接近斜率设定值的直线。其中的RANSAC法是一种常用的根据一组包含异常数据的样本数据集,计算出数据的数学模型参数,以得到有效样本数据的方法。此后,处理进到步骤S44-2c。
步骤S44-2c,根据该第二匹配直线所包含的多个单体相似度来确定序列相似度评分。具体地,可将该第二匹配直线上的各个单体相似度的平均值确定为该序列相似度评分。
本公开提出的音频检索识别方法,通过利用匀速媒体法来确定序列相似度评分,能够提高音频检索识别的准确性和效率。
进一步地,其中相似度矩阵可以是由多种音频相似度综合考量得到的。具体地,本公开的音频检索识别还包括:获取待识别音频的多种类型的第一音频指纹,获取第一候选音频集合中的音频的多种类型的第二音频指纹,根据基于多种类型的第二音频指纹得到的指纹索引以及多种类型的第一音频指纹来确定相似度矩阵。
图8为本公开一个实施例的基于多种类型的第一音频指纹和第二音频 指纹确定相似度矩阵以进行音频检索的示意性流程框图。请参阅图8,在本公开的一个实施例中,本公开的音频检索识别方法包括:
步骤S51,利用多种音频指纹提取方法,获取待识别音频的多种类型的第一音频指纹,每种类型的第一音频指纹包含多个用于表示音频内容特征的第一部分不妨称为第一音频指纹单体。可选地,至少一些类型的第一音频指纹包含用于表示第一部分的可信程度的第二部分。例如,同时获取待识别音频的按照前述实施例中的步骤S11至步骤S17方法得到的音频指纹、以及其他类型的音频指纹。此后,处理进到步骤S52。
步骤S52,获取一个已知音频(具体地,可以是前述的第一候选音频集合中的音频)的多种类型的第二音频指纹,每种类型的第二音频指纹包含多个用于表示音频内容特征的第一部分不妨称为第二音频指纹单体。可选地,至少一些类型的第二音频指纹包含用于表示第一部分的可信程度的第二部分。例如,同时获取已知音频的按照前述实施例中的步骤S11至步骤S17方法得到的音频指纹、以及其他类型的音频指纹。此后,处理进到步骤S53。
步骤S53,利用与前述实施例的步骤S42相似的方法,分别确定同种类型的该第二音频指纹单体与该第一音频指纹单体之间的单体相似度。从而对应于多种类型的音频指纹,能够得到一个已知音频的多种单体相似度。此后,处理进到步骤S54。
步骤S54,确定多种单体相似度的平均值或最小值;根据多种单体相似度的该平均值或该最小值,利用前述实施例的步骤S43相似的方法确定该已知音频的相似度矩阵。
此后,处理进到前述示例的步骤S44,并在步骤S44中根据该基于多种单体相似度的平均值或最小值的相似度矩阵,来确定序列相似度评分以及确定第二排名的结果等步骤。
利用多种相似度的平均值或最小值确定相似度矩阵的效果在于:利用单种音频指纹得到的相似度进行音频检索识别可能存在误匹配的情况,通过取多种音频指纹的相似度的平均值或取最小值,能够减少或排除该误匹配问题,从而提高音频检索识别的准确性。
需要说明的是,在取多种单体相似度的平均值或最小值之前,需要确保各种单体相似度具有一致的取值范围,例如可以预先将所有类型的单体相似度的取值范围均设置为0到1。事实上,前述的根据汉明距离确定的单体相似度的示例已将单体相似度的取值范围设置为0到1。
在本公开的一些实施例中,该音频检索识别方法还包括:在进行第一排名之前,对获取的待识别音频的第一音频指纹以及已知音频的第二音频指纹按照预设的固定长度切片,得到多个长度相同(包含相同数量的音频 指纹单体)的第一子音频指纹和第二子音频指纹(例如,在包括对第二音频指纹进行索引的步骤的实施例中,是在索引之前进行切片);和/或,在获取音频指纹之前,预先对待识别音频以及已知音频按照预设的固定时间长度切片,得到多段时间长度相同的待识别音频片段和已知音频片段,然后分别获取各个待识别音频片段和已知音频片段的音频指纹,得到各个待识别音频片段的第一子音频指纹、各个已知音频片段的第二子音频指纹。之后,根据每个第一子音频指纹、第二子音频指纹进行前述的第一排名和第二排名的步骤,得到各个子音频指纹的识别结果,然后根据各个子音频指纹的识别结果确定原始的待识别音频的识别结果。
通过对音频或音频指纹按照固定长度切片的效果在于:1、使TF-IDF排名更加公平;2、求得的单体相似度、序列相似度评分更加准确;3、统一长度有利于音频指纹及指纹索引的存储。
在本公开的一些实施例中,第一音频指纹中的第一音频指纹单体以及第二音频指纹中的第二音频指纹单体在排列上具有时间性,例如,是按照时间的先后顺序排列的。这时,本公开的音频检索识别方法还包括:根据前述的相似度矩阵确定待识别音频与已知音频(具体地,可以是前述的识别结果中的音频)的重复片段。具体地,可以根据相似度矩阵中的直线的起点和终点得到两个音频中的重复片段的起止时间。
其中的根据相似度矩阵中的直线(例如匹配直线)来确定重复片段的具体方法可以是:根据直线的起点所对应的第一音频指纹单体的序数(或者说,相似度矩阵中的横坐标)确定待识别音频中的重复片段的开始时间,而根据该起点所对应的第二音频指纹单体的序数(或者说,相似度矩阵中的纵坐标)确定第一候选音频中的重复片段的开始时间;类似地,根据直线的终点的横坐标确定待识别音频中的重复片段的结束时间,而根据该终点的纵坐标确定第一候选音频中的重复片段的结束时间。
在本公开的一些实施例中(例如前述的图6和图7所示的实施例),步骤S44还包括:检测所得到的第一匹配直线或第二匹配直线的开头部分和结尾部分,判断该第一匹配直线/第二匹配直线的开头部分和结尾部分的点(单体相似度)是否达到预设的单体相似度设定值,去掉第一匹配直线/第二匹配直线的开头和结尾的未达到该单体相似度设定值(即单体相似度不高)的部分,保留中间一段直线并定义为第三匹配直线;根据该第三匹配直线的直线相似度来确定序列相似度评分,和/或根据该第三匹配直线的起点和终点确定已知音频与待识别音频的重复片段的起止时间。通过去掉匹配直线开头结尾的相似度不高的部分、保留中间一段相似度较高的直线之后,再确定已知音频与待识别音频的相似情况,能够提高音频检索识别的准确性,能够得到更准确的重复片段。
其中的去掉匹配直线开头/结尾的未达到该单体相似度设定值的部分的具体方法可以是:从匹配直线的起点/终点向中间依次检查,判断是否达到该单体相似度设定值,在找到第一个达到该单体相似度设定值的点后,去掉该点到起点/终点之间的多个点。
需要注意的是,该单体相似度设定值可以是一个单体相似度的具体数值,在检查时判断一个点是否达到该数值;也可以是一个比例值,在检查时判断一个点与第一匹配直线/第二匹配直线所包含的所有点的平均值或最大值相比,是否达到该比例值。
图9为本公开的音频检索识别装置1000一个实施例的示意性结构框图。请参阅图9,本公开示例的音频检索识别装置1000主要包括:
音频指纹获取系统1100,用于获取待识别音频(Query音频)的音频指纹。其中,该音频指纹包括用于表示待识别音频的内容特征的第一部分以及用于表示该第一部分的可信程度的第二部分。
检索识别系统1200,用于根据待识别音频的该音频指纹对待识别音频进行识别,得到识别结果。
图10为本公开一个实施例提供的音频指纹获取系统1100的示意性结构框图。请参阅图10,本公开示例的音频指纹获取系统1100主要包括:声谱图转换模块1101、特征点确定模块1102、掩模确定模块1103、均值能量确定模块1104、音频指纹比特确定模块1105、强弱权重比特确定模块1106以及音频指纹确定模块1107。
其中,该声谱图转换模块1101用于将音频转换成声谱图(Spectrogram)。具体地,声谱图转换模块1101可具体用于通过短时傅里叶变换(Fast Fourier Transformation)将音频信号转换成时间-频率声谱图。
在本公开的实施例中,声谱图转换模块1101可包括梅尔变换子模块,用于利用梅尔(MEL)变换对声谱图进行预处理,利用梅尔变换能够将频谱分成多个频率区块(频率bin),而所分成的频率区块的数目是可以配置的。另外,声谱图转换模块1101还可以包括人类听觉系统滤波子模块,用于对声谱图进行人类听觉系统滤波(Human Auditory System filtering),利用人类听觉系统滤波等非线性变换,能够使得声谱图中的频谱分布更适合人耳感知。
该特征点确定模块1102用于确定声谱图中的特征点。
具体地,该特征点确定模块1102可以具体用于采用多种标准中的一种来确定特征点,例如,可以将特征点选为声谱图中的能量的极大值点,或者,也可以选为能量的极小值点。
在本公开的实施例中,该特征点确定模块1102也可以不选能量的极值 点作为特征点,而是用于选取固定点作为特征点,例如可以选取频率值与预设的频率设定值相等的点(频率固定的点)。进一步地,该特征点确定模块1102可用于按照频率大小预设低频、中频、高频的多个频率设定值。
该掩模确定模块1103用于在声谱图上,在特征点的附近,为特征点确定一个或多个掩模(mask),每个掩模包含多个谱区域。具体地,在声谱图中,每个掩模所包含的多块谱区域可以是对称分布的。
该均值能量确定模块1104,用于确定每个谱区域的均值能量。
该音频指纹比特确定模块1105,用于根据掩模中的多块谱区域的均值能量,确定音频指纹比特(bit)。需要注意的是,该音频指纹比特即为前述的音频指纹中的用于表示音频的内容特征的第一部分。
在本公开实施例中,该音频指纹比特确定模块1105可具体用于根据一个掩模所包含的多个谱区域的均值能量的差值确定一个音频指纹比特。
该强弱权重比特确定模块1106,用于判断音频指纹比特的可信程度,以确定每个音频指纹比特对应的强弱权重比特。需要注意的是,该强弱权重比特即为前述的音频指纹中的用于表示第一部分的可信程度的第二部分。
在本公开的实施例中,如果音频指纹比特是根据一个掩模所包含的多个谱区域均值能量的差值确定的,则该强弱权重比特确定模块1106具体用于:判断生成该音频指纹比特所使用的该差值的绝对值是否达到(或超过)预设的强弱比特阈值;如果达到强弱比特阈值,则将该音频指纹比特确定为强比特,并得到一个与该音频指纹比特对应的取值为1的强弱权重比特;如果未达到强弱比特阈值,则将该音频指纹比特确定为弱比特,并得到一个与该音频指纹比特对应的取值为0的强弱权重比特。
该音频指纹确定模块1107,用于根据该音频指纹比特和该强弱权重比特确定音频的音频指纹。
本公开通过在提取音频指纹比特的同时,提取该音频指纹比特对应的强弱权重比特,能够为一段音频生成一个准确性高、鲁棒性好的音频指纹。
可选地,本公开的音频指纹获取系统1100还包括时间戳添加模块(图中未示出),用于为音频指纹添加一个时间戳字段,用于表示音频起始位置与该特征点的时间差的字段,该字段可以是一个hash值。而如果将特征点设为固定点,则可以不必包含本模块,即不必记录该时间戳。
可选地,本公开的音频指纹获取系统1100还包括音频标识添加模块(图中未示出),用于为音频指纹添加一个音频标识字段,用于记录该音频指纹所对应的音频信号的ID标识信息,该字段可以是一个hash值。
可选地,本公开的音频指纹获取系统1100还包括音频分割模块(图中未示出)和音频指纹组合模块(图中未示出)。该音频分割模块用于将原始 音频按时间分成多段子音频。利用前述的音频指纹获取系统1100所包含的模块,对各段子音频提取音频指纹,得到多个音频指纹。而音频指纹组合模块用于将提取的各段子音频的音频指纹组合在一起,得到该整段音频的音频指纹。
为了便于叙述和理解,不妨将待识别音频的音频指纹称为第一音频指纹,第一音频指纹所包含的音频指纹单体称为第一音频指纹单体,第一音频指纹单体对应的强弱权重单体称为第一强弱权重单体。
图11为本公开一个实施例提供的检索识别系统1200的示意性结构框图。请参阅图11,本公开示例的检索识别系统1200主要包括:
第一排名模块1210,用于根据该第一音频指纹,对多个已知音频进行第一排名,根据该第一排名的结果,取出前k个已知音频作为第一候选音频集合。其中的k为正整数,而k的具体取值是可以设置的。具体地,该第一排名模块1210用于根据每个单独的第一音频指纹单体与已知音频的匹配情况进行的排名。进一步地,该第一排名模块1210可以用于根据各个第一音频指纹单体对已知音频进行的词频-逆向文件频率TF-IDF排名。
第二排名模块1220,用于根据该第一音频指纹,对该第一候选音频集合进行第二排名,根据该第二排名的结果,取出第一候选音频集合中的前n个第一候选音频作为识别结果。其中的n为正整数,而n的具体取值是可以设置的。具体地,该第二排名模块1220用于根据多个顺序排列的第一音频指纹单体,对该第一候选音频集合中的音频进行的排名。
另外,检索识别系统1200还可用于根据该识别结果在Meta数据库中进行检索,能够得到该识别结果的音频信息,例如识别出的音频的名称、作者、出处等等。当识别结果包括多个音频时,可以同时提供多个识别出的音频的信息。
在本实施例中,第一排名模块1210在进行第一排名和/或第二排名模块1220在进行第二排名的过程中,在利用到音频指纹时,可以根据音频指纹中的强弱权重单体的对音频指纹单体进行加权。
前述的已知音频可以是一个音频数据库中的音频。在该音频数据库中存储有已知音频的音频指纹,并且在存储的已知音频的音频指纹中包含有与第一音频指纹利用相同提取方法得到的与第一音频指纹相同类型的音频指纹,从而已知音频的音频指纹中也包括用于表示音频的内容特征的第一部分以及用于表示该第一部分的可信程度的第二部分。
在本公开的一些实施例中,本公开的音频检索识别装置1000还包括指纹索引获取模块(图中未示出),用于获取多个已知音频的音频指纹,为了便于叙述和理解,不妨将已知音频的音频指纹称为第二音频指纹,第二音频指纹所包含的音频指纹单体称为第二音频指纹单体,第二音频指纹所包 含的强弱权重单体称为第二强弱权重单体;对该第二音频指纹进行索引,以预先得到已知音频的指纹索引。而第一排名模块1210具体用于将该指纹索引与待识别音频的第一音频指纹单体进行匹配,以对多个已知音频进行TF-IDF排名。
进一步地,该指纹索引获取模块可以用于获取已知音频的音频指纹的正排指纹索引(forward index)和倒排指纹索引(inverted index)。
值得注意的是,该指纹索引获取模块可以用于根据强弱权重对第二音频指纹进行索引,以提高鲁棒性。
另外,如果对音频数据库中的所有已知音频都进行第一排名,可能会影响检索识别的效率,因此本公开的第一排名模块1210可以包括绝对匹配子模块1211,用于在第一排名之前,先对音频数据库中的已知音频进行绝对匹配(exact match)。
图12为本公开一个实施例提供的第一排名模块1210的示意性结构图。请参阅图12,在本公开的一个实施例中,该第一排名模块1210具体包括:
绝对匹配子模块1211,用于根据倒排指纹索引,统计各个第一音频指纹单体在哪些已知音频的第二音频指纹中出现,以从音频数据库中匹配出包含预设数量以上第一音频指纹单体的已知音频作为第二候选音频集合。值得注意的是,该绝对匹配子模块1211可以具体用于根据一个第一音频指纹单体对应的强弱权重单体,仅判断该第一音频指纹单体中的强比特在已知音频的第二音频指纹中的匹配情况,而忽略该第一音频指纹单体中的弱比特的匹配情况,以提高鲁棒性。
词频确定子模块1212,用于基于正排指纹索引,确定一个第一音频指纹单体在一个第二候选音频的第二音频指纹中的词频。值得注意的是,该正排指纹索引可以是前述的根据强弱权重得到的指纹索引。
文件频率确定子模块1213,用于基于倒排指纹索引,确定一个第一音频指纹单体的文件频率。值得注意的是,该倒排指纹索引可以是前述的根据强弱权重得到的指纹索引。
词频-逆向文件频率评分子模块1214,用于根据各个第一音频指纹单体在一个第二候选音频的第二音频指纹中的词频以及各个第一音频指纹单体的文件频率,确定该第二候选音频的词频-逆向文件频率评分。
第一排名子模块1215,用于根据得到的各个第二候选音频的词频-逆向文件频率评分对第二候选音频集合进行排名,得到第一排名的结果,从该第一排名结果中取出前k个第二候选音频作为第一候选音频集合;该第一排名子模块1215还可用于将各个第一候选音频的第二音频指纹(正排指纹索引)返回给第二排名模块1220,以备在后续的进一步处理。
在本公开的一些实施例中,该第二排名为根据多个顺序排列的第一音 频指纹单体所组成的具有先后顺序的序列在第一候选音频的音频指纹中出现的情况,对该第一候选音频集合中的音频进行的排名。具体地,该第二排名模块1220用于:根据已知音频的指纹索引与第一音频指纹得到该第一候选音频集合中的音频的相似度矩阵,根据该相似度矩阵对该第一候选音频集合中的音频进行排名。值得注意的是,该第二排名模块1220可以具体用于:在确定相似度矩阵的过程中,根据第一音频指纹对应的强弱权重和/或已知音频的指纹索引中的强弱权重进行加权,并利用加权后的相似度矩阵对第一候选音频集合中的音频进行排名,以提高鲁棒性。
图13为本公开一个实施例提供的第二排名模块1220的示意性结构图。请参阅图13,在本公开的一个实施例中,该第二排名模块1220具体包括:
第二音频指纹获取子模块1221,用于获取第一候选音频集合中的一个第一候选音频(事实上每个第一候选音频都是已知音频)的第二音频指纹。具体地,可以根据已知音频的指纹索引(例如,正排指纹索引)获取该第二音频指纹。在本公开的一些示例中,第一音频指纹中包含有与各个第一音频指纹单体对应的强弱权重单体(不妨称之为第一强弱权重单体),和/或第二音频指纹中包含有与各个第二音频指纹单体对应的强弱权重单体(不妨称之为第二强弱权重单体)。
单体相似度第一确定子模块1222,用于确定该第一候选音频的第二音频指纹所包含的各个第二音频指纹单体与各个第一音频指纹单体之间的单体相似度。值得注意的是,单体相似度第一确定子模块1222可以具体用于:在确定该单体相似度的过程中,根据第一强弱权重单体和/或第二强弱权重单体,对各个第一音频指纹单体、第二音频指纹单体进行加权,然后根据加权后的第一、第二音频指纹单体确定该单体相似度。在本公开的一种示例中,由于音频数据库中的数据信息的准确性更高,可以利用第二强弱权重单体分别对第一音频指纹单体、第二音频指纹单体进行加权。
相似度矩阵第一确定子模块1223,用于根据各个单体相似度,确定该第一候选音频与待识别音频之间的相似度矩阵。
序列相似度评分确定子模块1224,用于根据一个第一候选音频的相似度矩阵,确定该第一候选音频的序列相似度评分。具体地,该序列相似度评分确定子模块1224具体用于根据相似度矩阵中的直线来确定该的序列相似度评分。
第二排名子模块1225,用于根据各个第一候选音频的该序列相似度评分对第一候选音频集合进行排名,得到第二排名的结果,从该第二排名结果中取出前n个第一候选音频作为识别结果。
在本公开的一个实施例中,该序列相似度评分确定子模块1224具体用于利用前述的匀速音频法的各个具体步骤来确定该序列相似度评分。
在本公开的一个实施例中,该序列相似度评分确定子模块1224具体用于利用前述的动态规划法的各个具体步骤来确定该序列相似度评分。
进一步地,其中相似度矩阵可以是由多种音频相似度综合考量得到的。图14为本公开一个实施例的基于多种类型的第一音频指纹和第二音频指纹确定相似度矩阵的音频检索识别装置1000的示意性结构框图。请参阅图14,在本公开的一个实施例中,本公开的音频检索识别装置1000包括:
多类型第一音频指纹获取模块1300,用于利用多种音频指纹获取方法,获取待识别音频的多种类型的第一音频指纹,每种类型的第一音频指纹包含多个用于表示音频内容特征的第一部分不妨称为第一音频指纹单体。可选地,至少一些类型的第一音频指纹包含用于表示第一部分的可信程度的第二部分。
多类型第二音频指纹获取模块1400,用于获取一个已知音频(具体地,可以是前述的第一候选音频集合中的音频)的多种类型的第二音频指纹,每种类型的第二音频指纹包含多个用于表示音频内容特征的第一部分不妨称为第二音频指纹单体。可选地,至少一些类型的第二音频指纹包含用于表示第一部分的可信程度的第二部分。
单体相似度第二确定子模块1500,用于分别确定同种类型的该第二音频指纹单体与该第一音频指纹单体之间的单体相似度。从而对应于多种类型的音频指纹,能够得到一个已知音频的多种单体相似度。
相似度矩阵第二确定子模块1600,用于确定多种单体相似度的平均值或最小值,并根据多种单体相似度的该平均值或该最小值确定该已知音频的相似度矩阵。
进而前述的序列相似度评分确定子模块1224用于根据该基于多种单体相似度的平均值或最小值的相似度矩阵来确定序列相似度评分。
在本公开的一些实施例中,该音频检索识别装置1000还包括音频切片模块(图中未示出)。该音频切片模块用于在进行第一排名之前,对获取的待识别音频的第一音频指纹以及已知音频的第二音频指纹按照预设的固定长度切片,得到多个长度相同(包含相同数量的音频指纹单体)的第一子音频指纹和第二子音频指纹;和/或,该音频切片模块用于在获取音频指纹之前,预先对待识别音频以及已知音频按照预设的固定时间长度切片,得到多段时间长度相同的待识别音频片段和已知音频片段,然后分别获取各个待识别音频片段和已知音频片段的音频指纹,得到各个待识别音频片段的第一子音频指纹、各个已知音频片段的第二子音频指纹。而前述的第一排名模块1210和第二排名模块1220用于根据每个第一子音频指纹、第二子音频指纹进行前述的第一排名和第二排名的步骤,得到各个子音频指纹的识别结果,然后根据各个子音频指纹的识别结果确定原始的待识别音频 的识别结果。
在本公开的一些实施例中,第一音频指纹中的第一音频指纹单体以及第二音频指纹中的第二音频指纹单体在排列上具有时间性。这时,本公开的音频检索识别装置1000还包括重复音频片段确定模块(图中未示出),用于根据前述的相似度矩阵确定待识别音频与已知音频的重复片段。具体地,该重复媒体片段确定模块具体用于根据相似度矩阵中的直线的起点和终点得到两个音频中的重复片段的起止时间。
图15是图示根据本公开的实施例的音频检索识别硬件装置的硬件框图。如图15所示,根据本公开实施例的音频检索识别硬件装置2000包括存储器2001和处理器2002。音频检索识别硬件装置2000中的各组件通过总线系统和/或其它形式的连接机构(未示出)互连。
该存储器2001用于存储非暂时性计算机可读指令。具体地,存储器2001可以包括一个或多个计算机程序产品,该计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。该易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。该非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。
该处理器2002可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其它形式的处理单元,并且可以控制音频检索识别硬件装置2000中的其它组件以执行期望的功能。在本公开的一个实施例中,该处理器2002用于运行该存储器2001中存储的该计算机可读指令,使得该音频检索识别硬件装置2000执行前述的本公开各实施例的音频检索识别方法的全部或部分步骤。
图16是图示根据本公开的实施例的计算机可读存储介质的示意图。如图16所示,根据本公开实施例的计算机可读存储介质3000,其上存储有非暂时性计算机可读指令3001。当该非暂时性计算机可读指令3001由处理器运行时,执行前述的本公开各实施例的音频检索识别方法的全部或部分步骤。
图17是图示根据本公开实施例的终端设备的硬件结构示意图。终端设备可以以各种形式来实施,本公开中的终端设备可以包括但不限于诸如移动电话、智能电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、导航装置、车载终端设备、车载显示终端、车载电子后视镜等等的移动终端设备以及诸如数字TV、台式计算机等等的固定终端设备。
如图17所示,终端设备4100可以包括无线通信单元4110、A/V(音频/视频)输入单元4120、用户输入单元4130、感测单元4140、输出单元4150、 存储器4160、接口单元4170、控制器4180和电源单元4190等等。图17示出了具有各种组件的终端设备,但是应理解的是,并不要求实施所有示出的组件。可以替代地实施更多或更少的组件。
其中,无线通信单元4110允许终端设备4100与无线通信系统或网络之间的无线电通信。A/V输入单元4120用于接收音频或视频信号。用户输入单元4130可以根据用户输入的命令生成键输入数据以控制终端设备的各种操作。感测单元4140检测终端设备4100的当前状态、终端设备4100的位置、用户对于终端设备4100的触摸输入的有无、终端设备4100的取向、终端设备4100的加速或减速移动和方向等等,并且生成用于控制终端设备4100的操作的命令或信号。接口单元4170用作至少一个外部装置与终端设备4100连接可以通过的接口。输出单元4150被构造为以视觉、音频和/或触觉方式提供输出信号。存储器4160可以存储由控制器4180执行的处理和控制操作的软件程序等等,或者可以暂时地存储己经输出或将要输出的数据。存储器4160可以包括至少一种类型的存储介质。而且,终端设备4100可以与通过网络连接执行存储器4160的存储功能的网络存储装置协作。控制器4180通常控制终端设备的总体操作。另外,控制器4180可以包括用于再现或回放多媒体数据的多媒体模块。控制器4180可以执行模式识别处理,以将在触摸屏上执行的手写输入或者图片绘制输入识别为字符或图像。电源单元4190在控制器4180的控制下接收外部电力或内部电力并且提供操作各元件和组件所需的适当的电力。
本公开提出的音频检索识别方法的各种实施方式可以以使用例如计算机软件、硬件或其任何组合的计算机可读介质来实施。对于硬件实施,本公开提出的音频检索识别方法的各种实施方式可以通过使用特定用途集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理装置(DSPD)、可编程逻辑装置(PLD)、现场可编程门阵列(FPGA)、处理器、控制器、微控制器、微处理器、被设计为执行这里描述的功能的电子单元中的至少一种来实施,在一些情况下,本公开提出的音频检索识别方法的各种实施方式可以在控制器4180中实施。对于软件实施,本公开提出的音频检索识别方法的各种实施方式可以与允许执行至少一种功能或操作的单独的软件模块来实施。软件代码可以由以任何适当的编程语言编写的软件应用程序(或程序)来实施,软件代码可以存储在存储器4160中并且由控制器4180执行。
以上,根据本公开实施例的音频检索识别方法、装置、硬件装置、计算机可读存储介质以及终端设备,通过获取并利用音频对象的包括用于表示音频内容特征的第一部分和用于表示第一部分的可信程度的第二部分的音频指纹特征来进行音频检索识别,能够大大提高音频检索识别的准确性、鲁棒性和效率。
以上结合具体实施例描述了本公开的基本原理,但是,需要指出的是,在本公开中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本公开的各个实施例必须具备的。另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本公开为必须采用上述具体的细节来实现。
本公开中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。诸如“包括”、“包含”、“具有”等等的词语是开放性词汇,指“包括但不限于”,且可与其互换使用。这里所使用的词汇“或”和“和”指词汇“和/或”,且可与其互换使用,除非上下文明确指示不是如此。这里所使用的词汇“诸如”指词组“诸如但不限于”,且可与其互换使用。
另外,如在此使用的,在以“至少一个”开始的项的列举中使用的“或”指示分离的列举,以便例如“A、B或C的至少一个”的列举意味着A或B或C,或AB或AC或BC,或ABC(即A和B和C)。此外,措辞“示例的”不意味着描述的例子是优选的或者比其他例子更好。
还需要指出的是,在本公开的系统和方法中,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本公开的等效方案。
可以不脱离由所附权利要求定义的教导的技术而进行对在此所述的技术的各种改变、替换和更改。此外,本公开的权利要求的范围不限于以上所述的处理、机器、制造、事件的组成、手段、方法和动作的具体方面。可以利用与在此所述的相应方面进行基本相同的功能或者实现基本相同的结果的当前存在的或者稍后要开发的处理、机器、制造、事件的组成、手段、方法或动作。因而,所附权利要求包括在其范围内的这样的处理、机器、制造、事件的组成、手段、方法或动作。
提供所公开的方面的以上描述以使本领域的任何技术人员能够做出或者使用本公开。对这些方面的各种修改对于本领域技术人员而言是非常显而易见的,并且在此定义的一般原理可以应用于其他方面而不脱离本公开的范围。因此,本公开不意图被限制到在此示出的方面,而是按照与在此公开的原理和新颖的特征一致的最宽范围。
为了例示和描述的目的已经给出了以上描述。此外,此描述不意图将本公开的实施例限制到在此公开的形式。尽管以上已经讨论了多个示例方面和实施例,但是本领域技术人员将认识到其某些变型、修改、改变、添加和子组合。
Claims (35)
- 一种音频检索识别方法,所述方法包括:获取待识别音频的音频指纹,其中,所述音频指纹包括用于表示所述待识别音频的内容特征的第一部分以及用于表示所述第一部分的可信程度的第二部分;根据所述音频指纹对所述待识别音频进行识别,得到识别结果。
- 根据权利要求1所述的音频检索识别方法,其中,所述获取待识别音频的音频指纹包括:将所述待识别音频转换成声谱图;确定所述声谱图中的特征点;在所述声谱图上,为所述特征点确定一个或多个掩模,每个所述掩模包含多个谱区域;确定每个所述谱区域的均值能量;根据所述掩模中的所述多个谱区域的均值能量确定音频指纹比特;判断所述音频指纹比特的可信程度以确定强弱权重比特;根据所述音频指纹比特和所述强弱权重比特确定述待识别音频的音频指纹。
- 根据权利要求2所述的音频检索识别方法,其中,所述将所述待识别音频转换成声谱图包括:通过短时傅里叶变换将所述待识别音频转换成时间-频率的二维声谱图,所述声谱图中每个点的取值代表所述待识别音频的能量。
- 根据权利要求3所述的音频检索识别方法,其中,所述将所述待识别音频转换成声谱图还包括:对所述声谱图进行梅尔变化。
- 根据权利要求3所述的音频检索识别方法,其中,所述特征点为所述声谱图中的固定点。
- 根据权利要求5所述的音频检索识别方法,其中,所述特征点为频率值与预设的多个频率设定值相等的点。
- 根据权利要求3所述的音频检索识别方法,其中,所述特征点为所述声谱图中的能量极大值点,或者,所述特征点为所述声谱图中的能量极小值点。
- 根据权利要求2所述的音频检索识别方法,其中,所述掩模所包含的多个所述谱区域是对称分布的。
- 根据权利要求8所述的音频检索识别方法,其中,所述掩模所包含的多个所述谱区域具有相同的频率范围、和/或具有相同的时间范围、和/或以所述特征点为中心而中心对称分布。
- 根据权利要求2所述的音频检索识别方法,其中,所述谱区域均值能量为所述谱区域所包含的所有点的能量值的平均值。
- 根据权利要求2所述的音频检索识别方法,其中,所述的根据所述掩模中的所述多个谱区域的均值能量确定音频指纹比特包括:根据一个所述掩模所包含的多个所述谱区域的均值能量的差值确定一个音频指纹比特。
- 根据权利要求11所述的音频检索识别方法,其中,所述的判断所述音频指纹比特的可信程度以确定强弱权重比特包括:判断所述差值的绝对值是否达到或超过预设的强弱比特阈值,如果达到或超过所述强弱比特阈值,则将所述音频指纹比特确定为强比特,否则降所述音频指纹比特确定为弱比特;根据所述音频指纹比特是强比特还是弱比特来确定所述强弱权重比特。
- 根据权利要求2所述的音频检索识别方法,所述方法还包括:将待识别音频按时间分成多段子音频;提取每段所述子音频的所述音频指纹;将提取得到的各个所述子音频的所述音频指纹进行组合,得到所述待识别音频的音频指纹。
- 根据权利要求2所述的音频检索识别方法,其中,将所述待识别音频的音频指纹定义为第一音频指纹,所述第一音频指纹包含多个第一音频指纹单体以及与各个所述第一音频指纹单体相对应的第一强弱权重单体,所述第一音频指纹单体包含所述待识别音频的多个所述音频指纹比特,所述第一强弱权重单体包含与所述多个音频指纹比特相对应的多个所述强弱权重比特。
- 根据权利要求14所述的音频检索识别方法,其中,所述根据所述音频指纹对所述待识别音频进行识别包括:根据每个单独的所述第一音频指纹单体对多个已知音频进行第一排名,根据所述第一排名的结果,取出前k个所述已知音频作为第一候选音频集合,其中k为正整数;根据多个顺序排列的所述第一音频指纹单体对所述第一候选音频集合进行第二排名,根据所述第二排名的结果,取出前n个所述第一候选音频作为识别结果,其中n为正整数。
- 根据权利要求15所述的音频检索识别方法,还包括:预先获取所述已知音频的音频指纹作为第二音频指纹,所述第二音频指纹包含多个第二音频指纹单体以及与所述第二音频指纹单体相对应的第二强弱权重单体;对所述第二音频指纹进行索引,以预先得到所述已知音频的指纹索引。
- 根据权利要求16所述的音频检索识别方法,其中,在进行所述第一排名和/或进行所述第二排名的过程中,根据所述第一强弱权重单体和/或第二强弱权重单体,对所述第一音频指纹单体和/或所述第二音频指纹单体进行加权。
- 根据权利要求16所述的音频检索识别方法,其中,所述根据每个单独的所述第一音频指纹单体对多个已知音频进行第一排名包括:根据每个单独的所述第一音频指纹单体对多个已知音频进行词频-逆向文件频率TF-IDF排名。
- 根据权利要求17所述的音频检索识别方法,其中,所述根据每个单独的所述第一音频指纹单体对多个已知音频进行词频-逆向文件频率TF-IDF方式的第一排名包括:将所述已知音频的指纹索引与所述第一音频指纹单体进行匹配,以对所述已知音频进行所述TF-IDF排名。
- 根据权利要求19所述的音频检索识别方法,其中,所述预先得到所述已知音频的指纹索引包括:根据所述第二强弱权重单体,预先得到所述已知音频的正排指纹索引和/或倒排指纹索引。
- 根据权利要求19所述的音频检索识别方法,其中,所述将所述已知音频的指纹索引与所述第一音频指纹单体进行匹配包括:根据所述第一强弱权重单体,将所述音频的指纹索引与所述第一音频指纹单体进行绝对匹配。
- 根据权利要求16所述的音频检索识别方法,其中,所述根据多个顺序排列的所述第一音频指纹单体对所述第一候选音频集合进行第二排名包括:根据所述已知音频的指纹索引与所述第一音频指纹得到所述第一候选音频集合中的音频的相似度矩阵,根据所述相似度矩阵对所述第一候选音频集合中的音频进行排名。
- 根据权利要求22所述的音频检索识别方法,其中,所述的根据所述已知音频的指纹索引与所述第一音频指纹得到所述第一候选音频集合中的音频的相似度矩阵,根据所述相似度矩阵对所述第一候选音频集合中的音频进行排名包括:利用所述第一强弱权重单体和/或所述第二强弱权重单体进行加权,得到加权的所述相似度矩阵,根据所述加权的相似度矩阵对所述第一候选音频集合中的音频进行排名。
- 根据权利要求22所述的音频检索识别方法,其中,所述根据所述相似度矩阵对所述第一候选音频集合中的音频进行排名包括:根据所述相似度矩阵中的直线对所述第一候选音频集合中的音频进行排名。
- 根据权利要求22所述的音频检索识别方法,其中:所述获取待识别音频的音频指纹包括,获取所述待识别音频的多种类型的第一音频指纹;所述预先获取所述已知音频的音频指纹作为第二音频指纹包括,获取所述第一候选音频集合中的音频的多种类型的第二音频指纹;所述的根据所述已知音频的指纹索引与所述第一音频指纹得到所述第一候选音频集合中的音频的相似度矩阵包括,根据所述多种类型的第一音频指纹和所述多种类型的第二音频指纹确定所述相似度矩阵。
- 根据权利要求25所述的音频检索识别方法,其中,每种类型的所述第一音频指纹包含多个第一音频指纹单体,每种类型的所述第二音频指纹包含多个第二音频指纹单体;所述的根据所述多种类型的第一音频指纹和所述多种类型的第二音频指纹确定所述相似度矩阵包括:分别确定同种类型的所述第二音频指纹单体与所述第一音频指纹单体之间的单体相似度,以得到多种所述单体相似度;根据所述多种单体相似度的平均值或最小值确定所述相似度矩阵。
- 根据权利要求16所述的音频检索识别方法,还包括:预先对待识别音频和已知音频按照预设的时间长度切片,得到多段待识别子音频和多段已知子音频,对所述多段待识别子音频和所述多段已知子音频分别提取音频指纹,以得到长度相同的多个第一子音频指纹和多个第二子音频指纹。
- 根据权利要求16所述的音频检索识别方法,还包括:在进行所述第一排名之前,对获得的待识别音频的所述第一音频指纹和已知音频的所述第二音频指纹按照预设的长度切片,以得到长度相同的多个第一子音频指纹和多个第二子音频指纹。
- 根据权利要求22所述的音频检索识别方法,其中,所述多个第一音频指纹单体在所述第一音频指纹中按时间顺序排列,所述多个第二音频指纹单体在所述第二音频指纹中按时间顺序排列。
- 根据权利要求29所述的音频检索识别方法,还包括:根据所述相似度矩阵确定所述待识别音频与所述识别结果中的音频的重复片段。
- 一种音频检索识别装置,所述装置包括:音频指纹获取系统,用于获取待识别音频的音频指纹,其中,所述音频指纹包括用于表示所述待识别音频的内容特征的第一部分以及用于表示 所述第一部分的可信程度的第二部分;检索识别系统,用于根据所述音频指纹对所述待识别音频进行识别,得到识别结果。
- 根据权利要求31所述的音频检索识别装置,所述装置还包括执行权利要求2到30中任一权利要求所述步骤的模块。
- 一种音频检索识别硬件装置,包括:存储器,用于存储非暂时性计算机可读指令;以及处理器,用于运行所述计算机可读指令,使得所述计算机可读指令被所述处理器执行时实现根据权利要求1到30中任意一项所述的音频检索识别方法。
- 一种计算机可读存储介质,用于存储非暂时性计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时,使得所述计算机执行权利要求1到30中任意一项所述的音频检索识别方法。
- 一种终端设备,包括权利要求31或32所述的一种音频检索识别装置。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
SG11202008548VA SG11202008548VA (en) | 2018-03-29 | 2018-12-29 | Audio Retrieval And Recognition Method And Device |
US16/636,579 US11182426B2 (en) | 2018-03-29 | 2018-12-29 | Audio retrieval and identification method and device |
JP2019572761A JP6906641B2 (ja) | 2018-03-29 | 2018-12-29 | 音声検索・認識方法及び装置 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810273699.7A CN110322897B (zh) | 2018-03-29 | 2018-03-29 | 一种音频检索识别方法及装置 |
CN201810273699.7 | 2018-03-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019184518A1 true WO2019184518A1 (zh) | 2019-10-03 |
Family
ID=68062454
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/125493 WO2019184518A1 (zh) | 2018-03-29 | 2018-12-29 | 一种音频检索识别方法及装置 |
Country Status (5)
Country | Link |
---|---|
US (1) | US11182426B2 (zh) |
JP (1) | JP6906641B2 (zh) |
CN (1) | CN110322897B (zh) |
SG (1) | SG11202008548VA (zh) |
WO (1) | WO2019184518A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111489757A (zh) * | 2020-03-26 | 2020-08-04 | 北京达佳互联信息技术有限公司 | 音频处理方法、装置、电子设备及可读存储介质 |
WO2020098816A3 (en) * | 2019-11-29 | 2020-10-15 | Alipay (Hangzhou) Information Technology Co., Ltd. | Methods and devices for storing and managing audio data on blockchain |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110569373B (zh) * | 2018-03-29 | 2022-05-13 | 北京字节跳动网络技术有限公司 | 一种媒体特征的比对方法及装置 |
CN111986698B (zh) * | 2019-05-24 | 2023-06-30 | 腾讯科技(深圳)有限公司 | 音频片段的匹配方法、装置、计算机可读介质及电子设备 |
KR20210009596A (ko) * | 2019-07-17 | 2021-01-27 | 엘지전자 주식회사 | 지능적 음성 인식 방법, 음성 인식 장치 및 지능형 컴퓨팅 디바이스 |
CN111460215B (zh) * | 2020-03-30 | 2021-08-24 | 腾讯科技(深圳)有限公司 | 音频数据处理方法、装置、计算机设备以及存储介质 |
KR102380540B1 (ko) * | 2020-09-14 | 2022-04-01 | 네이버 주식회사 | 음원을 검출하기 위한 전자 장치 및 그의 동작 방법 |
CN114020958B (zh) * | 2021-09-26 | 2022-12-06 | 天翼爱音乐文化科技有限公司 | 一种音乐分享方法、设备及存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106940996A (zh) * | 2017-04-24 | 2017-07-11 | 维沃移动通信有限公司 | 一种视频中背景音乐的识别方法和移动终端 |
CN107293307A (zh) * | 2016-03-31 | 2017-10-24 | 阿里巴巴集团控股有限公司 | 音频检测方法及装置 |
US20170309298A1 (en) * | 2016-04-20 | 2017-10-26 | Gracenote, Inc. | Digital fingerprint indexing |
CN107577773A (zh) * | 2017-09-08 | 2018-01-12 | 科大讯飞股份有限公司 | 一种音频匹配方法与装置、电子设备 |
CN107622773A (zh) * | 2017-09-08 | 2018-01-23 | 科大讯飞股份有限公司 | 一种音频特征提取方法与装置、电子设备 |
Family Cites Families (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6990453B2 (en) | 2000-07-31 | 2006-01-24 | Landmark Digital Services Llc | System and methods for recognizing sound and music signals in high noise and distortion |
DE60228202D1 (de) | 2001-02-12 | 2008-09-25 | Gracenote Inc | Verfahren zum erzeugen einer identifikations hash vom inhalt einer multimedia datei |
US8428301B2 (en) * | 2008-08-22 | 2013-04-23 | Dolby Laboratories Licensing Corporation | Content identification and quality monitoring |
US20150254342A1 (en) * | 2011-05-30 | 2015-09-10 | Lei Yu | Video dna (vdna) method and system for multi-dimensional content matching |
EP2751804A1 (en) * | 2011-08-29 | 2014-07-09 | Telefónica, S.A. | A method to generate audio fingerprints |
US9009149B2 (en) * | 2011-12-06 | 2015-04-14 | The Trustees Of Columbia University In The City Of New York | Systems and methods for mobile search using Bag of Hash Bits and boundary reranking |
US8681950B2 (en) * | 2012-03-28 | 2014-03-25 | Interactive Intelligence, Inc. | System and method for fingerprinting datasets |
CN103971689B (zh) * | 2013-02-04 | 2016-01-27 | 腾讯科技(深圳)有限公司 | 一种音频识别方法及装置 |
NL2012567B1 (en) * | 2014-04-04 | 2016-03-08 | Teletrax B V | Method and device for generating improved fingerprints. |
US11289077B2 (en) * | 2014-07-15 | 2022-03-29 | Avaya Inc. | Systems and methods for speech analytics and phrase spotting using phoneme sequences |
CN104142984B (zh) * | 2014-07-18 | 2017-04-05 | 电子科技大学 | 一种基于粗细粒度的视频指纹检索方法 |
US9837101B2 (en) * | 2014-11-25 | 2017-12-05 | Facebook, Inc. | Indexing based on time-variant transforms of an audio signal's spectrogram |
US9740775B2 (en) * | 2015-03-13 | 2017-08-22 | TCL Research America Inc. | Video retrieval based on optimized selected fingerprints |
CN104778276A (zh) * | 2015-04-29 | 2015-07-15 | 北京航空航天大学 | 一种基于改进tf-idf的多索引合并排序算法 |
US20170097992A1 (en) * | 2015-10-02 | 2017-04-06 | Evergig Music S.A.S.U. | Systems and methods for searching, comparing and/or matching digital audio files |
US10236005B2 (en) * | 2017-06-08 | 2019-03-19 | The Nielsen Company (Us), Llc | Methods and apparatus for audio signature generation and matching |
CN107402965B (zh) * | 2017-06-22 | 2020-04-28 | 中国农业大学 | 一种音频检索方法 |
CN107633078B (zh) * | 2017-09-25 | 2019-02-22 | 北京达佳互联信息技术有限公司 | 音频指纹提取方法、音视频检测方法、装置及终端 |
-
2018
- 2018-03-29 CN CN201810273699.7A patent/CN110322897B/zh active Active
- 2018-12-29 SG SG11202008548VA patent/SG11202008548VA/en unknown
- 2018-12-29 US US16/636,579 patent/US11182426B2/en active Active
- 2018-12-29 WO PCT/CN2018/125493 patent/WO2019184518A1/zh active Application Filing
- 2018-12-29 JP JP2019572761A patent/JP6906641B2/ja active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107293307A (zh) * | 2016-03-31 | 2017-10-24 | 阿里巴巴集团控股有限公司 | 音频检测方法及装置 |
US20170309298A1 (en) * | 2016-04-20 | 2017-10-26 | Gracenote, Inc. | Digital fingerprint indexing |
CN106940996A (zh) * | 2017-04-24 | 2017-07-11 | 维沃移动通信有限公司 | 一种视频中背景音乐的识别方法和移动终端 |
CN107577773A (zh) * | 2017-09-08 | 2018-01-12 | 科大讯飞股份有限公司 | 一种音频匹配方法与装置、电子设备 |
CN107622773A (zh) * | 2017-09-08 | 2018-01-23 | 科大讯飞股份有限公司 | 一种音频特征提取方法与装置、电子设备 |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020098816A3 (en) * | 2019-11-29 | 2020-10-15 | Alipay (Hangzhou) Information Technology Co., Ltd. | Methods and devices for storing and managing audio data on blockchain |
US11120075B2 (en) | 2019-11-29 | 2021-09-14 | Alipay (Hangzhou) Information Technology Co., Ltd. | Methods and devices for storing and managing audio data on blockchain |
US11392638B2 (en) | 2019-11-29 | 2022-07-19 | Alipay (Hangzhou) Information Technology Co., Ltd. | Methods and devices for storing and managing audio data on blockchain |
CN111489757A (zh) * | 2020-03-26 | 2020-08-04 | 北京达佳互联信息技术有限公司 | 音频处理方法、装置、电子设备及可读存储介质 |
CN111489757B (zh) * | 2020-03-26 | 2023-08-18 | 北京达佳互联信息技术有限公司 | 音频处理方法、装置、电子设备及可读存储介质 |
Also Published As
Publication number | Publication date |
---|---|
US11182426B2 (en) | 2021-11-23 |
JP6906641B2 (ja) | 2021-07-21 |
CN110322897A (zh) | 2019-10-11 |
JP2020525856A (ja) | 2020-08-27 |
CN110322897B (zh) | 2021-09-03 |
SG11202008548VA (en) | 2020-10-29 |
US20210165827A1 (en) | 2021-06-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019184518A1 (zh) | 一种音频检索识别方法及装置 | |
WO2019184522A1 (zh) | 一种重复视频的判断方法及装置 | |
EP3477506B1 (en) | Video detection method, server and storage medium | |
US9612791B2 (en) | Method, system and storage medium for monitoring audio streaming media | |
CN105917359B (zh) | 移动视频搜索 | |
WO2017045443A1 (zh) | 一种图像检索方法及系统 | |
CN107293307B (zh) | 音频检测方法及装置 | |
EP2657884B1 (en) | Identifying multimedia objects based on multimedia fingerprint | |
CN111291177B (zh) | 一种信息处理方法、装置和计算机存储介质 | |
CN108563655B (zh) | 基于文本的事件识别方法和装置 | |
CN105760526B (zh) | 一种新闻分类的方法和装置 | |
WO2019184517A1 (zh) | 一种音频指纹提取方法及装置 | |
CN109117622B (zh) | 一种基于音频指纹的身份认证方法 | |
WO2023029356A1 (zh) | 基于句向量模型的句向量生成方法、装置及计算机设备 | |
CN111708942B (zh) | 多媒体资源推送方法、装置、服务器及存储介质 | |
WO2020135756A1 (zh) | 视频段的提取方法、装置、设备及计算机可读存储介质 | |
Zhang et al. | Large‐scale video retrieval via deep local convolutional features | |
US10776420B2 (en) | Fingerprint clustering for content-based audio recognition | |
WO2019184519A1 (zh) | 一种媒体检索方法及装置 | |
CN103870476A (zh) | 检索方法及设备 | |
Hoffmann et al. | Music data processing and mining in large databases for active media | |
CN113139084B (zh) | 一种视频去重方法及装置 | |
Mironică et al. | The influence of the similarity measure to relevance feedback | |
CN115881135A (zh) | 说话人确定方法、装置、电子设备及存储介质 | |
CN117275516A (zh) | 音频指纹识别模型训练方法、音频识别方法、设备和介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18912503 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2019572761 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18912503 Country of ref document: EP Kind code of ref document: A1 |