Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a long-term fingerprint extraction and matching method for audio, which solves the problems that when fingerprints are matched on audio fragments or whole audio files, the characteristics are expressed inadequately and the risk of large calculated and stored data amount.
The technical scheme adopted for solving the technical problems is as follows: an audio long-term fingerprint extraction method, the extraction method comprising the following steps:
s1: an input audio signal (PCM) and resampling the audio signal;
s2: framing, windowing and DFT (discrete Fourier transform) change are carried out on the resampled audio signal to obtain a frame frequency spectrum;
s3: performing inter-frame smoothing on the frame frequency to obtain an updated frame frequency;
S4: extracting frame-level short-time features from the updated frame spectrum;
S5: and processing the frame-level short-time features and extracting frame group long-time features.
Preferably, in the step S1, the specific operation of resampling is to extract the frequency range of 110Hz-7KHz as the analysis frequency band, and set the resampling frequency of the input signal to 16KHz according to the Nyquist sampling theorem, so as to avoid signal sampling distortion.
Preferably, in S2, the specific operations of framing, windowing and DFT variation are to frame the resampled signal according to 4096 samples (256 ms) and a 50% overlap; after framing, a hamming window is added frame by frame and DFT frequency domain transformation is performed to obtain a frame spectrum, wherein the frame spectrum is the relation between frequency and energy.
Preferably, in the step S3, the specific operation of the inter-frame smoothing is to perform weighted average on the adjacent 5 frame frequency spectrum data by using a sliding window, so as to increase the stability of the frequency spectrum and obtain an updated frame frequency spectrum:
; wherein the sliding window is stepped one frame at a time.
Preferably, in the step S4, the specific operation steps of frame-level short-time feature extraction are as follows:
a1: dividing a frame spectrum into logarithmic frequency domain sub-bands;
A2: calculating sub-band average spectral energy;
a3: regularization processing is carried out on the subband spectrum energy L2 to obtain frame-level short-time characteristics.
Preferably, in the A1, since the human ear feeling of sound is logarithmic, dividing the logarithmic frequency domain sub-bands into frame frequencies converts the frequency f in the frame frequency spectrum into logarithmic frequenciesIn the logarithmic frequency domain, the target frequency rangeDivided into 16 sub-bands of equal width.
Preferably, in said A2, the subband average spectral energy is calculated, i.e. for each audio frame, the average spectral energy is calculated over 16 frequency subbands, forming a 16-dimensional vector.
Preferably, in the A3, the sub-band spectrum energy L2 regularization processing obtains a frame-level short-time feature, that is, L2 regularization is performed on the obtained 16-dimensional vector, that is, the short-time feature of the audio frame is denoted as V.
Preferably, in the step S5, the specific operation of extracting the long-time frame group features is to form a continuous fixed number of audio frequency groups into frame groups, perform DFT change again on the short-time frame level features in the time axis direction, and retain low-frequency stable components to form the long-time frame group features.
An audio long-term fingerprint matching method, the matching method comprising the steps of:
b1: extracting long-term features of 2 audio files or fragments to be matched according to a frame group;
B2: and carrying out frame group level matching on the 2 frame group long-time features, and determining a matching relationship.
The invention has the technical effects and advantages that:
1. According to the audio long-term fingerprint extraction and matching method provided by the invention, the audio signal is subjected to fingerprint extraction, namely, after resampling, framing, windowing and DFT change, the obtained frame spectrum is subjected to inter-frame smoothing operation, then frame-level short-time characteristic extraction is performed, the spectrum sub-band characteristic of the audio frame is extracted, and then the change attribute of the time axis direction is calculated to form long-term characteristics, so that the audio fingerprint is rapidly extracted, the matching between two or more groups of subsequent audio fingerprints is facilitated, and the defects of shortness and instability of the traditional audio fingerprint are overcome.
2. The invention provides a long-term fingerprint extraction and matching method for audio, which is characterized in that fingerprint extraction is carried out on different audio signals, after the long-term characteristics of a frame group of audio are rapidly extracted, similarity between two or more groups of audio signals is calculated by using similarity, then whether the two or more groups of audio signals are matched or not is obtained, and the optimal similarity and offset are obtained simultaneously during matching.
Detailed Description
The invention is further described in connection with the following detailed description in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.
As shown in fig. 1 to 5, the method for extracting long-term fingerprints of audio according to the present invention comprises the following steps:
s1: an input audio signal (PCM) and resampling the audio signal;
s2: framing, windowing and DFT (discrete Fourier transform) change are carried out on the resampled audio signal to obtain a frame frequency spectrum;
s3: performing inter-frame smoothing on the frame frequency to obtain an updated frame frequency;
S4: extracting frame-level short-time features from the updated frame spectrum;
S5: and processing the frame-level short-time features and extracting frame group long-time features.
As an implementation mode of the invention, in the step S1, the specific operation of resampling is to extract the frequency range of 110Hz-7KHz as an analysis frequency band, and set the resampling frequency of the input signal to 16KHz according to the Nyquist sampling theorem, so as to avoid signal sampling distortion.
As an embodiment of the present invention, in S2, the specific operations of framing, windowing, and DFT variation are to frame the resampled signal according to 4096 samples (256 ms) and 50% overlapping degree; after framing, a hamming window is added frame by frame and DFT frequency domain transformation is performed to obtain a frame spectrum, wherein the frame spectrum is the relation between frequency and energy.
In one embodiment of the present invention, in S3, the specific operation of the inter-frame smoothing is to perform weighted average on the adjacent 5 frame frequency data by using a sliding window, which aims to increase the stability of the frequency spectrum, and obtain the updated frame frequency spectrum:
; wherein the sliding window is stepped one frame at a time.
In S4, the specific operation steps of the frame-level short-time feature extraction are as follows:
a1: dividing a frame spectrum into logarithmic frequency domain sub-bands;
A2: calculating sub-band average spectral energy;
a3: regularization processing is carried out on the subband spectrum energy L2 to obtain frame-level short-time characteristics.
In the embodiment of the present invention, in the A1, since the human ear feeling of the sound is logarithmic, dividing the logarithmic frequency domain sub-band into the frame frequency spectrum, i.e., converting the frequency f in the frame frequency spectrum into the logarithmic frequencyIn the logarithmic frequency domain, the target frequency range/>Divided into 16 sub-bands of equal width.
As an embodiment of the present invention, in the A2, the subband average spectral energy is calculated, i.e. for each audio frame, the average spectral energy is calculated over 16 frequency subbands, thereby forming a 16-dimensional vector.
In the embodiment of the present invention, in the A3, the sub-band spectrum energy L2 regularization processing obtains a frame-level short-time feature, that is, L2 regularizing the obtained 16-dimensional vector, that is, the short-time feature of the audio frame, denoted as V.
In S5, the specific operation of extracting the long-time features of the frame group is to form a continuous fixed number of audio groups, perform DFT change again on the short-time features of the frame level in the time axis direction, and retain the low-frequency stable components to form the long-time features of the frame group.
Specifically, the specific flow of feature extraction during frame group length is as follows:
C1: with consecutive T audio frames (e.g., t=32, i.e., frame group duration 4096 ms) forming a frame group, the frame group is characterized as: v 0, V1, …, VT-1, where V is the frame-level short-term feature (16-dimensional vector).
C2: DFT conversion is performed on each dimension [ V 0d, V1d, …, V(T-1)d ], d E [0,15] of the frame group feature.
And C3: taking the first m-level coefficient (e.g. m=12) after DFT conversion,/>,/>,…,/>,/>Wherein: /(I)Is the first coefficient,/>,…,/>Is cosine term coefficient,/>,…,/>Is a sine term coefficient.
And C4: the 16-dimensional m-level DFT coefficients are structured into a new feature [ a 0, A1c, A1s, …, Amc, Ams ], then the feature size is [16 x (2m+1) ].
C5: 2m+1 DFT coefficients (vectors) are L2 regularized.
C6:2m+1 DFT coefficients (vectors) are multiplied by weights, and the calculation formula is as follows:
Wherein, ;/>; B is a Bessel function; sinh is a hyperbolic sine function; the calculation result of the flow is that the long-term characteristic [/> ] of the frame group is formed,/>,/>,…,/> ]。
An audio long-term fingerprint matching method, the matching method comprising the steps of:
b1: extracting long-term features of two audio files or fragments to be matched according to a frame group;
b2: and performing frame group level matching on the two frame group long-time features, and determining a matching relationship.
Specifically, the specific process of frame group level matching is as follows:
d1: two frame group long time features are respectively marked as [ [ ,/>,/>,…,/>,/>And [/>),,/>,…,/>,/>The possible time offset (i.e., the number of audio frames) between two frame groups is denoted t.
D2: the similarity s of two frame groups at time offset t is calculated according to the following formula:
;
Wherein: t is the number of audio frames in the frame group; /(I) Representing vectors/>Sum vector/>Is a product of the inner product of (a).
D3: according to step D2, for all possible offsets T, T E [ - (T-1), (T-1) ], calculating the corresponding similarity s, and counting the largest value s best of all s, the obtained s best is the best similarity in the two frame groups, and the corresponding T best is the best similarity offset.
D4: and setting a similarity threshold s thrd according to the application requirement, and if the optimal similarity obtained in the step D3 is greater than or equal to s thrd, considering that the two frame groups are matched, otherwise, considering that the two frame groups are not matched.
The invention solves the defects of the traditional audio fingerprint such as short time and instability, extracts the frequency spectrum sub-band characteristics of the audio frame, then calculates the change attribute of the time axis direction to form long-time characteristics, and obtains the optimal similarity and offset simultaneously during matching;
Compared with the prior art of the same kind, the method is characterized in that:
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.