CN113780180B

CN113780180B - Audio long-term fingerprint extraction and matching method

Info

Publication number: CN113780180B
Application number: CN202111068271.7A
Authority: CN
Inventors: 陈书军
Original assignee: Individual
Current assignee: Yu Jiali
Priority date: 2021-09-13
Filing date: 2021-09-13
Publication date: 2024-06-25
Anticipated expiration: 2041-09-13
Also published as: CN113780180A

Abstract

The invention belongs to the technical field of audio signal processing, in particular to an audio long-time fingerprint extraction and matching method, which comprises the following steps: s1: an input audio signal (PCM) and resampling the audio signal; s2: framing, windowing and DFT (discrete Fourier transform) change are carried out on the resampled audio signal to obtain a frame frequency spectrum; s3: performing inter-frame smoothing on the frame frequency to obtain an updated frame frequency; s4: extracting frame-level short-time features from the updated frame spectrum; s5: processing the frame-level short-time features and extracting frame group long-time features; the method and the device solve the defects of the traditional audio fingerprint such as short time and instability, extract the frequency spectrum sub-band characteristics of the audio frame, calculate the change attribute of the time axis direction to form long-time characteristics, and obtain the optimal similarity and offset simultaneously during matching.

Description

Audio long-term fingerprint extraction and matching method

Technical Field

The invention belongs to the technical field of audio signal processing, and particularly relates to an audio long-time fingerprint extraction and matching method.

Background

Similar to human biological fingerprints, audio fingerprints refer to the fact that audio signals are processed to extract effective and robust acoustic features so as to uniquely express the audio content, and the audio fingerprint technology is widely used in the fields of audio retrieval, signal comparison, copyright protection and the like.

Acoustic features extracted by existing audio fingerprinting techniques are typically based on simple physical features (zero-crossing rate, spectral peaks, spectral density, etc.) or auditory perception features (pitch, melody, rhythm, etc.), common algorithms such as shazam [1], chromaprint [2], echoprint [3], etc.

Features extracted by the existing audio fingerprint technology algorithm have a short-term defect, and can only represent one or a few audio frames, namely, audio attributes of tens or hundreds of milliseconds, and when fingerprints of audio fragments or whole audio files are required to be matched, risks of unstable expression and large calculation and storage data amount occur to the features.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a long-term fingerprint extraction and matching method for audio, which solves the problems that when fingerprints are matched on audio fragments or whole audio files, the characteristics are expressed inadequately and the risk of large calculated and stored data amount.

The technical scheme adopted for solving the technical problems is as follows: an audio long-term fingerprint extraction method, the extraction method comprising the following steps:

s1: an input audio signal (PCM) and resampling the audio signal;

s2: framing, windowing and DFT (discrete Fourier transform) change are carried out on the resampled audio signal to obtain a frame frequency spectrum;

s3: performing inter-frame smoothing on the frame frequency to obtain an updated frame frequency;

S4: extracting frame-level short-time features from the updated frame spectrum;

S5: and processing the frame-level short-time features and extracting frame group long-time features.

Preferably, in the step S1, the specific operation of resampling is to extract the frequency range of 110Hz-7KHz as the analysis frequency band, and set the resampling frequency of the input signal to 16KHz according to the Nyquist sampling theorem, so as to avoid signal sampling distortion.

Preferably, in S2, the specific operations of framing, windowing and DFT variation are to frame the resampled signal according to 4096 samples (256 ms) and a 50% overlap; after framing, a hamming window is added frame by frame and DFT frequency domain transformation is performed to obtain a frame spectrum, wherein the frame spectrum is the relation between frequency and energy.

Preferably, in the step S3, the specific operation of the inter-frame smoothing is to perform weighted average on the adjacent 5 frame frequency spectrum data by using a sliding window, so as to increase the stability of the frequency spectrum and obtain an updated frame frequency spectrum:

; wherein the sliding window is stepped one frame at a time.

Preferably, in the step S4, the specific operation steps of frame-level short-time feature extraction are as follows:

a1: dividing a frame spectrum into logarithmic frequency domain sub-bands;

A2: calculating sub-band average spectral energy;

a3: regularization processing is carried out on the subband spectrum energy L2 to obtain frame-level short-time characteristics.

Preferably, in the A1, since the human ear feeling of sound is logarithmic, dividing the logarithmic frequency domain sub-bands into frame frequencies converts the frequency f in the frame frequency spectrum into logarithmic frequenciesIn the logarithmic frequency domain, the target frequency rangeDivided into 16 sub-bands of equal width.

Preferably, in said A2, the subband average spectral energy is calculated, i.e. for each audio frame, the average spectral energy is calculated over 16 frequency subbands, forming a 16-dimensional vector.

Preferably, in the A3, the sub-band spectrum energy L2 regularization processing obtains a frame-level short-time feature, that is, L2 regularization is performed on the obtained 16-dimensional vector, that is, the short-time feature of the audio frame is denoted as V.

Preferably, in the step S5, the specific operation of extracting the long-time frame group features is to form a continuous fixed number of audio frequency groups into frame groups, perform DFT change again on the short-time frame level features in the time axis direction, and retain low-frequency stable components to form the long-time frame group features.

An audio long-term fingerprint matching method, the matching method comprising the steps of:

b1: extracting long-term features of 2 audio files or fragments to be matched according to a frame group;

B2: and carrying out frame group level matching on the 2 frame group long-time features, and determining a matching relationship.

The invention has the technical effects and advantages that:

1. According to the audio long-term fingerprint extraction and matching method provided by the invention, the audio signal is subjected to fingerprint extraction, namely, after resampling, framing, windowing and DFT change, the obtained frame spectrum is subjected to inter-frame smoothing operation, then frame-level short-time characteristic extraction is performed, the spectrum sub-band characteristic of the audio frame is extracted, and then the change attribute of the time axis direction is calculated to form long-term characteristics, so that the audio fingerprint is rapidly extracted, the matching between two or more groups of subsequent audio fingerprints is facilitated, and the defects of shortness and instability of the traditional audio fingerprint are overcome.

2. The invention provides a long-term fingerprint extraction and matching method for audio, which is characterized in that fingerprint extraction is carried out on different audio signals, after the long-term characteristics of a frame group of audio are rapidly extracted, similarity between two or more groups of audio signals is calculated by using similarity, then whether the two or more groups of audio signals are matched or not is obtained, and the optimal similarity and offset are obtained simultaneously during matching.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a flow chart of fingerprint extraction of an audio signal according to the present invention;

FIG. 2 is a schematic diagram of framing in the process of extracting the finger print in the present invention;

FIG. 3 is a flow chart of frame-level short-term feature extraction in the present invention;

FIG. 4 is a schematic diagram of feature extraction at frame group length in the present invention;

FIG. 5 is a diagram of the best offset and similarity between groups of frames according to the present invention;

Detailed Description

The invention is further described in connection with the following detailed description in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

As shown in fig. 1 to 5, the method for extracting long-term fingerprints of audio according to the present invention comprises the following steps:

s1: an input audio signal (PCM) and resampling the audio signal;

S4: extracting frame-level short-time features from the updated frame spectrum;

As an implementation mode of the invention, in the step S1, the specific operation of resampling is to extract the frequency range of 110Hz-7KHz as an analysis frequency band, and set the resampling frequency of the input signal to 16KHz according to the Nyquist sampling theorem, so as to avoid signal sampling distortion.

As an embodiment of the present invention, in S2, the specific operations of framing, windowing, and DFT variation are to frame the resampled signal according to 4096 samples (256 ms) and 50% overlapping degree; after framing, a hamming window is added frame by frame and DFT frequency domain transformation is performed to obtain a frame spectrum, wherein the frame spectrum is the relation between frequency and energy.

In one embodiment of the present invention, in S3, the specific operation of the inter-frame smoothing is to perform weighted average on the adjacent 5 frame frequency data by using a sliding window, which aims to increase the stability of the frequency spectrum, and obtain the updated frame frequency spectrum:

; wherein the sliding window is stepped one frame at a time.

In S4, the specific operation steps of the frame-level short-time feature extraction are as follows:

a1: dividing a frame spectrum into logarithmic frequency domain sub-bands;

A2: calculating sub-band average spectral energy;

In the embodiment of the present invention, in the A1, since the human ear feeling of the sound is logarithmic, dividing the logarithmic frequency domain sub-band into the frame frequency spectrum, i.e., converting the frequency f in the frame frequency spectrum into the logarithmic frequencyIn the logarithmic frequency domain, the target frequency range/>Divided into 16 sub-bands of equal width.

As an embodiment of the present invention, in the A2, the subband average spectral energy is calculated, i.e. for each audio frame, the average spectral energy is calculated over 16 frequency subbands, thereby forming a 16-dimensional vector.

In the embodiment of the present invention, in the A3, the sub-band spectrum energy L2 regularization processing obtains a frame-level short-time feature, that is, L2 regularizing the obtained 16-dimensional vector, that is, the short-time feature of the audio frame, denoted as V.

In S5, the specific operation of extracting the long-time features of the frame group is to form a continuous fixed number of audio groups, perform DFT change again on the short-time features of the frame level in the time axis direction, and retain the low-frequency stable components to form the long-time features of the frame group.

Specifically, the specific flow of feature extraction during frame group length is as follows:

C1: with consecutive T audio frames (e.g., t=32, i.e., frame group duration 4096 ms) forming a frame group, the frame group is characterized as: v ₀, V₁, …, V_T-1, where V is the frame-level short-term feature (16-dimensional vector).

C2: DFT conversion is performed on each dimension [ V _0d, V_1d, …, V_(T-1)d ], d E [0,15] of the frame group feature.

And C3: taking the first m-level coefficient (e.g. m=12) after DFT conversion，/>，/>，…，/>，/>Wherein: /(I)Is the first coefficient,/>，…，/>Is cosine term coefficient,/>，…，/>Is a sine term coefficient.

And C4: the 16-dimensional m-level DFT coefficients are structured into a new feature [ a ₀, A_1c, A_1s, …, A_mc, A_ms ], then the feature size is [16 x (2m+1) ].

C5: 2m+1 DFT coefficients (vectors) are L2 regularized.

C6:2m+1 DFT coefficients (vectors) are multiplied by weights, and the calculation formula is as follows:

Wherein, ；/>; B is a Bessel function; sinh is a hyperbolic sine function; the calculation result of the flow is that the long-term characteristic [/> ] of the frame group is formed，/>，/>，…，/> ]。

b1: extracting long-term features of two audio files or fragments to be matched according to a frame group;

b2: and performing frame group level matching on the two frame group long-time features, and determining a matching relationship.

Specifically, the specific process of frame group level matching is as follows:

d1: two frame group long time features are respectively marked as [ [ ，/>，/>，…，/>，/>And [/>)，，/>，…，/>，/>The possible time offset (i.e., the number of audio frames) between two frame groups is denoted t.

D2: the similarity s of two frame groups at time offset t is calculated according to the following formula:

；

Wherein: t is the number of audio frames in the frame group; /(I) Representing vectors/>Sum vector/>Is a product of the inner product of (a).

D3: according to step D2, for all possible offsets T, T E [ - (T-1), (T-1) ], calculating the corresponding similarity s, and counting the largest value s _best of all s, the obtained s _best is the best similarity in the two frame groups, and the corresponding T _best is the best similarity offset.

D4: and setting a similarity threshold s _thrd according to the application requirement, and if the optimal similarity obtained in the step D3 is greater than or equal to s _thrd, considering that the two frame groups are matched, otherwise, considering that the two frame groups are not matched.

The invention solves the defects of the traditional audio fingerprint such as short time and instability, extracts the frequency spectrum sub-band characteristics of the audio frame, then calculates the change attribute of the time axis direction to form long-time characteristics, and obtains the optimal similarity and offset simultaneously during matching;

Compared with the prior art of the same kind, the method is characterized in that:

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for extracting long-term fingerprints of audio frequency is characterized by comprising the following steps: the extraction method comprises the following steps:

s1: an input audio signal (PCM) and resampling the audio signal;

S4: extracting frame-level short-time features from the updated frame spectrum;

S5: processing the frame-level short-time features and extracting frame group long-time features;

in the S1, the specific operation of resampling is to extract the frequency range of 110Hz-7KHz as an analysis frequency band, and set the resampling frequency of an input signal to 16KHz according to the Nyquist sampling theorem, so as to avoid signal sampling distortion;

In the step S2, the specific operations of framing, windowing and DFT variation are to frame the resampled signal according to 4096 samples (256 ms) and 50% overlapping degree; after framing, adding a hamming window frame by frame and carrying out DFT frequency domain transformation to obtain a frame frequency spectrum;

In the step S3, the specific operation of the inter-frame smoothing is to perform weighted average on the adjacent 5 frame data by using a sliding window, so as to obtain an updated frame frequency:

; wherein the sliding window is stepped one frame at a time;

In the step S4, the specific operation steps of frame-level short-time feature extraction are as follows:

a1: dividing a frame spectrum into logarithmic frequency domain sub-bands;

A2: calculating sub-band average spectral energy;

a3: regularization processing is carried out on subband spectrum energy L2 to obtain frame-level short-time characteristics;

In the A1, the frequency f in the frame spectrum is converted into the logarithmic frequency by dividing the frequency sub-band into the frame frequency In the logarithmic frequency domain, the target frequency range/>Dividing into 16 sub-bands with equal widths;

in said A2, the sub-band average spectral energy is calculated, i.e. for each audio frame, the average spectral energy is calculated over 16 frequency sub-bands, forming a 16-dimensional vector;

in the A3, the sub-band spectrum energy L2 regularization treatment is carried out to obtain frame-level short-time characteristics, namely, L2 regularization is carried out on the obtained 16-dimensional vector, namely, the short-time characteristics of the audio frame are marked as V;

In the step S5, the specific operation of frame group long-time feature extraction is to form a frame group by continuous fixed number of audios, perform DFT change again on the frame-level short-time features in the time axis direction, and reserve low-frequency stable components to form the frame group long-time features;

The specific flow of the feature extraction during frame group length is as follows:

C1: the continuous T audio frames form a frame group, and the frame group is characterized as follows: v ₀, V₁, …, V_T-1, where V is the frame-level short-term feature (16-dimensional vector);

C2: performing DFT conversion on each dimension [ V _0d, V_1d, …, V_(T-1)d ], d E [0,15] of the frame group characteristics;

and C3: taking the first m-level coefficient (e.g. m=12) after DFT conversion ，/>，/>，…，/>，/>Wherein: /(I)Is the first coefficient,/>，…，/>Is cosine term coefficient,/>，…，/>Is a sine term coefficient;

And C4: constructing a new feature [ A ₀, A_1c, A_1s, …, A_mc, A_ms ] of the frame group by using the 16-dimensional m-level DFT coefficients, wherein the feature size is [16 (2m+1) ];

C5: carrying out L2 regularization on 2m+1 DFT coefficients (vectors);

Wherein, ；/>; B is a Bessel function; sinh is a hyperbolic sine function; according to the flow calculation result, namely forming the long-term characteristic [/> ] of the frame group，/>，/>，…，/>，/>]。

2. An audio long-term fingerprint matching method, wherein long-term fingerprints are extracted by adopting the audio long-term fingerprint extraction method as claimed in claim 1, and the method is characterized in that: the matching method comprises the following steps:

B2: performing frame group level matching on the 2 frame group long-time features, and determining a matching relationship;

The specific flow of the frame group level matching is as follows:

d1: two frame group long time features are respectively marked as [ [ ，/>，/>，…，/>，/>And [/>)，/>，，…，/>，/>The possible time offset (i.e., audio frame number) between two frame groups is denoted as t;

；

Wherein: T is the number of audio frames in the frame group; (v)/(v) Representing vectors/>Sum vector/>Is an inner product of (2);

D3: according to step D2, for all possible offsets T, T E [ - (T-1), (T-1) ], calculating the corresponding similarity s, and counting the largest value s _best in all s, wherein the obtained s _best is the optimal similarity in the two frame groups, and the corresponding T _best is the optimal similarity offset;