CN113157967A

CN113157967A - Audio retrieval method and device

Info

Publication number: CN113157967A
Application number: CN202110420849.4A
Authority: CN
Inventors: 张鹏远; 陈树丽; 张学帅; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2021-07-23

Abstract

The application provides an audio retrieval method and device. The method comprises the following steps: according to the fingerprint sequence of the audio signal to be retrieved, adopting a Hash index to obtain a similar fingerprint sequence in an audio fingerprint library; then, dividing the fingerprint sequence and each similar fingerprint sequence into a plurality of sections of fingerprints with the same quantity in the frequency domain dimension; and comparing the differences of the fingerprints of each section in each similar fingerprint sequence with the fingerprints of each section in the fingerprint sequence so as to determine whether the audio signal is successfully retrieved by the audio fingerprint database. The scheme solves the problem that the audio signal is successfully searched by mistake when the audio signal has background music interference, and improves the accuracy rate of the detected audio signal.

Description

Audio retrieval method and device

Technical Field

The present application relates to the field of audio retrieval technologies, and in particular, to an audio retrieval method and apparatus.

Background

With the rapid development of computer technology and related network technology, the data volume of multimedia information increases rapidly, and the internet contains rich and varied information data such as characters, sounds, pictures, videos and the like, but the internet is also widely filled with a lot of bad audio information, and how to rapidly discover and filter the bad information attracts the attention of researchers at home and abroad. In the audio sample retrieval method, many are realized by audio fingerprints. The audio fingerprint applied in the audio retrieval is different from the audio fingerprint in the digital watermark, and the audio fingerprint in the digital watermark is additional information which is artificially designed and manufactured by applying a communication theory and is mainly used for confirming and tracking pirates.

The audio fingerprints in audio retrieval are important acoustic feature representations of the original audio data. The audio fingerprint refers to a section of digital abstract extracted from an audio signal through a specific algorithm, and the audio fingerprint can represent a compact digital signature of a section of audio important acoustic features. Rather than directly comparing the very large audio data itself, the audio fingerprint retrieval compares its corresponding smaller digital fingerprint. Therefore, the audio fingerprint is used as the index of the audio data metadata in the audio retrieval, the searching amount of the audio retrieval can be greatly reduced, and the efficiency of audio matching is remarkably improved.

However, in practical application environments, the template of the bad information often lacks a few words or is added with background music interference, so that the pieces of the bad information cannot be detected in real time.

Disclosure of Invention

The embodiment of the application provides an audio retrieval method, which is characterized in that a fingerprint sequence is divided into a plurality of sections in a frequency domain, and then fingerprint comparison is carried out, so that the problem that background interference exists in audio in the prior art, and the audio is retrieved successfully by mistake can be solved.

In a first aspect, an embodiment of the present application provides an audio retrieval method, where the method includes: acquiring a fingerprint sequence of an audio signal to be retrieved; according to the fingerprint sequence, a similar fingerprint sequence of the audio signal is obtained in an audio fingerprint database by adopting a Hash index; dividing the fingerprint sequence and each similar fingerprint sequence into a plurality of sections of fingerprints with the same quantity in a frequency domain dimension; and determining that the audio signal is successfully retrieved according to the difference between each section of fingerprint in each similar fingerprint sequence and each section of fingerprint in the fingerprint sequence.

According to the scheme, segmentation processing is carried out in the frequency domain dimension between the difference comparison of the fingerprint sequence and the similar fingerprint sequence, and then the difference between the segments is compared to determine whether the audio signal is successfully retrieved, so that the problem that the audio signal is successfully retrieved by mistake due to low overall difference is solved.

In a possible implementation, the fingerprint sequence includes a plurality of fingerprint vectors, and the obtaining, according to the fingerprint sequence, a similar fingerprint sequence of the audio signal in an audio fingerprint library using hash index includes:

respectively converting the plurality of fingerprint vectors into hash values;

and according to the hash value corresponding to each fingerprint vector, performing index query in a hash table corresponding to the audio fingerprint library to obtain a similar fingerprint sequence of the audio signal.

In a possible implementation manner, the determining that the audio signal is successfully retrieved according to the differences between the fingerprints of the segments in the similar fingerprint sequences and the fingerprints of the segments in the fingerprint sequences includes:

calculating the difference bit number of each segment of fingerprint in each similar fingerprint sequence according to each segment of fingerprint in the fingerprint sequence;

determining the error rate of each segment of fingerprint in each similar fingerprint sequence according to the ratio of the bit difference number of each segment of fingerprint in each similar fingerprint sequence to the total number of bits of each segment of fingerprint in each similar fingerprint sequence; (ii) a

And determining that the audio signal is successfully retrieved according to the error rate of each section of fingerprint in each similar fingerprint sequence.

In a possible implementation manner, the determining that the audio signal is successfully retrieved according to the error rate of each fingerprint in the similar fingerprint sequences includes:

and determining that the audio signal sample corresponding to each similar fingerprint sequence contains the audio signal according to the error rate of each fingerprint in each similar fingerprint sequence and a first error rate threshold, wherein the audio signal is successfully retrieved.

In a possible implementation manner, the determining that the audio signal is successfully retrieved according to the error rate of each fingerprint in the similar fingerprint sequences further includes:

obtaining the total error rate of each similar fingerprint sequence according to the error rate of each section of fingerprint in each similar fingerprint sequence;

and determining that the audio signal sample corresponding to each similar fingerprint sequence contains the audio signal according to the total error rate and a second error rate threshold of each similar fingerprint sequence, wherein the audio signal is successfully retrieved.

In a possible implementation, the obtaining of the fingerprint sequence of the audio signal to be retrieved includes:

acquiring a frequency domain signal corresponding to each frame signal in the audio signal;

sub-band division is carried out on the frequency domain signals corresponding to the frame signals, and sub-band energy difference vectors corresponding to the frequency domain signals are obtained; the energy difference vector comprises: the energy difference between adjacent subbands;

performing binary quantization on the sub-band energy difference vector corresponding to each frame signal to obtain a fingerprint vector of each frame signal;

and arranging the fingerprint vectors of the frame signals according to the time sequence to obtain the fingerprint sequence of the audio signal.

In a possible implementation manner, before the obtaining of the frequency domain signal corresponding to each frame signal in the audio signal, the method further includes:

and performing framing processing on the audio signal by adopting a Hanning window, and performing pre-emphasis processing on each frame signal after the framing processing.

In a possible implementation manner, before performing binary quantization on the subband energy difference vector corresponding to each frame signal, the method further includes:

filtering the sub-band energy difference vector corresponding to the current frame signal by adopting the sub-band energy difference vector corresponding to the previous frame signal; and when the current frame signal is a first frame signal, filtering the sub-band energy difference vector corresponding to the current frame signal by adopting a preset standard sub-band energy difference vector.

In a second aspect, an embodiment of the present application provides an audio retrieval apparatus, including: the acquisition module is used for acquiring a fingerprint sequence of the audio signal to be retrieved; the matching module is used for obtaining a similar fingerprint sequence of the audio signal in an audio fingerprint database by adopting a Hash index according to the fingerprint sequence; the segmentation module is used for dividing the fingerprint sequence and each similar fingerprint sequence into a plurality of sections of fingerprints with the same quantity in a frequency domain dimension; and the comparison module is used for determining that the audio signal is successfully retrieved according to the difference between each section of fingerprint in each similar fingerprint sequence and each section of fingerprint in the fingerprint sequence.

In a possible implementation, the fingerprint sequence includes a plurality of fingerprint vectors, and the matching module is specifically configured to:

respectively converting the plurality of fingerprint vectors into hash values;

In a possible embodiment, the alignment module is specifically configured to:

In a possible implementation manner, the obtaining unit is specifically configured to:

In a possible implementation manner, before the obtaining the frequency domain signal corresponding to each frame signal in the audio signal, the obtaining module is further configured to:

In a possible implementation manner, before performing binary quantization on the sub-band energy difference vector corresponding to each frame signal, the obtaining module is further configured to:

In a third aspect, the present application further provides a computing device comprising: a memory and a processor; the memory stores computer instructions that, when executed by the processor, implement the method of any of claims 1-8.

Drawings

FIG. 1 is a schematic diagram of fingerprint blocks of two audio files in an application scenario provided by the present application;

fig. 2 is a flowchart of an audio retrieval method provided in an embodiment of the present application;

fig. 3 is a flowchart of acquiring a fingerprint sequence of an audio signal according to an embodiment of the present application;

FIG. 4 is a schematic diagram of frequency domain segmentation of a fingerprint sequence according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a comparison of fingerprint sequences and similar fingerprint sequences to determine that they are detected according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an audio retrieval apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the embodiments of the present application will be described below with reference to the accompanying drawings.

In the description of the embodiments of the present application, the words "exemplary," "for example," or "for instance" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary," "e.g.," or "e.g.," is not to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the words "exemplary," "e.g.," or "exemplary" is intended to present relevant concepts in a concrete fashion.

In the description of the embodiments of the present application, the term "and/or" is only one kind of association relationship describing an associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, B exists alone, and A and B exist at the same time. In addition, the term "plurality" means two or more unless otherwise specified. For example, the plurality of systems refers to two or more systems, and the plurality of screen terminals refers to two or more screen terminals.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicit indication of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

Fig. 1 is an application scenario provided in the present application. In fig. 1, (a) and (b) represent fingerprint blocks of two audio files having the same background music, and (c) is a fingerprint difference diagram of (a) and (b). Where (a) is the audio sample, (b) is the audio to be retrieved, (a) the content is recited audio for "meditation at night" and (b) the content is recited audio for "spring dawn". Obviously, in this scenario, the fingerprint block shown in (b) is impossible to be retrieved by (a), but since two files have the same background music, the overall error rate may satisfy the threshold condition, resulting in successful retrieval of (b) by mistake.

To this end, the embodiment of the present application provides a flowchart of an audio retrieval method, as shown in fig. 2, the method includes steps S1 to S2.

In step S1, a fingerprint sequence of the audio signal to be retrieved is acquired.

In this embodiment, since the frequency to which the sound of the human ear is most sensitive is 3000 to 4000Hz, in order to satisfy the characteristics of the human ear and save the calculation resources, the audio signal to be retrieved is acquired by down-sampling to 5000Hz when the audio signal is acquired. This step is specifically illustrated as step S101 to step S107 in fig. 3.

In step S101, a frame division process is performed on an audio signal to be retrieved, and a plurality of frame signals are obtained.

In this embodiment, a hamming window is used to perform framing processing on an original audio signal, a frame length is set to be 0.4s (second), a frame overlap is set to be 0.0256s, and hamming window weighting is shown in formula (1).

In formula (1), w (n) is the weight of the value of the nth sampling point in the audio signal; n is the length of the Hanning window and is the number of sampling points of a frame of audio signal.

In step S102, each frame of signal is pre-emphasized to obtain a pre-emphasized output signal, and then a discrete fourier transform is performed to obtain a frequency domain signal corresponding to each frame of signal.

In this embodiment, first, a first-order FIR high-pass filter is used to implement pre-emphasis, and a specific difference equation is shown in formula (2); then fourier transform is performed using equation (3).

y_i[n]＝x_i(n)-ax_i(n-1) (2)

In the formula (2), x_i(n) is the signal of the i-th frame before pre-emphasis, a is the pre-emphasis coefficient, y_i[n]Is the signal of the ith pre-emphasized frame.

In the formula (3), Y_i(k) Is y_i[n]And L is the audio point number of Fourier transform of the corresponding frequency domain signal.

In step S103, sub-band division is performed on the frequency domain signal corresponding to each frame of signal, and an energy difference between adjacent sub-bands in each frequency domain signal is calculated, so as to obtain a sub-band energy difference vector corresponding to each frame of signal.

In the embodiment, the sub-bands are divided between 200Hz and 3000Hz of the frequency domain signal of each frame signal, and the total number of the sub-bands is 33. Wherein, the number of the frames is represented by i, 1< i < K, and K represents the maximum frame number; the number of subbands is denoted by M, 1< M, where M denotes the total number of subbands, where M is 33. The starting frequency domain of the mth sub-band, i.e., the ending frequency of the (m-1) th sub-band, is calculated as shown in equation (4).

In the formula (4), F_minThe lower limit of the subband frequency is set to 200Hz in the embodiment; f_maxThe upper limit of the subband frequency is 3000Hz in this embodiment.

After each sub-band is obtained, the energy of each sub-band is calculated first to obtain the sub-band energy vector of each frequency domain signal, and then the energy difference between adjacent sub-bands is calculated to obtain the sub-band energy difference vector of each frequency domain signal, that is, the sub-band energy difference vector corresponding to each frame signal.

Wherein, the sub-band energy vector corresponding to the ith frame signal can be represented as E_i＝(e_i,1,……,e_i,m,……,e_i,33) The sub-band energy difference vector corresponding to the ith frame signal can be represented as E'_i＝(e_i,1-e_i,2,……,e_i,m-e_i,m+1,……,e_i,32-e_i,33)，e_i,mIs the energy of the mth subband of the ith frame signal.

In step S104, the subband energy difference vector corresponding to the previous frame signal is used to filter the subband energy difference vector corresponding to the current frame signal.

In this embodiment, the filtering is performed according to the formula (5), and the filtered vector can be represented as S_i＝(s_i,1,……,s_i,m,……,s_i,32). When the current frame is the first frame, no filtering is carried out or a mark is adoptedThe quasi-subband energy difference vector, which may be a zero vector, is filtered.

In the formula (5), s_i,mFiltered value for the mth subband energy difference of the ith frame.

In step S105, binary quantization is performed on the filtered vector, and a fingerprint vector of each frame signal is obtained.

In this embodiment, formula (6) may be used to perform binary quantization to obtain a fingerprint vector T_i＝(t_i,1,...,t_i,m,…t_i,32)，T_iIs a fingerprint vector of the ith frame signal, t_i,mAnd filtering the m-th sub-band energy difference of the ith frame to obtain the fingerprint. In addition, in other embodiments, the vector before filtering may be directly subjected to binary quantization to obtain the fingerprint vector. The fingerprint vectors of each frame signal may then constitute a fingerprint sequence (or fingerprint block) of the audio signal to be retrieved.

With continued reference to fig. 1, in step S2, a similar fingerprint sequence is obtained by searching in the hash table of the audio fingerprint library according to the fingerprint sequence.

In this embodiment, each fingerprint vector in the fingerprint sequence of the audio signal is specifically converted into a hash value, a search query is performed in a hash table of the audio fingerprint library, a hit position is used as a candidate position, and a fingerprint sequence corresponding to the candidate position is used as a similar fingerprint sequence. The audio fingerprint library is constructed according to a plurality of audio signal samples, and a hash table of the audio fingerprint library is created according to a fingerprint sequence of the audio signal samples, so that retrieval and query are facilitated. In this embodiment, the fingerprint sequences of the plurality of audio signal samples are also obtained by the method described in step S1.

In step S3, the fingerprint sequence and each similar fingerprint sequence are divided into the same number of pieces of fingerprints in the frequency domain dimension, respectively.

In this embodiment, as shown in fig. 4, the similar fingerprint sequence is divided into H segments a and b in the frequency domain of 200Hz to 3000Hz by taking the similar fingerprint sequence as an example₁～A_H. The vertical direction in fig. 4 indicates the number of frames, which is set to 256 frames in this embodiment.

In step S4, the fingerprints of the similar fingerprint sequences are compared with the differences of the fingerprints of the similar fingerprint sequences to obtain the search result of the audio signal in the audio fingerprint library.

In this embodiment, each similar fingerprint sequence and each fingerprint sequence include H segments of fingerprints, and when the differences are compared, the H segment of fingerprints in each similar fingerprint sequence and the H segment of fingerprints in the fingerprint sequence are sequentially compared, and the number of bits of the differences is obtained, where H is greater than or equal to 1 and is less than or equal to H. After the difference bit number of each section of fingerprint of each similar fingerprint sequence is obtained, the error rate of each section of fingerprint in the similar fingerprint sequence is calculated one by one.

Specifically, as shown in fig. 5, if the bit error rate d of a fingerprint exceeds the first bit error rate threshold r, it indicates that the audio signal sample corresponding to the similar fingerprint sequence does not contain the audio signal to be retrieved, and the audio signal is not detected. Otherwise, the audio signal sample corresponding to the similar fingerprint sequence is indicated to contain the audio signal to be retrieved, and the audio signal is successfully retrieved. The first error rate threshold is the threshold of a single fingerprint segment and is used for judging the error rate of each fingerprint segment.

And if the error rate D of each fingerprint segment of the similar fingerprint sequence does not exceed the first error rate threshold R, summing the error rates to obtain an overall error rate D, and if the overall error rate D exceeds the second error rate threshold R, indicating that the audio signal sample corresponding to the similar fingerprint sequence does not contain the audio signal to be retrieved, wherein the audio signal is not detected. Otherwise, the audio signal sample corresponding to the similar fingerprint sequence is indicated to contain the audio signal to be retrieved, and the audio signal is successfully retrieved. The second bit error rate threshold is an overall bit error rate threshold used to determine the sum of bit error rates.

According to the method and the device, when the retrieval matching is carried out based on the fingerprint sequence, the fingerprint sequence is divided into a plurality of sections in the frequency domain, the error rate of each section of fingerprint is calculated firstly, then the overall error rate is calculated, and the judgment is carried out from the two aspects of the error rate of the section of fingerprint and the overall error rate, so that the error of signal retrieval can be reduced, and the accuracy of retrieval is improved.

Based on the foregoing method embodiment, the present application further provides an audio retrieval apparatus, as shown in fig. 6, the apparatus includes: the acquisition module is used for acquiring a fingerprint sequence of the audio signal to be retrieved; the matching module is used for obtaining a similar fingerprint sequence of the audio signal in an audio fingerprint database by adopting a Hash index according to the fingerprint sequence; the segmentation module is used for dividing the fingerprint sequence and each similar fingerprint sequence into a plurality of sections of fingerprints with the same quantity in a frequency domain dimension; and the comparison module is used for determining that the audio signal is successfully retrieved according to the difference between each section of fingerprint in each similar fingerprint sequence and each section of fingerprint in the fingerprint sequence.

respectively converting the plurality of fingerprint vectors into hash values;

In a possible embodiment, the alignment module is specifically configured to:

An embodiment of the present application further provides a computing device, where the computing device includes: a memory and a processor; the memory stores computer instructions which, when executed by the processor, implement aspects of the foregoing method embodiments.

It is understood that the processor in the embodiments of the present application may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. The general purpose processor may be a microprocessor, but may be any conventional processor.

The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by software instructions executed by a processor. The software instructions may consist of corresponding software modules that may be stored in Random Access Memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), Erasable Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable hard disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is to be understood that the various numerical references referred to in the embodiments of the present application are merely for descriptive convenience and are not intended to limit the scope of the embodiments of the present application.

Claims

1. An audio retrieval method, the method comprising:

acquiring a fingerprint sequence of an audio signal to be retrieved;

according to the fingerprint sequence, a similar fingerprint sequence of the audio signal is obtained in an audio fingerprint database by adopting a Hash index;

dividing the fingerprint sequence and each similar fingerprint sequence into a plurality of sections of fingerprints with the same quantity in a frequency domain dimension;

and determining that the audio signal is successfully retrieved according to the difference between each section of fingerprint in each similar fingerprint sequence and each section of fingerprint in the fingerprint sequence.

2. The method of claim 1, wherein the fingerprint sequence comprises a plurality of fingerprint vectors, and wherein obtaining a similar fingerprint sequence of the audio signal in an audio fingerprint library using hash indexing according to the fingerprint sequence comprises:

respectively converting the plurality of fingerprint vectors into hash values;

3. The method according to claim 1, wherein the determining that the audio signal was successfully retrieved according to the difference between each segment of the fingerprint in the similar fingerprint sequences and each segment of the fingerprint in the fingerprint sequences comprises:

4. The method of claim 3, wherein the determining that the audio signal was successfully retrieved according to the error rate of each fingerprint in the similar fingerprint sequences comprises:

5. The method of claim 3, wherein determining that the audio signal was successfully retrieved based on the error rate of each fingerprint in the similar fingerprint sequences further comprises:

6. The method according to claim 1, wherein the obtaining of the fingerprint sequence of the audio signal to be retrieved comprises:

7. The method according to claim 6, wherein before the obtaining the frequency domain signal corresponding to each frame signal in the audio signal, the method further comprises:

8. The method according to claim 6, wherein before performing binary quantization on the subband energy difference vector corresponding to each frame signal, the method further comprises:

9. An audio retrieval apparatus, the apparatus comprising:

the acquisition module is used for acquiring a fingerprint sequence of the audio signal to be retrieved;

the matching module is used for obtaining a similar fingerprint sequence of the audio signal in an audio fingerprint database by adopting a Hash index according to the fingerprint sequence;

the segmentation module is used for dividing the fingerprint sequence and each similar fingerprint sequence into a plurality of sections of fingerprints with the same quantity in a frequency domain dimension;

and the comparison module is used for determining that the audio signal is successfully retrieved according to the difference between each section of fingerprint in each similar fingerprint sequence and each section of fingerprint in the fingerprint sequence.

10. A computing device, wherein the computing device comprises: a memory and a processor; the memory stores computer instructions that, when executed by the processor, implement the method of any of claims 1-8.