CN107293307B

CN107293307B - Audio detection method and device

Info

Publication number: CN107293307B
Application number: CN201610201533.5A
Authority: CN
Inventors: 张�荣
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2016-03-31
Filing date: 2016-03-31
Publication date: 2021-07-16
Anticipated expiration: 2036-03-31
Also published as: CN107293307A

Abstract

The application discloses an audio detection method and device. Wherein, the method comprises the following steps: acquiring an audio fingerprint of an audio file to be tested; for each audio fingerprint of the audio file to be tested, searching a similar audio file of the audio file to be tested from the inverted list corresponding to the audio fingerprint; wherein each record in the inverted list comprises: a sample audio file identifier, and a location where a sample fingerprint occurs in a sample audio file, the sample audio file being an audio file indicated by the sample audio file identifier; and acquiring the similarity between the audio file to be tested and the similar audio file, and determining whether the audio file to be tested is the audio of the specified type according to the similarity.

Description

Audio detection method and device

Technical Field

The invention relates to the field of audio detection, in particular to an audio detection method and device.

Background

Currently, for the detection of illegal audio, an audio retrieval algorithm such as MD5 value comparison and audio digital watermarking is generally adopted, wherein,

MD5 value alignment: the MD5 value for any audio file is a text string of length 32. The MD5 values for the same audio file are necessarily the same; even if two files differ by only one bit, the MD5 value is not the same. That is, even for the same song, the values of file MD5 differ at different sampling rates. We can change the question of judging whether two audio files are the same into the question of comparing whether the MD5 values are consistent. The MD5 value comparison method has the advantages of simple calculation and 100% accuracy, but the omission factor is high, and only the audio files which are completely the same as those in the index can be detected.

Audio digital watermarking: in the digital watermarking technology, specifically designed watermark information for tracing a pirate source is provided, different watermarks are embedded into each copy of an audio file, and the pirate distribution source can be found according to the watermark information when the pirate is found. However, for certain traffic scenarios, the offending audio often does not embed a watermark to help the supervisor track its source.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

According to an aspect of an embodiment of the present application, there is provided an audio detection method, including: acquiring an audio fingerprint of the audio file to be detected; for each audio fingerprint of the audio file to be tested, searching a similar audio file of the audio file to be tested from the inverted list corresponding to the audio fingerprint; wherein each record in the inverted list comprises: a sample audio file identifier, and a location where a sample fingerprint occurs in a sample audio file, the sample audio file being an audio file indicated by the sample audio file identifier; and acquiring the similarity between the audio file to be tested and the similar audio file, and determining whether the audio file to be tested is the audio of the specified type according to the similarity.

Optionally, for each audio fingerprint of the audio file to be tested, looking up a time set in which the audio fingerprint of the audio file to be tested appears from the inverted list corresponding to the audio fingerprint of the audio file to be tested, where the time set includes: time indicated by the positions of the audio fingerprints of the audio file to be tested appearing in all the sample audio files;

for each audio fingerprint of the audio file to be detected, taking the time of the audio fingerprint of the audio file to be detected appearing in the audio file to be detected as reference time, carrying out difference operation on the time and the elements in the time set, and generating an intermediate result according to the obtained time difference; wherein the intermediate result is composed of the sample audio file identifier and the time difference corresponding to the sample audio file identifier;

and for each sample audio file identifier in the inverted list, counting the number of all the same time differences in the sample audio files indicated by the sample audio file identifier according to the intermediate result, sequencing the sample audio files in the inverted list according to the at least multiple sequence of the number to obtain the first M sample audio files, and taking the first M sample audio files as the similar audio files, wherein M is a natural number.

Optionally, obtaining the similarity between the audio file to be tested and the similar audio file includes: acquiring the similarity between the audio file to be tested and each of the M audio files according to the number N1 of the fingerprints in the audio file to be tested, the number N2 of the fingerprints in each of the M audio files and the number N of the characteristic point pairs, wherein N1, N2 and N are natural numbers; the characteristic point pairs are obtained through the following method: combining the anchor point corresponding to the local maximum value of each frame of each audio file in the M audio files on the spectrogram and the anchor point in the preset matrix target area in pairs, and taking each combination as one characteristic point pair; and the anchor points are sampling points corresponding to local maximum values in each frame spectrogram.

Optionally, the value of the sampling point corresponding to each frame of the audio file is determined by the following method: [2048k,2048k +4095], wherein k is a natural number.

According to another aspect of the embodiments of the present application, there is also provided an audio detection apparatus, including: the acquisition module is used for acquiring the audio fingerprint of the audio file to be detected; the query module is used for searching a similar audio file of the audio file to be tested from the inverted arrangement table corresponding to the audio fingerprint for each audio fingerprint of the audio file to be tested; wherein each record in the inverted list comprises: a sample audio file identifier, and a location where a sample fingerprint occurs in a sample audio file, the sample audio file being an audio file indicated by the sample audio file identifier; and the identification module is used for acquiring the similarity between the audio file to be tested and the similar audio file and determining whether the audio file to be tested is the audio of the specified type according to the similarity.

In the embodiment of the application, the audio file to be detected is detected in a mode of searching for the similar audio file of the audio file to be detected according to the audio fingerprint and determining whether the audio file to be detected is the audio file of the specified type according to the similarity between the audio file to be detected and the similar audio file, wherein the audio file to be detected can be accurately detected due to the fact that the audio file to be detected is detected according to the audio fingerprint, and the omission factor of audio detection is reduced; moreover, because similar audio files are searched from a newly designed inverted list (also called inverted document), the scheme provided by the embodiment of the application can be suitable for searching a large-scale sample set, and the application range is wide; in addition, as the audio fingerprints of the whole audio file to be detected are detected, more clues can be provided for tracing the illegal audio source, so that powerful support is provided for tracing the illegal audio source, and the technical problems that the missing rate is high and the illegal audio source cannot be effectively traced are solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware structure of a computer terminal for executing an audio detection method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram of an alternative audio detection method according to an embodiment of the present application;

FIG. 3 is a diagram illustrating an alternative audio fingerprint inverted document structure according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an alternative audio fingerprint front-ranked document according to an embodiment of the application;

FIG. 5 is a diagram illustrating an alternative audio fingerprint according to an embodiment of the present application;

fig. 6 is a schematic diagram of a local maximum in an alternative spectrogram according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an alternative anchor point and target window according to an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating an alternative illegal audio detection flow according to an embodiment of the present application;

FIG. 9 is a schematic diagram illustrating an alternative exemplary creation process of an illegal audio index according to an embodiment of the present application;

FIG. 10 is an intermediate result representation intent of an alternative retrieval process according to an embodiment of the present application;

FIG. 11 is a block diagram of an alternative audio detection device according to an embodiment of the present application;

fig. 12 is a block diagram of an alternative computer terminal according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It is a primary object of embodiments of the present application to establish an efficient mechanism to compare the perceptual auditory similarity of two audio data. It is noted that in the embodiments of the present application, rather than directly comparing the audio data itself, which is typically large, its corresponding digital fingerprint, which is typically small, is compared. Fingerprints of a large amount of audio data are stored in a database together with their corresponding metadata, such as song title, word song author, lyrics, etc., and the fingerprints are used as indices of the corresponding metadata.

Among other things, retrieval of an audio fingerprint typically involves two parts: a fingerprint extraction algorithm to compute auditory significance and a comparison algorithm to perform an efficient search in a fingerprint database. When an unknown video needs to be retrieved, the audio fingerprint of the unknown video is calculated according to a fingerprint extraction algorithm, and then the audio fingerprint is compared with a large number of audio fingerprints stored in a database to search for similar audio. An efficient fingerprinting algorithm and fingerprint comparison algorithm can correctly identify in a database the original version of the unknown audio that may be subject to various signal processing distortions.

The audio fingerprint detection algorithm has the following measurement indexes:

a) the accuracy comprises a correct recognition rate, a False negative rate and a False positive rate.

b) Robustness means that unknown audio can still be identified after signal processing with certain intensity. These processes include lossy compression, desynchronization due to shearing or misalignment, transposition, equalization, noise, D/a-a/D conversion, and the like. In order to improve robustness, fingerprints must be extracted based on audio features of the acoustically important content, so that invariance to signal processing is achieved to some extent.

c) Distinctiveness is that fingerprints between different content audio should have large differences, while fingerprints between different versions of the same content audio should have small differences.

d) Reliability, i.e., the probability that a song is correctly recognized, is usually measured by False positive rate (False pos it). False pos live is the most important parameter in an audio retrieval system because it says fingerprints that would otherwise have no similarity in the database are similar, which can seriously affect the confidence level of the retrieval system.

e) Fingerprint size to increase the speed of retrieval, fingerprints are typically stored in memory, with the size expressed in bits per second. The size of the fingerprint determines to a large extent the memory capacity of the fingerprint database.

f) Granularity is a parameter that depends on the specific application scenario, i.e. how many seconds of an unknown audio piece are needed to identify the entire audio.

g) Retrieval speed-this is a key parameter for practical commercial audio fingerprinting systems. It is often required to search a fingerprint database of 10 ten thousand songs at a speed of the order of seconds using limited computing resources, such as a normal PC.

h) Versatility-the ability to identify different audio formats and to use the same database for different applications.

The above factors may have strong correlation with each other. For example, if a smaller granularity is used, then a larger fingerprint is extracted within one granularity to achieve the same reliability of the search. As smaller granularity decreases reliability, while larger fingerprint size increases reliability. For another example, when a more robust fingerprint is used, the search speed is increased because the fingerprint search is a kind of approximation search, and the more robust the fingerprint is, the smaller the distance between the unknown fingerprint and the original fingerprint in the same signal processing environment is, thereby increasing the search speed.

The following detailed description is given with reference to specific examples.

Example 1

There is also provided, in accordance with an embodiment of the present application, an embodiment of an audio detection method, to note that the steps illustrated in the flowchart of the figure may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than here.

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking an example of the method running on a computer terminal, fig. 1 is a block diagram of a hardware structure of a computer terminal for executing an audio detection method according to an embodiment of the present application. As shown in fig. 1, the computer terminal 10 may include one or more (only one shown) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 104 for storing data, and a transmission device 106 for communication functions. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store software programs and modules of application software, such as program instructions/modules corresponding to the audio detection method in the embodiment of the present application, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, so as to implement the audio detection method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network Interface Controller (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

Under the above operating environment, the present application provides an audio detection method as shown in fig. 2. Fig. 2 is a schematic flow chart of an alternative audio detection method according to an embodiment of the present application. As shown in fig. 2, the method comprises steps S202-S206, wherein:

step S202, acquiring an audio fingerprint of an audio file to be tested;

it should be noted that, before step S202, the audio file to be detected may be determined, for example, various audio files are downloaded from the internet, and after the content of the audio file is manually listened to or the content is identified by an application on the terminal, the identified illegal audio may be detected as the audio file to be detected, where the illegal audio may be an artificially identified illegal audio or an illegal audio identified by the application.

In addition, in order to meet the requirements of data throughput and missed detection rate, according to actual conditions, before the audio fingerprint of the audio file to be detected is acquired, the audio format of the audio file to be detected is converted into a specified format, and normalization processing is performed on the audio sampling rate of the audio file to be detected. Like this, alright with audio fingerprint of gathering from many angles, improve audio frequency detection accuracy. For example, format conversion can be performed according to the audio formats in the sample audio library, so that detection can be performed according to the sample audio libraries with different audio formats, and the missing rate is further reduced.

For example, the most common Audio formats currently are Advanced Audio Coding (AAC) and motion Picture Experts Group Audio Layer 3 (MP 3).

In which, MP3 is used to greatly reduce the data size of audio files, and audio can be compressed at a ratio of 1:10 or even 1:12, and the quality of compressed audio is not significantly reduced compared with uncompressed audio. ACC is an MPEG-2 based audio coding technique. In case of the same code rate, the ACC quality is better than MP 3. But because of the absolute advantage in the number of audio occupation of the current MP3 format, it is preferable to unify the audio format to MP 3. The audio sampling rate refers to the sampling times of a sound signal in one second by a recording device, and the higher the sampling frequency is, the truer and more natural the sound is. On the current mainstream acquisition card, the sampling frequency is generally divided into three grades of 22050Hz, 44100Hz and 48000Hz, 44100KHz is the theoretical CD tone quality limit, and is also the audio sampling rate of the mainstream, so the sampling rate and the audio format to be converted can be flexibly selected according to the actual situation.

Step S204, for each audio fingerprint of the audio file to be tested, searching a similar audio file of the audio file to be tested from an inverted list corresponding to the audio fingerprint; wherein each record in the inverted list comprises: a sample audio file identification, and a location in the sample audio file where the sample fingerprint occurs, the sample audio file being the audio file indicated by the sample audio file identification. Alternatively, the posting list may be generated by: an audio fingerprint collection is assigned directly to store all sample fingerprints (e.g., the collection may be in the form of an array, but is not limited to this), and each sample fingerprint then points to a list of all audio files in which the sample fingerprint occurred, where the audio file identification (id) and the location in the sample audio file where the sample fingerprint occurred are stored.

Wherein, the structure of the inverted document can be seen in fig. 3: as shown in fig. 3, the sample fingerprint includes: fingerprint 1, fingerprint 2, fingerprint 3, … … fingerprint 2²⁴1, wherein (id1, pos1) represents the audio file identifier (id1) corresponding to the fingerprint 1 and the position (pos1) of the fingerprint 1 in the audio file identified by id1, and the meanings of the information in the rest lists are similar to those of the fingerprint 1, which is not described herein again.

The sample audio file is a reference audio of the audio file to be tested, that is, whether the audio file to be tested is a reference audio file of a specified type (for example, an illegal audio) is determined. A sample fingerprint in the set of audio fingerprints refers to an audio fingerprint of the reference audio file (i.e., sample audio file) described above.

Optionally, the query process for the audio to be tested may be implemented by the following processes, but is not limited to this:

1) for each audio fingerprint of the audio file to be tested, searching a time set of the audio fingerprint of the audio file to be tested from the inverted list corresponding to the audio fingerprint of the audio file to be tested, wherein the time set comprises the following components: the time indicated by the positions of the audio fingerprints of the audio file to be tested appearing in all the sample audio files;

2) for each audio fingerprint of the audio file to be detected, taking the time of the audio fingerprint of the audio file to be detected appearing in the audio file to be detected as reference time, carrying out difference operation on the time and the elements in the time set, and generating an intermediate result according to the obtained time difference; each intermediate result is composed of a sample audio file identifier and the time difference corresponding to the audio file identifier;

3) and for each sample audio file identifier in the inverted list, counting the number of all the same time differences in the sample audio files indicated by the audio file identifier according to the intermediate result, sequencing the sample audio files in the inverted list according to at least the sequence of the number to obtain top M sample audio files, and taking the top M sample audio files as the similar audio files, wherein M is a natural number.

The manner in which the above-described similar audio files are determined is described in detail below with reference to a specific example for ease of understanding.

First, the sources of the constituent elements in the forward and inverted lists (also called inverted documents or inverted indexes) are described: three audio files (music for example): acc, r2.mp3, r3. wav. The musical composition to be retrieved (i.e., the audio file to be tested) is t1.mp 3.

The process of establishing the forward document and the reverse document is as follows:

the method comprises the following steps:

the 3 tracks were converted to mp3 format, normalized to a sample rate of 44100Hz, and renamed r1.mp3, r2.mp3, r3.mp 3.

Step two:

a fourier transform is performed for each piece of music.

The specific method comprises the following steps: every 46ms, a 92ms time window is taken to take the audio signal for fourier transform. If r1.mp3 is 1 second in length, a total of (1000-92)/46+1 would result in 20 windows.

Step three:

an audio fingerprint is computed. Each window generates 10 local maxima (also referred to as anchor points) (the manner in which local maxima are generated is described in more detail below and will not be described further herein). Each anchor point has a matrix target zone (target zone) on its right side, and each matrix target zone has 1 or more (no more than 10) anchors points. If a certain anchor point is marked as P1, 2 anchor points are present in the corresponding target zone, which are marked as P21 and P22, respectively.

Then, 2 audio fingerprints would result from the P1 shot. One consisting of P1 and P21 and the other consisting of P1 and P22. If the time t1 at which P1 appears in music r1 is 100ms, the frequency is 500 Hz; the time P21 appeared in music r1 was 120ms and the frequency was 600Hz, (reference: the frequency range of human voice is about 250Hz-5000 Hz.) and then the above times were grouped into different time sets. At this time, the audio fingerprint formed by P1 and P21 is hash ([ f1: f21: delta _ t ]):t1 ═ hash ([500:600: 120-: 100. suppose that the time that P22 appears in music r1 is 125ms and the frequency is 700Hz, then the times are grouped into different time sets. The audio fingerprint composed of P1 and P22 at this time is: hash ([ f1: f22: delta _ t ]): t1 ═ hash ([500:700: 125-: 100.

step four:

and establishing an index. Wherein, for the positive document, for example, the positive document of r1.mp3 is:

r 1; fingerprint 1 fingerprint 2 fingerprint … … fingerprint n.

If the example in step three is the 10 th, 11 th fingerprint, then the forward ranked document of r1.mp3 is:

For inverted indexes (also known as inverted documents or inverted lists). If there is an anchor point in r2.mp3, denoted as P3, the epoch in r2.mp3 is 200ms and the frequency is 500 Hz. There are 3 anchors point in the target zone corresponding to P3, and they are respectively marked as P41, P42 and P43. Where the frequency of the point P41 is 600Hz, and the time of occurrence in r2.mp3 is 220ms, then the audio fingerprint composed of P3 and P41 is hash ([ f1: f21: delta _ t ]): t1 ═ hash ([500:600: 220-: 200.

then, the following records must be contained in the inverted document:

……

fingerprint (6781d3bdbfcb20b73e34e0e 5): (r1.mp3,100) | (r2.mp3,200) | … …

……

Fingerprint (715ff0eaccbd75decb48aa 80): (r1.mp3,100) | … …

……

If r1.mp3 has 1000 fingerprints, r2.mp3 has 2000 fingerprints, and r3.mp3 has 3000 fingerprints, then a total of 1000+2000+ 3000-6000 records would be in the inverted document.

And finishing the warehousing process, namely finishing the establishment of the forward-arranged document and the reverse-arranged document.

Next, a process of retrieving similar audio files will be described.

Assume that similar audio files of r4.mp3 (hereinafter referred to as r4) are retrieved as follows:

the method comprises the following steps: an audio fingerprint of r4 is obtained.

Let r4.mp3 denote any anchor point as P5, and its corresponding target zone contains 2 anchors points, which are respectively denoted as P61 and P62.

Then, 2 audio fingerprints would result from the P5 shot. One consisting of P5 and P61 and the other consisting of P5 and P62. If the time t5 at which P5 appears in music r4 is 150ms, the frequency f5 is 500 Hz; the time of occurrence of P61 in music r4 is 170ms and the frequency is 600 Hz. Then, the audio fingerprint formed by P5 and P61 is hash ([ f5: f61: delta _ t ]):t5 ═ hash ([500:600: 170-: 150. suppose that the time of occurrence of P62 in music r4 is 195ms and the frequency is 700 Hz. Then, the audio fingerprint formed by P5 and P62 is hash ([ f5: f61: delta _ t ]):t5 ═ hash ([500:700: 195-: 150.

mp 3. there may be other fingerprints as well, assuming a total of 1200 fingerprints, the list of fingerprints is:

Step two:

and checking the inverted index.

Mp3 looks up the previously generated inverted index for each fingerprint. For example, for the mth fingerprint 6781d3bdbfcb20b73e34e0e5 of r4. MP3: 150, take the 24-bit hash value 6781d3bdbfcb20b73e34e0e5, which is found in the inverted index to be:

fingerprint (6781d3bdbfcb20b73e34e0e 5): (r1.mp3,100) | (r2.mp3,200) | … …,

concerning the two pieces of audio r1.mp3 and r2.mp3, the time difference is 150-.

Mp3 for the mth fingerprint 715ff0eaccbd75decb48aa 80: 150, take the 24-bit hash value 715ff0eaccbd75decb48aa80, finding this line in the inverted index as:

fingerprint (715ff0eaccbd75decb48aa 80): (r1.mp3,100) | … …

Referring to a segment of audio r1.mp3, the time difference is 150-. The record in the intermediate results table is r1.mp3 with an additional record 50.

Then, now there are two 50 records in r1.mp3, one-50 record in r2.mp3,

step three: and calculating the similarity.

Now assume that the audio fingerprint of r4.mp3 does not match any other fingerprint in the library, then the number of fingerprints matching r4.mp3 to r1.mp3 is 2, the number of fingerprints matching r2.mp3 is 1, and the number of fingerprints matching r3.mp3 is 0. i.e. the number of identical time differences (50) in r1.mp3 is found to be 2 by the process of step two, and there is only one time difference-50 in r2.mp 3.

Then, the similarity between r1.mp3 and r4.mp3 is similarity-N/N1/N2-2-2/1000/1200.

Mp3 has similarity to r4.mp3 of s imi criterion N × N/N1/N2 ═ 1 × 1/2000/1200.

Mp 3. the similarity between mp3 and r4.mp3 is similarity-N/N1/N2-0-0/3000/1200.

In this case, the following are arranged in the order of the same time difference number: r1, r2, r3. The first 1 audio file in the sorted list (i.e., the first-ranked r1) is taken as a similar audio file of the audio file under test (r4.mp 3). Of course, the rule for selecting similar audio files is not limited to this, for example, when the sample audio files are r1, r2, r3, r5, r6, r7, and the number of times that the same time difference exists is r1, r2, r3, r5, r6, r7 in this order, the first three audio files (r1, r2, r3) can be regarded as similar audio files.

It should be noted that, in an optional embodiment of the present application, for the calculation of the similarity, only the similarity of the selected similar audio files may be calculated, or the similarities of all the audio files compared with the audio file to be tested may be calculated, which is specifically determined according to the actual situation.

As can also be seen from the above description, since the concept of time difference is considered in the process of searching for similar audio, it can be avoided that the audio to be tested is modified (including cutting and adding) to result in that similar audio files cannot be identified.

Step S206, obtaining the similarity between the audio file to be tested and the similar audio file, and determining whether the audio file to be tested is the audio of the designated type according to the similarity. Optionally, there are various ways to determine whether the audio file to be tested is an audio of the designated type, for example, the method may be directly determined according to a pre-established correspondence between a value range of the similarity and the type of the audio, at this time, a value range in which the similarity falls may be determined first, and the type of the audio corresponding to the value range may be determined according to the value range, so as to determine the type of the audio file to be tested. In an optional embodiment of the present application, the determination may be further performed according to a comparison result between the similarity and a preset threshold, and specifically may be implemented by the following processing procedures: and judging whether the similarity is greater than a preset threshold, wherein when the similarity is greater than the preset threshold, the audio file to be tested is determined to be the audio of the specified type. Alternatively, the specified type of audio may be offending audio present in the network.

For the above similarity, the similarity may be determined by using a technique in the related art, for example, may be implemented by using related calculation software of the similarity, and in an alternative embodiment of the present application, the similarity may be determined by the following manner, but is not limited to this:

acquiring the similarity between the audio file to be tested and each of the M audio files according to the number N1 of the fingerprints in the audio file to be tested, the number N2 of the fingerprints in each of the M audio files and the number N of the characteristic point pairs, wherein N1, N2 and N are natural numbers; the characteristic point pairs are obtained through the following method: combining the anchor point corresponding to the local maximum value of each frame of each audio file in the M audio files on the spectrogram and the anchor point in the preset matrix target area in pairs, and taking each combination as one characteristic point pair; the anchor points are sampling points corresponding to local maximum values in each frame spectrogram. It should be noted that, the values of M, N, N1 and N2 can be predetermined according to experimental data or experience, but are not limited thereto.

Wherein, for the spectrogram, the time-frequency conversion can be performed by Fast Fourier Transform (FFT). The fourier transform is a tool for transforming a signal from the time domain (audio signal) to the frequency domain (spectrogram), which essentially decomposes the original signal into a sum of a superposition of sinusoids of different frequencies. The fourier transform cannot characterize the local characteristics of the time domain signal, and the processing effect on the non-stationary signal is not good. To overcome this limitation, the signal is divided into many small time intervals, assuming that the signal is stationary for a short time interval of some window function. For example, the window time may be spanned by about 92ms, during which the audio signal may be considered stationary. Taking 4096 sampling points as a frame of data, performing FFT transformation to perform time-frequency calculation, then sliding 2048 points backwards to perform calculation, and so on, namely, the frame for performing spectrogram calculation is [2048k,2048k +4095], and k is 0,1, …, n. At a sampling rate of 44100Hz, this corresponds to computing a spectrogram of audio of about 92ms duration every about 46 ms. It can also be seen that, the value of the sampling point corresponding to each frame of the audio file can be determined in the following manner: [2048k,2048k +4095], wherein k is a natural number.

Wherein, the fingerprint number N2 for each of the M audio files can be obtained according to the positive document, and the structure of the positive document can be referred to as shown in fig. 4: as shown in fig. 4, each record in the front document includes an audio file id, and a list of fingerprints of the audio file, for example, for the record corresponding to id1, the record includes: the identification of the audio file corresponding to id1 (id1), and all the fingerprints in that audio file (fingerprint 1, fingerprint 2, fingerprint 3, … …). Based on the forward document, the number of fingerprints of the sample audio file can be rapidly determined, and the detection efficiency is improved.

Wherein, the local maximum is obtained by the following method:

dividing the spectrogram of each frame into a plurality of sampling points A at intervals on an X axis and a plurality of sampling points B at intervals on a Y axis of a two-dimensional coordinate system, and taking the sampling point with the maximum energy value in each area as a local maximum candidate point; and performing descending order arrangement on the local maximum candidate points of all the divided regions, and taking a specified number of local maximum candidate points which are ranked at the top as sampling points corresponding to the local maxima, wherein A and B are natural numbers.

As shown in fig. 6, the number of local maxima on each frame spectrogram is first set to 10. Dividing the spectrogram according to every 64 sampling points on the horizontal axis and every 1000Hz on the vertical axis, taking the point with the largest energy in each small lattice as a local maximum candidate point, arranging all the candidate points in a descending order according to the energy values, and taking the top 10 points as local maxima of the spectrogram, which are also called anchor points (anchors). As shown in fig. 7, for each anchor point, there is a matrix target zone (target zone) on the right side (i.e., the above-mentioned divided area) with the size of 128 sampling points on the horizontal axis, 1000Hz on the vertical axis, and 32 sampling points on the left side away from the anchor point. Then, the anchor point and all the anchor points in the target zone are combined two by two, and a fingerprint is generated by using each combination. As shown in fig. 5, each fingerprint is composed of three parts: the frequency and time difference of two anchors point, and the time (which can be represented by the number of sampling points) appearing in the audio, of course, the above three parts can also be hashed, and the operation result is taken as a fingerprint, that is: hash ([ f1: f2: delta _ t ]): t1. Wherein f1 is the frequency of the anchor point, f2 is the frequency of the local maximum corresponding to the anchor point, delta _ t (i.e. delta t) is the time difference between two sampling points, t1 is the time of the anchor point, and the hash ([ f1: f2: delta _ t ]) has a length of 24 bits.

Alternatively, the above similarity may be determined by the following formula, but is not limited thereto: p is N × N/N1/N2, wherein P represents the above-described similarity.

In order to better understand the above-mentioned detection process of audio files of the specified type, the following detailed description is given with reference to specific embodiments. Fig. 8 is a schematic diagram illustrating an alternative illegal audio detection flow according to an embodiment of the application. As shown in fig. 8, the detection process includes:

step S802, inputting audio and building an audio index (including forward-ranked documents and backward-ranked documents, the specific structural form of which is shown in the above embodiment) in advance, that is, collecting the illegal audio defined on the service, converting the encoding mode into mp3, and normalizing the sampling rate to 44100 Hz.

The process of creating the index is shown in fig. 9:

and step S902, unifying audio formats and sampling rates of illegal audios. Specifically, FFT transform is performed on 4096 sample points as one frame to perform time-frequency calculation, 2048 points are slid backward to perform calculation, and so on, i.e., the frame for performing spectrogram calculation is [2048k,2048k +4095], and k is 0,1, …, n. Which corresponds to computing a spectrogram of audio having a duration of about 92ms every about 46 ms.

In the spectrogram, all local maxima in each frame spectrogram are calculated by a pre-selected number (i.e., the number of local maxima), as shown in fig. 6.

Step S904, extracting an audio fingerprint of the audio file;

in step S906, an index is created (i.e., a forward document and a reverse document are created) using the audio fingerprint. The fingerprint-based audio retrieval operation principle is very similar to that of a search engine, and an inverted index structure is also constructed. Each fingerprint is a combination of a 24bit long string and a time stamp. A large array is allocated to store all fingerprints, and each fingerprint then points to an audio file posting list corresponding to that fingerprint, as shown in fig. 3.

The positive list structure is simple, and each record format is as follows: audio file id-list of fingerprints for this audio. As shown in fig. 7.

Step S804, extracting the audio fingerprint of the audio file to be detected;

step S806, searching similar audio files in the index according to the audio fingerprints, and calculating similarity. Specifically, the method comprises the following steps:

and searching an inverted index table for each extracted fingerprint to obtain an inverted list corresponding to the fingerprint. The audio appearing in the posting lists is written into the intermediate results table shown in FIG. 10;

subtracting the time corresponding to each audio in the inverted list (i.e. the time when the audio fingerprint of the audio file to be tested appears in the audio file in the inverted list) from the time corresponding to the extracted fingerprint (i.e. the time when the audio fingerprint of the audio file to be tested appears in the audio file to be tested), and storing the time difference in the intermediate result table shown in fig. 10;

the time difference for each audio in the intermediate results table shown in fig. 10 is sorted.

And counting the number of the same time difference in each audio, and returning 10 top-ranked audios.

Inputting the fingerprint numbers N1 and N2 of the two audios respectively, and the number N of feature point pairs on the matching into a similarity measurement formula: P-N/N1/N2, the similarity (i.e., the similarity probability) of two audio files is obtained.

Step S808, determining whether the similarity exceeds a preset threshold (for example, 0.08, which can be flexibly set according to actual conditions), if so, turning to step S810, otherwise, turning to step S812;

step S810, manually judging whether the audio is illegal, if so, turning to step S814, otherwise, turning to step S812;

step S812, determining that the audio file to be tested is a non-violation audio;

step S814, determining that the audio file to be tested is an illegal audio, and prompting the user to delete the illegal audio.

In summary, in the embodiment of the present application, a similar audio file to the audio file to be detected is searched according to the audio fingerprint, and when the similarity between the audio file to be detected and the similar audio file is greater than a preset threshold, the audio file to be detected is determined to be an audio file of a specified type, so that the audio file to be detected can be accurately detected, and the omission factor of audio detection is reduced; moreover, because similar audio files are searched from the newly designed inverted documents, the scheme provided by the embodiment of the application can be suitable for searching a large-scale sample set, and the application range is wide; in addition, as the audio fingerprints of the whole audio file to be detected are detected, more clues can be provided for tracing the illegal audio source, so that powerful support is provided for tracing the illegal audio source, and the technical problems that the missing rate is high and the illegal audio source cannot be effectively traced are solved.

It should be noted that the audio detection method in the embodiment of the present application may be applicable to detection of an illegal audio at the user side, for example, detecting whether a local audio at the user side belongs to the illegal audio. If so, prompting the user to delete the audio, thereby purifying the internet environment and avoiding illegal audio transmission.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

According to an embodiment of the present application, there is also provided an apparatus for implementing the audio detection method, as shown in fig. 11, the apparatus includes:

the acquiring module 110 is configured to acquire an audio fingerprint of an audio file to be detected;

optionally, the obtaining module 110 is further configured to obtain a similarity between the audio file to be tested and each of the M audio files according to the number N1 of fingerprints in the audio file to be tested, the number N2 of fingerprints in each of the M audio files, and the number N of pairs of feature points, where N1, N2, and N are all natural numbers; the characteristic point pairs are obtained through the following method: combining the anchor point corresponding to the local maximum value of each frame of each audio file in the M audio files on the spectrogram and the anchor point in the preset matrix target area in pairs, and taking each combination as one characteristic point pair; the anchor points are sampling points corresponding to local maximum values in each frame spectrogram.

The local maximum may be obtained by, but not limited to: dividing the spectrogram of each frame into a plurality of sampling points A at intervals on an X axis and a plurality of sampling points B at intervals on a Y axis of a two-dimensional coordinate system, and taking the sampling point with the maximum energy value in each area as a local maximum candidate point; and performing descending order arrangement on the local maximum candidate points of all the divided regions, and taking a specified number of local maximum candidate points which are ranked at the top as sampling points corresponding to the local maxima, wherein A and B are natural numbers.

A query module 112, connected to the obtaining module 110, configured to search, for each audio fingerprint of the audio file to be tested, a similar audio file of the audio file to be tested from the inverted list corresponding to the audio fingerprint; wherein each record in the inverted list comprises: a sample audio file identifier and a position where a sample fingerprint appears in a sample audio file, wherein the sample audio file is an audio file indicated by the sample audio file identifier; alternatively, the posting list may be generated by: an audio fingerprint collection is assigned directly to store all sample fingerprints (e.g., the collection may be in the form of an array, but is not limited to this), and each sample fingerprint then points to a list of all audio files in which the sample fingerprint occurred, where the audio file identification (id) and the location in the sample audio file where the sample fingerprint occurred are stored.

Optionally, the query module 112 is further configured to, for each audio fingerprint of the audio file to be tested, find a time set in which the audio fingerprint of the audio file to be tested appears from the inverted list corresponding to the audio fingerprint of the audio file to be tested, where the time set includes: the time indicated by the positions of the audio fingerprints of the audio file to be tested appearing in all the sample audio files; for each audio fingerprint of the audio file to be detected, taking the time of the audio fingerprint of the audio file to be detected appearing in the audio file to be detected as reference time, carrying out difference operation on the time of the audio fingerprint of the audio file to be detected and the time in the time set, and generating an intermediate result according to the obtained time difference; each intermediate result consists of an audio file identifier and the time difference corresponding to the audio file identifier; and for each sample audio file in the inverted list, counting the number of all the same time differences in the sample audio files indicated by the audio file identifiers according to the intermediate result, sequencing the sample audio files in the inverted list according to the at least multiple sequence of the number to obtain top M sample audio files, and taking the top M sample audio files as the similar audio files, wherein M is a natural number.

The identifying module 114 is connected to the querying module 112, and configured to obtain a similarity between the audio file to be tested and the similar audio file, and determine that the audio file to be tested is an audio of a specified type when the similarity is greater than a preset threshold.

Alternatively, based on N, N1 and N2 obtained by the obtaining module 110, the similarity may be determined by the following formula: p is N × N/N1/N2, wherein P represents the above-described similarity.

In an alternative embodiment, the audio fingerprint is composed of the following three parts: the time difference and frequency of the two anchor points in the above-mentioned pairs of characteristic points, and the time of occurrence in the audio file.

It should be noted that the above modules may be implemented in the form of software or hardware, where for the latter, the following may be implemented, but not limited to: the acquisition module 110, the query module 112, and the identification module 114 are located in the same processor; or the obtaining module 110, the querying module 112, and the identifying module 114 are located in the first processor, the second processor, and the third processor, respectively; or the acquisition module 110, the query module 112, and the identification module 114 may reside in different processors in any other combination.

Example 3

The embodiment of the application can provide a computer terminal, and the computer terminal can be any one computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.

Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.

In this embodiment, the computer terminal may execute program codes of the following steps in the audio detection method: acquiring an audio fingerprint of the audio file to be tested; for each audio fingerprint of the audio file to be tested, searching a similar audio file of the audio file to be tested from the inverted arrangement table according to the audio fingerprint; wherein each record in the inverted list comprises: identifying a sample audio file and a location of a sample fingerprint in the set of audio fingerprints that occurs in the sample audio file corresponding to the sample audio file identification, and each sample fingerprint in the set of audio fingerprints points to a record in the inverted list; and acquiring the similarity between the audio file to be tested and the similar audio file, and determining that the audio to be tested is the audio of the designated type under the condition that the similarity is greater than a preset threshold value.

Optionally, fig. 12 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 12, the computer terminal a may include: one or more processors 121 (only one shown), a memory 123, and a transmission device 125 connected to the web server.

The memory 123 may be configured to store software programs and modules, such as program instructions/modules corresponding to the audio detection method and apparatus in the embodiment of the present application, and the processor 121 executes various functional applications and data processing by running the software programs and modules stored in the memory 123, that is, implements the above-mentioned method for detecting a system bug attack. The memory 123 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 123 may further include memory located remotely from processor 121, which may be connected to terminal a via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 125 is used for receiving or transmitting data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 125 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmission device 125 is a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

Specifically, the memory 123 is used for storing preset action conditions, information of a preset authorized user, and an application program.

The processor 121 may call the information and application stored in the memory 123 through the transmission device to perform the following steps: acquiring an audio fingerprint of the audio file to be tested; for each audio fingerprint of the audio file to be tested, searching a similar audio file of the audio file to be tested from the inverted arrangement table according to the audio fingerprint; wherein each record in the inverted list comprises: a sample audio file identifier and a position where a sample fingerprint appears in a sample audio file, wherein the sample audio file is an audio file indicated by the sample audio file identifier; and acquiring the similarity between the audio file to be tested and the similar audio file, and determining that the audio to be tested is the audio of the designated type under the condition that the similarity is greater than a preset threshold value.

Optionally, the processor 121 may further execute program codes of the following steps: for each audio fingerprint of the audio file to be tested, searching a time set of the audio fingerprint of the audio file to be tested from the inverted list corresponding to the audio fingerprint of the audio file to be tested, wherein the time set comprises the following components: the time indicated by the positions of the audio fingerprints of the audio file to be tested appearing in all the sample audio files; for each audio fingerprint of the audio file to be detected, taking the time of the audio fingerprint of the audio file to be detected appearing in the audio file to be detected as reference time, carrying out difference operation on the time of the audio fingerprint of the audio file to be detected and the time in the time set, and generating an intermediate result according to the obtained time difference; each intermediate result consists of an audio file identifier and the time difference corresponding to the audio file identifier; and for each sample audio file in the inverted list, counting the number of all the same time differences in the sample audio files indicated by the audio file identifiers according to the intermediate result, sequencing the sample audio files in the inverted list according to the at least multiple sequence of the number to obtain top M sample audio files, and taking the top M sample audio files as the similar audio files, wherein M is a natural number.

Optionally, the processor 121 may further execute program codes of the following steps: acquiring the similarity between the audio file to be tested and each of the M audio files according to the number N1 of the fingerprints in the audio file to be tested, the number N2 of the fingerprints in each of the M audio files and the number N of the characteristic point pairs, wherein N1, N2 and N are natural numbers; the characteristic point pairs are obtained through the following method: combining the anchor point corresponding to the local maximum value of each frame of each audio file in the M audio files on the spectrogram and the anchor point in the preset matrix target area in pairs, and taking each combination as one characteristic point pair; the anchor points are sampling points corresponding to local maximum values in each frame spectrogram.

Optionally, the processor 121 may further execute program codes of the following steps: and converting the audio format of the audio file to be tested into a specified format and normalizing the audio sampling rate of the audio file to be tested.

It can be understood by those skilled in the art that the structure shown in fig. 12 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 12 is a diagram illustrating a structure of the electronic device. For example, the computer terminal a may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in fig. 12, or have a different configuration than shown in fig. 12.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

Example 4

Embodiments of the present application also provide a storage medium. Optionally, in this embodiment, the storage medium may be configured to store a program code executed by the audio detection method provided in the first embodiment.

Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring an audio fingerprint of the audio file to be tested; for each audio fingerprint of the audio file to be tested, searching a similar audio file of the audio file to be tested from the inverted list corresponding to the audio fingerprint; wherein each record in the inverted list comprises: a sample audio file identifier and a position where a sample fingerprint appears in a sample audio file, wherein the sample audio file is an audio file indicated by the sample audio file identifier; and acquiring the similarity between the audio file to be tested and the similar audio file, and determining that the audio to be tested is the audio of the designated type under the condition that the similarity is greater than a preset threshold value.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: for each audio fingerprint of the audio file to be tested, searching a time set of the audio fingerprint of the audio file to be tested from the inverted list corresponding to the audio fingerprint of the audio file to be tested, wherein the time set comprises the following components: time indicated by the positions of the audio fingerprints of the audio file to be tested appearing in all the sample audio files; for each audio fingerprint of the audio file to be detected, taking the time of the audio fingerprint of the audio file to be detected appearing in the audio file to be detected as reference time, carrying out difference operation on the time of the audio fingerprint of the audio file to be detected and the time in the time set, and generating an intermediate result according to the obtained time difference; each intermediate result consists of an audio file identifier and the time difference corresponding to the audio file identifier; and for each sample audio file in the inverted list, counting the number of all the same time differences in the sample audio files indicated by the audio file identification according to the intermediate result, sequencing the sample audio files in the inverted list according to the sequence of the number from at least one to obtain the first M sample audio files, and taking the first M sample audio files as the similar audio files, wherein M is a natural number.

Here, any one of the computer terminal groups may establish a communication relationship with the web server.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An audio detection method, comprising:

acquiring an audio fingerprint of an audio file to be tested;

for each audio fingerprint of the audio file to be tested, searching a similar audio file of the audio file to be tested from the inverted list corresponding to the audio fingerprint; wherein each record in the inverted list comprises: a sample audio file identifier, and a location where a sample fingerprint occurs in a sample audio file, the sample audio file being an audio file indicated by the sample audio file identifier;

and determining the similarity between the audio file to be tested and the similar audio file according to the number of the fingerprints and the number of the characteristic point pairs of each of the audio file to be tested and the similar audio file, and determining whether the audio file to be tested is the audio of the specified type according to the similarity.

2. The method of claim 1, wherein searching for a similar audio file for the audio to be tested from a posting list according to the audio fingerprint comprises:

for each audio fingerprint of the audio file to be tested, searching a time set of the audio fingerprint of the audio file to be tested from the inverted list corresponding to the audio fingerprint of the audio file to be tested, wherein the time set comprises the following components: time indicated by the positions of the audio fingerprints of the audio file to be tested appearing in all the sample audio files;

3. The method of claim 2, wherein obtaining the similarity between the audio file to be tested and the similar audio file comprises:

acquiring the similarity between the audio file to be tested and each audio file in the first M sample audio files according to the number N1 of the fingerprints in the audio file to be tested, the number N2 of the fingerprints in each audio file in the first M sample audio files and the number N of the characteristic point pairs, wherein N1, N2 and N are all natural numbers;

the characteristic point pairs are obtained through the following method: combining the anchor point corresponding to the local maximum value of each frame of each audio file in the first M sample audio files on the spectrogram and the anchor point in the preset matrix target area in pairs, and taking each combination as one characteristic point pair; and the anchor points are sampling points corresponding to local maximum values in each frame spectrogram.

4. The method of claim 3, wherein the local maxima are obtained by:

dividing the area of every A sampling points and every B sampling points on the Y axis of the spectrogram of each frame on the X axis of the two-dimensional coordinate system, and taking the sampling point with the maximum energy value in each area as a local maximum candidate point; and performing descending order arrangement on the local maximum candidate points of all the divided regions, and taking a specified number of local maximum candidate points which are ranked at the top as sampling points corresponding to the local maxima, wherein A and B are natural numbers.

5. The method of claim 3, wherein the similarity is determined by the following formula:

p ═ N × N/N1/N2, where P represents the similarity.

6. The method of claim 3, wherein the audio fingerprint is comprised of three parts: the time difference and frequency of the two anchor points in the feature point pair, and the time of occurrence in the audio file.

7. The method according to any one of claims 1 to 6, wherein before acquiring the audio fingerprint of the audio file under test, the method further comprises:

and converting the audio format of the audio file to be tested into a specified format and carrying out normalization processing on the audio sampling rate of the audio file to be tested.

8. The method according to any one of claims 1 to 6, wherein determining whether the audio file to be tested is a specified type of audio according to the similarity comprises:

and when the similarity is greater than a preset threshold value, determining the audio file to be tested as the audio of the specified type.

9. An audio detection apparatus, comprising:

the acquisition module is used for acquiring the audio fingerprint of the audio file to be detected;

the query module is used for searching a similar audio file of the audio file to be tested from the inverted arrangement table corresponding to the audio fingerprint for each audio fingerprint of the audio file to be tested; wherein each record in the inverted list comprises: a sample audio file identifier, and a location where a sample fingerprint occurs in a sample audio file, the sample audio file being an audio file indicated by the sample audio file identifier;

and the identification module is used for determining the similarity between the audio file to be detected and the similar audio file according to the fingerprint number and the number of the characteristic point pairs of each of the audio file to be detected and the similar audio file, and determining whether the audio file to be detected is the audio of the specified type according to the similarity.

10. The apparatus according to claim 9, wherein the query module is further configured to, for each audio fingerprint of the audio file to be tested, look up a time set of occurrence of the audio fingerprint of the audio file to be tested from the posting list corresponding to the audio fingerprint of the audio file to be tested, where the time set includes: time indicated by the positions of the audio fingerprints of the audio file to be tested appearing in all the sample audio files; for each audio fingerprint of the audio file to be detected, taking the time of the audio fingerprint of the audio file to be detected appearing in the audio file to be detected as reference time, carrying out difference operation on the time and the elements in the time set, and generating an intermediate result according to the obtained time difference; wherein the intermediate result is composed of the sample audio file identifier and the time difference corresponding to the sample audio file identifier; and for each sample audio file identifier in the inverted list, counting the number of all the same time differences in the sample audio files indicated by the sample audio file identifier according to the intermediate result, sequencing the sample audio files in the inverted list according to the at least multiple sequence of the number to obtain the first M sample audio files, and taking the first M sample audio files as the similar audio files, wherein M is a natural number.

11. The apparatus according to claim 10, wherein the obtaining module is further configured to obtain a similarity between the audio file to be tested and each of the first M sample audio files according to a number N1 of fingerprints in the audio file to be tested, a number N2 of fingerprints in each of the first M sample audio files, and a number N of pairs of feature points, where N1, N2, and N are all natural numbers; the characteristic point pairs are obtained through the following method: combining the anchor point corresponding to the local maximum value of each frame of each audio file in the first M sample audio files on the spectrogram and the anchor point in the preset matrix target area in pairs, and taking each combination as one characteristic point pair; and the anchor points are sampling points corresponding to local maximum values in each frame spectrogram.

12. The apparatus of claim 11, wherein the local maxima are obtained by:

13. The apparatus of claim 11, wherein the similarity is determined by the following equation:

p ═ N × N/N1/N2, where P represents the similarity.

14. The apparatus of claim 11, wherein the audio fingerprint is comprised of three parts: the time difference and frequency of the two anchor points in the feature point pair, and the time of occurrence in the audio file.

15. The apparatus according to any one of claims 9 to 14, wherein the identifying module is further configured to determine that the audio file to be tested is the specified type of audio when the similarity is greater than a preset threshold.