WO2022161291A1

WO2022161291A1 - Audio search method and apparatus, computer device, and storage medium

Info

Publication number: WO2022161291A1
Application number: PCT/CN2022/073291
Authority: WO
Inventors: 吕镇光
Original assignee: 百果园技术(新加坡)有限公司
Priority date: 2021-01-28
Filing date: 2022-01-21
Publication date: 2022-08-04
Also published as: CN112784098A

Abstract

Embodiments of the present application provide an audio search method and apparatus, a computer device, and a storage medium. The method comprises: determining first audio data and a plurality of pieces of second audio data; calculating a first hash feature for the first audio data and calculating second hash features for the plurality of pieces of second audio data, respectively; determining a sequence of arrangement of the plurality of pieces of second audio data according to densities of the plurality of second hash features; and comparing the first hash feature with the plurality of second hash features according to the sequence to search for second audio data the same as or similar to the first audio data.

Description

A kind of audio search method, apparatus, computer equipment and storage medium

This application claims the priority of the Chinese Patent Application No. 202110119351.4 filed with the China Patent Office on January 28, 2021, the entire contents of which are incorporated herein by reference.

technical field

The embodiments of the present application relate to the technical field of audio processing, for example, to an audio search method, apparatus, computer device, and storage medium.

Background technique

With the rapid development of the Internet, especially the widespread popularity of mobile terminals, users can easily create multimedia data, such as making short videos, humming songs, recordings, etc., which makes the multimedia data in the Internet grow rapidly, and audio data also followed by rapid growth.

In business scenarios such as song search and voice content review, the audio data is compared to determine whether the audio data is the same or similar.

Due to the large amount of audio data, the audio data is usually sorted by a queuing system, and then the audio data is compared in order.

In the queuing system, the baseline method is usually used, that is, the audio data has no specific reference standard when sorting, and the audio data is compared one by one. Although the accuracy rate is high, it occupies a lot of resources. Time consuming is high, resulting in low overall efficiency.

SUMMARY OF THE INVENTION

The embodiments of the present application propose an audio search method, apparatus, computer equipment, and storage medium, so as to solve the problem of how to improve the efficiency of comparison while maintaining the accuracy of comparison audio data.

In a first aspect, an embodiment of the present application provides an audio search method, including:

determining first audio data, a plurality of second audio data;

The first hash feature is calculated for the first audio data, the second hash feature is calculated for a plurality of the second audio data;

Determine the order of arrangement among the plurality of the second audio data according to the density of the plurality of the second hash features;

The first hash feature is compared with a plurality of the second hash features in the order to find the second audio data that is the same as or similar to the first audio data.

In a second aspect, the embodiment of the present application also provides an audio search method, including:

receiving the first audio data uploaded by the client, and calculating a first hash feature for the first audio data;

Find the currently configured blacklist, where a plurality of second audio data are recorded in the blacklist, and a second hash feature has been configured for a plurality of the second audio data;

The first hash feature is compared with a plurality of the second hash features in the order to determine whether there is second audio data in the plurality of second audio data that is the same as the first audio data or similar;

The first audio data is determined to be illegal in response to second audio data being the same as or similar to the first audio data in the plurality of second audio data.

In a third aspect, an embodiment of the present application also provides an audio search device, including:

an audio data determination module, configured to determine the first audio data and a plurality of second audio data;

A hash feature calculation module, configured to calculate a first hash feature for the first audio data and a second hash feature for a plurality of the second audio data respectively;

an order determination module, configured to determine the order in which the plurality of second audio data are arranged according to the density of the plurality of second hash features;

A hash feature comparison module, configured to compare the first hash feature with a plurality of the second hash features in the order to find the second hash features that are the same as or similar to the first audio data audio data.

In a fourth aspect, an embodiment of the present application also provides an audio search device, including:

an audio data receiving module, configured to receive the first audio data uploaded by the client, and calculate a first hash feature for the first audio data;

The blacklist search module is configured to search for a currently configured blacklist, where a plurality of second audio data are recorded in the blacklist, and a second hash feature has been configured for the plurality of second audio data;

A hash feature comparison module, configured to compare the first hash feature with a plurality of the second hash features in the order to determine whether there is second audio data in the plurality of second audio data the same or similar to the first audio data;

The illegal audio determination module is configured to determine that the first audio data is illegal in response to the presence of second audio data in the plurality of second audio data that is identical to or similar to the first audio data.

In a fifth aspect, an embodiment of the present application further provides a computer device, the computer device comprising:

at least one processor;

memory, arranged to store at least one program,

When the at least one program is executed by the at least one processor, the at least one processor is caused to implement the audio search method according to the first aspect or the second aspect.

In a sixth aspect, embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the implementation of the first or second aspect is implemented The audio search method described above.

Description of drawings

1 is a flowchart of an audio search method provided in Embodiment 1 of the present application;

FIG. 2 is an example diagram of calculating the density of the second hash feature according to Embodiment 1 of the present application;

3A is an example diagram of a short audio search provided in Embodiment 1 of the present application;

3B is an example diagram of a long audio search provided in Embodiment 1 of the present application;

4 is a flowchart of an audio search method provided in Embodiment 2 of the present application;

5 is a schematic structural diagram of an audio search apparatus according to Embodiment 3 of the present application;

6 is a schematic structural diagram of an audio search apparatus according to Embodiment 4 of the present application;

FIG. 7 is a schematic structural diagram of a computer device according to Embodiment 5 of the present application.

Detailed ways

The present application will be described in detail below with reference to the accompanying drawings and embodiments.

Example 1

FIG. 1 is a flowchart of an audio search method provided in Embodiment 1 of the application. This embodiment is applicable to sorting and comparing audio data according to the density of the hash feature of the audio data. The method can be performed by an audio search device. To perform, the audio search apparatus can be implemented by software and/or hardware, and can be configured in computer equipment, such as servers, workstations, personal computers, etc., including the following steps:

Step 101: Determine first audio data and a plurality of second audio data.

In this embodiment, the first audio data and the plurality of second audio data are audio data, and the audio data can be in the form of songs released by singers, audio data separated from video data such as short videos, movies, and TV dramas, For the voice signal recorded by the user on the mobile terminal, etc., the format of the audio data may include MP3, WMA, and AAC, which is not limited in this embodiment.

Exemplarily, the plurality of second audio data are pre-collected audio data in various ways, for example, the user uploads the audio data, purchases the audio data from the copyright owner, the technician records the audio data, and uses the crawler client to crawl from the network. Audio data, etc., a plurality of second audio data can form an audio library, and search services can be provided to the outside, the first audio data is the audio data to be searched, that is, the audio library is searched for the same or similar to the first audio data. the second audio data.

Due to the influence of compression, clipping, and background noise, the same or similar in this embodiment may refer to the first audio data and the second audio data being the same or similar in whole or in part.

Step 102: Calculate a first hash feature for the first audio data and calculate a second hash feature for a plurality of second audio data, respectively.

For the first audio data, a hash feature (hash, also known as hash feature, fingerprint) can be calculated for it to be used as the feature of the first audio data. For the convenience of distinction, the hash feature is recorded as the first hash feature .

For the second audio data, a hash feature (hash, also known as hash feature, fingerprint) can be calculated for it to be used as the feature of the second audio data. For the convenience of distinction, the hash feature is recorded as the second hash feature .

In general, the methods of calculating the first hash feature and calculating the second hash feature are the same, that is, the first hash feature is calculated for the first audio data and the second hash feature is calculated for multiple second audio data based on the same method. Hi feature.

In an embodiment of the present application, step 102 may include the following steps:

Step 1021: Convert the first audio data into a first spectrogram.

In this embodiment, the first audio data may be converted by means of Fourier transform (Discrete Fourier Transform, DFT), short-time Fourier transform (short-time Fourier transform, or short-term Fourier transform, STFT), etc. is a spectrogram, the horizontal axis of the spectrogram is time and the vertical axis is frequency, so that the first audio data is converted from a time-domain signal to a frequency-domain signal. For the convenience of distinction, the spectrogram is denoted as the first spectrogram.

Converting a time domain signal into a frequency domain signal will lose time information. Therefore, a data block (also known as a window) method can be used to divide a large segment of the first audio data in the time domain into multiple first data blocks. The plurality of first data blocks are respectively converted into frequency domain signals, so that time information is preserved to a certain extent.

For example, the parameters of the first audio data are two-channel, 16-bit precision, and 44100 Hz sampling. At this time, the data size of 1s is 441002byte2 channel ≈ 176kB. If 4kB is selected as the size of the data block, Fourier transform is performed on 44 blocks of data every second, and such a segmentation density can meet the requirements.

Step 1022: Search for a first key point on multiple spectral bands of the first spectrogram according to the energy.

The frequency span with the larger amplitude of the first audio data may be very wide, and may appear from low C (32.70 Hz) to high C (4186.01 Hz). In order to avoid analyzing the entire first spectrogram and reduce the calculation amount, the first spectrogram may be divided into a plurality of spectral bands (also called sub-bands).

Select key points, frequency peaks from each subband, for example, select the following subbands: 30Hz-40Hz, 40Hz-80Hz and 80Hz-120Hz for the bass subband (bass guitars and other instruments will have a bass subband at the fundamental frequency) , the midrange and treble subbands are 120Hz-180Hz and 180Hz-300Hz respectively (the fundamental frequencies of vocals and most other instruments appear in these two subbands).

Since a point with a larger energy (ie, the amplitude on the first spectrogram) is more resistant to noise, for each subband, a key point can be selected according to the energy, which is recorded as the first key point for the convenience of distinction.

Normally, the point with the highest frequency (ie, the highest energy) in each subband can be selected as the first key point.

Step 1023: Generate a first hash feature of the first audio data based on the first key point.

The first key point of each data block constitutes the signature of this frame of audio data, and the signatures of different data blocks constitute the first hash feature of the entire first audio data.

The first hash feature of the first audio data may be cached in the memory, waiting to be compared with the second hash feature of the second audio data.

Step 1024: Convert the second audio data into a second spectrogram.

In this embodiment, the second audio data can be converted into a spectrogram by means of Fourier transform, short-time Fourier transform, etc. The horizontal axis of the spectrogram is time, and the vertical axis is frequency, so that the second audio data is converted into a spectrogram. Converted from a time domain signal to a frequency domain signal, the spectrogram is denoted as the second spectrogram for the convenience of distinction.

Converting a time domain signal into a frequency domain signal will lose time information. Therefore, a data block (also known as a window) method can be used to divide a large segment of the second audio data in the time domain into multiple data blocks. The data blocks are converted to frequency domain signals separately, which preserves time information to a certain extent.

Step 1025: Search for a second key point on multiple spectral bands of the second spectrogram according to the energy.

The frequency span with the larger amplitude of the second audio data may be very wide, and may appear from the bass C (32.70 Hz) to the high C (4186.01 Hz). In order to avoid analyzing the whole second spectrogram and reduce the calculation amount, the second spectrogram may be divided into a plurality of spectral bands (also called sub-bands).

Since a point with a larger energy (ie, the amplitude on the second spectrogram) is more resistant to noise, for each subband, a key point can be selected according to the energy, which is recorded as the second key point for the convenience of distinction.

Usually, the point with the highest frequency (ie, the highest energy) in each subband can be selected as the second key point.

Step 1026: Generate a second hash feature of the second audio data based on the second key point.

The second key point of each data block constitutes the signature of this frame of audio data, and the signatures of different data blocks constitute the second hash feature of the entire second audio data.

The second hash feature of the second audio data can be stored as a key for retrieving the hash table. For the convenience of searching, the second hash feature is usually used as the key value of the hash table, and the part pointed to by the key value The time when the second hash feature appears in the second audio data and the ID of the second audio data are included.

第二哈希特征(Hash Tag)The second hash feature (Hash Tag)	时间(Time in Seconds)Time in Seconds	第二音频数据(Song)Second audio data (Song)
30 51 99 121 19530 51 99 121 195	53.5253.52	Song ASong A
33 56 92 151 18533 56 92 151 185	12.3212.32	Song BSong B
39 26 89 141 25139 26 89 141 251	15.3415.34	Song CSong C
32 67 100 128 27032 67 100 128 270	78.4378.43	Song DSong D
30 51 99 121 19530 51 99 121 195	10.8910.89	Song ESong E
34 57 95 111 20034 57 95 111 200	54.5254.52	Song ASong A
34 41 93 161 20234 41 93 161 202	11.8911.89	Song ESong E

Of course, the above method for calculating the first hash feature and the second hash feature is only an example. When implementing the embodiments of the present application, other methods for calculating the first hash feature and the second hash feature may be set according to the actual situation. This embodiment of the present application does not limit this. In addition, in addition to the above-mentioned methods for calculating the first hash feature and the second hash feature, those skilled in the art can also adopt other methods for calculating the first hash feature and the second hash feature according to actual needs. This is also not restricted.

Step 103: Determine the order of arrangement among the plurality of second audio data according to the density of the plurality of second hash features.

When the hash features are dense, the comparison accuracy of the hash features is higher, and when the hash features are sparse, the comparison accuracy of the hash features is lower, and it is easy to combine different or dissimilar audio data, considered to be the same or similar audio data.

In this embodiment, the second hash feature statistical density (Density) of the second audio data can be used to represent the density of the second hash feature, and in the queuing system (Queuing System), the second audio data The density of the two hash features is used as a threshold, and the plurality of second audio data are sorted according to the density of the second hash features of the second audio data, so as to determine the order among the plurality of second audio data.

In an embodiment of the present application, the density of the second hash feature of the second audio data is a local density, then in this embodiment, step 103 includes the following steps:

Step 1031: Count the number of overlapping second hash features in multiple local regions.

In this embodiment, the second audio data can be divided into a plurality of local areas of the same size, and for each local area, the number of overlapping second hash features in the local area can be counted separately, and the local area is used as the unit area, the data can be regarded as local density.

Exemplarily, a second spectrogram of the second audio data may be obtained, where the second spectrogram is a spectrogram obtained after converting the second audio data from time domain information to frequency domain information, and the second hash feature may be marked in on the second spectrogram.

Multiple windows of the same size are added to the second spectrogram to represent the range of multiple local regions, so that the number of second hash features is counted in multiple windows, and the second hash feature is used as the second hash feature in The number of multiple local regions.

Given the second audio data A, a window is added at time t, and the size of the window is k, then the number of local regions (that is, the local density) is expressed as follows:

where i is the number of overlapping second hash features within the window (ie, t to t+k).

Exemplarily, for the entire second spectrogram, a preset window may be searched, and a window may be added to the second spectrogram at preset time intervals, thereby dividing the second spectrogram into multiple local regions.

There are two relationships between the window and the preset time as follows:

In a relationship, the width of the window is equal to the length of the preset time, that is, there is no overlap between two adjacent windows, which reduces the calculation amount of the second hash feature.

In another relationship, the width of the window is smaller than the preset time length, that is, a partial overlap between two adjacent windows can improve the accuracy of the second hash feature.

Step 1032: Generate the density of the second hash feature in the second audio data to which it belongs based on the number of overlaps in the multiple local regions.

If the number of overlapping second hash features in multiple local regions is counted, the number of overlapping second hash features in multiple local regions may be used as a reference to generate the number of overlapping second hash features in the second audio data. density.

In one example, the number of overlaps in a plurality of partial regions may be compared, and if the number of overlaps in a certain partial region is the largest, the number of overlaps in the partial region with the largest number of overlaps is determined as the number of overlaps in the second hash feature to which it belongs. Density in the second audio data.

Given the second audio data A, a window (local area) is added at time t, and the number in the window is D(A, t), then the density of the second hash feature in the second audio data D(A) is:

Among them, max is the function of taking the maximum value.

In an example, as shown in FIG. 2 , a window 201 , a window 202 , a window 203 , a window 204 , a window 205 , a window 206 , and a window 207 are added to the second spectrogram of a certain second hash feature, wherein the window 203 The number of overlaps of the second hash features in the window 203 is the highest, therefore, the number of overlaps of the second hash features in the window 203 can be selected as the density of the second hash features in the second audio data.

Of course, the above method for calculating the density of the second hash feature is only an example. When implementing the embodiments of the present application, other methods for calculating the density of the second hash feature may be set according to actual conditions. Sort the number from large to small, take the number of overlaps in the j (j is a positive integer) local area before sorting 1 and calculate the average value as the density of the second hash feature in the second audio data, the embodiment of the present application This is not restricted. In addition, in addition to the above method for calculating the density of the second hash feature, those skilled in the art may also adopt other methods for calculating the density of the second hash feature according to actual needs, which are not limited in this embodiment of the present application.

Step 1033: Sort the plurality of second audio data in descending order according to the density to obtain the order of the plurality of second audio data.

If the density of the second hash feature is calculated for each second audio data, the plurality of second audio data may be sorted in descending order according to the density, so as to determine the order of each second audio data, that is, the second The higher the density of the hash features is, the higher the order of the second audio data is; otherwise, the lower the density of the second hash features is, the lower the sequence of the second audio data is.

Step 104: Compare the first hash feature with a plurality of second hash features in order to find second audio data that is the same as or similar to the first audio data.

In this embodiment, the second hash feature of the second audio data may be sequentially compared with the first hash feature of the first audio data according to the order in which the second audio data is arranged, so as to determine the difference between the first audio data and the first hash feature of the first audio data. Whether the two audio data are the same or similar.

For the current second audio data, if the difference between the second hash feature of the second audio data and the first hash feature of the first audio data is large, it can be considered that the second audio data is different from the first audio data The similarity between them is low, the first hash feature does not match the second hash feature, and the search continues for the next second audio data.

For the current second audio data, if the difference between the second hash feature of the second audio data and the first hash feature of the first audio data is small, it can be considered that the second audio data is different from the first audio data The similarity between them is relatively high, the first hash feature matches the second hash feature, and it is confirmed that the second audio data that is the same or similar to the first audio data is found. At this time, the search can be stopped.

Exemplarily, a target position may be determined, where the target position is used to represent the quantity of the second audio data to be compared, and the target position is generally much smaller than the quantity of the second audio data.

Compare the first hash feature with the second hash feature located before the target location in order.

If the first hash feature matches the second hash feature, it is determined that the first audio data and the second audio data to which the second hash feature belongs are identical or similar.

In this embodiment, the first audio data and a plurality of second audio data are determined, a first hash feature is calculated for the first audio data, and a second hash feature is calculated for a plurality of second audio data, respectively, according to the plurality of The density of the second hash feature determines the order in which the plurality of second audio data are arranged, and the first hash feature is compared with the plurality of second hash features in order to find the first audio data that is the same or similar to the first audio data. For audio data, denser hash features can improve the accuracy of comparison, adjust the sorting of audio data through the density of hash features, improve the probability of searching for the same or similar audio data in the process of priority comparison, thereby reducing the In the case of the number of comparisons, the accuracy of searching for audio data is improved.

Assuming that the number of second audio data is N (N is a positive integer), in the Queuing System:

For the baseline method, there is no specific reference standard for the order between the second audio data. The first audio data is compared with the second audio data one by one, and the matching second audio data is a coincidental event. The process of searching for the second audio data matching the first audio data consumes a lot of time, and the time complexity is O(N).

Therefore, the following improvements may be made to the queuing system:

1. Queue System A (Queue System A):

The queue system A arranges the second audio data according to the absolute number (Absolute Matches) of the second hash feature.

The second audio data is placed in a queue, where the second audio data at the front of the queue are most likely to be the best match and those at the back of the queue are less likely to be the correct match.

Therefore, the queue system A can provide a stop criterion. If the first m second audio data in the queue are compared and no second audio data matching the first audio data is found, the search can be stopped, and the search result is generated as There is no second audio data matching the first audio data.

Wherein, m is a positive integer, and m<<N (m is much smaller than N).

Therefore, the time complexity of the queue system A is O(m), and O(m)<<O(N).

2. Queue System B (Queue System B):

Although the queuing system A saves time, it is only effective when the plurality of second audio data have the same duration, and when the duration of the plurality of second audio data has a large deviation, the accuracy will decrease.

For example, the duration of the second audio data A is 2 minutes, and the duration of the second audio data B is 30 minutes, even if the second audio data A is a correct match of the first audio data, the second audio data B may only be due to the duration is so long that the number of second hash features of the second audio data B is greater than the number of second hash features of the second audio data A, so that the second audio data B is at the front of the queue, and the second audio data A is at the back of the queue.

When there are m pieces of second audio data with longer durations exhibiting this phenomenon (that is, frequent collision of long audios), the matching of the second audio data A in the queue will be lost.

In this regard, the queue system B normalizes the duration of the second audio data (Normalised by Duration) by dividing by the duration to queue the second audio data.

However, simply dividing by the duration of the second audio data will lead to an over-normalization problem, which will cause the longer second audio data to re-enter the queue, and the correct second audio data will still be matched and lost in the queue.

3. Queue System C (Queue System C):

This embodiment provides a queue system C, which performs normalization according to the density of the second hash feature, and sorts according to the density of the second hash feature, so that the difference between the absolute number of the second hash feature and the over-normalization duration is trade-offs were made.

In order for those skilled in the art to better understand the embodiments of the present application, the following compares the queuing system A, the queuing system B, and the queuing system C through specific scenarios:

Scenario 1. Short audio search

The second audio data are song A (Song A) and song B (Song B) respectively, the duration of song A is less than the duration of song B, and it is assumed that the given second audio data matching the first audio data is song A.

As shown in FIG. 3A , the second hash feature is marked on the second spectrogram of song A and the second spectrogram of song B, respectively, and the following data are counted on them:

Using queue system A, the absolute number of second hash features in song A (727) is less than the absolute number of second hash features in song B (913), so song A ranks after song B.

Using Queue System B, Song A's normalized duration (0.198) is greater than Song B's normalized duration (0.033), so Song A ranks ahead of Song B.

Using queuing system C, the density of the second hash feature in song A (0.266) is greater than the density of the second hash feature in song B (0.067), so song A ranks ahead of song B.

Scenario 2, long audio search

The second audio data are song A (Song A) and song B (Song B) respectively, the duration of song A is shorter than the duration of song B, and it is assumed that the given second audio data matching the first audio data is song B.

As shown in FIG. 3B , the second hash feature is marked on the second spectrogram of song A and the second spectrogram of song B respectively, and the following data are counted on them:

Using queue system A, the absolute number of second hash features in song A (347) is less than the absolute number of second hash features in song B (2481), so song A ranks after song B.

Using Queue System B, Song A's normalized duration (0.094) is greater than Song B's normalized duration (0.090), so Song A ranks ahead of Song B.

Using queuing system C, the density of the second hash feature in song A (0.127) is less than the density of the second hash feature in song B (0.182), so song A ranks after song B.

It can be seen that the query matching song B has a higher density area, the duration of song B is longer, the absolute number of second hash features is greater than that of song A, and queue system B overcompensates for the duration, although queue system B does not Scenario one (short audio search) is valid, but not for scenario two (long audio search), while queue system C is valid for both scenario one (short audio search) and scenario two (long audio search).

Embodiment 2

FIG. 4 is a flowchart of an audio search method provided in Embodiment 2 of the present application. This embodiment is applicable to the case where audio data is sorted and compared according to the density of the hash feature of the audio data, so as to perform content review. The method may be performed by an audio search apparatus, which may be implemented in software and/or hardware, and may be configured in computer equipment, such as a server, workstation, personal computer, etc., including the following steps:

Step 401: Receive first audio data uploaded by a client, and calculate a first hash feature for the first audio data.

In this embodiment, the computer device acts as a multimedia platform. On the one hand, it provides users with audio-based services, such as providing users with live programs, short videos, voice conversations, video conversations, etc., and on the other hand, receives user uploads. audio-carrying files, such as live broadcast data, short videos, session information, and so on.

Different multimedia platforms can formulate video content review standards based on business, legal and other factors. Before publishing a file with audio, review the content of the file with audio according to the review specification, and filter out some that do not meet the video content review standards. Audio-carrying files, such as audio-carrying files that contain pornographic, vulgar, violence, etc. content, so as to release some audio-carrying files that meet the video content review standards.

If the real-time requirement is high, a streaming real-time system can be set up in the multimedia platform. The user uploads the audio-carrying file to the streaming real-time system in real time through the client, and the streaming real-time system can transmit the audio-carrying file to the real-time streaming system. to computer equipment used for content moderation.

If the real-time requirements are low, a database, such as a distributed database, can be set up in the multimedia platform. The user uploads the audio file to the database through the client, and the computer equipment used for content review can read the data from the database. A file that carries audio.

In this embodiment, the first audio data may be separated from the file carrying the audio for content auditing, and for the first audio data, a hash feature may be calculated for the first audio data as the first hash feature.

In a method of calculating the first hash feature, the first audio data can be converted into a first spectrogram, a first key point can be searched on a plurality of spectral bands of the first spectrogram according to the energy, and based on the first key point A first hash feature of the first audio data is generated.

Step 402 , look up the currently configured blacklist.

In this embodiment, some audio data containing sensitive content such as pornography, vulgarity, violence, etc. may be recorded in the blacklist as second audio data. The second audio data can be continuously expanded.

When the second audio data is collected and recorded in the blacklist, a hash feature may be calculated for the second audio data as the second hash feature.

In a method for calculating the second hash feature, the second audio data can be converted into a second spectrogram, a second key point is searched on a plurality of spectral bands of the second spectrogram according to the energy, and based on the second key point A second hash feature of the second audio data is generated.

Therefore, a plurality of second audio data are recorded in the blacklist, and each second audio data has been configured with a second hash feature, and the second hash feature may be loaded during content review.

Step 403: Determine the order of arrangement among the plurality of second audio data according to the density of the plurality of second hash features.

For a multimedia platform, the magnitude of the first audio data uploaded by the client every day can reach tens of millions or even hundreds of millions. Among this large number of first audio data, the magnitude of the first audio data belonging to the blacklist is about several thousand. , which makes the matching rate of the blacklist lower.

Taking the 80 million first audio data of a certain multimedia platform as an example, the matching rate of the blacklist is about 0.005%.

Therefore, the multimedia platform needs a queue system with low time consumption and high precision to capture the first audio data belonging to the blacklist as much as possible.

The baseline method uses the first audio data to compare with all the second audio numbers in the blacklist. Although the accuracy rate is high, the time complexity is O(N) and the time-consuming is high, which is unnecessary. Because 99.995% of the first audio data does not match the second audio data, this is an inefficient search method.

Other queuing systems, such as Queue System A (Queue System A, arrange the second audio data according to the absolute number of the second hash feature (Absolute Matches)) and Queue System B (Queue System B, perform the second audio data duration Normalized by Duration to arrange the second audio data), which improves the efficiency by preferentially recommending the second audio data with higher possibility.

However, these queuing systems are less accurate due to the inconsistent duration of the second audio data.

This embodiment proposes a queue system C that allows pruning to more accurately select second audio data in the pruning queue using the density of the second hash feature while maintaining efficiency.

In an embodiment of the present application, step 403 includes the following steps:

Step 4031: Count the number of overlapping second hash features in multiple local regions.

Exemplarily, a second spectrogram of the second audio data can be obtained; multiple windows are added on the second spectrogram; the number of the second hash features is counted in the multiple windows, as the number of the second hash features in multiple windows. the number of local regions.

When adding multiple windows, you can search for a preset window; add windows on the second spectrogram at preset time intervals.

Wherein, the width of the window is less than or equal to the preset time length.

Step 4032: Generate the density of the second hash feature in the second audio data to which it belongs based on the number of overlaps in the multiple local regions.

In a method of generating density, the number of overlaps in multiple local regions can be compared; if the number of overlaps in a local region is the largest, the number of overlaps in the local region with the largest number of overlaps is determined as the second. The density of the feature in the associated second audio data.

Step 4033: Sort the plurality of second audio data in descending order according to the density to obtain the order of the plurality of second audio data.

Step 404: Compare the first hash feature with a plurality of second hash features in order to determine whether there is second audio data in the plurality of second audio data that is identical or similar to the first audio data.

Exemplarily, the target position may be determined; the first hash feature is compared with the second hash feature located before the target position in order.

If the first hash feature matches the second hash feature, it is determined that the first audio data is the same or similar to the second audio data to which the second hash feature belongs.

In this real-time experiment, the baseline method, queue system A, queue system B, and queue system C are tested. In the experiment, a test set consisting of 130 blacklisted second audio data and 1000 first audio data is used, of which 800 The first audio data does not belong to the blacklist, and the 200 first audio data belong to the blacklist.

In the implementation, the time-consuming and accuracy rate of all queuing systems when the stopping criterion is compared with the first m second audio data, and the random search without stopping criterion, the experimental results are as follows:

队列系统queue system	耗时(Time Taken)Time Taken	推送速率(Push Rate)Push Rate	准确率(Precision)Accuracy (Precision)
基线方法baseline method	53.6853.68	20.00％20.00%	100.00％100.00%
队列系统AQueue System A	3.903.90	86.50％86.50%	94.22％94.22%
队列系统BQueue System B	4.114.11	65.00％65.00%	96.15％96.15%
队列系统Cqueue system C	4.744.74	95.50％95.50%	97.91％97.91%

For the baseline method, if the stopping criterion is not implemented and tested against all second audio data, all pushes are positive due to rigorous testing of all databases, therefore, the push rate reaches 20% and reaches 100.00% accuracy.

For queue system A, when the stopping criterion is set, the time consumption is reduced by 92% compared with the baseline method, and the push rate and accuracy are better.

Queuing system B can improve accuracy relative to queuing system A, but at the expense of lowering the push rate.

Queue system C can provide high push rate and precision at the same time, and the time consumption is very small.

Step 405: If the second audio data is the same as or similar to the first audio data in the plurality of second audio data, determine that the first audio data is illegal.

If the first audio data is not the same or similar to any second audio data in the blacklist, it can be determined that the first audio data is legal, pass the content audit, and perform other content audits according to business requirements, or , releasing the first audio data to the public.

If the first audio data is the same as or similar to a certain second audio data in the blacklist, it can be determined that the first audio data is illegal, cannot pass the content review, and cannot be released to the public, and generates a corresponding The prompt information is sent to the client. At the same time, users who log in to the client can be banned, frozen, or banned.

In this embodiment, since the first hash feature is calculated for the first audio data, the second hash feature is calculated for the second audio data, the second audio data is sorted based on the density of the second hash feature, and the first The technical features such as the hash feature and the second hash feature are basically similar to the application of the first embodiment, so the description is relatively simple, and the relevant parts can be referred to the partial description of the first embodiment, and this embodiment will not be described in detail here. .

In this embodiment, the first audio data uploaded by the client is received, and the first hash feature is calculated for the first audio data; the currently configured blacklist is searched, and a plurality of second audio data are recorded in the blacklist, and a plurality of The second audio data has been configured with the second hash feature; the order of arrangement among the plurality of second audio data is determined according to the density of the plurality of second hash features; the first hash feature and the plurality of second hash features are arranged in order The feature is compared to determine whether the second audio data is the same or similar to the first audio data in the plurality of second audio data; if the second audio data is the same or similar to the first audio data in the plurality of second audio data , then it is determined that the first audio data is illegal, the denser hash features can improve the accuracy of comparison, and the sorting of audio data can be adjusted by the density of hash features. The probability of getting the same or similar audio data, so as to reduce the number of comparisons, improve the push rate of the search audio data, and improve the accuracy of the search audio data.

It should be noted that, for the sake of simple description, the method embodiments are expressed as a series of action combinations, but those skilled in the art should know that the embodiments of the present application are not limited by the described action sequence, because According to the embodiments of the present application, certain steps may be performed in other sequences or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions involved are not necessarily required by the embodiments of the present application.

Embodiment 3

FIG. 5 is a structural block diagram of an audio search apparatus provided in Embodiment 3 of the present application, including the following modules:

The audio data determination module 501 is configured to determine the first audio data and a plurality of second audio data;

The hash feature calculation module 502 is configured to calculate a first hash feature for the first audio data and a second hash feature for a plurality of the second audio data respectively;

The order determination module 503 is configured to determine the order of arrangement among the plurality of the second audio data according to the density of the plurality of the second hash features;

The hash feature comparison module 504 is configured to compare the first hash feature with a plurality of the second hash features in the order to find the first audio data that is the same or similar to the first audio data. 2. Audio data.

In an embodiment of the present application, the audio data determination module 501 includes:

a first spectrogram conversion module, configured to convert the first audio data into a first spectrogram;

a first key point search module, configured to search for a first key point on a plurality of frequency spectrum bands of the first spectrogram according to energy;

a first hash feature generation module, configured to generate a first hash feature of the first audio data based on the first key point;

A second spectrogram conversion module, configured to convert each second audio data into a second spectrogram;

A second key point searching module, configured to search for a second key point on a plurality of spectral bands of the second spectrogram according to energy;

A second hash feature generation module configured to generate a second hash feature of each of the second audio data based on the second key point.

In an embodiment of the present application, the ranking determining module 503 includes:

A local quantity statistics module, set to count the overlapping quantity of each second hash feature in multiple local areas;

a local density generation module, configured to generate the density of each second hash feature in the second audio data to which it belongs based on the number of overlaps in a plurality of the local regions;

The audio sequence determination module is configured to sort the plurality of second audio data in descending order according to the density to obtain the sequence of the plurality of second audio data.

In an embodiment of the present application, the local quantity statistics module includes:

a spectrogram acquisition module, configured to acquire a second spectrogram of the second audio data to which each second hash feature belongs;

a window adding module, configured to add multiple windows on the second spectrogram;

The window number statistics module is configured to count the number of each second hash feature in a plurality of the windows respectively, as the number of each second hash feature in a plurality of local areas.

In an embodiment of the present application, the window adding module includes:

Window search module, set to search for preset windows;

A time adding module, configured to add the window on the second spectrogram every preset time interval.

In an embodiment of the present application, the width of the window is less than or equal to the length of the preset time.

In an embodiment of the present application, the local density generation module includes:

a quantity comparison module, configured to compare the overlapping quantities in a plurality of the local regions;

The quantity value module is set to, if the overlapping quantity in a certain local area is the largest, determine the overlapping quantity in the partial area with the largest overlapping quantity as the second audio data to which each second hash feature belongs density in .

In an embodiment of the present application, the hash feature comparison module 504 includes:

The target position determination module is set to determine the target position;

a partial feature comparison module, configured to compare the first hash feature with the second hash feature located before the target position in the order;

A search and determination module, configured to determine that the first audio data and the second audio data to which the second hash feature belongs are identical or similar if the first hash feature matches the second hash feature.

The audio search apparatus provided by the embodiment of the present application can execute the audio search method provided by any embodiment of the present application, and has functional modules corresponding to the execution method.

Embodiment 4

6 is a structural block diagram of an audio search apparatus provided in Embodiment 4 of the present application, including the following modules:

The audio data receiving module 601 is configured to receive the first audio data uploaded by the client, and calculate the first hash feature for the first audio data;

The blacklist search module 602 is configured to search for a currently configured blacklist, where a plurality of second audio data are recorded in the blacklist, and a second hash feature has been configured for the plurality of second audio data;

an order determination module 603, configured to determine the order of arrangement among a plurality of the second audio data according to the density of the plurality of second hash features;

Hash feature comparison module 604, configured to compare the first hash feature with a plurality of the second hash features in the order to determine whether there is a second audio in the plurality of the second audio data data is the same as or similar to the first audio data;

The illegal audio determination module 605 is configured to determine that the first audio data is illegal if there is second audio data in the plurality of second audio data that is the same as or similar to the first audio data.

In an embodiment of the present application, the audio data receiving module 601 includes:

A first hash feature generation module configured to generate a first hash feature of the first audio data based on the first key point.

In an embodiment of the present application, it also includes:

In an embodiment of the present application, the ranking determining module 603 includes:

A local quantity statistics module, set to count the overlapped quantity of each second hash feature in multiple local areas;

In an embodiment of the present application, the window adding module includes:

Window search module, set to search for preset windows;

In an embodiment of the present application, the hash feature comparison module 604 includes:

Embodiment 5

The fifth embodiment of the present application provides a computer device, in which the audio search apparatus provided by any one of the embodiments of the present application can be integrated.

FIG. 7 is a schematic structural diagram of a computer device according to Embodiment 5 of the present application. The computer device includes at least one processor 701 and a memory 702, and the memory 702 is configured to store at least one program. When the at least one program is executed by the at least one processor 701, the at least one processor 701 implements the description in any embodiment of the present application. audio search method.

Embodiment 6

Embodiment 6 of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, each process of the above audio search method is implemented. Repeat.

Claims

An audio search method comprising:

determining first audio data, a plurality of second audio data;

respectively calculating a first hash feature for the first audio data, and calculating a second hash feature for a plurality of the second audio data;

Determine the order of arrangement among the plurality of the second audio data according to the density of the plurality of the second hash features;

The first hash feature is compared with a plurality of the second hash features in the order to find the second audio data that is the same as or similar to the first audio data.
The method according to claim 1, wherein the calculating a first hash feature for the first audio data and calculating a second hash feature for a plurality of the second audio data respectively comprises:

converting the first audio data into a first spectrogram;

searching for a first key point on a plurality of spectral bands of the first spectrogram according to the energy;

generating a first hash feature of the first audio data based on the first key point;

converting each second audio data to a second spectrogram;

searching for a second key point on a plurality of spectral bands of the second spectrogram according to the energy;

A second hash feature of each of the second audio data is generated based on the second keypoint.
The method according to claim 1, wherein the determining the order of arrangement among the plurality of the second audio data according to the density of the plurality of the second hash features comprises:

Count the number of overlaps of each second hash feature in multiple local regions;

generating a density of each of the second hash features in the associated second audio data based on the number of overlaps in a plurality of the local regions;

Sort the plurality of second audio data in descending order according to the density to obtain an order of the plurality of second audio data.
The method of claim 3, wherein the counting the number of overlaps of each second hash feature in a plurality of local regions comprises:

obtaining a second spectrogram of the second audio data to which each second hash feature belongs;

adding a plurality of windows on the second spectrogram;

The number of each second hash feature is counted in a plurality of the windows, as the number of each second hash feature in a plurality of local regions.
The method of claim 4, wherein the adding a plurality of windows on the second spectrogram comprises:

Find the default window;

The window is added on the second spectrogram at preset time intervals.
The method according to claim 5, wherein the width of the window is less than or equal to the length of the preset time.
The method according to claim 3, wherein generating the density of each of the second hash features in the second audio data to which they belong based on the number of overlaps in a plurality of the local regions comprises:

comparing the number of overlaps in a plurality of said local regions;

In response to the largest number of overlaps in a certain partial region, the number of overlaps in the partial region with the largest number of overlaps is determined as the density of each of the second hash features in the associated second audio data.
The method according to any one of claims 1-7, wherein the comparing the first hash feature with a plurality of the second hash features in the order to find a The second audio data with the same or similar audio data, including:

determine the target location;

comparing the first hash feature with the second hash feature prior to the target location in the order;

In response to the first hash feature matching the second hash feature, it is determined that the first audio data is identical or similar to the second audio data to which the second hash feature belongs.
An audio search method comprising:

receiving the first audio data uploaded by the client, and calculating a first hash feature for the first audio data;

Find the blacklist of the current configuration, the blacklist is recorded with a plurality of second audio data, and a plurality of the second audio data have been configured with the second hash feature;

Determine the order of arrangement among the plurality of the second audio data according to the density of the plurality of the second hash features;

The first hash feature is compared with a plurality of the second hash features in the order to determine whether there is second audio data in the plurality of second audio data that is the same as the first audio data or similar;

The first audio data is determined to be illegal in response to second audio data being the same as or similar to the first audio data in the plurality of second audio data.
An audio search device, comprising:

an audio data determination module, configured to determine the first audio data and a plurality of second audio data;

A hash feature calculation module, configured to calculate a first hash feature for the first audio data and a second hash feature for a plurality of the second audio data respectively;

an order determination module, configured to determine an order of arrangement among a plurality of the second audio data according to the density of the plurality of second hash features;

A hash feature comparison module, configured to compare the first hash feature with a plurality of the second hash features in the order to find the second hash features that are the same as or similar to the first audio data audio data.
An audio search device, comprising:

an audio data receiving module, configured to receive the first audio data uploaded by the client, and calculate a first hash feature for the first audio data;

The blacklist search module is configured to search for a currently configured blacklist, where a plurality of second audio data are recorded in the blacklist, and a second hash feature has been configured for the plurality of second audio data;

an order determination module, configured to determine an order of arrangement among a plurality of the second audio data according to the density of the plurality of second hash features;

A hash feature comparison module, configured to compare the first hash feature with a plurality of the second hash features in the order to determine whether there is second audio data in the plurality of second audio data the same or similar to the first audio data;

The illegal audio determination module is configured to determine that the first audio data is illegal in response to the presence of second audio data in the plurality of second audio data that is identical to or similar to the first audio data.
A computer device comprising:

at least one processor;

memory, arranged to store at least one program,

When the at least one program is executed by the at least one processor, the at least one processor is caused to implement the audio search method according to any one of claims 1-9.
A computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the audio search method according to any one of claims 1-9.