CN107705805B

CN107705805B - Audio duplicate checking method and device

Info

Publication number: CN107705805B
Application number: CN201711009825.XA
Authority: CN
Inventors: 黄君实; 林敏�; 李东亮; 陈强
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2017-10-25
Filing date: 2017-10-25
Publication date: 2021-01-29
Anticipated expiration: 2037-10-25
Also published as: CN107705805A

Abstract

The embodiment of the invention provides an audio duplicate checking method, which is applied to the technical field of multimedia and comprises the following steps: extracting audio frames from the audio to be checked at preset time intervals, determining spectrogram corresponding to each audio frame, inputting the spectrogram corresponding to each audio frame into a preset feature extraction model to obtain depth features corresponding to each audio frame, performing feature pooling on the depth features corresponding to each audio frame to obtain the depth features corresponding to each pooled audio frame, integrating and encoding the depth features corresponding to each pooled audio frame to obtain feature information of the audio to be checked, and performing audio duplicate checking according to the feature information of the audio to be checked. The audio duplicate checking method and device provided by the embodiment of the invention are used for carrying out audio duplicate checking on a plurality of audio information.

Description

Audio duplicate checking method and device

Technical Field

The invention relates to the technical field of multimedia, in particular to an audio duplicate checking method and device.

Background

With the development of information technology and multimedia technology, various types of audio websites have been developed, and some users or website managers will frequently upload some audio to the website for other users to download and view.

Therefore, a website may receive a large amount of uploaded audio, but many of the uploaded audio are repeated audio or audio with high similarity, when the website ranks the audio according to the audio watching amount to recommend to a user, since a large amount of repeated audio or audio with high similarity exists in the audio, the ranking accuracy of the website to the audio is low, the accuracy of the audio recommended to the user is also low, and since a large amount of repeated audio or audio with high similarity exists in the audio, the user is not facilitated to search and listen to the audio, so that the experience of the user is low.

Disclosure of Invention

In order to overcome the above technical problems or at least partially solve the above technical problems, the following technical solutions are proposed:

according to one aspect, an embodiment of the present invention provides an audio duplicate checking method, including:

extracting audio frames from the to-be-checked repeated audio at intervals of preset time;

determining a spectrogram corresponding to each audio frame;

inputting the spectrogram corresponding to each audio frame into a preset feature extraction model to obtain the depth feature corresponding to each audio frame;

performing feature pooling on the depth features respectively corresponding to the audio frames to obtain the depth features respectively corresponding to the audio frames after the feature pooling is performed;

integrating and coding depth features respectively corresponding to the audio frames subjected to pooling processing to obtain feature information of the audio to be checked;

and carrying out audio duplicate checking according to the characteristic information of the audio to be checked.

Specifically, the preset feature extraction model is obtained by training a deep convolutional neural network.

Before the step of inputting the spectrogram corresponding to each audio frame into the preset feature extraction model to obtain the depth features corresponding to each audio frame, the method further comprises:

and performing audio preprocessing on the spectrogram corresponding to each audio frame, wherein the audio preprocessing comprises at least one of the following items: carrying out regular size processing and audio denoising processing;

the method comprises the following steps of inputting spectrogram corresponding to each audio frame into a preset feature extraction model to obtain depth features corresponding to each audio frame, wherein the steps comprise:

and inputting the spectrogram corresponding to each audio frame after audio preprocessing into a preset feature extraction model to obtain the depth feature corresponding to each audio frame.

Further, after the step of obtaining the feature information of the audio to be checked by integrating and encoding the depth features corresponding to the audio frames after the pooling processing, the method further comprises the following steps:

carrying out post-processing on the characteristic information of the to-be-checked stress audio through at least one of the following processing modes to obtain the processed characteristic information of the to-be-checked stress audio, wherein the processing modes comprise: performing feature dimension reduction processing; and (5) performing decorrelation processing.

Specifically, the step of searching for duplicate audio according to the characteristic information of the audio to be searched for duplicate audio comprises:

determining the audio characteristic index of the stress audio to be checked through product quantization according to the processed characteristic information of the stress audio to be checked;

and carrying out audio duplicate checking according to the audio characteristic index of the audio to be checked.

Specifically, the audio duplicate checking method includes:

judging whether the audio feature indexes respectively corresponding to the audios are the same or not;

and if the same audio feature indexes exist, determining each audio repetition corresponding to the same audio feature indexes.

Further, from among the repeated individual audios, an audio to be deleted is determined, and the audio to be deleted is deleted.

According to another aspect, an embodiment of the present invention further provides an apparatus for audio duplicate checking, including:

the extraction module is used for extracting audio frames from the to-be-checked repeated audio at intervals of preset time;

the determining module is used for determining the spectrogram corresponding to each audio frame extracted by the extracting module;

the input module is used for inputting the spectrogram corresponding to each audio frame determined by the determination module into a preset feature extraction model to obtain the depth feature corresponding to each audio frame;

the feature pooling processing module is used for performing feature pooling on the depth features respectively corresponding to the audio frames to obtain the depth features respectively corresponding to the audio frames after the feature pooling is performed;

the integration coding module is used for integrating and coding the depth features respectively corresponding to the audio frames subjected to the pooling processing of the feature pooling processing module to obtain the feature information of the to-be-checked double audio;

and the audio duplicate checking module is used for carrying out audio duplicate checking according to the characteristic information of the audio to be checked, which is obtained by integrating the integrated coding module.

Further, the apparatus further comprises: an audio preprocessing module;

the audio preprocessing module is used for performing audio preprocessing on the spectrogram corresponding to each audio frame determined by the determining module, and the audio preprocessing comprises at least one of the following: carrying out regular size processing and audio denoising processing;

and the input module is specifically used for inputting the spectrogram corresponding to each audio frame subjected to the audio preprocessing by the audio preprocessing module into the preset feature extraction model so as to obtain the depth features corresponding to each audio frame.

Further, the apparatus further comprises: a post-processing module;

the post-processing module is used for post-processing the characteristic information of the stress audio to be checked in at least one of the following processing modes to obtain the processed characteristic information of the stress audio to be checked, and the processing modes comprise: performing feature dimension reduction processing; and (5) performing decorrelation processing.

Specifically, the audio duplication checking module comprises: the device comprises a determining unit and an audio duplicate checking unit;

the determining unit is used for determining the audio characteristic index of the to-be-checked double audio through Product Quantization according to the processed characteristic information of the to-be-checked double audio;

and the audio duplicate checking unit is used for carrying out audio duplicate checking according to the audio feature index of the audio to be subjected to duplicate checking determined by the determining unit.

Specifically, the audio duplication checking module is specifically configured to determine whether audio feature indexes respectively corresponding to the audios are the same;

the audio duplication checking module is specifically configured to determine, when the same audio feature index exists, each audio duplication corresponding to the same audio feature index.

Further, the apparatus further comprises: a deletion module;

the determining module is further used for determining the audio to be deleted from the repeated audios;

and the deleting module is used for deleting the audio to be deleted determined by the determining module.

According to yet another aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the above-mentioned audio duplicate checking method.

Embodiments of the present invention also provide, according to yet another aspect, a computing device, comprising: the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the audio duplicate checking method.

The invention provides an audio duplicate checking method and device, wherein audio frames are extracted from audio to be checked at intervals of preset time, then spectrogram corresponding to each audio frame is determined, then the spectrogram corresponding to each audio frame is input into a preset feature extraction model to obtain depth features corresponding to each audio frame, then feature pooling processing is carried out on the depth features corresponding to each audio frame, depth features corresponding to each pooled audio frame are obtained, then feature information of the audio to be checked is obtained by integrating and coding the depth features corresponding to each pooled audio frame, and then audio duplicate checking is carried out according to the feature information of the audio to be checked. The method and the device can determine the repeated audio information or the audio information with high similarity in the uploaded audio information by duplicate checking of the audio information, for example, the uploaded audio information, so that the accuracy of the website for ranking the audio information can be improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart of an audio duplicate checking method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of three pooling modes of an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an audio duplicate checking apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of another audio duplication checking apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As will be appreciated by those skilled in the art, a "terminal" as used herein includes both devices having a wireless signal receiver, which are devices having only a wireless signal receiver without transmit capability, and devices having receive and transmit hardware, which have devices having receive and transmit hardware capable of two-way communication over a two-way communication link. Such a device may include: a cellular or other communication device having a single line display or a multi-line display or a cellular or other communication device without a multi-line display; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "terminal" or "terminal device" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. As used herein, a "terminal Device" may also be a communication terminal, a web terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, or a smart tv, a set-top box, etc.

Example one

The embodiment of the invention provides an audio duplicate checking method, as shown in fig. 1, comprising:

step 101, extracting audio frames from the audio to be checked at preset time intervals.

For the embodiment of the invention, the audio to be checked is cut into a plurality of audio frames, one audio frame is extracted every preset time, and the time length of each audio frame is a preset time period.

For example, one audio frame is extracted every 5 milliseconds (ms) from the audio to be reproduced.

For example, the time length of each audio frame is 5 seconds(s), 8s, or 10 s. The present invention is not limited to the embodiments.

And 102, determining the spectrogram corresponding to each audio frame.

The method for converting the audio frame to be repeated into the spectrogram comprises the following steps: in order to reflect the frequency spectrum situation of the signal along with the time, short-time Fourier transform processing is adopted; the short-time Fourier transform is also called sliding window Fourier transform, namely, a short window function is multiplied by signal sliding to carry out Fourier transform on data in each window function interception interval:

wherein, ω (k, τ) is a window function with length N, X (ω, τ) is a two-dimensional function, the fourier change of the windowed sound with center point located at τ is identified, and the acoustic signal is transformed into a corresponding point with certain gray level to characterize through the above formula, which is the spectrogram in signal processing.

Step 103, inputting the spectrogram corresponding to each audio frame into a preset feature extraction model to obtain the depth feature corresponding to each audio frame.

The preset feature extraction model is obtained by training a deep convolution neural network.

For example, the deep convolutional nerve is trained by 2 million of material audios in 21000 categories to obtain the preset feature extraction model. Wherein the material audio can be classified according to different timbres, different tempos and/or different contents.

For the embodiment of the invention, each audio frame is input into the trained deep convolutional neural network to obtain the probability of each audio frame in each class of 21000 classes to which each audio frame in each audio frame belongs; or outputting the representation of the preset dimension corresponding to the audio frame, wherein the representation can be used for representing the application scene corresponding to the frame image.

And step 104, performing feature pooling on the depth features respectively corresponding to the audio frames to obtain the depth features respectively corresponding to the audio frames after the pooling is performed.

For the embodiment of the invention, pooling is to average and equalize each convolution characteristic on the basis of convolution characteristic extraction and continuously reduce the dimension of the convolution characteristic corresponding to the hidden node.

It is highly likely that features useful in one area will be equally applicable in another area for embodiments of the present invention. Thus, to describe longer audio, one natural idea is to aggregate statistics on features at different locations, e.g., one can calculate the average (or maximum) of a particular feature over an area. These summary statistical features not only have much lower dimensionality (compared to using all extracted features), but also improve the results (not easily overfitting). This polymerization operation is called pooling (Pooling).

For embodiments of the invention, pooling may comprise: 1) mean-posing, namely, averaging the characteristic points in the neighborhood, and keeping the background better; max-posing, namely, the feature point in the neighborhood is taken to be the largest, and the texture extraction is better; 3) stochastic-posing, which is between the two, gives probability to pixel points according to the numerical size, and then performs sub-sampling according to the probability.

The error of feature extraction mainly comes from two aspects: (1) the variance of the estimated value is increased due to the limited size of the neighborhood; (2) convolutional layer parameter errors cause a shift in the estimated mean. In general, mean-posing can reduce the first error, preserving more background information of the image, and max-posing can reduce the second error, preserving more texture information. In the average sense, similar to mean-pooling, in the local sense, the criterion of max-pooling is obeyed. The three types of pooling are shown in FIG. 2.

For example, M audio frames, each x, are extracted₁、x₂…x_MWhen pooling calculation is performed by the minimum pooling algorithm, minx ═ min [ x ═ x₁、x₂…x_M]Wherein min.]Representing the minimum value of each dimension in the M audio frames; when pooling calculations are performed by the max pooling algorithm, maxx ═ max [ x [ ]₁、x₂…x_M]Where max.]Representing the maximum value of each dimension in the M audio frames; when pooling calculations are performed by the average pooling algorithm, avexx ═ avex [ x ═ x₁、x₂…x_M]Wherein avex [.]Representing the averaging of each dimension of the M audio frames.

And 105, integrating and coding the depth features respectively corresponding to the audio frames after the pooling processing to obtain the feature information of the audio to be checked.

For example, in step 101, three audio frames, namely audio frame 1, audio frame 2, and audio frame 3, are extracted from the audio to be re-checked, and then the video information of the video to be re-checked is determined according to the feature information corresponding to the audio frame 1, the audio frame 2, and the audio frame 3.

And step 106, carrying out audio duplicate checking according to the characteristic information of the audio to be checked.

For the embodiment of the invention, whether the audio with higher correlation degree with the characteristic information of the audio to be checked exists in the online audio is determined through the characteristic information of the audio to be checked, so that the audio is checked for duplication.

The embodiment of the invention provides an audio duplicate checking method, which comprises the steps of extracting audio frames from audio to be checked at intervals of preset time, then determining spectrogram corresponding to each audio frame, then inputting the spectrogram corresponding to each audio frame into a preset feature extraction model to obtain depth features corresponding to each audio frame, then performing feature pooling on the depth features corresponding to each audio frame to obtain depth features corresponding to each audio frame after pooling, then performing integration and coding on the depth features corresponding to each audio frame after pooling to obtain feature information of the audio to be checked, and then performing audio duplicate checking according to the feature information of the audio to be checked. The embodiment of the invention can determine the repeated audio information or the audio information with high similarity in the uploaded audio information by checking the duplicate of the audio information, for example, the uploaded audio information, thereby improving the ranking accuracy of the website on the audio information.

Example two

Another possible implementation manner of the embodiment of the present invention further includes, on the basis of the operation shown in the first embodiment, the operation shown in the second embodiment, wherein,

step 103 further comprises: and carrying out audio preprocessing on the spectrogram corresponding to each audio frame.

Wherein the audio pre-processing comprises at least one of: the method comprises the steps of regular size processing and audio denoising processing.

For the embodiment of the present invention, the resizing of the spectrogram of the audio is performed by performing resizing on the audio in a sampling manner.

The method for respectively performing audio denoising processing on each audio frame comprises the following steps: calculating Mel frequency domain parameters of all frames in the audio data; calculating the amplitude and phase angle of all frequency domain frames; setting a current audio effective data frame to be judged as a Tth frame, and setting a current denoising frame as a first frame; detecting effective audio data of Mel frequency parameters to obtain a start frame and an end frame of the effective audio data; calculating a signal-to-noise ratio parameter; carrying out audio denoising processing to obtain a corrected value of the denoised frame amplitude; and performing inverse fast Fourier transform by using the corrected value of the amplitude and the phase angle.

For the embodiment of the present invention, the calculating Mel frequency domain parameters of all frames in audio data includes performing fast fourier transform to obtain the t-th frame audio x_t(n) Fourier transformed frequency domain frame X_t(k) (ii) a Will frequency domain frame X_t(k) Filtering with a set of triangular filters; calculating the output logarithmic energy of each filter; the Mel frequency domain parameter MFCCt is obtained.

N is more than or equal to 1 and less than or equal to N, N is the frame length, and k is more than or equal to 1 and less than or equal to N; x is the number of_t(n) an nth component representing the tth frame of audio; x_t(k) Representing the kth component of the corresponding frequency domain frame of the tth frame of audio.

Step 103 specifically comprises: and inputting the spectrogram corresponding to each audio frame after audio preprocessing into a preset feature extraction model to obtain the depth feature corresponding to each audio frame.

EXAMPLE III

Another possible implementation manner of the embodiment of the present invention is, on the basis of the first embodiment or the second embodiment, further including the operation shown in the third embodiment, wherein,

step 105 is followed by: and post-processing the characteristic information of the to-be-checked stress audio by at least one of the following processing modes to obtain the processed characteristic information of the to-be-checked stress audio.

Wherein, the processing mode comprises: performing feature dimension reduction processing; and (5) performing decorrelation processing.

For the embodiment of the invention, the characteristic dimension reduction processing is carried out on the characteristic information of the audio to be checked through a preset dimension reduction algorithm. Wherein, the characteristic algorithm comprises: component Analysis (PCA), Factor Analysis (Factor Analysis), and Independent Component Analysis (ICA). In the embodiment of the invention, the PCA is taken as an example to perform dimension reduction processing on the characteristic information of the to-be-checked double audio. The idea of PCA is to map n-dimensional features onto k dimensions (k < n), the k-dimensional features are called principal elements and are linear combinations of old features, the linear combinations maximize sample variance, and new k features are made to be uncorrelated as much as possible.

For example, a feature information matrix of 10000 dimensions may be dimension-reduced by 400 dimensions through a feature dimension reduction process.

For the embodiment of the invention, the characteristic information space R of the audio to be checked is repeated^NMapping to a feature space F to realize dimension reduction processing, wherein the feature information space R of the audio to be checked is^NAfter mapping to the feature space F, the covariance matrix is:

where M represents the dimension of the feature space,

denotes the jth feature mapping table, and T denotes the transposition operation symbol.

The eigenvalue and eigenvector of C satisfy: λ (x)_k)·V)＝(φ(x_k) CV), 1. ltoreq. k.ltoreq.M, where λ denotes the eigenvalue and V denotes the eigenvector. The projection of the input features on the mapping space vector is:

wherein V^kThe feature vector is represented by a vector of features,

the normalized coefficient is represented by a normalized coefficient,

representing the input feature map value.

For the embodiment of the invention, the relevance exists between the audio features of the adjacent dimensions, and when the relevance between the audio features of the adjacent dimensions is not needed, the decorrelation processing is carried out on the feature information of the audio to be checked and reproduced. In the embodiment of the invention, the characteristic information of the audio to be checked is subjected to characteristic dimension reduction processing and decorrelation processing, so that the obtained characteristic information of the audio to be checked has lower dimension and lower interference.

Example four

Another possible implementation manner of the embodiment of the present invention further includes, on the basis of the operation shown in the third embodiment, the operation shown in the fourth embodiment, wherein,

step 106 comprises: determining the audio characteristic index of the stress audio to be checked through Product Quantization according to the processed characteristic information of the stress audio to be checked; and carrying out audio duplicate checking according to the audio characteristic index of the audio to be checked.

The audio duplicate checking method comprises the following steps: judging whether the audio feature indexes respectively corresponding to the audios are the same or not; and if the same audio feature indexes exist, determining each audio repetition corresponding to the same audio feature indexes.

For the embodiment of the invention, Product Quantization comprises a grouping Quantization process of two process characteristics and a Cartesian Product process of categories. Assuming a data set is provided, K-means is given class number K, the objective function is the distance and the minimum value from all samples to the class center, and iterative computation is performed to optimize the objective function to obtain K class centers and the class to which each sample belongs. The objective function is unchanged, and the method of product quantization is as follows:

(1) the data set is K categories, each sample is represented in the form of a vector with dimension d, and each component of the vector is divided into m groups.

(2) Using a certain component quantity of all vectors as a data set, and obtaining the component quantity by adopting a k-means algorithm

Running k-means algorithm for m times by individual center, each group has

The class center notes this

The individual class centers are a set.

(3) And (4) performing Cartesian product on the m obtained sets to obtain the class center of the whole data set.

For the embodiment of the invention, the audio characteristic index of the to-be-checked stress audio is obtained by multiplying the processed characteristic information of the to-be-checked stress audio by Product Quantization, wherein the audio characteristic index of the to-be-checked stress audio is the corresponding relation between the to-be-checked stress audio and the characteristic index.

For example, the audio to be checked includes audio 1, audio 2, and audio 3, the index values of which are 001, 002, and 003, respectively, and the index values of the audio features corresponding to audio 1, audio 2, and audio 3, respectively, are 1, 2, and 1.

For the embodiment of the invention, if the audio feature indexes respectively corresponding to the two audios are the same, the two audios are represented as the repeated audios.

For example, the audio to be checked includes audio 1, audio 2, and audio 3, the index values corresponding to the audio 1, audio 2, and audio 3 are 001, 002, and 003, respectively, the index values of the audio features corresponding to the audio 1, audio 2, and audio 3 are 1, 2, and 1, respectively, and since the index values of the audio features corresponding to the audio 1 and the audio 2 are both 1 (the index values of the audio features corresponding to two different audios are the same), the audio 1 and the audio 2 are repeated audio.

For the embodiment of the invention, if a plurality of repeated audios exist in the online audios, the audio to be deleted is selected from the repeated audios and deleted.

For the embodiment of the present invention, the audio to be deleted is determined from the repeated audio according to a preset principle, wherein the preset principle includes at least one of the following items: the definition of the audio, the release time of the audio, the click amount of the audio, and the download amount of the audio.

For example, two repeated tones are included in the on-line tone, including: audio 1 and audio 3, where the download amount of audio 1 is 100, and the download amount of audio 2 is 1200, the audio to be deleted is audio 1.

For the embodiment of the invention, the audio to be deleted is determined from the repeated audio, and the audio to be deleted is deleted, so that when a user downloads the corresponding audio from the online audio, the audio to be downloaded can be accurately determined and downloaded, the repetition rate of the audio in the online audio can be reduced, the accuracy of searching the audio to be downloaded can be improved, and the experience of the user can be improved.

For the embodiment of the invention, when the audio is not found to be duplicated, if the user searches for the required audio by searching for the keyword, the website may recommend some repeated audio or audio with higher similarity to the user, or recommend the unrefound audio to the user after ranking, for example, the user searches for a song corresponding to the song name by searching for a song name XXX, the website may recommend all the singing audio of different occasions (including a concert, an album, and a commercial performance) of the song to the user, and the recommendation is repeated, so that the accuracy of the recommendation of the audio by the website to the user and the ranking of the audio are low, and the user experience is poor. In the embodiment of the invention, by determining the audio to be deleted from the repeated audios and deleting the audio to be deleted, when the user searches for the required audio by searching for the keyword, the audio required by the user can be recommended to the user more accurately, or the ranking of the relevant audio is recommended to the user, for example, only the audio of the song in the album is reserved, only the audio of the song in the album can be recommended to the user, so that the experience of the user can be improved.

For the embodiment of the invention, the audio to be deleted is determined from the repeated audios, and the audio to be deleted is deleted, namely the repeated audio is deleted from the database, so that the storage capacity in the database can be reduced, and the storage space can be saved.

An embodiment of the present invention provides an audio duplicate checking device, as shown in fig. 3, the device includes: an extraction module 31, a determination module 32, an input module 33, a feature pooling processing module 34, an integration encoding module 35, an audio duplication checking module 36, wherein,

and the extraction module 31 is configured to extract audio frames from the audio to be reviewed at preset time intervals.

And the determining module 32 is configured to determine the spectrogram corresponding to each audio frame extracted by the extracting module 31.

The input module 33 is configured to input the spectrogram corresponding to each audio frame determined by the determining module 32 into a preset feature extraction model, so as to obtain depth features corresponding to each audio frame.

The feature pooling processing module 34 is configured to perform feature pooling on the depth features corresponding to the audio frames, so as to obtain depth features corresponding to the pooled audio frames.

And the integration coding module 35 is configured to integrate and code the depth features corresponding to the audio frames respectively after the feature pooling processing module 34 pools, so as to obtain feature information of the to-be-checked audio.

And the audio duplication checking module 36 is configured to perform audio duplication checking according to the feature information of the audio to be duplicated, which is obtained by integrating through the integration coding module 35.

Further, as shown in fig. 4, the apparatus further includes: an audio pre-processing module 41.

And the audio preprocessing module 41 is configured to perform audio preprocessing on the spectrogram corresponding to each audio frame determined by the determining module 32.

The input module 33 is specifically configured to input the spectrogram corresponding to each audio frame subjected to the audio preprocessing by the audio preprocessing module 41 into the preset feature extraction model, so as to obtain the depth feature corresponding to each audio frame.

Further, as shown in fig. 4, the apparatus further includes: a post-processing module 42.

The post-processing module 42 is configured to perform post-processing on the feature information of the to-be-checked stress audio through at least one of the following processing manners to obtain the processed feature information of the to-be-checked stress audio.

Further, as shown in fig. 4, the audio duplication checking module 36 includes: a determining unit 361 and an audio duplication checking unit 362.

A determining unit 361, configured to determine, according to the processed feature information of the to-be-checked audio, an audio feature index of the to-be-checked audio by Product Quantization;

the audio duplicate checking unit 362 is configured to perform audio duplicate checking according to the audio feature index of the audio to be checked and determined by the determining unit 361.

Specifically, the audio duplication checking module 36 is specifically configured to determine whether the audio feature indexes respectively corresponding to the audios are the same.

The audio duplication checking module 36 is specifically configured to determine, when the same audio feature index exists, each audio duplication corresponding to the same audio feature index.

Further, as shown in fig. 4, the apparatus further includes: and deleting the module 43.

The determining module 32 is further configured to determine, from the repeated audio, an audio to be deleted.

And a deleting module 43, configured to delete the audio to be deleted determined by the determining module 31.

The embodiment of the invention provides an audio duplicate checking device, which extracts audio frames from audio to be checked at intervals of preset time, then determines spectrogram corresponding to each audio frame, inputs the spectrogram corresponding to each audio frame into a preset feature extraction model to obtain depth features corresponding to each audio frame, then performs feature pooling on the depth features corresponding to each audio frame to obtain depth features corresponding to each pooled audio frame, then performs integration and coding on the depth features corresponding to each pooled audio frame to obtain feature information of the audio to be checked, and then performs audio duplicate checking according to the feature information of the audio to be checked. The embodiment of the invention can determine the repeated audio information or the audio information with high similarity in the uploaded audio information by checking the duplicate of the audio information, for example, the uploaded audio information, thereby improving the ranking accuracy of the website on the audio information.

The audio duplicate checking device provided by the embodiment of the present invention can implement the method embodiment provided above, and for specific function implementation, reference is made to the description in the method embodiment, which is not repeated herein.

The embodiment of the invention provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the program is executed by a processor, the method for searching the duplicate audio frequency is realized.

The embodiment of the invention provides a computer readable storage medium, wherein audio frames are extracted from audio to be checked at intervals of preset time, spectrogram corresponding to each audio frame is determined, spectrogram corresponding to each audio frame is input into a preset feature extraction model to obtain depth features corresponding to each audio frame, feature pooling is carried out on the depth features corresponding to each audio frame to obtain depth features corresponding to each pooled audio frame, feature information of the audio to be checked is obtained by integrating and encoding the depth features corresponding to each pooled audio frame, and audio checking is carried out according to the feature information of the audio to be checked. The embodiment of the invention can determine the repeated audio information or the audio information with high similarity in the uploaded audio information by checking the duplicate of the audio information, for example, the uploaded audio information, thereby improving the ranking accuracy of the website on the audio information.

The computer-readable storage medium provided in the embodiments of the present invention can implement the method embodiments provided above, and for specific function implementation, reference is made to the description in the method embodiments, which is not repeated herein.

An embodiment of the present invention provides a computing device, including: the processor, the memory and the communication interface complete mutual communication through the communication bus;

The embodiment of the invention provides computing equipment, wherein audio frames are extracted from a to-be-checked heavy audio at intervals of preset time, then spectrogram corresponding to each audio frame is determined, then the spectrogram corresponding to each audio frame is input into a preset feature extraction model to obtain depth features corresponding to each audio frame, feature pooling processing is carried out on the depth features corresponding to each audio frame, depth features corresponding to each pooled audio frame are obtained, feature information of the to-be-checked heavy audio is obtained by integrating and coding the depth features corresponding to each pooled audio frame, and audio checking is carried out according to the feature information of the to-be-checked heavy audio. The embodiment of the invention can determine the repeated audio information or the audio information with high similarity in the uploaded audio information by checking the duplicate of the audio information, for example, the uploaded audio information, thereby improving the ranking accuracy of the website on the audio information.

The computing device provided in the embodiment of the present invention may implement the method embodiment provided above, and for specific function implementation, reference is made to the description in the method embodiment, which is not described herein again.

Those skilled in the art will appreciate that the present invention includes apparatus directed to performing one or more of the operations described in the present application. These devices may be specially designed and manufactured for the required purposes, or they may comprise known devices in general-purpose computers. These devices have stored therein computer programs that are selectively activated or reconfigured. Such a computer program may be stored in a device (e.g., computer) readable medium, including, but not limited to, any type of disk including floppy disks, hard disks, optical disks, CD-ROMs, and magnetic-optical disks, ROMs (Read-Only memories), RAMs (Random Access memories), EPROMs (Erasable Programmable Read-Only memories), EEPROMs (Electrically Erasable Programmable Read-Only memories), flash memories, magnetic cards, or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a bus. That is, a readable medium includes any medium that stores or transmits information in a form readable by a device (e.g., a computer).

It will be understood by those within the art that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. Those skilled in the art will appreciate that the computer program instructions may be implemented by a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, implement the features specified in the block or blocks of the block diagrams and/or flowchart illustrations of the present disclosure.

Those of skill in the art will appreciate that various operations, methods, steps in the processes, acts, or solutions discussed in the present application may be alternated, modified, combined, or deleted. Further, various operations, methods, steps in the flows, which have been discussed in the present application, may be interchanged, modified, rearranged, decomposed, combined, or eliminated. Further, steps, measures, schemes in the various operations, methods, procedures disclosed in the prior art and the present invention can also be alternated, changed, rearranged, decomposed, combined, or deleted.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for audio duplicate checking, comprising:

determining a spectrogram corresponding to each audio frame;

integrating and coding the depth features respectively corresponding to the audio frames after the pooling processing to obtain the feature information of the audio to be checked;

2. The method of claim 1, wherein the predetermined feature extraction model is obtained by training a deep convolutional neural network.

3. The method according to claim 1 or 2, wherein before the step of inputting the spectrogram corresponding to each audio frame into the preset feature extraction model to obtain the depth feature corresponding to each audio frame, the method further comprises:

the step of inputting the spectrogram corresponding to each audio frame into a preset feature extraction model to obtain the depth feature corresponding to each audio frame includes:

4. The method according to claim 1 or 2, wherein after the step of obtaining the feature information of the audio to be reviewed by integrating and encoding the depth features corresponding to the respective audio frames after the pooling, the method further comprises:

post-processing the characteristic information of the stress audio to be checked by at least one of the following processing modes to obtain the processed characteristic information of the stress audio to be checked, wherein the processing modes comprise: performing feature dimension reduction processing; and (5) performing decorrelation processing.

5. The method according to claim 4, wherein the step of performing audio duplicate checking according to the feature information of the audio to be checked comprises:

according to the processed characteristic information of the stress audio to be checked, and through Product Quantization, determining an audio characteristic index of the stress audio to be checked;

6. The method of claim 5, wherein the audio duplication checking comprises:

7. The method of claim 6, further comprising:

and determining the audio to be deleted from the repeated audio, and deleting the audio to be deleted.

8. An apparatus for audio duplication checking, comprising:

the integration coding module is used for integrating and coding the depth features respectively corresponding to the audio frames subjected to pooling processing by the feature pooling processing module to obtain feature information of the to-be-checked double audio;

and the audio duplicate checking module is used for carrying out audio duplicate checking according to the characteristic information of the audio to be checked, which is obtained by the integration of the integration coding module.

9. The apparatus of claim 8, wherein the predetermined feature extraction model is obtained by training a deep convolutional neural network.

10. The apparatus of claim 8 or 9, further comprising: an audio preprocessing module;

the audio preprocessing module is configured to perform audio preprocessing on the spectrogram corresponding to each audio frame determined by the determining module, where the audio preprocessing includes at least one of the following: carrying out regular size processing and audio denoising processing;

the input module is specifically configured to input a spectrogram corresponding to each audio frame subjected to audio preprocessing by the audio preprocessing module into a preset feature extraction model, so as to obtain depth features corresponding to each audio frame.

11. The apparatus of claim 8 or 9, further comprising: a post-processing module;

the post-processing module is configured to perform post-processing on the feature information of the to-be-checked stressed audio through at least one of the following processing manners to obtain the processed feature information of the to-be-checked stressed audio, where the processing manner includes: performing feature dimension reduction processing; and (5) performing decorrelation processing.

12. The apparatus of claim 11, wherein the audio repetition module comprises: the device comprises a determining unit and an audio duplicate checking unit;

a determining unit, configured to determine, according to the processed feature information of the to-be-checked audio, an audio feature index of the to-be-checked audio by Product Quantization;

13. The apparatus of claim 12,

the audio duplicate checking module is specifically configured to determine whether audio feature indexes respectively corresponding to the audios are the same;

the audio duplication checking module is specifically configured to determine that each audio duplication corresponds to the same audio feature index when the same audio feature index exists.

14. The apparatus of claim 13, further comprising: a deletion module;

the determining module is further configured to determine, from the repeated audios, an audio to be deleted;

15. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method of any one of claims 1 to 7.

16. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the corresponding operation of the audio duplicate checking method according to any one of claims 1-7.