CN111445922A

CN111445922A - Audio matching method and device, computer equipment and storage medium

Info

Publication number: CN111445922A
Application number: CN202010202378.5A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2020-07-24
Anticipated expiration: 2040-03-20
Also published as: CN111445922B

Abstract

The application discloses an audio matching method, an audio matching device, computer equipment and a storage medium, and relates to the technical field of audio. The method comprises the following steps: acquiring a first characteristic sequence of a first audio and a second characteristic sequence of a second audio; calling a sequence cross-correlation layer to perform cross-correlation processing on the first characteristic sequence and the second characteristic sequence, and outputting a cross-correlation vector sequence; calling a feature extraction layer to perform feature extraction processing on the cross-correlation vector sequence and output a prediction vector; and calling a classification layer to perform prediction processing on the prediction vector, and outputting the similarity probability of the first audio and the second audio. The similarity of the two audios is calculated by adopting a matching mode based on the neural network, and the similarity between different songs can be calculated, so that a similarity calculation result with higher precision is obtained between different songs.

Description

Audio matching method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of multimedia technologies, and in particular, to an audio matching method and apparatus, a computer device, and a storage medium.

Background

Audio matching is a technique of similarity matching two audios. According to the matching type, the audio matching comprises the following steps: audio segment matching and full audio matching. The audio segment matching means that an audio segment P is given to judge whether the audio segment P belongs to a part of the audio D. Full audio matching means that given an audio a, the similarity of audio a and audio B is calculated.

The audio fingerprint technology is provided in the related art, and selects the more significant time frequency points in the audio signal, encodes the time frequency points into a digital sequence by adopting a hash coding mode, and takes the digital sequence as the audio fingerprint. The audio fingerprinting technique converts the audio matching problem into a retrieval problem between different digital sequences.

Because the audio clip matching mainly aims at matching the audio clip and the full audio of the same song, the audio fingerprint technology based on signal processing has better matching effect in the scene of audio clip matching. However, in a full-audio matching scene, the similarity of two different songs is calculated more, and at this time, the application of the audio fingerprint technology is limited, and a good matching effect cannot be obtained.

Disclosure of Invention

The embodiment of the application provides an audio matching method, an audio matching device, computer equipment and a storage medium, and provides a matching scheme suitable for a full audio matching scene. The technical scheme is as follows:

according to an aspect of the present application, there is provided an audio matching method, comprising:

acquiring a first characteristic sequence of a first audio and a second characteristic sequence of a second audio;

calling a sequence cross-correlation layer to perform cross-correlation processing on the first characteristic sequence and the second characteristic sequence, and outputting a cross-correlation vector sequence;

calling a feature extraction layer to perform feature extraction processing on the cross-correlation vector sequence and output a prediction vector;

and calling a classification layer to perform prediction processing on the prediction vector, and outputting the similarity probability of the first audio and the second audio.

According to another aspect of the present application, there is provided an audio matching apparatus, comprising:

the acquisition module is used for acquiring a first characteristic sequence of a first audio and a second characteristic sequence of a second audio;

the sequence cross-correlation module is used for performing cross-correlation processing on the first characteristic sequence and the second characteristic sequence and outputting a cross-correlation vector sequence;

the characteristic extraction module is used for carrying out characteristic extraction processing on the cross-correlation vector sequence and outputting a prediction vector;

and the classification module is used for performing prediction processing on the prediction vector and outputting the similarity probability of the first audio and the second audio.

According to another aspect of the present application, there is provided a terminal, including: a processor and a memory storing at least one instruction, at least one program, a set of codes, or a set of instructions that is loaded by the processor and performs the audio matching method as described above.

According to another aspect of the present application, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by a processor to implement an audio matching method as described above.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the similarity of the two audios is calculated through an audio full-matching model comprising a sequence cross-correlation layer, a feature extraction layer and a classification layer, and the potential features and deep features of the audios can be mined out through the audio matching model adopting a neural network architecture, so that the similarity between different songs can be calculated, and a similarity calculation result with high precision is obtained.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a block diagram of an audio matching system provided by an exemplary embodiment of the present application;

FIG. 2 is a block diagram of an audio matching model provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;

FIG. 4 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of a time domain feature extraction provided by an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of frequency domain feature extraction provided by an exemplary embodiment of the present application;

FIG. 7 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;

FIG. 8 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;

FIG. 9 is a flow chart of offline matching provided by an exemplary embodiment of the present application;

FIG. 10 illustrates a schematic diagram of a song recommendation scenario provided by an exemplary embodiment of the present application;

FIG. 11 illustrates a schematic diagram of a song scoring scenario provided by an exemplary embodiment of the present application;

FIG. 12 is a flow chart of a model training method provided by an exemplary embodiment of the present application;

fig. 13 is a block diagram illustrating an exemplary embodiment of an audio matching apparatus according to the present application;

fig. 14 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine learning (Machine L earning, M L) is a multi-domain cross discipline, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. a special study on how a computer simulates or implements human learning behavior to acquire new knowledge or skills, reorganizes existing knowledge structures to continuously improve its performance.

FIG. 1 shows a block diagram of a computer system provided in an exemplary embodiment of the present application. The computer system 100 includes: a terminal 120 and a server 140.

The terminal 120 runs a platform supporting audio running, and the platform may be any one of an audio playing program or applet (a program running depending on a host program), an audio playing web page, a video playing program or applet, and a video playing web page.

The terminal 120 is connected to the server 140 through a wireless network or a wired network.

The server comprises at least one of a server, a plurality of servers, a cloud computing platform and a virtualization center. Illustratively, the server includes a processor 144 and a memory 142, the memory 142 in turn including a sequence cross-correlation layer 1421, a feature extraction layer 1422, and a classification layer 1423. In some embodiments, the server 140 obtains the audio signal of the audio to be matched from the terminal 120 or obtains the audio signal of the audio to be matched from the memory 142.

The terminal 120 generally refers to one or more terminals, for example, the terminal may be only one terminal, or the terminal may be tens of terminals or hundreds of terminals, or more, in this embodiment, the terminal 120 is taken as an example only, and the types of the terminal include: at least one of a smartphone, a tablet, an e-book reader, an MP3 player, an MP4 player, a laptop portable computer, and a desktop computer. The number and the type of the terminals are not limited in the embodiments of the present application.

Fig. 2 shows a block diagram of an audio matching model 200 provided in an exemplary embodiment of the present application. The audio matching model 200 includes: sequence cross-correlation layer 220, feature extraction layer 240, and classification layer 260.

Wherein the output of the sequence cross-correlation layer 220 is connected to the input of the feature extraction layer 240,

the sequence cross-correlation layer 220 is configured to perform cross-correlation processing on the first feature sequence of the first audio and the second feature sequence of the second audio, and output a cross-correlation vector sequence.

The feature extraction layer 240 is configured to perform feature extraction processing on the cross-correlation vector sequence, and output a prediction vector. Illustratively, the feature extraction layer 240 includes: time-domain convolutional layers 242 and frequency-domain convolutional layers 244, the time-domain convolutional layers 242 being used to perform time-domain convolution operations, and the frequency-domain convolutional layers 244 being used to perform frequency-domain convolution operations. Optionally, the feature extraction layer 240 further includes: a time domain pooling layer 246 and a frequency domain pooling layer 248, the time domain pooling layer 246 being used to perform time domain pooling operations and the frequency domain pooling layer 248 being used to perform frequency domain pooling operations. In one possible design, time domain convolutional layer 242 and time domain pooling layer 246 are not provided, but frequency domain convolutional layer 244 and frequency domain pooling layer 248 are provided. In one possible design, time domain convolutional layer 242, time domain pooling layer 246, frequency domain convolutional layer 244, and frequency domain pooling layer 248 are provided simultaneously.

The classification layer 260 is configured to perform prediction processing on the prediction vector and output a similarity probability between the first audio and the second audio.

Fig. 3 shows a flowchart of an audio matching method provided by an exemplary embodiment of the present application. This embodiment is illustrated by applying the method to the server shown in fig. 1. The method comprises the following steps:

step 302, acquiring a first characteristic sequence of a first audio and a second characteristic sequence of a second audio;

the sequence of features of the audio includes: n frequency domain vectors arranged in time sequence. Each frequency domain vector is M-dimensional, and each dimension represents the frequency F of the audio_MThe frequency difference between adjacent dimensions is the same. Wherein N and M are integers greater than 1. Optionally, the obtaining process of the feature sequence is as follows:

the audio is sampled in the time dimension with a preset sampling interval (e.g., every 0.1 second) to obtain a discrete time sequence T₁～T_nEach T value represents the size of the audio at that sample point.

The time series are grouped according to a fixed time period (such as every 3 second time period) to obtain a plurality of time series groups G₁～G_NEach time series packet G_iA plurality of samples, for example, 30 samples per 3 seconds/0.1 seconds, are included, and i is an integer not greater than N.

Will belong to the same time series group G_iA plurality of sampling points in (a) are transformed into a frequency domain vector to obtain N frequency domain vectors arranged in time order. Namely, each time sequence group is transformed from time domain to frequency domain to obtain each time sequence group G_iTo what is providedThe corresponding frequency domain sequence. The time-Frequency transformation method includes, but is not limited to, FFT (Fast Fourier Transform), DFT (Discrete Fourier Transform), MFCC (Mel-scale Frequency Cepstral Coefficients). Each frequency domain sequence represents the same set of time series groups G_iThe distribution of different frequencies contained therein. And respectively sampling the N frequency domain sequences according to different sampling frequencies to obtain N frequency domain vectors. The different sampling frequencies refer to: the upper frequency limit and the lower frequency limit of the audio frequency are equally divided into a plurality of frequency points, and the frequency points are different sampling frequencies.

The N frequency domain vectors arranged in time sequence form a two-dimensional matrix of M x N. The axis on the two-dimensional matrix corresponding to N represents the time domain direction and the axis corresponding to M represents the frequency domain direction. M is the quotient between the upper and lower limits of the frequency distribution and the frequency sampling interval.

The server obtains a first feature sequence of the first audio and a second feature sequence of the second audio. The first signature sequence comprises n first frequency-domain vectors and the second signature sequence comprises q second frequency-domain vectors. The ordering order and physical meaning of the first frequency domain vector and the second frequency domain vector are the same, for example, the first frequency domain vector and the second frequency domain vector are arranged according to the time domain order, and the first frequency domain vector and the second frequency domain vector are m-dimensional vectors.

Step 304, calling a sequence cross-correlation layer to perform cross-correlation processing on the first characteristic sequence and the second characteristic sequence, and outputting a cross-correlation vector sequence;

the cross-correlation process is a process operation for measuring the correlation between the first signature sequence and the second signature sequence.

Illustratively, the server calculates a first cross-correlation vector sequence of the first feature sequence relative to the second feature sequence and a second cross-correlation vector sequence of the second feature sequence relative to the first feature sequence; and splicing the first cross-correlation vector sequence and the second cross-correlation vector sequence, and outputting the cross-correlation vector sequence.

Step 306, calling a feature extraction layer to perform feature extraction processing on the cross-correlation vector sequence, and outputting a prediction vector;

the feature extraction processing includes: at least one of a convolution operation and a pooling operation. Wherein the convolution operation may be a multi-scale convolution operation.

Dividing according to the dimension, and the characteristic extraction processing comprises the following steps: at least one of a time domain feature extraction process and a frequency domain feature extraction process. The time domain feature extraction processing comprises: at least one of a time-domain convolution operation and a time-domain pooling operation. The frequency domain feature extraction processing includes: at least one of a frequency domain convolution operation and a frequency domain pooling operation.

In one possible design, a feature extraction layer is invoked to perform time domain feature extraction processing and frequency domain feature extraction processing on the cross-correlation vector sequence. In another possible design, a feature extraction layer is called to perform frequency domain feature extraction processing on the cross-correlation vector sequence.

And 308, calling a classification layer to perform prediction processing on the prediction vector, and outputting the similarity probability of the first audio and the second audio.

Optionally, the classification layer is a softmax function, the input is a prediction vector for the first audio and the second audio, and the output is a probability of similarity for the first audio and the second audio. The server performs at least one of audio recommendation, audio scoring, audio classification, and audio matching according to the similarity probability of the two audios.

In the personalized recommendation scene, the server is used for obtaining a first feature vector of a first audio provided by the client, then obtaining a second feature vector of a second audio in the audio library, using the audio matching model to find out the second audio with higher similarity to the first audio, and recommending the second audio to the client.

In the audio scoring scene, the server is used for obtaining a first feature vector of a first audio provided by the client, obtaining a second feature vector of a second audio in the audio library, calculating the similarity between the first audio and the second audio by using an audio matching model, and recommending the second audio with higher similarity in similarity score to the client.

In the audio matching scene, the server is used for obtaining a first feature vector of a first audio provided by the client, then obtaining a second feature vector of a second audio in the audio library, using the audio matching model to find out the second audio with extremely high similarity to the first audio, and recommending audio information (information such as song title, singer, style, year, record company and the like) of the second audio to the client.

In the audio classification scene, the server is used for calculating the similarity between every two songs in the audio library, and classifying the songs with the similarity higher than a threshold value into the same class cluster so as to divide the songs into the same class.

In summary, in the audio matching method provided in this embodiment, the similarity between two audios is calculated through the audio matching model including the sequence cross-correlation layer, the feature extraction layer, and the classification layer, and as the potential features and deep features of the audios can be found out by using the audio matching model of the neural network architecture, the similarity between different songs can be calculated, so as to obtain a similarity calculation result with higher accuracy.

Fig. 4 shows a flowchart of an audio matching method provided by another exemplary embodiment of the present application. This embodiment is illustrated by applying the method to the server shown in fig. 1. The method comprises the following steps:

the first sequence of features of the first audio comprises: n first frequency domain vectors arranged in time sequence. Each first frequency domain vector is M-dimensional, and each dimension represents the frequency F of the audio_MThe frequency difference between adjacent dimensions is the same. Wherein N and M are integers greater than 1.

The second feature sequence of the second audio comprises: q second frequency domain vectors arranged in time sequence. Each second frequency domain vector is M-dimensional, each dimension representing the frequency of the audio at a frequency F_MThe frequency difference between adjacent dimensions is the same. Wherein Q and M are integers greater than 1.

Exemplary As shown in FIG. 5, audio is first alignedThe audio signal is sampled in time dimension, for example, one audio signal is sampled every 0.1s, and a discrete time sequence T is obtained₁～T_nEach value represents the size of the audio at the sample point, and then the values are combined for a fixed period of time (e.g. 3s), e.g. 3s sample interval 0.1s, each group of sequences containing 30 values, e.g. T, per 0.1 s/3 s₁～T₃₀Is a group, called G₁,T₃₁～T₆₀Is G₂And so on. Then, a frequency domain transform (including but not limited to FFT, MFCC, DFT, etc.) is performed on each group of time sequences to obtain a frequency domain signal, which represents the distribution of different frequencies contained in a group of time sequences, and the frequency signal is also sampled, for example, 10hz, to obtain a discrete frequency sequence. Assuming that the upper and lower limits of the frequency are 0-f, the number of each frequency sequence is f/10, and each G_iCan be represented as such a plurality of frequency sequences, differing only in the different G' s_iThe same frequency of (a) is of different value. Corresponding to music, some parts of the music are very heavy in bass, and G_iThe low frequency values of (A) are large and some parts of the treble are high, those G_iThe high frequency value of (2) is large. So G_iCan be represented as a time series T₁～T₃₀Or may be represented as a sequence of frequencies, which collectively is a spectrogram. The spectrogram as illustrated in fig. 5 is a spectrogram after real audio decomposition, wherein the horizontal axis represents time, and the time period is about 1.75s, that is, a time slice is cut every 1.75 s; the frequency corresponding to each time segment is a vertical axis, the upper and lower limits of the frequency are 110 hz-3520 hz, and the depth of the gray scale represents the value corresponding to different frequencies.

The above processing is performed for both the audio signal of the first audio and the audio signal of the second audio, so that a first audio sequence of the first audio and a second audio sequence of the second audio can be obtained.

Step 3041, calculating an ith first frequency domain vector of the n first frequency domain vectors, wherein i is an integer no greater than n, relative to an ith correlation fraction of the n second frequency domain vectors;

setting a first characteristic of a first audio frequencyThe sequence comprises: n first frequency domain vectors { G) arranged in time order₁,G₂,...,G_n}. Each G_iIs a frequency domain vector. To measure the ith first frequency domain vector G_iAnd q second frequency domain vectors H_1-qThe following correlation calculation formula is introduced for the ith first frequency domain vector.

score(Gi)＝(H1*Gi+H2*Gi...+Hq*Gi)/(H1^2+H2^2+...+Hq^2)；

That is, the server calculates the ith first frequency domain vector G_iAnd q second frequency-domain vectors H_1-qAnd q second frequency-domain vectors H_1-qThe sum of squares of; determining the quotient of the product sum and the sum of squares as the ith first frequency-domain vector relative to the q second frequency-domain vectors H_1-qThe ith correlation score of (1).

Step 3042, calculating a weighted sequence of n first frequency domain vectors by using the ith correlation score as a correlation weight of the ith first frequency domain vector, to obtain a first cross-correlation vector sequence;

in this way, score (Gi) corresponding to each Gi is calculated, so that after correlation fusion, the first cross-correlation vector sequence includes: { G1 × score (G1),.., Gi × score (Gi),., Gn × score (G10) }, score (Gi) may be regarded as a correlation weight of original Gi, so that the output is a first cross-correlation vector sequence after being subjected to weight influence, denoted as { G '1,.., G' n }.

Step 3043, calculating a jth second frequency domain vector of the q second frequency domain vectors, wherein j is an integer no greater than q relative to the jth correlation score of the n first frequency domain vectors;

setting the second feature sequence of the second audio to include: q second frequency domain vectors { H) arranged in time order₁,H₂,...,H_q}. Each H_jIs a frequency domain vector. To measure the jth second frequency domain vector H_jAnd N first frequency domain vectors G_1-nThe following correlation calculation formula is introduced for the jth second frequency domain vector.

score(Hj)＝(G1*Hj+G2*Hj...+Gn*Hj)/(G1^2+G2^2+...+Gn^2)；

That is, the server calculates the jth second frequency domain vector H_jAnd n first frequency domain vectors G_1-nAnd n first frequency-domain vectors G_1-nThe sum of squares of; determining the quotient of the product sum and the square sum as the jth second frequency domain vector H_jWith respect to n first frequency domain vectors G_1-nThe jth relevance score of (1).

Step 3044, calculating weighting sequences of q second frequency domain vectors by using the jth correlation score as the correlation weight of the jth second frequency domain vector, so as to obtain a second cross-correlation vector sequence;

in this way, score (H) for each Hj is calculated_j) Such that after correlation fusion, the second cross-correlation vector sequence comprises: { H₁*score(H₁),...,H_j*score(H_j) ,., Hq score (hn), score (Hj) can be regarded as the correlation weight of the original Hj, so that the output is the second cross-correlation vector sequence after being influenced by the weight, which is denoted as { H' 1.

Step 3045, splicing the first cross-correlation vector sequence and the second cross-correlation vector sequence, and outputting a cross-correlation vector sequence;

illustratively, the first cross-correlation vector sequence { G '1, ·, G' n } and the second cross-correlation vector sequence { H '1,..., H' q } are concatenated to obtain a cross-correlation vector sequence { G '1,..,. G' n, H '1,..., H' q }. The sequence of cross-correlation vectors comprises n + q cross-correlation vectors. That is, a vector formed by splicing the n first cross-correlation vectors and the q second cross-correlation vectors.

Step 3061, calling a frequency domain convolution kernel to perform frequency domain convolution processing on the cross-correlation vector sequence along the frequency domain direction to obtain a frequency domain convolution vector sequence;

the frequency domain direction refers to performing frequency domain convolution processing on the autocorrelation vector sequence along the direction from small to large (or the direction from large to small) of the sampling frequency to obtain a frequency domain convolution vector.

Alternatively, the cross-correlation signature sequence can be viewed as a matrix of M rows by (N + Q) columns, each row being a (N + Q) -dimensional time-domain vector. Assuming that the size of the frequency domain convolution kernel is P x (N + Q), Q is smaller than M. The frequency domain direction is the convolution of P adjacent time domain vectors along the 0-M direction.

As shown in fig. 5, assuming that the size of the frequency domain convolution kernel is 3 × N + Q, when performing the first convolution according to the frequency domain direction, the time domain vector f1, the time domain vector f2, and the time domain vector f3 are convolved to obtain f' 1; when carrying out the second convolution according to the time domain direction, carrying out convolution on the time domain vector f2, the time domain vector f3 and the time domain vector f4 to obtain f' 2; and when carrying out convolution for the third time according to the time domain direction, carrying out convolution on the time domain vector f3, the time domain vector f4 and the time domain vector f5 to obtain f '3, and repeating the operation in the same way, and finally carrying out convolution to obtain N + Q-3+1 frequency domain convolution vectors f' i.

Wherein, each f' i is a new time domain vector obtained by compressing after convolution of Q time domain vectors. Each f' i is used to represent the correlation between the Q time domain vectors before convolution.

Step 3062, outputting a prediction vector according to the frequency domain convolution vector sequence;

alternatively, the server directly outputs the sequence of N + Q-3+1 frequency-domain convolution vectors as the prediction vector.

Optionally, the server performs pooling processing on the frequency domain convolution vector sequence along the frequency domain direction, and determines one frequency domain pooling vector obtained by pooling as the prediction vector.

As shown in fig. 6, when performing the frequency domain pooling operation, pooling is also performed along the frequency domain direction, and the pooling dimension coincides with the vector dimension. After the frequency domain pooling operation, the above N + Q-3+1 frequency domain convolution vectors f '1, f '2, … f '_N-P+1Compressed into a pooled frequency domain convolution vector f ". That is, the pooled frequency-domain convolution vector includes an element, so that the physical meaning of the pooled frequency-domain convolution vector f ″ is still preserved and can still be regarded as being compressed from the frequency-domain dimension to a new vector. The frequency domain pooling vector f "is used to represent the condensed nature of the plurality of frequency domain convolution vectors.

Optionally, the classification layer is a softmax function, the input is a prediction vector for the first audio and the second audio, and the output is a probability of similarity for the first audio and the second audio.

In the above embodiment, only the frequency domain vector is subjected to the frequency domain feature extraction in the feature extraction process, but in a different embodiment, the time domain feature extraction may be performed, and this is not limited. The time domain feature extraction is only different from the extraction direction of the frequency domain feature extraction, but the extraction mode is the same.

In an alternative embodiment based on fig. 4, the frequency domain convolution kernels are K, where K is an integer greater than 1. Step 3061 is alternatively implemented as step 306a and step 3062 is alternatively implemented as step 306b, as shown in FIG. 7 below:

step 306a, respectively calling K different frequency domain convolution kernels to perform frequency domain convolution processing on the cross-correlation vector sequence along the frequency domain direction to obtain K frequency domain convolution vector sequences with different scales;

and respectively calling K different frequency domain convolution kernels to carry out frequency domain convolution processing on the autocorrelation vector sequence along the frequency domain direction to obtain K frequency domain convolution vector sequences with different scales. The number of the frequency domain convolution vector sequences at each scale can be multiple, such as N-P + 1.

And step 306b, pooling the frequency domain convolution vector sequences of K different scales along the frequency domain direction, and determining K frequency domain pooling vectors obtained by pooling as prediction vectors.

Optionally, pooling is performed on the frequency domain convolution vector sequence under each scale, so as to obtain a pooled frequency domain pooling vector respectively. Pooling is performed on the frequency domain convolution vector sequences under the K different scales, and finally K frequency domain pooling vectors are obtained.

According to the sequence from small to large or from large to small of different scales, the prediction vectors { f '1, f'2, …, f 'k } or { f' k, f 'k-1, …, f' 1} are obtained by splicing.

In summary, because the multi-scale vector sequence represents the potential features and deep features of the audio by using the frequency domain vectors under multiple scales, the similarity between two audios is calculated by using the multi-scale vector sequence of the two audios as input and using a matching method based on a neural network, so that the similarity between different songs can be calculated, and thus a similarity calculation result with higher precision is obtained.

It should be noted that, in an alternative embodiment, the above-mentioned "convolution + pooling + multiscale" may be implemented in combination, as in the embodiment shown in fig. 8:

in general, the spectrogram of the first audio and the spectrogram of the second audio are first cross-correlated in the sequence cross-correlation layer 220, and the obtained cross-correlation vector sequence is output to the multi-scale frequency domain convolution layer 242 for multi-scale frequency domain convolution, so as to obtain a multi-scale frequency domain representation. Then, the multi-scale frequency domain representation is input to the multi-scale pooling layer 244 for multi-scale pooling, and finally output to the classification layer, and the output similarity probability is used for representing whether the two pieces of audio are similar. The following application sets forth the operational details of each module

Sequence cross-correlation layer 220:

the present application shows spectrogram A as { G }₁,G₂,...,G_nEach G_iAre all a frequency distribution, can be regarded as a vector, and the spectrogram B is represented as { H }₁,...,H_q}，H_jHeel G_iThe dimensions of (a) are the same, the physical meaning is the same, and each value of the vector represents the magnitude of a frequency component. In order to measure the cross correlation of two pieces of audio from the time perspective, the following correlation calculation formula is introduced in the application:

score(G_i)＝(H₁*G_i+H₂*G_i...+H_q*G_i)/(H1^2+H2^2+...+Hq^2)

in this way, the present application derives a score (G) for each Gi_i) Thus, after correlation fusion, the output of the entire time series correlation module is: { G₁*score(G₁),...,G_i*score(G_i),...,G_n*score(G_n) Can be score (G)_i) Viewed as original G_iSo that the output is the spectrum sequence after the weight influence, which is denoted as { G'₁,...,G'_n}。

Also for H_jScore (H) is also available herein_j) As follows

score(H_j)＝(G₁*H_j+G₂*H_j...+G_n*H_j)/(G₁^2+G₂^2+...+G_n^2)

By the same method, the application can obtain { H'₁,., H' q }. Next to this application two sequences are spliced together, i.e. { G'₁,...,G'_n,H'₁,...,H'_qIs input to the multi-scale frequency domain convolution layer 242.

Multi-scale frequency domain convolutional layer 242

And operating the cross-correlation vector sequence from the frequency domain through convolution kernels of multiple scales so as to fully extract the audio frequency domain characteristics.

Since the sequence cross-correlation layer 220 has already performed the cross-correlation process in time, it is not necessary to perform the time domain convolution operation, and only the frequency domain convolution operation is needed, and the "listening feeling" of human ears to music is affected by the frequency.

Multi-scale pooling layer 244

Assume that the multi-scale frequency-domain convolutional layer 242 has three scales of frequency-domain convolutional representations f1, f2, f 3. The present application separately pools these two-dimensional frequency domain convolution representations. As shown in fig. 6, f '1 to f'4 are the results of frequency domain convolution performed by the frequency domain convolution kernels in the same scale, that is, a certain fi is composed of 4 frequency domain convolution vectors f '1 to f'4, and then pooling operation in the scale is performed.

The resulting f "can be viewed as the original 4 frequency domain convolution vectors in the time dimension" compression "(since f '1 to f'4 represent l time sequences, 4 time sequences become 1 time sequence, so the compression in the time dimension" compression ").

In the present application, pooling operation is performed on each multi-scale f ' i to obtain multi-scale frequency domain pooling vectors f ' i, and then all (e.g., three) frequency domain pooling vectors f ' i are spliced together to form a large vector or vector sequence, which is input to the classification layer 260

Classification layer 260

The classification layer 260 may be a softmax function, and the output Y is the similarity probability of two pieces of audio, representing the degree of matching of the two pieces of audio.

As shown in fig. 9, when the order of magnitude of music in the music library is between millions and millions, it is suitable to use the audio matching model in the offline matching scene to predict the similarity probability between two full audios; when the order of magnitude of music in the music library is between ten orders of magnitude and thousand orders of magnitude, the audio matching model suitable for the online matching scene predicts the similarity probability between two full audios. The order of magnitude of music in the music library is between a million and a thousand, and the method is suitable for predicting the similarity probability between two sections of full audio by adopting an audio matching model in a near-line matching scene. The audio matching model (comprising the sequence cross-correlation layer, the feature extraction layer and the classification layer) provided by the embodiment of the application is more suitable for an offline matching scene.

In one illustrative example, the feature vectors of the audio are used for training and prediction of an audio matching model. The audio matching model is the audio matching model in the above embodiment, and after the feature vector of the audio provided by the embodiment of the application is adopted for training, the audio matching model can be used for predicting the similarity between two audios.

Audio recommendation scenario:

referring to the example shown in fig. 10, when the terminal 180 used by the user runs an audio playing application, and the user plays, collects or approves a first audio (a song) on the audio playing application, the server 160 may compare a first multi-scale vector sequence of the first audio (a song) with a second multi-scale vector sequence of a plurality of second audio (B song) to determine the similarity probability between the first audio and the second audio. According to the sequence of the similarity probability from high to low, the songs B, C, D and E which are similar to the song A are taken as recommended songs and sent to the audio playing application program on the terminal 180, so that the user can hear more songs which accord with the preference of the user.

Singing scoring scene:

referring to the example shown in fig. 11, the terminal 180 used by the user has a singing application running thereon, and the user sings a song, the server 160 may compare a first multi-scale vector sequence of a first audio (the song the user sings) with a second multi-scale vector sequence of a second audio (the original song, the star song, or the top scoring song) to determine the similarity probability between the first audio and the second audio. And giving the singing score of the user according to the similarity probability, and feeding the singing score back to the singing application program for displaying so as to be beneficial to improving the singing level of the user.

FIG. 12 shows a flowchart of a model training method provided by an exemplary embodiment of the present application. The model training method can be used for training the audio matching model in the above embodiments. The method comprises the following steps:

step 401, the server clusters the audios in the audio library according to the audio attribute features to obtain audio clusters, where the audio attribute features include at least two attribute features with different dimensions, and the feature similarity of the audios in different audio clusters is lower than that of the audios in the same audio cluster.

The audio library stores a large amount of audio, where the audio may include songs, pure music, symphony songs, piano songs, or other musical compositions, and the like, and the embodiment of the present application does not limit the types of the audio in the audio library. Optionally, the audio library is a music library of an audio playing application.

Optionally, the audio has respective audio attribute features, the audio attribute features may be attribute features of the audio itself, or attribute features given by human, and the same piece of audio may include attribute features of a plurality of different dimensions.

In one possible embodiment, the audio attribute characteristics of the audio include at least one of: text features, audio features, emotional features, and scene features. Alternatively, the text features may include text features of the audio itself (such as lyrics, composer, word maker, genre, etc.), or may include artificially assigned text features (such as comments); the audio features are used for representing audio characteristics of the audio, such as melody, rhythm, duration and the like; the emotion characteristics are used for representing emotion expressed by the audio; the scene features are used to characterize the playing scene used by the audio. Of course, besides the above audio attribute features, the audio may also include attribute features of other dimensions, which is not limited in this embodiment.

In the embodiment of the application, the process of clustering the audio based on the audio attribute features may be referred to as primary screening, and is used for primarily screening out the audio with similar audio attribute features. In order to improve the quality of primary screening, the computer equipment clusters according to the attribute characteristics of at least two different dimensions, and clustering deviation caused by clustering based on the attribute characteristics of a single dimension is avoided.

After clustering, the computer device obtains a plurality of audio clusters, and the audio in the same audio cluster has similar audio attribute characteristics (compared with the audio in other audio clusters). The number of audio clusters can be preset in a clustering stage (based on an empirical value), so that clustering is prevented from being too generalized or too detailed.

Step 402, generating a candidate audio pair according to the audio in the audio cluster, where the candidate audio pair includes two pieces of audio, and the two pieces of audio belong to the same audio cluster or different audio clusters.

Because the audios in the same audio class cluster have similar audio attribute characteristics, and the audios in different audio class clusters have great difference in audio attribute characteristics, the server may preliminarily generate audio samples based on the audio class clusters, where each audio sample is a candidate audio pair composed of two audio samples.

Since the audio library contains a large amount of audio, the number of candidate audio pairs generated based on the audio class cluster is also huge, for example, for the audio library containing y pieces of audio, the number of generated candidate audio pairs is C (y, 2). However, while a large number of candidate audio pairs can be generated based on audio class clusters, not all of the candidate audio pairs can be used for subsequent model training. For example, when the audio in the candidate audio pair is the same song (e.g., the same song sung by a different singer), or the audio in the candidate audio pair is completely different (e.g., an english ballad, a suona song), it is too simple to train the candidate audio pair as a model training sample, and a high-quality model cannot be obtained.

In order to improve the quality of the audio sample, in the embodiment of the application, the computer device further screens out a high-quality audio pair from the candidate audio pair as the audio sample through fine screening.

In step 403, the server determines an audio positive sample pair and an audio negative sample pair in the candidate audio pairs according to the historical play records of the audios in the audio library, where the audios in the audio positive sample pair belong to the same audio cluster, and the audios in the audio negative sample pair belong to different audio clusters.

Through analysis, the similarity between the audio playing behavior of the user and the audio is closely related, for example, the user often plays the audio with high similarity continuously, but the audio is not identical. Therefore, in the embodiment of the application, the computer device performs fine screening on the generated candidate audio pairs based on the historical play records of the audio to obtain the audio sample pairs. The audio sample pairs obtained by fine screening comprise audio positive sample pairs formed by similar audios (obtained by screening from candidate audio pairs formed by audios in the same audio cluster), and audio negative sample pairs formed by differential audios (obtained by screening from candidate audio pairs formed by audios in different audio clusters).

Optionally, the history playing record is an audio playing record under each user account, and may be an audio playing list formed according to the playing sequence. For example, the history play records may be song play records of the respective users collected by the audio play application server.

In some embodiments, the degree of distinction between the audio positive sample pair and the audio negative sample pair screened out based on the history playing record is low, so that the quality of a model obtained by training based on the audio sample pair is improved.

In step 404, the server performs machine learning training on the audio matching model according to the audio positive sample pair and the audio negative sample pair.

The sample is an object for training and testing the model, and the object includes label information, where the label information is a reference value (or called true value or supervised value) of the output result of the model, where a sample with label information of 1 is a positive sample, and a sample with label information of 0 is a negative sample. The samples in the embodiment of the present application refer to audio samples used for training a similarity model, and the audio samples are in the form of sample pairs, that is, the audio samples include two pieces of audio. Optionally, when the label information of the audio sample (pair) is 1, it indicates that two pieces of audio in the audio sample pair are similar audio, that is, an audio positive sample pair; when the label information of the audio sample (pair) is 0, it indicates that the two pieces of audio in the audio sample pair are not similar audio, i.e., an audio negative sample pair.

Wherein, the similarity probability of two audios in the same audio positive sample pair can be regarded as 1, or the clustering distance between two audios is quantized to the similarity probability. The similarity probability of two audios in the same audio negative sample pair can be regarded as 0, or the cluster-like distance or the vector distance between two audios is quantized to the similarity probability, for example, the inverse of the cluster-like distance or the inverse of the vector distance is quantized to the similarity probability of two audios in the same audio negative sample pair.

Illustratively, the "audio matching model" in the above embodiment includes: a sequence cross-correlation layer, a feature extraction layer and a classification layer.

In summary, in the embodiment of the application, firstly, according to audio attribute features of different dimensions, audio with similar features in an audio library is clustered to obtain audio clusters, then, the audio clusters belonging to the same or different audio clusters are combined to obtain a plurality of candidate audio pairs, and further, based on historical playing records of the audio, an audio positive sample pair and an audio negative sample pair are screened out from the candidate audio pairs for subsequent model training; clustering is carried out by fusing multi-dimensional attribute characteristics of audio, and positive and negative sample pairs are screened based on audio playing records of users, so that generated audio sample pairs can reflect the similarity between audio (including the attributes of the audio and the listening habits of the users) from multiple angles, the quality of the generated audio sample pairs is improved while the automatic generation of the audio sample pairs is realized, and the quality of subsequent model training based on the audio samples is further improved.

Fig. 13 is a block diagram of an audio matching apparatus according to an exemplary embodiment of the present application. The device includes:

an obtaining module 1320, configured to obtain a first feature sequence of a first audio and a second feature sequence of a second audio;

a sequence cross-correlation module 1340, configured to perform cross-correlation processing on the first feature sequence and the second feature sequence, and output a cross-correlation vector sequence;

a feature extraction module 1360, configured to perform feature extraction processing on the cross-correlation vector sequence, and output a prediction vector;

a classification module 1380, configured to perform prediction processing on the prediction vector and output a similarity probability between the first audio and the second audio.

In an exemplary embodiment, the first signature sequence comprises n first frequency-domain vectors, the second signature sequence comprises q second frequency-domain vectors, n and q are positive integers;

the sequence cross-correlation module 1340 is configured to calculate a first cross-correlation vector sequence of the first feature sequence relative to the second feature sequence and a second cross-correlation vector sequence of the second feature sequence relative to the first feature sequence; and splicing the first cross-correlation vector sequence and the second cross-correlation vector sequence, and outputting the cross-correlation vector sequence.

In an exemplary embodiment, the sequence cross-correlation module 1340 is configured to calculate an ith first frequency-domain vector of the n first frequency-domain vectors, i being an integer no greater than n with respect to an ith correlation fraction of the q second frequency-domain vectors; taking the ith correlation fraction as a correlation weight of the ith first frequency domain vector, and calculating a weighted sequence of the n first frequency domain vectors to obtain a first cross-correlation vector sequence; and calculating a jth of said q second frequency domain vectors, j being an integer no greater than q, relative to a jth correlation score of said n first frequency domain vectors; and calculating the weighted sequences of the q second frequency domain vectors by taking the jth correlation score as the correlation weight of the jth second frequency domain vector to obtain the second cross-correlation vector sequence.

In an exemplary embodiment, the sequence cross-correlation module 1340 is configured to calculate a product sum of the i-th first frequency-domain vector and the q second frequency-domain vectors, and a square sum of the q second frequency-domain vectors; determining a quotient of the product sum and the square sum as an ith correlation score of the ith first frequency-domain vector relative to the q second frequency-domain vectors;

the sequence cross-correlation module 1340 is configured to calculate a product sum of the jth second frequency-domain vector and the n first frequency-domain vectors, and a square sum of the n first frequency-domain vectors; determining a quotient of the product sum and the square sum as a jth correlation score of the jth second frequency-domain vector relative to the n first frequency-domain vectors.

In one illustrative embodiment, the feature extraction module comprises: a frequency domain convolution kernel;

the feature extraction module 1360 is configured to call the frequency domain convolution kernel to perform frequency domain convolution processing on the cross-correlation vector sequence along a frequency domain direction, so as to obtain a frequency domain convolution vector sequence; and outputting the prediction vector according to the frequency domain convolution vector sequence.

In an exemplary embodiment, the feature extraction module 1360 is configured to pool the sequence of frequency-domain convolution vectors along a frequency-domain direction, and determine a frequency-domain pooled vector obtained by pooling as the prediction vector.

In one illustrative embodiment, the frequency domain convolution kernels include K frequency domain convolution kernels of different scales, K being an integer greater than 1;

the feature extraction module 1360 is configured to call the K different frequency domain convolution kernels to perform frequency domain convolution processing on the cross-correlation vector sequence along the frequency domain direction, so as to obtain K different scale frequency domain convolution vector sequences.

In an exemplary embodiment, the feature extraction module 1360 is configured to pool the frequency-domain convolution vector sequences of K different scales along the frequency-domain direction, and determine K frequency-domain pooled vectors obtained by pooling as the prediction vector.

It should be noted that: the audio matching device provided in the above embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the audio matching apparatus and the audio matching method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

Fig. 14 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application. Specifically, the method comprises the following steps: the computer device 1400 includes a Central Processing Unit (CPU) 1401, a system memory 1404 including a random access memory 1402 and a read only memory 1403, and a system bus 1405 connecting the system memory 1404 and the Central Processing Unit 1401. The computer device 1400 also includes a basic Input/Output system (I/O system) 1406 that facilitates transfer of information between devices within the computer, and a mass storage device 1407 for storing an operating system 1413, application programs 1414, and other program modules 1415.

The basic input/output system 1406 includes a display 1408 for displaying information and an input device 1409, such as a mouse, keyboard, etc., for user input of information. Wherein the display 1408 and input device 1409 are both connected to the central processing unit 1401 via an input-output controller 1410 connected to the system bus 1405. The basic input/output system 1406 may also include an input/output controller 1410 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1410 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1407 is connected to the central processing unit 1401 through a mass storage controller (not shown) connected to the system bus 1405. The mass storage device 1407 and its associated computer-readable media provide non-volatile storage for the computer device 1400. That is, the mass storage device 1407 may include a computer readable medium (not shown) such as a hard disk or drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes Random Access Memory (RAM), Read Only Memory (ROM), flash Memory or other solid state Memory technology, Compact disk Read-Only Memory (CD-ROM), Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1404 and mass storage device 1407 described above may collectively be referred to as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1401, the one or more programs containing instructions for implementing the methods described above, and the central processing unit 1401 executes the one or more programs to implement the methods provided by the various method embodiments described above.

According to various embodiments of the present application, the computer device 1400 may also operate as a remote computer connected to a network via a network, such as the Internet. That is, the computer device 1400 may be connected to the network 1412 through the network interface unit 1414 that is coupled to the system bus 1405, or the network interface unit 1414 may be used to connect to other types of networks or remote computer systems (not shown).

The memory also includes one or more programs, stored in the memory, that include instructions for performing the steps performed by the computer device in the methods provided by the embodiments of the present application.

The present application further provides a computer-readable storage medium, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the computer-readable storage medium, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the audio matching method described in any of the above embodiments.

The present application also provides a computer program product, which when run on a computer, causes the computer to execute the audio matching method provided by the above-mentioned method embodiments.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, which may be a computer readable storage medium contained in a memory of the above embodiments; or it may be a separate computer-readable storage medium not incorporated in the terminal. The computer readable storage medium has stored therein at least one instruction, at least one program, a set of codes, or a set of instructions that are loaded and executed by the processor to implement the audio matching method of any of the above method embodiments.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of audio matching, the method comprising:

2. The method of claim 1, wherein the first signature sequence comprises n first frequency-domain vectors, wherein the second signature sequence comprises q second frequency-domain vectors, and wherein n and q are positive integers;

the calling sequence cross-correlation layer performs cross-correlation processing on the first characteristic sequence and the second characteristic sequence, and outputs a cross-correlation vector sequence, including:

calculating a first cross-correlation vector sequence of the first feature sequence relative to the second feature sequence and a second cross-correlation vector sequence of the second feature sequence relative to the first feature sequence;

and splicing the first cross-correlation vector sequence and the second cross-correlation vector sequence, and outputting the cross-correlation vector sequence.

3. The method of claim 2,

the calculating a first cross-correlation vector sequence of the first feature sequence relative to the second feature sequence comprises:

calculating an ith first frequency-domain vector of the n first frequency-domain vectors, i being an integer no greater than n, relative to an ith correlation fraction of the q second frequency-domain vectors; taking the ith correlation fraction as a correlation weight of the ith first frequency domain vector, and calculating a weighted sequence of the n first frequency domain vectors to obtain a first cross-correlation vector sequence;

the calculating a second cross-correlation vector sequence of the second feature sequence relative to the first feature sequence comprises:

calculating a jth of the q second frequency-domain vectors, j being an integer no greater than q, relative to a jth correlation score of the n first frequency-domain vectors; and calculating the weighted sequences of the q second frequency domain vectors by taking the jth correlation score as the correlation weight of the jth second frequency domain vector to obtain the second cross-correlation vector sequence.

4. The method of claim 3,

said calculating an ith correlation score for an ith of said n first frequency-domain vectors relative to said q second frequency-domain vectors, comprising:

calculating a product sum of the i-th first frequency-domain vector and the q second frequency-domain vectors, and a square sum of the q second frequency-domain vectors; determining a quotient of the product sum and the square sum as an ith correlation score of the ith first frequency-domain vector relative to the q second frequency-domain vectors;

said calculating a jth one of said q second frequency-domain vectors with respect to said jth correlation score of said n first frequency-domain vectors, comprising:

calculating a product sum of the jth second frequency-domain vector and the n first frequency-domain vectors, and a square sum of the n first frequency-domain vectors; determining a quotient of the product sum and the square sum as a jth correlation score of the jth second frequency-domain vector relative to the n first frequency-domain vectors.

5. The method of any of claims 1 to 4, wherein the feature extraction layer comprises: a frequency domain convolution kernel;

the calling feature extraction layer performs feature extraction processing on the cross-correlation vector sequence and outputs a prediction vector, and the method comprises the following steps:

calling the frequency domain convolution kernel to carry out frequency domain convolution processing on the cross-correlation vector sequence along the frequency domain direction to obtain a frequency domain convolution vector sequence;

and outputting the prediction vector according to the frequency domain convolution vector sequence.

6. The method of claim 5, wherein said outputting the prediction vector according to the sequence of frequency-domain convolution vectors comprises:

pooling the frequency domain convolution vector sequence along a frequency domain direction, and determining a frequency domain pooling vector obtained by pooling as the prediction vector.

7. The method of claim 5, wherein the frequency-domain convolution kernel comprises K frequency-domain convolution kernels at different scales, K being an integer greater than 1;

the calling the frequency domain convolution kernel to perform frequency domain convolution processing on the cross-correlation vector sequence along the frequency domain direction to obtain a frequency domain convolution vector sequence, and the method comprises the following steps:

and respectively calling the K different frequency domain convolution kernels to carry out frequency domain convolution processing on the cross-correlation vector sequence along the frequency domain direction to obtain K different scale frequency domain convolution vector sequences.

8. The method of claim 5, wherein said outputting the prediction vector according to the sequence of frequency-domain convolution vectors comprises:

and pooling the K frequency domain convolution vector sequences with different scales along the frequency domain direction, and determining K frequency domain pooling vectors obtained by pooling as the prediction vectors.

9. An audio matching apparatus, characterized in that the apparatus comprises:

10. A terminal, characterized in that the terminal comprises: a processor and a memory storing at least one instruction, at least one program, a set of codes, or a set of instructions that is loaded and executed by the processor to implement the audio matching method of any of claims 1 to 8.

11. A computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the audio matching method of any of claims 1 to 8.