CN111445922B

CN111445922B - Audio matching method, device, computer equipment and storage medium

Info

Publication number: CN111445922B
Application number: CN202010202378.5A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2023-10-03
Anticipated expiration: 2040-03-20
Also published as: CN111445922A

Abstract

The application discloses an audio matching method, an audio matching device, computer equipment and a storage medium, and relates to the technical field of audio. The method comprises the following steps: acquiring a first characteristic sequence of the first audio and a second characteristic sequence of the second audio; calling a sequence cross-correlation layer to carry out cross-correlation processing on the first characteristic sequence and the second characteristic sequence, and outputting a cross-correlation vector sequence; calling a feature extraction layer to perform feature extraction processing on the cross-correlation vector sequence, and outputting a prediction vector; and calling a classification layer to predict the prediction vector, and outputting the similarity probability of the first audio and the second audio. The similarity of two audios is calculated by adopting a matching mode based on a neural network, and the similarity between different songs can be calculated, so that a similarity calculation result with higher precision is obtained between the different songs.

Description

Audio matching method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of multimedia technologies, and in particular, to an audio matching method, an audio matching device, a computer device, and a storage medium.

Background

Audio matching is a technique of similarity matching of two audios. The audio matching is classified by the type of matching, including: audio segment matching and full audio matching. The audio segment matching refers to determining whether an audio segment P is part of an audio D given an audio segment P. Full audio matching refers to computing the similarity of audio a and audio B given an audio a.

In the related art, an audio fingerprint technology is provided, wherein a relatively significant time-frequency point in an audio signal is selected, a digital sequence is encoded by adopting a hash encoding mode, and the digital sequence is used as an audio fingerprint. The audio fingerprinting technique converts the audio matching problem into a retrieval problem between different digital sequences.

Because the audio fragment matching is mainly performed on the audio fragment and the full audio of the same song, the audio fingerprint technology based on signal processing has a good matching effect in the scene of audio fragment matching. However, in the full audio matching scenario, more similarity is calculated for two different songs, and at this time, the application of the audio fingerprint technology is limited, so that a better matching effect cannot be obtained.

Disclosure of Invention

The embodiment of the application provides an audio matching method, an audio matching device, computer equipment and a storage medium, and provides a matching scheme suitable for a full-audio matching scene. The technical scheme is as follows:

according to an aspect of the present application, there is provided an audio matching method, characterized in that the method comprises:

acquiring a first characteristic sequence of the first audio and a second characteristic sequence of the second audio;

Calling a sequence cross-correlation layer to carry out cross-correlation processing on the first characteristic sequence and the second characteristic sequence, and outputting a cross-correlation vector sequence;

calling a feature extraction layer to perform feature extraction processing on the cross-correlation vector sequence and outputting a prediction vector;

and calling a classification layer to predict the prediction vector and outputting the similarity probability of the first audio and the second audio.

According to another aspect of the present application, there is provided an audio matching apparatus, characterized in that the apparatus includes:

the acquisition module is used for acquiring a first characteristic sequence of the first audio and a second characteristic sequence of the second audio;

the sequence cross-correlation module is used for carrying out cross-correlation processing on the first characteristic sequence and the second characteristic sequence and outputting a cross-correlation vector sequence;

the feature extraction module is used for carrying out feature extraction processing on the cross-correlation vector sequence and outputting a prediction vector;

and the classification module is used for carrying out prediction processing on the prediction vector and outputting the similarity probability of the first audio and the second audio.

According to another aspect of the present application, there is provided a terminal including: a processor and a memory storing at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded by the processor and performing the audio matching method as described above.

According to another aspect of the present application, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, code set or instruction set, which is loaded and executed by a processor to implement an audio matching method as described above.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

the similarity calculation of two audios is carried out through the audio full-matching model comprising the sequence cross-correlation layer, the feature extraction layer and the classification layer, and potential features and deep features of the audios can be mined by adopting the audio matching model of the neural network architecture, so that the similarity between different songs can be calculated, and a similarity calculation result with higher precision is obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of an audio matching system provided by an exemplary embodiment of the present application;

FIG. 2 is a block diagram of an audio matching model provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;

FIG. 4 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of time domain feature extraction provided by an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of frequency domain feature extraction provided by an exemplary embodiment of the present application;

FIG. 7 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;

FIG. 8 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;

FIG. 9 is a flowchart of offline matching provided by an exemplary embodiment of the present application;

FIG. 10 is a schematic diagram of a song recommendation scenario provided by an exemplary embodiment of the present application;

FIG. 11 illustrates a schematic diagram of a song scoring scene provided by an exemplary embodiment of the present application;

FIG. 12 is a flowchart of a model training method provided by an exemplary embodiment of the present application;

fig. 13 is a block diagram of an audio matching apparatus according to an exemplary embodiment of the present application;

Fig. 14 is a schematic structural view of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

FIG. 1 illustrates a block diagram of a computer system provided in accordance with an exemplary embodiment of the present application. The computer system 100 includes: a terminal 120 and a server 140.

The terminal 120 operates a platform supporting operation of audio, and the platform may be any one of an audio playing program or applet (a program that operates depending on a host program), an audio playing web page, a video playing program or applet, and a video playing web page.

The terminal 120 is connected to the server 140 through a wireless network or a wired network.

The server includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. Illustratively, the server includes a processor 144 and a memory 142, the memory 142 in turn including a sequence cross-correlation layer 1421, a feature extraction layer 1422, and a classification layer 1423. In some embodiments, the server 140 retrieves the audio signals of the audio to be matched from the terminal 120 or retrieves the audio signals of the audio to be matched from the memory 142.

The terminal 120 generally refers to one or more terminals, for example, the terminal may be only one, or the terminal may be tens or hundreds, or more, and the embodiment of the present application is only illustrated by taking the terminal 120 as an example, where the types of the terminal include: at least one of a smart phone, a tablet computer, an electronic book reader, an MP3 player, an MP4 player, a laptop portable computer, and a desktop computer. The embodiment of the application does not limit the number and the types of the terminals.

Fig. 2 shows a block diagram of an audio matching model 200 provided by an exemplary embodiment of the present application. The audio matching model 200 includes: a sequence cross correlation layer 220, a feature extraction layer 240, and a classification layer 260.

Wherein an output of the sequence cross correlation layer 220 is coupled to an input of the feature extraction layer 240,

the sequence mutual layer 220 is configured to perform a cross-correlation process on the first feature sequence of the first audio and the second feature sequence of the second audio, and output a cross-correlation vector sequence.

The feature extraction layer 240 is configured to perform feature extraction processing on the cross-correlation vector sequence, and output a prediction vector. Illustratively, the feature extraction layer 240 includes: a time domain convolution layer 242 and a frequency domain convolution layer 244, the time domain convolution layer 242 being configured to perform a time domain convolution operation and the frequency domain convolution layer 244 being configured to perform a frequency domain convolution operation. Optionally, the feature extraction layer 240 further includes: a time domain pooling layer 246 and a frequency domain pooling layer 248, the time domain pooling layer 246 for performing time domain pooling operations and the frequency domain pooling layer 248 for performing frequency domain pooling operations. In one possible design, the time domain convolution layer 242 and the time domain pooling layer 246 are not provided, but the frequency domain convolution layer 244 and the frequency domain pooling layer 248 are provided. In one possible design, the time domain convolution layer 242, the time domain pooling layer 246, the frequency domain convolution layer 244, and the frequency domain pooling layer 248 are provided simultaneously.

The classification layer 260 is configured to perform prediction processing on the prediction vector, and output a similarity probability of the first audio and the second audio.

Fig. 3 shows a flowchart of an audio matching method according to an exemplary embodiment of the present application. The present embodiment is exemplified by the application of the method to the server shown in fig. 1. The method comprises the following steps:

step 302, acquiring a first feature sequence of a first audio and a second feature sequence of a second audio;

the characteristic sequence of the audio includes: n frequency domain vectors arranged in time sequence. Each frequency domain vector is M-dimensional, each dimension representing the audio frequency at a frequency F _M The frequency distribution in the adjacent dimensions is the same. Wherein N and M are integers greater than 1. Optionally, the process of obtaining the feature sequence is as follows:

sampling the audio in the time dimension with a preset sampling interval (e.g., every 0.1 seconds) to obtain a discrete time sequence T ₁ ～T _n Each T value represents the size of the audio at that sample point.

Grouping the time series according to a fixed time period (e.g., each 3 second time period) to obtain a plurality of time series groups G ₁ ～G _N Each time-series packet G _i Including a plurality of sampling points, such as 3 seconds/0.1 seconds = 30 sampling points, i being an integer no greater than N.

Will belong to the same time series packet G _i The plurality of sampling points in the model (a) are transformed into a frequency domain vector to obtain N frequency domain vectors which are arranged according to time sequence. That is, each time-series packet G is obtained by performing time-domain to frequency-domain transformation on each time-series packet _i The corresponding frequency domain sequence. The time-frequency transformation mode includes, but is not limited to, FFT (Fast Fourier Transform ), DFT (Discrete Fourier Transform, discrete Fourier transform), MFCC (Mel-scale Frequency Cepstral Coefficients, mel frequency cepstral coefficient). Each frequency domain sequence represents the same group of time sequence packets G _i The distribution of different frequencies contained therein. And respectively sampling the N frequency domain sequences according to different sampling frequencies to obtain N frequency domain vectors. Different sampling frequencies refer to: between the upper frequency limit and the lower frequency limit of the audio frequencyThe sampling device is divided into a plurality of frequency points, wherein the plurality of frequency points are different sampling frequencies.

N frequency domain vectors arranged in time sequence form a two-dimensional matrix of M x N. The axis corresponding to N on the two-dimensional matrix represents the time domain direction and the axis corresponding to M represents the frequency domain direction. M is the quotient between the upper and lower frequency distribution limits and the frequency sampling interval.

The server obtains a first feature sequence of the first audio and a second feature sequence of the second audio. The first signature sequence comprises n first frequency domain vectors and the second signature sequence comprises q second frequency domain vectors. The ordering order and physical meaning of the first frequency domain vector and the second frequency domain vector are the same, for example, the first frequency domain vector and the second frequency domain vector are all arranged according to the time domain order, and the first frequency domain vector and the second frequency domain vector are m-dimensional vectors.

Step 304, calling a sequence cross-correlation layer to carry out cross-correlation processing on the first characteristic sequence and the second characteristic sequence, and outputting a cross-correlation vector sequence;

the cross-correlation process is a process operation for measuring the correlation between the first feature sequence and the second feature sequence.

Illustratively, the server calculates a first cross-correlation vector sequence of the first feature sequence relative to the second feature sequence, and a second cross-correlation vector sequence of the second feature sequence relative to the first feature sequence; and splicing the first cross-correlation vector sequence and the second cross-correlation vector sequence, and outputting the first cross-correlation vector sequence and the second cross-correlation vector sequence as cross-correlation vector sequences.

Step 306, calling a feature extraction layer to perform feature extraction processing on the cross-correlation vector sequence and outputting a prediction vector;

the feature extraction process includes: at least one of a convolution operation and a pooling operation. Wherein the convolution operation may be a multi-scale convolution operation.

Dividing according to dimensions, and extracting features comprises the following steps: at least one of a time domain feature extraction process and a frequency domain feature extraction process. The time domain feature extraction process includes: at least one of a time domain convolution operation and a time domain pooling operation. The frequency domain feature extraction processing includes: at least one of a frequency domain convolution operation and a frequency domain pooling operation.

In one possible design, the feature extraction layer is invoked to perform a time domain feature extraction process and a frequency domain feature extraction process on the cross-correlation vector sequence. In another possible design, the feature extraction layer is called to perform frequency domain feature extraction processing on the cross-correlation vector sequence.

Step 308, call the classification layer to predict the prediction vector, and output the similarity probability of the first audio and the second audio.

Optionally, the classification layer is a softmax function, the input is a prediction vector of the first audio and the second audio, and the output is a similarity probability of the first audio and the second audio. The server performs at least one of audio recommendation, audio scoring, audio classification, and audio matching based on the similarity probabilities of the two audio.

In the personalized recommendation scene, the server is used for acquiring a second feature vector of a second audio in the audio library after obtaining a first feature vector of a first audio provided by the client, searching out the second audio with higher similarity with the first audio by using the audio matching model, and recommending the second audio to the client.

In the audio scoring scene, the server is used for acquiring a second feature vector of a second audio in the audio library after obtaining a first feature vector of a first audio provided by the client, calculating the similarity between the first audio and the second audio by using an audio matching model, and recommending the second audio with higher similarity score to the client.

In the audio matching scenario, the server is configured to obtain a first feature vector of a first audio provided by the client, obtain a second feature vector of a second audio in the audio library, find out the second audio with extremely high similarity to the first audio by using the audio matching model, and recommend audio information (such as song name, singer, style, year, record company, etc.) of the second audio to the client.

In the audio classification scene, the server is used for calculating similarity between every two songs in the audio library, and songs with similarity higher than a threshold value are classified into the same class cluster, so that the songs are classified into the same class.

In summary, according to the audio matching method provided by the embodiment, the similarity calculation of two audios is performed by the audio matching model including the sequence cross-correlation layer, the feature extraction layer and the classification layer, and the potential features and the deep features of the audios can be mined by the audio matching model adopting the neural network architecture, so that the similarity between different songs can be calculated, and a similarity calculation result with higher precision is obtained.

Fig. 4 shows a flowchart of an audio matching method according to another exemplary embodiment of the present application. The present embodiment is exemplified by the application of the method to the server shown in fig. 1. The method comprises the following steps:

the first feature sequence of the first audio comprises: n first frequency domain vectors arranged in time sequence. Each first frequency domain vector is M-dimensional, each dimension representing the audio frequency at a frequency F _M The frequency distribution in the adjacent dimensions is the same. Wherein N and M are integers greater than 1.

The second feature sequence of the second audio comprises: and Q second frequency domain vectors arranged in time sequence. Each second frequency domain vector is M-dimensional, each dimension representing the audio frequency at a frequency F _M The frequency distribution in the adjacent dimensions is the same. Wherein Q and M are integers greater than 1.

Exemplary As shown in FIG. 5, the audio signal of the audio is sampled in the time dimension, e.g., every 0.1s, to obtain a discrete time series T ₁ ～T _n Each value representing the size of the audio at the sample point and then combined for a fixed period of time (e.g., 3 s), e.g., a period of time having a 3s sampling interval of 0.1s, each set of sequences comprising 3s/0.1s = 30 values, e.g., T ₁ ～T ₃₀ Is a group, called G ₁ ,T ₃₁ ～T ₆₀ Is G ₂ And so on. Each set of time sequences is then subjected to a frequency domain transform (including but not limited toNot limited to FFT, MFCC, DFT, etc.), a frequency domain signal representing a distribution of different frequencies contained within a set of time series is obtained, and the frequency signal is also sampled, e.g., at 10hz, to obtain a discrete frequency series. Assuming that the upper and lower limits of the frequencies are 0-f, the number of each frequency sequence is f/10, each G _i Can be expressed as such a plurality of frequency sequences, except for different G' s _i The values of the same frequencies of (a) are different in magnitude. Corresponding to music, some parts of music are very bass, those G' s _i The low frequency values of (2) are large, some parts are high, and those G's are high _i Is large. So G _i Can be expressed as a time sequence T ₁ ～T ₃₀ And can also be expressed as a frequency sequence, and is a spectrogram together. The spectrogram as illustrated in fig. 5 is a spectrogram after real audio decomposition, the horizontal axis is time, and the time period is about 1.75s, that is, a time slice is cut every 1.75 s; the frequency corresponding to each time segment is a vertical axis, the upper and lower frequency limits are 110hz to 3520hz, and the gray scale represents the magnitude of the corresponding value of different frequencies.

The processing is performed on the audio signal of the first audio and the audio signal of the second audio, so that a first audio sequence of the first audio and a second audio sequence of the second audio can be obtained.

Step 3041, calculating an ith first frequency domain vector in the n first frequency domain vectors, wherein i is an integer not greater than n with respect to an ith correlation score of the n second frequency domain vectors;

providing a first feature sequence of the first audio comprises: n first frequency domain vectors { G ] arranged in time order ₁ ,G ₂ ,...,G _n }. Each G _i Are all a frequency domain vector. To measure the ith first frequency domain vector G _i And q second frequency domain vectors H _1-q The correlation between the first frequency domain vectors is introduced into the ith first frequency domain vector as follows.

score(Gi)＝(H1*Gi+H2*Gi...+Hq*Gi)/(H1^2+H2^2+...+Hq^2)；

That is, the server calculates the i-th first frequency-domain vector G _i And q second frequency domain vectors H _1-q Is a product of the products of q second frequency domain vectors H _1-q Square sum of (2); determining the quotient of the sum of products and the sum of squares as the i-th first frequency-domain vector relative to the q second frequency-domain vectors H _1-q Is the i-th relevance score of (c).

Step 3042, calculating weighted sequences of n first frequency domain vectors by taking the ith correlation score as the correlation weight of the ith first frequency domain vector, so as to obtain a first cross-correlation vector sequence;

In this way, a score (Gi) corresponding to each Gi is calculated, so that after correlation fusion, the first cross-correlation vector sequence includes: { G1. Score (G1),.. the output is thus the weighted first cross-correlation vector sequence, denoted as { G '1, & gt, G' n }.

Step 3043, calculating a j-th second frequency-domain vector of the q second frequency-domain vectors, wherein j is an integer not greater than q with respect to the j-th correlation score of the n first frequency-domain vectors;

providing a second feature sequence of the second audio comprises: q second frequency domain vectors { H } arranged in time order ₁ ,H ₂ ,...,H _q }. Each H _j Are all a frequency domain vector. To measure the j-th second frequency domain vector H _j And N first frequency domain vectors G _1-n The correlation between the two is introduced into the following correlation calculation formula for the j-th second frequency domain vector.

score(Hj)＝(G1*Hj+G2*Hj...+Gn*Hj)/(G1^2+G2^2+...+Gn^2)；

That is, the server calculates the jth second frequency domain vector H _j And n first frequency domain vectors G _1-n Is a product of the products of n first frequency domain vectors G _1-n Square sum of (2); determining the quotient of the sum of products and the sum of squares as the j-th second frequency domain vector H _j With respect to n first frequency domain vectors G _1-n Is the j-th relevance score of (2).

Step 3044, calculating the weighted sequences of the q second frequency domain vectors by taking the j-th correlation score as the correlation weight of the j-th second frequency domain vector, so as to obtain a second cross-correlation vector sequence;

in this way, score (H) corresponding to each Hj is calculated _j ) So that after correlation fusion, the second cross-correlation vector sequence comprises: { H ₁ *score(H ₁ ),...,H _j *score(H _j ) ,.. Hq. Score (Hn) }, score (Hj) can be regarded as the correlation weight of the original Hj, so that the output is a weighted second cross-correlation vector sequence, denoted as { H '1,..h' n }.

Step 3045, splicing the first cross-correlation vector sequence and the second cross-correlation vector sequence, and outputting the first cross-correlation vector sequence and the second cross-correlation vector sequence as cross-correlation vector sequences;

illustratively, the first cross-correlation vector sequence { G '1, & gt, G' n } and the second cross-correlation vector sequence { H '1, & gt, H' q } are concatenated to obtain the cross-correlation vector sequence { G '1, & gt, G' n, H '1, & gt, H' q }. The cross-correlation vector sequence includes n+q cross-correlation vectors. I.e. a vector of n first cross-correlation vectors and q second cross-correlation vectors.

Step 3061, calling a frequency domain convolution kernel to carry out frequency domain convolution processing on the cross-correlation vector sequence along the frequency domain direction to obtain a frequency domain convolution vector sequence;

The frequency domain direction refers to that the frequency domain convolution processing is performed on the autocorrelation vector sequence along the direction from small to large (or from large to small) of the sampling frequency, so as to obtain a frequency domain convolution vector.

Alternatively, the cross-correlation feature sequence may be regarded as a matrix of M rows by (n+q) columns, each row being a time domain vector of dimension (n+q). Assuming the size of the frequency domain convolution kernel is P (n+q), Q is less than M. The frequency domain direction means that the convolution processing is performed on P adjacent time domain vectors along the 0-M direction.

As shown in fig. 5, assuming that the size of the frequency domain convolution kernel is 3 x (n+q), when the first convolution is performed in the frequency domain direction, the time domain vector f1, the time domain vector f2, and the time domain vector f3 are convolved to obtain f'1; when the second convolution is carried out according to the time domain direction, the time domain vector f2, the time domain vector f3 and the time domain vector f4 are convolved to obtain f'2; when the third convolution is performed according to the time domain direction, the time domain vector f3, the time domain vector f4 and the time domain vector f5 are convolved to obtain f '3, and the like, and finally the convolution is performed to obtain n+Q-3+1 frequency domain convolution vectors f' i.

Wherein each f' i is a new time domain vector compressed after convolution of the Q time domain vectors. Each f' i is used to represent the correlation between the Q time domain vectors prior to convolution.

Step 3062, outputting a prediction vector according to the frequency domain convolution vector sequence;

alternatively, the server directly outputs the N+Q-3+1 frequency domain convolution vector sequences as prediction vectors.

Optionally, the server performs pooling processing on the frequency domain convolution vector sequence along the frequency domain direction, and determines a frequency domain pooled vector obtained by pooling as the prediction vector.

As shown in fig. 6, when the frequency domain pooling operation is performed, the pooling is performed along the frequency domain direction, and the pooling dimension is consistent with the vector dimension. After the frequency domain pooling operation, the above N+Q-3+1 frequency domain convolution vectors f '1, f '2, … f ' _N-P+1 Compressed into a pooled frequency domain convolution vector f). That is, the pooled frequency domain convolution vector includes an element, so that the physical meaning of the pooled frequency domain convolution vector f″ is still preserved, and can still be regarded as a new vector compressed from the frequency domain dimension. The frequency domain pooling vector f "is used to represent the condensed nature of the plurality of frequency domain convolution vectors.

Optionally, the classification layer is a softmax function, the input is a prediction vector of the first audio and the second audio, and the output is a similarity probability of the first audio and the second audio.

In the above embodiment, only the frequency domain vector is subjected to the frequency domain feature extraction in the feature extraction process, but in a different embodiment, the time domain feature extraction may be performed, which is not limited thereto. The time domain feature extraction is only different from the frequency domain feature extraction in extraction direction, but the extraction mode is the same.

In an alternative embodiment based on fig. 4, the above-mentioned frequency domain convolution kernels are K, K being an integer greater than 1. Step 3061 alternative implementation becomes step 306a and step 3062 alternative implementation becomes step 306b, as shown in fig. 7 below:

step 306a, respectively calling K different frequency domain convolution kernels to carry out frequency domain convolution processing on the cross-correlation vector sequences along the frequency domain direction to obtain K frequency domain convolution vector sequences with different scales;

and respectively calling K different frequency domain convolution kernels to carry out frequency domain convolution processing on the autocorrelation vector sequences along the frequency domain direction, so as to obtain K frequency domain convolution vector sequences with different scales. The sequence of frequency domain convolution vectors at each scale may be multiple, such as N-P + 1.

And 306b, pooling the K frequency domain convolution vector sequences with different scales along the frequency domain direction respectively, and determining the K frequency domain pooled vectors obtained by pooling as prediction vectors.

Optionally, pooling is performed on the frequency domain convolution vector sequence under each scale to obtain a pooled frequency domain pooled vector. And carrying out pooling treatment on the frequency domain convolution vector sequences under K different scales to finally obtain K frequency domain pooling vectors.

The prediction vectors { f '1, f' 2, …, f 'k } or { f' k, f 'k-1, …, f' 1} are obtained by stitching in the order from small to large or from large to small in different scales.

In summary, since the multi-scale vector sequence uses the frequency domain vectors under multiple scales to represent the potential features and deep features of the audio, the multi-scale vector sequence of the two audio is used as input, and the similarity of the two audio is calculated by adopting a matching mode based on a neural network, so that the similarity between different songs can be calculated, and a similarity calculation result with higher precision is obtained.

It should be noted that, in an alternative embodiment, the above "convolution+pooling+multiscale" may be implemented in combination, such as the embodiment shown in fig. 8:

overall, the spectrograms of the first audio and the second audio perform cross-correlation operation in the sequence cross-correlation layer 220, and the obtained cross-correlation vector sequence is output to the multi-scale frequency domain convolution layer 242 to perform multi-scale frequency domain convolution, so as to obtain a multi-scale frequency domain representation. The multi-scale frequency domain representation is then input to the multi-scale pooling layer 244 for multi-scale pooling and finally output to the classification layer, where the probability of similarity is used to represent whether the two audio segments are similar. The application will be described below with details of the operation of each module

Sequence cross correlation layer 220:

the application represents the spectrogram A as { G } ₁ ,G ₂ ,...,G _n Each G _i Are all a frequency distribution and can be regarded as a vector, and the spectrogram B is expressed as { H } ₁ ,...,H _q }，H _j Heel G _i The physical meaning is the same, and each value of the vector represents the magnitude of the frequency component. In order to measure the cross-correlation of two pieces of audio from a time perspective, the application introduces the following correlation calculation formula:

score(G _i )＝(H ₁ *G _i +H ₂ *G _i ...+H _q *G _i )/(H1^2+H2^2+...+Hq^2)

in this way, the present application obtains a score (G) _i ) So that after correlation fusion, the output of the whole time series correlation module is: { G ₁ *score(G ₁ ),...,G _i *score(G _i ),...,G _n *score(G _n ) Score (G) _i ) Is regarded as the original G _i The correlation weights of the (C) are obtained, so that the output is the spectrum sequence subjected to the weight influence, and the application is named as { G' ₁ ,...,G' _n }。

Also for H _j The application also provides a score (H) _j ) The following are listed below

score(H _j )＝(G ₁ *H _j +G ₂ *H _j ...+G _n *H _j )/(G ₁ ^2+G ₂ ^2+...+G _n ^2)

In the same way { H } 'can be obtained by the present application' ₁ ,., H' q. Next, the application splice two sequences together, i.e. { G' ₁ ,...,G' _n ,H' ₁ ,...,H' _q Input to a multi-scale frequency domain convolution layer 242.

Multi-scale frequency domain convolution layer 242

The cross-correlation vector sequences are operated on from the frequency domain by convolution kernels of multiple scales to fully extract the audio frequency domain features.

Because the sequence cross-correlation layer 220 has been subjected to a cross-correlation process in time, no additional time domain convolution operation is required, only a frequency domain convolution operation is required, and the "auditory" of the human ear for music is affected by frequency.

Multiscale pooling layer 244

Assume that the multi-scale frequency domain convolution layer 242 has three scale frequency domain convolution representations f1, f2, f3. The present application separately pools these two-dimensional frequency domain convolutions. As shown in fig. 6, f '1 to f'4 are the results of frequency domain convolution with the frequency domain convolution kernel at the same scale, that is, a certain fi is formed by the 4 frequency domain convolution vectors f '1 to f'4, and then the pooling operation at the scale is performed.

The resulting f "can be regarded as the original 4 frequency domain convolution vectors from the time dimension" compression "(because f '1 to f'4 represent l time sequences, 4 time sequences become 1 time sequence, so-called" compression in the time dimension ").

The application carries out pooling operation on each f ' i of multiple scales to obtain frequency domain pooled vectors f ' i of multiple scales, and then splices all (such as three) frequency domain pooled vectors f ' i together to form a large vector or vector sequence to be input into a classification layer 260

Classification layer 260

The classification layer 260 may be a softmax function, with the output Y being the similarity probability of the two pieces of audio, representing the degree of matching of the two pieces of audio.

As shown in fig. 9, when the order of magnitude of the music in the music library is between the order of millions and tens of millions, it is suitable to predict the similarity probability between two pieces of full audio using the audio matching model in the offline matching scene; when the order of magnitude of the music in the music library is between ten and thousand, the method is suitable for predicting the similarity probability between two full-length audios by an audio matching model in an online matching scene. The order of magnitude of the music in the music library is between the millions and thousands, and the method is suitable for predicting the similarity probability between two full-length audios by adopting an audio matching model in a near-line matching scene. The audio matching model (comprising a sequence cross-correlation layer, a feature extraction layer and a classification layer) provided by the embodiment of the application is more suitable for offline matching scenes.

In one illustrative example, the above-described feature vectors of audio are used for training and prediction of an audio matching model. The audio matching model is the audio matching model in the embodiment, and can be used for predicting the similarity between two audios after training the feature vector of the audios provided by the embodiment of the application.

Audio recommendation scenarios:

referring to the example shown in fig. 10, where the user uses the terminal 180 with an audio playing application, the user plays, favorites or likes a first audio (a song) on the audio playing application, and the server 160 may compare a first multi-scale vector sequence of the first audio (a song) with a second multi-scale vector sequence of a plurality of second audio (B song) to determine a likelihood of similarity of the first audio and the second audio. According to the order of the similarity probability from high to low, the B song, the C song, the D song and the E song which are similar to the a song are sent to the audio playing application program on the terminal 180 as recommended songs, so that the user can hear more songs which accord with the preference of the user.

Singing scoring scene:

referring to the example shown in fig. 11, where a singing application is running on a terminal 180 used by a user, where the user sings a song, the server 160 may compare a first multi-scale vector sequence of a first audio (the song the user sings) with a second multi-scale vector sequence of a second audio (the original song or the star song or the high score song) to determine a likelihood of similarity of the first audio and the second audio. And giving the singing score of the user according to the similarity probability, and feeding the singing score back to the singing application program for display so as to be beneficial to the user to improve the singing level of the user.

FIG. 12 illustrates a flow chart of a model training method provided by an exemplary embodiment of the present application. The model training method can be used for training the audio matching model in the embodiment. The method comprises the following steps:

step 401, the server clusters the audio in the audio library according to the audio attribute features to obtain an audio class cluster, wherein the audio attribute features comprise at least two attribute features with different dimensions, and the feature similarity of the audio in the different audio class clusters is lower than that of the audio in the same audio class cluster.

Wherein, a great deal of audio is stored in the audio library, and the audio may include songs, pure music, symphonies, piano songs or other playing music, etc., and the embodiment of the present application does not limit the type of audio in the audio library. Optionally, the audio library is a music library of an audio playing application.

Optionally, the audio has respective audio attribute features, the audio attribute features may be attribute features of the audio itself or attribute features artificially given, and the same audio may include attribute features of a plurality of different dimensions.

In one possible implementation, the audio attribute features of the audio include at least one of: text features, audio features, emotion features, and scene features. Alternatively, the text features may include text features of the audio itself (such as lyrics, composer, word maker, genre, etc.), and may also include artificially imparted text features (such as comments); the audio features are used for representing audio characteristics such as melody, rhythm, duration and the like of the audio itself; the emotion characteristics are used for representing emotion expressed by the audio; scene features are used to characterize the playback scene used by the audio. Of course, in addition to the above-described audio attribute features, the audio may also include attribute features of other dimensions, which are not limited in this embodiment.

In the embodiment of the application, the process of performing audio clustering based on the audio attribute features can be called as preliminary screening, and is used for preliminarily screening out the audio with similar audio attribute features. In order to improve the primary screening quality, the computer equipment clusters according to at least two attribute features with different dimensions, and clustering deviation caused by clustering based on attribute features with single dimension is avoided.

After clustering, the computer device obtains a plurality of audio class clusters, and the audio in the same audio class cluster has similar audio attribute characteristics (compared with the audio in other audio class clusters). The number of the audio class clusters can be preset in a clustering stage (can be based on experience values), so that the clusters are prevented from being excessively generalized or excessively refined.

Step 402, generating a candidate audio pair according to the audio in the audio class cluster, wherein the candidate audio pair comprises two sections of audio, and the two sections of audio belong to the same audio class cluster or different audio class clusters.

Because the audio in the same audio class cluster has similar audio attribute characteristics, and the audio in different audio class clusters has larger difference in the audio attribute characteristics, the server can initially generate audio samples based on the audio class clusters, wherein each audio sample is a candidate audio pair consisting of two pieces of audio.

Because of the large number of audio contained in the audio library, the number of candidate audio pairs generated based on the audio class clusters is also quite large, e.g., for an audio library containing y pieces of audio, the number of candidate audio pairs generated is C (y, 2). However, while massive numbers of candidate audio pairs can be generated based on the audio class clusters, not all candidate audio pairs can be used for subsequent model training. For example, when the candidate audio pair is the same song (such as the same song sung by different singers), or the audio in the candidate audio pair is completely different (such as a uk ballad and a suona song), the candidate audio pair is too simple to be trained to obtain a high-quality model as a model training sample.

In order to improve the quality of the audio samples, in the embodiment of the application, the computer equipment further screens out high-quality audio pairs from the candidate audio pairs as the audio samples through fine screening.

Step 403, the server determines, according to the historical play record of the audio in the audio library, an audio positive sample pair and an audio negative sample pair in the candidate audio pair, where the audio in the audio positive sample pair belongs to the same audio cluster, and the audio in the audio negative sample pair belongs to different audio clusters.

Through analysis, the audio playing behavior of the user has close relation with the similarity between the audio, for example, the user always plays the audio with higher similarity continuously but not the same audio. Therefore, in the embodiment of the application, the computer equipment performs fine screening on the generated candidate audio pairs based on the historical play record of the audio to obtain the audio sample pairs. Wherein the audio sample pairs obtained by fine screening comprise audio positive sample pairs composed of similar audio (screened from candidate audio pairs composed of audio in the same audio class cluster) and audio negative sample pairs composed of difference audio (screened from candidate audio pairs composed of audio in different audio class clusters).

Optionally, the historical play record is an audio play record under each user account, which may be an audio play list formed according to a play sequence. For example, the history play record may be a song play record of each user collected by the audio play application server.

In some embodiments, the degree of distinction between the audio positive sample pair and the audio negative sample pair screened based on the history play record is low, so that the quality of the model obtained by subsequent training based on the audio sample pair is improved.

Step 404, the server performs machine learning training on the audio matching model according to the audio positive sample pair and the audio negative sample pair.

The sample is an object for model training and testing, and the object contains labeling information, wherein the labeling information is a reference value (or referred to as a true value or a supervision value) of a model output result, the sample with the labeling information of 1 is a positive sample, and the sample with the labeling information of 0 is a negative sample. The samples in the embodiment of the application refer to audio samples for training a similarity model, and the audio samples are in the form of sample pairs, namely, the audio samples comprise two sections of audio. Optionally, when the labeling information of the audio sample (pair) is 1, it indicates that two pieces of audio in the audio sample pair are similar audio, namely an audio positive sample pair; when the labeling information of the audio sample (pair) is 0, it indicates that the two pieces of audio in the audio sample pair are not similar audio, i.e., the audio negative sample pair.

Wherein the similarity probability of two audios in the same audio positive sample pair can be regarded as 1, or the clustering distance between the two audios is quantized to be the similarity probability. The similarity probability of two audios in the same audio negative sample pair may be regarded as 0, or the cluster-like distance or vector distance between two audios may be quantized to the similarity probability, such as the inverse of the cluster-like distance or the inverse of the vector distance, to the similarity probability of two audios in the same audio negative sample pair.

Illustratively, the "audio matching model" in the above embodiment includes: a sequence cross-correlation layer, a feature extraction layer and a classification layer.

In summary, in the embodiment of the present application, firstly, according to the audio attribute features of different dimensions, audio with similar features in an audio library is clustered to obtain audio clusters, then, the audio clusters belonging to the same or different audio clusters are combined to obtain a plurality of candidate audio pairs, and further, based on the historical play record of the audio, audio positive sample pairs and audio negative sample pairs are screened from the candidate audio pairs for subsequent model training; the audio multi-dimension attribute features are integrated to perform clustering, positive and negative sample pairs are screened based on the audio play records of the users, so that the generated audio sample pairs can reflect the similarity between audios (including the attribute of the audio itself and the listening habit of the users) from multiple angles, the quality of the generated audio sample pairs is improved while the automatic generation of the audio sample pairs is realized, and the quality of the subsequent model training based on the audio samples is further improved.

Fig. 13 is a block diagram of an audio matching apparatus provided in an exemplary embodiment of the present application. The device comprises:

an obtaining module 1320, configured to obtain a first feature sequence of the first audio and a second feature sequence of the second audio;

A sequence cross-correlation module 1340, configured to perform cross-correlation processing on the first feature sequence and the second feature sequence, and output a cross-correlation vector sequence;

the feature extraction module 1360 is configured to perform feature extraction processing on the cross-correlation vector sequence, and output a prediction vector;

a classification module 1380, configured to perform prediction processing on the prediction vector, and output a similarity probability of the first audio and the second audio.

In an exemplary embodiment, the first signature sequence includes n first frequency domain vectors, the second signature sequence includes q second frequency domain vectors, and n and q are positive integers;

the sequence cross-correlation module 1340 is configured to calculate a first cross-correlation vector sequence of the first feature sequence relative to the second feature sequence, and a second cross-correlation vector sequence of the second feature sequence relative to the first feature sequence; and splicing the first cross-correlation vector sequence and the second cross-correlation vector sequence, and outputting the first cross-correlation vector sequence and the second cross-correlation vector sequence as the cross-correlation vector sequence.

In an exemplary embodiment, the sequence cross-correlation module 1340 is configured to calculate an i-th first frequency-domain vector of the n first frequency-domain vectors, where i is an integer not greater than n with respect to an i-th correlation score of the q second frequency-domain vectors; taking the ith correlation score as a correlation weight of the ith first frequency domain vector, and calculating a weighted sequence of the n first frequency domain vectors to obtain the first cross-correlation vector sequence; and calculating a j-th second frequency-domain vector of the q second frequency-domain vectors, j being an integer not greater than q with respect to a j-th correlation score of the n first frequency-domain vectors; and calculating the weighted sequences of the q second frequency domain vectors by taking the j-th correlation score as the correlation weight of the j-th second frequency domain vector to obtain the second cross-correlation vector sequence.

In an exemplary embodiment, the sequence cross-correlation module 1340 is configured to calculate a product of the i-th first frequency-domain vector and the q second frequency-domain vectors, and a sum of squares of the q second frequency-domain vectors; determining a quotient of the product sum and the square sum as an i-th correlation score for the i-th first frequency-domain vector relative to the q second frequency-domain vectors;

the sequence cross-correlation module 1340 is configured to calculate a product of the j-th second frequency-domain vector and the n first frequency-domain vectors, and a sum of squares of the n first frequency-domain vectors; and determining a quotient of the product sum and the square sum as a j-th correlation score of the j-th second frequency-domain vector relative to the n first frequency-domain vectors.

In one exemplary embodiment, the feature extraction module includes: a frequency domain convolution kernel;

the feature extraction module 1360 is configured to invoke the frequency domain convolution kernel to perform frequency domain convolution processing on the cross-correlation vector sequence along a frequency domain direction to obtain a frequency domain convolution vector sequence; and outputting the prediction vector according to the frequency domain convolution vector sequence.

In an exemplary embodiment, the feature extraction module 1360 is configured to pool the frequency domain convolution vector sequence along a frequency domain direction, and determine a frequency domain pooled vector obtained by pooling as the prediction vector.

In an exemplary embodiment, the frequency domain convolution kernels comprise K frequency domain convolution kernels of different scales, K being an integer greater than 1;

the feature extraction module 1360 is configured to call the K different frequency domain convolution kernels to perform frequency domain convolution processing on the cross-correlation vector sequence along a frequency domain direction, so as to obtain K frequency domain convolution vector sequences with different scales.

In an exemplary embodiment, the feature extraction module 1360 is configured to pool the K frequency domain convolution vector sequences with different scales along a frequency domain direction, and determine K frequency domain pooled vectors obtained by pooling as the prediction vector.

It should be noted that: the audio matching device provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the functions described above. In addition, the audio matching device and the audio matching method provided in the above embodiments belong to the same concept, and detailed implementation processes of the audio matching device and the audio matching method are detailed in the method embodiments, which are not repeated here.

Fig. 14 is a schematic structural view of a computer device according to an exemplary embodiment of the present application. Specifically, the present application relates to a method for manufacturing a semiconductor device. The computer apparatus 1400 includes a central processing unit (Central Processing Unit, CPU) 1401, a system memory 1404 including a random access memory 1402 and a read only memory 1403, and a system bus 1405 connecting the system memory 1404 and the central processing unit 1401. The computer device 1400 also includes a basic Input/Output system (I/O) 1406 that facilitates the transfer of information between the various devices within the computer, and a mass storage device 1407 for storing an operating system 1413, application programs 1414, and other program modules 1415.

The basic input/output system 1406 includes a display 1408 for displaying information and an input device 1409, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1408 and the input device 1409 are connected to the central processing unit 1401 via an input output controller 1410 connected to the system bus 1405. The basic input/output system 1406 may also include an input/output controller 1410 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 1410 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1407 is connected to the central processing unit 1401 through a mass storage controller (not shown) connected to the system bus 1405. The mass storage device 1407 and its associated computer-readable media provide non-volatile storage for the computer device 1400. That is, the mass storage device 1407 may include a computer readable medium (not shown) such as a hard disk or drive.

The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes random access Memory (RAM, random Access Memory), read Only Memory (ROM), flash Memory or other solid state Memory technology, compact disk (CD-ROM), digital versatile disk (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1404 and mass storage device 1407 described above may be collectively referred to as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1401, the one or more programs containing instructions for implementing the methods described above, the central processing unit 1401 executing the one or more programs to implement the methods provided by the various method embodiments described above.

According to various embodiments of the application, the computer device 1400 may also operate by a remote computer connected to the network through a network, such as the Internet. I.e., the computer device 1400 may be connected to the network 1412 through a network interface unit 1414 connected to the system bus 1405, or other types of networks or remote computer systems (not shown) may be connected to the system using the network interface unit 1414.

The memory also includes one or more programs stored in the memory, the one or more programs including steps for performing the methods provided by the embodiments of the present application, as performed by the computer device.

The embodiment of the application also provides a computer readable storage medium, in which at least one instruction, at least one section of program, a code set or an instruction set is stored, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by a processor to implement the audio matching method according to any one of the embodiments.

The application also provides a computer program product which, when run on a computer, causes the computer to perform the audio matching method provided by the above-mentioned method embodiments.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing related hardware, and the program may be stored in a computer readable storage medium, which may be a computer readable storage medium included in the memory of the above embodiments; or may be a computer-readable storage medium, alone, that is not incorporated into the terminal. The computer readable storage medium stores at least one instruction, at least one program, a code set, or a set of instructions, where the at least one instruction, the at least one program, the set of codes, or the set of instructions are loaded and executed by the processor to implement the audio matching method according to any of the method embodiments described above.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims

1. An audio matching method, the method comprising:

acquiring a first characteristic sequence of the first audio and a second characteristic sequence of the second audio; the first characteristic sequence comprises n first frequency domain vectors, the second characteristic sequence comprises q second frequency domain vectors, and n and q are positive integers;

calculating a first cross-correlation vector sequence of the first feature sequence relative to the second feature sequence, and a second cross-correlation vector sequence of the second feature sequence relative to the first feature sequence;

splicing the first cross-correlation vector sequence and the second cross-correlation vector sequence, and outputting the first cross-correlation vector sequence and the second cross-correlation vector sequence as cross-correlation vector sequences;

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the computing a first cross-correlation vector sequence of the first feature sequence relative to the second feature sequence, comprising:

calculating an ith first frequency domain vector of the n first frequency domain vectors, wherein i is an integer not greater than n relative to an ith correlation score of the q second frequency domain vectors; taking the ith correlation score as a correlation weight of the ith first frequency domain vector, and calculating a weighted sequence of the n first frequency domain vectors to obtain the first cross-correlation vector sequence;

the computing a second cross-correlation vector sequence of the second feature sequence relative to the first feature sequence, comprising:

calculating a j-th second frequency-domain vector of the q second frequency-domain vectors, wherein j is an integer not greater than q relative to a j-th correlation score of the n first frequency-domain vectors; and calculating the weighted sequences of the q second frequency domain vectors by taking the j-th correlation score as the correlation weight of the j-th second frequency domain vector to obtain the second cross-correlation vector sequence.

3. The method of claim 2, wherein the step of determining the position of the substrate comprises,

the computing an i-th first frequency-domain vector of the n first frequency-domain vectors, relative to an i-th correlation score of the q second frequency-domain vectors, comprising:

calculating a product of the ith first frequency-domain vector and the q second frequency-domain vectors, and a square sum of the q second frequency-domain vectors; determining a quotient of the product sum and the square sum as an i-th correlation score for the i-th first frequency-domain vector relative to the q second frequency-domain vectors;

said computing a j-th second frequency-domain vector of said q second frequency-domain vectors, a j-th correlation score with respect to said n first frequency-domain vectors, comprising:

calculating a product of the j-th second frequency-domain vector and the n first frequency-domain vectors, and a sum of squares of the n first frequency-domain vectors; and determining a quotient of the product sum and the square sum as a j-th correlation score of the j-th second frequency-domain vector relative to the n first frequency-domain vectors.

4. A method according to any one of claims 1 to 3, wherein the feature extraction layer comprises: a frequency domain convolution kernel;

The calling feature extraction layer performs feature extraction processing on the cross-correlation vector sequence and outputs a prediction vector, and the method comprises the following steps:

invoking the frequency domain convolution kernel to perform frequency domain convolution processing on the cross-correlation vector sequence along the frequency domain direction to obtain a frequency domain convolution vector sequence;

and outputting the prediction vector according to the frequency domain convolution vector sequence.

5. The method of claim 4, wherein said outputting said prediction vector from said sequence of frequency domain convolution vectors comprises:

and carrying out pooling treatment on the frequency domain convolution vector sequence along the frequency domain direction, and determining one frequency domain pooling vector obtained by pooling as the prediction vector.

6. The method of claim 4, wherein the frequency domain convolution kernel comprises K different frequency domain convolution kernels, K being an integer greater than 1;

the step of calling the frequency domain convolution kernel to perform frequency domain convolution processing on the cross-correlation vector sequence along the frequency domain direction to obtain a frequency domain convolution vector sequence comprises the following steps:

and respectively calling the K different frequency domain convolution kernels to carry out frequency domain convolution processing on the cross-correlation vector sequences along the frequency domain direction to obtain K frequency domain convolution vector sequences with different scales.

7. The method of claim 6, wherein said outputting said prediction vector from said sequence of frequency domain convolution vectors comprises:

and carrying out pooling treatment on the K frequency domain convolution vector sequences with different scales along the frequency domain direction respectively, and determining K frequency domain pooling vectors obtained by pooling as the prediction vector.

8. An audio matching device, the device comprising:

a sequence cross-correlation module for calculating a first cross-correlation vector sequence of the first feature sequence relative to the second feature sequence, and a second cross-correlation vector sequence of the second feature sequence relative to the first feature sequence; splicing the first cross-correlation vector sequence and the second cross-correlation vector sequence, and outputting the first cross-correlation vector sequence and the second cross-correlation vector sequence as cross-correlation vector sequences; the feature extraction module is used for carrying out feature extraction processing on the cross-correlation vector sequence and outputting a prediction vector;

9. A terminal, the terminal comprising: a processor and a memory storing at least one instruction, at least one program, code set, or instruction set that is loaded and executed by the processor to implement the audio matching method of any one of claims 1 to 7.

10. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement the audio matching method of any one of claims 1 to 7.