CN111309965B

CN111309965B - Audio matching method, device, computer equipment and storage medium

Info

Publication number: CN111309965B
Application number: CN202010201517.2A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2024-02-13
Anticipated expiration: 2040-03-20
Also published as: CN111309965A

Abstract

The application discloses an audio matching method, an audio matching device, computer equipment and a storage medium, and relates to the technical field of audio. The method comprises the following steps: acquiring a first multi-scale vector sequence of a first audio and a second multi-scale vector sequence of a second audio; matching frequency domain vectors belonging to the same scale in the first multi-scale vector sequence and the second multi-scale vector sequence to obtain a plurality of matched frequency domain vectors under different scales; splicing the plurality of matching frequency domain vectors under different scales to obtain a prediction vector; and calling a classification layer to predict the prediction vector and outputting the similarity probability of the first audio and the second audio. The similarity of two audios is calculated by adopting a matching mode based on a neural network, and the similarity between different songs can be calculated, so that a similarity calculation result with higher precision is obtained between the different songs.

Description

Audio matching method, device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of audio, in particular to an audio matching method, an audio matching device, computer equipment and a storage medium.

Background

Audio matching is a technique of similarity matching of two audios. The audio matching is classified by the type of matching, including: audio segment matching and full audio matching. The audio segment matching refers to determining whether an audio segment P is part of an audio D given an audio segment P. Full audio matching refers to computing the similarity of audio a and audio B given an audio a.

In the related art, an audio fingerprint technology is provided, wherein a relatively significant time frequency point in an audio file is selected, a digital sequence is encoded by adopting a hash encoding mode, and the digital sequence is used as an audio fingerprint. The audio fingerprinting technique converts the audio matching problem into a retrieval problem between different digital sequences.

Because the audio fragment matching is mainly performed on the audio fragment and the full audio of the same song, the audio fingerprint technology based on signal processing has a good matching effect in the scene of audio fragment matching. However, in the full audio matching scenario, more similarity is calculated for two different songs, and at this time, the application of the audio fingerprint technology is limited, so that a better matching effect cannot be obtained.

Disclosure of Invention

The embodiment of the application provides an audio matching method, an audio matching device, computer equipment and a storage medium, and provides a matching scheme suitable for a full audio matching scene. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides an audio matching method, where the method includes:

acquiring a first multi-scale vector sequence of a first audio and a second multi-scale vector sequence of a second audio;

matching frequency domain vectors belonging to the same scale in the first multi-scale vector sequence and the second multi-scale vector sequence to obtain a plurality of matched frequency domain vectors under different scales;

splicing the plurality of matching frequency domain vectors under different scales to obtain a prediction vector;

and calling a classification layer to predict the prediction vector and outputting the similarity probability of the first audio and the second audio.

In another aspect, an embodiment of the present application provides an audio matching apparatus, including:

the acquisition module is used for acquiring a first multi-scale vector sequence of the first audio and a second multi-scale vector sequence of the second audio;

the matching module is used for matching the frequency domain vectors belonging to the same scale in the first multi-scale vector sequence and the second multi-scale vector sequence to obtain a plurality of matched frequency domain vectors under different scales;

The splicing module is used for splicing the plurality of matching frequency domain vectors under different scales to obtain a prediction vector;

and the prediction module is used for calling a classification layer to predict the prediction vector and outputting the similarity probability of the first audio and the second audio.

In another aspect, embodiments of the present application provide a computer device, where the computer device includes a processor and a memory, where at least one instruction, at least one program, a code set, or a set of instructions is stored, where the at least one instruction, the at least one program, the code set, or the set of instructions are loaded and executed by the processor to implement the audio matching method as described in the above aspect.

In another aspect, a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions loaded and executed by a processor to implement the audio matching method as described in the above aspect is provided.

In another aspect, a computer program product is provided which, when run on a computer, causes the computer to perform the audio matching method as described in the above aspect.

The beneficial effects that technical scheme that this application embodiment provided include at least:

the multi-scale vector sequence adopts feature vectors under a plurality of scales to represent potential features and deep features of the audio, so that the multi-scale vector sequence of the two audio is taken as input, the similarity of the two audio is calculated by adopting a matching mode based on a neural network, and the similarity between different songs can be calculated, so that a similarity calculation result with higher precision is obtained between the different songs.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of an audio matching system provided in one exemplary embodiment of the present application;

FIG. 2 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;

FIG. 4 is a flowchart of an audio feature extraction method provided by an exemplary embodiment of the present application;

FIG. 5 is a spectral diagram of audio provided by an exemplary embodiment of the present application;

FIG. 6 is a flowchart of an audio feature extraction method provided by another exemplary embodiment of the present application;

FIG. 7 is a flowchart of an audio feature extraction method provided by another exemplary embodiment of the present application;

FIG. 8 is a flowchart of an audio feature extraction method provided by another exemplary embodiment of the present application;

FIG. 9 is a schematic diagram of time domain feature extraction provided by an exemplary embodiment of the present application;

FIG. 10 is a schematic diagram of frequency domain feature extraction provided by an exemplary embodiment of the present application;

FIG. 11 is a schematic view of stitching feature vectors provided in an exemplary embodiment of the present application;

FIG. 12 is a flowchart of an audio matching method provided by an exemplary embodiment of the present application;

FIG. 13 is a flow chart of online matching provided by an exemplary embodiment of the present application;

FIG. 14 illustrates a schematic diagram of a song recommendation scenario provided by an exemplary embodiment of the present application;

FIG. 15 illustrates a schematic diagram of a song scoring scene provided by one exemplary embodiment of the present application;

FIG. 16 is a flowchart of a model training method provided in one exemplary embodiment of the present application;

FIG. 17 is a block diagram of an audio matching device provided in an exemplary embodiment of the present application;

fig. 18 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Fig. 1 is a block diagram of an audio matching system 100 provided in an exemplary embodiment of the present application. The audio matching system 100 includes: computer device 120, repository 140, server 160, and terminal 180.

The computer device 120 is a computer or server used by a developer. The computer device 120 is capable of calculating a multi-scale vector sequence for all audio in the audio library off-line. The computer device 120 stores the multi-scale vector sequence for all audio in the memory store 140.

The computer device 120 and the storage 140 are connected by a wired network or a wireless network.

The repository 140 stores audio IDs and multi-scale vector sequences for a plurality of audios. The correspondence between the audio ID and the sequence of multi-scale audio vectors may be considered an "audio library". Of course, the audio library may also include audio files of audio, singers, genres, albums, sources, and other information.

The server 160 may be implemented as one server or as a server cluster formed by a group of servers, which may be a physical server or may be implemented as a cloud server. In one possible implementation, server 160 is a background server for an application or applet or web page program in terminal 180.

The server 160 and the storage library 140 are connected by a wired network, a wireless network or a data line.

Terminal 180 is an electronic device used by a user. The terminal 180 may be a mobile terminal such as a tablet computer or a laptop portable notebook computer, or may be a terminal such as a desktop computer or a projection computer, which is not limited in this embodiment.

In one application scenario, terminal 180 provides two pieces of audio to server 160: the first audio and the second audio, the request server 160 calculates a similarity between the first audio and the second audio. The server 160 feeds back the similarity between the first audio and the second audio to the terminal 180.

In another application scenario, the terminal 180 provides the first audio to the server 160, the server 160 determines other audio in the audio library as the second audio, calculates the similarity between the first audio and the second audio, and feeds back the second audio with the highest similarity to the terminal 180.

In the above illustrative example, the entire audio matching flow is divided into two parts: an "offline storage phase" and a "retrieve matches phase". The off-line storage stage is to extract a multi-scale vector sequence for each audio in an audio library and store the multi-scale vector sequence in the storage library; the "search matching stage" is to query the corresponding multi-scale vector sequence according to the audio ID of the first audio and the audio ID of the second audio, and perform multi-scale matching and classification according to the multi-scale vector sequences of the two audios.

First, a description is given of "search matching stage":

fig. 2 is a flowchart of an audio matching method provided in an exemplary embodiment of the present application. This embodiment is exemplified by the application of this method to the server 160. The method comprises the following steps:

step 202, obtaining a first multi-scale vector sequence of a first audio and a second multi-scale vector sequence of a second audio;

The first multi-scale vector sequence comprises: k first eigenvectors of different scales. Each first eigenvector is used for representing the frequency distribution condition of the audio frequency under a certain scale. The scale refers to the vector dimension and the number of the first feature vectors, and the vector dimension or the number of the first feature vectors under different scales are different, or the vector dimension and the number of the first feature vectors are different. Where the scale refers to the convolution kernel size used in extracting the (first) feature vector.

The second multi-scale vector sequence comprises: k second feature vectors of different scales. Each second eigenvector is used for representing the frequency distribution condition of the audio frequency under a certain scale. The scale refers to the vector dimension and the number of the second feature vectors, and the vector dimension or the number of the second feature vectors under different scales are different, or the vector dimension and the number of the second feature vectors are different. Where the scale refers to the convolution kernel size used in extracting the (second) feature vector.

The vector dimensions, the number of vectors and the physical meaning of the vectors of the first multi-scale vector sequence and the second multi-scale vector sequence are the same, and the first multi-scale vector sequence and the second multi-scale vector sequence are extracted only according to audio files of two different audios.

The server may calculate the first multi-scale vector sequence and/or the second multi-scale vector sequence in real time, or may read the first multi-scale vector sequence that has been calculated offline from the storage according to the audio ID of the first audio, and read the second multi-scale vector sequence that has been calculated offline from the storage according to the audio ID of the second audio, which is not limited in this embodiment.

Step 204, matching feature vectors belonging to the same scale in the first multi-scale vector sequence and the second multi-scale vector sequence to obtain a plurality of matched feature vectors under different scales;

since the first multi-scale vector sequence and the second multi-scale vector sequence each include the feature vectors at K different scales, there are W groups of feature vectors belonging to the same scale.

And for two feature vectors of each group under the same scale, the server performs matching calculation on the two feature vectors to obtain a matching feature vector. And respectively calculating the W groups of feature vectors to obtain K matched feature vectors under different scales.

Step 206, splicing a plurality of matching feature vectors under different scales to obtain a prediction vector;

splicing the matching feature vectors under K different scales according to the order of the scales from large to small to obtain a prediction vector; or, splicing the matching feature vectors under K different scales according to the order of the scales from small to large to obtain a prediction vector.

And step 208, calling a classification layer to predict the prediction vector and outputting the similarity probability of the first audio and the second audio.

Optionally, the classification layer is a softmax function, the input is a prediction vector of the first audio and the second audio, and the output is a similarity probability of the first audio and the second audio. The server performs at least one of audio recommendation, audio scoring, audio classification, and audio matching based on the similarity probabilities of the two audio.

In the personalized recommendation scene, the server is used for acquiring a second feature vector of a second audio in the audio library after obtaining a first feature vector of a first audio provided by the client, searching out the second audio with higher similarity with the first audio by using the audio matching model, and recommending the second audio to the client.

In the audio scoring scene, the server is used for acquiring a second feature vector of a second audio in the audio library after obtaining a first feature vector of a first audio provided by the client, calculating the similarity between the first audio and the second audio by using an audio matching model, and recommending the second audio with higher similarity score to the client.

In the audio matching scenario, the server is configured to obtain a first feature vector of a first audio provided by the client, obtain a second feature vector of a second audio in the audio library, find out the second audio with extremely high similarity to the first audio by using the audio matching model, and recommend audio information (such as song name, singer, style, year, record company, etc.) of the second audio to the client.

In the audio classification scene, the server is used for calculating similarity between every two songs in the audio library, and songs with similarity higher than a threshold value are classified into the same class cluster, so that the songs are classified into the same class.

In summary, since the multi-scale vector sequence uses the frequency domain vectors under multiple scales to represent the potential features and deep features of the audio, the multi-scale vector sequence of the two audio is used as input, and the similarity of the two audio is calculated by adopting a matching mode based on a neural network, so that the similarity between different songs can be calculated, and a similarity calculation result with higher precision is obtained.

In an alternative embodiment based on fig. 2, step 204 includes the following steps 2041 to 2044, as shown in fig. 3:

step 2041, multiplying the first feature vector and the second feature vector of the same scale to obtain a first vector;

providing that the first multi-scale vector sequence comprises K first feature vectors { hA1, hA2, …, hAk }, each feature vector having a different scale; the second multi-scale vector sequence includes K second feature vectors { hB1, hB2, …, hBk }, each of which has a different scale.

Feature vectors located at the same position in the two multi-scale vector sequences belong to the same scale. For example, the first feature vector hA1 and the second feature vector hB1 belong to the same scale; the first feature vector hA2 and the second feature vector hB2 belong to the same scale, …, and the first feature vector hAk and the second feature vector hBk belong to the same scale.

The first feature vector and the second feature vector belonging to the same scale are multiplied to obtain a first vector. For example, hA1 hB1, hA2 hB2, …, hAk × hBk.

Step 2042, subtracting the first feature vector and the second feature vector of the same scale to obtain a second vector;

for example, hA1-hB1, hA2-hB2, …, hAk-hBk.

Step 2043, subtracting the second feature vector and the first feature vector of the same scale to obtain a third vector;

for example, hB1-hA1, hB2-hA2, …, hBk-hAk.

And 2044, splicing the first vector, the second vector and the third vector under the ith scale to obtain a matching feature vector under the ith scale, wherein i is an integer not greater than K.

The first vector, the second vector, and the third vector at the i-th scale are spliced to hAbi= [ hAi x hBi, hAi-hBi, hBi-hAi ].

For example, the first vector, the second vector and the third vector under the 1 st scale are spliced to obtain the matching feature vectors { hA1 x hB1, hA1-hB1, hB1-hA1}; splicing the first vector, the second vector and the third vector under the 2 nd scale to obtain matching feature vectors { hA1 x hB1, hA1-hB1, hB1-hA1}; …, the first vector, the second vector and the third vector under the K scale are spliced to obtain the matching feature vector { hAk x hBk, hAk-hBk, hBk-hAk }, under the K scale.

Since the first vector, the second vector and the third vector at each scale are spliced, K matching feature vectors can be obtained.

Since h 1-hk represent feature vectors of different scales, the above process can be referred to as: and (5) multi-scale matching.

After K matched feature vectors are obtained through calculation, the matched feature vectors under K different scales are spliced according to the order of the scales from large to small to obtain a prediction vector; or, splicing the matching feature vectors under K different scales according to the order of the scales from small to large to obtain a prediction vector. Then, the prediction vector is input into the classification layer for prediction, and the probability output by the classification layer is the similarity degree of the two audios.

In summary, according to the method provided by the embodiment, the multi-scale vector sequences of the two audios are matched, so that the two audios can be compared from different feature levels, the matching accuracy of the two audios in matching is improved, and the accurate probability can be output as the similarity degree of the two audios.

Next, an "offline storage phase" is described:

fig. 4 is a flowchart of a method for extracting a multi-scale vector sequence according to an exemplary embodiment of the present application. The present embodiment is exemplified with the method applied in the computer device or server shown in fig. 1.

The method comprises the following steps:

step 402, obtaining a characteristic sequence of audio;

the characteristic sequence of the audio includes: n frequency domain vectors arranged in time sequence. Each frequency domain vector is M-dimensional, each dimension representing the audio frequency at a frequency F _M The frequency distribution in the adjacent dimensions is the same. Wherein N and M are integers greater than 1. Optionally, the process of obtaining the feature sequence is as follows:

sampling the audio in the time dimension with a preset sampling interval (e.g., every 0.1 seconds) to obtain a discrete time sequence T ₁ ～T _n Each T value represents the size of the audio at that sample point.

Grouping the time series according to a fixed time period (e.g., each 3 second time period) to obtain a plurality of time series groups G ₁ ～G _N Each time-series packet G _i Including a plurality of sampling points, such as 3 seconds/0.1 seconds = 30 sampling points, i being an integer no greater than N.

Will belong to the same time series packet G _i The plurality of sampling points in the model (a) are transformed into a frequency domain vector to obtain N frequency domain vectors which are arranged according to time sequence. That is, each time-series packet G is obtained by performing time-domain to frequency-domain transformation on each time-series packet _i The corresponding frequency domain sequence. The time-frequency transformation mode includes, but is not limited to, FFT (Fast Fourier Transform ), DFT (Discrete Fourier Transform, discrete Fourier transform), MFCC (Mel-scale Frequency Cepstral Coefficients, mel frequency cepstral coefficient). Each frequency domain sequence represents the same group of time sequence packets G _i The distribution of different frequencies contained therein. And respectively sampling the N frequency domain sequences according to different sampling frequencies to obtain N frequency domain vectors. Different sampling frequencies refer to: dividing the frequency upper limit and the frequency lower limit of the audio frequency into a plurality of frequency points, wherein the plurality of frequency points are different sampling frequencies.

N frequency domain vectors arranged in time sequence form a two-dimensional matrix of M x N. The axis corresponding to N on the two-dimensional matrix represents the time domain direction and the axis corresponding to M represents the frequency domain direction. M is the quotient between the upper and lower frequency distribution limits and the frequency sampling interval.

Step 404, invoking a time sequence correlation layer to perform time domain autocorrelation processing on the feature sequence to obtain an autocorrelation vector sequence;

the feature sequence of the audio includes N frequency domain vectors arranged in time order. For the ith frequency-domain vector of the N frequency-domain vectors, the time-domain autocorrelation process is a process operation of measuring the correlation of other frequency-domain vectors to the ith frequency-domain vector.

And calling a time sequence correlation layer to perform time domain autocorrelation processing on the N frequency domain vectors which are arranged according to the time sequence, so as to obtain an autocorrelation vector sequence. The autocorrelation vector sequence includes N autocorrelation feature vectors.

Wherein, N autocorrelation eigenvectors arranged in time sequence form a two-dimensional matrix of M x N. The axis corresponding to N on the two-dimensional matrix represents the time domain direction and the axis corresponding to M represents the frequency domain direction. M is the quotient between the upper and lower frequency distribution limits and the frequency sampling interval.

Step 406, calling a multi-scale time-frequency domain convolution layer to extract multi-scale features of the autocorrelation vector sequence to obtain an audio multi-scale vector sequence;

the multi-scale feature extraction includes: at least one of a time domain multi-scale feature extraction process and a frequency domain multi-scale feature extraction process.

The time domain multi-scale feature extraction processing refers to multi-scale feature extraction processing along the time direction, and the frequency multi-scale feature extraction processing refers to multi-scale feature extraction processing along the frequency direction. The time domain multi-scale feature extraction process and the frequency domain multi-scale feature extraction process are parallel and different multi-scale feature extraction processes.

Optionally, the computer equipment invokes the time domain convolution kernels under different scales to extract time domain features of the autocorrelation vector sequence along the time domain direction, so as to obtain time domain vectors under different scales; invoking frequency domain convolution kernels under different scales to extract frequency domain features of the autocorrelation vector sequence along the frequency domain direction, so as to obtain frequency domain vectors under different scales; splicing the time domain vector and the frequency domain vector under the same scale to obtain a feature vector of the audio under the same scale; and determining a sequence formed by the feature vectors of the audio under different scales as a multi-scale vector sequence of the audio.

Step 408, storing the multi-scale vector sequence of each audio to a memory bank;

alternatively, the storage form is < ID, { h1,..hk } >. ID refers to the audio ID of the audio, { h1,..hk } refers to the multi-scale vector sequence of the audio. k refers to the number of scales for the different scales.

In summary, according to the method provided by the embodiment, the time-domain autocorrelation processing is performed on the feature sequence by calling the time-sequence correlation layer to obtain the autocorrelation vector sequence, and the time-frequency domain processing module is called to perform at least one of the time-domain feature extraction processing and the frequency-domain feature extraction processing on the autocorrelation vector sequence to obtain the feature vector of the audio, so that the characteristics of the audio in the time domain and the frequency domain are comprehensively considered, the substantial characteristics of the audio in the time domain and the frequency domain are simultaneously extracted, and the extraction effectiveness of the feature vector of the audio is improved.

In an alternative embodiment based on fig. 4, the feature sequence for the audio mentioned in step 402 is shown in fig. 5. Exemplary, the audio file of the audio is sampled in a time dimension, such as every 0.1s, to obtain a discrete time sequence T ₁ ～T _n Each value representing the size of the audio at the sample point and then combined for a fixed period of time (e.g., 3 s), e.g., a period of time having a 3s sampling interval of 0.1s, each set of sequences comprising 3s/0.1s = 30 values, e.g., T ₁ ～T ₃₀ Is a group, called G ₁ ,T ₃₁ ～T ₆₀ Is G ₂ And so on. The frequency domain signal is then obtained by frequency domain transforming each set of time series (including but not limited to FFT, MFCC, DFT etc.), representing a distribution of different frequencies contained within a set of time series, and the frequency signal is also sampled, e.g. at 10hz, to obtain a discrete frequency series. Assuming that the upper and lower limits of the frequencies are 0-f, the number of each frequency sequence is f/10, each G _i Can be expressed as such a plurality of frequency sequences, except for different G' s _i The values of the same frequencies of (a) are different in magnitude. Corresponding to music, some parts of music are very bass, those G' s _i The low frequency values of (2) are large, some parts are high, and those G's are high _i Is large. So G _i Can be expressed as a time sequence T ₁ ～T ₃₀ And can also be expressed as a frequency sequence, and is a spectrogram together. The spectrogram as illustrated in fig. 5 is a spectrogram after real audio decomposition, the horizontal axis is time, and the time period is about 1.75s, that is, a time slice is cut every 1.75 s; the frequency corresponding to each time segment is a vertical axis, and the upper and lower frequency limitsThe gray scale is 110hz to 3520hz, and the gray scale represents the magnitude of the corresponding value of different frequencies.

In an alternative embodiment based on fig. 4, step 404 may alternatively be implemented as the following steps 404a and 404b, as shown in fig. 6:

step 404a, calculating an ith correlation score between the ith frequency domain vector and other frequency domain vectors except the ith frequency domain vector, wherein i is an integer not greater than N;

the feature sequence of the audio frequency comprises the following steps: n frequency domain vectors { G ] arranged in time order ₁ ,G ₂ ,...,G _n }. Each G _i Are all a frequency domain vector. In order to measure the correlation between other frequency-domain vectors in the feature sequence and the ith frequency-domain vector, the following correlation calculation formula is introduced for the ith frequency-domain vector.

score(G _i )＝(G ₁ *G _i +G ₂ *G _i ...+G _n *G _i –G _i *G _i )/(G ₁ ^2+G ₂ ^2+...+G _n ^2–G _i ^2)

That is, the computer device calculates a product sum of the i-th frequency-domain vector and other frequency-domain vectors other than the i-th frequency-domain vector; calculating the square sum of other frequency domain vectors except the ith frequency domain vector; the quotient of the sum of products and the sum of squares is determined as an ith correlation score between the ith frequency domain vector and other frequency domain vectors than the ith frequency domain vector.

It should be noted that both the numerator and denominator need to be subtracted from G _i *G _i (or G) _i 2) because other frequency-domain vectors are to be weighted against the ith frequency-domain vector G _i Is a function of (a) and (b). But it is not excluded that in some embodiments G is reserved on the numerator and denominator of the above formula _i *G _i (or G) _i 2) the probability of a.

In step 404b, the i-th correlation score is used as the correlation weight of the i-th frequency domain vector, and the weighted sequences of the N frequency domain vectors are calculated to obtain the autocorrelation vector sequence.

At the moment of calculation, each G is obtained _i Corresponding score (G) _i ) In the first placeThe i correlation score is used as the correlation weight of the i frequency domain vector, and the product of the i frequency domain vector and the i correlation score is used as the i autocorrelation vector ti. Similar calculation is carried out on all the n frequency domain vectors to obtain an autocorrelation vector sequence { t } ₁ ,...,t _n The following calculation formula is introduced.

{t ₁ ,...,t _n }＝{G ₁ *score(G ₁ ),...,G _i *score(G _i ),...,G _n *score(G _n )}

Optionally, the weighted sequence of N frequency domain vectors refers to: the sequence formed by the weighted products between the ith correlation score and the ith frequency domain vector are arranged in time order.

In summary, according to the method provided by the embodiment, the time-domain autocorrelation processing is performed on the feature sequence by the time-sequence correlation layer, so that the autocorrelation characteristics of different frequency domain vectors in the time domain dimension can be extracted, and the feature extraction effectiveness of the audio in the time domain dimension is improved.

In an alternative embodiment based on fig. 4, step 404 is followed by step 405a and step 405b, as shown in fig. 7:

Step 405a, sampling S autocorrelation vectors from the N autocorrelation vectors according to the order of the correlation scores corresponding to the N autocorrelation vectors from high to low, where S is an integer smaller than N;

since the autocorrelation feature vectors include N autocorrelation vectors. For the purpose of reducing the calculation amount, a part of autocorrelation vectors are screened out to participate in subsequent calculation according to the sequence from high to low of the correlation scores corresponding to the N autocorrelation vectors.

The value of S is an empirical value, such as 20% -50% of N. Taking N as 100 as an example, it can be determined according to score (G _i ) The autocorrelation vectors { t8, t11, t12} ordered in the top 20 are filtered out in order from high to low.

Step 405b, determining S autocorrelation vectors as a sampled autocorrelation vector sequence.

Optionally, the S autocorrelation vectors are combined in order of the correlation score from high to low, and determined as a sampled autocorrelation vector sequence. Or, the S autocorrelation vectors are combined in time domain order to determine a sampled autocorrelation vector sequence.

In summary, according to the method provided by the embodiment, the autocorrelation vector sequence is sampled according to the importance degree, and a part of important autocorrelation vectors are sampled to form the sampled autocorrelation vector sequence, so that the subsequent calculation workload can be reduced, and the instantaneity of the technical scheme in on-line audio matching is improved.

In an alternative embodiment based on fig. 4, step 406 includes steps 4061 to 4064, as shown in fig. 8:

step 4061, invoking the time domain convolution kernels under different scales to extract time domain features of the autocorrelation vector sequence along the time domain direction, so as to obtain time domain extraction vectors under different scales;

the time domain feature extraction includes: at least one of temporal direction convolution and temporal direction pooling. In various embodiments, the order of operations of the convolution process, the pooling process may be combined in a wide variety of ways: for example, convolution and pooling are performed; or pooling and then convolving; or firstly fully connecting layers, then convolving, fully connecting again and pooling again; multiple iterations (e.g., resNet, stacking many layers of convolutions, pooling) are also possible.

For a time domain convolution kernel at a certain scale m×p:

time domain direction convolution:

the time domain direction refers to the time domain convolution processing of the autocorrelation characteristic vector sequence along the direction from the early to the late (or from the late to the early) to obtain a time domain convolution vector.

Alternatively, the autocorrelation vector sequence may be regarded as a matrix of M rows by N columns (the sampled autocorrelation vector Xu Xulie may be regarded as a matrix of M rows by S columns), each column being an M-dimensional frequency domain vector. Assuming that the scale size of the time domain convolution kernel is M x P, P is smaller than N (or S). The time domain direction means that the convolution processing is performed on P adjacent frequency domain vectors along the 0-N direction.

As shown in fig. 9, assuming that the size of the time domain convolution kernel is m×3, when performing the first convolution according to the time domain direction, the frequency domain vector t1, the frequency domain vector t2, and the frequency domain vector t3 are convolved to obtain t'1; when the second convolution is carried out according to the time domain direction, the frequency domain vector t2, the frequency domain vector t3 and the frequency domain vector t4 are convolved to obtain t'2; when the third convolution is performed according to the time domain direction, the frequency domain vector t3, the frequency domain vector t4 and the frequency domain vector t5 are convolved to obtain t '3, and the like, and finally the convolution is performed to obtain N-3+1 time domain convolution vectors t' i. Where i is not greater than N-P+1 (or S-P+1).

Wherein, the physical meaning of each t' i is a new frequency domain vector obtained by compressing after convolution of P frequency domain vectors. Each t' i is used to represent the correlation between the P frequency domain vectors prior to convolution.

Time domain direction pooling:

optionally, the plurality of time domain convolution vectors are subjected to pooling along the time domain direction, so as to obtain a pooled time domain extraction vector.

When a plurality of time domain convolution vectors under the same scale are subjected to time domain pooling operation, pooling is also performed along the time direction, and the pooling dimension is consistent with the vector dimension. As shown in fig. 9, after the time domain pooling operation, the above N-p+1 time domain convolution vectors t '1, t '2, … t ' _N-P+1 Compressed into a pooled time domain extraction vector t). That is, the pooled time domain extraction vector includes an element, so that the physical meaning of the pooled time domain extraction vector t″ is still preserved, and can be still regarded as a new vector compressed from the frequency domain dimension. The time domain extraction vector t "is used to represent the condensed nature of the plurality of time domain convolution vectors.

It should be noted that the number of the substrates,

in this embodiment of the present application, the time domain convolution kernel may be K scales, where K is an integer greater than 1, where P in the time domain convolution kernel of each scale has a different value, and the above operation may be performed on each time domain convolution kernel, to finally obtain K pooled time domain extraction vectors.

Step 4062, invoking frequency domain convolution kernels under different scales to extract frequency domain features of the autocorrelation vector sequence along the frequency domain direction, so as to obtain frequency domain vectors under different scales;

the frequency domain feature extraction includes: at least one of frequency domain direction convolution and frequency domain direction pooling. In various embodiments, the order of operations of the convolution process, the pooling process may be combined in a wide variety of ways: for example, convolution and pooling are performed; or pooling and then convolving; or firstly fully connecting layers, then convolving, fully connecting again and pooling again; multiple iterations (e.g., resNet, stacking many layers of convolutions, pooling) are also possible.

For a frequency domain convolution kernel at a certain scale p×n:

frequency domain direction convolution:

the frequency domain direction refers to that the frequency domain convolution processing is performed on the autocorrelation vector sequence along the direction from small to large (or from large to small) of the sampling frequency, so as to obtain a frequency domain convolution vector.

Alternatively, the autocorrelation feature sequence may be regarded as a matrix of M rows by N columns (the sampled autocorrelation vector sequence may be regarded as a matrix of M rows by S columns), each row being an N-dimensional time domain vector. Let the size of the frequency domain convolution kernel be P x N, P being smaller than M. The frequency domain direction means that M adjacent time domain vectors are convolved along the 0-M direction.

As shown in fig. 10, assuming that the size of the frequency domain convolution kernel is 3*N, when the first convolution is performed in the frequency domain direction, the time domain vector f1, the time domain vector f2, and the time domain vector f3 are convolved to obtain f'1; when the second convolution is carried out according to the time domain direction, the time domain vector f2, the time domain vector f3 and the time domain vector f4 are convolved to obtain f'2; when the third convolution is performed according to the time domain direction, the time domain vector f3, the time domain vector f4 and the time domain vector f5 are convolved to obtain f '3, and the like, and finally, M-3+1 frequency domain vectors are convolved to obtain f' i. Where i is not greater than M-P+1.

Wherein, the physical meaning of each f' i is a new time domain vector obtained by compressing after convolution of P time domain vectors. Each f' i is used to represent the correlation between the P time domain vectors prior to convolution.

Frequency domain direction pooling:

when the frequency domain pooling operation is carried out on a plurality of frequency domain convolution vectors under the same scale, the pooling operation is carried out along the time direction, and the pooling dimension is consistent with the vector dimension. As shown in fig. 10, the pooling operation is performed in the frequency domainNext, the N-P+1 frequency domain convolution vectors f '1, f'2, … f 'are applied' _N-P+1 Compressed into a pooled frequency domain extraction vector f). That is, the pooled frequency domain extraction vector includes an element, so that the physical meaning of the pooled frequency domain extraction vector f″ is still preserved, and can be still regarded as a new vector compressed from the time dimension. The pooled frequency domain extraction vector f "is used to represent the condensed nature of the multiple frequency domain convolution vectors.

In this embodiment of the present application, the frequency domain convolution kernel may be K scales, where K is an integer greater than 1, where P in the frequency domain convolution kernel of each scale has a different value, and the foregoing operation may be performed on each frequency domain convolution kernel, to finally obtain K pooled frequency domain extraction vectors.

Step 4063, splicing the time domain extraction vector and the frequency domain extraction vector under the same scale to obtain a feature vector of the audio under the same scale;

As shown in fig. 11, the time domain extraction vector t″ and the frequency domain extraction vector f″ under the same scale are spliced to obtain a feature vector { t ", f" } of the audio under the same scale.

Step 4064, a sequence of feature vectors of the audio at different scales is determined as a multi-scale vector sequence of the audio.

Optionally, for each scale j, the time domain extraction vector t "j and the frequency domain extraction vector f" j are spliced to obtain a feature vector { t "j, f" j } of the audio under the scale j. Then according to the sequence from small to large or from large to small of different scales, the multi-scale feature vector sequence { t '1, f' 1, t '2, f' 2, …, t 'k, f' k } of the audio is obtained by splicing, or { t '1, t' 2, …, t 'k, f' 1, f '2, …, f' k }.

Fig. 12 is a flow chart of an audio matching method of an exemplary embodiment. The whole flow is divided into two parts:

the left part is called: the offline storage stage is to extract features from each music piece in the music library and store the features in the storage 1260, and the right part is called: the search matching stage is to query the respective characteristics of the two pieces of music according to the repository 1260, then match the two pieces of music, and output whether the two pieces of music are similar.

Offline storage stage:

this stage uses a sequence autocorrelation module 1220 and a multi-scale time-frequency domain convolution module 1240 for feature extraction.

The present application inputs a spectrogram of a section of audio to the sequence autocorrelation module 1220, outputs a processed autocorrelation vector sequence, and then performs sequence importance sampling, where the purpose of this step is to sample an autocorrelation vector sequence with higher importance from the audio sequences for subsequent processing, so as to reduce computation pressure. The strategy adopted in the application is to sort the score (Gi) obtained by the last step of the sequence autocorrelation module 1220, take the first k autocorrelation vectors to output, and set k empirically, and generally set to 20% -50% of the total sequence number, for example, the sequence is { G1, G2, & gt, G100}, and the sequence after k is 20, and the sequence after sorting according to the score (Gi) is { G2, G8, G9, & gt, total 20 sequences.

After importance sampling is finished, the application can perform multi-scale time-frequency domain convolution, the application performs pooling operation on two-dimensional matrix representations h1 and h2 of different scales, k is the number of convolution kernels of different scales, and k corresponding vector representations are obtained and then input into a storage library. So that for a piece of audio the application finally represents it with k frequency domain vectors. The dimensions and physical meanings of the k frequency domain vectors are consistent, and each frequency domain vector is formed by splicing a time domain vector and a frequency domain vector, so that important information of the audio in the time dimension and the frequency dimension is reflected.

The present application processes each of the pieces of music in the library in such a way that a final repository is obtained for storing the characteristic representation of each piece of music, i.e. m vectors, in the form < ID, { h1, }. Because these vector dimensions and physical meanings are consistent, they are comparable.

The offline storage phase is accomplished offline, ultimately resulting in a repository 1260 serving search matches on-line.

And (3) searching and matching:

for two pieces of music A and B to be queried on line, the application obtains respective k multi-scale feature vectors according to the audio IDs, namely feature query 1282 in the block diagram.

Assuming that k feature vectors corresponding to a are { hA1,.. hAk }, and B is { hB1,.. hBk }, next, the present application performs a pairwise matching of k feature vectors of each other, such as for hAi and hBi, the present application gets a prediction vector:

hABi＝[hAi*hBi,hAi-hBi,hBi-hAi]

the multiplication and addition and subtraction numbers indicate that the elements at the same position of the two feature vectors are operated, and finally spliced together to obtain hAbi. Since h 1-hk represent the results of convolution kernels of different scales, this step is called multi-scale matching.

In this way the application can derive { hAB1,..ha bk }, the application concatenates the k vectors together for input to the classification layer 1286, the classification layer 1286 being a softmax function, the output Y being a similarity probability representing the degree to which the two tones are similar.

Since the multi-scale vector sequences in this application are stored in the repository 1260 for offline computation, and are subject to sequence importance sampling during offline computation. In online matching, only a multi-scale matching with less calculation amount and classification layer prediction are needed.

As shown in fig. 13, when the order of magnitude of the music in the music library is between the order of millions and tens of millions, it is suitable to predict the similarity probability between two pieces of full audio using the audio matching model in the offline matching scene; when the order of magnitude of the music in the music library is between ten and thousand, the method is suitable for predicting the similarity probability between two full-length audios by an audio matching model in an online matching scene. The order of magnitude of the music in the music library is between the millions and thousands, and the method is suitable for predicting the similarity probability between two full-length audios by adopting an audio matching model in a near-line matching scene. The audio matching model (multi-scale matching+classifying layer, or time sequence autocorrelation layer+multi-scale time-frequency domain convolution layer+multi-scale matching+classifying layer) provided by the embodiment of the application is suitable for an online matching scene between ten-magnitude and thousand-magnitude.

In one illustrative example, the above-described feature vectors of audio are used for training and prediction of an audio matching model. The audio matching model is the audio matching model in the above embodiment, and after training by adopting the feature vector of the audio provided by the embodiment of the application, the audio matching model can be used for predicting the similarity between two audios.

Audio recommendation scenarios:

referring to the example shown in fig. 14, where the user uses the terminal 180 with an audio playing application, the user plays, favorites or likes a first audio (a song) on the audio playing application, and the server 160 may compare a first multi-scale vector sequence of the first audio (a song) with a second multi-scale vector sequence of a plurality of second audio (B song) to determine a likelihood of similarity of the first audio and the second audio. According to the order of the similarity probability from high to low, the B song, the C song, the D song and the E song which are similar to the a song are sent to the audio playing application program on the terminal 180 as recommended songs, so that the user can hear more songs which accord with the preference of the user.

Singing scoring scene:

referring to the example shown in fig. 15, where a singing application is running on a terminal 180 used by a user, where the user sings a song, the server 160 may compare a first multi-scale vector sequence of a first audio (the song the user sings) with a second multi-scale vector sequence of a second audio (the original song or the star song or the high score song) to determine a likelihood of similarity of the first audio and the second audio. And giving the singing score of the user according to the similarity probability, and feeding the singing score back to the singing application program for display so as to be beneficial to the user to improve the singing level of the user.

FIG. 16 is a flowchart of a model training method provided in an exemplary embodiment of the present application. The model training method may be used to train the classification layer in the above embodiments. The embodiment is exemplified by the application of the method to the server shown in fig. 1. The method comprises the following steps:

step 501, clustering the audio in the audio library according to the audio attribute features to obtain an audio class cluster, wherein the audio attribute features comprise at least two attribute features with different dimensions, and the feature similarity of the audio in the different audio class clusters is lower than that of the audio in the same audio class cluster.

The audio library stores a large amount of audio, which may include songs, pure music, symphonies, piano songs, or other playing music, and the embodiment of the present application does not limit the type of audio in the audio library. Optionally, the audio library is a music library of an audio playing application.

Optionally, the audio has respective audio attribute features, the audio attribute features may be attribute features of the audio itself or attribute features artificially given, and the same audio may include attribute features of a plurality of different dimensions.

In one possible implementation, the audio attribute features of the audio include at least one of: text features, audio features, emotion features, and scene features. Alternatively, the text features may include text features of the audio itself (such as lyrics, composer, word maker, genre, etc.), and may also include artificially imparted text features (such as comments); the audio features are used for representing audio characteristics such as melody, rhythm, duration and the like of the audio itself; the emotion characteristics are used for representing emotion expressed by the audio; scene features are used to characterize the playback scene used by the audio. Of course, in addition to the above-described audio attribute features, the audio may also include attribute features of other dimensions, which are not limited in this embodiment.

In the embodiment of the present application, the process of performing audio clustering based on the audio attribute features may be referred to as preliminary screening, and is used for preliminarily screening the audio with similar audio attribute features. In order to improve the primary screening quality, the computer equipment clusters according to at least two attribute features with different dimensions, and clustering deviation caused by clustering based on attribute features with single dimension is avoided.

After clustering, the computer device obtains a plurality of audio class clusters, and the audio in the same audio class cluster has similar audio attribute characteristics (compared with the audio in other audio class clusters). The number of the audio class clusters can be preset in a clustering stage (can be based on experience values), so that the clusters are prevented from being excessively generalized or excessively refined.

Step 502, generating a candidate audio pair according to the audio in the audio class cluster, wherein the candidate audio pair comprises two sections of audio, and the two sections of audio belong to the same audio class cluster or different audio class clusters.

Because the audio in the same audio class cluster has similar audio attribute characteristics, and the audio in different audio class clusters has larger difference in the audio attribute characteristics, the server can initially generate audio samples based on the audio class clusters, wherein each audio sample is a candidate audio pair consisting of two pieces of audio.

Because of the large number of audio contained in the audio library, the number of candidate audio pairs generated based on the audio class clusters is also quite large, e.g., for an audio library containing y pieces of audio, the number of candidate audio pairs generated is C (y, 2). However, while massive numbers of candidate audio pairs can be generated based on the audio class clusters, not all candidate audio pairs can be used for subsequent model training. For example, when the candidate audio pair is the same song (such as the same song sung by different singers), or the audio in the candidate audio pair is completely different (such as a uk ballad and a suona song), the candidate audio pair is too simple to be trained to obtain a high-quality model as a model training sample.

In order to improve the quality of the audio samples, in the embodiment of the application, the server further screens out high-quality audio pairs from the candidate audio pairs as the audio samples through fine screening.

Step 503, determining an audio positive sample pair and an audio negative sample pair in the candidate audio pairs according to the historical play record of the audio in the audio library, wherein the audio in the audio positive sample pair belongs to the same audio class cluster, and the audio in the audio negative sample pair belongs to different audio class clusters.

Through analysis, the audio playing behavior of the user has close relation with the similarity between the audio, for example, the user always plays the audio with higher similarity continuously but not the same audio. Therefore, in the embodiment of the application, the computer device performs fine screening on the generated candidate audio pairs based on the historical play record of the audio to obtain the audio sample pairs. Wherein the audio sample pairs obtained by fine screening comprise audio positive sample pairs composed of similar audio (screened from candidate audio pairs composed of audio in the same audio class cluster) and audio negative sample pairs composed of difference audio (screened from candidate audio pairs composed of audio in different audio class clusters).

Optionally, the historical play record is an audio play record under each user account, which may be an audio play list formed according to a play sequence. For example, the history play record may be a song play record of each user collected by the audio play application server.

In some embodiments, the degree of distinction between the audio positive sample pair and the audio negative sample pair screened based on the history play record is low, so that the quality of the model obtained by subsequent training based on the audio sample pair is improved.

Step 504, machine learning training is performed on the classification layer according to the audio positive sample pair and the audio negative sample pair.

The sample is an object for model training and testing, and the object contains labeling information, wherein the labeling information is a reference value (or referred to as a true value or a supervision value) of a model output result, the sample with the labeling information of 1 is a positive sample, and the sample with the labeling information of 0 is a negative sample. The samples in the embodiments of the present application refer to audio samples for training a similarity model, and the audio samples are in the form of sample pairs, that is, the audio samples include two pieces of audio. Optionally, when the labeling information of the audio sample (pair) is 1, it indicates that two pieces of audio in the audio sample pair are similar audio, namely an audio positive sample pair; when the labeling information of the audio sample (pair) is 0, it indicates that the two pieces of audio in the audio sample pair are not similar audio, i.e., the audio negative sample pair.

Wherein the similarity probability of two audios in the same audio positive sample pair can be regarded as 1, or the clustering distance between the two audios is quantized to be the similarity probability. The similarity probability of two audios in the same audio negative sample pair may be regarded as 0, or the cluster-like distance or vector distance between two audios may be quantized to the similarity probability, such as the inverse of the cluster-like distance or the inverse of the vector distance, to the similarity probability of two audios in the same audio negative sample pair.

To sum up, in the embodiment of the present application, firstly, according to the audio attribute features of different dimensions, audio with similar features in an audio library is clustered to obtain audio class clusters, then, the audio class clusters belonging to the same or different audio class clusters are combined to obtain a plurality of candidate audio pairs, and further, based on the historical play record of the audio, audio positive sample pairs and audio negative sample pairs are screened from the candidate audio pairs for subsequent model training; the audio multi-dimension attribute features are integrated to perform clustering, positive and negative sample pairs are screened based on the audio play records of the users, so that the generated audio sample pairs can reflect the similarity between audios (including the attribute of the audio itself and the listening habit of the users) from multiple angles, the quality of the generated audio sample pairs is improved while the automatic generation of the audio sample pairs is realized, and the quality of the subsequent model training based on the audio samples is further improved.

Fig. 17 is a block diagram of an audio matching apparatus provided in an exemplary embodiment of the present application. The audio matching apparatus includes:

an acquisition module 1720 for acquiring a first multi-scale vector sequence of the first audio and a second multi-scale vector sequence of the second audio;

The matching module 1740 is configured to match frequency domain vectors belonging to the same scale in the first multi-scale vector sequence and the second multi-scale vector sequence to obtain a plurality of matched frequency domain vectors under different scales;

the splicing module 1760 is configured to splice the plurality of matching frequency domain vectors under different scales to obtain a prediction vector;

and the classification module 1780 is used for calling a classification layer to predict the prediction vector and outputting the similarity probability of the first audio and the second audio.

In an alternative embodiment, the first multi-scale vector sequence includes K first feature vectors of different scales, the second multi-scale vector sequence includes K second feature vectors of different scales, and W is an integer greater than 1;

the matching module 1740 is configured to multiply the first feature vector and the second feature vector with the same scale to obtain a first vector; subtracting the first characteristic vector and the second characteristic vector of the same scale to obtain a second vector; subtracting the second characteristic vector and the first characteristic vector of the same scale to obtain a third vector; and splicing the first vector, the second vector and the third vector under the ith scale to obtain a matching feature vector under the ith scale, wherein i is an integer not greater than W.

In an optional embodiment, the stitching module 1760 is configured to perform second stitching on the matching feature vectors under the K different scales according to a sequence from the large scale to the small scale, to obtain the prediction vector; or, performing second splicing on the matching feature vectors under the K different scales according to the order from small scale to large scale to obtain the prediction vector.

In an alternative embodiment, the acquiring module 1720 is configured to acquire the first multi-scale vector sequence in the first audio store and the second multi-scale vector sequence in the second audio store.

In an alternative embodiment, the apparatus further comprises: a feature extraction module 1710;

the feature extraction module 1710 is configured to obtain a feature sequence of audio, where the audio includes the first audio and the second audio; invoking a time sequence correlation layer to perform time domain autocorrelation processing on the feature sequence to obtain an autocorrelation vector sequence; invoking a multi-scale time-frequency domain convolution layer to extract multi-scale features of the autocorrelation vector sequence to obtain the multi-scale vector sequence of the audio; storing the multi-scale vector sequence of the audio to the repository.

In an optional embodiment, the feature sequence includes N frequency domain vectors ordered according to time, and the feature extracting module 1710 is configured to calculate an i-th correlation score between an i-th frequency domain vector and other frequency domain vectors except the i-th frequency domain vector, where i is an integer not greater than N; and calculating the weighted sequences of the N frequency domain vectors by taking the ith correlation score as the correlation weight of the ith frequency domain vector to obtain the autocorrelation vector sequence.

In an optional embodiment, the autocorrelation vector sequence includes N autocorrelation vectors, and the feature extraction module 1710 is configured to sample S autocorrelation vectors from the N autocorrelation vectors in order of high correlation scores corresponding to the N autocorrelation vectors, where S is an integer smaller than N;

and determining the S autocorrelation vectors as the sampled autocorrelation vector sequence.

In an optional embodiment, the feature extraction module 1712 is configured to call a time domain convolution kernel under different scales to perform time domain feature extraction on the autocorrelation vector sequence along a time domain direction, so as to obtain time domain vectors under different scales; invoking frequency domain convolution kernels under different scales to extract frequency domain features of the autocorrelation vector sequence along the frequency domain direction to obtain frequency domain vectors under different scales; splicing the time domain vector and the frequency domain vector under the same scale to obtain a feature vector of the audio under the same scale; and determining a sequence formed by the feature vectors of the audio under different scales as a multi-scale vector sequence of the audio. A storage module 1714 for storing a sequence of multi-scale vectors for the audio.

In an alternative embodiment, the time domain feature extraction includes: at least one of time domain direction convolution and time domain direction pooling; the frequency domain feature extraction includes: at least one of frequency domain direction convolution and frequency domain direction pooling.

In an alternative embodiment, the apparatus further comprises: a training module 1790;

the training module 1790 is configured to cluster the audio in the audio library according to the audio attribute features to obtain an audio class cluster, where the audio attribute features include at least two attribute features with different dimensions, and feature similarity of the audio in the different audio class clusters is lower than that of the audio in the same audio class cluster; generating a candidate audio pair according to the audio in the audio class cluster, wherein the candidate audio pair comprises two sections of audio, and the two sections of audio belong to the same audio class cluster or different audio class clusters; determining an audio positive sample pair and an audio negative sample pair in the candidate audio pairs according to the historical play record of the audio in the audio library, wherein the audio in the audio positive sample pair belongs to the same audio class cluster, and the audio in the audio negative sample pair belongs to different audio class clusters; and performing machine learning training on the classification layer according to the audio positive sample pair and the audio negative sample pair.

It should be noted that: the audio matching device provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the functions described above. In addition, the audio matching device and the audio matching method provided in the above embodiments belong to the same concept, and detailed implementation processes of the audio matching device and the audio matching method are detailed in the method embodiments, which are not repeated here.

Fig. 18 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application. Specifically, the present invention relates to a method for manufacturing a semiconductor device. The computer device 1800 includes a central processing unit (Central Processing Unit, CPU) 1801, a system memory 1804 including a random access memory 1802 and a read only memory 1803, and a system bus 1805 connecting the system memory 1804 and the central processing unit 1801. The computer device 1800 also includes a basic Input/Output system (I/O) 1806, which helps to transfer information between various devices within the computer, and a mass storage device 1807 for storing an operating system 1813, application programs 1814, and other program modules 1815.

The basic input/output system 1806 includes a display 1808 for displaying information and an input device 1809, such as a mouse, keyboard, etc., for user input of information. Wherein the display 1808 and the input device 1809 are coupled to the central processing unit 1801 via an input output controller 1810 coupled to the system bus 1805. The basic input/output system 1806 can also include an input/output controller 1810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 1810 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1807 is connected to the central processing unit 1801 through a mass storage controller (not shown) connected to the system bus 1805. The mass storage device 1807 and its associated computer-readable media provide non-volatile storage for the computer device 1800. That is, the mass storage device 1807 may include a computer-readable medium (not shown), such as a hard disk or drive.

The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes Random Access Memory (RAM), read Only Memory (ROM), flash memory or other solid state memory technology, compact disk Read Only memory (Compact Disc Read-Only memory, CD-ROM), digital versatile disk (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1804 and mass storage 1807 described above may be referred to collectively as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1801, the one or more programs containing instructions for implementing the methods described above, the central processing unit 1801 executing the one or more programs to implement the methods provided by the various method embodiments described above.

According to various embodiments of the application, the computer device 1800 may also operate by a remote computer connected to the network through a network, such as the Internet. I.e., the computer device 1800 may connect to the network 1812 through a network interface unit 1811 connected to the system bus 1805, or other types of networks or remote computer systems (not shown), using the network interface unit 1811.

The memory also includes one or more programs stored in the memory, the one or more programs including steps for performing the methods provided by the embodiments of the present application, as performed by the computer device.

The embodiment of the application further provides a computer readable storage medium, where at least one instruction, at least one section of program, a code set, or an instruction set is stored, where at least one instruction, at least one section of program, a code set, or an instruction set is loaded and executed by a processor to implement the audio matching method described in any of the foregoing embodiments.

The present application also provides a computer program product which, when run on a computer, causes the computer to perform the audio matching method provided by the above-mentioned respective method embodiments.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing related hardware, and the program may be stored in a computer readable storage medium, which may be a computer readable storage medium included in the memory of the above embodiments; or may be a computer-readable storage medium, alone, that is not incorporated into the terminal. The computer readable storage medium stores at least one instruction, at least one program, a code set, or a set of instructions, where the at least one instruction, the at least one program, the set of codes, or the set of instructions are loaded and executed by the processor to implement the audio matching method according to any of the method embodiments described above.

Alternatively, the computer-readable storage medium may include: ROM, RAM, solid state disk (SSD, solid State Drives), or optical disk, etc. The RAM may include, among other things, resistive random access memory (ReRAM, resistance Random Access memory bank) and dynamic random access memory (DRAM, dynamic Random Access memory bank). The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims

1. An audio matching method, the method comprising:

acquiring a first multi-scale vector sequence of a first audio and a second multi-scale vector sequence of a second audio, wherein the first multi-scale vector sequence comprises K first feature vectors with different scales, the second multi-scale vector sequence comprises K second feature vectors with different scales, and K is an integer larger than 1;

multiplying the first characteristic vector and the second characteristic vector with the same scale to obtain a first vector;

subtracting the first characteristic vector and the second characteristic vector of the same scale to obtain a second vector;

Subtracting the second characteristic vector and the first characteristic vector of the same scale to obtain a third vector;

splicing the first vector, the second vector and the third vector under the ith scale to obtain a matching feature vector under the ith scale, wherein i is an integer not more than K;

splicing the matching feature vectors under K different scales to obtain a prediction vector;

2. The method according to claim 1, wherein the stitching the matching feature vectors under K different scales to obtain a prediction vector includes:

performing second splicing on the matching feature vectors under the K different scales according to the order from large scale to small scale to obtain the prediction vector;

or alternatively, the first and second heat exchangers may be,

and performing second splicing on the matching feature vectors under the K different scales according to the order from small scale to large scale to obtain the prediction vector.

3. The method according to claim 1 or 2, wherein the obtaining a first sequence of multi-scale vectors for a first audio and a second sequence of multi-scale vectors for a second audio comprises:

The first multi-scale vector sequence of the first audio in a memory bank and the second multi-scale vector sequence of the second audio in the memory bank are obtained.

4. A method according to claim 3, characterized in that the method further comprises:

acquiring a characteristic sequence of audio, wherein the audio comprises the first audio and the second audio;

invoking a time sequence correlation layer to perform time domain autocorrelation processing on the feature sequence to obtain an autocorrelation vector sequence;

invoking a multi-scale time-frequency domain convolution layer to extract multi-scale features of the autocorrelation vector sequence to obtain the multi-scale vector sequence of the audio;

storing the multi-scale vector sequence of the audio to the repository.

5. The method of claim 4, wherein the feature sequence includes N frequency domain vectors ordered in time, and wherein the invoking the time sequence correlation layer to perform time domain autocorrelation processing on the feature sequence to obtain an autocorrelation vector sequence comprises:

calculating an ith correlation score between an ith frequency domain vector and other frequency domain vectors except the ith frequency domain vector, wherein i is an integer not more than N;

and calculating the weighted sequences of the N frequency domain vectors by taking the ith correlation score as the correlation weight of the ith frequency domain vector to obtain the autocorrelation vector sequence.

6. The method of claim 5, wherein the sequence of autocorrelation vectors comprises N autocorrelation vectors, the method further comprising:

sampling S autocorrelation vectors from the N autocorrelation vectors according to the sequence of the correlation scores corresponding to the N autocorrelation vectors from high to low, wherein S is an integer smaller than N;

7. The method of claim 4, wherein invoking the multi-scale time-frequency domain convolution layer to perform multi-scale feature extraction on the autocorrelation vector sequence to obtain the multi-scale vector sequence of the audio comprises:

invoking time domain convolution kernels under different scales to extract time domain features of the autocorrelation vector sequence along the time domain direction to obtain time domain vectors under different scales;

invoking frequency domain convolution kernels under different scales to extract frequency domain features of the autocorrelation vector sequence along the frequency domain direction to obtain frequency domain vectors under different scales;

splicing the time domain vector and the frequency domain vector under the same scale to obtain a feature vector of the audio under the same scale;

and determining a sequence formed by the feature vectors of the audio under different scales as a multi-scale vector sequence of the audio.

8. The method of claim 7, wherein the step of determining the position of the probe is performed,

the time domain feature extraction includes: at least one of time domain direction convolution and time domain direction pooling;

the frequency domain feature extraction includes: at least one of frequency domain direction convolution and frequency domain direction pooling.

9. The method according to claim 1 or 2, characterized in that the method further comprises:

clustering the audio in the audio library according to the audio attribute characteristics to obtain audio class clusters, wherein the audio attribute characteristics comprise at least two attribute characteristics with different dimensions, and the feature similarity of the audio in the different audio class clusters is lower than that of the audio in the same audio class cluster;

generating a candidate audio pair according to the audio in the audio class cluster, wherein the candidate audio pair comprises two sections of audio, and the two sections of audio contained in the candidate audio pair belong to the same audio class cluster or different audio class clusters;

determining an audio positive sample pair and an audio negative sample pair in the candidate audio pairs according to the historical play record of the audio in the audio library, wherein the audio in the audio positive sample pair belongs to the same audio class cluster, and the audio in the audio negative sample pair belongs to different audio class clusters;

And performing machine learning training on the classification layer according to the audio positive sample pair and the audio negative sample pair.

10. An audio matching device, the device comprising:

the system comprises an acquisition module, a first audio processing module and a second audio processing module, wherein the acquisition module is used for acquiring a first multi-scale vector sequence of a first audio and a second multi-scale vector sequence of a second audio, the first multi-scale vector sequence comprises K first characteristic vectors with different scales, the second multi-scale vector sequence comprises K second characteristic vectors with different scales, and K is an integer larger than 1;

the matching module is used for multiplying the first characteristic vector and the second characteristic vector of the same scale to obtain a first vector;

the matching module is also used for subtracting the first characteristic vector and the second characteristic vector of the same scale to obtain a second vector;

the matching module is also used for subtracting the second characteristic vector and the first characteristic vector of the same scale to obtain a third vector;

the matching module is further used for splicing the first vector, the second vector and the third vector under the ith scale to obtain a matching feature vector under the ith scale, wherein i is an integer not greater than K;

The splicing module is used for splicing the matching feature vectors under K different scales to obtain a prediction vector;

and the classification module is used for calling a classification layer to predict the prediction vector and outputting the similarity probability of the first audio and the second audio.

11. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program that is loaded and executed by the processor to implement the audio matching method of any of claims 1 to 9.

12. A computer readable storage medium having stored therein at least one program loaded and executed by a processor to implement the audio matching method of any one of claims 1 to 9.