CN111445922A - Audio matching method and device, computer equipment and storage medium - Google Patents
Audio matching method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN111445922A CN111445922A CN202010202378.5A CN202010202378A CN111445922A CN 111445922 A CN111445922 A CN 111445922A CN 202010202378 A CN202010202378 A CN 202010202378A CN 111445922 A CN111445922 A CN 111445922A
- Authority
- CN
- China
- Prior art keywords
- audio
- sequence
- frequency
- vector
- correlation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 239000013598 vector Substances 0.000 claims abstract description 278
- 238000012545 processing Methods 0.000 claims abstract description 55
- 238000000605 extraction Methods 0.000 claims abstract description 53
- 238000011176 pooling Methods 0.000 claims description 44
- 238000004364 calculation method Methods 0.000 abstract description 7
- 238000013528 artificial neural network Methods 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 description 15
- 238000012549 training Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 12
- 238000005070 sampling Methods 0.000 description 8
- 238000012216 screening Methods 0.000 description 8
- 230000005236 sound signal Effects 0.000 description 7
- 238000013473 artificial intelligence Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000013461 design Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000006835 compression Effects 0.000 description 3
- 238000007906 compression Methods 0.000 description 3
- 230000004927 fusion Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000008451 emotion Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011112 process operation Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000011426 transformation method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/12—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/54—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2240/00—Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
- G10H2240/121—Musical libraries, i.e. musical databases indexed by musical parameters, wavetables, indexing schemes using musical parameters, musical rule bases or knowledge bases, e.g. for automatic composing methods
- G10H2240/131—Library retrieval, i.e. searching a database or selecting a specific musical piece, segment, pattern, rule or parameter set
- G10H2240/141—Library retrieval matching, i.e. any of the steps of matching an inputted segment or phrase with musical database contents, e.g. query by humming, singing or playing; the steps may include, e.g. musical analysis of the input, musical feature extraction, query formulation, or details of the retrieval process
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses an audio matching method, an audio matching device, computer equipment and a storage medium, and relates to the technical field of audio. The method comprises the following steps: acquiring a first characteristic sequence of a first audio and a second characteristic sequence of a second audio; calling a sequence cross-correlation layer to perform cross-correlation processing on the first characteristic sequence and the second characteristic sequence, and outputting a cross-correlation vector sequence; calling a feature extraction layer to perform feature extraction processing on the cross-correlation vector sequence and output a prediction vector; and calling a classification layer to perform prediction processing on the prediction vector, and outputting the similarity probability of the first audio and the second audio. The similarity of the two audios is calculated by adopting a matching mode based on the neural network, and the similarity between different songs can be calculated, so that a similarity calculation result with higher precision is obtained between different songs.
Description
Technical Field
The present application relates to the field of multimedia technologies, and in particular, to an audio matching method and apparatus, a computer device, and a storage medium.
Background
Audio matching is a technique of similarity matching two audios. According to the matching type, the audio matching comprises the following steps: audio segment matching and full audio matching. The audio segment matching means that an audio segment P is given to judge whether the audio segment P belongs to a part of the audio D. Full audio matching means that given an audio a, the similarity of audio a and audio B is calculated.
The audio fingerprint technology is provided in the related art, and selects the more significant time frequency points in the audio signal, encodes the time frequency points into a digital sequence by adopting a hash coding mode, and takes the digital sequence as the audio fingerprint. The audio fingerprinting technique converts the audio matching problem into a retrieval problem between different digital sequences.
Because the audio clip matching mainly aims at matching the audio clip and the full audio of the same song, the audio fingerprint technology based on signal processing has better matching effect in the scene of audio clip matching. However, in a full-audio matching scene, the similarity of two different songs is calculated more, and at this time, the application of the audio fingerprint technology is limited, and a good matching effect cannot be obtained.
Disclosure of Invention
The embodiment of the application provides an audio matching method, an audio matching device, computer equipment and a storage medium, and provides a matching scheme suitable for a full audio matching scene. The technical scheme is as follows:
according to an aspect of the present application, there is provided an audio matching method, comprising:
acquiring a first characteristic sequence of a first audio and a second characteristic sequence of a second audio;
calling a sequence cross-correlation layer to perform cross-correlation processing on the first characteristic sequence and the second characteristic sequence, and outputting a cross-correlation vector sequence;
calling a feature extraction layer to perform feature extraction processing on the cross-correlation vector sequence and output a prediction vector;
and calling a classification layer to perform prediction processing on the prediction vector, and outputting the similarity probability of the first audio and the second audio.
According to another aspect of the present application, there is provided an audio matching apparatus, comprising:
the acquisition module is used for acquiring a first characteristic sequence of a first audio and a second characteristic sequence of a second audio;
the sequence cross-correlation module is used for performing cross-correlation processing on the first characteristic sequence and the second characteristic sequence and outputting a cross-correlation vector sequence;
the characteristic extraction module is used for carrying out characteristic extraction processing on the cross-correlation vector sequence and outputting a prediction vector;
and the classification module is used for performing prediction processing on the prediction vector and outputting the similarity probability of the first audio and the second audio.
According to another aspect of the present application, there is provided a terminal, including: a processor and a memory storing at least one instruction, at least one program, a set of codes, or a set of instructions that is loaded by the processor and performs the audio matching method as described above.
According to another aspect of the present application, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by a processor to implement an audio matching method as described above.
The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:
the similarity of the two audios is calculated through an audio full-matching model comprising a sequence cross-correlation layer, a feature extraction layer and a classification layer, and the potential features and deep features of the audios can be mined out through the audio matching model adopting a neural network architecture, so that the similarity between different songs can be calculated, and a similarity calculation result with high precision is obtained.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a block diagram of an audio matching system provided by an exemplary embodiment of the present application;
FIG. 2 is a block diagram of an audio matching model provided by an exemplary embodiment of the present application;
FIG. 3 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;
FIG. 4 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;
FIG. 5 is a schematic diagram of a time domain feature extraction provided by an exemplary embodiment of the present application;
FIG. 6 is a schematic diagram of frequency domain feature extraction provided by an exemplary embodiment of the present application;
FIG. 7 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;
FIG. 8 is a flow chart of an audio matching method provided by an exemplary embodiment of the present application;
FIG. 9 is a flow chart of offline matching provided by an exemplary embodiment of the present application;
FIG. 10 illustrates a schematic diagram of a song recommendation scenario provided by an exemplary embodiment of the present application;
FIG. 11 illustrates a schematic diagram of a song scoring scenario provided by an exemplary embodiment of the present application;
FIG. 12 is a flow chart of a model training method provided by an exemplary embodiment of the present application;
fig. 13 is a block diagram illustrating an exemplary embodiment of an audio matching apparatus according to the present application;
fig. 14 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Machine learning (Machine L earning, M L) is a multi-domain cross discipline, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. a special study on how a computer simulates or implements human learning behavior to acquire new knowledge or skills, reorganizes existing knowledge structures to continuously improve its performance.
FIG. 1 shows a block diagram of a computer system provided in an exemplary embodiment of the present application. The computer system 100 includes: a terminal 120 and a server 140.
The terminal 120 runs a platform supporting audio running, and the platform may be any one of an audio playing program or applet (a program running depending on a host program), an audio playing web page, a video playing program or applet, and a video playing web page.
The terminal 120 is connected to the server 140 through a wireless network or a wired network.
The server comprises at least one of a server, a plurality of servers, a cloud computing platform and a virtualization center. Illustratively, the server includes a processor 144 and a memory 142, the memory 142 in turn including a sequence cross-correlation layer 1421, a feature extraction layer 1422, and a classification layer 1423. In some embodiments, the server 140 obtains the audio signal of the audio to be matched from the terminal 120 or obtains the audio signal of the audio to be matched from the memory 142.
The terminal 120 generally refers to one or more terminals, for example, the terminal may be only one terminal, or the terminal may be tens of terminals or hundreds of terminals, or more, in this embodiment, the terminal 120 is taken as an example only, and the types of the terminal include: at least one of a smartphone, a tablet, an e-book reader, an MP3 player, an MP4 player, a laptop portable computer, and a desktop computer. The number and the type of the terminals are not limited in the embodiments of the present application.
Fig. 2 shows a block diagram of an audio matching model 200 provided in an exemplary embodiment of the present application. The audio matching model 200 includes: sequence cross-correlation layer 220, feature extraction layer 240, and classification layer 260.
Wherein the output of the sequence cross-correlation layer 220 is connected to the input of the feature extraction layer 240,
the sequence cross-correlation layer 220 is configured to perform cross-correlation processing on the first feature sequence of the first audio and the second feature sequence of the second audio, and output a cross-correlation vector sequence.
The feature extraction layer 240 is configured to perform feature extraction processing on the cross-correlation vector sequence, and output a prediction vector. Illustratively, the feature extraction layer 240 includes: time-domain convolutional layers 242 and frequency-domain convolutional layers 244, the time-domain convolutional layers 242 being used to perform time-domain convolution operations, and the frequency-domain convolutional layers 244 being used to perform frequency-domain convolution operations. Optionally, the feature extraction layer 240 further includes: a time domain pooling layer 246 and a frequency domain pooling layer 248, the time domain pooling layer 246 being used to perform time domain pooling operations and the frequency domain pooling layer 248 being used to perform frequency domain pooling operations. In one possible design, time domain convolutional layer 242 and time domain pooling layer 246 are not provided, but frequency domain convolutional layer 244 and frequency domain pooling layer 248 are provided. In one possible design, time domain convolutional layer 242, time domain pooling layer 246, frequency domain convolutional layer 244, and frequency domain pooling layer 248 are provided simultaneously.
The classification layer 260 is configured to perform prediction processing on the prediction vector and output a similarity probability between the first audio and the second audio.
Fig. 3 shows a flowchart of an audio matching method provided by an exemplary embodiment of the present application. This embodiment is illustrated by applying the method to the server shown in fig. 1. The method comprises the following steps:
the sequence of features of the audio includes: n frequency domain vectors arranged in time sequence. Each frequency domain vector is M-dimensional, and each dimension represents the frequency F of the audioMThe frequency difference between adjacent dimensions is the same. Wherein N and M are integers greater than 1. Optionally, the obtaining process of the feature sequence is as follows:
the audio is sampled in the time dimension with a preset sampling interval (e.g., every 0.1 second) to obtain a discrete time sequence T1~TnEach T value represents the size of the audio at that sample point.
The time series are grouped according to a fixed time period (such as every 3 second time period) to obtain a plurality of time series groups G1~GNEach time series packet GiA plurality of samples, for example, 30 samples per 3 seconds/0.1 seconds, are included, and i is an integer not greater than N.
Will belong to the same time series group GiA plurality of sampling points in (a) are transformed into a frequency domain vector to obtain N frequency domain vectors arranged in time order. Namely, each time sequence group is transformed from time domain to frequency domain to obtain each time sequence group GiTo what is providedThe corresponding frequency domain sequence. The time-Frequency transformation method includes, but is not limited to, FFT (Fast Fourier Transform), DFT (Discrete Fourier Transform), MFCC (Mel-scale Frequency Cepstral Coefficients). Each frequency domain sequence represents the same set of time series groups GiThe distribution of different frequencies contained therein. And respectively sampling the N frequency domain sequences according to different sampling frequencies to obtain N frequency domain vectors. The different sampling frequencies refer to: the upper frequency limit and the lower frequency limit of the audio frequency are equally divided into a plurality of frequency points, and the frequency points are different sampling frequencies.
The N frequency domain vectors arranged in time sequence form a two-dimensional matrix of M x N. The axis on the two-dimensional matrix corresponding to N represents the time domain direction and the axis corresponding to M represents the frequency domain direction. M is the quotient between the upper and lower limits of the frequency distribution and the frequency sampling interval.
The server obtains a first feature sequence of the first audio and a second feature sequence of the second audio. The first signature sequence comprises n first frequency-domain vectors and the second signature sequence comprises q second frequency-domain vectors. The ordering order and physical meaning of the first frequency domain vector and the second frequency domain vector are the same, for example, the first frequency domain vector and the second frequency domain vector are arranged according to the time domain order, and the first frequency domain vector and the second frequency domain vector are m-dimensional vectors.
the cross-correlation process is a process operation for measuring the correlation between the first signature sequence and the second signature sequence.
Illustratively, the server calculates a first cross-correlation vector sequence of the first feature sequence relative to the second feature sequence and a second cross-correlation vector sequence of the second feature sequence relative to the first feature sequence; and splicing the first cross-correlation vector sequence and the second cross-correlation vector sequence, and outputting the cross-correlation vector sequence.
the feature extraction processing includes: at least one of a convolution operation and a pooling operation. Wherein the convolution operation may be a multi-scale convolution operation.
Dividing according to the dimension, and the characteristic extraction processing comprises the following steps: at least one of a time domain feature extraction process and a frequency domain feature extraction process. The time domain feature extraction processing comprises: at least one of a time-domain convolution operation and a time-domain pooling operation. The frequency domain feature extraction processing includes: at least one of a frequency domain convolution operation and a frequency domain pooling operation.
In one possible design, a feature extraction layer is invoked to perform time domain feature extraction processing and frequency domain feature extraction processing on the cross-correlation vector sequence. In another possible design, a feature extraction layer is called to perform frequency domain feature extraction processing on the cross-correlation vector sequence.
And 308, calling a classification layer to perform prediction processing on the prediction vector, and outputting the similarity probability of the first audio and the second audio.
Optionally, the classification layer is a softmax function, the input is a prediction vector for the first audio and the second audio, and the output is a probability of similarity for the first audio and the second audio. The server performs at least one of audio recommendation, audio scoring, audio classification, and audio matching according to the similarity probability of the two audios.
In the personalized recommendation scene, the server is used for obtaining a first feature vector of a first audio provided by the client, then obtaining a second feature vector of a second audio in the audio library, using the audio matching model to find out the second audio with higher similarity to the first audio, and recommending the second audio to the client.
In the audio scoring scene, the server is used for obtaining a first feature vector of a first audio provided by the client, obtaining a second feature vector of a second audio in the audio library, calculating the similarity between the first audio and the second audio by using an audio matching model, and recommending the second audio with higher similarity in similarity score to the client.
In the audio matching scene, the server is used for obtaining a first feature vector of a first audio provided by the client, then obtaining a second feature vector of a second audio in the audio library, using the audio matching model to find out the second audio with extremely high similarity to the first audio, and recommending audio information (information such as song title, singer, style, year, record company and the like) of the second audio to the client.
In the audio classification scene, the server is used for calculating the similarity between every two songs in the audio library, and classifying the songs with the similarity higher than a threshold value into the same class cluster so as to divide the songs into the same class.
In summary, in the audio matching method provided in this embodiment, the similarity between two audios is calculated through the audio matching model including the sequence cross-correlation layer, the feature extraction layer, and the classification layer, and as the potential features and deep features of the audios can be found out by using the audio matching model of the neural network architecture, the similarity between different songs can be calculated, so as to obtain a similarity calculation result with higher accuracy.
Fig. 4 shows a flowchart of an audio matching method provided by another exemplary embodiment of the present application. This embodiment is illustrated by applying the method to the server shown in fig. 1. The method comprises the following steps:
the first sequence of features of the first audio comprises: n first frequency domain vectors arranged in time sequence. Each first frequency domain vector is M-dimensional, and each dimension represents the frequency F of the audioMThe frequency difference between adjacent dimensions is the same. Wherein N and M are integers greater than 1.
The second feature sequence of the second audio comprises: q second frequency domain vectors arranged in time sequence. Each second frequency domain vector is M-dimensional, each dimension representing the frequency of the audio at a frequency FMThe frequency difference between adjacent dimensions is the same. Wherein Q and M are integers greater than 1.
Exemplary As shown in FIG. 5, audio is first alignedThe audio signal is sampled in time dimension, for example, one audio signal is sampled every 0.1s, and a discrete time sequence T is obtained1~TnEach value represents the size of the audio at the sample point, and then the values are combined for a fixed period of time (e.g. 3s), e.g. 3s sample interval 0.1s, each group of sequences containing 30 values, e.g. T, per 0.1 s/3 s1~T30Is a group, called G1,T31~T60Is G2And so on. Then, a frequency domain transform (including but not limited to FFT, MFCC, DFT, etc.) is performed on each group of time sequences to obtain a frequency domain signal, which represents the distribution of different frequencies contained in a group of time sequences, and the frequency signal is also sampled, for example, 10hz, to obtain a discrete frequency sequence. Assuming that the upper and lower limits of the frequency are 0-f, the number of each frequency sequence is f/10, and each GiCan be represented as such a plurality of frequency sequences, differing only in the different G' siThe same frequency of (a) is of different value. Corresponding to music, some parts of the music are very heavy in bass, and GiThe low frequency values of (A) are large and some parts of the treble are high, those GiThe high frequency value of (2) is large. So GiCan be represented as a time series T1~T30Or may be represented as a sequence of frequencies, which collectively is a spectrogram. The spectrogram as illustrated in fig. 5 is a spectrogram after real audio decomposition, wherein the horizontal axis represents time, and the time period is about 1.75s, that is, a time slice is cut every 1.75 s; the frequency corresponding to each time segment is a vertical axis, the upper and lower limits of the frequency are 110 hz-3520 hz, and the depth of the gray scale represents the value corresponding to different frequencies.
The above processing is performed for both the audio signal of the first audio and the audio signal of the second audio, so that a first audio sequence of the first audio and a second audio sequence of the second audio can be obtained.
setting a first characteristic of a first audio frequencyThe sequence comprises: n first frequency domain vectors { G) arranged in time order1,G2,...,Gn}. Each GiIs a frequency domain vector. To measure the ith first frequency domain vector GiAnd q second frequency domain vectors H1-qThe following correlation calculation formula is introduced for the ith first frequency domain vector.
score(Gi)=(H1*Gi+H2*Gi...+Hq*Gi)/(H1^2+H2^2+...+Hq^2);
That is, the server calculates the ith first frequency domain vector GiAnd q second frequency-domain vectors H1-qAnd q second frequency-domain vectors H1-qThe sum of squares of; determining the quotient of the product sum and the sum of squares as the ith first frequency-domain vector relative to the q second frequency-domain vectors H1-qThe ith correlation score of (1).
in this way, score (Gi) corresponding to each Gi is calculated, so that after correlation fusion, the first cross-correlation vector sequence includes: { G1 × score (G1),.., Gi × score (Gi),., Gn × score (G10) }, score (Gi) may be regarded as a correlation weight of original Gi, so that the output is a first cross-correlation vector sequence after being subjected to weight influence, denoted as { G '1,.., G' n }.
setting the second feature sequence of the second audio to include: q second frequency domain vectors { H) arranged in time order1,H2,...,Hq}. Each HjIs a frequency domain vector. To measure the jth second frequency domain vector HjAnd N first frequency domain vectors G1-nThe following correlation calculation formula is introduced for the jth second frequency domain vector.
score(Hj)=(G1*Hj+G2*Hj...+Gn*Hj)/(G1^2+G2^2+...+Gn^2);
That is, the server calculates the jth second frequency domain vector HjAnd n first frequency domain vectors G1-nAnd n first frequency-domain vectors G1-nThe sum of squares of; determining the quotient of the product sum and the square sum as the jth second frequency domain vector HjWith respect to n first frequency domain vectors G1-nThe jth relevance score of (1).
in this way, score (H) for each Hj is calculatedj) Such that after correlation fusion, the second cross-correlation vector sequence comprises: { H1*score(H1),...,Hj*score(Hj) ,., Hq score (hn), score (Hj) can be regarded as the correlation weight of the original Hj, so that the output is the second cross-correlation vector sequence after being influenced by the weight, which is denoted as { H' 1.
illustratively, the first cross-correlation vector sequence { G '1, ·, G' n } and the second cross-correlation vector sequence { H '1,..., H' q } are concatenated to obtain a cross-correlation vector sequence { G '1,..,. G' n, H '1,..., H' q }. The sequence of cross-correlation vectors comprises n + q cross-correlation vectors. That is, a vector formed by splicing the n first cross-correlation vectors and the q second cross-correlation vectors.
the frequency domain direction refers to performing frequency domain convolution processing on the autocorrelation vector sequence along the direction from small to large (or the direction from large to small) of the sampling frequency to obtain a frequency domain convolution vector.
Alternatively, the cross-correlation signature sequence can be viewed as a matrix of M rows by (N + Q) columns, each row being a (N + Q) -dimensional time-domain vector. Assuming that the size of the frequency domain convolution kernel is P x (N + Q), Q is smaller than M. The frequency domain direction is the convolution of P adjacent time domain vectors along the 0-M direction.
As shown in fig. 5, assuming that the size of the frequency domain convolution kernel is 3 × N + Q, when performing the first convolution according to the frequency domain direction, the time domain vector f1, the time domain vector f2, and the time domain vector f3 are convolved to obtain f' 1; when carrying out the second convolution according to the time domain direction, carrying out convolution on the time domain vector f2, the time domain vector f3 and the time domain vector f4 to obtain f' 2; and when carrying out convolution for the third time according to the time domain direction, carrying out convolution on the time domain vector f3, the time domain vector f4 and the time domain vector f5 to obtain f '3, and repeating the operation in the same way, and finally carrying out convolution to obtain N + Q-3+1 frequency domain convolution vectors f' i.
Wherein, each f' i is a new time domain vector obtained by compressing after convolution of Q time domain vectors. Each f' i is used to represent the correlation between the Q time domain vectors before convolution.
alternatively, the server directly outputs the sequence of N + Q-3+1 frequency-domain convolution vectors as the prediction vector.
Optionally, the server performs pooling processing on the frequency domain convolution vector sequence along the frequency domain direction, and determines one frequency domain pooling vector obtained by pooling as the prediction vector.
As shown in fig. 6, when performing the frequency domain pooling operation, pooling is also performed along the frequency domain direction, and the pooling dimension coincides with the vector dimension. After the frequency domain pooling operation, the above N + Q-3+1 frequency domain convolution vectors f '1, f '2, … f 'N-P+1Compressed into a pooled frequency domain convolution vector f ". That is, the pooled frequency-domain convolution vector includes an element, so that the physical meaning of the pooled frequency-domain convolution vector f ″ is still preserved and can still be regarded as being compressed from the frequency-domain dimension to a new vector. The frequency domain pooling vector f "is used to represent the condensed nature of the plurality of frequency domain convolution vectors.
And 308, calling a classification layer to perform prediction processing on the prediction vector, and outputting the similarity probability of the first audio and the second audio.
Optionally, the classification layer is a softmax function, the input is a prediction vector for the first audio and the second audio, and the output is a probability of similarity for the first audio and the second audio.
In the above embodiment, only the frequency domain vector is subjected to the frequency domain feature extraction in the feature extraction process, but in a different embodiment, the time domain feature extraction may be performed, and this is not limited. The time domain feature extraction is only different from the extraction direction of the frequency domain feature extraction, but the extraction mode is the same.
In an alternative embodiment based on fig. 4, the frequency domain convolution kernels are K, where K is an integer greater than 1. Step 3061 is alternatively implemented as step 306a and step 3062 is alternatively implemented as step 306b, as shown in FIG. 7 below:
and respectively calling K different frequency domain convolution kernels to carry out frequency domain convolution processing on the autocorrelation vector sequence along the frequency domain direction to obtain K frequency domain convolution vector sequences with different scales. The number of the frequency domain convolution vector sequences at each scale can be multiple, such as N-P + 1.
And step 306b, pooling the frequency domain convolution vector sequences of K different scales along the frequency domain direction, and determining K frequency domain pooling vectors obtained by pooling as prediction vectors.
Optionally, pooling is performed on the frequency domain convolution vector sequence under each scale, so as to obtain a pooled frequency domain pooling vector respectively. Pooling is performed on the frequency domain convolution vector sequences under the K different scales, and finally K frequency domain pooling vectors are obtained.
According to the sequence from small to large or from large to small of different scales, the prediction vectors { f '1, f'2, …, f 'k } or { f' k, f 'k-1, …, f' 1} are obtained by splicing.
In summary, because the multi-scale vector sequence represents the potential features and deep features of the audio by using the frequency domain vectors under multiple scales, the similarity between two audios is calculated by using the multi-scale vector sequence of the two audios as input and using a matching method based on a neural network, so that the similarity between different songs can be calculated, and thus a similarity calculation result with higher precision is obtained.
It should be noted that, in an alternative embodiment, the above-mentioned "convolution + pooling + multiscale" may be implemented in combination, as in the embodiment shown in fig. 8:
in general, the spectrogram of the first audio and the spectrogram of the second audio are first cross-correlated in the sequence cross-correlation layer 220, and the obtained cross-correlation vector sequence is output to the multi-scale frequency domain convolution layer 242 for multi-scale frequency domain convolution, so as to obtain a multi-scale frequency domain representation. Then, the multi-scale frequency domain representation is input to the multi-scale pooling layer 244 for multi-scale pooling, and finally output to the classification layer, and the output similarity probability is used for representing whether the two pieces of audio are similar. The following application sets forth the operational details of each module
Sequence cross-correlation layer 220:
the present application shows spectrogram A as { G }1,G2,...,GnEach GiAre all a frequency distribution, can be regarded as a vector, and the spectrogram B is represented as { H }1,...,Hq},HjHeel GiThe dimensions of (a) are the same, the physical meaning is the same, and each value of the vector represents the magnitude of a frequency component. In order to measure the cross correlation of two pieces of audio from the time perspective, the following correlation calculation formula is introduced in the application:
score(Gi)=(H1*Gi+H2*Gi...+Hq*Gi)/(H1^2+H2^2+...+Hq^2)
in this way, the present application derives a score (G) for each Gii) Thus, after correlation fusion, the output of the entire time series correlation module is: { G1*score(G1),...,Gi*score(Gi),...,Gn*score(Gn) Can be score (G)i) Viewed as original GiSo that the output is the spectrum sequence after the weight influence, which is denoted as { G'1,...,G'n}。
Also for HjScore (H) is also available hereinj) As follows
score(Hj)=(G1*Hj+G2*Hj...+Gn*Hj)/(G1^2+G2^2+...+Gn^2)
By the same method, the application can obtain { H'1,., H' q }. Next to this application two sequences are spliced together, i.e. { G'1,...,G'n,H'1,...,H'qIs input to the multi-scale frequency domain convolution layer 242.
Multi-scale frequency domain convolutional layer 242
And operating the cross-correlation vector sequence from the frequency domain through convolution kernels of multiple scales so as to fully extract the audio frequency domain characteristics.
Since the sequence cross-correlation layer 220 has already performed the cross-correlation process in time, it is not necessary to perform the time domain convolution operation, and only the frequency domain convolution operation is needed, and the "listening feeling" of human ears to music is affected by the frequency.
Assume that the multi-scale frequency-domain convolutional layer 242 has three scales of frequency-domain convolutional representations f1, f2, f 3. The present application separately pools these two-dimensional frequency domain convolution representations. As shown in fig. 6, f '1 to f'4 are the results of frequency domain convolution performed by the frequency domain convolution kernels in the same scale, that is, a certain fi is composed of 4 frequency domain convolution vectors f '1 to f'4, and then pooling operation in the scale is performed.
The resulting f "can be viewed as the original 4 frequency domain convolution vectors in the time dimension" compression "(since f '1 to f'4 represent l time sequences, 4 time sequences become 1 time sequence, so the compression in the time dimension" compression ").
In the present application, pooling operation is performed on each multi-scale f ' i to obtain multi-scale frequency domain pooling vectors f ' i, and then all (e.g., three) frequency domain pooling vectors f ' i are spliced together to form a large vector or vector sequence, which is input to the classification layer 260
Classification layer 260
The classification layer 260 may be a softmax function, and the output Y is the similarity probability of two pieces of audio, representing the degree of matching of the two pieces of audio.
As shown in fig. 9, when the order of magnitude of music in the music library is between millions and millions, it is suitable to use the audio matching model in the offline matching scene to predict the similarity probability between two full audios; when the order of magnitude of music in the music library is between ten orders of magnitude and thousand orders of magnitude, the audio matching model suitable for the online matching scene predicts the similarity probability between two full audios. The order of magnitude of music in the music library is between a million and a thousand, and the method is suitable for predicting the similarity probability between two sections of full audio by adopting an audio matching model in a near-line matching scene. The audio matching model (comprising the sequence cross-correlation layer, the feature extraction layer and the classification layer) provided by the embodiment of the application is more suitable for an offline matching scene.
In one illustrative example, the feature vectors of the audio are used for training and prediction of an audio matching model. The audio matching model is the audio matching model in the above embodiment, and after the feature vector of the audio provided by the embodiment of the application is adopted for training, the audio matching model can be used for predicting the similarity between two audios.
Audio recommendation scenario:
referring to the example shown in fig. 10, when the terminal 180 used by the user runs an audio playing application, and the user plays, collects or approves a first audio (a song) on the audio playing application, the server 160 may compare a first multi-scale vector sequence of the first audio (a song) with a second multi-scale vector sequence of a plurality of second audio (B song) to determine the similarity probability between the first audio and the second audio. According to the sequence of the similarity probability from high to low, the songs B, C, D and E which are similar to the song A are taken as recommended songs and sent to the audio playing application program on the terminal 180, so that the user can hear more songs which accord with the preference of the user.
Singing scoring scene:
referring to the example shown in fig. 11, the terminal 180 used by the user has a singing application running thereon, and the user sings a song, the server 160 may compare a first multi-scale vector sequence of a first audio (the song the user sings) with a second multi-scale vector sequence of a second audio (the original song, the star song, or the top scoring song) to determine the similarity probability between the first audio and the second audio. And giving the singing score of the user according to the similarity probability, and feeding the singing score back to the singing application program for displaying so as to be beneficial to improving the singing level of the user.
FIG. 12 shows a flowchart of a model training method provided by an exemplary embodiment of the present application. The model training method can be used for training the audio matching model in the above embodiments. The method comprises the following steps:
The audio library stores a large amount of audio, where the audio may include songs, pure music, symphony songs, piano songs, or other musical compositions, and the like, and the embodiment of the present application does not limit the types of the audio in the audio library. Optionally, the audio library is a music library of an audio playing application.
Optionally, the audio has respective audio attribute features, the audio attribute features may be attribute features of the audio itself, or attribute features given by human, and the same piece of audio may include attribute features of a plurality of different dimensions.
In one possible embodiment, the audio attribute characteristics of the audio include at least one of: text features, audio features, emotional features, and scene features. Alternatively, the text features may include text features of the audio itself (such as lyrics, composer, word maker, genre, etc.), or may include artificially assigned text features (such as comments); the audio features are used for representing audio characteristics of the audio, such as melody, rhythm, duration and the like; the emotion characteristics are used for representing emotion expressed by the audio; the scene features are used to characterize the playing scene used by the audio. Of course, besides the above audio attribute features, the audio may also include attribute features of other dimensions, which is not limited in this embodiment.
In the embodiment of the application, the process of clustering the audio based on the audio attribute features may be referred to as primary screening, and is used for primarily screening out the audio with similar audio attribute features. In order to improve the quality of primary screening, the computer equipment clusters according to the attribute characteristics of at least two different dimensions, and clustering deviation caused by clustering based on the attribute characteristics of a single dimension is avoided.
After clustering, the computer device obtains a plurality of audio clusters, and the audio in the same audio cluster has similar audio attribute characteristics (compared with the audio in other audio clusters). The number of audio clusters can be preset in a clustering stage (based on an empirical value), so that clustering is prevented from being too generalized or too detailed.
Because the audios in the same audio class cluster have similar audio attribute characteristics, and the audios in different audio class clusters have great difference in audio attribute characteristics, the server may preliminarily generate audio samples based on the audio class clusters, where each audio sample is a candidate audio pair composed of two audio samples.
Since the audio library contains a large amount of audio, the number of candidate audio pairs generated based on the audio class cluster is also huge, for example, for the audio library containing y pieces of audio, the number of generated candidate audio pairs is C (y, 2). However, while a large number of candidate audio pairs can be generated based on audio class clusters, not all of the candidate audio pairs can be used for subsequent model training. For example, when the audio in the candidate audio pair is the same song (e.g., the same song sung by a different singer), or the audio in the candidate audio pair is completely different (e.g., an english ballad, a suona song), it is too simple to train the candidate audio pair as a model training sample, and a high-quality model cannot be obtained.
In order to improve the quality of the audio sample, in the embodiment of the application, the computer device further screens out a high-quality audio pair from the candidate audio pair as the audio sample through fine screening.
In step 403, the server determines an audio positive sample pair and an audio negative sample pair in the candidate audio pairs according to the historical play records of the audios in the audio library, where the audios in the audio positive sample pair belong to the same audio cluster, and the audios in the audio negative sample pair belong to different audio clusters.
Through analysis, the similarity between the audio playing behavior of the user and the audio is closely related, for example, the user often plays the audio with high similarity continuously, but the audio is not identical. Therefore, in the embodiment of the application, the computer device performs fine screening on the generated candidate audio pairs based on the historical play records of the audio to obtain the audio sample pairs. The audio sample pairs obtained by fine screening comprise audio positive sample pairs formed by similar audios (obtained by screening from candidate audio pairs formed by audios in the same audio cluster), and audio negative sample pairs formed by differential audios (obtained by screening from candidate audio pairs formed by audios in different audio clusters).
Optionally, the history playing record is an audio playing record under each user account, and may be an audio playing list formed according to the playing sequence. For example, the history play records may be song play records of the respective users collected by the audio play application server.
In some embodiments, the degree of distinction between the audio positive sample pair and the audio negative sample pair screened out based on the history playing record is low, so that the quality of a model obtained by training based on the audio sample pair is improved.
In step 404, the server performs machine learning training on the audio matching model according to the audio positive sample pair and the audio negative sample pair.
The sample is an object for training and testing the model, and the object includes label information, where the label information is a reference value (or called true value or supervised value) of the output result of the model, where a sample with label information of 1 is a positive sample, and a sample with label information of 0 is a negative sample. The samples in the embodiment of the present application refer to audio samples used for training a similarity model, and the audio samples are in the form of sample pairs, that is, the audio samples include two pieces of audio. Optionally, when the label information of the audio sample (pair) is 1, it indicates that two pieces of audio in the audio sample pair are similar audio, that is, an audio positive sample pair; when the label information of the audio sample (pair) is 0, it indicates that the two pieces of audio in the audio sample pair are not similar audio, i.e., an audio negative sample pair.
Wherein, the similarity probability of two audios in the same audio positive sample pair can be regarded as 1, or the clustering distance between two audios is quantized to the similarity probability. The similarity probability of two audios in the same audio negative sample pair can be regarded as 0, or the cluster-like distance or the vector distance between two audios is quantized to the similarity probability, for example, the inverse of the cluster-like distance or the inverse of the vector distance is quantized to the similarity probability of two audios in the same audio negative sample pair.
Illustratively, the "audio matching model" in the above embodiment includes: a sequence cross-correlation layer, a feature extraction layer and a classification layer.
In summary, in the embodiment of the application, firstly, according to audio attribute features of different dimensions, audio with similar features in an audio library is clustered to obtain audio clusters, then, the audio clusters belonging to the same or different audio clusters are combined to obtain a plurality of candidate audio pairs, and further, based on historical playing records of the audio, an audio positive sample pair and an audio negative sample pair are screened out from the candidate audio pairs for subsequent model training; clustering is carried out by fusing multi-dimensional attribute characteristics of audio, and positive and negative sample pairs are screened based on audio playing records of users, so that generated audio sample pairs can reflect the similarity between audio (including the attributes of the audio and the listening habits of the users) from multiple angles, the quality of the generated audio sample pairs is improved while the automatic generation of the audio sample pairs is realized, and the quality of subsequent model training based on the audio samples is further improved.
Fig. 13 is a block diagram of an audio matching apparatus according to an exemplary embodiment of the present application. The device includes:
an obtaining module 1320, configured to obtain a first feature sequence of a first audio and a second feature sequence of a second audio;
a sequence cross-correlation module 1340, configured to perform cross-correlation processing on the first feature sequence and the second feature sequence, and output a cross-correlation vector sequence;
a feature extraction module 1360, configured to perform feature extraction processing on the cross-correlation vector sequence, and output a prediction vector;
a classification module 1380, configured to perform prediction processing on the prediction vector and output a similarity probability between the first audio and the second audio.
In an exemplary embodiment, the first signature sequence comprises n first frequency-domain vectors, the second signature sequence comprises q second frequency-domain vectors, n and q are positive integers;
the sequence cross-correlation module 1340 is configured to calculate a first cross-correlation vector sequence of the first feature sequence relative to the second feature sequence and a second cross-correlation vector sequence of the second feature sequence relative to the first feature sequence; and splicing the first cross-correlation vector sequence and the second cross-correlation vector sequence, and outputting the cross-correlation vector sequence.
In an exemplary embodiment, the sequence cross-correlation module 1340 is configured to calculate an ith first frequency-domain vector of the n first frequency-domain vectors, i being an integer no greater than n with respect to an ith correlation fraction of the q second frequency-domain vectors; taking the ith correlation fraction as a correlation weight of the ith first frequency domain vector, and calculating a weighted sequence of the n first frequency domain vectors to obtain a first cross-correlation vector sequence; and calculating a jth of said q second frequency domain vectors, j being an integer no greater than q, relative to a jth correlation score of said n first frequency domain vectors; and calculating the weighted sequences of the q second frequency domain vectors by taking the jth correlation score as the correlation weight of the jth second frequency domain vector to obtain the second cross-correlation vector sequence.
In an exemplary embodiment, the sequence cross-correlation module 1340 is configured to calculate a product sum of the i-th first frequency-domain vector and the q second frequency-domain vectors, and a square sum of the q second frequency-domain vectors; determining a quotient of the product sum and the square sum as an ith correlation score of the ith first frequency-domain vector relative to the q second frequency-domain vectors;
the sequence cross-correlation module 1340 is configured to calculate a product sum of the jth second frequency-domain vector and the n first frequency-domain vectors, and a square sum of the n first frequency-domain vectors; determining a quotient of the product sum and the square sum as a jth correlation score of the jth second frequency-domain vector relative to the n first frequency-domain vectors.
In one illustrative embodiment, the feature extraction module comprises: a frequency domain convolution kernel;
the feature extraction module 1360 is configured to call the frequency domain convolution kernel to perform frequency domain convolution processing on the cross-correlation vector sequence along a frequency domain direction, so as to obtain a frequency domain convolution vector sequence; and outputting the prediction vector according to the frequency domain convolution vector sequence.
In an exemplary embodiment, the feature extraction module 1360 is configured to pool the sequence of frequency-domain convolution vectors along a frequency-domain direction, and determine a frequency-domain pooled vector obtained by pooling as the prediction vector.
In one illustrative embodiment, the frequency domain convolution kernels include K frequency domain convolution kernels of different scales, K being an integer greater than 1;
the feature extraction module 1360 is configured to call the K different frequency domain convolution kernels to perform frequency domain convolution processing on the cross-correlation vector sequence along the frequency domain direction, so as to obtain K different scale frequency domain convolution vector sequences.
In an exemplary embodiment, the feature extraction module 1360 is configured to pool the frequency-domain convolution vector sequences of K different scales along the frequency-domain direction, and determine K frequency-domain pooled vectors obtained by pooling as the prediction vector.
It should be noted that: the audio matching device provided in the above embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the audio matching apparatus and the audio matching method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.
Fig. 14 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application. Specifically, the method comprises the following steps: the computer device 1400 includes a Central Processing Unit (CPU) 1401, a system memory 1404 including a random access memory 1402 and a read only memory 1403, and a system bus 1405 connecting the system memory 1404 and the Central Processing Unit 1401. The computer device 1400 also includes a basic Input/Output system (I/O system) 1406 that facilitates transfer of information between devices within the computer, and a mass storage device 1407 for storing an operating system 1413, application programs 1414, and other program modules 1415.
The basic input/output system 1406 includes a display 1408 for displaying information and an input device 1409, such as a mouse, keyboard, etc., for user input of information. Wherein the display 1408 and input device 1409 are both connected to the central processing unit 1401 via an input-output controller 1410 connected to the system bus 1405. The basic input/output system 1406 may also include an input/output controller 1410 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1410 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 1407 is connected to the central processing unit 1401 through a mass storage controller (not shown) connected to the system bus 1405. The mass storage device 1407 and its associated computer-readable media provide non-volatile storage for the computer device 1400. That is, the mass storage device 1407 may include a computer readable medium (not shown) such as a hard disk or drive.
Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes Random Access Memory (RAM), Read Only Memory (ROM), flash Memory or other solid state Memory technology, Compact disk Read-Only Memory (CD-ROM), Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1404 and mass storage device 1407 described above may collectively be referred to as memory.
The memory stores one or more programs configured to be executed by the one or more central processing units 1401, the one or more programs containing instructions for implementing the methods described above, and the central processing unit 1401 executes the one or more programs to implement the methods provided by the various method embodiments described above.
According to various embodiments of the present application, the computer device 1400 may also operate as a remote computer connected to a network via a network, such as the Internet. That is, the computer device 1400 may be connected to the network 1412 through the network interface unit 1414 that is coupled to the system bus 1405, or the network interface unit 1414 may be used to connect to other types of networks or remote computer systems (not shown).
The memory also includes one or more programs, stored in the memory, that include instructions for performing the steps performed by the computer device in the methods provided by the embodiments of the present application.
The present application further provides a computer-readable storage medium, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the computer-readable storage medium, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the audio matching method described in any of the above embodiments.
The present application also provides a computer program product, which when run on a computer, causes the computer to execute the audio matching method provided by the above-mentioned method embodiments.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, which may be a computer readable storage medium contained in a memory of the above embodiments; or it may be a separate computer-readable storage medium not incorporated in the terminal. The computer readable storage medium has stored therein at least one instruction, at least one program, a set of codes, or a set of instructions that are loaded and executed by the processor to implement the audio matching method of any of the above method embodiments.
It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.
Claims (11)
1. A method of audio matching, the method comprising:
acquiring a first characteristic sequence of a first audio and a second characteristic sequence of a second audio;
calling a sequence cross-correlation layer to perform cross-correlation processing on the first characteristic sequence and the second characteristic sequence, and outputting a cross-correlation vector sequence;
calling a feature extraction layer to perform feature extraction processing on the cross-correlation vector sequence and output a prediction vector;
and calling a classification layer to perform prediction processing on the prediction vector, and outputting the similarity probability of the first audio and the second audio.
2. The method of claim 1, wherein the first signature sequence comprises n first frequency-domain vectors, wherein the second signature sequence comprises q second frequency-domain vectors, and wherein n and q are positive integers;
the calling sequence cross-correlation layer performs cross-correlation processing on the first characteristic sequence and the second characteristic sequence, and outputs a cross-correlation vector sequence, including:
calculating a first cross-correlation vector sequence of the first feature sequence relative to the second feature sequence and a second cross-correlation vector sequence of the second feature sequence relative to the first feature sequence;
and splicing the first cross-correlation vector sequence and the second cross-correlation vector sequence, and outputting the cross-correlation vector sequence.
3. The method of claim 2,
the calculating a first cross-correlation vector sequence of the first feature sequence relative to the second feature sequence comprises:
calculating an ith first frequency-domain vector of the n first frequency-domain vectors, i being an integer no greater than n, relative to an ith correlation fraction of the q second frequency-domain vectors; taking the ith correlation fraction as a correlation weight of the ith first frequency domain vector, and calculating a weighted sequence of the n first frequency domain vectors to obtain a first cross-correlation vector sequence;
the calculating a second cross-correlation vector sequence of the second feature sequence relative to the first feature sequence comprises:
calculating a jth of the q second frequency-domain vectors, j being an integer no greater than q, relative to a jth correlation score of the n first frequency-domain vectors; and calculating the weighted sequences of the q second frequency domain vectors by taking the jth correlation score as the correlation weight of the jth second frequency domain vector to obtain the second cross-correlation vector sequence.
4. The method of claim 3,
said calculating an ith correlation score for an ith of said n first frequency-domain vectors relative to said q second frequency-domain vectors, comprising:
calculating a product sum of the i-th first frequency-domain vector and the q second frequency-domain vectors, and a square sum of the q second frequency-domain vectors; determining a quotient of the product sum and the square sum as an ith correlation score of the ith first frequency-domain vector relative to the q second frequency-domain vectors;
said calculating a jth one of said q second frequency-domain vectors with respect to said jth correlation score of said n first frequency-domain vectors, comprising:
calculating a product sum of the jth second frequency-domain vector and the n first frequency-domain vectors, and a square sum of the n first frequency-domain vectors; determining a quotient of the product sum and the square sum as a jth correlation score of the jth second frequency-domain vector relative to the n first frequency-domain vectors.
5. The method of any of claims 1 to 4, wherein the feature extraction layer comprises: a frequency domain convolution kernel;
the calling feature extraction layer performs feature extraction processing on the cross-correlation vector sequence and outputs a prediction vector, and the method comprises the following steps:
calling the frequency domain convolution kernel to carry out frequency domain convolution processing on the cross-correlation vector sequence along the frequency domain direction to obtain a frequency domain convolution vector sequence;
and outputting the prediction vector according to the frequency domain convolution vector sequence.
6. The method of claim 5, wherein said outputting the prediction vector according to the sequence of frequency-domain convolution vectors comprises:
pooling the frequency domain convolution vector sequence along a frequency domain direction, and determining a frequency domain pooling vector obtained by pooling as the prediction vector.
7. The method of claim 5, wherein the frequency-domain convolution kernel comprises K frequency-domain convolution kernels at different scales, K being an integer greater than 1;
the calling the frequency domain convolution kernel to perform frequency domain convolution processing on the cross-correlation vector sequence along the frequency domain direction to obtain a frequency domain convolution vector sequence, and the method comprises the following steps:
and respectively calling the K different frequency domain convolution kernels to carry out frequency domain convolution processing on the cross-correlation vector sequence along the frequency domain direction to obtain K different scale frequency domain convolution vector sequences.
8. The method of claim 5, wherein said outputting the prediction vector according to the sequence of frequency-domain convolution vectors comprises:
and pooling the K frequency domain convolution vector sequences with different scales along the frequency domain direction, and determining K frequency domain pooling vectors obtained by pooling as the prediction vectors.
9. An audio matching apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring a first characteristic sequence of a first audio and a second characteristic sequence of a second audio;
the sequence cross-correlation module is used for performing cross-correlation processing on the first characteristic sequence and the second characteristic sequence and outputting a cross-correlation vector sequence;
the characteristic extraction module is used for carrying out characteristic extraction processing on the cross-correlation vector sequence and outputting a prediction vector;
and the classification module is used for performing prediction processing on the prediction vector and outputting the similarity probability of the first audio and the second audio.
10. A terminal, characterized in that the terminal comprises: a processor and a memory storing at least one instruction, at least one program, a set of codes, or a set of instructions that is loaded and executed by the processor to implement the audio matching method of any of claims 1 to 8.
11. A computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the audio matching method of any of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010202378.5A CN111445922B (en) | 2020-03-20 | 2020-03-20 | Audio matching method, device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010202378.5A CN111445922B (en) | 2020-03-20 | 2020-03-20 | Audio matching method, device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111445922A true CN111445922A (en) | 2020-07-24 |
CN111445922B CN111445922B (en) | 2023-10-03 |
Family
ID=71654307
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010202378.5A Active CN111445922B (en) | 2020-03-20 | 2020-03-20 | Audio matching method, device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111445922B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112487236A (en) * | 2020-12-01 | 2021-03-12 | 腾讯音乐娱乐科技(深圳)有限公司 | Method, device, equipment and storage medium for determining associated song list |
CN113763927A (en) * | 2021-05-13 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Speech recognition method, speech recognition device, computer equipment and readable storage medium |
CN115273892A (en) * | 2022-07-27 | 2022-11-01 | 腾讯科技(深圳)有限公司 | Audio processing method, device, equipment, storage medium and computer program product |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008026836A (en) * | 2006-07-25 | 2008-02-07 | Yamaha Corp | Method, device, and program for evaluating similarity of voice |
US20080281590A1 (en) * | 2005-10-17 | 2008-11-13 | Koninklijke Philips Electronics, N.V. | Method of Deriving a Set of Features for an Audio Input Signal |
US20110035373A1 (en) * | 2009-08-10 | 2011-02-10 | Pixel Forensics, Inc. | Robust video retrieval utilizing audio and video data |
US20140205103A1 (en) * | 2011-08-19 | 2014-07-24 | Dolby Laboratories Licensing Corporation | Measuring content coherence and measuring similarity |
US20180349350A1 (en) * | 2017-06-01 | 2018-12-06 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Artificial intelligence based method and apparatus for checking text |
CN109859772A (en) * | 2019-03-22 | 2019-06-07 | 平安科技(深圳)有限公司 | Emotion identification method, apparatus and computer readable storage medium |
CN110503976A (en) * | 2019-08-15 | 2019-11-26 | 广州华多网络科技有限公司 | Audio separation method, device, electronic equipment and storage medium |
CN110675893A (en) * | 2019-09-19 | 2020-01-10 | 腾讯音乐娱乐科技(深圳)有限公司 | Song identification method and device, storage medium and electronic equipment |
-
2020
- 2020-03-20 CN CN202010202378.5A patent/CN111445922B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080281590A1 (en) * | 2005-10-17 | 2008-11-13 | Koninklijke Philips Electronics, N.V. | Method of Deriving a Set of Features for an Audio Input Signal |
JP2008026836A (en) * | 2006-07-25 | 2008-02-07 | Yamaha Corp | Method, device, and program for evaluating similarity of voice |
US20110035373A1 (en) * | 2009-08-10 | 2011-02-10 | Pixel Forensics, Inc. | Robust video retrieval utilizing audio and video data |
US20140205103A1 (en) * | 2011-08-19 | 2014-07-24 | Dolby Laboratories Licensing Corporation | Measuring content coherence and measuring similarity |
US20180349350A1 (en) * | 2017-06-01 | 2018-12-06 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Artificial intelligence based method and apparatus for checking text |
CN109859772A (en) * | 2019-03-22 | 2019-06-07 | 平安科技(深圳)有限公司 | Emotion identification method, apparatus and computer readable storage medium |
CN110503976A (en) * | 2019-08-15 | 2019-11-26 | 广州华多网络科技有限公司 | Audio separation method, device, electronic equipment and storage medium |
CN110675893A (en) * | 2019-09-19 | 2020-01-10 | 腾讯音乐娱乐科技(深圳)有限公司 | Song identification method and device, storage medium and electronic equipment |
Non-Patent Citations (2)
Title |
---|
GRAHAM PERCIVAL等: "Streamlined Tempo Estimation Based on Autocorrelation and Cross-correlation With Pulses", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, pages 1765 - 1771 * |
杨帆等: "基于交叉递归图和局部匹配的翻唱歌曲识别", 华东理工大学学报, pages 247 - 253 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112487236A (en) * | 2020-12-01 | 2021-03-12 | 腾讯音乐娱乐科技(深圳)有限公司 | Method, device, equipment and storage medium for determining associated song list |
CN113763927A (en) * | 2021-05-13 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Speech recognition method, speech recognition device, computer equipment and readable storage medium |
CN113763927B (en) * | 2021-05-13 | 2024-03-08 | 腾讯科技(深圳)有限公司 | Speech recognition method, device, computer equipment and readable storage medium |
CN115273892A (en) * | 2022-07-27 | 2022-11-01 | 腾讯科技(深圳)有限公司 | Audio processing method, device, equipment, storage medium and computer program product |
Also Published As
Publication number | Publication date |
---|---|
CN111445922B (en) | 2023-10-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111309965B (en) | Audio matching method, device, computer equipment and storage medium | |
CN111400543B (en) | Audio fragment matching method, device, equipment and storage medium | |
CN111444967B (en) | Training method, generating method, device, equipment and medium for generating countermeasure network | |
Chen et al. | The AMG1608 dataset for music emotion recognition | |
CN111445922B (en) | Audio matching method, device, computer equipment and storage medium | |
US10296959B1 (en) | Automated recommendations of audio narrations | |
CN111428074B (en) | Audio sample generation method, device, computer equipment and storage medium | |
CN111309966B (en) | Audio matching method, device, equipment and storage medium | |
CN111444379B (en) | Audio feature vector generation method and audio fragment representation model training method | |
KR20080030922A (en) | Information processing apparatus, method, program and recording medium | |
CN111462761A (en) | Voiceprint data generation method and device, computer device and storage medium | |
CN111445921B (en) | Audio feature extraction method and device, computer equipment and storage medium | |
WO2016102738A1 (en) | Similarity determination and selection of music | |
CN111460215B (en) | Audio data processing method and device, computer equipment and storage medium | |
EP3096242A1 (en) | Media content selection | |
Mirza et al. | Residual LSTM neural network for time dependent consecutive pitch string recognition from spectrograms: a study on Turkish classical music makams | |
Lu et al. | Predicting likability of speakers with Gaussian processes | |
Leleuly et al. | Analysis of feature correlation for music genre classification | |
Blume et al. | Huge music archives on mobile devices | |
Shirali-Shahreza et al. | Fast and scalable system for automatic artist identification | |
Pei et al. | Instrumentation analysis and identification of polyphonic music using beat-synchronous feature integration and fuzzy clustering | |
Cai et al. | Feature selection approaches for optimising music emotion recognition methods | |
Kher | Music Composer Recognition from MIDI Representation using Deep Learning and N-gram Based Methods | |
Chen et al. | Hierarchical representation based on Bayesian nonparametric tree-structured mixture model for playing technique classification | |
Hemgren | Fuzzy Content-Based Audio Retrieval Using Visualization Tools |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40025589 Country of ref document: HK |
|
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |