CN115221351A

CN115221351A - Audio matching method and device, electronic equipment and computer-readable storage medium

Info

Publication number: CN115221351A
Application number: CN202210881716.1A
Authority: CN
Inventors: 袁有根; 胡鹏飞
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2022-10-21

Abstract

The embodiment of the application discloses an audio matching method, an audio matching device, electronic equipment and a computer-readable storage medium; in the embodiment of the application, an audio and audio set to be matched is obtained, wherein the audio set comprises audio and a numerical matrix corresponding to audio fingerprints of the audio; fingerprint feature extraction is carried out on the audio to be matched to obtain a target audio fingerprint corresponding to the audio to be matched, wherein the target audio fingerprint comprises a plurality of fingerprint elements; mapping each fingerprint element in the target audio fingerprint to a target numerical value in a preset numerical value interval to obtain a target numerical value matrix corresponding to the target audio fingerprint, wherein the difference value between the target numerical value and the end point value of the preset numerical value interval is in a preset range; and searching the audio matched with the audio to be matched from the audio set according to the target numerical matrix and the numerical matrix corresponding to the audio fingerprint. The embodiment of the application can improve the speed of audio matching.

Description

Audio matching method and device, electronic equipment and computer-readable storage medium

Technical Field

The present application relates to the field of audio technologies, and in particular, to an audio matching method and apparatus, an electronic device, and a computer-readable storage medium.

Background

With the development of society, the audio frequency on the internet is more and more. After a user acquires a piece of audio, the audio matched with the audio can be searched through the audio fingerprint in the audio.

However, when the audio fingerprints extracted by the current audio fingerprint extraction method are used for matching, the matching speed is slow.

Disclosure of Invention

The embodiment of the application provides an audio matching method, an audio matching device, electronic equipment and a computer-readable storage medium, and can solve the technical problem that the speed of audio matching is low.

An audio matching method, comprising:

acquiring an audio set and an audio set to be matched, wherein the audio set comprises audio and a numerical matrix corresponding to the audio fingerprint of the audio;

extracting fingerprint features of the audio to be matched to obtain a target audio fingerprint corresponding to the audio to be matched, wherein the target audio fingerprint comprises a plurality of fingerprint elements;

mapping each fingerprint element in the target audio fingerprint to a target value in a preset value interval to obtain a target value matrix corresponding to the target audio fingerprint, wherein a difference value between the target value and an end point value of the preset value interval is in a preset range;

and searching the audio matched with the audio to be matched from the audio set according to the target numerical matrix and the numerical matrix corresponding to the audio fingerprint.

Accordingly, an embodiment of the present application provides an audio matching apparatus, including:

the system comprises an acquisition module, a matching module and a matching module, wherein the acquisition module is used for acquiring an audio set to be matched, and the audio set comprises audio and a numerical matrix corresponding to an audio fingerprint of the audio;

the extraction module is used for extracting fingerprint features of the audio to be matched to obtain a target audio fingerprint corresponding to the audio to be matched, and the target audio fingerprint comprises a plurality of fingerprint elements;

a mapping module, configured to map each fingerprint element in the target audio fingerprint to a target value within a preset value interval, to obtain a target value matrix corresponding to the target audio fingerprint, where a difference between the target value and an end value of the preset value interval is within a preset range;

and the searching module is used for searching the audio matched with the audio to be matched from the audio set according to the target numerical matrix and the numerical matrix corresponding to the audio fingerprint.

The numerical matrix corresponding to the audio fingerprint of the audio includes a binary matrix corresponding to the audio fingerprint of the audio.

Correspondingly, the lookup module is specifically configured to perform:

performing binary mapping processing on the target numerical matrix to obtain a target binary matrix corresponding to the target audio fingerprint;

and searching the audio matched with the audio to be matched from the audio set according to the target binary matrix and the binary matrix corresponding to the audio fingerprint.

Optionally, the extracting module is specifically configured to perform:

and extracting fingerprint characteristics of the audio to be matched through the trained audio fingerprint model to obtain a target audio fingerprint corresponding to the audio to be matched.

The mapping module is specifically configured to perform:

and mapping each fingerprint element in the target audio fingerprint to a target numerical value in a preset numerical value interval through the trained audio fingerprint model.

Optionally, the audio matching apparatus further comprises:

a training module to perform:

acquiring a training sample set of an audio fingerprint model to be trained, and performing fingerprint feature extraction on training samples in the training sample set to obtain sample audio fingerprints corresponding to the training samples;

mapping each element in the sample audio fingerprint to a sample numerical value in the preset numerical value interval to obtain a sample numerical value matrix corresponding to the sample audio fingerprint;

determining a target loss value of the audio fingerprint model to be trained according to the sample numerical matrix and the endpoint value corresponding to the preset numerical interval;

and training the audio fingerprint model to be trained according to the target loss value so as to make the audio fingerprint model to be trained converge on the endpoint value, thereby obtaining the trained audio fingerprint model.

Optionally, the training module is specifically configured to perform:

carrying out short-time Fourier transform on training samples in the training sample set to obtain time-frequency characteristic vectors corresponding to the training samples;

and extracting fingerprint features of the time-frequency feature vector to obtain a sample audio fingerprint corresponding to the training sample.

Optionally, the fingerprint feature extraction includes temporal feature extraction and spatial feature extraction.

Accordingly, the training module is specifically configured to perform:

extracting time characteristics of the time-frequency characteristic vector to obtain a time characteristic vector corresponding to the training sample;

and extracting the spatial features of the time feature vectors to obtain sample audio fingerprints corresponding to the training samples.

Optionally, the training module is specifically configured to perform:

acquiring each section of target audio;

carrying out windowing and framing processing on the target audio to obtain each sample voice segment corresponding to the target audio;

and determining a training sample set according to the sample voice fragments.

Optionally, the training module is specifically configured to perform:

performing time domain data augmentation processing on the sample voice fragment to obtain a positive sample corresponding to the sample voice fragment;

and determining a training sample set according to the positive sample corresponding to the sample voice fragment and the sample voice fragment.

Optionally, the training module is specifically configured to perform:

performing frequency domain data amplification processing on the time-frequency characteristic vector to obtain a sample time-frequency characteristic vector corresponding to the training sample;

and performing fingerprint feature extraction on the sample time-frequency feature vector to obtain a sample audio fingerprint corresponding to the training sample.

Optionally, the training module is specifically configured to perform:

determining a contrast loss value of the audio fingerprint model to be trained according to the sample audio fingerprint;

determining a quantization loss value of the audio fingerprint model to be trained according to the sample numerical value matrix and the corresponding endpoint value of the preset numerical value interval;

and according to the contrast loss value and the quantization loss value, obtaining a target loss value of the audio fingerprint model to be trained.

Optionally, the training module is specifically configured to perform:

screening out a first sample audio fingerprint from the sample audio fingerprints;

determining a first inner product value corresponding to the first sample audio fingerprint according to the first sample audio fingerprint and a second sample audio fingerprint corresponding to the first sample audio fingerprint in the sample audio fingerprints, wherein when the first sample audio fingerprint is a sample audio fingerprint corresponding to the first sample voice fragment, the second sample audio fingerprint is a sample audio fingerprint of a first positive sample corresponding to the first sample voice fragment, or when the first sample audio fingerprint is a sample audio fingerprint corresponding to the first positive sample, the second sample audio fingerprint is a sample audio fingerprint of a first sample voice fragment corresponding to the first positive sample;

determining a second inner product value corresponding to the first sample audio fingerprint according to the first sample audio fingerprint, the second sample audio fingerprint, and a third sample audio fingerprint, where the third sample audio fingerprint is a sample audio fingerprint of the sample audio fingerprints other than the first sample audio fingerprint and the second sample audio fingerprint;

and determining the contrast loss value of the audio fingerprint model to be trained according to the first inner product value and the second inner product value.

In addition, an electronic device is further provided in an embodiment of the present application, and includes a processor and a memory, where the memory stores a computer program, and the processor is configured to run the computer program in the memory to implement the audio matching method provided in the embodiment of the present application.

In addition, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, where the computer program is suitable for being loaded by a processor to execute any one of the audio matching methods provided in the embodiment of the present application.

In addition, the embodiment of the present application also provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements any one of the audio matching methods provided by the embodiment of the present application.

In the embodiment of the application, an audio and audio set to be matched is obtained first, and the audio set comprises numerical value matrixes corresponding to audio fingerprints of the audio and the audio. And then, fingerprint feature extraction is carried out on the audio to be matched to obtain a target audio fingerprint corresponding to the audio to be matched, wherein the target audio fingerprint comprises a plurality of fingerprint elements. Then, mapping each fingerprint element in the target audio fingerprint to a target value in a preset value interval to obtain a target value matrix corresponding to the target audio fingerprint, wherein a difference value between the target value and an endpoint value of the preset value interval is in a preset range. And finally, searching the audio matched with the audio to be matched from the audio set according to the target numerical matrix and the numerical matrix corresponding to the audio fingerprint.

In other words, in the embodiment of the present application, each fingerprint element in the target audio fingerprint is mapped to a target value in a preset value interval, so as to obtain a target value matrix corresponding to the target audio fingerprint, and a difference value between the target value and an end value of the preset value interval is within a preset range, so that an audio matched with the audio to be matched can be searched according to the target value in the target value matrix and a value in the value matrix, thereby reducing the time for searching the audio matched with the audio to be matched, and increasing the speed for searching the audio matched with the audio to be matched.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of a scene of an audio matching process provided in an embodiment of the present application;

FIG. 2 is a schematic flowchart of an audio matching method provided in an embodiment of the present application;

FIG. 3 is a schematic flow chart of model training provided by an embodiment of the present application;

FIG. 4 is a schematic flow chart of a model structure provided by an embodiment of the present application;

FIG. 5 is a schematic flow chart diagram of a model application provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of an audio matching apparatus provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Machine Learning (ML) is a multi-domain cross subject, and relates to multi-domain subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

The scheme provided by the embodiment of the application relates to technologies such as artificial intelligence voice recognition and machine learning, and is specifically explained by the following embodiment.

The embodiment of the application provides an audio matching method, an audio matching device, electronic equipment and a computer-readable storage medium. The audio matching device may be integrated in an electronic device, and the electronic device may be a server or a terminal.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, network acceleration service (CDN), big data and an artificial intelligence platform.

And, a plurality of servers can be grouped into a blockchain, and the servers are nodes on the blockchain.

The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

For example, as shown in fig. 1, the terminal may acquire the audio to be matched and send the audio to be matched to the server. And the server extracts fingerprint features of the audio to be matched to obtain a target audio fingerprint corresponding to the audio to be matched, the target audio fingerprint comprises a plurality of fingerprint elements, each fingerprint element in the target audio fingerprint is mapped to be a target value in a preset value interval to obtain a target value matrix corresponding to the target audio fingerprint, and the difference value between the target value and the end point value of the preset value interval is in a preset range. And then, the server searches the audio matched with the audio to be matched from the audio set according to the target numerical matrix and the numerical matrix corresponding to the audio fingerprint of the audio in the audio set, and returns the audio matched with the audio to be matched to the terminal.

In addition, "a plurality" in the embodiment of the present application means two or more. "first" and "second" and the like in the embodiments of the present application are used for distinguishing the description, and are not to be construed as implying relative importance.

The following are detailed descriptions. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

In the present embodiment, description will be made from the perspective of an audio matching apparatus, and for convenience of describing the audio matching method of the present application, the following description will be made in detail with the audio matching apparatus integrated in a terminal, that is, with the terminal as an execution subject.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating an audio matching method according to an embodiment of the present application. The audio matching method may include:

s201, obtaining an audio and an audio set to be matched, wherein the audio set comprises audio and a numerical matrix corresponding to audio fingerprints of the audio.

When the terminal acquires the acquisition instruction, the audio to be matched is acquired through a microphone of the terminal according to the acquisition instruction, so that the audio to be matched is acquired. Or the audio to be matched can be acquired through microphones of other terminals, and the other terminals send the audio to be matched to the terminals, so that the terminals acquire the audio to be matched. The method for acquiring the audio to be matched by the terminal may be selected according to actual conditions, and this embodiment is not limited herein.

The terminal can send an acquisition request instruction to the server when acquiring the audio to be matched. And the server returns the audio set to the terminal according to the acquisition request, and the terminal acquires the audio set. Or, the terminal may also send an acquisition request instruction to the server before acquiring the audio to be matched. And the server returns the audio set to the terminal according to the acquisition request, so that the terminal acquires the audio set and locally stores the audio set. And then the terminal acquires the audio set from the local storage, so that when the audio to be matched is acquired, the audio matched with the audio to be matched can be found even if the terminal is in an off-line state. The method for acquiring the audio set by the terminal is not limited in this embodiment.

It should be understood that the audio set sent to the terminal by the server may include only audio, and then after the terminal acquires the audio set, the terminal performs feature extraction mapping processing on the audio in the audio set, so as to obtain a numerical matrix corresponding to the audio fingerprint of the audio in the audio set, and store the numerical matrix corresponding to the audio fingerprint of the audio in the audio set.

Or, the server may perform feature extraction mapping processing on the audio in the audio set to obtain a numerical matrix corresponding to the audio fingerprint of the audio in the audio set, store the numerical matrix corresponding to the audio fingerprint of the audio in the audio set, and then send the audio set to the terminal.

The process of performing feature extraction and mapping on the audio in the audio set may specifically refer to a process of performing fingerprint feature extraction and mapping on the audio to be matched, which is not described in detail herein.

S202, fingerprint feature extraction is carried out on the audio to be matched, and a target audio fingerprint corresponding to the audio to be matched is obtained and comprises a plurality of fingerprint elements.

An audio fingerprint (fingerprint) is a set of unique identifiers (which may be, for example, symbols or numbers) obtained by extracting features of audio. A fingerprint element refers to a unique identifier in a set of unique identifiers.

The method for extracting the fingerprint features of the audio to be matched to obtain the target audio fingerprint corresponding to the audio to be matched can be selected according to actual conditions, for example, the method can extract the fingerprint features of the audio to be matched through a trained audio fingerprint model to obtain the target audio fingerprint corresponding to the audio to be matched. The present embodiment is not limited herein.

In some embodiments, the extracting fingerprint features of the audio to be matched to obtain a target audio fingerprint corresponding to the audio to be matched includes:

carrying out short-time Fourier transform on the audio to be matched to obtain a target time-frequency characteristic vector of the audio to be matched;

and fingerprint feature extraction is carried out on the target time-frequency feature vector to obtain a target audio fingerprint corresponding to the audio to be matched.

In this embodiment, a Short-time Fourier transform (STFT) is performed on the audio to be matched to obtain a target time-frequency feature vector of the audio to be matched, so as to reduce the amount of computation for extracting the fingerprint features, improve the speed of extracting the fingerprint features, and simultaneously obtain time domain information and frequency domain information of the audio to be matched, so that the loss of the audio to be matched is reduced in the process of extracting the fingerprint features, and thus the accuracy of the target audio is improved.

In other embodiments, the extracting the fingerprint feature of the target time-frequency feature vector to obtain the target audio fingerprint corresponding to the audio to be matched includes:

extracting time features of the target time-frequency feature vector to obtain a target time feature vector corresponding to the audio to be matched;

and extracting the spatial features of the target time feature vector to obtain a target audio fingerprint corresponding to the audio to be matched.

The target time-frequency feature vector includes time domain information and frequency domain information of the audio to be matched, for example, each row of elements of the target time-frequency feature vector includes time domain information of the audio to be matched, each column of elements of the target time-frequency feature vector includes frequency domain information of the audio to be matched, time feature extraction is performed on the target time-frequency feature vector, which can be understood as extracting features of each row of elements of the target time-frequency feature vector to obtain a target time feature vector, and space feature extraction is performed on the target time feature vector, which can be understood as extracting features of each column of elements of the target time feature vector.

The time feature extraction and the space feature extraction of the target time-frequency feature vector may be performed according to actual conditions, for example, the time feature extraction and the space feature extraction of the target time-frequency feature vector may be performed by a Spatially separable convolutional layer (SSCNN) in a trained audio fingerprint model, or the time feature extraction of the target time-frequency feature vector may be performed by a temporal convolutional layer, and the space feature extraction of the target time-frequency feature vector may be performed by a spatial convolutional layer. The present embodiment is not limited herein.

In this embodiment, the time characteristic and the spatial characteristic of the audio to be matched are extracted at the same time, so that the time domain information and the frequency domain information of the audio to be matched are included in the target audio fingerprint at the same time, thereby improving the accuracy of the target audio fingerprint and further improving the accuracy of the found audio matched with the audio to be matched.

It should be understood that the processes of performing temporal feature extraction and spatial feature extraction may be performed once or multiple times, and the embodiment is not limited herein.

S203, mapping each fingerprint element in the target audio fingerprint to a target value in a preset value interval to obtain a target value matrix corresponding to the target audio fingerprint, wherein the difference value between the target value and the endpoint value of the preset value interval is in a preset range.

The terminal can map each fingerprint element in the target audio fingerprint to a target value in a preset value interval through the trained audio fingerprint model, and a target value matrix corresponding to the target audio fingerprint is obtained. Or, the terminal may map each fingerprint element in the target audio fingerprint to a target value within a preset value interval through a preset mapping function (the preset mapping function may be selected according to the preset value interval, for example, when the preset value interval is (-1, 1), the preset mapping function may be a hyperbolic tangent function tanh), so as to obtain a target value matrix corresponding to the target audio fingerprint. Or, the terminal may map each fingerprint element in the target audio fingerprint to a target value within a preset value interval by using a preset mapping matrix.

The preset value interval may be selected according to actual conditions, for example, the preset value interval may be set to (-1, 1) or set to (0, 1), and the embodiment is not limited herein.

The preset range may be set according to actual conditions, for example, the preset range may be set to [0,1] or [0,0.5], and the embodiment is not limited herein.

In this embodiment, each fingerprint element in the target audio fingerprint is mapped to a target value within a preset value interval, so as to obtain a target value matrix corresponding to the target audio fingerprint, and a difference value between the target value and an end value of the preset value interval is within a preset range, so that the fingerprint element in the target audio fingerprint is limited to the target value within the preset range, where the difference value between the fingerprint element and the end value of the preset value interval is within the preset range, so as to speed up finding of an audio matched with the audio to be matched.

And S204, searching the audio matched with the audio to be matched from the audio set according to the target numerical matrix and the numerical matrix corresponding to the audio fingerprint.

After the terminal acquires the target numerical matrix and the numerical matrix, the target numerical matrix and the numerical matrix can be matched, and then the audio corresponding to the numerical matrix matched with the target numerical matrix is used as the audio matched with the audio to be matched.

In this embodiment, since the difference between the target value in the target value matrix and the endpoint value of the preset value interval is within the preset range, and the difference between the value in the value matrix and the endpoint value of the preset value interval is within the preset range, the audio matched with the audio to be matched can be found from the audio set more quickly according to the value matrix corresponding to the target value matrix and the audio fingerprint.

In some embodiments, in order to find the audio matched with the audio to be matched from the audio set more quickly, the numerical matrix corresponding to the audio fingerprint of the audio includes a binary matrix corresponding to the audio fingerprint of the audio;

searching the audio matched with the audio to be matched from the audio set according to the target numerical matrix and the numerical matrix corresponding to the audio fingerprint, wherein the searching comprises the following steps:

The binary mapping process refers to using two values to represent the target value in the target value matrix, for example, using 0 and 1 to represent the target value in the target value matrix, that is, the element in the target binary matrix is 0 or 1.

The method for performing binary mapping on the target value matrix to obtain a target binary matrix corresponding to the target audio fingerprint may be selected according to actual conditions, for example, inputting the target value in the target value matrix into a sign function sign to perform binary mapping, so as to obtain the target binary matrix:

where t represents the target value.

Or, the target values in the target value matrix may be input to a piecewise function for binary mapping processing, so as to obtain a target binary matrix. The present embodiment is not limited herein.

In this embodiment, since the target numerical matrix is mapped to the target binary matrix, the audio matched with the audio to be matched can be found from the audio set more quickly according to the target binary matrix and the binary matrix corresponding to the audio fingerprint.

In some other embodiments, in order to more accurately find the audio matched with the audio to be matched, fingerprint feature extraction is performed on the audio to be matched to obtain a target audio fingerprint corresponding to the audio to be matched, including:

carrying out windowing and framing processing on the audio to be matched to obtain each target voice fragment of the audio to be matched;

and fingerprint feature extraction is carried out on the target voice fragment to obtain a target sub-audio fingerprint corresponding to the target voice fragment.

Mapping each fingerprint element in the target audio fingerprint to a target value in a preset value interval to obtain a target value matrix corresponding to the target audio fingerprint, wherein the method comprises the following steps:

and mapping each fingerprint element in the target sub-audio fingerprint to a target numerical value in a preset numerical value interval to obtain a target sub-numerical value matrix corresponding to the target sub-audio fingerprint.

and searching the audio matched with the audio to be matched from the audio set according to the target sub-numerical matrix and the numerical matrix corresponding to the audio fingerprint.

At this time, the terminal may perform fusion processing on the target sub-numerical matrix to obtain a target numerical matrix, and then search for an audio matched with the audio to be matched from the audio set according to the numerical matrices corresponding to the target numerical matrix and the audio fingerprint.

Or the audio set comprises a numerical matrix corresponding to the audio fingerprint and a numerical submatrix corresponding to the voice fragment, the terminal can match the target submatrix with the numerical submatrix, and then the audio matched with the audio to be matched is determined according to the voice fragment corresponding to the numerical submatrix matched with the target submatrix.

The extracting of the fingerprint features of the target voice segment to obtain the target sub-audio fingerprint corresponding to the target voice segment may include:

performing short-time Fourier transform on the target voice segment to obtain a target sub time-frequency characteristic vector of the target voice segment;

and fingerprint feature extraction is carried out on the target sub-time-frequency feature vector to obtain a target sub-audio fingerprint corresponding to the target voice segment.

Optionally, the extracting the fingerprint feature of the target sub-time-frequency feature vector to obtain a target sub-audio fingerprint corresponding to the target speech segment may include:

extracting time characteristics of the target sub time-frequency characteristic vector to obtain a target sub time characteristic vector corresponding to the target voice fragment;

and extracting the spatial features of the target sub-time feature vector to obtain a target sub-audio fingerprint corresponding to the target voice segment.

When through the audio frequency fingerprint model that has trained, treat to match the audio frequency and carry out fingerprint feature extraction, obtain the target audio frequency fingerprint that treats to match the audio frequency and through the audio frequency fingerprint model that has trained, when every fingerprint element in the target audio frequency fingerprint maps the target numerical value for predetermineeing the numerical value interval in, through the audio frequency fingerprint model that has trained, treat to match the audio frequency and carry out fingerprint feature extraction, before obtaining the target audio frequency fingerprint that treats to match the audio frequency and correspond, still include:

mapping each element in the sample audio fingerprint to a sample value in a preset value interval to obtain a sample value matrix corresponding to the sample audio fingerprint;

determining a target loss value of the audio fingerprint model to be trained according to the sample numerical value matrix and the corresponding endpoint value of the preset numerical value interval;

and training the audio fingerprint model to be trained according to the target loss value so as to make the audio fingerprint model to be trained converge at an endpoint value, and obtaining the trained audio fingerprint model.

The sample values in the sample value matrix and the endpoint values corresponding to the preset value intervals may be substituted into the loss function to obtain a target loss value for determining the audio fingerprint model to be trained, and the type of the loss function may be selected according to an actual situation, for example, the loss function may be a quantization loss function or a cross entropy loss function, which is not limited herein.

If the target loss value meets the preset condition, it indicates that the audio fingerprint model to be trained converges to the end point value, that is, it indicates that each fingerprint element in the target audio fingerprint can be mapped to a target value within a preset value interval by the audio fingerprint model to be trained, and the difference value between the target value and the end point value of the preset value interval is within a preset range, then the audio fingerprint model to be trained is taken as the trained audio fingerprint model.

And if the target loss value does not meet the preset condition, updating the network parameters of the audio fingerprint model to be trained according to the target loss value, and returning to execute the step of performing fingerprint feature extraction on the training samples in the training sample set to obtain the sample audio fingerprints corresponding to the training samples.

In this embodiment, when training an audio fingerprint model to be trained, first mapping each element in a sample audio fingerprint to a sample value within a preset value interval to obtain a sample value matrix corresponding to the sample audio fingerprint, then determining a target loss value of the audio fingerprint model to be trained according to the sample value matrix and an endpoint value corresponding to the preset value interval, and finally training the audio fingerprint model to be trained according to the target loss value to converge the audio fingerprint model to the endpoint value, i.e., to make the sample value approach the endpoint value, so that the trained audio fingerprint model can map each fingerprint element in the target audio fingerprint to a target value within the preset value interval to obtain a target value matrix corresponding to the target audio fingerprint, where a difference between the target value and the endpoint value of the preset value interval is within a preset range, thereby increasing a speed of finding an audio matched with an audio to be matched.

In other embodiments, performing fingerprint feature extraction on a training sample in a training sample set to obtain a sample audio fingerprint corresponding to the training sample includes:

and fingerprint feature extraction is carried out on the time-frequency feature vector to obtain a sample audio fingerprint corresponding to the training sample.

For example, in the process of performing short-time fourier transform on training samples in a training sample set, the window length of the short-time fourier transform is set to 1024 sampling points, the window shift is set to 256 sampling points, and the characteristic dimension is 256.

In this embodiment, a short-time fourier transform is performed on a training sample to obtain a time-frequency feature vector corresponding to the training sample, and the time-frequency feature vector replaces a spectrum feature, so that the training speed of the audio fingerprint model to be trained is increased, the loss of the training sample is reduced, and the effect of matching the trained audio fingerprint model to the audio to be matched is improved.

It should be understood that, the training samples in the training sample set may be subjected to short-time fourier transform by the to-be-trained audio fingerprint model, or the training samples in the training sample set may also be subjected to short-time fourier transform by a convolution layer, where the convolution layer may not be located in the to-be-trained audio fingerprint model, and then the time-frequency feature vector is subjected to fingerprint feature extraction by the to-be-trained audio fingerprint model, so as to obtain the sample audio fingerprint corresponding to the training sample.

In other embodiments, the fingerprint feature extraction includes temporal feature extraction and spatial feature extraction. And fingerprint feature extraction is carried out on the time-frequency feature vector to obtain a sample audio fingerprint corresponding to the training sample, and the method comprises the following steps:

extracting time features of the time-frequency feature vectors to obtain time feature vectors corresponding to the training samples;

and extracting the spatial features of the time feature vector to obtain a sample audio fingerprint corresponding to the training sample.

When the audio fingerprint model to be trained is trained, the audio fingerprint model to be trained is trained to extract time characteristics of the time-frequency characteristic vector and extract space characteristics of the time characteristic vector, dimension reduction of the time-frequency characteristic vector is achieved, and timing sequence information of a training sample is kept at the same time, so that time characteristic extraction and space extraction can be carried out on a target time-frequency characteristic vector and the target time characteristic vector through the trained audio fingerprint model, accuracy of found audio matched with the audio to be matched is improved, and accuracy can be guaranteed while matching speed is improved when the audio is matched through the trained audio fingerprint model.

It should be understood that, the spatial feature extraction may also be performed first, and then the temporal feature extraction is performed, that is, the spatial feature extraction is performed on the time-frequency feature vector first to obtain a spatial feature vector, and then the temporal feature extraction is performed on the spatial feature vector to obtain a sample audio fingerprint corresponding to the training sample.

In other embodiments, obtaining a training sample set includes:

acquiring each section of target audio;

and determining a training sample set according to the sample voice fragments.

In this embodiment, after each section of target audio is obtained, the target audio is not directly combined into a training sample set, but the target audio is subjected to windowing and framing processing to obtain each sample voice segment, and then the training sample set is formed according to the sample voice segments.

And then randomly selecting sample voice fragments of the same section of target audio and sample voice fragments of different target audios from the training sample set to form training samples of the same batch in the process of training the audio fingerprint model to be trained according to the training sample set, so as to realize the interruption of the training samples, and finally training the audio fingerprint model to be trained according to the training samples of the same batch each time.

For example, in the windowing and framing process, the terminal converts each segment of target audio into a wav format with a sampling rate of 8000, a monaural channel and 16 quantization bits, and then moves once every half second by using a window with a time sequence length of 1 second, thereby obtaining a sample voice segment.

In order to improve the robustness of the trained audio fingerprint model, a training sample set is determined according to a sample speech segment, and the method comprises the following steps:

carrying out time domain data augmentation processing on the sample voice fragment to obtain a positive sample corresponding to the sample voice fragment;

The time domain data amplification processing on the sample voice segment may refer to adding different types of noise to the sample voice segment, changing the speed of sound of the sample voice segment, or changing the pitch of the sample voice segment, where the noise ratio may be, for example, background music, human voice, reverberation, echo, and the like, and the size of the noise may be between-10 db and 30 db.

The method comprises the steps of performing time domain data amplification processing on a sample voice fragment to obtain a positive sample corresponding to the sample voice fragment, and then determining a training sample set according to the positive sample corresponding to the sample voice fragment and the sample voice fragment, so that the trained audio fingerprint model obtained by training according to the training sample set is higher in robustness.

In order to further improve the robustness of the trained audio fingerprint model, the method for extracting the fingerprint features of the time-frequency feature vector to obtain the audio fingerprint corresponding to the training sample comprises the following steps:

carrying out frequency domain data amplification processing on the time-frequency characteristic vector to obtain a sample time-frequency characteristic vector corresponding to the training sample;

and fingerprint feature extraction is carried out on the sample time-frequency feature vector to obtain a sample audio fingerprint corresponding to the training sample.

The method for performing the frequency domain data amplification processing may be selected according to actual conditions, for example, the terminal may randomly set zero to elements in the time-frequency feature vector, or the method for performing the frequency domain data amplification processing may also be frequency domain masking, where the frequency domain masking refers to a weak pure tone where a strong pure tone may mask a nearby sound and the sound is simultaneously generated. The present embodiment is not limited herein.

In this embodiment, data amplification processing is performed on the sample voice segment in the time domain, and data amplification processing is performed on the time-frequency feature vector in the frequency domain, so that the robustness of the trained audio fingerprint model is further improved.

In order to further improve the robustness of the trained audio fingerprint model, before determining the target loss value of the audio fingerprint model to be trained according to the sample numerical matrix and the endpoint value corresponding to the preset numerical interval, the method further includes:

determining a contrast loss value of an audio fingerprint model to be trained according to the sample audio fingerprint;

determining a target loss value of the audio fingerprint model to be trained according to the sample numerical matrix and the corresponding endpoint value of the preset numerical interval, wherein the method comprises the following steps:

determining a quantization loss value of the audio fingerprint model to be trained according to the sample numerical matrix and the endpoint value corresponding to the preset numerical interval;

In the embodiment, the quantization loss value is not directly used as the target loss value, but the target loss value is determined according to the contrast loss value and the quantization loss value, so that the robustness of the trained audio fingerprint model is improved.

The sample numerical matrix and the corresponding endpoint value of the preset numerical interval can be substituted into the following formula to obtain the quantization loss value of the audio fingerprint model to be trained:

L _q representing the quantization loss value, N representing the number of training samples of a batch, S representing the set of training samples of a batch, z _i Sample audio fingerprints, z, representing sample speech segments _j Sample audio fingerprint, tanh (z), representing a positive sample of a sample speech segment _i ) Representing a sample audio fingerprint z _i Corresponding sample value matrix, tanh (z) _j ) A matrix of values corresponding to sample audio fingerprints representing positive samples of the sample speech segments. | | tanh (z) _i )-1|| ₁ First solving for tanh (z) _i ) Then subtracts 1, i.e., | | | calving ₁ Representing the quantization function and 1 the endpoint value.

The contrast loss value and the quantization loss value may be substituted into the following equation to obtain a target loss value:

L＝αL _c +(1-α)L _q

l represents a target loss value, L _c The values indicate contrast loss values, and alpha indicates a weight coefficient, which can be set according to actual conditions.

Determining a contrast loss value of an audio fingerprint model to be trained according to the sample audio fingerprint, comprising:

determining a first inner product value corresponding to the first sample audio fingerprint according to the first sample audio fingerprint and a second sample audio fingerprint corresponding to the first sample audio fingerprint in the sample audio fingerprints, wherein when the first sample audio fingerprint is the sample audio fingerprint corresponding to the first sample voice fragment, the second sample audio fingerprint is the sample audio fingerprint of a first positive sample corresponding to the first sample voice fragment, or when the first sample audio fingerprint is the sample audio fingerprint corresponding to the first positive sample, the second sample audio fingerprint is the sample audio fingerprint of the first sample voice fragment corresponding to the first positive sample;

determining a second inner product value corresponding to the first sample audio fingerprint according to the first sample audio fingerprint, the second sample audio fingerprint and a third sample audio fingerprint, wherein the third sample audio fingerprint is a sample audio fingerprint except the first sample audio fingerprint and the second sample audio fingerprint in the sample audio fingerprints;

and determining a contrast loss value of the audio fingerprint model to be trained according to the first inner product value and the second inner product value.

Wherein, when first sample audio fingerprint is the sample audio fingerprint that first sample pronunciation fragment corresponds, can carry out the inner product with first sample audio fingerprint and second sample audio fingerprint, obtain the inner product value that first sample audio fingerprint corresponds, carry out the inner product with first sample audio fingerprint and third sample audio fingerprint, obtain the inner product value that third sample audio fingerprint corresponds.

When the first sample audio fingerprint is the sample audio fingerprint corresponding to the first positive sample, the second sample audio fingerprint and the first sample audio fingerprint are subjected to inner product to obtain an inner product value corresponding to the first sample audio fingerprint, and the second sample audio fingerprint and the third sample audio fingerprint are subjected to inner product to obtain an inner product value corresponding to the third sample audio fingerprint.

And then dividing the inner product value corresponding to the first sample audio fingerprint by a preset constant to obtain a first quotient, and dividing the inner product value corresponding to each third sample audio fingerprint by the preset constant to obtain a second quotient.

And then, performing exponential operation on the first quotient and the second quotient to obtain a first exponential result corresponding to the first quotient and a second exponential result corresponding to the second quotient, wherein the first exponential result is a first inner product value of the first sample audio fingerprint. And accumulating the first exponent result and the second exponent result to obtain a second inner product value of the first sample audio fingerprint.

And finally, dividing the first exponential result by the second inner product value to obtain a division result, and carrying out logarithmic operation on the division result to obtain a first comparison loss value of the first sample audio fingerprint. And returning to the step of screening the first sample audio fingerprints from the sample audio fingerprints so as to obtain first comparison loss values of the first sample audio fingerprints, and adding the first comparison loss values of the first sample audio fingerprints to obtain the comparison loss value of the audio fingerprint model to be trained.

Namely, substituting the first inner product value and the second inner product value into the following comparison learning function formula to obtain a comparison loss value of the audio fingerprint model to be trained:

where i =2k-1,j =2k, l (i, j) represents a first contrast loss value for each first sample audio fingerprint. τ denotes a predetermined constant, i denotes a first sample speech segment, j denotes a first positive sample, y denotes a first positive sample, a second sample speech segment or a positive sample representing a second sample speech segment, a _ij Represents the inner product value of the first sample audio fingerprint, a when y = j _iy An inner product value representing the first sample audio fingerprint, a ≠ j when y ≠ j _iy Representing the inner product value corresponding to the third sample audio fingerprint.

exp () denotes an exponential function with a constant e as base, exp (a) _ij τ) represents the first inner product value, exp (a) _iy /τ) represents either the first exponential result or the second exponential result,

a second inner product value representing the first sample audio fingerprint.

At this time, z is as defined above _i Can be used to represent a sample audio fingerprint corresponding to a first sample speech segment, z above _j A sample audio fingerprint of the first positive sample may be represented.

It should be understood that, because the training samples in the training sample set may be sample voice segments or positive samples of the sample voice segments, the third sample audio fingerprint may be a sample audio fingerprint corresponding to the second sample voice segment or a sample audio fingerprint of a positive sample corresponding to the second sample voice segment, where the second sample voice segment is a sample voice segment in the training sample set other than the first sample voice segment.

The first sample speech segment may also be referred to as a reference sample, and the second sample speech segment and the positive sample corresponding to the second sample speech segment may also be referred to as a negative sample.

In this embodiment, the first sample audio fingerprint is a sample audio fingerprint corresponding to the first sample voice fragment, that is, the first sample audio fingerprint is a sample audio fingerprint corresponding to the sample voice fragment, or the first sample audio fingerprint may also be a sample audio fingerprint corresponding to the first positive sample, that is, the first sample audio fingerprint is a sample audio fingerprint corresponding to the positive sample.

In order to further improve the effect of audio matching of the trained audio fingerprint model, according to a first sample audio fingerprint and a second sample audio fingerprint corresponding to the first sample audio fingerprint in the sample audio fingerprint, a first inner product value corresponding to the first sample audio fingerprint is determined, and according to the first sample audio fingerprint, the second sample audio fingerprint and a third sample audio fingerprint, a second inner product value corresponding to the first sample audio fingerprint is determined, which includes:

determining a first inner product value and a third inner product value corresponding to the first sample audio fingerprint according to the first sample audio fingerprint and a second sample audio fingerprint corresponding to the first sample audio fingerprint in the sample audio fingerprints;

and determining a second inner product value and a fourth inner product value corresponding to the first sample audio fingerprint according to the first sample audio fingerprint, the second sample audio fingerprint and the third sample audio fingerprint.

According to the first inner product value and the second inner product value, determining a contrast loss value of the audio fingerprint model to be trained, comprising:

and determining a contrast loss value of the audio fingerprint model to be trained according to the first inner product value, the second inner product value, the third inner product value and the fourth inner product value.

Wherein, when first sample audio fingerprint is the sample audio fingerprint that first sample pronunciation fragment corresponds, second sample audio fingerprint is the sample audio fingerprint of the first positive sample that first sample pronunciation fragment corresponds, can carry out the inner product with first sample audio fingerprint and second sample audio fingerprint, obtains the first inner product value of first sample audio fingerprint, carries out the inner product with second sample audio fingerprint and carries out the inner product with first sample audio fingerprint, obtains the third inner product value of first sample audio fingerprint. And adding the result of the inner product of the first sample audio fingerprint and the third sample audio fingerprint to the first inner product value to obtain a second inner product value corresponding to the first sample audio fingerprint. And adding the result of the inner product of the second sample audio fingerprint and the third sample audio fingerprint with the third inner product value to obtain a fourth inner product value corresponding to the first sample audio fingerprint.

When the first sample audio fingerprint is the sample audio fingerprint corresponding to the first positive sample, and the second sample audio fingerprint is the sample audio fingerprint of the first sample voice segment corresponding to the first positive sample, the first sample audio fingerprint and the second sample audio fingerprint can be subjected to inner product to obtain a third inner product value of the first sample audio fingerprint, and the second sample audio fingerprint and the first sample audio fingerprint are subjected to inner product to obtain a first inner product value of the first sample audio fingerprint. And adding the result of the inner product of the first sample audio fingerprint and the third sample audio fingerprint with the third inner product value to obtain a fourth inner product value corresponding to the first sample audio fingerprint, and adding the result of the inner product of the second sample audio fingerprint and the third sample audio fingerprint with the first inner product value to obtain a second inner product value corresponding to the first sample audio fingerprint.

That is, the sample audio fingerprint of the first sample voice segment and the sample audio fingerprint of the first positive sample are subjected to inner product to obtain a first inner product value, and the sample audio fingerprint of the first positive sample and the sample audio fingerprint of the first sample voice segment are subjected to inner product to obtain a third inner product value. And adding the result of the inner product of the sample audio fingerprint of the first sample voice fragment and the third sample audio fingerprint to the first inner product value to obtain a second inner product value, and adding the result of the inner product of the sample audio fingerprint of the first positive sample and the third sample audio fingerprint to the third inner product value to obtain a fourth inner product value.

In this embodiment, the calculation manner of the first inner product value, the second inner product value, the third inner product value, and the fourth inner product value may specifically refer to the calculation process of the first inner product value and the second inner product value, and this embodiment is not described herein again.

Then, the first comparison loss value of the first sample voice segment can be determined according to the first inner product value and the second inner product value, the first sample comparison loss value of the first positive sample can be determined according to the third inner product value and the fourth inner product value, and finally the comparison loss value can be determined according to the first comparison loss value and the first sample comparison loss value.

At this time, the contrast loss value can be expressed by the following equation:

i =2k-1,j =2k, l (j, i) represents the first sample contrast loss value.

The way of calculating the first sample contrast loss value of the first positive sample corresponding to the first sample voice segment may specifically refer to a process of calculating the first contrast loss value of the first sample audio fingerprint, which is not described herein again in this embodiment.

It should be understood that the first sample voice segment and the second sample voice segment in the present embodiment are relative concepts, and the first sample audio fingerprint and the third sample audio fingerprint are also relative concepts. For example, the training sample set includes a sample voice fragment 1, a positive sample 2 corresponding to the sample voice fragment 1, a sample voice fragment 3, and a positive sample 4 corresponding to the sample voice fragment 2, and the sample audio fingerprints include a sample audio fingerprint a corresponding to the sample voice fragment 1, a sample audio fingerprint b corresponding to the positive sample 2, a sample audio fingerprint c corresponding to the sample voice fragment 3, and a sample audio fingerprint d corresponding to the positive sample 4.

When the first sample audio fingerprint is sample audio fingerprint a, the first sample audio fingerprint is sample audio fingerprint 1, the second sample audio fingerprint is sample audio fingerprint b, the second sample audio fingerprint is sample audio fingerprint 3, and the third sample audio fingerprint is sample audio fingerprint c and sample audio fingerprint d.

When the first sample audio fingerprint is sample audio fingerprint b, the first sample audio fingerprint is sample audio fingerprint 1, the second sample audio fingerprint is sample audio fingerprint a, the second sample audio fingerprint is sample audio fingerprint 3, and the third sample audio fingerprint is sample audio fingerprint c and sample audio fingerprint d.

When the first sample audio fingerprint is sample audio fingerprint c, the first sample audio fingerprint is sample audio fingerprint 3, the second sample audio fingerprint is sample audio fingerprint d, the second sample audio fingerprint is sample audio fingerprint 1, and the second sample audio fingerprint is sample audio fingerprint a and sample audio fingerprint b.

When the first sample audio fingerprint is sample audio fingerprint d, the first sample audio fingerprint is sample audio fingerprint 1, the second sample audio fingerprint is sample audio fingerprint c, the second sample audio fingerprint is sample audio fingerprint 3, and the third sample audio fingerprint is sample audio fingerprint a and sample audio fingerprint b.

As can be seen from the above, in the embodiment of the present application, an audio set and an audio set to be matched are obtained first, where the audio set includes a numerical matrix corresponding to audio fingerprints of the audio and the audio. And then fingerprint feature extraction is carried out on the audio to be matched to obtain a target audio fingerprint corresponding to the audio to be matched, wherein the target audio fingerprint comprises a plurality of fingerprint elements. And then, mapping each fingerprint element in the target audio fingerprint to a target value in a preset value interval to obtain a target value matrix corresponding to the target audio fingerprint, wherein the difference value between the target value and the end value of the preset value interval is in a preset range. And finally, searching the audio matched with the audio to be matched from the audio set according to the target numerical matrix and the numerical matrix corresponding to the audio fingerprint.

In other words, in the embodiment of the present application, each fingerprint element in the target audio fingerprint is mapped to a target value in a preset value interval, so as to obtain a target value matrix corresponding to the target audio fingerprint, and a difference between the target value and an end value of the preset value interval is within a preset range, so that the audio matched with the audio to be matched can be searched according to the target value in the target value matrix and a value in the value matrix, thereby reducing the time for searching the audio matched with the audio to be matched, and increasing the speed for searching the audio matched with the audio to be matched.

The method described in the above embodiments is further illustrated in detail by way of example.

Referring to fig. 3, fig. 3 is a schematic flow chart of a model training method according to an embodiment of the present application. The model training method process can comprise the following steps:

s301, the terminal obtains each section of target audio, and performs windowing and framing processing on the target audio to obtain each sample voice segment corresponding to each section of target audio.

S302, the terminal performs time domain data augmentation processing on the sample voice fragment to obtain a positive sample corresponding to the sample voice fragment, and determines a training sample set according to the positive sample corresponding to the sample voice fragment and the sample voice fragment.

S303, the terminal performs short-time Fourier transform on the training samples in the training sample set to obtain time-frequency characteristic vectors corresponding to the training samples.

S304, the terminal performs frequency domain data amplification processing on the time-frequency characteristic vector to obtain a sample time-frequency characteristic vector corresponding to the training sample.

For example, as shown in fig. 4, after the target audio is acquired, windowing and framing processing is performed on the target audio to obtain sample voice segments, and sample scrambling is performed on the sample voice segments, so that training samples in the same batch include both sample voice segments of the same target audio and sample voice segments of different target audios.

S305, the terminal extracts time characteristics of the sample time-frequency characteristic vectors through the spatially separable convolution layers in the audio fingerprint model to be trained to obtain time characteristic vectors corresponding to the training samples, and extracts spatial characteristics of the time characteristic vectors to obtain initial sample audio fingerprints corresponding to the training samples.

The spatially separable convolutional layers are comprised of a temporal convolution and a spatial convolution stack. For example, the spatially separable convolutional layers in the audio fingerprint model to be trained may be as shown in fig. 4 (Groupnorm in fig. 4 represents group normalization), and at this time, the spatially separable convolutional layers may perform multiple times of temporal feature extraction and spatial feature extraction, that is, after performing spatial feature extraction on the temporal feature vectors, obtaining spatial feature vectors, and then performing temporal feature extraction and spatial feature extraction on the spatial feature vectors again.

It should be understood that when performing the temporal feature extraction and the spatial feature extraction, the elements in the matrix may also be mapped to values in the preset value interval, thereby further improving the accuracy of the trained audio fingerprint model. That is, mapping elements in the time feature vector to values in a preset value interval to obtain a time value matrix, then performing spatial feature extraction on the time value matrix to obtain an initial sample audio fingerprint, then mapping the elements in the initial sample audio matrix to values in the preset value interval to obtain a spatial value matrix, and finally inputting the spatial value matrix into the full-link layer.

S306, the terminal performs dimensionality reduction on the initial sample audio fingerprint through the full connection layer in the audio fingerprint model to be trained to obtain the sample audio fingerprint.

The initial sample audio fingerprints are subjected to dimensionality reduction processing, so that the calculated amount of an audio fingerprint model to be trained is reduced, and the independence among the sample audio fingerprints is kept.

Optionally, the splitting may be performed by first splitting the initial sample audio fingerprint of the Split function in the full-link layer to obtain a splitting result, and then inputting the splitting result into the neurons in the full-link layer.

S307, the terminal screens out first sample audio fingerprints from the sample audio fingerprints, determines first inner product values and third inner product values corresponding to the first sample audio fingerprints according to the first sample audio fingerprints and second sample audio fingerprints corresponding to the first sample audio fingerprints in the sample audio fingerprints, and when the first sample audio fingerprints are the sample audio fingerprints corresponding to the first sample voice fragments, the second sample audio fingerprints are sample audio fingerprints of first positive samples corresponding to the first sample voice fragments, or when the first sample audio fingerprints are the sample audio fingerprints corresponding to the first positive samples, the second sample audio fingerprints are sample audio fingerprints of the first sample voice fragments corresponding to the first positive samples.

S308, the terminal determines a second inner product value and a fourth inner product value corresponding to the first sample audio fingerprint according to the first sample audio fingerprint, the second sample audio fingerprint and a third sample audio fingerprint, wherein the third sample audio fingerprint is a sample audio fingerprint except the first sample audio fingerprint and the second sample audio fingerprint in the sample audio fingerprint.

S309, the terminal determines a contrast loss value of the audio fingerprint model to be trained according to the first inner product value, the second inner product value, the third inner product value and the fourth inner product value.

S3010, the terminal maps elements in the sample audio fingerprints to sample numerical values in a preset numerical value interval to obtain a sample numerical value matrix corresponding to the sample audio fingerprints, and the quantization loss value of the audio fingerprint model to be trained is determined according to the sample numerical value matrix and end point values corresponding to the preset numerical value interval.

And S3011, the terminal obtains a target loss value of the audio fingerprint model to be trained according to the comparison loss value and the quantization loss value.

And S3012, if the target loss value meets the preset condition, the terminal takes the audio fingerprint model to be trained as the trained audio fingerprint model, and if the target loss value does not meet the preset condition, the terminal updates the network parameters of the audio fingerprint model to be trained according to the target loss value and returns to execute the step S303.

In this embodiment, when the audio fingerprint model to be trained is trained, time domain data amplification processing and frequency domain data amplification processing are performed on the sample voice segment at the same time, so that the coverage rate of the trained audio fingerprint model is improved, and the robustness of the trained audio fingerprint model in a complex application scene is improved.

By carrying out short-time Fourier transform on the training sample, the time-frequency characteristic vector corresponding to the training sample is obtained to replace the frequency spectrum characteristic, so that the training speed of the audio fingerprint model to be trained is improved, the loss of the training sample is reduced, and the effect of matching the audio to be matched by the trained audio fingerprint model is improved.

The method comprises the steps of extracting time characteristics and space characteristics of a training sample to obtain a sample audio fingerprint, converting the sample audio fingerprint into a sample numerical value in a preset numerical value interval, enabling the sample numerical value to be converged at an end point value in the preset numerical value interval, improving accuracy of a found audio matched with the audio to be matched, and further enabling matching speed to be improved and accuracy to be guaranteed when the audio is matched through a trained audio fingerprint model.

The specific implementation manner and the corresponding beneficial effects in this embodiment may refer to the above embodiment of the audio matching method, and this embodiment is not described herein again.

Referring to fig. 5, fig. 5 is a schematic flow chart of a model application method according to an embodiment of the present disclosure. The model application method process can comprise the following steps:

s501, the terminal obtains an audio and an audio set to be matched, wherein the audio set comprises audio and a binary matrix corresponding to audio fingerprints of the audio.

S502, the terminal performs short-time Fourier transform on the audio to be matched to obtain a target time-frequency characteristic vector of the audio to be matched.

S503, the terminal extracts time features of the target time-frequency feature vectors through the space separable convolution layers in the trained audio fingerprint model to obtain target time feature vectors corresponding to the audio to be matched, and extracts space features of the target time feature vectors to obtain the audio fingerprints before dimensionality reduction corresponding to the audio to be matched.

It should be understood that when performing the temporal feature extraction and the spatial feature extraction, the elements in the matrix may also be mapped to values in a preset value interval, thereby further improving the accuracy of the audio matching. That is, the spatial feature extraction is performed on the target time feature vector to obtain the audio fingerprint before dimensionality reduction corresponding to the audio to be matched, and the method includes the following steps:

mapping elements in the target time characteristic vector to numerical values of a preset numerical value interval to obtain a target time numerical value matrix;

performing spatial feature extraction on the target time numerical matrix to obtain a spatial sample audio fingerprint;

and mapping elements in the spatial sample audio matrix into numerical values of a preset numerical value interval to obtain the audio fingerprint before dimensionality reduction corresponding to the audio to be matched.

S504, the terminal conducts dimension reduction processing on the audio fingerprints before dimension reduction through the full connection layer in the trained audio fingerprint model, and a target audio fingerprint is obtained.

And S505, the terminal maps each fingerprint element in the target audio fingerprint into a target value in a preset value interval through the trained audio fingerprint model to obtain a target value matrix corresponding to the target audio fingerprint, and the difference value between the target value and the endpoint value of the preset value interval is in a preset range.

S506, the terminal performs binary mapping processing on the target numerical matrix to obtain a target binary matrix corresponding to the target audio fingerprint.

And S507, the terminal searches the audio matched with the audio to be matched from the audio set according to the target binary matrix and the binary matrix corresponding to the audio fingerprint.

In this embodiment, since the trained audio fingerprint model can perform temporal feature extraction and spatial feature extraction, the accuracy of audio matching of the trained audio fingerprint model in the embodiment of the present application is higher compared to the audio fingerprint extraction method in the related art.

The trained audio fingerprint model of the embodiment has a faster audio matching speed than an audio fingerprint extraction method in the related art because each fingerprint element in the target audio fingerprint can be mapped to a target value in a preset value interval to obtain a target value matrix corresponding to the target audio fingerprint, a difference value between the target value and an end value of the preset value interval is in a preset range, and binary mapping processing is performed on the target value matrix to obtain a target binary matrix corresponding to the target audio fingerprint.

The following describes the effect of the trained audio fingerprint model in the embodiment of the present application.

Ten thousand songs are randomly selected from the fma open source music data set to serve as a training set, two thousand songs are extracted from the training set to be randomly added with various disturbances, and the songs added with the disturbances serve as a testing set. And then, the accuracy and the Real-time factor (RTF) are used as evaluation indexes, the higher the accuracy is, the better the effect is indicated, and the smaller the RTF is, the faster the matching speed is indicated. The results of testing the trained Audio fingerprint Model and the Audio fingerprint extracting and matching method in the related art (the Audio fingerprint extracting and matching method in the related art may be, for example, a band energy method, a Landmark method, a Nowplaying Model, and a Sequence-to-Sequence automatic encoder Model for Audio fingerprint identification (SAMAF)) by using the test set are shown in the following table:

Method	rate of accuracy	Real time rate
			Band energy	50.25％	0.03
Landmark	53.90％	0.03
			Nowplaying	65.70％	0.15
SAMAF	70.30％	0.15
			Trained audio fingerprint model	80.10％	0.05

From the above table, compared with a spectrum-based audio fingerprint extraction and matching method (the spectrum-based audio fingerprint extraction and matching method includes band energy and Landmark), the accuracy and speed of a neural network model-based audio matching method (including Nowplaying, SAMAF and a trained audio fingerprint model) are both significantly improved, which indicates that the neural network model-based audio matching method is faster and better than the spectrum-based audio fingerprint extraction and matching method.

In the audio matching method based on the neural network model, compared with nowplasing and SAMAF, the accuracy of the trained audio fingerprint model in the embodiment is improved by 10%, and the matching speed is improved by 2 times. The trained audio fingerprint model in the embodiment has better effect and higher speed.

In order to better implement the audio matching method provided by the embodiment of the present application, an embodiment of the present application further provides a device based on the audio matching method. Wherein the meaning of the noun is the same as that in the audio matching method, and the specific implementation details can refer to the description in the method embodiment.

For example, as shown in fig. 6, the audio matching means may include:

the obtaining module 601 is configured to obtain an audio and an audio set to be matched, where the audio set includes an audio and a numerical matrix corresponding to an audio fingerprint of the audio.

The extracting module 602 is configured to perform fingerprint feature extraction on the audio to be matched to obtain a target audio fingerprint corresponding to the audio to be matched, where the target audio fingerprint includes a plurality of fingerprint elements.

The mapping module 603 is configured to map each fingerprint element in the target audio fingerprint to a target value within a preset value interval, to obtain a target value matrix corresponding to the target audio fingerprint, where a difference between the target value and an endpoint value of the preset value interval is within a preset range.

The searching module 604 is configured to search for an audio matched with the audio to be matched from the audio set according to the target numerical matrix and the numerical matrix corresponding to the audio fingerprint.

Optionally, the numerical matrix corresponding to the audio fingerprint of the audio includes a binary matrix corresponding to the audio fingerprint of the audio.

Accordingly, the lookup module 604 is specifically configured to perform:

Optionally, the extracting module 602 is specifically configured to perform:

and performing fingerprint feature extraction on the audio to be matched through the trained audio fingerprint model to obtain a target audio fingerprint corresponding to the audio to be matched.

The mapping module 603 is specifically configured to perform:

Optionally, the audio matching apparatus further comprises:

a training module to perform:

mapping each element in the sample audio fingerprint into a sample numerical value in a preset numerical value interval to obtain a sample numerical value matrix corresponding to the sample audio fingerprint;

Optionally, the training module is specifically configured to perform:

Accordingly, the training module is specifically configured to perform:

Optionally, the training module is specifically configured to perform:

acquiring each section of target audio;

and determining a training sample set according to the sample voice fragment.

Optionally, the training module is specifically configured to perform:

screening a first sample audio fingerprint from the sample audio fingerprints;

In specific implementation, the above modules may be implemented as independent entities, or may be combined arbitrarily, and implemented as the same or several entities, and the specific implementation manner and the corresponding beneficial effects of the above modules may refer to the foregoing method embodiments, which are not described herein again.

An embodiment of the present application further provides an electronic device, where the electronic device may be a server or a terminal, and as shown in fig. 7, a schematic structural diagram of the electronic device according to the embodiment of the present application is shown, specifically:

the electronic device may include components such as a processor 701 of one or more processing cores, memory 702 of one or more computer-readable storage media, a power supply 703, and an input unit 704. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 7 does not constitute a limitation of the electronic device and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 701 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by operating or executing computer programs and/or modules stored in the memory 702 and calling data stored in the memory 702. Optionally, processor 701 may include one or more processing cores; preferably, the processor 701 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 701.

The memory 702 may be used to store computer programs and modules, and the processor 701 may execute various functional applications and data processing by operating the computer programs and modules stored in the memory 702. The memory 702 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, a computer program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 702 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 702 may also include a memory controller to provide the processor 701 with access to the memory 702.

The electronic device further includes a power source 703 for supplying power to each component, and preferably, the power source 703 may be logically connected to the processor 701 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The power supply 703 may also include any component including one or more of a dc or ac power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may also include an input unit 704, and the input unit 704 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 701 in the electronic device loads an executable file corresponding to one or more processes of the computer program into the memory 702 according to the following instructions, and the processor 701 executes the computer program stored in the memory 702, so as to implement various functions, such as:

acquiring an audio set and an audio set to be matched, wherein the audio set comprises audio and a numerical matrix corresponding to audio fingerprints of the audio;

performing fingerprint feature extraction on the audio to be matched to obtain a target audio fingerprint corresponding to the audio to be matched, wherein the target audio fingerprint comprises a plurality of fingerprint elements;

mapping each fingerprint element in the target audio fingerprint to a target value in a preset value interval to obtain a target value matrix corresponding to the target audio fingerprint, wherein the difference value between the target value and an endpoint value of the preset value interval is in a preset range;

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by a computer program, which may be stored in a computer-readable storage medium and loaded and executed by a processor, or by related hardware controlled by the computer program.

To this end, the present application provides a computer-readable storage medium, in which a computer program is stored, where the computer program can be loaded by a processor to execute the steps in any one of the audio matching methods provided in the embodiments of the present application. For example, the computer program may perform the steps of:

fingerprint feature extraction is carried out on the audio to be matched to obtain a target audio fingerprint corresponding to the audio to be matched, wherein the target audio fingerprint comprises a plurality of fingerprint elements;

mapping each fingerprint element in the target audio fingerprint to a target value in a preset value interval to obtain a target value matrix corresponding to the target audio fingerprint, wherein the difference value between the target value and the end value of the preset value interval is in a preset range;

The above detailed implementation of each operation and the corresponding beneficial effects can refer to the foregoing embodiments, and are not described herein again.

Wherein the computer-readable storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

Since the computer program stored in the computer-readable storage medium can execute the steps in any audio matching method provided in the embodiments of the present application, beneficial effects that can be achieved by any audio matching method provided in the embodiments of the present application can be achieved, and detailed descriptions are omitted here for the foregoing embodiments.

According to an aspect of the application, there is provided, among other things, a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the audio matching method.

The foregoing describes in detail an audio matching method, apparatus, electronic device, and computer-readable storage medium provided in the embodiments of the present application, and specific examples are applied herein to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An audio matching method, comprising:

mapping each fingerprint element in the target audio fingerprint to a target numerical value in a preset numerical value interval to obtain a target numerical value matrix corresponding to the target audio fingerprint, wherein the difference value between the target numerical value and the end point value of the preset numerical value interval is in a preset range;

2. The audio processing method of claim 1, wherein the matrix of values corresponding to the audio fingerprint of the audio comprises a matrix of values corresponding to the audio fingerprint of the audio;

the searching for the audio matched with the audio to be matched from the audio set according to the target numerical matrix and the numerical matrix corresponding to the audio fingerprint comprises:

3. The audio processing method according to claim 1, wherein the audio to be matched is subjected to fingerprint feature extraction to obtain a target audio fingerprint corresponding to the audio to be matched; mapping each fingerprint element in the target audio fingerprint to a target value within a preset value interval, including:

performing fingerprint feature extraction on the audio to be matched through a trained audio fingerprint model to obtain a target audio fingerprint corresponding to the audio to be matched;

and mapping each fingerprint element in the target audio fingerprint into a target numerical value in a preset numerical value interval through the trained audio fingerprint model.

4. The audio processing method according to claim 3, wherein before performing fingerprint feature extraction on the audio to be matched through the trained audio fingerprint model to obtain a target audio fingerprint corresponding to the audio to be matched, the method further comprises:

determining a target loss value of the audio fingerprint model to be trained according to the sample numerical value matrix and the endpoint value corresponding to the preset numerical value interval;

5. The model training method according to claim 4, wherein the performing fingerprint feature extraction on the training samples in the training sample set to obtain the sample audio fingerprints corresponding to the training samples comprises:

6. The model training method of claim 5, wherein the fingerprint feature extraction comprises temporal feature extraction and spatial feature extraction;

the extracting of the fingerprint characteristics of the time-frequency characteristic vector to obtain the sample audio fingerprint corresponding to the training sample comprises:

and extracting spatial features of the temporal feature vector to obtain a sample audio fingerprint corresponding to the training sample.

7. The model training method of claim 5, wherein the obtaining a training sample set comprises:

acquiring each section of target audio;

carrying out windowing and framing processing on the target audio to obtain each sample voice fragment corresponding to the target audio;

and determining a training sample set according to the sample voice fragment.

8. The model training method of claim 7, wherein determining a training sample set from the sample speech segments comprises:

9. The model training method of claim 8, wherein the performing of the fingerprint feature extraction on the time-frequency feature vector to obtain the audio fingerprint corresponding to the training sample comprises:

10. The model training method according to claim 9, wherein before determining the target loss value of the audio fingerprint model to be trained according to the endpoint value corresponding to the sample value matrix and the preset value interval, the method further comprises:

determining a target loss value of the audio fingerprint model to be trained according to the sample numerical value matrix and the endpoint value corresponding to the preset numerical value interval, wherein the determining comprises the following steps:

determining a quantization loss value of the audio fingerprint model to be trained according to the sample numerical value matrix and an endpoint value corresponding to the preset numerical value interval;

11. The model training method according to claim 10, wherein the determining the contrast loss value of the audio fingerprint model to be trained according to the sample audio fingerprint comprises:

determining a first inner product value corresponding to a first sample audio fingerprint according to the first sample audio fingerprint and a second sample audio fingerprint corresponding to the first sample audio fingerprint in the sample audio fingerprints, wherein when the first sample audio fingerprint is a sample audio fingerprint corresponding to a first sample voice fragment, the second sample audio fingerprint is a sample audio fingerprint of a first positive sample corresponding to the first sample voice fragment, or when the first sample audio fingerprint is a sample audio fingerprint corresponding to the first positive sample, the second sample audio fingerprint is a sample audio fingerprint of the first sample voice fragment corresponding to the first positive sample;

12. An audio matching apparatus, comprising:

the system comprises an acquisition module, a matching module and a matching module, wherein the acquisition module is used for acquiring an audio frequency to be matched and an audio frequency set, and the audio frequency set comprises an audio frequency and a numerical matrix corresponding to an audio frequency fingerprint of the audio frequency;

the mapping module is used for mapping each fingerprint element in the target audio fingerprint to a target numerical value in a preset numerical value interval to obtain a target numerical value matrix corresponding to the target audio fingerprint, wherein the difference value between the target numerical value and the end point value of the preset numerical value interval is in a preset range;

13. An electronic device, comprising a processor and a memory, the memory storing a computer program, the processor being configured to execute the computer program in the memory to perform the audio matching method of any of claims 1-11.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program adapted to be loaded by a processor for performing the audio matching method of any of claims 1-11.

15. A computer program product, characterized in that it stores a computer program adapted to be loaded by a processor for performing the audio matching method of any of claims 1-11.