WO2023134549A1

WO2023134549A1 - Encoder generation method, fingerprint extraction method, medium, and electronic device

Info

Publication number: WO2023134549A1
Application number: PCT/CN2023/070796
Authority: WO
Inventors: 于哲松; 杜行健; 刘铭瑀; 朱碧磊; 马泽君
Original assignee: 北京有竹居网络技术有限公司
Priority date: 2022-01-14
Filing date: 2023-01-06
Publication date: 2023-07-20
Also published as: CN114443891B; CN114443891A

Abstract

The present invention relates to an encoder generation method, a fingerprint extraction method, a medium, and an electronic device. The encoder generation method comprises: acquiring a plurality of sample audios; constructing a first group of samples and a second group of samples according to the plurality of sample audios, wherein for each sample in the first group of samples, a corresponding positive sample and a corresponding negative sample exist in the second group of samples; and performing comparison training on a first encoder and a second encoder according to the first group of samples and the second group of samples, wherein the trained first encoder can be used as an audio fingerprint extractor to output an encoding vector serving as a fingerprint feature of an audio. The trained first encoder obtained in the present invention can effectively extract fingerprint features of the audio, and more accurate audio fingerprints are obtained, such that the accuracy of audio retrieval is improved.

Description

Encoder generation method, fingerprint extraction method, medium and electronic device

Cross References to Related Applications

This application claims the priority of the Chinese patent application with the application number 202210045056.3 and the title of the invention "encoder generation method, fingerprint extraction method, medium and electronic equipment" submitted on January 14, 2022, and the entire content of the application Incorporated in this application by reference.

technical field

The present disclosure relates to the technical field of artificial intelligence, and in particular, to an encoder generation method, a fingerprint extraction method, a medium, and an electronic device.

Background technique

Audio fingerprints are compact digital signatures extracted from audio content that represent important acoustic information of a piece of audio. Audio fingerprints provide a unique representation for audio, through which an audio can be effectively distinguished from other audio. In related technologies, an automatic encoder of long short-term memory is used to generate an audio fingerprint for an audio, and the audio fingerprint is used to complete an audio retrieval task, for example, to retrieve other audio related to the audio from a music library. However, for distorted audio, the audio fingerprint generated by the autoencoder cannot effectively represent the audio, which reduces the accuracy of audio retrieval and cannot effectively complete the audio retrieval task.

Contents of the invention

This section is provided to introduce concepts in a simplified form that are described in detail later in the Detailed Description. This part of the content is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.

In a first aspect, the present disclosure provides a method for generating an encoder, including:

Get multiple sample audio;

Constructing a first group of samples and a second group of samples according to the plurality of sample audios, wherein, for each sample in the first group of samples, there are corresponding positive samples and negative samples in the second group of samples ;

According to the first group of samples and the second group of samples, the first coder and the second coder are comparatively trained, and the trained first coder can be used as an audio fingerprint extractor to output fingerprint features as audio encoding vector;

Wherein, the first encoder is used to encode the samples in the first group of samples to obtain a first encoding vector corresponding to each sample, and the second encoder is used to encode the samples in the second group of samples Encoding is performed to obtain a second encoding vector corresponding to each sample; the comparison training is used to make the first encoding vector output by the first encoder close to the corresponding second encoding vector of the positive sample, away from the corresponding The second encoding vector of the negative samples, and the encoding parameters of the second encoder gradually tend to the encoding parameters of the first encoder.

In a second aspect, the present disclosure provides an audio fingerprint extraction method, including:

Obtain the audio to be queried;

According to the audio fingerprint extractor, the audio to be queried is processed to obtain an encoding vector as the fingerprint feature of the audio to be queried; the audio fingerprint extractor is trained according to the generation method of the encoder described in the first aspect. first encoder.

In a third aspect, the present disclosure provides an encoder generation device, including:

The first acquisition module is configured to acquire a plurality of sample audios;

A construction module configured to construct a first group of samples and a second group of samples according to the plurality of audio samples, wherein, for each sample in the first group of samples, there is a corresponding Positive samples and negative samples of ;

The training module is configured to perform comparative training on the first encoder and the second encoder according to the first set of samples and the second set of samples, and the trained first encoder can be output as an audio fingerprint extractor Encoded vectors as fingerprint features of the audio;

In a fourth aspect, the present disclosure provides an audio fingerprint extraction device, including:

The second obtaining module is configured to obtain the audio to be queried;

A processing module configured to process the audio to be queried according to an audio fingerprint extractor to obtain an encoding vector as a fingerprint feature of the audio to be queried; the audio fingerprint extractor is the encoder according to the first aspect The first encoder trained by the generative method

In a fifth aspect, the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the methods described in the first aspect and the second aspect are implemented.

In a sixth aspect, the present disclosure provides an electronic device, including:

storage means on which at least one computer program is stored;

At least one processing device configured to execute the at least one computer program in the storage device to implement the steps of the methods of the first aspect and the second aspect.

Through the above technical solution, by means of comparative training, the first encoding vector output by the first encoder is made close to the second encoding vector of the corresponding positive sample, and far away from the second encoding vector of the corresponding negative sample, that is, the first encoder The output encoding vector can more effectively distinguish the audio that belongs to the same audio and the audio that does not belong to the same audio, and the contrastive training enables the first encoder to learn higher-level features of the audio. Furthermore, the audio fingerprint (that is, the encoding vector of the fingerprint feature of the output audio) obtained by training as the output of the first encoder of the audio fingerprint extractor can better complete the audio retrieval task and improve the accuracy of the audio retrieval.

Other features and advantages of the present disclosure will be described in detail in the detailed description that follows.

Description of drawings

The above and other features, advantages and aspects of the various embodiments of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.

In the attached picture:

Fig. 1 is a flowchart showing a method for generating an encoder according to an exemplary embodiment of the present disclosure.

Fig. 2 is a flow chart showing comparative training of a first encoder and a second encoder according to an exemplary embodiment of the present disclosure.

Fig. 3 is a flowchart showing an audio fingerprint extraction method according to an exemplary embodiment of the present disclosure.

Fig. 4 is a block diagram showing an apparatus for generating an encoder according to an exemplary embodiment of the present disclosure.

Fig. 5 is a block diagram of an audio fingerprint extraction device according to an exemplary embodiment of the present disclosure.

Fig. 6 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed ways

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.

It should be understood that the various steps described in the method implementations of the present disclosure may be executed in different orders, and/or executed in parallel. Additionally, method embodiments may include additional steps and/or omit performing illustrated steps. The scope of the present disclosure is not limited in this respect.

As used herein, the term "comprise" and its variations are open-ended, ie "including but not limited to". The term "based on" is "based at least in part on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one further embodiment"; the term "some embodiments" means "at least some embodiments." Relevant definitions of other terms are given in the description below.

It should be noted that concepts such as "first" and "second" mentioned in this disclosure are only used to distinguish devices, modules or units, and are not used to limit these devices, modules or units to be different devices, modules or units. unit, and is not intended to limit the sequence or interdependence of the functions performed by these devices, modules or units.

It should be noted that the modifications of "one" and "multiple" mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, it should be understood as "one or more" multiple".

The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are used for illustrative purposes only, and are not used to limit the scope of these messages or information.

As mentioned in the background, an audio fingerprint is a compact digital signature extracted from audio content that represents an important piece of audio acoustic information. Audio fingerprints provide a unique representation for audio, through which an audio can be effectively distinguished from other audio. In some embodiments, audio fingerprints can be applied to various scenarios. For example, audio fingerprints can be used for audio deduplication, that is, to eliminate repeated audio in a set of audio; for another example, audio fingerprints can be used for audio retrieval, such as for Listen to the song and recognize the song, find out the original song for a piece of audio.

In related technologies, in the audio retrieval task, based on non-deep learning algorithms, after extracting the spectrogram from the original audio, calculate the salient points on the spectrogram, perform hash coding on them, and build a large-scale audio fingerprint library. After the fingerprint feature is extracted from the query audio, multi-level filtering is performed through hash feature retrieval. It is also possible to perform audio retrieval through the music detection module and the music recognition module, wherein the music detection module is used to detect whether there is music currently, and the music recognition module includes some convolutional layers and two-layer segmentation coding blocks, which extract features from the query audio, and Search by feature distance. In addition, a long-short-term memory autoencoder can be used to generate audio fingerprints, which can be used for audio retrieval.

However, none of the methods for generating audio fingerprints (or extracting features) described in the above related technologies can effectively retrieve distorted audio. In some embodiments, distorted audio may refer to edited audio, for example, edited audio speed and/or audio.

In view of this, the present disclosure proposes a method for generating an encoder, which trains the first encoder as an audio fingerprint extractor through comparative training, so that the first encoder can learn higher-level features of the audio. The output coding vector can better distinguish the audio that belongs to the same audio and the audio that does not belong to the same audio. The audio fingerprint output by the audio fingerprint extractor can better complete the audio retrieval task and improve the accuracy of audio retrieval. Spend.

Fig. 1 is a flowchart showing a method for generating an encoder according to an exemplary embodiment of the present disclosure. As shown in Figure 1, the generation method includes:

Step 110, acquiring a plurality of audio samples.

In some embodiments, the sample audio may be training data for comparative training of the first encoder and the second encoder. In some embodiments, the sample audio can be music data, such as a song or a song segment, wherein the length of the song segment can be set according to actual conditions, for example, the length of the song segment can be any value from 10s to 900s. In some embodiments, sample audio may use different vocals and different styles of music data.

In some embodiments, multiple audio samples can be obtained by reading stored data, calling related interfaces, or other methods.

Step 120, constructing a first group of samples and a second group of samples according to the plurality of audio samples.

In some embodiments, the first group of samples and the second group of samples may be constructed according to data enhancement results of multiple sample audios, and the data enhancement may include adjusting audio parameters. In some embodiments, constructing the first group of samples and the second group of samples according to the plurality of sample audios includes: respectively performing the first parameter adjustment and the second parameter adjustment on the plurality of sample audios to obtain the first group of samples and the second group sample.

In some embodiments, parameter adjustment may refer to adjusting an audio adjustment parameter according to a corresponding adjustment manner. In some embodiments, the adjustment parameters of the first parameter adjustment or the second parameter adjustment may include but not limited to at least one of the following: noise, pitch, speed, filter parameters, echo, frequency band of gain or attenuation, and audio format .

In some embodiments, the noise may include white noise, and adjusting the noise may refer to increasing white noise in a first preset ratio, wherein the first preset ratio may be specifically set according to actual conditions, for example, the first preset ratio The range of may be (0, 0.1]. In some embodiments, adjusting the pitch may refer to raising or lowering the pitch, for example, raising or lowering the pitch within an octave, An octave may include 12 semitones. Correspondingly, adjusting the pitch may refer to raising or lowering the pitch within the octave in units of semitones.

In some embodiments, adjusting the speed may refer to fast-forwarding or slowing down the playback speed of the audio according to the second preset multiple speed, wherein the second preset multiple speed may be specifically set according to the actual situation, for example, the second preset multiple speed The range of double speed can be [0.5, 1.5]. In some embodiments, the filtering parameter may refer to the filtering frequency, and adjusting the filtering parameter may refer to performing high-pass filtering and/or low-pass filtering on the audio, wherein the filtering frequency corresponding to the high-pass filtering or low-pass filtering may be based on actual needs Make specific settings. For example, the filtering frequency of the high-pass filtering may be 2000 Hz, and the filtering frequency of the low-pass filtering may be 300 Hz.

In some embodiments, the adjustment of the frequency band of gain or attenuation can be realized through an equalizer, and the equalizer can adjust the following parameters: frequency, gain and Q (Quantize) value, wherein the frequency can be used to represent the adjustment The parameter of the frequency point, the gain may be a parameter used to characterize the gain or attenuation at a set frequency value, and the Q value may be a parameter used to characterize the "width" of the frequency band for gain or attenuation. In some embodiments, adjusting the audio format may refer to compressing the audio in a preset format, and the preset format may be specifically set according to actual conditions. For example, the preset format may be a 32Kbps MP3 format. In some embodiments, data enhancement can also be implemented by adjusting other parameters of the audio, for example, adding Gaussian noise, etc., which will not be repeated in this disclosure.

In some embodiments, the adjustment parameters and/or adjustment methods corresponding to the first parameter adjustment and the second parameter adjustment are not completely the same. The different adjustment methods may refer to different setting values of the adjustment parameters. Exemplarily, the first parameter adjustment may be to increase the white noise at a ratio of 0.1, to speed up the playback speed of the audio by 0.5 times, and to raise the pitch of the first two semitones of the audio. The second parameter adjustment can be adding 0.2 ratio of white noise, doubling the audio playback speed, raising the first 2 semitones of the audio, lowering the third semitone, and 2000Hz high-pass filtering. Different versions of data augmentation can be implemented by adjusting the first parameter adjustment and the second parameter adjustment with different adjustment parameters and/or adjustment methods.

In some embodiments, each sample in the first group of samples is an audio sample after adjustment by a first parameter, and each sample in the second group of samples is an audio sample after adjustment by a second parameter. In some embodiments, the audio obtained by using different versions of data enhancement on the sample audio is similar to each other, that is, the first parameter adjustment and the second parameter adjustment are performed on the same sample audio, and the obtained sample audio after the first parameter adjustment and The sample audios adjusted by the second parameter are similar audios.

It can be understood that for any sample in the first group of samples, the similar audio of the sample is a positive sample of the sample, and the dissimilar audio is a negative sample of the sample. Therefore, for each sample in the first group of samples, there are corresponding positive samples and negative samples in the second group of samples. Correspondingly, for each sample in the first group of samples, the sample in the second group of samples corresponding to the same sample audio as the sample is a positive sample, and the other samples are negative samples.

In the embodiment of the present disclosure, there are various types of adjustment parameters, and the first group of samples and the second group of samples are respectively obtained by adjusting various adjustment parameters, and then the first encoder is adjusted by the first group of samples and the second group of samples Compared with the second encoder for training, the first encoder can process the audio after adjusting various adjustment parameters, that is, the first encoder can process the audio edited in various ways, which improves the robustness of the first encoder. Stickiness, so that the first encoder can better extract the audio fingerprint of edited audio (for example, audio with adjusted pitch and/or speed), that is, it can better extract the audio fingerprint of distorted audio, and then target the distortion For the audio, the audio retrieval task can be accurately completed by using the audio fingerprint output by the first encoder, and the accuracy of the audio retrieval can be improved.

Step 130: Perform comparative training on the first encoder and the second encoder according to the first group of samples and the second group of samples, and the trained first encoder can be used as an audio fingerprint extractor to output an encoded vector as an audio fingerprint feature.

In some embodiments, the first encoder is used to encode samples in the first group of samples to obtain a first encoding vector corresponding to each sample, and the second encoder is used to encode samples in the second group of samples to obtain The second encoding vector corresponding to each sample. In some embodiments, the first encoder can be used to encode the mel spectrum of the samples in the first group of samples to obtain the first encoding vector corresponding to each sample, and the second encoder can be used to encode the second The Mel spectrum of the samples in the group of samples is encoded to obtain a second encoding vector corresponding to each sample. In some embodiments, the first encoder may be an encoder and the second encoder may be a momentum encoder. In some embodiments, the first encoder or the second encoder may be a residual network, eg, ResNet18.

In some embodiments, contrastive training is used to make the first encoding vector output by the first encoder close to the second encoding vector of the corresponding positive sample, far away from the second encoding vector of the corresponding negative sample, and the encoding of the second encoder The parameters gradually tend towards the encoding parameters of the first encoder. Contrastive training can be self-supervised learning. Self-supervised learning does not require manual label information, and directly uses the data itself as supervisory information to learn the feature expression of sample data (for example, sample audio). For specific details about performing comparative training on the first encoder and the second encoder, refer to FIG. 2 and related descriptions, and details are not repeated here.

In the embodiment of the present disclosure, the first encoding vector output by the first encoder is made close to the second encoding vector of the corresponding positive sample and far away from the second encoding vector of the corresponding negative sample through comparative training, that is, the characteristics of the sample audio It is more similar to the features of its positive samples, and less similar to the features of its negative samples, so that the first encoding vector output by the encoder can more effectively distinguish between audio that belongs to the same audio and audio that does not belong to the same audio, And the contrastive training enables the first encoder to learn higher-level features of the audio. Furthermore, the trained audio fingerprint (that is, the encoded vector of the fingerprint feature of the output audio) obtained from the training of the first encoder as the audio fingerprint extractor can better complete the audio retrieval task and improve the accuracy of the audio retrieval.

Fig. 2 is a flow chart showing comparative training of a first encoder and a second encoder according to an exemplary embodiment of the present disclosure. As shown in Figure 2, the method includes:

Step 210: Encode the samples in the first group of samples according to the first encoder to obtain the first encoding vector corresponding to each sample, and encode the samples in the second group of samples according to the second encoder to obtain the corresponding The second encoding vector of samples.

Step 220, perform an iterative operation on the loss value of the comparison loss function based on the first encoding vector and the second encoding vector, and iteratively update the encoding parameters of the first encoder based on the loss value, so that the first encoding vector output by the first encoder The second coded vector close to the corresponding positive sample is far away from the second coded vector of the corresponding negative sample, wherein the loss value is used to characterize the similarity between the first coded vector and the second coded vector; and, the second coded vector The encoding parameters of the first encoder gradually tend to the encoding parameters of the first encoder until the first encoder that has been trained is obtained.

In some embodiments, the similarity can be determined by the distance between vectors, the smaller the distance, the greater the similarity. In some embodiments, the distance may include, but is not limited to, cosine distance, Euclidean distance, Manhattan distance, Mahalanobis distance, Minkowski distance, and the like.

In some embodiments, the comparison loss function may be determined according to actual conditions, and the loss value may be determined using the comparison loss function. For example, the similarity between the first encoded vector and the second encoded vector is determined as a loss value. In some embodiments, the contrastive loss function may be an InfoNCE (Noise Contrastive Estimation, InfoNCE) loss function. In some embodiments, the loss value of the InfoNCE loss function can be obtained by the following formula (1):

Among them, L _q represents the loss value of the InfoNCE loss function, q represents the first encoding vector, k ₊ represents the second encoding vector of the positive sample matching the sample corresponding to the first encoding vector, τ indicates the temperature hyperparameter, τ can be based on the actual The specific setting of the situation, for example, τ can be 0.1, K represents the sum of the second coded vectors involved in the loss value calculation, K _i represents the i-th second coded vector involved in the loss value calculation, q k ₊ represents q and k ₊ q· _ki means the dot product of q and _ki , exp means the exponential function with the natural constant e as the base, and log means the logarithmic function.

In some embodiments, performing an iterative operation on the loss value of the comparison loss function based on the first encoding vector and the second encoding vector includes: when the iterative operation is the first round, based on the first encoding vector obtained in the first round and the second encoding vector to determine the loss value of the comparison loss function; when the iterative operation is not the first round, replace the second encoding vector corresponding to the target historical round in the preset queue according to the second encoding vector obtained in the current round, And based on the first encoding vector obtained in the current round and the second encoding vector in the preset queue, determine the loss value of the comparison loss function; wherein the preset queue is used to store the second encoding obtained in each round of iterative operation Vector, the target history round is the history round with the earliest number of rounds stored in the preset queue.

In some embodiments, when the iterative operation is the first round, the second encoding vector obtained in the first round can be stored in the preset queue. It can be understood that when the iterative operation is the first round, based on the first round The first encoding vector and the second encoding vector obtained in one round determine the loss value, that is, the loss value is determined based on the first encoding vector obtained in the first round and the second encoding vector in the preset queue. In the case that the iterative operation is not the first round, the second encoding vector corresponding to the target historical round in the preset queue can be replaced according to the second encoding vector obtained in the current round, that is, the preset queue is updated, and then obtained based on the current round The first encoding vector and the updated second encoding vector in the preset queue are used to determine the loss value of the comparison loss function. It can be seen that during the iterative operation of the loss value, the preset queue is a dynamic queue, and the second encoding vector in the preset queue will be updated in each round of iterative operation, and in each round of iterative operation, for For each of the multiple first encodings, there is one second encoding vector of positive samples in the preset queue, and the rest are second encoding vectors of negative samples.

Exemplarily, taking N sample audio as an example, the first encoder can obtain N first encoding vectors, and the second encoder can obtain N second encoding vectors. In the iteration of the loss value of the model During the operation, when the iterative operation is the first round, it can be based on the N second encoding vectors obtained in the first round (that is, the N encoding vectors obtained in the first round stored in the preset queue) and the first The N first encoding vectors obtained in the first round determine the loss value of the comparison loss function, for example, for each of the N first encoding vectors in the first round, use the first encoding vector and the N second encoding vectors to pass The above formula (1) calculates N loss values, and averages the N loss values to obtain the loss value of the first round of iterative operation. In the case that the iterative operation is not the first round, for example, the iterative operation is the second round, then the historical round with the earliest number of rounds stored in the preset queue can be replaced according to the N second encoding vectors obtained in the second round, namely The N second coded vectors obtained in the first round, at this time, the preset queue includes the replaced N second coded vectors, and then, the N first coded vectors obtained through the second round and the preset queue The method of determining the loss value of the comparison loss function and obtaining the loss value of the second round of iterative operation is similar to that of the first round, and will not be repeated here. By analogy, the loss value corresponding to each round in multiple rounds of iterative operations can be obtained.

In some embodiments, during the process of iteratively updating the encoding parameters of the first encoder based on the loss value, the encoding parameters of the second encoder gradually approach the encoding parameters of the first encoder. In some embodiments, the encoding parameters of the second encoder can be changed by the following formula (2):

θ _k ←mθ _k +(1-m)θ _q (2)

Wherein, θ _k represents the encoding parameter of the second encoder, θ _q represents the encoding parameter of the first encoder, m represents a constant, and m can be specifically set according to the actual situation, for example, m=0.999. Since the value set by m is relatively large, the encoding parameters of the second encoder maintain most of its own weight, and only use smaller weights to approach the encoding parameters of the first encoder to realize the encoding of the second encoder The parameters gradually tend towards the encoding parameters of the first encoder.

Fig. 3 is a flowchart showing an audio fingerprint extraction method according to an exemplary embodiment of the present disclosure. As shown in Figure 3, the method includes:

Step 310, acquire the audio to be queried.

Step 320, process the audio to be queried by the audio fingerprint extractor to obtain a coded vector serving as a fingerprint feature of the audio to be queried.

In some embodiments, the audio to be queried may be the audio for which the encoding vector of the fingerprint feature of the audio needs to be obtained, that is, the audio for which the fingerprint of the audio needs to be obtained. In some embodiments, the audio to be queried may be distorted audio, that is, edited audio. In some embodiments, the edited audio may refer to the audio after adjusting one or more of the following adjustment parameters: noise, pitch, speed, filter parameters, echo, frequency band of gain or attenuation, and audio format . For specific details on adjusting the foregoing adjustment parameters, reference may be made to the above-mentioned step 120 and related descriptions, which will not be repeated here.

In some embodiments, the audio fingerprint extractor may be the first encoder that has been trained according to the aforementioned steps 110 to 130 . For the training process of the first encoder (ie, the audio fingerprint extractor), reference can be made to the above-mentioned FIG. 1 and FIG. 2 and their related descriptions, which will not be repeated here.

In some embodiments, the audio retrieval task can be completed according to the encoding vector of the fingerprint feature of the audio to be queried, that is, the audio fingerprint of the audio to be queried. In some embodiments, other audios belonging to the same audio as the audio to be queried can be retrieved from a preset database according to the audio fingerprint of the audio to be queried. For example, when the audio to be queried is a distorted song, the audio retrieval task may refer to retrieving the original song of the audio to be queried from a music library.

In some embodiments, retrieving from the preset database according to the audio fingerprint of the audio to be queried other audio that belongs to the same audio as the audio to be queried may be: according to the audio fingerprint of the audio to be queried, retrieving from the database Other audio whose fingerprint similarity meets preset conditions. The similarity can be determined by audio fingerprints, ie the distances between encoded vectors of fingerprint features of audio. Regarding the distance, reference may be made to the foregoing step 210 and its related descriptions, which will not be repeated here. It is worth noting that other audios that belong to the same audio as the audio to be queried can also be retrieved from the database by means other than similarity, and this disclosure does not impose any restrictions on this.

Fig. 4 is a block diagram showing an apparatus for generating an encoder according to an exemplary embodiment of the present disclosure. As shown in Figure 4, the generating device 400 includes:

The first acquisition module 410 is configured to acquire a plurality of sample audios;

The construction module 420 is configured to construct a first group of samples and a second group of samples according to the plurality of audio samples, wherein, for each sample in the first group of samples, there are Corresponding positive samples and negative samples;

The training module 430 is configured to perform comparative training on the first encoder and the second encoder according to the first set of samples and the second set of samples, and the trained first encoder can be used as an audio fingerprint extractor Output an encoded vector as the fingerprint feature of the audio;

In some embodiments, construction module 420 is further configured to:

Performing first parameter adjustment and second parameter adjustment on a plurality of the sample audios respectively to obtain the first group of samples and the second group of samples, and adjustments corresponding to the first parameter adjustment and the second parameter adjustment The parameters and/or adjustments are not exactly the same;

Wherein, each sample in the first group of samples is sample audio adjusted by the first parameter, and each sample in the second group of samples is sample audio adjusted by the second parameter, For each sample in the first group of samples, the sample in the second group of samples corresponding to the same audio sample as the sample is a positive sample, and the other samples are negative samples.

In some embodiments, the adjustment parameters include but are not limited to one or more of the following: noise, pitch, speed, filter parameters, echo, frequency band of gain or attenuation, and audio format.

In some embodiments, training module 430 is further configured to:

Encoding the samples in the first group of samples according to the first encoder to obtain a first encoding vector corresponding to each sample, and encoding the samples of the second group of samples according to the second encoder , to obtain the second encoding vector corresponding to each sample;

The loss value of the comparison loss function is iteratively calculated based on the first encoding vector and the second encoding vector, and the encoding parameters of the first encoder are iteratively updated based on the loss value, so that the first encoding The first coded vector output by the filter is close to the corresponding second coded vector of the positive sample and far away from the corresponding second coded vector of the negative sample, wherein the loss value is used to characterize the a degree of similarity between the first encoded vector and said second encoded vector; and,

Making the encoding parameters of the second encoder gradually approach the encoding parameters of the first encoder until the first encoder with training is obtained.

In some embodiments, training module 430 is further configured to:

When the iterative operation is the first round, determine the loss value of the comparison loss function based on the first encoding vector and the second encoding vector obtained in the first round;

In the case that the iterative operation is not the first round, the second encoding vector corresponding to the target history round in the preset queue is replaced according to the second encoding vector obtained in the current round, and based on the current round Determining the loss value of the comparison loss function from the obtained first encoding vector and the second encoding vector in the preset queue;

Wherein, the preset queue is used to store the second encoding vector obtained in each round of the iterative operation, and the target historical round is the historical round with the earliest number of rounds stored in the preset queue.

Fig. 5 is a block diagram of an audio fingerprint extraction device according to an exemplary embodiment of the present disclosure. As shown in Figure 5, the device 500 includes:

The second obtaining module 510 is configured to obtain the audio to be queried;

The processing module 520 is configured to process the audio to be queried according to the audio fingerprint extractor to obtain a coded vector as a fingerprint feature of the audio to be queried; the audio fingerprint extractor is according to an embodiment of the present disclosure Encoder Generation Method The trained first encoder is completed.

Referring now to FIG. 6 , it shows a schematic structural diagram of an electronic device 600 suitable for implementing an embodiment of the present disclosure. The electronic equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like. The electronic device shown in FIG. 6 is only an example, and should not limit the functions and application scope of the embodiments of the present disclosure.

As shown in FIG. 6, an electronic device 600 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 601, which may be randomly accessed according to a program stored in a read-only memory (ROM) 602 or loaded from a storage device 608. Various appropriate actions and processes are executed by programs in the memory (RAM) 603 . In the RAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored. The processing device 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to the bus 604 .

Typically, the following devices can be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 607 such as a computer; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 6 shows electronic device 600 having various means, it should be understood that implementing or possessing all of the illustrated means is not a requirement. More or fewer means may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. When the computer program is executed by the processing device 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.

It should be noted that the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable medium or any combination of the above two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.

In some embodiments, the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium The communication (eg, communication network) interconnections. Examples of communication networks include local area networks ("LANs"), wide area networks ("WANs"), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.

The above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.

The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires a plurality of sample audios; constructs a first group of samples according to the plurality of sample audios And a second group of samples, wherein, for each sample in the first group of samples, there are corresponding positive samples and negative samples in the second group of samples; according to the first group of samples and the first Two groups of samples carry out comparative training for the first encoder and the second encoder, and the first encoder that has been trained can be used as an audio fingerprint extractor to output an encoded vector as a fingerprint feature of the audio; wherein, the first encoder uses Encoding the samples in the first group of samples to obtain a first encoding vector corresponding to each sample, the second encoder is used to encode the samples in the second group of samples to obtain a corresponding to each sample the second encoding vector of the first encoder; the comparison training is used to make the first encoding vector output by the first encoder close to the corresponding second encoding vector of the positive sample, and away from the corresponding second encoding vector of the negative sample , and the encoding parameters of the second encoder gradually approach the encoding parameters of the first encoder.

Alternatively, the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires the audio to be queried; Processing is performed to obtain an encoding vector serving as the fingerprint feature of the audio to be queried; the audio fingerprint extractor is the first encoder that has been trained according to the encoder generation method described in the embodiment of the present disclosure.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.

The modules or units described in the embodiments of the present disclosure may be implemented by software or by hardware. Wherein, the name of a module or unit does not constitute a limitation of the unit itself under certain circumstances, for example, a jump module may also be described as "a module for jumping to a next-level page".

The functions described herein above may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), System on Chips (SOCs), Complex Programmable Logical device (CPLD) and so on.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, Example 1 provides a method for generating an encoder, including:

Get multiple sample audio;

According to one or more embodiments of the present disclosure, Example 2 provides the method of Example 1, wherein the first group of samples and the second group of samples are constructed according to the plurality of audio samples, including:

According to one or more embodiments of the present disclosure, Example 3 provides the method of Example 2, the adjustment parameters include but not limited to one or more of the following: noise, pitch, speed, filter parameters, echo, gain or Attenuated frequency band, and audio format.

According to one or more embodiments of the present disclosure, Example 4 provides the method of Example 1, performing comparative training on the first encoder and the second encoder according to the first set of samples and the second set of samples, include:

According to one or more embodiments of the present disclosure, Example 5 provides the method of Example 4, where performing an iterative operation on the loss value of the comparison loss function based on the first encoding vector and the second encoding vector includes:

According to one or more embodiments of the present disclosure, Example 6 provides an audio fingerprint extraction method, including: obtaining the audio to be queried; processing the audio to be queried according to an audio fingerprint extractor to obtain the audio to be queried The encoding vector of the fingerprint feature; the audio fingerprint extractor is the first encoder trained according to the encoder generation method described in any one of Examples 1-5.

According to one or more embodiments of the present disclosure, Example 7 provides an encoder generation device, including:

According to one or more embodiments of the present disclosure, Example 8 provides the device of Example 7, the construction module is further configured to:

According to one or more embodiments of the present disclosure, Example 9 provides the device of Example 8, the adjustment parameters include but are not limited to at least one of the following: noise, pitch, speed, filter parameters, echo, gain or attenuation frequency band, and audio format.

According to one or more embodiments of the present disclosure, Example 10 provides the device of Example 7, the training module is further configured to:

According to one or more embodiments of the present disclosure, Example 11 provides the device of Example 10, the training module is further configured to:

According to one or more embodiments of the present disclosure, Example 12 provides an audio fingerprint extraction device, comprising:

The second obtaining module is configured to obtain the audio to be queried;

The processing module is configured to process the audio to be queried according to the audio fingerprint extractor to obtain a coded vector as the fingerprint feature of the audio to be queried; the audio fingerprint extractor is according to any one of examples 1-5 The first encoder that has been trained by the encoder generation method described above.

According to one or more embodiments of the present disclosure, Example 13 provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in Examples 1-6 are implemented.

According to one or more embodiments of the present disclosure, Example 14 provides an electronic device, comprising:

storage means on which at least one computer program is stored;

At least one processing device configured to execute the at least one computer program in the storage device to implement the steps of any one of the methods in Examples 1-6.

The above description is only a preferred embodiment of the present disclosure and an illustration of the applied technical principle. Those skilled in the art should understand that the disclosure scope involved in this disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, but also covers the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of equivalent features. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions disclosed in this disclosure.

In addition, while operations are depicted in a particular order, this should not be understood as requiring that the operations be performed in the particular order shown or performed in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while the above discussion contains several specific implementation details, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims. Regarding the device in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here

Claims

A method for generating an encoder, comprising:

Get multiple sample audio;

Constructing a first group of samples and a second group of samples according to the plurality of sample audios, wherein, for each sample in the first group of samples, there are corresponding positive samples and negative samples in the second group of samples ;

According to the first group of samples and the second group of samples, the first coder and the second coder are comparatively trained, and the trained first coder can be used as an audio fingerprint extractor to output fingerprint features as audio encoding vector;

Wherein, the first encoder is used to encode the samples in the first group of samples to obtain a first encoding vector corresponding to each sample, and the second encoder is used to encode the samples in the second group of samples Encoding is performed to obtain a second encoding vector corresponding to each sample; the comparison training is used to make the first encoding vector output by the first encoder close to the corresponding second encoding vector of the positive sample, away from the corresponding The second encoding vector of the negative samples, and the encoding parameters of the second encoder gradually tend to the encoding parameters of the first encoder.
The method according to claim 1, wherein said constructing a first set of samples and a second set of samples according to said plurality of sample audios comprises:

Performing first parameter adjustment and second parameter adjustment on a plurality of the sample audios respectively to obtain the first group of samples and the second group of samples, and adjustments corresponding to the first parameter adjustment and the second parameter adjustment The parameters and/or adjustments are not exactly the same;

Wherein, each sample in the first group of samples is sample audio adjusted by the first parameter, and each sample in the second group of samples is sample audio adjusted by the second parameter, For each sample in the first group of samples, the sample in the second group of samples corresponding to the same audio sample as the sample is a positive sample, and the other samples are negative samples.
The method according to claim 2, wherein the adjustment parameters include but are not limited to at least one of the following: noise, pitch, speed, filter parameters, echo, frequency band of gain or attenuation, and audio format.
The method according to claim 1, wherein the comparative training of the first encoder and the second encoder according to the first set of samples and the second set of samples includes:

Encoding the samples in the first group of samples according to the first encoder to obtain a first encoding vector corresponding to each sample, and encoding the samples of the second group of samples according to the second encoder , to obtain the second encoding vector corresponding to each sample;

The loss value of the comparison loss function is iteratively calculated based on the first encoding vector and the second encoding vector, and the encoding parameters of the first encoder are iteratively updated based on the loss value, so that the first encoding The first coded vector output by the filter is close to the corresponding second coded vector of the positive sample and far away from the corresponding second coded vector of the negative sample, wherein the loss value is used to characterize the a degree of similarity between the first encoded vector and said second encoded vector; and,

Making the encoding parameters of the second encoder gradually approach the encoding parameters of the first encoder until the first encoder with training is obtained.
The method according to claim 4, wherein the iterative operation of the loss value of the comparison loss function based on the first encoding vector and the second encoding vector comprises:

When the iterative operation is the first round, determine the loss value of the comparison loss function based on the first encoding vector and the second encoding vector obtained in the first round;

In the case that the iterative operation is not the first round, the second encoding vector corresponding to the target history round in the preset queue is replaced according to the second encoding vector obtained in the current round, and based on the current round Determining the loss value of the comparison loss function from the obtained first encoding vector and the second encoding vector in the preset queue;

Wherein, the preset queue is used to store the second encoding vector obtained in each round of the iterative operation, and the target historical round is the historical round with the earliest number of rounds stored in the preset queue.
A method for extracting audio fingerprints, comprising:

Obtain the audio to be queried;

According to the audio fingerprint extractor, the audio to be queried is processed to obtain a coded vector as the fingerprint feature of the audio to be queried; the audio fingerprint extractor is according to the encoder according to any one of claims 1-5 Generate the first encoder trained by the method.
A device for generating an encoder, comprising:

The first acquisition module is configured to acquire a plurality of sample audios;

A construction module configured to construct a first group of samples and a second group of samples according to the plurality of audio samples, wherein, for each sample in the first group of samples, there is a corresponding Positive samples and negative samples of ;

The training module is configured to perform comparative training on the first encoder and the second encoder according to the first set of samples and the second set of samples, and the trained first encoder can be output as an audio fingerprint extractor Encoded vectors as fingerprint features of the audio;

Wherein, the first encoder is used to encode the samples in the first group of samples to obtain a first encoding vector corresponding to each sample, and the second encoder is used to encode the samples in the second group of samples Encoding is performed to obtain a second encoding vector corresponding to each sample; the comparison training is used to make the first encoding vector output by the first encoder close to the corresponding second encoding vector of the positive sample, away from the corresponding The second encoding vector of the negative samples, and the encoding parameters of the second encoder gradually tend to the encoding parameters of the first encoder.
A device for extracting audio fingerprints, comprising:

The second obtaining module is configured to obtain the audio to be queried;

The processing module is configured to process the audio to be queried according to the audio fingerprint extractor to obtain a coded vector as the fingerprint feature of the audio to be queried; the audio fingerprint extractor is according to any one of claims 1-5 The encoder generating method is the first encoder that has been trained.
A computer-readable medium, on which a computer program is stored, wherein the program implements the steps of any one of claims 1-6 when executed by a processing device.
An electronic device comprising:

storage means on which at least one computer program is stored;

At least one processing device configured to execute the at least one computer program in the storage device to implement the steps of the method according to any one of claims 1-6.