WO2023134549A1 - Encoder generation method, fingerprint extraction method, medium, and electronic device - Google Patents

Encoder generation method, fingerprint extraction method, medium, and electronic device Download PDF

Info

Publication number
WO2023134549A1
WO2023134549A1 PCT/CN2023/070796 CN2023070796W WO2023134549A1 WO 2023134549 A1 WO2023134549 A1 WO 2023134549A1 CN 2023070796 W CN2023070796 W CN 2023070796W WO 2023134549 A1 WO2023134549 A1 WO 2023134549A1
Authority
WO
WIPO (PCT)
Prior art keywords
samples
encoder
audio
sample
encoding
Prior art date
Application number
PCT/CN2023/070796
Other languages
French (fr)
Chinese (zh)
Inventor
于哲松
杜行健
刘铭瑀
朱碧磊
马泽君
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023134549A1 publication Critical patent/WO2023134549A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture

Definitions

  • the first group of samples and the second group of samples may be constructed according to data enhancement results of multiple sample audios, and the data enhancement may include adjusting audio parameters.
  • constructing the first group of samples and the second group of samples according to the plurality of sample audios includes: respectively performing the first parameter adjustment and the second parameter adjustment on the plurality of sample audios to obtain the first group of samples and the second group sample.
  • retrieving from the preset database according to the audio fingerprint of the audio to be queried other audio that belongs to the same audio as the audio to be queried may be: according to the audio fingerprint of the audio to be queried, retrieving from the database Other audio whose fingerprint similarity meets preset conditions.
  • the similarity can be determined by audio fingerprints, ie the distances between encoded vectors of fingerprint features of audio. Regarding the distance, reference may be made to the foregoing step 210 and its related descriptions, which will not be repeated here. It is worth noting that other audios that belong to the same audio as the audio to be queried can also be retrieved from the database by means other than similarity, and this disclosure does not impose any restrictions on this.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires a plurality of sample audios; constructs a first group of samples according to the plurality of sample audios And a second group of samples, wherein, for each sample in the first group of samples, there are corresponding positive samples and negative samples in the second group of samples; according to the first group of samples and the first Two groups of samples carry out comparative training for the first encoder and the second encoder, and the first encoder that has been trained can be used as an audio fingerprint extractor to output an encoded vector as a fingerprint feature of the audio; wherein, the first encoder uses Encoding the samples in the first group of samples to obtain a first encoding vector corresponding to each sample, the second encoder is used to encode the samples in the second group of samples to obtain a corresponding to each sample the second encoding vector of the first encoder; the comparison training is used to make the first encoding vector output
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • the first encoder is used to encode the samples in the first group of samples to obtain a first encoding vector corresponding to each sample
  • the second encoder is used to encode the samples in the second group of samples Encoding is performed to obtain a second encoding vector corresponding to each sample
  • the comparison training is used to make the first encoding vector output by the first encoder close to the corresponding second encoding vector of the positive sample, away from the corresponding The second encoding vector of the negative samples, and the encoding parameters of the second encoder gradually tend to the encoding parameters of the first encoder.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to an encoder generation method, a fingerprint extraction method, a medium, and an electronic device. The encoder generation method comprises: acquiring a plurality of sample audios; constructing a first group of samples and a second group of samples according to the plurality of sample audios, wherein for each sample in the first group of samples, a corresponding positive sample and a corresponding negative sample exist in the second group of samples; and performing comparison training on a first encoder and a second encoder according to the first group of samples and the second group of samples, wherein the trained first encoder can be used as an audio fingerprint extractor to output an encoding vector serving as a fingerprint feature of an audio. The trained first encoder obtained in the present invention can effectively extract fingerprint features of the audio, and more accurate audio fingerprints are obtained, such that the accuracy of audio retrieval is improved.

Description

编码器的生成方法、指纹提取方法、介质及电子设备Encoder generation method, fingerprint extraction method, medium and electronic device
相关申请的交叉引用Cross References to Related Applications
本申请要求于2022年01月14日提交的,申请号为202210045056.3、发明名称为“编码器的生成方法、指纹提取方法、介质及电子设备”的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202210045056.3 and the title of the invention "encoder generation method, fingerprint extraction method, medium and electronic equipment" submitted on January 14, 2022, and the entire content of the application Incorporated in this application by reference.
技术领域technical field
本公开涉及人工智能技术领域,具体地,涉及一种编码器的生成方法、指纹提取方法、介质及电子设备。The present disclosure relates to the technical field of artificial intelligence, and in particular, to an encoder generation method, a fingerprint extraction method, a medium, and an electronic device.
背景技术Background technique
音频指纹是从音频内容中提取出的代表一条音频重要声学信息的紧致数字签名。音频指纹为音频提供了一种唯一性的表示,通过音频指纹可以有效地将一条音频和其他音频区分开来。相关技术中,使用长短期记忆的自动编码器为音频生成音频指纹,并利用该音频指纹完成音频检索任务,例如,从曲库中检索出与该音频相关的其他音频。然而,针对失真音频,通过自动编码器生成的音频指纹无法有效表示该音频,进而降低了音频检索的准确率,无法有效地完成音频检索任务。Audio fingerprints are compact digital signatures extracted from audio content that represent important acoustic information of a piece of audio. Audio fingerprints provide a unique representation for audio, through which an audio can be effectively distinguished from other audio. In related technologies, an automatic encoder of long short-term memory is used to generate an audio fingerprint for an audio, and the audio fingerprint is used to complete an audio retrieval task, for example, to retrieve other audio related to the audio from a music library. However, for distorted audio, the audio fingerprint generated by the autoencoder cannot effectively represent the audio, which reduces the accuracy of audio retrieval and cannot effectively complete the audio retrieval task.
发明内容Contents of the invention
提供该部分内容以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该部分内容并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。This section is provided to introduce concepts in a simplified form that are described in detail later in the Detailed Description. This part of the content is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.
第一方面,本公开提供一种编码器的生成方法,包括:In a first aspect, the present disclosure provides a method for generating an encoder, including:
获取多个样本音频;Get multiple sample audio;
根据所述多个样本音频构造第一组样本以及第二组样本,其中,针对所述第一组样本中的每一条样本,在所述第二组样本中均存在对应的正样本以及负样本;Constructing a first group of samples and a second group of samples according to the plurality of sample audios, wherein, for each sample in the first group of samples, there are corresponding positive samples and negative samples in the second group of samples ;
根据所述第一组样本和所述第二组样本对于第一编码器以及第二编码器进行对比训练,训练完成的所述第一编码器能够作为音频指纹提取器输出作为音频的指纹特征的编码向量;According to the first group of samples and the second group of samples, the first coder and the second coder are comparatively trained, and the trained first coder can be used as an audio fingerprint extractor to output fingerprint features as audio encoding vector;
其中,所述第一编码器用于对所述第一组样本中的样本进行编码,得到对应每一条样本的第一编码向量,所述第二编码器用于对所述第二组样本中的样本进行编码,得到对应 每一条样本的第二编码向量;所述对比训练用于使所述第一编码器输出的第一编码向量接近对应的所述正样本的第二编码向量,远离对应的所述负样本的第二编码向量,且所述第二编码器的编码参数逐渐趋向所述第一编码器的编码参数。Wherein, the first encoder is used to encode the samples in the first group of samples to obtain a first encoding vector corresponding to each sample, and the second encoder is used to encode the samples in the second group of samples Encoding is performed to obtain a second encoding vector corresponding to each sample; the comparison training is used to make the first encoding vector output by the first encoder close to the corresponding second encoding vector of the positive sample, away from the corresponding The second encoding vector of the negative samples, and the encoding parameters of the second encoder gradually tend to the encoding parameters of the first encoder.
第二方面,本公开提供一种音频指纹提取方法,包括:In a second aspect, the present disclosure provides an audio fingerprint extraction method, including:
获取待查询音频;Obtain the audio to be queried;
根据音频指纹提取器对所述待查询音频进行处理,得到作为所述待查询音频的指纹特征的编码向量;所述音频指纹提取器是根据第一方面所述的编码器的生成方法训练完成的第一编码器。According to the audio fingerprint extractor, the audio to be queried is processed to obtain an encoding vector as the fingerprint feature of the audio to be queried; the audio fingerprint extractor is trained according to the generation method of the encoder described in the first aspect. first encoder.
第三方面,本公开提供一种编码器的生成装置,包括:In a third aspect, the present disclosure provides an encoder generation device, including:
第一获取模块,被配置为获取多个样本音频;The first acquisition module is configured to acquire a plurality of sample audios;
构造模块,被配置为根据所述多个样本音频构造第一组样本以及第二组样本,其中,针对所述第一组样本中的每一条样本,在所述第二组样本中均存在对应的正样本以及负样本;A construction module configured to construct a first group of samples and a second group of samples according to the plurality of audio samples, wherein, for each sample in the first group of samples, there is a corresponding Positive samples and negative samples of ;
训练模块,被配置为根据所述第一组样本和所述第二组样本对于第一编码器以及第二编码器进行对比训练,训练完成的所述第一编码器能够作为音频指纹提取器输出作为音频的指纹特征的编码向量;The training module is configured to perform comparative training on the first encoder and the second encoder according to the first set of samples and the second set of samples, and the trained first encoder can be output as an audio fingerprint extractor Encoded vectors as fingerprint features of the audio;
其中,所述第一编码器用于对所述第一组样本中的样本进行编码,得到对应每一条样本的第一编码向量,所述第二编码器用于对所述第二组样本中的样本进行编码,得到对应每一条样本的第二编码向量;所述对比训练用于使所述第一编码器输出的第一编码向量接近对应的所述正样本的第二编码向量,远离对应的所述负样本的第二编码向量,且所述第二编码器的编码参数逐渐趋向所述第一编码器的编码参数。Wherein, the first encoder is used to encode the samples in the first group of samples to obtain a first encoding vector corresponding to each sample, and the second encoder is used to encode the samples in the second group of samples Encoding is performed to obtain a second encoding vector corresponding to each sample; the comparison training is used to make the first encoding vector output by the first encoder close to the corresponding second encoding vector of the positive sample, away from the corresponding The second encoding vector of the negative samples, and the encoding parameters of the second encoder gradually tend to the encoding parameters of the first encoder.
第四方面,本公开提供一种音频指纹提取装置,包括:In a fourth aspect, the present disclosure provides an audio fingerprint extraction device, including:
第二获取模块,被配置为获取待查询音频;The second obtaining module is configured to obtain the audio to be queried;
处理模块,被配置为根据音频指纹提取器对所述待查询音频进行处理,得到作为所述待查询音频的指纹特征的编码向量;所述音频指纹提取器是根据第一方面所述的编码器的生成方法训练完成的第一编码器A processing module configured to process the audio to be queried according to an audio fingerprint extractor to obtain an encoding vector as a fingerprint feature of the audio to be queried; the audio fingerprint extractor is the encoder according to the first aspect The first encoder trained by the generative method
第五方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现第一方面和第二方面所述方法的步骤。In a fifth aspect, the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the methods described in the first aspect and the second aspect are implemented.
第六方面,本公开提供一种电子设备,包括:In a sixth aspect, the present disclosure provides an electronic device, including:
存储装置,其上存储有至少一个计算机程序;storage means on which at least one computer program is stored;
至少一个处理装置,用于执行所述存储装置中的所述至少一个计算机程序,以实现第一方面和第二方面所述方法的步骤。At least one processing device configured to execute the at least one computer program in the storage device to implement the steps of the methods of the first aspect and the second aspect.
通过上述技术方案,通过对比训练的方式,使第一编码器输出的第一编码向量接近对应的正样本的第二编码向量,远离对应的负样本的第二编码向量,即使得第一编码器输出的编码向量能更有效地区分与其属于同一个音频的音频以及与其不属于同一个音频的音频,且对比训练使得第一编码器能学习到音频的更高层次的特征。进而训练得到的作为音频指纹提取器的第一编码器输出的音频指纹(即输出的音频的指纹特征的编码向量)能更好地完成音频检索任务,提高了音频检索的准确度。Through the above technical solution, by means of comparative training, the first encoding vector output by the first encoder is made close to the second encoding vector of the corresponding positive sample, and far away from the second encoding vector of the corresponding negative sample, that is, the first encoder The output encoding vector can more effectively distinguish the audio that belongs to the same audio and the audio that does not belong to the same audio, and the contrastive training enables the first encoder to learn higher-level features of the audio. Furthermore, the audio fingerprint (that is, the encoding vector of the fingerprint feature of the output audio) obtained by training as the output of the first encoder of the audio fingerprint extractor can better complete the audio retrieval task and improve the accuracy of the audio retrieval.
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。Other features and advantages of the present disclosure will be described in detail in the detailed description that follows.
附图说明Description of drawings
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。The above and other features, advantages and aspects of the various embodiments of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.
在附图中:In the attached picture:
图1是根据本公开一示例性实施例示出的一种编码器的生成方法的流程图。Fig. 1 is a flowchart showing a method for generating an encoder according to an exemplary embodiment of the present disclosure.
图2是根据本公开一示例性实施例示出的对第一编码器以及第二编码器进行对比训练的流程图。Fig. 2 is a flow chart showing comparative training of a first encoder and a second encoder according to an exemplary embodiment of the present disclosure.
图3是根据本公开一示例性实施例示出的一种音频指纹提取方法的流程图。Fig. 3 is a flowchart showing an audio fingerprint extraction method according to an exemplary embodiment of the present disclosure.
图4是根据本公开一示例性实施例示出的一种编码器的生成装置的框图。Fig. 4 is a block diagram showing an apparatus for generating an encoder according to an exemplary embodiment of the present disclosure.
图5是根据本公开一示例性实施例示出的一种音频指纹提取装置的框图。Fig. 5 is a block diagram of an audio fingerprint extraction device according to an exemplary embodiment of the present disclosure.
图6是根据本公开一示例性实施例示出的一种电子设备的结构示意图。Fig. 6 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the various steps described in the method implementations of the present disclosure may be executed in different orders, and/or executed in parallel. Additionally, method embodiments may include additional steps and/or omit performing illustrated steps. The scope of the present disclosure is not limited in this respect.
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语 的相关定义将在下文描述中给出。As used herein, the term "comprise" and its variations are open-ended, ie "including but not limited to". The term "based on" is "based at least in part on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one further embodiment"; the term "some embodiments" means "at least some embodiments." Relevant definitions of other terms are given in the description below.
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对装置、模块或单元进行区分,并非用于限定这些装置、模块或单元一定为不同的装置、模块或单元,也并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that concepts such as "first" and "second" mentioned in this disclosure are only used to distinguish devices, modules or units, and are not used to limit these devices, modules or units to be different devices, modules or units. unit, and is not intended to limit the sequence or interdependence of the functions performed by these devices, modules or units.
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of "one" and "multiple" mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, it should be understood as "one or more" multiple".
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are used for illustrative purposes only, and are not used to limit the scope of these messages or information.
正如背景技术所言,音频指纹是从音频内容中提取出的代表一条音频重要声学信息的紧致数字签名。音频指纹为音频提供了一种唯一性的表示,通过音频指纹可以有效地将一条音频和其他音频区分开来。在一些实施例中,音频指纹可以应用于多种场景,例如,音频指纹可以用于音频消重,即消除一组音频中的重复音频;又例如,音频指纹可以用于音频检索,如用于听歌识曲,为一段音频找出其原曲。As mentioned in the background, an audio fingerprint is a compact digital signature extracted from audio content that represents an important piece of audio acoustic information. Audio fingerprints provide a unique representation for audio, through which an audio can be effectively distinguished from other audio. In some embodiments, audio fingerprints can be applied to various scenarios. For example, audio fingerprints can be used for audio deduplication, that is, to eliminate repeated audio in a set of audio; for another example, audio fingerprints can be used for audio retrieval, such as for Listen to the song and recognize the song, find out the original song for a piece of audio.
相关技术中,在音频检索任务中,可以基于非深度学习的算法,对原始音频提取频谱图后,计算频谱图上的显著点,对其进行哈希编码,构建大规模音频指纹库,检索时对查询音频提取指纹特征后,通过哈希特征检索进行多级过滤。也可以通过音乐检测模块和音乐识别模块进行音频检索,其中,音乐检测模块用于检测当前是否有音乐,音乐识别模块包含一些卷积层和两层分割编码块,其对查询音频提取特征,并通过特征距离进行检索。除此之外,还可以使用一个长短期记忆的自动编码器生成音频指纹,利用该音频指纹进行音频检索。In related technologies, in the audio retrieval task, based on non-deep learning algorithms, after extracting the spectrogram from the original audio, calculate the salient points on the spectrogram, perform hash coding on them, and build a large-scale audio fingerprint library. After the fingerprint feature is extracted from the query audio, multi-level filtering is performed through hash feature retrieval. It is also possible to perform audio retrieval through the music detection module and the music recognition module, wherein the music detection module is used to detect whether there is music currently, and the music recognition module includes some convolutional layers and two-layer segmentation coding blocks, which extract features from the query audio, and Search by feature distance. In addition, a long-short-term memory autoencoder can be used to generate audio fingerprints, which can be used for audio retrieval.
然而,上述相关技术中所述的生成音频指纹(或提取特征)的方法,均不能对失真音频进行有效检索。在一些实施例中,失真音频可以是指被编辑后的音频,例如,对音频的速度和/或音频进行编辑后得到的音频。However, none of the methods for generating audio fingerprints (or extracting features) described in the above related technologies can effectively retrieve distorted audio. In some embodiments, distorted audio may refer to edited audio, for example, edited audio speed and/or audio.
有鉴于此,本公开提出一种编码器的生成方法,通过对比训练的方式训练得到作为音频指纹提取器的第一编码器,使得第一编码器能学习到音频的更高层次的特征,其输出的编码向量能更好地区分与其属于同一个音频的音频以及与其不属于同一个音频的音频,通过该音频指纹提取器输出的音频指纹能更好地完成音频检索任务,提高音频检索的准确度。In view of this, the present disclosure proposes a method for generating an encoder, which trains the first encoder as an audio fingerprint extractor through comparative training, so that the first encoder can learn higher-level features of the audio. The output coding vector can better distinguish the audio that belongs to the same audio and the audio that does not belong to the same audio. The audio fingerprint output by the audio fingerprint extractor can better complete the audio retrieval task and improve the accuracy of audio retrieval. Spend.
图1是根据本公开一示例性实施例示出的一种编码器的生成方法的流程图。如图1所示,该生成方法包括:Fig. 1 is a flowchart showing a method for generating an encoder according to an exemplary embodiment of the present disclosure. As shown in Figure 1, the generation method includes:
步骤110,获取多个样本音频。 Step 110, acquiring a plurality of audio samples.
在一些实施例,样本音频可以是用于对第一编码器和第二编码器进行对比训练的训练数据。在一些实施例中,样本音频可以是音乐数据,例如,歌曲或歌曲片段,其中,歌曲 片段的长度可以根据实际情况具体设置,例如,歌曲片段的长度可以是10s至900s中的任意数值。在一些实施例中,样本音频可以使用不同人声和不同风格的音乐数据。In some embodiments, the sample audio may be training data for comparative training of the first encoder and the second encoder. In some embodiments, the sample audio can be music data, such as a song or a song segment, wherein the length of the song segment can be set according to actual conditions, for example, the length of the song segment can be any value from 10s to 900s. In some embodiments, sample audio may use different vocals and different styles of music data.
在一些实施例中,可以通过读取存储的数据、调用相关接口或其他方式获取多个样本音频。In some embodiments, multiple audio samples can be obtained by reading stored data, calling related interfaces, or other methods.
步骤120,根据多个样本音频构造第一组样本以及第二组样本。 Step 120, constructing a first group of samples and a second group of samples according to the plurality of audio samples.
在一些实施例中,可以根据多个样本音频的数据增强的结果构造第一组样本以及第二组样本,数据增强可以包括对音频的参数进行调整。在一些实施例中,根据多个样本音频构造第一组样本以及第二组样本,包括:对多个样本音频分别进行第一参数调整以及第二参数调整,得到第一组样本以及第二组样本。In some embodiments, the first group of samples and the second group of samples may be constructed according to data enhancement results of multiple sample audios, and the data enhancement may include adjusting audio parameters. In some embodiments, constructing the first group of samples and the second group of samples according to the plurality of sample audios includes: respectively performing the first parameter adjustment and the second parameter adjustment on the plurality of sample audios to obtain the first group of samples and the second group sample.
在一些实施例中,参数调整可以是指根据对应的调整方式对音频的调整参数进行调整。在一些实施例中,第一参数调整或第二参数调整的调整参数可以包括但不限于以下的至少一种:噪声、音高、速度、滤波参数、回声、增益或衰减的频段、以及音频格式。In some embodiments, parameter adjustment may refer to adjusting an audio adjustment parameter according to a corresponding adjustment manner. In some embodiments, the adjustment parameters of the first parameter adjustment or the second parameter adjustment may include but not limited to at least one of the following: noise, pitch, speed, filter parameters, echo, frequency band of gain or attenuation, and audio format .
在一些实施例中,噪声可以包括白噪声,对噪声进行调整可以是指增加第一预设比例的白噪声,其中,第一预设比例可以根据实际情况具体设置,例如,第一预设比例的范围可以是(0,0.1]。在一些实施例中,对音高进行调整可以是指对音高进行升调或降调,例如,在八度内对音高进行升调或降调,八度可以包括12个半音,对应的,对音高进行调整可以是指以半音为单位对八度内的音高进行升调或降调。In some embodiments, the noise may include white noise, and adjusting the noise may refer to increasing white noise in a first preset ratio, wherein the first preset ratio may be specifically set according to actual conditions, for example, the first preset ratio The range of may be (0, 0.1]. In some embodiments, adjusting the pitch may refer to raising or lowering the pitch, for example, raising or lowering the pitch within an octave, An octave may include 12 semitones. Correspondingly, adjusting the pitch may refer to raising or lowering the pitch within the octave in units of semitones.
在一些实施例中,对速度进行调整可以是指根据第二预设倍速对音频的播放速度进行快进或慢放,其中,第二预设倍速可以根据实际情况具体设置,例如,第二预设倍速的范围可以是[0.5,1.5]。在一些实施例中,滤波参数可以是指滤波频率,对滤波参数进行调整可以是指对音频进行高通滤波和/或低通滤波,其中,高通滤波或低通滤波对应的滤波频率可以根据实际需求进行具体设置。例如,高通滤波的滤波频率可以是2000Hz,低通滤波的滤波频率可以是300Hz。In some embodiments, adjusting the speed may refer to fast-forwarding or slowing down the playback speed of the audio according to the second preset multiple speed, wherein the second preset multiple speed may be specifically set according to the actual situation, for example, the second preset multiple speed The range of double speed can be [0.5, 1.5]. In some embodiments, the filtering parameter may refer to the filtering frequency, and adjusting the filtering parameter may refer to performing high-pass filtering and/or low-pass filtering on the audio, wherein the filtering frequency corresponding to the high-pass filtering or low-pass filtering may be based on actual needs Make specific settings. For example, the filtering frequency of the high-pass filtering may be 2000 Hz, and the filtering frequency of the low-pass filtering may be 300 Hz.
在一些实施例中,对增益或衰减的频段进行调整可以通过均衡器实现,均衡器可以对以下参数进行调整:频率、增益以及Q(Quantize)值,其中,频率可以是用于表征进行调整的频率点的参数,增益可以是用于表征在设定的频率值上进行增益或衰减的参数,Q值可以是用于表征进行增益或衰减的频段“宽度”的参数。在一些实施例中,对音频格式进行调整可以是指通过预设格式对音频进行压缩,预设格式可以根据实际情况具体设置。例如,预设格式可以是32Kbps的MP3格式。在一些实施例中,还可以通过对音频的其他参数进行调整实现数据增强,例如,增加高斯噪声等,本公开在此不再赘述。In some embodiments, the adjustment of the frequency band of gain or attenuation can be realized through an equalizer, and the equalizer can adjust the following parameters: frequency, gain and Q (Quantize) value, wherein the frequency can be used to represent the adjustment The parameter of the frequency point, the gain may be a parameter used to characterize the gain or attenuation at a set frequency value, and the Q value may be a parameter used to characterize the "width" of the frequency band for gain or attenuation. In some embodiments, adjusting the audio format may refer to compressing the audio in a preset format, and the preset format may be specifically set according to actual conditions. For example, the preset format may be a 32Kbps MP3 format. In some embodiments, data enhancement can also be implemented by adjusting other parameters of the audio, for example, adding Gaussian noise, etc., which will not be repeated in this disclosure.
在一些实施例中,第一参数调整和第二参数调整对应的调整参数和/或调整方式不完全相同。调整方式不同可以是指调整参数的设定值不同。示例地,第一参数调整可以是增加 0.1比例的白噪声、对音频播放速度加快0.5倍、以及对音频前两个半音进行升调。第二参数调整可以是增加0.2比例的白噪声、对音频播放速度加快1倍、对音频的前2个半音进行升调,第三个半音进行降调、以及2000Hz高通滤波。通过调整参数和/或调整方式不完全相同的第一参数调整和第二参数调整可以实现不同版本的数据增强。In some embodiments, the adjustment parameters and/or adjustment methods corresponding to the first parameter adjustment and the second parameter adjustment are not completely the same. The different adjustment methods may refer to different setting values of the adjustment parameters. Exemplarily, the first parameter adjustment may be to increase the white noise at a ratio of 0.1, to speed up the playback speed of the audio by 0.5 times, and to raise the pitch of the first two semitones of the audio. The second parameter adjustment can be adding 0.2 ratio of white noise, doubling the audio playback speed, raising the first 2 semitones of the audio, lowering the third semitone, and 2000Hz high-pass filtering. Different versions of data augmentation can be implemented by adjusting the first parameter adjustment and the second parameter adjustment with different adjustment parameters and/or adjustment methods.
在一些实施例中,第一组样本中的每一条样本为经过第一参数调整后的样本音频,第二组样本中的每一条样本为经过第二参数调整后的样本音频。在一些实施例中,对样本音频采用不同版本的数据增强得到的音频互为相似音频,即对同一样本音频进行第一参数调整和第二参数调整,得到的第一参数调整后的样本音频和第二参数调整后的样本音频互为相似音频。In some embodiments, each sample in the first group of samples is an audio sample after adjustment by a first parameter, and each sample in the second group of samples is an audio sample after adjustment by a second parameter. In some embodiments, the audio obtained by using different versions of data enhancement on the sample audio is similar to each other, that is, the first parameter adjustment and the second parameter adjustment are performed on the same sample audio, and the obtained sample audio after the first parameter adjustment and The sample audios adjusted by the second parameter are similar audios.
可以理解的,针对第一组样本中的任一样本,该样本的相似音频为该样本的正样本,不相似音频为该样本的负样本。因此,针对第一组样本中的每一条样本,在第二组样本中均存在对应的正样本以及负样本。对应的,针对第一组样本中的每一条样本,第二组样本中与该样本对应相同样本音频的样本为正样本,其他样本为负样本。It can be understood that for any sample in the first group of samples, the similar audio of the sample is a positive sample of the sample, and the dissimilar audio is a negative sample of the sample. Therefore, for each sample in the first group of samples, there are corresponding positive samples and negative samples in the second group of samples. Correspondingly, for each sample in the first group of samples, the sample in the second group of samples corresponding to the same sample audio as the sample is a positive sample, and the other samples are negative samples.
在本公开实施例中,调整参数的类型多样,通过对多种调整参数进行调整分别得到第一组样本和第二组样本,进而通过该第一组样本和第二组样本对第一编码器和第二编码器进行对比训练,第一编码器能处理对多种调整参数进行调整后的音频,即第一编码器能处理被多种方式编辑后的音频,提高了第一编码器的鲁棒性,使得第一编码器能更好地提取被编辑的音频(例如,音高和/或速度进行调整的音频)的音频指纹,即能更好地提取失真音频的音频指纹,进而针对失真音频,利用第一编码器输出的音频指纹能准确地完成音频检索任务,提高音频检索的准确度。In the embodiment of the present disclosure, there are various types of adjustment parameters, and the first group of samples and the second group of samples are respectively obtained by adjusting various adjustment parameters, and then the first encoder is adjusted by the first group of samples and the second group of samples Compared with the second encoder for training, the first encoder can process the audio after adjusting various adjustment parameters, that is, the first encoder can process the audio edited in various ways, which improves the robustness of the first encoder. Stickiness, so that the first encoder can better extract the audio fingerprint of edited audio (for example, audio with adjusted pitch and/or speed), that is, it can better extract the audio fingerprint of distorted audio, and then target the distortion For the audio, the audio retrieval task can be accurately completed by using the audio fingerprint output by the first encoder, and the accuracy of the audio retrieval can be improved.
步骤130,根据第一组样本和第二组样本对于第一编码器以及第二编码器进行对比训练,训练完成的第一编码器能够作为音频指纹提取器输出作为音频的指纹特征的编码向量。Step 130: Perform comparative training on the first encoder and the second encoder according to the first group of samples and the second group of samples, and the trained first encoder can be used as an audio fingerprint extractor to output an encoded vector as an audio fingerprint feature.
在一些实施例中,第一编码器用于对第一组样本中的样本进行编码,得到对应每一条样本的第一编码向量,第二编码器用于对第二组样本中的样本进行编码,得到对应每一条样本的第二编码向量。在一些实施例中,第一编码器可以用于对第一组样本中的样本的梅尔频谱进行编码处理,得到对应每一条样本的第一编码向量,第二编码器可以用于对第二组样本中的样本的梅尔频谱进行编码处理,得到对应每一条样本的第二编码向量。在一些实施例中,第一编码器可以是编码器,第二编码器可以是动量编码器。在一些实施例中,第一编码器或第二编码器可以是残差网络,例如,ResNet18。In some embodiments, the first encoder is used to encode samples in the first group of samples to obtain a first encoding vector corresponding to each sample, and the second encoder is used to encode samples in the second group of samples to obtain The second encoding vector corresponding to each sample. In some embodiments, the first encoder can be used to encode the mel spectrum of the samples in the first group of samples to obtain the first encoding vector corresponding to each sample, and the second encoder can be used to encode the second The Mel spectrum of the samples in the group of samples is encoded to obtain a second encoding vector corresponding to each sample. In some embodiments, the first encoder may be an encoder and the second encoder may be a momentum encoder. In some embodiments, the first encoder or the second encoder may be a residual network, eg, ResNet18.
在一些实施例中,对比训练用于使第一编码器输出的第一编码向量接近对应的正样本的第二编码向量,远离对应的负样本的第二编码向量,且第二编码器的编码参数逐渐趋向第一编码器的编码参数。对比训练可以是自监督学习,自监督学习无需人工标注的标签信 息,直接利用数据本身作为监督信息,来学习样本数据(例如,样本音频)的特征表达。关于对第一编码器和第二编码器进行对比训练的具体细节可以参见图2及其相关描述,在此不再赘述。In some embodiments, contrastive training is used to make the first encoding vector output by the first encoder close to the second encoding vector of the corresponding positive sample, far away from the second encoding vector of the corresponding negative sample, and the encoding of the second encoder The parameters gradually tend towards the encoding parameters of the first encoder. Contrastive training can be self-supervised learning. Self-supervised learning does not require manual label information, and directly uses the data itself as supervisory information to learn the feature expression of sample data (for example, sample audio). For specific details about performing comparative training on the first encoder and the second encoder, refer to FIG. 2 and related descriptions, and details are not repeated here.
在本公开的实施例中,通过对比训练使第一编码器输出的第一编码向量接近对应的正样本的第二编码向量,远离对应的负样本的第二编码向量,即使得样本音频的特征与其正样本的特征更相似,且与其负样本的特征更不相似,从而使得编码器输出的第一编码向量能更有效地区分与其属于同一个音频的音频以及与其不属于同一个音频的音频,且对比训练使得第一编码器能学习到音频的更高层次的特征。进而,训练得到的作为音频指纹提取器的第一编码器输出的音频指纹(即输出的音频的指纹特征的编码向量)能更好地完成音频检索任务,提高了音频检索的准确度。In the embodiment of the present disclosure, the first encoding vector output by the first encoder is made close to the second encoding vector of the corresponding positive sample and far away from the second encoding vector of the corresponding negative sample through comparative training, that is, the characteristics of the sample audio It is more similar to the features of its positive samples, and less similar to the features of its negative samples, so that the first encoding vector output by the encoder can more effectively distinguish between audio that belongs to the same audio and audio that does not belong to the same audio, And the contrastive training enables the first encoder to learn higher-level features of the audio. Furthermore, the trained audio fingerprint (that is, the encoded vector of the fingerprint feature of the output audio) obtained from the training of the first encoder as the audio fingerprint extractor can better complete the audio retrieval task and improve the accuracy of the audio retrieval.
图2是根据本公开一示例性实施例示出的对第一编码器以及第二编码器进行对比训练的流程图。如图2所示,该方法包括:Fig. 2 is a flow chart showing comparative training of a first encoder and a second encoder according to an exemplary embodiment of the present disclosure. As shown in Figure 2, the method includes:
步骤210,根据第一编码器对第一组样本中的样本进行编码,得到对应每一条样本的第一编码向量,以及根据第二编码器对第二组样本的样本进行编码,得到对应每一条样本的第二编码向量。Step 210: Encode the samples in the first group of samples according to the first encoder to obtain the first encoding vector corresponding to each sample, and encode the samples in the second group of samples according to the second encoder to obtain the corresponding The second encoding vector of samples.
步骤220,基于第一编码向量和第二编码向量对对比损失函数的损失值进行迭代运算,并基于损失值迭代更新第一编码器的编码参数,以使第一编码器输出的第一编码向量接近对应的正样本的第二编码向量,远离对应的负样本的第二编码向量,其中,损失值用于表征第一编码向量以及第二编码向量之间的相似度;并,使第二编码器的编码参数逐渐趋向第一编码器的编码参数,直到得到训练完成的第一编码器。 Step 220, perform an iterative operation on the loss value of the comparison loss function based on the first encoding vector and the second encoding vector, and iteratively update the encoding parameters of the first encoder based on the loss value, so that the first encoding vector output by the first encoder The second coded vector close to the corresponding positive sample is far away from the second coded vector of the corresponding negative sample, wherein the loss value is used to characterize the similarity between the first coded vector and the second coded vector; and, the second coded vector The encoding parameters of the first encoder gradually tend to the encoding parameters of the first encoder until the first encoder that has been trained is obtained.
在一些实施例中,相似度可以通过向量之间的距离确定,距离越小,相似度越大。在一些实施例中,距离可以包括但不限于余弦距离、欧氏距离、曼哈顿距离、马氏距离或闵可夫斯基距离等。In some embodiments, the similarity can be determined by the distance between vectors, the smaller the distance, the greater the similarity. In some embodiments, the distance may include, but is not limited to, cosine distance, Euclidean distance, Manhattan distance, Mahalanobis distance, Minkowski distance, and the like.
在一些实施例中,可以根据实际情况确定对比损失函数,并利用该对比损失函数确定损失值。例如,将第一编码向量和第二编码向量之间的相似度确定为损失值。在一些实施例中,对比损失函数可以是InfoNCE(Noise Contrastive Estimation,InfoNCE)损失函数。在一些实施例中,InfoNCE损失函数的损失值可以通过如下公式(1)得到:In some embodiments, the comparison loss function may be determined according to actual conditions, and the loss value may be determined using the comparison loss function. For example, the similarity between the first encoded vector and the second encoded vector is determined as a loss value. In some embodiments, the contrastive loss function may be an InfoNCE (Noise Contrastive Estimation, InfoNCE) loss function. In some embodiments, the loss value of the InfoNCE loss function can be obtained by the following formula (1):
Figure PCTCN2023070796-appb-000001
Figure PCTCN2023070796-appb-000001
其中,L q表示InfoNCE损失函数的损失值,q表示第一编码向量,k +表示与第一编码向量对应的样本匹配的正样本的第二编码向量,τ表示温度超参,τ可以根据实际情 况具体设置,例如,τ可以是0.1,K表示参与损失值计算的第二编码向量的总和,K i表示参与损失值计算的第i个第二编码向量,q·k +表示q与k +的点积,q·k i表示q与k i的点积,exp表示以自然常数e为底的指数函数,log表示对数函数。 Among them, L q represents the loss value of the InfoNCE loss function, q represents the first encoding vector, k + represents the second encoding vector of the positive sample matching the sample corresponding to the first encoding vector, τ indicates the temperature hyperparameter, τ can be based on the actual The specific setting of the situation, for example, τ can be 0.1, K represents the sum of the second coded vectors involved in the loss value calculation, K i represents the i-th second coded vector involved in the loss value calculation, q k + represents q and k +ki means the dot product of q and ki , exp means the exponential function with the natural constant e as the base, and log means the logarithmic function.
在一些实施例中,基于第一编码向量和第二编码向量对对比损失函数的损失值进行迭代运算,包括:在迭代运算是第一轮的情况下,基于第一轮得到的第一编码向量和第二编码向量,确定对比损失函数的损失值;在迭代运算不是第一轮的情况下,根据当前轮得到的第二编码向量替换掉预设队列中目标历史轮对应的第二编码向量,并基于当前轮得到的第一编码向量和预设队列中的第二编码向量,确定对比损失函数的所述损失值;其中,预设队列用于存储迭代运算中每一轮得到的第二编码向量,目标历史轮为预设队列中存储的轮数最早的历史轮。In some embodiments, performing an iterative operation on the loss value of the comparison loss function based on the first encoding vector and the second encoding vector includes: when the iterative operation is the first round, based on the first encoding vector obtained in the first round and the second encoding vector to determine the loss value of the comparison loss function; when the iterative operation is not the first round, replace the second encoding vector corresponding to the target historical round in the preset queue according to the second encoding vector obtained in the current round, And based on the first encoding vector obtained in the current round and the second encoding vector in the preset queue, determine the loss value of the comparison loss function; wherein the preset queue is used to store the second encoding obtained in each round of iterative operation Vector, the target history round is the history round with the earliest number of rounds stored in the preset queue.
在一些实施例中,在迭代运算是第一轮的情况下,预设队列中可以存储第一轮得到的第二编码向量,可以理解的,在迭代运算是第一轮的情况下,基于第一轮得到的第一编码向量和第二编码向量确定损失值,也即基于第一轮得到的第一编码向量和预设队列中的第二编码向量确定损失值。在迭代运算不是第一轮的情况下,可以根据当前轮得到的第二编码向量替换掉预设队列中目标历史轮对应的第二编码向量,即对预设队列进行更新,进而基于当前轮得到的第一编码向量和更新后的预设队列中的第二编码向量,确定对比损失函数的所述损失值。由此可知,在损失值的迭代运算过程中,预设队列为动态队列,预设队列中的第二编码向量在每一轮迭代运算时会进行更新,且在每一轮迭代运算中,针对多个第一编码中的每个,在预设队列中存在1个正样本的第二编码向量,其余为负样本的第二编码向量。In some embodiments, when the iterative operation is the first round, the second encoding vector obtained in the first round can be stored in the preset queue. It can be understood that when the iterative operation is the first round, based on the first round The first encoding vector and the second encoding vector obtained in one round determine the loss value, that is, the loss value is determined based on the first encoding vector obtained in the first round and the second encoding vector in the preset queue. In the case that the iterative operation is not the first round, the second encoding vector corresponding to the target historical round in the preset queue can be replaced according to the second encoding vector obtained in the current round, that is, the preset queue is updated, and then obtained based on the current round The first encoding vector and the updated second encoding vector in the preset queue are used to determine the loss value of the comparison loss function. It can be seen that during the iterative operation of the loss value, the preset queue is a dynamic queue, and the second encoding vector in the preset queue will be updated in each round of iterative operation, and in each round of iterative operation, for For each of the multiple first encodings, there is one second encoding vector of positive samples in the preset queue, and the rest are second encoding vectors of negative samples.
示例地,以多个样本音频为N个样本音频为例,则第一编码器可以得到N个第一编码向量,第二编码器可以得到N个第二编码向量,在模型的损失值的迭代运算过程中,在迭代运算是第一轮的情况下,可以基于第一轮得到的N个第二编码向量(也即预设队列中存储的第一轮得到的N个编码向量)和第一轮得到的N个第一编码向量,确定对比损失函数的损失值,例如,针对第一轮的N个第一编码向量中的每个,利用该第一编码向量和N个第二编码向量通过上述公式(1)计算得到N个损失值,对N个损失值求平均得到第一轮迭代运算的损失值。在迭代运算不是第一轮的情况下,例如,迭代运算为第二轮,则可以根据第二轮得到的N个第二编码向量替换掉预设队列中存储的轮数最早的历史轮,即第一轮得到的N个第二编码向量,此时预设队列中包括替换后的N个第二编码向量,进而,可以通过第二轮得到的N个第一编码向量和该预设队列中的N个第二编码向量,确定对比损失函数的损失值,得到第二轮迭代运算的损失值的方式与第一轮类似,在此不再赘述。以 此类推,可以得到多轮迭代运算中每轮对应的损失值。Exemplarily, taking N sample audio as an example, the first encoder can obtain N first encoding vectors, and the second encoder can obtain N second encoding vectors. In the iteration of the loss value of the model During the operation, when the iterative operation is the first round, it can be based on the N second encoding vectors obtained in the first round (that is, the N encoding vectors obtained in the first round stored in the preset queue) and the first The N first encoding vectors obtained in the first round determine the loss value of the comparison loss function, for example, for each of the N first encoding vectors in the first round, use the first encoding vector and the N second encoding vectors to pass The above formula (1) calculates N loss values, and averages the N loss values to obtain the loss value of the first round of iterative operation. In the case that the iterative operation is not the first round, for example, the iterative operation is the second round, then the historical round with the earliest number of rounds stored in the preset queue can be replaced according to the N second encoding vectors obtained in the second round, namely The N second coded vectors obtained in the first round, at this time, the preset queue includes the replaced N second coded vectors, and then, the N first coded vectors obtained through the second round and the preset queue The method of determining the loss value of the comparison loss function and obtaining the loss value of the second round of iterative operation is similar to that of the first round, and will not be repeated here. By analogy, the loss value corresponding to each round in multiple rounds of iterative operations can be obtained.
在一些实施例中,基于损失值迭代更新第一编码器的编码参数的过程中,第二编码器的编码参数逐渐趋向第一编码器的编码参数。在一些实施例中,第二编码器的编码参数可以通过如下公式(2)进行变化:In some embodiments, during the process of iteratively updating the encoding parameters of the first encoder based on the loss value, the encoding parameters of the second encoder gradually approach the encoding parameters of the first encoder. In some embodiments, the encoding parameters of the second encoder can be changed by the following formula (2):
θ k←mθ k+(1-m)θ q         (2) θ k ←mθ k +(1-m)θ q (2)
其中,θ k表示第二编码器的编码参数,θ q表示第一编码器的编码参数,m表示常数,m可以根据实际情况具体设置,例如,m=0.999。由于m设置的值较大,因此,第二编码器的编码参数保持了其本身的大部分权重,仅利用较小的权重往第一编码器的编码参数进行靠近,实现第二编码器的编码参数逐渐趋向第一编码器的编码参数。 Wherein, θ k represents the encoding parameter of the second encoder, θ q represents the encoding parameter of the first encoder, m represents a constant, and m can be specifically set according to the actual situation, for example, m=0.999. Since the value set by m is relatively large, the encoding parameters of the second encoder maintain most of its own weight, and only use smaller weights to approach the encoding parameters of the first encoder to realize the encoding of the second encoder The parameters gradually tend towards the encoding parameters of the first encoder.
图3是根据本公开一示例性实施例示出的一种音频指纹提取方法的流程图。如图3所示,该方法包括:Fig. 3 is a flowchart showing an audio fingerprint extraction method according to an exemplary embodiment of the present disclosure. As shown in Figure 3, the method includes:
步骤310,获取待查询音频。 Step 310, acquire the audio to be queried.
步骤320,根据音频指纹提取器对待查询音频进行处理,得到作为待查询音频的指纹特征的编码向量。Step 320, process the audio to be queried by the audio fingerprint extractor to obtain a coded vector serving as a fingerprint feature of the audio to be queried.
在一些实施例中,待查询音频可以是需要得到音频的指纹特征的编码向量的音频,即需要得到音频指纹的音频。在一些实施例中,待查询音频可以是失真音频,即被编辑后的音频。在一些实施例中,被编辑的音频可以是指对以下一种或多种调整参数进行调整后的得到音频:噪声、音高、速度、滤波参数、回声、增益或衰减的频段、以及音频格式。关于调整前述调整参数的具体细节可以参见上述步骤120及其相关描述,在此不再赘述。In some embodiments, the audio to be queried may be the audio for which the encoding vector of the fingerprint feature of the audio needs to be obtained, that is, the audio for which the fingerprint of the audio needs to be obtained. In some embodiments, the audio to be queried may be distorted audio, that is, edited audio. In some embodiments, the edited audio may refer to the audio after adjusting one or more of the following adjustment parameters: noise, pitch, speed, filter parameters, echo, frequency band of gain or attenuation, and audio format . For specific details on adjusting the foregoing adjustment parameters, reference may be made to the above-mentioned step 120 and related descriptions, which will not be repeated here.
在一些实施例中,音频指纹提取器可以是根据前述步骤110至步骤130得到的训练完成的第一编码器。关于第一编码器(即音频指纹提取器)的训练过程可以参见上述图1和图2及其相关描述,在此不再赘述。In some embodiments, the audio fingerprint extractor may be the first encoder that has been trained according to the aforementioned steps 110 to 130 . For the training process of the first encoder (ie, the audio fingerprint extractor), reference can be made to the above-mentioned FIG. 1 and FIG. 2 and their related descriptions, which will not be repeated here.
在一些实施例中,根据待查询音频的指纹特征的编码向量,即根据待查询音频的音频指纹,可以完成音频检索任务。在一些实施例中,根据待查询音频的音频指纹可以从预设数据库中检索出与该待查询音频属于同一音频的其他音频。示例地,当待查询音频为失真歌曲时,音频检索任务可以是指从曲库中检索出该待查询音频的原曲。In some embodiments, the audio retrieval task can be completed according to the encoding vector of the fingerprint feature of the audio to be queried, that is, the audio fingerprint of the audio to be queried. In some embodiments, other audios belonging to the same audio as the audio to be queried can be retrieved from a preset database according to the audio fingerprint of the audio to be queried. For example, when the audio to be queried is a distorted song, the audio retrieval task may refer to retrieving the original song of the audio to be queried from a music library.
在一些实施例中,根据待查询音频的音频指纹从预设数据库中检索出与该待查询音频属于同一音频的其他音频可以是:根据待查询音频的音频指纹,从数据库中检索出与该音频指纹的相似度满足预设条件的其他音频。相似度可以通过音频指纹,即音频的指纹特征的编码向量之间的距离确定。关于距离可以参见上述步骤210及其相关描述,在此不再赘述。值得说明的是,还可以通过除相似度以外的其他方式,从数据库中检索出与待查询音 频属于同一音频的其他音频,本公开对此并不做任何限制。In some embodiments, retrieving from the preset database according to the audio fingerprint of the audio to be queried other audio that belongs to the same audio as the audio to be queried may be: according to the audio fingerprint of the audio to be queried, retrieving from the database Other audio whose fingerprint similarity meets preset conditions. The similarity can be determined by audio fingerprints, ie the distances between encoded vectors of fingerprint features of audio. Regarding the distance, reference may be made to the foregoing step 210 and its related descriptions, which will not be repeated here. It is worth noting that other audios that belong to the same audio as the audio to be queried can also be retrieved from the database by means other than similarity, and this disclosure does not impose any restrictions on this.
图4是根据本公开一示例性实施例示出的一种编码器的生成装置的框图。如图4所示,该生成装置400包括:Fig. 4 is a block diagram showing an apparatus for generating an encoder according to an exemplary embodiment of the present disclosure. As shown in Figure 4, the generating device 400 includes:
第一获取模块410,被配置为获取多个样本音频;The first acquisition module 410 is configured to acquire a plurality of sample audios;
构造模块420,被配置为根据所述多个样本音频构造第一组样本以及第二组样本,其中,针对所述第一组样本中的每一条样本,在所述第二组样本中均存在对应的正样本以及负样本;The construction module 420 is configured to construct a first group of samples and a second group of samples according to the plurality of audio samples, wherein, for each sample in the first group of samples, there are Corresponding positive samples and negative samples;
训练模块430,被配置为根据所述第一组样本和所述第二组样本对于第一编码器以及第二编码器进行对比训练,训练完成的所述第一编码器能够作为音频指纹提取器输出作为音频的指纹特征的编码向量;The training module 430 is configured to perform comparative training on the first encoder and the second encoder according to the first set of samples and the second set of samples, and the trained first encoder can be used as an audio fingerprint extractor Output an encoded vector as the fingerprint feature of the audio;
其中,所述第一编码器用于对所述第一组样本中的样本进行编码,得到对应每一条样本的第一编码向量,所述第二编码器用于对所述第二组样本中的样本进行编码,得到对应每一条样本的第二编码向量;所述对比训练用于使所述第一编码器输出的第一编码向量接近对应的所述正样本的第二编码向量,远离对应的所述负样本的第二编码向量,且所述第二编码器的编码参数逐渐趋向所述第一编码器的编码参数。Wherein, the first encoder is used to encode the samples in the first group of samples to obtain a first encoding vector corresponding to each sample, and the second encoder is used to encode the samples in the second group of samples Encoding is performed to obtain a second encoding vector corresponding to each sample; the comparison training is used to make the first encoding vector output by the first encoder close to the corresponding second encoding vector of the positive sample, away from the corresponding The second encoding vector of the negative samples, and the encoding parameters of the second encoder gradually tend to the encoding parameters of the first encoder.
在一些实施例中,构造模块420进一步被配置为:In some embodiments, construction module 420 is further configured to:
对多个所述样本音频分别进行第一参数调整以及第二参数调整,得到所述第一组样本以及所述第二组样本,所述第一参数调整和所述第二参数调整对应的调整参数和/或调整方式不完全相同;Performing first parameter adjustment and second parameter adjustment on a plurality of the sample audios respectively to obtain the first group of samples and the second group of samples, and adjustments corresponding to the first parameter adjustment and the second parameter adjustment The parameters and/or adjustments are not exactly the same;
其中,所述第一组样本中的每一条样本为经过所述第一参数调整后的样本音频,所述第二组样本中的每一条样本为经过所述第二参数调整后的样本音频,针对所述第一组样本中的每一条样本,所述第二组样本中与该样本对应相同样本音频的样本为正样本,其他样本为负样本。Wherein, each sample in the first group of samples is sample audio adjusted by the first parameter, and each sample in the second group of samples is sample audio adjusted by the second parameter, For each sample in the first group of samples, the sample in the second group of samples corresponding to the same audio sample as the sample is a positive sample, and the other samples are negative samples.
在一些实施例中,所述调整参数包括但不限于以下的一种或多种:噪声、音高、速度、滤波参数、回声、增益或衰减的频段、以及音频格式。In some embodiments, the adjustment parameters include but are not limited to one or more of the following: noise, pitch, speed, filter parameters, echo, frequency band of gain or attenuation, and audio format.
在一些实施例中,训练模块430进一步被配置为:In some embodiments, training module 430 is further configured to:
根据所述第一编码器对所述第一组样本中的样本进行编码,得到对应每一条样本的第一编码向量,以及根据所述第二编码器对所述第二组样本的样本进行编码,得到对应每一条样本的第二编码向量;Encoding the samples in the first group of samples according to the first encoder to obtain a first encoding vector corresponding to each sample, and encoding the samples of the second group of samples according to the second encoder , to obtain the second encoding vector corresponding to each sample;
基于所述第一编码向量和所述第二编码向量对对比损失函数的损失值进行迭代运算,并基于所述损失值迭代更新所述第一编码器的编码参数,以使所述第一编码器输出的所述第一编码向量接近对应的所述正样本的所述第二编码向量,远离对应的所述负样本的所述 第二编码向量,其中,所述损失值用于表征所述第一编码向量以及所述第二编码向量之间的相似度;并,The loss value of the comparison loss function is iteratively calculated based on the first encoding vector and the second encoding vector, and the encoding parameters of the first encoder are iteratively updated based on the loss value, so that the first encoding The first coded vector output by the filter is close to the corresponding second coded vector of the positive sample and far away from the corresponding second coded vector of the negative sample, wherein the loss value is used to characterize the a degree of similarity between the first encoded vector and said second encoded vector; and,
使所述第二编码器的编码参数逐渐趋向所述第一编码器的所述编码参数,直到得到训练完成的所述第一编码器。Making the encoding parameters of the second encoder gradually approach the encoding parameters of the first encoder until the first encoder with training is obtained.
在一些实施例中,训练模块430进一步被配置为:In some embodiments, training module 430 is further configured to:
在所述迭代运算是第一轮的情况下,基于所述第一轮得到的所述第一编码向量和所述第二编码向量,确定所述对比损失函数的所述损失值;When the iterative operation is the first round, determine the loss value of the comparison loss function based on the first encoding vector and the second encoding vector obtained in the first round;
在所述迭代运算不是所述第一轮的情况下,根据当前轮得到的所述第二编码向量替换掉预设队列中目标历史轮对应的所述第二编码向量,并基于所述当前轮得到的所述第一编码向量和所述预设队列中的所述第二编码向量,确定所述对比损失函数的所述损失值;In the case that the iterative operation is not the first round, the second encoding vector corresponding to the target history round in the preset queue is replaced according to the second encoding vector obtained in the current round, and based on the current round Determining the loss value of the comparison loss function from the obtained first encoding vector and the second encoding vector in the preset queue;
其中,所述预设队列用于存储所述迭代运算中每一轮得到的所述第二编码向量,所述目标历史轮为所述预设队列中存储的轮数最早的所述历史轮。Wherein, the preset queue is used to store the second encoding vector obtained in each round of the iterative operation, and the target historical round is the historical round with the earliest number of rounds stored in the preset queue.
图5是根据本公开一示例性实施例示出的一种音频指纹提取装置的框图。如图5所示,该装置500包括:Fig. 5 is a block diagram of an audio fingerprint extraction device according to an exemplary embodiment of the present disclosure. As shown in Figure 5, the device 500 includes:
第二获取模块510,被配置为获取待查询音频;The second obtaining module 510 is configured to obtain the audio to be queried;
处理模块520,被配置为根据音频指纹提取器对所述待查询音频进行处理,得到作为所述待查询音频的指纹特征的编码向量;所述音频指纹提取器是根据本公开实施例所述的编码器的生成方法训练完成的第一编码器。The processing module 520 is configured to process the audio to be queried according to the audio fingerprint extractor to obtain a coded vector as a fingerprint feature of the audio to be queried; the audio fingerprint extractor is according to an embodiment of the present disclosure Encoder Generation Method The trained first encoder is completed.
下面参考图6,其示出了适于用来实现本公开实施例的电子设备600的结构示意图。本公开实施例中的电子设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图6示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。Referring now to FIG. 6 , it shows a schematic structural diagram of an electronic device 600 suitable for implementing an embodiment of the present disclosure. The electronic equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like. The electronic device shown in FIG. 6 is only an example, and should not limit the functions and application scope of the embodiments of the present disclosure.
如图6所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。As shown in FIG. 6, an electronic device 600 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 601, which may be randomly accessed according to a program stored in a read-only memory (ROM) 602 or loaded from a storage device 608. Various appropriate actions and processes are executed by programs in the memory (RAM) 603 . In the RAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored. The processing device 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to the bus 604 .
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图 6示出了具有各种装置的电子设备600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices can be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 607 such as a computer; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 6 shows electronic device 600 having various means, it should be understood that implementing or possessing all of the illustrated means is not a requirement. More or fewer means may alternatively be implemented or provided.
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开实施例的方法中限定的上述功能。In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. When the computer program is executed by the processing device 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。It should be noted that the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable medium or any combination of the above two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。In some embodiments, the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium The communication (eg, communication network) interconnections. Examples of communication networks include local area networks ("LANs"), wide area networks ("WANs"), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取多个样本音频;根据所述多个样本音频构造第一组样本 以及第二组样本,其中,针对所述第一组样本中的每一条样本,在所述第二组样本中均存在对应的正样本以及负样本;根据所述第一组样本和所述第二组样本对于第一编码器以及第二编码器进行对比训练,训练完成的所述第一编码器能够作为音频指纹提取器输出作为音频的指纹特征的编码向量;其中,所述第一编码器用于对所述第一组样本中的样本进行编码,得到对应每一条样本的第一编码向量,所述第二编码器用于对所述第二组样本中的样本进行编码,得到对应每一条样本的第二编码向量;所述对比训练用于使所述第一编码器输出的第一编码向量接近对应的所述正样本的第二编码向量,远离对应的所述负样本的第二编码向量,且所述第二编码器的编码参数逐渐趋向所述第一编码器的编码参数。The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires a plurality of sample audios; constructs a first group of samples according to the plurality of sample audios And a second group of samples, wherein, for each sample in the first group of samples, there are corresponding positive samples and negative samples in the second group of samples; according to the first group of samples and the first Two groups of samples carry out comparative training for the first encoder and the second encoder, and the first encoder that has been trained can be used as an audio fingerprint extractor to output an encoded vector as a fingerprint feature of the audio; wherein, the first encoder uses Encoding the samples in the first group of samples to obtain a first encoding vector corresponding to each sample, the second encoder is used to encode the samples in the second group of samples to obtain a corresponding to each sample the second encoding vector of the first encoder; the comparison training is used to make the first encoding vector output by the first encoder close to the corresponding second encoding vector of the positive sample, and away from the corresponding second encoding vector of the negative sample , and the encoding parameters of the second encoder gradually approach the encoding parameters of the first encoder.
或者,上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取待查询音频;根据音频指纹提取器对所述待查询音频进行处理,得到作为所述待查询音频的指纹特征的编码向量;所述音频指纹提取器是根据本公开实施例所述的编码器的生成方法训练完成的第一编码器。Alternatively, the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires the audio to be queried; Processing is performed to obtain an encoding vector serving as the fingerprint feature of the audio to be queried; the audio fingerprint extractor is the first encoder that has been trained according to the encoder generation method described in the embodiment of the present disclosure.
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
描述于本公开实施例中所涉及到的模块或单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块或单元的名称在某种情况下并不构成对该单元本身的限定,例如,跳转模块还可以被描述为“用于跳转至下一级页面的模块”。The modules or units described in the embodiments of the present disclosure may be implemented by software or by hardware. Wherein, the name of a module or unit does not constitute a limitation of the unit itself under certain circumstances, for example, a jump module may also be described as "a module for jumping to a next-level page".
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described herein above may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), System on Chips (SOCs), Complex Programmable Logical device (CPLD) and so on.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
根据本公开的一个或多个实施例,示例1提供了一种编码器的生成方法,包括:According to one or more embodiments of the present disclosure, Example 1 provides a method for generating an encoder, including:
获取多个样本音频;Get multiple sample audio;
根据所述多个样本音频构造第一组样本以及第二组样本,其中,针对所述第一组样本中的每一条样本,在所述第二组样本中均存在对应的正样本以及负样本;Constructing a first group of samples and a second group of samples according to the plurality of sample audios, wherein, for each sample in the first group of samples, there are corresponding positive samples and negative samples in the second group of samples ;
根据所述第一组样本和所述第二组样本对于第一编码器以及第二编码器进行对比训练,训练完成的所述第一编码器能够作为音频指纹提取器输出作为音频的指纹特征的编码向量;According to the first group of samples and the second group of samples, the first coder and the second coder are comparatively trained, and the trained first coder can be used as an audio fingerprint extractor to output fingerprint features as audio encoding vector;
其中,所述第一编码器用于对所述第一组样本中的样本进行编码,得到对应每一条样本的第一编码向量,所述第二编码器用于对所述第二组样本中的样本进行编码,得到对应每一条样本的第二编码向量;所述对比训练用于使所述第一编码器输出的第一编码向量接近对应的所述正样本的第二编码向量,远离对应的所述负样本的第二编码向量,且所述第二编码器的编码参数逐渐趋向所述第一编码器的编码参数。Wherein, the first encoder is used to encode the samples in the first group of samples to obtain a first encoding vector corresponding to each sample, and the second encoder is used to encode the samples in the second group of samples Encoding is performed to obtain a second encoding vector corresponding to each sample; the comparison training is used to make the first encoding vector output by the first encoder close to the corresponding second encoding vector of the positive sample, away from the corresponding The second encoding vector of the negative samples, and the encoding parameters of the second encoder gradually tend to the encoding parameters of the first encoder.
根据本公开的一个或多个实施例,示例2提供了示例1的方法,所述根据所述多个样本音频构造第一组样本以及第二组样本,包括:According to one or more embodiments of the present disclosure, Example 2 provides the method of Example 1, wherein the first group of samples and the second group of samples are constructed according to the plurality of audio samples, including:
对多个所述样本音频分别进行第一参数调整以及第二参数调整,得到所述第一组样本以及所述第二组样本,所述第一参数调整和所述第二参数调整对应的调整参数和/或调整方式不完全相同;Performing first parameter adjustment and second parameter adjustment on a plurality of the sample audios respectively to obtain the first group of samples and the second group of samples, and adjustments corresponding to the first parameter adjustment and the second parameter adjustment The parameters and/or adjustments are not exactly the same;
其中,所述第一组样本中的每一条样本为经过所述第一参数调整后的样本音频,所述第二组样本中的每一条样本为经过所述第二参数调整后的样本音频,针对所述第一组样本中的每一条样本,所述第二组样本中与该样本对应相同样本音频的样本为正样本,其他样 本为负样本。Wherein, each sample in the first group of samples is sample audio adjusted by the first parameter, and each sample in the second group of samples is sample audio adjusted by the second parameter, For each sample in the first group of samples, the sample in the second group of samples corresponding to the same audio sample as the sample is a positive sample, and the other samples are negative samples.
根据本公开的一个或多个实施例,示例3提供了示例2的方法,所述调整参数包括但不限于以下的一种或多种:噪声、音高、速度、滤波参数、回声、增益或衰减的频段、以及音频格式。According to one or more embodiments of the present disclosure, Example 3 provides the method of Example 2, the adjustment parameters include but not limited to one or more of the following: noise, pitch, speed, filter parameters, echo, gain or Attenuated frequency band, and audio format.
根据本公开的一个或多个实施例,示例4提供了示例1的方法,所述根据所述第一组样本和所述第二组样本对于第一编码器以及第二编码器进行对比训练,包括:According to one or more embodiments of the present disclosure, Example 4 provides the method of Example 1, performing comparative training on the first encoder and the second encoder according to the first set of samples and the second set of samples, include:
根据所述第一编码器对所述第一组样本中的样本进行编码,得到对应每一条样本的第一编码向量,以及根据所述第二编码器对所述第二组样本的样本进行编码,得到对应每一条样本的第二编码向量;Encoding the samples in the first group of samples according to the first encoder to obtain a first encoding vector corresponding to each sample, and encoding the samples of the second group of samples according to the second encoder , to obtain the second encoding vector corresponding to each sample;
基于所述第一编码向量和所述第二编码向量对对比损失函数的损失值进行迭代运算,并基于所述损失值迭代更新所述第一编码器的编码参数,以使所述第一编码器输出的所述第一编码向量接近对应的所述正样本的所述第二编码向量,远离对应的所述负样本的所述第二编码向量,其中,所述损失值用于表征所述第一编码向量以及所述第二编码向量之间的相似度;并,The loss value of the comparison loss function is iteratively calculated based on the first encoding vector and the second encoding vector, and the encoding parameters of the first encoder are iteratively updated based on the loss value, so that the first encoding The first coded vector output by the filter is close to the corresponding second coded vector of the positive sample and far away from the corresponding second coded vector of the negative sample, wherein the loss value is used to characterize the a degree of similarity between the first encoded vector and said second encoded vector; and,
使所述第二编码器的编码参数逐渐趋向所述第一编码器的所述编码参数,直到得到训练完成的所述第一编码器。Making the encoding parameters of the second encoder gradually approach the encoding parameters of the first encoder until the first encoder with training is obtained.
根据本公开的一个或多个实施例,示例5提供了示例4的方法,所述基于所述第一编码向量和所述第二编码向量对对比损失函数的损失值进行迭代运算,包括:According to one or more embodiments of the present disclosure, Example 5 provides the method of Example 4, where performing an iterative operation on the loss value of the comparison loss function based on the first encoding vector and the second encoding vector includes:
在所述迭代运算是第一轮的情况下,基于所述第一轮得到的所述第一编码向量和所述第二编码向量,确定所述对比损失函数的所述损失值;When the iterative operation is the first round, determine the loss value of the comparison loss function based on the first encoding vector and the second encoding vector obtained in the first round;
在所述迭代运算不是所述第一轮的情况下,根据当前轮得到的所述第二编码向量替换掉预设队列中目标历史轮对应的所述第二编码向量,并基于所述当前轮得到的所述第一编码向量和所述预设队列中的所述第二编码向量,确定所述对比损失函数的所述损失值;In the case that the iterative operation is not the first round, the second encoding vector corresponding to the target history round in the preset queue is replaced according to the second encoding vector obtained in the current round, and based on the current round Determining the loss value of the comparison loss function from the obtained first encoding vector and the second encoding vector in the preset queue;
其中,所述预设队列用于存储所述迭代运算中每一轮得到的所述第二编码向量,所述目标历史轮为所述预设队列中存储的轮数最早的所述历史轮。Wherein, the preset queue is used to store the second encoding vector obtained in each round of the iterative operation, and the target historical round is the historical round with the earliest number of rounds stored in the preset queue.
根据本公开的一个或多个实施例,示例6提供了一种音频指纹提取方法,包括:获取待查询音频;根据音频指纹提取器对所述待查询音频进行处理,得到作为所述待查询音频的指纹特征的编码向量;所述音频指纹提取器是根据示例1-5任一项所述的编码器的生成方法训练完成的第一编码器。According to one or more embodiments of the present disclosure, Example 6 provides an audio fingerprint extraction method, including: obtaining the audio to be queried; processing the audio to be queried according to an audio fingerprint extractor to obtain the audio to be queried The encoding vector of the fingerprint feature; the audio fingerprint extractor is the first encoder trained according to the encoder generation method described in any one of Examples 1-5.
根据本公开的一个或多个实施例,示例7提供了一种编码器的生成装置,包括:According to one or more embodiments of the present disclosure, Example 7 provides an encoder generation device, including:
第一获取模块,被配置为获取多个样本音频;The first acquisition module is configured to acquire a plurality of sample audios;
构造模块,被配置为根据所述多个样本音频构造第一组样本以及第二组样本,其中, 针对所述第一组样本中的每一条样本,在所述第二组样本中均存在对应的正样本以及负样本;A construction module configured to construct a first group of samples and a second group of samples according to the plurality of audio samples, wherein, for each sample in the first group of samples, there is a corresponding Positive samples and negative samples of ;
训练模块,被配置为根据所述第一组样本和所述第二组样本对于第一编码器以及第二编码器进行对比训练,训练完成的所述第一编码器能够作为音频指纹提取器输出作为音频的指纹特征的编码向量;The training module is configured to perform comparative training on the first encoder and the second encoder according to the first set of samples and the second set of samples, and the trained first encoder can be output as an audio fingerprint extractor Encoded vectors as fingerprint features of the audio;
其中,所述第一编码器用于对所述第一组样本中的样本进行编码,得到对应每一条样本的第一编码向量,所述第二编码器用于对所述第二组样本中的样本进行编码,得到对应每一条样本的第二编码向量;所述对比训练用于使所述第一编码器输出的第一编码向量接近对应的所述正样本的第二编码向量,远离对应的所述负样本的第二编码向量,且所述第二编码器的编码参数逐渐趋向所述第一编码器的编码参数。Wherein, the first encoder is used to encode the samples in the first group of samples to obtain a first encoding vector corresponding to each sample, and the second encoder is used to encode the samples in the second group of samples Encoding is performed to obtain a second encoding vector corresponding to each sample; the comparison training is used to make the first encoding vector output by the first encoder close to the corresponding second encoding vector of the positive sample, away from the corresponding The second encoding vector of the negative samples, and the encoding parameters of the second encoder gradually tend to the encoding parameters of the first encoder.
根据本公开的一个或多个实施例,示例8提供了示例7的装置,所述构造模块进一步被配置为:According to one or more embodiments of the present disclosure, Example 8 provides the device of Example 7, the construction module is further configured to:
对多个所述样本音频分别进行第一参数调整以及第二参数调整,得到所述第一组样本以及所述第二组样本,所述第一参数调整和所述第二参数调整对应的调整参数和/或调整方式不完全相同;Performing first parameter adjustment and second parameter adjustment on a plurality of the sample audios respectively to obtain the first group of samples and the second group of samples, and adjustments corresponding to the first parameter adjustment and the second parameter adjustment The parameters and/or adjustments are not exactly the same;
其中,所述第一组样本中的每一条样本为经过所述第一参数调整后的样本音频,所述第二组样本中的每一条样本为经过所述第二参数调整后的样本音频,针对所述第一组样本中的每一条样本,所述第二组样本中与该样本对应相同样本音频的样本为正样本,其他样本为负样本。Wherein, each sample in the first group of samples is sample audio adjusted by the first parameter, and each sample in the second group of samples is sample audio adjusted by the second parameter, For each sample in the first group of samples, the sample in the second group of samples corresponding to the same audio sample as the sample is a positive sample, and the other samples are negative samples.
根据本公开的一个或多个实施例,示例9提供了示例8的装置,所述调整参数包括但不限于以下的至少一种:噪声、音高、速度、滤波参数、回声、增益或衰减的频段、以及音频格式。According to one or more embodiments of the present disclosure, Example 9 provides the device of Example 8, the adjustment parameters include but are not limited to at least one of the following: noise, pitch, speed, filter parameters, echo, gain or attenuation frequency band, and audio format.
根据本公开的一个或多个实施例,示例10提供了示例7的装置,所述训练模块进一步被配置为:According to one or more embodiments of the present disclosure, Example 10 provides the device of Example 7, the training module is further configured to:
根据所述第一编码器对所述第一组样本中的样本进行编码,得到对应每一条样本的第一编码向量,以及根据所述第二编码器对所述第二组样本的样本进行编码,得到对应每一条样本的第二编码向量;Encoding the samples in the first group of samples according to the first encoder to obtain a first encoding vector corresponding to each sample, and encoding the samples of the second group of samples according to the second encoder , to obtain the second encoding vector corresponding to each sample;
基于所述第一编码向量和所述第二编码向量对对比损失函数的损失值进行迭代运算,并基于所述损失值迭代更新所述第一编码器的编码参数,以使所述第一编码器输出的所述第一编码向量接近对应的所述正样本的所述第二编码向量,远离对应的所述负样本的所述第二编码向量,其中,所述损失值用于表征所述第一编码向量以及所述第二编码向量之间的相似度;并,The loss value of the comparison loss function is iteratively calculated based on the first encoding vector and the second encoding vector, and the encoding parameters of the first encoder are iteratively updated based on the loss value, so that the first encoding The first coded vector output by the filter is close to the corresponding second coded vector of the positive sample and far away from the corresponding second coded vector of the negative sample, wherein the loss value is used to characterize the a degree of similarity between the first encoded vector and said second encoded vector; and,
使所述第二编码器的编码参数逐渐趋向所述第一编码器的所述编码参数,直到得到训练完成的所述第一编码器。Making the encoding parameters of the second encoder gradually approach the encoding parameters of the first encoder until the first encoder with training is obtained.
根据本公开的一个或多个实施例,示例11提供了示例10的装置,所述训练模块进一步被配置为:According to one or more embodiments of the present disclosure, Example 11 provides the device of Example 10, the training module is further configured to:
在所述迭代运算是第一轮的情况下,基于所述第一轮得到的所述第一编码向量和所述第二编码向量,确定所述对比损失函数的所述损失值;When the iterative operation is the first round, determine the loss value of the comparison loss function based on the first encoding vector and the second encoding vector obtained in the first round;
在所述迭代运算不是所述第一轮的情况下,根据当前轮得到的所述第二编码向量替换掉预设队列中目标历史轮对应的所述第二编码向量,并基于所述当前轮得到的所述第一编码向量和所述预设队列中的所述第二编码向量,确定所述对比损失函数的所述损失值;In the case that the iterative operation is not the first round, the second encoding vector corresponding to the target history round in the preset queue is replaced according to the second encoding vector obtained in the current round, and based on the current round Determining the loss value of the comparison loss function from the obtained first encoding vector and the second encoding vector in the preset queue;
其中,所述预设队列用于存储所述迭代运算中每一轮得到的所述第二编码向量,所述目标历史轮为所述预设队列中存储的轮数最早的所述历史轮。Wherein, the preset queue is used to store the second encoding vector obtained in each round of the iterative operation, and the target historical round is the historical round with the earliest number of rounds stored in the preset queue.
根据本公开的一个或多个实施例,示例12提供一种音频指纹提取装置,包括:According to one or more embodiments of the present disclosure, Example 12 provides an audio fingerprint extraction device, comprising:
第二获取模块,被配置为获取待查询音频;The second obtaining module is configured to obtain the audio to be queried;
处理模块,被配置为根据音频指纹提取器对所述待查询音频进行处理,得到作为所述待查询音频的指纹特征的编码向量;所述音频指纹提取器是根据示例1-5任一项所述的编码器的生成方法训练完成的第一编码器。The processing module is configured to process the audio to be queried according to the audio fingerprint extractor to obtain a coded vector as the fingerprint feature of the audio to be queried; the audio fingerprint extractor is according to any one of examples 1-5 The first encoder that has been trained by the encoder generation method described above.
根据本公开的一个或多个实施例,示例13提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现示例1-6中任一项所述方法的步骤。According to one or more embodiments of the present disclosure, Example 13 provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in Examples 1-6 are implemented.
根据本公开的一个或多个实施例,示例14提供一种电子设备,包括:According to one or more embodiments of the present disclosure, Example 14 provides an electronic device, comprising:
存储装置,其上存储有至少一个计算机程序;storage means on which at least one computer program is stored;
至少一个处理装置,用于执行所述存储装置中的所述至少一个计算机程序,以实现示例1-6中任一项所述方法的步骤。At least one processing device configured to execute the at least one computer program in the storage device to implement the steps of any one of the methods in Examples 1-6.
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present disclosure and an illustration of the applied technical principle. Those skilled in the art should understand that the disclosure scope involved in this disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, but also covers the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of equivalent features. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions disclosed in this disclosure.
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子 组合的方式实现在多个实施例中。In addition, while operations are depicted in a particular order, this should not be understood as requiring that the operations be performed in the particular order shown or performed in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while the above discussion contains several specific implementation details, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims. Regarding the device in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here

Claims (10)

  1. 一种编码器的生成方法,其包括:A method for generating an encoder, comprising:
    获取多个样本音频;Get multiple sample audio;
    根据所述多个样本音频构造第一组样本以及第二组样本,其中,针对所述第一组样本中的每一条样本,在所述第二组样本中均存在对应的正样本以及负样本;Constructing a first group of samples and a second group of samples according to the plurality of sample audios, wherein, for each sample in the first group of samples, there are corresponding positive samples and negative samples in the second group of samples ;
    根据所述第一组样本和所述第二组样本对于第一编码器以及第二编码器进行对比训练,训练完成的所述第一编码器能够作为音频指纹提取器输出作为音频的指纹特征的编码向量;According to the first group of samples and the second group of samples, the first coder and the second coder are comparatively trained, and the trained first coder can be used as an audio fingerprint extractor to output fingerprint features as audio encoding vector;
    其中,所述第一编码器用于对所述第一组样本中的样本进行编码,得到对应每一条样本的第一编码向量,所述第二编码器用于对所述第二组样本中的样本进行编码,得到对应每一条样本的第二编码向量;所述对比训练用于使所述第一编码器输出的第一编码向量接近对应的所述正样本的第二编码向量,远离对应的所述负样本的第二编码向量,且所述第二编码器的编码参数逐渐趋向所述第一编码器的编码参数。Wherein, the first encoder is used to encode the samples in the first group of samples to obtain a first encoding vector corresponding to each sample, and the second encoder is used to encode the samples in the second group of samples Encoding is performed to obtain a second encoding vector corresponding to each sample; the comparison training is used to make the first encoding vector output by the first encoder close to the corresponding second encoding vector of the positive sample, away from the corresponding The second encoding vector of the negative samples, and the encoding parameters of the second encoder gradually tend to the encoding parameters of the first encoder.
  2. 根据权利要求1所述的方法,其中,所述根据所述多个样本音频构造第一组样本以及第二组样本,包括:The method according to claim 1, wherein said constructing a first set of samples and a second set of samples according to said plurality of sample audios comprises:
    对多个所述样本音频分别进行第一参数调整以及第二参数调整,得到所述第一组样本以及所述第二组样本,所述第一参数调整和所述第二参数调整对应的调整参数和/或调整方式不完全相同;Performing first parameter adjustment and second parameter adjustment on a plurality of the sample audios respectively to obtain the first group of samples and the second group of samples, and adjustments corresponding to the first parameter adjustment and the second parameter adjustment The parameters and/or adjustments are not exactly the same;
    其中,所述第一组样本中的每一条样本为经过所述第一参数调整后的样本音频,所述第二组样本中的每一条样本为经过所述第二参数调整后的样本音频,针对所述第一组样本中的每一条样本,所述第二组样本中与该样本对应相同样本音频的样本为正样本,其他样本为负样本。Wherein, each sample in the first group of samples is sample audio adjusted by the first parameter, and each sample in the second group of samples is sample audio adjusted by the second parameter, For each sample in the first group of samples, the sample in the second group of samples corresponding to the same audio sample as the sample is a positive sample, and the other samples are negative samples.
  3. 根据权利要求2所述的方法,其中,所述调整参数包括但不限于以下的至少一种:噪声、音高、速度、滤波参数、回声、增益或衰减的频段、以及音频格式。The method according to claim 2, wherein the adjustment parameters include but are not limited to at least one of the following: noise, pitch, speed, filter parameters, echo, frequency band of gain or attenuation, and audio format.
  4. 根据权利要求1所述的方法,其中,所述根据所述第一组样本和所述第二组样本对于第一编码器以及第二编码器进行对比训练,包括:The method according to claim 1, wherein the comparative training of the first encoder and the second encoder according to the first set of samples and the second set of samples includes:
    根据所述第一编码器对所述第一组样本中的样本进行编码,得到对应每一条样本的第一编码向量,以及根据所述第二编码器对所述第二组样本的样本进行编码,得到对应每一条样本的第二编码向量;Encoding the samples in the first group of samples according to the first encoder to obtain a first encoding vector corresponding to each sample, and encoding the samples of the second group of samples according to the second encoder , to obtain the second encoding vector corresponding to each sample;
    基于所述第一编码向量和所述第二编码向量对对比损失函数的损失值进行迭代运算,并基于所述损失值迭代更新所述第一编码器的编码参数,以使所述第一编码器输出的所述第一编码向量接近对应的所述正样本的所述第二编码向量,远离对应的所述负样本的所述第二编码向量,其中,所述损失值用于表征所述第一编码向量以及所述第二编码向量之间 的相似度;并,The loss value of the comparison loss function is iteratively calculated based on the first encoding vector and the second encoding vector, and the encoding parameters of the first encoder are iteratively updated based on the loss value, so that the first encoding The first coded vector output by the filter is close to the corresponding second coded vector of the positive sample and far away from the corresponding second coded vector of the negative sample, wherein the loss value is used to characterize the a degree of similarity between the first encoded vector and said second encoded vector; and,
    使所述第二编码器的编码参数逐渐趋向所述第一编码器的所述编码参数,直到得到训练完成的所述第一编码器。Making the encoding parameters of the second encoder gradually approach the encoding parameters of the first encoder until the first encoder with training is obtained.
  5. 根据权利要求4所述的方法,其中,所述基于所述第一编码向量和所述第二编码向量对对比损失函数的损失值进行迭代运算,包括:The method according to claim 4, wherein the iterative operation of the loss value of the comparison loss function based on the first encoding vector and the second encoding vector comprises:
    在所述迭代运算是第一轮的情况下,基于所述第一轮得到的所述第一编码向量和所述第二编码向量,确定所述对比损失函数的所述损失值;When the iterative operation is the first round, determine the loss value of the comparison loss function based on the first encoding vector and the second encoding vector obtained in the first round;
    在所述迭代运算不是所述第一轮的情况下,根据当前轮得到的所述第二编码向量替换掉预设队列中目标历史轮对应的所述第二编码向量,并基于所述当前轮得到的所述第一编码向量和所述预设队列中的所述第二编码向量,确定所述对比损失函数的所述损失值;In the case that the iterative operation is not the first round, the second encoding vector corresponding to the target history round in the preset queue is replaced according to the second encoding vector obtained in the current round, and based on the current round Determining the loss value of the comparison loss function from the obtained first encoding vector and the second encoding vector in the preset queue;
    其中,所述预设队列用于存储所述迭代运算中每一轮得到的所述第二编码向量,所述目标历史轮为所述预设队列中存储的轮数最早的所述历史轮。Wherein, the preset queue is used to store the second encoding vector obtained in each round of the iterative operation, and the target historical round is the historical round with the earliest number of rounds stored in the preset queue.
  6. 一种音频指纹提取方法,其包括:A method for extracting audio fingerprints, comprising:
    获取待查询音频;Obtain the audio to be queried;
    根据音频指纹提取器对所述待查询音频进行处理,得到作为所述待查询音频的指纹特征的编码向量;所述音频指纹提取器是根据权利要求1-5任一项所述的编码器的生成方法训练完成的第一编码器。According to the audio fingerprint extractor, the audio to be queried is processed to obtain a coded vector as the fingerprint feature of the audio to be queried; the audio fingerprint extractor is according to the encoder according to any one of claims 1-5 Generate the first encoder trained by the method.
  7. 一种编码器的生成装置,其包括:A device for generating an encoder, comprising:
    第一获取模块,被配置为获取多个样本音频;The first acquisition module is configured to acquire a plurality of sample audios;
    构造模块,被配置为根据所述多个样本音频构造第一组样本以及第二组样本,其中,针对所述第一组样本中的每一条样本,在所述第二组样本中均存在对应的正样本以及负样本;A construction module configured to construct a first group of samples and a second group of samples according to the plurality of audio samples, wherein, for each sample in the first group of samples, there is a corresponding Positive samples and negative samples of ;
    训练模块,被配置为根据所述第一组样本和所述第二组样本对于第一编码器以及第二编码器进行对比训练,训练完成的所述第一编码器能够作为音频指纹提取器输出作为音频的指纹特征的编码向量;The training module is configured to perform comparative training on the first encoder and the second encoder according to the first set of samples and the second set of samples, and the trained first encoder can be output as an audio fingerprint extractor Encoded vectors as fingerprint features of the audio;
    其中,所述第一编码器用于对所述第一组样本中的样本进行编码,得到对应每一条样本的第一编码向量,所述第二编码器用于对所述第二组样本中的样本进行编码,得到对应每一条样本的第二编码向量;所述对比训练用于使所述第一编码器输出的第一编码向量接近对应的所述正样本的第二编码向量,远离对应的所述负样本的第二编码向量,且所述第二编码器的编码参数逐渐趋向所述第一编码器的编码参数。Wherein, the first encoder is used to encode the samples in the first group of samples to obtain a first encoding vector corresponding to each sample, and the second encoder is used to encode the samples in the second group of samples Encoding is performed to obtain a second encoding vector corresponding to each sample; the comparison training is used to make the first encoding vector output by the first encoder close to the corresponding second encoding vector of the positive sample, away from the corresponding The second encoding vector of the negative samples, and the encoding parameters of the second encoder gradually tend to the encoding parameters of the first encoder.
  8. 一种音频指纹提取装置,其包括:A device for extracting audio fingerprints, comprising:
    第二获取模块,被配置为获取待查询音频;The second obtaining module is configured to obtain the audio to be queried;
    处理模块,被配置为根据音频指纹提取器对所述待查询音频进行处理,得到作为所述待查询音频的指纹特征的编码向量;所述音频指纹提取器是根据权利要求1-5任一项所述的编码器的生成方法训练完成的第一编码器。The processing module is configured to process the audio to be queried according to the audio fingerprint extractor to obtain a coded vector as the fingerprint feature of the audio to be queried; the audio fingerprint extractor is according to any one of claims 1-5 The encoder generating method is the first encoder that has been trained.
  9. 一种计算机可读介质,其上存储有计算机程序,其中,该程序被处理装置执行时实现权利要求1-6中任一项所述方法的步骤。A computer-readable medium, on which a computer program is stored, wherein the program implements the steps of any one of claims 1-6 when executed by a processing device.
  10. 一种电子设备,其包括:An electronic device comprising:
    存储装置,其上存储有至少一个计算机程序;storage means on which at least one computer program is stored;
    至少一个处理装置,用于执行所述存储装置中的所述至少一个计算机程序,以实现权利要求1-6中任一项所述方法的步骤。At least one processing device configured to execute the at least one computer program in the storage device to implement the steps of the method according to any one of claims 1-6.
PCT/CN2023/070796 2022-01-14 2023-01-06 Encoder generation method, fingerprint extraction method, medium, and electronic device WO2023134549A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210045056.3A CN114443891B (en) 2022-01-14 2022-01-14 Encoder generation method, fingerprint extraction method, medium, and electronic device
CN202210045056.3 2022-01-14

Publications (1)

Publication Number Publication Date
WO2023134549A1 true WO2023134549A1 (en) 2023-07-20

Family

ID=81367987

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/070796 WO2023134549A1 (en) 2022-01-14 2023-01-06 Encoder generation method, fingerprint extraction method, medium, and electronic device

Country Status (2)

Country Link
CN (1) CN114443891B (en)
WO (1) WO2023134549A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114443891B (en) * 2022-01-14 2022-12-06 北京有竹居网络技术有限公司 Encoder generation method, fingerprint extraction method, medium, and electronic device
CN116069903A (en) * 2023-03-02 2023-05-05 特斯联科技集团有限公司 Class search method, system, electronic equipment and storage medium
CN116758936B (en) * 2023-08-18 2023-11-07 腾讯科技(深圳)有限公司 Processing method and device of audio fingerprint feature extraction model and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111930992A (en) * 2020-08-14 2020-11-13 腾讯科技(深圳)有限公司 Neural network training method and device and electronic equipment
CN112905840A (en) * 2021-02-09 2021-06-04 北京有竹居网络技术有限公司 Video processing method, device, storage medium and equipment
US20210295091A1 (en) * 2020-03-19 2021-09-23 Salesforce.Com, Inc. Unsupervised representation learning with contrastive prototypes
CN113887215A (en) * 2021-10-18 2022-01-04 平安科技(深圳)有限公司 Text similarity calculation method and device, electronic equipment and storage medium
CN114443891A (en) * 2022-01-14 2022-05-06 北京有竹居网络技术有限公司 Encoder generation method, fingerprint extraction method, medium, and electronic device

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7516074B2 (en) * 2005-09-01 2009-04-07 Auditude, Inc. Extraction and matching of characteristic fingerprints from audio signals
US9684715B1 (en) * 2012-03-08 2017-06-20 Google Inc. Audio identification using ordinal transformation
SG10201608643PA (en) * 2013-01-29 2016-12-29 Fraunhofer Ges Forschung Decoder for Generating a Frequency Enhanced Audio Signal, Method of Decoding, Encoder for Generating an Encoded Signal and Method of Encoding Using Compact Selection Side Information
CN107731220B (en) * 2017-10-18 2019-01-22 北京达佳互联信息技术有限公司 Audio identification methods, device and server
US10956704B2 (en) * 2018-11-07 2021-03-23 Advanced New Technologies Co., Ltd. Neural networks for biometric recognition
CN110136744B (en) * 2019-05-24 2021-03-26 腾讯音乐娱乐科技(深圳)有限公司 Audio fingerprint generation method, equipment and storage medium
US11335347B2 (en) * 2019-06-03 2022-05-17 Amazon Technologies, Inc. Multiple classifications of audio data
CN114556473A (en) * 2019-10-19 2022-05-27 谷歌有限责任公司 Self-supervised pitch estimation
CN111243620B (en) * 2020-01-07 2022-07-19 腾讯科技(深圳)有限公司 Voice separation model training method and device, storage medium and computer equipment
CN111428078B (en) * 2020-03-20 2023-05-23 腾讯科技(深圳)有限公司 Audio fingerprint coding method, device, computer equipment and storage medium
CN113821658A (en) * 2021-06-30 2021-12-21 腾讯科技(深圳)有限公司 Method, device and equipment for training encoder and storage medium
CN113724695B (en) * 2021-08-30 2023-08-01 深圳平安智慧医健科技有限公司 Electronic medical record generation method, device, equipment and medium based on artificial intelligence
CN113870845A (en) * 2021-09-26 2021-12-31 平安科技(深圳)有限公司 Speech recognition model training method, device, equipment and medium
CN113889089A (en) * 2021-09-29 2022-01-04 北京百度网讯科技有限公司 Method and device for acquiring voice recognition model, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210295091A1 (en) * 2020-03-19 2021-09-23 Salesforce.Com, Inc. Unsupervised representation learning with contrastive prototypes
CN111930992A (en) * 2020-08-14 2020-11-13 腾讯科技(深圳)有限公司 Neural network training method and device and electronic equipment
CN112905840A (en) * 2021-02-09 2021-06-04 北京有竹居网络技术有限公司 Video processing method, device, storage medium and equipment
CN113887215A (en) * 2021-10-18 2022-01-04 平安科技(深圳)有限公司 Text similarity calculation method and device, electronic equipment and storage medium
CN114443891A (en) * 2022-01-14 2022-05-06 北京有竹居网络技术有限公司 Encoder generation method, fingerprint extraction method, medium, and electronic device

Also Published As

Publication number Publication date
CN114443891B (en) 2022-12-06
CN114443891A (en) 2022-05-06

Similar Documents

Publication Publication Date Title
WO2023134549A1 (en) Encoder generation method, fingerprint extraction method, medium, and electronic device
WO2020024556A1 (en) Music quality evaluation method and apparatus, and computer device and storage medium
CN103038765B (en) Method and apparatus for being adapted to situational model
WO2022105545A1 (en) Speech synthesis method and apparatus, and readable medium and electronic device
CN111798821B (en) Sound conversion method, device, readable storage medium and electronic equipment
WO2022105553A1 (en) Speech synthesis method and apparatus, readable medium, and electronic device
WO2022156464A1 (en) Speech synthesis method and apparatus, readable medium, and electronic device
CN111444382B (en) Audio processing method and device, computer equipment and storage medium
WO2023273611A1 (en) Speech recognition model training method and apparatus, speech recognition method and apparatus, medium, and device
WO2023273579A1 (en) Model training method and apparatus, speech recognition method and apparatus, and medium and device
WO2023134550A1 (en) Feature encoding model generation method, audio determination method, and related device
WO2022037388A1 (en) Voice generation method and apparatus, device, and computer readable medium
WO2022156413A1 (en) Speech style migration method and apparatus, readable medium and electronic device
WO2023273596A1 (en) Method and apparatus for determining text correlation, readable medium, and electronic device
CN111159464B (en) Audio clip detection method and related equipment
CN111428078B (en) Audio fingerprint coding method, device, computer equipment and storage medium
CN111597825A (en) Voice translation method and device, readable medium and electronic equipment
CN111462727A (en) Method, apparatus, electronic device and computer readable medium for generating speech
CN108847251B (en) Voice duplicate removal method, device, server and storage medium
WO2023226572A1 (en) Feature representation extraction method and apparatus, device, medium and program product
CN111898753A (en) Music transcription model training method, music transcription method and corresponding device
CN110955789B (en) Multimedia data processing method and equipment
WO2023155713A1 (en) Method and apparatus for marking speaker, and electronic device
CN112382266A (en) Voice synthesis method and device, electronic equipment and storage medium
CN111462775A (en) Audio similarity determination method, device, server and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23739893

Country of ref document: EP

Kind code of ref document: A1