CN113345444B

CN113345444B - Speaker confirmation method and system

Info

Publication number: CN113345444B
Application number: CN202110496856.2A
Authority: CN
Inventors: 陈增照; 郑秋雨; 何秀玲; 戴志诚; 张婧; 孟秉恒; 李佳文; 吴潇楠; 朱胜虎
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2021-05-07
Filing date: 2021-05-07
Publication date: 2022-10-28
Anticipated expiration: 2041-05-07
Also published as: CN113345444A

Abstract

The present invention provides a speaker confirmation method and system, including: preprocessing the speaker's audio information, converting the audio information into data in a preset format; inputting data in a preset format corresponding to the speaker's audio information to the trained deep nested residual neural network based on the spatial attention mechanism to obtain the frame-level speaker vector; based on the frame-level speaker vector, the utterance-level speaker vector is generated, and the utterance-level speaker vector is calculated. The cosine similarity between the speaker vector and the target speaker vector is used to determine whether the speaker is the target speaker; the target speaker vector is obtained in advance. The present invention proposes a deep nested residual neural network based on a spatial attention mechanism, which can extract speaker voiceprint features more accurately through the deep neural network.

Description

A speaker verification method and system

技术领域technical field

本发明属于说话者识别领域，更具体地，涉及一种说话者确认方法及系统。The invention belongs to the field of speaker identification, and more specifically relates to a method and system for speaker identification.

背景技术Background technique

声纹是对语音中所蕴含的、能表征和标志说话者的语音特征，以及基于这些特征所建立的语音模型的总称。语言作为人际交往中必须的媒介，承载着十分重要的作用。人在成年以后声音很难发生改变，由于发音器官和发音方式的不同，每个人在说话过程中所蕴含的语音特征都是独一无二的，即使被模仿也很难改变说话者最基本的发音特质和声道特征。因此根据声纹的唯一标识性和短时平稳性，本发明能够对语音建立模型并用于身份认证。Voiceprint is a general term for the voice features contained in the voice, which can represent and mark the speaker, and the voice model established based on these features. As a necessary medium in interpersonal communication, language plays a very important role. It is difficult for a person’s voice to change after adulthood. Due to the difference in vocal organs and pronunciation methods, each person’s voice characteristics in the process of speaking are unique. Even if they are imitated, it is difficult to change the speaker’s most basic pronunciation characteristics and vocal characteristics. Therefore, according to the unique identification and short-term stability of the voiceprint, the present invention can establish a voice model and use it for identity authentication.

声纹识别又叫做说话者识别，是根据待识别的目标说话者声纹特征来确定目标说话者的过程。说话者识别任务从根本上可以分为两类：说话者辨认和说话者确认，说话者辨认是指给出一个声纹样本，从已经训练好的语料库中匹配是否是目标说话者的身份，是个一对一的确认问题；说话者确认是指确认出目标声纹的身份属于语料库中的哪一个说话者的样本，是个多对一的选择问题。Voiceprint recognition, also called speaker recognition, is the process of determining the target speaker based on the voiceprint characteristics of the target speaker to be recognized. Speaker recognition tasks can be fundamentally divided into two categories: speaker recognition and speaker confirmation. Speaker recognition refers to giving a voiceprint sample and matching whether it is the identity of the target speaker from the trained corpus. One-to-one confirmation problem; speaker confirmation refers to confirming which speaker sample in the corpus the identity of the target voiceprint belongs to, and it is a many-to-one selection problem.

说话者确认方法在人们的日常工作生活中具有十分重要的意义，尽管目前很多国内外研究者已经提出了一系列成熟的说话者确认方法例如GMM-UBM高斯混合-通用背景模型、LSTM长短时记忆网络模型以及CNN卷积神经网络模型等，但是在研究方法、模型效果以及应用场景上面仍然存在很多的问题和改进空间。Speaker confirmation methods are very important in people's daily work and life, although many domestic and foreign researchers have proposed a series of mature speaker confirmation methods such as GMM-UBM Gaussian mixture-general background model, LSTM long short-term memory Network models and CNN convolutional neural network models, etc., but there are still many problems and room for improvement in research methods, model effects, and application scenarios.

现有说话者确认方法的不足主要有以下几点：1、基于传统的GMM-UBM高斯混合-通用背景模型的方法居多，技术手段仍然停留在最传统的模型匹配阶段。随着社会进步科技高速发展，传统方法在处理大量的数据信息是就体现了复杂性和困难。2、说话者的声纹中包含了很多有效的说话者信息，且说话者音频极易受环境、录音设备以及说话者自身等多种因素的影响。现存的深度学习说话者确认算法中无法更好地提取说话者特征，整体效果有待提升。3、不同场景以及不同语言的说话者确认方法是目前声纹识别的挑战所在，显存的说话者确认方法无法将跨语言数据对模型识别效果的影响降至最低，在跨语言识别研究中效果不够理想。The deficiencies of the existing speaker confirmation methods mainly include the following points: 1. Most of the methods are based on the traditional GMM-UBM Gaussian mixture-universal background model, and the technical means are still in the most traditional model matching stage. With the rapid development of social progress and science and technology, the traditional method reflects the complexity and difficulty in processing a large amount of data information. 2. The speaker's voiceprint contains a lot of effective speaker information, and the speaker's audio is easily affected by various factors such as the environment, recording equipment, and the speaker himself. The existing deep learning speaker confirmation algorithm cannot better extract speaker features, and the overall effect needs to be improved. 3. The speaker confirmation method in different scenes and different languages is the current challenge of voiceprint recognition. The speaker confirmation method in video memory cannot minimize the impact of cross-language data on the model recognition effect, and the effect is not enough in cross-language recognition research. ideal.

发明内容SUMMARY OF THE INVENTION

针对现有技术的缺陷，本发明的目的在于提供一种说话者确认方法及系统，旨在解决现有说话人识别神经网络系统中识别准确率低和网络规模庞大的问题。Aiming at the defects of the prior art, the object of the present invention is to provide a speaker confirmation method and system, which aims to solve the problems of low recognition accuracy and large network scale in the existing speaker recognition neural network system.

为实现上述目的，第一方面，本发明提供了一种说话者确认方法，包括如下步骤：In order to achieve the above object, in a first aspect, the present invention provides a speaker confirmation method, comprising the following steps:

对说话者的音频信息进行预处理，将所述音频信息转换为预设格式的数据；Preprocessing the audio information of the speaker, converting the audio information into data in a preset format;

将说话者音频信息对应的预设格式的数据输入到训练好的基于空间注意力机制的深度嵌套残差神经网络，以得到帧级别的说话者向量；所述基于空间注意力机制的深度嵌套残差神经网络包括：四层每层包含两个嵌套残差块的残差神经网络和空间注意力机制；在嵌套残差神经网络之后引入空间注意力机制，所述空间注意力机制基于空间维度在注意力模块中引入平均池化和最大池化，并将两部分池化结果合并，以保留有用信息减少参数规模，以及在注意力模块的激活层中使用sigmoid函数，以获得帧级别的说话者向量；Input the data in the preset format corresponding to the speaker's audio information into the trained deep nested residual neural network based on the spatial attention mechanism to obtain the speaker vector at the frame level; the deep embedding based on the spatial attention mechanism The set of residual neural network includes: each layer of four layers contains a residual neural network and a spatial attention mechanism of two nested residual blocks; after the nested residual neural network, a spatial attention mechanism is introduced, and the spatial attention mechanism Introduce average pooling and maximum pooling in the attention module based on the spatial dimension, and combine the two parts of the pooling results to retain useful information to reduce the parameter scale, and use the sigmoid function in the activation layer of the attention module to obtain the frame level speaker vector;

基于所述帧级别的说话者向量生成话语级别的说话者向量，并计算所述话语级别的说话者向量和目标说话者向量的余弦相似度，以判断所述说话者是否为目标说话者；所述目标说话者向量是预先获取的。Generate an utterance-level speaker vector based on the frame-level speaker vector, and calculate the cosine similarity between the utterance-level speaker vector and the target speaker vector to determine whether the speaker is a target speaker; The target speaker vectors are pre-fetched.

在一个可选的示例中，所述对说话者的音频信息进行预处理，将所述音频信息转换为预设格式的数据，具体为：In an optional example, the preprocessing is performed on the audio information of the speaker, and the audio information is converted into data in a preset format, specifically:

将说话者的WAV格式音频文件采用音频转换技术转换为flac格式文件，将flac格式文件进行预处理，得到包含说话者全部信息的npy格式数据。The speaker's WAV format audio file is converted into a flac format file using audio conversion technology, and the flac format file is preprocessed to obtain npy format data containing all information about the speaker.

在一个可选的示例中，所述每个嵌套残差块中包含两个子残差块，每个子残差块包含两个单元，每个单元是一个构造块；每两个嵌套残差块的前面放置一个卷积层；In an optional example, each nested residual block contains two sub-residual blocks, each sub-residual block contains two units, and each unit is a building block; every two nested residual blocks A convolutional layer is placed in front of the block;

两个嵌套的子残差块实现堆叠功能，具体公式为：Two nested sub-residual blocks realize the stacking function, and the specific formula is:

H₁(x)＝F₁(x)+xH ₁ (x)=F ₁ (x)+x

H₂(x)＝F₂(x)+H₁(x)H ₂ (x)=F ₂ (x)+H ₁ (x)

H(x)＝H₂(x)+xH(x)＝H ₂ (x)+x

其中，x表示第一个嵌套残差块的输入数据，F₁(x)表示嵌套残差块中第一个子残差块的输出，H₁(x)表示F₁(x)和x的结合数据，F₂(x)表示嵌套残差块中第二个子残差块的输出，H₂(x)表示F₂(x)和H₁(x)的结合数据，H(x)表示两个嵌套的残差块的输出。Among them, x represents the input data of the first nested residual block, F ₁ (x) represents the output of the first sub-residual block in the nested residual block, H ₁ (x) represents F ₁ (x) and The combined data of x, F ₂ (x) represents the output of the second sub-residual block in the nested residual block, H ₂ (x) represents the combined data of F ₂ (x) and H ₁ (x), H(x ) denote the output of two nested residual blocks.

在一个可选的示例中，在嵌套残差块之后引入空间注意力机制，以及在注意力模块的激活层中使用sigmoid函数，以获得帧级别的说话者向量，具体公式为：In an optional example, the spatial attention mechanism is introduced after the nested residual block, and the sigmoid function is used in the activation layer of the attention module to obtain the speaker vector at the frame level, the specific formula is:

F″＝f{avg_pool(V)，max_pool(V)}F″=f{avg_pool(V), max_pool(V)}

F′＝σ(F″)F'=σ(F")

F＝Multiply(V，F′)F=Multiply(V,F')

其中，V表示经过嵌套残差神经网络输出的说话者向量，avg_pool表示平均池化操作，max_pool表示最大池化操作，f{}表示把两个池化操作的结果进行合并得到新的说话者向量F″；F′表示对F″加激活函数后得到的说话者向量；F表示帧级别的说话者向量。Among them, V represents the speaker vector output by the nested residual neural network, avg_pool represents the average pooling operation, max_pool represents the maximum pooling operation, and f{} represents the combination of the results of the two pooling operations to obtain a new speaker Vector F″; F′ represents the speaker vector obtained by adding an activation function to F″; F represents the frame-level speaker vector.

在一个可选的示例中，所述计算所述话语级别的说话者向量和目标说话者向量的余弦相似度，以判断所述说话者是否为目标说话者，具体为：In an optional example, the calculation of the cosine similarity between the speaker vector of the utterance level and the target speaker vector to determine whether the speaker is the target speaker is specifically:

对所述余弦相似度的概率值设置阈值，当所述余弦相似度的概率值大于所述阈值时，则判断所述说话者为目标说话者，否则判断所述说话者不是目标说话者。A threshold is set for the probability value of the cosine similarity, and when the probability value of the cosine similarity is greater than the threshold, it is judged that the speaker is the target speaker, otherwise it is judged that the speaker is not the target speaker.

第二方面，本发明提供了一种说话者确认系统，包括：In a second aspect, the present invention provides a speaker confirmation system, comprising:

说话者音频确定单元，用于对说话者的音频信息进行预处理，将所述音频信息转换为预设格式的数据；A speaker audio determination unit, configured to preprocess the audio information of the speaker, and convert the audio information into data in a preset format;

帧级别向量确定单元，用于将说话者音频信息对应的预设格式的数据输入到训练好的基于空间注意力机制的深度嵌套残差神经网络，以得到帧级别的说话者向量；所述基于空间注意力机制的深度嵌套残差神经网络包括：四层每层包含两个嵌套残差块的嵌套残差神经网络和空间注意力机制；在嵌套残差神经网络之后引入空间注意力机制，所述空间注意力机制基于空间维度在注意力模块中引入平均池化和最大池化，并将两部分池化结果合并，以保留有用信息减少参数规模，以及在注意力模块的激活层中使用sigmoid函数，以获得帧级别的说话者向量；The frame-level vector determination unit is used to input the data in the preset format corresponding to the speaker's audio information to the trained deep nested residual neural network based on the spatial attention mechanism to obtain the frame-level speaker vector; The deep nested residual neural network based on the spatial attention mechanism includes: a four-layer nested residual neural network with two nested residual blocks in each layer and a spatial attention mechanism; the spatial attention is introduced after the nested residual neural network Attention mechanism, the spatial attention mechanism introduces average pooling and maximum pooling in the attention module based on the spatial dimension, and merges the two parts of the pooling results to retain useful information and reduce the parameter scale, and in the attention module The sigmoid function is used in the activation layer to obtain the speaker vector at the frame level;

说话者确认单元，用于基于所述帧级别的说话者向量生成话语级别的说话者向量，并计算所述话语级别的说话者向量和目标说话者向量的余弦相似度，以判断所述说话者是否为目标说话者；所述目标说话者向量是预先获取的。A speaker confirmation unit, configured to generate an utterance-level speaker vector based on the frame-level speaker vector, and calculate the cosine similarity between the utterance-level speaker vector and the target speaker vector to determine the speaker Whether it is the target speaker; the target speaker vector is obtained in advance.

在一个可选的示例中，所述说话者音频确定单元，将说话者的WAV格式音频文件采用音频转换技术转换为flac格式文件，将flac格式文件进行预处理，得到包含说话者全部信息的npy格式数据。In an optional example, the speaker audio determination unit converts the speaker's WAV format audio file into a flac format file using audio conversion technology, preprocesses the flac format file, and obtains npy containing all information about the speaker format data.

在一个可选的示例中，每个嵌套残差块中包含两个子残差块，每个子残差块包含两个单元，每个单元是一个构造块；每两个嵌套残差块的前面放置一个卷积层；In an optional example, each nested residual block contains two sub-residual blocks, each sub-residual block contains two units, and each unit is a building block; each of the two nested residual blocks Place a convolutional layer in front;

H₁(x)＝F₁(x)+xH ₁ (x)=F ₁ (x)+x

H₂(x)＝F₂(x)+H₁(x)H ₂ (x)=F ₂ (x)+H ₁ (x)

H(x)＝H₂(x)+xH(x)＝H ₂ (x)+x

在一个可选的示例中，所述帧级别向量确定单元在嵌套残差块之后引入空间注意力机制，以及在注意力模块的激活层中使用sigmoid函数，以获得帧级别的说话者向量，具体公式为：In an optional example, the frame-level vector determination unit introduces a spatial attention mechanism after the nested residual block, and uses a sigmoid function in the activation layer of the attention module to obtain a frame-level speaker vector, The specific formula is:

F″＝f{avg_pool(V)，max_pool(V)}F″=f{avg_pool(V), max_pool(V)}

F′＝σ(F″)F'=σ(F")

F＝Multiply(V，F′)F=Multiply(V,F')

在一个可选的示例中，所述说话者确认单元对所述余弦相似度的概率值设置阈值，当所述余弦相似度的概率值大于所述阈值时，则判断所述说话者为目标说话者，否则判断所述说话者不是目标说话者。In an optional example, the speaker confirmation unit sets a threshold for the probability value of the cosine similarity, and when the probability value of the cosine similarity is greater than the threshold, it is judged that the speaker is the target speaker Otherwise, it is judged that the speaker is not the target speaker.

总体而言，通过本发明所构思的以上技术方案与现有技术相比，具有以下有益效果：Generally speaking, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

本发明提供一种说话者确认方法及系统，提出了一种新颖的嵌套残差神经网络。在证明了ResNet在说话者验证中的效率之后，我们对该研究进行了一系列探索。本文首次提出了使用残差中的残差的概念；本发明提出在说话人确认任务中使用空间注意力机制。本发明采用添加注意力机制的方法，以从声纹能量谱中获得更多有价值的声纹信息，这是首次将用于图片处理的空间注意力用于声纹信息的处理。本发明提出了一种表现更好的说话人确认网络模型，并且通过实验证明本方法优于目前前沿技术，然后在本网络模型的基础上添加注意力机制更进一步提升了系统性能。The invention provides a speaker confirmation method and system, and proposes a novel nested residual neural network. After demonstrating the efficiency of ResNet in speaker verification, we conduct a series of explorations in this research. This paper proposes for the first time the concept of using residuals in residuals; the present invention proposes the use of spatial attention mechanisms in speaker verification tasks. The invention adopts the method of adding an attention mechanism to obtain more valuable voiceprint information from the voiceprint energy spectrum, which is the first time that the spatial attention used for image processing is used for the processing of voiceprint information. The present invention proposes a speaker confirmation network model with better performance, and proves that the method is superior to the current cutting-edge technology through experiments, and then adds an attention mechanism on the basis of the network model to further improve the system performance.

本发明提供一种说话者确认方法及系统，提出了一种基于空间注意力机制的深度嵌套残差神经网络，通过深度神经网络更准确地提取说话者声纹特征。实验结果表明，该方法具有较高的学习能力，能够超越以前的神经网络方法的性能。与其他最新方法相比，实验结果在英语公共数据集LibriSpeech上的准确率提高了6％，错误率也同样低。本发明还通过在中文数据集AISHELL上的训练性能证明了该方法具有良好的跨语言适应性。The present invention provides a speaker confirmation method and system, and proposes a deeply nested residual neural network based on a spatial attention mechanism, which can more accurately extract speaker voiceprint features through the deep neural network. Experimental results show that the method has high learning ability and is able to surpass the performance of previous neural network methods. Compared with other state-of-the-art methods, the experimental results show a 6% improvement in accuracy on the English public dataset LibriSpeech with a similarly low error rate. The present invention also proves that the method has good cross-language adaptability through the training performance on the Chinese data set AISHELL.

附图说明Description of drawings

图1是本发明实施例提供的一种说话者确认方法流程图；Fig. 1 is a flow chart of a speaker confirmation method provided by an embodiment of the present invention;

图2是本发明实施例提供的另一种说话者确认流程图；Fig. 2 is another kind of speaker confirmation flowchart provided by the embodiment of the present invention;

图3是本发明实施例提供的嵌套残差块结构图；FIG. 3 is a structural diagram of a nested residual block provided by an embodiment of the present invention;

图4是本发明实施例提供的基于注意力机制的嵌套残差神经网络结构图；4 is a structure diagram of a nested residual neural network based on an attention mechanism provided by an embodiment of the present invention;

图5是本发明实施例提供的说话者空间注意力机制结构图；Fig. 5 is a structural diagram of a speaker's spatial attention mechanism provided by an embodiment of the present invention;

图6是本发明实施例提供的不同模型在英文数据集上的表现图；Fig. 6 is a performance diagram of different models provided by the embodiment of the present invention on the English data set;

图7是本发明实施例提供的不同模型在中文数据集上的表现图；Fig. 7 is a performance diagram of different models provided by the embodiment of the present invention on the Chinese data set;

图8是本发明实施例提供的添加注意力机制前后在不同数据集上的表现对比图；Fig. 8 is a performance comparison diagram on different data sets before and after adding the attention mechanism provided by the embodiment of the present invention;

图9是本发明实施例提供的不同模型在中英文跨语言数据集上的表现对比图；Figure 9 is a comparison of the performance of different models provided by the embodiment of the present invention on the Chinese-English cross-language data set;

图10是本发明实施例提供的说话者确认系统架构图。Fig. 10 is an architecture diagram of a speaker verification system provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

本发明的目的是基于空间注意力机制和深度嵌套残差神经网络提取说话者的fbank声纹特征，在训练过程中，以样本对的形式输入然后根据深度神经网络最后一层的输出说话者帧级向量表示，并利用余弦相似度判断当前说话者是否是目标说话者。The purpose of the present invention is to extract the fbank voiceprint features of the speaker based on the spatial attention mechanism and the deep nested residual neural network. Frame-level vector representation, and use cosine similarity to judge whether the current speaker is the target speaker.

图1是本发明实施例提供的一种说话者确认方法流程图；如图1所示，包括如下步骤：Fig. 1 is a kind of speaker confirming method flowchart that the embodiment of the present invention provides; As shown in Fig. 1, comprises the following steps:

S101，对说话者的音频信息进行预处理，将所述音频信息转换为预设格式的数据；S101, preprocessing the audio information of the speaker, and converting the audio information into data in a preset format;

S102，将说话者音频信息对应的预设格式的数据输入到训练好的基于空间注意力机制的深度嵌套残差神经网络，以得到帧级别的说话者向量；所述基于空间注意力机制的深度嵌套残差神经网络包括：四层每层包含两个嵌套残差块的嵌套残差神经网络和空间注意力机制；在嵌套残差神经网络之后引入空间注意力机制，所述空间注意力机制基于空间维度在注意力模块中引入平均池化和最大池化，并将两部分池化结果合并，以保留有用信息减少参数规模，以及在注意力模块的激活层中使用sigmoid函数，以获得帧级别的说话者向量；S102, input the data in the preset format corresponding to the speaker's audio information to the trained deep nested residual neural network based on the spatial attention mechanism to obtain the speaker vector at the frame level; the spatial attention mechanism-based The deep nested residual neural network includes: a nested residual neural network and a spatial attention mechanism with four layers each containing two nested residual blocks; a spatial attention mechanism is introduced after the nested residual neural network, the The spatial attention mechanism introduces average pooling and maximum pooling in the attention module based on the spatial dimension, and merges the two parts of the pooling results to retain useful information and reduce the parameter scale, and uses the sigmoid function in the activation layer of the attention module , to obtain the frame-level speaker vector;

S103，基于所述帧级别的说话者向量生成话语级别的说话者向量，并计算所述话语级别的说话者向量和目标说话者向量的余弦相似度，以判断所述说话者是否为目标说话者；所述目标说话者向量是预先获取的。S103. Generate an utterance-level speaker vector based on the frame-level speaker vector, and calculate the cosine similarity between the utterance-level speaker vector and the target speaker vector, to determine whether the speaker is the target speaker ; The target speaker vector is pre-acquired.

本发明中的基于空间注意力机制和深度嵌套残差神经网络的说话者确认方法分为三部分：音频预处理、说话者声纹特征提取以及说话者确认；总的处理流程如图2所示。首先将公开数据集Librispeech中带有说话者标签的WAV音频文件采用音频转换技术处理为flac文件，将flac文件进行预处理，得到包含说话者全部信息的.npy文件。然后将.npy文件作为深度嵌套残差神经网络的输入，通过一系列的卷积和池化操作得到说话者的帧级向量表示。在训练过程中，采用AP和AN样本对的形式进行训练，其中AP代表anchor-positive即对照-正样本，AN代表anchor-negative即对照-负样本。深度嵌套神经网络通过不断对AP和AN中向量相似度的学习，从而提高模型对说话者确认算法的性能。The speaker confirmation method based on spatial attention mechanism and deep nested residual neural network in the present invention is divided into three parts: audio preprocessing, speaker voiceprint feature extraction and speaker confirmation; the overall processing flow is shown in Figure 2 Show. First, the WAV audio files with speaker labels in the public dataset Librispeech are processed into flac files using audio conversion technology, and the flac files are preprocessed to obtain .npy files containing all information about the speakers. Then the .npy file is used as the input of the deep nested residual neural network, and the frame-level vector representation of the speaker is obtained through a series of convolution and pooling operations. During the training process, training is performed in the form of AP and AN sample pairs, where AP stands for anchor-positive, that is, control-positive samples, and AN stands for anchor-negative, that is, control-negative samples. The deep nested neural network improves the performance of the model to the speaker confirmation algorithm by continuously learning the vector similarity in AP and AN.

本发明提出了适合于说话者确认任务的嵌套残差块。详细的网络结构如图3所示。Relu表示线性整流函数，是一种常用的激活函数；NR-block是本发明中提出的嵌套残差块的结构。The present invention proposes nested residual blocks suitable for speaker confirmation tasks. The detailed network structure is shown in Figure 3. Relu represents the linear rectification function, which is a commonly used activation function; NR-block is the structure of the nested residual block proposed in the present invention.

本发明的体系结构中堆叠了两个嵌套残差块，每个嵌套残差块由两个子残差块堆叠而成。每个子残差块包含两个单元，每个单元都是一个构造块。本发明在每两个嵌套残差块的前面放置一个卷积层，将内核的大小设置为5，将步长设置为2，这可以完美地应对由于通道数增加而引起的变化。In the architecture of the present invention, two nested residual blocks are stacked, and each nested residual block is formed by stacking two sub-residual blocks. Each subresidual block contains two units, and each unit is a building block. The present invention places a convolutional layer in front of every two nested residual blocks, sets the size of the kernel to 5, and sets the stride to 2, which can perfectly cope with the changes caused by the increase in the number of channels.

在嵌套残差神经网络中，H(x)被视为拟合到覆盖图的基本映射，其中x表示嵌套残差块中第一层的输入。在第一个构造块中，本发明假设残差函数为公式(1)，则可以在第二个残差块中获得公式(2)。因此，本发明可以实现嵌套残差块的堆叠功能，可以用作下一个嵌套残差块的输入。将目标函数从H(x)更改为F(x)可以大大降低网络学习的难度。In nested residual neural networks, H(x) is considered as the base map fitted to the overlay map, where x represents the input of the first layer in the nested residual block. In the first building block, the present invention assumes that the residual function is formula (1), then formula (2) can be obtained in the second residual block. Therefore, the present invention can realize the stacking function of the nested residual block, and can be used as the input of the next nested residual block. Changing the objective function from H(x) to F(x) can greatly reduce the difficulty of network learning.

H₁(x)＝F₁(x)+x (1)H ₁ (x)=F ₁ (x)+x (1)

H₂(x)＝F₂(x)+H₁(x) (2)H ₂ (x) = F ₂ (x) + H ₁ (x) (2)

H(x)＝H₂(x)+x (3)H(x)＝H ₂ (x)+x (3)

其中，下标1代表提出的嵌套残差块中的第一个残差块，下标2代表提出的嵌套残差块中的第二个残差块。Among them, the subscript 1 represents the first residual block in the proposed nested residual block, and the subscript 2 represents the second residual block in the proposed nested residual block.

具体地，嵌套残差神经网络是由四个相同的包含两个嵌套残差块的网络层堆叠而成，每个嵌套残差块中又包含两个残差块。在嵌套神经网络中，第一个嵌套残差块的输出H(X)作为第二个嵌套残差块的输入，继续执行公式(1)-公式(3)，以此类推，得到最后一个嵌套残差块的输出也就是嵌套残差神经网络中整个嵌套残差块的输出，可以表示为V，也就是下面的说话者向量，V也是帧级别的说话者向量。Specifically, the nested residual neural network is stacked by four identical network layers containing two nested residual blocks, and each nested residual block contains two residual blocks. In the nested neural network, the output H(X) of the first nested residual block is used as the input of the second nested residual block, continue to execute formula (1)-formula (3), and so on, to get The output of the last nested residual block is also the output of the entire nested residual block in the nested residual neural network, which can be expressed as V, which is the speaker vector below, and V is also the speaker vector at the frame level.

从帧级特征提取器输出的嵌套残差神经网络的说话者向量可以表示为：V＝[v₁ v₂… v_d]。其中v_t∈R^D，R^D属于D维空间向量。符号v表示与每个帧相对应的向量，符号d表示图层的输入帧尺寸。通过研究本发明可以发现，仅使用深度神经网络可能会导致梯度爆炸和梯度消失问题。在以声纹能量谱形式传输声纹信息的过程中，如果全部利用信息，无疑会增加系统的工作量，而且这种方法可能不可靠。The speaker vector of the nested residual neural network output from the frame-level feature extractor can be expressed as: V = [v ₁ v ₂ ... v _d ]. Among them v _t ∈ R ^D , R ^D belongs to the D-dimensional space vector. The symbol v denotes the vector corresponding to each frame, and the symbol d denotes the input frame dimensions of the layer. By studying the present invention, it can be found that only using a deep neural network may cause gradient explosion and gradient disappearance problems. In the process of transmitting voiceprint information in the form of voiceprint energy spectrum, if all the information is used, it will undoubtedly increase the workload of the system, and this method may be unreliable.

因此，图4是本发明实施例提供的基于注意力机制的嵌套残差神经网络结构图；图5是本发明实施例提供的说话者空间注意力机制结构图；本发明在嵌套残差块之后引入了一种空间注意机制。本发明基于空间维度在注意力模块中引入平均池化和最大池化，然后在等式(4)中将两者合并在一起，保留有用的信息并减少参数的规模。然后，使用二维卷积层将特征的维数减少到单个通道。在激活层中使用sigmoid函数，以获取说话者的空间注意力特征。最后，等式(6)将注意力输入到特征模块的特征和输出做加乘操作。至此，本发明获得了说话者帧级特征向量。Therefore, FIG. 4 is a structural diagram of a nested residual neural network based on an attention mechanism provided by an embodiment of the present invention; FIG. 5 is a structural diagram of a speaker space attention mechanism provided by an embodiment of the present invention; A spatial attention mechanism is introduced after the block. The present invention introduces average pooling and max pooling in the attention module based on the spatial dimension, and then combines the two together in equation (4), retaining useful information and reducing the scale of parameters. Then, the dimensionality of the features is reduced to a single channel using a 2D convolutional layer. The sigmoid function is used in the activation layer to obtain the speaker's spatial attention features. Finally, Equation (6) adds and multiplies the features and outputs of the attention input to the feature module. So far, the present invention has obtained the speaker frame-level feature vector.

F″＝f{avg_pool(V)，max_pool(V)} (4)F″=f{avg_pool(V), max_pool(V)} (4)

F′＝σ(F″) (5)F'=σ(F") (5)

F＝Multiply(V，F′) (6)F = Multiply(V, F') (6)

其中，V表示经过嵌套残差神经网络输出的说话者向量，avg_pool表示平均池化操作，max_pool表示最大池化操作，f{}表示把两个池化操作的结果进行合并得到新的说话者向量F″；F′表示对F″加激活函数后得到的说话者向量；F表示最终输出的帧级别的说话者向量，也是后面用于余弦相似度判断的帧级别的说话者向量。Among them, V represents the speaker vector output by the nested residual neural network, avg_pool represents the average pooling operation, max_pool represents the maximum pooling operation, and f{} represents the combination of the results of the two pooling operations to obtain a new speaker Vector F"; F' represents the speaker vector obtained by adding an activation function to F"; F represents the final output frame-level speaker vector, which is also the frame-level speaker vector used for cosine similarity judgment later.

具体地，帧级别的说话人向量被输入到降维层中，然后在时间维度求平均，以生成话语级别的说话人向量。话语级别说话人向量的是由帧级别的说话人向量经过降维、全连接层、平均、长度归一化后生成的。Specifically, the frame-level speaker vectors are input into a dimensionality reduction layer, and then averaged in the time dimension to generate utterance-level speaker vectors. The utterance-level speaker vector is generated from the frame-level speaker vector through dimensionality reduction, fully connected layers, averaging, and length normalization.

全连接层将话语级别表示投影到512维的说话人向量中。我们通过长度归一化操作来标准化特征，并在目标函数中使用余弦相似度：Fully connected layers project the utterance-level representations into 512-dimensional speaker vectors. We normalize the features by a length normalization operation and use cosine similarity in the objective function:

cos(X，Y)＝X^TYcos(X, Y)=X ^T Y

其中，X和Y是两个不同的说话人话语级向量。然后通过等式中的余弦相似度公式，将X和Y向量相似度的概率进行建模，通过对相似度概率设置一定的阈值来判定是否为同一说话者。where X and Y are two different speaker utterance-level vectors. Then use the cosine similarity formula in the equation to model the probability of similarity between X and Y vectors, and determine whether they are the same speaker by setting a certain threshold for the similarity probability.

本文根据不同的语料库以及是否增加注意力机制分别进行了四项实验。通过实验对比结果表明，该方法具有较高的学习能力，可以超越以前的神经网络方法的性能。In this paper, four experiments are carried out according to different corpora and whether to add attention mechanism. The experimental comparison results show that the method has a high learning ability and can surpass the performance of previous neural network methods.

其中，以下图6-图9中纵坐标ACC代表准确率，横坐标epoch代表验证轮数。Among them, in the following Figures 6-9, the ordinate ACC represents the accuracy rate, and the abscissa epoch represents the number of verification rounds.

图6显示了在英语公共数据集LibriSpeech中五个不同网络模型的实验结果。其中一个实验baseResNet作为本方法的对照实验。在对照实验的基础上，本发明又对网络结构分别进行了微调，以观察不同结构对说话者确认任务的影响。其中，图6中Proposed-NResNet模型指的是本发明提出的嵌套残差神经网络模型。在对照实验中，残差块被堆叠了三次，而ResNet-bd是对残差块在网络中不同位置的调整。本发明在NResNet中将两个残差块进行嵌套并堆叠是创新之一。从实验结果来看，baseResNet，ResNet-bd3和Proposed-NResNet可以达到理想性能，但是可以清楚地观察到，本发明提出的Proposed-NResNet模型表现更加稳定并且准确率更高。Figure 6 shows the experimental results of five different network models on the English public dataset LibriSpeech. One of the experiments, baseResNet, serves as a control experiment for this method. On the basis of the control experiment, the present invention fine-tunes the network structure respectively to observe the influence of different structures on the speaker confirmation task. Wherein, the Proposed-NResNet model in Fig. 6 refers to the nested residual neural network model proposed by the present invention. In the control experiment, the residual block is stacked three times, and ResNet-bd is an adjustment for different positions of the residual block in the network. In the present invention, nesting and stacking two residual blocks in NResNet is one of the innovations. From the experimental results, baseResNet, ResNet-bd3 and Proposed-NResNet can achieve ideal performance, but it can be clearly observed that the Proposed-NResNet model proposed by the present invention is more stable and has higher accuracy.

全面和科学是研究始终如一的宗旨。在图7中，本发明选择了最佳的三组实验，然后在中文公共数据集AISHELL上探索结果。其中，图7中Proposed-NResNet模型指的是本发明提出的嵌套残差神经网络模型。本发明以AISHELL的151个人的音频作为训练集，以40个人的音频数据作为验证集，以验证本文提到的方法在中文上也是可行的。首先必须将数据处理为神经网络的要求格式。如图7中的结果所示，本发明的方法可以在中文和英文数据集上均获得出色的结果。Comprehensive and scientific is the consistent purpose of research. In Figure 7, the present invention selects the best three sets of experiments, and then explores the results on the Chinese public dataset AISHELL. Wherein, the Proposed-NResNet model in FIG. 7 refers to the nested residual neural network model proposed by the present invention. The present invention uses the audio data of 151 people from AISHELL as a training set and the audio data of 40 people as a verification set to verify that the method mentioned in this paper is also feasible in Chinese. First the data must be processed into the format required by the neural network. As shown in the results in Figure 7, the method of the present invention can achieve excellent results on both Chinese and English datasets.

在深度嵌套神经网络中添加说话者空间注意力机制是这项工作的第二项创新。本发明通过实验对比了不同的语言环境下将空间注意力机制添加到NResNet中的表现。本发明分别观察了每个组的效果。具体细节可以在图8中进行分析。其中，图8中Proposed-NResNet+SA模型指的是本发明提出的嵌套残差神经网络和说话人注意力的结合模型。令人惊讶的是，本发明提出的Proposed-NResNet+SA的准确性可以达到接近0.99的最佳性能。与没有添加注意力机制相比，有一个显着的进步，这进一步说明了本文提出的注意机制在说话者确认任务中的高效率。Adding speaker spatial attention mechanism in deep nested neural network is the second innovation of this work. The present invention compares the performance of adding the spatial attention mechanism to NResNet under different language environments through experiments. The present invention observes the effect of each group separately. The specific details can be analyzed in Figure 8. Among them, the Proposed-NResNet+SA model in Fig. 8 refers to the combined model of nested residual neural network and speaker's attention proposed by the present invention. Surprisingly, the accuracy of Proposed-NResNet+SA proposed by the present invention can reach the best performance close to 0.99. There is a significant improvement compared to no attention mechanism added, which further illustrates the high efficiency of the attention mechanism proposed in this paper in the speaker confirmation task.

此外，本发明在图9中对比本方法在跨语言数据集上的效果。其中，图9中Proposed-NResNet模型指的是本发明提出的嵌套残差神经网络模型。在本实验中本发明将LibriSpeech的Train-clean-100用作训练集，而AISHELL中文数据集的40个人的音频被用作验证集。通过图中数据可以发现，本发明所提出的方法在英语数据集上训练的模型也可以在中文数据集上获得令人满意的结果。In addition, the present invention compares the effect of this method on cross-language datasets in FIG. 9 . Wherein, the Proposed-NResNet model in FIG. 9 refers to the nested residual neural network model proposed by the present invention. In this experiment, the present invention uses LibriSpeech's Train-clean-100 as a training set, and the audios of 40 people in the AISHELL Chinese data set are used as a verification set. From the data in the figure, it can be found that the model trained on the English data set by the method proposed by the present invention can also obtain satisfactory results on the Chinese data set.

表1是作为对照实验的四个神经网络模型和本发明方法中提出的两种模型分别在英语数据集LibriSpeech和中文数据集AISHELL上准确率和等错误率的比较表。通过表1能够清晰看出，本发明提出的嵌套残差神经网络Proposed-NResNet相比已有的模型在分别在英文和中文数据集上准确率有3％左右的提升，且等错误率更低；本发明提出的嵌套残差神经网络和说话人注意力机制结合的方法Proposed-NResNet+SA相比单独使用嵌套残差神经网络准确率又有了3％的提升，且相比已有模型整体有6％的提升，等错误率降低了接近一倍。Table 1 is a comparison table of the accuracy rate and equal error rate of the four neural network models used as a control experiment and the two models proposed in the method of the present invention on the English data set LibriSpeech and the Chinese data set AISHELL respectively. It can be clearly seen from Table 1 that compared with the existing models, the nested residual neural network Proposed-NResNet proposed by the present invention has an improvement of about 3% in the accuracy rate on the English and Chinese data sets, and the equal error rate is even lower. Low; the method Proposed-NResNet+SA, which combines the nested residual neural network and the speaker's attention mechanism proposed by the present invention, has a 3% improvement in accuracy compared with the nested residual neural network alone, and compared with the existing There is a 6% improvement in the model as a whole, and the error rate is nearly doubled.

表1模型的准确率和等错误率表Table 1 The accuracy rate and equal error rate table of the model

图10是本发明实施例提供的说话者确认系统架构图，如图10所示，包括：Fig. 10 is an architecture diagram of a speaker confirmation system provided by an embodiment of the present invention, as shown in Fig. 10 , including:

说话者音频确定单元1010，用于对说话者的音频信息进行预处理，将所述音频信息转换为预设格式的数据；Speaker audio determination unit 1010, configured to preprocess the audio information of the speaker, and convert the audio information into data in a preset format;

帧级别向量确定单元1020，用于将说话者音频信息对应的预设格式的数据输入到训练好的基于空间注意力机制的深度嵌套残差神经网络，以得到帧级别的说话者向量；所述基于空间注意力机制的深度嵌套残差神经网络包括：四层每层包含两个嵌套残差块的嵌套残差神经网络和空间注意力机制；在嵌套残差神经网络之后引入空间注意力机制，所述空间注意力机制基于空间维度在注意力模块中引入平均池化和最大池化，并将两部分池化结果合并，以保留有用信息减少参数规模，以及在注意力模块的激活层中使用sigmoid函数，以获得帧级别的说话者向量；The frame-level vector determination unit 1020 is used to input the data in the preset format corresponding to the speaker's audio information to the trained deep nested residual neural network based on the spatial attention mechanism to obtain the frame-level speaker vector; The deep nested residual neural network based on the spatial attention mechanism includes: a nested residual neural network with four layers each containing two nested residual blocks and a spatial attention mechanism; after the nested residual neural network, the Spatial attention mechanism, the spatial attention mechanism introduces average pooling and maximum pooling in the attention module based on the spatial dimension, and merges the two parts of the pooling results to retain useful information and reduce the parameter scale, and in the attention module Use the sigmoid function in the activation layer to obtain the frame-level speaker vector;

说话者确认单元1030，用于基于所述帧级别的说话者向量生成话语级别的说话者向量，并计算所述话语级别的说话者向量和目标说话者向量的余弦相似度，以判断所述说话者是否为目标说话者；所述目标说话者向量是预先获取的。The speaker confirmation unit 1030 is configured to generate an utterance-level speaker vector based on the frame-level speaker vector, and calculate the cosine similarity between the utterance-level speaker vector and the target speaker vector to determine the utterance Whether the speaker is the target speaker; the target speaker vector is obtained in advance.

具体地，图10中各个单元的功能可参见前述方法实施例中的介绍，在此不做赘述。Specifically, for the functions of each unit in FIG. 10 , refer to the introduction in the foregoing method embodiments, and details are not repeated here.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。Those skilled in the art can easily understand that the above are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, etc., All should be included within the protection scope of the present invention.

Claims

1. a speaker confirmation method, is characterized in that, comprises the steps:

Preprocessing the audio information of the speaker, converting the audio information into data in a preset format;

Input the data in the preset format corresponding to the speaker's audio information into the trained deep nested residual neural network based on the spatial attention mechanism to obtain the speaker vector at the frame level; the deep embedding based on the spatial attention mechanism The nested residual neural network includes: four layers of nested residual neural network and a spatial attention mechanism containing two nested residual blocks in each layer; a spatial attention mechanism is introduced after the nested residual neural network, and the spatial attention The force mechanism introduces average pooling and maximum pooling in the attention module based on the spatial dimension, and merges the two parts of the pooling results to retain useful information to reduce the parameter scale, and uses the sigmoid function in the activation layer of the attention module to Get frame-level speaker vectors;

Generate an utterance-level speaker vector based on the frame-level speaker vector, and calculate the cosine similarity between the utterance-level speaker vector and the target speaker vector to determine whether the speaker is a target speaker; The target speaker vectors are pre-fetched.

2. The speaker confirmation method according to claim 1, wherein the audio information of the speaker is preprocessed, and the audio information is converted into data in a preset format, specifically:

The speaker's WAV format audio file is converted into a flac format file using audio conversion technology, and the flac format file is preprocessed to obtain npy format data containing all information about the speaker.

3. The speaker confirmation method according to claim 1, wherein each nested residual block contains two sub-residual blocks, each sub-residual block contains two units, and each unit is a building block ; Place a convolutional layer in front of every two nested residual blocks;

Two nested sub-residual blocks realize the stacking function, and the specific formula is:

H ₁ (x)=F ₁ (x)+x

H ₂ (x)=F ₂ (x)+H ₁ (x)

H(x)＝H ₂ (x)+x

Among them, x represents the input data of the first nested residual block, F ₁ (x) represents the output of the first sub-residual block in the nested residual block, H ₁ (x) represents F ₁ (x) and The combined data of x, F ₂ (x) represents the output of the second sub-residual block in the nested residual block, H ₂ (x) represents the combined data of F ₂ (x) and H ₁ (x), H(x ) denote the output of two nested residual blocks.

4. The speaker confirmation method according to claim 1, characterized in that a spatial attention mechanism is introduced after the nested residual block, and a sigmoid function is used in the activation layer of the attention module to obtain frame-level speech Or vector, the specific formula is:

F″=f{avg_pool(V), max_pool(V)}

F'=σ(F")

F=Multiply(V,F')

Among them, V represents the speaker vector output by the nested residual neural network, avg_pool represents the average pooling operation, max_pool represents the maximum pooling operation, and f{} represents the combination of the results of the two pooling operations to obtain a new speaker Vector F"; F' represents the speaker vector obtained by adding the activation function to F"; F represents the speaker vector at the frame level, and Multiply represents the addition and multiplication operation.

5. The speaker confirmation method according to any one of claims 1 to 4, wherein the calculation of the cosine similarity between the speaker vector of the utterance level and the target speaker vector is to determine the speaker Whether it is the target speaker, specifically:

A threshold is set for the probability value of the cosine similarity, and when the probability value of the cosine similarity is greater than the threshold, it is judged that the speaker is the target speaker, otherwise it is judged that the speaker is not the target speaker.

6. A speaker confirmation system, comprising:

A speaker audio determination unit, configured to preprocess the audio information of the speaker, and convert the audio information into data in a preset format;

The frame-level vector determination unit is used to input the data in the preset format corresponding to the speaker's audio information to the trained deep nested residual neural network based on the spatial attention mechanism to obtain the frame-level speaker vector; The deep nested residual neural network based on the spatial attention mechanism includes: a four-layer nested residual neural network with two nested residual blocks in each layer and a spatial attention mechanism; the spatial attention is introduced after the nested residual neural network Attention mechanism, the spatial attention mechanism introduces average pooling and maximum pooling in the attention module based on the spatial dimension, and merges the two parts of the pooling results to retain useful information and reduce the parameter scale, and in the attention module The sigmoid function is used in the activation layer to obtain the speaker vector at the frame level;

A speaker confirmation unit, configured to generate an utterance-level speaker vector based on the frame-level speaker vector, and calculate the cosine similarity between the utterance-level speaker vector and the target speaker vector to determine the speaker Whether it is the target speaker; the target speaker vector is obtained in advance.

7. The speaker confirmation system according to claim 6, wherein the speaker audio determination unit converts the speaker's WAV format audio file into a flac format file using audio conversion technology, and pre-processes the flac format file. Processing to get the npy format data containing all the information of the speaker.

8. The speaker confirmation system according to claim 6, wherein each nested residual block contains two sub-residual blocks, each sub-residual block contains two units, and each unit is a building block ; Place a convolutional layer in front of every two nested residual blocks;

H ₁ (x)=F ₁ (x)+x

H ₂ (x)=F ₂ (x)+H ₁ (x)

H(x)＝H ₂ (x)+x

9. The speaker confirmation system according to claim 6, wherein the frame-level vector determination unit introduces a spatial attention mechanism after the nested residual block, and uses a sigmoid function in the activation layer of the attention module , to obtain the speaker vector at the frame level, the specific formula is:

F″=f{avg_pool(V), max_pool(V)}

F'=σ(F")

F=Multiply(V,F')

10. The speaker confirmation system according to any one of claims 6 to 9, wherein the speaker confirmation unit sets a threshold for the probability value of the cosine similarity, when the probability value of the cosine similarity If it is greater than the threshold, it is judged that the speaker is the target speaker, otherwise it is judged that the speaker is not the target speaker.