CN110556114B

CN110556114B - Caller identification method and device based on attention mechanism

Info

Publication number: CN110556114B
Application number: CN201910684343.7A
Authority: CN
Inventors: 林格平; 戚梦苑; 沈亮; 李娅强; 刘发强; 孙旭东; 孙晓晨; 宁珊; 蔡文强; 王玉龙
Original assignee: Beijing University of Posts and Telecommunications; National Computer Network and Information Security Management Center
Current assignee: Beijing University of Posts and Telecommunications; National Computer Network and Information Security Management Center
Priority date: 2019-07-26
Filing date: 2019-07-26
Publication date: 2022-06-17
Anticipated expiration: 2039-07-26
Also published as: CN110556114A

Abstract

The invention discloses a speaker identification method and a speaker identification device based on an attention mechanism, which comprise the following steps: collecting the call records of a plurality of tested callers and the call records of the tested callers; establishing a speaker voice library according to the call record corresponding to the tested speaker; training the voice of the tested talking person by adopting a neural network based on attention to obtain a training model; storing the call record of the test caller to obtain a record file; and identifying whether the tested speaker is the target speaker or not by using the recording file by adopting the training model. The voice of the tested caller is trained by adopting the attention-based neural network to obtain a training model, the tested caller is identified by adopting the training model, the consistency of the master corresponding to the dialing number is confirmed, the potential communication safety hazard caused by the imitation of the identity of the caller is avoided, and the information safety in the calling process is further improved.

Description

Caller identification method and device based on attention mechanism

技术领域technical field

本发明涉及语音识别领域，特别涉及基于注意力机制的通话人识别方法及装置。The present invention relates to the field of speech recognition, in particular to a method and device for identifying a caller based on an attention mechanism.

背景技术Background technique

语音是人类生产生活中最直接也是最主要的交流方式，人的语音中包含了语义信息、语言或者方言信息、信道信息等。随着计算机技术的不断进步与发展及网络时代的到来，有越来越多的措施去伪装说话人的身份。Voice is the most direct and most important way of communication in human production and life. Human voice contains semantic information, language or dialect information, channel information, etc. With the continuous progress and development of computer technology and the advent of the Internet age, there are more and more measures to disguise the identity of the speaker.

在通信过程中，确定说话者的身份有利于确定通信的安全。现有技术中的说话人识别方式有利用KL散度的近似值作为说话人之间相似度的度量标准、利用BP神经网络进行说话人的识别、利用MFCC和GFCC的混合特征进行说话人识别等。然而，大部分技术的应用场景都是基于智能家居，如扫地机器人等根据声音去识别说话人是否为自己的主人。现有技术中并没有针对通信过程中的说话人识别方法，更没有对通信过程中语音的静默过程进行考虑。During the communication process, determining the identity of the speaker is beneficial to determine the security of the communication. Speaker recognition methods in the prior art include using the approximation of KL divergence as a measure of similarity between speakers, using BP neural network for speaker recognition, and using mixed features of MFCC and GFCC for speaker recognition. However, most of the application scenarios of technologies are based on smart homes, such as sweeping robots, etc., to identify whether the speaker is the owner of the speaker based on the voice. In the prior art, there is no speaker recognition method in the communication process, and no consideration is given to the silence process of the speech in the communication process.

现有技术中的语音识别的方法多数适用于对外界直接获取的语音信号进行识别的场景，缺乏在通信过程中对通话人的语音识别方法，存在通话人身份可被仿冒的安全隐患。Most of the voice recognition methods in the prior art are suitable for the scene of recognizing the voice signal obtained directly from the outside world, and lack the voice recognition method for the caller during the communication process, and there is a security risk that the caller's identity can be counterfeited.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提出一种基于注意力机制的通话人识别方法及装置，以解决上述技术问题。The purpose of the present invention is to propose a method and device for identifying a caller based on an attention mechanism, so as to solve the above-mentioned technical problems.

为实现上述目的，本发明提供了如下方案：For achieving the above object, the present invention provides the following scheme:

本发明实施例的第一个方面，提供了一种基于注意力机制的通话人识别方法，包括如下步骤：A first aspect of the embodiments of the present invention provides a method for identifying a caller based on an attention mechanism, including the following steps:

采集多个被测试通话人的通话录音和测试通话人的通话录音；Collect call recordings of multiple tested callers and test callers;

根据被测试通话人对应的通话录音，建立通话人语音库；According to the call recording corresponding to the tested caller, establish the caller's voice library;

采用基于注意力的神经网络对被测试通话人语音进行训练，获得训练模型；Use the attention-based neural network to train the voice of the tested caller to obtain the training model;

存储测试通话人的通话录音，获得录音文件；Store the call recording of the test caller and obtain the recording file;

将录音文件采用训练模型识别被测试通话人是否为目标通话人。The recording file is used to train the model to identify whether the tested caller is the target caller.

可选的，步骤采集多个被测试通话人的通话录音和测试通话人的通话录音，包括：Optionally, the step is to collect call recordings of multiple tested callers and call recordings of test callers, including:

测试方使用智能手机在通话过程中的内置录音功能录制被测试通话人在通话过程的语音；在通话过程中使用系统自带的通话录音功能，需要明确手机设备型号，录音文件保存格式，通话时外部环境特征；并将通话录音保存为Wave格式的无损文件形式。The test party uses the built-in recording function of the smartphone during the call to record the voice of the person under test during the call; when using the built-in call recording function of the system during the call, it is necessary to specify the mobile phone device model, the recording file storage format, and the call recording function. External environment characteristics; and save call recordings as lossless files in Wave format.

可选的，步骤根据所述被测试通话人对应的通话录音，建立通话人语音库，包括：Optionally, the step is to establish a caller's voice library according to the call recording corresponding to the tested caller, including:

获取被测试通话人身份与被测试语音的对应关系；Obtain the correspondence between the identity of the tested caller and the tested voice;

根据对应关系建立通话人语音库，通话人语音库中包括被识别方通话音频数据、被测试通话人身份信息、被测试通话人环境特性、测试方录制设备信息、测试方环境特性、通话时长、通话时间、通话音量。The caller's voice database is established according to the corresponding relationship. The caller's voice database includes the audio data of the call of the identified party, the identity information of the tested caller, the environmental characteristics of the tested caller, the information of the recording equipment of the tester, the environmental characteristics of the tester, the call duration, Call time, call volume.

可选的，步骤采用基于注意力的神经网络对所述被测试通话人语音进行训练，获得训练模型，包括：Optionally, the step adopts an attention-based neural network to train the voice of the tested caller to obtain a training model, including:

将录音文件采用维纳滤波器进行去噪处理，获得预处理录音文件；The recording file is denoised with Wiener filter to obtain the preprocessed recording file;

采用基于注意力机制的时间递归神经网络训练所述预处理录音文件，获得训练模型。The preprocessed recording file is trained by using an attention mechanism-based temporal recurrent neural network to obtain a training model.

可选的，步骤采用基于注意力机制的时间递归神经网络训练预处理录音文件，获得训练模型，包括：Optionally, the step adopts the attention mechanism-based temporal recurrent neural network training to preprocess the recording files to obtain a training model, including:

将预处理录音文件通过时间递归神经网络的输入层提取语音特征，获得预处理录音文件中的语音的梅尔倒谱系数特征向量；Extracting speech features from the preprocessed recording file through the input layer of the time recurrent neural network, and obtaining the Mel cepstral coefficient feature vector of the speech in the preprocessed recording file;

将梅尔倒谱系数特征向量发送至全连接层，全连接层(可以看作为自编码器)对梅尔倒谱系数特征向量进行特征提取，获得预处理录音文件中的语音的第二特征向量；The feature vector of Mel cepstral coefficients is sent to the fully connected layer, and the fully connected layer (which can be regarded as an autoencoder) performs feature extraction on the feature vector of Mel cepstral coefficients to obtain the second feature vector of the speech in the preprocessed recording file ;

将第二特征向量发送至基于注意力的时间递归神经网络层，基于注意力的时间递归神经网络层包括多个LSTM层，通过多个LSTM层处理第二特征向量获得处理数据；sending the second feature vector to the attention-based time recurrent neural network layer, the attention-based time recurrent neural network layer includes multiple LSTM layers, and processing the second feature vector through multiple LSTM layers to obtain processing data;

将处理数据发送至归一化指数函数层，归一化指数函数层将处理数据与人名对应转换，获得处理数据对应的人名。The processed data is sent to the normalized exponential function layer, and the normalized exponential function layer converts the processed data corresponding to the person's name to obtain the person's name corresponding to the processed data.

可选的，步骤将录音文件采用训练模型识别被测试通话人是否目标通话人，包括：Optionally, the step uses the recording file to identify whether the tested caller is the target caller by using a training model, including:

判断通话人语音库中是否存在待测试语音，如果是，识别出通话人语音库中已有的被测试人；否则，如果待识别的音频文件属于新测试人，则会被识别为最相近的已有被测试人，即置信度值最大的已有分类。Determine whether there is a voice to be tested in the caller's voice library, and if so, identify the existing testee in the caller's voice library; otherwise, if the audio file to be recognized belongs to a new tester, it will be identified as the closest one. Existing tested person, that is, the existing classification with the largest confidence value.

为了实现上述目的，本发明还提供了如下方案：In order to achieve the above object, the present invention also provides the following scheme:

基于注意力机制的通话人识别装置，包括：Caller identification device based on attention mechanism, including:

收集模块，用于采集多个被测试通话人的通话录音和测试通话人的通话录音；The collection module is used to collect the call recordings of multiple tested callers and the call recordings of the test callers;

语音库建立模块，用于根据被测试通话人对应的通话录音，建立通话人语音库；The voice library establishment module is used to establish the caller's voice library according to the call recording corresponding to the tested caller;

训练模块，用于采用基于注意力的神经网络对被测试通话人语音进行训练，获得训练模型；The training module is used to train the voice of the tested caller by using the attention-based neural network to obtain the training model;

文件存储模块，用于存储测试通话人的通话录音，获得录音文件；The file storage module is used to store the call recording of the test caller and obtain the recording file;

测试模块，用于将录音文件采用训练模型识别被测试通话人是否目标通话人。The test module is used to identify whether the tested caller is the target caller by using the training model on the recording file.

可选的，收集模块具体包括：Optionally, the collection module specifically includes:

测试方单元，用于测试方使用智能手机在通话过程中的内置录音功能录制被测试通话人在通话过程的语音；The test party unit is used for the test party to use the built-in recording function of the smartphone during the call to record the voice of the tested caller during the call;

录音单元，用于在通话过程中使用系统自带的通话录音功能，需要明确手机设备型号，录音文件保存格式，通话时外部环境特征；并将通话录音保存为Wave格式的无损文件形式。The recording unit is used to use the built-in call recording function of the system during the call. It is necessary to specify the mobile phone device model, the recording file storage format, and the external environment characteristics during the call; and save the call recording as a lossless file in Wave format.

可选的，语音库建立模块具体包括：Optionally, the voice library establishment module specifically includes:

对应关系获取单元，用于获取被测试通话人身份与被测试语音的对应关系；A corresponding relationship obtaining unit, used for obtaining the corresponding relationship between the identity of the tested caller and the tested voice;

语音库建立单元，用于根据对应关系建立通话人语音库，通话人语音库中包括被识别方通话音频数据、被测试通话人身份信息、被测试通话人环境特性、测试方录制设备信息、测试方环境特性、通话时长、通话时间、通话音量。The voice library establishment unit is used to establish a caller voice library according to the corresponding relationship. The caller voice library includes the audio data of the call of the identified party, the identity information of the tested caller, the environmental characteristics of the tested caller, the information of the test party's recording equipment, and the test. Party environment characteristics, call duration, call time, call volume.

可选的，训练模块具体包括：Optionally, the training module specifically includes:

预处理单元，用于将录音文件采用维纳滤波器进行去噪处理，获得预处理录音文件；The preprocessing unit is used to denoise the recording file by using the Wiener filter to obtain the preprocessing recording file;

训练模型建立单元，用于采用基于注意力机制的时间递归神经网络训练预处理录音文件，获得训练模型。The training model establishment unit is used to train the preprocessing recording file by using the attention mechanism-based time recurrent neural network to obtain the training model.

根据本发明提供的具体实施例，本发明公开了以下技术效果：According to the specific embodiments provided by the present invention, the present invention discloses the following technical effects:

本发明公开了基于注意力机制的说话人识别方法及装置，通过采用基于注意力的神经网络对被测试通话人语音进行训练，获得训练模型，采用训练模型识别被测试通话人，能在仅有音频的情况下利用已有通信语音库确定通信过程中的说话，以便用户通过号码显示通话人与实际通话人进行匹配，从而进行可靠性的判断，可有效防御通过仿冒通话人语音而进行的诈骗等行为，间接保护用户通信安全。The invention discloses a speaker identification method and device based on an attention mechanism. By using an attention-based neural network to train the voice of the tested caller, a training model is obtained, and the training model is used to identify the tested caller. In the case of audio, the existing communication voice library is used to determine the speech in the communication process, so that the user can match the caller with the actual caller through the number display, so as to make a reliability judgment, which can effectively prevent fraud by counterfeiting the caller's voice. and other behaviors to indirectly protect the security of user communications.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the accompanying drawings required in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some of the present invention. In the embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative labor.

图1为本发明实施例1基于注意力机制的通话人识别方法的一个实施例的流程示意图；1 is a schematic flowchart of an embodiment of a method for identifying a caller based on an attention mechanism according to Embodiment 1 of the present invention;

图2为本发明实施例2提供的一种基于注意力机制的说话人识别方法的流程图；2 is a flowchart of a method for speaker identification based on an attention mechanism provided in Embodiment 2 of the present invention;

图3为本发明实施例3基于注意力机制的通话人识别装置的结构示意图；3 is a schematic structural diagram of a caller identification device based on an attention mechanism according to Embodiment 3 of the present invention;

图4为本发明提供的神经网络结构图。FIG. 4 is a structural diagram of a neural network provided by the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例的附图，对本发明实施例的技术方案进行清楚、完整地描述。显然，所描述的实施例是本发明的一部分实施例，而不是全部的实施例。基于所描述的本发明的实施例，本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are some, but not all, embodiments of the present invention. Based on the described embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present invention.

实施例1Example 1

本发明实施例1提供一种基于注意力机制的说话人识别方法的一个实施例，如图1所示，该方法，包括如下步骤：Embodiment 1 of the present invention provides an embodiment of a method for speaker recognition based on an attention mechanism. As shown in FIG. 1 , the method includes the following steps:

S101：采集多个被测试通话人的通话录音和测试通话人的通话录音；S101: Collect call recordings of multiple tested callers and call recordings of test callers;

S102：根据被测试通话人对应的通话录音，建立通话人语音库；S102: Establish a caller's voice library according to the call recording corresponding to the tested caller;

S103：采用基于注意力的神经网络对被测试通话人语音进行训练，获得训练模型；S103: Use an attention-based neural network to train the voice of the tested caller to obtain a training model;

S104：存储测试通话人的通话录音，获得录音文件；S104: Store the call recording of the test caller, and obtain the recording file;

S105：将录音文件采用训练模型识别被测试通话人是否为目标通话人。S105: Use the recording file to identify whether the tested caller is the target caller by using the training model.

在接听过程中，测试方对通话进行录音，后续将通话录音与被测试通话人对应,构建被测试通话人语音库；During the answering process, the test party records the call, and then corresponds the call recording to the tested caller, and builds the tested caller's voice library;

利用基于注意力的LSTM神经网络对通话人语音特征进行学习，生成通话人识别模型；Use the attention-based LSTM neural network to learn the characteristics of the caller's voice to generate a caller recognition model;

在进行通话人识别时，测试方对通话过程进行录音,并将录音文件保存成wave的音频文件；When the caller is identified, the test party records the call process and saves the recording file as a wave audio file;

将保存好的音频文件经过维纳滤波器，提取梅尔倒谱系数后输入到训练好的通话人识别模型，进行被测试通话人的识别。The saved audio file is passed through the Wiener filter, and the Mel cepstral coefficients are extracted and input into the trained caller identification model to identify the tested caller.

其中基于注意力的时间递归神经网络的使用可以直接利用开源工具tensorflow实现。The use of attention-based temporal recurrent neural network can be directly implemented using the open source tool tensorflow.

其中本发明实例中使用的网络参数为：rnn_size为任一实数，如64，The network parameters used in the example of the present invention are: rnn_size is any real number, such as 64,

attn_length为任一实数，如64。需要强调的是本发明的核心是基于注意力机制的通话人识别方法，修改网络参数等针对网络的操作均包含在本发明中。attn_length is any real number, such as 64. It should be emphasized that the core of the present invention is a caller identification method based on an attention mechanism, and operations for the network such as modifying network parameters are all included in the present invention.

本发明实施例提供的基于注意力机制的通话人识别，能在仅有音频的情况下利用已有通信语音库确定通信过程中的说话，以便用户通过号码显示通话人与实际通话人进行匹配，从而进行可靠性的判断，间接保护用户通信安全，通过采用训练模型识别被测试通话人是否为目标通话人，通过语音仿冒通话者身份进行通话的行为具有一定的识别能力，提高了用户通话过程中的通话信息的安全性。The caller identification based on the attention mechanism provided by the embodiment of the present invention can use the existing communication voice library to determine the speech in the communication process under the condition of only audio, so that the user can match the caller and the actual caller through the number display, Thereby, it can judge the reliability and indirectly protect the user's communication security. By using the training model to identify whether the tested caller is the target caller, the behavior of making a call by imitating the caller's identity by voice has a certain recognition ability, which improves the user's call process. security of call information.

实施例2Example 2

本发明实施例2提供一种基于注意力机制的说话人识别方法的一个优选实施例。参见图2所示，在该实施例中，该方法包括步骤：Embodiment 2 of the present invention provides a preferred embodiment of an attention mechanism-based speaker identification method. Referring to Figure 2, in this embodiment, the method includes the steps:

S201：收集用户的通话录音，将通话录音与被测试通话人对应，构建通话人语音库。S201: Collect the call recording of the user, correspond the call recording with the tested caller, and construct a caller's voice library.

测试方使用智能手机在通话过程中的内置录音设备或使用含有录音功能的耳机录制被测试通话人在通话过程的语音。The test party uses the built-in recording device of the smartphone during the call or uses the headset with recording function to record the voice of the tested caller during the call.

安卓手机在通话过程中可以使用系统自带的通话录音功能，需要明确手机设备型号，录音文件保存格式，通话时外部环境特征(如安静、嘈杂)；而苹果手机由于隐私设置系统并未提供通话录音功能，可以通过带有录音功能的耳机进行录音，需要明确耳机的品牌、耳机的型号，录音文件保存格式，外部环境特征(如安静、嘈杂)。Android phones can use the built-in call recording function during a call. It is necessary to clarify the phone device model, the recording file storage format, and the external environment characteristics (such as quiet and noisy) during the call; while the Apple phone does not provide a call due to the privacy setting system. For the recording function, you can record through the headphones with the recording function. You need to clarify the brand of the headphones, the model of the headphones, the recording file storage format, and the external environment characteristics (such as quiet, noisy).

当测试方是主叫时，可以在被叫接听后打开智能手机内置录音设备或含有录音功能的耳机；当测试方是被叫时，可以在接听主叫呼叫时打开智能手机内置录音设备或含有录音功能的耳机。When the test party is the calling party, you can turn on the smartphone's built-in recording device or the headset with recording function after the called party answers the call; when the test party is the called party, you can turn on the smartphone's built-in recording device or include Headphones with recording function.

语音库是被测试通话人身份与被测试语音的关联库，目的是为后续的模型训练提供数据。The voice library is the association library between the tested caller's identity and the tested voice, and the purpose is to provide data for subsequent model training.

语音库中包含被识别方通话音频数据；被识别方身份信息(例如：电话号码、姓名、所在地)；被叫方环境特性(例如：室内、街道、商店等)；测试方录制设备信息(例如：采样频率、降噪特性、音频存储格式等)；测试方环境特性(例如：室内、街道、商店等)；通话时长；通话时间；通话音量。The voice library contains the audio data of the call of the recognized party; the identity information of the recognized party (such as phone number, name, location); the environmental characteristics of the called party (such as: indoor, street, store, etc.); the test party's recording equipment information (such as : sampling frequency, noise reduction characteristics, audio storage format, etc.); test party environmental characteristics (for example: indoor, street, store, etc.); call duration; call time; call volume.

S202:利用基于注意力的神经网络对通信人语音进行训练生成训练模型。S202: Use an attention-based neural network to train the communication person's speech to generate a training model.

具体而言，将语音库中的语音文件进行预处理，使用维纳滤波器进行语音的简单降噪，避免噪声对整个实验过程的影响。Specifically, the speech files in the speech library are preprocessed, and the Wiener filter is used to perform simple noise reduction of speech, so as to avoid the influence of noise on the whole experimental process.

使用的神经网络基于注意力机制的时间递归神经网络，网络结构如图4所示。The neural network used is based on the temporal recurrent neural network of the attention mechanism, and the network structure is shown in Figure 4.

S203：测试过程中测试方对通话过程进行录音,并将录音文件保存。S203: During the test, the test party records the call process, and saves the recording file.

具体而言，要求保存的音频文件格式是无损的，如WAVE、FLAC、APE、ALAC、WavPack等。Specifically, the audio file format required to be saved is lossless, such as WAVE, FLAC, APE, ALAC, WavPack, etc.

其中WAVE通常使用三个参数来表示声音，量化位数，取样频率和采样点振幅。量化位数分为8位，16位，24位三种，声道有单声道和立体声之分，单声道振幅数据为n*1矩阵点，立体声为n*2矩阵点，取样频率一般有11025Hz(11kHz)，22050Hz(22kHz)和44100Hz(44kHz)三种，音质出色，但文件体积较大。并记录录音文件的编码方式。Among them, WAVE usually uses three parameters to represent the sound, the number of quantization bits, the sampling frequency and the sampling point amplitude. The quantization bits are divided into 8 bits, 16 bits, and 24 bits. The channels are divided into mono and stereo. The mono amplitude data is n*1 matrix point, and the stereo is n*2 matrix point. The sampling frequency is general. There are 11025Hz (11kHz), 22050Hz (22kHz) and 44100Hz (44kHz) three, the sound quality is excellent, but the file size is large. And record the encoding method of the recording file.

S204：将录音文件输入到训练好的模型，进行被测试通话人的识别。S204: Input the recording file into the trained model to identify the tested caller.

具体而言，保存好的音频文件要先经过预处理，经过维纳滤波对语音信号进行降噪处理，然后截取成相同长度的语音片段，如10秒，将每个语音片段提取Mel频率倒谱系数特征后输入训练好的模型，进行分类，神经网络模型如图4所示。Specifically, the saved audio file should be preprocessed first, and the voice signal will be denoised by Wiener filtering, and then cut into voice fragments of the same length, such as 10 seconds, and the Mel frequency cepstrum is extracted from each voice fragment. After counting the features, input the trained model for classification. The neural network model is shown in Figure 4.

神经网络结构包括：The neural network structure includes:

S301为特征输入层，对通信语音库中的语音进行特征工程，提取所需特征，这里提取语音库中语音的梅尔倒谱系数作为特征向量；S301 is a feature input layer, which performs feature engineering on the voice in the communication voice library, and extracts the required features, where the Mel cepstral coefficients of the voice in the voice library are extracted as a feature vector;

S302为全连接层；S302 is a fully connected layer;

S303为基于注意力的时间递归神经网络层；S303 is an attention-based temporal recurrent neural network layer;

S304中归一化指数函数层计算的结果；The result of the normalized exponential function layer calculation in S304;

S305输出层，做编码与人名的转换，用于输出S304中归一化指数函数层计算的结果。S305 is an output layer, which converts the code and the person's name, and is used to output the result calculated by the normalized exponential function layer in S304.

本实施例只能识别出已有的被测试人，即通话人语音库中已有的被测试人，如果待识别的音频文件属于新测试人，则会被识别为最相近的已有被测试人。This embodiment can only identify the existing tested person, that is, the existing tested person in the voice library of the caller. If the audio file to be recognized belongs to a new test person, it will be recognized as the closest existing tested person. people.

本发明提供的实施例2通过采用基于注意力的神经网络对被测试通话人语音进行训练，获得训练模型，采用训练模型识别被测试人通话人，确认了拨号码对应的主人的一致性，避免了通话人身份被仿冒的安全隐患。Embodiment 2 provided by the present invention uses the attention-based neural network to train the voice of the tested caller, obtains a training model, uses the training model to identify the tested caller, confirms the consistency of the owner corresponding to the dialed number, and avoids It avoids the security risk of the caller's identity being forged.

实施例3Example 3

本发明实施例3还提供一种基于注意力机制的通话人识别装置，如图3所示。Embodiment 3 of the present invention further provides a device for identifying a caller based on an attention mechanism, as shown in FIG. 3 .

收集模块10，用于构建被测试通话人身份和音频文件对应的被测试通话人语音库。收集模块10中还可细分为电话录音模块11和数据库处理模块12。The collection module 10 is used for constructing the tested caller's voice library corresponding to the tested caller's identity and the audio file. The collection module 10 can be further subdivided into a telephone recording module 11 and a database processing module 12 .

电话录音模块11，用于对通话过程进行录音，具体实施方法为测试方有计划或无计划的与被测试通话人产生通话，对通话过程进行录音，并将录音文件保存成如Wave格式的音频文件。The telephone recording module 11 is used to record the call process. The specific implementation method is that the test party has a planned or unplanned call with the tested caller, records the call process, and saves the recording file as an audio in a Wave format. document.

数据库处理模块12，用于关联音频文件与被测试人身份。The database processing module 12 is used for associating the audio file with the tested person's identity.

测试方将收集到的音频文件与被测试通话人身份，及音频文件的相关配置信息存入数据库。配置信息中含有被识别方身份信息(例如：电话号码、姓名、所在地)；被叫方环境特性(例如：室内、街道、商店等)；测试方录制设备信息(例如：采样频率、降噪特性、音频存储格式等)；测试方环境特性(例如：室内、街道、商店等)；通话时长；通话时间；通话音量的信息。The test party stores the collected audio files, the identity of the caller under test, and the related configuration information of the audio files into the database. The configuration information contains the identity information of the identified party (for example: phone number, name, location); the environmental characteristics of the called party (for example: indoor, street, store, etc.); the test party's recording equipment information (for example: sampling frequency, noise reduction characteristics) , audio storage format, etc.); test party environmental characteristics (for example: indoor, street, store, etc.); call duration; call duration; call volume information.

训练模块20，用来训练被测试人语音库中的音频文件。使用的网络结构模型如图4所示。The training module 20 is used to train the audio files in the tested person's voice database. The network structure model used is shown in Figure 4.

具体而言，在此之前还要对音频文件进行降噪处理并提取特征。降噪处理用的是维纳滤波，特征使用的是音频文件的梅尔倒谱系数。这里可以使用Python提供的音频处理模块python_speech_features。Specifically, the audio files are denoised and features extracted before this. The noise reduction processing uses Wiener filtering, and the features use the Mel cepstral coefficients of the audio file. Here you can use the audio processing module python_speech_features provided by Python.

测试模块30，用来识别新音频文件所属的通话人。The testing module 30 is used to identify the caller to which the new audio file belongs.

在测试过程中，测试方仍通过上述录音方法对与被测试通话人的通话过程进行录音。录音文件进行预处理，即训练模块20中所说的降噪处理和提取特征后将其传入训练好的模型，接收模型的输出结果。During the testing process, the testing party still records the conversation process with the person being tested through the above recording method. The recording file is preprocessed, that is, the noise reduction processing and feature extraction mentioned in the training module 20 are transferred to the trained model, and the output result of the model is received.

测试模块30只能识别出训练模块20中已有的被测试人，即通话人语音库中已有的被测试人。如果测试模块30中待识别的音频文件属于新测试人，则会被识别为最相近的已有被测试人，最相近的已有被测试人为置信度值最大的已有分类的被测试人。The testing module 30 can only identify the existing tested persons in the training module 20, that is, the existing tested persons in the voice database of the caller. If the audio file to be identified in the testing module 30 belongs to a new test person, it will be identified as the most similar existing tested person, and the closest existing tested person is the existing classified tested person with the largest confidence value.

本发明相比于现有技术具有如下的技术效果：Compared with the prior art, the present invention has the following technical effects:

本发明公开了基于注意力机制的说话人识别方法及装置，通过采用基于注意力的神经网络对所述被测试通话人语音进行训练，获得训练模型，采用训练模型识别所述被测试通话人，确认了拨号码对应的主人的一致性，避免了由于通话人身份被仿冒带来的通信安全隐患，进一步提高了通话过程中的信息的安全性。The invention discloses a method and device for identifying a speaker based on an attention mechanism. By using an attention-based neural network to train the voice of the tested caller, a training model is obtained, and the training model is used to identify the tested caller. The consistency of the owner corresponding to the dialed number is confirmed, the hidden danger of communication security caused by the counterfeiting of the caller's identity is avoided, and the security of the information during the call is further improved.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

本文中应用了具体个例对发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想，所描述的实施例仅仅是本发明的一部分实施例，而不是全部的实施例，基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The principles and implementations of the invention are described herein by using specific examples. The descriptions of the above embodiments are only used to help understand the method and the core idea of the present invention, and the described embodiments are only a part of the embodiments of the present invention. , rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

Claims

1. The method for identifying the talker based on the attention mechanism is characterized by comprising the following steps of:

collecting the call records of a plurality of tested callers and the call records of the tested callers;

establishing a speaker voice library according to the call record corresponding to the tested speaker;

training a neural network based on an attention mechanism by adopting the call record of the tested caller to obtain a training model;

storing the call records of the test caller and the tested caller to obtain a record file;

adopting the training model to identify whether the tested speaker is a target speaker according to the recording file,

wherein, adopt the conversation recording of the person who is tested the neural network based on the attention mechanism trains, obtains the training model, includes:

denoising the call recording of the tested caller by adopting a wiener filter to obtain a preprocessed recording file;

extracting voice features from the preprocessed sound recording file through an input layer of a time recursive neural network to obtain a Mel cepstrum coefficient feature vector of voice in the preprocessed sound recording file;

sending the Mel cepstrum coefficient feature vector to a full connection layer, and performing feature extraction on the Mel cepstrum coefficient feature vector by the full connection layer to obtain a second feature vector of the voice in the preprocessed audio file;

sending the second feature vector to an attention-based time-recursive neural network layer, wherein the attention-based time-recursive neural network layer comprises a plurality of LSTM layers, and the second feature vector is processed through the plurality of LSTM layers to obtain processed data;

and sending the processing data to a normalization index function layer, wherein the normalization index function layer correspondingly converts the processing data and the name of the person to obtain the name of the person corresponding to the processing data.

2. The method of claim 1, wherein collecting call records of a plurality of tested callers and a test caller comprises:

the testing party records the voice of the tested caller in the call process by using the built-in recording function of the smart phone in the call process; the call recording function of the system is used in the call process, and the model of the mobile phone equipment, the storage format of the recording file and the external environment characteristics during the call need to be determined; and saving the call record in a lossless file form in Wave format.

3. The method for identifying a caller according to claim 1, wherein the establishing a caller voice library according to the call record corresponding to the tested caller comprises:

acquiring the corresponding relation between the identity of the tested caller and the tested voice;

and establishing a speaker voice library according to the corresponding relation, wherein the speaker voice library comprises the voice frequency data of the identified party, the identity information of the tested party, the environmental characteristics of the tested party, the recording equipment information of the tested party, the environmental characteristics of the tested party, the call duration, the call time and the call volume.

4. The method for identifying a caller based on an attention mechanism according to claim 1, wherein identifying whether the tested caller is a target caller by using the training model according to the recording file comprises:

judging whether the voice to be tested exists in the speaker voice library, and if so, identifying the tested person existing in the speaker voice library; otherwise, if the audio file to be identified belongs to the new tester, the existing tester is identified as the closest.

5. A speaker recognition device based on an attention mechanism, comprising:

the collecting module is used for collecting the call records of a plurality of tested callers and the call records of the tested callers;

the voice database establishing module is used for establishing a speaker voice database according to the call record corresponding to the tested speaker;

the training module is used for training the neural network based on the attention mechanism by adopting the call record of the tested caller to obtain a training model;

the file storage module is used for storing the call records of the test caller and the tested caller to obtain a record file;

the test module is used for identifying whether the tested speaker is a target speaker or not by adopting the training model according to the recording file,

wherein the training module is configured to:

6. The apparatus for identifying a human speaker based on attention mechanism as claimed in claim 5, wherein the collecting module comprises:

the testing party unit is used for recording the voice of the tested caller in the calling process by using the built-in recording function of the smart phone in the calling process by the testing party;

the recording unit is used for using the call recording function of the system in the call process and determining the model of the mobile phone equipment, the storage format of a recording file and the external environment characteristics during the call; and saving the call record in a lossless file form in Wave format.

7. The apparatus for identifying a caller according to claim 5, wherein the speech library creating module comprises:

the corresponding relation acquisition unit is used for acquiring the corresponding relation between the identity of the tested caller and the tested voice;

and the voice library establishing unit is used for establishing a speaker voice library according to the corresponding relation, wherein the speaker voice library comprises the voice frequency data of the identified party, the identity information of the tested party, the environmental characteristics of the tested party, the recording equipment information of the tested party, the environmental characteristics of the tested party, the call duration, the call time and the call volume.