CN113345444A

CN113345444A - Speaker confirmation method and system

Info

Publication number: CN113345444A
Application number: CN202110496856.2A
Authority: CN
Inventors: 陈增照; 郑秋雨; 何秀玲; 戴志诚; 张婧; 孟秉恒; 李佳文; 吴潇楠; 朱胜虎
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2021-05-07
Filing date: 2021-05-07
Publication date: 2021-09-03
Anticipated expiration: 2041-05-07
Also published as: CN113345444B

Abstract

The invention provides a speaker confirmation method and a speaker confirmation system, which comprise the following steps: preprocessing audio information of a speaker, and converting the audio information into data in a preset format; inputting data in a preset format corresponding to speaker audio information into a trained deep nested residual error neural network based on a space attention mechanism to obtain a speaker vector at a frame level; generating a speaker vector of an utterance level based on the speaker vector of the frame level, and calculating cosine similarity of the speaker vector of the utterance level and a target speaker vector to determine whether the speaker is a target speaker; the target speaker vector is pre-acquired. The invention provides a deep nested residual error neural network based on a space attention mechanism, which can more accurately extract the voiceprint characteristics of a speaker.

Description

Speaker confirmation method and system

Technical Field

The invention belongs to the field of speaker recognition, and particularly relates to a speaker confirmation method and a speaker confirmation system.

Background

Voiceprints are a general term for the speech features that are implied in speech and that characterize and flag a speaker, and for speech models that are built based on these features. Language is a very important medium in interpersonal communication. The voice of a person is difficult to change after the person grows up, and because of different pronunciation organs and pronunciation modes, the voice characteristics of each person in the speaking process are unique, and the most basic pronunciation characteristics and vocal tract characteristics of a speaker are difficult to change even if the voice characteristics are simulated. Therefore, according to the unique identification and the short-time stationarity of the voiceprint, the method and the device can establish a model for the voice and use the model for identity authentication.

Voiceprint recognition, also called speaker recognition, is a process of determining a target speaker from voiceprint characteristics of the target speaker to be recognized. Speaker recognition tasks can be fundamentally divided into two categories: speaker identification and speaker confirmation, wherein the speaker identification means that a voiceprint sample is given, whether the identity of a target speaker is matched from a trained corpus is a one-to-one confirmation problem; speaker verification is a sample for identifying which speaker in the corpus the identity of the target voiceprint belongs to, and is a many-to-one selection problem.

Although many researchers at home and abroad have proposed a series of mature speaker confirmation methods such as a GMM-UBM gaussian mixture-general background model, an LSTM long-term memory network model, a CNN convolutional neural network model, and the like, many problems and improvement spaces still exist in the aspects of research methods, model effects, and application scenarios.

The defects of the existing speaker identification method mainly comprise the following points: 1. the traditional GMM-UBM Gaussian mixture-general background model-based method is multiple, and the technical means still remain in the most traditional model matching stage. With the rapid development of society advancing technology, the traditional method represents complexity and difficulty in processing a large amount of data information. 2. The voiceprint of the speaker contains much valid speaker information, and the speaker audio is highly susceptible to a variety of factors including the environment, the recording device, and the speaker itself. The existing deep learning speaker verification algorithm cannot extract the speaker characteristics better, and the overall effect is to be improved. 3. The speaker identification method for different scenes and different languages is the challenge of voiceprint identification at present, the display-stored speaker identification method cannot reduce the influence of cross-language data on the model identification effect to the minimum, and the effect is not ideal in the cross-language identification research.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a speaker verification method and a speaker verification system, and aims to solve the problems of low recognition accuracy and large network scale in the existing speaker recognition neural network system.

To achieve the above object, in a first aspect, the present invention provides a speaker verification method, including the steps of:

preprocessing audio information of a speaker, and converting the audio information into data in a preset format;

inputting data in a preset format corresponding to speaker audio information into a trained deep nested residual error neural network based on a space attention mechanism to obtain a speaker vector at a frame level; the deep nested residual error neural network based on the spatial attention mechanism comprises: four layers of residual error neural networks each comprising two nested residual error blocks and a spatial attention mechanism; introducing a spatial attention mechanism after nesting a residual neural network, wherein the spatial attention mechanism introduces average pooling and maximum pooling in an attention module based on spatial dimensions, combines two parts of pooling results to retain useful information and reduce parameter scale, and uses a sigmoid function in an activation layer of the attention module to obtain a speaker vector at a frame level;

generating a speaker vector of an utterance level based on the speaker vector of the frame level, and calculating cosine similarity of the speaker vector of the utterance level and a target speaker vector to determine whether the speaker is a target speaker; the target speaker vector is pre-acquired.

In an optional example, the preprocessing is performed on the audio information of the speaker, and the audio information is converted into data in a preset format, specifically:

converting the WAV format audio file of the speaker into a flac format file by adopting an audio conversion technology, and preprocessing the flac format file to obtain npy format data containing all information of the speaker.

In an optional example, each of the nested residual blocks includes two sub-residual blocks, each of the sub-residual blocks includes two cells, and each of the cells is a building block; placing a convolution layer in front of each two nested residual blocks;

the two nested sub-residual blocks realize the stacking function, and the specific formula is as follows:

H₁(x)＝F₁(x)+x

H₂(x)＝F₂(x)+H₁(x)

H(x)＝H₂(x)+x

where x represents the input data of the first nested residual block, F₁(x) Representing the output of the first sub-residual block in the nested residual block, H₁(x) Is represented by F₁(x) And x, F₂(x) Representing the output of the second sub-residual block in the nested residual block, H₂(x) Is represented by F₂(x) And H₁(x) H (x) represents the output of two nested residual blocks.

In an alternative example, a spatial attention mechanism is introduced after nesting the residual block, and a sigmoid function is used in the activation layer of the attention module to obtain a speaker vector at the frame level, with the following specific formula:

F″＝f{avg_pool(V)，max_pool(V)}

F′＝σ(F″)

F＝Multiply(V，F′)

v represents a speaker vector output by a nested residual error neural network, avg _ pool represents average pooling operation, max _ pool represents maximum pooling operation, and F { } represents merging results of the two pooling operations to obtain a new speaker vector F'; f 'represents a speaker vector obtained by adding an activation function to F'; f denotes a speaker vector at the frame level.

In an optional example, the calculating a cosine similarity between the speaker vector at the utterance level and the target speaker vector to determine whether the speaker is the target speaker specifically includes:

setting a threshold value for the probability value of the cosine similarity, and judging that the speaker is a target speaker when the probability value of the cosine similarity is greater than the threshold value, otherwise, judging that the speaker is not the target speaker.

In a second aspect, the present invention provides a speaker verification system, comprising:

the speaker audio determining unit is used for preprocessing audio information of a speaker and converting the audio information into data in a preset format;

the frame level vector determining unit is used for inputting data in a preset format corresponding to speaker audio information into a trained deep nested residual error neural network based on a space attention mechanism so as to obtain a frame level speaker vector; the deep nested residual error neural network based on the spatial attention mechanism comprises: four layers of nested residual error neural networks each comprising two nested residual error blocks and a spatial attention mechanism; introducing a spatial attention mechanism after nesting a residual neural network, wherein the spatial attention mechanism introduces average pooling and maximum pooling in an attention module based on spatial dimensions, combines two parts of pooling results to retain useful information and reduce parameter scale, and uses a sigmoid function in an activation layer of the attention module to obtain a speaker vector at a frame level;

a speaker confirmation unit for generating a speaker vector at an utterance level based on the speaker vector at the frame level, and calculating a cosine similarity between the speaker vector at the utterance level and a target speaker vector to determine whether the speaker is a target speaker; the target speaker vector is pre-acquired.

In an optional example, the speaker audio determining unit converts the WAV format audio file of the speaker into a flac format file by using an audio conversion technique, and pre-processes the flac format file to obtain npy format data containing all information of the speaker.

In an alternative example, each nested residual block comprises two sub-residual blocks, each sub-residual block comprises two cells, and each cell is a building block; placing a convolution layer in front of each two nested residual blocks;

H₁(x)＝F₁(x)+x

H₂(x)＝F₂(x)+H₁(x)

H(x)＝H₂(x)+x

In an alternative example, the frame-level vector determination unit introduces a spatial attention mechanism after nesting the residual block, and uses a sigmoid function in the activation layer of the attention module to obtain the speaker vector at the frame level, with the following specific formula:

F″＝f{avg_pool(V)，max_pool(V)}

F′＝σ(F″)

F＝Multiply(V，F′)

In an optional example, the speaker determination unit sets a threshold to the probability value of the cosine similarity, and determines that the speaker is the target speaker when the probability value of the cosine similarity is greater than the threshold, otherwise determines that the speaker is not the target speaker.

Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

the invention provides a speaker identification method and a speaker identification system, and provides a novel nested residual error neural network. After demonstrating the efficiency of ResNet in speaker verification, we conducted a series of explorations on this study. The concept of using the residual of the residuals is presented herein for the first time; the present invention proposes the use of a spatial attention mechanism in the speaker verification task. The invention adopts a method of adding an attention mechanism to obtain more valuable voiceprint information from a voiceprint energy spectrum, which is to use the spatial attention for picture processing for the processing of the voiceprint information for the first time. The invention provides a speaker confirmation network model with better performance, and experiments prove that the method is superior to the prior art, and then an attention mechanism is added on the basis of the network model to further improve the system performance.

The invention provides a speaker confirmation method and a speaker confirmation system, and provides a deep nested residual error neural network based on a space attention mechanism, wherein the deep neural network is used for more accurately extracting the voiceprint characteristics of a speaker. Experimental results show that the method has high learning capacity and can exceed the performance of the traditional neural network method. Compared with other latest methods, the accuracy of the experimental result on the English public data set LibriSpeech is improved by 6%, and the error rate is also low. The invention also proves that the method has good cross-language adaptability through the training performance on the Chinese data set AISHELL.

Drawings

Fig. 1 is a flow chart of a speaker verification method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another speaker verification provided by an embodiment of the present invention;

FIG. 3 is a diagram of nested residual blocks provided by an embodiment of the present invention;

FIG. 4 is a diagram of a nested residual neural network architecture based on an attention mechanism provided by an embodiment of the present invention;

FIG. 5 is a block diagram of a speaker spatial attention mechanism provided by an embodiment of the present invention;

FIG. 6 is a representation of different models provided by embodiments of the present invention on an English data set;

FIG. 7 is a representation of different models on a Chinese dataset according to an embodiment of the present invention;

FIG. 8 is a comparison of performance on different data sets before and after an attention mechanism is added as provided by an embodiment of the present invention;

FIG. 9 is a comparison graph of the expressions of different models on Chinese and English cross-language data sets according to an embodiment of the present invention;

fig. 10 is an architecture diagram of a speaker verification system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention aims to extract the fbank voiceprint feature of a speaker based on a spatial attention mechanism and a deep nested residual error neural network, in the training process, the fbank voiceprint feature is input in a sample pair mode and then is represented according to an output speaker frame level vector of the last layer of the deep neural network, and whether the current speaker is a target speaker is judged by utilizing cosine similarity.

Fig. 1 is a flow chart of a speaker verification method according to an embodiment of the present invention; as shown in fig. 1, the method comprises the following steps:

s101, preprocessing audio information of a speaker, and converting the audio information into data in a preset format;

s102, inputting data in a preset format corresponding to speaker audio information into a trained deep nested residual error neural network based on a space attention mechanism to obtain a speaker vector at a frame level; the deep nested residual error neural network based on the spatial attention mechanism comprises: four layers of nested residual error neural networks each comprising two nested residual error blocks and a spatial attention mechanism; introducing a spatial attention mechanism after nesting a residual neural network, wherein the spatial attention mechanism introduces average pooling and maximum pooling in an attention module based on spatial dimensions, combines two parts of pooling results to retain useful information and reduce parameter scale, and uses a sigmoid function in an activation layer of the attention module to obtain a speaker vector at a frame level;

s103, generating a speaker vector of an utterance level based on the speaker vector of the frame level, and calculating cosine similarity of the speaker vector of the utterance level and a target speaker vector to judge whether the speaker is a target speaker; the target speaker vector is pre-acquired.

The speaker identification method based on the spatial attention mechanism and the deep nested residual error neural network is divided into three parts: audio preprocessing, speaker voiceprint feature extraction and speaker confirmation; the overall process flow is shown in figure 2. Firstly, processing WAV audio files with speaker labels in a public data set Librispeech into flac files by adopting an audio conversion technology, and preprocessing the flac files to obtain npy files containing all speaker information. The npy file is then used as an input to a deep nested residual neural network to obtain a frame-level vector representation of the speaker through a series of convolution and pooling operations. In the training process, the training is carried out in the form of AP and AN sample pairs, wherein AP represents anchor-positive samples and AN represents anchor-negative samples. The deep nested neural network continuously learns the vector similarity in the AP and the AN, so that the performance of the model on the speaker verification algorithm is improved.

The invention proposes a nested residual block suitable for the speaker verification task. The detailed network structure is shown in fig. 3. Relu represents a linear rectification function, which is a commonly used activation function; the NR-block is the structure of the nested residual block proposed in the present invention.

Two nested residual blocks are stacked in the system structure of the invention, and each nested residual block is formed by stacking two sub-residual blocks. Each sub-residual block contains two cells, each cell being a building block. The present invention places a convolution layer in front of every two nested residual blocks, sets the size of the kernel to 5, and sets the step size to 2, which can perfectly cope with the change caused by the increase of the number of channels.

In nested residual neural networks, h (x) is considered to be the basic mapping fitted to the overlay, where x represents the input of the first layer in the nested residual block. In the first building block, the invention assumes that the residual function is equation (1), then equation (2) can be obtained in the second residual block. Therefore, the present invention can implement the stacking function of the nested residual block, which can be used as the input of the next nested residual block. Changing the objective function from h (x) to f (x) can greatly reduce the difficulty of network learning.

H₁(x)＝F₁(x)+x (1)

H₂(x)＝F₂(x)+H₁(x) (2)

H(x)＝H₂(x)+x (3)

Where subscript 1 represents the first one of the proposed nested residual blocks and subscript 2 represents the second one of the proposed nested residual blocks.

Specifically, the nested residual error neural network is formed by stacking four identical network layers containing two nested residual error blocks, and each nested residual error block contains two residual error blocks. In the nested neural network, the output h (x) of the first nested residual block is used as the input of the second nested residual block, and the equations (1) to (3) are continuously executed, and so on, so as to obtain the output of the last nested residual block, that is, the output of the whole nested residual block in the nested residual neural network, which can be expressed as V, that is, the following speaker vector, and V is the speaker vector at the frame level.

The speaker vector of the nested residual neural network output from the frame-level feature extractor can be represented as: v ═ V₁ v₂… v_d]. Wherein v is_t∈R^D，R^DBelonging to a D-dimensional space vector. The symbol v denotes a vector corresponding to each frame, and the symbol d denotes an input frame size of the layer. It can be found by studying the present invention that the use of only a deep neural network may lead to gradient explosion and gradient disappearance problems. In transmitting voiceprint information in the form of a voiceprint energy spectrum, if the information is fully utilized, the workload of the system is undoubtedly increased and the method may be unreliable.

Therefore, fig. 4 is a diagram of a nested residual neural network structure based on an attention mechanism according to an embodiment of the present invention; FIG. 5 is a block diagram of a speaker spatial attention mechanism provided by an embodiment of the present invention; the present invention introduces a spatial attention mechanism after nesting the residual blocks. The present invention introduces average pooling and maximum pooling in the attention module based on spatial dimensions, and then combines both together in equation (4), preserving useful information and reducing the size of the parameters. The dimensions of the features are then reduced to a single channel using a two-dimensional convolutional layer. A sigmoid function is used in the activation layer to obtain the spatial attention characteristics of the speaker. Finally, equation (6) adds the features and outputs with attention input to the feature module. To this end, the present invention obtains speaker frame-level feature vectors.

F″＝f{avg_pool(V)，max_pool(V)} (4)

F′＝σ(F″) (5)

F＝Multiply(V，F′) (6)

V represents a speaker vector output by a nested residual error neural network, avg _ pool represents average pooling operation, max _ pool represents maximum pooling operation, and F { } represents merging results of the two pooling operations to obtain a new speaker vector F'; f 'represents a speaker vector obtained by adding an activation function to F'; f denotes a frame-level speaker vector to be finally output, and is also a frame-level speaker vector to be used for cosine similarity determination later.

Specifically, frame-level speaker vectors are input into the dimension reduction layer and then averaged in the time dimension to generate utterance-level speaker vectors. The speaking level speaker vector is generated by the speaker vector of the frame level after dimension reduction, full connection layer, averaging and length normalization.

The fully-connected layer projects the utterance-level representation into a 512-dimensional speaker vector. We normalize the features by a length normalization operation and use cosine similarity in the objective function:

cos(X，Y)＝X^TY

where X and Y are two different speaker utterance level vectors. Then, the probability of the similarity of the X vector and the Y vector is modeled through a cosine similarity formula in an equation, and whether the X vector and the Y vector are the same speaker is judged by setting a certain threshold value for the probability of the similarity.

Four experiments were performed according to different corpora and whether the attention mechanism was increased. The experimental comparison result shows that the method has higher learning ability and can exceed the performance of the prior neural network method.

In fig. 6 to 9, the ordinate ACC represents the accuracy, and the abscissa epoch represents the number of verification rounds.

Fig. 6 shows the experimental results of five different network models in the english common data set libristech. One of the experiments baseResNet served as a control experiment for the present method. On the basis of the comparison experiment, the invention respectively carries out fine adjustment on the network structure so as to observe the influence of different structures on the speaker confirmation task. The deployed-NResNet model in fig. 6 refers to the nested residual neural network model Proposed by the present invention. In the control experiment, the residual block was stacked three times, and ResNet-bd is an adjustment to the different locations of the residual block in the network. The invention nests and stacks two residual blocks in NResNet is one of the innovations. From experimental results, baseResNet, ResNet-bd3 and Proposed-NResNet can achieve ideal performance, but it can be clearly observed that the Proposed-NResNet model provided by the invention is more stable in performance and higher in accuracy.

Comprehensive and scientific research is the subject of consistent research. In fig. 7, the present invention selects the best three sets of experiments and then explores the results on the chinese public data set AISHELL. The deployed-NResNet model in fig. 7 refers to the nested residual neural network model Proposed by the present invention. The present invention takes the audio of 151 persons of AISHELL as a training set and the audio data of 40 persons as a verification set to verify that the method mentioned herein is also feasible in chinese. The data must first be processed into the required format for the neural network. As shown by the results in FIG. 7, the method of the present invention can achieve excellent results on both Chinese and English datasets.

Adding a speaker spatial attention mechanism in a deep nested neural network is a second innovation of this work. The invention compares the performance of adding a space attention mechanism to NResNet under different language environments through experiments. The present invention separately observes the effect of each group. The specific details may be analyzed in fig. 8. The deployed-NResNet + SA model in fig. 8 refers to a combined model of the nested residual neural network and the attention of the speaker Proposed by the present invention. Surprisingly, the accuracy of the Proposed deployed-NResNet + SA of the present invention can achieve near 0.99 of optimum performance. There is a significant advance over the lack of an added attention mechanism, which further illustrates the high efficiency of the attention mechanism proposed herein in the speaker verification task.

Furthermore, the present invention compares the effect of the present method on cross-language datasets in FIG. 9. The deployed-NResNet model in fig. 9 refers to the nested residual neural network model Proposed by the present invention. In this experiment the invention used Train-clean-100 from librispech as the training set, while the audio of 40 persons from AISHELL chinese dataset was used as the validation set. Through data in the graph, the model trained on the English data set by the method provided by the invention can obtain satisfactory results on the Chinese data set.

Table 1 is a comparison table of the accuracy and the equal error rate of the four neural network models used as the control experiments and the two models proposed in the method of the present invention on the english data set libristech and the chinese data set AISHELL, respectively. As can be clearly seen from table 1, compared with the existing model, the nested residual error neural network deployed-NResNet provided by the invention has an improved accuracy rate of about 3% in english and chinese data sets, and has a lower equal error rate; compared with the method of singly using the nested residual error neural network, the method of combining the nested residual error neural network and the attention mechanism of the speaker, the accuracy rate of the nested residual error neural network and the NResNet + SA are improved by 3 percent, compared with the existing model, the whole nested residual error neural network is improved by 6 percent, and the equal error rate is reduced by nearly one time.

TABLE 1 accuracy and Iso-error rate table of model

Fig. 10 is an architecture diagram of a speaker verification system according to an embodiment of the present invention, as shown in fig. 10, including:

a speaker audio determining unit 1010, configured to pre-process audio information of a speaker, and convert the audio information into data in a preset format;

a frame level vector determining unit 1020, configured to input data in a preset format corresponding to speaker audio information into a trained deep nested residual error neural network based on a spatial attention mechanism, so as to obtain a frame level speaker vector; the deep nested residual error neural network based on the spatial attention mechanism comprises: four layers of nested residual error neural networks each comprising two nested residual error blocks and a spatial attention mechanism; introducing a spatial attention mechanism after nesting a residual neural network, wherein the spatial attention mechanism introduces average pooling and maximum pooling in an attention module based on spatial dimensions, combines two parts of pooling results to retain useful information and reduce parameter scale, and uses a sigmoid function in an activation layer of the attention module to obtain a speaker vector at a frame level;

a speaker determination unit 1030 configured to generate a speaker vector at an utterance level based on the speaker vector at the frame level, and calculate a cosine similarity between the speaker vector at the utterance level and a target speaker vector to determine whether the speaker is a target speaker; the target speaker vector is pre-acquired.

Specifically, the functions of each unit in fig. 10 can be referred to the description in the foregoing method embodiment, and are not described herein again.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A speaker verification method, comprising the steps of:

inputting data in a preset format corresponding to speaker audio information into a trained deep nested residual error neural network based on a space attention mechanism to obtain a speaker vector at a frame level; the deep nested residual error neural network based on the spatial attention mechanism comprises: four layers of nested residual error neural networks each comprising two nested residual error blocks and a spatial attention mechanism; introducing a spatial attention mechanism after nesting a residual neural network, wherein the spatial attention mechanism introduces average pooling and maximum pooling in an attention module based on spatial dimensions, combines two parts of pooling results to retain useful information and reduce parameter scale, and uses a sigmoid function in an activation layer of the attention module to obtain a speaker vector at a frame level;

2. The speaker verification method according to claim 1, wherein the audio information of the speaker is preprocessed to convert the audio information into data in a preset format, specifically:

3. The speaker verification method of claim 1, wherein each nested residual block comprises two sub-residual blocks, each sub-residual block comprises two cells, each cell being a building block; placing a convolution layer in front of each two nested residual blocks;

H₁(x)＝F₁(x)+x

H₂(x)＝F₂(x)+H₁(x)

H(x)＝H₂(x)+x

4. The speaker verification method according to claim 1, wherein a spatial attention mechanism is introduced after the residual block is nested, and a sigmoid function is used in the active layer of the attention module to obtain the speaker vector at the frame level, with the following formula:

F″＝f{avg_pool(V),max_pool(V)}

F′＝σ(F″)

F＝Multiply(V,F′)

5. The speaker verification method according to any one of claims 1 to 4, wherein the calculating a cosine similarity between the utterance-level speaker vector and a target speaker vector to determine whether the speaker is a target speaker comprises:

6. A speaker verification system, comprising:

7. The speaker verification system according to claim 6, wherein the speaker audio determination unit converts the WAV format audio file of the speaker into a flac format file by using an audio conversion technique, and preprocesses the flac format file to obtain npy format data containing all information of the speaker.

8. The speaker verification system of claim 6, wherein each nested residual block comprises two sub-residual blocks, each sub-residual block comprising two cells, each cell being a building block; placing a convolution layer in front of each two nested residual blocks;

H₁(x)＝F₁(x)+x

H₂(x)＝F₂(x)+H₁(x)

H(x)＝H₂(x)+x

9. The speaker verification system of claim 6, wherein the frame-level vector determination unit introduces a spatial attention mechanism after nesting the residual block and uses a sigmoid function in the active layer of the attention module to obtain the frame-level speaker vector by the following formula:

F″＝f{avg_pool(V),max_pool(V)}

F′＝σ(F″)

F＝Multiply(V,F′)

10. The speaker identification system according to any one of claims 6 to 9, wherein the speaker identification unit sets a threshold value to the probability value of the cosine similarity, and determines that the speaker is a target speaker when the probability value of the cosine similarity is greater than the threshold value, and otherwise determines that the speaker is not a target speaker.