CN110634489B

CN110634489B - Voiceprint confirmation method, voiceprint confirmation device, voiceprint confirmation equipment and readable storage medium

Info

Publication number: CN110634489B
Application number: CN201810663588.7A
Authority: CN
Inventors: 方昕; 刘俊华; 魏思; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2018-06-25
Filing date: 2018-06-25
Publication date: 2022-01-14
Anticipated expiration: 2038-06-25
Also published as: CN110634489A

Abstract

The application provides a voiceprint confirmation method, a voiceprint confirmation device and a readable storage medium, wherein the method comprises the following steps: generating a feature vector with attention features based on the voice of the target object as a target feature vector of the target object for each of the first object and the second object, wherein the attention features can represent the influence of different objects on the confirmation result; determining whether the first object and the second object are the same object based on the target feature vector of the first object and the target feature vector of the second object. The voiceprint confirmation method, the voiceprint confirmation device, the voiceprint confirmation equipment and the readable storage medium have the advantages that influences of different objects on confirmation results are considered, the attention mechanism is introduced, different feature vectors with attention features are generated aiming at the different objects, and the discrimination accuracy can be effectively improved through the feature vectors for discrimination.

Description

Voiceprint confirmation method, voiceprint confirmation device, voiceprint confirmation equipment and readable storage medium

Technical Field

The present invention relates to the field of identity authentication technologies, and in particular, to a voiceprint confirmation method, apparatus, device, and readable storage medium.

Background

Voiceprint authentication is an important technology in the field of identity authentication, and an important technology in voiceprint authentication is voiceprint confirmation, namely, determining whether an object to be confirmed corresponding to a voice to be confirmed and a target object corresponding to a target voice are the same object.

The voiceprint confirmation scheme in the prior art is as follows: firstly, feature vectors representing voice information of two objects are respectively extracted from the voice to be confirmed of the object to be confirmed and the target voice of the target object by using a probability statistical model, then similarity calculation is carried out by using the two extracted feature vectors, and whether the object to be confirmed corresponding to the voice to be confirmed and the target object corresponding to the target voice are the same object or not is judged according to the similarity obtained through calculation.

Although the voiceprint confirmation scheme in the prior art can judge whether the objects corresponding to the two voices are the same object, the judgment accuracy is low.

Disclosure of Invention

In view of the above, the present invention provides a voiceprint recognition method, apparatus, device and readable storage medium, to solve the problem of low accuracy in determining a voiceprint recognition scheme in the prior art, and the technical scheme is as follows:

a voiceprint validation method comprising:

generating a feature vector with attention features based on the voice of the target object as a target feature vector of the target object for each of a first object and a second object, wherein the attention features can represent the influence of different objects on a confirmation result;

determining whether the first object and the second object are the same object based on the target feature vector of the first object and the target feature vector of the second object.

Wherein, for each target object of the first object and the second object, generating a feature vector with attention feature based on the speech of the target object as a target feature vector of the target object includes:

for each target object in the first object and the second object, extracting a voice feature vector sequence from voice of the target object;

determining a voiceprint feature vector of the target object through the voice feature vector sequence to serve as a basic feature vector of the target object;

and determining a feature vector with attention features as a target feature vector of the target object based on the basic feature vector of another object and the voice feature vector sequence of the target object.

Wherein the extracting of the speech feature vector sequence from the speech of the target object includes:

extracting voice features with preset frame numbers from the voice of the target object to obtain a voice feature sequence;

and characterizing each frame of voice features in the voice feature sequence by using a feature vector, and obtaining the voice feature vector sequence after characterizing each frame of voice features.

Wherein the determining a feature vector with attention feature based on the base feature vector of another object and the speech feature vector sequence of the target object as the target feature vector of the target object comprises:

determining the attention weight corresponding to each voice feature vector in the voice feature vector sequence of the target object through the basic feature vector of the other object and the voice feature vector sequence of the target object to obtain the attention weight sequence corresponding to the voice feature vector sequence of the target object;

and determining the feature vector with the attention feature as the target feature vector of the target object through the voice feature vector sequence of the target object and the attention weight sequence corresponding to the voice feature vector sequence of the target object.

Wherein, the determining the attention weight corresponding to each speech feature vector in the speech feature vector sequence of the target object by the base feature vector of the other object and the speech feature vector sequence of the target object includes:

respectively splicing each voice feature vector in the voice feature vector sequence of the target object with the basic feature vector of the other object to obtain a spliced feature vector sequence;

converting the spliced feature vector sequence into a one-dimensional vector as an attention weight vector;

and mapping each weight element in the attention weight vector into a weight value in a preset range to obtain the attention weight corresponding to each voice feature vector in the voice feature vector sequence of the target object.

Wherein the determining whether the first object and the second object are the same object based on the target feature vector of the first object and the target feature vector of the second object comprises:

splicing the target characteristic vector of the first object and the target characteristic vector of the second object to obtain a spliced characteristic vector;

and determining whether the first object and the second object are the same object or not based on the spliced feature vector.

The voiceprint confirmation method further comprises:

and splicing the basic feature vector of the target object with the feature vector with the attention feature of the target object, and taking the feature vector obtained after splicing as the target feature vector of the target object.

A voiceprint confirmation apparatus comprising: the system comprises a feature vector generation module and a voiceprint confirmation module;

the feature vector generation module is used for generating a feature vector with attention features based on the voice of the target object as a target feature vector of the target object aiming at each target object in a first object and a second object, wherein the attention features can represent the influence of different objects on a confirmation result;

the voiceprint confirmation module is configured to determine whether the first object and the second object are the same object based on the target feature vector of the first object and the target feature vector of the second object.

Wherein the feature vector generation module comprises: the system comprises a feature extraction module, a basic feature vector determination module and a target feature vector determination module;

the feature extraction module is configured to, for each target object in the first object and the second object, extract a speech feature vector sequence from speech of the target object;

the basic feature vector determining module is configured to determine a voiceprint feature vector of the target object through the speech feature vector sequence, where the voiceprint feature vector is used as a basic feature vector of the target object;

and the target feature vector determining module is used for determining a feature vector with attention features as a target feature vector of the target object based on a basic feature vector of another object and the voice feature vector sequence of the target object.

Wherein the target feature vector determination module comprises: an attention weight determining submodule and a target feature vector determining submodule;

the attention weight determining submodule is used for determining the attention weight corresponding to each voice feature vector in the voice feature vector sequence of the target object through the basic feature vector of the other object and the voice feature vector sequence of the target object, and obtaining the attention weight sequence corresponding to the voice feature vector sequence of the target object;

the target feature vector determining submodule is configured to determine the feature vector with the attention feature as the target feature vector of the target object through the speech feature vector sequence of the target object and the attention weight sequence corresponding to the speech feature vector sequence of the target object.

The voiceprint confirmation apparatus further includes: a feature vector splicing module;

the feature vector splicing module is configured to splice the basic feature vector of the target object and the feature vector with the attention feature of the target object, and use a feature vector obtained after splicing as the target feature vector of the target object.

A voiceprint validation apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program, and the program is specifically configured to:

A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the voiceprint validation method.

As can be seen from the above technical solutions, according to the voiceprint recognition method, apparatus, device and readable storage medium provided by the present invention, first, for each of the first object and the second object, the feature vector with attention feature is generated based on the speech of the target object, so as to obtain the feature vector with attention feature of the first object as the target feature vector of the first object, and obtain the feature vector with attention feature of the second object as the target feature vector of the second object, and then, whether the first object and the second object are the same object is determined based on the target feature vector of the first object and the target vector of the second object. The voiceprint confirmation method, the voiceprint confirmation device, the voiceprint confirmation equipment and the readable storage medium provided by the invention take the influence of different objects on the confirmation result into consideration, introduce an attention mechanism, generate different feature vectors with attention features aiming at different objects, and can effectively improve the judgment accuracy by judging through the feature vectors.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of a voiceprint confirmation method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a process of generating a feature vector with attention features based on a voice of a target object as a target feature vector of the target object in a voiceprint recognition method according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of an implementation process of determining a feature vector with attention features as a target feature vector of a target object based on a basic feature vector of another object and a speech feature vector sequence of the target object in the voiceprint recognition method according to the embodiment of the present invention;

FIG. 4 is a schematic diagram of a deep neural network based system for voiceprint validation according to an embodiment of the present invention;

fig. 5 is a schematic network structure diagram of a first attention weight determining network and a second attention weight determining network according to an embodiment of the present invention;

FIG. 6 is another schematic diagram of a deep neural network-based system for performing voiceprint validation according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a voiceprint confirmation apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a voiceprint confirmation apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the prior art, there are many schemes for voiceprint validation, such as a voiceprint validation scheme based on a Gaussian Mixture Model-universal background Model (GMM-UBM), a voiceprint validation scheme based on Total Variance (TV), a voiceprint validation scheme based on a deep neural network, and the like, and the inventors find that:

the existing voiceprint recognition schemes all represent speaker information based on a probability statistical model, and the current research focuses on how to establish a more accurate probability statistical model to represent speaker information, specifically, the existing voiceprint recognition schemes all train feature vectors representing speaker information by using voices of a large number of speakers, for example, the voiceprint recognition scheme based on GMM-UBM extracts a mean value supervector as a feature vector representing speaker information, the voiceprint recognition scheme based on TV extracts an iv vector as a feature vector representing speaker information, and the voiceprint recognition scheme based on a deep neural network extracts a linear output of a certain layer of network as a feature vector representing speaker information.

In any voiceprint confirmation scheme, the same probability statistical model is used to extract the feature vector representing the speaker information from the voice to be confirmed of the object to be confirmed and the target voice of the target object, and no influence of different objects on the confirmation result is concerned, for example, when the speaker a and the speaker B are confirmed, it is often determined whether the speaker a and the speaker B are the same speaker according to the obvious difference between the speaker a and the speaker B, for example, the pronunciation of a certain word of the speaker B is obviously different from that of the speaker a, immediately it can be determined whether the speaker a and the speaker B are not the same speaker, for example, when the speaker a and the speaker C are confirmed, it is often determined whether the speaker a and the speaker C are the same speaker according to the obvious difference between the speaker a and the speaker C, for example, the speed of the speaker C is obviously different from that the speaker a, based on the fact that the speaker A and the speaker C can be judged to be not the same speaker immediately, different attention mechanisms are introduced according to different objects, different feature vectors for representing speaker information are generated according to different objects, and then final judgment is carried out through the feature vectors, so that judgment accuracy is effectively improved.

Referring to fig. 1, a schematic flow chart of a voiceprint confirmation method provided in an embodiment of the present application is shown, where the voiceprint confirmation method includes:

step S101: for each of the first object and the second object, a feature vector with attention features is generated as a target feature vector of the target object based on the speech of the target object.

Wherein the attention feature is capable of characterizing the influence of different objects on the validation result.

Via the above-described step S101, a feature vector with attention feature of the first object may be obtained as a target feature vector of the first object, and a feature vector with attention feature of the second object may be obtained as a target feature vector of the second object.

It should be noted that, in this embodiment, one of the first object and the second object is an object to be confirmed, and the other is a target object, and accordingly, one of the voice of the first object and the voice of the second object is the voice of the object to be confirmed, and the other is the voice of the target object, and as to which of the two voices is the voice of the object to be confirmed, and which is the voice of the target object, the embodiment is not particularly limited, because the voiceprint confirmation method provided in this embodiment does not need to distinguish the two voices deliberately, and as long as the two voices are acquired, it can be determined whether the two objects corresponding to the two voices are the same object.

Step S102: determining whether the first object and the second object are the same object based on the target feature vector of the first object and the target feature vector of the second object.

In the voiceprint recognition method provided by the embodiment of the application, first, for each of the first object and the second object, the feature vector with the attention feature is generated based on the voice of the target object, so that the feature vector with the attention feature of the first object is obtained as the target feature vector of the first object, the feature vector with the attention feature of the second object is obtained as the target feature vector of the second object, and then whether the first object and the second object are the same object is determined based on the target feature vector of the first object and the target vector of the second object. Therefore, the voiceprint recognition method provided by the embodiment of the application considers the influence of different objects on the recognition result, introduces an attention mechanism, generates different feature vectors with attention features aiming at different objects, and can effectively improve the recognition accuracy by performing recognition through the feature vectors.

In another embodiment of the present application, for step S101 in the previous embodiment: for each of the first object and the second object, a feature vector with attention features is generated based on the speech of the target object, and introduced as a target feature vector of the target object.

Referring to fig. 2, a flowchart illustrating an implementation process of generating a feature vector with attention feature based on speech of a target object as a target feature vector of the target object for each of a first object and a second object may include:

step S201: for each of the first object and the second object, a sequence of speech feature vectors is extracted from speech of the target object.

In one possible implementation, the process of extracting the speech feature vector sequence from the speech of the target object may include: extracting voice features with preset frame numbers from the voice of the target object to obtain a voice feature sequence; and characterizing each frame of voice feature in the voice feature sequence by using a feature vector, and then obtaining a voice feature vector sequence after characterizing each frame of voice feature. The speech feature vector sequence of the first object and the speech feature vector sequence of the second object may be obtained through the above-described process. The speech features extracted from the speech of the target object may be, but are not limited to, FilterBank features.

Illustratively, a is a as a, a is VA as a speech, B is B as a second object, VB is B as B speech, first, 100 frames of speech features are extracted from VA, for example, 100 frames of speech features are extracted to obtain a speech feature sequence At (t ═ 1,2, …,100), similarly, 100 frames of speech features are extracted from VB to obtain a speech feature sequence Bt (t ═ 1,2, …,100), then, vector representation is performed on the speech feature sequence At to obtain a speech feature vector sequence HAt (t ═ 1,2, …,100), and similarly, vector representation is performed on the speech feature sequence Bt (t ═ 1,2, …,100) to obtain a speech feature vector sequence HBt (t ═ 1,2, …, 100).

Step S202: and determining the voiceprint feature vector of the target object through the voice feature vector sequence to serve as the basic feature vector of the target object.

Specifically, the speech feature vector sequence may be averaged in frames, and the feature vector obtained after averaging may be determined as a voiceprint feature vector of the target object, which is used as a basic feature vector of the target object.

The basic feature vector of the first object and the basic feature vector of the second object may be obtained through step S202.

Illustratively, the speech feature vector sequence of the first object a is HAt (t is 1,2, …,100), the speech feature vector sequence of the second object is HBt (t is 1,2, …,100), the speech feature vector sequence HAt (t is 1,2, …,100) is averaged by frame to obtain the base feature vector HA of the first object a, and similarly, the speech feature vector sequence HBt (t is 1,2, …,100) is averaged by frame to obtain the base feature vector HB of the second object B. Assuming that each of the speech feature vectors in the speech feature vector sequences of the first object a and the second object B is a 200-dimensional vector, the base feature vectors of the first object a and the second object B are also 200-dimensional vectors.

It should be noted that, in order to improve the discrimination accuracy, in this embodiment, the voiceprint feature vector determined in this step is not directly used for performing the decision, but the voiceprint feature vector is used as a basis to perform subsequent operations, so as to determine the voiceprint feature vector finally used for the discrimination.

Step S203: and determining a feature vector with attention features as a target feature vector of the target object based on the basic feature vector of the other object and the voice feature vector sequence of the target object.

That is, the target feature vector of the first object is determined based on the base feature vector of the second object and the speech feature vector sequence of the first object, and similarly, the target feature vector of the second object is determined based on the base feature vector of the first object and the speech feature vector sequence of the second object. The target feature vector pair finally used for discrimination, namely the target feature vector of the first object and the target feature vector of the second object, can be obtained through the steps.

In another embodiment of the present application, for step S203: and determining a feature vector with attention features based on the basic feature vector of another object and the voice feature vector sequence of the target object, and introducing the feature vector as a target feature vector of the target object.

Referring to fig. 3, a flowchart illustrating an implementation process of determining a feature vector with attention feature as a target feature vector of a target object based on a base feature vector of another object and a speech feature vector sequence of the target object may include:

step S301: and determining the attention weight corresponding to each voice feature vector in the voice feature vector sequence of the target object through the basic feature vector of another object and the voice feature vector sequence of the target object to obtain the attention weight sequence corresponding to the voice feature vector sequence of the target object.

In a possible implementation manner, a specific implementation process of this step may include: respectively splicing each voice feature vector in the voice feature vector sequence of the target object with the basic feature vector of another object to obtain a spliced feature vector sequence; converting the spliced feature vector sequence into a one-dimensional vector as an attention weight vector; and mapping each weight element in the attention weight vector into a weight value in a preset range to obtain the attention weight corresponding to each voice feature vector in the voice feature vector sequence of the target object.

Through the present step, the attention weight corresponding to each speech feature vector in the speech feature vector sequence of the first object can be obtained, that is, the attention weight sequence corresponding to the speech feature vector sequence of the first object is obtained, and similarly, the attention weight corresponding to each speech feature vector in the speech feature vector sequence of the second object can be obtained, that is, the attention weight sequence corresponding to the speech feature vector sequence of the second object is obtained.

For example, the basic feature vector of the first object a is HA, the speech feature vector sequence is HAt (t is 1,2, …,100), the basic feature vector of the second object B is HB, the speech feature vector sequence is HBt (t is 1,2, …,100), the basic feature vector of the second object B is HB, and the speech feature vector sequence HAt (t is 1,2, …,100) of the first object a are spliced, assuming that HA is a 200-dimensional vector and the speech feature vector is also a 200-dimensional vector, a 400-dimensional feature vector is obtained after splicing, 100 speech feature vectors are all spliced with HB, 100 400-dimensional feature vectors are obtained, then 100-dimensional feature vectors are converted into 100-1 vectors as weight vectors, and finally, the 100-1 weight vectors are mapped to obtain HAt (t is 1,2, …,100) the corresponding attention weight sequence α (At) (t ═ 1,2, …, 100). Similarly, the basic feature vector of the first object a may be HA and each of the speech feature vectors in the speech feature vector sequence HBt (t is 1,2, …,100) of the second object B may be spliced, 100 400-dimensional feature vectors are obtained after the splicing, then the 100 400-dimensional feature vectors are converted into 100-1 vectors as weight vectors, and finally the 100-1 weight vectors are mapped to obtain the attention weight sequence α (Bt) (t is 1,2, …,100) corresponding to HBt (t is 1,2, …, 100).

Step S302: and determining a feature vector with attention features as a target feature vector of the target object through the voice feature vector sequence of the target object and an attention weight sequence corresponding to the voice feature vector sequence of the target object.

Assuming that the speech feature vector sequence of the target object is HXt (T is 1,2, … T) and the corresponding attention weight sequence is α (Xt) (T is 1,2, … T), the target feature vector YX of the target object can be calculated by the following equation:

wherein, T is the number of the voice feature vectors in the voice feature vector sequence.

The target feature vector of the first object and the target feature vector of the second object may be obtained through step S302.

In a possible implementation manner, after obtaining the target feature vector of the first object and the target feature vector of the second object, the target feature vector of the first object and the target feature vector of the second object may be spliced to obtain a spliced feature vector; and determining whether the first object and the second object are the same object or not based on the spliced feature vectors.

It should be noted that, in the above embodiment, the feature vector with attention feature of the first object is used as the target feature vector of the first object, the feature vector with attention feature of the second object is used as the target feature vector of the second object, and then whether the first object and the second object are the same object is determined by the target feature vector of the first object and the target feature vector of the second object, which can significantly improve the accuracy of the determination.

In view of the above, in order to further improve the determination accuracy, in another embodiment of the present application, the basic feature vector of the first object and the feature vector with attention feature of the first object are spliced, the feature vector obtained after the splicing is used as the target feature vector of the first object, similarly, the basic feature vector of the second object and the feature vector with attention feature of the second object are spliced, the feature vector obtained after the splicing is used as the target feature vector of the second object, and whether the first object and the second object are the same object is determined by the target feature vector of the first object and the target feature vector of the second object.

The embodiment blends more information into the feature vector for discrimination, which further improves the discrimination accuracy of the scheme.

It should be noted that, in a possible implementation manner, the voiceprint confirmation method provided in this embodiment can be implemented based on a deep neural network system, and please refer to fig. 4, which shows a schematic diagram of implementing voiceprint confirmation based on the deep neural network system, and the specific implementation process includes:

(1) a speech feature vector sequence HAt (T1, 2, … T) of the first object a and a speech feature vector sequence HBt (T1, 2, … T) of the second pair B are determined.

Specifically, a speech feature sequence At is extracted from the speech of the first object a (T ═ 1,2, … T), a speech feature sequence Bt is extracted from the speech of the second object B (T ═ 1,2, … T), a speech feature vector sequence HAt of the first object a is determined by a pre-established first feature characterization network G1 (T ═ 1,2, … T), and a speech feature vector sequence HBt of the second object B is determined by a pre-established second feature characterization network G2 (T ═ 1,2, … T).

The first characterization networks G1 and G2 are obtained by performing random gradient descent pre-training with respect to whether the training samples are vocal print feature vectors obtained by averaging a sequence of speech feature vectors extracted from a large number of training speeches, and with respect to whether the training samples are the same object.

It should be noted that, when the entire deep neural network system is trained, the training data used is a large number of training speech pairs, and the training speech pairs need to include a positive example pair and a negative example pair. Wherein, the positive example pair refers to two voices of the same object, and the negative example pair refers to two voices of different objects. To improve the robustness of the neural network, the opposite example is to select two similar voices of different objects, for example, the voices of two similar speakers can be selected.

In addition, the first token network G1 and the second token network G2 in the present embodiment may be, but are not limited to, a recurrent neural network RNN, a convolutional neural network CNN, or the like.

(2) The base feature vector HA of the first object a is determined based on the speech feature vector sequence HAt of the first object a (T1, 2, … T), and the base feature vector HB of the second object B is determined based on the speech feature vector sequence HBt of the second object B (T1, 2, … T).

Specifically, after the speech feature vector sequence HAt (T ═ 1,2, … T) of the first object is determined by the first token network G1 and the speech feature vector sequence HBt (T ═ 1,2, … T) of the second object is determined by the second token network G2, the speech feature vector sequence HAt (T ═ 1,2, … T) of the first object a is averaged by frames by the first average calculation module Avg1 to obtain the basic feature vector HA of the first object a, and similarly, the speech feature vector sequence HBt (T ═ 1,2, … T) of the second object B is averaged by frames by the second average calculation module Avg2 to obtain the basic feature vector HB of the second object B.

(3) An attention weight sequence α (At) (T ═ 1,2, … T) corresponding to HAt (T ═ 1,2, … T) is determined based on the base feature vector HB of the second object B and the speech feature vector sequence HAt (T ═ 1,2, … T) of the first object a, and an attention weight sequence α (Bt) (T ═ 1,2, … T) corresponding to HBt (T ═ 1,2, … T) is determined based on the base feature vector HA of the first object a and the speech feature vector sequence HBt (T ═ 1,2, … T) of the second object B.

Specifically, the basic feature vector HB of the second object B and the speech feature vector sequence HAt (T ═ 1,2, … T) of the first object a are input into a first attention weight determination network att1 established in advance, and an attention weight sequence α (At) (T ═ 1,2, … T) corresponding to the speech feature vector sequence HAt (T ═ 1,2, … T) of the first object a output by the first attention weight determination network att1 is obtained; similarly, the base feature vector HA of the first object a and the speech feature vector sequence HBt (T ═ 1,2, … T) of the second object are input to the second attention weight determination network att2 established in advance, and the attention weight sequence α (Bt) (T ═ 1,2, … T) corresponding to the speech feature vector sequence HBt (T ═ 1,2, … T) of the second object B output by the second attention weight determination network att2 is obtained.

The first attention weight determination network att1 and the second attention weight determination network att2 may be obtained by extracting a speech feature vector sequence from a large number of training speeches as training samples, and performing random gradient descent pre-training on the same speaker as a target.

Further, referring to fig. 5, a network structure of a first attention weight determining network att1 and a second attention weight determining network att2 is shown, and a specific process of determining an attention weight sequence by the network shown in fig. 5 includes: each voice feature vector in the voice feature vector sequence HAt (T is 1,2, … T) of the first object a is spliced with the basic feature vector HB of the second object B by the first splicing module concat1, the spliced feature vectors are input into the multilayer full-link layer FC1 and projected onto one node to obtain a one-dimensional weight vector, and the one-dimensional weight vector is finally subjected to a softmax function to obtain an attention weight sequence α (At) (T is 1,2, … T). Similarly, the second concatenation module concat2 concatenates each speech feature vector in the speech feature vector sequence HBt (T is 1,2, … T) of the second object B with the base feature vector HA of the first object a, inputs the concatenated feature vector into the multi-layer fully-connected layer FC2, projects the concatenated feature vector onto a node to obtain a one-dimensional weight vector, and finally obtains the attention weight sequence α (Bt) after the one-dimensional weight vector passes through the softmax function (T is 1,2, … T).

(4) The target feature vector of the first object a is determined based on the attention weight sequence α (At) (T ═ 1,2, … T) corresponding to the speech feature vector sequence HAt (T ═ 1,2, … T) and HAt (T ═ 1,2, … T) of the first object a, and the target feature vector of the second object B is determined based on the attention weight sequence α (Bt) (T ═ 1,2, … T) corresponding to the speech feature vector sequence HBt (T ═ 1,2, … T) and HBt (T ═ 1,2, … T) of the second object B.

Specifically, after obtaining the attention weight sequence α (At) (T ═ 1,2, … T) corresponding to the speech feature vector sequence HAt (T ═ 1,2, … T) of the first object a and the attention weight sequence α (Bt) (T ═ 1,2, … T) corresponding to the speech feature vector sequence HBt (T ═ 1,2, … T) of the second object, HAt (T ═ 1,2, … T) is weighted by α (At) (T ═ 1,2, … T) according to the above formula (1) to obtain the target feature vector YA of the first object a, and HBt (T ═ 1,2, … T) is weighted by α (Bt) (T ═ 1,2, … T) according to the above formula (1) to obtain the target feature vector YB of the second object B.

(5) It is determined whether the first object a and the second object B are the same object based on the target feature vector of the first object a and the target feature vector of the second object B.

Specifically, the target feature vector YA of the first object a and the target feature vector YB of the second object B are spliced by the splicing module concat in fig. 3, and after the spliced feature vectors pass through a full connection layer FC, the first object a and the second object B are determined by the determining module DIS based on the CE criterion whether to be the same object.

It should be noted that, when the voiceprint confirmation is implemented based on the deep neural network system, two feature characterization networks shown in fig. 4 may be used, a voice feature of one voice in a voice pair is characterized by one of the feature characterization networks, a voice feature of the other voice is characterized by the other feature characterization network, or one feature characterization network may be used, and the voice features of the two voices are sequentially characterized by one feature characterization network, which is also the case with the attention weight determination network, and may be two or one.

The voiceprint confirmation method provided by the embodiment of the invention considers the influence of different objects on the confirmation result, introduces the attention mechanism, generates different feature vectors with attention characteristics aiming at different objects, and can effectively improve the judgment accuracy by judging through the feature vectors with the attention characteristics.

Referring to fig. 6, another schematic diagram illustrating voiceprint confirmation implemented based on a deep neural network system is shown, and the implementation manner of the implementation manner is different from that of fig. 4, in that a splicing module concat splices a basic feature vector HA of a first object, a target feature vector YA of the first object, a basic feature vector HB of a second object, and a target feature vector YB of the second object, and judges whether the first object a and the second object B are the same object according to the spliced vectors. That is, compared with the implementation shown in fig. 4, the present implementation blends the basic feature vector into the feature vector for discrimination, and the accuracy of discrimination by the voiceprint recognition method provided in the present embodiment is further improved by the blending of the basic feature vector.

Corresponding to the voiceprint confirmation method, an embodiment of the present application further provides a voiceprint confirmation apparatus, please refer to fig. 7, which shows a schematic structural diagram of the voiceprint confirmation apparatus, and the apparatus may include: a feature vector generation module 701 and a voiceprint validation module 702.

A feature vector generating module 701, configured to generate, for each of the first object and the second object, a feature vector with attention features based on the speech of the target object as a target feature vector of the target object.

A voiceprint confirmation module 702, configured to determine whether the first object and the second object are the same object based on the target feature vector of the first object and the target feature vector of the second object.

In the voiceprint recognition apparatus provided in the embodiment of the present application, first, for each of the first object and the second object, the feature vector with the attention feature is generated based on the speech of the target object, so that the feature vector with the attention feature of the first object is obtained as the target feature vector of the first object, and the feature vector with the attention feature of the second object is obtained as the target feature vector of the second object, and then, whether the first object and the second object are the same object is determined based on the target feature vector of the first object and the second target vector of the second object. Therefore, the voiceprint recognition device provided by the embodiment of the application considers the influence of different objects on the recognition result, introduces the attention mechanism, generates different feature vectors with attention features aiming at different objects, and can effectively improve the recognition accuracy by performing recognition through the feature vectors.

In a possible implementation manner, the feature vector generating module 701 in the voiceprint confirmation apparatus provided in the foregoing embodiment may include: the device comprises a feature extraction module, a basic feature vector determination module and a target feature vector determination module.

The feature extraction module is configured to, for each of the first object and the second object, extract a speech feature vector sequence from speech of the target object.

And the basic feature vector determining module is used for determining the voiceprint feature vector of the target object through the voice feature vector sequence as the basic feature vector of the target object.

In one possible implementation manner, the feature extraction module includes: a feature extraction sub-module and a feature characterization sub-module.

The feature extraction submodule is used for extracting the voice features with preset frame numbers from the voice of the target object to obtain a voice feature sequence.

And the feature characterization submodule is used for characterizing each frame of voice feature in the voice feature sequence by using a feature vector, and obtaining the voice feature vector sequence after characterizing each frame of voice feature.

In one possible implementation, the target feature vector determination module may include: an attention weight determination submodule and a target feature vector determination submodule.

And the attention weight determining submodule is used for determining the attention weight corresponding to each voice feature vector in the voice feature vector sequence of the target object through the basic feature vector of the other object and the voice feature vector sequence of the target object, and obtaining the attention weight sequence corresponding to the voice feature vector sequence of the target object.

In one possible implementation, the attention weight determination submodule includes: a splicing submodule, a conversion submodule and a mapping submodule.

And the splicing submodule is used for splicing each voice feature vector in the voice feature vector sequence of the target object with the basic feature vector of the other object respectively to obtain a spliced feature vector sequence.

And the conversion submodule is used for converting the spliced characteristic vector sequence into a one-dimensional vector as an attention weight vector.

The mapping submodule is configured to map each weight element in the attention weight vector to a weight value in a preset range, and obtain an attention weight corresponding to each voice feature vector in the voice feature vector sequence of the target object.

In a possible implementation manner, the voiceprint confirmation module 702 in the voiceprint confirmation apparatus provided in the foregoing embodiment is specifically configured to splice the target feature vector of the first object and the target feature vector of the second object, and obtain a spliced feature vector; and determining whether the first object and the second object are the same object or not based on the spliced feature vector.

In a possible implementation manner, in order to improve the discrimination accuracy, the voiceprint recognition apparatus provided in the above embodiment may further include: and a feature vector splicing module.

Fig. 8 shows a schematic structural diagram of a voiceprint confirmation apparatus according to an embodiment of the present invention, where the voiceprint confirmation apparatus includes: a memory 801 and a processor 802.

A memory 801 for storing programs;

a processor 802 for executing the program, the program being specifically for:

The image processing apparatus may further include: a bus, a communication interface 803, an input device 804, and an output device 805.

The processor 802, the memory 801, the communication interface 803, the input device 804, and the output device 805 are connected to each other by a bus. Wherein:

a bus may include a path that transfers information between components of a computer system.

The processor 802 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the inventive arrangements. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The processor 802 may include a main processor and may also include a baseband chip, modem, and the like.

The memory 801 stores programs for executing the technical solution of the present invention, and may also store an operating system and other key services. In particular, the program may include program code including computer operating instructions. More specifically, memory 801 may include a read-only memory (ROM), other types of static storage devices that may store static information and instructions, a Random Access Memory (RAM), other types of dynamic storage devices that may store information and instructions, a disk storage, a flash, and so forth.

The input device 804 may include a means for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.

Output device 805 may include devices that allow output of information to a user, such as a display screen, a printer, speakers, and the like.

Communication interface 803 may include any means for using a transceiver or the like to communicate with other devices or communication networks, such as ethernet, Radio Access Network (RAN), Wireless Local Area Network (WLAN), etc.

The processor 802 executes programs stored in the memory 801 and invokes other devices that may be used to implement the steps of the voiceprint validation method provided by embodiments of the present invention.

The embodiment of the present invention further provides a readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the voiceprint validation method provided in any of the above embodiments.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A voiceprint validation method, comprising:

for each target object in the first object and the second object, generating a feature vector with attention features based on the voice of the target object and simultaneously combining the voice of the other object as a target feature vector of the target object, wherein the attention features can represent the influence of different objects on a confirmation result;

2. The voiceprint recognition method according to claim 1, wherein the generating, for each of the first object and the second object, a feature vector with attention feature based on the voice of the target object simultaneously combined with the voice of the other object as the target feature vector of the target object comprises:

3. The voiceprint recognition method according to claim 2, wherein said extracting a speech feature vector sequence from the speech of the target object comprises:

4. The voiceprint recognition method according to claim 2, wherein the determining a feature vector with attention feature as the target feature vector of the target object based on the base feature vector of another object and the speech feature vector sequence of the target object comprises:

5. The voiceprint recognition method according to claim 3, wherein the determining the attention weight corresponding to each speech feature vector in the speech feature vector sequence of the target object by the base feature vector of the other object and the speech feature vector sequence of the target object comprises:

6. The voiceprint recognition method according to claim 1, wherein the determining whether the first object and the second object are the same object based on the target feature vector of the first object and the target feature vector of the second object comprises:

7. The voiceprint confirmation method according to claim 2, further comprising:

8. A voiceprint confirmation apparatus, comprising: the system comprises a feature vector generation module and a voiceprint confirmation module;

the feature vector generation module is used for generating a feature vector with attention features as a target feature vector of each target object in the first object and the second object based on the voice of the target object and simultaneously combining the voice of the other object, wherein the attention features can represent the influence of different objects on the confirmation result;

9. The voiceprint confirmation apparatus according to claim 8, wherein the feature vector generation module comprises: the system comprises a feature extraction module, a basic feature vector determination module and a target feature vector determination module;

10. The voiceprint confirmation apparatus according to claim 9, wherein the target feature vector determination module comprises: an attention weight determining submodule and a target feature vector determining submodule;

11. The voiceprint confirmation apparatus according to claim 9, further comprising: a feature vector splicing module;

12. A voiceprint confirmation apparatus comprising: a memory and a processor;

the memory is used for storing programs;

13. A readable storage medium, having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the voiceprint validation method according to any one of claims 1 to 7.