CN113823294B

CN113823294B - Cross-channel voiceprint recognition method, device, equipment and storage medium

Info

Publication number: CN113823294B
Application number: CN202111390613.7A
Authority: CN
Inventors: 郑方; 佴瑞乾; 李蓝天; 王东; 张琛; 谢弈峥
Original assignee: Tsinghua University; Shanghai Pudong Development Bank Co Ltd
Current assignee: Tsinghua University; Shanghai Pudong Development Bank Co Ltd
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-03-11
Anticipated expiration: 2041-11-23
Also published as: CN113823294A

Abstract

The invention provides a cross-channel voiceprint recognition method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring voiceprint audio data to be identified, wherein the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels; inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and identifying the voiceprint audio data according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through the voiceprint audio data collected in two different channels in each iteration process. The technical scheme of the invention can improve the identification accuracy of cross-channel voiceprint identification.

Description

Cross-channel voiceprint recognition method, device, equipment and storage medium

Technical Field

The present invention relates to the field of geostatistical methods, and in particular, to a cross-channel voiceprint recognition method, apparatus, electronic device, and non-transitory computer-readable storage medium.

Background

In recent years, with the intensive research of voiceprint recognition technology, the voiceprint recognition system has achieved satisfactory performance under single channel conditions. However, in practical applications, the voice signal may be transmitted through different channels, such as a network channel, a telephone channel, and so on. This channel difference will distort the speech signal to different degrees, affecting the performance of the voiceprint recognition system. For example, in the registration phase, the user's voice is collected by the network channel; in the recognition phase, the user's voice is picked up by the telephone channel. At this time, the voiceprint recognition performance will be greatly degraded due to channel mismatch. Considering the diversity of voiceprint authentication scenes, the voiceprint recognition technology of a single channel can greatly limit the popularization and application of the voiceprint technology.

Therefore, how to overcome the influence of channel change on the identification performance is a technical problem which needs to be solved at present to improve the identification performance of the voiceprint identification system under the cross-channel condition.

Disclosure of Invention

The invention provides a cross-channel voiceprint recognition method and device, electronic equipment and a non-transitory computer readable storage medium, which are used for solving the problem that cross-channel voiceprint recognition is difficult in the prior art and improving the accuracy of cross-channel voiceprint recognition.

The invention provides a cross-channel voiceprint recognition method, which comprises the following steps: acquiring voiceprint audio data to be identified, wherein the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels; inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and identifying the voiceprint audio data according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through the voiceprint audio data collected in two different channels in each iteration process.

The cross-channel voiceprint recognition method provided by the invention further comprises a training process of the cross-channel voiceprint recognition model, wherein the training process comprises the following steps: acquiring a sample voiceprint audio data set collected in the set channel set, wherein the set channel set comprises a first channel and a second channel, and the sample voiceprint audio data in the sample voiceprint audio data set are collected in the at least two different channels; selecting sample voiceprint audio data in one channel, calculating a first loss function and an updated intermediate parameter of the sample voiceprint audio data in the channel corresponding to the sample voiceprint audio data, selecting sample voiceprint audio data in another channel except the one channel based on the updated intermediate parameter and the first loss function, calculating a second loss function and an updated model parameter of the sample voiceprint audio data in the channel corresponding to the sample voiceprint audio data, and completing an iteration process; and reselecting sample voiceprint audio data to perform an iterative process until the second loss function is converged to obtain the cross-channel voiceprint recognition model.

According to the cross-channel voiceprint recognition method provided by the invention, the at least two different channels comprise at least one of the following channel classifications: a wireless channel, a wired channel, and a storage channel.

According to the cross-channel voiceprint identification method provided by the invention, the voiceprint audio data to be identified comprise first data collected in a first channel and second data collected in a second channel; after obtaining the voiceprint audio data processing result, the method further includes: acquiring a similar relation between the first data and the second data according to a voiceprint audio data processing result corresponding to the first data and a voiceprint audio data processing result corresponding to the second data; and identifying whether the first data and the second data are from the same speaker according to the magnitude relation between the similarity relation and a set first threshold value.

According to the cross-channel voiceprint identification method provided by the invention, the voiceprint audio data to be identified comprises third data collected in a first channel; after obtaining the voiceprint audio data processing result, the method further includes: acquiring a similarity relation between the third data and the on-library data according to a voiceprint audio data processing result corresponding to the third data and on-library data in a voiceprint library, wherein the on-library data is obtained according to voiceprint audio data collected in a second channel; selecting fourth data with the maximum similarity with the third data from the database data according to the similarity relation; and identifying whether the third data and the fourth data are from the same speaker according to the magnitude relation between the similarity of the third data and the fourth data and a set second threshold value.

According to the cross-channel voiceprint recognition method provided by the invention, the similarity relation is obtained by calculating the cosine distance or performing probability linear discriminant analysis.

According to the inventionIn each iteration process, the intermediate parameters are updated according to the following formula:

wherein,

is that

On-channel

The loss function of (a) to (b),

is collected from the channel

The voice print audio data of (a) above,

for the learning rate of the local update,

is composed of

The amount of change in (c); model parameters are updated according to the following formula

To

：

Wherein,

，

is that

On-channel

The loss function of (a) to (b),

is collected from the channel

The voice print audio data of (a) above,

is the global updated learning rate.

The invention provides a cross-channel voiceprint recognition device, which comprises: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring voiceprint audio data to be identified, the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels; the identification unit is used for inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result so as to identify the voiceprint audio data according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through the voiceprint audio data collected in two different channels in each iteration process.

According to the cross-channel voiceprint recognition device provided by the invention, the device further comprises a training unit used for performing a training process on the cross-channel voiceprint recognition model, and the training unit comprises: a first obtaining subunit, configured to obtain a sample voiceprint audio data set collected in the set channel set, where the sample voiceprint audio data in the sample voiceprint audio data set are collected in the at least two different channels; the iteration subunit is configured to select sample voiceprint audio data in one channel, calculate a first loss function and an updated intermediate parameter of the sample voiceprint audio data in a channel corresponding to the iteration subunit, select sample voiceprint audio data in another channel other than the one channel based on the updated intermediate parameter and the first loss function, calculate a second loss function and an updated model parameter of the sample voiceprint audio data in the channel corresponding to the iteration subunit, complete an iteration process, and reselect the sample voiceprint audio data to perform the iteration process until the second loss function converges, so as to obtain the cross-channel voiceprint identification model.

According to the cross-channel voiceprint recognition device provided by the invention, the at least two different channels comprise at least one of the following channel classifications: a wireless channel, a wired channel, and a storage channel.

According to the cross-channel voiceprint recognition device provided by the invention, the voiceprint audio data to be recognized comprise first data collected in the first channel and second data collected in the second channel; the apparatus further includes a first similarity relation determination unit configured to: after the voiceprint audio data processing result is obtained, acquiring the similarity relation between the first data and the second data according to the voiceprint audio data processing result corresponding to the first data and the voiceprint audio data processing result corresponding to the second data; and identifying whether the first data and the second data are from the same speaker according to the magnitude relation between the similarity relation and a set first threshold value.

According to the cross-channel voiceprint recognition device provided by the invention, the voiceprint audio data to be recognized comprise third data collected in the first channel; the apparatus further includes a second similarity relation determination unit configured to: acquiring a similarity relation between the third data and the on-library data according to a voiceprint audio data processing result corresponding to the third data and the on-library data in a voiceprint library, wherein the on-library data is obtained according to the voiceprint audio data collected in the second channel; selecting fourth data with the maximum similarity with the third data from the database data according to the similarity relation; and identifying whether the third data and the fourth data are from the same speaker according to the magnitude relation between the similarity of the third data and the fourth data and a set second threshold value.

According to the cross-channel voiceprint recognition device provided by the invention, the iteration unit is further configured to: during each iteration, the intermediate parameters are updated according to the following formula:

wherein,

is that

On-channel

The loss function of (a) to (b),

is collected from the channel

The voice print audio data of (a) above,

for the learning rate of the local update,

is composed of

To

：

Wherein,

，

is that

On-channel

The loss function of (a) to (b),

is collected from the channel

The voice print audio data of (a) above,

is the global updated learning rate.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the cross-channel voiceprint recognition method as described in any of the above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the cross-channel voiceprint recognition method as described in any of the above.

According to the cross-channel voiceprint recognition method, the cross-channel voiceprint recognition device, the electronic equipment and the non-transient computer readable storage medium, model training of each iteration process is carried out on voiceprint audio data collected in two different channels, so that a cross-channel voiceprint recognition model suitable for different channels can be obtained, and the cross-channel voiceprint recognition model can be used for accurately recognizing the voiceprint audio data to be recognized.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a cross-channel voiceprint recognition method provided by the present invention;

FIG. 2 is a flow chart illustrating a training process of a cross-channel voiceprint recognition model provided by the present invention;

FIG. 3 is a flow chart of a two iteration process provided by the present invention;

FIG. 4 is a schematic structural diagram of a cross-channel voiceprint recognition apparatus provided by the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the one or more embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the invention. As used in one or more embodiments of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present invention refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used herein to describe various information in one or more embodiments of the present invention, such information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

The terms used in the examples of the present invention are explained below:

voiceprint: one type of information in a speech signal is a general term for speech features that characterize the identity of a speaker and speech models built based on these features. Because the different speakers use different vocal organs such as tongue, oral cavity, nasal cavity, vocal cords, lung, etc. in different sizes and forms, and considering the difference of different speakers in age, character, language habit, etc., the characteristics of different speakers such as vocal volume and vocal frequency are greatly different. It can be said that the voiceprint patterns of any two persons are not identical.

And (3) voiceprint recognition: the method is also called speaker identification, which is a biological characteristic identification technology for automatically realizing speaker identification by utilizing a computer and various information identification technologies according to the voiceprint characteristics which can represent the personal information of the speaker in a voice signal. Voiceprint recognition is essentially a type of pattern recognition problem. A typical voiceprint recognition system generally consists of two phases, registration and recognition. Wherein, the registration tries to train the reserved voice of the user into a speaker model, and the recognition is to judge whether an unknown voice comes from a specified speaker.

In the related art, the traditional voiceprint recognition technology is based on a statistical probability model, wherein the most classical is a gaussian mixture model-general background model (GMM-UBM) architecture. To further enhance the expressive power of speaker characteristics under limited data, various subspace models are proposed in succession, the most notable of which is the i-vector model. The i-vector model introduces an important concept: speaker characterization vector (Speaker embedding), i.e. a continuous vector of a fixed length is used to characterize the Speaker characteristics.

In recent years, based on deep learning methods, researchers have proposed a series of voiceprint recognition model methods in sequence, such as: d-vector model, x-vector model, etc. Such models map a speech signal of random duration into a continuous vector of fixed length called Deep speaker characterization vector (Deep speaker embedding). Constructing a space for describing the characteristics of the speakers through the characterization vectors of the speakers; in this space, scoring and decision making for voiceprint recognition can be achieved.

For the mainstream speaker model, the training goal is usually to maximally distinguish different speakers without considering channel disturbance, which makes it difficult to be effective in the cross-channel task. To address the cross-channel problem, researchers have conducted a series of studies. Such research is mainly divided into two fields, one is channel adaptation; another class is channel generalization. For channel adaptation, the basic idea is to project a channel A into a channel B through a certain mapping function, and to complete registration and identification on the channel B; for channel generalization, the basic idea is to learn a channel-independent space, project both channel a and channel B into the space, and perform registration and identification.

In consideration of channel disturbance, the technical scheme of cross-channel voiceprint recognition is difficult to achieve high recognition accuracy.

To solve the problem, an embodiment of the present invention provides a cross-channel voiceprint recognition scheme. The scheme is a channel robustness optimization method, and can improve the channel generalization of a voiceprint recognition system, so that the problem of cross-channel recognition is solved. The technical scheme of the embodiment of the invention belongs to the field of second-class channel generalization.

The following detailed description of exemplary embodiments of the invention refers to the accompanying drawings.

Fig. 1 is a flowchart illustrating a cross-channel voiceprint recognition method according to an embodiment of the present invention. The method provided by the embodiment of the invention can be executed by any electronic equipment with computer processing capability, such as a terminal or a server. As shown in fig. 1, the cross-channel voiceprint recognition method includes:

step 102, obtaining voiceprint audio data to be identified, wherein the voiceprint audio data to be identified is collected in a channel in a set channel set, and the set channel set comprises at least two different channels.

Specifically, the at least two channels may be a first channel and a second channel with different transmission media.

Step 104, inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and identifying the voiceprint audio data according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in a set channel through a plurality of iteration processes, and model parameters are trained by the voiceprint audio data collected in two different channels in each iteration process.

In particular, the cross-channel voiceprint recognition model is a deep neural network model. And the data processing result is a characteristic vector of the voiceprint audio data to be recognized output by the cross-channel voiceprint recognition model, and when the voiceprint audio data to be recognized is the voiceprint audio data, the voiceprint audio data processing result is a speaker characterization vector. According to the feature vector or the speaker characterization vector, comparison between two voice print audio data or comparison between the current input voice print and the voice print in the database can be performed in the space for describing the characteristics of the speaker.

In the embodiment of the invention, in the training process of the cross-channel voiceprint recognition model, each iteration part adopts voiceprint audio data in two different channels for training, so that channel generalization can be better realized, and the accuracy is higher during cross-channel voiceprint recognition.

Before step 104, a training process for the cross-channel voiceprint recognition model is further included, as shown in fig. 2, the training process includes:

step 201, a sample voiceprint audio data set collected in a set channel set is obtained, and sample voiceprint audio data in the sample voiceprint audio data set are collected in at least two different channels.

Step 202, selecting sample voiceprint audio data for iteration, specifically, selecting sample voiceprint audio data in one channel, calculating a first loss function and an updated intermediate parameter of the sample voiceprint audio data in the channel corresponding to the sample voiceprint audio data, selecting sample voiceprint audio data in another channel except the channel based on the updated intermediate parameter and the first loss function, calculating a second loss function and an updated model parameter of the sample voiceprint audio data in the channel corresponding to the sample voiceprint audio data, and completing an iteration process.

In step 203, it is determined whether the second penalty function is converged, if yes, step 204 is executed, and if not, step 202 is executed.

And step 204, obtaining a cross-channel voiceprint recognition model.

In step 202, the operation of updating the intermediate parameters is a local update phase of model parameter update, and the operation of updating the model parameters is a global update phase of model parameter update. The training data for these two phases come from different channels.

In an embodiment of the invention, the at least two different channels comprise at least one of the following channel classes: a wireless channel, a wired channel, and a storage channel.

The two different channels may be different channels in the same category of channels, for example, channels of two different transmission media in a wired channel, or two channels in different categories, for example, one is a wired channel and one is a wireless channel.

In one embodiment, the two phases of training data are from two different channels

. Wherein,

representation from channel

And channel

The data set of (2).

And

is a channel

And channel

A subset of (a).

Are model parameters of the trained model.

During each iteration, the intermediate parameters are updated according to the following formula

：

Wherein,

namely, it is

Which is prepared from

On-channel

The loss function of (a) to (b),

is collected from the channel

The voice print audio data of (a) above,

for the learning rate of the local update,

is composed of

The amount of change in (c).

Model parameters are updated according to the following formula

To

；

Wherein,

，

namely, it is

Which is a

On-channel

The loss function of (a) to (b),

is collected from the channel

The voice print audio data of (a) above,

is the global updated learning rate.

In this solution, the model parameters

The updates are done only at global updates, while local updates are computed

Only the intermediate parameters of the gradient are calculated as global updates.

Before step 104, preprocessing the voiceprint audio data to be recognized is required, and when the voiceprint audio data to be recognized is the voiceprint audio data, the preprocessing operation may be a noise reduction operation or a mute segment data removal operation, or the noise reduction operation or the mute segment data removal operation may be performed at the same time.

In the voiceprint recognition technology, whether two voiceprint audio data are the same speaker or not can be compared, namely one-to-one confirmation is carried out; the voiceprint audio data of the same speaker as the current voiceprint audio data can also be recognized from a plurality of voiceprint audio data, namely, one-to-many recognition is carried out.

In one embodiment of the invention, the voiceprint audio data to be identified comprises first data collected in a first channel and second data collected in a second channel; after step 104, one-to-one confirmation of the voiceprint audio data may be performed, and specifically, a similarity relationship between the first data and the second data is obtained according to a voiceprint audio data processing result corresponding to the first data and a voiceprint audio data processing result corresponding to the second data; and identifying whether the first data and the second data are from the same speaker according to the magnitude relation between the similarity relation and the set first threshold value.

This embodiment may be used for one-to-one validation of voiceprint audio data collected under different channels of the same determined user. For example, the voiceprint audio data of the user collected at the mobile phone end is compared with the voiceprint audio data of the same user collected at other equipment for confirmation.

In another embodiment of the invention, the voiceprint audio data to be identified comprises third data collected in the first channel; after step 104, identifying the voiceprint audio data, specifically, obtaining a similarity relationship between the third data and the in-library data according to a voiceprint audio data processing result corresponding to the third data and the in-library data in the voiceprint library, wherein the in-library data is obtained according to the voiceprint audio data collected in the second channel; selecting fourth data with the maximum similarity with the third data from the database data according to the similarity relation; and identifying whether the third data and the fourth data are from the same speaker according to the magnitude relation between the similarity of the third data and the fourth data and a set second threshold value.

Wherein the voiceprint library stores voiceprint audio data of a plurality of different speakers of the second channel. This embodiment can be used for one-to-many recognition of voiceprint audio data acquired under different channels of an uncertain user. For example, the voiceprint audio data of a certain user collected at the mobile phone end is compared with the database data collected at other equipment for identification.

The similarity relation can be obtained by calculating cosine distance or performing probability linear discriminant analysis. The scheme of calculating the similarity by adopting the cosine distance algorithm is simple, and the rear-end algorithms such as the probability linear discriminant analysis algorithm and the like are slightly complex, but have higher calculation accuracy.

The channel robustness optimization method provided by the embodiment of the invention has great advantages in cross-channel condition.

To formula

Performing a first order taylor expansion yields:

hypothesis pair channel

And channel

The sequence in the training process is not limited, and then the following formula is provided:

wherein,

and

respectively from the channel

And

and is and

。

in the formula (2), the first term on the right side of the equation

The method is equivalent to accumulating the loss value of each channel data in the data set, and is equivalent to the loss value of the mixed training of a plurality of channels. The second term on the right of the equation can be regarded as a regularization term, which is a loss functionInner products of gradients on different channels.

In model training, the optimization objective is to minimize the loss function. Obviously, the optimization of the first term on the right side of the equation results in model parameters

Gradually converging; while the first term converges, the second term on the right side of the equation ensures that the gradient directions of different channels are consistent as much as possible, i.e., the directions are consistent, and the inner product is maximum. This means that optimizing the objective function will ensure, on the one hand, that performance optimizations are identified on the individual channels and, on the other hand, that the performance optimizations are consistent across the individual channels.

Hereinafter, taking 16kHz voice data of network channel and 8kHz voice data of telephone channel as an example, the training and testing process of the present invention is shown.

In the iterative process diagram shown in fig. 3, two rounds of iterative processes are shown. Wherein,

the model parameters in each iteration are respectively, and the solid arrow represents the parameter updating direction of each iteration. As indicated by the dashed arrows in the figure, the first round of training includes two steps, local update and global update, the first step local update using 8kHz channel data, the second step global update using 16kHz channel data; similarly, in the second round of training, the first step local update uses 16kHz channel data and the second step global update uses 8kHz channel data. After a plurality of rounds of training, finally optimized model parameters are obtained

。

In the test stage, the network channel 16kHz voice data and the telephone channel 8kHz voice data can be respectively mapped into the same parameter space through the model, and registration and confirmation recognition are completed in the space.

In an application channel of the embodiment of the invention, a user registers voiceprint through an application program of a mobile terminal and consults services through a call center. In the process, the business system of the merchant uses voiceprint recognition to authenticate the user identity to ensure the business safety. In the process, the voice with the sampling rate of 16KHz is collected through the network channel of the mobile terminal, the voice with the sampling rate of 8KHz is collected through the telephone channel, and the comparison of the two voices belongs to cross-channel comparison, namely cross-channel identification.

The technical scheme of the embodiment of the invention has simple training process and can be easily transferred to various deep learning frames. In addition, the optimization of each channel is guaranteed, the consistency of the optimization of each channel is guaranteed, the optimization deviation among different channels is avoided, and overfitting on some channels is prevented.

As can be seen from the formula (2), the method has a strong mathematical theoretical basis, and the effectiveness of the scheme is proved. The technical scheme of the embodiment of the invention is not only suitable for the problem of cross-channel voiceprint recognition, but also can be popularized to other related applications of pattern recognition, such as image recognition channels of face recognition and the like.

According to the cross-channel voiceprint recognition method provided by the invention, model training in each iteration process is carried out by adopting the voiceprint audio data collected in two different channels, so that a cross-channel voiceprint recognition model suitable for different channels can be obtained, and the cross-channel voiceprint recognition model can be used for accurately recognizing the voiceprint audio data to be recognized.

The cross-channel voiceprint recognition device provided by the invention is described below, and the cross-channel voiceprint recognition device described below and the cross-channel voiceprint recognition method described above can be referred to correspondingly.

As shown in fig. 4, an apparatus for cross-channel voiceprint recognition according to an embodiment of the present invention includes:

an obtaining unit 402, configured to obtain voiceprint audio data to be identified, where the voiceprint audio data to be identified is collected in a channel in a set channel set, and the set channel set includes at least two different channels.

The identification unit 404 is configured to input voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and perform voiceprint audio data identification according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in a set channel through a plurality of iteration processes, and model parameters are trained by the voiceprint audio data collected in two different channels in each iteration process.

In the embodiment of the present invention, the training unit is further included for performing a training process on the cross-channel voiceprint recognition model, and the training unit includes: the first acquisition subunit is used for acquiring a sample voiceprint audio data set acquired in a set channel set, wherein the sample voiceprint audio data in the sample voiceprint audio data set are acquired in at least two different channels; the iteration subunit is used for selecting sample voiceprint audio data in one channel, calculating a first loss function and an updated intermediate parameter of the sample voiceprint audio data in the channel corresponding to the iteration subunit, selecting sample voiceprint audio data in another channel except the channel based on the updated intermediate parameter and the first loss function, calculating a second loss function and an updated model parameter of the sample voiceprint audio data in the channel corresponding to the iteration subunit, completing an iteration process, and reselecting the sample voiceprint audio data to perform an iteration process until the second loss function is converged to obtain a cross-channel voiceprint recognition model.

In the embodiment of the invention, the voice print audio data to be identified comprises first data collected in a first channel and second data collected in a second channel; the apparatus further includes a first similarity relation determination unit configured to: after obtaining the voiceprint audio data processing result, acquiring a similarity relation between the first data and the second data according to the voiceprint audio data processing result corresponding to the first data and the voiceprint audio data processing result corresponding to the second data; and identifying whether the first data and the second data are from the same speaker according to the magnitude relation between the similarity relation and the set first threshold value.

In the embodiment of the invention, the voiceprint audio data to be identified comprises third data collected in a first channel; the apparatus further includes a second similarity relation determination unit configured to: acquiring a similarity relation between the third data and the on-library data according to the voiceprint audio data processing result corresponding to the third data and the on-library data in the voiceprint library, wherein the on-library data is obtained according to the voiceprint audio data collected in the second channel; selecting fourth data with the maximum similarity with the third data from the database data according to the similarity relation; and identifying whether the third data and the fourth data are from the same speaker according to the magnitude relation between the similarity of the third data and the fourth data and a set second threshold value.

In an embodiment of the present invention, the iteration unit is further configured to: during each iteration, the intermediate parameters are updated according to the following formula:

wherein,

is that

On-channel

The loss function of (a) to (b),

is collected from the channel

The voice print audio data of (a) above,

for the learning rate of the local update,

is composed of

The amount of change in (c).

Model parameters are updated according to the following formula

To

：

Wherein,

，

is that

On-channel

The loss function of (a) to (b),

is collected from the channel

The voice print audio data of (a) above,

is the global updated learning rate.

Since each functional module of the cross-channel voiceprint recognition apparatus in the exemplary embodiment of the present invention corresponds to the step of the exemplary embodiment of the cross-channel voiceprint recognition method, for details that are not disclosed in the embodiment of the apparatus of the present invention, please refer to the above-mentioned embodiment of the cross-channel voiceprint recognition method of the present invention.

According to the cross-channel voiceprint recognition device provided by the invention, model training in each iteration process is carried out by adopting the voiceprint audio data collected in two different channels, so that a cross-channel voiceprint recognition model suitable for different channels can be obtained, and the cross-channel voiceprint recognition model can be used for accurately recognizing the voiceprint audio data to be recognized.

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a cross-channel voiceprint recognition method comprising: acquiring voiceprint audio data to be identified, wherein the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels; inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and identifying the voiceprint audio data according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through the voiceprint audio data collected in two different channels in each iteration process.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the cross-channel voiceprint recognition method provided by the above methods, the method comprising: acquiring voiceprint audio data to be identified, wherein the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels; inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and identifying the voiceprint audio data according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through the voiceprint audio data collected in two different channels in each iteration process.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program that when executed by a processor is implemented to perform the cross-channel voiceprint recognition methods provided above, the method comprising: acquiring voiceprint audio data to be identified, wherein the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels; inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and identifying the voiceprint audio data according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through the voiceprint audio data collected in two different channels in each iteration process.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A cross-channel voiceprint recognition method, comprising:

acquiring voiceprint audio data to be identified, wherein the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels;

inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and identifying the voiceprint audio data according to the voiceprint audio data processing result;

the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through voiceprint audio data collected in two different channels in each iteration process;

the training process of the cross-channel voiceprint recognition model comprises the following steps:

acquiring a sample voiceprint audio data set collected in the set channel set, wherein the sample voiceprint audio data in the sample voiceprint audio data set are collected in the at least two different channels;

selecting sample voiceprint audio data in one channel, calculating a first loss function and an updated intermediate parameter of the sample voiceprint audio data in the channel corresponding to the sample voiceprint audio data, selecting sample voiceprint audio data in another channel except the one channel based on the updated intermediate parameter and the first loss function, calculating a second loss function and an updated model parameter of the sample voiceprint audio data in the channel corresponding to the sample voiceprint audio data, and completing an iteration process;

and reselecting sample voiceprint audio data to perform an iterative process until the second loss function is converged to obtain the cross-channel voiceprint recognition model.

2. The method of claim 1, wherein the at least two different channels comprise at least one of the following channel classifications: a wireless channel, a wired channel, and a storage channel.

3. The method according to claim 1, wherein the voiceprint audio data to be identified comprises first data collected in a first channel and second data collected in a second channel; after obtaining the voiceprint audio data processing result, the method further includes:

acquiring a similar relation between the first data and the second data according to a voiceprint audio data processing result corresponding to the first data and a voiceprint audio data processing result corresponding to the second data;

and identifying whether the first data and the second data are from the same speaker according to the magnitude relation between the similarity relation and a set first threshold value.

4. The method of claim 1, wherein the voiceprint audio data to be identified comprises third data collected on a first channel; after obtaining the voiceprint audio data processing result, the method further includes:

acquiring a similarity relation between the third data and the on-library data according to a voiceprint audio data processing result corresponding to the third data and the on-library data in a voiceprint library, wherein the on-library data is obtained according to voiceprint audio data collected in a second channel;

selecting fourth data with the maximum similarity with the third data from the database data according to the similarity relation;

and identifying whether the third data and the fourth data are from the same speaker according to the magnitude relation between the similarity of the third data and the fourth data and a set second threshold value.

5. The method according to claim 3 or 4, wherein the similarity relation is obtained by calculating a cosine distance or performing a probabilistic linear discriminant analysis.

6. The method of claim 1, wherein during each of said iterations, the intermediate parameters are updated according to the following formula:

wherein,

is that

On-channel

The loss function of (a) to (b),

is collected from the channel

The voice print audio data of (a) is,

for the learning rate of the local update,

is composed of

To

：

Wherein,

，

is that

On-channel

The loss function of (a) to (b),

is collected from the channel

The voice print audio data of (a) is,

is the global updated learning rate.

7. A cross-channel voiceprint recognition apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring voiceprint audio data to be identified, the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels;

the identification unit is used for inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result so as to identify the voiceprint audio data according to the voiceprint audio data processing result;

a training unit for performing a training process on the cross-channel voiceprint recognition model, the training unit comprising: a first obtaining subunit, configured to obtain a sample voiceprint audio data set collected in the set channel set, where the sample voiceprint audio data in the sample voiceprint audio data set are collected in the at least two different channels; the iteration subunit is configured to select sample voiceprint audio data in one channel, calculate a first loss function and an updated intermediate parameter of the sample voiceprint audio data in a channel corresponding to the iteration subunit, select sample voiceprint audio data in another channel other than the one channel based on the updated intermediate parameter and the first loss function, calculate a second loss function and an updated model parameter of the sample voiceprint audio data in the channel corresponding to the iteration subunit, complete an iteration process, and reselect the sample voiceprint audio data to perform the iteration process until the second loss function converges, so as to obtain the cross-channel voiceprint identification model.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 6 are implemented when the processor executes the program.

9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.