CN108268948B

CN108268948B - Data processing apparatus and data processing method

Info

Publication number: CN108268948B
Application number: CN201710001199.3A
Authority: CN
Inventors: 刘柳; 刘汝杰; 石自强
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-01-03
Filing date: 2017-01-03
Publication date: 2022-02-18
Anticipated expiration: 2037-01-03
Also published as: CN108268948A

Abstract

The present invention relates to a data processing apparatus and a data processing method. The data processing apparatus according to the present invention includes: an extraction unit configured to extract an i-vector of training audio data from each of a plurality of training audio data; the dividing unit is used for dividing the i vector into a plurality of clusters and calculating the cluster center of each cluster; the computing unit is used for computing the distance between the i vector of each training audio data and the cluster center of each cluster; and the training unit is used for training the deep neural network DNN model, wherein the training unit takes the distance between the i vector of each piece of training audio data and the cluster center of each cluster as the true output value of the DNN model. With the data processing apparatus and the data processing method according to the present invention, it is possible to train the DNN model to output the distance between the i-vector of the audio data and each cluster center, thereby reducing the amount of calculation generated in the audio data registration and recognition process while enabling more sufficient tag information to be obtained.

Description

Data processing apparatus and data processing method

Technical Field

Embodiments of the present invention relate to the field of data processing, and in particular, to a data processing apparatus and method that can train a deep neural network DNN model, a data processing apparatus and method that can register audio data, and a data processing apparatus and method that can test audio data.

Background

This section provides background information related to the present invention, which is not necessarily prior art.

Speaker recognition is a biometric identification technique, also known as voiceprint recognition. The conventional speaker recognition technology mainly includes two types: one is to extract an i-vector (also referred to as an i-vector or an identity vector) based on a GMM (Gaussian Mixture Model), register and identify audio data according to the i-vector; the other method is to extract a d-vector (d-vector) based on a Deep Neural Network (DNN), and register and identify the audio data according to the d-vector. Both of these techniques have some drawbacks. In the identification technology for extracting the i-vector based on the GMM, the i-vector of the audio data can be obtained only by extracting the super-vector of the audio data through eight matrix operations and one matrix inversion operation, and the algorithm is complex and consumes a lot of time; furthermore, if the amount of data used to train the GMM is reduced, the accuracy of the recognition may be significantly reduced. In the identification technology based on DNN extraction of d vectors, due to structural limitation, an output layer during training has to be abandoned, and data of a last hidden layer is used as the d vectors; further, there is a fixed number of output nodes in such a system, requiring retraining of the DNN model when the training set is updated; in addition, only speaker information is used as a label in such identification techniques, discarding a large amount of information such as channels, sentence content, and noise.

In view of the above technical problems, the present invention is expected to provide a solution, which can combine the above two recognition technologies to train a suitable DNN model, so as to reduce the amount of computation generated during the registration and recognition of audio data, simplify the registration and recognition processes, and obtain more sufficient tag information.

Disclosure of Invention

This section provides a general summary of the invention and is not a comprehensive disclosure of its full scope or all of its features.

The invention aims to provide a data processing device and a data processing method, which can train a proper DNN model to reduce the calculation amount generated in the process of registering and identifying audio data, simplify the processes of registering and identifying and obtain more sufficient label information.

According to an aspect of the present invention, there is provided a data processing apparatus including: an extraction unit configured to extract an i-vector of training audio data from each of a plurality of training audio data; a dividing unit, configured to divide the i-vector into a plurality of clusters, and calculate a cluster center of each of the plurality of clusters; the computing unit is used for computing the distance between the i vector of each training audio data and the cluster center of each cluster; and a training unit for training a Deep Neural Network (DNN) model, wherein the training unit takes the distance between the i-vector of each training audio data and the cluster center of each cluster as an output true value of the DNN model.

According to another aspect of the present invention, there is provided a data processing method including: extracting an i-vector of training audio data from each of a plurality of training audio data; dividing the i-vector into a plurality of clusters, and calculating a cluster center of each cluster in the plurality of clusters; calculating the distance between the i vector of each training audio data and the cluster center of each cluster; and training a Deep Neural Network (DNN) model, wherein training the DNN model comprises: and taking the distance between the i vector of each training audio data and the cluster center of each cluster as an output true value of the DNN model.

According to another aspect of the present invention, there is provided a program product comprising machine-readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, is capable of causing the computer to perform a data processing method according to the present invention.

According to another aspect of the present invention, there is provided a machine-readable storage medium having embodied thereon a program product comprising machine-readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, is capable of causing the computer to execute a data processing method according to the present invention.

By using the data processing device and the data processing method, the DNN model can be trained by using the i-vector of the training audio data, and the i-vectors of the registration audio data and the test audio data do not need to be calculated in the subsequent audio data registration and identification processes, so that the calculation amount can be greatly reduced. Further, the i-vector may be clustered, and actually, training audio data may be classified, and a true output value of the DNN model may be defined as a distance between the i-vector and a cluster center of each cluster, so that an output of the DNN model includes a difference between the training audio data and each class of training audio data, thereby obtaining more sufficient label information. When the DNN model is adopted for registering and identifying the audio data, the calculation amount can be greatly reduced, the registering and identifying process can be simplified, and richer label information can be obtained.

The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.

Drawings

The drawings described herein are for illustrative purposes only of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure. In the drawings:

FIG. 1 shows a block diagram of a data processing apparatus according to an embodiment of the present invention;

FIG. 2 shows a block diagram of a data processing apparatus according to another embodiment of the present invention;

fig. 3 shows a block diagram of a configuration of a registration unit of the data processing apparatus according to an embodiment of the present invention;

FIG. 4 shows a block diagram of a data processing apparatus according to a further embodiment of the invention;

FIG. 5 shows a block diagram of a test unit of a data processing apparatus according to an embodiment of the invention;

FIG. 6 shows a flow diagram of a data processing method according to an embodiment of the invention; and

fig. 7 is a block diagram of an exemplary structure of a general-purpose personal computer in which a data processing method according to the present invention can be implemented.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and are herein described in detail. It should be understood, however, that the description herein of specific embodiments is not intended to limit the invention to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. It is noted that throughout the several views, corresponding reference numerals indicate corresponding parts.

Detailed Description

Examples of the present invention will now be described more fully with reference to the accompanying drawings. The following description is merely exemplary in nature and is not intended to limit the present disclosure, application, or uses.

The following example embodiments are provided so that this disclosure will be thorough, and will fully convey the scope to those skilled in the art. Numerous specific details are set forth such as examples of specific units, devices, and methods to provide a thorough understanding of embodiments of the invention. It will be apparent to those skilled in the art that specific details need not be employed, that example embodiments may be embodied in many different forms, and that neither should be construed to limit the scope of the invention. In certain example embodiments, well-known processes, well-known structures, and well-known technologies are not described in detail.

A data processing device 100 according to the invention is described below with reference to fig. 1.

The data processing apparatus 100 according to the present invention includes an extraction unit 110, a division unit 120, a calculation unit 130, and a training unit 140.

According to an embodiment of the present invention, the extraction unit 110 may extract an i-vector of the training audio data from each of a plurality of training audio data. Here, the extraction unit 110 may acquire a plurality of training audio data from outside the data processing apparatus 100, the training audio data being used for training the DNN model, and thus the number of training audio data may be determined according to actual needs. Next, the extraction unit 110 may extract an i-vector of each training audio data according to any manner. Further, the extraction unit 110 may transmit the i-vector of each training audio data to the division unit 120.

According to an embodiment of the present invention, the dividing unit 120 may divide the i-vector into a plurality of clusters and calculate a cluster center of each of the plurality of clusters. Here, the dividing unit 120 may acquire an i-vector of each of the plurality of training audio data from the extracting unit 110, and may divide all the i-vectors into a plurality of clusters according to a certain rule and calculate a cluster center of each cluster. Next, the dividing unit 120 may transmit the divided i-vectors and the cluster center positions to the calculating unit 130.

According to an embodiment of the present invention, the calculation unit 130 may calculate a distance between the i-vector of each training audio data and the cluster center of each cluster. Here, the calculation unit 130 may acquire the i-vectors divided into clusters and the positions of the cluster centers from the division unit 120 and calculate the distance between each i-vector and each cluster center. That is, for an i-vector of any one of the training audio data, a distance between the i-vector and a cluster center of each cluster is calculated, that is, a distance between the i-vector of any one of the training audio data and any one of the cluster centers is calculated. Next, the calculation unit 130 may transmit the calculated distance between each i-vector and each cluster center to the training unit 140.

According to an embodiment of the invention, the training unit 140 may train a deep neural network DNN model. Here, the training unit 140 may acquire a distance between the i-vector of each training audio data and each cluster center from the calculation unit 130, and may take the distance between the i-vector of each training audio data and each cluster center of each cluster as an output true value of the DNN model. Thus, the output of the trained DNN model is the distance between the i-vector of some training audio data and the cluster center of each cluster.

It can be seen that, with the data processing apparatus 100 according to the present invention, the DNN model can be trained using the i-vectors of the training audio data, and there is no need to calculate the i-vectors of the registration audio data and the test audio data in the subsequent audio data registration and recognition processes, thereby greatly reducing the amount of calculation. Further, the i-vector may be clustered, and actually, training audio data may be classified, and a true output value of the DNN model may be defined as a distance between the i-vector and a cluster center of each cluster, so that an output of the DNN model includes a difference between the training audio data and each class of training audio data, thereby obtaining more sufficient label information. When the DNN model is adopted for registering and identifying the audio data, the calculation amount can be greatly reduced, the registering and identifying process can be simplified, and richer label information can be obtained.

According to an embodiment of the present invention, the extraction unit 110 may extract an i-vector of training audio data according to any method known in the art. Preferably, the extraction unit 110 may extract an i-vector of the training audio data using the gaussian mixture model GMM.

According to an embodiment of the present invention, the dividing unit 120 may divide all i vectors into a plurality of clusters according to a certain rule. For example, the dividing unit 120 may divide similar i-vectors into the same cluster. That is, the i vectors may be clustered, and the i vectors of the same class may be divided into the same cluster. As is well known, an i-vector of one audio data may spatially represent the audio data, and a distance of two i-vectors in space may represent a similarity between the audio data represented by the two i-vectors. When two i vectors are closer in space, the more similar the audio data represented by the two i vectors are.

According to an embodiment of the present invention, the dividing unit 120 may calculate a distance between every two i-vectors, and divide the i-vectors into a plurality of clusters according to the distance between every two i-vectors. More specifically, the dividing unit 120 may divide the i-vector into a plurality of clusters, so that a distance between any two i-vectors of all i-vectors in the same cluster is smaller than a certain threshold.

In this way, the dividing unit 120 may divide similar i-vectors into the same cluster. That is, the training audio data represented by the i vectors in the same cluster are likely to be audio data belonging to the same speaker, and may also be audio data belonging to different speakers with similar timbre. Of course, the above embodiments are only exemplary embodiments in the present invention, and the present invention may also adopt other embodiments to implement clustering of i vectors.

According to an embodiment of the present invention, the dividing unit 120 may calculate a cluster center of each cluster according to a certain rule. For example, the dividing unit 120 may obtain a cluster center by averaging all i vectors in the same cluster, or the dividing unit 120 may calculate central points of all i vectors in the same cluster to obtain a cluster center, which is not limited in the present invention.

In this way, the dividing unit 120 may calculate a cluster center for each cluster. Here, the cluster center reflects the center position of the cluster to some extent. In other words, if the space is large enough, the cluster center may represent the position of the cluster in space.

As can be seen, according to an embodiment of the present invention, the dividing unit 120 may cluster the i-vectors and calculate a cluster center of each cluster. In practice, the dividing unit 120 clusters all the training audio data, divides the training audio data of the same speaker or speakers having similar timbres into a cluster, and calculates the cluster center position. Thus, the position of any one i vector calculated by the calculating unit 130 from each cluster center actually reflects the difference between the training audio data represented by the i vector and each type of training audio data, and the closer an i vector is to a cluster center, the higher the probability that the audio data represented by the i vector belongs to the speaker of the audio data represented by the i vector in the cluster.

According to an embodiment of the present invention, the extraction unit 110 may further extract a super vector of each of the plurality of training audio data. As is known, in the process of extracting the i-vector, the i-vector of the audio data can be obtained only by extracting the super-vector of the audio data first and performing eight matrix operations and one matrix inversion operation on the super-vector. Therefore, the extraction unit 110 may extract the supervector of each training audio data, and then calculate the i-vector of each training audio data from the supervector of each training audio data.

According to an embodiment of the present invention, the training unit 140 may also use the supervectors of each training audio data as input features of the DNN model. Here, the training unit 140 may acquire the supervectors of each training audio data from the extraction unit 110, and may train the DNN model using the supervectors of each training audio data as input features of the DNN model and using a distance between an i-vector distance of each training audio data and each cluster center as an output true value of the DNN model.

According to embodiments of the present invention, the training unit 140 may employ various algorithms to train the DNN model. For example, the DNN model may be trained using a backward algorithm.

It can be seen that, according to the embodiment of the present invention, the input of the DNN model trained by the training unit 140 is a supervector of certain audio data, and the output is a distance between an i-vector of the audio data and each cluster center. Thus, the DNN model can be trained by only computing the i-vector of the training audio data. After the DNN model is trained, the distance between the i-vector of the audio data and each cluster center can be obtained only by calculating the super-vector of certain audio data and inputting the super-vector into the DNN model, and the i-vector of the audio data does not need to be calculated. Since the process of calculating the i-vector by the supervectors is very complicated and the calculation amount is very large, the calculation amount can be greatly reduced only by calculating the supervectors without calculating the i-vector in the invention. Furthermore, such DNN model may be applied to registration and recognition techniques of audio data, thereby saving registration and recognition time. Meanwhile, the output of the DNN model may be different from each type of audio data, i.e., the content of the output tag is enriched.

The data processing apparatus 100 as described above may be used as an apparatus for training a DNN model. Further, the data processing apparatus 100 may also register the registered audio data using the trained DNN model, and such data processing apparatus 100 will be described in detail below.

Fig. 2 shows a block diagram of a data processing device according to another embodiment of the present invention. As shown in fig. 2, the data processing apparatus 100 includes an extraction unit 110, a division unit 120, a calculation unit 130, a training unit 140, and a registration unit 150. The extracting unit 110, the dividing unit 120, the calculating unit 130 and the training unit 140 have been described in detail in the foregoing, and are not described herein again.

According to an embodiment of the present invention, the registration unit 150 may perform a registration process for each of a plurality of registration audio data. Here, the registration unit 150 may acquire a plurality of pieces of registration audio data from outside the data processing apparatus 100, and may perform a registration process for each piece of registration audio data. Here, the registered audio data may be audio data that needs to be registered in an audio database for subsequent audio recognition, that is, the speaker of the registered audio data is known. According to an embodiment of the present invention, the registration process performed on the registration audio data may include registering the registration audio data with an audio database. Here, the audio database may be located in the registration unit 150, may be located in a unit other than the registration unit 150 in the data processing apparatus 100, and may be located in an apparatus or unit other than the data processing apparatus 100, which is not limited in the present invention.

Fig. 3 shows a block diagram of a configuration of a registration unit of the data processing apparatus according to an embodiment of the present invention. As shown in fig. 3, the registration unit 150 may include a hyper-vector determination unit 151, a distance determination unit 152, and a parameter determination unit 153.

According to an embodiment of the present invention, the super vector determination unit 151 may extract a super vector of the registered audio data. Here, the super vector determination unit 151 may extract the super vector of the registered audio data in a similar manner to the extraction unit 110, and for example, may extract the super vector of the registered audio data using the GMM model. Further, the super vector determination unit 151 may transmit the extracted super vector to the distance determination unit 152.

According to an embodiment of the present invention, the distance determining unit 152 may determine a distance between an i-vector of the registered audio data and a cluster center of each cluster according to the DNN model. Here, the distance determination unit 152 may acquire the supervectors of the registered audio data from the supervector determination unit 151, and may acquire the trained DNN model from the training unit 140. Further, the distance determining unit 152 may input the super vector of the registered audio data to a DNN model, the output of which is the distance between the i-vector of the registered audio data and the cluster center of each cluster. Further, the distance determining unit 152 may transmit the distance between the i-vector of the registered audio data and the cluster center of each cluster to the parameter determining unit 153.

According to an embodiment of the present invention, the parameter determination unit 153 may determine that the parameter of the registered audio data is stored in the audio database according to a distance between the i-vector of the registered audio data and the cluster center of each cluster.

According to an embodiment of the present invention, a plurality of registered audio data may be stored in the audio database, and a record of each registered audio data may include speaker information of the registered audio data and parameters of the registered audio data.

According to an embodiment of the present invention, the parameter determination unit 153 may take distances between the i-vector of the registered audio data and cluster centers of all clusters and a cluster center corresponding to each distance as parameters of the registered audio data. In addition, in order to further reduce the amount of calculation, the parameter determination unit 153 may also select a plurality of (for example, M being an integer greater than or equal to 2) distances from the distances between the i-vector of the registered audio data and the cluster center of each cluster, and take the selected distances and the cluster centers corresponding to the distances as the parameters of the registered audio data.

According to an embodiment of the present invention, the parameter determination unit 153 may select a minimum plurality of distances (for example, M being an integer greater than or equal to 2) from distances between an i-vector of the registered audio data and a cluster center of each cluster, and take the selected distances and the cluster centers corresponding to the distances as parameters of the registered audio data.

The following table shows a record of one piece of registered audio data stored in the audio database when M is 3.

TABLE 1

In the example shown in the above table, the speaker of the registered audio data is a, and the i-vector of the registered audio data has distances d1, d2, and d3 from cluster centers a, b, and c, respectively, i.e., cluster center a corresponds to distance value d1, cluster center b corresponds to distance value d2, and cluster center c corresponds to distance value d 3. Wherein d1< d2< d3 and d1, d2 and d3 are each smaller than the distance between the i-vector of the registered audio data and the other cluster centers. And the sequence of the cluster center a is 1, the sequence of the cluster center b is 2, and the sequence of the cluster center c is 3, wherein the smaller the sequence here means the smaller the distance value corresponding to the cluster center. It is noted that table 1 is only an exemplary storage manner in the audio database, and the audio database may store each record in other manners. Further, a plurality of such records may be stored in the audio database, and the value of M may be 2 or an integer greater than 3.

The data processing apparatus 100 as described above can be used as an apparatus for registering audio data. In such an embodiment, the data processing apparatus 100 may register the registered audio data of any one of the known speakers using the trained DNN model. In this process, the data processing apparatus 100 need not calculate the i-vector of the registered audio data, and need only calculate the super-vector of the registered audio data, which can greatly reduce the amount of calculation. In addition, the record of the registered audio data comprises the distance between the cluster centers with different distances of the registered audio data, namely the similarity degree of the registered audio data and each type of training audio data, so that the registered information is richer.

Embodiments of the data processing apparatus 100 as an apparatus for training a DNN model and as an apparatus for registering audio data are described above in detail. Further, the data processing apparatus 100 may also test the test audio data using the trained DNN model, and such data processing apparatus 100 will be described in detail below.

Fig. 4 shows a block diagram of a data processing device 100 according to a further embodiment of the invention. As shown in fig. 4, the data processing apparatus 100 may include an extraction unit 110, a division unit 120, a calculation unit 130, a training unit 140, a registration unit 150, and a test unit 160. The extracting unit 110, the dividing unit 120, the calculating unit 130, the training unit 140, and the registering unit 150 have been described in detail in the foregoing, and are not described herein again.

According to an embodiment of the present invention, the test unit 160 may perform a test for each of the plurality of test audio data. Here, the test unit 160 may acquire a plurality of test audio data from the outside of the data processing apparatus 100. The test audio data may be audio data of a speaker to be identified. That is, the speaker of the test audio data is unknown. The testing process performed by the testing unit 160 on the test audio data may include identifying a speaker of the test audio data.

As described in the foregoing, in the present invention, there are three kinds of audio data: training audio data, enrollment audio data, and test audio data. These three types of audio data are not different in content, and are audio data generated by a speaker, and are different only in purpose. The training audio data are used for training a DNN model, and a super vector and an i vector of the training audio data need to be extracted to train the DNN model; the registered audio data is the audio data of a known speaker needing to be registered in an audio database, and the supervectors of the registered audio data need to be extracted to obtain the parameters of the registered audio data; the test audio data is audio data of an unknown speaker for which identification of the speaker is required, a hyper vector of the test audio data is required to be extracted, and the speaker of the test audio data can be identified from the registered audio data stored in the audio database.

Fig. 5 shows a block diagram of the test unit 160 of the data processing device according to an embodiment of the present invention. As shown in fig. 5, the test unit 160 may include a supervector determination unit 161, a distance determination unit 162, a parameter determination unit 163, and a speaker determination unit 164.

According to an embodiment of the present invention, the supervector determination unit 161 may extract a supervector of the test audio data. Here, the supervector determination unit 161 may extract a supervector of the test audio data according to a similar manner to the extraction unit 110 and the supervector determination unit 151, and for example, may extract a supervector of the test audio data using a GMM model. Further, the super vector determination unit 161 may transmit the extracted super vector to the distance determination unit 162.

According to an embodiment of the present invention, the distance determining unit 162 may determine a distance between the i-vector of the test audio data and the cluster center of each cluster according to the DNN model. Here, the distance determination unit 162 may acquire the supervectors of the test audio data from the supervector determination unit 161, and may acquire the trained DNN model from the training unit 140. Further, the distance determining unit 162 may input the supervector of the test audio data to the DNN model, and an output of the DNN model is a distance between the i-vector of the test audio data and a cluster center of each cluster. Further, the distance determining unit 162 may transmit the distance between the i-vector of the test audio data and the cluster center of each cluster to the parameter determining unit 163.

According to an embodiment of the present invention, the parameter determining unit 163 may determine the parameter of the test audio data according to a distance between the i-vector of the test audio data and the cluster center of each cluster. Further, the parameter determination unit 163 may transmit the parameters of the test audio data to the speaker determination unit 164.

According to an embodiment of the present invention, the parameter determination unit 163 may take distances between the i-vector of the test audio data and cluster centers of all clusters and a cluster center corresponding to each distance as the parameter of the test audio data. In addition, in order to further reduce the amount of calculation, the parameter determination unit 163 may also select a plurality of (for example, N being an integer greater than or equal to 2) distances from the distances between the i-vector of the test audio data and the cluster center of each cluster, and take the selected distances and the cluster centers corresponding to the distances as the parameters of the test audio data.

According to an embodiment of the present invention, the parameter determination unit 163 may select a minimum plurality of distances (e.g., N being an integer greater than or equal to 2) from distances between an i-vector of the test audio data and a cluster center of each cluster, and take the selected distances and the cluster centers corresponding to the distances as parameters of the test audio data.

Preferably, N ═ M. That is, the number of distances selected by the parameter determination unit 163 may be equal to the number of distances selected by the parameter determination unit 153.

The following table shows a set of parameters for one test audio data when N-3.

TABLE 2

In the example shown in the above table, the i-vector of the test audio data is at distances d4, d5, and d6 from cluster centers a, b, and d, respectively, i.e., cluster center a corresponds to distance value d4, cluster center b corresponds to distance value d5, and cluster center d corresponds to distance value d 6. Wherein d4< d5< d6 and d4, d5 and d6 are all smaller than the distance between the i-vector of the test audio data and other cluster centers, and wherein the ordering of cluster center a is 1, the ordering of cluster center b is 2, and the ordering of cluster center d is 3, where smaller ordering indicates that the cluster center corresponds to a smaller distance value.

According to an embodiment of the invention, speaker determination unit 164 may determine the speaker of the test audio data from the parameters of the test audio data and the audio database. Here, the speaker determining unit 164 may acquire the parameters of the test audio data from the parameter determining unit 163, and may acquire records of all the registered audio data from the audio database, thereby determining the speaker of the test audio data.

According to an embodiment of the invention, speaker determination unit 164 may perform the following operations: comparing the parameter of the test audio data with the parameter of each of a plurality of registered audio data stored in an audio database; determining the similarity between the test audio data and each registered audio data stored in the audio database according to the comparison result; and determining a speaker of the registered audio data having the highest similarity with the test audio data as a speaker of the test audio data.

According to an embodiment of the present invention, speaker determination unit 164 may determine parameters of all registered audio data in the audio database and compare with parameters of the test audio data, respectively. According to an embodiment of the present invention, speaker determination unit 164 may determine a similarity score of the test audio data and each registered audio data stored in the audio database, and may determine the registered audio data having the highest similarity score. Further, the speaker determination unit 164 may determine speaker information of the registered audio data having the highest similarity score from the audio database, thereby determining the speaker as the speaker of the test audio data.

According to an embodiment of the invention, speaker determination unit 164 may determine the similarity score of the test audio data and the enrollment audio data according to the following operations: comparing each cluster center in the parameters of the test audio data with each cluster center in the parameters of the registration audio data; when one cluster center in the parameters of the test audio data is matched with one cluster center in the parameters of the registration audio data, determining the score and the weight of the cluster center; and determining a weighted sum of the scores of all the matched cluster centers as a similarity score of the test audio data and the registration audio data. Here, if any one of the cluster centers in the parameters of the test audio data does not match any one of the cluster centers in the parameters of the registered audio data, it is determined that the similarity score of the test audio data and the registered audio data is 0.

According to an embodiment of the invention, speaker determination unit 164 may determine the score for each cluster center according to the following operations: calculating the absolute value of the difference value between the distance corresponding to the cluster center in the parameters of the registered audio data and the distance corresponding to the cluster center in the parameters of the tested audio data; and determining the score of the cluster center according to the absolute value of the difference. Specifically, speaker determination unit 164 may determine the score of the cluster center such that the smaller the absolute value of the difference in distance, the higher the score of the cluster center.

According to an embodiment of the invention, speaker determination unit 164 may determine the weight for each cluster center according to the following operations: determining an ordering of the cluster center in the parameters of the registered audio data and an ordering in the parameters of the tested audio data; and determining the weight of the cluster center according to the two sorts. Specifically, the speaker determination unit 164 may determine the weight of the ranking of the cluster center in the parameters of the enrollment audio data, and may determine the weight of the ranking of the cluster center in the parameters of the test audio data; and the product of the two weights is taken as the weight of the cluster center.

According to an embodiment of the present invention, speaker determination unit 164 may set a weight of the ranking of the cluster centers in the parameters of the test audio data such that the smaller the ranking value, the larger the weight. Further, the speaker determination unit 164 may set the weight of the ranking of the cluster center in the parameters of the registered audio data such that the smaller the ranking value, the larger the weight.

As described above, speaker determination unit 164 may determine a similarity score S of the test audio data and the ith (1 ≦ i ≦ Q, where Q is the number of registered audio data in the audio database) registered audio data in the audio database_iComprises the following steps:

wherein j represents the number of the matched cluster centers, P is more than or equal to 1 and less than or equal to N, w_jWeight representing the jth matched cluster center, c_jRepresents the score of the jth matched cluster center.

Further, the speaker determining unit 164 may determine the weight w of the jth matched cluster center_jComprises the following steps:

w_j＝w_j1×w_j2

wherein, w_j1Weight, w, representing the ranking of the jth matched cluster center in the parameters of the registered audio data_j2A weight representing the ordering of the jth matched cluster center in the parameters of the test audio data.

How to calculate the similarity score will be specifically described below taking the parameters of the registered audio data shown in table 1 and the parameters of the test audio data shown in table 2 as examples.

For the parameters of the test audio data shown in table 2, the parameters of the registration audio data shown in table 1 are first matched with the cluster center a ranked as 1, thereby determining that the cluster center a ranked as 1 in the test audio data matches the cluster center a ranked as 1 in the registration audio data. Thus, the cluster center a can be the 1 st matching cluster center, i.e., j is 1 in the above two formulas. Next, the weight and score of the cluster center a are calculated, respectively. It is assumed here that, among the parameters of the registered audio data, the cluster centers ordered as 1, 2, and 3 are weighted by 5, 3, and 2, respectively. Further, it can also be assumed that, among the parameters of the test audio data, the cluster centers ordered as 1, 2, and 3 are weighted by 5, 3, and 2, respectively. Since the order of the cluster center a in the parameters of the registration audio data and the test audio data is 1, w can be determined₁₁＝w₁₂5, and then w₁25. Next, the distance value corresponding to the cluster center a in the registered audio data is d1, the distance value corresponding to the test audio data is d4, the absolute value of the difference between d1 and d4 is calculated, and c is calculated from the absolute value of the difference₁The value of (c). Here, the distance difference can be correctedThe absolute value is set in a plurality of sections, and different scores are set for different sections. As described above, a score c for cluster center a may be determined₁And a weight w₁. In a similar manner, the score and weight of the 2 nd matched cluster center b may be determined, i.e., j — 2. Here, since the order of the cluster center b in the parameters of the registration audio data and the test audio data is 2, w can be determined₂₁＝w₂₂3, and then w₂9. Next, the distance value corresponding to the cluster center b in the registered audio data is d2, the distance value corresponding to the test audio data is d5, the absolute value of the difference between d2 and d5 is calculated, and c is calculated from the absolute value of the difference₂The value of (c). As described above, a score c for cluster center b may be determined₂And a weight w₂. Here, since the cluster center of the test audio data with the rank of 3 does not match any of the cluster centers in the registered audio data, it is determined that P is 2. Next, the similarity score of the registered audio data described in table 1 and the test audio data shown in table 2 may be determined as w₁c₁+w₂c₂. Speaker determination unit 164 may calculate a similarity score for each enrollment audio data with the test audio data in the manner described above and select the enrollment audio data with the highest score from among them. Assuming that the registered audio data having the highest similarity score with the test audio data shown in table 2 is the registered audio data shown in table 1, the speaker determination unit 164 may determine that the speaker of the test audio data shown in table 2 is speaker a.

Although how to calculate the similarity score of the enrollment audio data and the test audio data is described in detail in the foregoing by way of example, this is not limitative, and other embodiments may of course be employed to calculate the similarity score of the enrollment audio data and the test audio data.

According to an embodiment of the present invention, the test unit 160 in the data processing apparatus 100 may also set a threshold value of a distance between the i-vector of the test audio data and each cluster center. When any one of the distances of the i-vector of the test audio data from each cluster center determined by the distance determination unit 162 is smaller than this threshold value, the above-described procedure may be performed, that is, the parameter determination unit 163 determines the parameter of the test audio data and the speaker determination unit 164 determines the speaker of the test audio data. Further, when the distance of the i-vector of the test audio data determined by the distance determining unit 162 from each cluster center is greater than the threshold, it indicates that the test audio data is not very similar to all types of training audio data, and it indicates that the trained DNN model may not be suitable for the test audio data, i.e., the identification of the test audio data by using such DNN model may not be accurate. In such a case, it may be necessary to increase the amount of training audio data, for example to add some training audio data associated with or similar to the test audio data to retrain the DNN model.

The data processing device 100, which can test audio data, is described in detail above. In such an embodiment, the data processing apparatus 100 may test the test audio data of any one unknown speaker using the trained DNN model, i.e. identify the speaker of the test audio data. In this process, the data processing apparatus 100 need not calculate the i-vector of the test audio data, but only the super-vector of the test audio data, which can greatly reduce the amount of calculation. In addition, the distance between the test audio data and different cluster centers, namely the similarity degree of the test audio data and each type of training audio data, can be determined, so that the output label information is richer.

The data processing apparatus 100 according to the embodiment of the present invention is described above in detail. Next, a data processing method according to an embodiment of the present invention will be described in detail.

Fig. 6 is a flowchart of a data processing method according to an embodiment of the present invention.

As shown in fig. 6, in step S610, an i-vector of training audio data is extracted from each of a plurality of training audio data.

Next, in step S620, the i-vector is divided into a plurality of clusters, and a cluster center of each of the plurality of clusters is calculated.

Next, in step S630, the distance between the i-vector of each training audio data and the cluster center of each cluster is calculated.

Next, in step S640, a deep neural network DNN model is trained.

Wherein training the DNN model comprises: the distance between the i-vector of each training audio data and the cluster center of each cluster is taken as the true output value of the DNN model.

Preferably, extracting the i-vector of the training audio data comprises: and extracting an i vector of the training audio data by using a Gaussian Mixture Model (GMM).

Preferably, dividing the i-vector into a plurality of clusters comprises: calculating the distance between every two i vectors; and dividing the i-vectors into a plurality of clusters according to the distance between every two i-vectors.

Preferably, training the DNN model further comprises: extracting a supervector for each of a plurality of training audio data; and using the supervectors of each training audio data as input features of the DNN model.

Preferably, the data processing method further includes performing registration for each of the plurality of registration audio data, including: extracting a super vector of the registered audio data; determining the distance between the i vector of the registered audio data and the cluster center of each cluster according to the DNN model; and determining parameters of the registered audio data to be stored in the audio database according to the distance between the i-vector of the registered audio data and the cluster center of each cluster.

Preferably, determining the parameters of the registered audio data comprises: selecting a plurality of distances from the distances between the i-vector of the registered audio data and the cluster center of each cluster; and using the selected distance and the cluster center corresponding to the distance as parameters of the registered audio data.

Preferably, the data processing method further comprises performing a test for each of the plurality of test audio data, including: extracting a supervector of the test audio data; determining the distance between the i vector of the test audio data and the cluster center of each cluster according to the DNN model; determining parameters of the test audio data according to the distance between the i vector of the test audio data and the cluster center of each cluster; and determining the speaker of the test audio data according to the parameters of the test audio data and the audio database.

Preferably, determining the parameters of the test audio data comprises: selecting a plurality of distances from the distances between the i-vector of the test audio data and the cluster center of each cluster; and taking the selected distance and the cluster center corresponding to the distance as parameters of the test audio data.

Preferably, determining the speaker of the test audio data comprises: comparing the parameter of the test audio data with the parameter of each of a plurality of registered audio data stored in an audio database; determining the similarity between the test audio data and each registered audio data stored in the audio database according to the comparison result; and determining a speaker of the registered audio data having the highest similarity with the test audio data as a speaker of the test audio data.

Preferably, training the DNN model further comprises: the DNN model is trained using a backward algorithm.

The data processing method described above can be implemented by the data processing apparatus 100 according to the embodiment of the present invention, and therefore, various embodiments of the data processing apparatus 100 described above are applicable thereto, and will not be described repeatedly herein.

It can be seen that, with the data processing apparatus and the data processing method according to the present invention, the DNN model can be trained such that the input of the trained DNN model is a super-vector of certain audio data, and the output is a distance between an i-vector of the audio data and each cluster center. In this way, in the subsequent registration process, the data processing apparatus 100 can register the registered audio data of any one of the known speakers by using the trained DNN model. In this process, the data processing apparatus 100 need not calculate the i-vector of the registered audio data, and need only calculate the super-vector of the registered audio data, which can greatly reduce the amount of calculation. In addition, the record of the registered audio data comprises the distance between the cluster centers with different distances of the registered audio data, namely the similarity degree of the registered audio data and each type of training audio data, so that the registered information is richer. Further, in the subsequent testing process, the data processing apparatus 100 may utilize the trained DNN model to test the test audio data of any unknown speaker, i.e. identify the speaker of the test audio data. In this process, the data processing apparatus 100 need not calculate the i-vector of the test audio data, but only the super-vector of the test audio data, which can greatly reduce the amount of calculation. In addition, the distance between the test audio data and different cluster centers, namely the similarity degree of the test audio data and each type of training audio data, can be determined, so that the output label information is richer.

It is apparent that the respective operational procedures of the data processing method according to the present invention can be implemented in the form of computer-executable programs stored in various machine-readable storage media.

Moreover, the object of the present invention can also be achieved by: a storage medium storing the above executable program code is directly or indirectly supplied to a system or an apparatus, and a computer or a Central Processing Unit (CPU) in the system or the apparatus reads out and executes the program code. At this time, as long as the system or the apparatus has a function of executing a program, the embodiment of the present invention is not limited to the program, and the program may be in any form, for example, an object program, a program executed by an interpreter, a script program provided to an operating system, or the like.

Such machine-readable storage media include, but are not limited to: various memories and storage units, semiconductor devices, magnetic disk units such as optical, magnetic, and magneto-optical disks, and other media suitable for storing information, etc.

In addition, the technical solution of the present invention can also be realized by a computer connecting to a corresponding website on the internet, downloading and installing the computer program code according to the present invention into the computer and then executing the program.

As shown in fig. 7, the CPU 701 executes various processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 to a Random Access Memory (RAM) 703. In the RAM 703, data necessary when the CPU 701 executes various processes and the like is also stored as necessary. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output interface 705 is also connected to the bus 704.

The following components are connected to the input/output interface 705: an input section 706 (including a keyboard, a mouse, and the like), an output section 707 (including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like), a storage section 708 (including a hard disk and the like), a communication section 709 (including a network interface card such as a LAN card, a modem, and the like). The communication section 709 performs communication processing via a network such as the internet. A driver 710 may also be connected to the input/output interface 705, as desired. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that the computer program read out therefrom is mounted in the storage section 708 as necessary.

In the case where the above-described series of processes is realized by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 711.

It should be understood by those skilled in the art that such a storage medium is not limited to the removable medium 711 shown in fig. 7 in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 711 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disc-read only memory (CD-ROM) and a Digital Versatile Disc (DVD)), a magneto-optical disk (including a mini-disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM 702, a hard disk included in the storage section 708, or the like, in which programs are stored and which are distributed to users together with the apparatus including them.

In the system and method of the present invention, it is apparent that each unit or each step can be decomposed and/or recombined. These decompositions and/or recombinations are to be regarded as equivalents of the present invention. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.

Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, it should be understood that the above-described embodiments are only for illustrating the present invention and do not constitute a limitation to the present invention. It will be apparent to those skilled in the art that various modifications and variations can be made in the above-described embodiments without departing from the spirit and scope of the invention. Accordingly, the scope of the invention is to be defined only by the claims appended hereto, and by their equivalents.

With respect to the embodiments including the above embodiments, the following remarks are also disclosed:

supplementary note 1. a data processing apparatus comprising:

an extraction unit configured to extract an i-vector of training audio data from each of a plurality of training audio data;

a dividing unit, configured to divide the i-vector into a plurality of clusters, and calculate a cluster center of each of the plurality of clusters;

the computing unit is used for computing the distance between the i vector of each training audio data and the cluster center of each cluster; and

a training unit for training a deep neural network DNN model,

wherein the training unit takes a distance between the i-vector of each training audio data and a cluster center of each cluster as an output true value of the DNN model.

Supplementary note 2. the data processing apparatus according to supplementary note 1, wherein the extraction unit extracts an i-vector of the training audio data using a gaussian mixture model GMM.

Note 3 the data processing apparatus according to note 1, wherein the dividing unit calculates a distance between every two i vectors, and divides the i vectors into a plurality of clusters according to the distance between every two i vectors.

Note 4. the data processing apparatus according to note 1, wherein the extraction unit is further configured to extract a supervector for each of the plurality of training audio data, and the training unit is further configured to use the supervector for each of the plurality of training audio data as an input feature of the DNN model.

Note 5 the data processing apparatus according to note 1, wherein the data processing apparatus further includes a registration unit operable to perform registration for each of a plurality of registered audio data, the registration unit including:

a first super-vector determining unit configured to extract a super-vector of the registered audio data;

a first distance determining unit configured to determine a distance between an i-vector of the registered audio data and a cluster center of each cluster according to the DNN model; and

and the first parameter determining unit is used for determining the parameters of the registered audio data to be stored in an audio database according to the distance between the i vector of the registered audio data and the cluster center of each cluster.

Supplementary note 6 the data processing apparatus according to supplementary note 5, wherein the first parameter determination unit selects a plurality of distances from distances between an i-vector of the registered audio data and a cluster center of each cluster, and takes the selected distances and the cluster centers corresponding to the distances as parameters of the registered audio data.

Supplementary note 7 the data processing apparatus according to supplementary note 5, wherein the data processing apparatus further comprises a test unit for performing a test for each of a plurality of test audio data, the test unit comprising:

a second supervector determination unit for extracting a supervector of the test audio data;

a second distance determining unit, configured to determine, according to the DNN model, a distance between an i-vector of the test audio data and a cluster center of each cluster;

a second parameter determining unit, configured to determine a parameter of the test audio data according to a distance between an i-vector of the test audio data and a cluster center of each cluster; and

a speaker determining unit for determining a speaker of the test audio data according to the parameters of the test audio data and the audio database.

Note 8 the data processing apparatus according to note 7, wherein the second parameter determination unit selects a plurality of distances from distances between an i-vector of the test audio data and a cluster center of each cluster, and takes the selected distances and the cluster centers corresponding to the distances as parameters of the test audio data.

Note 9. the data processing apparatus according to note 7, wherein the speaker determination unit is configured to perform the following operations:

comparing the parameters of the test audio data with the parameters of each of a plurality of registered audio data stored in the audio database;

determining the similarity between the test audio data and each registered audio data stored in the audio database according to the comparison result; and

determining a speaker of the registered audio data having the highest similarity with the test audio data as a speaker of the test audio data.

Supplementary note 10 the data processing apparatus according to supplementary note 1, wherein the training unit trains the DNN model using a backward algorithm.

Note 11 that a data processing method includes:

extracting an i-vector of training audio data from each of a plurality of training audio data;

dividing the i-vector into a plurality of clusters, and calculating a cluster center of each cluster in the plurality of clusters;

calculating the distance between the i vector of each training audio data and the cluster center of each cluster; and

training a Deep Neural Network (DNN) model,

wherein training the DNN model comprises: and taking the distance between the i vector of each training audio data and the cluster center of each cluster as an output true value of the DNN model.

Supplementary note 12. the data processing method according to supplementary note 11, wherein extracting an i-vector of the training audio data comprises:

and extracting an i vector of the training audio data by using a Gaussian Mixture Model (GMM).

Supplementary note 13. the data processing method according to supplementary note 11, wherein dividing the i-vector into a plurality of clusters comprises:

calculating the distance between every two i vectors; and

and dividing the i vectors into a plurality of clusters according to the distance between every two i vectors.

Supplementary notes 14. the data processing method according to supplementary notes 11, wherein training the DNN model further comprises:

extracting a supervector for each of the plurality of training audio data; and

and taking the supervectors of each training audio data as input features of the DNN model.

Note 15 the data processing method according to note 11, wherein the data processing method further includes performing registration for each of a plurality of pieces of registered audio data, including:

extracting a supervector of the registered audio data;

determining a distance between an i-vector of the registered audio data and a cluster center of each cluster according to the DNN model; and

and determining parameters of the registered audio data to be stored in an audio database according to the distance between the i vector of the registered audio data and the cluster center of each cluster.

Supplementary note 16. the data processing method according to supplementary note 15, wherein determining the parameter of the registered audio data comprises:

selecting a plurality of distances from the distances between the i-vector of the registered audio data and the cluster center of each cluster; and

and taking the selected distance and the cluster center corresponding to the distance as the parameters of the registered audio data.

Supplementary note 17 the data processing method according to supplementary note 15, wherein the data processing method further comprises performing a test for each of a plurality of test audio data, including:

extracting a supervector of the test audio data;

determining a distance between an i-vector of the test audio data and a cluster center of each cluster according to the DNN model;

determining parameters of the test audio data according to the distance between the i vector of the test audio data and the cluster center of each cluster; and

determining a speaker of the test audio data from the parameters of the test audio data and the audio database.

Supplementary note 18 the data processing method according to supplementary note 17, wherein determining the parameter of the test audio data comprises:

selecting a plurality of distances from the distances between the i-vector of the test audio data and the cluster center of each cluster; and

and taking the selected distance and the cluster center corresponding to the distance as parameters of the test audio data.

Supplementary note 19 the data processing method according to supplementary note 17, wherein determining the speaker of the test audio data comprises:

Reference numeral 20, a machine-readable storage medium having embodied thereon a program product comprising machine-readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, is capable of causing the computer to execute the data processing method according to any one of reference numerals 11 to 19.

Claims

1. A data processing apparatus comprising:

a training unit for training a deep neural network DNN model,

wherein the training unit takes a distance between the i-vector of each training audio data and a cluster center of each cluster as an output true value of the DNN model,

wherein the input of the DNN model trained by the training unit is a super-vector of the audio data, the output is a distance between an i-vector of the audio data and a cluster center of each cluster, and

wherein the data processing apparatus further includes a registration unit operable to perform registration for each of a plurality of registration audio data, the registration unit including:

2. The data processing apparatus according to claim 1, wherein the extraction unit extracts an i-vector of the training audio data using a gaussian mixture model GMM.

3. The data processing apparatus according to claim 1, wherein the dividing unit calculates a distance between every two i-vectors, and divides the i-vectors into a plurality of clusters according to the distance between every two i-vectors.

4. The data processing apparatus of claim 1, wherein the extraction unit is further configured to extract a super-vector of each of the plurality of training audio data, and the training unit is further configured to use the super-vector of each of the training audio data as an input feature of the DNN model.

5. The data processing apparatus according to claim 1, wherein the first parameter determination unit selects a plurality of distances from distances between an i-vector of the registration audio data and a cluster center of each cluster, and takes the selected distances and the cluster centers corresponding to the distances as parameters of the registration audio data.

6. The data processing apparatus according to claim 1, wherein the data processing apparatus further comprises a test unit for performing a test for each of a plurality of test audio data, the test unit comprising:

7. The data processing apparatus according to claim 6, wherein the second parameter determination unit selects a plurality of distances from distances between an i-vector of the test audio data and a cluster center of each cluster, and takes the selected distances and the cluster centers corresponding to the distances as parameters of the test audio data.

8. The data processing apparatus of claim 6, wherein the speaker determination unit is configured to:

9. A method of data processing, comprising:

training a Deep Neural Network (DNN) model,

wherein training the DNN model comprises: taking the distance between the i-vector of each training audio data and the cluster center of each cluster as the true output value of the DNN model,

wherein the input of the DNN model is a supervector of audio data, the output is a distance between an i-vector of the audio data and a cluster center of each cluster, and

wherein the data processing method further comprises performing registration for each of a plurality of registration audio data, including:

extracting a supervector of the registered audio data;