WO2018029777A1

WO2018029777A1 - Speaker adaptation device, speech recognition apparatus and speech recognition method

Info

Publication number: WO2018029777A1
Application number: PCT/JP2016/073408
Authority: WO
Inventors: 勇気太刀岡
Original assignee: 三菱電機株式会社
Priority date: 2016-08-09
Filing date: 2016-08-09
Publication date: 2018-02-15
Also published as: JPWO2018029777A1; JP6324647B1

Abstract

An adaptation unit (7) calculates the weight of a weight matrix indicating a connection weight between nodes in a DNN (5) with respect to each number-of-learning-speakers (N), or with respect to each number-of-learning-speakers (N) and each number-of-dimensions (D_out) of an output (X_out) of a speaker adaptation layer (5-3) such that an error calculated by an error calculation unit (6) reduces.

Description

Speaker adaptation device, speech recognition device, and speech recognition method

The present invention relates to a speaker adaptation device that adapts an acoustic model using a deep neural network (hereinafter referred to as DNN) to a speaker, a speech recognition device and a speech recognition method using the same.

In speech recognition, the recognition performance is improved by adapting the acoustic model to the speaker. For example, in speech recognition using a Hidden Markov Model (hereinafter referred to as HMM), Gaussian Mixture Model (hereinafter referred to as GMM) is widely used as an output probability distribution of acoustic features (Non-patent Document 1). reference). In GMM, model parameters are adapted to speakers by learning model parameters based on maximum likelihood criteria. However, in order to further improve the accuracy of speech recognition, it has been proposed to use DNN instead of GMM in speech recognition using HMM.

Examples of speaker adaptation methods using DNN include adaptation methods described in Patent Literature 1 and Non-Patent Literature 3. In this adaptation method, a specific layer among a plurality of layers in DNN is used as a speaker adaptation layer.
Non-Patent Document 2 describes a technique for adapting a DNN to a speaker using an auxiliary feature such as an i-vector.

Japanese Patent Laying-Open No. 2015-102806

The adaptation methods described in Patent Document 1 and Non-Patent Document 3 are effective when a large amount of adaptation data is used, but it is usually difficult to use so much adaptation data.

In addition, since the adaptation method described in Non-Patent Document 2 uses auxiliary feature amounts, there is a problem that the amount of computation in speaker adaptation is large, and the accuracy of speaker adaptation greatly varies depending on the accuracy of the auxiliary feature amounts. there were.

The present invention solves the above-mentioned problem, and does not use auxiliary feature amounts, and does not use a large amount of adaptation data. An object is to obtain a device and a speech recognition method.

The speaker adaptation device according to the present invention includes an error calculation unit and a first adaptation unit. The error calculation unit includes an input layer, an output layer, and one or more intermediate layers between the input layer and the output layer, and an output in a DNN having a speaker adaptation layer in one of the one or more intermediate layers The error between the layer output data and the teacher data is calculated. The first adaptation unit inputs a weight matrix indicating connection weights between nodes in the DNN obtained from the learning data of the learning speaker, so that the error calculated by the error calculation unit is reduced. The weight of the weight matrix in the adaptation layer is calculated for each number of learning speakers, or for each number of learning speakers and for each dimension of the output of the speaker adaptation layer.

According to the present invention, the weight of the weight matrix indicating the connection weight between nodes in the speaker adaptation layer is reduced for each learning speaker number so that the error between the output data of the DNN output layer and the teacher data is reduced. It is calculated for each number of learning speakers and for each number of output dimensions of the speaker adaptation layer. Therefore, DNN speaker adaptation is possible without using auxiliary feature values. Also, DNN speaker adaptation can be performed appropriately without using a large amount of adaptation data.

It is a block diagram which shows the structural example of the speech recognition apparatus which concerns on Embodiment 1 of this invention. It is a block diagram which shows the structural example of the speaker adaptation apparatus and DNN which concern on Embodiment 1. FIG. FIG. 3A is a block diagram showing a hardware configuration for realizing the function of the speaker adaptation apparatus according to Embodiment 1. FIG. 3B is a block diagram illustrating a hardware configuration for executing software that implements the functions of the speaker adaptation device according to Embodiment 1. 3 is a flowchart showing an operation of the speech recognition apparatus according to the first embodiment. It is a figure which shows the example of an output of DNN. It is a block diagram which shows the structural example of the speech recognition apparatus which concerns on Embodiment 2 of this invention. It is a block diagram which shows the structural example of the speaker adaptation apparatus which concerns on Embodiment 2, and DNN. 6 is a flowchart showing the operation of the speech recognition apparatus according to the second embodiment. It is a figure which shows the structural example of DNN in Embodiment 3 of this invention. 10 is a flowchart showing a part of the operation of the speaker adaptation device according to the third exemplary embodiment. It is a block diagram which shows the structural example of the speaker adaptation apparatus and DNN which concern on Embodiment 4 of this invention. It is a block diagram which shows the structural example of the speaker adaptation apparatus and DNN which concern on Embodiment 5 of this invention.

Hereinafter, in order to describe the present invention in more detail, modes for carrying out the present invention will be described with reference to the accompanying drawings.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration example of a speech recognition apparatus 1 according to Embodiment 1 of the present invention. FIG. 2 is a block diagram illustrating a configuration example of the speaker adaptation device 4 and the DNN 5.
As shown in FIG. 1, the speech recognition apparatus 1 includes a feature amount extraction unit 2,

speech recognition units

3 a and 3 b, a speaker adaptation device 4, and a DNN 5. Moreover, the speaker adaptation apparatus 4 is provided with the error calculation part 6, the adaptation part 7, and the memory | storage part 8, as shown in FIG.

The feature amount extraction unit 2 inputs a speaker voice collected by a microphone (not shown), and extracts a voice feature amount from the input speaker voice. For example, the feature amount extraction unit 2 extracts a time series of feature vectors as a feature amount by performing an acoustic feature amount analysis on the speaker voice.
The voice recognition unit 3a performs voice recognition of the speaker voice based on the voice feature amount extracted by the feature amount extraction unit 2, and obtains alignment information based on the voice recognition result.
The alignment information is each time and the state (state number) of the HMM at the time when the time-series speech recognition is obtained.

The voice recognition unit 3b performs voice recognition of the speaker voice using the DNN 5 adapted to the adaptation target speaker. The recognition result obtained by the voice recognition unit 3b is output to a subsequent output device as a final voice recognition result.
Although FIG. 1 shows a configuration in which the voice recognition unit 3a and the voice recognition unit 3b are separately provided, one voice recognition unit may be provided, and the voice recognition unit may have both functions.

The speaker adaptation device 4 adapts the DNN 5 to the adaptation target speaker based on the alignment information input from the speech recognition unit 3a.
DNN5 is a neural network having a number of layers, and includes input layer 5-1, output layer 5-5, and one or more intermediate layers provided between input layer 5-1 and output layer 5-5. 5-2 to 5-4.

The input layer 5-1 is a layer in which information is first input by the DNN 5, and has a plurality of input nodes. The output layer 5-5 is a layer having the number of output nodes to be recognized. Each of the intermediate layers 5-2 to 5-4 has a plurality of nodes, and any one of these layers becomes an intermediate layer for speaker adaptation. In the example of FIG. 2, the middle layer between the middle layer 5-2 and the middle layer 5-4 is the speaker adaptation layer 5-3.

The error calculation unit 6 calculates an error between the output data of the output layer 5-5 in the DNN 5 and the teacher data. For example, the error calculation unit 6 outputs the output layer 5- when the feature amount of the speech uttered from the adaptation target speaker is input to the input layer 5-1, based on the alignment information input from the speech recognition unit 3a. The output data to be output from 5 is specified. Then, the error calculation unit 6 calculates an error between the output data and the data actually output from the output layer 5-5 using the output data as teacher data. Such an error calculation method is known as an error back propagation method.

The adaptation unit 7 embodies the first adaptation unit of the present invention, and adapts the speaker adaptation layer 5-3 in the DNN 5 to the adaptation target speaker. When adapting the speaker adaptation layer 5-3 to the adaptation target speaker, if adaptation data composed of the speech of the adaptation target speaker is used, the effect of speaker adaptation on the adaptation target speaker is enhanced. However, for this purpose, it is necessary to collect a large amount of adaptation data consisting of the speech of the adaptation target speaker.

Therefore, the adaptation unit 7 sets any one of the intermediate layers 5-2 to 5-4 in the DNN 5 as the speaker adaptation layer 5-3, and learns the DNN 5 using the learning data of N learning speakers. N weight matrices W _n obtained in advance are used for speaker adaptation.
The subscript n is a subscript indicating one of the N learning speakers, and is a positive integer from 1 to N. Each node of DNN 5 is given a connection weight and a bias, and weight matrix W _n is a matrix having connection weights between nodes in DNN 5 as elements.

Adaptation unit 7 calculates the weight w _n of the weight matrix W _n in the speaker adaptation layer 5-3 as error calculated by the error calculating unit 6 is reduced.
Alternatively, the adaptation unit 7 calculates the weight w _n of the weight matrix W _n for each number of dimensions of the output of the speaker adaptation layer 5-3.

The storage unit 8 stores speaker-independent learning data that does not depend on the characteristics of the specific speaker described above.
In the storage unit 8 in the first embodiment, weight matrix data 8-1 to 8-N obtained from learning data of N learning speakers are stored. The weight matrix data 8-1 to 8-N are weight matrices W _n (n = 1 to N).
Although FIG. 2 shows a configuration in which the speaker adaptation device 4 includes the storage unit 8, the present invention is not limited to this. That is, the storage unit 8 may be constructed in an external storage device that can be read from the speaker adaptation device 4.

Each function of the error calculation unit 6 and the adaptation unit 7 in the speaker adaptation device 4 is realized by a processing circuit. That is, the speaker adaptation device 4 calculates the error between the output data of the output layer 5-5 and the teacher data in the DNN 5, and the weight of the weight matrix W _n in the speaker adaptation layer 5-3 so that the error is reduced. comprising a processing circuit for calculating w _n.
The processing circuit may be dedicated hardware or a CPU (Central Processing Unit) that executes a program stored in the memory.

FIG. 3A shows a hardware processing circuit that implements the function of the speaker adaptation device 4, and FIG. 3B shows a hardware configuration that executes software that implements the function of the speaker adaptation device 4. Yes. As shown in FIG. 3A, when the processing circuit is a dedicated hardware processing circuit 100, the processing circuit 100 may be, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an ASIC ( Application Specific Integrated Circuit (FPGA), Field-Programmable Gate Array (FPGA), or a combination of these. The functions of each unit of the error calculation unit 6 and the adaptation unit 7 may be realized by a processing circuit, or the functions of each unit may be realized by a single processing circuit.

As shown in FIG. 3B, when the processing circuit is the CPU 101, the functions of the error calculation unit 6 and the adaptation unit 7 are realized by software, firmware, or a combination of software and firmware. Software and firmware are described as programs and stored in the memory 102.
The CPU 101 reads out and executes the program stored in the memory 102, thereby realizing the functions of each unit. That is, the speaker adaptation apparatus 4, when executed by the CPU 101, calculates an error between the output data and the teacher data in the output layer 5-5, the process of calculating the weight w _n so that the error is reduced A memory 102 for storing a program to be executed as a result is provided. In addition, these programs cause the computer to execute the procedures or methods of the error calculation unit 6 and the adaptation unit 7.

The memory 102 is, for example, a nonvolatile or volatile semiconductor memory such as RAM (Random Access Memory), ROM, flash memory, EPROM (Erasable Programmable ROM), EEPROM (Electrically Programmable EPROM), magnetic disk, flexible disk, optical disk, Compact discs, mini discs, DVDs (Digital Versatile Disk), and the like are applicable.

In addition, about each function of the error calculation part 6 and the adaptation part 7, a part may be implement | achieved by exclusive hardware and a part may be implement | achieved by software or firmware.
For example, the error calculation unit 6 realizes its function by the dedicated hardware processing circuit 100, and the adaptation unit 7 realizes the function by the CPU 101 executing the program stored in the memory 102.
As described above, the processing circuit can realize the above-described functions by hardware, software, firmware, or a combination thereof.

Next, the operation will be described.
FIG. 4 is a flowchart showing the operation of the speech recognition apparatus 1.
First, the feature quantity extraction unit 2 inputs the speaker voice collected by the microphone, and extracts the feature quantity from the input voice (step ST1). The audio feature amount is, for example, a time series of feature vectors. Further, data indicating the voice feature amount is input from the feature amount extraction unit 2 to the voice recognition unit 3a and DNN5.

Next, the speech recognition unit 3a performs speech recognition of the speaker speech based on the speech feature amount extracted by the feature amount extraction unit 2 (step ST2).
Furthermore, the speech recognition unit 3a acquires alignment information based on the speech recognition result (step ST3). The alignment information obtained in this way is input to the speaker adaptation device 4 from the speech recognition unit 3a.

The error calculation unit 6 calculates an error between the output data of the DNN 5 to which the feature amount of the speech voice of the adaptation target speaker is input and the teacher data (step ST4). The teacher data is determined from the alignment information.
As described above, the alignment information may be obtained by voice recognition of the utterance voice by the voice recognition unit 3a without teacher data. Alignment information may be obtained based on the utterance content.

Adaptation unit 7, from the storage unit 8 to input the N weight matrix W _n, and calculates the weight w _n of the weight matrix W _n as error calculated by the error calculating unit 6 is reduced (step ST5 ).
The adaptation unit 7, based on the weight w _n calculated in the manner described above, adapt the speaker adaptation layer 5-3 in the adaptive target speaker (step ST6).

For example, in Embodiment 1, the output x _out of the speaker adaptation layer 5-3 is calculated according to the following equation (1). In the following formula (1), the output x _out is represented by a vector having elements of a plurality of dimensions. W _n is a weight matrix for learning data of the learning speaker n, and w _n is a weight of the weight matrix W _n . Thus, in the following equation (1), one weight is defined for each weight matrix. The input x _in is the output of the intermediate layer 5-2 preceding the speaker adaptation layer 5-3, that is, the input of the speaker adaptation layer 5-3. The input x _in is represented by a vector having elements of a plurality of dimensions.

When the feature amount of the speech uttered from the adaptation target speaker is input to the input layer 5-1 of the DNN 5, this information is sequentially input to the intermediate layer 5-2, the speaker adaptation layer 5-3, and the intermediate layer 5-4. Propagate and output from output layer 5-5.
The adaptation unit 7 uses the speech feature value, the alignment information, and the above equation (1) input to the input layer 5-1, and the input x _in of the speaker adaptation layer 5-3 and the speaker adaptation layer 5- 3 output x _out is obtained. Next, the adaptation unit 7 reads the weight matrix W _n for the learning data of the learning speaker _n from the storage unit 8, and uses the weight matrix W _n , the input x _in, and the output x _out , the above equation (1). to calculate the weight _{w n} in accordance with.

Adaptation unit 7 corrects the value of the weight w _n as error sequentially calculated by the error calculating unit 6 is reduced. Then, the adaptation unit 7 uses the weight (w _n ) when the error is smaller than a predetermined threshold as the final weight for the weight matrix W _n of the learning data of the learning speaker n (1) ). This process is performed by the number of N weight matrix W _n by the adaptation unit 7, thereby, the speaker adaptation layer 5-3 is adapted to the adaptive target speaker. That is, the number of parameters that need to be adapted is N.

The adaptation unit 7 may calculate the output x _out of the speaker adaptation layer 5-3 according to the following equation (2). In the following formula (2), “. *” Is a product for each element of the vector.
The weight _{w n} of the weight matrix _{W n} is represented by a vector having the elements of the same number of dimensions _{D out} and the output _{x out.}

Adaptation unit 7, the weight w _n when errors are sequentially calculated by the error calculating unit 6 modifies the value of the weight w _n to decrease, becomes smaller than the threshold value the error predetermined as the final weight for weight matrix W _n of the learning data of training speakers n is set to the above formula (2).
This process is performed by the adapting unit 7 for each of the N weight matrices W _{n by} the number of dimensions of the output x _out , whereby the speaker adaptation layer 5-3 is adapted to the adaptation target speaker.
That is, when the number of dimensions of the output x _out is D _out , the number of parameters that need to be adapted is N × D _out .

The output _{x out} speaker adaptation layer 5-3 obtained by the above formula (1) was weighted input _{x in} the speaker adaptation layer 5-3 with a weighting matrix _{W n} of the weighted with weights _{w n} N Although it was the value which averaged the operation value for each piece, it is not limited to this.
For example, as shown in the following formula (3), the maximum value among the N calculated values may be used as the output _xout . Max _r represents returning the maximum element for each row.

Moreover, the adaptation unit 7 are sequentially weighted input x _in the speaker adaptation layer 5-3 with a weighting matrix W _n is a weighting for each element of the vector by the weight w _n in the formula (2).
The maximum value among the N × D _out operation values obtained in this way may be used as the output x _out of the speaker adaptation layer 5-3.

In step ST7, the speech recognition unit 3b performs speech recognition using the DNN 5 in which the speaker adaptation layer 5-3 is adapted to the adaptation target speaker. For example, the output of the output layer 5-5 of the DNN 5 is a posterior probability for each state of the HMM used for speech recognition. The speech recognition unit 3b performs pattern matching on the feature pattern of the speech extracted by the feature extraction unit 2 using the posterior probability for each HMM state output from the output layer 5-5, and performs pattern matching. The similarity based on is calculated. The voice recognition unit 3b generates and outputs a voice recognition result based on the similarity calculated in this way.

Further, speech recognition may be performed using the output from the intermediate layer 5-4 of the DNN 5.
FIG. 5 is a diagram illustrating an output example of the DNN 5, and illustrates a case where the feature amount obtained in the intermediate layer 5-4 is output. In this case, the output from the intermediate layer 5-4 is used, for example, for speech recognition of the subsequent speech recognition unit 3b as a bottleneck feature amount.
Here, the bottleneck feature value is a feature value extracted from DNN 5 having a bottleneck structure in which the number of nodes in the intermediate layer is reduced.

As described above, in the speaker adaptation device 4 according to Embodiment 1, the adaptation unit 7 uses the weight matrix in the speaker adaptation layer 5-3 so that the error calculated by the error calculation unit 6 is reduced. to calculate the weight _{w n} of W _n.
Or, adaptation unit 7, the weight _{w n} of the weight matrix _{W n,} is calculated for each dimensionality _{D out} of the output _{x out} of the speaker adaptation layer 5-3.
In the conventional technique, the number of parameters that need to be adapted is D _in × D _out , but in the speaker adaptation device 4, the number of parameters is N or N × D _out .
Thus, the speaker adaptation apparatus 4 can appropriately perform speaker adaptation of the DNN 5 without using a large amount of adaptation data.
Further, since an auxiliary feature quantity such as an i-vector is unnecessary, the amount of calculation is reduced, and the accuracy of speaker adaptation is not affected by the accuracy of the auxiliary feature quantity.

Further, in the conventional technique, a large amount of adaptation data is necessary to perform speaker adaptation with high accuracy. On the other hand, in the speaker adapting apparatus 4, for example, the average value or the maximum value of N pieces is set as the output _xout as in the above formulas (1) to (3). Thereby, even if there is little adaptation data, the precision of speaker adaptation can be maintained. That is, robustness when there is little adaptive data can be improved.

Furthermore, the speech recognition apparatus 1 according to the first embodiment includes a speaker adaptation device 4, a DNN 5, and a DNN 5 in which the speaker adaptation layer 5-3 is adapted to the adaptation target speaker by the speaker adaptation device 4. And a voice recognition unit 3b for voice recognition. With this configuration, it is possible to realize the speech recognition device 1 that can obtain the above-described effects of the speaker adaptation device 4.

Furthermore, in the speech recognition method according to the first embodiment, the speaker adaptation device 4 adapts the DNN 5 to the adaptation target speaker, the speech recognition unit 3b, and the speaker adaptation layer 5-3 applies the adaptation target. Voice recognition using DNN5 adapted to the speaker. Thereby, the speech recognition method by which the said effect of the speaker adaptation apparatus 4 is acquired can be provided.

Embodiment 2. FIG.
FIG. 6 is a block diagram showing a configuration example of a speech recognition apparatus 1A according to Embodiment 2 of the present invention. FIG. 7 is a block diagram illustrating a configuration example of the speaker adaptation device 4A and the DNN 5A.
As shown in FIG. 6, the speech recognition apparatus 1A includes a feature amount extraction unit 2,

speech recognition units

3a and 3b, a speaker adaptation device 4A, and a DNN 5A.
The speaker adaptation device 4A includes an error calculation unit 6, an adaptation unit 7A, and a storage unit 8, as shown in FIG. 6 and 7, the same components as those in FIGS. 1 and 2 are denoted by the same reference numerals, and description thereof is omitted.

Speaker adaptation apparatus 4A adapts the DNN5A based on the offset _{o n} of output _{x out} of the speaker adaptation layer 5A-3 to the adaptive target speaker. The DNN 5A is a neural network having a number of layers, and includes an input layer 5-1, an output layer 5-5, and one or more intermediate layers provided between the input layer 5-1 and the output layer 5-5. It has layers 5-2 to 5-4. In FIG. 7, the middle layer between the middle layer 5-2 and the middle layer 5-4 is the speaker adaptation layer 5A-3.
Incidentally, the speaker adaptation layer 5A-3 is an intermediate layer which is adapted to the adaptive target speaker based on the offset o _n.

The adaptation unit 7A embodies the second adaptation unit of the present invention, and adapts the speaker adaptation layer 5A-3 in the DNN 5A to the adaptation target speaker. Specifically, the adaptation unit 7A, so that the error calculated by the error calculating unit 6 is reduced, the offset o _n of output x _out of the speaker adaptation layer 5A-3 weighted by the weighting matrix W _n calculate. In this case, one-dimensional offset _{o n} or offset _{o n} of the same dimensions as the output _{x out} of the speaker adaptation layer 5A-3, are calculated.

The functions of the error calculation unit 6 and the adaptation unit 7A in the speaker adaptation device 4A are realized by a processing circuit. About each function of the error calculation part 6 and the adaptation part 7A, a part may be implement | achieved by exclusive hardware and a part may be implement | achieved by software or firmware.
For example, the error calculation unit 6 realizes its function by the dedicated hardware processing circuit 100 shown in FIG. 3A, and the adaptation unit 7A executes the program stored in the memory 102 by the CPU 101 shown in FIG. 3B. The function is realized.
As described above, the processing circuit can realize the above-described functions by hardware, software, firmware, or a combination thereof.

Next, the operation will be described.
FIG. 8 is a flowchart showing the operation of the speech recognition apparatus 1A. The processes from step ST1 to step ST4 and the process of step ST7 in FIG. 8 are the same as those in FIG.
In step ST5a, the adaptation unit 7A inputs N weight matrices W _n from the storage unit 8, and the story weighted by the weight matrix W _n so that the error calculated by the error calculation unit 6 decreases. calculating the offset _{o n} the speaker adaptation layer 5A-3 output _{x out.}
After this, the adaptation unit 7A, based on the offset o _n calculated in this way, adapt the speaker adaptation layer 5A-3 to the adaptive target speaker (Step ST6a).

For example, in the second embodiment, the output x _{out of the} speaker adaptation layer 5A-3 is calculated according to the following equation (4). In the following formula (4), _{o n} is the offset of the weight matrix _{W n.}
The following equation (4), one-dimensional offset is defined as an offset _{o n} of output _{x out} of the speaker adaptation layer 5A-3.

When the feature amount of the speech uttered from the adaptation target speaker is input to the input layer 5-1 of the DNN 5A, this information is stored in the intermediate layer 5-2, the speaker adaptation layer 5A-3, and the intermediate layer 5-4. Propagated in order and output from the output layer 5-5.
The adaptation unit 7A uses the speech feature amount and alignment information input to the input layer 5-1 and the above equation (4) to input the input x _in of the speaker adaptation layer 5A-3 and the speaker adaptation layer 5 -3 output x _out is obtained. Next, the adaptation unit 7A reads the weight matrix W _n for the learning data of the learning speaker _n from the storage unit 8, and uses the weight matrix W _n , the input x _in, and the output x _out , according to the above equation (4). to calculate the offset _{o n.}

Here, the adaptation unit 7A, so that the error which is sequentially calculated by the error calculating unit 6 is reduced, to modify the value of the offset o _n. Then, the adaptation unit 7A, sets the offset o _n when the error is smaller than a predetermined threshold, the above formula as a final offset (4). This process is performed by the number of N weight matrix W _n by adaptation unit 7A, thereby, the speaker adaptation layer 5A-3 is adapted to the adaptive target speaker. That is, the number of parameters that need to be adapted is N.

Note that the adaptation unit 7A may calculate the output x _out of the speaker adaptation layer 5-3 according to the following equation (5). Offset _{o n} in formula (5) is expressed as a vector having the elements of the same number of dimensions _{D out} and the output _{x out} of the speaker adaptation layer 5A-3.

Adaptation unit 7A modifies the value of the offset o _n as error sequentially calculated by the error calculating unit 6 is reduced. Then, the adaptation unit 7A is set on the equation (5) the offset o _n as the final offset when the error is smaller than a predetermined threshold.
This process is performed by the adapting unit 7A for each of the N weight matrices W _n for the dimension D _out of the output x _out , and thereby the speaker adaptation layer 5A-3 is adapted to the adaptation target speaker. That is, the number of parameters requiring adaptation is N × D _out .

The output _{x out} of the speaker adaptation layer 5A-3 obtained by the above formula (4), the one-dimensional offset _{o n} the input _{x in} the speaker adaptation layer 5A-3 weighted by the weighting matrix _{W n} are added However, the average value of the calculated values for N is not limited to this.
For example, similarly to the above equation (3), the maximum value among the N calculated values may be used as the output _xout . Moreover, the adaptation unit 7A is the input _{x in} the speaker adaptation layer 5A-3 weighted by the weighting matrix _{W n,} adds the offset _{o n} of the same dimensions as the output _{x out} of the speaker adaptation layer 5A-3 . The maximum value among the N × D _out calculated values calculated in this way may be used as the output x _out of the speaker adaptation layer 5A-3.

As described above, in the speaker adaptation apparatus 4A according to the second embodiment, the adaptation unit 7A, so that the error calculated by the error calculating unit 6 decreases, adaptive one-dimensional offset o _n or speaker calculating the offset _{o n} of the same dimensions as the output _{x out} layer 5A-3.
By thus adapt the offset o _n, the number of adaptive parameters require likewise a N number or N × D _out pieces in the first embodiment. Accordingly, DNN5A speaker adaptation can be appropriately performed without using a large amount of adaptation data.
Further, since an auxiliary feature quantity such as an i-vector is unnecessary, the amount of calculation is reduced, and the accuracy of speaker adaptation is not affected by the accuracy of the auxiliary feature quantity.

Further, in the conventional technique, a large amount of adaptation data is necessary to perform speaker adaptation with high accuracy. On the other hand, in the speaker adapting apparatus 4A, for example, the accuracy of speaker adaptation can be maintained by setting the average value or maximum value for N as the output _xout . That is, robustness when there is little adaptive data can be improved.

Furthermore, the speech recognition apparatus 1A according to Embodiment 2 includes a speaker adaptation apparatus 4A, a DNN 5A, and a DNN 5A in which the speaker adaptation layer 5-3 is adapted to the adaptation target speaker by the speaker adaptation apparatus 4A. And a voice recognition unit 3b for voice recognition. With this configuration, it is possible to realize a speech recognition device 1A that can obtain the above-described effects of the speaker adaptation device 4A.

Furthermore, in the speech recognition method according to the second embodiment, the speaker adaptation device 4A adapts the DNN 5A to the adaptation target speaker, the speech recognition unit 3b, the speaker adaptation layer 5A-3 the adaptation target Voice recognition using DNN 5A adapted to the speaker.
Thereby, it is possible to provide a speech recognition method capable of obtaining the above effects of the speaker adaptation device 4A.

Embodiment 3 FIG.
In addition to calculating the offset of the speaker adaptation layer output, the speaker adaptation apparatus according to Embodiment 3 calculates the weight of the weight matrix so that the error calculated by the error calculation unit is reduced.
Therefore, in the following description, FIG. 7 is referred to for the configuration of the speaker adaptation device according to the third embodiment.

FIG. 9 is a diagram showing a configuration example of the DNN 5B according to the third embodiment of the present invention.
Although omitted in FIG. 9, the intermediate layer 5 is provided between the input layer 5-1 and the speaker adaptation layer 5B-3 and between the speaker adaptation layer 5B-3 and the output layer 5-5. -2, 5-4.
In DNN5B shown in FIG. 9, the speaker adaptation layer 5B-3, has been adapted to the adaptive target speaker by the offset _{o n} the weight _{w n} and the output _{x out} of the weight matrix _{W n.}
In addition, the speaker adaptation layer 5B-3, as the weight w _n, similarly to the above formula (1), one weight is set for each weight matrix, as an offset o _n, the above formula (4 ), A one-dimensional offset is set.

Enter the _{x in} the speaker adaptation layer 5B-3, the output and _{x out,} output _{x out} of the speaker adaptation layer 5B-3 is, for _example, is w _{_n} W _n _{x in} 1-dimensional offset _{o n} against It is represented by an average value for N of the added operation values. Further, the maximum value among the N calculated values may be used as the output x _out of the speaker adaptation layer 5B-3.

The weight _{w n} of the weight matrix _{W n,} similarly to the above formula (2) may be a weight set for each number of dimensions _{D out} of the output _{x out} of the speaker adaptation layer 5B-3. Further, the offset _{o n} output _{x out} may be offset _{o n} of the same dimensions as well as the output _{x out} in the above formula (4). In this case, the output x _out of the speaker adaptation layer 5B-3 is, for example, w _n . * _(W _{n x in)} to the offset _{o n} of the same dimensions as the output _{x out} is expressed by the average value or the maximum value of the summed calculated value.

Further, the output _{x out} of the speaker adaptation layer 5B-3 _may be an average value or the maximum value of calculation value w _{_n} W _n _{x in} the same dimension as the output _{x out} offset _{o n} is added.
Further, the output x _{out of} the speaker adaptation layer 5B-3 becomes w _n . * (W n _x _in) to one-dimensional offset o _n may be an average value or the maximum value of the summed calculated value.
That is, in the speaker adaptation layer 5B-3 in the third embodiment, adapted to the adaptive target speaker by a parameter combining the offset o _n the output of the weight w _n with speaker adaptation layer 5B-3 the weighting matrix W _n It only has to be done.

Next, the operation will be described.
FIG. 10 is a flowchart showing a part of the operation of the speaker adaptation device 4A according to Embodiment 3, and shows a part related to the adaptation process of the speaker adaptation layer 5B-3. Note that step ST5b and step ST6b shown in FIG. 10 are executed instead of step ST5a and step ST6a in the series of processing shown in FIG.
Hereinafter, description of processes other than step ST5b and step ST6b is omitted.

In step ST5b, the adaptation unit 7A inputs N weight matrices W _n from the storage unit 8, and the speaker weighted by the weight matrix W _n so that the error calculated by the error calculation unit 6 decreases. calculating the offset _{o n} of output _{x out} of the adaptive layer 5B-3.
Furthermore, the adaptation unit 7A calculates the weight w _n of the weight matrix W _n as error calculated by the error calculating unit 6 is reduced.
In step ST6b, adaptation unit 7A, based on the offset _{o n} the weight _{w n} calculated in step ST5b, adapt the speaker adaptation layer 5B-3 the adaptive target speaker.

As described above, in the speaker adaptation apparatus 4A according to the third embodiment, the adaptation unit 7A, in addition to the calculation of the offset o _n of output x _out, so that the error calculated by the error calculating unit 6 is reduced to calculate the weight _{w n} of the weighting matrix _{W n} to.
Even with this configuration, DNN5B speaker adaptation can be appropriately performed without using a large amount of adaptation data.

Embodiment 4 FIG.
In the speaker adaptation apparatus according to Embodiments 1 to 3, as the number N of learned speakers increases, the number of parameters to be adapted increases accordingly. For this reason, when the number N of learning data of a learning speaker increases excessively, the amount of calculation required for speaker adaptation also increases excessively.
Therefore, the speaker adaptation apparatus according to the fourth embodiment clusters the N weight matrices W _n into a number M classes smaller than N and reduces the number to M weight matrices W _m . Thereby, even if N increases excessively, the increase in the amount of calculation required for speaker adaptation can be reduced. The subscript m is a positive integer from 1 to M.

FIG. 11 is a block diagram showing a configuration example of the speaker adaptation device 4B and DNN 5 according to Embodiment 4 of the present invention. The speaker adaptation device 4B includes an error calculation unit 6, an adaptation unit 7B, a storage unit 8, and a clustering unit 9. In FIG. 11, the same components as those in FIG.

The clustering unit 9 clusters the N weight matrices W _n stored in the storage unit 8 into classes 10-1 to 10-M to obtain M weight matrices W _m .
As a clustering method, for example, k-means clustering based on the distance between W _n can be cited.
Further, the clustering unit 9 may vectorize the weight matrix W _n to obtain a matrix of D _in × D _out rows and N columns, and perform spectrum clustering on the obtained matrix.
Hereinafter, a weight matrix clustered into classes 10-1 to 10-M is assumed to be W ′ ₁ ,..., W ′ _M.

The adaptation unit 7B receives the clustered weight matrices W ′ ₁ ,..., W ′ _M, and the weight w of the weight matrix W ′ _m so that the error calculated by the error calculation unit 6 decreases. _m is calculated. For example, the output x _out of the speaker adaptation layer 5-3 is calculated according to the following equation (6).
In the following equation (6), W ′ _m is a weight matrix clustered into classes 10-1 to 10-M, and w _m is a weight of the weight matrix W ′ _m .
In the following equation (6), one weight is defined for each weight matrix of classes 10-1 to 10-M.

The adapting unit 7B corrects the value of the weight w _m so that errors sequentially calculated by the error calculating unit 6 are reduced. Next, the adaptation unit 7B sets the weight w _m when the error is smaller than a predetermined threshold as the final weight for the weight matrix W ′ _{m in the} above formula (6). This process is performed by the adaptation unit 7B by the number of M weight matrices W ′ _m , and thereby the speaker adaptation layer 5-3 is adapted to the adaptation target speaker. That is, the number of parameters that need to be adapted is M.

Note that the adaptation unit 7B may calculate the output x _out of the speaker adaptation layer 5-3 according to the following equation (7). The weight w _m of the weight matrix W ′ _m in the following equation (7) is represented by a vector having elements of the same dimensionality D _out as the output x _out .

The adapting unit 7B corrects the value of the weight w _m so that the errors sequentially calculated by the error calculating unit 6 are reduced.
Next, the adaptation unit 7B sets the weight w _m when the error is smaller than a predetermined threshold as the final weight for the weight matrix W ′ _{m in the} above formula (7). This process is performed by the adapting unit 7B for each of the M weight matrices W ′ _m for the dimension D _out of the output x _out , whereby the speaker adaptation layer 5-3 is adapted to the adaptation target speaker. . That is, the number of parameters that need to be adapted is M × D _out .
In the above formulas (6) and (7), the output x _out is the M average value, but the maximum value among the M values may be the output x _out .

Further, the functions of the error calculation unit 6, the adaptation unit 7B, and the clustering unit 9 in the speaker adaptation device 4B are realized by a processing circuit. A part of the functions of the error calculation unit 6, the adaptation unit 7B, and the clustering unit 9 may be realized by dedicated hardware, and a part may be realized by software or firmware.
For example, the error calculation unit 6 realizes its function with the dedicated hardware processing circuit 100 shown in FIG. 3A, and the adaptation unit 7B and the clustering unit 9 store the CPU 101 shown in FIG. This function is realized by executing the program.
As described above, the processing circuit can realize the above-described functions by hardware, software, firmware, or a combination thereof.

Further, the case where the clustering unit 9 is provided in the configuration of the first embodiment has been described so far, but the clustering unit 9 may be provided in the configuration of the second or third embodiment.
Even if comprised in this way, the increase in the computational complexity required for speaker adaptation can be reduced.

For example, the case of providing the clustering portion 9 to the configuration of the second embodiment, the speaker adaptation layer 5A-3, is adapted to the adaptive target speaker by the offset o _m output x _out.
Adaptation unit 7A in accordance with a _{w n} and _{W n} in the formula (4) or (5) was replaced with the _{w m} and W _'m wherein calculating the offset _{o m.}

Furthermore, the case of providing the clustering portion 9 to the configuration of the third embodiment, the speaker adaptation layer 5B-3, the adaptive target speaker by the offset o _m of the weight w _m and the output x _out of the weight matrix W _'m Adapted. Adaptation unit 7A, the _{w n} and _{W n,} by replacing the _{w m} and W _'m calculates the offset _{o m} and weight _{w m.}

As described above, the speaker adaptation device 4B according to Embodiment 4 includes the clustering unit 9. The clustering unit 9 clusters the weight matrix W _n into a number M of classes smaller than the learning speaker number N. Adaptation unit according to the fourth embodiment is calculated for each class at least one of the weights w _m and offset o _m which is clustered by the clustering unit 9. Thereby, even if N increases excessively, the speaker adaptation of DNN5 can be performed appropriately.

Embodiment 5 FIG.
FIG. 12 is a block diagram showing a configuration example of the speaker adaptation device 4C and DNN5 according to the fifth embodiment of the present invention. The speaker adaptation device 4 </ b> C includes an error calculation unit 6,

adaptation units

7 and 11, a storage unit 8, and a switching unit 12. In FIG. 12, the same components as those in FIG.

The adaptation unit 11 embodies the third adaptation unit of the present invention, and adapts the speaker adaptation layer 5-3 in the DNN 5 to the adaptation target speaker. Specifically, the adaptation unit 11 inputs N weight matrices W _n from the storage unit 8 so that the error calculated by the error calculation unit 6 is reduced in the speaker adaptation layer 5-3. to modify the weighting matrix _{W n.}
Since the input x _in of the speaker adaptation layer 5-3 is weighted by the weight matrix W _n , the number of parameters that need to be adapted is D _in × D _out .

The switching unit 12 switches between adaptation of the speaker adaptation layer 5-3 by the adaptation unit 7 and adaptation of the speaker adaptation layer 5-3 by the adaptation unit 11 in accordance with a predetermined condition.
If learning speaker number N is large, who was speaker adaptation based on the N-number of the weight matrix W _n is, the effect of the speaker adaptation than the adaptation process based on the weight w _n increases.

Therefore, the switching unit 12 switches from adaptation by the adaptation unit 7 to adaptation by the adaptation unit 11 when the learning speaker number N is equal to or greater than the threshold. Thereby, the effect of speaker adaptation can be improved.
In addition, the switching unit 12 may switch between the adaptation performed by the adaptation unit 7 and the adaptation performed by the adaptation unit 11 so that the error calculated by the error calculation unit 6 is smaller.

Moreover, each function of the error calculation unit 6, the adaptation unit 7, the adaptation unit 11, and the switching unit 12 in the speaker adaptation device 4C is realized by a processing circuit. About each function of the error calculation part 6, the adaptation part 7, the adaptation part 11, and the switching part 12, a part may be implement | achieved by exclusive hardware and a part may be implement | achieved by software or firmware.
For example, the error calculation unit 6 realizes its function by the dedicated hardware processing circuit 100 shown in FIG. 3A, and the

adaptation units

7 and 11 and the switching unit 12 are configured by the CPU 101 shown in FIG. The function is realized by executing the program stored in the.
As described above, the processing circuit can realize the above-described functions by hardware, software, firmware, or a combination thereof.

FIG. 12 shows the case where the adaptation unit 11 and the switching unit 12 are provided in the configuration of the first embodiment. However, the adaptation unit 11 and the switching unit 12 may be provided in each configuration described in the second to fourth embodiments. .
That is, the switching unit 12 may switch between adaptation by the adaptation unit 7A or the adaptation unit 7B and adaptation by the adaptation unit 11 according to a predetermined condition.

As described above, the speaker adaptation device 4C according to the fifth embodiment includes the adaptation unit 11 and the switching unit 12. The adaptation unit 11 modifies the weight matrix W _n in the speaker adaptation layer 5-3 so that the error calculated by the error calculation unit 6 is reduced. The switching unit 12 switches between adaptation by the adaptation unit 7 and adaptation by the adaptation unit 11. With this configuration, the effect of speaker adaptation can be improved.

In the present invention, within the scope of the invention, a free combination of each embodiment, a modification of an arbitrary component of each embodiment, or an omission of any component in each embodiment is possible.

The speaker adaptation device according to the present invention can be widely applied to speech recognition technology using HMM.

1, 1A speech recognition device, 2 feature extraction unit, 3a, 3b speech recognition unit, 4, 4A-4C speaker adaptation device, 5, 5A, 5B DNN, 5-1 input layer, 5-2, 5- 4 Intermediate layer, 5-3, 5A-3, 5B-3 Speaker adaptation layer, 5-5 Output layer, 6 Error calculation unit, 7, 7A, 7B, 11 Adaptation unit, 8 Storage unit, 8-1 ~ 8-N weight matrix data, 9 clustering unit, 10-1 to 10-M class, 12 switching unit, 100 processing circuit, 101 CPU, 102 memory.

Claims

The input layer, the output layer, and the deep neural network in the deep neural network having one or more intermediate layers between the input layer and the output layer, and having a speaker adaptation layer in any one of the one or more intermediate layers An error calculation unit for calculating an error between the output data of the output layer and the teacher data;
In the speaker adaptation layer, a weight matrix indicating connection weights between nodes of the deep neural network obtained from learning data of the learning speaker is input, and the error calculated by the error calculation unit is reduced. And a first adaptation unit that calculates the weight of the weight matrix for each number of learning speakers, or for each number of learning speakers and for each number of dimensions of the output of the speaker adaptation layer. Speaker adaptation device.
The input layer, the output layer, and the deep neural network in the deep neural network having one or more intermediate layers between the input layer and the output layer, and having a speaker adaptation layer in any one of the one or more intermediate layers An error calculation unit for calculating an error between the output data of the output layer and the teacher data;
A weight matrix indicating connection weights between nodes of the deep neural network obtained from learning data of a learning speaker is input, and weighted by the weight matrix so that an error calculated by the error calculation unit is reduced. And a second adaptation unit that calculates a one-dimensional offset of the output of the speaker adaptation layer or an offset of the same dimension as the output of the speaker adaptation layer for each number of learning speakers. Speaker adaptation device.
In addition to calculating the offset of the speaker adaptation layer output, the second adaptation unit sets the weight of the weight matrix for each learning speaker number so that the error calculated by the error calculation unit is reduced. The speaker adaptation apparatus according to claim 2, wherein
A clustering unit that clusters the weight matrix into a number of classes less than the number of learning speakers;
The speaker adaptation apparatus according to claim 1, wherein the first adaptation unit calculates a weight of the weight matrix for each class.
A clustering unit that clusters the weight matrix into a number of classes less than the number of learning speakers;
The speaker adaptation apparatus according to claim 2, wherein the second adaptation unit calculates an offset of the weight matrix for each class.
A clustering unit that clusters the weight matrix into a number of classes less than the number of learning speakers;
The speaker adaptation apparatus according to claim 3, wherein the second adaptation unit calculates an offset of the output of the speaker adaptation layer and a weight of the weight matrix for each class.
A third adaptation unit that modifies the weight matrix in the speaker adaptation layer so that the error calculated by the error calculation unit is reduced;
The switching unit for switching between adaptation of the speaker adaptation layer by the first adaptation unit and adaptation of the speaker adaptation layer by the third adaptation unit. The speaker adaptation device described.
A third adaptation unit that modifies the weight matrix in the speaker adaptation layer so that the error calculated by the error calculation unit is reduced;
3. The switching unit for switching between adaptation of the speaker adaptation layer by the second adaptation unit and adaptation of the speaker adaptation layer by the third adaptation unit. The speaker adaptation device described.
A speaker adaptation device according to claim 1;
The deep neural network;
A speech recognition apparatus comprising: a speech recognition unit that recognizes speech using the deep neural network in which the speaker adaptation layer is adapted to an adaptation target speaker by a speaker adaptation device.
The speaker adaptation apparatus according to claim 1, wherein the deep neural network is adapted to a speaker to be adapted.
A speech recognition method comprising: a speech recognition unit using the deep neural network in which the speaker adaptation layer is adapted to a target speaker.