JP6615736B2

JP6615736B2 - Spoken language identification apparatus, method thereof, and program

Info

Publication number: JP6615736B2
Application number: JP2016231976A
Authority: JP
Inventors: 亮増村; 太一浅見
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-11-30
Filing date: 2016-11-30
Publication date: 2019-12-04
Anticipated expiration: 2036-11-30
Also published as: JP2018087935A

Description

本発明は、入力発話がどの言語で話されたものかを識別するための言語識別技術に関する。 The present invention relates to a language identification technique for identifying in which language an input utterance is spoken.

音声言語識別技術は、入力された音声がどの言語なのかを識別する技術である。例えば、入力された言語が、英語、日本語、中国語のどれなのかを識別する。音声言語識別技術では、あらかじめ各言語の言語らしさを統計的にモデル化しておくことで、入力された発話がどれに一番近いかを計算して識別を行う枠組みが一般的となっている。具体的には、音声データと言語ラベル(音声データの言語を示すラベルであり、例えば、「この音声は日本語で話されている」という内容を示す情報)の組を大量に準備して、機械学習の枠組みで各言語らしさを捉えることで言語識別装置を構築する。 The spoken language identification technique is a technique for identifying which language the input voice is. For example, it identifies whether the input language is English, Japanese or Chinese. In the spoken language identification technology, a framework is generally used in which the linguistic character of each language is statistically modeled in advance to calculate and identify which input utterance is closest to. Specifically, a large number of sets of voice data and language labels (labels indicating the language of the voice data, for example, information indicating the content of "this voice is spoken in Japanese"), A language identification device is constructed by capturing the uniqueness of each language within the framework of machine learning.

上述の枠組みとして、非特許文献１で挙げられるニューラルネットワーク音声言語識別に着目する。非特許文献１では、ニューラルネットワーク、特にディープニューラルネットワークと呼ばれる統計モデルを利用して上述の枠組みを実現している。具体的には、数ミリ秒程度のフレーム単位で言語らしさを統計的に捉えておく。そのためにニューラルネットワークという統計モデルを利用している。ニューラルネットワークは、例えば言語ラベルの可能性が3言語（日本語、英語、中国語）とした場合に、ある音声幅(例えば25ミリ秒)のフレーム単位のモデルを学習しておくとすると、任意のフレームが入力された際に上述の3言語のどれに当たるかといった確率値を算出する枠組みを実現できる。また、フレーム単位のニューラルネットワークの入力は、MFCC（メル周波数ケプストラム係数）等の音響特徴量が用いられる。なお、ディープニューラルネットワークの学習方法及びMFCCについては、公知の技術であるためここでは説明を省略する。 As the above-mentioned framework, attention is paid to the neural network spoken language identification mentioned in Non-Patent Document 1. In Non-Patent Document 1, the above-described framework is realized using a statistical model called a neural network, particularly a deep neural network. Specifically, the linguistic character is statistically grasped in frame units of about several milliseconds. For this purpose, a statistical model called a neural network is used. For example, if the possibility of language labels is 3 languages (Japanese, English, and Chinese), the neural network is an arbitrary model that learns a frame-by-frame model with a certain voice width (for example, 25 milliseconds). A framework for calculating the probability value of which of the three languages mentioned above when a frame is input can be realized. In addition, an acoustic feature such as MFCC (Mel Frequency Cepstrum Coefficient) is used as an input to the neural network in units of frames. Note that the deep neural network learning method and MFCC are well-known techniques, and thus description thereof is omitted here.

次に学習したニューラルネットワークを利用して言語識別を行う際は、どの言語で話されたかが未知である入力音声に対して、ニューラルネットワークが採用した音声幅のフレーム単位に音声を分割し、フレームごとに、前述の音響特徴量をニューラルネットワークに入力して、言語ごとの確率値を得る。この確率値は、入力音響特徴量がどの言語であるかの推定確率値を表す。その後、その確率の対数をとった対数確率を全てのフレームで平均化する。そして、平均対数確率が最も高い言語が入力音声の言語であると識別する。 Next, when performing language identification using the learned neural network, the speech is divided into frames of the speech width adopted by the neural network for the input speech for which it is unknown in which language it is spoken. In addition, the above-described acoustic feature amount is input to the neural network to obtain a probability value for each language. This probability value represents an estimated probability value indicating which language the input acoustic feature quantity is. Thereafter, the logarithmic probability obtained by taking the logarithm of the probability is averaged over all frames. Then, the language having the highest average log probability is identified as the language of the input speech.

例えば入力音声が3秒で、1フレームの長さを25ミリ秒と定義すると、入力音声には120フレーム存在することとなる。この時、120フレーム中の最初のフレームを学習済みのディープニューラルネットワークに入力すると、最初のフレームに対する言語ごとの確率値が出力される。例えば、英語である確率が0.5、日本語である確率が0.3、中国語である確率が0.2と出力される。このような処理を残り119フレーム全てに対しても同様に行う。その後、言語ごと(英語ごと、日本語ごと、中国語ごと)に対数確率値の平均値を算出する。例えば英語であれば、1フレーム目から120フレーム目までの全ての英語である対数確率値を加算し120で割ることで算出できる。このような処理を行った場合、英語である平均対数確率が-10、日本語である平均対数確率が-50、中国語である平均対数確率が-100であったとする。この結果として、音声言語識別装置は平均対数確率が最大の言語である英語であったと入力音声を識別する。 For example, if the input voice is 3 seconds and the length of one frame is defined as 25 milliseconds, there are 120 frames in the input voice. At this time, when the first frame of 120 frames is input to the learned deep neural network, a probability value for each language for the first frame is output. For example, the probability of English is 0.5, the probability of Japanese is 0.3, and the probability of Chinese is 0.2. Such processing is similarly performed for all the remaining 119 frames. Thereafter, an average value of log probability values is calculated for each language (every English, every Japanese, every Chinese). For example, in the case of English, it can be calculated by adding logarithmic probability values which are all English from the first frame to the 120th frame and dividing by 120. When such processing is performed, it is assumed that the average log probability of English is -10, the average log probability of Japanese is -50, and the average log probability of Chinese is -100. As a result, the spoken language identification device identifies the input speech as being English, which is the language with the largest average log probability.

Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno, Pedro J. Moreno, Joaquin Gonzalez-Rodriguez, "Frame by Frame Language Identification in Short Utterances using Deep Neural Networks", Neural Networks Special Issue: Neural Network Learning in Big Data, 2014.Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno, Pedro J. Moreno, Joaquin Gonzalez-Rodriguez, "Frame by Frame Language Identification in Short Utterances using Deep Neural Networks", Neural Networks Special Issue: Neural Network Learning in Big Data, 2014.

しかしながら、従来のニューラルネットワークを利用した音声言語識別では、音声に含まれる音韻情報を精緻に捉えた音声言語識別を行うことができない。従来の音声言語識別では、MFCC等の音響特徴量から直接言語ごとの事後確率をモデル化しているが、この枠組みでは音韻を精緻に捉える構造が含まれていない。言い換えると、MFCC等の音響特徴量には音韻に関わる情報が包含されているものの、上述のニューラルネットワークは、言語識別のためのモデルであり、音韻を精緻に捉える構造とは言えない。しかし、音声言語識別では、音韻の並びが重要とされている。 However, in the spoken language identification using the conventional neural network, the spoken language identification in which the phoneme information included in the speech is accurately captured cannot be performed. In conventional spoken language identification, posterior probabilities for each language are directly modeled from acoustic features such as MFCC, but this framework does not include a structure that accurately captures phonemes. In other words, although information related to phonemes is included in acoustic feature quantities such as MFCC, the above-described neural network is a model for language identification and cannot be said to be a structure that accurately captures phonemes. However, in phonetic language identification, the arrangement of phonemes is important.

そこで、本発明では、音声に含まれる音韻情報を頑健に捉え、その情報を利用して音声言語識別を行うことができる音声言語識別装置、その方法、及びプログラムを提供することを目的とする。 Therefore, an object of the present invention is to provide a spoken language identification apparatus, a method thereof, and a program capable of robustly capturing phonological information included in speech and performing speech language identification using the information.

上記の課題を解決するために、本発明の一態様によれば、音声言語識別装置は、s=1,2,…,Sとし、Sを1以上の整数の何れかとし、音声データから得られるフレーム単位の音響特徴量を入力とし、その音声データに対する所定の言語sの音韻情報をフレーム単位で出力するニューラルネットワーク音響モデルを用いて、対象となる音声データから得られるフレーム単位の音響特徴量系列Xからボトルネック特徴量系列を計算するボトルネック特徴量計算部と、S個のボトルネック特徴量系列と、音響特徴量系列Xとを含むパラレルボトルネック特徴量系列を構成するパラレルボトルネック特徴量構成部と、パラレルボトルネック特徴量系列を音声言語識別用ニューラルネットワークの入力として、対象となる音声データが何れの言語によるものかを識別する言語識別部とを含み、ボトルネック特徴量は、ニューラルネットワーク音響モデルの中間層または出力層の何れかであるボトルネック層の出力値である。 In order to solve the above-described problem, according to one aspect of the present invention, the spoken language identification device has s = 1, 2,..., S, S is any integer of 1 or more, and is obtained from the speech data. Frame-based acoustic feature value obtained from the target speech data using a neural network acoustic model that receives the acoustic feature value in frame units as input and outputs phonological information of the predetermined language s for the speech data in frame units A bottleneck feature quantity calculation unit that calculates a bottleneck feature quantity series from the series X, a parallel bottleneck feature that constitutes a parallel bottleneck feature quantity sequence including S bottleneck feature quantity sequences and an acoustic feature quantity series X Quantitative component and parallel bottleneck feature series as input to spoken language identification neural network to identify which language the target speech data is in And a language identification unit that, the bottleneck feature quantity is the output value of the bottleneck layer is either the intermediate layer or the output layer of the neural network acoustic models.

上記の課題を解決するために、本発明の他の態様によれば、音声言語識別方法は、s=1,2,…,Sとし、Sを1以上の整数の何れかとし、ボトルネック特徴量計算部が、音声データから得られるフレーム単位の音響特徴量を入力とし、その音声データに対する所定の言語sの音韻情報をフレーム単位で出力するニューラルネットワーク音響モデルを用いて、対象となる音声データから得られるフレーム単位の音響特徴量系列Xからボトルネック特徴量系列を計算するボトルネック特徴量計算ステップと、パラレルボトルネック特徴量構成部が、S個のボトルネック特徴量系列と、音響特徴量系列Xとを含むパラレルボトルネック特徴量系列を構成するパラレルボトルネック特徴量構成ステップと、言語識別部が、パラレルボトルネック特徴量系列を音声言語識別用ニューラルネットワークの入力として、対象となる音声データが何れの言語によるものかを識別する言語識別ステップとを含み、ボトルネック特徴量は、ニューラルネットワーク音響モデルの中間層または出力層の何れかであるボトルネック層の出力値である。 In order to solve the above-mentioned problem, according to another aspect of the present invention, the spoken language identification method has s = 1, 2,..., S, S is any integer of 1 or more, and a bottleneck feature Using the neural network acoustic model in which the quantity calculation unit receives the acoustic feature quantity in units of frames obtained from the voice data and outputs phonological information of the predetermined language s for the voice data in units of frames, the target voice data A bottleneck feature amount calculating step for calculating a bottleneck feature amount sequence from a frame-unit acoustic feature amount sequence X obtained from a parallel bottleneck feature amount component, S bottleneck feature amount sequences, and an acoustic feature amount A parallel bottleneck feature quantity configuration step including a series X and a parallel bottleneck feature quantity sequence, and a language identification unit converts the parallel bottleneck feature quantity sequence into a speech language A language identification step for identifying which language the target speech data is in as an input to the separate neural network, and the bottleneck feature amount is in either the intermediate layer or the output layer of the neural network acoustic model. This is the output value of a certain bottleneck layer.

本発明によれば、従来よりも頑強に捉えた音韻情報を利用して音声言語識別を行うことができるという効果を奏する。 According to the present invention, there is an effect that it is possible to perform spoken language identification using phoneme information captured more robustly than in the past.

第一実施形態に係る音声言語識別装置の機能ブロック図。The functional block diagram of the speech language identification device which concerns on 1st embodiment. 第一実施形態に係る音声言語識別装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the speech language identification device which concerns on 1st embodiment. ユニット数を絞った音声認識用ニューラルネットワーク音響モデルの例を示す図。The figure which shows the example of the neural network acoustic model for speech recognition which narrowed down the number of units. ボトルネック層よりも後段の中間層及び出力層を削除した音声認識用ニューラルネットワーク音響モデルの例を示す図。The figure which shows the example of the neural network acoustic model for speech recognition which deleted the intermediate | middle layer and output layer of the back | latter stage rather than the bottleneck layer.

以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted. In the following description, it is assumed that processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

＜第一実施形態のポイント＞
本実施形態では、音韻情報を捉えるための部位を言語識別部の前段に設置し、捉えた音韻情報も含めたニューラルネットワーク音声言語識別を実施する。 <Points of first embodiment>
In the present embodiment, a part for capturing phonological information is installed in the preceding stage of the language identifying unit, and neural network speech language identification including the captured phonological information is performed.

具体的な実現方法としては、音韻情報を頑健に捉えるために、音声認識システムで利用するようなニューラルネットワーク音響モデル（以下、単に「NN音響モデル」ともいう）を複数利用する。なお、複数のNN音響モデルはそれぞれ複数の言語に対応する。例えば、日本語音声認識用NN音響モデル、英語音声認識用NN音響モデル、中国語音声認識用NN音響モデルなどを意味する。各NN音響モデルは、フレーム単位で対象とする言語の音素を識別するためのものである。本実施形態では、NN音響モデルからフレーム単位で抽出することが可能なニューラルネットワークの中間層の情報（ボトルネック特徴量）を、複数のNN音響モデルを使ってそれぞれ抽出し、それらを統合した「パラレルボトルネック特徴量」を通常の音声言語識別用のニューラルネットワークの入力に加えて利用し、それにより音声言語識別を実施する。 As a specific implementation method, in order to capture phonological information robustly, a plurality of neural network acoustic models (hereinafter also simply referred to as “NN acoustic models”) used in a speech recognition system are used. A plurality of NN acoustic models respectively correspond to a plurality of languages. For example, it means an NN acoustic model for Japanese speech recognition, an NN acoustic model for English speech recognition, an NN acoustic model for Chinese speech recognition, and the like. Each NN acoustic model is for identifying phonemes of a target language in units of frames. In this embodiment, the information (bottleneck feature) of the intermediate layer of the neural network that can be extracted from the NN acoustic model in units of frames is extracted using a plurality of NN acoustic models, and these are integrated. "Parallel bottleneck features" are used in addition to the input of a normal neural network for speech language identification, thereby implementing speech language identification.

＜第一実施形態＞
図１は第一実施形態に係る音声言語識別装置１００の機能ブロック図を、図２はその処理フローを示す。 <First embodiment>
FIG. 1 is a functional block diagram of the spoken language identification apparatus 100 according to the first embodiment, and FIG. 2 shows a processing flow thereof.

音声言語識別装置１００は、対象となる入力音声から得られるフレーム単位の音響特徴量系列X={x₁,x₂,…,x_T}と、S個の音声認識用NN音響モデルF^sと、音声言語識別用ニューラルネットワークFとを入力とし、入力音声の言語を識別し、識別結果を言語ラベルLとして出力する。ただし、X={x₁,x₂,…,x_T}であり、Tは時系列の長さ（全フレーム数）であり、t=1,2,…,Tであり、x_tはtフレーム目の音響特徴量のベクトルを表す。Sは各音声認識用NN音響モデルが対象とする言語の種類の総数を示し、sは言語の種類を表すインデックスであり、s=1,2,…,Sである。例えば、フランス語、ドイツ語、ポルトガル語の3つを識別対象の言語とすると、S=3となる。 The spoken language identification apparatus 100 includes an acoustic feature quantity sequence X = {x ₁ , x ₂ ,..., X _T } obtained from target input speech, and S speech recognition NN acoustic models F ^s The speech language identification neural network F is input, the language of the input speech is identified, and the identification result is output as a language label L. However, X = {x ₁ , x ₂ , ..., x _T }, T is the length of the time series (total number of frames), t = 1,2, ..., T, and x _t is t This represents a vector of acoustic features of the frame. S represents the total number of language types targeted by each NN acoustic model for speech recognition, s is an index representing the language type, and s = 1, 2,. For example, if three languages, French, German, and Portuguese, are used as identification target languages, S = 3.

例えば、この音声言語識別装置１００は、CPUと、RAMと、以下の音声言語識別処理を実行するためのプログラムを記録したROMを備えたコンピュータで構成され、機能的には次に示すように構成されている。音声言語識別装置１００は、音響モデル削除部１１０と、ボトルネック特徴量計算部１２０と、パラレルボトルネック特徴量構成部１３０と、言語識別部１４０とを含む。以下、各部の処理内容を説明する。 For example, the spoken language identification device 100 includes a CPU, a RAM, and a computer including a ROM that stores a program for executing the following spoken language identification processing, and is functionally configured as follows. Has been. The spoken language identification device 100 includes an acoustic model deletion unit 110, a bottleneck feature amount calculation unit 120, a parallel bottleneck feature amount configuration unit 130, and a language identification unit 140. Hereinafter, the processing content of each part is demonstrated.

＜音響モデル削除部１１０＞
音響モデル削除部１１０は、S個の音声認識用NN音響モデルF^sを入力とし、予め定めたボトルネック層k_sよりも後段の中間層及び出力層を削除し、削除後のS個の音声認識用NN音響モデルF^s,k_sを出力する。なお、上付き添え字k_sは、k_sを意味する。 <Acoustic model deletion unit 110>
The acoustic model deletion unit 110 receives the S speech recognition NN acoustic models F ^s as input, deletes the intermediate layer and the output layer after the predetermined bottleneck layer k _s , and deletes the S speech after the deletion The recognition NN acoustic model F ^{s, k_s} is output. The superscript k_s means k _s .

ここで、音声認識用NN音響モデルF^sの中間層の数をN_sとする。N_sは1以上の整数の何れかである。なお、ボトルネック層k_s(N_s≧k_s≧1)は、人手により決定する。例えば、N_s=5の音声認識用NN音響モデルであれば、k_sを1以上5以下の整数の何れかとする。例えば、k_s=4として決定する。例えば、各音声認識用NN音響モデルF^sのボトルネック層k_sはそれぞれ実験的に決定すればよい。 Here, the number of intermediate layers of the NN acoustic model for speech recognition F ^s is N _s . N _s is any integer of 1 or more. The bottleneck layer k _s (N _s ≧ k _s ≧ 1) is determined manually. For example, in the case of an NN acoustic model for speech recognition with N _s = 5, k _s is any integer from 1 to 5. For example, it is determined as k _s = 4. For example, the bottleneck layer k _s of each speech recognition NN acoustic model F ^s may be determined experimentally.

音声認識用NN音響モデルF^sは、ボトルネック層k_sを設定できるならば任意の構造を利用できる。例えば、予めボトルネック層k_sとして設定したい中間層に対しては、他の中間層よりもユニット数を絞ってもよい(図３参照)。ユニット数をしぼっておくことで、ボトルネック特徴量の次元数を制御できる。例えば、中間層5層の音声認識用NN音響モデルF^sにおいて、4層目をボトルネック層として設定し、4層目のみユニット数64、その他のユニット数は512などとして準備しておく。音声認識用NN音響モデルの学習方法としては、既存の如何なる技術を用いてもよい。例えば、参考文献１の学習方法を用いる。
（参考文献１）Geoffrey Hinton, etc., "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups", IEEE Signal Processing Magazine, Volume: 29, Issue: 6, Pages: 82 - 97, Nov. 2012. The NN acoustic model F ^s for speech recognition can use any structure as long as the bottleneck layer k _s can be set. For example, the number of units may be reduced as compared with other intermediate layers for an intermediate layer that is to be set in advance as the bottleneck layer k _s (see FIG. 3). By reducing the number of units, the number of dimensions of the bottleneck feature amount can be controlled. For example, the NN acoustic model F ^s for speech recognition of the intermediate layer 5 layers, set the fourth layer as a bottleneck layer, 4-layer only unit number 64, other number of units you prepare the like 512. Any existing technique may be used as the learning method of the NN acoustic model for speech recognition. For example, the learning method of Reference Document 1 is used.
(Reference 1) Geoffrey Hinton, etc., "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups", IEEE Signal Processing Magazine, Volume: 29, Issue: 6, Pages: 82-97, Nov. 2012.

図３はユニット数を絞った音声認識用NN音響モデルF^sの例を、図４はボトルネック層k_sよりも後段の中間層及び出力層を削除した音声認識用NN音響モデルF^s,k_sの例を示す。
なお、音声認識用NN音響モデルF^sを学習する際には、中間層及び出力層を削除することはできないので、学習が済んでから削除し、音声言語識別時に利用する。 FIG. 3 shows an example of the NN acoustic model F ^s for speech recognition with a reduced number of units, and FIG. 4 shows the NN acoustic model F ^{s, k_s} for speech recognition in which the intermediate layer and output layer after the bottleneck layer k _s are deleted. An example of
Note that when learning the speech recognition NN acoustic model F ^s , the intermediate layer and the output layer cannot be deleted. Therefore, they are deleted after learning, and used for speech language identification.

＜ボトルネック特徴量計算部１２０＞
ボトルネック特徴量計算部１２０は、対象となる音声データから得られるフレーム単位の音響特徴量系列X={x₁,x₂,…,x_T}を入力とし、音声認識用NN音響モデルF^s,k_sを用いて、音響特徴量系列X={x₁,x₂,…,x_T}からボトルネック特徴量系列V^s={v₁ ^s,v₂ ^s,…,v_T ^s}を計算し（Ｓ１２０）、出力する。ボトルネック特徴量v_t ^sは、前述のボトルネック層k_sの出力値(音声認識用NN音響モデルF^sのk_s番目の中間層の出力値)であり、音韻情報を陽に表す数値ベクトルである。 <Bottleneck feature amount calculation unit 120>
The bottleneck feature amount calculation unit 120 receives an acoustic feature amount sequence X = {x ₁ , x ₂ ,..., X _T } in units of frames obtained from target speech data, and uses the NN acoustic model F ^s for speech recognition. ^{, k_s} to calculate the bottleneck feature sequence V ^s = {v ₁ ^s , v ₂ ^s ,…, v _T ^s } from the acoustic feature sequence X = {x ₁ , x ₂ ,…, x _T } (S120) and output. The bottleneck feature value v _t ^s is the output value of the bottleneck layer k _s described above (the output value of the k _s- th intermediate layer of the speech recognition NN acoustic model F ^s ), and is a numerical vector that explicitly represents the phoneme information It is.

なお、ボトルネック特徴量計算部１２０は、音声認識用NN音響モデルF^s,k_sごとに実施される部位である。つまり、S個の音声認識用NN音響モデルF^1,k_1,F^2,k_2,…,F^S,k_Sを用いる場合、ボトルネック特徴量系列V^sは、S個の時系列V¹,V²,…,V^Sを得ることになる。 The bottleneck feature amount calculation unit 120 is a part that is implemented for each NN acoustic model F ^{s, k_s} for speech recognition. That is, when the S NN acoustic models for speech recognition F ^{1, k_1} , F ^{2, k_2} ,..., F ^{S, k_S} are used, the bottleneck feature amount series V ^s is represented by S time series V ¹ , V ² , ... you get V ^S.

なお、ボトルネック特徴量系列V^sは、音声認識用NN音響モデルF^s,k_sに音響特徴量系列Xを入力した場合のボトルネック層の出力として表される。
v_t=f(x_t)
ここで、f()は音声認識用NN音響モデルF^s,k_sによるボトルネック層までの計算を表す。同様に、v_tはtフレーム目のボトルネック特徴量のベクトルを表す。つまり、ボトルネック特徴量の計算はフレーム単位で実施され、長さTの音響特徴量系列Xに対して、長さTのボトルネック特徴量系列V^sを得る。 The bottleneck feature quantity sequence V ^s is expressed as an output of the bottleneck layer when the acoustic feature quantity sequence X is input to the speech recognition NN acoustic model F ^{s, k_s} .
v _t = f (x _t )
Here, f () represents calculation up to the bottleneck layer by the NN acoustic model for speech recognition F ^{s, k_s} . Similarly, v _t represents a vector of bottleneck feature values in the t-th frame. That is, the calculation of the bottleneck feature amount is performed in units of frames, and the bottleneck feature amount sequence V ^s of length T is obtained for the acoustic feature amount sequence X of length T.

＜パラレルボトルネック特徴量構成部１３０＞
パラレルボトルネック特徴量構成部１３０は、S個のボトルネック特徴量系列V¹,V²,…,V^Sと、音響特徴量系列Xとを入力とし、これらの情報を含むパラレルボトルネック特徴量系列Pを構成し（Ｓ１３０）、出力する。 <Parallel Bottleneck Feature Quantity Configuration Unit 130>
The parallel bottleneck feature quantity configuration unit 130 receives the S bottleneck feature quantity series V ¹ , V ² ,..., V ^S and the acoustic feature quantity series X, and includes parallel bottleneck feature quantities including these pieces of information. A series P is formed (S130) and output.

例えば、パラレルボトルネック特徴量構成部１３０は、S個のボトルネック特徴量系列V¹,V²,…,V^Sと音響特徴量系列Xとをフレームtごとに結合して、新たなベクトル系列を作る。tフレーム目のS個のボトルネック特徴量をv_t ¹,v_t ²,…,v_t ^Sとおく。tフレーム目のパラレルボトルネック特徴量p_tは、
p_t=[{v_t ¹}^T,{v_t ²}^T,…,{v_t ^S}^T,x_t ^T]^T
として構成する。ただし、上付き添え字Tは転置を示す。つまり、元の音響特徴量x_tと複数のボトルネック特徴量v_t ¹,v_t ²,…,v_t ^Sの各要素とを並べたベクトルを構成する。最終的な、パラレルボトルネック特徴量系列は、P=p₁,p₂,…,p_Tとして構成される。 For example, the parallel bottleneck feature quantity constructing unit 130 combines the S bottleneck feature quantity sequences V ¹ , V ² ,..., V ^S and the acoustic feature quantity series X for each frame t to generate a new vector series. make. S number of bottlenecks feature amount of the t-th frame _{^{_{^{v t 1, v t 2,}}}} ..., put a v _t ^S. parallel bottleneck feature amount p _t of t th frame,
p _t = [{v _t ¹ } ^T , {v _t ² } ^T ,…, {v _t ^S } ^T , x _t ^T ] ^T
Configure as. However, the superscript T indicates transposition. That is, the acoustic feature quantity of the original x _t and a plurality of bottleneck feature quantity _{^{_{^{v t 1, v t 2,}}}} ..., constitute a vector obtained by arranging the elements of v _t ^S. The final parallel bottleneck feature quantity sequence is configured as P = p ₁ , p ₂ ,..., P _T.

＜言語識別部１４０＞
言語識別部１４０は、パラレルボトルネック特徴量系列Pと音声言語識別用ニューラルネットワークFとを入力とし、これらの値を用いて、対象となる音声データが何れの言語によるものかを識別し（Ｓ１４０）、識別結果である言語ラベルLを出力する。 <Language identification unit 140>
The language identification unit 140 receives the parallel bottleneck feature amount series P and the speech language identification neural network F as inputs, and identifies the language in which the target speech data is based on these values (S140). ), And output the language label L as the identification result.

なお、この枠組みは、ニューラルネットワーク音声言語識別の入力として、従来は音響特徴量系列を用いていた部分を、パラレルボトルネック特徴量系列を扱うことで実現できる。つまり、本実施形態では、パラレルボトルネック特徴量系列Pを音声言語識別用ニューラルネットワークFの入力として、対象となる音声データが何れの言語によるものかを識別する。よって、P=p₁,…,p_Tのフレームtごとに、音声言語識別用ニューラルネットワークFに入力して、言語ごとの確率値を得る。この確率は、音響特徴量x_tがどの言語であるかの推定確率値を表す。その後、その確率の対数をとった対数確率を全てのフレームで平均化する。そして、平均対数確率が最も高い言語が入力音声の言語であると識別する。 Note that this framework can be realized by handling a parallel bottleneck feature quantity part, which conventionally used an acoustic feature quantity series, as an input for neural network spoken language identification. That is, in the present embodiment, the parallel bottleneck feature amount series P is used as an input to the speech language identification neural network F to identify which language the target speech data is in. Therefore, for each frame t of P = p ₁ ,..., P _T , it is input to the speech language identification neural network F to obtain a probability value for each language. This probability represents an estimated probability value indicating which language the acoustic feature quantity x _t is. Thereafter, the logarithmic probability obtained by taking the logarithm of the probability is averaged over all frames. Then, the language having the highest average log probability is identified as the language of the input speech.

なお、この音声言語識別用ニューラルネットワークFは、学習データに対してもそれぞれパラレルボトルネック特徴量系列を計算し、パラレルボトルネック特徴量系列と言語ラベルとの組を教師データとして学習することでモデル化できる。 The neural network for speech language identification F calculates a model by calculating a parallel bottleneck feature amount sequence for each learning data and learning a set of the parallel bottleneck feature amount sequence and a language label as teacher data. Can be

＜効果＞
以上の構成により、従来よりも頑強に捉えた音韻情報を利用して音声言語識別を行うことができる。前述の通り、音声言語識別では、音韻の並びが重要とされているため、より頑強に捉えた音韻情報を利用して音声言語識別を行うことで、音声言語識別の性能を大きく向上することができ、従来よりも高精度化が期待できる。 <Effect>
With the above configuration, it is possible to perform speech language identification using phoneme information captured more robustly than in the past. As mentioned above, the alignment of phonemes is important in spoken language identification, so the performance of spoken language identification can be greatly improved by performing spoken language identification using phoneme information captured more robustly. It is possible to expect higher accuracy than before.

＜変形例＞
なお、ボトルネック特徴量計算部１２０で用いる音響モデルの言語と、最終的な音声言語識別で対象とする言語は全く独立の存在であり、音声言語識別で扱わない言語の音響モデルを利用してもよい。要は、ある言語Aで発せられた音声データの音響特徴量系列Xからボトルネック特徴量(音韻情報を陽に表す数値ベクトル)を計算することができればよい。例えば、ある言語Aに類似する音韻を持つ他の言語BのNN音響モデルをある言語AのNN音響モデルに流用したり、ある言語Aに含まれる音韻の大部分を含む他の言語BのNN音響モデルをある言語AのNN音響モデルに流用する方法などが考えられる。例えば、言語Bに含まれる音韻の種類は、日本語の音韻の種類よりも多く、日本語の音韻の種類を全て包含する場合には、言語BのNN音響モデルを用いて、日本語の音声データから得られる音響特徴量系列Xからボトルネック特徴量系列Vを計算してもよく、さらに、最終的な音声言語識別で対象とする言語の中に言語Bが含まれなくともよい。 <Modification>
Note that the language of the acoustic model used in the bottleneck feature quantity calculation unit 120 and the language targeted by the final spoken language identification are completely independent, and the acoustic model of the language not handled by the spoken language identification is used. Also good. In short, it is only necessary to be able to calculate a bottleneck feature value (a numerical vector that explicitly represents phonological information) from an acoustic feature value sequence X of speech data issued in a language A. For example, an NN acoustic model of another language B that has a phoneme similar to a certain language A is diverted to an NN acoustic model of another language A, or an NN of another language B that contains most of the phonemes included in a certain language A A method of diverting an acoustic model to a language A NN acoustic model is conceivable. For example, there are more phoneme types included in language B than Japanese phonemes, and if all of the phoneme types in Japanese are included, Japanese language speech is used using the NN acoustic model of language B. The bottleneck feature amount sequence V may be calculated from the acoustic feature amount sequence X obtained from the data, and the language B may not be included in the target language in the final speech language identification.

必ずしも音声認識用NN音響モデルでなくともよい。要は、NN音響モデルは、音声データから得られるフレーム単位の音響特徴量を入力とし、その音声データに対する所定の言語sの音韻情報をフレーム単位で出力とする音響モデルであれば、どのようなものでもよく、必ずしも音声認識用である必要はない。なお、本実施形態の音声言語識別装置を、音声認識装置の中に組み込むことで音声認識用の音響モデルを共用することができるというメリットがある。また、音声認識装置は、従来よりも高い精度で言語を特定することができ、結果として、音声認識の精度を向上させることができるというメリットがある。 The NN acoustic model for speech recognition is not necessarily required. In short, any NN acoustic model can be used as long as it is an acoustic model that receives acoustic features in units of frames obtained from speech data and outputs phonological information of a predetermined language s for the speech data in units of frames. It may be a thing and does not necessarily have to be for voice recognition. In addition, there is an advantage that an acoustic model for speech recognition can be shared by incorporating the speech language identification device of the present embodiment into the speech recognition device. In addition, the speech recognition apparatus has an advantage that the language can be specified with higher accuracy than before, and as a result, the accuracy of speech recognition can be improved.

本実施形態では、音響モデル削除部１１０において、ボトルネック層以降の中間層や出力層を削除しているが、削除せずに音声認識用NN音響モデルをそのまま用いてもよい。その場合には、音響モデル削除部１１０を設けなくともよい。なお、音声認識用NN音響モデルをそのまま用いる場合には、ボトルネック特徴量計算部１２０に各音声認識用NN音響モデルF^sに対するボトルネック層k_sを知らせ、ボトルネック特徴量計算部１２０では、音声認識用NN音響モデルF^sのボトルネック層k_sの出力値をボトルネック特徴量として出力する。 In the present embodiment, the acoustic model deletion unit 110 deletes the intermediate layer and the output layer after the bottleneck layer, but the speech recognition NN acoustic model may be used without being deleted. In that case, the acoustic model deletion unit 110 may not be provided. When the speech recognition NN acoustic model is used as it is, the bottleneck feature amount calculation unit 120 is informed of the bottleneck layer k _s for each speech recognition NN acoustic model F ^s , and the bottleneck feature amount calculation unit 120 The output value of the bottleneck layer k _s of the NN acoustic model F ^s for speech recognition is output as a bottleneck feature amount.

本実施形態では、ボトルネック層をNNの中間層の何れかとしたが、必ずしも中間層である必要はなく、出力層をボトルネック層として用いてもよい。 In the present embodiment, the bottleneck layer is one of the NN intermediate layers, but the intermediate layer is not necessarily required, and the output layer may be used as the bottleneck layer.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by an electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

Claims

s = 1, 2,..., S, S being one of integers of 1 or more, an acoustic feature amount in units of frames obtained from speech data as input, and phoneme information of a predetermined language s for the speech data as a frame Using a neural network acoustic model that is output in units, a bottleneck feature amount calculation unit that calculates a bottleneck feature amount sequence from an acoustic feature amount sequence X in frame units obtained from target speech data;
A parallel bottleneck feature quantity constituting a parallel bottleneck feature quantity sequence including S bottleneck feature quantity series and the acoustic feature quantity series X;
The parallel bottleneck feature quantity series as an input of a spoken language identification neural network, and a language identification unit that identifies which language the target speech data is in,
The bottleneck feature amount is an output value of a bottleneck layer that is either an intermediate layer or an output layer of the neural network acoustic model.
Spoken language identification device.

The spoken language identification device of claim 1,
The neural network for speech language identification is learned using a parallel bottleneck feature quantity sequence in units of frames obtained from speech data for learning and a language label indicating the language of the speech data for learning.
Spoken language identification device.

s = 1, 2,..., S, S is one of integers of 1 or more, and the bottleneck feature quantity calculation unit inputs the acoustic feature quantity in units of frames obtained from the voice data, and predetermined for the voice data A bottleneck feature quantity calculation that calculates a bottleneck feature quantity sequence from a frame-unit acoustic feature quantity sequence X obtained from target speech data, using a neural network acoustic model that outputs phonological information of each language s in frame units Steps,
A parallel bottleneck feature quantity configuring unit configured to construct a parallel bottleneck feature quantity sequence including S bottleneck feature quantity series and the acoustic feature quantity series X; and
A language identification step, wherein the parallel bottleneck feature quantity sequence is input to a neural network for speech language identification, and a language identification step for identifying which language the target speech data is in,
The bottleneck feature amount is an output value of a bottleneck layer that is either an intermediate layer or an output layer of the neural network acoustic model.
Spoken language identification method.

The spoken language identification method of claim 3,
The neural network for speech language identification is learned using a parallel bottleneck feature quantity sequence in units of frames obtained from speech data for learning and a language label indicating the language of the speech data for learning.
Spoken language identification method.

A program for causing a computer to function as the spoken language identification apparatus according to claim 1 or 2.