JP2016156870A

JP2016156870A - Language identification model learning device, language identification device, language identification model learning method, language identification method, program, and recording medium

Info

Publication number: JP2016156870A
Application number: JP2015032887A
Authority: JP
Inventors: 亮増村; Akira Masumura; 浩和政瀧; Hirokazu Masataki
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-02-23
Filing date: 2015-02-23
Publication date: 2016-09-01
Anticipated expiration: 2035-02-23
Also published as: JP6389776B2

Abstract

PROBLEM TO BE SOLVED: To perform highly accurate language identification.SOLUTION: A learning data storage part 1 stores a plurality of learning data items composed of groups of voice data by a plurality of languages combined with language labels representing languages of the respective voice data. A conversion model learning part 2 uses the learning data to learn a discrete symbol sequence conversion model outputting a discrete symbol sequence discretizing a posterior probability distribution of language labels with the voice data as input. A discrete symbol sequence conversion part 4 uses the discrete symbol sequence conversion model to convert the voice data of the learning data into a discrete symbol sequence. A language identification model learning part 5 uses the discrete symbol sequence of the learning data and the language labels to make the discrete signal sequence of the voice data input, and learns a language identification model outputting generation probability every language label.SELECTED DRAWING: Figure 1

Description

この発明は、入力発話がどの言語で話されたものかを識別する言語識別技術に関する。 The present invention relates to a language identification technique for identifying in which language an input utterance is spoken.

グローバル社会の進展に伴い、音声認識技術では複数の言語の入力を許容する必要性が増している。そこで、入力された音声の言語がどの言語なのか（例えば、英語、日本語、中国語のいずれなのか）を識別する言語識別技術の高度化が求められている。 With the development of the global society, the need for allowing input of multiple languages is increasing in speech recognition technology. Therefore, there is a demand for advanced language identification technology for identifying which language of the input speech is (for example, English, Japanese, or Chinese).

言語識別技術では、あらかじめ各言語の言語らしさを統計的に捉えておくことで、入力された発話がどの言語に一番近いかを計算して識別を行う枠組みが一般的である。具体的には、音声データと言語ラベル（「この音声は日本語で話されている」といった情報）の組を大量に準備して、各言語の言語らしさを捉えることで言語識別器を構築する。 In the language identification technology, a framework is generally used in which the language likeness of each language is statistically grasped in advance to calculate and identify which language the input utterance is closest to. Specifically, a large number of pairs of speech data and language labels (information such as “This speech is spoken in Japanese”) is prepared, and a language classifier is constructed by grasping the language likeness of each language. .

従来の言語識別技術として、非特許文献１に記載の方法が挙げられる。非特許文献１では、ディープニューラルネットワークと呼ばれる統計モデルを利用して言語識別を実現している。具体的には、ニューラルネットワークを利用して、数ミリ秒程度のフレーム単位で言語らしさを統計的に捉える。ニューラルネットワークは、例えば、言語ラベルの可能性が３言語（日本語、英語、中国語）とした場合に、ある音声幅（例えば、25ミリ秒）のフレーム単位で統計モデルを学習しておき、任意のフレームが入力された際に上記の３言語のいずれに当たるかを表す確率値を算出する。なお、ディープニューラルネットワークの学習方法は、公知の技術である。 As a conventional language identification technique, there is a method described in Non-Patent Document 1. In Non-Patent Document 1, language identification is realized using a statistical model called a deep neural network. Specifically, using a neural network, the linguistic character is statistically captured in units of a few milliseconds. For example, when the possibility of language labels is 3 languages (Japanese, English, Chinese), the neural network learns a statistical model in units of frames of a certain voice width (for example, 25 milliseconds) A probability value representing which one of the above three languages corresponds when an arbitrary frame is input is calculated. The deep neural network learning method is a known technique.

学習済みのニューラルネットワークを利用して言語識別を行う際は、どの言語で話されたかが未知である入力音声に対して、ニューラルネットワークの学習で利用した音声幅のフレーム単位に入力音声を分割し、各フレームについてどの言語で話されたかの確率値を算出する。その後、その確率値を全てのフレームで平均化し、平均値が最も高い言語に識別する。例えば、入力音声が３秒で、１フレームの長さを25ミリ秒と定義すると、入力音声には120フレームが存在することとなる。このとき、120フレーム中の最初のフレームを学習済みのディープニューラルネットワークに入力すると、例えば、「英語である確率が0.5、日本語である確率が0.3、中国語である確率が0.2」といった確率値が出力される。このような処理を残り119フレームすべてに対しても同様に求めた後、言語ごと（例えば、英語ごと、日本語ごと、中国語ごと）に確率値の平均値を算出する。すなわち、１フレーム目から120フレーム目までのすべての英語（もしくは、日本語、中国語）である確率値を加算し、フレーム数の120で除算する。このような処理を行った結果、例えば、英語である確率の平均値が0.7、日本語である確率の平均値が0.1、中国語である確率の平均値が0.2であったとする。この場合、言語識別器は、入力音声の言語は確率の平均値が最大の言語である英語であったと識別する。 When performing language identification using a learned neural network, the input speech is divided into frames of the speech width used in the neural network learning for the input speech for which the language spoken is unknown, The probability value of which language is spoken for each frame is calculated. Thereafter, the probability values are averaged over all frames, and the language having the highest average value is identified. For example, if the input voice is 3 seconds and the length of one frame is defined as 25 milliseconds, there are 120 frames in the input voice. At this time, if the first frame of 120 frames is input to the learned deep neural network, for example, the probability value is “0.5 probability for English, 0.3 for Japanese, 0.2 for Chinese” Is output. After such processing is similarly obtained for all the remaining 119 frames, an average probability value is calculated for each language (for example, every English, every Japanese, every Chinese). That is, the probability values of all English (or Japanese and Chinese) from the first frame to the 120th frame are added and divided by 120 of the number of frames. As a result of such processing, for example, it is assumed that the average probability of English is 0.7, the average probability of Japanese is 0.1, and the average probability of Chinese is 0.2. In this case, the language classifier identifies that the language of the input speech is English, which is the language having the maximum probability.

全フレームの確率値を平均化することでどの言語で話されたかの確率値を求めているのは、各フレームレベルではその言語と特定できるだけの十分な情報がないことに起因している。いかにニューラルネットワークが緻密なモデリングを可能にするとしても、数ミリ秒程度の音声から完全に言語を特定できるだけのモデリングは不可能である。具体的には、ある英語の音声について各フレーム単位で確率値を算出した場合に、１フレーム目に対しては英語である確率が高いが、２フレーム目は英語である確率が低いと判断される場合が十分にある。従来手法では、この課題に対し全体を平均化してみることで解決を図っており、ある程度の精度で言語識別が実現できている。 The reason for obtaining the probability value of which language is spoken by averaging the probability values of all frames is that there is not enough information to identify the language at each frame level. No matter how precise the neural network enables modeling, it is not possible to model enough to specify a language completely from speech in a few milliseconds. Specifically, when a probability value is calculated for each English speech for each frame, it is determined that the probability of English is high for the first frame, but the probability of English is low for the second frame. There are enough cases. In the conventional method, this problem is solved by averaging the whole, and language identification can be realized with a certain degree of accuracy.

Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno, Pedro J. Moreno, Joaquin Gonzalez-Rodriguez, “Frame by Frame Language Identification in Short Utterances using Deep Neural Networks”, Neural Networks Special Issue: Neural Network Learning in Big Data, 2014.Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno, Pedro J. Moreno, Joaquin Gonzalez-Rodriguez, “Frame by Frame Language Identification in Short Utterances using Deep Neural Networks”, Neural Networks Special Issue: Neural Network Learning in Big Data, 2014.

非特許文献１の枠組みは、フレームレベルの識別能力の低さを全フレームの確率値を平均化してみることで解決している。しかしながら、全体を平均化してしまうと、系列全体としての変化をうまく捉えることができなくなる。例えば、ある英語の音声を言語識別器にかけると、４フレーム中３フレームは英語の確率が高いと出てくるが、４フレーム中１フレームは日本語の確率が高いと出てくる場合がある。このような現象が実際にあったとしても全体を平均化して捉えてしまうと、「英語の音声はこのような現象が起こりやすい」といった知見を活かすことができなくなってしまう。 The framework of Non-Patent Document 1 solves the low discriminating ability at the frame level by averaging the probability values of all frames. However, if the whole is averaged, the change as a whole series cannot be grasped well. For example, when a certain English voice is applied to the language classifier, 3 frames out of 4 frames appear with a high probability of English, but 1 frame out of 4 frames may appear with a high probability of Japanese. . Even if such a phenomenon actually exists, if the whole is averaged and understood, it will be impossible to make use of the knowledge that “English speech is likely to cause such a phenomenon”.

この問題が識別誤りにつながることをより具体的な例を挙げて説明する。例えば、ある英語の音声のフレームごとの識別確率が、１〜３フレーム目は「P(英語)=0.5、P(日本語)=0.4、P(中国語)=0.1」であり、４フレーム目は「P(英語)=0.1、P(日本語)=0.8、P(中国語)=0.1」であったと想定する。ここで、P(＊)は、そのフレームが言語＊である確率値を表している。この系列を平均値として捉えると、その確率は「P(英語)=0.4、P(日本語)=0.5、p(中国語)=0.1」となる。英語よりも日本語の方が高い確率となるため、実際には英語の音声であっても日本語に識別されてしまう識別誤りを起こしてしまう。このようにフレームレベルでは言語らしさをうまく捉えられないことは多く、平均化するだけでは言語識別のための緻密さに欠ける。 The fact that this problem leads to identification errors will be described with a more specific example. For example, the identification probability for each frame of an English speech is “P (English) = 0.5, P (Japanese) = 0.4, P (Chinese) = 0.1” in the first to third frames, and the fourth frame. Is assumed to be “P (English) = 0.1, P (Japanese) = 0.8, P (Chinese) = 0.1”. Here, P (*) represents a probability value that the frame is the language *. Taking this series as an average value, the probabilities are “P (English) = 0.4, P (Japanese) = 0.5, p (Chinese) = 0.1”. Since Japanese has a higher probability than English, in practice, even an English speech will cause an identification error that is identified as Japanese. As described above, it is often the case that the linguistic character cannot be grasped well at the frame level, and the accuracy for language identification is lacking simply by averaging.

すなわち、非特許文献１に記載の従来技術の問題点は、系列としての言語らしさの変動に関する情報を含まない平均値という情報で言語らしさを捉えていることによる、言語識別の緻密さの欠落と言える。 That is, the problem of the prior art described in Non-Patent Document 1 is that there is a lack of precise language identification due to the language-likeness being grasped by the information of the average value that does not include information relating to the variation of the language-likeness as a series. I can say that.

この発明の目的は、このような点に鑑みて、フレームレベルの言語らしさの系列情報を利用して、より高精度な言語識別を実現することである。 In view of such a point, an object of the present invention is to realize language identification with higher accuracy using sequence information of language-likeness at a frame level.

上記の課題を解決するために、この発明の第一の態様の言語識別モデル学習装置は、複数の言語による音声データと各音声データの言語を表す言語ラベルとを組とした複数の学習データを記憶する学習データ記憶部と、学習データを用いて、音声データを入力とし、言語ラベルの事後確率分布を離散化した離散記号系列を出力する離散記号系列変換モデルを学習する変換モデル学習部と、離散記号系列変換モデルを用いて、学習データの音声データを離散記号系列に変換する離散記号系列変換部と、学習データの離散記号系列および言語ラベルを用いて、音声データの離散記号系列を入力とし、言語ラベルごとの生成確率を出力する言語識別モデルを学習する言語識別モデル学習部と、を含む。 In order to solve the above-described problem, the language identification model learning device according to the first aspect of the present invention includes a plurality of pieces of learning data in which speech data in a plurality of languages and a language label representing the language of each speech data are paired. A learning data storage unit for storing, a conversion model learning unit for learning a discrete symbol sequence conversion model for outputting a discrete symbol sequence obtained by discretizing a posteriori probability distribution of a language label using speech data as input using the learning data; Using a discrete symbol sequence conversion model, a discrete symbol sequence converter that converts speech data of learning data into discrete symbol sequences, and a discrete symbol sequence of speech data using the discrete symbol sequences and language labels of learning data as inputs. A language identification model learning unit that learns a language identification model that outputs a generation probability for each language label.

この発明の第二の態様の言語識別装置は、言語識別モデル学習装置により生成した離散記号系列変換モデルを記憶した変換モデル記憶部と、言語識別モデル学習装置により生成した言語識別モデルを記憶した言語識別モデル記憶部と、離散記号系列変換モデルを用いて、入力音声データを離散記号系列に変換する離散記号系列変換部と、言語識別モデルを用いて、入力音声データの離散記号系列から言語ラベルごとの生成確率を求め、最大の生成確率を与える言語ラベルを出力する言語識別部と、を含む。 A language identification device according to a second aspect of the present invention includes a conversion model storage unit that stores a discrete symbol sequence conversion model generated by a language identification model learning device, and a language that stores a language identification model generated by a language identification model learning device. For each language label from the discrete symbol sequence of the input speech data using the identification model storage unit, the discrete symbol sequence conversion unit that converts the input speech data into a discrete symbol sequence using the discrete symbol sequence conversion model, and the language identification model And a language identification unit that outputs a language label that gives the maximum generation probability.

この発明の言語識別技術は、ニューラルネットワークによって捉えたフレームレベルの言語らしさの系列情報を言語ごとにモデル化し、その言語識別モデルを用いて入力音声の言語識別を実施する。したがって、この発明によれば、言語らしさの平均基準で言語識別を実施する場合と比較して、高精度な言語識別を実現することができる。 The language identification technique of the present invention models frame level language-like sequence information captured by a neural network for each language, and performs language identification of input speech using the language identification model. Therefore, according to the present invention, it is possible to realize highly accurate language identification as compared with the case where language identification is performed with an average standard of language likeness.

図１は、言語識別モデル学習装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of a language identification model learning device. 図２は、変換モデル学習部の機能構成を例示する図である。FIG. 2 is a diagram illustrating a functional configuration of the conversion model learning unit. 図３は、言語識別モデル学習部の機能構成を例示する図である。FIG. 3 is a diagram illustrating a functional configuration of the language identification model learning unit. 図４は、言語識別モデル学習方法の処理フローを例示する図である。FIG. 4 is a diagram illustrating a processing flow of the language identification model learning method. 図５は、言語識別装置の機能構成を例示する図である。FIG. 5 is a diagram illustrating a functional configuration of the language identification device. 図６は、言語識別方法の処理フローを例示する図である。FIG. 6 is a diagram illustrating a processing flow of the language identification method.

実施形態の説明に先立って、この発明の基本的な考え方を説明する。 Prior to the description of the embodiments, the basic concept of the present invention will be described.

この発明では、フレームレベルの言語らしさを離散化して捉える。フレームごとの言語らしさの情報は、例えば「P(英語)=0.5、P(日本語)=0.4、P(中国語)=0.1」といった情報である。このような連続表現を系列として捉えることは非常に複雑である。一方で、離散化された情報であれば系列を捉えやすい。そこで、フレームごとの言語らしさの情報を離散化した上で、音声データ全体を離散記号系列に変換する。フレームごとの言語らしさの情報は、言語ごとの確率値を要素とするベクトルと考えることができる。つまり、このベクトル空間を何らかの基準で分割し、空間ごとにクラスタ番号を定めておくことにより、言語らしさの情報を離散記号化できる。例えば、１〜３フレーム目の言語らしさ（例えば、「P(英語)=0.5、P(日本語)=0.4、P(中国語)=0.1」）がクラスタ番号10となり、４フレーム目の言語らしさ（例えば、「P(英語)=0.1、P(日本語)=0.8、P(中国語)=0.1」）がクラスタ番号３となった場合、この音声の離散記号系列は「10,10,10,3」と表すことができる。 In the present invention, the language level of the frame level is discretized and captured. The language-specific information for each frame is, for example, “P (English) = 0.5, P (Japanese) = 0.4, P (Chinese) = 0.1”. It is very complicated to capture such a continuous expression as a sequence. On the other hand, if it is discretized information, it is easy to catch the series. Therefore, the language-like information for each frame is discretized, and the entire speech data is converted into a discrete symbol sequence. Information on the language likeness for each frame can be considered as a vector having a probability value for each language as an element. In other words, by dividing this vector space according to some criteria and determining the cluster number for each space, the language-like information can be converted into discrete symbols. For example, the language likeness in the first to third frames (for example, “P (English) = 0.5, P (Japanese) = 0.4, P (Chinese) = 0.1”) becomes cluster number 10, and the language likeness in the fourth frame. (For example, “P (English) = 0.1, P (Japanese) = 0.8, P (Chinese) = 0.1”) becomes cluster number 3, the discrete symbol sequence of this speech is “10,10,10 , 3 ”.

離散記号系列は、既存の記号系列のモデリング技術を使って、言語ごと（上記の例では、英語ごと、日本語ごと、中国語ごと）にモデル化する。記号系列のモデリングには、言語モデルと呼ばれる公知技術を利用できる。例えば、Nグラムモデルという言語モデルを利用すれば、N個組の記号系列の生成確率を直接モデル化できる。 The discrete symbol series is modeled for each language (in the above example, for each English, every Japanese, and every Chinese) using the existing symbol series modeling technology. A known technique called a language model can be used for modeling the symbol series. For example, if a language model called an N-gram model is used, the generation probability of N symbol sequences can be directly modeled.

識別時には、入力音声データを離散記号系列に変換し、学習したモデルを利用して離散記号系列から各言語の生成確率を求め、最大の生成確率を与える言語に識別する。 At the time of identification, the input speech data is converted into a discrete symbol sequence, the generation probability of each language is obtained from the discrete symbol sequence using the learned model, and the language that gives the maximum generation probability is identified.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

実施形態では、学習データを用いて離散記号系列変換モデルおよび言語識別モデルを学習する言語識別モデル学習装置および方法と、学習した離散記号系列変換モデルおよび言語識別モデルを用いて音声データの言語識別を行う言語識別装置および方法を説明する。 In the embodiment, a language identification model learning apparatus and method for learning a discrete symbol sequence conversion model and a language identification model using learning data, and language identification of speech data using the learned discrete symbol sequence conversion model and language identification model. A language identification apparatus and method to be performed will be described.

＜言語識別モデル学習＞
実施形態の言語識別モデル学習装置は、図１に示すように、学習データ記憶部１、変換モデル学習部２、変換モデル記憶部３、離散記号系列変換部４、言語識別モデル学習部５、および言語識別モデル記憶部６を例えば含む。変換モデル学習部２は、図２に示すように、ニューラルネットワーク学習部２１およびセントロイド生成部２２を例えば含む。言語識別モデル学習部５は、図３に示すように、学習データ分割部５１およびモデル学習部５２を例えば含む。 <Language identification model learning>
As shown in FIG. 1, the language identification model learning device of the embodiment includes a learning data storage unit 1, a conversion model learning unit 2, a conversion model storage unit 3, a discrete symbol sequence conversion unit 4, a language identification model learning unit 5, and The language identification model storage unit 6 is included, for example. The conversion model learning unit 2 includes, for example, a neural network learning unit 21 and a centroid generation unit 22 as shown in FIG. The language identification model learning unit 5 includes, for example, a learning data dividing unit 51 and a model learning unit 52 as shown in FIG.

言語識別モデル学習装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。言語識別モデル学習装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。言語識別モデル学習装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、言語識別モデル学習装置の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。 The language identification model learning device is configured, for example, by loading a special program into a known or dedicated computer having a central processing unit (CPU), a main memory (RAM), and the like. It is a special device. For example, the language identification model learning device executes each process under the control of the central processing unit. The data input to the language identification model learning device and the data obtained in each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out as needed for other processing. Used. In addition, at least a part of each processing unit of the language identification model learning device may be configured by hardware such as an integrated circuit.

言語識別モデル学習装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。言語モデル作成装置が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 Each storage unit included in the language identification model learning device includes, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory (Flash Memory), Alternatively, it can be configured by middleware such as a relational database or key-value store. Each storage unit included in the language model creation device is only required to be logically divided, and may be stored in one physical storage device.

学習データ記憶部１には、K（≧2）個の学習データが記憶されている。学習データは、T（≧2）種類の言語で話された発話を収録した音声データと、各音声データがどの言語で話されたのかを表す言語ラベルとの組である。 The learning data storage unit 1 stores K (≧ 2) pieces of learning data. The learning data is a set of speech data that includes utterances spoken in T (≧ 2) languages and a language label that indicates in which language each speech data is spoken.

図４を参照して、実施形態の言語識別モデル学習方法の処理手続きを説明する。 With reference to FIG. 4, a processing procedure of the language identification model learning method of the embodiment will be described.

ステップＳ１１〜１２において、変換モデル学習部２は、学習データ記憶部１に記憶されたK個の学習データを用いて、音声データを離散記号系列に変換する離散記号系列変換モデルを学習する。離散記号系列変換モデルは、変換モデル記憶部３へ記憶される。離散記号系列は、フレームごとの言語らしさを表す事後確率分布を離散化した系列である。離散記号系列変換モデルは、フレームごとの音声データから事後確率分布を求めるニューラルネットワークと、事後確率分布をクラスタリングしたときの各クラスタのセントロイドとからなる。変換モデル学習部２は、従来技術と同様に、学習データからフレームレベルのニューラルネットワークを学習し、それを利用して各学習データを事後確率分布の系列に変換する。その後、全データの全フレームを効率的に離散化するために、学習データの事後確率分布をK-meansクラスタリングによりM（≧2）個のクラスタに分類し、各クラスタのセントロイドを学習する。以下、変換モデル学習部２の処理をより詳細に説明する。 In steps S11 to S12, the conversion model learning unit 2 uses the K pieces of learning data stored in the learning data storage unit 1 to learn a discrete symbol sequence conversion model that converts speech data into a discrete symbol sequence. The discrete symbol sequence conversion model is stored in the conversion model storage unit 3. The discrete symbol sequence is a sequence obtained by discretizing the posterior probability distribution representing the language-likeness of each frame. The discrete symbol sequence conversion model includes a neural network for obtaining a posterior probability distribution from speech data for each frame and a centroid of each cluster when the posterior probability distribution is clustered. The conversion model learning unit 2 learns a frame-level neural network from learning data, and converts each learning data into a series of posterior probability distributions using the learning data, as in the prior art. Thereafter, in order to efficiently discretize all frames of all data, the posterior probability distribution of the learning data is classified into M (≧ 2) clusters by K-means clustering, and the centroid of each cluster is learned. Hereinafter, the process of the conversion model learning unit 2 will be described in more detail.

ステップＳ１１において、ニューラルネットワーク学習部２１は、学習データ記憶部１に記憶されたK個の学習データからニューラルネットワークを学習する。このニューラルネットワークは、従来技術で用いられるものと同様に、フレームの音声データに対する特徴量を入力とし、その音声データの発話がどの言語によるものであるかを示す事後確率分布を出力する。フレーム長は数ミリ秒であり、例えば従来技術と同様に25ミリ秒とする。フレームの音声データに対する特徴量は、公知のメル周波数ケプストラム係数（MFCC: Mel Frequency Cepstral Coefficient）等、任意の特徴量を利用することができる。 In step S <b> 11, the neural network learning unit 21 learns a neural network from the K pieces of learning data stored in the learning data storage unit 1. Similar to that used in the prior art, this neural network receives a feature amount for speech data of a frame and outputs a posterior probability distribution indicating which language the speech of the speech data is from. The frame length is several milliseconds, for example, 25 milliseconds as in the prior art. An arbitrary feature amount such as a known Mel Frequency Cepstral Coefficient (MFCC) can be used as the feature amount for the audio data of the frame.

このニューラルネットワークは、学習データ内に含まれる言語ラベルの種類数Tの値が出力層の大きさとなる。例えば、K個の音声データに付与された言語ラベルの種類が「日本語」「英語」「中国語」の３種類であれば、T=3であり、出力層の大きさは３となる。学習データの総数Kの値は、３言語識別であればそれぞれの言語の音声データが10,000個として、K=30,000となる。ニューラルネットワークの層の数や形状などは任意の形態をとることができる。例えば、中間層の数が８であり、各層のノード数が2,048であるニューラルネットワークなどを利用できる。なお、ニューラルネットワークの学習方法については、上記非特許文献１などを参考にされたい。 In this neural network, the value of the number T of language labels included in the learning data is the size of the output layer. For example, if there are three types of language labels assigned to K audio data, “Japanese”, “English”, and “Chinese”, T = 3 and the size of the output layer is 3. The value of the total number K of learning data is K = 30,000, assuming that there are 10,000 pieces of speech data for each language in the case of 3-language identification. The number and shape of the layers of the neural network can take any form. For example, a neural network in which the number of intermediate layers is 8 and the number of nodes in each layer is 2,048 can be used. For the neural network learning method, refer to Non-Patent Document 1 above.

ステップＳ１２において、セントロイド生成部２２は、ニューラルネットワーク学習部２１で学習したニューラルネットワークを用いて、K個の学習データの各フレームを事後確率分布に変換し、K個の事後確率分布の系列を得る。各事後確率分布は、上述のような３言語識別であれば、３次元のベクトルとして得られる。続いて、セントロイド生成部２２は、事後確率分布（ベクトル）を離散記号化するためにクラスタリングを行う。クラスタリング時のクラスタ数Mの値は人手で与える。例えば、Mは64や128を与える。クラスタリング手法は、公知のK-meansクラスタリングを利用することができる。K-meansクラスタリングを行うと、各クラスタのセントロイドのベクトルを得ることができる。なお、セントロイドは事後確率分布と同様にベクトルとして表される。各セントロイドにはインデクス（識別記号）を付与しておく。例えば、１つ目のセントロイドには識別記号「1」を与え、２つ目のセントロイドには識別記号「2」を与える。 In step S12, the centroid generation unit 22 uses the neural network learned by the neural network learning unit 21 to convert each frame of the K learning data into a posterior probability distribution, and generates a sequence of K posterior probability distributions. obtain. Each posterior probability distribution is obtained as a three-dimensional vector if the above three-language identification is performed. Subsequently, the centroid generation unit 22 performs clustering in order to convert the posterior probability distribution (vector) into discrete symbols. The number of clusters M at the time of clustering is given manually. For example, M gives 64 or 128. As the clustering method, known K-means clustering can be used. When K-means clustering is performed, a centroid vector of each cluster can be obtained. The centroid is expressed as a vector in the same way as the posterior probability distribution. Each centroid is given an index (identification symbol). For example, an identification symbol “1” is given to the first centroid, and an identification symbol “2” is given to the second centroid.

ステップＳ１３において、離散記号系列変換部４は、変換モデル学習部２で学習した離散記号系列変換モデルを用いて、学習データの音声データを離散記号系列に変換する。具体的には、ニューラルネットワークを用いて、K個の学習データの各フレームを事後確率分布（ベクトル）に変換し、フレームごとに事後確率分布とM個のセントロイドそれぞれとのユークリッド距離を測り、最も近いクラスタの識別記号に変換する。例えば、ある音声のフレーム数が５であった場合は、５個のフレームに対してそれぞれ事後確率分布を得て、上述の方法で離散化する。例えば、１番目のフレームが識別記号「1」、２〜３番目のフレームが識別記号「5」、４番目のフレームが識別記号「2」、５番目のフレームが識別記号「3」にそれぞれ変換されたとすると、結果として「1, 5, 5, 2, 3」の離散記号系列に変換される。これをK個の学習データすべてに対して実施し、すべての学習データを離散記号系列に変換する。 In step S13, the discrete symbol sequence conversion unit 4 converts the speech data of the learning data into a discrete symbol sequence using the discrete symbol sequence conversion model learned by the conversion model learning unit 2. Specifically, using a neural network, each frame of K learning data is converted to a posterior probability distribution (vector), and the Euclidean distance between each posterior probability distribution and each of the M centroids is measured for each frame. Convert to the nearest cluster identifier. For example, when the number of frames of a certain speech is 5, the posterior probability distribution is obtained for each of the 5 frames and is discretized by the above-described method. For example, the first frame is converted to the identification symbol “1”, the second to third frames are converted to the identification symbol “5”, the fourth frame is converted to the identification symbol “2”, and the fifth frame is converted to the identification symbol “3”. As a result, it is converted into a discrete symbol sequence of “1, 5, 5, 2, 3”. This is performed for all K learning data, and all the learning data is converted into a discrete symbol sequence.

ステップＳ１４〜１５において、言語識別モデル学習部５は、離散記号系列変換部４で学習データから変換された離散記号系列と、学習データ記憶部１に記憶された学習データの言語ラベルとを用いて、言語識別モデルを学習する。言語識別モデルは、言語識別モデル記憶部６へ記憶される。言語識別モデルは、言語ラベルごとに学習されたT個の離散記号系列モデルから構成される。以下、言語識別モデル学習部５の処理手続きをより詳細に説明する。 In steps S14 to S15, the language identification model learning unit 5 uses the discrete symbol sequence converted from the learning data by the discrete symbol sequence conversion unit 4 and the language label of the learning data stored in the learning data storage unit 1. Learn the language identification model. The language identification model is stored in the language identification model storage unit 6. The language identification model is composed of T discrete symbol sequence models learned for each language label. Hereinafter, the processing procedure of the language identification model learning unit 5 will be described in more detail.

ステップＳ１４において、データ分割部５１は、学習データの離散記号系列を言語ラベルごとに分割する。K個の学習データはそれぞれ言語ラベルを持っているので、それぞれ同じ言語ラベルを持つ集合に分割する。つまり、言語ラベルの種類数Tと同数の学習データの集合ができる。例えば、言語ラベルの種類が「日本語」「英語」「中国語」の３種類であれば、３個の学習データの集合に分割することができる。 In step S14, the data dividing unit 51 divides the discrete symbol sequence of the learning data for each language label. Since the K learning data have language labels, they are divided into sets each having the same language label. That is, the same number of learning data sets as the number T of language label types can be created. For example, if there are three types of language labels, “Japanese”, “English”, and “Chinese”, they can be divided into a set of three learning data.

ステップＳ１５において、モデル学習部５２は、言語ラベルごとに分割した学習データを用いて、言語ラベルごとに離散記号系列モデルを学習し、各言語の離散記号系列モデルを集約して言語識別モデルを出力する。離散記号系列のモデリングは言語モデルと呼ばれる技術を利用できる。任意の言語モデルを利用して離散記号系列をモデル化できるが、例えば、Nグラムモデルと呼ばれる公知の言語モデルが利用できる。Nグラムモデルでは、N個組の離散記号系列の生成確率を直接モデル化できる。Nグラムモデルを学習すると、例えば「離散記号系列が“1,2”と続いた後に、“3”が続く確率は0.3、“1”が続く確率は0.2」といった記号の並び方の生成確率を求めることができる。そして、学習したNグラムモデルを利用することで、任意の離散記号系列に対して生成確率を求めることが可能となる。モデル学習部５２は、言語ごとに学習したT個の離散記号系列モデルを集約して言語識別モデルを生成し、言語識別モデル記憶部６へ記憶する。 In step S15, the model learning unit 52 learns the discrete symbol sequence model for each language label using the learning data divided for each language label, and aggregates the discrete symbol sequence models for each language and outputs a language identification model. To do. A technique called a language model can be used for modeling discrete symbol sequences. A discrete symbol sequence can be modeled using an arbitrary language model. For example, a known language model called an N-gram model can be used. The N-gram model can directly model the generation probability of N discrete symbol sequences. When learning an N-gram model, for example, “the probability that a discrete symbol sequence continues with“ 1,2 ”followed by a“ 3 ”is 0.3, and the probability that“ 1 ”continues is 0.2” is obtained. be able to. Then, by using the learned N-gram model, the generation probability can be obtained for an arbitrary discrete symbol sequence. The model learning unit 52 aggregates T discrete symbol sequence models learned for each language to generate a language identification model, and stores the language identification model in the language identification model storage unit 6.

＜言語識別＞
実施形態の言語識別装置は、図５に示すように、変換モデル記憶部３、離散記号系列変換部４、言語識別モデル記憶部６、および言語識別部７を例えば含む。変換モデル記憶部３、離散記号系列変換部４、および言語識別モデル記憶部６は、言語識別モデル学習装置が備える各構成部と同じものである。変換モデル記憶部３には、言語識別モデル学習装置により生成された離散記号系列変換モデルが記憶されている。言語識別モデル記憶部６は、言語識別モデル学習装置により生成された言語識別モデルが記憶されている。 <Language identification>
As illustrated in FIG. 5, the language identification device according to the embodiment includes a conversion model storage unit 3, a discrete symbol sequence conversion unit 4, a language identification model storage unit 6, and a language identification unit 7, for example. The conversion model storage unit 3, the discrete symbol sequence conversion unit 4, and the language identification model storage unit 6 are the same as the components included in the language identification model learning device. The conversion model storage unit 3 stores a discrete symbol sequence conversion model generated by the language identification model learning device. The language identification model storage unit 6 stores a language identification model generated by a language identification model learning device.

言語識別装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。言語識別装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。言語識別装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、言語識別装置の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。 The language identification device is, for example, a special program configured by reading a special program into a known or dedicated computer having a central processing unit (CPU), a main storage device (RAM), and the like. Device. For example, the language identification device executes each process under the control of the central processing unit. The data input to the language identification device and the data obtained in each process are stored in the main storage device, for example, and the data stored in the main storage device is read out as needed and used for other processing. The Further, at least a part of each processing unit of the language identification device may be configured by hardware such as an integrated circuit.

図６を参照して、実施形態の言語識別方法の処理手続きを説明する。 With reference to FIG. 6, a processing procedure of the language identification method of the embodiment will be described.

ステップＳ２１において、離散記号系列変換部４は、変換モデル記憶部３に記憶された離散記号系列変換モデルを用いて、入力音声データを離散記号系列に変換する。離散記号系列変換部４の処理は、上述のステップＳ１３と同様である。 In step S21, the discrete symbol sequence conversion unit 4 converts the input speech data into a discrete symbol sequence using the discrete symbol sequence conversion model stored in the conversion model storage unit 3. The process of the discrete symbol sequence conversion unit 4 is the same as that in step S13 described above.

ステップＳ２２において、言語識別部７は、言語識別モデル記憶部６に記憶された言語識別モデルを用いて、離散記号系列変換部４で得られた入力音声データの離散記号系列から、その入力音声データの言語を表す言語ラベルを求める。具体的には、言語識別モデルを構成する言語ラベルごとの離散記号系列モデルを利用して、言語ラベルごとに言語らしさの確率値を算出する。各言語ラベルの離散記号系列モデルは言語モデルとして表されているので、上述の通り任意の記号系列に対して生成確率を求めることが可能である。したがって、離散記号系列に変換した入力音声データに対して言語ラベルの生成確率を算出することが可能である。言語識別は言語ラベルごとに生成確率を求めた後に、最大の確率値を与える言語ラベルを出力することで実現できる。例えば「日本語」「英語」「中国語」の離散記号系列モデルがあった場合、各言語の生成確率が「P(日本語)=0.05、P(英語)=0.02、P(中国語)=0.0001」であったとした場合、最大を与える言語ラベルは「日本語」であるので、言語ラベルとして「日本語」を出力する。 In step S22, the language identification unit 7 uses the language identification model stored in the language identification model storage unit 6 and the input speech data from the discrete symbol sequence of the input speech data obtained by the discrete symbol sequence conversion unit 4. Find the language label that represents the language. More specifically, a probability value of language likelihood is calculated for each language label using a discrete symbol sequence model for each language label constituting the language identification model. Since the discrete symbol sequence model of each language label is represented as a language model, the generation probability can be obtained for an arbitrary symbol sequence as described above. Therefore, it is possible to calculate the generation probability of a language label for input speech data converted into a discrete symbol sequence. Language identification can be realized by obtaining a generation probability for each language label and then outputting a language label giving the maximum probability value. For example, if there is a discrete symbol series model of `` Japanese '' `` English '' `` Chinese '', the generation probability of each language is `` P (Japanese) = 0.05, P (English) = 0.02, P (Chinese) = If it is 0.0001, the language label giving the maximum is “Japanese”, so “Japanese” is output as the language label.

この発明の言語識別技術は、上記のように構成することにより、ニューラルネットワークによって捉えたフレームレベルの言語らしさの系列情報を陽に活かした言語識別を実現できる。これにより、言語らしさの平均基準で言語識別を実施する従来技術と比較して、言語識別の性能を向上することができる。 By configuring the language identification technique of the present invention as described above, it is possible to realize language identification that makes full use of the frame level language-like sequence information captured by the neural network. Thereby, the performance of language identification can be improved as compared with the prior art in which language identification is performed with an average standard of language likeness.

この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. A configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１学習データ記憶部
２変換モデル学習部
２１ニューラルネットワーク学習部
２２セントロイド生成部
３変換モデル記憶部
４離散記号系列変換部
５言語識別モデル学習部
５１学習データ分割部
５２モデル学習部
６言語識別モデル記憶部 1 learning data storage unit 2 conversion model learning unit 21 neural network learning unit 22 centroid generation unit 3 conversion model storage unit 4 discrete symbol sequence conversion unit 5 language identification model learning unit 51 learning data division unit 52 model learning unit 6 language identification model Memory

Claims

A learning data storage unit for storing a plurality of learning data in which voice data in a plurality of languages and a language label representing the language of each voice data are paired;
A conversion model learning unit that learns a discrete symbol sequence conversion model that outputs a discrete symbol sequence obtained by discretizing the posterior probability distribution of the language label using the learning data as input.
Using the discrete symbol sequence conversion model, a discrete symbol sequence conversion unit that converts speech data of the learning data into the discrete symbol sequence;
A language identification model learning unit that learns a language identification model that uses a discrete symbol sequence of speech data as input and outputs a generation probability for each language label using the discrete symbol sequence and language label of the learning data;
A language identification model learning device including:

The language identification model learning device according to claim 1,
The conversion model learning unit
A neural network learning unit that learns a neural network that uses the learning data as input and outputs a posterior probability distribution of the language label, using a feature amount in units of frames of speech data as input,
Clustering the posterior probability distribution generated from the learning data using the neural network, and generating a centroid for each cluster;
Including
The discrete symbol sequence conversion unit obtains the posterior probability distribution of the language label from the feature amount of the learning data in frame units, and discretizes the posterior probability distribution based on the distance between the posterior probability distribution and the centroid. Is what
The language identification model learning unit
A learning data dividing unit that divides the learning data for each language label to generate learning data by language label;
Using the discrete symbol series and language labels of the learning data for each language label, learning a discrete symbol series model for each language label that determines the generation probability of the language label from the discrete symbol series of speech data, and the above discrete symbol series model for each language label A model learning unit that generates the language identification model by aggregating
A language identification model learning device including:

A conversion model storage unit storing a discrete symbol sequence conversion model generated by the language identification model learning device according to claim 1;
A language identification model storage unit storing a language identification model generated by the language identification model learning device according to claim 1;
A discrete symbol sequence conversion unit that converts input speech data into a discrete symbol sequence using the discrete symbol sequence conversion model;
Using the language identification model, a language identification unit that obtains a generation probability for each language label from a discrete symbol sequence of the input speech data and outputs a language label that gives the maximum generation probability;
Language identification device including

The learning data storage unit stores a plurality of learning data in which voice data in a plurality of languages and a language label representing the language of each voice data are paired,
A conversion model learning step for learning a discrete symbol sequence conversion model for outputting a discrete symbol sequence obtained by discretizing the posterior probability distribution of the language label, using the learning data as input, using the learning data;
A discrete symbol sequence conversion unit that converts the speech data of the learning data into the discrete symbol sequence using the discrete symbol sequence conversion model;
A language identification model learning unit learns a language identification model that receives a discrete symbol sequence of speech data and outputs a generation probability for each language label, using the discrete symbol sequence and language label of the learning data. Learning steps,
A language identification model learning method including

A discrete symbol sequence conversion model generated by the language identification model learning method according to claim 4 is stored in the conversion model storage unit,
A language identification model generated by the language identification model learning method according to claim 4 is stored in the language identification model storage unit,
A discrete symbol sequence conversion unit that converts the input speech data into a discrete symbol sequence using the discrete symbol sequence conversion model;
A language identification step for obtaining a generation probability for each language label from a discrete symbol sequence of the input speech data using the language identification model, and outputting a language label that gives the maximum generation probability;
Language identification method.

A program for causing a computer to function as the language identification model learning device according to claim 1 or 2 or the language identification device according to claim 3.

A computer-readable recording medium on which a program for causing a computer to function as the language identification model learning device according to claim 1 or 2 or the language identification device according to claim 3 is recorded.