JP4769223B2

JP4769223B2 - Text phonetic symbol conversion dictionary creation device, recognition vocabulary dictionary creation device, and speech recognition device

Info

Publication number: JP4769223B2
Application number: JP2007116607A
Authority: JP
Inventors: 浩範吉田; 敏幸宮崎
Original assignee: Asahi Kasei Corp
Current assignee: Asahi Kasei Corp
Priority date: 2007-04-26
Filing date: 2007-04-26
Publication date: 2011-09-07
Anticipated expiration: 2027-04-26
Also published as: JP2008275731A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a text phonetic symbol conversion dictionary creator which creates a text phonetic symbol conversion dictionary for accurately creating phonetic symbol strings having high possibility of being pronounced when words are pronounced. <P>SOLUTION: The text phonetic symbol conversion dictionary creator 100 includes: acquiring the words, segment division information for dividing the words into segments and phonetic symbols of each segment from learning data; calculating occurrence probability of phoneme segment pairs in the learning data and connection probability of connection phoneme segment pair system; and creating the text phonetic symbol conversion dictionary including these calculated probability. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、テキスト発音記号変換辞書作成装置、テキスト発音記号変換辞書作成プログラム、テキスト発音記号変換辞書作成方法、作成したテキスト発音記号変換辞書を用いた認識語彙辞書作成装置、認識語彙辞書作成プログラム、認識語彙辞書作成方法、音声認識装置、音声認識プログラム、音声認識方法に関する。 The present invention relates to a text phonetic symbol conversion dictionary creation device, a text phonetic symbol conversion dictionary creation program, a text phonetic symbol conversion dictionary creation method, a recognition vocabulary dictionary creation device using the created text phonetic symbol conversion dictionary, a recognition vocabulary dictionary creation program, The present invention relates to a recognition vocabulary dictionary creation method, a speech recognition device, a speech recognition program, and a speech recognition method.

任意の単語や文章の入力（テキスト）を音声に変換して出力する音声合成装置では、文章入力を発音記号列に変換する綴り―発音記号変換が行われる。従来の綴り―発音記号変換方法として、第1に、単語辞書に基づく方法がある。例えば、単語等の文字列の各々を発音記号列に対応付けて単語辞書として保存しておくことが考えられる。単語辞書の例を図１０に示す。図示したように、単語辞書は、単語「ａｂａｃａ」を検索してこれに対応する発音記号列／ａｂａｋａ／を特定することができる。なお、発音記号であることの表記として、／で囲まれた文字を本明細書では以降用いる。 In a speech synthesizer that converts input (text) of an arbitrary word or sentence into speech and outputs it, spelling-phonetic symbol conversion is performed to convert the sentence input into a phonetic symbol string. As a conventional spelling-phonetic symbol conversion method, there is a method based on a word dictionary. For example, each character string such as a word may be associated with a phonetic symbol string and stored as a word dictionary. An example of a word dictionary is shown in FIG. As shown in the figure, the word dictionary can search for the word “abaca” and specify the phonetic symbol string / abaka / corresponding to it. In this specification, a character surrounded by / is used hereinafter as a notation of a phonetic symbol.

第２の従来の綴り―発音記号変換方法として、ルールに基づく方法がある。例えば、文字の配列に関するルールが使用される。ルールの例として図１１に示すように、文字列の先頭にある「ａ」は／ａ／と発音され、「ａｂｅ」と文字列が配列された場合の「ａ」は／ｅ／と発音されることが規定される。さらに、「ａｂｅｎ」と文字列が配列された場合の「ａ」は／ｏ／と発音されることが規定される。
これらの綴り―発音記号変換方法の従来技術として、書き文字を音声の最小単位（音素）に変換するＧｒａｐｈｅｍｅＴｏＰｈｏｎｅｍｅ（Ｇ２Ｐ）と呼ばれる技術が、非特許文献１に記載されている。 As a second conventional spelling-phonetic symbol conversion method, there is a rule-based method. For example, rules regarding the arrangement of characters are used. As an example of a rule, as shown in FIG. 11, “a” at the beginning of a character string is pronounced as “/ a /”, and “a” when a character string is arranged as “abe” is pronounced as “/ e /”. Is stipulated. Further, it is defined that “a” when “aben” and a character string are arranged is pronounced as “/ o /”.
Non-patent document 1 describes a technique called Graphe To Phoneme (G2P) that converts written characters into a minimum unit (phoneme) of speech as a conventional technique of these spelling-phonetic symbol conversion methods.

An Introduction to Text-to-Speech Synthesis:(KLUWER ACADWMIC PABLISHERS :by Thierry Dutoit)An Introduction to Text-to-Speech Synthesis: (KLUWER ACADWMIC PABLISHERS: by Thierry Dutoit)

綴り―発音記号変換方法は音声合成装置のみならず、音声認識装置においても用いられる場合がある。このような音声認識装置の例として、ユーザ自身によって電話帳データに登録された相手先名を、音声認識によって選択可能とする機能を持つ携帯電話機等が挙げられる。相手先名を認識対象語彙とするためには、相手先名の綴りから発音記号列を生成することが必要となる。
上述した単語辞書に基づく綴り―発音記号変換方法では、認識対象語彙と一致する単語が単語辞書に登録されていない場合には発音記号列を得ることができない。また、多くの単語に対して発音記号列を得るためには、単語辞書に登録する単語数を増やす必要があり単語辞書が大型化するという問題点もあった。 The spelling-phonetic symbol conversion method may be used not only in a speech synthesizer but also in a speech recognition device. As an example of such a voice recognition device, there is a mobile phone or the like having a function that allows a user to select a destination name registered in phone book data by voice recognition. In order to use the other party name as the recognition target vocabulary, it is necessary to generate a phonetic symbol string from the spelling of the other party name.
In the spelling-phonetic symbol conversion method based on the word dictionary described above, a phonetic symbol string cannot be obtained if a word that matches the recognition target vocabulary is not registered in the word dictionary. In addition, in order to obtain a phonetic symbol string for many words, it is necessary to increase the number of words registered in the word dictionary, resulting in a problem that the word dictionary is enlarged.

一方、上述したルールに基づく綴り―発音記号変換方法では、入力された任意の単語に対して何らかの発音記号が得られるが、出力された発音記号列の精度を高めるためには、複雑なルールを用いる必要があり、このルールを記憶するための記憶領域が増大することになる。また、ルールが複雑であるため、入力された単語から発音記号列を出力するための処理量も増大するという問題点もあった。
さらに、入力された単語に対して出力された発音記号列の確からしさを示す指標が存在しないため、１つの単語に対して得られた複数の発音記号列候補の中から所定数の発音記号列を選択する場合に、上位の確からしい発音記号列を選択することが困難であるという問題点もあった。 On the other hand, in the spelling-phonetic symbol conversion method based on the rules described above, some phonetic symbols can be obtained for any input word, but in order to increase the accuracy of the output phonetic symbol string, complicated rules are used. It is necessary to use this, and the storage area for storing this rule increases. In addition, since the rules are complicated, there is a problem in that the amount of processing for outputting a phonetic symbol string from an input word also increases.
Furthermore, since there is no index indicating the certainty of the phonetic symbol string output for the input word, a predetermined number of phonetic symbol strings from among a plurality of phonetic symbol string candidates obtained for one word There is also a problem that it is difficult to select the most probable phonetic symbol string.

本発明は、このような点に鑑みてなされたものであって、単語が発音された場合に発音される可能性の高い発音記号列を精度よく生成するためのテキスト発音記号変換辞書を作成するテキスト発音記号変換辞書装置、当該テキスト発音記号変換辞書を用いて認識語彙辞書を作成する認識語彙辞書作成装置、及び当該認識語彙辞書を用いて音声を高い認識率で認識する音声認識装置を提供することを目的とする。また、発音記号列とともに、その確からしさを示す指標を出力することも目的とする。 The present invention has been made in view of these points, and creates a text phonetic symbol conversion dictionary for accurately generating a phonetic symbol string that is likely to be pronounced when a word is pronounced. Provided are a text phonetic symbol conversion dictionary device, a recognition vocabulary dictionary creation device that creates a recognition vocabulary dictionary using the text phonetic symbol conversion dictionary, and a speech recognition device that recognizes speech at a high recognition rate using the recognition vocabulary dictionary. For the purpose. Another object is to output an index indicating the certainty along with the phonetic symbol string.

以上の課題を解決するため、本発明の請求項１に記載のテキスト発音記号変換辞書作成装置は、テキストから発音記号への変換に用いられるテキスト発音記号変換辞書を作成するテキスト発音記号変換辞書作成装置であって、単語と、当該単語をセグメントに分割したセグメント分割情報と、当該セグメントごとの発音記号と、を含むデータを学習データとして記憶する学習データ記憶手段と、前記学習データから、前記単語と、前記セグメント分割情報と、前記セグメントごとの発音記号と、を取得する学習データ取得手段と、前記学習データ取得手段によって取得された前記セグメント分割情報と、前記セグメントごとの発音記号から、セグメントの名前と当該セグメントの名前に対応する発音記号との組である音韻セグメント対を生成し、前記学習データにおいて当該音韻セグメント対が出現する頻度に基づいて生起確率を算出する生起確率算出手段と、単語内において連接する音韻セグメント対の系列である連接音韻セグメント対系列が、前記学習データにおいて出現する頻度に基づいて接続確率を算出する接続確率算出手段と、前記生起確率算出手段によって算出された音韻セグメント対ごとの生起確率と、前記接続確率算出手段によって算出された連接音韻セグメント対系列ごとの接続確率とを含むテキスト発音記号変換辞書を保存するテキスト発音記号変換辞書保存手段と、を備え、前記学習データ記憶手段は、前記単語が１つの単語とみなせる文字列を複数接続した単語である場合には、前記単語を当該文字列に対応するセグメントに分割したセグメント分割情報と、当該文字列に対応するセグメントごとの発音記号と、を記憶することを特徴とする。 In order to solve the above problems, a text phonetic symbol conversion dictionary creating apparatus according to claim 1 of the present invention creates a text phonetic symbol conversion dictionary for creating a text phonetic symbol conversion dictionary used for text-to-phonetic symbol conversion. A learning data storage means for storing data including a word , segment division information obtained by dividing the word into segments, and a phonetic symbol for each segment as learning data; from the learning data, the word Learning data acquisition means for acquiring the segment division information and the phonetic symbol for each segment, the segment division information acquired by the learning data acquisition means, and the phonetic symbol for each segment, Generates phoneme segment pairs that are pairs of names and phonetic symbols corresponding to the segment names An occurrence probability calculation means for calculating an occurrence probability based on the frequency of occurrence of the phoneme segment pair in the learning data, and a connected phoneme segment pair sequence that is a sequence of phoneme segment pairs connected in a word in the learning data. A connection probability calculating means for calculating a connection probability based on the appearance frequency; an occurrence probability for each phoneme segment pair calculated by the occurrence probability calculating means; and for each connected phoneme segment pair sequence calculated by the connection probability calculating means A text phonetic symbol conversion dictionary storing means for storing a text phonetic symbol conversion dictionary including the connection probability of the word, and the learning data storage means is a word obtained by connecting a plurality of character strings in which the word can be regarded as one word. In this case, segment division information obtained by dividing the word into segments corresponding to the character string , And to store the phonetic symbols of each segment corresponding to the character string, the.

また、請求項２に記載の認識語彙辞書作成装置は、音声を認識するために用いられる認識語彙辞書を作成する認識語彙辞書作成装置であって、請求項１に記載のテキスト発音記号変換辞書作成装置で作成されたテキスト発音記号変換辞書を記憶するテキスト発音記号変換辞書記憶手段と、音声認識の対象である認識語彙が登録された認識語彙辞書を記憶する認識語彙辞書記憶手段と、前記認識語彙辞書に登録された認識語彙を取得する認識語彙取得手段と、前記テキスト発音記号変換辞書を参照することで前記取得した認識語彙をセグメントに分割し、セグメントの系列であるセグメント系列を生成するセグメント系列生成手段と、前記セグメント系列生成手段で生成されたセグメント系列から、音韻セグメント対の系列である音韻セグメント対系列を生成し、前記テキスト発音記号変換辞書を参照することで当該音韻セグメント対系列ごとに累積コストを算出する累積コスト算出手段と、前記累積コスト算出手段で算出した累積コストに基づいて前記音韻セグメント対系列の中から上位の音韻セグメント対系列を音韻セグメント対系列候補として選択する音声候補選択手段と、前記音声候補選択手段で選択した音韻セグメント対系列候補に対応する発音記号の系列を、前記認識語彙辞書に登録する音声候補登録手段と、を備え、前記累積コスト算出手段は、前記音韻セグメント対系列における前記音韻セグメント対の前記生起確率と、前記連接音韻セグメント対系列の前記接続確率とに基づいて累積コストを算出することを特徴とする。 The recognition vocabulary dictionary creation device according to claim 2 is a recognition vocabulary dictionary creation device for creating a recognition vocabulary dictionary used for recognizing speech, and the text vocabulary symbol conversion dictionary creation according to claim 1 A text phonetic symbol conversion dictionary storage means for storing a text phonetic symbol conversion dictionary created by the apparatus; a recognition vocabulary dictionary storage means for storing a recognition vocabulary dictionary in which a recognition vocabulary subject to speech recognition is registered; and the recognition vocabulary A recognition vocabulary acquisition means for acquiring a recognition vocabulary registered in a dictionary, and a segment sequence for generating a segment sequence that is a segment sequence by dividing the acquired recognition vocabulary into segments by referring to the text phonetic symbol conversion dictionary A phoneme segment which is a sequence of phoneme segment pairs from the segment sequence generated by the generation unit and the segment sequence generation unit Generating a pair sequence and referring to the text phonetic symbol conversion dictionary to calculate a cumulative cost for each phoneme segment pair sequence; and the phoneme based on the cumulative cost calculated by the cumulative cost calculation unit A speech candidate selection means for selecting a higher phoneme segment pair sequence as a phoneme segment pair sequence candidate from the segment pair sequence; and a phonetic symbol sequence corresponding to the phoneme segment pair sequence candidate selected by the speech candidate selection means, Speech candidate registration means for registering in a recognition vocabulary dictionary, and the accumulated cost calculating means includes the occurrence probability of the phoneme segment pair in the phoneme segment pair sequence and the connection probability of the connected phoneme segment pair sequence. The accumulated cost is calculated based on this.

また、請求項３に記載の音声認識装置は、請求項２に記載の認識語彙辞書作成装置で作成された認識語彙辞書に基づき音声を認識することを特徴とする。
また、請求項４に記載のテキスト発音記号変換辞書作成プログラムは、コンピュータに、テキストから発音記号への変換に用いられるテキスト発音記号変換辞書の作成を実行させるためのテキスト発音記号変換辞書作成プログラムであって、単語と、当該単語をセグメントに分割したセグメント分割情報と、当該セグメントごとの発音記号と、を含むデータを学習データとして記憶し、前記単語が１つの単語とみなせる文字列を複数接続した単語である場合には、文字列を複数接続した前記単語を前記文字列に対応するセグメントに分割したセグメント分割情報と、前記文字列に対応するセグメントごとの発音記号と、をさらに記憶する学習データ記憶手段から、前記単語と、前記セグメント分割情報と、前記セグメントごとの発音記号と、を取得する学習データ取得ステップと、前記学習データ取得ステップによって取得された前記セグメント分割情報と、前記セグメントごとの発音記号から、セグメントの名前と当該セグメントの名前に対応する発音記号との組である音韻セグメント対を生成し、前記学習データにおいて当該音韻セグメント対が出現する頻度に基づいて生起確率を算出する生起確率算出ステップと、単語内において連接する音韻セグメント対の系列である連接音韻セグメント対系列が、前記学習データにおいて出現する頻度に基づいて接続確率を算出する接続確率算出ステップと、前記生起確率算出ステップによって算出された音韻セグメント対ごとの生起確率と、前記接続確率算出ステップによって算出された連接音韻セグメント対系列ごとの接続確率とを含むテキスト発音記号変換辞書を保存するテキスト発音記号変換辞書保存ステップと、を含むことを特徴とする According to a third aspect of the present invention, there is provided a speech recognition apparatus for recognizing a voice based on the recognition vocabulary dictionary created by the recognition vocabulary dictionary creation apparatus according to the second aspect.
A text phonetic symbol conversion dictionary creation program according to claim 4 is a text phonetic symbol conversion dictionary creation program for causing a computer to create a text phonetic symbol conversion dictionary used for conversion from text to phonetic symbols. In addition, data including a word , segment division information obtained by dividing the word into segments, and a phonetic symbol for each segment is stored as learning data, and a plurality of character strings that can be regarded as one word are connected. If it is a word, learning data that further stores segment division information obtained by dividing the word connected to a plurality of character strings into segments corresponding to the character strings, and phonetic symbols for each segment corresponding to the character strings from the storage means, and said word, said segment division information, and phonetic symbols of each of the segments, A phoneme that is a set of a segment name and a phonetic symbol corresponding to the name of the segment from the learning data acquisition step to be acquired, the segment division information acquired by the learning data acquisition step, and the phonetic symbol for each segment An occurrence probability calculating step of generating a segment pair and calculating an occurrence probability based on a frequency of occurrence of the phoneme segment pair in the learning data; and a connected phoneme segment pair sequence that is a sequence of phoneme segment pairs connected in a word. A connection probability calculation step of calculating a connection probability based on a frequency of appearance in the learning data, an occurrence probability of each phoneme segment pair calculated by the occurrence probability calculation step, and a concatenation calculated by the connection probability calculation step The connection probability for each phoneme segment pair sequence Characterized in that it comprises a text phonetic conversion dictionary storage step of storing the non-text phonetic conversion dictionary, the

また、請求項５に記載の認識語彙辞書作成プログラムは、コンピュータに、音声を認識するために用いられる認識語彙辞書の作成を実行させるための認識語彙辞書作成プログラムであって、請求項４に記載のテキスト発音記号変換辞書作成プログラムで作成されたテキスト発音記号変換辞書を記憶するテキスト発音記号変換辞書記憶手段から当該テキスト発音記号変換辞書を取得するテキスト発音記号変換辞書取得ステップと、音声認識の対象である認識語彙が登録された認識語彙辞書を記憶する認識語彙辞書記憶ステップと、前記認識語彙辞書に登録された認識語彙を取得する認識語彙取得ステップと、前記テキスト発音記号変換辞書取得ステップで取得した前記テキスト発音記号変換辞書を参照することで前記取得した認識語彙をセグメントに分割し、セグメントの系列であるセグメント系列を生成するセグメント系列生成ステップと、前記セグメント系列生成手段で生成されたセグメント系列から、音韻セグメント対の系列である音韻セグメント対系列を生成し、前記テキスト発音記号変換辞書を参照することで当該音韻セグメント対系列ごとに累積コストを算出する累積コスト算出ステップと、前記累積コスト算出ステップで算出した累積コストに基づいて前記音韻セグメント対系列の中から上位の音韻セグメント対系列を音韻セグメント対系列候補として選択する音声候補選択ステップと、前記音声候補選択ステップで選択した音韻セグメント対系列候補に対応する発音記号の系列を、前記認識語彙辞書に登録する音声候補登録ステップと、を含み、前記累積コスト算出ステップは、前記音韻セグメント対系列における前記音韻セグメント対の前記生起確率と、前記連接音韻セグメント対系列の前記接続確率とに基づいて累積コストを算出するステップを含むことを特徴とする。 Further, the recognition vocabulary dictionary creation program according to claim 5, the computer, a recognition vocabulary dictionary creating program for executing the creation of the recognition vocabulary dictionary used to recognize the speech, according to claim 4 A text phonetic symbol conversion dictionary acquisition step for acquiring the text phonetic symbol conversion dictionary from the text phonetic symbol conversion dictionary storage means for storing the text phonetic symbol conversion dictionary created by the text phonetic symbol conversion dictionary creation program; The recognition vocabulary dictionary storage step for storing the recognition vocabulary dictionary in which the recognition vocabulary is registered, the recognition vocabulary acquisition step for acquiring the recognition vocabulary registered in the recognition vocabulary dictionary, and the text phonetic symbol conversion dictionary acquisition step The acquired recognition vocabulary is segmented by referring to the text phonetic symbol conversion dictionary. A segment sequence generation step of generating a segment sequence that is a sequence of segments, and generating a phoneme segment pair sequence that is a sequence of phoneme segment pairs from the segment sequence generated by the segment sequence generation means, A cumulative cost calculating step of calculating a cumulative cost for each phoneme segment pair sequence by referring to the text phonetic symbol conversion dictionary, and a higher rank from the phoneme segment pair sequence based on the cumulative cost calculated in the cumulative cost calculating step A speech candidate selecting step for selecting a phoneme segment pair sequence of the phoneme segment pair sequence as a candidate for a phoneme segment pair sequence, and a speech for registering a phonetic symbol sequence corresponding to the phoneme segment pair sequence candidate selected in the speech candidate selection step in the recognition vocabulary dictionary A candidate registration step, and calculating the accumulated cost Step is characterized in that it comprises a step of calculating the accumulated cost based the an occurrence probability of the phoneme segment pair in the phoneme segment pair sequence, in said connection probabilities of the articulated phoneme segment pair sequence.

また、請求項６に記載の音声認識プログラムは、請求項５に記載の認識語彙辞書作成プログラムで作成された認識語彙辞書に基づいて音声を認識するステップを含む処理を実行させることを特徴とする。
また、請求項７に記載のテキスト発音記号変換辞書作成方法は、テキストから発音記号への変換に用いられるテキスト発音記号変換辞書を作成するテキスト発音記号変換辞書作成方法であって、単語と、当該単語をセグメントに分割したセグメント分割情報と、当該セグメントごとの発音記号と、を含むデータを学習データとして記憶し、前記単語が１つの単語とみなせる文字列を複数接続した単語である場合には、文字列を複数接続した前記単語を前記文字列に対応するセグメントに分割したセグメント分割情報と、前記文字列に対応するセグメントごとの発音記号と、をさらに記憶する学習データ記憶手段から、前記単語と、前記セグメント分割情報と、前記セグメントごとの発音記号と、を取得する学習データ取得ステップと、前記学習データ取得ステップによって取得された前記セグメント分割情報と、前記セグメントごとの発音記号から、セグメントの名前と当該セグメントの名前に対応する発音記号との組である音韻セグメント対を生成し、前記学習データにおいて当該音韻セグメント対が出現する頻度に基づいて生起確率を算出する生起確率算出ステップと、単語内において連接する音韻セグメント対の系列である連接音韻セグメント対系列が、前記学習データにおいて出現する頻度に基づいて接続確率を算出する接続確率算出ステップと、前記生起確率算出ステップによって算出された音韻セグメント対ごとの生起確率と、前記接続確率算出ステップによって算出された連接音韻セグメント対系列ごとの接続確率とを含むテキスト発音記号変換辞書を保存するテキスト発音記号変換辞書保存ステップと、を含むことを特徴とする。 According to a sixth aspect of the present invention, there is provided a speech recognition program for executing a process including a step of recognizing speech based on a recognized vocabulary dictionary created by a recognized vocabulary dictionary creating program according to claim 5. .
The text phonetic symbol conversion dictionary creating method according to claim 7 is a text phonetic symbol conversion dictionary creating method for creating a text phonetic symbol conversion dictionary used for text-to-phonetic symbol conversion. When data including segment division information obtained by dividing a word into segments and a phonetic symbol for each segment is stored as learning data, and the word is a word obtained by connecting a plurality of character strings that can be regarded as one word, From the learning data storage means for further storing segment division information obtained by dividing the word connected with a plurality of character strings into segments corresponding to the character strings, and phonetic symbols for each segment corresponding to the character strings, the words and A learning data acquisition step for acquiring the segment division information and a phonetic symbol for each segment; A phoneme segment pair that is a set of a segment name and a phonetic symbol corresponding to the name of the segment is generated from the segment division information acquired by the data acquisition step and the phonetic symbol for each segment. An occurrence probability calculating step for calculating an occurrence probability based on the frequency of occurrence of the phoneme segment pair, and a connected phoneme segment pair sequence that is a sequence of phoneme segment pairs connected in a word based on the frequency of occurrence in the learning data A connection probability calculating step for calculating a connection probability, an occurrence probability for each phoneme segment pair calculated by the occurrence probability calculation step, and a connection probability for each connected phoneme segment pair sequence calculated by the connection probability calculation step. Text to save text phonetic symbol conversion dictionary Characterized in that it comprises a preparative pronunciation symbols conversion dictionary storage step.

また、請求項８に記載の認識語彙辞書作成方法は、音声を認識するために用いられる認識語彙辞書を作成する認識語彙辞書作成方法であって、請求項７に記載のテキスト発音記号変換辞書作成方法で作成されたテキスト発音記号変換辞書を記憶するテキスト発音記号変換辞書記憶手段から当該テキスト発音記号変換辞書を取得するテキスト発音記号変換辞書取得ステップと、音声認識の対象である認識語彙が登録された認識語彙辞書を記憶する認識語彙辞書記憶ステップと、前記認識語彙辞書に登録された認識語彙を取得する認識語彙取得ステップと、前記テキスト発音記号変換辞書取得ステップで取得した前記テキスト発音記号変換辞書を参照することで前記取得した認識語彙をセグメントに分割し、セグメントの系列であるセグメント系列を生成するセグメント系列生成ステップと、前記セグメント系列生成手段で生成されたセグメント系列から、音韻セグメント対の系列である音韻セグメント対系列を生成し、前記テキスト発音記号変換辞書を参照することで当該音韻セグメント対系列ごとに累積コストを算出する累積コスト算出ステップと、前記累積コスト算出ステップで算出した累積コストに基づいて前記音韻セグメント対系列の中から上位の音韻セグメント対系列を音韻セグメント対系列候補として選択する音声候補選択ステップと、前記音声候補選択ステップで選択した音韻セグメント対系列候補に対応する発音記号の系列を、前記認識語彙辞書に登録する音声候補登録ステップと、を含み、前記累積コスト算出ステップは、前記音韻セグメント対系列における前記音韻セグメント対の前記生起確率と、前記連接音韻セグメント対系列の前記接続確率とに基づいて累積コストを算出することを特徴とする。
また、請求項９に記載の音声認識方法は、請求項８に記載の認識語彙辞書作成方法で作成された認識語彙辞書に基づいて音声を認識するステップを含むことを特徴とする。 The recognition vocabulary dictionary creation method according to claim 8 is a recognition vocabulary dictionary creation method for creating a recognition vocabulary dictionary used for recognizing speech, and the text vocabulary symbol conversion dictionary creation according to claim 7 A text phonetic symbol conversion dictionary acquisition step for acquiring the text phonetic symbol conversion dictionary from the text phonetic symbol conversion dictionary storage means for storing the text phonetic symbol conversion dictionary created by the method, and a recognition vocabulary that is a target of speech recognition are registered. A recognition vocabulary dictionary storage step for storing the recognized vocabulary dictionary, a recognition vocabulary acquisition step for acquiring a recognition vocabulary registered in the recognition vocabulary dictionary, and the text phonogram conversion dictionary acquired in the text phonogram conversion dictionary acquisition step By dividing the acquired recognition vocabulary into segments, a segment system that is a sequence of segments Generating a phoneme segment pair sequence that is a sequence of phoneme segment pairs from the segment sequence generated by the segment sequence generating means and referring to the text phonetic symbol conversion dictionary. A cumulative cost calculating step for calculating a cumulative cost for each segment pair sequence, and a phoneme segment pair sequence that is higher in the phoneme segment pair sequence based on the accumulated cost calculated in the cumulative cost calculating step as a phoneme segment pair sequence candidate A speech candidate selection step of selecting, and a speech candidate registration step of registering a phonetic symbol sequence corresponding to the phoneme segment pair sequence candidate selected in the speech candidate selection step in the recognized vocabulary dictionary, and calculating the accumulated cost The step includes the step in the phoneme segment pair sequence. Said occurrence probability of the rhyme segment pair, and calculates the accumulated cost on the basis of said connection probabilities of the articulated phoneme segment pair sequence.
A speech recognition method according to claim 9 includes a step of recognizing speech based on the recognition vocabulary dictionary created by the recognition vocabulary dictionary creation method according to claim 8 .

本発明に係る請求項１に記載のテキスト発音記号変換辞書作成装置、請求項４に記載のテキスト発音記号変換辞書作成プログラム、並びに請求項７に記載のテキスト発音記号変換辞書作成方法によれば、音韻セグメント対ごとの出現頻度に基づいて算出した生起確率と、音韻セグメント対の連接パターンに対する尤もらしさを表す連接音韻セグメント対系列の接続確率とを、大規模な学習データから求めてテキスト発音記号変換辞書に設定する。これによって、統計値に基づいた生起確率と接続確率とが得られるので、テキストから発音記号への変換において、このテキスト発音記号変換辞書を用いることによって、単語が発音された場合に発音される可能性が高い発音記号列を精度よく生成することができる。また、テキスト発音記号変換辞書の大きさに応じて、上位の生起確率値と接続確率値を選択することにより、重要度の高い統計値をテキスト発音記号変換辞書に設定することができる。
また、請求項１に記載のテキスト発音記号変換辞書作成装置によれば、セグメントとして音素を用いるので、一般的な辞書で使用されている発音記号から容易に学習データとして音素を抽出することができる。さらに、セグメントとして１つの音素に対応するセグメントのみならず、連接する複数の音素に対応するセグメントも用いるので、単語等であるセグメントに関する生起確率値と接続確率値をテキスト発音記号変換辞書に設定することができる。これによって、テキストから発音記号への変換において、このテキスト発音記号変換辞書を用いることで、既知の単語の発音記号に基づいた発音記号を生成することができる。したがって、ユーザにとって発音が未知の単語であって、その単語が複数の既知の単語が接続されているとみなされるものについては、この既知の単語を意識して発音する可能性が高いので、テキストから発音記号への変換の精度が高くなる。 According to the text phonetic symbol conversion dictionary creating apparatus according to claim 1, the text phonetic symbol conversion dictionary creating program according to claim 4 , and the text phonetic symbol conversion dictionary creating method according to claim 7 according to the present invention, Text phonetic symbol conversion based on the occurrence probability calculated based on the appearance frequency of each phoneme segment pair and the connection probability of the connected phoneme segment pair sequence representing the likelihood of the phoneme segment pair connection pattern from large-scale learning data Set in dictionary. As a result, the occurrence probability and connection probability based on the statistical value are obtained, so in the conversion from text to phonetic symbol, this text phonetic symbol conversion dictionary can be used to pronounce the word when it is pronounced It is possible to generate a phonetic symbol string with high accuracy. Further, by selecting a higher occurrence probability value and connection probability value according to the size of the text phonetic symbol conversion dictionary, a statistical value having a high importance can be set in the text phonetic symbol conversion dictionary.
Further, according to the text phonetic symbol conversion dictionary creating apparatus according to claim 1, since the phoneme is used as the segment, the phoneme can be easily extracted as the learning data from the phonetic symbol used in a general dictionary. . Furthermore, since not only a segment corresponding to one phoneme but also a segment corresponding to a plurality of connected phonemes is used as a segment, an occurrence probability value and a connection probability value for a segment such as a word are set in the text phonetic symbol conversion dictionary. be able to. Thus, in the conversion from text to phonetic symbols, by using this text phonetic symbol conversion dictionary, phonetic symbols based on phonetic symbols of known words can be generated. Therefore, for a word whose pronunciation is unknown to the user and the word is considered to be connected to a plurality of known words, there is a high possibility that the word is pronounced in consideration of this known word. The accuracy of conversion from to phonetic symbols increases.

本発明に係る請求項２に記載の認識語彙辞書作成装置、請求項５に記載の認識語彙辞書作成プログラム、並びに請求項８に記載の認識語彙辞書作成方法によれば、音韻セグメント対ごとの出現頻度に基づいて算出した生起確率と、音韻セグメント対の連接パターンに対する尤もらしさを表す連接音韻セグメント対系列の接続確率とが、大規模な学習データから求められてテキスト発音記号変換辞書に設定されているので、このテキスト発音記号変換辞書を用いて単語が発音された場合に発音される可能性が高い発音記号列を精度よく生成し認識語彙辞書に登録することができる。また、生成された発音記号列に対応する累積コストの値によって、その発音記号列の確からしさを判定することができるので、発音記号列の候補の中から柔軟に認識語彙辞書に登録する発音記号列を選択することが可能となる。 According to the recognized vocabulary dictionary creating apparatus according to claim 2 , the recognized vocabulary dictionary creating program according to claim 5 , and the recognized vocabulary dictionary creating method according to claim 8 according to the present invention, the appearance for each phoneme segment pair The occurrence probability calculated based on the frequency and the connection probability of the concatenated phoneme segment pair sequence representing the likelihood of the concatenation pattern of the phoneme segment pair are obtained from large-scale learning data and set in the text phonetic symbol conversion dictionary. Therefore, using this text phonetic symbol conversion dictionary, a phonetic symbol string that is likely to be pronounced when a word is pronounced can be accurately generated and registered in the recognition vocabulary dictionary. In addition, since the probability of the phonetic symbol string can be determined by the value of the accumulated cost corresponding to the generated phonetic symbol string, the phonetic symbols to be flexibly registered in the recognition vocabulary dictionary from among the phonetic symbol string candidates A column can be selected.

本発明に係る請求項３に記載の音声認識装置、請求項６に記載の音声認識プログラム、並びに請求項９に記載の音声認識方法によれば、大規模な学習データから求められた統計値に基づいて作成されたテキスト発音記号変換辞書を用いることによって、単語が発音された場合に発音される可能性が高い発音記号列が精度よく生成されて認識語彙辞書に登録されているので、音声認識の認識精度を向上させることができる。 Speech recognition apparatus according to claim 3 of the present invention, the speech recognition program according to claim 6, and according to the speech recognition method according to claim 9, the statistical value obtained from a large training data By using the text phonetic symbol conversion dictionary created based on this, the phonetic symbol strings that are likely to be pronounced when a word is pronounced are accurately generated and registered in the recognition vocabulary dictionary. Recognition accuracy can be improved.

〔第１の実施の形態〕
以下、本発明の第１の実施の形態を図面に基づき説明する。図１〜図３は、本発明に係るテキスト発音記号変換辞書作成装置、テキスト発音記号変換辞書作成プログラム、及びテキスト発音記号変換辞書作成方法の実施の形態を示す図である。
まず、本発明に係るテキスト発音記号変換辞書作成装置の構成を、図１に基づき説明する。図１は、本発明に係るテキスト発音記号変換辞書作成装置１００の構成を示すブロック図である。 [First Embodiment]
Hereinafter, a first embodiment of the present invention will be described with reference to the drawings. 1 to 3 are diagrams showing embodiments of a text phonetic symbol conversion dictionary creation device, a text phonetic symbol conversion dictionary creation program, and a text phonetic symbol conversion dictionary creation method according to the present invention.
First, the structure of the text phonetic symbol conversion dictionary creation apparatus according to the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing a configuration of a text phonetic symbol conversion dictionary creation apparatus 100 according to the present invention.

学習データ記憶部１０は、文字列で表記された単語と、各単語をセグメントに分割したセグメント分割情報と、各セグメントに対応する発音記号とを含むデータを学習データとして記憶する。ここで、セグメントとは、単語のテキスト（文字列）の一部であり、１つ以上の連続した発音記号に対応するものを意味する。本実施の形態では、単語「ａｂａｃａ」および単語「ａｂａｃｕｌｅ」を構成する「ａ」、「ｂ」、「ｃ」、「ｕ」、「ｌｅ」といった１つの発音記号に対応する文字をセグメントとした。また、発音記号として音素表記を用いる。 The learning data storage unit 10 stores, as learning data, data including words represented by character strings, segment division information obtained by dividing each word into segments, and pronunciation symbols corresponding to the segments. Here, the segment means a part of a word text (character string) corresponding to one or more continuous phonetic symbols. In the present embodiment, the word “abaca” and the character corresponding to one phonetic symbol such as “a”, “b”, “c”, “u”, “le” constituting the word “abacule” are segmented. . In addition, phoneme notation is used as a phonetic symbol.

学習データ取得部１１は、学習データ記憶部１０に記憶された学習データから、単語と、セグメント分割情報と、各セグメントに対応する発音記号とを取得する。
出現頻度カウンタ１２は、セグメントに対応する音素とセグメントの組（以下、音韻セグメント対と記す。）が学習データに出現する頻度をカウントしてメモリ１６に保存する。また、各セグメントが学習データに出現する頻度をカウントしてメモリ１６に保存する。学習データの全ての単語に対する処理が終了した時点で、各音韻セグメント対の出現頻度と、各セグメントの出現頻度である各トータル頻度とが確定し、メモリ１６に保存される。
生起確率算出部１３は、音韻セグメント対の出現頻度を、当該音韻セグメント対に含まれるセグメントに対応するトータル頻度で除算することにより、音韻セグメント対ごとに生起確率を算出しメモリ１６に保存する。 The learning data acquisition unit 11 acquires words, segment division information, and pronunciation symbols corresponding to each segment from the learning data stored in the learning data storage unit 10.
The appearance frequency counter 12 counts the frequency at which a pair of phonemes and segments corresponding to a segment (hereinafter referred to as a phoneme segment pair) appears in the learning data, and stores it in the memory 16. The frequency at which each segment appears in the learning data is counted and stored in the memory 16. When the processing for all the words in the learning data is completed, the appearance frequency of each phoneme segment pair and the total frequency that is the appearance frequency of each segment are determined and stored in the memory 16.
The occurrence probability calculation unit 13 calculates the occurrence probability for each phoneme segment pair and stores it in the memory 16 by dividing the appearance frequency of the phoneme segment pair by the total frequency corresponding to the segment included in the phoneme segment pair.

接続確率算出部１４は、単語内において連接する音韻セグメント対の系列である連接音韻セグメント対系列が学習データに出現する頻度をカウントしてメモリ１６に保存する。学習データの全ての単語に対する処理が終了した時点で、連接音韻セグメント対系列ごとの出現頻度が確定し、メモリ１６に保存される。そして、連接音韻セグメント対系列の中の先行音韻セグメント対が有するトータル出現頻度で、連接音韻セグメント対系列の出現頻度を除算することにより、連接音韻セグメント対系列ごとに接続確率を算出しメモリ１６に保存する。
テキスト発音記号変換辞書保存部１５は、音韻セグメント対ごとに算出された生起確率と、連接音韻セグメント対系列ごとに算出された接続確率とをメモリ１６から読み出し、テキスト発音記号変換辞書として保存する。 The connection probability calculation unit 14 counts the frequency at which a connected phoneme segment pair sequence, which is a sequence of phoneme segment pairs connected in a word, appears in the learning data, and stores it in the memory 16. When the processing for all the words in the learning data is completed, the appearance frequency for each connected phoneme segment pair sequence is determined and stored in the memory 16. Then, the connection probability is calculated for each connected phoneme segment pair sequence by dividing the appearance frequency of the connected phoneme segment pair sequence by the total appearance frequency of the preceding phoneme segment pair in the connected phoneme segment pair sequence. save.
The text phonetic symbol conversion dictionary storage unit 15 reads out the occurrence probability calculated for each phoneme segment pair and the connection probability calculated for each connected phoneme segment pair sequence from the memory 16 and stores them as a text phonetic symbol conversion dictionary.

次に、図２に基づき、このような構成をしたテキスト発音記号変換辞書作成装置１００におけるテキスト発音記号変換辞書の作成処理の流れを説明する。図２は、テキスト発音記号変換辞書作成装置１００において実行される、テキスト発音記号変換辞書の作成方法を説明するためのフローチャートである。
テキスト発音記号変換辞書の作成処理は、図２のフローチャートに示すように、まずステップＳ２０１に移行し、学習データ取得部１１において、学習データ記憶部１０に記憶された学習データから、単語と、セグメント分割情報と、各セグメントに対応する発音記号とを取得し、ステップＳ２０２に移行する。 Next, the flow of processing for creating a text phonetic symbol conversion dictionary in the text phonetic symbol conversion dictionary creating device 100 having such a configuration will be described with reference to FIG. FIG. 2 is a flowchart for explaining a method for creating a text phonetic symbol conversion dictionary executed by the text phonetic symbol conversion dictionary creating device 100.
As shown in the flowchart of FIG. 2, the text phonetic symbol conversion dictionary creation process first proceeds to step S201, where the learning data acquisition unit 11 uses words and segments from the learning data stored in the learning data storage unit 10. The division information and the phonetic symbols corresponding to each segment are acquired, and the process proceeds to step S202.

ステップＳ２０２では、出現頻度カウンタ１２において、学習データの全ての単語に対する処理が終了したか否かを判定し、処理が終了したと判定された場合（Ｙｅｓ）はステップＳ２０５に移行し、そうでない場合（Ｎｏ）はステップＳ２０３に移行する。
ステップＳ２０３に移行した場合は、出現頻度カウンタ１２は、各音韻セグメント対が学習データに出現する頻度をカウントしてメモリ１６に保存する。また、各セグメントが学習データに出現する頻度をカウントしてメモリ１６に保存し、ステップＳ２０４に移行する。 In step S202, the appearance frequency counter 12 determines whether or not the processing for all the words in the learning data has been completed. If it is determined that the processing has been completed (Yes), the process proceeds to step S205. (No) moves to step S203.
When the process proceeds to step S <b> 203, the appearance frequency counter 12 counts the frequency at which each phoneme segment pair appears in the learning data and stores it in the memory 16. Further, the frequency at which each segment appears in the learning data is counted and stored in the memory 16, and the process proceeds to step S204.

ステップＳ２０４では、接続確率算出部１４において、単語内において連接する音韻セグメント対の系列である連接音韻セグメント対系列が学習データに出現する頻度をカウントしてメモリ１６に保存してステップＳ２０１に移行する。
ステップＳ２０５に移行した場合は、生起確率算出部１３において、音韻セグメント対の出現頻度を、当該音韻セグメント対に含まれるセグメントに対応するトータル頻度で除算することにより、音韻セグメント対ごとに生起確率を算出しメモリ１６に保存して、ステップＳ２０６に移行する。 In step S204, the connection probability calculation unit 14 counts the frequency at which the connected phoneme segment pair sequence, which is a sequence of phoneme segment pairs connected in the word, appears in the learning data, stores it in the memory 16, and proceeds to step S201. .
When the process proceeds to step S205, the occurrence probability calculation unit 13 divides the appearance frequency of the phoneme segment pair by the total frequency corresponding to the segment included in the phoneme segment pair, thereby calculating the occurrence probability for each phoneme segment pair. The calculated value is stored in the memory 16, and the process proceeds to step S206.

ステップＳ２０６では、接続確率算出部１４において、連接音韻セグメント対系列の中の先行音韻セグメント対が有するトータル出現頻度で、連接音韻セグメント対系列の出現頻度を除算することにより、連接音韻セグメント対系列ごとに接続確率を算出しメモリ１６に保存して、ステップＳ２０７に移行する。
ステップＳ２０７では、テキスト発音記号変換辞書保存部１５において、音韻セグメント対ごとに算出された生起確率と、連接音韻セグメント対系列ごとに算出された接続確率とをメモリ１６から読み出し、テキスト発音記号変換辞書として保存して処理を終了する。 In step S206, the connection probability calculation unit 14 divides the appearance frequency of the connected phoneme segment pair sequence by the total appearance frequency of the preceding phoneme segment pair in the connected phoneme segment pair sequence, thereby obtaining each connected phoneme segment pair sequence. The connection probability is calculated and stored in the memory 16, and the process proceeds to step S207.
In step S207, the text phonetic symbol conversion dictionary storage unit 15 reads out the occurrence probability calculated for each phoneme segment pair and the connection probability calculated for each connected phoneme segment pair sequence from the memory 16, and the text phonetic symbol conversion dictionary. Save as to finish the process.

実施例１
以下、図３（ａ）、（ｂ）、（ｃ）に基づいて、テキスト発音記号変換辞書作成装置１００の動作を具体的に説明する。
学習データ記憶部１０に記憶された学習データは、単語の綴りとその単語の発音記号を含む一般的な辞書から抽出されたデータに基づいて生成される。本実施例においては、学習データに含まれるデータとして、図３（ａ）のように、例えば単語「ａｂａｃａ」に対するセグメント分割情報は、「（ａ）ｂａｃａ」、「ａ（ｂ）ａｃａ」、「ａｂ（ａ）ｃａ」、「ａｂａ（ｃ）ａ」、「ａｂａｃ（ａ）」といった情報から構成される。単語「ａｂａｃａ」は「ａ」、「ｂ」、「ａ」、「ｃ」、「ａ」のように５個のセグメントに分割され、単語「ａｂａｃｕｌｅ」は「ａ」、「ｂ」、「ａ」、「ｃ」、「ｕ」、「ｌｅ」のように６個のセグメントに分割される。 Example 1
The operation of the text phonetic symbol conversion dictionary creation device 100 will be specifically described below with reference to FIGS. 3 (a), (b), and (c).
The learning data stored in the learning data storage unit 10 is generated based on data extracted from a general dictionary including the spelling of the word and the phonetic symbol of the word. In the present embodiment, as data included in the learning data, as shown in FIG. 3A, for example, segment division information for the word “abaca” is “(a) baca”, “a (b) aca”, “ It consists of information such as “ab (a) ca”, “aba (c) a”, and “abac (a)”. The word “abaca” is divided into five segments such as “a”, “b”, “a”, “c”, “a”, and the word “abacule” is divided into “a”, “b”, “a”. ”,“ C ”,“ u ”,“ le ”, and so on.

また、各セグメントに対応する発音記号は、例えば「（ａ）ｂａｃａ→ａ」のように表現される。単語「ａｂａｃａ」を構成する５個のセグメントである「ａ」、「ｂ」、「ａ」、「ｃ」、「ａ」に対応する音素は、それぞれ／ａ／、／ｂ／、／ａ／、／ｋ／、／ａ／となる。また、単語「ａｂａｃｕｌｅ」を構成する６個のセグメントである「ａ」、「ｂ」、「ａ」、「ｃ」、「ｕ」、「ｌｅ」に対応する音素はそれぞれ／ａ／、／ｂ／、／ａ／、／ｋ／、／ｙ／、／ｌ／となる。 A phonetic symbol corresponding to each segment is expressed as, for example, “(a) baca → a”. The phonemes corresponding to the five segments “a”, “b”, “a”, “c”, “a” constituting the word “abaca” are / a /, / b /, / a /, respectively. , / K /, / a /. The phonemes corresponding to the six segments “a”, “b”, “a”, “c”, “u”, “le” constituting the word “abacule” are / a /, / b, respectively. /, / A /, / k /, / y /, / l /.

学習データ取得部１１は、学習データから、単語と、セグメント分割情報と、各セグメントに対応する発音記号とを取得する。学習データから取得した単語をセグメントに分割する。出現頻度カウンタ１２は、音韻セグメント対が学習データに出現する頻度をカウントしてメモリ１６に保存する。以下、発音記号「X」と発音されるセグメント「ｙ」を、「X｜ｙ」と記す。単語「ａｂａｃａ」のみが処理された場合には、音韻セグメント対「ａ｜ａ」、「ｂ｜ｂ」、「ｋ｜ｃ」の出現頻度はそれぞれ３回、１回、１回となる。 The learning data acquisition unit 11 acquires words, segment division information, and pronunciation symbols corresponding to each segment from the learning data. Divide words obtained from learning data into segments. The appearance frequency counter 12 counts the frequency at which phoneme segment pairs appear in the learning data and stores them in the memory 16. Hereinafter, the segment “y” that is pronounced as the phonetic symbol “X” is denoted as “X | y”. When only the word “abaca” is processed, the appearance frequencies of the phoneme segment pairs “a | a”, “b | b”, and “k | c” are three times, once, and once, respectively.

学習データの全ての単語に対する処理が終了した時点で、各音韻セグメント対の出現頻度と、各セグメントの出現頻度である各トータル頻度とが確定し、メモリ１６に保存される。図３（ａ）の例では、音韻セグメント対「Ａ｜ａ」、「ａ｜ａ」、「ｂ｜ｂ」、「ｐ｜ｂ」の出現頻度はそれぞれ１２２３、４５１４２、１２３７２、２６７である。 When the processing for all the words in the learning data is completed, the appearance frequency of each phoneme segment pair and the total frequency that is the appearance frequency of each segment are determined and stored in the memory 16. In the example of FIG. 3A, the appearance frequencies of the phoneme segment pairs “A | a”, “a | a”, “b | b”, and “p | b” are 1223, 45142, 12372, and 267, respectively.

生起確率算出部１３は、音韻セグメント対の出現頻度を、当該音韻セグメント対に含まれるセグメントに対応するトータル頻度で除算することにより、音韻セグメント対ごとに生起確率を算出しメモリ１６に保存する。図３（ａ）の例では、セグメント「ａ」のトータル頻度は４６３６５回である。セグメント「ａ」に対応する音素には、／Ａ／と／ａ／の２種類があり、そして音韻セグメント対「Ａ｜ａ」と「ａ｜ａ」が出現する頻度はそれぞれ１２２３回と４５１４２回である。このとき、生起確率はそれぞれ０．０３と０．９７である。後述の認識語彙辞書作成装置における演算量を低減するために、生起確率の対数値を用いる。例えば、生起確率値０．０３と０．９７に対しては、それぞれ−３．６３と−０．０２が生起確率の対数値である。 The occurrence probability calculation unit 13 calculates the occurrence probability for each phoneme segment pair and stores it in the memory 16 by dividing the appearance frequency of the phoneme segment pair by the total frequency corresponding to the segment included in the phoneme segment pair. In the example of FIG. 3A, the total frequency of the segment “a” is 46365 times. There are two types of phonemes corresponding to the segment “a”, / A / and / a /, and the frequency of occurrence of the phoneme segment pair “A | a” and “a | a” is 1223 times and 45142 times, respectively. It is. At this time, the occurrence probabilities are 0.03 and 0.97, respectively. In order to reduce the amount of calculation in a recognition vocabulary dictionary creation apparatus, which will be described later, a logarithmic value of occurrence probability is used. For example, for occurrence probability values 0.03 and 0.97, −3.63 and −0.02 are logarithmic values of the occurrence probabilities, respectively.

接続確率算出部１４は、単語内において連接する音韻セグメント対の系列である連接音韻セグメント対系列が学習データに出現する頻度をカウントしてメモリ１６に保存する。単語「ａｂａｃａ」において、２個の連接する音韻セグメント対からなる連接音韻セグメント対系列は「ａ｜ａ→ｂ｜ｂ」、「ｂ｜ｂ→ａ｜ａ」、「ａ｜ａ→ｋ｜ｃ」、「ｋ｜ｃ→ａ｜ａ」のように４個となり、それぞれの出現頻度は１回となる。学習データ１１の全ての単語に対する処理が終了した時点で、音韻セグメント対の各組の出現頻度が確定し、メモリ１６に保存される。 The connection probability calculation unit 14 counts the frequency at which a connected phoneme segment pair sequence, which is a sequence of phoneme segment pairs connected in a word, appears in the learning data, and stores it in the memory 16. In the word “abaca”, the concatenated phoneme segment pair sequence composed of two consecutive phoneme segment pairs is “a | a → b | b”, “b | b → a | a”, “a | a → k | c”. ”,“ K | c → a | a ”, and the frequency of appearance is once. When the processing for all the words in the learning data 11 is completed, the appearance frequency of each pair of phoneme segment pairs is determined and stored in the memory 16.

そして、連接音韻セグメント対系列の先行音韻セグメント対が有するトータル出現頻度で、連接音韻セグメント対系列ごとの出現頻度を除算することにより、連接音韻セグメント対系列ごとに接続確率を算出しメモリ１６に保存する。図３（ａ）に示したように、「ａ｜ａ」のトータル頻度は４５１４２回である。「ａ｜ａ→ｂ｜ｂ」が出現する頻度は２４８７回である。このとき、「ａ｜ａ→ｂ｜ｂ」の接続確率は０．０５５でありその対数値は−２．８９となる。以下、生起確率と接続確率の値として対数値を用いる。 Then, by dividing the appearance frequency for each connected phoneme segment pair sequence by the total appearance frequency of the preceding phoneme segment pair of the connected phoneme segment pair sequence, a connection probability is calculated for each connected phoneme segment pair sequence and stored in the memory 16. To do. As shown in FIG. 3A, the total frequency of “a | a” is 45142 times. The frequency of occurrence of “a | a → b | b” is 2487 times. At this time, the connection probability of “a | a → b | b” is 0.055, and its logarithmic value is −2.89. Hereinafter, logarithmic values are used as values of occurrence probability and connection probability.

音韻セグメント対ごとの生起確率に関する情報は、図３（ａ）に示したように、音韻セグメント対の名前である「Ａ｜ａ」、「ａ｜ａ」、「ｂ｜ｂ」、「ｐ｜ｂ」等のラベルと、音韻セグメント対ごとの生起確率である、−３．６３、−０．０２、−０．０２、−３．８５等の数値となる。そして、連接音韻セグメント対系列ごとの接続確率に関する情報は、図３（ｂ）に示したように、連接音韻セグメント対系列の名前である「ａ｜ａ→ｂ｜ｂ」、「ａ｜ａ→ｐ｜ｂ」、「ａ｜ａ→ｋ｜ｃ」、「ａ｜ａ→ｓ｜ｃ」等のラベルと、連接音韻セグメント対の接続確率である、−２．８９、−８．２３、−３．１８、−４．６９等の数値となる。 As shown in FIG. 3A, the information on the occurrence probability for each phoneme segment pair includes the names of phoneme segment pairs “A | a”, “a | a”, “b | b”, “p | It is a numerical value such as −3.63, −0.02, −0.02, −3.85, which is a probability of occurrence for each pair of phoneme segments and a label such as “b”. The information on the connection probability for each connected phoneme segment pair sequence is “a | a → b | b”, “a | a →”, which is the name of the connected phoneme segment pair sequence, as shown in FIG. p | b "," a | a → k | c ", labels such as" a | a → s | c "and the connection probabilities of the connected phoneme segment pairs -2.89, -8.23,- It becomes a numerical value such as 3.18, -4.69.

本実施の形態は、以上述べたように、１つの発音記号に対応する文字をセグメントとする構成に限定されるものではなく、複数の発音記号によって構成される文字をセグメントとすることも可能である。複数の発音記号によって構成される文字とは、例えば、単語、接頭辞、接尾辞などである。
例えば、単語「ｕｐｈａｌｌ」を構成する文字列「ｕｐ」と「ｈａｌｌ」は、それぞれを１つの単語とみなすことができる。 As described above, the present embodiment is not limited to a configuration in which characters corresponding to one phonetic symbol are segmented, and a character composed of a plurality of phonetic symbols can also be segmented. is there. The characters composed of a plurality of phonetic symbols are, for example, words, prefixes, suffixes, and the like.
For example, each of the character strings “up” and “hall” constituting the word “uphall” can be regarded as one word.

これらの文字列をセグメントとする場合には、対応する音韻セグメント対はそれぞれ「ａｐ｜ｕｐ」と「ｈｏｌ｜ｈａｌｌ」となる。このようなセグメントを、上述の１つの発音記号に対応する文字であるセグメントに混在させて、テキスト発音記号変換辞書を作成してもよい。こうすることにより、単語であるセグメントに関する生起確率値と接続確率値をテキスト発音記号変換辞書に設定することができるので、後述する認識語彙辞書作成装置において、既知の単語の発音記号に基づいた発音記号を生成することができる。したがって、ユーザにとって発音が未知の単語であって、その単語が複数の既知の単語が接続されているとみなされるものについては、この既知の単語を意識して発音する可能性が高いので、テキストから発音記号への変換の精度が高くなる。 When these character strings are segmented, the corresponding phoneme segment pairs are “ap | up” and “hol | hall”, respectively. Such a segment may be mixed in a segment that is a character corresponding to the above-mentioned one phonetic symbol to create a text phonetic symbol conversion dictionary. In this way, the occurrence probability value and the connection probability value for the segment that is a word can be set in the text phonetic symbol conversion dictionary, so that the recognition vocabulary dictionary creation device described later can generate pronunciation based on the phonetic symbol of a known word. Symbols can be generated. Therefore, for a word whose pronunciation is unknown to the user and the word is considered to be connected to a plurality of known words, there is a high possibility that the word is pronounced in consideration of this known word. The accuracy of conversion from to phonetic symbols increases.

なお、複数の発音記号によって構成される文字であるセグメントが、学習データに出現する頻度が少ない場合は、生起確率と接続確率をそれぞれ所定の値に設定してもよい。
テキスト発音記号変換辞書保存部１５は、音韻セグメント対ごとの生起確率に関する情報と、連接音韻セグメント対系列ごとの接続確率に関する情報をテキスト発音記号変換辞書として保存する。上記第1の実施の形態において、学習データ記憶部１０は、請求項１記載の学習データ記憶手段に対応し、学習データ取得部１１は、請求項１記載の学習データ取得手段に対応し、出現頻度カウンタ１２及び生起確率算出部１３は、請求項１記載の生起確率算出手段に対応し、接続確率算出部１４は、請求項１記載の接続確率算出手段に対応し、テキスト発音記号変換辞書保存部１５は、請求項１記載のテキスト発音記号変換辞書保存手段に対応する。 Note that when the frequency of the segment, which is a character composed of a plurality of phonetic symbols, appears in the learning data, the occurrence probability and the connection probability may be set to predetermined values.
The text phonetic symbol conversion dictionary storage unit 15 stores information on the occurrence probability for each phoneme segment pair and information on the connection probability for each connected phoneme segment pair sequence as a text phonetic symbol conversion dictionary. In the first embodiment, the learning data storage unit 10 corresponds to the learning data storage unit described in claim 1, and the learning data acquisition unit 11 corresponds to the learning data acquisition unit described in claim 1 and appears. The frequency counter 12 and the occurrence probability calculation unit 13 correspond to the occurrence probability calculation unit according to claim 1, and the connection probability calculation unit 14 corresponds to the connection probability calculation unit according to claim 1, and is stored in a text phonetic symbol conversion dictionary. The unit 15 corresponds to the text phonetic symbol conversion dictionary storage means described in claim 1.

また、上記第1の実施の形態において、ステップＳ２０１は、請求項６または請求項９記載の学習データ取得ステップに対応し、ステップＳ２０２〜Ｓ２０３並びにＳ２０５は、請求項６または請求項９記載の生起確率算出ステップに対応し、ステップＳ２０４並びにＳ２０６は、請求項６または請求項９記載の接続確率算出ステップに対応し、ステップＳ２０７は、請求項６または請求項９記載のテキスト発音記号変換辞書保存ステップに対応する。 In the first embodiment, step S201 corresponds to the learning data acquisition step according to claim 6 or claim 9, and steps S202 to S203 and S205 are the occurrence according to claim 6 or claim 9. Corresponding to the probability calculating step, steps S204 and S206 correspond to the connection probability calculating step according to claim 6 or claim 9, and step S207 is the text phonetic symbol conversion dictionary storing step according to claim 6 or claim 9. Corresponding to

なお、上述した第１の実施の形態に係るテキスト発音記号変換辞書作成プログラムは、記憶部を備えた一般的なコンピュータシステムによって実行可能である。この場合、上記記憶部に格納されたテキスト発音記号変換辞書作成プログラムをコンピュータが実行することによって、上述したテキスト発音記号変換辞書作成動作が行われる。なお、上記テキスト発音記号変換辞書作成プログラムは、通信媒体を介してコンピュータシステムに供給されてもかまわない。また、光ディスク等の記憶媒体に上記テキスト発音記号変換辞書作成プログラムを記録し、当該記録媒体に記録されたテキスト発音記号変換辞書作成プログラムをコンピュータシステムで読み込んでもかまわない。 The text phonetic symbol conversion dictionary creation program according to the first embodiment described above can be executed by a general computer system including a storage unit. In this case, when the computer executes the text phonetic symbol conversion dictionary creation program stored in the storage unit, the text phonetic symbol conversion dictionary creation operation described above is performed. The text phonetic symbol conversion dictionary creation program may be supplied to the computer system via a communication medium. The text phonetic symbol conversion dictionary creation program may be recorded on a storage medium such as an optical disk, and the text phonetic symbol conversion dictionary creation program recorded on the recording medium may be read by a computer system.

〔第２の実施の形態〕
次に、本発明の第２の実施の形態を図面に基づき説明する。図４〜図８は、本発明に係る認識語彙辞書作成装置、認識語彙辞書作成プログラム、及び認識語彙辞書作成方法の実施の形態を示す図である。
まず、本発明に係る認識語彙辞書作成装置の構成を図４に基づき説明する。図４は、本発明に係る認識語彙辞書作成装置４００を説明するブロック図である。第１認識語彙辞書記憶部４０は、音声認識の対象となる認識語彙ごとに、その認識語彙の発音記号列が予め登録された第１認識語彙辞書を記憶している。第２認識語彙辞書記憶部４１は、ユーザが登録した音声認識の対象となる認識語彙が保存された第２認識語彙を記憶している。第２認識語彙辞書と第１認識語彙辞書とは、認識語彙の発音記号列が第１認識語彙辞書では予め登録されているのに対し、第２認識語彙辞書では、本発明に係る認識語彙辞書作成装置によって新たに登録される点で異なる。 [Second Embodiment]
Next, a second embodiment of the present invention will be described with reference to the drawings. 4 to 8 are diagrams showing embodiments of a recognized vocabulary dictionary creating apparatus, a recognized vocabulary dictionary creating program, and a recognized vocabulary dictionary creating method according to the present invention.
First, the configuration of the recognized vocabulary dictionary creating apparatus according to the present invention will be described with reference to FIG. FIG. 4 is a block diagram illustrating a recognized vocabulary dictionary creating apparatus 400 according to the present invention. The first recognized vocabulary dictionary storage unit 40 stores a first recognized vocabulary dictionary in which phonetic symbol strings of the recognized vocabulary are registered in advance for each recognized vocabulary to be speech-recognized. The second recognized vocabulary dictionary storage unit 41 stores a second recognized vocabulary in which a recognized vocabulary registered for speech recognition by the user is stored. In the second recognition vocabulary dictionary and the first recognition vocabulary dictionary, the phonetic symbol string of the recognition vocabulary is registered in advance in the first recognition vocabulary dictionary, whereas in the second recognition vocabulary dictionary, the recognition vocabulary dictionary according to the present invention is used. It differs in that it is newly registered by the creation device.

テキスト発音記号変換辞書記憶部４２は、先に述べたテキスト発音記号変換辞書作成装置で作成されたテキスト発音記号変換辞書を記憶する。認識語彙取得部４３は、第２認識語彙辞書に記憶されている認識語彙を取得する。セグメント系列生成部４４は、取得された認識語彙をセグメントに分割してセグメント系列を生成する。本実施の形態では分割のために参照するセグメントのラベルとして、テキスト発音記号変換辞書に保存された音韻セグメント対のセグメントの部分を使用するが、セグメントのラベルを登録したテーブルを使用してもよい。
単語「ａｂａｃａ」を例にとると、５個のセグメントからなるセグメント系列｛「ａ」，「ｂ」，「ａ」，「ｃ」，「ａ」｝と、４個のセグメントからなるセグメント系列｛「ａ」，「ｂ」，「ａｃ」，「ａ」｝の２つのセグメント系列が生成される。 The text phonetic symbol conversion dictionary storage unit 42 stores the text phonetic symbol conversion dictionary created by the text phonetic symbol conversion dictionary creation device described above. The recognized vocabulary acquisition unit 43 acquires the recognized vocabulary stored in the second recognized vocabulary dictionary. The segment series generation unit 44 divides the acquired recognition vocabulary into segments and generates a segment series. In this embodiment, the segment part of the phoneme segment pair stored in the text phonetic symbol conversion dictionary is used as the segment label to be referred to for division, but a table in which segment labels are registered may be used. .
Taking the word “abaca” as an example, a segment sequence consisting of five segments {“a”, “b”, “a”, “c”, “a”} and a segment sequence consisting of four segments { Two segment sequences “a”, “b”, “ac”, “a”} are generated.

累積コスト算出部４５は、セグメント系列生成部４４によって生成されたセグメント系列の各セグメントに対して取り得る全ての発音記号を割り当てることにより、セグメント系列から音韻セグメント対の系列を生成する。テキスト発音記号変換辞書に保存された音韻セグメント対を、セグメントに発音記号を割り当てるために参照する。次に、音韻セグメント対の系列ごとに、テキスト発音記号変換辞書４３に保存された生起確率と接続確率とを参照して累積コストを算出する。
音声候補選択部４６は、累積コストの算出対象となった音韻セグメント対の系列から、累積コストに基づいて音声候補を選択する。音声候補登録部４７は、音声候補選択部４６によって選択された音声候補を、対応する認識語彙の発音記号列として第２認識語彙辞書に登録する。 The accumulated cost calculation unit 45 generates a phoneme segment pair sequence from the segment sequence by assigning all possible phonetic symbols to each segment of the segment sequence generated by the segment sequence generation unit 44. The phoneme segment pairs stored in the text phonetic symbol conversion dictionary are referenced to assign phonetic symbols to the segments. Next, for each series of phoneme segment pairs, an accumulated cost is calculated with reference to the occurrence probability and the connection probability stored in the text phonetic symbol conversion dictionary 43.
The speech candidate selection unit 46 selects a speech candidate based on the accumulated cost from a series of phoneme segment pairs for which the accumulated cost is to be calculated. The speech candidate registration unit 47 registers the speech candidate selected by the speech candidate selection unit 46 in the second recognition vocabulary dictionary as a pronunciation symbol string of the corresponding recognition vocabulary.

次に、図５に基づき、このような構成をした認識語彙辞書作成装置４００における認識語彙辞書の作成処理の流れを説明する。図５は、以上述べた認識語彙辞書作成装置４００において実行される、認識語彙辞書の作成方法を説明するためのフローチャートである。
認識語彙辞書の作成処理は、図５のフローチャートに示すように、まずステップＳ５０１に移行し、認識語彙取得部４３において、第２認識語彙辞書４１に含まれる単語を入力し、ステップＳ５０２に移行する。ステップＳ５０２では、セグメント系列生成部４４において、入力された単語をセグメントに分割してセグメント系列を生成し、ステップＳ５０３に移行する。ステップＳ５０３では、累積コスト算出部４５において、セグメント系列の各セグメントに対して取り得る全ての発音記号を割り当てることにより、セグメント系列から音韻セグメント対の系列を生成し、ステップＳ５０４に移行する。 Next, the flow of a recognition vocabulary dictionary creation process in the recognition vocabulary dictionary creation apparatus 400 configured as described above will be described with reference to FIG. FIG. 5 is a flowchart for explaining a recognition vocabulary dictionary creation method executed by the recognition vocabulary dictionary creation apparatus 400 described above.
As shown in the flowchart of FIG. 5, the recognition vocabulary dictionary creation processing first proceeds to step S501, and the recognition vocabulary acquisition unit 43 inputs a word included in the second recognition vocabulary dictionary 41, and then proceeds to step S502. . In step S502, the segment series generation unit 44 divides the input word into segments to generate a segment series, and the process proceeds to step S503. In step S503, the accumulated cost calculation unit 45 generates all possible phonetic symbols for each segment in the segment series, thereby generating a series of phoneme segment pairs from the segment series, and proceeds to step S504.

ステップＳ５０４では、累積コスト算出部４５において、テキスト発音記号変換辞書に保存された生起確率及び累積確率に基づいて音韻セグメント対の系列の各々について累積コストを算出し、ステップＳ５０５に移行する。ステップＳ５０５では、音声候補選択部４６において、算出された累積コストと境界尤度とを比較する。そして、境界尤度以上の累積コストを有する音声候補を選択し、ステップＳ５０６に移行する。ステップＳ５０６では、音声候補登録部４７において、音声候補選択部４６によって選択された音声候補を対応する認識語彙の発音記号列として第２認識語彙辞書に登録し処理を終了する。 In step S504, the accumulated cost calculation unit 45 calculates the accumulated cost for each of the phoneme segment pair sequences based on the occurrence probability and the accumulated probability stored in the text phonetic symbol conversion dictionary, and the process proceeds to step S505. In step S505, the speech candidate selection unit 46 compares the calculated accumulated cost with the boundary likelihood. Then, a speech candidate having an accumulated cost equal to or higher than the boundary likelihood is selected, and the process proceeds to step S506. In step S506, the speech candidate registration unit 47 registers the speech candidate selected by the speech candidate selection unit 46 in the second recognized vocabulary dictionary as the phonetic symbol string of the corresponding recognized vocabulary, and the process ends.

実施例１
次に、図６に基づいて、累積コスト算出部４５の動作を具体的に説明する。図６は、累積コスト算出部４５が単語「ａｂａｃａ」の音声候補の累積コストを算出する例を示している。累積コスト算出部４５は、例えば、単語「ａｂａｃａ」の音声候補／ａｂａｋａ／の累積コストを、以下のようにして算出する。すなわち、単語「ａｂａｃａ」の先頭のセグメント「ａ」には、発音記号／ａ／、発音記号／Ａ／の２つの音声候補がある。累積コスト算出部４５は、発音記号／ａ／の生起確率−０．０２、発音記号／ａ／と発音記号／ｂ／との接続確率−２．８９、発音記号／ｂ／の生起確率を累積し、文字列「ａｂ」の累積コストを算出する。 Example 1
Next, the operation of the accumulated cost calculation unit 45 will be specifically described with reference to FIG. FIG. 6 shows an example in which the accumulated cost calculation unit 45 calculates the accumulated cost of the speech candidate of the word “abaca”. For example, the accumulated cost calculation unit 45 calculates the accumulated cost of the speech candidate / abaka / of the word “abaca” as follows. That is, in the first segment “a” of the word “abaca”, there are two speech candidates of phonetic symbol / a / and phonetic symbol / A /. The accumulated cost calculation unit 45 accumulates the occurrence probability of the phonetic symbol /a/−0.02, the probability of connection between the phonetic symbol / a / and the phonetic symbol /b/-2.89, and the probability of occurrence of the phonetic symbol / b /. Then, the accumulated cost of the character string “ab” is calculated.

なお、本実施の形態は、前記したように生起確率及び接続確率の対数値を用いているため、累積コストを算出するための演算として、乗算の代わりに加算を使用することで演算量を低減することができる。
さらに、累積コスト算出部４５は、発音記号／ｂ／から発音記号／ａ／に連続する接続確率と３番目のセグメント「ａ」の音声候補が発音記号／ａ／である生起確率とを累積する。さらに、発音記号／ａ／から発音記号／ｋ／に連続する接続確率とセグメント「ｃ」の音声候補が発音記号／ｋ／である生起確率とを累積し、発音記号／ｋ／から発音記号／ａ／に連続する接続確率とセグメント「ａ」の音声候補が発音記号／ａ／である生起確率とを累積する。以上の演算により、累積コスト算出部４５は、発音記号列／ａｂａｋａ／の累積コスト−１０．０３を得る。
また、同様に、累積コスト算出部４５は、他の音声候補である発音記号列／ａｂＡｋａ／、／ａｂａｋＡ／、／Ａｂａｋａ／、／ａｂａｓａ／等についても累積コストを算出する。 Since the present embodiment uses the logarithmic value of the occurrence probability and the connection probability as described above, the calculation amount is reduced by using addition instead of multiplication as the calculation for calculating the accumulated cost. can do.
Further, the accumulated cost calculation unit 45 accumulates the connection probability that continues from the phonetic symbol / b / to the phonetic symbol / a / and the occurrence probability that the speech candidate of the third segment “a” is the phonetic symbol / a /. . Further, the connection probability that continues from phonetic symbol / a / to phonetic symbol / k / and the occurrence probability that the speech candidate of segment “c” is phonetic symbol / k / are accumulated, and phonetic symbol / k / to phonetic symbol / The connection probability consecutive to a / and the occurrence probability that the speech candidate of the segment “a” is the phonetic symbol / a / are accumulated. Through the above calculation, the accumulated cost calculation unit 45 obtains the accumulated cost of 10.03 of the phonetic symbol string / abaka /.
Similarly, the accumulated cost calculation unit 45 calculates the accumulated cost for the phonetic symbol strings / abAka /, / abakA /, / Abaka /, / abasa /, etc., which are other speech candidates.

次に、音声候補選択部４６は、累積コストの算出対象となった発音記号から、累積コストに基づいて音声候補を選択する。本実施の形態の音声候補選択部４６は、累積コストが所定のしきい値より大きい音声候補を選択している。しきい値は、予め決められた一定の値でもよい。また、算出された累積コストのうちの最も大きい値から所定の値を減じた値であってもよい。 Next, the speech candidate selection unit 46 selects a speech candidate based on the accumulated cost from the phonetic symbols for which the accumulated cost has been calculated. The speech candidate selection unit 46 of the present embodiment selects a speech candidate whose accumulated cost is greater than a predetermined threshold value. The threshold value may be a predetermined value. Further, it may be a value obtained by subtracting a predetermined value from the largest value among the calculated accumulated costs.

図７は、算出された累積コストのうちの最も大きい値から所定の値を減じた値をしきい値に設定する例を説明するための図である。図示した例では、発音記号列／ａｂａｋａ／、／ａｂＡｋａ／、／ａｂａｋＡ／、／Ａｂａｋａ／、／ａｂａｓａ／について累積コストを算出している。本実施の形態では、音声候補である発音記号列／ａｂａｋａ／、／ａｂＡｋａ／、／ａｂａｋＡ／、／Ａｂａｋａ／、／ａｂａｓａ／を、累積コストの値が高いものから順に第１候補から第５候補の順位を付すものとする。各発音記号の累積コストは、以下の通りである。
第１候補発音記号列／ａｂａｋａ／累積コスト −１０．０３
第２候補発音記号列／ａｂＡｋａ／累積コスト −１２．７０
第３候補発音記号列／ａｂａｋＡ／累積コスト −１４．２５
第４候補発音記号列／Ａｂａｋａ／累積コスト −１６．５３
第５候補発音記号列／ａｂａＳａ／累積コスト −１７．６４ FIG. 7 is a diagram for explaining an example in which a threshold value is set by subtracting a predetermined value from the largest value among the calculated accumulated costs. In the illustrated example, the accumulated cost is calculated for the phonetic symbol strings / abaka /, / abAka /, / abakA /, / Abaka /, / abasa /. In the present embodiment, the phonetic symbol strings / abaka /, / abAka /, / abakaA /, / Abaka /, / abasa / are assigned from the first candidate to the fifth candidate in descending order of the accumulated cost. Shall be given the ranking. The accumulated cost of each phonetic symbol is as follows.
First candidate phonetic symbol string / abaka / accumulated cost -10.03
Second candidate phonetic symbol string / abAka / accumulated cost -12.70
Third candidate phonetic symbol string / abakA / accumulated cost -14.25
Fourth candidate phonetic symbol string / Abaka / accumulated cost -16.53
5th candidate phonetic symbol string / abaSa / accumulated cost -17.64

図７に示した例では、上述の所定の値を５．００としている。音声候補選択部４６は、最も大きい累積コストである−１０．０３から５．００を減じてしきい値（境界尤度）−１５．０３を設定している。音声候補選択部４６は、累積コストが境界尤度以上の音声候補である発音記号列／ａｂａｋａ／、／ａｂＡｋａ／、／ａｂａｋＡ／を選択する。音声候補登録部４７は、選択された音声候補を第２認識語彙辞書に保存する。一方、累積コストが境界尤度以下の音声候補である発音記号列／Ａｂａｋａ／、／ａｂａＳａ／は除外される。 In the example shown in FIG. 7, the predetermined value is 5.00. The speech candidate selection unit 46 sets a threshold value (boundary likelihood) -15.03 by subtracting 5.00 from -10.03 which is the largest accumulated cost. The speech candidate selection unit 46 selects a phonetic symbol string / abaka /, / abAka /, / abakA / that is a speech candidate whose accumulated cost is equal to or higher than the boundary likelihood. The speech candidate registration unit 47 stores the selected speech candidate in the second recognized vocabulary dictionary. On the other hand, phonetic symbol strings / Abaka / and / abaSa /, which are speech candidates whose accumulated cost is equal to or lower than the boundary likelihood, are excluded.

このように構成した場合、各音声候補の第1候補の累積コストの値に依存したしきい値を設定することができるので、第1候補の音声候補が有する累積コストの値を基準にすることで、入力される可能性の低い音声候補を第２認識語彙辞書の登録対象から除外することができる。したがって、第２認識語彙辞書の大型化を抑えながら、入力された音声の認識率を高めることができる。 When configured in this way, a threshold value can be set depending on the cumulative cost value of the first candidate of each speech candidate, so that the cumulative cost value of the first candidate speech candidate is used as a reference. Thus, speech candidates that are unlikely to be input can be excluded from registration targets of the second recognition vocabulary dictionary. Therefore, it is possible to increase the recognition rate of the input speech while suppressing an increase in the size of the second recognition vocabulary dictionary.

また、本実施の形態は、上記したような音声候補を選択する構成に限定されるものでなく、累積コストがより大きい所定の数の音声候補を選択するようにしてもよい。すなわち、例えば、１単語について３個の音声候補を第２認識語彙辞書に登録する場合、音声候補選択部４６は、単語「ａｂａｃａ」について、累積コストがより高い３つの音声候補、発音記号列／ａｂａｋａ／、／ａｂＡｋａ／、／ａｂａｋＡ／を選択する。
このように構成した場合、本実施の形態は、予め登録される音声候補の数やそれらの登録に必要な第２認識語彙辞書の容量を予測することができる。 Further, the present embodiment is not limited to the configuration for selecting speech candidates as described above, and a predetermined number of speech candidates having a higher accumulated cost may be selected. That is, for example, when registering three speech candidates for one word in the second recognition vocabulary dictionary, the speech candidate selection unit 46 determines, for the word “abaca”, three speech candidates with higher accumulated costs, phonetic symbol strings / Select abaka /, / abAka /, / abakA /.
When configured in this manner, the present embodiment can predict the number of speech candidates registered in advance and the capacity of the second recognition vocabulary dictionary necessary for the registration.

また、本実施の形態は、以上述べたように、１つの発音記号に対応する文字をセグメントとする構成に限定されるものではなく、複数の発音記号によって構成される文字をセグメントとすることも可能である。複数の発音記号によって構成される文字とは、例えば、単語、接頭辞、接尾辞などである。さらに、生起確率と接続確率の総和を求めて累積コストの算出を行うことに限定されるものではなく、生起確率と接続確率のそれぞれに対して重み係数を乗じて、これらの総和を求めることも可能である。
第２認識語彙辞書から取得した認識語彙をセグメント系列生成部４４でセグメントに分割する前に、当該取得した認識語彙が第１認識語彙辞書に登録されているかどうかを判定し、登録されていた場合には、第１認識語彙辞書に登録されている発音記号列を、第２認識語彙辞書に登録するようにしてもよい。 Further, as described above, the present embodiment is not limited to the configuration in which the character corresponding to one phonetic symbol is segmented, and the character constituted by a plurality of phonetic symbols may be segmented. Is possible. The characters composed of a plurality of phonetic symbols are, for example, words, prefixes, suffixes, and the like. Furthermore, it is not limited to calculating the total cost of the occurrence probability and the connection probability, and it is not limited to calculating the accumulated cost, and the sum of these can be calculated by multiplying each of the occurrence probability and the connection probability by a weighting factor. Is possible.
When the recognized vocabulary acquired from the second recognized vocabulary dictionary is divided into segments by the segment series generation unit 44, it is determined whether or not the acquired recognized vocabulary is registered in the first recognized vocabulary dictionary. The phonetic symbol string registered in the first recognized vocabulary dictionary may be registered in the second recognized vocabulary dictionary.

図８は、複数の発音記号によって構成される文字をセグメントとする例を説明するための図である。なお、本実施の形態では、１つの発音記号に対応するセグメントを通常セグメント、連接した複数の発音記号に対応するセグメントを拡張セグメントとも記す。この場合に用いられるテキスト発音記号変換辞書は、通常セグメントと拡張セグメントを混在させてテキスト発音記号変換辞書作成装置で作成されたものである。 FIG. 8 is a diagram for explaining an example in which a character composed of a plurality of phonetic symbols is used as a segment. In the present embodiment, a segment corresponding to one phonetic symbol is also referred to as a normal segment, and a segment corresponding to a plurality of connected phonetic symbols is also referred to as an extended segment. The text phonetic symbol conversion dictionary used in this case is created by a text phonetic symbol conversion dictionary creation device in which a normal segment and an extended segment are mixed.

図８に示した例では、単語「ｕｐｈａｌｌ」について複数の連接した発音記号で構成されるセグメントを設定している。符号８２を付して示した範囲は、１つの発音記号に対応するセグメントから生成される音韻セグメント対の系列を示している。符号８１を付して示した範囲は、連接した複数の発音記号に対応するセグメントから生成される音韻セグメント対の系列を示している。
符号８１で示した例では、単語「ｕｐｈａｌｌ」の文字列「ｕｐ」と「ｈａｌｌ」とのように複数の単語に分割したものを拡張セグメントとしている。図示したように、本実施の形態の認識語彙辞書作成装置では、通常セグメントと拡張セグメントとを混在させている。 In the example shown in FIG. 8, a segment composed of a plurality of connected phonetic symbols is set for the word “uphall”. A range indicated by reference numeral 82 indicates a series of phoneme segment pairs generated from segments corresponding to one phonetic symbol. A range indicated by reference numeral 81 indicates a series of phoneme segment pairs generated from segments corresponding to a plurality of connected phonetic symbols.
In the example indicated by reference numeral 81, an extended segment is formed by dividing the word “upall” into a plurality of words such as the character strings “up” and “hall”. As shown in the figure, in the recognized vocabulary dictionary creating apparatus of the present embodiment, normal segments and extended segments are mixed.

拡張セグメントを導入することにより、単語であるセグメントに関する生起確率値と接続確率値をテキスト発音記号変換辞書に設定することができる。これによって、テキストから発音記号への変換において、このテキスト発音記号変換辞書を用いることで、既知の単語の発音記号に基づいた発音記号を生成することができる。したがって、ユーザにとって発音が未知の単語であって、その単語が複数の既知の単語が接続されているとみなされるものについては、この既知の単語を意識して発音する可能性が高いので、テキストから発音記号への変換の精度が高くなる。 By introducing an extended segment, an occurrence probability value and a connection probability value for a segment that is a word can be set in the text phonetic symbol conversion dictionary. Thus, in the conversion from text to phonetic symbols, by using this text phonetic symbol conversion dictionary, phonetic symbols based on phonetic symbols of known words can be generated. Therefore, for a word whose pronunciation is unknown to the user and the word is considered to be connected to a plurality of known words, there is a high possibility that the word is pronounced in consideration of this known word. The accuracy of conversion from to phonetic symbols increases.

上記第２の実施の形態において、第２認識語彙辞書４１は、請求項４記載の認識語彙辞書記憶手段に対応し、テキスト発音記号変換辞書記憶部４２は、請求項４記載のテキスト発音記号変換辞書記憶手段に対応し、認識語彙取得部４３は、請求項４記載の認識語彙取得手段に対応し、セグメント系列生成部４４は、請求項４記載のセグメント系列生成手段に対応し、累積コスト算出部４５は、請求項４記載の累積コスト算出手段に対応し、音声候補選択部４６は、請求項４記載の音声候補選択手段に対応し、音声候補登録部４７は、請求項４記載の音声候補登録手段に対応する。 In the second embodiment, the second recognition vocabulary dictionary 41 corresponds to the recognition vocabulary dictionary storage means described in claim 4, and the text phonetic symbol conversion dictionary storage unit 42 includes the text phonetic symbol conversion described in claim 4. Corresponding to the dictionary storage means, the recognized vocabulary acquisition unit 43 corresponds to the recognized vocabulary acquisition means described in claim 4, and the segment sequence generation unit 44 corresponds to the segment sequence generation means described in claim 4, and calculates the accumulated cost. The unit 45 corresponds to the accumulated cost calculating means described in claim 4, the speech candidate selecting unit 46 corresponds to the speech candidate selecting means described in claim 4, and the speech candidate registering unit 47 corresponds to the speech described in claim 4. Corresponds to candidate registration means.

また、上記第２の実施の形態において、ステップＳ５０１は、請求項７または請求項１０記載の認識語彙取得ステップに対応し、ステップＳ５０２は、請求項７または請求項１０記載のセグメント系列生成ステップに対応し、ステップＳ５０３〜Ｓ５０４は、請求項７または請求項１０記載の累積コスト算出ステップに対応し、ステップＳ５０４は、請求項７または請求項１０記載のテキスト発音記号変換辞書取得ステップに対応し、ステップＳ５０５は、請求項７または請求項１０記載の音声候補選択ステップに対応し、ステップＳ５０６は、請求項７または請求項１０記載の音声候補登録ステップに対応する。 In the second embodiment, step S501 corresponds to the recognized vocabulary acquisition step according to claim 7 or claim 10, and step S502 corresponds to the segment sequence generation step according to claim 7 or claim 10. Correspondingly, steps S503 to S504 correspond to the accumulated cost calculation step according to claim 7 or claim 10, and step S504 corresponds to the text phonetic symbol conversion dictionary acquisition step according to claim 7 or claim 10, Step S505 corresponds to the speech candidate selection step according to claim 7 or claim 10, and step S506 corresponds to the speech candidate registration step according to claim 7 or claim 10.

なお、上述した第２の実施の形態に係る認識語彙辞書作成プログラムは、記憶部を備えた一般的なコンピュータシステムによって実行可能である。この場合、上記記憶部に格納された認識語彙辞書作成プログラムをコンピュータが実行することによって、上述した認識語彙辞書作成動作が行われる。なお、上記認識語彙辞書作成プログラムは、通信媒体を介してコンピュータシステムに供給されてもかまわない。また、光ディスク等の記憶媒体に上記認識語彙辞書作成プログラムを記録し、当該記録媒体に記録された認識語彙辞書作成プログラムをコンピュータシステムで読み込んでもかまわない。 The recognized vocabulary dictionary creation program according to the second embodiment described above can be executed by a general computer system including a storage unit. In this case, the recognized vocabulary dictionary creating operation described above is performed by the computer executing the recognized vocabulary dictionary creating program stored in the storage unit. The recognized vocabulary dictionary creation program may be supplied to the computer system via a communication medium. Alternatively, the recognition vocabulary dictionary creation program may be recorded on a storage medium such as an optical disk, and the recognition vocabulary dictionary creation program recorded on the recording medium may be read by a computer system.

〔第３の実施の形態〕
次に、本発明の第３の実施の形態を図面に基づき説明する。図９は、本発明に係る音声認識装置の実施の形態を示す図である。本実施の形態の音声認識装置９００は、先に述べた本実施の形態の認識語彙辞書作成装置４００によって作成した認識語彙辞書９３を備えて音声認識を行う。
図９に示すように、音声認識装置９００は、入力音声を入力する音声入力部９０、入力された音声から特徴量の時系列を抽出する特徴量抽出部９１、音響モデルを記憶する音響モデル記憶部９２、音声候補を記憶する認識語彙辞書９３、パターンマッチングを行う照合部９４、入力音声の認識結果候補を出力する認識結果出力部９５、音声認識装置に制御信号を入力するための操作部９６を備えている。 [Third Embodiment]
Next, a third embodiment of the present invention will be described with reference to the drawings. FIG. 9 is a diagram showing an embodiment of a speech recognition apparatus according to the present invention. The speech recognition apparatus 900 of the present embodiment includes the recognition vocabulary dictionary 93 created by the recognition vocabulary dictionary creation apparatus 400 of the present embodiment described above and performs speech recognition.
As illustrated in FIG. 9, the speech recognition apparatus 900 includes a speech input unit 90 that inputs input speech, a feature amount extraction unit 91 that extracts a time series of feature amounts from the input speech, and an acoustic model storage that stores an acoustic model. Unit 92, recognition vocabulary dictionary 93 for storing speech candidates, matching unit 94 for pattern matching, recognition result output unit 95 for outputting input speech recognition result candidates, and operation unit 96 for inputting control signals to the speech recognition device It has.

音声入力部９０は、図示しないマイク等でユーザが入力した音声に対してＡ／Ｄ変換を行う。特徴量抽出部９１は、入力音声から特徴量の時系列を抽出する。音響モデル記憶部９２は、例えば連続分布型のＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）により表現された音響モデルを記憶する。なお、音響モデルは多数の話者の音声データを用いて全ての音素ごとに作成される。 The voice input unit 90 performs A / D conversion on voice input by the user with a microphone (not shown) or the like. The feature amount extraction unit 91 extracts a time series of feature amounts from the input voice. The acoustic model storage unit 92 stores an acoustic model expressed by, for example, a continuous distribution type HMM (Hidden Markov Model). The acoustic model is created for every phoneme using speech data of a large number of speakers.

認識語彙辞書９３は、通常セグメントと拡張セグメントとが図７のように混在した認識語彙辞書作成装置によって作成される。例えば、拡張セグメントには文字列「ｕｐ」と「ｈａｌｌ」とが含まれる。照合部９４は、認識語彙辞書９３に記憶されている音声候補ごとに、その音声候補の音素表記（発音記号）にしたがって音響モデルを連結して音声パターンモデルを生成する一方、特徴量抽出部９１から特徴量の時系列を受けると、その特徴量の時系列と、予め生成した複数の音声候補に係る音声パターンモデルとのパターンマッチングによって音声候補の音響尤度を求め、音響尤度が上位の複数個の音声候補を入力音声の認識結果候補とする。 The recognized vocabulary dictionary 93 is created by a recognized vocabulary dictionary creating device in which normal segments and extended segments are mixed as shown in FIG. For example, the extension segment includes character strings “up” and “hall”. For each speech candidate stored in the recognized vocabulary dictionary 93, the matching unit 94 generates a speech pattern model by connecting the acoustic model according to the phoneme notation (phonetic symbol) of the speech candidate, while the feature amount extracting unit 91 When the feature amount time series is received, the acoustic likelihood of the speech candidate is obtained by pattern matching between the feature amount time series and speech pattern models related to a plurality of speech candidates generated in advance. A plurality of speech candidates are set as input speech recognition result candidates.

認識結果出力部９５は、照合部９４のパターンマッチングによって認識結果候補が得られた場合、複数の音声候補に対応する複数の単語を図示しないディスプレイ等に出力するものである。操作部９６は、出力された複数の単語のいずれかを選択する操作や音声認識装置に音声認識の開始、停止を指示する。あるいは、認識結果や認識結果に基づいて実行された演算処理の結果に基づいて、ユーザが次の指示をするための操作に使用される。 The recognition result output unit 95 outputs a plurality of words corresponding to a plurality of speech candidates to a display (not shown) or the like when a recognition result candidate is obtained by pattern matching of the matching unit 94. The operation unit 96 instructs to select one of the plurality of output words or to start or stop the voice recognition to the voice recognition apparatus. Or it is used for operation for a user to perform the next instruction | indication based on the result of the arithmetic processing performed based on the recognition result and the recognition result.

次に、本実施の形態を具体的に説明する。
まず、ユーザが図示しないマイクに向けて発声すると、音声入力部９０は、マイクに入力された入力音声に対してＡ／Ｄ変換を行う。
特徴量抽出部９１は、入力音声のデジタル信号に対して分析を行い、例えば、ＭＦＣＣ（ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）等の特徴量の時系列を抽出する。 Next, this embodiment will be specifically described.
First, when the user speaks into a microphone (not shown), the voice input unit 90 performs A / D conversion on the input voice input to the microphone.
The feature amount extraction unit 91 analyzes the digital signal of the input voice and extracts a time series of feature amounts such as MFCC (Mel Frequency Cepstrum Coefficient).

照合部９４は、予め認識語彙辞書９３に記憶されている音声候補ごとに、その音声候補の音素表記（発音記号）にしたがって、音響モデル記憶部９２に記憶された音響モデルを連結して音声パターンモデル（特徴量の時系列パターンをモデル化した音声パターンモデル）を生成する。そして、特徴量抽出部９１が抽出した特徴量の時系列が入力されると、例えば、ビタビアルゴリズムを用いて、その特徴量の時系列と音声候補ごとの音声パターンモデルとのパターンマッチングによって音声候補の音響尤度を求める。例えば、単語「ｕｐｈａｌｌ」が認識対象として認識語彙辞書９３に登録されている場合に、ユーザが単語「ｕｐｈａｌｌ」を発声すると、認識語彙辞書９３に記憶されている音声候補／ａｐｈａｌ／、／ａｐｆａｌ／、／ｕｆｏｌ／、／ｕｐｈｏｌ／、／ａｐｈｏｌ／等に対応する音声パターンモデルと、入力音声から抽出された特徴量の時系列とのパターンマッチングが行われる。 For each speech candidate stored in the recognized vocabulary dictionary 93 in advance, the matching unit 94 connects the acoustic model stored in the acoustic model storage unit 92 according to the phoneme notation (phonetic symbol) of the speech candidate, thereby generating a speech pattern. A model (speech pattern model obtained by modeling a time series pattern of feature values) is generated. When a time series of feature quantities extracted by the feature quantity extraction unit 91 is input, for example, a voice candidate is obtained by pattern matching between the feature quantity time series and a voice pattern model for each voice candidate using a Viterbi algorithm. Obtain the acoustic likelihood of. For example, when the word “uphall” is registered in the recognition vocabulary dictionary 93 as a recognition target, when the user utters the word “uphall”, the speech candidates / aphal /, / apfal / stored in the recognition vocabulary dictionary 93 , / Ufol /, / uphor /, / aphor /, and the like, and pattern matching is performed between the time series of feature amounts extracted from the input speech and the speech pattern model.

拡張セグメントである文字列「ｕｐ」と「ｈａｌｌ」に対応する音声候補／ａｐｈｏｌ／が認識語彙辞書９３に記憶されているので、ユーザが発音記号／ａｐｈｏｌ／と発音しても音響尤度が高くなり、認識結果候補の中に音声候補／ａｐｈｏｌ／が含まれる。
認識結果出力部９５は、照合部９４で得られた認識結果候補をディスプレイ等に出力する。ユーザが発音記号／ｕｆｏｌ／と／ａｐｈｏｌ／のいずれを発音しても単語「ｕｐｈａｌｌ」が認識結果候補となる。 Since the speech candidates / aphor / corresponding to the character strings “up” and “hall” as the extended segments are stored in the recognition vocabulary dictionary 93, the acoustic likelihood is high even if the user pronounces the phonetic symbol / aphor /. Thus, the speech candidate / aphor / is included in the recognition result candidates.
The recognition result output unit 95 outputs the recognition result candidate obtained by the collation unit 94 to a display or the like. Regardless of whether the user pronounces the phonetic symbol / ufol / or / aphor /, the word "ufall" is a recognition result candidate.

ユーザは、操作部９６を操作して認識結果候補の中から単語「ｕｐｈａｌｌ」を選択する。音声認識装置９００の認識結果は、例えば、カーナビゲーション装置の地図検索機能等に入力される。このような場合、地図検索機能は、単語「ｕｐｈａｌｌ」が地名として入力されたとし、単語「ｕｐｈａｌｌ」が示す地点の緯度や経度の情報を抽出する等の制御を行う。 The user operates the operation unit 96 to select the word “uphall” from the recognition result candidates. The recognition result of the voice recognition device 900 is input to, for example, a map search function of a car navigation device. In such a case, the map search function performs control such as extracting information on the latitude and longitude of the point indicated by the word “upall”, assuming that the word “uphall” is input as the place name.

上記第３の実施の形態において、認識語彙辞書９３は、請求項４記載の認識語彙辞書作成装置で作成された認識語彙辞書に対応する。
なお、上述した第３の実施の形態に係る音声認識プログラムは、記憶部を備えた一般的なコンピュータシステムによって実行可能である。この場合、上記記憶部に格納された音声認識プログラムをコンピュータが実行することによって、上述した音声認識動作が行われる。なお、上記音声認識プログラムは、通信媒体を介してコンピュータシステムに供給されてもかまわない。また、光ディスク等の記憶媒体に上記音声認識プログラムを記録し、当該記録媒体に記録された音声認識プログラムをコンピュータシステムで読み込んでもかまわない。 In the third embodiment, the recognized vocabulary dictionary 93 corresponds to the recognized vocabulary dictionary created by the recognized vocabulary dictionary creating device according to claim 4.
Note that the voice recognition program according to the third embodiment described above can be executed by a general computer system including a storage unit. In this case, the above-described speech recognition operation is performed by the computer executing the speech recognition program stored in the storage unit. The voice recognition program may be supplied to the computer system via a communication medium. Further, the voice recognition program may be recorded on a storage medium such as an optical disk, and the voice recognition program recorded on the recording medium may be read by a computer system.

本発明は、大規模な学習データから求めた統計値に基づいて作成したテキスト発音記号変換辞書を用いることによって、単語が発音された場合に発音される可能性が高い発音記号列を精度よく生成するので、音声認識のための認識語彙辞書を作成するために利用可能である。 The present invention accurately generates a phonetic symbol string that is likely to be pronounced when a word is pronounced by using a text phonetic symbol conversion dictionary created based on statistical values obtained from large-scale learning data. Therefore, it can be used to create a recognition vocabulary dictionary for speech recognition.

本発明の第１の実施の形態のテキスト発音記号変換辞書作成装置１００の構成を示すブロック図である。It is a block diagram which shows the structure of the text phonetic symbol conversion dictionary creation apparatus 100 of the 1st Embodiment of this invention. 本発明の第１の実施の形態のテキスト発音記号変換辞書作成装置１００におけるテキスト発音記号変換辞書の作成処理を示すフローチャートである。It is a flowchart which shows the creation process of the text phonetic symbol conversion dictionary in the text phonetic symbol conversion dictionary creation apparatus 100 of the 1st Embodiment of this invention. 図２に示したテキスト発音記号変換辞書作成装置１００の動作を説明するための図である。It is a figure for demonstrating operation | movement of the text phonetic symbol conversion dictionary creation apparatus 100 shown in FIG. 本発明の認識語彙辞書作成装置４００の構成を示すブロック図である。It is a block diagram which shows the structure of the recognition vocabulary dictionary creation apparatus 400 of this invention. 図４に示した認識語彙辞書作成装置４００における認識語彙辞書の作成処理を示すフローチャートである。It is a flowchart which shows the creation process of the recognition vocabulary dictionary in the recognition vocabulary dictionary creation apparatus 400 shown in FIG. 図４に示した累積コスト算出部の動作を説明するための図である。It is a figure for demonstrating operation | movement of the accumulated cost calculation part shown in FIG. 本発明の第２の実施の形態の累積コストのしきい値設定する例を説明するための図である。It is a figure for demonstrating the example which sets the threshold value of the accumulation cost of the 2nd Embodiment of this invention. 本発明の第２の実施の形態の連接する複数の発音記号に対応するセグメントを設定した例を説明するための図である。It is a figure for demonstrating the example which set the segment corresponding to the several phonetic symbol to which the 2nd Embodiment of this invention is connected. 本発明の第３の実施の形態の音声認識装置９００の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus 900 of the 3rd Embodiment of this invention. 従来の綴り―発音記号変換方法で用いられる単語辞書の例を説明するための図である。It is a figure for demonstrating the example of the word dictionary used with the conventional spelling-phonetic-symbol conversion method. 従来の綴り―発音記号変換方法で用いられるルールの例を説明するための図である。It is a figure for demonstrating the example of the rule used with the conventional spelling-phonetic-symbol conversion method.

Explanation of symbols

１００テキスト発音記号変換辞書作成装置
４００認識語彙辞書作成装置
９００音声認識装置
１０学習データ記憶部
１１学習データ取得部
１２出現頻度カウンタ
１３生起確率算出部
１４接続確率算出部
１５テキスト発音記号変換辞書保存部
１６メモリ
４０第1認識語彙辞書記憶部
４１第２認識語彙辞書記憶部
４２テキスト発音記号変換辞書記憶部
４３認識語彙取得部
４４セグメント系列生成部
４５累積コスト算出部
４６音声候補選択部
４７音声候補登録部
９０音声入力部
９１特徴量抽出部
９２音響モデル記憶部
９３認識語彙辞書
９４照合部
９５認識結果出力部
９６操作部 DESCRIPTION OF SYMBOLS 100 Text phonetic symbol conversion dictionary creation apparatus 400 Recognition vocabulary dictionary creation apparatus 900 Speech recognition apparatus 10 Learning data storage part 11 Learning data acquisition part 12 Appearance frequency counter 13 Occurrence probability calculation part 14 Connection probability calculation part 15 Text phonetic symbol conversion dictionary preservation | save part 16 Memory 40 First recognition vocabulary dictionary storage unit 41 Second recognition vocabulary dictionary storage unit 42 Text phonetic symbol conversion dictionary storage unit 43 Recognition vocabulary acquisition unit 44 Segment sequence generation unit 45 Cumulative cost calculation unit 46 Speech candidate selection unit 47 Speech candidate registration Unit 90 voice input unit 91 feature amount extraction unit 92 acoustic model storage unit 93 recognition vocabulary dictionary 94 collation unit 95 recognition result output unit 96 operation unit

Claims

A text phonetic symbol conversion dictionary creation device for creating a text phonetic symbol conversion dictionary used for conversion from text to phonetic symbols,
Learning data storage means for storing, as learning data, data including a word , segment division information obtained by dividing the word into segments, and pronunciation symbols for the segments;
Learning data acquisition means for acquiring the word, the segment division information, and the phonetic symbol for each segment from the learning data;
A phoneme segment pair that is a set of a segment name and a phonetic symbol corresponding to the segment name is generated from the segment division information acquired by the learning data acquisition unit and the phonetic symbol for each segment, and the learning An occurrence probability calculating means for calculating an occurrence probability based on the frequency of occurrence of the phoneme segment pair in the data;
A concatenated phoneme segment pair sequence that is a sequence of phoneme segment pairs concatenated within a word is
Connection probability calculation means for calculating a connection probability based on the frequency of occurrence in the learning data;
A text phonetic symbol for storing a text phonetic symbol conversion dictionary including the occurrence probability for each phoneme segment pair calculated by the occurrence probability calculating unit and the connection probability for each connected phoneme segment pair sequence calculated by the connection probability calculating unit A conversion dictionary storage means;
Equipped with a,
The learning data storage means, when the word is a word obtained by connecting a plurality of character strings that can be regarded as one word, segment division information obtained by dividing the word connected by a plurality of character strings into segments corresponding to the character strings And a phonetic symbol conversion dictionary creating apparatus for storing a phonetic symbol for each segment corresponding to the character string .

A recognition vocabulary dictionary creating device for creating a recognition vocabulary dictionary used for recognizing speech,
A text phonetic symbol conversion dictionary storage means for storing a text phonetic symbol conversion dictionary created by the text phonetic symbol conversion dictionary creating device according to claim 1;
A recognition vocabulary dictionary storing means for storing a recognition vocabulary dictionary in which a recognition vocabulary to be speech recognition is registered;
Recognition vocabulary acquisition means for acquiring a recognition vocabulary registered in the recognition vocabulary dictionary;
Segment sequence generation means for dividing the acquired recognition vocabulary into segments by referring to the text phonetic symbol conversion dictionary, and generating a segment sequence that is a sequence of segments;
A phoneme segment pair sequence that is a sequence of phoneme segment pairs is generated from the segment sequence generated by the segment sequence generation means, and the accumulated cost is calculated for each phoneme segment pair sequence by referring to the text phonetic symbol conversion dictionary Cumulative cost calculation means to
A speech candidate selection unit that selects a higher phoneme segment pair sequence as a phoneme segment pair sequence candidate from the phoneme segment pair sequence based on the accumulated cost calculated by the accumulated cost calculation unit;
Speech candidate registration means for registering a sequence of phonetic symbols corresponding to the phoneme segment pair sequence candidate selected by the speech candidate selection means in the recognized vocabulary dictionary,
The cumulative cost calculating means calculates a cumulative cost based on the occurrence probability of the phoneme segment pair in the phoneme segment pair sequence and the connection probability of the connected phoneme segment pair sequence. Creation device.

A speech recognition apparatus for recognizing speech based on a recognition vocabulary dictionary created by the recognition vocabulary dictionary creation apparatus according to claim 2.

A text phonetic symbol conversion dictionary creation program for causing a computer to create a text phonetic symbol conversion dictionary used for conversion from text to phonetic symbols,
Data including a word , segment division information obtained by dividing the word into segments, and pronunciation symbols for each segment are stored as learning data, and the word is a word obtained by connecting a plurality of character strings that can be regarded as one word. In this case, from learning data storage means for further storing segment division information obtained by dividing the word connected to a plurality of character strings into segments corresponding to the character strings, and phonetic symbols for each segment corresponding to the character strings Learning data acquisition step for acquiring the word, the segment division information, and the phonetic symbol for each segment;
A phoneme segment pair that is a set of a segment name and a phonetic symbol corresponding to the name of the segment is generated from the segment division information acquired by the learning data acquisition step and the phonetic symbol for each segment, and the learning An occurrence probability calculating step for calculating an occurrence probability based on the frequency of occurrence of the phoneme segment pair in the data;
A concatenated phoneme segment pair sequence that is a sequence of phoneme segment pairs concatenated within a word is
A connection probability calculation step of calculating a connection probability based on the frequency of appearance in the learning data;
A text phonetic symbol for storing a text phonetic symbol conversion dictionary including the occurrence probability for each phoneme segment pair calculated by the occurrence probability calculating step and the connection probability for each connected phoneme segment pair sequence calculated by the connection probability calculating step A conversion dictionary saving step;
A text phonetic symbol conversion dictionary creation program characterized by including:

A recognition vocabulary dictionary creating program for causing a computer to create a recognition vocabulary dictionary used for recognizing speech,
A text phonetic symbol conversion dictionary acquisition step for acquiring the text phonetic symbol conversion dictionary from the text phonetic symbol conversion dictionary storage means for storing the text phonetic symbol conversion dictionary created by the text phonetic symbol conversion dictionary creation program according to claim 4; ,
A recognition vocabulary dictionary storage step for storing a recognition vocabulary dictionary in which recognition vocabulary to be recognized by speech recognition is registered;
A recognition vocabulary acquisition step of acquiring a recognition vocabulary registered in the recognition vocabulary dictionary;
A segment sequence generation step of dividing the acquired recognition vocabulary into segments by referring to the text phonetic symbol conversion dictionary acquired in the text phonetic symbol conversion dictionary acquisition step, and generating a segment sequence which is a sequence of segments;
A phoneme segment pair sequence that is a sequence of phoneme segment pairs is generated from the segment sequence generated by the segment sequence generation means, and the accumulated cost is calculated for each phoneme segment pair sequence by referring to the text phonetic symbol conversion dictionary A cumulative cost calculating step,
A speech candidate selection step of selecting, as a phoneme segment pair sequence candidate, an upper phoneme segment pair sequence from the phoneme segment pair sequence based on the accumulated cost calculated in the accumulated cost calculating step;
A speech candidate registration step of registering a sequence of phonetic symbols corresponding to the phoneme segment pair sequence candidate selected in the speech candidate selection step in the recognized vocabulary dictionary,
The accumulated cost calculating step includes a step of calculating an accumulated cost based on the occurrence probability of the phoneme segment pair in the phoneme segment pair sequence and the connection probability of the connected phoneme segment pair sequence. Recognition vocabulary dictionary creation program.

A speech recognition program that causes a computer to execute processing including a step of recognizing speech based on a recognized vocabulary dictionary created by the recognized vocabulary dictionary creating program according to claim 5.

A text phonetic symbol conversion dictionary creation method for creating a text phonetic symbol conversion dictionary used for conversion from text to phonetic symbols,
Data including a word , segment division information obtained by dividing the word into segments, and pronunciation symbols for each segment are stored as learning data, and the word is a word obtained by connecting a plurality of character strings that can be regarded as one word. In this case, from learning data storage means for further storing segment division information obtained by dividing the word connected to a plurality of character strings into segments corresponding to the character strings, and phonetic symbols for each segment corresponding to the character strings Learning data acquisition step for acquiring the word, the segment division information, and the phonetic symbol for each segment;
A phoneme segment pair that is a set of a segment name and a phonetic symbol corresponding to the name of the segment is generated from the segment division information acquired by the learning data acquisition step and the phonetic symbol for each segment, and the learning An occurrence probability calculating step for calculating an occurrence probability based on the frequency of occurrence of the phoneme segment pair in the data;
A concatenated phoneme segment pair sequence that is a sequence of phoneme segment pairs concatenated within a word is
A connection probability calculation step of calculating a connection probability based on the frequency of appearance in the learning data;
A text phonetic symbol for storing a text phonetic symbol conversion dictionary including the occurrence probability for each phoneme segment pair calculated by the occurrence probability calculating step and the connection probability for each connected phoneme segment pair sequence calculated by the connection probability calculating step A conversion dictionary saving step;
A method for creating a text phonetic symbol conversion dictionary, comprising:

A recognition vocabulary dictionary creating method for creating a recognition vocabulary dictionary used for recognizing speech,
A text phonetic symbol conversion dictionary acquisition step for acquiring the text phonetic symbol conversion dictionary from the text phonetic symbol conversion dictionary storage means for storing the text phonetic symbol conversion dictionary created by the text phonetic symbol conversion dictionary creating method according to claim 7; ,
A recognition vocabulary dictionary storage step for storing a recognition vocabulary dictionary in which recognition vocabulary to be recognized by speech recognition is registered;
A recognition vocabulary acquisition step of acquiring a recognition vocabulary registered in the recognition vocabulary dictionary;
A segment sequence generation step of dividing the acquired recognition vocabulary into segments by referring to the text phonetic symbol conversion dictionary acquired in the text phonetic symbol conversion dictionary acquisition step, and generating a segment sequence which is a sequence of segments;
A phoneme segment pair sequence that is a sequence of phoneme segment pairs is generated from the segment sequence generated by the segment sequence generation means, and the accumulated cost is calculated for each phoneme segment pair sequence by referring to the text phonetic symbol conversion dictionary A cumulative cost calculating step,
A speech candidate selection step of selecting, as a phoneme segment pair sequence candidate, an upper phoneme segment pair sequence from the phoneme segment pair sequence based on the accumulated cost calculated in the accumulated cost calculating step;
A speech candidate registration step of registering a sequence of phonetic symbols corresponding to the phoneme segment pair sequence candidate selected in the speech candidate selection step in the recognized vocabulary dictionary,
The accumulated cost calculating step calculates an accumulated cost based on the occurrence probability of the phoneme segment pair in the phoneme segment pair sequence and the connection probability of the connected phoneme segment pair sequence. How to make.

9. A speech recognition method comprising the step of recognizing speech based on a recognized vocabulary dictionary created by the recognized vocabulary dictionary creating method according to claim 8.