JP2003295888A

JP2003295888A - Speech recognition device and program

Info

Publication number: JP2003295888A
Application number: JP2002102287A
Authority: JP
Inventors: Jun Ishii; 純石井
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2002-04-04
Filing date: 2002-04-04
Publication date: 2003-10-15

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a speech recognition device and a program therefor which make it possible to vocally input a symbol such as a punctuation mark in a document vocal input device. <P>SOLUTION: The device is equipped with a symbol character grammar storage means 101 which acoustically analyzes an input speech and stores symbol character grammar as a rule of connection between a word and the symbol character and a collating means 1102 which selects an acoustic pattern between a word acoustic pattern and a symbol character acoustic pattern according to the symbol character grammar, collates the acoustic pattern with a speech feature quantity, and outputs a word or symbol character as to a matching acoustic pattern as a speech recognition result as to the input speech. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、人間（ユーザ）が
発声した音声を認識して認識結果を表示する音声認識装
置並びに音声認識プログラムに関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device and a voice recognition program for recognizing a voice uttered by a human (user) and displaying a recognition result.

【０００２】[0002]

【従来の技術】音声認識とは、人間の音声から自動的に
機械が言語情報を取り出す技術であり、音声入力による
文書作成など実用性が高い。音声認識装置に関しては例
えば「音声情報処理」古井貞煕著、１９９８年６月、
森北出版（以下文献１とする）の５章や、「音声認識の
基礎」Ｌ．ＲＡＢＩＮＥＲ、Ｂ．Ｈ．ＪＵＡＮＧ、古井
貞煕監訳、１９９５年１１月、ＮＴＴアドバンステク
ノロジ（以下文献２とする）や、「確率モデルによる音
声認識」中川聖一著、昭和６３年７月、コロナ社（以下
文献３とする）に詳細が記されている。2. Description of the Related Art Speech recognition is a technology in which a machine automatically extracts linguistic information from a human voice, and is highly practical such as document creation by voice input. Regarding the speech recognition device, for example, "Speech Information Processing" by Sadahiro Furui, June 1998,
Chapter 5 of Morikita Publishing (hereinafter referred to as Reference 1) and “Basics of Speech Recognition” L.A. RABINER, B.R. H. Translated by JUANG, Sadahiro Furui, November 1995, NTT Advanced Technology (hereinafter referred to as Reference 2) and "Speech recognition by probabilistic model" by Seiichi Nakagawa, July 1988, Corona Publishing (hereinafter referred to as Reference 3) ) For details.

【０００３】以下に従来例としての文献１記載の音声認
識装置を図１４に示すブロック図を参照して説明する。A voice recognition device described in Document 1 as a conventional example will be described below with reference to a block diagram shown in FIG.

【０００４】図１４において、１００１は音声認識装置
を使用するユーザであって、１００２はユーザ１００１
によって発声された入力音声である。また１００３は音
声認識手段であって、入力音声１００２から音声特徴量
を抽出するとともに、予め定められた単語や文の認識対
象の標準パタンを用いて照合を行う。１００４は音声認
識手段１００３が照合した音声認識結果であって、通常
はテキストとして出力される。１００５は音声認識表示
手段であって、音声認識手段１００３によって出力され
た音声認識結果１００４を表示する。In FIG. 14, reference numeral 1001 denotes a user who uses the voice recognition device, and 1002 denotes a user 1001.
It is the input voice uttered by. A voice recognition unit 1003 extracts a voice feature amount from the input voice 1002, and performs matching using a standard pattern of a recognition target of a predetermined word or sentence. Reference numeral 1004 denotes a voice recognition result collated by the voice recognition means 1003, which is normally output as text. A voice recognition display unit 1005 displays the voice recognition result 1004 output by the voice recognition unit 1003.

【０００５】図１５は音声認識手段１００３の内部の説
明をするためのブロック図である。図１５において、１
１０１は音声特徴量抽出手段であって、入力音声１００
２の音声信号をＡ／Ｄ変換し、Ａ／Ｄ変換された信号を
５ミリ秒〜２０ミリ秒程度の一定時間間隔のフレームで
切り出し、音響分析を行って音声特徴量を抽出する。１
１０５は認識対象単語辞書であって、予め定められた認
識対象単語の表記と読みを格納している。また、１１０
６は標準パタンテーブルであって、サブワード単位の標
準パタンとそのラベル表記を格納している。１１０４は
単語音響パタン生成手段であって、認識対象単語辞書１
１０５に格納されている認識対象単語の読みをサブワー
ド音声単位のラベル表記へ変換し、ラベル表記に対応し
たサブワード音声単位の標準パタンを標準パタンテーブ
ル１１０６から抽出して、この標準パタンを認識対象単
語の読みを構成するサブワード音声単位の順に連結して
認識対象単語の音響パタンを生成する。１１０３は通常
文法であって、単語と単語との接続の規則を記憶する。
ここにおいて、通常文法１１０３は、単語間の接続規則
のみを含んでおり、「、」や「。」などの記号文字と単
語との接続規則は含まない。１１０２は照合手段であっ
て、通常文法１１０３に従って、単語音響パタン生成手
段１１０４で生成された複数の認識対象単語の音響パタ
ンから音響パタンを選択し、この音響パタンと音声特徴
量抽出手段１１０１から出力された音声特徴量とを照合
し、一致する音響パタンに対応する単語及びその単語を
連結した文書を音声認識結果１００４として出力する。
なお以降の説明において、照合手段１１０２が文法に従
って音声特徴量と認識対象単語の音響パタンとを照合す
る、とは、照合手段１１０２は、その文法に規定された
語の前後関係から認識対象単語を絞り込み、次に音響特
徴量との照合を行う認識対象単語の音響パタンを選択す
ることを意味する。FIG. 15 is a block diagram for explaining the inside of the voice recognition means 1003. In FIG. 15, 1
Reference numeral 101 denotes a voice feature amount extraction means, which is an input voice 100.
The second audio signal is A / D-converted, the A / D-converted signal is cut out in frames at constant time intervals of about 5 ms to 20 ms, and acoustic analysis is performed to extract the audio feature amount. 1
A recognition target word dictionary 105 stores the notation and reading of a predetermined recognition target word. Also, 110
Reference numeral 6 is a standard pattern table, which stores standard patterns in subword units and their label notations. Reference numeral 1104 denotes a word acoustic pattern generation means, which is a recognition target word dictionary 1
The reading of the recognition target word stored in 105 is converted into the label notation of the subword voice unit, the standard pattern of the subword voice unit corresponding to the label notation is extracted from the standard pattern table 1106, and this standard pattern is recognized. The acoustic pattern of the recognition target word is generated by connecting the sub-word voice units that constitute the reading of the. 1103 is a normal grammar, which stores rules for connecting words to each other.
Here, the normal grammar 1103 includes only connection rules between words, and does not include connection rules between symbol characters such as “,” and “.” And words. Reference numeral 1102 denotes a collating unit, which selects an acoustic pattern from the acoustic patterns of the plurality of recognition target words generated by the word acoustic pattern generating unit 1104 according to the normal grammar 1103, and outputs the acoustic pattern and the speech feature amount extracting unit 1101. The generated speech feature amount is collated, and the word corresponding to the matching acoustic pattern and the document in which the word is connected are output as the speech recognition result 1004.
In the following description, the matching unit 1102 matches the speech feature amount and the acoustic pattern of the recognition target word according to the grammar, that is, the matching unit 1102 selects the recognition target word from the context of the words specified in the grammar. It means narrowing down and then selecting the acoustic pattern of the recognition target word to be compared with the acoustic feature amount.

【０００６】次に、動作について説明する。上記入力音
声１００２が入力されると、音声特徴量抽出手段１１０
１が音響分析を行って入力音声１００２の音声特徴量を
抽出する。また単語音響パタン生成手段１１０４は、認
識対象単語辞書１１０５および標準パタンテーブル１１
０６より、認識対象単語音響パタンを生成する。次に照
合手段１１０２は、通常文法１１０３を用いて、前記音
声特徴量と単語音響パタン生成手段１１０４によって生
成された認識対象単語音響パタンとの照合を行い、音声
認識結果１００４を出力する。Next, the operation will be described. When the input voice 1002 is input, the voice feature amount extraction means 110
1 performs acoustic analysis to extract the voice feature amount of the input voice 1002. Further, the word acoustic pattern generation means 1104 includes a recognition target word dictionary 1105 and a standard pattern table 11.
From 06, a recognition target word acoustic pattern is generated. Next, the collation unit 1102 collates the speech feature amount with the recognition target word acoustic pattern generated by the word acoustic pattern generation unit 1104 using the normal grammar 1103, and outputs a speech recognition result 1004.

【０００７】以上のように、従来の音声認識装置は、通
常文法を用いてユーザが発声した入力音声の音声特徴量
と予め定められた認識対象単語との照合を行い、発声さ
れた音声を認識し表示するものである。As described above, the conventional speech recognition apparatus recognizes the uttered voice by matching the voice feature amount of the input voice uttered by the user with a predetermined recognition target word by using the normal grammar. Is displayed.

【０００８】[0008]

【発明が解決しようとする課題】文書を作成する場合に
は、読点「、」や句点「。」等の記号文字なども適宜使
用する必要がある。しかし従来の音声認識装置による文
書入力では、これらの記号文字を入力するために”とう
てん”と発声しても、音声認識結果は「読点」となり、
「、」とは認識されないという問題がある。また、「と
うてん」という発声の認識結果として予め「、」を設定
しておくという手段も考えられるが、そうする
と「、」、「読点」、「当店」といった同音異義語を列
挙した認識結果の候補から最適な語を選択する操作が必
要となり、選択操作が煩雑となる。本発明は以上の問題
点を解決し、音声認識装置において記号文字が簡単に入
力できる音声認識装置、音声認識プログラムを得ること
を目的とする。When creating a document, it is necessary to appropriately use symbol characters such as the punctuation mark "," and the punctuation mark ".". However, in the document input by the conventional voice recognition device, even if you say "Toten" to input these symbol characters, the voice recognition result becomes "reading point",
There is a problem that "," is not recognized. It is also possible to set “,” in advance as the recognition result of the utterance “Toten”, but if you do so, the recognition result of homonyms such as “,”, “reading point”, and “our shop” will be listed. It is necessary to select an optimum word from the candidates, which makes the selection operation complicated. SUMMARY OF THE INVENTION It is an object of the present invention to solve the above problems and to obtain a voice recognition device and a voice recognition program in which a symbol character can be easily input in the voice recognition device.

【０００９】[0009]

【課題を解決するための手段】本発明に係る音声認識装
置は、入力音声を音響分析して音声の特徴を示す音声特
徴量を抽出する音声特徴量抽出手段と、標準パタンテー
ブルが記憶する音声認識における音響的なスコアを求め
るための標準パタンと認識対象単語辞書が記憶する単語
からこの単語についての単語音響パタンを生成する単語
音響パタン生成手段と、記号文字の表記と読みの辞書が
記憶する記号文字と上記標準パタンからこの記号文字に
ついての記号文字音響パタンを生成する記号文字音響パ
タン生成手段と、単語と記号文字との接続規則である記
号文字文法を記憶する記号文字文法記憶手段と、この記
号文字文法に従って上記単語音響パタンと上記記号文字
音響パタンから音響パタンを選択しこの音響パタンと上
記音声特徴量とを照合して一致する音響パタンについて
の単語又は記号文字を上記入力音声についての音声認識
結果として出力する照合手段を備えるものである。A voice recognition apparatus according to the present invention comprises a voice feature quantity extraction means for acoustically analyzing an input voice and extracting a voice feature quantity indicating a feature of the voice, and a voice stored in a standard pattern table. The word acoustic pattern generation means for generating a word acoustic pattern for this word from the standard pattern for obtaining the acoustic score in recognition and the word stored in the recognition target word dictionary, and the dictionary for notation and reading of symbol characters are stored. A symbol character acoustic pattern generating means for generating a symbol character acoustic pattern for this symbol character from the symbol character and the standard pattern, and a symbol character grammar storing means for storing a symbol character grammar which is a connection rule between a word and a symbol character, According to this symbolic grammar, an acoustic pattern is selected from the word acoustic pattern and the symbolic character acoustic pattern, and the acoustic pattern and the speech feature amount are selected. A word or symbol characters for acoustic pattern matching engaged those comprising matching means for outputting a speech recognition result for said input speech.

【００１０】また本発明に係る音声認識装置は、上記音
声認識装置が、単語間の接続規則である通常文法を記憶
する通常文法記憶手段と、この通常文法と上記記号文字
文法の何れか一の文法を選択する文法切換スイッチとを
さらに備え、上記照合手段が、この文法切換スイッチが
上記通常文法を選択している場合にはこの通常文法に従
って上記単語音響パタンから音響パタンを選択して上記
音声特徴量と照合し一致する音響パタンについての単語
を上記入力音声についての記号文字を含まない音声認識
結果として出力する構成としたものである。Further, in the voice recognition device according to the present invention, the voice recognition device stores a normal grammar storing means for storing a normal grammar which is a connection rule between words, and one of the normal grammar and the symbol character grammar. A grammar changeover switch for selecting a grammar is further provided, and when the grammar changeover switch selects the normal grammar, the collating means selects an acoustic pattern from the word sound pattern according to the normal grammar to select the voice. It is configured such that a word about an acoustic pattern that matches the feature amount and matches is output as a voice recognition result that does not include a symbol character for the input voice.

【００１１】また本発明に係る音声認識装置は、入力音
声を音響分析して音声の特徴を示す音声特徴量を抽出す
る音声特徴量抽出手段と、音声認識における音響的なス
コアを求めるための標準パタンと認識対象単語辞書が記
憶する単語からこの単語についての単語音響パタンを生
成する単語音響パタン生成手段と、単語間の接続規則で
ある通常文法を記憶する通常文法記憶手段と、この通常
文法に従って上記単語音響パタンから選択した音響パタ
ンと上記音声特徴量とを照合して一致した音響パタンに
ついての単語を上記入力音声についての中間音声認識結
果として出力する照合手段と、単語と記号文字との接続
規則である記号文字文法を記憶する記号文字文法記憶手
段と、この記号文字文法に従って上記中間音声認識結果
に記号文字を挿入して記号文字を含む文書を生成しこの
文書を出力する記号文字挿入手段を備えるものである。Further, the voice recognition apparatus according to the present invention comprises a voice feature amount extraction means for acoustically analyzing an input voice to extract a voice feature amount indicating a voice feature, and a standard for obtaining an acoustic score in voice recognition. According to this normal grammar, a word sound pattern generation means for generating a word sound pattern for this pattern from the words stored in the pattern and recognition target word dictionary, a normal grammar storage means for storing a normal grammar which is a connection rule between words, Connection between the word and the symbol character, and a matching means for matching the sound pattern selected from the word sound pattern with the sound feature amount and outputting the word for the matched sound pattern as an intermediate speech recognition result for the input sound. A symbol character grammar storing means for storing a symbol character grammar which is a rule, and a symbol character is inserted into the intermediate speech recognition result according to this symbol character grammar. Generating a document containing the symbol characters Te and those having the mark character inserting means for outputting the document.

【００１２】また本発明に係る音声認識装置は、上記音
声特徴量抽出手段が、記号文字の発声を含む入力音声を
音響分析し記号文字の発声の音声特徴量を抽出し、上記
照合手段が、記号文字の発声についての照合結果部分を
含む中間音声認識結果を出力し、上記記号文字挿入手段
が、その照合結果部分を上記記号文字文法に従って識別
しその照合結果部分を記号文字に置換して記号文字を含
む文書を生成する構成とするものである。Further, in the voice recognition device according to the present invention, the voice feature amount extracting means acoustically analyzes the input voice including the utterance of the symbol character to extract the voice feature amount of the utterance of the symbol character, and the collating means, An intermediate speech recognition result including a collation result portion for the utterance of a symbol character is output, and the symbol character inserting means identifies the collation result portion according to the symbol character grammar and replaces the collation result portion with a symbol character to generate a symbol. It is configured to generate a document including characters.

【００１３】また本発明に係る音声認識装置は、上記音
声特徴量抽出手段が、一定間隔のポーズ時間を含む入力
音声を音響分析しポーズ位置についての情報を含む音声
特徴量を抽出し、上記記号文字挿入手段が、そのポーズ
位置を上記記号文字文法に従って識別しそのポーズ位置
に記号文字を挿入して記号文字を含む文書を生成する構
成としたものである。Further, in the voice recognition apparatus according to the present invention, the voice feature amount extraction means acoustically analyzes an input voice including a pause time of a constant interval to extract a voice feature amount including information about a pause position, and the symbol is used. The character insertion means is configured to identify the pause position according to the symbol character grammar and insert the symbol character at the pause position to generate a document including the symbol character.

【００１４】また本発明に係る音声認識装置は、上記音
声特徴量抽出手段が、入力音声より韻律情報を抽出し、
上記記号文字文法記憶手段が、韻律情報と記号文字との
関連を上記記号文字文法に記憶し、上記記号文字挿入手
段が、上記音声特徴抽出手段が抽出した韻律情報と上記
記号文字文法に従って上記中間音声認識結果に記号文字
を挿入し記号文字を含む文書を生成する構成とするもの
である。In the voice recognition device according to the present invention, the voice feature amount extraction means extracts prosody information from the input voice,
The symbol-character grammar storage means stores the relation between prosody information and symbol characters in the symbol-character grammar, and the symbol-character insertion means stores the intermediate according to the prosody information extracted by the phonetic feature extraction means and the symbol-character grammar. The configuration is such that a symbol character is inserted into a voice recognition result and a document including the symbol character is generated.

【００１５】また本発明に係る音声認識装置は、単語間
の接続規則である通常文法を記憶する通常文法記憶手段
と、この通常文法と上記記号文字文法の何れか一の文法
を選択する文法切換スイッチをさらに備え、上記記号文
字挿入手段が、この文法切換スイッチが上記通常文法を
選択している場合には上記中間音声認識結果を記号文字
を含まない文書として出力する構成としたものである。Further, the speech recognition apparatus according to the present invention has a normal grammar storing means for storing a normal grammar which is a connection rule between words, and a grammar switching for selecting one of the normal grammar and the symbol character grammar. A switch is further provided, and the symbol character inserting means outputs the intermediate speech recognition result as a document containing no symbol character when the grammar change switch selects the normal grammar.

【００１６】また本発明に係る音声認識プログラムは、
入力音声を音響分析して音声の特徴を示す音声特徴量を
抽出する音声特徴量抽出手順と、標準パタンテーブルが
記憶する音声認識における音響的なスコアを求めるため
の標準パタンと認識対象単語辞書が記憶する単語からこ
の単語の単語音響パタンを生成する単語音響パタン生成
手順と、記号文字の表記と読みの辞書が記憶する記号文
字と上記標準パタンからこの記号文字の記号文字音響パ
タンを生成する記号文字音響パタン生成手順と、単語と
記号文字との接続規則である記号文字文法を記憶する記
号文字文法記憶手順と、この記号文字文法に従って上記
単語音響パタンと上記記号文字音響パタンから音響パタ
ンを選択しこの音響パタンと上記音声特徴量とを照合し
て一致する音響パタンについての単語又は記号文字を上
記入力音声についての音声認識結果として出力する照合
手順とをコンピュータに実行させるものである。A voice recognition program according to the present invention is
The voice feature quantity extraction procedure for acoustically analyzing the input voice and extracting the voice feature quantity indicating the feature of the voice, and the standard pattern for recognizing the acoustic score in the voice recognition stored in the standard pattern table and the recognition target word dictionary are A procedure for generating a word acoustic pattern of this word from a stored word, and a symbol for generating a symbolic acoustic pattern of this symbolic character from the symbolic character stored in the dictionary for notation and reading of symbolic characters and the above standard pattern. A character-acoustic pattern generation procedure, a symbol-character grammar storage procedure that stores a symbol-character grammar that is a connection rule between a word and a symbol character, and an acoustic pattern is selected from the word acoustic pattern and the symbol-character acoustic pattern according to this symbol-character grammar. This acoustic pattern is collated with the above-mentioned speech feature amount, and a word or symbol character for a matching acoustic pattern is added to the input speech. Is intended to execute the the verification procedure is output as the speech recognition result to the computer.

【００１７】また本発明に係る音声認識プログラムは、
単語間の接続規則である通常文法を記憶する通常文法記
憶手順と、この通常文法と上記記号文字文法のうちの一
の文法を選択する文法切換手順とをさらにコンピュータ
に実行させ、上記照合手順が、この文法切換スイッチが
上記通常文法を選択している場合にはこの通常文法に従
って上記単語音響パタンから音響パタンを選択して上記
音声特徴量と照合し一致する音響パタンについての単語
を上記入力音声についての記号文字を含まない音声認識
結果として出力する構成としたものである。The voice recognition program according to the present invention is
The computer further executes a normal grammar storing procedure for storing a normal grammar, which is a connection rule between words, and a grammar switching procedure for selecting one of the normal grammar and one of the above symbol character grammars. , If the grammar selection switch selects the normal grammar, a sound pattern is selected from the word sound patterns according to the normal grammar, the sound pattern is compared with the sound feature amount, and a word having a matching sound pattern is input to the input sound. Is output as a speech recognition result that does not include the symbol character of.

【００１８】また本発明に係る音声認識プログラムは、
入力音声を音響分析して音声の特徴を示す音声特徴量を
抽出する音声特徴量抽出手順と、音声認識における音響
的なスコアを求めるための標準パタンと認識対象単語辞
書が記憶する単語からこの単語の単語音響パタンを生成
する単語音響パタン生成手順と、単語間の接続規則であ
る通常文法を記憶する通常文法記憶手順と、この通常文
法に従って上記単語音響パタンから選択した音響パタン
と上記音声特徴量とを照合して一致した音響パタンにつ
いての単語を上記入力音声についての中間音声認識結果
として出力する照合手順と、単語と記号文字との接続規
則である記号文字文法を記憶する記号文字文法記憶手順
と、この記号文字文法に従って上記中間音声認識結果に
記号文字を挿入して記号文字を含む文書を生成しこの文
書を出力する記号文字挿入手順とをコンピュータに実行
させるものである。A voice recognition program according to the present invention is
A speech feature quantity extraction procedure that performs acoustic analysis of the input speech to extract a speech feature quantity indicating a speech feature, and a standard pattern for obtaining an acoustic score in speech recognition and a word stored in the recognition target word dictionary. The word sound pattern generation procedure for generating the word sound pattern, the normal grammar storage procedure for storing the normal grammar that is the connection rule between words, the sound pattern selected from the word sound patterns according to the normal grammar, and the speech feature amount. And a matching character string grammar storing procedure for storing a word for the matched acoustic pattern as an intermediate speech recognition result for the input speech and a symbol character grammar that is a connection rule between the word and the symbol character. And a symbol that inserts a symbol character into the intermediate speech recognition result according to this symbol character grammar to generate a document containing the symbol character and outputs this document. It is intended to execute the character insertion procedure into the computer.

【００１９】また本発明に係る音声認識プログラムは、
上記音声特徴量抽出手順が、記号文字の発声を含む入力
音声を音響分析し記号文字の発声についての音声特徴量
を抽出し、上記照合手順が、記号文字の発声についての
照合結果部分を含む中間音声認識結果を出力し、上記記
号文字挿入手順が、その照合結果部分を上記記号文字文
法に従って識別しその照合結果部分を記号文字に置換し
て記号文字を含む文書を生成する構成としたものであ
る。The voice recognition program according to the present invention is
The voice feature amount extraction procedure acoustically analyzes an input voice including a utterance of a symbol character to extract a voice feature amount about the utterance of a symbol character, and the matching procedure includes an intermediate result including a match result portion about the utterance of a symbol character. The speech recognition result is output, and the symbol character insertion procedure is configured to identify the matching result portion according to the symbol character grammar and replace the matching result portion with the symbol character to generate a document including the symbol character. is there.

【００２０】また本発明に係る音声認識プログラムは、
上記音声特徴量抽出手順が、一定間隔のポーズ時間を含
む入力音声を音響分析しポーズ位置についての情報を含
む音声特徴量を抽出し、上記記号文字挿入手順が、その
ポーズ位置を上記記号文字文法に従って識別しそのポー
ズ位置に記号文字を挿入して記号文字を含む文書を生成
する構成としたものである。The voice recognition program according to the present invention is
The voice feature amount extraction procedure acoustically analyzes an input voice including a pause time at a constant interval to extract a voice feature amount including information about a pause position, and the symbol / character insertion procedure determines the pause position by the symbol / character grammar. According to the above-mentioned method, the symbol character is inserted into the pause position to generate a document including the symbol character.

【００２１】また本発明に係る音声認識プログラムは、
上記音声特徴量抽出手順が、入力音声より韻律情報を抽
出し、上記記号文字文法記憶手順が、韻律情報と記号文
字との関連を上記記号文字文法に記憶し、上記記号文字
挿入手順が、上記音声特徴抽出手段が抽出した韻律情報
と上記記号文字文法に従って上記中間音声認識結果に記
号文字を挿入し記号文書を含む文書を生成する構成とし
たものである。A voice recognition program according to the present invention is
The voice feature amount extraction procedure extracts prosody information from the input voice, the symbol character grammar storage procedure stores the relation between the prosody information and the symbol characters in the symbol character grammar, and the symbol character insertion procedure According to the prosody information extracted by the speech feature extracting means and the symbol character grammar, symbol characters are inserted into the intermediate speech recognition result to generate a document including a symbol document.

【００２２】また本発明に係る音声認識プログラムは、
単語間の接続規則である通常文法を記憶する通常文法記
憶手順と、この通常文法と上記記号文字文法のうちの一
の文法を選択する文法切換手順とをさらにコンピュータ
に実行させ、上記記号文字挿入手順が、この文法切換手
順が上記通常文法を選択している場合には上記中間音声
認識結果を記号文字を含まない文書として出力する構成
としたものである。A voice recognition program according to the present invention is
The computer is further caused to execute a normal grammar storing procedure for storing a normal grammar which is a connection rule between words and a grammar switching procedure for selecting one of the normal grammar and one of the above symbol character grammars to insert the above symbol characters. When the grammar switching procedure selects the normal grammar, the procedure outputs the intermediate speech recognition result as a document containing no symbol characters.

【００２３】[0023]

DETAILED DESCRIPTION OF THE INVENTION

【実施の形態１】図１は、本発明の実施の形態１による
音声認識装置の構成を示すブロック図である。図１にお
いて、１０１は記号文字文法であり、単語と記号文字の
接続規則が格納されている。単語および記号文字の接続
の規則は、例えば文献３の２章、Ｐ．２１に述べられて
いる有限状態オートマトンや、Ｎ−ｇｒａｍやＨＭＭ
（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌｓ）のよう
な統計的言語モデルで表現されている。統計的言語モデ
ルについては、「統計的言語モデル」北研二、（財）東
京大学出版会、１９９９年１１月（以下文献４とする）
に詳しく紹介されている。First Embodiment FIG. 1 is a block diagram showing the configuration of a voice recognition device according to a first embodiment of the present invention. In FIG. 1, reference numeral 101 denotes a symbolic character grammar, which stores connection rules for words and symbolic characters. Rules for connecting words and symbols are described in, for example, Chapter 2 of Document 3, P. Finite state automata described in No. 21, N-gram and HMM
It is expressed by a statistical language model such as (Hidden Markov Models). For the statistical language model, see “Statistical Language Model” Kenji Kita, The University of Tokyo Press, November 1999 (hereinafter referred to as Reference 4).
Have been introduced in detail.

【００２４】１０２は記号文字入力選択信号であって、
ユーザが音声認識手段に記号文字入力を行わせる場合に
伝送される信号である。例えば、ユーザが記号文字入力
が選択するために文書作成に使用している端末の特定の
キー（例えばキーボードのＳｈｉｆｔキー）を押下する
と、記号文字入力選択信号１０２が生成されて文法切換
スイッチ１０３に伝送される。また、端末の特定のキー
を解放した（押下をやめた）場合に、記号文字入力選択
信号１０２を生成して文法切換スイッチ１０３に伝送
し、押下した場合に記号文字入力選択信号１０２の伝送
を行わない構成としてもよい。Reference numeral 102 is a symbol character input selection signal,
This is a signal transmitted when the user causes the voice recognition means to input a symbol character. For example, when the user presses a specific key (for example, the Shift key of the keyboard) of the terminal used to create the document for selecting the symbol character input, the symbol character input selection signal 102 is generated and the grammar changeover switch 103 is generated. Is transmitted. In addition, when a specific key of the terminal is released (pressing is stopped), the symbol / character input selection signal 102 is generated and transmitted to the grammar changeover switch 103, and when it is pressed, the symbol / character input selection signal 102 is transmitted. There may be no configuration.

【００２５】１０３は文法切換スイッチであって、記号
文字入力が選択され、記号文字入力選択信号１０２が伝
送された場合は、図１の文法切換スイッチ１０３は接点
Ｂに接続されるため、照合手段１１０２で用いる文法は
記号文字文法１０１となる。ユーザが記号文字入力を行
わず、記号文字入力選択信号１０２が伝送されていない
状態では、文法切換スイッチ１０３は接点Ａに接続され
るため、照合手段１１０２で用いる文法は通常文法１１
０３となる。Reference numeral 103 denotes a grammar changeover switch, and when the symbol character input is selected and the symbol character input selection signal 102 is transmitted, the grammar changeover switch 103 of FIG. The grammar used in 1102 is the symbol character grammar 101. When the user does not input the symbol character and the symbol character input selection signal 102 is not transmitted, the grammar change switch 103 is connected to the contact A, so that the grammar used by the collating means 1102 is the normal grammar 11
It becomes 03.

【００２６】１０５は記号文字の表記と読みの辞書であ
って、予め定められた記号文字の表記と読みを記憶して
おり、例えば図４のように記号文字の表記と読みが格納
されている。ここで、記号文字とは「。」や「、」や
「？」のように、通常は読み上げない文字を指す。記号
文字の表記と読みの辞書１０５における記号文字の読み
として、予めユーザが記号を連想しやすいものを割り当
てておくとよい。本実施の形態では、「。」の読みを
「kuteN」、「、」の読みを「tooteN」とする。また複
数の記号文字列に対して１つの読みを割り当てることも
可能である。例えば、記号文字列「 (^_^) 」の読みを
「kao」（顔）として登録しておくことが考えられる。Reference numeral 105 is a dictionary of notation and reading of symbol characters, which stores predetermined notation and reading of symbol characters. For example, as shown in FIG. 4, the notation and reading of symbol characters are stored. . Here, the symbol character refers to a character that is not normally read out, such as ".", "," Or "?". As the reading of the symbol characters in the dictionary 105 for notation and reading of the symbol characters, it is preferable to assign in advance a symbol that allows the user to easily associate the symbol. In the present embodiment, the reading of “.” Is “kuteN” and the reading of “,” is “tooteN”. It is also possible to assign one reading to a plurality of symbol character strings. For example, it is possible to register the reading of the symbol character string “(^ _ ^)” as “kao” (face).

【００２７】１０４は記号文字音響パタン生成手段であ
って、記号文字の表記と読みの辞書１０５が記憶する記
号文字の読みと標準パタンテーブル１１０６から記号文
字音響パタンを生成する。なお以下、従来技術の説明図
である図１４、図１５と同一の機能ブロックについては
同一の符号を付し説明を省略する。Reference numeral 104 denotes a symbol / character acoustic pattern generation means, which generates a symbol / character acoustic pattern from the symbol / character reading and standard pattern table 1106 stored in the symbol / character notation and reading dictionary 105. It should be noted that, hereinafter, the same functional blocks as those of FIG. 14 and FIG.

【００２８】次に図２を用いて、動作について説明す
る。まずステップＳＴ１０１において、単語音響パタン
生成手段１１０４は、認識対象単語辞書１１０５と標準
パタンテーブル１１０６を用いて単語音響パタンを生成
する。認識対象単語辞書１１０５は、予め定められた認
識対象単語の表記と読みを記憶している。ここで表記と
は、漢字や仮名や数字などで表されるテキストのことで
あり、読みとは、漢字や仮名や数字を読み上げる場合の
発音を意味し、音素記号等で表されている。例えばニュ
ース音声を認識対象とした場合では、認識対象単語は図
３のように、表記とその読みの組み合わせ（例．「国会
議員」という表記とに対し読み"koQkaigiiN"等）を格納
している。また、標準パタンテーブル１１０６には音声
認識における音響的なスコアを求めるための標準パタン
が格納されている。標準パタンとは、例えばサブワード
音声単位のパタン[λl1, λl2, ... , λlM]（l1, l2,
...lMはラベル名、Mは総ラベル数）であって、多数話
者の通常発声の音声データでパラメータ学習を行ったＨ
ＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌｓ）を
用いる。Next, the operation will be described with reference to FIG. First, in step ST101, the word acoustic pattern generation means 1104 generates a word acoustic pattern using the recognition target word dictionary 1105 and the standard pattern table 1106. The recognition target word dictionary 1105 stores the notation and reading of a predetermined recognition target word. Here, the notation is text represented by kanji, kana, numbers, and the like, and the reading means pronunciation when reading kanji, kana, and numbers, and is represented by phoneme symbols and the like. For example, in the case where news speech is the recognition target, the recognition target word stores a combination of the notation and its reading (eg, the reading "koQkaigiiN" for the notation "diet of parliament") as shown in FIG. . Further, the standard pattern table 1106 stores standard patterns for obtaining an acoustic score in voice recognition. The standard pattern is, for example, a pattern [λl1, λl2, ..., λlM] (l1, l2,
... lM is a label name, M is the total number of labels), and H is obtained by performing parameter learning on speech data of many speakers who normally utter.
MM (Hidden Markov Models) is used.

【００２９】単語音響パタンの生成方法について、サブ
ワード音声単位（音素や音節などの音声片単位）の標準
パタンを用いた場合を例にして説明する。まず単語音響
パタン生成手段１１０４は、認識対象単語辞書１１０５
が記憶している認識対象単語の読み[wr(1), wr(2),
..., wr(N)](括弧内は単語番号)をサブワード音声単位
のラベル表記へ変換する。次に単語音響パタン生成手段
１１０４は、標準パタンテーブル１１０６に格納されて
いる標準パタンから、前記サブワード音声単位のラベル
表記に対応したサブワード音声単位の標準パタンを選択
して、単語内でそのラベル表記が出現する順に各標準パ
タンを連結し、単語音響パタン[Λ(1), Λ(2), ... ,
Λ(N)]（括弧内は単語番号）を生成する。A method of generating a word acoustic pattern will be described by taking a case of using a standard pattern of a sub-word voice unit (a voice piece unit such as a phoneme or a syllable) as an example. First, the word acoustic pattern generation means 1104 uses the recognition target word dictionary 1105.
The reading of the recognition target word memorized by [wr (1), wr (2),
..., wr (N)] (word numbers in parentheses) are converted to label representations in subword voice units. Next, the word acoustic pattern generation means 1104 selects a standard pattern of a subword voice unit corresponding to the label description of the subword voice unit from the standard patterns stored in the standard pattern table 1106, and the label representation in the word. , Each standard pattern is connected in the order of appearance, and word acoustic patterns [Λ (1), Λ (2), ...,
Λ (N)] (word numbers in parentheses) are generated.

【００３０】単語音響パタンΛ(n)の生成方法につい
て、前後環境依存の音素をサブワード音声単位とした標
準パタンを例に説明する。認識対象単語辞書１１０５が
記憶するｎ番目の認識対象単語として、「明日（as
u）」という単語が存在するものとする。この場合、
「明日」の前後に単語を接続する連続単語音声認識にお
いては、「明日」は音素系列で/$asu*/と表される。な
お、$は先行単語の最後の音素、*は後続単語の先頭音素
である。この例でサブワード音声単位のラベルは、(a)
中心音素が/a/であり、先行音素が先行単語の最後の音
素$、後続音素が/s/のラベル{$as}と、(b)中心音素が/s
/であり、先行音素が/a/、後続音素が/u/であるラベル
{asu}と、(c)中心音素が/u/であり、先行音素が/s/、後
続音素が後続単語の先頭音素/*/であるラベル{su*}とに
変換できる。このサブワード音声単位ラベルに対応する
標準パタン λ$as、λasu、λsu* を標準パタンテーブ
ル１１０６から抽出し、これらを連結した標準パタンΛ
(n)が単語「明日」の標準パタンとなる。最近では、前
後音素環境依存の音素のサブワード音声単位標準パタン
を用い、認識対象単語が数万単語以上の音声認識システ
ムの検討が行われている。A method of generating the word acoustic pattern Λ (n) will be described by taking a standard pattern in which phonemes depending on the front-back environment are used as subword voice units. As the n-th recognition target word stored in the recognition target word dictionary 1105, “Tomorrow (as
u) ”exists. in this case,
In continuous word speech recognition in which words are connected before and after "tomorrow", "tomorrow" is expressed as / $ asu * / in the phoneme sequence. Note that $ is the last phoneme of the preceding word, and * is the beginning phoneme of the following word. In this example, the sub-word voice unit label is (a)
The central phoneme is / a /, the preceding phoneme is the last phoneme $ of the preceding word, the subsequent phoneme is the label {$ as}, and (b) the central phoneme is / s.
/ With a leading phoneme of / a / and a subsequent phoneme of / u /
It can be converted into {asu} and (c) a label {su *} in which the central phoneme is / u /, the preceding phoneme is / s /, and the following phoneme is the first phoneme / * / of the following word. Standard patterns λ $ as, λasu, and λsu * corresponding to the subword voice unit label are extracted from the standard pattern table 1106, and a standard pattern Λ obtained by concatenating them is extracted.
(n) becomes the standard pattern for the word "tomorrow". Recently, a speech recognition system has been studied, in which a subword speech unit standard pattern of phonemes depending on the surrounding phoneme environment is used and the number of words to be recognized is tens of thousands or more.

【００３１】次にステップＳＴ１０２において、記号文
字音響パタン生成手段１０４は、記号文字の表記と読み
の辞書１０５と標準パタンテーブル１１０６とを用い
て、記号文字音響パタンを生成する。Next, in step ST102, the symbol / character acoustic pattern generation means 104 generates a symbol / character acoustic pattern using the symbol / character notation and reading dictionary 105 and the standard pattern table 1106.

【００３２】ステップＳＴ１０２における記号文字音響
パタンの生成方法について、サブワード音声単位の標準
パタンを用いた場合を例にして説明する。記号文字音響
パタン生成手段１０４では、記号文字の表記と読みの辞
書１０５で設定されている記号文字の読み[wr'(1), wr'
(2), ..., wr'(M)]（括弧内は単語番号）をサブワード
音声単位のラベル表記へ変換し、標準パタンテーブル１
１０６が記憶する標準パタンからラベルに対応したサブ
ワード音声単位の標準パタンを選択し、記号文字の読み
において各サブワード音声単位が出現する順にその標準
パタンを連結することで記号文字音響パタン[Λ'(1),
Λ'(2), ... , Λ'(M)]（括弧内は単語番号）を生成す
る。The method of generating the symbolic character acoustic pattern in step ST102 will be described by taking the case of using the standard pattern of the subword voice unit as an example. In the symbol character acoustic pattern generation means 104, the reading of symbol characters [wr '(1), wr' set in the dictionary 105 for notation and reading of symbol characters is performed.
(2), ..., wr '(M)] (word numbers in parentheses) are converted into sub-word voice unit label notation, and standard pattern table 1
A standard pattern of subword voice units corresponding to the label is selected from the standard patterns stored in 106, and the standard patterns are concatenated in the order in which each subword voice unit appears in reading of the symbol character, whereby the symbol character acoustic pattern [Λ '( 1),
Λ '(2), ..., Λ' (M)] (word numbers in parentheses) are generated.

【００３３】記号文字音響パタンΛ'(n)の生成方法につ
いて、前後環境依存の音素をサブワード音声単位とした
標準パタンを例として説明する。記号文字の表記と読み
の辞書１０５が記憶するｎ番目の記号文字として、「。
（kuteN）」が存在するものとする。この場合
は、「。」の前後に単語が接続する連続単語音声認識で
は、「。」は音素系列で/$kuteN*/と表される。なお、$
は先行単語の最後の音素、*は後続単語の先頭音素であ
る。この例でサブワード音声単位のラベルは、(a)中心
音素が/k/であり、先行音素が先行単語の最後の音素$、
後続音素が/u/のラベル{$ku}と、(b)中心音素が/u/であ
り、先行音素が/k/、後続音素が/t/であるラベル{kut}
と、(c)中心音素が/t/であり、先行音素が/u/、後続音
素が/e/ラベル{ute}と、(d)中心音素が/e/であり、先行
音素が/t/、後続音素が/N/のラベル{teN}と、(e)中心音
素が/N/であり、先行音素が/e/、後続音素が後続単語の
先頭音素/*/のラベル{eN*}とに変換できる。このサブワ
ード音声単位ラベルに対応する標準パタン λ$ku、λku
t、λute、λteN、λeN* を標準パタンテーブル１１０
６から抽出し、これらを連結した標準パタンΛ'(n)が記
号文字「。」の標準パタンとなる。The method of generating the symbolic character acoustic pattern Λ '(n) will be described by taking a standard pattern in which phonemes depending on the surrounding environment are used as subword speech units. As the nth symbol character stored in the dictionary 105 for notation and reading of symbol characters, “.
(KuteN) ”exists. In this case, in continuous word speech recognition in which words are connected before and after ".", "." Is represented as / $ kuteN * / in the phoneme sequence. Note that $
Is the last phoneme of the preceding word and * is the first phoneme of the following word. In this example, the label of the sub-word phonetic unit is (a) the central phoneme is / k /, the preceding phoneme is the last phoneme $ of the preceding word,
Label {$ ku} with subsequent phoneme / u /, and (b) label {kut} with central phoneme / u /, preceding phoneme / k /, and subsequent phoneme / t /.
And (c) the central phoneme is / t /, the preceding phoneme is / u /, the subsequent phoneme is / e / the label {ute}, and (d) the central phoneme is / e / and the preceding phoneme is / t. /, Label {teN} whose subsequent phoneme is / N /, (e) the central phoneme is / N /, the preceding phoneme is / e /, and the subsequent phoneme is the first phoneme of the following word / * / Label {eN * } And can be converted to. Standard pattern λ $ ku, λku corresponding to this subword voice unit label
Standard pattern table 110 for t, λute, λteN, λeN *
The standard pattern Λ '(n) extracted from No. 6 and connecting them becomes the standard pattern of the symbol character ".".

【００３４】ステップＳＴ１０３において音声特徴量抽
出手段１１０１は、入力音声１００２の音声信号をＡ／
Ｄ変換し、Ａ／Ｄ変換された信号を５ミリ秒〜２０ミリ
秒程度の一定時間間隔のフレームで切り出し、音響分析
を行って音声特徴量ベクトルO = [o(1), o(2), ... , o
(T)]（Tは総フレーム数）を抽出する。ここで音声特徴
量とは、少ない情報量で音声の特徴を表現できるもので
あり、例えばケプストラムの動的特徴の物理量で構成す
る特徴量ベクトルである。音声特徴量については例えば
文献１の５−２音響処理に詳しく述べられている。In step ST103, the audio feature amount extraction means 1101 converts the audio signal of the input audio 1002 into A /
The D-converted and A / D-converted signal is cut out in frames with a fixed time interval of about 5 ms to 20 ms, acoustic analysis is performed, and a voice feature vector O = [o (1), o (2) , ..., o
(T)] (T is the total number of frames). Here, the voice feature amount is a feature amount vector that can express a voice feature with a small amount of information, and is, for example, a feature amount vector configured by a physical amount of a dynamic feature of the cepstrum. The voice feature amount is described in detail in, for example, 5-2 sound processing of Document 1.

【００３５】ステップＳＴ１０４において文法切換スイ
ッチは、記号文字入力選択信号１０２の有無によりユー
ザが記号文字入力が選択したかどうか判定する。ユーザ
が記号文字入力を選択すると、記号文字入力選択信号１
０２が伝送されて、文法切換スイッチ１０３は記号文字
文法を選択する。したがってこの場合は、ステップＳＴ
１０６において、照合手段１１０２は、記号文字文法１
０１を用いて記号文字音響パタンと単語音響パタンから
音響パタンを選択し、この音響パタンと音声特徴量抽出
手段１１０１からの出力である音声特徴量とを照合し
て、一致した音響パタンに対応する記号文字又は認識対
象単語を順次音声認識結果１００４として、テキスト形
式で出力する。In step ST104, the grammar changeover switch determines whether or not the user selects the symbol / character input by the presence / absence of the symbol / character input selection signal 102. When the user selects the symbol character input, the symbol character input selection signal 1
02 is transmitted, and the grammar change switch 103 selects a symbol character grammar. Therefore, in this case, step ST
In 106, the collation means 1102 determines that the symbolic character grammar 1
01 is used to select a sound pattern from the symbolic character sound pattern and the word sound pattern, and this sound pattern is compared with the sound feature amount output from the sound feature amount extraction means 1101 to correspond to the matched sound pattern. The symbol characters or the recognition target words are sequentially output as a speech recognition result 1004 in a text format.

【００３６】次に、ステップＳＴ１０６における照合の
方法について詳しく説明する。照合手段１１０２は、音
声特徴量抽出手段１１０１からの出力である音声特徴量
Oに対して、数式１によって認識結果の単語系列 W'を抽
出する。数式１において、第一項 P(O｜W) は音響的な
確率である。ここでは、単語音響パタン生成手段１１０
４からの出力である単語音響パタン [Λ(1), Λ(2),
…，Λ(N)]と、記号文字音響パタン生成手段１０４の出
力である記号文字音響パタン[Λ’(1), Λ’(2), …,
Λ’(M)]とを、記号文字文法１０１で定められている単
語および記号文字の接続の規則にしたがって接続するこ
とで単語系列 W を仮定し、確率P(O｜W)を評価する。最
近では音響的な確率を計算するために、ＨＭＭ（Ｈｉｄ
ｄｅｎＭａｒｋｏｖＭｏｄｅｌｓ）を用いることが
多い。なお、音響的な確率の計算方法については、文献
１、文献２、文献３に詳しく紹介されている。Next, the matching method in step ST106 will be described in detail. The collating unit 1102 outputs the voice feature amount output from the voice feature amount extracting unit 1101.
With respect to O, the word sequence W ′ of the recognition result is extracted by Expression 1. In Equation 1, the first term P (O | W) is the acoustic probability. Here, the word acoustic pattern generation means 110
The word acoustic pattern [Λ (1), Λ (2),
, Λ (N)] and the symbol character acoustic pattern [Λ '(1), Λ' (2), ..., which is the output of the symbol character acoustic pattern generation means 104.
Λ '(M)] is connected according to the rules for connecting words and symbol characters defined in the symbol character grammar 101 to assume the word sequence W and evaluate the probability P (O | W). Recently, in order to calculate acoustic probability, HMM (Hid
Den Markov Models) are often used. The method of calculating the acoustic probability is described in detail in Documents 1, 2, and 3.

【００３７】[0037]

【数式１】 [Formula 1]

【００３８】また第二項P(W) は言語的な確率であっ
て、記号文字文法１０１で定められている単語および記
号文字の接続の規則にしたがって仮定された単語と記号
の系列W の生起確率を表すものである。最近では言語的
な確率を求めるために、単語連鎖の確率を与える統計的
言語モデルを用いることが多い。統計的言語モデルにつ
いては文献４に詳しい。The second term P (W) is a linguistic probability, and is the occurrence of a sequence W of words and symbols assumed according to the rules for connecting words and symbol characters defined in the symbol-character grammar 101. It represents the probability. Recently, in order to obtain the linguistic probability, a statistical language model that gives the probability of word chain is often used. For details of the statistical language model, refer to Reference 4.

【００３９】次に、記号文字文法１０１がＮ−ｇｒａｍ
モデルである場合を例にして、記号文字を含む系列の生
起確率を求める方法を説明する。Ｎ−ｇｒａｍモデルと
は、統計的言語モデルの中の一つである。Ｎ−ｇｒａｍ
モデルによるＬ個の単語（数式２で表される）の生起す
る確率（数式３で表される）は、数式５で与えられる。
数式５に示すように、Ｎ−ｇｒａｍは直前の（Ｎ−１）
単語から現在の単語が生起する条件付き確率（数式４で
表される）をパラメータとして保持し、単語列の確率を
計算するものである。このパラメータは、一般に大量の
テキストコーパスで確率を計算しておく。Ｎが２の２−
ｇｒａｍの場合、記号文字文法１０１のパラメータとし
て、例えば条件付き確率Ps(、(tooteN)|私は(watasiw
a))、Ps(。(kuteN)|行く(iku))などが保持されている。Next, the symbolic character grammar 101 is N-gram.
A method of obtaining the occurrence probability of a sequence including a symbol character will be described by taking a case of a model as an example. The N-gram model is one of statistical language models. N-gram
The probability of occurrence of L words (represented by Equation 2) according to the model (represented by Equation 3) is given by Equation 5.
As shown in Equation 5, N-gram is the immediately preceding (N-1)
The conditional probability that a current word occurs from a word (represented by Equation 4) is held as a parameter, and the probability of a word string is calculated. For this parameter, the probability is generally calculated in a large amount of text corpus. N is 2 2-
In the case of gram, as a parameter of the symbolic character grammar 101, for example, conditional probability Ps (, (tooteN) | I am (watasiw
a)), Ps (. (kuteN) | go (iku)) etc. are held.

【００４０】[0040]

【数式２】 [Formula 2]

【００４１】[0041]

【数式３】 [Formula 3]

【００４２】[0042]

【数式４】 [Formula 4]

【００４３】[0043]

【数式５】 [Formula 5]

【００４４】記号文字を含む系列、例えば「駅へ行く。
(ekie iku kuteN) 」の２−ｇｒａｍによる生起確率に
ついては、次の数式６のように計算する。なお、数式６
において#は文頭と文末を表すものである。A sequence including symbol characters, for example, "Go to the station.
The occurrence probability by 2-gram of "(ekie iku kuteN)" is calculated by the following formula 6. In addition, Formula 6
In, # represents the beginning and end of a sentence.

【００４５】[0045]

【数式６】 [Formula 6]

【００４６】照合手段１１０２は、数式１によって得ら
れた単語系列 W' のテキスト表記を音声認識結果１００
４として出力する。The matching means 1102 converts the text notation of the word sequence W'obtained by Equation 1 into the speech recognition result 100.
Output as 4.

【００４７】ユーザ１００１が記号文字入力を選択しな
い状態では、記号文字入力選択信号１０２が伝送されな
いため、ステップＳＴ１０４において文法切換スイッチ
は通常文法を選択する。その結果、ステップＳＴ１０５
において照合手段１１０２は、通常文法１０１を用いて
ステップＳＴ１０１で生成された単語音響パタンから音
響パタンを選択し、この音響パタンと音声特徴量とを照
合し、一致した音響パタンに対応する認識対象単語を順
次音声認識結果１００４として、テキスト形式で出力す
る。Since the symbol / character input selection signal 102 is not transmitted when the user 1001 does not select the symbol / character input, the grammar selection switch selects the normal grammar in step ST104. As a result, step ST105
In the above, the matching means 1102 selects an acoustic pattern from the word acoustic pattern generated in step ST 101 using the normal grammar 101, collates this acoustic pattern with the voice feature amount, and recognizes the recognition target word corresponding to the matched acoustic pattern. Are sequentially output as a speech recognition result 1004 in a text format.

【００４８】通常文法１１０３は、認識対象単語辞書１
１０５に格納されている単語間の接続規則が格納されて
いる。単語間の接続の規則は、有限状態オートマトン
や、Ｎ−ｇｒａｍやＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏ
ｖＭｏｄｅｌｓ）のような統計的言語モデルで表現さ
れている。統計的言語モデルの一つであるＮ−ｇｒａｍ
（Ｎ＝２）を用いた場合は、条件付き確率P(駅へ|私
は)、P(行く|駅へ)等がパラメータとして通常文法１１
０３に格納されている。The normal grammar 1103 is a recognition target word dictionary 1
The connection rule between words stored in 105 is stored. The rules of connection between words are finite state automata, N-gram and HMM (Hidden Marko).
v Models). N-gram, which is one of the statistical language models
When (N = 2) is used, the conditional probability P (To station | I), P (Go | To station), etc. is a normal grammar 11 as a parameter.
It is stored in 03.

【００４９】次に、照合の方法について詳しく説明す
る。音声特徴量抽出手段１１０１は音響特徴量を抽出す
るが、この音響特徴量の特徴量ベクトルをOとする。照
合手段１１０２は、この特徴量ベクトル Oに対して数式
１によって認識結果の単語系列W'を抽出する。数式１に
おいて、第一項 P(O｜W) は音響的な確率であり、単語
音響パタン生成手段１１０４からの出力である単語音響
パタン [Λ(1), Λ(2),... , Λ(N)]と通常文法１１０
３で定められている単語間の接続規則にしたがって単語
系列 W を仮定して確率を評価する。音響的な確率を計
算するためにＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭ
ｏｄｅｌｓ）を用いることが多い。Next, the matching method will be described in detail. The voice feature amount extraction means 1101 extracts an acoustic feature amount, and the feature amount vector of this acoustic feature amount is set to O. The matching means 1102 extracts the word sequence W ′ of the recognition result by the mathematical expression 1 with respect to this feature amount vector O. In Expression 1, the first term P (O | W) is an acoustic probability, and is a word acoustic pattern [Λ (1), Λ (2), ..., Which is an output from the word acoustic pattern generation means 1104. Λ (N)] and regular grammar 110
The probability is evaluated assuming the word sequence W according to the connection rules between words defined in 3. HMM (Hidden Markov M) is used to calculate the acoustic probability.
Often used.

【００５０】また数式１の第二項P(W)は言語的な確率で
あって、通常文法１１０３で定められている単語間の接
続規則にしたがって仮定された単語系列Ｗの確率を表
す。このような言語的な確率を求めるために、単語連鎖
の確率を与える統計的言語モデルを用いることが多い。The second term P (W) in the mathematical expression 1 is a linguistic probability, and represents the probability of the word sequence W assumed according to the connection rule between words defined in the normal grammar 1103. In order to obtain such a linguistic probability, a statistical language model that gives the probability of word chain is often used.

【００５１】記号文字を含む系列、例えば「駅へ行く
(ekie iku) 」の２−ｇｒａｍによる生起確率について
は、次の数式７のように計算する。Sequences containing symbolic characters, such as "go to station
(ekie iku) ”, the occurrence probability by 2-gram is calculated as in the following formula 7.

【００５２】[0052]

【数式７】 [Formula 7]

【００５３】照合手段１１０２は、数式１によって得ら
れた単語系列 W' のテキスト表記を音声認識結果１００
４として出力する。The matching means 1102 converts the text notation of the word sequence W'obtained by Equation 1 into the speech recognition result 100.
Output as 4.

【００５４】ステップＳＴ１０７において、発声が終了
したかどうかを判断する。発声が終了していない場合
は、ステップＳＴ１０３に戻り処理を繰り返す。一方、
発声が終了した場合は処理を終了する。In step ST107, it is determined whether or not the utterance has ended. If the utterance has not ended, the process returns to step ST103 and is repeated. on the other hand,
If the utterance is finished, the process is finished.

【００５５】本実施の形態は、一つの文単位に記号文字
入力の選択有無をを切り換える場合で説明したが、単語
や記号文字単位で記号文字入力選択信号１０２を切り換
える構成も可能である。例えば「私は、駅へ行く。（wa
tasiwa tooteN ekie iku kuteN）」という発声におい
て、アンダーラインの部分で記号文字入力選択信号１０
２が伝送されたとすると、この場合の生起確率は数式８
によって計算される。数式８において条件付き確率Ps
(・|・)は記号文字文法１０１に格納されているもので
ある。また、条件付き確率P(・|・)が、通常文法１１０
３に格納されているものである。Although the present embodiment has been described in the case where the selection / non-selection of the symbol / character input is switched for each sentence, the symbol / character input selection signal 102 may be switched for each word or symbol / character. For example, "I go to the station. (Wa
tasiwa tooteN ekie iku kuteN ) ”, the symbol character input selection signal 10 is underlined.
If 2 is transmitted, the occurrence probability in this case is
Calculated by Conditional probability Ps in Equation 8
(• | •) is stored in the symbolic character grammar 101. The conditional probability P (・ | ・) is the normal grammar 110
3 is stored.

【００５６】[0056]

【数式８】 [Formula 8]

【００５７】なお、実施の形態１において、音声特徴量
抽出手段１１０１、照合手段１１０２、単語音響パタン
生成手段１１０４、記号文字音響パタン生成手段１０４
をハードウェアで構成してもよいが、これらの処理を行
う音声認識プログラムを作成し、コンピュータがこの音
声認識プログラムを実行するようにしてもよい。In the first embodiment, the voice feature quantity extracting means 1101, the matching means 1102, the word acoustic pattern generating means 1104, and the symbolic character acoustic pattern generating means 104.
May be configured by hardware, but a voice recognition program for performing these processes may be created and the computer may execute the voice recognition program.

【００５８】以上のように、この実施の形態１における
音声認識装置によれば、記号文字を入力する場合は、ユ
ーザが記号文字入力を選択し記号文字の読みを発声する
ことで記号文字が入力できるので、効率的に文書が作成
できる効果がある。As described above, according to the speech recognition apparatus of the first embodiment, when a symbol character is input, the user selects the symbol character input and utters the reading of the symbol character to input the symbol character. As a result, the document can be efficiently created.

【００５９】[0059]

【実施の形態２】図５は、本発明の実施の形態２による
音声認識装置の構成を示すブロック図である。図におい
て２０１は記号文字挿入切換スイッチであって、ユーザ
が記号文字入力を選択すると、記号文字入力選択信号１
０２が伝送され、その結果記号文字挿入切換スイッチ２
０１は接点Ａに接続される。また、ユーザが記号文字入
力を選択せず、記号文字入力選択信号１０２が伝送され
ていない状態は、記号文字入力切換スイッチ２０１は接
点Ｂに接続される。[Embodiment 2] FIG. 5 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 2 of the present invention. In the figure, reference numeral 201 denotes a symbol / character insertion changeover switch, and when the user selects symbol / character input, a symbol / character input selection signal 1
02 is transmitted, and as a result, the symbol character insertion changeover switch 2
01 is connected to the contact A. When the user does not select the symbol character input and the symbol character input selection signal 102 is not transmitted, the symbol character input changeover switch 201 is connected to the contact B.

【００６０】２０２は記号文字挿入手段であって、記号
文字文法１０１を用いて、単語列の間の最も適切な位置
に記号文字を挿入する。なお、実施の形態１と同一の機
能ブロックについては同一の符号を付し説明を省略す
る。A symbol character inserting means 202 inserts a symbol character at the most suitable position between word strings using the symbol character grammar 101. The same functional blocks as those in the first embodiment are designated by the same reference numerals and the description thereof will be omitted.

【００６１】次に図６を用いて、動作について説明す
る。まず、ステップＳＴ２０１において単語音響パタン
生成手段１１０４は、認識対象単語辞書１１０５と標準
パタン１１０６を用いて単語音響パタンを生成する。Next, the operation will be described with reference to FIG. First, in step ST201, the word acoustic pattern generation means 1104 generates a word acoustic pattern using the recognition target word dictionary 1105 and the standard pattern 1106.

【００６２】ステップＳＴ２０２において音声特徴量抽
出手段１１０１は、入力音声１００２の音声信号をＡ／
Ｄ変換し、Ａ／Ｄ変換された信号を５ミリ秒〜２０ミリ
秒程度の一定時間間隔のフレームで切り出し、音響分析
を行って音声特徴量ベクトルO = [o(1), o(2), ... , o
(T)]（Tは総フレーム数）を抽出する。In step ST202, the voice feature amount extraction means 1101 converts the voice signal of the input voice 1002 into A /
The D-converted and A / D-converted signal is cut out in frames with a fixed time interval of about 5 ms to 20 ms, acoustic analysis is performed, and a voice feature vector O = [o (1), o (2) , ..., o
(T)] (T is the total number of frames).

【００６３】ステップＳＴ２０３において照合手段１１
０２は、通常文法１０１に従ってステップＳＴ２０１で
生成された単語音響パタンと音声特徴量抽出手段１１０
１で抽出された音声特徴量Oとの照合を行い、数式１に
よって記号文字を含まない音声認識結果を中間音声認識
結果Wcとして出力する。Collating means 11 in step ST203
Reference numeral 02 denotes the word acoustic pattern and the voice feature amount extraction means 110 generated in step ST201 according to the normal grammar 101.
The speech feature amount O extracted in 1 is collated, and the speech recognition result not including the symbol character is output as the intermediate speech recognition result Wc by the mathematical expression 1.

【００６４】ステップＳＴ２０４において、記号文字入
力選択信号１０２によりユーザが記号文字入力を選択し
ているかどうかを判定する。ユーザが記号文字入力を選
択すると、記号文字入力選択信号１０２が伝送されるた
め、ステップＳＴ２０４において記号文字挿入切換スイ
ッチ２０１は記号文字入力が選択されたと判断し、接点
Ａに接続する。この場合、次にステップＳＴ２０５にお
いて記号文字挿入手段２０２は、記号文字文法１０１に
従って中間音声認識結果Wcの単語間において、記号文字
を挿入する位置を決定して、記号文字の挿入を行い、音
声認識結果１００４を出力する。In step ST204, it is determined whether or not the user selects the symbol / character input by the symbol / character input selection signal 102. When the user selects the symbol character input, the symbol character input selection signal 102 is transmitted. Therefore, in step ST204, the symbol character insertion changeover switch 201 determines that the symbol character input is selected and connects to the contact A. In this case, next, in step ST205, the symbol character inserting means 202 determines the position of inserting the symbol character between the words of the intermediate speech recognition result Wc according to the symbol character grammar 101, inserts the symbol character, and performs the voice recognition. The result 1004 is output.

【００６５】記号文字挿入手段２０２における記号文字
挿入処理について、例を用いて説明する。記号文字文法
１０１は２−ｇｒａｍとし、照合手段１１０２からの出
力である記号文字を含まない音声認識結果Wcが「私は駅
へ行く」であったとする。記号文字挿入手段２０２は、
記号文字文法１０１を用いて、単語列Wcの単語間に最も
適切な記号文字を挿入する。ここで、「私は」と「駅
へ」の間に記号文字を挿入するかどうかを決定するため
に、数式９の値を最大にする記号文字について、その値
が予め定められた閾値より大きいか否かを判定する。記
号文字「、」について、数式９の値が最大となり、かつ
その値がその閾値より大きい場合には、その記号文字を
挿入し、「私は、駅へ」となる。The symbol character inserting process in the symbol character inserting means 202 will be described by using an example. It is assumed that the symbol character grammar 101 is 2-gram, and the speech recognition result Wc that does not include the symbol character output from the matching unit 1102 is “I am going to the station”. The symbol character inserting means 202,
The most appropriate symbol character is inserted between the words of the word string Wc using the symbol character grammar 101. Here, in order to determine whether to insert the symbol character between “Iwa” and “To the station”, the value of the symbol character that maximizes the value of Expression 9 is larger than a predetermined threshold value. Or not. For the symbol character “,”, when the value of the mathematical expression 9 is the maximum and the value is larger than the threshold value, the symbol character is inserted and “I am to the station”.

【００６６】[0066]

【数式９】 [Formula 9]

【００６７】ユーザが記号文字入力を選択しない状態で
は、記号文字入力選択信号１０２が伝送されないため、
ステップＳＴ２０４において、記号文字挿入切換スイッ
チ２０１は接点Ｂに接続される。その結果ステップＳＴ
２０６では、中間音声認識結果Wcを、音声認識結果１０
０４として出力する。Since the symbol character input selection signal 102 is not transmitted when the user does not select the symbol character input,
In step ST204, the symbol / character insertion changeover switch 201 is connected to the contact B. As a result, step ST
At 206, the intermediate speech recognition result Wc is set to the speech recognition result 10
Output as 04.

【００６８】ステップＳＴ２０７において、発声が終了
したかどうかが判断される。発声が終了していない場合
は、ステップＳＴ２０２に戻り処理を繰り返す。一方、
発声が終了している場合は処理を終了する。In step ST207, it is determined whether or not the utterance has ended. If utterance has not ended, the process returns to step ST202 and is repeated. on the other hand,
If the utterance has ended, the processing ends.

【００６９】なお、実施の形態２において、音声特徴量
抽出手段１１０１、照合手段１１０２、単語音響パタン
生成手段１１０４、記号文字挿入手段２０２をハードウ
ェアで構成してもよいが、これらの処理を行う音声認識
プログラムを作成し、コンピュータが音声認識プログラ
ムを実行するようにしてもよい。In the second embodiment, the voice feature quantity extracting means 1101, the collating means 1102, the word sound pattern generating means 1104, and the symbol character inserting means 202 may be configured by hardware, but these processes are performed. A voice recognition program may be created and the computer may execute the voice recognition program.

【００７０】以上のように、この実施の形態２における
音声認識装置によれば、記号文字を入力する場合には、
ユーザが記号文字入力を選択して単語発声することで記
号文字が自動的に入力できるので、効率的に文書が作成
できる効果がある。As described above, according to the voice recognition device of the second embodiment, when a symbol character is input,
Since the user can automatically input the symbol characters by selecting the symbol character input and uttering a word, there is an effect that a document can be efficiently created.

【００７１】[0071]

【実施の形態３】図７は、本発明の実施の形態３による
音声認識装置の構成を示すブロック図である。図におい
て３０１はポーズ位置への記号文字挿入手段であって、
記号文字文法１０１に従って、単語列のポーズの位置に
最も適切な記号文字を挿入する。ここでポーズ位置と
は、単語の発話の間に存在するある一定時間以上の無音
区間のことをいう。なお、実施の形態１及び２と同一の
機能ブロックについては同一の符号を付し説明を省略す
る。[Third Embodiment] FIG. 7 is a block diagram showing the structure of a speech recognition apparatus according to a third embodiment of the present invention. In the figure, 301 is a symbol character insertion means at the pause position,
According to the symbol character grammar 101, the most appropriate symbol character is inserted at the position of the pause in the word string. Here, the pause position refers to a silent section that exists between the utterances of words for a certain period of time or longer. Note that the same functional blocks as those of the first and second embodiments are designated by the same reference numerals and the description thereof will be omitted.

【００７２】次に図８を用いて、処理の方法について説
明する。まず、ステップＳＴ３０１において単語音響パ
タン生成手段１１０４は、認識対象単語辞書１１０５と
標準パタン１１０６を用いて単語音響パタンを生成す
る。Next, the processing method will be described with reference to FIG. First, in step ST301, the word acoustic pattern generation means 1104 generates a word acoustic pattern using the recognition target word dictionary 1105 and the standard pattern 1106.

【００７３】音声特徴量抽出手段１１０１はステップＳ
Ｔ３０２において、入力音声１００２の音声信号をＡ／
Ｄ変換し、Ａ／Ｄ変換された信号を５ミリ秒〜２０ミリ
秒程度の一定時間間隔のフレームで切り出し、音響分析
を行って音声特徴量ベクトルO = [o(1), o(2), … , o
(T)]（Tは総フレーム数）を抽出する。The voice feature quantity extraction means 1101 executes step S.
At T302, the audio signal of the input audio 1002 is A /
The D-converted and A / D-converted signal is cut out in frames with a fixed time interval of about 5 ms to 20 ms, acoustic analysis is performed, and a voice feature vector O = [o (1), o (2) ,…, O
(T)] (T is the total number of frames).

【００７４】ステップＳＴ３０３において照合手段１１
０２は、通常文法１０１に従って、ステップＳＴ３０１
で生成された単語音響パタンと、音声特徴量抽出手段１
１０１で抽出された音声特徴量Oとの照合を行い、数式
１によって記号文字を含まない音声認識結果を中間音声
認識結果Wpcとして出力するとともに、単語間の無音区
間を検出して、ポーズ位置情報も出力する。Collating means 11 in step ST303
02 is step ST301 according to the normal grammar 101.
And the sound feature amount extraction means 1
By comparing with the voice feature amount O extracted in 101, the voice recognition result not including the symbol character is output as the intermediate voice recognition result Wpc by the mathematical expression 1, and the silent interval between words is detected to determine the pause position information. Will also be output.

【００７５】ステップＳＴ３０４において、記号文字入
力選択信号１０２によりユーザが記号文字入力を選択し
ているかどうかを判定する。ユーザが記号文字入力を選
択すると、記号文字入力選択信号１０２が伝送されるた
め、ステップＳＴ３０４において記号文字挿入切換スイ
ッチ２０１は記号文字入力が選択されたと判断し、接点
Ａに接続する。In step ST304, it is determined whether or not the user has selected the symbol character input by the symbol character input selection signal 102. When the user selects the symbol character input, the symbol character input selection signal 102 is transmitted. Therefore, in step ST304, the symbol character insertion changeover switch 201 determines that the symbol character input is selected and connects to the contact A.

【００７６】この場合、次にステップＳＴ３０５におい
てポーズ位置への記号文字挿入手段３０１は、記号文字
文法１０１に従って中間音声認識結果Wpcの単語間にお
いて、記号文字を挿入する位置を決定し、この位置に記
号文字の挿入を行って、音声認識結果１００４を出力す
る。In this case, next, in step ST305, the symbol character inserting means 301 at the pause position determines the position of inserting the symbol character between the words of the intermediate speech recognition result Wpc according to the symbol character grammar 101, and at this position. A symbol character is inserted and a voice recognition result 1004 is output.

【００７７】次に、ポーズ位置への記号文字挿入手段３
０１におけるポーズ位置への記号文字挿入処理につい
て、例を用いて説明する。記号文字文法１０１は２−ｇ
ｒａｍとし、中間音声認識結果Wpcが、「私は＜ポーズ
＞駅へ行く」であったとする。ポーズ位置への記号文字
挿入手段３０１は、記号文字文法１０１に従って、単語
列Wpcのポーズの位置に最も適切な記号文字を挿入す
る。ここで、「私は」と「駅へ」の間に存在するポーズ
位置に記号文字を挿入するかどうかを決定するために、
数式９の値を最大にする記号文字について、その値が予
め定められた閾値より大きいか否かを判定する。記号文
字「、」について、数式９の値が最大となり、かつその
値がその閾値より大きい場合には、その記号文字を挿入
し、「私は、駅へ」となる。Next, the symbol character insertion means 3 at the pause position
The symbol character insertion processing at the pause position in 01 will be described using an example. Symbolic character grammar 101 is 2-g
Suppose that the intermediate speech recognition result Wpc is “I am going to <pause> station”. The symbol character insertion means 301 at the pause position inserts the most appropriate symbol character at the pause position of the word string Wpc according to the symbol character grammar 101. Here, in order to decide whether to insert the symbol character in the pose position that exists between "I am" and "To the station",
For the symbol character that maximizes the value of Expression 9, it is determined whether the value is larger than a predetermined threshold value. For the symbol character “,”, when the value of the mathematical expression 9 is the maximum and the value is larger than the threshold value, the symbol character is inserted and “I am to the station”.

【００７８】ユーザが記号文字入力を選択していない状
態では、記号文字入力選択信号１０２が伝送されないた
め、ステップＳＴ３０４において記号文字挿入切換スイ
ッチ２０１は接点Ｂに接続される。その結果、ステップ
ＳＴ３０６において、中間音声認識結果Wpcを音声認識
結果１００４として出力する。Since the symbol character input selection signal 102 is not transmitted when the user does not select the symbol character input, the symbol character insertion changeover switch 201 is connected to the contact B in step ST304. As a result, in step ST306, the intermediate speech recognition result Wpc is output as the speech recognition result 1004.

【００７９】本実施の形態では、発声の途中にポーズ位
置が存在する場合について説明したが、発声の最後の単
語の後に記号文字を挿入することも可能である。「行
く」の例では、数式１０を最大にする記号文字におい
て、その最大値が予め定めた閾値より大きい場合に記号
文字を挿入するようにする。記号文字「。」の場合に数
式１０が最大となり、最大値が閾値より大きければ「行
く。」となる。In the present embodiment, the case where a pause position exists in the middle of utterance has been described, but it is also possible to insert a symbol character after the last word of utterance. In the example of “go”, in the symbol character that maximizes Expression 10, the symbol character is inserted when its maximum value is larger than a predetermined threshold value. In the case of the symbol character “.”, The numerical formula 10 becomes maximum, and when the maximum value is larger than the threshold value, “go.”.

【００８０】[0080]

【数式１０】 [Formula 10]

【００８１】ステップＳＴ３０７において、発声が終了
したか否かが判断される。発声が終了していない場合
は、ステップＳＴ３０２に戻り処理を繰り返す。一方、
発声が終了している場合は処理を終了する。In step ST307, it is determined whether or not the utterance has ended. If utterance has not ended, the process returns to step ST302 and repeats. on the other hand,
If the utterance has ended, the processing ends.

【００８２】なお、本実施の形態において、音声特徴量
抽出手段１１０１、照合手段１１０２、単語音響パタン
生成手段１１０４、ポーズ位置への記号文字挿入手段３
０１をハードウェアで構成してもよいが、これらの処理
を行う音声認識プログラムを作成し、コンピュータが音
声認識プログラムを実行するようにしてもよい。In the present embodiment, the voice feature quantity extraction means 1101, the collation means 1102, the word acoustic pattern generation means 1104, and the symbol character insertion means 3 at the pause position.
01 may be configured by hardware, but a voice recognition program for performing these processes may be created and the computer may execute the voice recognition program.

【００８３】以上のように、本実施の形態における音声
認識装置によれば、ユーザは記号文字を入力するために
記号文字入力を選択して単語発声することで、記号文字
がポーズの位置に自動的に入力できるので効率的に文書
が作成できる効果がある。As described above, according to the voice recognition device of the present embodiment, the user selects the symbol character input to input the symbol character and speaks the word, so that the symbol character is automatically placed at the pause position. Since it can be input manually, there is an effect that a document can be efficiently created.

【００８４】[0084]

【実施の形態４】図９は、本発明の実施の形態４による
音声認識装置の構成を示すブロック図である。図におい
て、４０１は韻律情報抽出手段であって、ユーザの音声
から韻律情報を抽出する手段である。ここで韻律情報と
は、アクセントやイントネーションの情報である。韻律
情報については、文献１の４章に述べられている。Fourth Embodiment FIG. 9 is a block diagram showing the structure of a voice recognition device according to a fourth embodiment of the present invention. In the figure, reference numeral 401 denotes a prosody information extraction means, which is means for extracting prosody information from the user's voice. Here, the prosody information is information on accent and intonation. The prosody information is described in Chapter 4 of Document 1.

【００８５】また４０２は、韻律情報を用いた記号文字
挿入手段である。なお、実施の形態１乃至３と同一の機
能ブロックについては同一の符号を付し説明を省略す
る。Reference numeral 402 is a symbol character inserting means using prosody information. Note that the same functional blocks as those of the first to third embodiments are designated by the same reference numerals and the description thereof will be omitted.

【００８６】次に図１０を用いて、動作について説明す
る。ステップＳＴ４０１において、単語音響パタン生成
手段１１０４が、認識対象単語辞書１１０５と標準パタ
ン１１０６を用いて単語音響パタンを生成する。Next, the operation will be described with reference to FIG. In step ST401, the word acoustic pattern generation means 1104 generates a word acoustic pattern using the recognition target word dictionary 1105 and the standard pattern 1106.

【００８７】ステップＳＴ４０２において音声特徴量抽
出手段１１０１は、入力音声１００２の音声信号をＡ／
Ｄ変換し、Ａ／Ｄ変換された信号を５ミリ秒〜２０ミリ
秒程度の一定時間間隔のフレームで切り出し、音響分析
を行って音声特徴量ベクトルO = [o(1), o(2), ... , o
(T)]（Tは総フレーム数）を抽出する。In step ST402, the voice feature amount extraction means 1101 converts the voice signal of the input voice 1002 into A /
The D-converted and A / D-converted signal is cut out in frames with a fixed time interval of about 5 ms to 20 ms, acoustic analysis is performed, and a voice feature vector O = [o (1), o (2) , ..., o
(T)] (T is the total number of frames).

【００８８】ステップＳＴ４０３において照合手段１１
０２は、通常文法１０１に従って、ステップＳＴ４０１
で生成された単語音響パタンと音声特徴量抽出手段１１
０１で抽出された音声特徴量Oとの照合を行い、数式１
によって記号文字を含まない音声認識結果を中間音声認
識結果Wcとして出力する。In step ST403, the collating means 11
02 is step ST401 according to the normal grammar 101.
Word acoustic pattern and speech feature amount extraction means 11 generated in
Matching with the voice feature amount O extracted in 01
Outputs a speech recognition result that does not include a symbol character as an intermediate speech recognition result Wc.

【００８９】ステップＳＴ４０４において、記号文字入
力選択信号１０２によりユーザが記号文字入力を選択し
ているかどうかを判定する。ユーザが記号文字入力を選
択すると、記号文字入力選択信号１０２が伝送されるた
め、ステップＳＴ４０４において記号文字挿入切換スイ
ッチ２０１は記号文字入力が選択されたと判断し、接点
Ａに接続する。In step ST404, it is determined by the symbol / character input selection signal 102 whether the user has selected the symbol / character input. When the user selects the symbol character input, the symbol character input selection signal 102 is transmitted. Therefore, in step ST404, the symbol character insertion changeover switch 201 determines that the symbol character input is selected and connects to the contact A.

【００９０】この場合、次にステップＳＴ４０５におい
て、韻律情報抽出手段４０１においてユーザの音声から
韻律情報を抽出する。In this case, in step ST405, the prosody information extracting means 401 extracts prosody information from the user's voice.

【００９１】ステップＳＴ４０６において韻律情報を用
いた記号文字挿入手段４０２は、記号文字文法１０１と
ステップ４０５で韻律情報抽出手段４０１から出力され
た韻律情報に従って、中間音声認識結果Wcの単語間にお
いて、記号文字を挿入する位置を決定し、この位置に記
号文字の挿入を行って、音声認識結果１００４を出力す
る。In step ST406, the symbol character inserting means 402 using the prosody information uses the symbol character grammar 101 and the prosody information output from the prosody information extracting means 401 in step 405, to generate a symbol between the words of the intermediate speech recognition result Wc. The position at which the character is inserted is determined, the symbol character is inserted at this position, and the voice recognition result 1004 is output.

【００９２】次に、韻律情報を用いた記号文字挿入手段
４０２における記号文字挿入処理について、例を用いて
説明する。記号文字文法１０１は２−ｇｒａｍとし、中
間音声認識結果Wcが、「私は駅へ行く」であったとす
る。韻律情報抽出手段４０１は韻律情報を抽出するが、
ここでは韻律の最も支配的な要因としてピッチ周波数を
抽出することとする。ピッチ周波数は文献１のｐ．７１
に示されているように、ポーズで囲まれた一息で発声す
る音声区間すなわち呼気段階において、ピッチ周波数は
その出始めでは高いが、次第に声門下圧の低下などによ
り低下する。図１１はピッチ周波数の変動状況を定性的
に示したものであり、「私は」を一息で発声され、次に
「駅へ行く」発声された例を示している。一般には、息
継ぎ部分に「、」を挿入するので、図１１のＸ点のよう
に、ピッチ周波数にギャップが生じる部分が記号文字挿
入の対象となる。韻律情報を用いた記号文字挿入手段４
０２は、記号文字文法１０１に基づいて、単語列Wcにお
けるピッチ周波数のギャップの位置に最も適切な記号文
字を挿入する。この例では、「私は」と「駅へ」の間に
存在するピッチ周波数のギャップに記号文字を挿入する
かどうかを決定するために、数式９を最大にする記号文
字について、その値が予め定められた閾値より大きいか
否かを判定する。記号文字「、」について、数式９の値
が最大となり、かつその値が閾値より大きい場合には、
記号文字を挿入し、「私は、駅へ」となる。Next, the symbol character insertion processing in the symbol character insertion means 402 using the prosody information will be described using an example. It is assumed that the symbol / character grammar 101 is 2-gram and the intermediate speech recognition result Wc is “I am going to the station”. The prosody information extraction means 401 extracts prosody information,
Here, the pitch frequency is extracted as the most dominant factor of the prosody. The pitch frequency is p. 71
As shown in (3), the pitch frequency is high at the beginning of the vocalization period, that is, in the exhalation stage, which is surrounded by a pause, and gradually decreases due to a decrease in hypoglottic pressure. FIG. 11 qualitatively shows the change situation of the pitch frequency, and shows an example in which "I am" is uttered at once and then "go to the station" is uttered. Generally, since "," is inserted in the breathing portion, the portion where a gap occurs in the pitch frequency, such as the point X in FIG. 11, is the target of symbol character insertion. Symbolic character insertion means 4 using prosody information
02 inserts the most suitable symbol character at the position of the pitch frequency gap in the word string Wc based on the symbol character grammar 101. In this example, in order to determine whether to insert a symbol character in the pitch frequency gap existing between “I am” and “to the station”, the value of the symbol character that maximizes Equation 9 is previously set. It is determined whether or not it is larger than a predetermined threshold. For the symbol character “,”, when the value of Expression 9 is the maximum and the value is larger than the threshold value,
After inserting the symbolic characters, it becomes "I'm going to the station".

【００９３】ユーザが記号文字入力を選択していない場
合は、記号文字入力選択信号１０２が生成されず記号文
字挿入切換スイッチ２０１に伝送されないため、ステッ
プＳＴ４０４において記号文字挿入切換スイッチ２０１
は接点Ｂに接続される。その結果、ステップＳＴ４０７
において、中間音声認識結果Wcを音声認識結果１００４
として出力する。If the user has not selected the symbol / character input, the symbol / character input selection signal 102 is not generated and transmitted to the symbol / character insertion changeover switch 201. Therefore, in step ST404, the symbol / character insertion changeover switch 201 is selected.
Is connected to contact B. As a result, step ST407
, The intermediate speech recognition result Wc is converted to the speech recognition result 1004
Output as.

【００９４】ステップＳＴ４０８において、発声が終了
しているか否かが判断される。発声が終了していない場
合は、ステップＳＴ４０２に戻り処理を繰り返す。一
方、発声が終了している場合は処理を終了する。In step ST408, it is determined whether utterance has ended. If the utterance has not ended, the process returns to step ST402 and is repeated. On the other hand, if the utterance has ended, the process ends.

【００９５】なお、本実施の形態において、韻律情報抽
出手段４０１は、音声特徴量抽出手段１１０１とは別の
ブロックとして構成したが、音声特徴量抽出手段１１０
１が韻律情報を抽出する処理の実行を行ってもよい。In the present embodiment, the prosody information extraction means 401 is configured as a block separate from the voice feature quantity extraction means 1101, but the voice feature quantity extraction means 110 is used.
1 may execute the process of extracting prosody information.

【００９６】また本実施の形態において、音声特徴量抽
出手段１１０１、照合手段１１０２、単語音響パタン生
成手段１１０４、韻律情報抽出手段４０１、韻律情報を
用いた記号文字挿入手段４０２をハードウェアで構成し
てもよいが、これらの処理を行う音声認識プログラムを
作成し、コンピュータが音声認識プログラムを実行する
ようにしてもよい。Further, in this embodiment, the voice feature quantity extraction means 1101, the matching means 1102, the word acoustic pattern generation means 1104, the prosody information extraction means 401, and the symbol character insertion means 402 using prosody information are configured by hardware. Alternatively, a voice recognition program that performs these processes may be created and the computer may execute the voice recognition program.

【００９７】以上のように、この実施の形態４における
音声認識装置、音声認識方法によれば、ユーザが記号文
字を入力する場合にユーザが記号文字入力を選択して発
声することで、発声の韻律情報を用いて記号文字が自動
的に入力できるので効率的に文書が作成できる効果があ
る。As described above, according to the voice recognition device and the voice recognition method of the fourth embodiment, when the user inputs a symbol character, the user selects the symbol character input and utters the voice. Since the symbol characters can be automatically input using the prosody information, there is an effect that a document can be efficiently created.

【００９８】[0098]

【実施の形態５】図１２は、この発明の実施の形態５に
よる音声対話装置の構成を示すブロック図である。図１
２において５０１は記号文字入力選択手段である。図１
で示した実施の形態１の音声認識装置、及び従来の音声
対話装置の図１４と同一の機能ブロックは同一の符号を
付し説明を省略する。Fifth Embodiment FIG. 12 is a block diagram showing the structure of a voice dialogue apparatus according to a fifth embodiment of the present invention. Figure 1
In No. 2, 501 is a symbol character input selection means. Figure 1
The same functional blocks as those in FIG. 14 of the voice recognition device of the first embodiment and the conventional voice dialogue device shown in FIG.

【００９９】次に図１３を用いて、動作について説明す
る。ステップＳＴ５０１においてユーザ１００１は、記
号文字入力選択手段５０１を用いて記号文字入力を選択
する。ここでは、例えばキーボードのＳｈｉｆｔキーを
押下すると、記号文字入力の選択を行う構成などが考え
られる。ユーザが記号文字入力を選択すると、記号文字
入力選択信号１０２が伝送される。音声認識手段１００
３はユーザ１００１が記号文字入力を選択したか否かを
記号文字入力選択信号１０２によって判断する。Next, the operation will be described with reference to FIG. In step ST501, the user 1001 selects the symbol character input using the symbol character input selection means 501. Here, for example, a configuration is possible in which when the Shift key of the keyboard is pressed, selection of symbol character input is performed. When the user selects the symbol character input, the symbol character input selection signal 102 is transmitted. Voice recognition means 100
3 judges by the symbol character input selection signal 102 whether the user 1001 has selected the symbol character input.

【０１００】ステップＳＴ５０１において記号文字挿入
が選択されている場合、ステップ５０３において音声認
識手段１００３は、音声認識結果１００４を出力する。
ここで、音声認識手段１００３は例えば実施の形態１で
説明した音声認識手段とする。この場合、音声認識手段
１００３は入力された音声を音響分析し、記号文字文法
に従って単語音響パタンと記号文字音響パタンから音響
パタンを選択し、この音響パタンと抽出された音声特徴
量とを照合して記号文字が含まれた音声認識結果１００
４を出力する。If the insertion of symbol characters is selected in step ST501, the voice recognition means 1003 outputs the voice recognition result 1004 in step 503.
Here, the voice recognition unit 1003 is, for example, the voice recognition unit described in the first embodiment. In this case, the speech recognition unit 1003 acoustically analyzes the input speech, selects an acoustic pattern from the word acoustic pattern and the symbolic character acoustic pattern according to the grammatical character grammar, and compares this acoustic pattern with the extracted speech feature amount. Result of speech recognition 100
4 is output.

【０１０１】ここでは、音声認識手段１００３は実施の
形態１の場合で説明したが、実施の形態２乃至実施の形
態４の何れか一の音声認識手段であってもよい。Although the voice recognition means 1003 has been described in the case of the first embodiment, it may be any one of the voice recognition means of the second to fourth embodiments.

【０１０２】また記号文字挿入が選択されていない場
合、ステップＳＴ５０２において音声認識手段１００３
は、音声認識結果１００４を出力する。ここで、音声認
識手段１００３は例えば実施の形態１で説明した音声認
識手段とする。この場合、音声認識手段１００３は入力
された音声を音響分析し、通常文法１１０３に従って単
語音響パタンを選択し、この音響パタンと音声特徴量と
の照合を行い、記号文字を含まない音声認識結果１００
４を出力する。When the symbol character insertion is not selected, the voice recognition means 1003 in step ST502.
Outputs a voice recognition result 1004. Here, the voice recognition unit 1003 is, for example, the voice recognition unit described in the first embodiment. In this case, the speech recognition unit 1003 acoustically analyzes the input speech, selects a word acoustic pattern according to the normal grammar 1103, matches the acoustic pattern with the speech feature amount, and the speech recognition result 100 that does not include a symbol character.
4 is output.

【０１０３】ステップＳＴ５０４において、ステップＳ
Ｔ５０２またはステップＳＴ５０３により音声認識手段
１００３が出力した音声認識結果１００４を表示装置に
表示する。In step ST504, step S
The voice recognition result 1004 output by the voice recognition means 1003 in T502 or step ST503 is displayed on the display device.

【０１０４】ステップＳＴ５０５において、発声が終わ
ったか否かが判断される。発声が終わっていない場合
は、ステップＳＴ５０１に戻り処理を繰り返す。一方、
発声が終わっている場合は処理を終了する。In step ST505, it is determined whether the utterance has ended. If the utterance has not ended, the process returns to step ST501 and is repeated. on the other hand,
If the utterance has ended, the processing ends.

【０１０５】なお、本実施の形態５において、記号文字
入力選択手段５０１、音声認識手段１００３、音声認識
結果表示手段１００５をハードウェアで構成してもよい
が、これらの処理を行うプログラムを作成し、コンピュ
ータがこの音声認識プログラムを実行するようにしても
よい。In the fifth embodiment, the symbol / character input selection means 501, the voice recognition means 1003, and the voice recognition result display means 1005 may be configured by hardware, but a program for performing these processing is created. The computer may execute the voice recognition program.

【０１０６】以上のように、この実施の形態５における
音声認識装置、音声認識方法によれば、ユーザはキーボ
ードなどの補助入力手段から指示を行い発声すること
で、記号文字を入力することができ、認識結果を表示す
るので効率的に文書が作成できる効果がある。As described above, according to the voice recognition device and the voice recognition method of the fifth embodiment, the user can input the symbol characters by giving an instruction from the auxiliary input means such as the keyboard and uttering. Since the recognition result is displayed, there is an effect that a document can be efficiently created.

【０１０７】[0107]

【発明の効果】本発明による装置及びプログラムは、記
号文字文法を用いて入力音声中に含まれる記号文字に相
当する発声部分を記号文字の読みと照合する構成とした
ため、音声認識によって記号文字を含む文書入力を可能
とするという効果を有する。Since the apparatus and program according to the present invention are configured to match the utterance portion corresponding to the symbol character included in the input voice with the reading of the symbol character by using the symbol character grammar, the symbol character is recognized by the voice recognition. This has the effect of enabling the input of documents including.

【０１０８】また本発明による装置及びプログラムは、
文法切換スイッチによって通常文法と記号文字文法を切
り換え、ここで選択した文法に従って、単語音響パタン
と記号文字音響パタンを選択して、その音響パタンと入
力音声から抽出された音響特徴量とを照合する構成とし
たため、ユーザが入力しようとしている文書に合わせて
適切な文法を切り換えることができ、その結果識字率が
向上するという効果を有する。The apparatus and program according to the present invention are
The grammar selection switch is used to switch between the normal grammar and the symbolic character grammar, the word acoustic pattern and the symbolic character acoustic pattern are selected according to the grammar selected here, and the acoustic pattern and the acoustic feature amount extracted from the input speech are collated. Since the configuration is adopted, an appropriate grammar can be switched according to the document that the user is trying to input, and as a result, the literacy rate is improved.

【０１０９】また本発明による装置及びプログラムは、
通常文法に従って入力音声から抽出された音響特徴量を
照合することにより取得した中間音声認識結果につい
て、記号文字文法に従って記号文字挿入位置を決定し、
記号文字を挿入する構成としたため、記号文字文法を用
いない従来の構成による音声認識を用いた文書入力装置
の出力結果を本発明の中間音声認識結果として利用し、
さらにその中間音声認識結果に記号文字を挿入すること
ができるという効果を有する。The apparatus and program according to the present invention are
Regarding the intermediate speech recognition result obtained by matching the acoustic feature amount extracted from the input speech according to the normal grammar, the symbol character insertion position is determined according to the symbol character grammar,
Since the configuration is such that the symbol characters are inserted, the output result of the document input device using the speech recognition according to the conventional configuration that does not use the symbol character grammar is used as the intermediate speech recognition result of the present invention.
Further, there is an effect that a symbol character can be inserted into the intermediate speech recognition result.

【０１１０】また本発明による装置及びプログラムは、
入力音声から無音区間をポーズ位置として抽出し、この
ポーズ位置について記号文字文法に従って、中間音声認
識結果に記号文字を挿入する構成としたため、ユーザは
記号文字に相当する発声を行わなくても音声認識による
文書入力において記号文字を入力することができるとい
う効果を有する。The device and program according to the present invention are
The silent segment is extracted from the input voice as a pause position, and the symbol position is configured to insert the symbol character into the intermediate speech recognition result according to the symbol character grammar. Therefore, the user can recognize the voice without uttering the symbol character. This has an effect that symbol characters can be input in the document input by.

【０１１１】また本発明による装置及びプログラムは、
入力音声から韻律情報を抽出し、この韻律情報中のギャ
ップ位置について記号文字文法に従って、中間音声認識
結果に記号文字を挿入する構成としたため、ユーザは記
号文字に相当する発声を行わなくても音声認識による文
書入力において記号文字を入力することができるという
効果を有する。The apparatus and program according to the present invention are
Since the prosody information is extracted from the input speech and the symbol character is inserted into the intermediate speech recognition result according to the symbol character grammar for the gap position in the prosody information, the user does not need to utter the speech corresponding to the symbol character. This has an effect that symbol characters can be input in document input by recognition.

[Brief description of drawings]

【図１】本発明の実施の形態１の構成図である。FIG. 1 is a configuration diagram of a first embodiment of the present invention.

【図２】本発明の実施の形態１による処理のフローチ
ャートである。FIG. 2 is a flowchart of processing according to the first embodiment of the present invention.

【図３】本発明の実施の形態１における認識対象単語
辞書の例を示す図である。FIG. 3 is a diagram showing an example of a recognition target word dictionary according to the first embodiment of the present invention.

【図４】本発明の実施の形態１における記号文字の表
記と読みの辞書の例を示す図である。FIG. 4 is a diagram showing an example of a dictionary for notation and reading of symbol characters according to the first embodiment of the present invention.

【図５】本発明の実施の形態２の構成図である。FIG. 5 is a configuration diagram of a second embodiment of the present invention.

【図６】本発明の実施の形態２による処理のフローチ
ャートである。FIG. 6 is a flowchart of processing according to the second embodiment of the present invention.

【図７】本発明の実施の形態３の構成図である。FIG. 7 is a configuration diagram of a third embodiment of the present invention.

【図８】本発明の実施の形態３による処理のフローチ
ャートである。FIG. 8 is a flowchart of processing according to the third embodiment of the present invention.

【図９】本発明の実施の形態４の構成図である。FIG. 9 is a configuration diagram of a fourth embodiment of the present invention.

【図１０】本発明の実施の形態４による処理のフロー
チャートである。FIG. 10 is a flowchart of processing according to the fourth embodiment of the present invention.

【図１１】本発明の実施の形態４におけるピッチ周波
数変動を表した時系列グラフである。FIG. 11 is a time-series graph showing pitch frequency variation in the fourth embodiment of the present invention.

【図１２】本発明の実施の形態５の構成図である。FIG. 12 is a configuration diagram of a fifth embodiment of the present invention.

【図１３】本発明の実施の形態５による処理のフロー
チャートである。FIG. 13 is a flowchart of processing according to the fifth embodiment of the present invention.

【図１４】従来の技術の構成図である。FIG. 14 is a configuration diagram of a conventional technique.

【図１５】従来の技術による音声認識装置の構成図で
ある。FIG. 15 is a configuration diagram of a voice recognition device according to a conventional technique.

[Explanation of symbols]

１０１：記号文字文法１０２：記号
文字入力選択信号１０３：文法切換スイッチ１０４：記号
文字音響パタン生成手段１０５：記号文字の表記と読みの辞書２０１：記号
文字切換スイッチ２０２：記号文字挿入手段３０１：ポーズ位置
への記号文字挿入手段４０１：韻律情報抽出手段４０２：韻律情報を
用いた記号文字挿入手段５０１：記号文字入力選択手段１００１：ユーザ１００２：入力音声１００３：音声認識
手段１００４：音声認識結果１００５：音声認識
結果表示手段１１０１：音声特徴量抽出手段１１０２：照合手段１１０３：通常文法１１０４：単語音響パ
タン１１０５：認識対象単語辞書１１０６：標準パタン
テーブル101: Symbolic character grammar 102: Symbolic character input selection signal 103: Grammar changeover switch 104: Symbolic character acoustic pattern generation means 105: Symbolic character notation and reading dictionary 201: Symbolic character changeover switch 202: Symbolic character insertion means 301: Pause Symbol / character insertion means 401 to position: Prosody information extraction means 402: Symbol / character insertion means 501 using prosody information 501: Symbol / character input selection means 1001: User 1002: Input voice 1003: Speech recognition means 1004: Speech recognition result 1005: Speech recognition result display means 1101: Speech feature amount extraction means 1102: Matching means 1103: Normal grammar 1104: Word acoustic pattern 1105: Recognition target word dictionary 1106: Standard pattern table

Claims

[Claims]

1. A voice feature amount extraction means for acoustically analyzing an input voice to extract a voice feature amount indicating a voice feature, and a standard pattern for obtaining an acoustic score in voice recognition stored in a standard pattern table. A word acoustic pattern generation means for generating a word acoustic pattern for this word from a word stored in the recognition target word dictionary, a symbol character stored in the dictionary for notation and reading of symbol characters, and a symbol for this symbol character from the above standard pattern. A symbol / character acoustic pattern generating means for generating a character / acoustic pattern, a symbol / character grammar storing means for storing a symbol / character grammar which is a connection rule between a word and a symbol / character, and the word / acoustic pattern and the symbol / character according to the symbol / character grammar. A sound pattern is selected from the sound patterns, the sound pattern is compared with the sound feature amount, and a word or a sound pattern that matches the sound pattern or A voice recognition device comprising collating means for outputting a symbol character as a voice recognition result of the input voice.

2. The speech recognition device comprises: a normal grammar storing means for storing a normal grammar which is a connection rule between words; and a grammar changeover switch for selecting one of the normal grammar and the symbol character grammar. When the grammar selector switch selects the normal grammar, the matching means selects a sound pattern from the word sound patterns in accordance with the normal grammar, matches the sound feature amount with the sound feature quantity, and matches the sound. The speech recognition apparatus according to claim 1, wherein the word for the pattern is output as a speech recognition result that does not include a symbol character for the input speech.

3. A voice feature amount extraction means for acoustically analyzing an input voice to extract a voice feature amount indicating a voice feature, a standard pattern for obtaining an acoustic score in voice recognition, and a recognition target word dictionary are stored. A word sound pattern generation means for generating a word sound pattern for this word from a word, a normal grammar storage means for storing a normal grammar which is a connection rule between words, and a sound selected from the word sound patterns according to this normal grammar. Collating means for collating the pattern and the voice feature amount, and outputting a word for the matched acoustic pattern as an intermediate speech recognition result for the input voice,
A symbol character grammar storing means for storing a symbol character grammar which is a connection rule between a word and a symbol character, and a symbol character is inserted into the intermediate speech recognition result according to the symbol character grammar to generate a document including the symbol character, and this document is generated. A voice recognition device comprising a symbol character insertion means for outputting.

4. The speech feature quantity extracting means acoustically analyzes the input speech including the utterances of the symbol characters to extract the speech feature quantity of the utterances of the symbol characters, and the collating means is a collation result about the utterances of the symbol characters. Output intermediate speech recognition result including parts,
4. The symbol character inserting means is configured to identify the collation result portion according to the symbol character grammar and replace the collation result portion with the symbol character to generate a document including the symbol character. The voice recognition device described.

5. The voice feature amount extracting means acoustically analyzes the input voice including a pause time at a constant interval to extract a voice feature amount including information about a pause position, and the symbol / character inserting means determines the pause position. 4. The speech recognition apparatus according to claim 3, wherein the speech recognition apparatus is configured to identify the character according to the symbol character grammar, and insert the symbol character at the pause position to generate a document including the symbol character.

6. The voice feature quantity extraction means extracts prosody information from the input voice, and the symbol character grammar storage means stores the relation between the prosody information and the symbol characters in the symbol character grammar. The inserting means is configured to insert a symbol character into the intermediate speech recognition result according to the prosody information extracted by the speech feature extracting means and the symbol character grammar to generate a document including the symbol character. The voice recognition device according to 3.

7. The speech recognition device comprises a normal grammar storage means for storing a normal grammar which is a connection rule between words, and a grammar changeover switch for selecting one of the normal grammar and the symbol character grammar. The symbol character inserting means is configured to output the intermediate speech recognition result as a document containing no symbol character when the grammar selector switch selects the normal grammar. The voice recognition device according to any one of claims 3 to 6.

8. A voice feature amount extraction procedure for acoustically analyzing an input voice to extract a voice feature amount indicating a voice feature, and a standard pattern for obtaining an acoustic score in voice recognition stored in a standard pattern table. The word sound pattern generation procedure that generates the word sound pattern of this word from the words stored in the recognition target word dictionary, and the sign character sound of this sign character from the sign character and the above standard pattern stored in the dictionary of notation and reading of sign characters A symbol / character acoustic pattern generation procedure for generating a pattern, a symbol / character grammar storage procedure for storing a symbol / character grammar that is a connection rule between words and symbol characters, and the above-mentioned word / acoustic pattern and the symbol / character acoustic pattern according to this symbol / character grammar. Select an acoustic pattern from the above, check this acoustic pattern against the above-mentioned voice feature amount, and enter the word or symbol character for the matching acoustic pattern above. A voice recognition program for causing a computer to execute a verification procedure for outputting a voice recognition result of a force voice.

9. The speech recognition program includes a normal grammar storing procedure for storing a normal grammar, which is a connection rule between words,
The normal grammar and the grammar switching procedure for selecting one of the symbolic grammars are further caused to be executed by the computer, and the collation procedure is performed when the grammar switching switch selects the normal grammar. According to the normal grammar, an acoustic pattern is selected from the word acoustic patterns, collated with the speech feature amount, and a word for the matching acoustic pattern is output as a speech recognition result that does not include the symbol characters of the input speech. 9. The voice recognition program according to claim 8.

10. A voice feature amount extraction procedure for acoustically analyzing an input voice to extract a voice feature amount indicating a voice feature, a standard pattern for obtaining an acoustic score in voice recognition, and a recognition target word dictionary are stored. A word acoustic pattern generation procedure for generating a word acoustic pattern of this word from a word, a normal grammar storing procedure for storing a normal grammar that is a connection rule between words, and an acoustic pattern selected from the word acoustic patterns according to this normal grammar. And the speech feature amount are collated, and a matching procedure for outputting a word for a matched acoustic pattern as an intermediate speech recognition result for the input speech, and a symbol character grammar that is a connection rule between the word and the symbol character are stored. According to the symbol character grammar storage procedure and the symbol character grammar, the symbol character is inserted into the intermediate speech recognition result to generate a document including the symbol character, and this document is generated. A speech recognition program for causing a computer to execute a symbol character insertion procedure for outputting.

11. The speech feature quantity extraction procedure acoustically analyzes an input speech including a utterance of a symbol character to extract a speech feature quantity about a utterance of a symbol character, and the above matching procedure includes a collation about an utterance of a symbol character. The intermediate speech recognition result including the result portion is output, and the symbol character insertion procedure identifies the matching result portion according to the symbol character grammar and replaces the matching result portion with the symbol character to generate a document including the symbol character. The voice recognition program according to claim 10, wherein the voice recognition program is configured.

12. The voice feature amount extraction procedure acoustically analyzes an input voice including a pause time at a constant interval to extract a voice feature amount including information about a pause position, and the symbol character insertion procedure includes the pause position. 11. The speech recognition program according to claim 10, wherein the program is identified according to the symbol character grammar, and the symbol character is inserted at the pause position to generate a document including the symbol character.

13. The voice feature quantity extraction procedure extracts prosody information from an input voice, and the symbol character grammar storage procedure comprises:
The relation between the prosodic information and the symbolic character is stored in the symbolic character grammar, and the symbolic character inserting step inserts the symbolic character into the intermediate speech recognition result according to the prosodic information extracted by the speech feature extracting means and the symbolic character grammar. 11. The speech recognition program according to claim 10, wherein the speech recognition program is configured to generate a document including a symbol document.

14. The speech recognition program comprises a normal grammar storing procedure for storing a normal grammar which is a connection rule between words, and a grammar switching procedure for selecting one of the normal grammar and the symbol character grammar. And the symbol character insertion procedure is configured to output the intermediate speech recognition result as a document not containing symbol characters when the grammar switching procedure selects the normal grammar. 14. The voice recognition program according to claim 10, characterized in that.