JPH11143487A

JPH11143487A - Method and device for converting voice to character

Info

Publication number: JPH11143487A
Application number: JP9308252A
Authority: JP
Inventors: Hiroyuki Ono; 弘幸大野; Tadashi Teramine; 正寺峰
Original assignee: Osaka Gas Co Ltd
Current assignee: Osaka Gas Co Ltd
Priority date: 1997-11-11
Filing date: 1997-11-11
Publication date: 1999-05-28

Abstract

PROBLEM TO BE SOLVED: To provide a sound-character conversion technology capable of converting the same word into the correct character without repetition. SOLUTION: A phoneme recognizing means 3 to divide the inputted sound into a plurality of segments and to allot one or more phonemes to each segment, a character string converting means 4 to determine one or more words based on the phoneme, and a definite word selecting means 6 to display one of the words stored in a storage part 51 on a monitor 52 as a definite word and to display other words than the fixed word as the next possible candidate, are provided.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力された音声信
号を認識して、文字データに変換する音声文字変換技術
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice / character conversion technology for recognizing an input voice signal and converting it into character data.

【０００２】[0002]

【従来の技術】ファクトリー・オートメーション、オフ
ィス・オートメーション、ホームオートメーションが進
むとともに、数多くの機械が職場や家庭に導入されてい
るが、これらの機械へ命令を伝達する入力インタフェー
スとして、キーボードやマウスなどのポインティングデ
ィバイスが主に利用されている。しかしながら、キーボ
ードを通じての命令の入力は、キーボード操作が不得手
のものにとって、非常にわずらわしく、入力エラーがつ
きまとう。より、ユーザーフレンドリーなインターフェ
ースとしてアイコンやイラストを用いたグラフィックイ
ンターフェースがあり、そこでは、マウス等を用いて表
示されるメニューから所望の命令を選択する。この場
合、命令の数が少なければよいが、命令がある程度の数
となれば、メニューの階層が深くなり、所望の命令にた
どり着くのに多くの時間を要することになる。このよう
な欠点を解消すべく音声認識を用いた音声入力方式が登
場し始めている。音声認識のもつ利点は、入力に熟練を
要しないことや、目や手を用いないので他の作業を行っ
ている間に入力が行えることなどがあり、その期待は大
きい。2. Description of the Related Art With the progress of factory automation, office automation, and home automation, many machines have been introduced in workplaces and homes. As input interfaces for transmitting commands to these machines, keyboards, mice, and the like are used. Pointing devices are mainly used. However, inputting an instruction through a keyboard is very troublesome for those who are not good at keyboard operation, and input errors are common. As a more user-friendly interface, there is a graphic interface using icons and illustrations, in which a desired command is selected from a menu displayed using a mouse or the like. In this case, it is sufficient that the number of instructions is small. However, if the number of instructions is a certain number, the hierarchy of the menu becomes deep, and much time is required to reach a desired instruction. In order to solve such disadvantages, a voice input method using voice recognition has begun to appear. The advantages of speech recognition are that it does not require skill in inputting, and that the input can be performed while other work is being performed because eyes and hands are not used.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、現状の
音声認識の技術は高い信頼性をもって確立しているとは
言えず、しばしば誤認識が生じる。特に、不特定話者連
続音声認識では、話し手のくせなどにより同じような誤
認識が続いたり、同じ言葉を意識して口調を変えながら
発音しなければ正しく認識されないということが頻繁に
生じる。本発明の目的は、同じ言葉を何度も繰り返すこ
となく正しい文字データに変換することが可能な変換す
る音声文字変換方法と音声文字変換装置を提供すること
である。However, the current speech recognition technology has not been established with high reliability, and erroneous recognition often occurs. In particular, in the unspecified speaker continuous speech recognition, the same erroneous recognition continues due to the habit of a speaker or the like, and it often occurs that the same word is not recognized correctly unless the pronunciation is changed while changing the tone. An object of the present invention is to provide a voice-to-text conversion method and a voice-to-text conversion device capable of converting the same word into correct character data without repeating it many times.

【０００４】[0004]

【課題を解決するための手段】上記目的を達成するた
め、本発明による音声文字変換方法は、入力された音声
を複数のセグメントに分割して音素認識し、各セグメン
トに対して１以上の音素を割り当て、この音素に基づい
て決定された１以上の単語を記憶部に格納し、格納され
た単語の１つを確定単語として出力するとともに前記確
定単語以外の単語を次候補とすることを特徴とする。In order to achieve the above object, a speech-to-speech conversion method according to the present invention divides an input speech into a plurality of segments, recognizes phonemes, and generates one or more phonemes for each segment. And storing one or more words determined based on the phoneme in a storage unit, outputting one of the stored words as a confirmed word, and using a word other than the confirmed word as a next candidate. And

【０００５】この方法では、それ自体は公知な音素認識
手法を用いて割り当てられた音素に対して１以上の単語
を対応付けて記憶部に格納し、１以上の単語から選択さ
れた１つの単語を確定単語として制御コマンドとすると
ともに、確定単語以外の単語を表示次候補とし、確定単
語が正しくない場合は表示次候補が確定単語として繰り
上げられる。確定単語が正しいものかどうかをチェック
するためには、例えば、確定単語をモニターに表示して
オペレータの判断を仰ぐことができるし、確定単語を音
声出力してオペレータの判断を仰ぐことも可能である。
いずれにしても、例えば、最初の確定単語が正しくなけ
れば、なんらかの次候補の表示のためのコマンドを送る
ことにより次候補が次の確定単語として出力されるの
で、正しい音声認識を求めてオペレータが何度も同じ言
葉を繰り返しても、誤まっている同じ単語を繰り返して
出力するということは回避できる。[0005] In this method, one or more words are stored in a storage unit in association with a phoneme assigned using a known phoneme recognition method, and one phoneme selected from the one or more words is stored. Is used as a control word as a control word, and a word other than the control word is set as a next display candidate. In order to check whether the confirmed word is correct, for example, it is possible to display the confirmed word on a monitor and ask the operator's judgment, or to output the confirmed word by voice and ask the operator's judgment. is there.
In any case, for example, if the first confirmed word is not correct, by sending a command for displaying some next candidate, the next candidate is output as the next confirmed word. Even if the same word is repeated many times, it is possible to avoid outputting the same wrong word repeatedly.

【０００６】確定単語と表示次候補とを区分けする好ま
しい方法として、例えば、前記音素に基づいて決定され
た単語には音素適合確率値をもって前記記憶部に格納さ
れ、高い確率値をもった単語から順に確定単語とするこ
とが提案される。音素に対応する単語には音素適合確率
値がリンクされているので、その確率値の高い順に単語
を表示していくことにより、同じ文字データが繰り返さ
れるというエラーがなくなるだけでなく、少ない選択回
数で正しい変換文字データに遭遇する可能性が高くな
る。[0006] As a preferable method of classifying the determined word and the next display candidate, for example, a word determined based on the phoneme is stored in the storage unit with a phoneme matching probability value, and a word having a high probability value is used. It is suggested that the words be determined in order. Since the phoneme matching probability value is linked to the word corresponding to the phoneme, displaying the words in descending order of the probability value not only eliminates the error that the same character data is repeated, but also reduces the number of selections. Is more likely to encounter the correct converted character data.

【０００７】さらに、本発明の好適な実施形態として、
所定時間以内に同じ音声が入力された場合次候補を確定
単語として出力する、例えばモニターに表示するような
方式を採用することが提案される。この場合、所定時間
以内の同じ音声の入力が先に確定単語に代えて次候補を
確定単語として出力するためのトリガーとなっているの
で、音声文字変換が正しくない場合でも、繰り返し発音
することで、順次異なる単語が確定単語として出力され
ていくので、繰り返し発音が同じ誤変換を繰り返すとい
う従来生じていたようなストレスをオペレータに与える
ことはなくなる。Further, as a preferred embodiment of the present invention,
It is proposed to adopt a method of outputting the next candidate as a confirmed word when the same voice is input within a predetermined time, for example, displaying it on a monitor. In this case, the input of the same voice within a predetermined time is a trigger for outputting the next candidate as a fixed word instead of the fixed word first. Since different words are sequentially output as confirmed words, the operator is not given the stress that has conventionally occurred such that repeated pronunciations repeat the same erroneous conversion.

【０００８】別な次候補出力トリガー方法として、例え
ば、予め最も認識されやすい音声を次候補の出力コマン
ドとして設定しておき、出力された単語が誤って認識さ
れていた場合、その次候補出力コマンドとしての音声を
発音することで次候補を順次表示させることも可能であ
る。As another next candidate output trigger method, for example, a speech which is most easily recognized is set in advance as an output command of the next candidate, and if an output word is erroneously recognized, the next candidate output command is output. It is also possible to sequentially display the next candidate by pronouncing the voice as "?".

【０００９】上記目的を達成するため、本発明による音
声文字変換装置は、入力された音声を複数のセグメント
に分割して各セグメントに対して１以上の音素を割り当
てる音素認識手段と、前記音素に基づいて１以上の単語
を決定する文字列変換手段と、前記決定された単語を格
納する記憶部と、前記格納された単語の１つを確定単語
として出力するとともに前記確定単語以外の単語を表示
の次候補とする確定単語選択手段とを備えている。In order to achieve the above object, a speech-to-speech conversion apparatus according to the present invention divides an input speech into a plurality of segments and assigns at least one phoneme to each segment; Character string conversion means for determining one or more words based on the word, a storage unit for storing the determined word, and outputting one of the stored words as a confirmed word and displaying words other than the confirmed word And a definitive word selecting means to be the next candidate of

【００１０】この装置では、文字列変換手段が割り当て
られた音素に対して１以上の単語を対応付けて記憶部に
格納し、確定単語選択手段が選択された１つの単語を確
定単語として出力して制御コマンドとして用いられると
ともに、出力された確定単語以外の単語を先の確定単語
の次候補として準備しておく。なんらかの次候補の確定
単語としての出力のためのコマンドを受け取ると、次候
補が先の確定単語に代えて出力されるので、正しい音声
認識を求めてオペレータが何度も同じ言葉を繰り返して
も、誤まっている同じ単語を繰り返して表示するという
ことは回避できる。もちろん、確定単語のチェックのた
めには、確定単語をモニターに表示することなどが提案
される。In this device, the character string conversion means stores one or more words in association with the assigned phonemes in the storage unit, and the confirmed word selection means outputs the selected one word as a confirmed word. In addition, a word other than the output fixed word is prepared as a next candidate of the previous fixed word while being used as a control command. When receiving a command for outputting the next candidate as a confirmed word, the next candidate is output in place of the previous confirmed word, so even if the operator repeatedly repeats the same word for correct speech recognition, Repeated display of the same incorrect word can be avoided. Of course, for checking the confirmed word, it is proposed to display the confirmed word on a monitor.

【００１１】この音声文字変換装置においても、前記提
案された方法で述べたように、単語適合確率値をもって
記憶部に格納された単語に対して、確定単語選択手段が
高い確率値をもった単語から順に確定単語として選択す
ることにより、その確率値の高い順に単語が出力される
ことになるので、同じ文字データが繰り返されるという
エラーがなくなるだけでなく、少ない選択回数で正しい
変換文字データに遭遇する可能性が高くなる。In this speech-to-speech conversion apparatus, as described in the above-mentioned proposed method, a word having a high probability value is determined by the confirmed word selecting means with respect to the word stored in the storage unit with the word matching probability value. By selecting as a confirmed word in order from, the words are output in order of the probability value, so not only the error that the same character data is repeated is eliminated, but also the correct converted character data is encountered with a small number of selections Is more likely to do so.

【００１２】また、所定時間以内に入力された音声が同
一であるかどうかを判定する入力音声評価手段が備えら
れ、所定時間以内に同じ音声が入力されたと判定された
場合、前記確定単語選択手段は前記次候補を確定単語と
して出力する（例えばモニターに表示する）構成を採用
するならば、音声文字変換が正しくない場合でも、繰り
返し発音することで、順次異なる単語は表示されていく
ので、繰り返し発音が同じ誤変換を繰り返すというスト
レスをオペレータに与えることはなくなる。本発明によ
るその他の特徴及び利点は、以下図面を用いた実施例の
説明により明らかになるだろう。Further, there is provided input voice evaluation means for determining whether or not the voices input within a predetermined time are the same, and when it is determined that the same voice has been input within the predetermined time, the determined word selecting means is provided. If a configuration is adopted in which the next candidate is output as a confirmed word (for example, it is displayed on a monitor), even if the speech-to-text conversion is incorrect, different words are displayed successively by repeatedly pronouncing. The stress that the pronunciation repeats the same erroneous conversion is not given to the operator. Other features and advantages according to the present invention will become apparent from the following description of embodiments with reference to the drawings.

【００１３】[0013]

【発明の実施の形態】図１に本発明による音声文字変換
技術を用いた音声コマンド入力システムの機能ブロック
図が示されている。このシステムでは、オペレータがコ
マンドを発声することにより、プラント現場に配置され
た監視カメラが操作される。例えば、オペレータが「Ｌ
ＮＧ気化器１号」と発声すると、監視カメラがＬＮＧ気
化器１号に照準をあわせるべく作動して、モニターにＬ
ＮＧ気化器１号の様子を表示するのである。FIG. 1 is a functional block diagram of a voice command input system using a voice character conversion technique according to the present invention. In this system, a surveillance camera arranged at a plant site is operated by an operator issuing a command. For example, if the operator selects "L
"NG vaporizer No. 1", the surveillance camera operates to aim at LNG vaporizer No. 1, and L is displayed on the monitor.
The state of NG vaporizer No. 1 is displayed.

【００１４】このシステムには、オペレータの発した音
声をアナログ音声信号に変換するマイク１、マイク１か
ら送られてきたアナログ音声信号をデジタル音声信号に
変換するＡ／Ｄ変換部２、音声信号を分析処理して発音
記号に似たような音素に置き換えていく音素認識手段
３、辞書ファイル４１にアクセスしながら音素列に適当
な単語を割り当てていく文字列変換手段４、割り当てら
れた単語を一時的に格納する記憶部５１、記憶部５１に
格納された単語から最適なものを確定単語として選択す
る確定単語選択手段６が備えられている。さらに、確定
単語選択手段６によって確定単語として選択された単語
はモニター５２に表示されるとともに、監視カメラ制御
手段７にも引き渡される。監視カメラ制御手段７を構成
するコマンド認識部７１は、監視カメラ制御手段７に引
き渡された確定単語からコマンド辞書ファイル７２にア
クセスして対応する制御コマンドを引き出し、この制御
コマンドはドライバー７３によって監視カメラ７４の駆
動モータを作動する制御信号に変換される。This system includes a microphone 1 for converting a voice emitted by an operator into an analog voice signal, an A / D converter 2 for converting an analog voice signal sent from the microphone 1 into a digital voice signal, and a voice signal. Phoneme recognition means 3 for analyzing and replacing phonemes similar to phonetic symbols; character string conversion means 4 for assigning appropriate words to phoneme strings while accessing dictionary file 41; There is provided a storage unit 51 for temporarily storing, and a fixed word selecting unit 6 for selecting an optimum word from the words stored in the storage unit 51 as a fixed word. Further, the word selected as the confirmed word by the confirmed word selecting means 6 is displayed on the monitor 52 and is also passed to the monitoring camera control means 7. The command recognition unit 71 constituting the monitoring camera control means 7 accesses the command dictionary file 72 from the determined word passed to the monitoring camera control means 7 and derives a corresponding control command. It is converted into a control signal for operating the drive motor 74.

【００１５】音素認識手段３は、それ自体は公知の音韻
認識アルゴリズムを用いたものであり、デジタル音声信
号から経時的な音声スペクトルを分析してその特徴パラ
メータを抽出する音響特徴抽出部３１と、この特徴パラ
メータから音韻コードを生成する音韻コード変換部３２
と、音韻コードに対応する音素を与える音素変換部３３
とを備えている。音素変換部３３で、音韻コードに対応
する音素を選択する際、一義的に音素が決定される場合
（つまり音素適合確率値が１）を除いて、複数の音素が
その音素適合確率値付きで選択される。文字列変換手段
４は、この音素適合確率値の高いものから順次組み合わ
せて制御コマンドとしての１つ以上の単語を生成し、記
憶部５１に格納する。その際、生成された単語には属性
値として単語適合確率値がリンクされる。単語適合確率
値は、例えば、その単語の元となった音素の音素適合確
率値を乗算して簡単に得ることができる。確定単語選択
手段６は、ある音声に対して複数の単語が存在する時に
はまず最も高い単語適合確率値をもった単語を確定単語
として、モニター５２に表示するとともに、これをカメ
ラ制御手段７に送り込み、監視カメラの操作をスタート
させる。The phoneme recognizing means 3 uses a well-known phonemic recognition algorithm. The phoneme recognizing means 3 analyzes a temporal voice spectrum from a digital voice signal and extracts its characteristic parameters. Phoneme code conversion unit 32 for generating a phoneme code from this feature parameter
And a phoneme conversion unit 33 that gives a phoneme corresponding to the phoneme code.
And When selecting a phoneme corresponding to a phoneme code in the phoneme conversion unit 33, a plurality of phonemes are added with their phoneme matching probability values except when a phoneme is uniquely determined (that is, the phoneme matching probability value is 1). Selected. The character string conversion means 4 generates one or more words as control commands by sequentially combining the words having the highest phoneme matching probability values, and stores the words in the storage unit 51. At this time, a word matching probability value is linked to the generated word as an attribute value. The word matching probability value can be easily obtained by, for example, multiplying the phoneme matching probability value of the phoneme from which the word is based. When there are a plurality of words for a certain voice, the confirmed word selecting means 6 first displays the word having the highest word matching probability value as a confirmed word on the monitor 52 and sends it to the camera control means 7. , Start the operation of the surveillance camera.

【００１６】ここで、もしモニター５２に表示された単
語がオペレータが発音したものと異なる場合、この確定
単語を取り消して正しい単語が認識されるようにしなけ
ればならないが、例えば、誤認識の原因がオペレータの
発音の癖などであれば、再度オペレータが再度正しい制
御コマンドを発音しても、必ずしも正しく音声認識され
るとは限らない。このため、本発明では、誤認識した場
合のために、先に決定された確定単語以外の単語、つま
り２番目に高い単語適合確率値をもった単語が次候補と
して用意されているので、この次候補を確定単語とすれ
ばよい。このような確定単語の入れ替えを正しい確定単
語が出力されるまで、順次やっていけばよいのである。Here, if the word displayed on the monitor 52 is different from the word pronounced by the operator, this fixed word must be canceled so that the correct word is recognized. In the case of an operator's pronunciation habit, even if the operator issues a correct control command again, the voice is not always recognized correctly. For this reason, in the present invention, words other than the confirmed word determined earlier, that is, words having the second highest word matching probability value are prepared as the next candidates in the case of misrecognition. The next candidate may be set as a confirmed word. Such replacement of the confirmed words may be performed sequentially until a correct confirmed word is output.

【００１７】この実施の形態では、次候補を確定単語と
するトリガーとして、所定時間（例えば２秒）以内にオ
ペレータが再度同じ制御コマンドを繰り返すこと、つま
りこのシステムに同じ音声信号が入力されることを利用
しており、そのために、入力音声評価手段８が備えられ
ている。この入力音声評価手段８は、Ａ／Ｄ変換部２か
ら入力された音声信号を所定時間前に入力された音声信
号と比較して、２つの信号が一致した場合単語替えコマ
ンドを確定単語選択手段に出力する。単語替えコマンド
を受け取った確定単語選択手段６は、次候補の単語を確
定単語とし、さらに次に高い単語適合確率をもつ単語を
次候補とする。もちろん、所定時間経過後にオペレータ
によって発音された音声は新たに音声認識される。In this embodiment, the operator repeats the same control command again within a predetermined time (for example, 2 seconds) as a trigger for setting the next candidate as a fixed word, that is, the same voice signal is input to this system. For that purpose, an input voice evaluation means 8 is provided. The input voice evaluation unit 8 compares the voice signal input from the A / D conversion unit 2 with the voice signal input before a predetermined time, and determines a word replacement command when the two signals match each other. Output to The confirmed word selecting unit 6 that has received the word replacement command sets the next candidate word as the confirmed word, and sets the word having the next highest word matching probability as the next candidate. Of course, the voice pronounced by the operator after the elapse of the predetermined time is newly recognized.

【００１８】以下図２を用いて、オペレータが「ＬＮＧ
気化器１号」と発声したことを例にとって、この音声コ
マンド入力システムの作用を説明する。マイク１から入
力された音声はゲインコントロールされた後、１６ｋＨ
ｚサンプリングと１６ビット量子化でデジタル信号化さ
れる。さらに、約６．６ミリ秒のフレーム毎に２０チャ
ンネルのフィルターバンクなどを用いて２３の音響特徴
パラメータ表現に変換される。この音響特徴パラメータ
は、２段階の決定木からなる音韻エンコーダによって音
韻コードに変換される。Hereinafter, referring to FIG.
The operation of the voice command input system will be described by taking as an example the utterance of "vaporizer 1". After the voice input from the microphone 1 is gain controlled,
It is converted into a digital signal by z sampling and 16-bit quantization. Further, each frame of about 6.6 milliseconds is converted into 23 acoustic feature parameter expressions using a 20-channel filter bank or the like. This acoustic feature parameter is converted into a phoneme code by a phoneme encoder consisting of a two-stage decision tree.

【００１９】この２段階の決定木において、第１段の決
定木では、約６．６ミリ秒の各フレームの特徴量とし
て、そのフレームの２３の音響特徴量と、その前後のフ
レームの音響特徴量との関係から導かれる１６１の特徴
量を合わせた計１８４の特徴量を入力ベクトルとして用
いる。この第１段の決定木の出力は、次の段階でセグメ
ンテーションを行うための０番から８番までの９クラス
である。この９クラスをセグメントクラスと呼ぶ。第２
段の決定木では、特徴量として、そのセグメントのセグ
メントクラスや、そのセグメントを構成するフレームの
音響特徴量の平均などの特徴量と、その前後のセグメン
トの特徴量との関係から導かれる特徴量を合わせた計２
８２の特徴量を入力ベクトルとして用いる。この第２段
の決定木の出力は、音素よりさらに小さい単位である約
１８００種類の音韻コードである。In the two-stage decision tree, in the first stage of the decision tree, the acoustic features of 23 frames of the frame and the acoustic features of the frames before and after the frame are determined as the features of each frame of about 6.6 milliseconds. A total of 184 feature amounts obtained by combining 161 feature amounts derived from the relationship with the amounts are used as input vectors. The outputs of the first-stage decision tree are nine classes from No. 0 to No. 8 for performing segmentation in the next stage. These nine classes are called segment classes. Second
In the decision tree of the stage, as a feature amount, a feature amount derived from a relationship between a feature amount such as a segment class of the segment, an average of acoustic feature amounts of frames constituting the segment, and feature amounts of segments before and after the segment. Total 2
82 feature amounts are used as input vectors. The output of the second-stage decision tree is about 1800 types of phoneme codes, which are units smaller than phonemes.

【００２０】得られた音韻コード列は、各音韻コードに
対して複数の音素を確率付きで割り当てた音韻コードフ
ァイルを参照しながら、音素適合確率付きで音韻コード
／音素変換される。各セグメント毎に所定の確率値以上
をもつ１つ以上の音素が選択される。この音素リスト
は、文法を参照しながら単語に変換されていくが、その
際例えば、”ＬＰＧ”という単語には単語適合確率値＝
０．６４が与えられ、”ＬＮＧ”という単語には、単語
適合確率値＝０．３６が与えられている。このように、
変換された単語は、記憶部５１に格納される。The obtained phoneme code sequence is subjected to phoneme code / phoneme conversion with a phoneme matching probability with reference to a phoneme code file in which a plurality of phonemes are assigned to each phoneme code with a probability. One or more phonemes having a predetermined probability value or more are selected for each segment. This phoneme list is converted into a word while referring to the grammar. At this time, for example, the word “LPG” has a word matching probability value =
0.64 is given, and the word “LNG” is given a word matching probability value = 0.36. in this way,
The converted word is stored in the storage unit 51.

【００２１】上述したように音声認識され、文字に変換
された単語は、確定単語選択手段６によって、その確率
値が大きいものを組み合わせて最も確からしいものから
順に確定単語、次候補とする。この例では、最も確率値
が大きかった”ＬＰＧ気化器１号”が最初の確定単語で
あり、”ＬＮＧ気化器１号”が次候補で、”ＬＰＧ気化
器２号”がこれに続く。よって、モニター５２には、図
３に示すように”ＬＰＧ気化器１号”が表示され、監視
カメラ７４はＬＰＧ気化器１号を撮影すべく作動し始め
る。The words that have been speech-recognized and converted into characters as described above are combined by the confirmed word selecting means 6 to combine the words with the greatest probability values into the confirmed word and the next candidate in order from the most probable one. In this example, “LPG vaporizer No. 1” having the largest probability value is the first confirmed word, “LNG vaporizer No. 1” is the next candidate, and “LPG vaporizer No. 2” follows. Accordingly, “LPG vaporizer No. 1” is displayed on the monitor 52 as shown in FIG. 3, and the monitoring camera 74 starts operating to photograph the LPG vaporizer No. 1.

【００２２】しかしながら、オペレータが発音したのは
「ＬＮＧ気化器１号」であり、この結果は誤認識である
ので、オペレータは１秒後に再度「ＬＮＧ気化器１号」
と発音する。入力音声評価手段８は、誤認識のための再
入力を意味する所定時間、ここでは２秒以内での同じ音
声信号の入力を確認し、新たな音声認識処理をせずに或
いは新たな音声認識処理をしたとしてもその結果は無視
して、確定単語選択手段６に作用して、次候補である”
ＬＮＧ気化器１号”を確定単語として、モニター５２に
表示し、その結果、監視カメラ７４はＬＮＧ気化器１号
を撮影すべく作動を変更する。今度は正しい制御コマン
ドが送られたことになるので、必要の場合、オペレータ
は、少なくとも２秒の経過を待って、次の制御コマンド
を送るべく発音する。However, what the operator pronounced is "LNG vaporizer No. 1", and the result is erroneous recognition.
Pronounced The input voice evaluation means 8 confirms the input of the same voice signal within a predetermined period of time meaning re-input for erroneous recognition, in this case, within 2 seconds, and performs no new voice recognition processing or performs new voice recognition. Even if the processing is performed, the result is ignored, and it acts on the fixed word selecting means 6 to be the next candidate.
The LNG vaporizer No. 1 "is displayed as a confirmed word on the monitor 52, and as a result, the surveillance camera 74 changes its operation to photograph the LNG vaporizer No. 1. This time, the correct control command has been sent. Thus, if necessary, the operator will wait at least two seconds before speaking to send the next control command.

【００２３】ここでの実施形態では、次候補表示のトリ
ガーとして同じ内容をもう１度発音することにしたが、
その他の方法、例えば、予め最も認識されやすい音声を
次候補の表示コマンドとして設定しておき、表示された
単語が誤って認識されていた場合、その次候補表示コマ
ンドとしての音声を発音することで次候補を順次表示さ
せることも可能である。In this embodiment, the same content is reproduced once again as a trigger for displaying the next candidate.
Other methods, such as setting the most recognizable voice in advance as the display command of the next candidate, and if the displayed word is erroneously recognized, pronounces the voice as the next candidate display command. It is also possible to sequentially display the next candidates.

【００２４】また、確定単語のチェックのために、確定
単語はモニター５２に表示される構成に代えて、確定単
語を音声で出力することで、オペレータのチェックを受
けるようにしてもよい。Further, in order to check the confirmed word, the confirmed word may be output by voice instead of being displayed on the monitor 52, so that the confirmed word may be checked by the operator.

【００２５】本発明の重要な点は、何度も同じように誤
って音声認識されることを避けるため、１度行った音声
認識での結果で得られる他の言葉を次候補として利用す
ることである。この主旨から外れない限り、音声認識方
法としてその他の公知の方法を用いることも本発明の枠
内に入るものである。An important point of the present invention is to use another word obtained as a result of the speech recognition performed once as a next candidate in order to avoid erroneous speech recognition in the same manner many times. It is. As long as it does not depart from the gist, it is within the scope of the present invention to use other known methods as speech recognition methods.

[Brief description of the drawings]

【図１】本発明による音声文字変換技術を用いた音声コ
マンド入力システムの機能ブロック図FIG. 1 is a functional block diagram of a voice command input system using a voice character conversion technology according to the present invention.

【図２】本発明による音声文字変換の流れを示す説明図FIG. 2 is an explanatory diagram showing a flow of speech character conversion according to the present invention.

【図３】本発明による音声文字変換における記憶部とモ
ニター画面の様子を示す説明図FIG. 3 is an explanatory diagram showing a state of a storage unit and a monitor screen in voice transcription according to the present invention;

[Explanation of symbols]

２Ａ／Ｄ変換部３音素認識手段４文字列変換手段６確定単語選択手段７カメラ制御手段８入力音声評価手段５１記憶部５２モニター 2 A / D conversion unit 3 Phoneme recognition unit 4 Character string conversion unit 6 Fixed word selection unit 7 Camera control unit 8 Input voice evaluation unit 51 Storage unit 52 Monitor

Claims

[Claims]

1. An input speech is divided into a plurality of segments, phonemes are recognized, one or more phonemes are assigned to each segment, and one or more words determined based on the phonemes are stored in a storage unit. And outputting one of the stored words as a confirmed word, and using a word other than the confirmed word as a next candidate.

2. The method according to claim 1, wherein the confirmed word is output to a monitor for display on the monitor.

3. The word determined based on the phoneme is stored in the storage unit with a phoneme matching probability value, and the words having higher probability values are determined words in order. The transcription method described in 1.

4. The voice-to-character conversion method according to claim 1, wherein if the same voice is input within a predetermined time, the next candidate is output as a confirmed word.

5. A phoneme recognition unit that divides an input speech into a plurality of segments and assigns one or more phonemes to each segment, and a character string conversion unit that determines one or more words based on the phonemes. A storage unit for storing the determined word, and a determined word selecting means for outputting one of the stored words as a determined word and selecting a word other than the determined word as a next candidate for display. Voice transcription device.

6. The apparatus according to claim 5, further comprising a monitor for displaying the confirmed word for confirming the confirmed word.

7. A word determined based on the phoneme is stored in the storage unit with a word matching probability value, and the confirmed word selecting means selects words having a higher probability value as confirmed words in order. The phonetic character conversion device according to claim 5 or 6, wherein

8. An input voice evaluation means for determining whether or not voices input within a predetermined time are the same, and when it is determined that the same voice has been input within a predetermined time, the determined word is determined. The phonetic character converter according to claim 5, wherein the selection unit outputs the next candidate as a confirmed word.