JPH11352986A

JPH11352986A - Recognition error moderating method of device utilizing voice recognition

Info

Publication number: JPH11352986A
Application number: JP10160676A
Authority: JP
Inventors: Tetsutada Sakurai; 哲真桜井; Yoshio Nakadai; 芳夫中台; Yoshitake Suzuki; 義武鈴木; Yutaka Nishino; 豊西野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-06-09
Filing date: 1998-06-09
Publication date: 1999-12-24

Abstract

PROBLEM TO BE SOLVED: To enhance the voice recognition of a voice dial PHS portable telephone and make it easy to use. SOLUTION: In this recognition error moderating method, when an opposite name 'Meguro' is inputted by voice (S2), similarity to a standard pattern is calculated (S3). When similarity does not exceed a regulated value (S4), resounding is urged by a display or synthesized voice with a different content every time in the form of conversation as 'Pardon. I was thinking about another thing. What?', and the process is returned to step S2 (S6). When the similarity exceeds the regulated value, 'Meguro' 03-1234-5678 is displayed, and automatic transmission to this telephone number is performed. A standard pattern is speaker-adapted by the use of the input voice of the same user.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は例えば音声ダイヤ
ルのように入力音声を音声認識し、その認識結果にもと
づく処理を行う処理装置において誤認識を緩和させる方
法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for recognizing an input voice such as a voice dial and performing a process based on the result of the recognition to reduce erroneous recognition.

【０００２】[0002]

【従来の技術】昨今、ユーザー（利用者）の行動支援の
視点から、音声認識機能が大きな注目を集めている。そ
して、この音声認識機能を唱った多くの商品が生み出さ
れ、それらに関する多くの学会発表がなされている。例
えば、Ａ社はVOICE TYPEと名付けた音声認識ワードプロ
セッサーソフトウエアを、Ｂ社は音声認識機能を有する
カーナビゲーションシステムを、それぞれ商品として発
売している。また、発明者らの手によって、音声でダイ
ヤルが可能な音声認識機能付き腕時計形ＰＨＳ電話機
が、１９９８年２月に開催された長野オリンピックの運
営スタッフに提供されたことは新聞や雑誌あるいはテレ
ビジョン放送等で広く紹介されたところである。2. Description of the Related Art In recent years, a voice recognition function has attracted a great deal of attention from the viewpoint of supporting behavior of a user (user). A number of products that use this speech recognition function have been produced, and many conference presentations have been made regarding them. For example, Company A has released a speech recognition word processor software named VOICE TYPE, and Company B has released a car navigation system having a voice recognition function as a product. In addition, the present inventors have provided a wristwatch-type PHS telephone with a voice recognition function capable of dialing by voice to the operating staff of the Nagano Olympic Games held in February 1998 in newspapers, magazines or television. It has been widely introduced in broadcasts and the like.

【０００３】よく知られているように、音声認識機能は
声、即ち、口で種々のコマンド（名命）や名称を対象シ
ステムに入力可能な為、両手が他の用途に使われている
ユーザーにとって大変好ましい入力手段とみなされてい
た。例えば、カーナビゲーションシステムのユーザーで
あるドライバーは、自動車の運転に専念する必要があ
り、その両手、両足はハンドルやブレーキの操作に忙殺
される。このため、カーナビゲーションシステムの画面
切り替えや目的地までの距離情報を引き出す手段とし
て、音声認識機能を利用することが試みられた。この
他、携帯電話機等では小さなダイヤルボタンの操作性の
不便さを緩和する手立てとして音声認識機能が活用され
ている。例えば、Ｃ社の携帯電話機では肉声であらかじ
め吹き込んでおく必要があるものの、名前を発声するこ
とで相手の電話番号を表示し、かつ、その番号にダイヤ
ルすることができる。[0003] As is well known, the voice recognition function allows a user to input various commands (names) and names to the target system by voice, that is, a user who uses both hands for other purposes. Was regarded as a very preferable input means for the user. For example, a driver who is a user of a car navigation system needs to concentrate on driving a car, and his hands and feet are hesitant to operate steering wheels and brakes. For this reason, it has been attempted to use a voice recognition function as a means for switching screens of a car navigation system or extracting information on a distance to a destination. In addition, a voice recognition function is utilized in mobile phones and the like as a means of alleviating the inconvenience of small dial button operability. For example, although a mobile phone of Company C needs to be pre-recorded in real voice, it is possible to display the telephone number of the other party by dialing out the name and dial the number.

【０００４】ここで上げた事例に代表される音声認識機
能付きの商品等は、その利便性にもかかわらず、より多
くのユーザーを獲得するまでには至っていない。その理
由は、自動車運転中の音声によるカーナビゲーション操
作では自動車の走行音やエンジン音が、また、街頭での
音声によるダイヤル操作では街頭騒音や周囲の会話が、
１００％の音声認識率実現を困難にしていることによ
る。１００％でない音声認識率をシステム的におぎなう
手段として、音声認識の結果をユーザーに確認する手法
が取られる。例えば、次の具体例を示す。[0004] Despite its convenience, products and the like with a voice recognition function typified by the above-mentioned cases have not yet gained more users. The reason is that car navigation operation by voice while driving a car causes driving noise and engine sound of the car, and dial operation by voice on the street causes street noise and surrounding conversation,
This is because it is difficult to realize a speech recognition rate of 100%. As a means for systematically reducing a speech recognition rate other than 100%, a method of confirming the result of speech recognition to a user is used. For example, the following specific example is shown.

【０００５】ユーザー：音声認識機能を起動（Ｓ１）
「あおきさん」と発声（Ｓ２）音声認識機能はその発声と予め記憶した標準音声との類
似度を計算する（Ｓ３）。この値は例えば６０であった
とする（１００を上限とする模式的な数値である）。そ
の計算した類似度が定められた値（しきい値）を超える
か否かを判定し（Ｓ４）、しきい値を超えていれば、端末：液晶表示画面に認識結果を表示『あおきさん０
３−１２３４−５６７８』（Ｓ５）ユーザー：ダイヤル発信ボタンを押す端末：電話回線にダイヤルトーンを送出という、流れになることが一般的であり、またステップ
Ｓ４で認識の類似度が所定の値を超えていなかった場合
は端末から「もう一度発声して下さい」と定型文を液晶
画面等に提示又は合成音声で発しステップＳ２に戻る
（Ｓ６）。またステップＳ５で認識結果表示が「あおき
さん」を「あおやまさん」と誤認識して表示した場合に
は、再度の発声を行う必要があった。これらの手順は、
ドライバーがカーナビゲーション等を操作する際にもほ
ぼ同様なものであった。要約すれば、音声認識機能は、
１００％の認識率を達成できない結果として、ユーザー
（ドライバー）の確認操作が避けられず、「面倒だ」と
いう評価につながり、市場に広く受け入れられるには至
っていないのが実情である。[0005] User: activates voice recognition function (S1)
Speech "Aoki-san" (S2) The speech recognition function calculates the similarity between the utterance and a previously stored standard speech (S3). This value is assumed to be, for example, 60 (a typical numerical value with 100 as an upper limit). It is determined whether or not the calculated similarity exceeds a predetermined value (threshold) (S4). If it exceeds the threshold, the terminal: displays the recognition result on the liquid crystal display screen [Aokisan 0
(3-5234-5678) "(S5) User: Press dial dial button Terminal: Send dial tone to telephone line Generally, the flow is similar, and the similarity of recognition becomes a predetermined value in step S4. If not exceeded, the terminal sends a fixed phrase "Please speak again" on the liquid crystal screen or the like, or utters the synthesized speech, and returns to step S2 (S6). If the recognition result display incorrectly recognized "Aokisan" as "Aoyamasan" in step S5, it was necessary to repeat the utterance. These steps
It was almost the same when the driver operated a car navigation system. In summary, speech recognition features
As a result of not being able to achieve a recognition rate of 100%, the confirmation operation of the user (driver) is unavoidable, which leads to an evaluation of being troublesome, and has not been widely accepted in the market.

【０００６】この他、音声認識機能を用いた玩具の存在
が知られている。例えば、数年前に、一種の音声認識機
能を有する小犬の玩具が発売されている。これは、赤外
線センサーで人を検知すると鳴き声を立てるが、人の発
する「しー、しー」という声を聞いて「く〜ん、く〜
ん」と甘えた声を立てる玩具であった。赤外線センサー
で人体を検知することは玩具に限らず、よく採用される
手法である。本事例の玩具の特徴は、安価なコストで人
の声を認識させる手立てにあり、人の発する「しー」と
いう音に含まれる子音の周波数成分を一種の波長フィル
ターで検知するものであった。当然のことであるが、反
応が画一的であり、高い人気を博するには至らなかっ
た。[0006] In addition, toys using a voice recognition function are known. For example, a few years ago, a small dog toy having a kind of voice recognition function was released. This means that when a person is detected by the infrared sensor, it makes a cry,
It was a toy making a spoiled voice. Detecting the human body with an infrared sensor is not limited to toys, but is a widely used technique. The feature of the toy in this case is that it is a means of recognizing human voices at low cost, and detects the frequency components of consonants contained in the sound of "shi" emitted by humans with a kind of wavelength filter. . Not surprisingly, the response was uniform and did not reach high popularity.

【０００７】さらに、音声を用いた認識とは若干異なる
が、１０数年前にあたかも知性を持ったかのように受け
答えするコンピュータが出現した。これは、人がキーボ
ードから「今日は大変なことがあったんだ」というよう
な文章を入力すると、「どうしたんだい？」といった答
えが画面に表示されるものであり、コンピュータ（ある
いは回線の向こうにつながる機械）がことば（この場合
はキーボードから入力された文章）を認識するとして話
題になったことがあった。この仕掛けは、相手が入力す
る文章の中のキーワードを元にあたりさわりのない文章
を作るソフトウエアが内蔵されたコンピュータであり、
建設的な受け答えが困難なことは自明であった。これ
も、他の事例同様、時ならずして忘れさられた。[0007] Further, although slightly different from recognition using voice, a computer has appeared 10 or more years ago as if it had intelligence. This means that when a person enters a sentence such as "Today was a difficult time" from the keyboard, the answer "What do you want to do?" Is displayed on the screen, and the computer (or the other side of the line) Has been talked about as recognizing words (in this case, sentences entered from the keyboard). This device is a computer with built-in software that creates untouchable sentences based on keywords in sentences entered by the other party,
It was obvious that constructive responses were difficult. This, like other cases, was forgotten at times.

【０００８】また、知性とは言い難いが、ポケットの中
に入るような小さなゲーム機に、プログラムされた動き
で持ち主に世話をやかせるいわゆる電子ペットが、この
発明以前に存在した。電子ペットは、一種のゲームであ
り、持ち主が行う“世話：えさやり、ふん掃除、躾等”
に応じて成長したり、健康を損ねたりするものであっ
た。これらは、数十を超える種類のものが発売された
が、ゲームとしての“あがり”の方法が知れ渡るにつ
れ、飽きられることとなったのは良く知られた事実であ
る。[0008] In addition, a so-called electronic pet, which is hardly an intellect, but takes care of its owner with programmed movements in a small game machine that fits in a pocket, existed before this invention. Electronic pets are a type of game, and the owner performs “care: feeding, cleaning, disciplining, etc.”
They grew up and lost their health. It is a well-known fact that more than several dozen types of these have been released, but as the game's “finish” method has become known, it has become tired.

【０００９】[0009]

【発明が解決しようとする課題】この発明の目的は、今
までに列挙した事例とは一線を画し、認識技術の一つで
ある音声認識機能を具備し、さらに音声認識機能で避け
られなかった“誤認識”を違和感なくカバーする機能を
合わせ持つ方法およびシステムを提供することにある。SUMMARY OF THE INVENTION The object of the present invention is distinguished from the cases enumerated so far by providing a speech recognition function, which is one of the recognition techniques, and is inevitable by the speech recognition function. It is another object of the present invention to provide a method and a system having a function of covering "misrecognition" without discomfort.

【００１０】[0010]

【課題を解決するための手段】この発明によれば音声認
識機能を有し、その誤認識と推定される発声、あるいは
認識が困難な発声、に対して再度の発声を促す表示ある
いはガイダンス（音声合成による発声）を行い、ユーザ
ーが対処とするシステムに“あきない”ようにすること
を特徴とし、この“あきさせない”ために、前記ガイダ
ンスが続けて同じ内容でなされないことを特徴とするも
のである。According to the present invention, there is provided a voice recognition function, and a display or guidance (voice) prompting the user to re-produce a speech that is presumed to be erroneously recognized or is difficult to recognize. (Synthesis utterance) so that the user does not open the system to be dealt with, and the guidance is not continuously given the same contents in order to prevent the system from opening. It is.

【００１１】更にこのようにあきないで使用し、つまり
より利用し、ユーザーの音声が何回も入力され、音声認
識機能が、いわゆる話者適応によって比較的短時間に充
分な認識率に達するようになる。In addition, the user's voice is input many times, so that the voice recognition function can reach a sufficient recognition rate in a relatively short time by so-called speaker adaptation. become.

【００１２】[0012]

【発明の実施の形態】図１にこの発明の一実施例を示
す。ここに、本体１は電話機等の通信端末、あるいはゲ
ーム機、あるいはいわゆるリモコン、あるいは家庭にあ
る電気製品、あるいはこれらを複合したもの、あるいは
これらを組み合わせたもの等である。本事例では本体１
を携帯電話機の一種であるＰＨＳ電話機として説明を行
うが、先に上げた種々のシステムや端末あるいは製品に
適用できることは明らかである。さて、ＰＨＳ電話機１
は、いわゆる電話機の基本機能を実現するシステム基本
機能部２、無線機能を実現するＲＦ部３、ユーザーに通
信のインターフェース機能を与えるインタフェース部
４、及びユーザーの音声に基づく操作を可能とする音声
認識部５、等から構成されている。また、インタフェー
ス部４は、音声の送受に不可欠なマイクロホン４１、ス
ピーカー４２、ダイヤル数字を表示する液晶等の画面表
示部４３、キー操作部４４（図示せず）、およびこれら
を統合的に制御するインタフェース制御部４５等から構
成される。この他、必要に応じて、音声ガイダンス部４
６（図示せず）、あるいは電話番号やガイダンス音声を
格納しておくメモリー４７（図示せず）等が追加され
る。図１中央に示したこの発明を特徴付ける部位を、ユ
ーザー特化部６と仮称する。FIG. 1 shows an embodiment of the present invention. Here, the main body 1 is a communication terminal such as a telephone, a game machine, a so-called remote controller, an electric appliance at home, or a combination thereof, or a combination thereof. In this case, the main unit 1
Is described as a PHS telephone which is a kind of a mobile telephone, but it is apparent that the present invention can be applied to the various systems, terminals and products mentioned above. By the way, PHS telephone 1
Is a system basic function unit 2 that realizes a basic function of a telephone, an RF unit 3 that realizes a wireless function, an interface unit 4 that provides a user with a communication interface function, and a voice recognition that enables an operation based on a user's voice. Unit 5 and the like. Further, the interface unit 4 controls a microphone 41, a speaker 42, a screen display unit 43 such as a liquid crystal for displaying dialed numbers, a key operation unit 44 (not shown), and a key operation unit 44, which are indispensable for transmitting and receiving voice. It comprises an interface control unit 45 and the like. In addition, if necessary, the voice guidance unit 4
6 (not shown), or a memory 47 (not shown) for storing a telephone number and guidance voice are added. The part characterizing the present invention shown in the center of FIG.

【００１３】ここで、この発明の適用事例の一つである
ＰＨＳ電話システムの概略構成に関し、図２を用いて簡
単に説明しておく。図２における形状は分かり易さを優
先して表現したので、もちろん、これら以外の構成や形
状を取ることはなんの問題もない。図２において、いわ
ゆるネットワークとしてのＩＳＤＮ７１が存在し、ＩＳ
ＤＮ７１につながる、いわゆる基地局７２１，７２２
等、これらを管理し、電話サービスを提供するサービス
インタフェース部７３、及びユーザーが持つ端末７４１
や端末７４２（図示せず）等がＰＨＳ電話システムを構
成し、以下に述べるマイクロ波帯の無線通信サービス
（以下、ＰＨＳサービスと記述）が提供される。もちろ
ん、ＰＨＳ相互の通信サービスだけでなく、他のネット
ワーク７５（図示せず）、例えば、ＰＳＴＮあるいは、
インターネット等とつながっていたり、つなげることが
可能で、一般の家庭電話などとの通話も可能である。Here, a schematic configuration of a PHS telephone system, which is one of application examples of the present invention, will be briefly described with reference to FIG. Since the shape in FIG. 2 is expressed with priority on intelligibility, there is no problem in adopting any other configuration or shape. In FIG. 2, there is an ISDN 71 as a so-called network.
So-called base stations 721 and 722 connected to DN71
And a service interface unit 73 for managing these and providing a telephone service, and a terminal 741 owned by the user.
And a terminal 742 (not shown) constitute a PHS telephone system, and provide a microwave band wireless communication service (hereinafter referred to as a PHS service) described below. Of course, not only PHS mutual communication services, but also other networks 75 (not shown), for example, PSTN or
It can be connected to or connected to the Internet or the like, and can also communicate with general home phones.

【００１４】ＰＨＳサービスは、１．９ＧＨｚ帯の電波
を用いて端末からは１０ｍＷ以下の無線（以下、ＲＦ）
出力で、公衆基地局側（７２１，７２２等）からは５０
０ｍＷ以下の出力（意図する基地局のカバー範囲によっ
て出力値を制御）で音声あるいはディジタルデータの送
受信を行うパーソナルユースの通信サービスである。Ｐ
ＨＳ電話システムにおいては、ＴＤＭＡ／ＴＤＤフレー
ムと呼ばれる５ｍｓ毎の単位時間の中で送受信のタイム
スロット（６２５μｓ／スロット）が割り当てられ、一
つの基地局に対して、三つの端末の音声チャネルが設け
られる。また、この音声チャネルを制御するためのチャ
ネル：制御チャネルが一つの基地局と三つの端末の間に
設けられる。このような低電力のＰＨＳ電話機は小形に
作ることが可能であるが、小形化すれば、キー操作部も
小さくならざるを得ず、人の指や手の大きさに起因する
小形化の制約があった。このような小形化の制約を克服
する手立ての一つが、発明者らによって実現された音声
認識技術によるダイヤル発信機能の付与である。これ
は、電話機が通常備えるキー操作部の代わりにマイクロ
ホン及び音声認識機能を用いて人の声（数字あるいは名
前）をダイヤル数字に置き換えて電話回線に送出するも
のである。この機能の付与によってＰＨＳ電話機を腕時
計程度まで小形化することができたのは良く知られた事
実であり、１９９８年２月に開催された長野オリンピッ
クの会場運営に４０台の腕時計形ＰＨＳ電話機が利用さ
れた。[0014] The PHS service uses a radio wave of 1.9 GHz band and transmits a radio signal of 10 mW or less (hereinafter referred to as RF) from a terminal.
The output is 50 from the public base station side (721, 722, etc.).
This is a personal use communication service that transmits and receives voice or digital data with an output of 0 mW or less (the output value is controlled according to the intended coverage of the base station). P
In the HS telephone system, a transmission / reception time slot (625 μs / slot) is allocated in a unit time of 5 ms called a TDMA / TDD frame, and three base station voice channels are provided for one base station. . A channel for controlling this voice channel: a control channel is provided between one base station and three terminals. Such a low-power PHS telephone can be made small, but if it is made small, the key operation unit must be small, and the size of human fingers and hands is limited. was there. One way to overcome such a restriction of miniaturization is to provide a dial transmission function by a voice recognition technology realized by the inventors. In this method, a human voice (number or name) is replaced with a dialed number using a microphone and a voice recognition function instead of a key operation unit normally provided in a telephone, and is transmitted to a telephone line. It is a well-known fact that the provision of this function has made it possible to reduce the size of a PHS telephone to the size of a wristwatch. Forty-four wristwatch-type PHS telephones were operated at the Nagano Olympics venue held in February 1998. Used.

【００１５】ここで、この発明の効果を検証するために
実際に用いた音声認識機能について具体的な構成につい
ても言及する。図３において、音声入力部１１は音声を
受信する手段であり、例えば、オーディオマイクロホン
や、音響波形データを受信するアナログの信号入力端子
である。波形変換部１２は音声入力部１１より得られた
音声データを分析のためのディジタル数値へ変換する手
段である。波形変換部１２には、例えば、アナログの音
声波形をディジタルデータへ変換する。音声特徴抽出部
１３は波形変換部１２によって得られた音声波形データ
から音声区間検出および音声認識のための特徴量を抽出
する部分であり、この説明で事例として上げているＤＰ
マッチング法（いわゆる特定話者音声認識手法）では、
例えば、短時間対数パワー分析およびケプストラム分析
等、音響認識技術において良く知られている分析方法を
用いる。また、当然のことであるが、隠れマルコフモデ
ルに立脚する音声認識（いわゆる不特定話者音声認識手
法）のための分析手順やこれらに匹敵するものの採用も
実験し、好結果を得た。日本語を対象とした認識では、
隠れマルコフモデルに立脚した不特定話者音声認識方式
がよい結果をもたらし、日本語以外も対象とした場合
は、ＤＰマッチングの特定話者音声認識手法が好結果を
もたらした。Here, a specific configuration of the speech recognition function actually used to verify the effect of the present invention will be described. In FIG. 3, an audio input unit 11 is a unit for receiving audio, and is, for example, an audio microphone or an analog signal input terminal for receiving acoustic waveform data. The waveform converter 12 is means for converting voice data obtained from the voice input unit 11 into digital numerical values for analysis. The waveform converter 12 converts, for example, an analog voice waveform into digital data. The voice feature extraction unit 13 is a part for extracting a feature amount for voice section detection and voice recognition from the voice waveform data obtained by the waveform conversion unit 12, and a DP described in this description as an example.
In the matching method (so-called specific speaker speech recognition method),
For example, an analysis method well known in acoustic recognition technology such as a short-time logarithmic power analysis and a cepstrum analysis is used. In addition, as a matter of course, an experiment was conducted on an analysis procedure for speech recognition based on a hidden Markov model (a so-called speaker-independent speaker recognition method) and the use of an equivalent method, and a good result was obtained. In recognition for Japanese,
The unspecified speaker speech recognition method based on the Hidden Markov Model yielded good results, and in cases other than Japanese, the specific speaker speech recognition method of DP matching yielded good results.

【００１６】以上述べた手順は、音声波形が３００〜３
４００Ｈｚの帯域幅に制限される電話のハンドセットか
らの入力からでも同様な結果を得た。音声区間検出部１
５は、音声特徴抽出部１３から得られる音声特徴量の格
納する部位を決定する。入力パターン格納部１６は音声
区間検出部１５で決定された音声始端から終端までの区
間において母音に重きを置いた音声特徴量を取り込んで
未知入力パターンとする記憶部である。標準パターン記
憶部１７は入力パターン格納部１６に格納された未知入
力パターンと照合するための標準パターンを記憶する。
標準パターン記憶部１７は適用される音声認識手法によ
って記憶内容が異なることは当然である。例えば、外国
語にも対応可能な特定話者認識方式に多用されるＤＰマ
ッチングの場合は、事前に登録した音声の特徴パターン
であり、これらは、認識対象者の肉声を特徴パターン化
したものである。また、不特定話者の認識に多用される
隠れマルコフモデルによる音声認識の場合は、音素と呼
ばれる単位に分けられた音声であり、通常、４３要素、
あるいは２６要素などがいわゆるベクトル情報として蓄
えられる。この場合、先のＤＰマッチングと異なり、そ
れ自体では有効な意味を持たないので、認識対象をモデ
ル化した隠れマルコフモデルネットワークも合わせてシ
ステムは具備する必要がある。これを認識対象モデル格
納部２４として明記した。勿論、これらは表記上の便法
であり、１７と２４を同一ブロックにまとめて表記する
ことも可能である（以下、両者あるいはその一部で入力
パターンと照合すべき標準パターンを格納した部位を標
準パターン格納部７００とする）。以上のことを端的に
言えば、標準パターン格納部７００は入力パターン格納
部と同様の手順で分析および格納され、ラベル名を付与
された認識のための複数の音声標準パターンを格納した
記憶手段である。The procedure described above is based on the case where the speech waveform is 300 to 3
Similar results were obtained from input from a telephone handset limited to a 400 Hz bandwidth. Voice section detector 1
5 determines a part to store the audio feature amount obtained from the audio feature extraction unit 13. The input pattern storage unit 16 is a storage unit that fetches a voice feature amount with emphasis on a vowel in a section from the voice start point to the voice end determined by the voice section detection unit 15 and sets the input voice feature quantity as an unknown input pattern. The standard pattern storage unit 17 stores a standard pattern for collating with the unknown input pattern stored in the input pattern storage unit 16.
It goes without saying that the standard pattern storage unit 17 has different storage contents depending on the applied speech recognition method. For example, in the case of DP matching, which is frequently used for a specific speaker recognition method that can also support foreign languages, the feature patterns of voices registered in advance are those obtained by converting the real voice of the recognition target person into a feature pattern. is there. In the case of speech recognition using a hidden Markov model that is frequently used for recognition of an unspecified speaker, the speech is divided into units called phonemes, and usually includes 43 elements,
Alternatively, 26 elements or the like are stored as so-called vector information. In this case, unlike the above-mentioned DP matching, it does not have a valid meaning by itself, so the system needs to be equipped with a hidden Markov model network that models the recognition target. This is specified as the recognition target model storage unit 24. Of course, these are notational expedients, and it is also possible to collectively write 17 and 24 in the same block (hereinafter, a part storing a standard pattern to be compared with an input pattern in both or a part thereof). The standard pattern storage unit 700). To put it simply, the standard pattern storage unit 700 is a storage unit that stores and analyzes a plurality of voice standard patterns for recognition with label names, which are analyzed and stored in the same procedure as the input pattern storage unit. is there.

【００１７】尤度演算部２２は、入力パターン格納部１
６に格納された未知の入力音声パターンと標準パターン
格納部７００で生成あるいは出力される複数の標準パタ
ーンとの間での類似度の比較を行う。類似度は、入力音
声パターンと標準パターンとの距離値（マハラノビス距
離等の数式で定義される特徴量上の距離値）として定義
される。あるいは標準パターンが発声されたと仮定し
て、実際に観測された入力パターンを生み出す確率（但
し一般にはこの確率の対数値が用いられる）として定義
される。前者の場合は、最も距離値が小さいものが、後
者の場合は、最も確率が大きいものが高い尤度を持つと
判定する。この尤度演算部２２においてそのしきい値が
外部から設定できる構成とすることは極めて実用的な構
成となる。これを尤度しきい値設定部２５として図３に
明記した。尤度の設定を変えることで雑音に対するシス
テムの感度を調整したり、類似度の極めて近い複数の候
補を同時に選択するシステム構成とすることなどが可能
となる。また、出力が発話者の意図に沿ったものであっ
た時の類似度を記録する部位（図示せず）およびこれと
先の尤度しきい値設定部２５を結ぶシステム構成とする
ことでこのシステムには学習機能を付与できるものであ
る。学習機能あるいは、対象となる発話者の声をより認
識し易くする話者適応の技術については音声認識の一般
的な機能改善の手法として知られている。The likelihood calculating section 22 stores the input pattern storing section 1
6 is compared with a plurality of standard patterns generated or output by the standard pattern storage unit 700. The similarity is defined as a distance value (a distance value on a feature amount defined by a mathematical expression such as a Mahalanobis distance) between the input voice pattern and the standard pattern. Alternatively, it is defined as a probability of generating an actually observed input pattern assuming that a standard pattern is uttered (however, a logarithmic value of this probability is generally used). In the former case, it is determined that the one with the smallest distance value has the highest likelihood in the latter case. It is an extremely practical configuration that the threshold value of the likelihood calculation unit 22 can be set from the outside. This is clearly shown in FIG. By changing the setting of the likelihood, it becomes possible to adjust the sensitivity of the system to noise, or to adopt a system configuration in which a plurality of candidates having extremely similarities are simultaneously selected. In addition, a system (not shown) for recording the similarity when the output is in accordance with the intention of the speaker and a system configuration connecting the same to the likelihood threshold setting unit 25 make this possible. The system can be provided with a learning function. A learning function or a speaker adaptation technique that makes it easier to recognize the voice of a target speaker is known as a general function improvement method for speech recognition.

【００１８】尤度比較部２３は尤度の演算結果を受け
て、入力音声がどの標準パターンに最も類似しているか
を判定する。この時、類似度が似通った標準パターンが
複数存在する場合はそれらの全てを、似通った標準パタ
ーンが一つのみの場合はその標準パターンを結果集計部
１９へ出力する。結果集計部１９は各標準パターンにつ
いて尤度順に認識結果を並べて出力部２０に送る。この
時、尤度の数値を合わせて送るか否かはこのシステムの
利用形態による。また、結果集計部１９は最も類似度が
高いと判定されたパターンに関し、十分なしきい値が得
られなかった場合は、ユーザー特化部６に処理を渡し、
この発明の特長ある手順、即ち、発話者の気分をなごや
かにさせる表示や発声（合成音声による）を尽くすこと
となる。これを以下、具体的に説明する。The likelihood comparing section 23 receives the result of the likelihood calculation and determines which standard pattern the input voice is most similar to. At this time, when there are a plurality of standard patterns having similar similarities, all of them are output to the result totaling unit 19 when all the standard patterns are similar, and when there is only one similar standard pattern. The result totaling unit 19 arranges the recognition results in order of likelihood for each standard pattern and sends the result to the output unit 20. At this time, whether or not the likelihood value is also sent depends on the usage of this system. In addition, the result totaling unit 19 passes processing to the user specializing unit 6 when a sufficient threshold value cannot be obtained for the pattern determined to have the highest similarity,
The characteristic procedure of the present invention, that is, the display and the utterance (by the synthetic voice) that make the speaker feel smooth are exhausted. This will be specifically described below.

【００１９】ユーザー特化部６は、端的に言えば、本体
１が持つインタフェース制御部４５を介し、ユーザーの
癖や特徴を取り込み、このユーザーの操作に対してより
正確な入力を補助する仕組である。また、ユーザーの発
声の癖や調子をデーターとして記録し、図３の標準パタ
ーン格納部７００のデーターをユーザーに合わせたもの
に変化させる仕組、いわゆる話者適応である。さらに、
周囲騒音等で本来の機能が果たされなかった場合には、
紋切り形の「もう一度、発声して下さい」などの指示で
はなく、擬人性あるいはペット性を加味した受け答えと
することでユーザーの不快感を緩和せんとするものであ
る。以下、音声認識機能付きのＰＨＳ電話機の事例を元
にこの発明の効果及び構成について説明する。In short, the user specializing section 6 captures the user's habits and characteristics through the interface control section 45 of the main body 1 and assists the user in inputting more accurately the operation of the user. is there. In addition, this is a so-called speaker adaptation mechanism in which the user's utterance habit and tone are recorded as data, and the data in the standard pattern storage unit 700 in FIG. further,
If the original function is not performed due to ambient noise, etc.,
Rather than giving instructions such as "please utter again" in the form of a crest, it is intended to alleviate the user's discomfort by giving a response that considers anthropomorphism or petness. Hereinafter, effects and configurations of the present invention will be described based on the case of a PHS telephone with a voice recognition function.

【００２０】図４は、ＰＨＳ電話機に内蔵された音声認
識機能を用いて『めぐろ』さんを発声し、『めぐろ』さ
んとして登録されたダイヤル数字を呼び出す操作（音声
レパートリーダイヤルあるいは音声ダイヤル等と称す
る）の手順の流れを示す。この事例で強調されるべき部
分は音声認識機能が誤認識となる場合である。よく知ら
れているように、誤認識は、ユーザーの発声の開始ある
いは終了を的確に識別出来なかった時、あるいは発声途
中で他者の声や周囲の騒音が重畳し、本来の音声波形以
外の情報までも取り込んだ時に生ずる。このような時に
は、取り込んだ音声波形と認識候補との間の類似性（尤
度とも称する）は、本来有すべき値に比べて小さいこと
が分かっている。従来の音声認識機能を実現するシステ
ムでは、このような場合、小さい類似度の中でも最も高
い値の候補を認識結果として上げるか、「認識できませ
ん」などといった一定の応答を返すことが行われてい
た。FIG. 4 shows an operation of uttering “Meguro” using a voice recognition function built in the PHS telephone and calling a dialed number registered as “Meguro” (referred to as voice repertoire dial or voice dial, etc.). 2) shows the procedure flow. The part to be emphasized in this case is the case where the speech recognition function is erroneously recognized. As is well known, misrecognition occurs when the start or end of a user's utterance cannot be accurately identified, or when another person's voice or surrounding noise is superimposed during the utterance, and other than the original voice waveform. Occurs when information is taken in. In such a case, it is known that the similarity (also referred to as likelihood) between the captured speech waveform and the recognition candidate is smaller than the value that should originally be possessed. In such a case, the system that realizes the conventional voice recognition function is to raise a candidate having the highest value among the small similarities as a recognition result or to return a certain response such as "cannot be recognized". .

【００２１】例えば、『めぐろ（ｍｅ／ｇｕ／ｒｏ）』
という発声において、真中の『ｇｕ』の音が周囲の騒音
に紛れた場合には、「ｍｅ／＊＊／ｒｏ」という情報が
得られる。この場合、「ｍｅｇｕｒｏ」と発声された場
合に比べて「めじろ」と「めぐろ」の二つの地名が本来
期待されるべき値に比べて小さいものの、同程度の類似
度を持つことになる。その結果、発声の調子によって
は、かなりの確率で「めじろ」を結果として返して誤認
識となっていた。この発明では、このような場合には、
第一の候補と第二の候補の類似度が極めて接近している
値を取ることに着目し、別の応答を返すことに大きな特
長がある。図４の操作手順において、まず音声認識機能
を起動させ（Ｓ１）、「めぐろ」を発声し（Ｓ２）、最
初の発声を受けて類似度の値を計算し（Ｓ３）、これが
所定の値を越えているか、否かの判定を行い（Ｓ４）、
第二の候補と第一の候補の差が所定の値を越えている
か、否かの判定を行う。これらの値が所定の類似度を満
足していなかったり、第一と第二の候補の差が小さかっ
たりした時は、第一の候補を認識結果として返さない。
その際、従来のシステムにありがちな「もう一度発声し
て下さい」式の無味乾燥した特定の応答を返すのではな
く、ＰＨＳ電話機があたかも意思を持っているような応
答を返すことが特長の一つである。この例では、ＰＨＳ
電話機が他のことをしていたような「ごめん、他のこと
を考えていた。なんだい？」という形で、自然な再発声
を促す事例としている（Ｓ６）。この会話はユーザー特
化部６の中に種々のパターンを記録しておき、それらを
必要に応じて表示（画面の場合）、あるいは発声（スピ
ーカー出力の場合）することとし、さらに続けて同じ応
答をしないことで擬人性を高めることができる。つま
り、異なる言葉（文章）で再発声を促す。For example, "Meguro (me / gu / ro)"
When the sound of “gu” in the middle is mixed with the surrounding noise, the information “me / ** / ro” is obtained. In this case, although the two place names “Meguro” and “Meguro” are smaller than the value that should be originally expected, they have similar degrees of similarity as compared with the case where “meguro” is uttered. As a result, depending on the tone of the utterance, "mejiro" was returned as a result with a considerable probability, resulting in erroneous recognition. According to the present invention, in such a case,
Focusing on taking a value in which the similarity between the first candidate and the second candidate is extremely close, there is a great feature in returning another response. In the operation procedure of FIG. 4, first, the voice recognition function is activated (S1), "Meguro" is uttered (S2), and the first utterance is received to calculate the similarity value (S3). It is determined whether or not it has exceeded (S4),
It is determined whether or not the difference between the second candidate and the first candidate exceeds a predetermined value. When these values do not satisfy the predetermined similarity or when the difference between the first and second candidates is small, the first candidate is not returned as a recognition result.
One of the features is that the PHS phone returns a response as if it were willing, rather than returning a specific dry and dry response that is common in conventional systems. It is. In this example, PHS
In this case, a natural voice is urged in the form of "I'm sorry, I was thinking about something else." In this conversation, various patterns are recorded in the user specializing section 6, and they are displayed (in the case of a screen) or uttered (in the case of a speaker output) as necessary, and then the same response is made. By not doing so, the personification can be enhanced. In other words, a different word (sentence) prompts a resound.

【００２２】このような応答に加えて、ＰＨＳ電話機特
有のサービスを会話に盛り込むことが一層の擬人化、あ
るいはペット性を高め、ユーザーから見た端末への愛着
度が増すことが予備的な実験で確認できた。これを図４
のフローの後半に示した。この例では、ＰＨＳ電話機の
持つ位置情報サービスと連動させ、例えば横浜を条件と
して設定している場合に、ＰＨＳの位置情報サービスに
より、そのＰＨＳ電話機が横浜に位置していると、条件
がとられて、〔めぐろさんに「横浜は今日は暑い、って
教えてあげて」「じゃ、つなぐね」〕と表示器に表示さ
れ（Ｓ８）、ユーザーは「めぐろ」が正しく認識された
ことを知り、かつ横浜につづいて定形文が現われ、「じ
ゃ、つなぐね」で音声ダイヤルが行われることを認識で
きる。このように、条件設定をいくつも設けておくと、
使用ごとに異なる条件が成立して、それに応じた文章や
ガイダンスが現われ、次にはどのようなものが現われる
かと興味が湧き、ユーザーに使用すること（再発声を含
む）に愛着を生じる。この結果、何日も同一ユーザーに
よる再発声や、利用のための音声入力が行われ、そのユ
ーザーの音声に対し、標準パターンの適応化が進み、そ
のユーザーの音声認識率が高くなる。またこのように音
声認識率が高くなると、この種の音声認識を利用した装
置を、一層抵抗なく、使い易いものとなる。In addition to such a response, it is a preliminary experiment to incorporate a service specific to the PHS telephone into the conversation to further anthropomorphize or enhance the pet property and increase the degree of attachment to the terminal as seen by the user. Was confirmed. This is shown in FIG.
The latter half of the flow. In this example, if the PHS telephone is located in Yokohama by the PHS location information service, for example, when the condition is set in conjunction with the location information service of the PHS telephone and Yokohama is set as a condition, the condition is set. Then, [Meguro-san, "Tell me that Yokohama is hot today,""Justconnect"] is displayed on the display (S8), and the user knows that "Meguro" has been correctly recognized. Then, a fixed phrase appears after Yokohama, and it can be recognized that voice dialing is performed with "Ja, Tsunagune". In this way, if you set up a number of condition settings,
Different conditions are established for each use, and sentences and guidance appear according to the conditions. Then, the user is interested in what will appear, and the user will be attached to the use (including reoccurrence). As a result, re-speaking by the same user and voice input for use are performed for many days, adaptation of the standard pattern to the voice of the user progresses, and the voice recognition rate of the user increases. In addition, when the speech recognition rate is increased as described above, a device using this kind of speech recognition becomes easier and easier to use.

【００２３】このように位置情報サービスにもとづく、
前記例では横浜という位置を条件とし、これが成立すれ
ば前記のような定形応答を作成したが、他の各種の条件
を設定しておき、それが成立すればそれに応じた応答文
を作成するようにすることもできる。例えばＰＨＳ電話
サービス事業者が提供している天気予報や占情報との連
動等で得た情報を元に、当該端末の位置や気候を会話に
挟む仕組とすることで、端末の無機質性を薄めることが
できる。これらはほんの一例で、ユーザーがよく掛ける
電話番号には、その条件が成立すれば「いつもの人だ
ね。いるかな？」といったメッセージをダイヤル発信中
に表示したり、合成音声で発声させることもこの発明の
一実施例である。このような条件の例としては、ユーザ
ーが登録した個人情報、例えば名前、生年月日、居住地
などの情報を用いることで、その本体１を利用するユー
ザーに応じた応答文（案内文を含む）を表示あるいは発
声させることにより、その本体（処理装置）１に対する
愛着を増すようにする。例えばユーザーの誕生日に最初
に掛ける通話に際して、生年月日を条件として、「ハッ
ピーバースデー ○○さん」と発声又は表示をしてから
「××さんに電話だね、まかせなさい」と擬人性を強調
した表示又は発声を行わせる。これらの擬人性やペット
性をもたらす会話や表示は定型文だと直ぐに飽きられる
が、ＰＨＳ電話機の持つデーター転送機能を利用し、Ｐ
ＨＳ電話サービス事業者が図１中のユーザー特化部６に
様々な応答あるいは会話文を送り込むことが可能であ
る。当然であるが、ユーザーの通話料負担を避けるため
にこのデーター転送はＰＨＳ電話サービス事業者の負担
で行うことが考えられる。また、これも当然であるが、
このデーター転送に必要な通話の着信をユーザーが知り
合いからの着信と誤解して応答しないよう、いわゆる無
鳴動着信とすることが望ましい。As described above, based on the location information service,
In the above example, the position of Yokohama is set as a condition, and if this is satisfied, the fixed response is created as described above. However, other various conditions are set, and if it is satisfied, a response sentence corresponding to the condition is created. You can also For example, based on information obtained by linking with a weather forecast or proprietary information provided by a PHS telephone service provider, a structure is adopted in which the location and climate of the terminal are put into a conversation, thereby reducing the inorganicity of the terminal. be able to. These are just a few examples. Some of the phone numbers that users often call will display a message such as "Is this an ordinary person?" It is one Example of this invention. As an example of such a condition, by using personal information registered by the user, for example, information such as a name, a date of birth, and a place of residence, a response sentence (including a guide sentence) corresponding to the user using the main body 1 is used. ) Is displayed or uttered to increase the attachment to the main body (processing device) 1. For example, when making the first call on the birthday of the user, on the condition of the date of birth, say or display "Happy Birthday XX" and then say "Let's call XX, leave it to me." Have a highlighted display or utterance. Conversations and displays that bring these anthropomorphisms and pets can be instantly tiring with fixed phrases, but using the data transfer function of the PHS phone,
The HS telephone service provider can send various responses or conversations to the user specializing unit 6 in FIG. As a matter of course, it is conceivable that this data transfer is performed by a PHS telephone service provider in order to avoid the burden of the user's call charge. Also, of course,
It is desirable to use a so-called non-ringing incoming call so that the user does not answer a call necessary for data transfer as a call from an acquaintance.

【００２４】以上の事例で述べたこの発明の特徴は、ユ
ーザーの発声の認識が困難な場合や、誤認識の可能性が
高い場合には、擬人的な応答を返すことでユーザーの音
声認識の性能への不満を緩和できることは明白である。
また、このような再発声がユーザーによってなされる結
果、ユーザーの発声の特徴が集まり、標準パターンを話
者適応化処理することにより認識率の改善に繋がる。こ
れはシステムにユーザーの発声を学習させた結果であ
る。The feature of the present invention described in the above case is that when it is difficult to recognize the utterance of the user or when there is a high possibility of erroneous recognition, a person-like response is returned to return the voice of the user. Clearly, performance dissatisfaction can be mitigated.
In addition, as a result of such a re-utterance being made by the user, features of the user's utterances are gathered, and the recognition rate is improved by subjecting the standard pattern to speaker adaptation processing. This is the result of having the system learn user utterances.

【００２５】この発明を特徴付けるユーザー特化部６
は、その目的を達成するため、いわゆる演算処理機能、
メモリー機能等を持つことが必要である。演算処理機能
とメモリー機能はＰＨＳ電話のシステム基本機能部ある
いは音声認識部等に含まれることが一般的であり、これ
らを利用することでシステム総体としてのコストを下げ
ることが可能である。音声認識機能を実行するディジタ
ルシグナルプロセッサー（ＤＳＰ：図示せず）や電話機
能を制御するＣＰＵ（図示せず）を流用する、あるい
は、それらが本来の機能を果たした後に残る処理能力を
用いる、等の構成は問題なく、所定の機能を果たした。
一方、これらの構成部品の流用ではなく、ユーザー特化
部専用の演算処理機能やメモリーを持つことは、ユーザ
ー特化部６とシステム基本機能部の干渉やメモリー競合
を避ける上で効果があり、この発明の適用事例の中で試
みた、より高度な応答や会話を提供する形態に適してい
た。The user specializing unit 6 characterizing the present invention
Is a so-called arithmetic processing function to achieve its purpose,
It is necessary to have a memory function. The arithmetic processing function and the memory function are generally included in the system basic function unit or the voice recognition unit of the PHS telephone, and by using these, it is possible to reduce the cost of the entire system. Utilize a digital signal processor (DSP: not shown) for performing a voice recognition function or a CPU (not shown) for controlling a telephone function, or use a processing capability remaining after the functions of the two have been performed. Performed no problem and performed a predetermined function.
On the other hand, having an arithmetic processing function and a memory dedicated to the user specialization unit instead of diversion of these components is effective in avoiding interference and memory conflict between the user specialization unit 6 and the system basic function unit. It was suitable for the form of providing more advanced responses and conversations attempted in the application examples of the present invention.

【００２６】ここで、電話機以外への適用事例について
言及する。先の会話文や応答文を通信で送ることがかな
わないので、受け答え可能な表現が挟まることは避けら
れない。しかし、ユーザー特化部６の存在によって大き
な効果をもたらすことが可能である。従来の音声認識機
能付きリモコンの認識率が実用的なレベル（一般に９０
％以上と言われている）に達しなかった原因は二点知ら
れている。一つは音声認識プログラムの完成度不足であ
り、これは昨今、大幅に改善されたことはよく知られた
事実である。他の一つは、ユーザーが音声認識機能に適
した発声を行うことに慣れていなかったという事実であ
る。これは、発声のタイミングの狂いや「え〜、あの
〜」といった不要語を付けてしまうユーザーの癖に起因
していた。認識が困難な発声までも強引に認識すると、
誤認識の割合が増え、結局、「使いものにならない」と
いう印象をユーザーが持ってしまっていた。これに対
し、この発明では、誤認識の可能性の高い発声に対して
は擬人性あるいはペット性を表に出したユーモアのある
再発声手順を取ることに特長がある。例えば、周囲騒音
が高く、発声後に計算された音声の類似度が所定の値に
達しない事例があった。このような事例において、被験
者に「ごめん、まわりがうるさくて聞き取れなかった。
もう一度、教えて」というメッセージを出した場合は、
リモコンへの再発声への嫌悪感が薄められたという、効
果を得た。Here, an example of application to a device other than a telephone will be described. Since it is incomparable to send the previous conversation sentence or response sentence by communication, it is inevitable that expressions that can be answered will be interposed. However, a great effect can be brought about by the presence of the user specializing unit 6. The recognition rate of the conventional remote control with voice recognition function is at a practical level (generally 90
% Is said to have been reached). One is the lack of perfection of speech recognition programs, which is a well-known fact that it has improved significantly in recent years. Another is the fact that the user was not accustomed to making utterances suitable for the speech recognition function. This has been caused by the irregular timing of the utterance and the habit of the user to add an unnecessary word such as “Eh, ah”. Forcibly recognizing utterances that are difficult to recognize,
The rate of misrecognition increased, and eventually the user had the impression that it was "useless". On the other hand, the present invention is characterized in that a humorous re-utterance procedure that expresses anthropomorphism or petness is used for utterances that are likely to be erroneously recognized. For example, there is a case where ambient noise is high and the similarity of voice calculated after utterance does not reach a predetermined value. In such a case, the subject said, "I'm sorry, the surroundings were too loud to hear.
Please tell me again. "
The effect was that the aversion to the repetition of the remote control was reduced.

【００２７】このように再発声への抵抗感が薄められる
と、それだけそのユーザーからの再発声を得ることがで
き、そのユーザーの音声に特有のデータをより多く蓄積
することができ、結果として、ユーザーの発声の認識率
を高める効果が生ずる。これには、ある認識結果がユー
ザーに受け入れられた時、直前に繰り返されていた認識
のための発声内容との比較を行う。その過程で、ユーザ
ーが、不要語を付けた発声をし易いのか、発声のタイミ
ングがずれているのか、周囲騒音が高いのか、等音声認
識の問題を解析し、必要な認識アルゴリズムのユーザー
に合わせた変更（学習とも呼ぶ）を行い、図１の構成で
ユーザー特化部６から音声認識部５へデータあるいはプ
ログラムの転送を行うことで、ユーザーの発声に適した
音声認識機能とすることが可能である。[0027] When the resistance to the re-speaking is reduced in this manner, the re-speaking from the user can be obtained, and more data specific to the user's voice can be accumulated. As a result, This has the effect of increasing the recognition rate of the user's utterance. To do this, when a recognition result is accepted by the user, it is compared with the utterance content for recognition that was repeated immediately before. In the process, the user analyzes voice recognition problems, such as whether the user easily utters unnecessary words, the timing of the utterance is shifted, the ambient noise is high, etc. By making a change (also called learning) and transferring data or a program from the user specializing unit 6 to the voice recognizing unit 5 in the configuration of FIG. 1, a voice recognizing function suitable for user utterance can be achieved. It is.

【００２８】[0028]

【発明の効果】この発明の構成を用いることで、ユーザ
ーが、電話機等の通信端末、あるいはゲーム機、あるい
はいわゆるリモコン、あるいは家庭にある電気製品、あ
るいはこれらを複合したもの、あるいはこれらを組み合
わせたもの等の音声認識機能をもつ処理装置を楽しみな
がら、その音声認識機能の認識率を上げることが可能で
ある。また、ユーザーの発声に特化した擬人化、ペット
化を進めることで対象の端末や製品などの処理装置への
愛着が増し、利用期間が長くなるという効果を生む。By using the configuration of the present invention, a user can use a communication terminal such as a telephone, a game machine, a so-called remote controller, an electric appliance at home, or a combination of these, or a combination of these. It is possible to increase the recognition rate of the voice recognition function while enjoying a processing device having a voice recognition function such as an object. In addition, by promoting anthropomorphization and petization specialized for user's utterance, attachment to a processing device such as a target terminal or product is increased, and an effect of extending a use period is produced.

【００２９】特に通信機能を持つ電話機等では、内蔵の
データーを適宜書き換えたり、季節や場所に応じた応答
メッセージ表示や発声を施すことで擬人性、ペット性を
一層高めることができ、ユーザーの愛着を増進すること
が可能である。結果として、これは、直ぐに解約される
ＰＨＳ電話機などへの有効な対策ともなり得る。In particular, in telephones and the like having a communication function, personality and petness can be further enhanced by appropriately rewriting the built-in data, displaying a response message or uttering according to the season or place, and thus attaching to the user. It is possible to improve. As a result, this can also be an effective countermeasure for PHS phones, etc., which will be canceled immediately.

[Brief description of the drawings]

【図１】この発明の構成を示す模式図。FIG. 1 is a schematic diagram showing a configuration of the present invention.

【図２】この発明を適用した音声認識機能付きのＰＨＳ
電話機の機能構成の模式図。FIG. 2 shows a PHS with a voice recognition function to which the present invention is applied.
FIG. 3 is a schematic diagram of a functional configuration of a telephone.

【図３】この発明の構成要件の一つである音声認識機能
の一実施例を示す図。FIG. 3 is a diagram showing one embodiment of a voice recognition function which is one of the components of the present invention.

【図４】この発明の音声認識機能を使う手順の一例を示
す流れ図。FIG. 4 is a flowchart showing an example of a procedure for using the voice recognition function of the present invention.

【図５】従来の音声認識機能を使う手順の一例を示す流
れ図。FIG. 5 is a flowchart showing an example of a procedure for using a conventional voice recognition function.

───────────────────────────────────────────────────── フロントページの続き (72)発明者西野豊東京都新宿区西新宿三丁目19番２号日本電信電話株式会社内 ────────────────────────────────────────────────── ─── Continued on the front page (72) Inventor Yutaka Nishino 3-19-2 Nishi-Shinjuku, Shinjuku-ku, Tokyo Nippon Telegraph and Telephone Corporation

Claims

[Claims]

An input voice is recognized by a voice recognition function.
In a device that performs processing based on the recognition result, the similarity is compared with a threshold value in the recognition processing for the input voice, and if the result of the comparison determines that the recognition is erroneous recognition or difficult, the apparatus compares with the previous time. An erroneous recognition mitigation method for a speech recognition utilizing apparatus, characterized by prompting reoccurrence by display of different contents or voice.

2. An input voice is recognized by a voice recognition function.
An apparatus for performing a response process based on the recognition result, wherein a response based on the recognition result is displayed or uttered according to a user according to a user. .

3. A method according to claim 1, wherein the input speech is recognized by a speech recognition function and a response process is performed based on the result of the recognition. Determining whether or not a condition is satisfied; reading out the fixed phrase according to the condition if the condition is satisfied, generating a response sentence, and displaying or uttering the response sentence; How to mitigate false recognition.

4. A method according to claim 1, wherein said apparatus is provided with a data transmission function, and said data required for said display or utterance is externally transferred. A method for reducing false recognition of a recognition and utilization device.