JP2010197644A

JP2010197644A - Speech recognition system

Info

Publication number: JP2010197644A
Application number: JP2009041794A
Authority: JP
Inventors: Yuzo Takahashi; 優三高橋; Takashi Kato; 隆加藤
Original assignee: Urimina; URIMINA KK; Gifu University NUC
Current assignee: Urimina; URIMINA KK; Gifu University NUC
Priority date: 2009-02-25
Filing date: 2009-02-25
Publication date: 2010-09-09

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition system which does not require registering work for utilizing an enroll function before speech recognition, and which significantly improves a recognition rate of speech recognition without causing incorrect recognition and malfunction. <P>SOLUTION: A recognition computer 2 in the recognition system 1 includes: a speech information acquiring means 8 for detecting speech V which is output by a speaker S, and for acquiring speech information 17; an utterance tendency specifying means 9 for analyzing and specifying utterance tendency regarding the speech V, based on the speech information 17; a dictionary group storage means 10 for storing a reference dictionary SD and a plurality of utterance tendency dictionaries X1 etc.; a dictionary selection means 11 for selecting one utterance tendency dictionary X1 etc. which matches utterance tendency; a reference collating means 12 for collating a vocabulary by using the reference dictionary SD; an utterance tendency collating means 13 for collating the vocabulary by using the utterance tendency dictionary X1 etc.; and a vocabulary output means 14 for outputting the vocabulary regarding the recognized speech V. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声認識システムに関するものであり、特に、医学教育用のシミュレータ等に利用される音声入力に係るマンマシンインターフェイス技術に利用可能に形成され、話者の発した音声の認識率を向上させることの可能な音声認識システムに関するものである。 The present invention relates to a speech recognition system, and in particular, can be used for man-machine interface technology related to speech input used in medical education simulators, etc., and improves the recognition rate of speech uttered by a speaker. The present invention relates to a speech recognition system that can be made to operate.

従来から、話者が発声する音声をマイク等の音声入力機器によって受付け、これを音声情報として取得し、さらに当該音声情報を解析することにより音声を認識する音声認識技術の開発が進められている。この技術によって、キーボードやマウス等の操作入力機器を利用することなく、音声入力のみでコンピュータ等を操作することが可能となり、コンピュータの操作に不慣れな高齢者や手肢の動きが制限される要介護者であっても、コンピュータ等の操作が容易に行えるようになる。また、カーナビゲーションシステム等に音声入力及び音声認識技術を採用することにより、運転者がハンドルから手を離すことなく、目的地の設定や案内の開始を音声によって実行することが可能となり、安全性を高めることができる。このように、音声入力及び音声認識に係る技術は、幅広い分野で使用され、さらに新たな技術分野での使用が期待されている。 2. Description of the Related Art Conventionally, development of speech recognition technology that recognizes speech by receiving speech uttered by a speaker with a speech input device such as a microphone, acquiring this as speech information, and analyzing the speech information has been in progress. . This technology makes it possible to operate a computer or the like only by voice input without using an operation input device such as a keyboard or a mouse, and it is necessary to limit the movement of elderly people unaccustomed to computer operation and limbs. Even a caregiver can easily operate a computer or the like. In addition, by adopting voice input and voice recognition technology for car navigation systems, etc., it is possible to perform destination setting and start of guidance by voice without taking the driver's hand off the steering wheel. Can be increased. As described above, the technology related to speech input and speech recognition is used in a wide range of fields, and is expected to be used in new technical fields.

音声認識に係る技術において、話者の発声した音声を正確に認識できないと、機器の誤動作やエラー等の不具合に繋がるおそれがある。そのため、音声認識技術において、取得した音声情報を誤認識することなく如何に正確に認識することが可能か否かが実用化の点で特に重要な問題となる。ここで、音声入力による認識を行う場合、音声に含まれる各語彙に対し、音素や周波数特性等が予め登録された認識用の辞書（データベース）が使用されている。このとき、市販されている音声認識用の製品は、幅広い不特定の使用者を対象とするため、標準的な発音傾向（所謂「標準語」、「共通語」）のアクセントや発音で発声される語彙を適切に認識することが可能なように設定されている。 In the technology related to voice recognition, if the voice uttered by the speaker cannot be accurately recognized, it may lead to malfunctions such as malfunction or error of the device. Therefore, in the speech recognition technology, how accurately the acquired speech information can be recognized without misrecognizing becomes a particularly important issue in terms of practical use. Here, when recognition by speech input is performed, a recognition dictionary (database) in which phonemes, frequency characteristics, and the like are registered in advance is used for each vocabulary included in the speech. At this time, since the products for speech recognition on the market are intended for a wide range of unspecified users, they are uttered with accents and pronunciations of standard pronunciation tendencies (so-called “standard words” and “common words”). It is set so that the vocabulary can be properly recognized.

そのため、極端に早口でしゃべったり、或いは逆にゆっくりとしゃべる癖のある話者、極端に声質（高音、低音）が異なる話者、或いは地方等の訛りによってアクセントや発音が標準語と異なる話者は、標準語に設定された音声認識技術（音声認識ソフト）では、上述の認識率が著しく低下するおそれがある。そのため、上記のような話者にとっては、安定した音声認識を行うことができず、キーボード等を利用した入力に比べ、かえって文章作成作業等が煩雑になることがあった。 Therefore, a speaker who has a habit of speaking very quickly or conversely, a speaker with extremely different voice quality (high or low tone), or a speaker whose accent or pronunciation differs from the standard word due to local speaking In the speech recognition technology (speech recognition software) set as a standard word, the above-described recognition rate may be significantly reduced. For this reason, stable speech recognition cannot be performed for the above-described speakers, and the text creation work or the like may be complicated compared to input using a keyboard or the like.

そこで、標準的な発音傾向以外の有する話者の発声に対しても認識率を向上させるため、予め特定の話者の音声を登録し、これを解析処理し、データベース化することが行われている。具体的に説明すると、話者に対して予め規定の文章を音読させ、これを音声情報として取得し、登録することにより、認識率を飛躍的に向上させる機能（所謂「エンロール機能」）を伴った音声認識ソフトが開発されている。この場合、音声認識ソフトの稼働するコンピュータ等の音声認識装置により、話者の発音傾向に基づく音響モデルを構築し、当該話者が発声した場合には、個々に構築された特定の音響モデルを利用して認識処理が実施される。その結果、上記認識率は一定以上の水準に保たれ、実用上の問題がない程度まで改善されることになる。 Therefore, in order to improve the recognition rate even for speaker utterances other than the standard pronunciation tendency, a specific speaker's voice is registered in advance, analyzed, and made into a database. Yes. More specifically, it is accompanied by a function (so-called “enroll function”) that drastically improves the recognition rate by allowing a speaker to read a predetermined sentence aloud in advance and acquiring and registering it as voice information. Voice recognition software has been developed. In this case, an acoustic model based on a speaker's pronunciation tendency is constructed by a speech recognition device such as a computer on which speech recognition software is operated, and when the speaker utters, a specific acoustic model constructed individually is The recognition process is implemented using this. As a result, the recognition rate is maintained at a certain level or more, and is improved to the extent that there is no practical problem.

しかしながら、上述したエンロール機能を採用した音声認識技術の場合、下記に掲げるような問題点を有することがあった。すなわち、これらの音声認識システムは、認識率を向上させるための前段階の登録作業が非常に煩雑となることがあった。つまり、特定話者の発音傾向を統計的処理に基づいて精細に解析する必要があり、登録時に膨大な量の文章を音読させる作業を強制的にする必要があった。例えば、市販の音声認識ソフトの場合、エンロール機能を使用するための登録時には、当該ソフトウェアが指定する複数の文章を話者がそれぞれ読み上げ、その内容を音声認識ソフトが解析し、データとして登録する作業を繰り返し実行する必要があった。そのため、最終的に登録を完了するまでに、最低でも３０分以上が必要となることもあった。したがって、音声認識技術を利用したシステムを恒常的に使用する専門のオペレータ以外では、上記のような煩雑な登録作業に時間を掛けることが無駄な場合も多くあり、エンロール機能自体が十分に活用されていない場合も多かった。その結果、エンロール機能を活用することなく音声認識処理を行うため、低い認識率で当該ソフトを使用することがあった。 However, the voice recognition technology that employs the enrollment function described above may have the following problems. That is, in these voice recognition systems, the registration work in the previous stage for improving the recognition rate may be very complicated. That is, it is necessary to analyze the pronunciation tendency of a specific speaker based on statistical processing, and it is necessary to force a task of reading a huge amount of sentences aloud at the time of registration. For example, in the case of commercially available voice recognition software, when registering to use the enrollment function, the speaker reads a plurality of sentences specified by the software, and the contents are analyzed by the voice recognition software and registered as data It was necessary to execute repeatedly. Therefore, at least 30 minutes or more may be required to finally complete registration. Therefore, there are many cases where it is useless to spend time on the complicated registration work as described above, except for a specialized operator who constantly uses a system using voice recognition technology, and the enrollment function itself is fully utilized. In many cases it was not. As a result, since the speech recognition process is performed without using the enrollment function, the software may be used at a low recognition rate.

また、音声認識技術によって解析される音声認識は、その話者の発する音声の周波数特性であったり、音の強弱等であり、その発音傾向は常に一定のものではなかった。すなわち、話者の感情（喜怒哀楽）によって通常の発音傾向と異なるアクセント等で音声を発したり、普段は標準語で喋る話者であったとしても、感情の変化により出身地方の訛りを含んだ音声を発声をすることがあった。そのため、エンロール機能の登録時には、平静状態で標準語で登録した話者であっても、実際の音声入力を行う際には感情等により発音傾向が変化し、エンロール機能が十分に発揮できないことがあった。その結果、反って認識率が低下する可能性があった。 In addition, the speech recognition analyzed by the speech recognition technology is the frequency characteristics of the speech uttered by the speaker, the strength of the sound, etc., and the pronunciation tendency is not always constant. In other words, depending on the emotion of the speaker (feeling emotional), even if the speaker utters voices with accents that are different from the normal pronunciation tendency, etc. There were times when I was uttering voice. Therefore, when registering the enrollment function, even if the speaker is registered in a standard language in a calm state, the pronunciation tendency may change due to emotions, etc. when the actual voice input is performed, and the enrollment function may not be fully demonstrated. there were. As a result, the recognition rate may be lowered.

そこで、本発明は、上記実情に鑑み、音声認識前のエンロール機能を活用するための登録作業を要することがなく、個々の状況に応じて適切な発音傾向辞書を特定し、音声認識に係る認識率を飛躍的に向上させることが可能な音声認識システムの提供を課題とするものである。 Therefore, in view of the above circumstances, the present invention does not require registration work for utilizing the enrollment function before speech recognition, identifies an appropriate pronunciation tendency dictionary according to each situation, and recognizes speech recognition. An object of the present invention is to provide a speech recognition system capable of dramatically improving the rate.

上記の課題を解決するため、本発明の音声認識システムは、「話者の音声を検出し、音声情報を取得する音声情報取得手段と、取得した前記音声情報に基づいて、前記音声の高低、周波数特性、アクセント、及びピッチを含む前記話者の発音傾向を分析し、特定する発音傾向特定手段と、標準化した標準発音傾向に基づいて構築された標準辞書、及び前記標準発音傾向と相違する特定発音傾向に基づいて各々構築された複数の発音傾向辞書を記憶する辞書群記憶手段と、記憶された複数の前記発音傾向辞書の中から、特定された前記音声の前記発音傾向に合致若しくは類似する一の前記発音傾向辞書を選定する辞書選定手段と、前記音声情報及び前記標準辞書を利用して、前記音声に含まれる語彙を照合し、認識する標準照合手段と、前記標準照合手段によって未認識と判定された前記語彙を前記辞書選定手段によって選定された前記発音傾向辞書を利用して照合し、認識する発音傾向照合手段と、前記標準照合手段及び前記発音傾向照合手段の少なくとも一方によって認識された前記音声に係る前記語彙を出力する語彙出力手段と」を具備して主に構成されている。 In order to solve the above-described problem, the speech recognition system of the present invention includes: “a speech information acquisition unit that detects a speech of a speaker and acquires speech information; and the level of the speech based on the acquired speech information, Analyzing and specifying the speaker's pronunciation tendency including frequency characteristics, accent, and pitch, pronunciation tendency identifying means for identifying, standard dictionary constructed based on standardized standard pronunciation tendency, and identification different from the standard pronunciation tendency A dictionary group storage means for storing a plurality of pronunciation tendency dictionaries each constructed based on a pronunciation tendency, and matches or resembles the pronunciation tendency of the voice specified from the plurality of stored pronunciation tendency dictionaries A dictionary selection unit that selects one pronunciation tendency dictionary, a standard collation unit that collates and recognizes a vocabulary included in the speech using the speech information and the standard dictionary, and the standard The vocabulary determined to be unrecognized by the collating means is collated by using the pronunciation tendency dictionary selected by the dictionary selecting means, and recognized pronunciation tendency collating means, the standard collating means and the pronunciation tendency collating means And a vocabulary output means for outputting the vocabulary related to the speech recognized by at least one of the voices ”.

ここで、発音傾向とは、話者の発声した音声に関し、音声の高低、音響特性の周波数分布、アクセント・イントネーション、及びピッチ等の他者の音声との識別が可能なものである。なお、発音傾向を特定するためには、声質や声のトーン、及び話者の発声時の感情（喜怒哀楽）による区別を含むものであっても構わない。係る発音傾向は、周波数分布などによって数値化されるものであってもよい。 Here, the pronunciation tendency relates to the voice uttered by the speaker and can be distinguished from other voices such as the pitch of the voice, the frequency distribution of the acoustic characteristics, the accent / intonation, and the pitch. In addition, in order to specify the pronunciation tendency, it may include discrimination based on voice quality, tone of voice, and emotion (feeling of emotion) when the speaker speaks. Such pronunciation tendency may be quantified by frequency distribution or the like.

一方、辞書群記憶手段に記憶される標準辞書は、標準発音傾向、すなわち、標準語（共通語）の場合の語彙のアクセントや平均的な声質及びピッチに基づいて規定されている。そのため、話者を特定することなく、幅広い人々を対象として音声の音声認識を行うことが可能となる。これに対し、発音傾向辞書は、標準辞書の標準発音傾向から逸脱し、話者に応じて異なる発音傾向を示すタイプのものが複数構成されている。さらに具体的に示すと、音声の高低、訛り・方言等の地域差によるアクセントの相違、音声の速度の違い、及び年代別等の種々のタイプに応じて個別に設定することができる。すなわち、分類した発音傾向の違いに応じて発音傾向辞書が構築されている。これらの複数の辞書により、辞書群が構成され、辞書群記憶手段に記憶される。 On the other hand, the standard dictionary stored in the dictionary group storage means is defined based on the standard pronunciation tendency, that is, the vocabulary accent and average voice quality and pitch in the case of a standard word (common word). Therefore, it is possible to perform speech recognition for a wide range of people without specifying a speaker. On the other hand, the pronunciation tendency dictionary includes a plurality of types that deviate from the standard pronunciation tendency of the standard dictionary and show different pronunciation tendency depending on the speaker. More specifically, it can be set individually according to various types such as the level of voice, accent differences due to regional differences such as utterances and dialects, voice speed differences, and age groups. That is, a pronunciation tendency dictionary is constructed according to the classified pronunciation tendency differences. A dictionary group is constituted by these plural dictionaries and is stored in the dictionary group storage means.

したがって、本発明の音声認識システムによれば、取得した音声情報に基づいて、話者の発音傾向を特定し、特定された発音傾向に合致または類似する発音傾向辞書が辞書群記憶手段の中から選定される。そして、音声認識処理を行う場合、始めに標準発音傾向に基づく標準辞書を利用して語彙を照合し、認識処理を行う。その後、標準辞書で認識されなかった語彙について、選定された発音傾向辞書を用いて照合し、認識を行う。これにより、二つの辞書を用いて照合された結果が語彙出力手段によって出力される。すなわち、本発明の音声認識システムの場合、二つの辞書を利用し、二段階の認識処理を行うため、音声の認識率を向上させることができる。特に、発音傾向に基づいて特定された発音傾向辞書により、話者の発声の癖などを把握した上での認識処理が行われるため、上記認識率を飛躍的に向上させることができる。 Therefore, according to the voice recognition system of the present invention, the pronunciation tendency of the speaker is specified based on the acquired voice information, and a pronunciation tendency dictionary that matches or resembles the specified pronunciation tendency is selected from the dictionary group storage means. Selected. When performing speech recognition processing, first, vocabulary is collated using a standard dictionary based on standard pronunciation tendency, and recognition processing is performed. After that, the vocabulary that has not been recognized in the standard dictionary is collated using the selected pronunciation tendency dictionary and recognized. As a result, the collation result using the two dictionaries is output by the vocabulary output means. In other words, in the case of the speech recognition system of the present invention, since two-stage recognition processing is performed using two dictionaries, the speech recognition rate can be improved. In particular, the recognition rate can be dramatically improved because the recognition process is performed after grasping the habit of the speaker's utterance by the pronunciation tendency dictionary specified based on the pronunciation tendency.

さらに、本発明の音声認識システムは、上記構成に加え、「前記発音傾向特定手段は、前記話者が最先のタイミングで発声した前記音声に基づいて前記発音傾向を特定する最先特定手段をさらに有し、前記辞書選定手段は、前記最先特定手段によって特定された前記発音傾向に基づいて、前記話者に対応する前記発音傾向辞書を固定する固定選定手段を」具備するものであっても構わない。 Further, the speech recognition system according to the present invention may include, in addition to the above configuration, “the pronunciation tendency specifying means includes earliest specifying means for specifying the pronunciation tendency based on the voice uttered by the speaker at the earliest timing”. The dictionary selecting means further comprises a fixed selecting means for fixing the pronunciation tendency dictionary corresponding to the speaker based on the pronunciation tendency specified by the earliest specifying means. It doesn't matter.

したがって、本発明の音声認識システムによれば、発音傾向が話者が一番最初のタイミングで発声した音声によって特定され、その後は当該話者については発音傾向辞書を固定した状態で認識処理が行われる。すなわち、最初の一文についての認識により、話者の発音傾向を特定することが可能となる。これにより、発音傾向の特定が一回で完了し、その後のシステムに負担を課することがない。 Therefore, according to the speech recognition system of the present invention, the pronunciation tendency is specified by the voice uttered by the speaker at the earliest timing, and then the recognition process is performed with the pronunciation tendency dictionary fixed for the speaker. Is called. That is, it becomes possible to identify the speaker's pronunciation tendency by recognizing the first sentence. As a result, the identification of the pronunciation tendency is completed once, and no burden is imposed on the subsequent system.

さらに、本発明の音声認識システムは、上記構成に加え、「前記発音傾向特定手段は、前記話者が前記音声を発声する毎に、前記発音傾向を逐次特定する逐次特定手段をさらに有し、前記辞書選定手段は、逐次特定された前記発音傾向に基づいて、前記発音傾向辞書を再選定する辞書再選定手段を」具備して構成されるものであっても構わない。 Furthermore, the speech recognition system of the present invention has, in addition to the above configuration, “the pronunciation tendency specifying means further includes sequential specifying means for sequentially specifying the pronunciation tendency every time the speaker utters the voice, The dictionary selecting means may comprise a dictionary reselecting means for reselecting the pronunciation tendency dictionary based on the pronunciation tendency sequentially specified.

したがって、本発明の音声認識システムによれば、上記に示したように、話者についての発音傾向辞書を固定するものに対し、発声した音声毎に発音傾向の特定及び発音傾向辞書の選定が行われる。これにより、同じ話者であっても、感情によって早口になったり、声のトーンが異なることがある。係る場合、最初に特定された発音傾向辞書に特定されていると、発音傾向辞書による認識率が低下するおそれがある。そのため、システムに対しては若干の負担を課すことになるものの、適宜発声傾向の特定及び辞書の選定を繰り返すことにより、認識率の低下を防ぐことができる。 Therefore, according to the speech recognition system of the present invention, as described above, the pronunciation tendency dictionary and the pronunciation tendency dictionary are selected for each voice uttered, while the pronunciation tendency dictionary for the speaker is fixed. Is called. As a result, even if the speaker is the same speaker, his / her voice may change quickly or the tone of the voice may differ. In such a case, if the pronunciation tendency dictionary is specified first, the recognition rate by the pronunciation tendency dictionary may be reduced. Therefore, although a slight burden is imposed on the system, it is possible to prevent the recognition rate from being lowered by appropriately specifying the utterance tendency and selecting the dictionary.

さらに、本発明の音声認識システムは、上記構成に加え、「前記音声情報取得手段は、前記話者がそれぞれ発声した複数の前記音声が混在して形成される会話形式の前記音声情報を取得する」ものであっても構わない。 Further, the voice recognition system according to the present invention, in addition to the above configuration, “the voice information acquisition means acquires the voice information in a conversation format formed by mixing a plurality of voices uttered by the speaker. It does not matter.

したがって、本発明の音声認識システムは、複数の話者が集い会話を行う会話形式の音声に係る音声情報を取得するものである。これにより、会話に参加するそれぞれ話者について、発音傾向を特定し、これに基づいて、音声の認識処理を行うことができる。 Therefore, the speech recognition system of the present invention acquires speech information related to conversational speech in which a plurality of speakers gather and have a conversation. Thereby, it is possible to identify a pronunciation tendency for each speaker participating in the conversation and perform a speech recognition process based on the pronunciation tendency.

本発明の効果として、音声による認識処理を標準辞書及び発音傾向辞書の二段階で行うことにより、音声認識に係る認識率を向上させることができる。さらに、発音傾向の特定を話者が音声を発する度に特定することにより、話者の感情等に左右されることなく、安定して音声認識処理を行うことができる。 As an effect of the present invention, the recognition rate related to speech recognition can be improved by performing speech recognition processing in two stages: a standard dictionary and a pronunciation tendency dictionary. Furthermore, by specifying the pronunciation tendency every time the speaker utters the voice, the voice recognition process can be performed stably without being influenced by the emotion of the speaker.

本実施形態の音声認識システムの概略構成を示す説明図である。It is explanatory drawing which shows schematic structure of the speech recognition system of this embodiment. 音声認識システムにおける認識コンピュータの機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the recognition computer in a speech recognition system. 認識コンピュータの処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of a recognition computer. 認識コンピュータの処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of a recognition computer.

以下、本発明の一実施形態である音声認識システム１（以下、単に「認識システム１」と称す）について、図１乃至図４に基づいて説明する。ここで、図１は本実施形態の音声認識システム１の概略構成を示す説明図であり、図２は音声認識システム１における認識コンピュータ２の機能的構成を示すブロック図であり、図３及び図４は認識コンピュータの処理の流れを示すフローチャートである。 A speech recognition system 1 (hereinafter simply referred to as “recognition system 1”) according to an embodiment of the present invention will be described below with reference to FIGS. Here, FIG. 1 is an explanatory diagram showing a schematic configuration of the speech recognition system 1 of the present embodiment, and FIG. 2 is a block diagram showing a functional configuration of the recognition computer 2 in the speech recognition system 1, and FIGS. 4 is a flowchart showing a processing flow of the recognition computer.

本実施形態の認識システム１は、図１乃至図４に示されるように、音声認識装置として機能する認識コンピュータ２によって主に構成されている。ここで、認識コンピュータ２は、図１及び図２に示すように、市販のパーソナルコンピュータを利用して主に構築され、話者Ｓが発声した音声Ｖを取得するマイク等の音声入力機器３と、各種データの入力及び操作を行うためのキーボード等の操作入力機器４と、認識結果を文字出力するための出力画面５を有する液晶ディスプレイ６とがそれぞれコンピュータ本体７に接続されて主に構成されている。 The recognition system 1 of the present embodiment is mainly configured by a recognition computer 2 that functions as a voice recognition device, as shown in FIGS. Here, as shown in FIGS. 1 and 2, the recognition computer 2 is mainly constructed using a commercially available personal computer, and a voice input device 3 such as a microphone that acquires the voice V uttered by the speaker S; An operation input device 4 such as a keyboard for inputting and operating various data, and a liquid crystal display 6 having an output screen 5 for outputting characters of recognition results are respectively connected to a computer main body 7 and mainly configured. ing.

また、コンピュータ本体７の内部には、音声Ｖを分析し、認識する音声認識機能ＳＲ及び認識された音声Ｖを各種辞書（標準辞書ＳＤ、発音傾向辞書Ｘ１等）を利用して音声Ｖに含まれる語彙の照合を行う語彙照合機能ＶＣが構築され、各種機能を発揮することができるようになっている。さらに、コンピュータ本体７には、上記に示した機器３，４，６等との信号を送受するためのインターフェイスや制御機構、インターネット等のネットワーク環境への接続を可能とする通信機能、及び認識システム１として機能させるための認識システム用ソフトウェア（図示しない）を内蔵するハードディスク等の記憶手段１９（辞書群記憶手段１０等）、認識システム用ソフトウェアに基づいて各種処理を行うためのＣＰＵを含む演算処理部等を含んで構成されている。これらのパーソナルコンピュータの構成及び機能については、既に周知のものであるため、ここでは説明を省略する。 The computer main body 7 includes a voice recognition function SR for analyzing and recognizing the voice V and the recognized voice V in the voice V using various dictionaries (standard dictionary SD, pronunciation tendency dictionary X1, etc.). A vocabulary collation function VC for collating vocabulary is established, and various functions can be exhibited. Furthermore, the computer main body 7 has an interface and control mechanism for transmitting and receiving signals to and from the devices 3, 4, 6 and the like described above, a communication function that enables connection to a network environment such as the Internet, and a recognition system. Arithmetic processing including a storage means 19 (dictionary group storage means 10 and the like) such as a hard disk with built-in recognition system software (not shown) for functioning as 1, and a CPU for performing various processes based on the recognition system software It is comprised including a part etc. Since the configurations and functions of these personal computers are already well known, description thereof is omitted here.

ここで、認識コンピュータ２は、その機能的構成として、話者Ｓが発声する音声Ｖを音声入力機器３を介して検出し、音声Ｖに係る音声信号を電気信号に変換し、これを音声情報１７として取得する音声情報取得手段８と、取得した音声情報１７に基づいて、音声Ｖに係る発音傾向を分析し、これを特定する発音傾向特定手段９と、標準語等を発音する際に予め規定された標準的な標準発音傾向に基づいて各語彙を登録し、構築された標準辞書ＳＤ、及び標準発音傾向と相違し、声質、アクセント、ピッチ、及び訛り等に従ってそれぞれ各語彙を登録し、構築された複数の発音傾向辞書Ｘ１，Ｘ２，Ｘ３．．．，Ｘｎからなる辞書のグループ（辞書群）を電子データ化し、データベースとして記憶する辞書群記憶手段１０と、辞書群記憶手段１０に記憶された複数の発音傾向辞書Ｘ１等の中から、取得した音声情報１７によって特定された発音傾向と周波数特性等の各種パラメータが合致若しくは最も類似する一の発音傾向辞書Ｘ１等を選定する辞書選定手段１１と、取得した音声情報１７及び辞書群記憶手段１１の中の標準辞書ＳＤを利用して、音声Ｖに含まれる語彙を照合し、認識する標準照合手段１２と、標準照合手段１２によって認識されなかった語彙（未認識語彙）を、発音傾向に基づいて選定された発音傾向辞書Ｘ１等を利用して照合し、認識する発音傾向照合手段１３と、標準辞書ＳＤ及び発音傾向辞書Ｘ１等で認識された音声Ｖに係る語彙をそれぞれ出力する語彙出力手段１４とを主に具備して構成されている。 Here, as a functional configuration, the recognition computer 2 detects the voice V uttered by the speaker S via the voice input device 3, converts the voice signal related to the voice V into an electrical signal, and converts this into voice information. The voice information acquisition means 8 acquired as 17 and the pronunciation tendency relating to the voice V based on the acquired voice information 17 are analyzed, and the pronunciation tendency specifying means 9 for specifying this, and the pronunciation of a standard word or the like in advance Register each vocabulary based on the standardized standard pronunciation tendency, register each vocabulary according to the voice quality, accent, pitch, and accent, etc. A plurality of pronunciation tendency dictionaries X1, X2, X3. . . , Xn, a dictionary group (dictionary group) that is converted into electronic data and stored as a database, and voices obtained from a plurality of pronunciation tendency dictionaries X1 stored in the dictionary group storage unit 10 The dictionary selection means 11 for selecting the pronunciation tendency dictionary X1 or the like that matches or most closely matches various parameters such as the pronunciation tendency specified by the information 17 and the frequency characteristics, and the acquired voice information 17 and the dictionary group storage means 11 Using the standard dictionary SD, the standard collation unit 12 that collates and recognizes the vocabulary contained in the voice V and the vocabulary (unrecognized vocabulary) that is not recognized by the standard collation unit 12 is selected based on the pronunciation tendency. The pronunciation tendency collation means 13 that collates and recognizes using the pronunciation tendency dictionary X1 etc. and the voice V recognized by the standard dictionary SD and the pronunciation tendency dictionary X1 etc. And lexical output means 14 for outputting the vocabulary respectively are composed mainly provided.

ここで、本実施形態の認識システム１において、発音傾向の特定及びその後の発音傾向辞書Ｘ１等の選定において二つの方式を採用することが可能であり、話者Ｓの選択によって指定することが可能となっている。具体的に説明すると、発音傾向特定手段９の一部機能として、話者Ｓが本実施形態の認識システム１において、最初のタイミングで発声した音声Ｖに基づいて発音傾向を特定する最先特定手段、及び辞書選定手段１１の一部機能として、最先特定手段１５によって特定された発音傾向に基づいて、話者Ｓに対応する発音傾向辞書Ｘ１等を固定し、以後の音声Ｖに対しても同一の発音傾向辞書Ｘ１を利用して音声Ｖの認識処理を行う固定選定手段１６を利用した選定辞書固定方式と、発音傾向特定手段９の一部機能として、話者が音声Ｖを発声する度に、上述の発音傾向を逐次特定する逐次特定手段１７と、辞書選定手段１１の一機能として、逐次特定手段１７によって逐次特定される発音傾向に基づいて、発音傾向辞書Ｘ１等の再選定を実施する辞書再選定手段１８とを具備する選定辞書変動方式との二つの方式である。 Here, in the recognition system 1 of the present embodiment, it is possible to adopt two methods for specifying the pronunciation tendency and subsequently selecting the pronunciation tendency dictionary X1 and the like, and it can be designated by the selection of the speaker S. It has become. More specifically, as a partial function of the pronunciation tendency specifying means 9, the earliest specifying means for specifying the pronunciation tendency based on the voice V uttered by the speaker S at the first timing in the recognition system 1 of the present embodiment. As a partial function of the dictionary selecting means 11, the pronunciation tendency dictionary X1 corresponding to the speaker S is fixed based on the pronunciation tendency specified by the earliest specifying means 15, and the subsequent voice V is also fixed. As a selection dictionary fixing method using fixed selection means 16 that performs recognition processing of voice V using the same pronunciation tendency dictionary X1 and a partial function of pronunciation tendency specifying means 9, every time a speaker utters voice V In addition, as a function of the sequential specifying means 17 for sequentially specifying the pronunciation tendency and the dictionary selecting means 11, the pronunciation tendency dictionary X1 and the like are reselected based on the pronunciation tendency sequentially specified by the sequential specifying means 17. Are two methods of selecting a dictionary change method comprising the that dictionary reselection unit 18.

選定辞書固定方式の場合、話者Ｓについての最先の音声Ｖに基づいて発音傾向の特定及び発音傾向辞書Ｘ１等の選定がなされ、認識システム１における音声認識処理の過程は、当該発音傾向辞書Ｘ１等が常に固定されることになる。そのため、発音傾向の特定及び発音傾向辞書Ｘ１の選定に係る処理が一度で完了するため、以後の音声認識処理を速やかに行うことができる。そのため、音声認識中の認識システム１に過度な負担を強いることがない、優れた利点を有している。しかしながら、話者Ｓの感情が変化し、早口になったり、声量が大きくなる等により、同じ話者Ｓでも発音傾向が偏向する場合がある。その場合、一度固定された発音傾向辞書Ｘ１では、誤認識や語彙の照合が不能となる場合があり、認識率が低下するおそれがある。 In the case of the selection dictionary fixing method, the pronunciation tendency is specified and the pronunciation tendency dictionary X1 is selected based on the earliest voice V for the speaker S, and the process of the speech recognition processing in the recognition system 1 is the pronunciation tendency dictionary. X1 etc. are always fixed. Therefore, since the process related to the pronunciation tendency specification and the selection of the pronunciation tendency dictionary X1 is completed at once, the subsequent voice recognition process can be performed promptly. For this reason, the recognition system 1 during voice recognition has an excellent advantage that it does not impose an excessive burden. However, there is a case where the pronunciation tendency of the same speaker S is deflected due to a change in the emotion of the speaker S, a quick mouth, or a loud voice. In that case, the pronunciation tendency dictionary X1 once fixed may make incorrect recognition or vocabulary collation impossible, which may reduce the recognition rate.

一方、選定辞書変動方式の場合、各発言毎に発音傾向を特定し、発音傾向辞書Ｘ１等の選定を繰り返し実施する処理が行われるため、上記のような発言中の発音傾向の変更にすぐに対応することができる。さらに、複数の話者Ｓが集合し、一度に会話を行う会議形式等の音声Ｖを認識する場合であっても、個々の話者Ｓを特定する必要がなく、音声Ｖ自体の発音傾向に基づいて認識処理をすればよいため、選定辞書固定方式に比べて係る点において利点を有している。しかしながら、各発言毎に発音傾向の特定及び発音傾向辞書Ｘ１等の選定を行うため、認識システム１に過度の負担を課す可能性がある。特に、複数の話者Ｓが一斉に発言をした場合、各発言（音声Ｖ）毎に選定等の処理を実施する必要があるため、最終的に語彙の出力が遅延する可能性があった。そのため、話者Ｓは音声認識を行う周囲の状況に応じて、いずれかの方式を使用するかを任意に選択することができるようになっている。 On the other hand, in the case of the selection dictionary variation method, the pronunciation tendency is specified for each utterance, and the process of repeatedly selecting the pronunciation tendency dictionary X1 and the like is performed. Can respond. Further, even when a plurality of speakers S are gathered and recognize the speech V such as a conference format in which conversations are performed at a time, it is not necessary to identify each speaker S, and the pronunciation tendency of the speech V itself is increased. Since the recognition process may be performed based on this, it has an advantage in this respect compared to the selected dictionary fixing method. However, since the pronunciation tendency is specified and the pronunciation tendency dictionary X1 is selected for each utterance, there is a possibility of imposing an excessive burden on the recognition system 1. In particular, when a plurality of speakers S speak at the same time, it is necessary to perform processing such as selection for each utterance (speech V), which may eventually delay the output of the vocabulary. Therefore, the speaker S can arbitrarily select which method to use in accordance with the surrounding situation in which speech recognition is performed.

次に、本実施形態の認識システム１による音声認識に係る処理の流れの一例について、図３及び図４に基づいて説明する。まず、音声認識を行う前に、前述の発音傾向辞書Ｘ１等の選定方式に係る選択指示の入力を受付ける（ステップＳ１）。この選択指示の入力は、前述したメリット及びデメリットを勘案の上、話者Ｓによって決定される。そして、話者Ｓによる選定方式の選択が決定され、係る選択指示の入力を受付けると（ステップＳ１）、それぞれの選定方式に対して処理が識別できるようにフラグＦを立てる（ステップＳ２）。具体的に説明すると、本実施形態の認識システム１では、選定辞書固定方式が選択指示された場合には、Ｆ＝１を与え、選定辞書変動方式が選択指示された場合には、Ｆ≠１を与える処理が行われる（ステップＳ２）。係るフラグＦは、後のステップによる判別のために使用することができる。 Next, an example of the flow of processing related to speech recognition by the recognition system 1 of the present embodiment will be described based on FIGS. 3 and 4. First, before performing speech recognition, an input of a selection instruction related to the selection method such as the pronunciation tendency dictionary X1 is received (step S1). The input of this selection instruction is determined by the speaker S in consideration of the merits and demerits described above. When selection of the selection method by the speaker S is determined and an input of such a selection instruction is accepted (step S1), a flag F is set so that the process can be identified for each selection method (step S2). Specifically, in the recognition system 1 of the present embodiment, F = 1 is given when the selection dictionary fixing method is instructed, and F ≠ 1 when the selection dictionary variation method is instructed. Is performed (step S2). The flag F can be used for determination in a later step.

その後、話者Ｓにより、音声認識の対象となる音声Ｖが発せられているか否かの検出処理を実行する（ステップＳ３）。このとき、音声認識を行う音声Ｖが話者Ｓの発言によって、音声入力機器３を介して検出される場合（ステップＳ３においてＹＥＳ）、認識システム１の認識コンピュータ２は音声入力機器３を介してこれを入力し、音声Ｖの音声信号を周波数特性や波形情報等に基づく電気信号に変換し、これを音声情報１７として取得する（ステップＳ４）。ここで、音声Ｖを電気信号に変換し、電子データとして音声情報１７を取得する処理は周知の技術であるため、ここでは説明を省略する。これにより、認識コンピュータ２の記憶手段１９には、音声情報１７が記憶されることになる。一方、音声入力機器３によって音声Ｖが検出されない場合（ステップＳ３においてＮＯ）、ステップＳ３の処理を繰り返し継続し、音声Ｖを検出し、音声情報１７が取得されるまで待機する。 Thereafter, the speaker S performs a detection process as to whether or not the voice V that is the target of voice recognition is being emitted (step S3). At this time, when the voice V for voice recognition is detected via the voice input device 3 by the speech of the speaker S (YES in step S3), the recognition computer 2 of the recognition system 1 passes through the voice input device 3. This is input, the voice signal of voice V is converted into an electrical signal based on frequency characteristics, waveform information, etc., and this is acquired as voice information 17 (step S4). Here, since the process of converting the voice V into an electrical signal and acquiring the voice information 17 as electronic data is a well-known technique, the description thereof is omitted here. As a result, the voice information 17 is stored in the storage unit 19 of the recognition computer 2. On the other hand, when the voice V is not detected by the voice input device 3 (NO in step S3), the process of step S3 is continuously repeated, the voice V is detected, and the process waits until the voice information 17 is acquired.

そして、取得した音声情報１７から、音声Ｖ（話者Ｓ）の発音傾向を分析し、特定する（ステップＳ５）。ここで、発音傾向の特定は、音声情報１７を分析することによって得られる音声Ｖの特性（話者Ｓの声質、音声Ｖの高低、音声Ｖの周波数特性・周波数分布・声紋、発音・アクセント・イントネーション、音声Ｖのピッチ・発声速度）に基づいて決定される。すなわち、個々の話者Ｓに応じて、発音傾向は種々異なるものである。そのため、音声Ｖの波形の違い（所謂「声紋」）によって話者Ｓをそれぞれ個別に識別することができる。さらに、発音傾向は、話者Ｓの発声時の感情等によって異なることもある。ここで、記憶手段１９には、話者Ｓの発音傾向を特定するための音響モデルＳＭが予め記憶され、これを利用して検出された音声Ｖの特性の対比が実施され、発音傾向が特定されることとなる。なお、特定された発音傾向は、電子データとして記憶手段１９に記憶されるものであっても構わない（図示しない）。また、後述する発音傾向辞書Ｘ１等の選定後は、基本的に不要なものとなるため、メモリ等の一時的に記憶し、選定後に消去するものであっても構わない。 Then, the pronunciation tendency of the voice V (speaker S) is analyzed and specified from the acquired voice information 17 (step S5). Here, the pronunciation tendency is specified by characteristics of the voice V obtained by analyzing the voice information 17 (voice quality of the speaker S, level of the voice V, frequency characteristics of the voice V, frequency distribution, voiceprint, pronunciation, accent, Intonation, pitch of voice V and utterance speed). That is, the pronunciation tendency varies depending on each speaker S. Therefore, the speakers S can be individually identified by the difference in the waveform of the voice V (so-called “voiceprint”). Further, the pronunciation tendency may vary depending on the emotion of the speaker S when speaking. Here, an acoustic model SM for specifying the pronunciation tendency of the speaker S is stored in the storage means 19 in advance, and the characteristics of the voice V detected by using this are compared to specify the pronunciation tendency. Will be. The specified pronunciation tendency may be stored in the storage unit 19 as electronic data (not shown). Further, after the selection of the pronunciation tendency dictionary X1, which will be described later, is basically unnecessary, it may be temporarily stored in a memory or the like and deleted after selection.

その後、記憶手段１９の辞書群記憶手段１０に、電子データとして各語彙がデータベース化して記憶された複数の発音傾向の異なる発音傾向辞書Ｘ１等の中から、前述の音響モデルＳＭによって特定された発音傾向に合致する、若しくは最も類似する一つの発音傾向辞書Ｘ１等を選定する（ステップＳ６）。この場合、発音傾向の近似度を数値化して示し、当該近似度の値によって最も類似するものを選定する処理が行われる。 After that, the pronunciation specified by the acoustic model SM is selected from a plurality of pronunciation tendency dictionaries X1 having different pronunciation tendency stored in the dictionary group storage means 10 of the storage means 19 as a database. One pronunciation tendency dictionary X1 or the like that matches or is most similar to the tendency is selected (step S6). In this case, the degree of approximation of the pronunciation tendency is shown in numerical form, and a process of selecting the most similar according to the value of the degree of approximation is performed.

次に、辞書群記憶手段１０の標準辞書ＳＤを利用し、取得された音声情報１７に含まれる語彙を照合し、音声認識処理を行う（ステップＳ７）。ここで、音声情報１７の音声Ｖは、複数の語彙によって構成されるものであり、個々の語彙を標準辞書ＳＤに予め登録された語彙と音声認識技術を利用して照合することにより、音声Ｖを構成する語彙を認識することができる。標準辞書ＳＤによって音声Ｖを構成する全ての語彙の認識が完了した場合（ステップＳ８においてＹＥＳ）、すなわち、話者Ｓの発声した音声Ｖがいずれも標準的な発音傾向で発音され、標準辞書ＳＤによって認識される場合、認識された語彙を文字情報として液晶ディスプレイ６の出力画面５に出力する（ステップＳ９）。これにより、音声Ｖが音声認識され、文字情報に変換される。係る文字情報は記憶手段１９に記憶されるものであってもよい（図示しない）。 Next, using the standard dictionary SD of the dictionary group storage means 10, the vocabulary contained in the acquired speech information 17 is collated, and speech recognition processing is performed (step S7). Here, the voice V of the voice information 17 is composed of a plurality of vocabularies, and each voice vocabulary is collated with a vocabulary registered in advance in the standard dictionary SD by using a voice recognition technique, whereby the voice V is recorded. Can be recognized. When recognition of all vocabularies constituting the voice V by the standard dictionary SD is completed (YES in step S8), that is, all the voices V uttered by the speaker S are pronounced with a standard pronunciation tendency, and the standard dictionary SD Is recognized, the recognized vocabulary is output as character information to the output screen 5 of the liquid crystal display 6 (step S9). Thereby, the voice V is recognized as voice and converted into character information. Such character information may be stored in the storage means 19 (not shown).

ここで、標準辞書ＳＤによって全ての語彙の認識が完了していない場合（ステップＳ８においてＮＯ）、すなわち、標準辞書ＳＤによって一部（または全部）が照合されず、認識できなかった場合、先に選定された発音傾向辞書Ｘ１等を利用して、標準辞書ＳＤで認識されなかった音声Ｖの語彙の照合を行う（ステップＳ１０）。そして、選定された発音傾向辞書Ｘ１等によって音声Ｖを構成する全ての語彙の認識が完了した場合（この場合、先に用いた標準辞書ＳＤによる照合及び認識の結果を含む）（ステップＳ１１においてＹＥＳ）、ステップＳ９の処理に移行し、認識された語彙を文字情報として出力する。一方、全ての語彙の認識が完了していない場合（ステップＳ１１においてＮＯ）、未認識の部位を含んだ文字情報を出力画面５に出力する（ステップＳ１２）。同様に係る文字情報を記憶手段１９に記憶するものであってもよい。 Here, when the recognition of all vocabularies is not completed by the standard dictionary SD (NO in step S8), that is, when a part (or all) of the standard dictionary SD is not collated and cannot be recognized, Using the selected pronunciation tendency dictionary X1 or the like, the vocabulary of the voice V that has not been recognized by the standard dictionary SD is collated (step S10). Then, when the recognition of all vocabularies constituting the speech V is completed by the selected pronunciation tendency dictionary X1 or the like (in this case, the result of collation and recognition by the previously used standard dictionary SD is included) (YES in step S11) ), The process proceeds to step S9, and the recognized vocabulary is output as character information. On the other hand, if the recognition of all vocabularies has not been completed (NO in step S11), character information including an unrecognized part is output to the output screen 5 (step S12). Similarly, the character information may be stored in the storage unit 19.

その後、認識コンピュータ２は、話者Ｓから新たな音声Ｖが発せられているかを検出する（ステップＳ１３）。ここで、音声入力機器３を介して音声Ｖの検出がある場合（ステップＳ１３においてＹＥＳ）、先に指定されたフラグＦの値を判定する（ステップＳ１４）。すなわち、Ｆ＝１の場合（ステップＳ１４においてＹＥＳ）、すなわち、先に選定された発音傾向辞書Ｘ１等を利用して、新たな音声Ｖについても音声認識処理をする選定辞書固定方式の場合、新たな音声Ｖについての音声情報１７を取得し（ステップＳ１５）、ステップ７の処理に移行する。これにより、当該音声情報１７に対し、標準辞書ＳＤ及び固定された発音傾向辞書Ｘ１等を利用した音声認識処理が実施され、ステップＳ７からステップＳ１３までの処理が繰り返される。 Thereafter, the recognition computer 2 detects whether a new voice V is emitted from the speaker S (step S13). Here, when the voice V is detected via the voice input device 3 (YES in step S13), the value of the flag F previously specified is determined (step S14). That is, in the case of F = 1 (YES in step S14), that is, in the case of the selected dictionary fixing method in which the voice recognition process is performed for the new voice V using the previously selected pronunciation tendency dictionary X1 or the like, The voice information 17 for the correct voice V is acquired (step S15), and the process proceeds to step 7. Thereby, the voice recognition process using the standard dictionary SD and the fixed pronunciation tendency dictionary X1 is performed on the voice information 17, and the processes from step S7 to step S13 are repeated.

一方、Ｆ≠１の場合（ステップＳ１４においてＮＯ）、すなわち、検出された音声Ｖ毎に発音傾向辞書Ｘ１等の選定を行う場合、ステップＳ４の処理に移行し、当該音声Ｖに基づく発音傾向の特定、発音傾向辞書Ｘ１等の選定、標準辞書ＳＤ及び再選定された発音傾向辞書Ｘ１等を利用した音声認識処理が実施され、ステップＳ４からステップＳ１３までの処理が繰り返される。 On the other hand, if F ≠ 1 (NO in step S14), that is, if the pronunciation tendency dictionary X1 or the like is selected for each detected voice V, the process proceeds to step S4, and the pronunciation tendency based on the voice V is changed. The voice recognition processing using the specification, selection of the pronunciation tendency dictionary X1, etc., the standard dictionary SD, the reselected pronunciation tendency dictionary X1, etc. is performed, and the processing from step S4 to step S13 is repeated.

これにより、話者Ｓによる発言が継続的に行われる場合、認識コンピュータ２は、それぞれの音声Ｖを検出し、二種類の辞書ＳＤ，Ｘ１等を用い、二段階の語彙照合処理により、音声Ｖを文字情報に変換して出力することができる。さらに、認識開始前に、発音傾向辞書Ｘ１等の選定方式を選択することができるため、音声認識の対象となる音声Ｖに応じて最適なものを選択し、認識率を向上させることができる。 Thereby, when the speech by the speaker S is continuously performed, the recognition computer 2 detects the respective voices V, and uses the two types of dictionaries SD, X1, etc., and performs the voice V Can be converted into character information and output. Furthermore, since the selection method of the pronunciation tendency dictionary X1 and the like can be selected before the recognition starts, the optimum one can be selected according to the voice V that is the target of voice recognition, and the recognition rate can be improved.

ここで、話者Ｓから新たな音声Ｖが発せられず、音声入力機器３による検出がされない場合（ステップＳ１３においてＮＯ）、ステップＳ１４及びステップＳ１５の処理をキャンセルする。その後、システム終了の指示の有無を検出し（ステップＳ１６）、当該指示が検出される場合（ステップＳ１６においてＹＥＳ）、システムを終了する（ステップＳ１７）。一方、当該指示が検出されない場合（ステップＳ１６においてＮＯ）、ステップＳ１の処理に戻り、音声認識処理を継続することとなる。 Here, when a new voice V is not emitted from the speaker S and is not detected by the voice input device 3 (NO in step S13), the processes in steps S14 and S15 are canceled. Thereafter, the presence / absence of an instruction to terminate the system is detected (step S16). If the instruction is detected (YES in step S16), the system is terminated (step S17). On the other hand, when the instruction is not detected (NO in step S16), the process returns to step S1, and the speech recognition process is continued.

本実施形態の認識システム１により、予め複数の発音傾向に従って構築された発音傾向辞書Ｘ１等を記憶し、話者Ｓの発音傾向に基づいて最適の辞書が選定される。さらに、第一番目の処理として標準的な発音傾向に従って構築された標準辞書ＳＤによる照合処理を行うことにより、発音傾向辞書Ｘ１等による語彙の照合処理を軽減することができる。すなわち、標準辞書ＳＤは、一般的な発話に係る音声を認識する上で共通化されたものであり、相違する発音傾向を有する音声であったとしても、その大部分については適用可能であり、語彙の認識が可能なように形成されている。そのため、始めに、標準辞書ＳＤによる処理により、音声Ｖ中の大部分の語彙を認識し、標準辞書ＳＤで認識できなかった残りの語彙（未認識語彙）についてのみ発音傾向辞書Ｘ１等を使用することが行われる。これにより、音声Ｖの音声認識に係る認識率が安定したものとなる。なお、標準辞書ＳＤを利用せず、発音傾向辞書Ｘ１等をいきなり適用することも可能ではあるものの、標準的な発音傾向を有する部分の認識が劣る可能性があり、音声Ｖを完全に転換し、文字情報として出力することが困難となるおそれもある。 The recognition system 1 of the present embodiment stores a pronunciation tendency dictionary X1 or the like that is previously constructed according to a plurality of pronunciation trends, and an optimal dictionary is selected based on the pronunciation tendency of the speaker S. Furthermore, by performing collation processing using the standard dictionary SD constructed according to the standard pronunciation tendency as the first processing, it is possible to reduce vocabulary collation processing using the pronunciation tendency dictionary X1 and the like. That is, the standard dictionary SD is common in recognizing sounds related to general utterances, and even if it is a voice having a different pronunciation tendency, it can be applied to most of them. It is formed so that vocabulary can be recognized. Therefore, first, most of the vocabulary in the voice V is recognized by the processing using the standard dictionary SD, and the pronunciation tendency dictionary X1 is used only for the remaining vocabulary (unrecognized vocabulary) that could not be recognized by the standard dictionary SD. Is done. Thereby, the recognition rate concerning the voice recognition of the voice V becomes stable. Although it is possible to apply the pronunciation tendency dictionary X1 etc. suddenly without using the standard dictionary SD, there is a possibility that the recognition of the part having the standard pronunciation tendency is inferior, and the voice V is completely converted. Therefore, it may be difficult to output as character information.

以上、本発明について好適な実施形態を挙げて説明したが、本発明はこれらの実施形態に限定されるものではなく、以下に示すように、本発明の要旨を逸脱しない範囲において、種々の改良及び設計の変更が可能である。 The present invention has been described with reference to preferred embodiments. However, the present invention is not limited to these embodiments, and various modifications can be made without departing from the spirit of the present invention as described below. And design changes are possible.

すなわち、本実施形態の認識システム１において、一人の話者Ｓの音声Ｖを認識するものを示したが、これに限定されるものではなく、会議形式等の複数の話者Ｓが会話（発話）を行うものを対象とするものであっても構わない。この場合、個々の話者Ｓの発音傾向が予め特定され、話者Ｓの特定を行うことができるものであれば、上述の発音傾向辞書Ｘ１等を固定する選定辞書固定方式を採用することができる。しかしながら、一度に多くの話者Ｓが発声する可能性があるため、発話同士が重なる場合があり、誤認識する可能性が高くなる。そのため、個々の発言毎に発音傾向を特定し、発音傾向辞書Ｘ１等をそれに応じて変動させる選定辞書変動方式を採用するものが好適と思われる。 That is, in the recognition system 1 of the present embodiment, the one that recognizes the voice V of one speaker S is shown. However, the present invention is not limited to this. ) May be targeted. In this case, if the pronunciation tendency of each speaker S is specified in advance and the speaker S can be specified, a selection dictionary fixing method for fixing the above-described pronunciation tendency dictionary X1 or the like may be adopted. it can. However, since there is a possibility that many speakers S speak at a time, the utterances may overlap each other, and the possibility of erroneous recognition increases. For this reason, it is considered preferable to specify a pronunciation tendency for each individual utterance and adopt a selection dictionary variation method that varies the pronunciation tendency dictionary X1 and the like accordingly.

さらに、本実施形態の認識システム１において、発音傾向辞書Ｘ１等の選定に係る方式を選択する処理を有するものを示したがこれに限定されるものではなく、いずれか一方の方式に限定したものであってももちろん構わない。すなわち、音声認識する対象が固定され、話者Ｓの発音傾向の変動が想定されない場合には、選定辞書固定方式を採用し、迅速かつ安定した音声認識を行うものであって構わない。 Furthermore, in the recognition system 1 of the present embodiment, the one having a process for selecting a method related to selection of the pronunciation tendency dictionary X1 or the like is shown, but the present invention is not limited to this, and is limited to one of the methods. But of course it does n’t matter. That is, when the target of speech recognition is fixed and the variation in the pronunciation tendency of the speaker S is not expected, a selection dictionary fixing method may be adopted to perform quick and stable speech recognition.

１認識システム（音声認識システム）
２認識コンピュータ
３音声入力機器
５出力画面
６液晶ディスプレイ
７コンピュータ本体
８音声情報取得手段
９発音傾向特定手段
１０辞書群記憶手段
１１辞書選定手段
１２標準照合手段
１３発音傾向照合手段
１４語彙出力手段
１５最先特定手段
１６固定選定手段
１７逐次特定手段
１８辞書再選定手段
Ｓ話者
ＳＤ標準辞書
Ｖ音声
Ｘ１，Ｘ２，Ｘ３．．．Ｘｎ発音傾向辞書

1 Recognition system (voice recognition system)
2 recognition computer 3 voice input device 5 output screen 6 liquid crystal display 7 computer main body 8 voice information acquisition means 9 pronunciation tendency identification means 10 dictionary group storage means 11 dictionary selection means 12 standard collation means 13 pronunciation tendency collation means 14 vocabulary output means 15 Point identification means 16 Fixed selection means 17 Sequential identification means 18 Dictionary reselection means S Speaker SD Standard dictionary V Speech X1, X2, X3. . . Xn Pronunciation Trend Dictionary

Claims

Voice information acquisition means for detecting the voice of the speaker and acquiring voice information;
Based on the acquired voice information, the pronunciation tendency specifying means for analyzing and specifying the speaker's pronunciation tendency including the height, frequency characteristics, accent, and pitch of the voice;
A dictionary group storage means for storing a standard dictionary constructed based on a standardized pronunciation tendency and a plurality of pronunciation tendency dictionaries each constructed based on a specific pronunciation tendency different from the standard pronunciation tendency;
A dictionary selection means for selecting one of the pronunciation tendency dictionaries that matches or resembles the pronunciation tendency of the specified voice from among the plurality of pronunciation tendency dictionaries stored;
Standard collating means for collating and recognizing vocabulary contained in the voice using the voice information and the standard dictionary;
Using the pronunciation tendency dictionary selected by the dictionary selecting means to collate and recognize the vocabulary determined to be unrecognized by the standard collating means;
A speech recognition system comprising: vocabulary output means for outputting the vocabulary related to the speech recognized by at least one of the standard collating means and the pronunciation tendency collating means.

The pronunciation tendency specifying means is:
It further has a first point specifying means for specifying the pronunciation tendency based on the voice uttered by the speaker at the earliest timing,
The dictionary selecting means includes
The speech recognition system according to claim 1, further comprising: a fixed selection unit that fixes the pronunciation tendency dictionary corresponding to the speaker based on the pronunciation tendency specified by the earliest specifying unit. .

The pronunciation tendency specifying means is:
Each time the speaker utters the voice, the speaker further includes sequential specifying means for sequentially specifying the pronunciation tendency,
The dictionary selecting means includes
The speech recognition system according to claim 1, further comprising a dictionary reselecting unit that reselects the pronunciation tendency dictionary based on the pronunciation tendency specified sequentially.

The voice information acquisition means includes
The voice recognition system according to any one of claims 1 to 3, wherein the voice information in a conversation format formed by mixing a plurality of the voices uttered by the speaker is acquired. .