JP7363307B2

JP7363307B2 - Automatic learning device and method for recognition results in voice chatbot, computer program and recording medium

Info

Publication number: JP7363307B2
Application number: JP2019179539A
Authority: JP
Inventors: 正二朗森部
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2023-10-18
Anticipated expiration: 2039-09-30
Also published as: JP2021056392A; JP2023080132A

Description

本発明は、例えばＡＩ（Artificial Intelligence：人工知能）スピーカ或いはスマートスピーカや、音声認識ＩＶＲ（Interactive Voice Response：自動音声応答装置）や、チャットＩＦ（Inter-Face：インタフェース）のような対話型の音声認識装置を含んでなり、音声を通じて或いは音声及びテキストを通じて、話者との会話を自動的に行うＡＩチャットボット（chatbot或いはchatterbot）或いは音声応答システムにおいて、話者の発話の認識結果を自動学習する装置及び方法、並びにコンピュータをそのような装置として機能させるコンピュータプログラム及び記録媒体の技術分野に関する。 The present invention is applicable to interactive voice systems such as AI (Artificial Intelligence) speakers, smart speakers, voice recognition IVR (Interactive Voice Response), and chat IF (Inter-Face). An AI chatbot (chatbot or chatterbot) or voice response system that includes a recognition device and automatically carries out a conversation with the speaker through voice or through voice and text, automatically learns the recognition results of the speaker's utterances. The present invention relates to the technical field of devices and methods, as well as computer programs and recording media that cause a computer to function as such devices.

近年、ＡＩチャットボット或いは音声応答システムについては、諸外国、我が国等の企業により商品化され、更にその機能向上のために各種の提案がなされている。例えば特許文献１では、ユーザ発話の意図が問合せであるか否かを判定する装置が提案されている。特許文献２では、対話システムが答えられなかった質問に答えるための知識を拡充する情報処理装置が提案されている。特許文献３では、ＡＩチャットボットサーバーに既に蓄積された知識を利用する音声問合せシステムが提案されている。 In recent years, AI chatbots or voice response systems have been commercialized by companies in various countries and Japan, and various proposals have been made to further improve their functionality. For example, Patent Document 1 proposes a device that determines whether the intention of a user's utterance is an inquiry. Patent Document 2 proposes an information processing device that expands knowledge to answer questions that cannot be answered by a dialog system. Patent Document 3 proposes a voice inquiry system that uses knowledge already accumulated in an AI chatbot server.

特開２０１９－１１４１４１号公報Japanese Patent Application Publication No. 2019-114141 特開２０１９－０６１４８２号公報JP2019-061482A 特許第６５５５８３８号公報Patent No. 6555838

ＡＩチャットボット或いは音声応答システムでは、音声認識の精度を効率良く向上させること或いは少ない作業やデータ処理により該精度を向上させることが望ましい。 In an AI chatbot or voice response system, it is desirable to efficiently improve the accuracy of voice recognition, or to improve the accuracy with less work and data processing.

しかしながら、例えば特許文献１では、ユーザ発話の意図を示す対話行為及びユーザ発話の発話主題を推定し、該推定された対話行為が問合せを示し且つ該推定された発話主題が回答可能な問合せ対象である場合に、発話が問合せであると判定するように構成されている。特許文献２では、失敗原因を分析して、該原因に応じて対話ログデータから質問文を生成出力し、その質問に対する回答を対話データの新たな知識として追加するように構成されている。特許文献３では、数多くのチャットボットサーバ装置夫々が分散して学習処理を行うことで、数多くのチャットボットサーバ装置に既に蓄積されている学習モデルを利用して、個別のＡＩスピーカで取得された音声による質問に対する答弁を行うように構成されている。 However, for example, in Patent Document 1, a dialogue act indicating the intention of the user's utterance and the utterance subject of the user's utterance are estimated, and the estimated dialogue act indicates an inquiry, and the estimated utterance subject is an answerable inquiry target. In certain cases, the utterance is configured to be determined to be a query. Patent Document 2 is configured to analyze the cause of failure, generate and output a question from dialogue log data according to the cause, and add the answer to the question as new knowledge to the dialogue data. In Patent Document 3, a large number of chatbot server devices perform learning processing in a distributed manner, so that a learning model that has already been accumulated in a large number of chatbot server devices is used to obtain a model that is acquired by an individual AI speaker. The device is configured to answer questions by voice.

従って、これらの背景技術に係るシステム或いは既存の商品によれば何れも、音声認識の精度を上げるためには、音響モデルや言語モデル等の学習データが必要となる。更に、音に対する読み、読みに対する表記といった学習データは現状、人が正解を与える必要があり、非常に手間がかかる。しかも、学習データを無暗に増やせば必ず精度が上がるという訳でもない。勿論、登録データの増大は、取り扱うべきデータ量の肥大化、或いはデータ処理の負荷の増大に繋がってしまうという問題点もある。 Therefore, according to the systems related to these background technologies or existing products, learning data such as acoustic models and language models are required in order to improve the accuracy of speech recognition. Furthermore, learning data such as pronunciations for sounds and notations for pronunciations currently requires humans to provide correct answers, which is extremely time-consuming. Moreover, simply increasing the training data will not necessarily improve accuracy. Of course, there is also the problem that an increase in registered data leads to an increase in the amount of data to be handled or an increase in data processing load.

このように音声認識の性質上、その精度を効率良く向上させること、或いは利用者（言い換えれば、話者であるシステムの利用者或いはユーザ）、音響モデルや言語モデル等の教師データを与える作業者（言い換えれば、正解データを与える作業者）等といった人員にかける負担を軽減し、少ない人手作業や更に少ないデータ処理により該精度を向上させることは技術的に困難である。 In this way, due to the nature of speech recognition, it is important to efficiently improve its accuracy, or to improve the efficiency of the user (in other words, the system user or user who is the speaker), or the worker who provides training data such as acoustic models and language models. It is technically difficult to reduce the burden on personnel such as (in other words, a worker who provides correct data) and to improve the accuracy with less manual labor and less data processing.

本発明は、例えば上述した技術的問題に鑑みなされたものであり、音声認識の精度を効率良く向上させることが可能な、ＡＩチャットボットにおける認識結果の自動学習装置及び方法、並びにコンピュータをそのような装置として機能させるコンピュータプログラム及び記録媒体を提供することを課題とする。 The present invention has been made, for example, in view of the above-mentioned technical problems, and provides an automatic learning device and method for recognition results in an AI chatbot, and a computer that can efficiently improve the accuracy of speech recognition. An object of the present invention is to provide a computer program and a recording medium that allow the computer to function as a functional device.

本発明に係るＡＩチャットボットにおける認識結果の自動学習装置の一の態様は上記課題を解決するために、話者からの発話に係る前記ＡＩチャットボットによる認識結果を復唱する復唱部と、前記復唱された認識結果に対する前記話者の反応に基づいて前記認識結果の正誤を判定する判定部と、前記発話に関して前記話者及び前記ＡＩチャットボット間でなされる対話の中で、前記判定部による判定が誤となった後に正となった場合における、前記誤となった認識結果と前記正となった認識結果との差分に基づいて前記発話に係る学習データを抽出する学習部とを備える。 One aspect of the automatic learning device for recognition results in an AI chatbot according to the present invention, in order to solve the above-mentioned problems, includes a repeating unit that repeats the recognition results by the AI chatbot regarding utterances from a speaker; a determination unit that determines whether the recognition result is correct or incorrect based on the speaker's reaction to the recognition result; and a determination unit that determines whether the recognition result is correct or incorrect based on the speaker's reaction to the recognition result; and a learning section that extracts learning data related to the utterance based on a difference between the erroneous recognition result and the correct recognition result when the recognition result becomes erroneous and then correct.

本発明に係るＡＩチャットボットにおける認識結果の自動学習方法の一の態様は上記課題を解決するために、話者からの発話に係る前記ＡＩチャットボットによる認識結果を復唱する復唱ステップと、前記復唱された認識結果に対する前記話者の反応に基づいて前記認識結果の正誤を判定する判定ステップと、前記発話に関して前記話者及び前記ＡＩチャットボット間でなされる対話の中で、前記判定ステップによる判定が誤となった後に正となった場合における、前記誤となった認識結果と前記正となった認識結果との差分に基づいて前記発話に係る学習データを抽出する学習ステップとを備える。 One aspect of the automatic learning method for recognition results in an AI chatbot according to the present invention, in order to solve the above-mentioned problems, includes a step of reciting a recognition result by the AI chatbot regarding an utterance from a speaker; a determination step of determining whether the recognition result is correct or incorrect based on the speaker's reaction to the recognition result, and a determination by the determination step during a dialogue between the speaker and the AI chatbot regarding the utterance. a learning step of extracting learning data related to the utterance based on a difference between the incorrect recognition result and the correct recognition result when the recognition result becomes correct after being incorrect.

本発明に係るコンピュータプログラムの一の態様は、コンピュータに、上述したＡＩチャットボットにおける認識結果の自動学習方法の一の態様を実行させる。 One aspect of the computer program according to the present invention causes a computer to execute one aspect of the automatic learning method for recognition results in an AI chatbot described above.

本発明の記録媒体の一の態様は、上述したコンピュータプログラムの一の態様が記録された記録媒体である。 One aspect of the recording medium of the present invention is a recording medium on which one aspect of the computer program described above is recorded.

上述したＡＩチャットボットにおける認識結果の自動学習装置及び方法、並びにコンピュータプログラム及び記録媒体の夫々の一の態様によれば、利用者に負担をかけることなく自動的に、音声認識の精度を効率良く向上させることが可能となる。より具体的には例えば、利用者若しくはユーザ又は話者に負担をかけることなく正解データの登録を促すことで自動的に学習を行うことが可能となり、類似の質問での認識精度を向上させることが可能となり、学習すればする程ＡＩチャットボットとのやり取りを減らすことが可能となる。更に、類音語を利用した変換は悪影響もあり得るため、例えば、限られたシナリオの範囲で当該自動学習を実行すれば、言い換えれば、シナリオ毎に学習データを構築すれば或いは分ければ、各シナリオにて上述の効果はより顕著に現れる。 According to one aspect of each of the above-described automatic learning device and method for recognition results in an AI chatbot, as well as the computer program and recording medium, the accuracy of speech recognition can be automatically and efficiently improved without imposing a burden on the user. It becomes possible to improve the performance. More specifically, for example, it is possible to automatically learn by prompting the user to register correct answer data without placing a burden on the user or the speaker, thereby improving recognition accuracy for similar questions. The more it learns, the more it will be possible to reduce interactions with the AI chatbot. Furthermore, since conversion using synonyms can have negative effects, for example, if the automatic learning is performed within a limited range of scenarios, or in other words, if learning data is constructed for each scenario, or if it is divided, each The above-mentioned effect appears more clearly in the scenario.

本発明によるこのような作用効果は、以下に説明する発明の実施形態により、より明らかにされる。 These effects of the present invention will be made more clear by the embodiments of the invention described below.

第１実施形態に係る、本発明に係る「ＡＩチャットボット」の一例としての音声応答システムにおける認識結果の自動学習装置を含んで構成される、当該音声応答システムの全体構成を示すブロック図である。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing the overall configuration of a voice response system as an example of the "AI chatbot" according to the present invention, which includes an automatic learning device for recognition results in the voice response system according to the first embodiment. . 第１実施形態に係る、復唱による自動学習の処理フローを示すフローチャートである。7 is a flowchart showing a process flow of automatic learning by recitation according to the first embodiment. 第１実施形態に係る、復唱による自動学習に係る利用イメージを示す図式的概念図である。FIG. 2 is a schematic conceptual diagram illustrating a usage image of automatic learning by recitation according to the first embodiment. 第１実施形態に係る、音声認識の仕組みを示す図式的概念図である。FIG. 2 is a schematic conceptual diagram showing a voice recognition mechanism according to the first embodiment. 第１実施形態の一比較例における「学習データの作成方法」を示す図式的概念図である。FIG. 2 is a schematic conceptual diagram showing a "learning data creation method" in a comparative example of the first embodiment. 第１実施形態の他の比較例における「学習データの作成方法」を示す図式的概念図である。FIG. 3 is a schematic conceptual diagram showing a "learning data creation method" in another comparative example of the first embodiment. 第１実施形態に係る、類音語を使用した音声認識結果を補正する処理を示す図式的概念図である。FIG. 2 is a schematic conceptual diagram illustrating a process of correcting a speech recognition result using synonyms according to the first embodiment. 第１実施形態に係る、発話の特徴を利用した類音語の判定に係る利用イメージを示す図式的概念図（その１）である。FIG. 2 is a schematic conceptual diagram (part 1) illustrating a usage image related to the determination of synonyms using characteristics of utterances according to the first embodiment; FIG. 第１実施形態に係る、発話の特徴を利用した類音語の判定に係る利用イメージを示す図式的概念図（その２）である。FIG. 7 is a schematic conceptual diagram (Part 2) showing a usage image related to the determination of synonyms using the characteristics of utterances according to the first embodiment; FIG. 第１実施形態に係る、学習データを反映した後における利用イメージを示す図式的概念図である。FIG. 2 is a schematic conceptual diagram showing a usage image after reflecting learning data according to the first embodiment. 第２実施形態に係る、回答に対する評価判定による自動学習の処理フローを示すフローチャートである。12 is a flowchart illustrating a process flow of automatic learning based on evaluation decisions for answers according to the second embodiment. 第２実施形態に係る、回答に対する評価判定による自動学習に係る利用イメージを示す図式的概念図である。FIG. 7 is a schematic conceptual diagram showing a usage image of automatic learning based on evaluation judgments on answers according to the second embodiment.

＜第１実施形態＞
第１実施形態について図１～図１０を参照して説明する。先ず図１を参照して第１実施形態の全体構成について説明する。ここに図１は、第１実施形態に係る、音声認識における認識結果の自動学習装置１００を有する音声応答システム１の全体構成を図式的に示している。即ち、本実施形態では、音声応答システム１が、本発明に係る「ＡＩチャットボット」の一例を構成している。 <First embodiment>
A first embodiment will be described with reference to FIGS. 1 to 10. First, the overall configuration of the first embodiment will be described with reference to FIG. FIG. 1 schematically shows the overall configuration of a voice response system 1 having an automatic learning device 100 for recognition results in voice recognition according to a first embodiment. That is, in this embodiment, the voice response system 1 constitutes an example of the "AI chatbot" according to the present invention.

図１に示すように、音声応答システム１は、ＡＩチャットボット（ＱＡ検索）部１０、音声キャプチャ装置１１、音声認識装置１２、単語辞書ＤＢ４００と音響モデルＤＢ４０１と言語モデルＤＢ４０２と変換ルールＤＢ４０３と学習データＤＢ４０４とを含む記憶装置、並びに自動学習装置１００を含んで構成されている。単語辞書ＤＢ４００は、単語の表記と読みの対応関係である単語辞書を記憶する。例えば、単語辞書は、図４の単語辞書４００ａである。音響モデルＤＢ４０１は、音素と読みとの対応関係である音響モデルを記憶する。例えば、音響モデルは、図４の音響モデル４０１ａである。言語モデルＤＢ４０２は、隠れマルコフモデル等の言語モデルを記憶する。例えば、言語モデルは、図４の言語モデル４０２ａである。変換ルールＤＢ４０３は、変換先と変換元の単語の対応関係である変換ルールを記憶する。例えば、変換モデルは、図７の変換ルール４０３ａである。学習データＤＢ４０４は、学習データを記憶する。例えば、学習データは、図１の学習データ４０４ａや図３の学習データ４０４ｂ或いは４０４ｃである。自動学習装置１００は、送信部１０１、音声応答制御部１０２及び自動学習部１０３を備えて構成されている。 As shown in FIG. 1, the voice response system 1 includes an AI chatbot (QA search) unit 10, a voice capture device 11, a voice recognition device 12, a word dictionary DB 400, an acoustic model DB 401, a language model DB 402, a conversion rule DB 403, and learning. It is configured to include a storage device including a data DB 404 and an automatic learning device 100. The word dictionary DB 400 stores a word dictionary that is a correspondence between the spelling and reading of words. For example, the word dictionary is the word dictionary 400a in FIG. 4. The acoustic model DB 401 stores an acoustic model that is a correspondence relationship between phonemes and pronunciations. For example, the acoustic model is acoustic model 401a in FIG. 4. The language model DB 402 stores language models such as hidden Markov models. For example, the language model is language model 402a in FIG. 4. The conversion rule DB 403 stores conversion rules that are the correspondence between conversion destination and conversion source words. For example, the conversion model is the conversion rule 403a in FIG. 7. The learning data DB 404 stores learning data. For example, the learning data is the learning data 404a in FIG. 1, or the learning data 404b or 404c in FIG. The automatic learning device 100 includes a transmitting section 101, a voice response control section 102, and an automatic learning section 103.

音声キャプチャ装置１１は、ユーザ（或いは、システムの利用者或いは話者）２０が端末２１のマイクに向けて発した音声をキャプチャする装置であり、インターネット等の通信網を介して該音声をキャプチャする。ここで、端末２１は、例えば、スマートホン、ＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）、タブレット、携帯電話、腕時計型デバイスや眼鏡型デバイス等のウェアラブルデバイス、などである。尚、音声キャプチャ装置１１は、ユーザ２０の発話をキャプチャできるデバイスであれば、端末２１のマイクに限定されず、端末２１と有線または無線で通信可能に接続されたデバイス等でもよい。音声キャプチャ装置１１は、キャプチャした音声の音声データを音声認識装置１２に出力する。 The voice capture device 11 is a device that captures the voice emitted by the user (or system user or speaker) 20 into the microphone of the terminal 21, and captures the voice via a communication network such as the Internet. . Here, the terminal 21 is, for example, a smart phone, a PC (Personal Computer), a tablet, a mobile phone, a wearable device such as a wristwatch type device or a glasses type device. Note that the voice capture device 11 is not limited to the microphone of the terminal 21 as long as it can capture the speech of the user 20, and may be a device that is communicably connected to the terminal 21 by wire or wirelessly. The voice capture device 11 outputs the voice data of the captured voice to the voice recognition device 12.

音声認識装置１２は、音声キャプチャ装置１１でキャプチャされた音声を、後に詳述するように自動学習装置１００により自動学習され記憶装置に格納された音響モデル４０１と、言語モデル４０２とに基づいて、音声認識する。その音声認識の仕組みについても後に詳述する（図４等参照）。音声認識装置１２は、当該音声認識した結果である音声認識結果を自動学習装置１００に出力する。 The speech recognition device 12 processes the speech captured by the speech capture device 11 based on an acoustic model 401 and a language model 402 that have been automatically learned by the automatic learning device 100 and stored in a storage device, as will be described in detail later. Voice recognition. The voice recognition mechanism will be explained in detail later (see FIG. 4, etc.). The speech recognition device 12 outputs the speech recognition result, which is the result of the speech recognition, to the automatic learning device 100.

ＡＩチャットボット（ＱＡ検索）部１０は、音声認識された認識結果で示される質問に対する、ＱＡ或いはＱ＆Ａ（Question and Answer：即ち、質問に対する回答）を、ＱＡナレッジ１０ａ内に構築された各種の知識データを格納するＱＡナレッジ１０ａから検索するように構成されている。より具体的には、自動学習装置１００により、音声認識結果が復唱されユーザからの音声の認識結果が正しいと判定された場合に、ＡＩチャットボット部１０は、ユーザからの質問に対する回答を返す。更に、ユーザからの質問に対する回答がおかしい又は回答がない場合に、ＡＩチャットボット部１０は、ＱＡ検索するように構成されている。 The AI chatbot (QA search) unit 10 uses various knowledge built in the QA knowledge 10a to perform QA or Q&A (Question and Answer: that is, answers to questions) in response to the questions indicated by the voice recognition results. It is configured to search from the QA knowledge 10a that stores data. More specifically, when the automatic learning device 100 repeats the voice recognition result and determines that the recognition result of the voice from the user is correct, the AI chatbot unit 10 returns an answer to the question from the user. Further, the AI chatbot unit 10 is configured to perform a QA search when the answer to a question from the user is incorrect or there is no answer.

自動学習装置１００では、音声応答制御部１０２による制御下で、先ず、音声認識装置１２による認識結果を、合成音声又は合成音声及びテキストの組み合わせ（以下適宜、単に「合成音声等」と称する）で、本発明に係る「復唱部」の一例を音声応答制御部１０２と共に構成する送信部１０１が、ユーザ２０の端末２１へ向けて復唱する。具体的には先ず、音声応答制御部１０２は、音声認識結果の入力を受け付ける。音声応答制御部１０２は、音声認識結果に基づいて、ユーザ２０に対して出力するための合成音声を生成する。送信部１０１は、当該生成された合成音声データ又は合成音声のテキストデータを端末２１に送信する。音声応答制御部１０２は、当該入力された音声認識結果をＡＩチャットボット（ＱＡ検索）部１０に出力する。 In the automatic learning device 100, under the control of the voice response control unit 102, first, the recognition result by the voice recognition device 12 is converted into a synthesized voice or a combination of synthesized voice and text (hereinafter referred to simply as “synthesized voice, etc.”). The transmitter 101, which together with the voice response controller 102 constitutes an example of the "repeat unit" according to the present invention, repeats the message to the terminal 21 of the user 20. Specifically, first, the voice response control unit 102 receives input of voice recognition results. The voice response control unit 102 generates a synthesized voice to be output to the user 20 based on the voice recognition result. The transmitter 101 transmits the generated synthetic voice data or text data of the synthesized voice to the terminal 21 . The voice response control unit 102 outputs the input voice recognition result to the AI chatbot (QA search) unit 10.

本実施形態に係る「復唱」とは、認識結果をそのまま合成音声（以下、「復唱合成音声」ともいう。）又はテキスト（以下、「復唱テキスト」ともいう。）で返す（即ち、文字通り復唱する）のでもよいし、認識結果を確認する内容の或いは認識結果に適当な前置き若しくは後置きを付加した内容又は要約した内容の復唱合成音声等又は復唱テキストで返す（即ち、同一内容ながら表現を変えて復唱する）のでもよい。例えば「○○」であるとの認識結果に対して「○○でよろしいでしょうか？」などとする発話を返す（即ち、復唱する）のでもよい。例えば、ユーザ２０が端末２１に対して「タクシーを呼びたい。」と言った場合、端末２１から「『タクシーを呼びたい。』でよろしいですか。」という復唱合成音声又は復唱テキストが出力される。更に、認識結果と同様の意味を有し、より汎用性の高いものとして単語辞書等に登録されている他の同意語や、同義語、類義語、同様の意味の文章で置き換えた内容で返す（即ち、復唱する）のでもよい。 "Repeat" according to this embodiment means to return the recognition result as it is as a synthesized voice (hereinafter also referred to as "repeat synthesized voice") or text (hereinafter also referred to as "repeat text") (that is, to literally repeat it back) ), or it may be returned as a repeating synthesized voice, etc., or repeating text that confirms the recognition result, or that adds an appropriate preface or postscript to the recognition result, or that summarizes the content (i.e., the same content but with different expressions). You can also repeat the phrase by repeating it. For example, in response to the recognition result of "○○", an utterance such as "Is ○○ OK?" may be returned (that is, repeated). For example, when the user 20 says to the terminal 21, ``I want to call a taxi.'', the terminal 21 outputs a repeating synthesized voice or repeating text saying, ``Are you sure that I want to call a taxi?'' . Furthermore, the recognition result is replaced with other synonyms, synonyms, synonyms, or sentences with similar meanings that have the same meaning as the recognition result and are registered in word dictionaries as more versatile words ( In other words, it may be repeated.

更に、自動学習装置１００では、音声応答制御部１０２による制御下で、音声認識装置１２による認識結果に対してＡＩチャットボット（ＱＡ検索）部１０によりＱＡ検索された検索結果に係る合成音声又はテキストを、送信部１０１が、ユーザ２０の端末２１に対して送信するように構成されている。 Further, in the automatic learning device 100, under the control of the voice response control unit 102, the synthesized speech or text related to the search result performed by the AI chatbot (QA search) unit 10 is performed on the recognition result by the voice recognition device 12. The transmitting unit 101 is configured to transmit to the terminal 21 of the user 20.

本実施形態では特に、これら一連の対話の中で、自動学習装置１００は後に詳述するように、本発明に係る「判定部」の一例を構成する音声応答制御部１０２により実行される認識結果の正誤判定において、音声認識装置１２による最初の或いは先の認識結果が“誤”であり且つその後の認識結果が“正”となった場合に、本発明に係る「学習部」の一例を構成する自動学習部１０３が、当該正誤の認識結果の差分を自動学習データ４０４ａとして抽出し、これを単語辞書に登録するように構成されている。 In this embodiment, in particular, during these series of dialogues, the automatic learning device 100 uses the recognition results executed by the voice response control unit 102, which constitutes an example of the "judgment unit" according to the present invention, as will be described in detail later. When the first or previous recognition result by the speech recognition device 12 is "incorrect" and the subsequent recognition result is "correct" in the correctness/incorrect determination of The automatic learning unit 103 is configured to extract the difference between the correct and incorrect recognition results as automatic learning data 404a, and register this in a word dictionary.

言い換えれば、本実施形態では、音声応答制御部１０２により実行される認識結果の正誤判定において、認識結果が最初から“正”である（即ち、誤となる認識結果が存在しない）場合や、最後まで“誤”である（即ち、正が何であるのかが言い換えれば正解が結局分からない）場合、上記差分が存在しないため、ここにいう差分を自動学習データ４０４ａとして抽出する処理は行われない。但し、当該差分に係る抽出処理とは別に、当初から“正”である認識結果や、最後まで“誤”である認識結果を、他の統計的なデータ処理などのために利用すること或いはデータとして蓄積しておき事後的に何らかの方法でデータ解析に利用することは任意である。 In other words, in this embodiment, in the correctness determination of the recognition result performed by the voice response control unit 102, there are cases where the recognition result is “correct” from the beginning (that is, there is no incorrect recognition result), and cases where the recognition result is “correct” from the beginning (that is, there is no incorrect recognition result). If the answer is "wrong" (in other words, the correct answer cannot be determined after all), the above-mentioned difference does not exist, so the process of extracting the difference as the automatic learning data 404a is not performed. However, apart from the extraction process related to the difference, recognition results that are "correct" from the beginning or "incorrect" until the end may be used for other statistical data processing, etc. It is optional to store the information as a data file and use it for data analysis in some way after the fact.

本実施形態に係る「差分」とは、正の認識結果及び誤の認識結果間における表記的あるいは文構造的な差異であってもよい。例えば、この場合の「差分」とは、単語の違いである。しかし、本実施形態に係る「差分」はこれに限らず、これに加えて又は代えて、例えば、声の強弱の差異、声のテンポの差異、声或いは発話内容から推定される話者の感情の差異など、発話の特徴上の差異であってもよい。 The "difference" according to the present embodiment may be a difference in notation or sentence structure between the correct recognition result and the incorrect recognition result. For example, the "difference" in this case is a difference in words. However, the "difference" according to the present embodiment is not limited to this, and in addition to or instead of this, for example, differences in the strength of the voice, differences in the tempo of the voice, and the speaker's emotion estimated from the voice or the content of the utterance. It may also be a difference in characteristics of the utterance, such as a difference in utterances.

以上のように、本発明によれば、いずれの場合にも、後で詳述するように教師データを与える作業者等といった人員にかける負担（図５及び図６参照）を顕著に軽減しつつ、自動学習データ４０４ａを機械学習或いはＡＩ学習により単語辞書内に登録可能となる。しかも、このような音声応答システム１における、正誤判定の結果に応じた自動学習は、リアルタイム的に対話しながら実行可能であり、或いは、ユーザ２０及び音声応答システム１間の対話に係る記録ログを事後的に参照して実行することも可能である。 As described above, according to the present invention, in any case, as will be described in detail later, the burden placed on personnel such as workers who provide training data (see FIGS. 5 and 6) can be significantly reduced. , automatic learning data 404a can be registered in the word dictionary by machine learning or AI learning. Moreover, such automatic learning according to the result of the correct/incorrect judgment in the voice response system 1 can be performed while interacting in real time, or by recording logs related to the interaction between the user 20 and the voice response system 1. It is also possible to refer to and execute the program after the fact.

本実施形態では例えば、自動学習部１０３は、図１に例示したように、ユーザ２０及び音声応答システム１間の対話を通じて、種別を一の“類音語”とし、上述の如き“差分”として夫々抽出された「ホタル」及び「ホテル」を相互に対応する「単語１」及び「単語２」とする自動学習データ４０４ａを登録する。これは、「ホタルを意訳したい」という誤（言い換えれば、不正解データ）であると判定された認識結果と、後に正（言い換えれば、正解データ）であると判定された「ホテルを予約したい」との差分に基づく自動学習データである。更に、種別を他の“類音語”とし、「小樽」及び「ホテル」を相互に対応する「単語１」及び「単語２」とする自動学習データ４０４ａを登録する。これは、「小樽を予約したい」という誤であると判定された認識結果と、後に正であると判定された「ホテルを予約したい」との差分に基づく自動学習データである。同じく種別を他の“類音語”として「意訳」及び「予約」を相互に対応する「単語１」及び「単語２」を自動学習データ４０４ａとして登録する。更にまた、種別を“共起”（即ち、同一の対話の中で共に起こり得る可能性が高い組み合わせ或いは対の一つ）とし、「ホテル」及び「予約」を相互に対応する「単語１」及び「単語２」とする自動学習データ４０４ａとして登録するように構成されている。 In this embodiment, for example, as illustrated in FIG. 1, the automatic learning unit 103 sets the type to one "synonym" through the dialogue between the user 20 and the voice response system 1, and sets it as the "difference" as described above. Automatic learning data 404a is registered in which the extracted "firefly" and "hotel" are made into mutually corresponding "word 1" and "word 2". This is the recognition result that was determined to be incorrect (in other words, incorrect data), "I want to paraphrase firefly", and the recognition result that was later determined to be correct (in other words, correct data), "I want to reserve a hotel". This is automatic learning data based on the difference between Further, automatic learning data 404a is registered in which the type is set to another "synonym" and "Otaru" and "hotel" are set to mutually corresponding "word 1" and "word 2". This is automatic learning data based on the difference between the recognition result that was determined to be incorrect, ``I want to make a reservation for Otaru,'' and ``I want to make a reservation for a hotel,'' which was later determined to be correct. Similarly, "word 1" and "word 2" which correspond to each other with "random translation" and "reservation" are registered as the automatic learning data 404a with the type as other "synonyms". Furthermore, the type is set to "co-occurrence" (that is, one of the combinations or pairs that are likely to occur together in the same conversation), and "hotel" and "reservation" are defined as "word 1" that corresponds to each other. and "word 2" as automatic learning data 404a.

より具体的には図１に例示したように、ユーザ２０が、端末２１に、（Ｉ）「ホテルを予約したいです。」と発話すると、この音声を音声キャプチャ装置１１はキャプチャする。更に、音声応答制御部１０２の制御下で、音響モデル４０１及び言語モデル４０２を利用しての音声認識装置１２による認識結果たる（Ｉ）『ホタルを意訳したい』を、送信部１０１は復唱する。即ち、本例では復唱たる（Ｉ）「『ホタルを意訳したい』でよろしいでしょうか？」なる合成音声データ又は合成音声に係るテキストデータを、端末２１から、当該認識結果の元となる発話をした当人であるユーザ２０に対して復唱するように構成されている。 More specifically, as illustrated in FIG. 1, when the user 20 speaks (I) "I would like to reserve a hotel" into the terminal 21, the voice capture device 11 captures this voice. Further, under the control of the voice response control unit 102, the transmission unit 101 repeats (I) “I want to paraphrase firefly”, which is the recognition result by the voice recognition device 12 using the acoustic model 401 and the language model 402. That is, in this example, the synthesized speech data or the text data related to the synthesized speech that is repeated (I) "Is it okay to say 'I want to paraphrase firefly'?" is uttered from the terminal 21, which is the source of the recognition result. The message is configured to be repeated to the user 20 who is the person in question.

これを受けて、直近の認識結果が誤（ＮＧ）であることを意味する（ＩＩ）『「違います。ホテルを予約したいです。」なる発話がユーザ２０からなされ、更にこれを受けて、直近の認識結果が誤（ＮＧ）であると判定した後に（ＩＩ）『「回答がありませんでした」もう一度…』なる合成音声等による発話がユーザ２０に対してなされたりする。更に、音声認識装置１２の側で直近の認識結果が誤（ＮＧ）であることが判明する（ＩＩＩ）「（不正解に気を悪くして感情的に“イラッ”として）ホテルを予約したいです！」なるユーザ２０による発話などの、ユーザ２０の思考や、音声応答制御部１０２による音声認識装置１２の認識結果の正誤判定の結果に応じて、多種多様なやり取りが行われ、最終的には、（ＩＩＩ）『ホテル予約ですね。予約日と人数を…』という正（ＯＫ）である旨の認識結果が、音声応答システム１から発話されることになる。 In response to this, the user 20 makes the following utterance (II) ``No, I want to reserve a hotel.'' which means that the most recent recognition result is incorrect (NG). After determining that the recognition result is incorrect (NG), (II) an utterance such as a synthesized voice such as "'There was no answer.' Say it again...'" is made to the user 20. Furthermore, the most recent recognition result on the voice recognition device 12 side turns out to be incorrect (NG). A wide variety of interactions occur depending on the thoughts of the user 20, such as the utterance by the user 20, and the result of the voice response control unit 102 determining whether the recognition result of the voice recognition device 12 is correct or incorrect. , (III) ``It's a hotel reservation. The voice response system 1 will utter a recognition result indicating that the reservation date and number of people are correct (OK).

このようにユーザ２０及び音声応答システム１間で、ホテル予約に関して一連の対話がなされると、音声応答システム１は、『ホテルを予約したい』なる正（ＯＫ）の認識結果に最終的には辿り着き、ＡＩチャットボット（ＱＡ検索）部１０は、ＱＡナレッジ１０ａを活用して、正の認識結果が示す質問に対するＱＡ検索を実行するように構成されている。この際、ＱＡ検索との連携で、音声応答制御部１０２では、例えば「小樽を予約したい」なる認識結果は誤（ＮＧ）であり且つ「ホテルを予約したい」なる認識結果は正（ＯＫ）であるとの判定が可能となる。即ち、本実施形態によれば、ＱＡ検索の結果に対するユーザ２０の反応（例えば、「いいね」ボタンを押すなど）に基づいて、認識結果の正誤判定を少なくとも部分的に実行することが可能であり、認識機能を効率良く向上させることが可能となる。 When a series of dialogues regarding hotel reservations occur between the user 20 and the voice response system 1 in this way, the voice response system 1 eventually reaches a positive (OK) recognition result of "I want to reserve a hotel." The AI chatbot (QA search) unit 10 is configured to utilize the QA knowledge 10a to perform a QA search for the question indicated by the positive recognition result. At this time, in cooperation with the QA search, the voice response control unit 102 determines, for example, that the recognition result "I want to reserve Otaru" is incorrect (NG) and the recognition result "I want to reserve a hotel" is correct (OK). It becomes possible to determine that there is. That is, according to the present embodiment, it is possible to at least partially determine whether the recognition result is correct or incorrect based on the user 20's reaction to the QA search result (for example, pressing the "Like" button). This makes it possible to efficiently improve recognition functions.

自動学習装置１００の自動学習部１０３は、上述した認識結果との差分の抽出、更にその自動学習データ４０４ａとしての登録を、このようなＱＡ検索の実行と相前後して若しくは並行して又は記録ログを利用して事後的に実施可能に構成されている。 The automatic learning unit 103 of the automatic learning device 100 extracts the difference from the above-mentioned recognition result and further registers it as automatic learning data 404a, concurrently or in parallel with execution of such QA search, or records it. It is configured to be executable after the fact using logs.

なお、このような自動学習装置１００等を含んで構成される音声応答システム１は、説明の便宜上図１では各装置単位や各部単位で別体として図示されているが、音声キャプチャ機能、音声認識機能、音声応答制御機能、送信機能、自動学習機能、ＡＩチャットボットのＱＡ検索機能などを備えており且つ端末２１と同じ通信網に収容可能であれば、一又は複数のコンピュータ或いは端末装置、一又は複数のサーバ装置、一又は複数のデータベース或いは記憶装置などから、ハードウエア的に或いはソフトウエア的に各種形態で柔軟に実現されるものであってよい。また、上述した機能のうちの少なくとも一以上の機能は、クラウドで実行可能である。汎用コンピュータに本実施形態に係る自動学習方法を行わせる独自のコンピュータプログラムにより実現してもよい。更にそのようなプログラムが格納された記憶媒体からコンピュータへ、該プログラムを直接又はダウンロード後に読み込むことで実現してもよい。 Note that the voice response system 1 that includes such an automatic learning device 100 and the like is shown as a separate body for each device and each part in FIG. 1 for convenience of explanation, but it has a voice capture function, a voice recognition function, etc. function, voice response control function, transmission function, automatic learning function, AI chatbot QA search function, etc., and if it can be accommodated in the same communication network as the terminal 21, one or more computers or terminal devices, one Alternatively, it may be flexibly realized in various forms in hardware or software from a plurality of server devices, one or more databases or storage devices, or the like. Furthermore, at least one or more of the functions described above can be executed in the cloud. It may be realized by a unique computer program that causes a general-purpose computer to perform the automatic learning method according to the present embodiment. Furthermore, it may be realized by reading such a program directly or after downloading it into a computer from a storage medium in which such a program is stored.

次に、図２から図１０を参照して、図１に示した如き構成を有する第１実施形態の動作処理について詳述する。 Next, with reference to FIGS. 2 to 10, the operational processing of the first embodiment having the configuration shown in FIG. 1 will be described in detail.

図２において、音声認識装置１２によって、ユーザ２０から端末２１で入力され通信網及び音声キャプチャ装置１１を介して入力された音声に対する音声認識が実行され、例えば「ホテルを意訳したい」との認識結果が得られる（ステップＳ１０）。続いて、自動学習装置１００では、音声応答制御部１０２による制御下で『「ホテルを意訳したい」でよろしいですか』というレスポンスが生成される。更に、送信部１０１は、端末２１へ当該生成されるレスポンスを送信し、ユーザ２０に対して、合成音声の形式で（或いは、テキストの形式で又は合成音声及びテキスト両方の形式で）復唱する（ステップＳ１１）。 In FIG. 2, the voice recognition device 12 performs voice recognition on the voice input from the user 20 at the terminal 21 via the communication network and voice capture device 11, and the recognition result is, for example, “I want to paraphrase the word hotel.” is obtained (step S10). Subsequently, the automatic learning device 100 generates a response under the control of the voice response control unit 102 such as ``Are you sure that I want to paraphrase the word ``hotel''?''. Further, the transmitting unit 101 transmits the generated response to the terminal 21 and repeats it to the user 20 in the form of synthesized speech (or in the form of text, or in the form of both synthesized speech and text). Step S11).

続いて、再び音声認識装置１２は、ユーザ２０から端末２１で入力され通信網及び音声キャプチャ装置１１を介して入力された音声に対する音声認識を行い、例えば「はい」または「いいえ」との認識結果が得られる（ステップＳ１２）。 Next, the voice recognition device 12 again performs voice recognition on the voice input from the user 20 at the terminal 21 via the communication network and the voice capture device 11, and generates a recognition result of, for example, “yes” or “no”. is obtained (step S12).

続いて、音声応答制御部１０２により認識結果の正誤判定が行われる（ステップＳ１３）。ステップＳ１３の判定において、認識結果が誤である場合（ステップＳ１３：「Ｎｏ」）、音声応答制御部１０２は、ユーザ２０に言い直しを促す旨の「もう一度お願いします」というレスポンスを生成し、送信部１０１は当該生成されるレスポンスを端末２１に送信する（ステップＳ１４）。更に、ステップＳ１０へ戻り、それ以降の処理が繰り返し実行される（ステップＳ１０～Ｓ１３）。即ち、音声応答システム１は、当該一連の対話におけるユーザ２０が先の発話を言い直すよう、レスポンスによりユーザ２０に促すことになる。 Subsequently, the voice response control unit 102 determines whether the recognition result is correct or incorrect (step S13). In the determination of step S13, if the recognition result is incorrect (step S13: "No"), the voice response control unit 102 generates a response of "please try again" to urge the user 20 to rephrase, The transmitter 101 transmits the generated response to the terminal 21 (step S14). Furthermore, the process returns to step S10, and the subsequent processes are repeatedly executed (steps S10 to S13). That is, the voice response system 1 prompts the user 20 to rephrase the previous utterance in the series of dialogues through the response.

他方、ステップＳ１３の判定において認識結果が正である場合（ステップＳ１３：「Ｙｅｓ」）、当該一連の対話の中でステップＳ１３で少なくとも一度「Ｎｏ」とされた後（即ち、認識結果が誤であると判定された後）に、自動学習部１０３は、ユーザ２０が言い直しをしたか否かを判定する。即ち、当該ユーザ２０及び音声応答システム１間でなされる一連の対話の中で、音声応答制御部１０２が誤と判定した後に正と判定した場合、誤と判定された認識結果と正と判定された認識結果との差分として抽出可能な学習データがあるか否かが判定される（ステップＳ１５）。 On the other hand, if the recognition result is positive in the determination in step S13 (step S13: "Yes"), after "No" is determined at least once in step S13 in the series of dialogues (i.e., the recognition result is incorrect). After it is determined that there is a sentence), the automatic learning unit 103 determines whether the user 20 has reworded the sentence. That is, in a series of dialogues between the user 20 and the voice response system 1, if the voice response control unit 102 determines that it is correct after determining that it is incorrect, the recognition result that was determined to be incorrect and the recognition result that was determined to be correct are different. It is determined whether there is learning data that can be extracted as a difference from the recognized recognition result (step S15).

ステップＳ１５の判定において、言い直しはなかった場合（ステップＳ１５：ＮＯ）、抽出すべき学習データはないので、そのまま一連の処理を終了する。他方、ステップＳ１５の判定において、言い直しをした場合（ステップＳ１５：ＹＥＳ）、抽出すべき学習データが存在している場合（即ち、一連の対話の中で、判定が誤となった後に正となった場合）であるので、自動学習部１０３は、該差分を学習データとして抽出し（ステップＳ１６）、一連の処理を終了する。 In the determination in step S15, if there is no rewording (step S15: NO), there is no learning data to be extracted, so the series of processes is ended. On the other hand, in the judgment in step S15, if the wording has been reworded (step S15: YES), if learning data to be extracted exists (i.e., in a series of dialogues, if the judgment is correct after being incorrect). Therefore, the automatic learning unit 103 extracts the difference as learning data (step S16), and ends the series of processing.

次に図３を参照して、上述の如き復唱により自動学習を行うところの自動学習方法の利用イメージについて、具体例を交えながら説明を加える。図３では、先ず図２で示したステップＳ１０～Ｓ１４に対応して、対話Ｃ１０Ａ～Ｃ１４Ａのやりとりがユーザ２０及び音声応答システム１間で、図中で上から下への順で行われる。なお、学習データがあるか否かの判定（図２のステップＳ１５）及びある場合の学習データの抽出（図２のステップＳ１６）の各処理については、リアルタイム的に実行されてもよいし、図２のステップＳ１０～Ｓ１４の処理を示す記録ログから事後的に実行してもよい。 Next, with reference to FIG. 3, an explanation will be given of an image of the use of an automatic learning method in which automatic learning is performed by repetition as described above, along with a specific example. In FIG. 3, first, corresponding to steps S10 to S14 shown in FIG. 2, dialogues C10A to C14A are performed between the user 20 and the voice response system 1 in order from top to bottom in the figure. Note that the processes of determining whether or not there is learning data (step S15 in FIG. 2) and extracting learning data in the case where there is training data (step S16 in FIG. 2) may be executed in real time, or may be performed in real time. The processing may be executed after the fact from the record log showing the processing of Steps S10 to S14 of No.2.

図３において先ず、ユーザ２０から端末２１を用いて「ホテルを予約したい。」との音声Ｃ１０Ａの入力を受け付ける。ここでは一例として、「予約」なる単語部分について、ユーザ２０により、はっきりと発話出来ていない或いは雑音やノイズなどの影響ではっきりとキャプチャできないものとする。 In FIG. 3, first, a voice C10A input from the user 20, using the terminal 21, saying "I would like to reserve a hotel" is accepted. Here, as an example, it is assumed that the word "reservation" cannot be clearly uttered by the user 20 or cannot be clearly captured due to noise or noise.

これを受けて、音声応答システム１は、図２で説明した処理（即ち、主にステップＳ１０及びＳ１１の処理）を経て『「ホテルを意訳したい」でよろしいでしょうか？』なる対話Ｃ１１Ａがユーザ２０に送信する。ここでの対話Ｃ１１Ａは、合成音声のみならず、合成音声に加えて又は代えてテキストメッセージであってもよい。この場合、音声応答システム１は、ユーザ２０の端末２１に（例えば、ＳＭＳ，ＬＩＮＥ等のアプリの形式で）送信する。 In response to this, the voice response system 1 goes through the process explained in FIG. 2 (i.e., mainly the process of steps S10 and S11) and then asks the question, ``Is it OK to translate ``hotel''? ” dialogue C11A is transmitted to the user 20. The dialogue C11A here may be not only a synthesized voice but also a text message in addition to or in place of the synthesized voice. In this case, the voice response system 1 transmits the message to the terminal 21 of the user 20 (for example, in the form of an application such as SMS or LINE).

これを受けて、ユーザ２０から端末２１を介して「いいえ。」との対話Ｃ１２Ａが行われる。即ち、対話Ｃ１１Ａが誤りである（即ち、不正解データである）旨の対話Ｃ１２Ａが音声応答システム１に対して行われる。これを受けて、音声応答システム１は、図２で説明した処理（即ち、主にステップＳ１２、Ｓ１３及びＳ１４の処理）を経て『もう一度お願いします。』なる対話Ｃ１４Ａがユーザ２０に対して行われる。ここでの対話Ｃ１４Ａは、合成音声のみならず、合成音声に加えて又は代えてテキストメッセージで送信されてもよい。 In response to this, the user 20 performs a dialogue C12A via the terminal 21 saying "No." That is, a dialogue C12A is performed to the voice response system 1 indicating that the dialogue C11A is incorrect (that is, incorrect data). In response to this, the voice response system 1 goes through the process explained in FIG. 2 (i.e., mainly the processes of steps S12, S13, and S14) and then sends the message ``Please try again.'' ” dialogue C14A is performed with the user 20. The dialogue C14A here may be transmitted not only as a synthesized voice but also as a text message in addition to or in place of the synthesized voice.

これを受けて、ユーザ２０から端末２１を介して「ホテルを予約したい」との対話Ｃ１０Ｂが行われる。ここでは一例として、「予約」なる単語部分について、ユーザ２０により、はっきり発話できた或いは雑音やノイズなどの影響なくはっきりキャプチャできたものとする。これを受けて、音声応答システム１は、図２で説明した処理（即ち、主にステップＳ１０及びＳ１１の処理）を経て『「ホテルを予約したい」でよろしいでしょうか？』なる対話Ｃ１１Ｂがユーザ２０に対して行われる。ここでの対話Ｃ１１Ｂは、合成音声のみならず、合成音声に加えて又は代えてテキストメッセージで送信されてもよい。 In response to this, the user 20 performs a dialogue C10B via the terminal 21 saying, "I would like to reserve a hotel." Here, as an example, it is assumed that the word "reservation" was clearly uttered by the user 20 or was clearly captured without being affected by noise or noise. In response to this, the voice response system 1 goes through the process explained in FIG. 2 (i.e., mainly the process of steps S10 and S11), and then asks, ``Are you sure you want to reserve a hotel?'' ” dialogue C11B is performed with the user 20. The dialogue C11B here may be transmitted not only as a synthesized voice but also as a text message in addition to or in place of the synthesized voice.

これを受けて、ユーザ２０から端末２１を介して「はい。」との対話Ｃ１２Ｂが行われる。即ち、対話Ｃ１１Ｂが正である（即ち、正解データである）旨の対話Ｃ１２Ｂが音声応答システム１に対して行われる。これを受けて、音声応答システム１は、図２で説明した処理（即ち、主にステップＳ１２及びＳ１３の処理）を経て、確認的な内容である『承りました。』なる対話Ｃ１３Ｂがユーザ２０に対して行われる。ここでの対話Ｃ１３Ｂは、合成音声のみならず、合成音声に加えて又は代えてテキストメッセージで送信されてもよい。 In response to this, the user 20 performs a dialogue C12B via the terminal 21 saying "Yes." That is, a dialogue C12B indicating that the dialogue C11B is correct (that is, correct data) is performed to the voice response system 1. In response to this, the voice response system 1 goes through the process explained in FIG. 2 (i.e., mainly the process of steps S12 and S13), and then sends a confirmation message, ``Accepted.'' ” dialogue C13B is performed to the user 20. The dialogue C13B here may be transmitted not only as a synthesized voice but also as a text message in addition to or in place of the synthesized voice.

以上の一連の対話は、対話Ｃ１１Ａが誤（即ち不正解データ）であり且つ対話Ｃ１１Ｂが正（即ち、正解データ）である場合となるので、音声応答システム１では、学習データ抽出処理（即ち、図２のステップＳ１６の処理）が実行される。 In the above series of dialogues, the dialogue C11A is incorrect (i.e., incorrect data) and the dialogue C11B is correct (i.e., correct data). Therefore, in the voice response system 1, the learning data extraction process (i.e., The process of step S16 in FIG. 2) is executed.

より具体的には、音声応答システム１の自動学習部１０３は、対話Ｃ１１Ａにある不正解データと対話Ｃ１１Ｂにある正解データとの差分を抽出する。この抽出は、全体の対話に相当する自動学習データ４０４ｃを構成する複数の単語（即ち単語１～単語４）に分解することで、相異なる対話部分（図３中で「判定」が“×”となる対話部分）を構築している「意訳」と「予約」とを、自動学習データ４０４ｂとすることで行う。 More specifically, the automatic learning unit 103 of the voice response system 1 extracts the difference between the incorrect data in the dialogue C11A and the correct data in the dialogue C11B. This extraction is performed by decomposing the automatic learning data 404c corresponding to the entire dialogue into a plurality of words (i.e., words 1 to 4) that make up the automatic learning data 404c. This is done by using the automatic learning data 404b as the "random translation" and "reservation" that make up the dialogue part).

更に、音声応答システム１の自動学習部１０３は、このようにして抽出した単語３に係る「意訳」と「予約」）を、相互に“類音語”の種別で、自動学習データ４０４ｂに対応する変換ルール４０３ｂとして（“予約”を“正解”に且つ“意訳”を“不正解”にという形式で）変換ルールＤＢ４０３に登録する。この変換ルール４０３ｂの登録と並行して或いは相前後して、自動学習データ４０４ａ（図１参照）に対応する変換ルール４０３ａを変換ルールＤＢ４０３に登録する。なお、このような差分に基づく自動学習データ４０４ｂの登録は、リアルタイム的に遅延なく行われてもよいし、記録ログを利用することで事後的に行われてもよい。 Furthermore, the automatic learning unit 103 of the voice response system 1 corresponds to the automatic learning data 404b the words ``parallel translation'' and ``reservation'' related to the word 3 extracted in this way, with the types of ``synonyms'' mutually. It is registered in the conversion rule DB 403 as a conversion rule 403b (in the form of “reservation” as “correct” and “random translation” as “incorrect”). In parallel or in parallel with the registration of this conversion rule 403b, a conversion rule 403a corresponding to the automatic learning data 404a (see FIG. 1) is registered in the conversion rule DB 403. Note that registration of the automatic learning data 404b based on such differences may be performed in real time without delay, or may be performed after the fact by using a recording log.

次に図４を参照して、上述の音声応答システム１における音声認識装置１０２（図１参照）での音声認識の仕組み（即ち、図２に示した音声認識処理（ステップＳ１０或いはＳ１２における処理方式）について、具体例を交えながら説明を加える。 Next, with reference to FIG. 4, the voice recognition mechanism in the voice recognition device 102 (see FIG. 1) in the voice response system 1 described above (i.e., the voice recognition process shown in FIG. 2 (processing method in step S10 or S12) ) will be explained using specific examples.

図４において、先ず、音声キャプチャ装置１１（図１参照）でキャプチャされた音の波形３０１を、音声認識装置１２は、多数の音響モデル４０１ａの如き音素及び読みを対応付けたデータを含んでなる音響モデルＤＢ４０１を参照することで、読み３０５（即ち、ここでの具体例としての「さーびすおもーしこみたいのですが」）に変換する（図４の上段に図示したステップＳＴＥＰ１）。 In FIG. 4, first, the speech recognition device 12 converts a sound waveform 301 captured by the speech capture device 11 (see FIG. 1) into a sound waveform 301, which includes data associating phonemes and pronunciations, such as a large number of acoustic models 401a. By referring to the acoustic model DB 401, it is converted to reading 305 (i.e., "I want to serve you as a service" as a specific example here) (step STEP 1 shown in the upper part of FIG. 4). .

これを受けて、音声認識装置１２は、多数の単語辞書４００ａの如き表記及び読みの対応データを含む単語辞書４００並びに多数の言語モデル４０２ａの如きユーザの発話を単語に分解してかなや漢字に対応付けた形で含んでなる言語モデルＤＢ４０２を参照することで、読み３０５をかな漢字表記３０６（即ち、ここでの具体例としての「サービスを申し込みたいのですが」）に変換する（図４の下段に図示したステップＳＴＥＰ２）。 In response to this, the speech recognition device 12 breaks down the user's utterances into words, such as a word dictionary 400 including a number of word dictionaries 400a, including correspondence data of notation and pronunciation, and a number of language models 402a, and converts them into kana and kanji. By referring to the language model DB 402 that contains the information in the associated form, the reading 305 is converted to the kana-kanji notation 306 (i.e., "I would like to apply for a service" as a specific example here) (see FIG. 4). Step STEP2) illustrated in the lower row.

以上図４に示したように、音の波形を読みに変換し、更にこれをかな漢字表記に変換することで、本実施形態の一具体例における音声認識（例えば、図２におけるステップＳ１０、Ｓ１２等の処理）は実行される。 As shown in FIG. 4 above, by converting the sound waveform into reading and further converting it into kana-kanji notation, speech recognition in one specific example of this embodiment (for example, steps S10, S12 in FIG. 2, etc.) is performed. ) is executed.

次に、図５及び図６を参照して、図２のステップＳ１６の学習データの抽出方法、言い換えれば、学習データの作成方法における、顕著なる効果或いは大きな利点について説明を加える。 Next, with reference to FIGS. 5 and 6, a description will be given of the remarkable effects or great advantages of the learning data extraction method in step S16 of FIG. 2, in other words, the learning data creation method.

図５に図示した比較例における、音声認識結果から書き起こすことによる学習データの作成方法によれば、５０～１００時間の音響データ３０１を、作業者３０（即ち、音響モデルや言語モデル等の教師データを与える作業者或いは正解データを与える作業者）が、人手でテキスト化する。例えば、電話オペレータＯＰが「お電話ありがとうございます。」と発話し、ユーザ或いはカスタマ（ＣＵ）が「パソコンが壊れました。」と発話し、オペレータ（ＯＰ）が「どのような症状ですか？」と発話し、ユーザ或いはカスタマ（ＣＵ）が「電源が入りません。」の如き対話が行われた後に、当該対話を含む音響データ３０１から、テキスト化された音声認識結果３０６が、作業者３０の人手により作成される。この際、テキスト中の未知語の単語辞書４００への登録が行われたり、単語の出現ルールを追加することで言語モデル４０２の構築が行われたりする。 According to the method of creating learning data by transcribing the speech recognition results in the comparative example illustrated in FIG. The operator who provides the data or the operator who provides the correct answer data manually converts it into text. For example, the telephone operator (OP) says, ``Thank you for calling.'' The user or customer (CU) says, ``My computer is broken.'' The operator (OP) says, ``What are your symptoms? ", and after the user or customer (CU) has a dialogue such as "The power is not turned on," the voice recognition result 306 converted into text is sent to the operator from the acoustic data 301 including the dialogue. Created by 30 people. At this time, unknown words in the text are registered in the word dictionary 400, and a language model 402 is constructed by adding rules for word appearance.

当該比較例によれば、作業者３０が実行せねばならない、当該テキスト化のための人手による作業は、通話時間の１０倍位の時間がかかる作業となってしまう。即ち、本例では、５００～１０００時間と言った長時間の人手による労働（即ち、人手によるテキスト化作業）が必要となってしまう。 According to the comparative example, the manual work for converting into text that the worker 30 must perform takes about 10 times as long as the call time. That is, in this example, a long time of 500 to 1000 hours of manual labor (that is, manual text conversion work) is required.

図６に図示した比較例における、認識結果から書き起こすことによる学習データの作成方法によれば、作業者３０が人手で誤認識している個所を探し出し、修正する。例えば、オペレータ（ＯＰ）が「お電話ありがとうございます。」と発話し、ユーザ或いはカスタマ（ＣＵ）が「パソコンが乞われました。」と発話し、オペレータ（ＯＰ）が「どのような賞状ですか？」と発話し、ユーザ或いはカスタマ（ＣＵ）が「電源が入りません。」と発話したとの如き、誤まってテキスト化された部分を含むテキスト化された音声認識結果３０６ａに対して、人手で、誤認識している個所が、「乞われる」や「賞状」だとして認識される。これにより、正しくテキスト化された音声認識結果３０６ｂが作成される。更に、「乞われる」や「賞状」は、正しくは「壊れる」や「症状」であるという出現ルールや未知語の登録が、人手によって実行される。そして、テキスト中の未知語（例えば「症状」）の単語辞書４００への登録が行われたり、単語の出現ルールを追加することで言語モデル４０２の構築が行われたりする。 According to the method of creating learning data by transcribing the recognition results in the comparative example illustrated in FIG. 6, the operator 30 manually searches for and corrects the erroneously recognized portions. For example, the operator (OP) says, ``Thank you for calling.'' The user or customer (CU) says, ``I was asked for a computer.'' The operator (OP) says, ``What kind of award is this? ” and the user or customer (CU) uttered “The power is not turned on.” , parts that are misrecognized manually are recognized as ``begged'' or ``certificates.'' As a result, a speech recognition result 306b that is correctly converted into text is created. Furthermore, appearance rules such as ``begged for'' and ``certificate'' are actually ``broken'' and ``symptom'' and registration of unknown words is performed manually. Then, unknown words (for example, "symptoms") in the text are registered in the word dictionary 400, and a language model 402 is constructed by adding rules for word appearance.

当該比較例によれば、作業者３０が実行せねばならない、人手による作業は、やはり時間及びストレスがかかる長時間の労働が必要となってしまう。 According to the comparative example, the manual work that must be performed by the worker 30 still requires long hours of labor that are time consuming and stressful.

図５及び図６に示した比較例との比較からも明らかなように、上述の本実施形態（図１から図４参照）における、作業者による認識結果から書き起こすという人手による過酷な労働なしに、一連の対話の中で自動的に不正解データと正解データとの差分を抽出し、これを学習データとして自動的に登録するという作用効果は、人手を掛けずに効率的に高精度で自動学習データを増やして行く上で顕著に有利である。特に図６の認識結果から書き起こす作業を、自動学習により極めて効率的に実行できるので、本実施形態は大変有利である。 As is clear from the comparison with the comparative example shown in FIGS. 5 and 6, there is no need for the harsh manual labor of transcribing the recognition results by the operator in this embodiment described above (see FIGS. 1 to 4). The effect of automatically extracting the difference between incorrect data and correct data during a series of dialogues and automatically registering this as learning data is that it can be done efficiently and with high precision without any human intervention. This is significantly advantageous in increasing the amount of automatic learning data. In particular, this embodiment is very advantageous because the task of transcribing from the recognition results shown in FIG. 6 can be executed extremely efficiently through automatic learning.

次に、図７を参照して、本実施形態における、類音語を使用した音声認識結果の補正について説明を加える。ここに図７では、上段に音声認識装置１２（図１参照）による類音語があった場合に実行される音声認識処理の一例を示し、下段に、認識結果補正装置１２ｃによる類音語があった場合に実行される補正処理の一例を示している。当該補正は、変換ルールの生成を自動化する方式で実行され、限られた利用範囲の下で、独自の変換ルールを用意することで正しく補正することを可能ならしめる。 Next, with reference to FIG. 7, a description will be given of correction of speech recognition results using synonyms in this embodiment. In FIG. 7, the upper row shows an example of the speech recognition process that is executed when there is a similar sound by the speech recognition device 12 (see FIG. 1), and the lower row shows a similar sound by the recognition result correction device 12c. An example of a correction process that is executed when there is a problem is shown. The correction is performed by a method that automates the generation of conversion rules, and it is possible to correct the correction correctly by preparing a unique conversion rule under a limited scope of use.

なお、図７では、説明の便宜上、認識結果補正装置１２ｃを音声認識装置１２と別体で図示しているが、実際には、認識結果補正機能を、音声認識装置１２における音声認識機能に持たせれば足りる。即ち、ハードウエア的には、認識結果補正装置１２ｃは、音声認識装置１２に含まれていればよい。同様に、認識結果補正装置１２ｃが参照したり登録したりし、記憶装置内に登録される変換ルール４０３についても、単語辞書４００、音響モデル４０１及び言語モデル４０２に含まれる形で構築されればよい。 Note that in FIG. 7, the recognition result correction device 12c is shown separately from the speech recognition device 12 for convenience of explanation, but in reality, the recognition result correction function is included in the speech recognition function of the speech recognition device 12. It's enough if you can. That is, in terms of hardware, the recognition result correction device 12c may be included in the speech recognition device 12. Similarly, the conversion rules 403 that are referenced or registered by the recognition result correction device 12c and registered in the storage device are constructed in a form that is included in the word dictionary 400, the acoustic model 401, and the language model 402. good.

図７において、先ずその上段にあるように、音声データが渡されると、音声認識装置１２は、音響モデル４０１、言語モデル４０２及び単語辞書４００を参照することで、音声認識を実行する。ここでは特に、汎用的な音声認識装置であればある程、類音語、同音異議語については、誤認識が発生しやすく、例えば認識結果として、誤を含む「どのような賞状ですか？」なるテキスト化された音声認識結果３０６ａが出力されるものとする。 In FIG. 7, as shown in the upper row, when voice data is passed, the voice recognition device 12 executes voice recognition by referring to an acoustic model 401, a language model 402, and a word dictionary 400. In particular, the more general-purpose speech recognition equipment is, the more likely it is that erroneous recognition will occur regarding similar words and homophones. It is assumed that a speech recognition result 306a converted into text is output.

そこで、図７の下段にあるように、誤を含む「どのような賞状ですか？」なるテキスト化された音声認識結果３０６ａが渡されると、認識結果補正装置１２は、変換ルール４０３を参照することで、認識結果の補正を実行する。ここでは特に、限られた利用範囲の元、独自の変換ルールを用意することで正しく補正を実行するようにしている。このため、補正結果として、正である「どのような症状ですか？」なるテキスト化された音声認識補正結果３０６ｂが出力される。このような独自の変換ルールは、例えば、“病院”や“医療”や“海外旅行”なるシナリオ範囲を限られた利用範囲に対し用意されており、独自の変換ルールとして、変換先としての予約（よやく）に対し、変換元１として与薬（よやく）、変換元２として意訳（いやく）、変換元３として要約（ようやく）、…といった変換ルールを規定する各種データ４０３ａが、用意されている。或いは、変換先としての症状（しょうじょう）に対し、変換元１として賞状（しょうじょう）、変換元２として少々（しょうしょう）、…といった変換ルール４０３ａが用意されている。 Therefore, as shown in the lower part of FIG. 7, when the speech recognition result 306a containing an error and converted into text "What kind of award is it?" is passed, the recognition result correction device 12 refers to the conversion rule 403. By doing so, the recognition results are corrected. In particular, given the limited scope of use, we have created our own conversion rules to ensure correct correction. Therefore, as a correction result, a positive speech recognition correction result 306b that is converted into text "What kind of symptoms do you have?" is output. These unique conversion rules are prepared for a limited scope of use, such as scenarios such as "hospital," "medical care," and "overseas travel." Various data 403a are prepared that specify conversion rules such as giving medicine (Yoyaku) as conversion source 1, free translation (yaku) as conversion source 2, summary (finally) as conversion source 3, etc. for (Yoyaku), etc. has been done. Alternatively, conversion rules 403a are prepared such that the conversion source 1 is a certificate, the conversion source 2 is a little, etc., for the symptoms as conversion destinations.

このように図７に示した補正に係る独自の変換ルールの生成を自動化する方式を採用すれば、限られた利用範囲の下での変換ルールを用意することで、比較的容易にして高精度で正しく補正することが可能となる。 In this way, if the method shown in Figure 7 that automates the generation of unique conversion rules related to correction is adopted, conversion rules for a limited range of use can be prepared, making it relatively easy and highly accurate. It is possible to make correct corrections.

次に、図８及び図９を参照して、本実施形態における、発話の特徴を利用した類音語の判定について説明を加える。本実施形態では、自動学習部１０３が自動学習データとして抽出する「差分」は、正の認識結果及び誤の認識間における表記的或いは文構造的な差異であってもよいが、このような表記的或いは文構造的な差異では、言い回しが変わると、何処が間違いであったのか、即ち、何処が誤で何処が正であったのかが判定できない状況が発生し得る。 Next, with reference to FIGS. 8 and 9, a description will be given of the determination of similar words using characteristics of utterances in this embodiment. In the present embodiment, the "difference" that the automatic learning unit 103 extracts as automatic learning data may be a difference in notation or sentence structure between the correct recognition result and the incorrect recognition; If the wording changes due to differences in terms of target or sentence structure, a situation may arise where it is not possible to determine what was wrong, that is, what was wrong and what was correct.

図８に示すように即ち、学習データ４０４ｄとして単語３、単語４、単語５、単語７及び単語８が相互に不一致となっているが、言い回しが変わっているだけ或いは言い直されただけであって、誤認識された訳ではない単語が、これら不一致の単語らに混在している。従って、正誤の判定には、一致不一致に基づくのみではなく、それ以外に何らかの変換ルールがあることが望ましい。本例では、単語として認識結果上で正誤をなす「照会」及び「確認」が、正誤をなすものと判定できない。 As shown in FIG. 8, word 3, word 4, word 5, word 7, and word 8 are inconsistent with each other as the learning data 404d, but the wording is simply changed or reworded. Therefore, words that were not misrecognized are mixed in with these mismatched words. Therefore, it is desirable that the determination of correctness is based not only on matching and mismatching, but also on some other conversion rule. In this example, "inquiry" and "confirmation", which are correct or incorrect in the recognition results as words, cannot be determined to be correct or incorrect.

図９に示すようにそこで、「一致」なる種別に加えて、「音の波形」及び「強調」なる種別を持つ学習データ４０４ｅを含むように言語モデル４０２ａを構成する。すると、単語として正誤をなす「照会」に対する「確認」が「強調」されている単語であることから、正誤をなす単語であると判定できる。「強調」されている個所であるか否かは、「音の波形」に基づいて、言い直した発話に係る音量の差、テンポ、感情（例えば、怒り）等から、話者であるユーザ２０が強調している個所として判定可能となる。 As shown in FIG. 9, the language model 402a is configured to include learning data 404e having the types "sound waveform" and "emphasis" in addition to the type "match". Then, since "confirmation" is an "emphasized" word for "inquiry" which is a correct word, it can be determined that it is a correct word. Whether or not the part is "emphasized" can be determined by the user 20 who is the speaker based on the "sound waveform" and the difference in volume, tempo, emotion (e.g. anger) of the restated utterance, etc. This can be determined as the highlighted part.

図８及び図９から分かるように、ユーザ２０が言い直した場合に、自然と間違った個所が強調して発話されるという性質を利用して、強調されている個所であるか否かを、変換ルールとして採用すれば、認識結果上で不一致である複数或いは多数の単語の中から、正誤をなす単語がどれであるのか判定できる。或いは、言い直しの際には、間違った個所を強調して発話するように、音声応答システム１に関して予め設定された使用マニュアルでユーザ２０にその旨を予め教育しておいてもよいし、その旨を適当な時点で合成音声字メッセージでユーザ２０に予め伝えておくのでもよい。何れの場合にも、ユーザ２０が言い直す際に当初間違いであった個所が強調されることで、認識結果上の正誤の個所を判定できるので、本実施形態は、実用上大変有利である。 As can be seen from FIGS. 8 and 9, when the user 20 rephrases the sentence, the incorrect part is naturally emphasized and uttered. If adopted as a conversion rule, it is possible to determine which words are correct or incorrect among a plurality or a large number of words that do not match in the recognition results. Alternatively, when retelling, the user 20 may be trained in advance in a usage manual set in advance for the voice response system 1, so that the user 20 should emphasize the incorrect part when speaking. This may be communicated to the user 20 in advance using a synthesized voice message at an appropriate time. In either case, when the user 20 rephrases the sentence, the parts that were originally incorrect are emphasized, and the parts that are correct or incorrect in the recognition result can be determined, so this embodiment is very advantageous in practice.

次に、上述した各種の動作処理（図２～図９参照）により自動学習された学習データが、音声応答システム１に反映された（より具体的には、単語辞書４００、音響モデル４０１、言語モデル４０２、変換ルール４０３等が機械学習で更新された）後における、第１実施形態の利用イメージについて、図１０を参照して具体例で説明する。図１０では、図２で示したステップＳ１０～１３に対応して、対話Ｃ１０Ａ～Ｃ１３Ｂのやりとりがユーザ２０及び音声応答システム１間で、図中で上から下への順で行われる。 Next, the learning data automatically learned through the various motion processes described above (see FIGS. 2 to 9) is reflected in the voice response system 1 (more specifically, the word dictionary 400, the acoustic model 401, the language The usage image of the first embodiment after the model 402, conversion rule 403, etc. have been updated by machine learning will be described with a specific example with reference to FIG. 10. In FIG. 10, dialogues C10A to C13B are exchanged between the user 20 and the voice response system 1 in order from top to bottom in the figure, corresponding to steps S10 to S13 shown in FIG.

図１０において先ず、ユーザ２０から端末２１を介して「ホテルを予約したい。」との対話Ｃ１０Ａが行われる。ここでは一例として、「予約」なる単語部分について、ユーザ２０により、はっきりと発話で来ていない或いは雑音やノイズなどの影響ではっきりとキャプチャできないものとする。ここまでは、はっきりと発話できていない点を含めて、学習データ反映前である図３の対話Ｃ１０Ａの場合と同じである。 In FIG. 10, first, a dialogue C10A is performed from the user 20 via the terminal 21 saying, "I would like to reserve a hotel." Here, as an example, it is assumed that the word "reservation" is not clearly uttered by the user 20 or cannot be clearly captured due to the influence of noise. The process up to this point is the same as the case of dialogue C10A in FIG. 3, which is before learning data is reflected, including the fact that the user is not able to speak clearly.

しかるに、これを受けて、音声応答システム１では、「ホタルを意訳したい」なる音声認識装置１２による当初の不正解の認識結果に基づいて既に学習済である、学習データ４０４ｆ等を含んでなる学習データを元に変換を行って、正解である「ホテルを予約したい」という認識結果を、この段階で導き出す。その結果、図２で説明した処理（即ち、主にステップＳ１０及びＳ１１の処理）を経て『「ホテルを予約したい」でよろしいでしょうか？』なる対話Ｃ１１Ｂがユーザ２０に対して行われる。ここでの対話Ｃ１１Ｂは、合成音声のみならず、合成音声に加えて又は代えてテキストメッセージで発信させてもよい。 However, in response to this, the voice response system 1 performs learning that includes the learning data 404f, etc., which has already been learned based on the initial incorrect recognition result by the voice recognition device 12 such as "I want to paraphrase firefly". At this stage, the data is converted and the correct answer, ``I want to book a hotel,'' is derived. As a result, after going through the process explained in FIG. 2 (i.e., mainly the process of steps S10 and S11), the user is asked, ``Are you sure you want to reserve a hotel?'' ” dialogue C11B is performed with the user 20. The dialogue C11B here may be transmitted not only as a synthesized voice but also as a text message in addition to or in place of the synthesized voice.

これを受けて、ユーザ２０から端末２１を介して「はい。」との対話Ｃ１２Ｂが行われる。即ち、対話Ｃ１１Ｂが正である（即ち、正解データである）旨の対話Ｃ１２Ｂが音声応答システム１に対して行われる。これを受けて、音声応答システム１から、図２で説明した処理（即ち、主にステップＳ１２及びＳ１３の処理）を経て、確認的な内容である『承りました。』なる対話Ｃ１３Ｂがユーザ２０に対して行われる。ここでの対話Ｃ１３Ｂは、合成音声のみならず、合成音声に加えて又は代えてテキストメッセージで発信させてもよい。図１０に示した利用の場合、対話Ｃ１１Ｂが初めから正であるので、音声応答システム１では、学習データ抽出処理（即ち、図２のステップＳ１６の処理）が実行されることはない。 In response to this, the user 20 performs a dialogue C12B via the terminal 21 saying "Yes." That is, a dialogue C12B indicating that the dialogue C11B is correct (that is, correct data) is performed to the voice response system 1. In response to this, the voice response system 1 goes through the process explained in FIG. 2 (i.e., mainly the process of steps S12 and S13), and then sends a confirmation message, ``Accepted.'' ” dialogue C13B is performed to the user 20. The dialogue C13B here may be transmitted not only as a synthesized voice but also as a text message in addition to or in place of the synthesized voice. In the case of the use shown in FIG. 10, since the dialogue C11B is positive from the beginning, the voice response system 1 does not execute the learning data extraction process (that is, the process of step S16 in FIG. 2).

以上詳細に説明したように、第１実施形態によれば、復唱による自動学習であって上述の差分を学習データとして抽出する処理を用いるので、比較的容易にして効率良く、自動的に学習を実行できる。しかも、学習をすればする程、音声応答システム１とのやり取りの機会を低減できる。なお、類音語を利用した変換では、類音語の数が不必要なまでに多くなってしまうなどの悪影響もあるので、限られたシナリオ或いは利用範囲で（例えば、第１実施形態に係る音声応答システム１の利用範囲をシナリオ毎の特定業種に限るなどして）、第１実施形態を実施することで、より顕著なる効果が現れる。 As described in detail above, according to the first embodiment, automatic learning is performed by repetition, and the process of extracting the above-mentioned differences as learning data is used, so learning is performed relatively easily and efficiently. Can be executed. Furthermore, the more the user learns, the more opportunities for interaction with the voice response system 1 can be reduced. Note that conversion using synonyms has a negative effect such as an unnecessary increase in the number of synonyms, so it can only be used in a limited scenario or scope of use (for example, in the first embodiment). By implementing the first embodiment (by limiting the scope of use of the voice response system 1 to a specific industry for each scenario, etc.), more significant effects will be obtained.

＜第２実施形態＞
第２実施形態について図１１及び図１２を参照して説明する。第２実施形態に係る音声応答システム１のハードウエア的な全体構成は、第１実施形態のそれ（図１参照）と同様であるので、図１を援用することとし、その説明は適宜省略する。第２実施形態に係る音声応答システム１の動作処理及び利用イメージは、第１実施形態のそれら（図２及び図３参照）と異なるので、以下詳細に説明する。ここに図１１は、第１実施形態の図２と同趣旨の第２実施形態に係るフローチャートであり、図１２は、第１実施形態の図３と同趣旨の第２実施形態に係る図式的概念図である。図１１及び図１２において、第２実施形態では、ＡＩチャットボット（ＱＡ検索）部１０（図１参照）によるＱＡ検索におけるユーザ２０（図１参照）への回答に対する評価判定を行う。 <Second embodiment>
A second embodiment will be described with reference to FIGS. 11 and 12. The overall hardware configuration of the voice response system 1 according to the second embodiment is the same as that of the first embodiment (see FIG. 1), so FIG. 1 will be referred to, and the description thereof will be omitted as appropriate. . The operation processing and usage image of the voice response system 1 according to the second embodiment are different from those of the first embodiment (see FIGS. 2 and 3), so they will be described in detail below. Here, FIG. 11 is a flowchart according to the second embodiment having the same meaning as FIG. 2 of the first embodiment, and FIG. 12 is a diagrammatic flowchart according to the second embodiment having the same meaning as FIG. 3 of the first embodiment. It is a conceptual diagram. In FIGS. 11 and 12, in the second embodiment, an evaluation judgment is performed on an answer to a user 20 (see FIG. 1) in a QA search by an AI chatbot (QA search) unit 10 (see FIG. 1).

図１１において具体的には先ず、第１実施形態における音声認識（図１のステップＳ１０）、レスポンス（図１のステップＳ１１）及び音声認識（図１のステップＳ１２）と同様の対話を経て、ユーザ２０の発話上の質問文の抽出処理が、音声認識装置１２、自動学習装置１００等により実行される（ステップＳ２０）。ここでは例えば「予約の照会の仕方を知りたい」という質問文が抽出される。 Specifically, in FIG. 11, the user first undergoes the same dialogue as the voice recognition (step S10 in FIG. 1), response (step S11 in FIG. 1), and voice recognition (step S12 in FIG. 1) in the first embodiment. 20 is executed by the speech recognition device 12, automatic learning device 100, etc. (step S20). Here, for example, a question such as "I would like to know how to inquire about reservations" is extracted.

続いて、音声応答システム１内では、ＡＩチャットボット（ＱＡ検索）部１０によりＱＡ検索が実行される（ステップＳ２１）。ここでは例えば『「外来の受付には紹介状が…」でよろしいですか』なるＱＡ検索の結果が得られる。続いて、自動学習装置１００では、音声応答制御部１０２は、ＱＡ検索の結果である『「外来の受付には紹介状が…」でよろしいですか』というレスポンスを生成する。更に送信部１０１は、端末２１へ送信することで、ユーザ２０に対してＱＡ検索の結果が、合成音声の形式で（或いは、合成音声及びテキストの形式で）実行される（ステップＳ２２）。 Next, within the voice response system 1, a QA search is executed by the AI chatbot (QA search) unit 10 (step S21). Here, for example, a QA search result such as ``Is it okay to say, ``A letter of introduction is required at the outpatient reception desk?'''' can be obtained. Subsequently, in the automatic learning device 100, the voice response control unit 102 generates a response that is the result of the QA search, ``Is it okay to say, ``A letter of introduction is required for the outpatient reception?''''. Further, the transmitter 101 transmits the QA search result to the user 20 in the form of synthesized speech (or in the form of synthesized speech and text) by transmitting it to the terminal 21 (step S22).

これを受けて、再び音声認識装置１２によって、ユーザ２０から端末２１で入力され通信網及び音声キャプチャ装置１１を介して入力された音声に対する音声認識が実行され、例えば「はい」または「いいえ」との認識結果が得られる（ステップＳ２３）。 In response to this, the voice recognition device 12 again performs voice recognition on the voice input from the user 20 at the terminal 21 via the communication network and the voice capture device 11, for example, “yes” or “no”. A recognition result is obtained (step S23).

続いて、音声応答制御部１０２により認識結果の正誤判定が行われる（ステップＳ２４）。ステップＳ２４の判定において、認識結果が誤である場合（ステップＳ２４：「Ｎｏ」）、音声応答制御部１０２による制御下で、ユーザ２０に言い直しを促す旨の「もう一度お願いします」というレスポンスが、送信部１０１から送信される（ステップＳ２５）。更に、ステップＳ２０へ戻り、それ以降の処理が繰り返し実行される（ステップＳ２０～Ｓ２３）。即ち、音声応答システム１は、当該一連の対話におけるユーザ２０が先の質問文を含む発話を言い直すよう、レスポンスによりユーザ２０に促すことになる。 Subsequently, the voice response control unit 102 determines whether the recognition result is correct or incorrect (step S24). In the determination of step S24, if the recognition result is incorrect (step S24: "No"), under the control of the voice response control unit 102, a response of "please repeat" to prompt the user 20 to rephrase the phrase is sent. , is transmitted from the transmitter 101 (step S25). Furthermore, the process returns to step S20, and the subsequent processes are repeatedly executed (steps S20 to S23). That is, the voice response system 1 prompts the user 20 in the series of dialogues to rephrase the utterance that includes the previous question sentence, using the response.

他方、ステップＳ２４の判定において認識結果が正である場合（ステップＳ２４：「Ｙｅｓ」）、当該一連の対話の中でステップＳ２４で少なくも一度「いいえ」とされた後に、自動学習部１０３は、ユーザ２０が言い直しをしたか否かを判定する。即ち、当該ユーザ２０及び音声応答システム１間でなされる一連の対話の中で、音声応答制御部１０２による判定が誤となった後に正となった場合における、誤となった認識結果と正となった認識結果との差分として、抽出可能な学習データがあるか否かが判定される（ステップＳ２６）。 On the other hand, if the recognition result is positive in the determination in step S24 (step S24: "Yes"), after "no" is determined at least once in step S24 during the series of dialogues, the automatic learning unit 103 It is determined whether the user 20 has reworded the statement. That is, in a series of dialogues between the user 20 and the voice response system 1, when the voice response control unit 102 makes a determination that is incorrect and then becomes correct, the incorrect recognition result and the correct recognition result are separated. It is determined whether or not there is learning data that can be extracted as a difference from the recognition result obtained (step S26).

ステップＳ２６の判定において、登録すべき学習データが在る場合、即ち現段階に至るまでにＱＡ検索結果に誤りがなかった場合（ステップＳ２６：ＮＯ）、抽出すべき学習データはないので、そのまま一連の処理を終了する。他方、ステップＳ２６の判定において、登録すべき学習データが在る場合、即ち現段階に至るまでに１回以上ＱＡ検索結果に誤りがあった場合（ステップＳ２６：ＹＥＳ）、自動学習部１０３は、先に誤であると判定された質問文及び今回正であると判定された質問文間の差分を、学習データとして抽出し（ステップＳ２７）、一連の処理を終了する。なおステップＳ２７における「差分」のとらえ方或いは扱い方等や抽出された学習データの登録の仕方等については、第１実施形態の場合と同様である。 In the determination of step S26, if there is learning data to be registered, that is, if there are no errors in the QA search results up to the current stage (step S26: NO), there is no learning data to be extracted, so the process continues as is. The process ends. On the other hand, in the determination in step S26, if there is learning data to be registered, that is, if there has been an error in the QA search result at least once up to the current stage (step S26: YES), the automatic learning unit 103: The difference between the question sentence previously determined to be incorrect and the question sentence currently determined to be correct is extracted as learning data (step S27), and the series of processes ends. Note that the method of understanding or handling the "difference" and the method of registering the extracted learning data in step S27 are the same as in the first embodiment.

次に図１２を参照して、上述の如きＱＡ回答に対する評価判定により自動学習を行うところの自動学習方法の利用イメージについて、具体例を交えながら説明を加える。なお、学習データがあるか否かの判定（図１１のステップＳ２６）及びある場合の学習データの抽出（図１１のステップＳ２７）の各処理については、リアルタイム的に実行されてもよいし、図１１のステップＳ２０～Ｓ２５の処理を示す記録ログから事後的に実行してもよい。 Next, with reference to FIG. 12, an explanation will be given of an image of the use of an automatic learning method in which automatic learning is performed by evaluating the QA answers as described above, using a specific example. Note that the processes of determining whether or not there is learning data (step S26 in FIG. 11) and extracting learning data if there is training data (step S27 in FIG. 11) may be performed in real time, or may be performed in real time. It may be executed after the fact from the record log showing the processing of Steps S20 to S25 of No. 11.

図１２において先ず、ユーザ２０から端末２１を介して「予約の照会の仕方を知りたい。」との対話Ｃ２０が行われる。これを受けて『予約の照会の仕方を知りたい」でよろしいでしょうか？』なる対話Ｃ２１がユーザ２０に対して行われる。ここでの対話Ｃ２１は、合成音声のみならず、合成音声に加えて又は代えてテキストメッセージで発信させてもよい。 In FIG. 12, first, a dialogue C20 is performed from the user 20 via the terminal 21 saying, "I would like to know how to inquire about reservations." In response to this, is it okay to say ``I want to know how to inquire about reservations''? ” dialogue C21 is performed with the user 20. The dialogue C21 here may be transmitted not only as a synthesized voice but also as a text message in addition to or in place of the synthesized voice.

これを受けて、ユーザ２０から端末２１を介して「はい。」との対話Ｃ２２が行われる。即ち、対話Ｃ２１が正である（即ち、発話或いは発音自体は間違ってはいないため正しいと判断した）旨の対話Ｃ２２が音声応答システム１に対して行われる。 In response to this, the user 20 performs a dialogue C22 via the terminal 21 saying "Yes." That is, a dialogue C22 is performed to the voice response system 1 indicating that the dialogue C21 is correct (that is, the utterance or pronunciation itself is determined to be correct because it is not wrong).

これを受けて、音声応答システム１から、図１１で説明したＱＡ検索処理（即ち、図１１のステップＳ２１）がＡＩチャットボット（ＱＡ検索）部１０により実行され、その結果「外来の受付には紹介状が…」…「解決しましたか？」なる対話Ｃ２３がユーザ２０に対して行われる。ここでの対話Ｃ２３は、合成音声のみならず、合成音声に加えて又は代えてテキストメッセージで発信させてもよい。 In response, the voice response system 1 executes the QA search process explained in FIG. 11 (i.e., step S21 in FIG. A dialogue C23 such as "I have a letter of introduction..."..."Did you solve the problem?" is performed to the user 20. The dialogue C23 here may be transmitted not only by a synthesized voice but also by a text message in addition to or in place of the synthesized voice.

これを受けて、ユーザ２０から端末２１を介して「いいえ。予約の確認の仕方を知りたい。」との対話Ｃ２４が行われる。即ち、対話Ｃ２１が誤であった（即ち、発話或いは発音自体は間違っていなかったものの、ＱＡ検索の結果からして実は誤或いは不正解データであった）旨の対話Ｃ２４が音声応答システム１に対して行われる。この際、ユーザ２０は、同音異義語の存在を大なり小なり意識して、「照会」ではなく「確認」なる単語を用いて対話Ｃ２４を行っている。言い換えれば、ユーザ２０は、敢えて、先の対話Ｃ２０と同じではなく、これと似た表現で言い直しをしている。なお、「似た表現」とは、類音語とは限らない。 In response to this, the user 20 performs a dialogue C24 via the terminal 21 saying, "No. I would like to know how to confirm the reservation." In other words, the voice response system 1 receives a dialogue C24 indicating that the dialogue C21 was incorrect (that is, the utterance or pronunciation itself was not wrong, but based on the QA search results, it was actually a mistake or incorrect data). It is done against. At this time, the user 20 is more or less conscious of the existence of homonyms, and uses the word "confirmation" instead of "inquiry" in the dialogue C24. In other words, the user 20 intentionally rephrases the dialogue C20 using a similar expression rather than the same as the previous dialogue C20. Note that "similar expressions" do not necessarily mean synonyms.

これを受けて『予約の確認の仕方を知りたい」でよろしいでしょうか？』なる対話Ｃ２５がユーザ２０に対して行われる。ここでの対話Ｃ２５は、合成音声のみならず、合成音声に加えて又は代えてテキストメッセージで発信させてもよい。 In response to this, is it okay to say "I want to know how to confirm my reservation"? ” dialogue C25 is performed with the user 20. The dialogue C25 here may be transmitted not only by synthesized voice but also by a text message in addition to or in place of the synthesized voice.

これを受けて、音声応答システム１から、図１１で説明したＱＡ検索処理（即ち、図１１のステップＳ２１）がＡＩチャットボット（ＱＡ検索）部１０により実行され、その結果「予約サイトから参照…」…「解決しましたか？」なる対話Ｃ２７がユーザ２０に対して行われる。ここでの対話Ｃ２７は、合成音声のみならず、合成音声に加えて又は代えてテキストメッセージで発信させてもよい。 In response to this, the voice response system 1 executes the QA search process explained in FIG. 11 (i.e., step S21 in FIG. 11) by the AI chatbot (QA search) unit 10, and as a result, "Reference from the reservation site... "..."Did you solve the problem?" dialogue C27 is performed to the user 20. The dialogue C27 here may be transmitted not only as a synthesized voice but also as a text message in addition to or in place of the synthesized voice.

これを受けて、ユーザ２０から端末２１を介して「はい。」との対話Ｃ２８が行われる。即ち、対話Ｃ２５が正である（即ち、正解データである）旨の対話Ｃ２８が音声応答システム１に対して行われる。 In response to this, the user 20 performs a dialogue C28 via the terminal 21 saying "Yes." That is, a dialogue C28 indicating that the dialogue C25 is correct (that is, correct data) is performed to the voice response system 1.

なお、本実施形態において、対話Ｃ２８の入力に代えて或いは加えて、ユーザ２０がＱＡ検索の結果に満足した際に発信する「いいね」等の肯定的なメッセージに基づいて、対応する認識結果が正である（即ち正解データである）ことを、音声応答システム１側で判定することも可能である。 In this embodiment, instead of or in addition to the input in the dialogue C28, the corresponding recognition result is displayed based on a positive message such as "like" sent when the user 20 is satisfied with the QA search result. It is also possible for the voice response system 1 to determine that the answer is correct (that is, it is correct data).

以上の一連の対話は、対話Ｃ２１が誤（即ち、不正解データ）であり且つ対話Ｃ２５が正（正解データ）である場合となるので、音声応答システム１では、学習データ抽出処理（即ち、図１１のステップＳ２７の処理）が実行される。 In the above series of dialogues, the dialogue C21 is incorrect (i.e., incorrect data) and the dialogue C25 is correct (correct data), so the voice response system 1 performs the learning data extraction process (i.e., the 11) is executed.

より具体的には、音声応答システム１は、対話Ｃ２１にある不正解データと対話Ｃ２５にある正解データとの差分を抽出する。この抽出は、全体の対話を複数の単語（即ち単語１～単語４）に分解することで相異なる対話部分（図１２中で「判定」が“×”となる対話部分）を構築している「照会」と「確認」とを、学習データ４０４ｇとすることで行う。更に、音声応答システム１は、このようにして抽出した単語３に係る「照会」と「確認」を、相互に“類義語”なる種別で、自動学習データ４００ｅとして（“確認”を“正解”に且つ“照会”を“不正解”にという形式で）単語辞書に登録する。なお、このような差分に基づく自動学習データ４０４ｈの登録は、リアルタイム的に遅延なく行われてもよいし、記録ログを利用することで事後的に行われてもよい。また、学習データとして、「確認」と「紹介」と「照会」とを対応付けて記憶してもよい。また、学習データとして、類音語に関する正解データ「照会」と不正解データ「紹介」とを対応付けて記憶してもよい。 More specifically, the voice response system 1 extracts the difference between the incorrect answer data in dialogue C21 and the correct answer data in dialogue C25. This extraction constructs different dialogue parts (dialogue parts whose "judgment" is "x" in Figure 12) by breaking down the entire dialogue into multiple words (namely, words 1 to 4). The "inquiry" and "confirmation" are performed using the learning data 404g. Furthermore, the voice response system 1 sets "inquiry" and "confirmation" related to the word 3 extracted in this way as automatic learning data 400e (by changing "confirmation" to "correct answer") as mutually "synonymous" types. and register it in the word dictionary (in the form of "incorrect answer" instead of "inquiry"). Note that registration of the automatic learning data 404h based on such differences may be performed in real time without delay, or may be performed after the fact by using a recording log. Furthermore, "confirmation", "introduction", and "inquiry" may be stored in association with each other as learning data. Further, as learning data, correct answer data "inquiry" and incorrect answer data "introduction" regarding the same sound words may be stored in association with each other.

以上詳細に説明したように、第２実施形態によれば、ＱＡ回答に対する評価判定を利用しつつ上述の差分を学習データとして抽出する処理を用いるので、比較的容易にして効率良く、自動的に学習を実行できる。しかも、学習をすればする程、音声応答システム１とのやり取りの機会を低減できる。なお、類音語を利用した変換では、類音語の数が不必要なまでに多くなってしまうなどの悪影響もあるので、限られたシナリオ或いは利用範囲で（例えば、第２実施形態に係る音声応答システム１の利用範囲をシナリオ毎の特定業種に限るなどして）、第２実施形態を実施することで、より顕著なる効果が現れる。 As described in detail above, according to the second embodiment, the process of extracting the above-mentioned differences as learning data while using the evaluation judgment for QA answers is used, so that the process can be performed relatively easily, efficiently, and automatically. Can perform learning. Furthermore, the more the user learns, the more opportunities for interaction with the voice response system 1 can be reduced. Note that conversion using synonyms has a negative effect such as the number of synonyms becoming unnecessarily large, so it can only be used in limited scenarios or scope of use (for example, in the second embodiment). By implementing the second embodiment (by limiting the scope of use of the voice response system 1 to a specific industry for each scenario, etc.), more significant effects will appear.

付記
以上説明した実施形態に関して、更に以下の付記を開示する。 Additional Notes Regarding the embodiments described above, the following additional notes are further disclosed.

［付記１］
付記１に記載のＡＩチャットボットにおける認識結果の自動学習装置は、話者からの発話に係る前記ＡＩチャットボットによる認識結果を復唱する復唱部と、前記復唱された認識結果に対する前記話者の反応に基づいて前記認識結果の正誤を判定する判定部と、前記発話に関して前記話者及び前記ＡＩチャットボット間でなされる対話の中で、前記判定部による判定が誤となった後に正となった場合における、前記誤となった認識結果と前記正となった認識結果との差分に基づいて前記発話に係る学習データを抽出する学習部とを備える。 [Additional note 1]
The automatic learning device for recognition results in an AI chatbot according to appendix 1 includes a repeating unit that repeats a recognition result by the AI chatbot regarding an utterance from a speaker, and a reaction of the speaker to the read recognition result. A determination unit that determines whether the recognition result is correct or incorrect based on the utterance, and during a dialogue between the speaker and the AI chatbot regarding the utterance, the determination by the determination unit becomes incorrect and then becomes correct. and a learning unit that extracts learning data related to the utterance based on a difference between the erroneous recognition result and the correct recognition result in the above case.

［付記２］
付記２に記載の自動学習装置は、前記復唱部は、前記認識結果を、そのまま又は該認識結果を少なくとも部分的に同じ意味を持つ他の言葉に置き換えた上で、前記ＡＩチャットボットからの発話で又は前記話者が認識可能な出力形式で出力することで、復唱することを特徴とする付記１記載の自動学習装置である。 [Additional note 2]
In the automatic learning device according to appendix 2, the repeating unit may use the recognition result as it is or replace the recognition result with another word having the same meaning at least partially, and then reproduce the recognition result from the utterance from the AI chatbot. The automatic learning device according to appendix 1 is characterized in that the automatic learning device repeats the content by outputting the content in an output format that can be recognized by the speaker.

［付記３］
付記３に記載の自動学習装置は、前記判定部は、前記復唱された認識結果に対する前記話者の反応として、前記話者による更なる発話又はＡＩチャットボットが検出、識別若しくは認識可能な形式での入力内容に基づいて前記認識結果の正誤を判定することを特徴とする付記１又は２に記載の自動学習装置である。 [Additional note 3]
In the automatic learning device according to appendix 3, the determination unit determines, as the speaker's reaction to the repeated recognition result, further utterances by the speaker or in a format that can be detected, identified, or recognized by the AI chatbot. The automatic learning device according to appendix 1 or 2, wherein the automatic learning device determines whether the recognition result is correct or incorrect based on the input content.

［付記４］
付記４に記載の自動学習装置は、前記学習部は、前記学習データの抽出として、前記反応として前記話者により更なる発話がなされ該更なる発話に係る判定が正となった場合、前記誤となった認識結果と前記正となった認識結果との相互間で差分となる単語を、類音語として辞書登録することを特徴とする付記１から３のいずれか一付記に記載の自動学習装置である。 [Additional note 4]
In the automatic learning device according to Supplementary note 4, the learning unit extracts the learning data, and if the speaker makes a further utterance as the reaction and the determination regarding the further utterance is correct, the learning unit extracts the learning data. Automatic learning according to any one of appendices 1 to 3, characterized in that a word that is a difference between the recognition result that is positive and the recognition result that is positive is registered in the dictionary as a synonym. It is a device.

［付記５］
付記５に記載の自動学習装置は、前記ＡＩチャットボットは、前記反応として前記話者により更なる発話がなされた場合に、前記更なる発話の音声から前記話者の感情認識を行い、前記発話における誤の原因となる箇所を特定し、該特定された個所が誤であるとの前提で前記更なる発話を認識することを特徴とする付記１から４のいずれか一付記に記載の自動学習装置である。 [Additional note 5]
In the automatic learning device according to appendix 5, when the speaker makes a further utterance as the reaction, the AI chatbot recognizes the speaker's emotion from the audio of the further utterance, and Automatic learning according to any one of appendices 1 to 4, characterized in that the part causing the error in is identified, and the further utterance is recognized on the premise that the identified part is an error. It is a device.

［付記６］
付記６に記載の自動学習装置は、前記判定部は、前記復唱された認識結果に対する前記話者の反応に加えて又は代えて、前記ＡＩチャットボットによる認識結果に応じて前記ＡＩチャットボットにより実行されたＱＡ検索の検索結果に対する前記話者の反応に基づいて、前記認識結果の正誤を判定することを特徴とする付記１から５のいずれか一付記に記載の自動学習装置である。 [Additional note 6]
In the automatic learning device according to appendix 6, the determination unit performs the determination by the AI chatbot in accordance with the recognition result by the AI chatbot, in addition to or in place of the speaker's reaction to the repeated recognition result. The automatic learning device according to any one of appendices 1 to 5, wherein the automatic learning device determines whether the recognition result is correct or incorrect based on the speaker's reaction to the search result of the QA search performed.

［付記７］
付記７に記載のＡＩチャットボットにおける認識結果の自動学習方法は、話者からの発話に係る前記ＡＩチャットボットによる認識結果を復唱する復唱ステップと、前記復唱された認識結果に対する前記話者の反応に基づいて前記認識結果の正誤を判定する判定ステップと、前記発話に関して前記話者及び前記ＡＩチャットボット間でなされる対話の中で、前記判定ステップによる判定が誤となった後に正となった場合における、前記誤となった認識結果と前記正となった認識結果との差分に基づいて前記発話に係る学習データを抽出する学習ステップとを備える。 [Additional note 7]
The automatic learning method of recognition results in an AI chatbot according to appendix 7 includes a repeating step of repeating a recognition result by the AI chatbot regarding an utterance from a speaker, and a reaction of the speaker to the read recognition result. a determination step of determining whether the recognition result is correct based on the utterance; and a determination step that determines whether the recognition result is correct after the determination by the determination step is incorrect during a dialogue between the speaker and the AI chatbot regarding the utterance. and a learning step of extracting learning data related to the utterance based on a difference between the erroneous recognition result and the correct recognition result in the above case.

［付記８］
付記８に記載のコンピュータプログラムは、コンピュータに、付記７に記載のモデル構築方法を実行させるコンピュータプログラムである。 [Additional note 8]
The computer program described in Appendix 8 is a computer program that causes a computer to execute the model construction method described in Appendix 7.

［付記９］
付記９に記載の記録媒体は、付記８に記載のコンピュータプログラムが記録された記録媒体である。 [Additional note 9]
The recording medium described in Appendix 9 is a recording medium on which the computer program described in Appendix 8 is recorded.

本発明は、請求の範囲及び明細書全体から読み取るこのできる発明の要旨又は思想に反しない範囲で適宜変更可能であり、そのような変更を伴うＡＩチャットボットにおける認識結果の自動学習装置及び方法、並びにコンピュータプログラム及び記録媒体もまた本発明の技術思想に含まれる。 The present invention can be modified as appropriate within a scope that does not go against the gist or idea of the invention as read from the claims and the entire specification, and the automatic learning device and method of recognition results in an AI chatbot that involve such modifications, Furthermore, computer programs and recording media are also included in the technical idea of the present invention.

１…音声応答システム（ＡＩチャットボット）
１０…ＡＩチャットボット（ＱＡ検索）部
１１…音声キャプチャ装置
１２…音声認識装置
２０…ユーザ
２１…端末
１００…自動学習装置
１０１…送信部
１０２…音声応答制御部
１０３…自動学習部
４００…単語辞書ＤＢ
４０１…音響モデルＤＢ
４０２…言語モデルＤＢ
４０３…変換ルールＤＢ
４０４…学習データＤＢ 1...Voice response system (AI chatbot)
10...AI chatbot (QA search) unit 11...voice capture device 12...speech recognition device 20...user 21...terminal 100...automatic learning device 101...transmission unit 102...voice response control unit 103...automatic learning unit 400...word dictionary DB
401...Acoustic model DB
402...Language model DB
403...Conversion rule DB
404...Learning data DB

Claims

An automatic learning device for recognition results in an AI chatbot,
a repeating unit that repeats the recognition result by the AI chatbot regarding the utterance from the speaker;
a determination unit that determines whether the recognition result is correct or incorrect based on the speaker's reaction to the repeated recognition result;
In a dialogue between the speaker and the AI chatbot regarding the utterance, when the judgment by the judgment unit becomes incorrect and then becomes correct, the incorrect recognition result and the correct result are determined. and a learning unit that extracts learning data related to the utterance based on the difference between the recognition result and the recognition result .
In addition to or in place of the speaker's reaction to the recited recognition result, the determination unit is configured to determine the response to the response to the search result of the QA search performed by the AI chatbot in response to the recognition result by the AI chatbot. An automatic learning device for recognition results in an AI chatbot, characterized in that it determines whether the recognition results are correct or incorrect based on a person's reaction .

The recitation unit is configured to reproduce the recognition result as it is or after replacing the recognition result with another word having the same meaning at least in part, and outputting the recognition result as an utterance from the AI chatbot or an output recognizable by the speaker. The automatic learning device for recognition results in an AI chatbot according to claim 1, wherein the recognition result is recited by outputting it in a format.

The determination unit determines the recognition result based on further utterances by the speaker or input content in a format that can be detected, identified, or recognized by the AI chatbot, as the speaker's reaction to the recited recognition result. The automatic learning device for recognition results in an AI chatbot according to claim 1 or 2, wherein the device determines whether the recognition results are correct or incorrect.

When the speaker makes a further utterance as the reaction and the determination regarding the further utterance is correct, the learning unit extracts the learning data by extracting the erroneous recognition result and the correct result. 4. The automatic learning device for recognition results in an AI chatbot according to any one of claims 1 to 3, wherein words that are different from the recognition results obtained are registered in the dictionary as synonyms.

When the speaker makes a further utterance as the reaction, the AI chatbot recognizes the speaker's emotion from the audio of the further utterance, and identifies the part that causes the error in the utterance. 5. The automatic learning device for recognition results in an AI chatbot according to any one of claims 1 to 4, wherein the further utterance is recognized on the assumption that the identified portion is incorrect.

An automatic learning method for recognition results in an AI chatbot, the method comprising:
a repeating step of repeating the recognition result by the AI chatbot regarding the utterance from the speaker;
a determination step of determining whether the recognition result is correct or incorrect based on the speaker's reaction to the repeated recognition result;
In a dialogue between the speaker and the AI chatbot regarding the utterance, when the judgment in the judgment step becomes incorrect and then becomes correct, the incorrect recognition result and the correct result are determined. and a learning step of extracting learning data related to the utterance based on the difference between the recognition result and the recognition result obtained .
The determining step includes, in addition to or instead of the speaker's reaction to the recited recognition result, the response to the response to the search result of the QA search performed by the AI chatbot in response to the recognition result by the AI chatbot. A method for automatically learning recognition results in an AI chatbot , the method comprising determining whether the recognition results are correct or incorrect based on a person's reaction .

A computer program that causes a computer to execute the method for automatically learning recognition results in an AI chatbot according to claim 6 .

A recording medium on which the computer program according to claim 7 is recorded.