JP2021081527A

JP2021081527A - Voice recognition device, voice recognition method, and voice recognition program

Info

Publication number: JP2021081527A
Application number: JP2019207512A
Authority: JP
Inventors: 光洋高波; Mitsuhiro Takanami
Original assignee: NTT Communications Corp
Current assignee: NTT Communications Corp
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2021-05-27
Anticipated expiration: 2039-11-15
Also published as: JP7495220B2

Abstract

To improve recognition accuracy of voice.SOLUTION: A reading device 10 comprises: an utterance information acquisition part 133; a learning part 134; and an output processing part 135. The utterance information acquisition part 133 acquires utterance information including information indicating movement of a user's mouth when the user speaks in a whisper and voice information of the user. The learning part 134 generates a model outputting a recognition result of an utterance content which the utterance information indicates by learning using the utterance information acquired by the utterance information acquisition part 133 and the utterance content which the utterance information indicates. The output processing part 135 outputs the recognition result of the utterance content which the utterance information indicates by using the model with the utterance information being an object of recognition as input.SELECTED DRAWING: Figure 3

Description

本発明は、音声認識装置、音声認識方法、および、音声認識プログラムに関する。 The present invention relates to a voice recognition device, a voice recognition method, and a voice recognition program.

技術革新の進歩により、入力された音声をテキストに変換する技術等の精度も向上し、コミュニケーション手段の拡張性が高まっている。 Advances in technological innovation have improved the accuracy of technologies for converting input voice into text, and the expandability of communication means has increased.

特表２０１２−５１００８８号公報Japanese Patent Application Laid-Open No. 2012-51088 特開２０１９−６０９２１号公報Japanese Unexamined Patent Publication No. 2019-60921 特開２０１２−００３１６２号公報Japanese Unexamined Patent Publication No. 2012-003162

しかし、音声入力については、入力する際の環境への依存度が高く、必要なコミュニケーションに支障をきたす場合がある。例えば、１．公共の場所等、ユーザが大きな声を出せない環境下での音声入力、２．風邪等の体調不良時や聞き取りにくい声質のユーザによる音声入力、３．幹線道路、イベント会場等、周囲の音が大きい場所での音声入力、４．発声障がい等を持つユーザによる音声入力等が行われると、入力された音声を精度よく認識できない場合がある。その結果、高度なコミュニケーションツールを活用できないケースが存在する。 However, voice input is highly dependent on the environment when inputting, which may interfere with necessary communication. For example, 1. 2. Voice input in an environment where the user cannot make a loud voice, such as in a public place. 2. Voice input by a user who is in poor physical condition such as a cold or has a voice quality that is difficult to hear. 4. Voice input in places with loud surrounding sounds such as highways and event venues. When voice input or the like is performed by a user with a vocal disability or the like, the input voice may not be recognized accurately. As a result, there are cases where advanced communication tools cannot be used.

そこで、本発明は、前記した問題を解決し、音声の認識精度を向上させることを課題とする。 Therefore, an object of the present invention is to solve the above-mentioned problems and improve the voice recognition accuracy.

前記した課題を解決するため、本発明は、ユーザがささやき声で発話するときの前記ユーザの口の動きを示す情報および前記ユーザの音声情報を含む発話情報を取得する第１の取得部と、前記第１の取得部により取得された発話情報と当該発話情報の示す発話内容とを用いた学習により作成されたモデルを用いて、認識の対象となる発話情報を入力として、前記発話情報の示す発話内容の認識結果を出力する出力部と、を備えることを特徴とする。 In order to solve the above-mentioned problems, the present invention includes a first acquisition unit that acquires utterance information including information indicating the movement of the user's mouth when the user speaks with a whisper and utterance information including the user's voice information. Using a model created by learning using the utterance information acquired by the first acquisition unit and the utterance content indicated by the utterance information, the utterance indicated by the utterance information is input by inputting the utterance information to be recognized. It is characterized by including an output unit that outputs a content recognition result.

本発明によれば、音声の認識精度を向上させることができる。 According to the present invention, the voice recognition accuracy can be improved.

図１は、読話装置を含むシステムの構成例を示す図である。FIG. 1 is a diagram showing a configuration example of a system including a reading device. 図２は、システムの概要を説明する図である。FIG. 2 is a diagram illustrating an outline of the system. 図３は、読話装置の構成例を示す図である。FIG. 3 is a diagram showing a configuration example of a reading device. 図４は、端末装置の構成例を示す図である。FIG. 4 is a diagram showing a configuration example of the terminal device. 図５は、読話装置の処理手順の例を示すフローチャートである。FIG. 5 is a flowchart showing an example of a processing procedure of the reading device. 図６は、システムの処理手順の例を示すシーケンス図である。FIG. 6 is a sequence diagram showing an example of the processing procedure of the system. 図７は、音声認識プログラムを実行するコンピュータの例を示す図である。FIG. 7 is a diagram showing an example of a computer that executes a voice recognition program.

以下、図面を参照しながら、本発明を実施するための形態（実施形態）について説明する。本発明は、以下に説明する実施形態に限定されない。 Hereinafter, embodiments (embodiments) for carrying out the present invention will be described with reference to the drawings. The present invention is not limited to the embodiments described below.

図１および図２を用いて本実施形態の読話装置（音声認識装置）１０を含むシステムの概要を説明する。システムは、例えば、図１に示すように、読話装置１０とユーザの端末装置２０とを備える。読話装置１０は、端末装置２０から取得した、ユーザがささやき声で発話するときの口の動きと音声とに基づき、ユーザがどのような発話をしたかを認識する。そして、読話装置１０は、その認識結果（例えば、テキスト情報）を端末装置２０へ送信する。なお、以下の説明において、ささやき声とは、ユーザが声帯を振動させずに発した音声（無声音）であるものとする。 The outline of the system including the reading device (speech recognition device) 10 of the present embodiment will be described with reference to FIGS. 1 and 2. The system includes, for example, a reading device 10 and a user's terminal device 20, as shown in FIG. The reading device 10 recognizes what kind of utterance the user has made based on the movement of the mouth and the voice when the user speaks with a whisper, which is acquired from the terminal device 20. Then, the reading device 10 transmits the recognition result (for example, text information) to the terminal device 20. In the following description, the whispering voice is a voice (unvoiced sound) emitted by the user without vibrating the vocal cords.

端末装置２０は、携帯電話機や、スマートフォン、タブレット端末、パーソナルコンピュータ等である。端末装置２０と読話装置１０とは、インターネット等のネットワークを介して通信可能に接続される。なお、システムに設置される端末装置２０および読話装置１０の数は、図２に示す数に限定されない。 The terminal device 20 is a mobile phone, a smartphone, a tablet terminal, a personal computer, or the like. The terminal device 20 and the reading device 10 are communicably connected to each other via a network such as the Internet. The number of terminal devices 20 and reading devices 10 installed in the system is not limited to the number shown in FIG.

次に、図２を用いてシステムの概要を説明する。例えば、まず、端末装置２０は、カメラ等によりユーザがささやき声で発話するときの口の輪郭の各座標間の変化を取得し、また、マイク等によりささやき声の音声波形を取得する。次に、端末装置２０は、例えば、取得した口の輪郭の変化を示す情報（読話情報）と、ささやき声の音声波形を示す情報（音声情報）とを多重化した多重化デジタル信号を作成し、読話装置１０へ送信する。読話装置１０は、端末装置２０から送信された多重化デジタル信号に基づき、ユーザの発話内容を識別する。これにより、読話装置１０は、読話情報のみ、あるいは音声情報のみではユーザの発話内容が識別（認識）できないような場合であっても、ユーザの発話内容を識別しやすくすることができる。例えば、読話装置１０は、上記の音声情報を用いることにより、読話情報のみでは識別が困難である、発話における子音、単語、文節等の区切りを識別できる。その結果、読話装置１０は、ユーザの発話内容の識別精度を向上させることができる。 Next, the outline of the system will be described with reference to FIG. For example, first, the terminal device 20 acquires the change between the coordinates of the contour of the mouth when the user speaks with a whisper by a camera or the like, and acquires the voice waveform of the whisper by a microphone or the like. Next, the terminal device 20 creates, for example, a multiplexed digital signal in which information indicating a change in the contour of the acquired mouth (reading information) and information indicating a whispering voice waveform (voice information) are multiplexed. It is transmitted to the reading device 10. The reading device 10 identifies the utterance content of the user based on the multiplexed digital signal transmitted from the terminal device 20. As a result, the reading device 10 can easily identify the user's utterance content even when the user's utterance content cannot be identified (recognized) only by the reading information or the voice information. For example, the reading device 10 can identify a delimiter of a consonant, a word, a phrase, or the like in an utterance, which is difficult to identify only by the reading information by using the above-mentioned voice information. As a result, the reading device 10 can improve the identification accuracy of the user's utterance content.

また、読話装置１０は、ユーザがささやき声で発話するときの読話情報および音声情報の学習を行い、その学習結果を用いて、ユーザの発話内容を識別する。これにより、ユーザの発話内容の識別精度をさらに向上させることができる。例えば、読話情報と音声情報との組み合わせによってもユーザの発話内容が識別できないような場合であっても、上記の学習結果を用いることで、ユーザの発話内容を識別しやすくすることができる。 Further, the reading device 10 learns reading information and voice information when the user speaks with a whisper, and uses the learning result to identify the content of the user's utterance. As a result, the accuracy of identifying the utterance content of the user can be further improved. For example, even if the user's utterance content cannot be identified by the combination of the reading information and the voice information, the user's utterance content can be easily identified by using the above learning result.

［読話装置］
次に、図３を用いて、読話装置１０の構成例を説明する。図３に示すように、読話装置１０は、通信部１１と、記憶部１２と、制御部１３とを有する。 [Reading device]
Next, a configuration example of the reading device 10 will be described with reference to FIG. As shown in FIG. 3, the reading device 10 includes a communication unit 11, a storage unit 12, and a control unit 13.

通信部１１は、例えば、ＮＩＣ（Network Interface Card）等によって実現される。そして、通信部１１は、ネットワークと有線または無線で接続され、端末装置２０との間で情報の送受信を行う。 The communication unit 11 is realized by, for example, a NIC (Network Interface Card) or the like. Then, the communication unit 11 is connected to the network by wire or wirelessly, and transmits / receives information to / from the terminal device 20.

記憶部１２は、例えば、ＲＡＭ（Random Access Memory)、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。記憶部１２は、制御部１３が各種処理を行う際に参照する情報や、各種処理により作成された情報を記憶する。例えば、記憶部１２は、学習部１３４により学習（作成）されたモデルを記憶する。このモデルは、端末装置２０のユーザごとに作成される。モデルの詳細については後記する。 The storage unit 12 is realized by, for example, a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory (Flash Memory), or a storage device such as a hard disk or an optical disk. The storage unit 12 stores information referred to when the control unit 13 performs various processes and information created by the various processes. For example, the storage unit 12 stores the model learned (created) by the learning unit 134. This model is created for each user of the terminal device 20. Details of the model will be described later.

制御部１３は、コントローラ（Controller）であり、例えば、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）等によって、読話装置１０内部の記憶装置に記憶されている各種プログラム（音声認識プログラムの一例に相当）がＲＡＭを作業領域として実行されることにより実現される。 The control unit 13 is a controller, and is an example of various programs (speech recognition programs) stored in a storage device inside the reading device 10 by, for example, a CPU (Central Processing Unit) or an MPU (Micro Processing Unit). Is realized by executing RAM as a work area.

制御部１３は、読話処理部１３１と、音声テキスト変換部（テキスト変換部）１３２とを備える。 The control unit 13 includes a reading processing unit 131 and a voice text conversion unit (text conversion unit) 132.

読話処理部１３１は、ユーザの発話時におけるユーザの口の動きを示す情報および当該ユーザのささやき声の音声情報（これらの情報をまとめて、「発話情報」と呼ぶ）に基づき、当該ユーザの発話内容を認識する。例えば、読話処理部１３１は、ユーザの発話情報に基づき当該ユーザの発話内容を示す音声データを生成する。そして、読話処理部１３１は、生成した音声データを音声テキスト変換部１３２へ出力する。音声テキスト変換部１３２は、読話処理部１３１から出力された音声データをテキスト情報に変換する。 The reading processing unit 131 describes the utterance content of the user based on the information indicating the movement of the user's mouth at the time of the user's utterance and the voice information of the user's whispering voice (collectively referred to as "utterance information"). Recognize. For example, the reading processing unit 131 generates voice data indicating the utterance content of the user based on the utterance information of the user. Then, the reading processing unit 131 outputs the generated voice data to the voice text conversion unit 132. The voice-text conversion unit 132 converts the voice data output from the reading processing unit 131 into text information.

読話処理部１３１について詳細に説明する。読話処理部１３１は、発話情報取得部（第１の取得部）１３３と、学習部１３４と、出力処理部（出力部）１３５と、修正情報取得部（第２の取得部）１３６とを備える。 The reading processing unit 131 will be described in detail. The reading processing unit 131 includes an utterance information acquisition unit (first acquisition unit) 133, a learning unit 134, an output processing unit (output unit) 135, and a correction information acquisition unit (second acquisition unit) 136. ..

発話情報取得部１３３は、端末装置２０からユーザの発話情報を取得する。例えば、発話情報取得部１３３は、端末装置２０からユーザが初回設定用のテキストを発話したときの発話情報や、認識の対象となるユーザの発話情報を取得する。 The utterance information acquisition unit 133 acquires the user's utterance information from the terminal device 20. For example, the utterance information acquisition unit 133 acquires the utterance information when the user utters the text for initial setting and the utterance information of the user to be recognized from the terminal device 20.

学習部１３４は、発話情報取得部１３３により取得されたユーザの発話情報と、当該発話情報の示す発話内容とを用いた学習を行う。例えば、学習部１３４は、ユーザの発話情報と当該発話情報の示す発話内容とを学習し、当該ユーザの発話情報の示す発話内容の認識結果を出力するためのモデルを作成する。 The learning unit 134 performs learning using the utterance information of the user acquired by the utterance information acquisition unit 133 and the utterance content indicated by the utterance information. For example, the learning unit 134 learns the utterance information of the user and the utterance content indicated by the utterance information, and creates a model for outputting the recognition result of the utterance content indicated by the utterance information of the user.

一例を挙げる。例えば、学習部１３４は、まず、発話情報取得部１３３からユーザが初回設定用のテキストを読み上げたときの発話情報と、当該初回設定用のテキストの内容とを対応付けた情報をモデルの初期情報として登録する。 Let me give you an example. For example, the learning unit 134 first sets the initial information of the model as information in which the utterance information when the user reads out the text for the initial setting from the utterance information acquisition unit 133 and the content of the text for the initial setting are associated with each other. Register as.

その後、修正情報取得部１３６（後記）が、端末装置２０から初期情報の登録後のモデルを用いた発話内容の認識結果に関する修正情報を受信した場合、学習部１３４はその修正情報に基づき当該ユーザのモデルを修正する。また、修正情報取得部１３６が、端末装置２０から、上記の修正後のユーザのモデルを用いた発話内容の認識結果に関する修正情報を受信した場合、学習部１３４はその修正情報に基づき当該ユーザのモデルを修正する。このような処理を繰り返すことにより、学習部１３４は、ユーザの発話内容を精度よく認識可能なモデルを作成することができる。 After that, when the correction information acquisition unit 136 (described later) receives the correction information regarding the recognition result of the utterance content using the model after registration of the initial information from the terminal device 20, the learning unit 134 receives the correction information based on the correction information. Modify the model of. Further, when the correction information acquisition unit 136 receives the correction information regarding the recognition result of the utterance content using the user's model after the correction from the terminal device 20, the learning unit 134 receives the correction information of the user based on the correction information. Modify the model. By repeating such processing, the learning unit 134 can create a model capable of accurately recognizing the utterance content of the user.

なお、学習部１３４が、ユーザの発話情報に基づく学習を行う際、着目している語の前後の語および音声を用いて同音異義語を学習する。例えば、「ツール」、「ルーツ」、「クール」という語を発音するときのユーザの口の動きはほぼ同じである。よって、例えば、学習部１３４は、以下の文における「ツール」、「ルーツ」、「クール」という語の前後の言葉および音声を用いて同音異義語を学習する。 When the learning unit 134 performs learning based on the user's utterance information, it learns homonyms by using the words before and after the word of interest and the voice. For example, the movement of the user's mouth when pronouncing the words "tool," "roots," and "cool" is about the same. Therefore, for example, the learning unit 134 learns homonyms by using the words before and after the words "tool", "roots", and "cool" in the following sentences and voices.

・日本人が発明するツールは優れもの
・日本人のルーツは縄文人と弥生人
・日本人の使うものはどれもクールだ・ The tools invented by the Japanese are excellent. ・ The roots of the Japanese are Jomon and Yayoi. ・ The ones used by the Japanese are all cool.

このようにすることで学習部１３４は、ユーザが発話する語のうち同音異義語についても精度よく認識可能なモデルを作成することができる。 By doing so, the learning unit 134 can create a model that can accurately recognize homonyms among the words spoken by the user.

出力処理部１３５は、認識の対象となるユーザの発話情報を入力として、学習部１３４による学習結果（例えば、上記のモデル）を用いて当該ユーザの発話情報の示す発話内容の認識結果を出力する。例えば、出力処理部１３５は、認識の対象となるユーザの発話情報を入力として、上記のモデルを用いて当該ユーザの発話内容を示す音声データを生成し、音声テキスト変換部１３２へ出力する。その後、出力処理部１３５は、音声テキスト変換部１３２から当該音声データのテキスト情報を受け取ると、当該テキスト情報を当該ユーザの端末装置２０へ送信する。 The output processing unit 135 receives the utterance information of the user to be recognized as input, and outputs the recognition result of the utterance content indicated by the utterance information of the user using the learning result (for example, the above model) by the learning unit 134. .. For example, the output processing unit 135 takes the utterance information of the user to be recognized as input, generates voice data indicating the utterance content of the user using the above model, and outputs the voice data to the voice text conversion unit 132. After that, when the output processing unit 135 receives the text information of the voice data from the voice text conversion unit 132, the output processing unit 135 transmits the text information to the terminal device 20 of the user.

修正情報取得部１３６は、端末装置２０から、ユーザの発話内容の認識結果に関する修正情報を取得する。例えば、修正情報取得部１３６は、端末装置２０から、ユーザの発話内容を示すテキストデータに関する修正情報を受信する。そして、修正情報取得部１３６は、当該修正情報を学習部１３４へ出力する。 The correction information acquisition unit 136 acquires correction information regarding the recognition result of the user's utterance content from the terminal device 20. For example, the correction information acquisition unit 136 receives correction information regarding text data indicating the content of the user's utterance from the terminal device 20. Then, the correction information acquisition unit 136 outputs the correction information to the learning unit 134.

［端末装置］
次に、図４を用いて、端末装置２０の構成例を説明する。図４に示すように、端末装置２０は、通信部２１と、記憶部２２と、マイク２３と、カメラ２４と、入力部２５と、出力部２６と、制御部２７とを有する。 [Terminal device]
Next, a configuration example of the terminal device 20 will be described with reference to FIG. As shown in FIG. 4, the terminal device 20 includes a communication unit 21, a storage unit 22, a microphone 23, a camera 24, an input unit 25, an output unit 26, and a control unit 27.

通信部２１は、ネットワークに通信可能に接続された読話装置１０との間で通信するＮＩＣなどのインターフェイスである。 The communication unit 21 is an interface such as a NIC that communicates with a reading device 10 that is communicably connected to the network.

記憶部２２は、例えば、ＲＡＭ、フラッシュメモリ等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。 The storage unit 22 is realized by, for example, a semiconductor memory element such as a RAM or a flash memory, or a storage device such as a hard disk or an optical disk.

マイク２３は、音声を取得する。例えば、マイク２３は、ユーザのささやき声で発話する際の音声を取得する。カメラ２４は、画像（動画または静止画）を撮影する。例えば、カメラ２４は、ユーザがささやき声で発話する際の口の動きの画像を撮影する。 The microphone 23 acquires voice. For example, the microphone 23 acquires the voice when speaking with the whispering voice of the user. The camera 24 captures an image (moving or still image). For example, the camera 24 captures an image of the movement of the mouth when the user whispers.

入力部２５は、ユーザから各種操作を受け付ける入力装置である。例えば、入力部２５は、キーボードやマウスや操作キー等によって実現される。出力部２６は、各種情報を表示するための表示装置である。例えば、出力部２６は、液晶ディスプレイ等によって実現される。なお、端末装置２０にタッチパネルが採用された場合には、入力部２５と出力部２６とは一体化される。 The input unit 25 is an input device that receives various operations from the user. For example, the input unit 25 is realized by a keyboard, a mouse, operation keys, or the like. The output unit 26 is a display device for displaying various information. For example, the output unit 26 is realized by a liquid crystal display or the like. When a touch panel is adopted for the terminal device 20, the input unit 25 and the output unit 26 are integrated.

制御部２７は、コントローラであり、例えば、ＣＰＵやＭＰＵ等によって、端末装置２０内部の記憶装置に記憶されている各種プログラム（読話プログラム）がＲＡＭを作業領域として実行されることにより実現される。 The control unit 27 is a controller, and is realized by, for example, a CPU, an MPU, or the like executing various programs (reading programs) stored in the storage device inside the terminal device 20 using the RAM as a work area.

制御部２７は、発話受付部２７１と、発話情報送信部２７２と、テキスト受信部２７３と、表示部２７４と、修正情報送信部２７５とを有する。 The control unit 27 includes an utterance reception unit 271, an utterance information transmission unit 272, a text reception unit 273, a display unit 274, and a correction information transmission unit 275.

発話受付部２７１は、カメラ２４により発話時におけるユーザの口の動きを示す情報を取得し、また、マイク２３により発話時におけるユーザの音声情報を取得する。 The utterance reception unit 271 acquires information indicating the movement of the user's mouth at the time of utterance by the camera 24, and acquires the voice information of the user at the time of utterance by the microphone 23.

発話情報送信部２７２は、発話受付部２７１により取得された、発話時におけるユーザの口の動きを示す情報および当該ユーザの音声情報を含む発話情報を作成し、読話装置１０へ送信する。例えば、発話情報送信部２７２は、発話受付部２７１により取得された、発話時におけるユーザの口の動きを示す情報および当該ユーザの音声情報をデジタル信号に変換し、当該デジタル信号を発話情報として読話装置１０へ送信する。 The utterance information transmission unit 272 creates utterance information including information indicating the movement of the user's mouth at the time of utterance and voice information of the user acquired by the utterance reception unit 271, and transmits the utterance information to the reading device 10. For example, the utterance information transmitting unit 272 converts the information indicating the movement of the user's mouth at the time of utterance and the voice information of the user acquired by the utterance receiving unit 271 into a digital signal, and reads the digital signal as the utterance information. It is transmitted to the device 10.

テキスト受信部２７３は、読話装置１０から、ユーザの発話内容を示すテキスト情報を受信する。表示部２７４は、種々の情報を出力部２６に表示する。例えば、表示部２７４は、初回設定用のテキスト情報を出力部２６に表示したり、テキスト受信部２７３が受信したテキスト情報を出力部２６に表示したりする。 The text receiving unit 273 receives text information indicating the content of the user's utterance from the reading device 10. The display unit 274 displays various information on the output unit 26. For example, the display unit 274 displays the text information for the initial setting on the output unit 26, or displays the text information received by the text reception unit 273 on the output unit 26.

例えば、端末装置２０の読話プログラムが起動されると、表示部２７４は、ユーザに初回設定用のテキスト情報をささやき声で読み上げるよう促すメッセージを出力部２６に表示する。そして、ユーザが当該テキスト情報をささやき声で読み上げると、発話受付部２７１は、カメラ２４により読み上げ時におけるユーザの口の動きを示す情報を取得し、また、マイク２３により読み上げ時における音声情報を取得する。そして、発話情報送信部２７２は、発話受付部２７１により取得された、初回設定用のテキストの読み上げ時におけるユーザの口の動きを示す情報および音声情報を含む発話情報を作成し、読話装置１０へ送信する。 For example, when the reading program of the terminal device 20 is started, the display unit 274 displays a message on the output unit 26 prompting the user to read out the text information for the initial setting in a whisper. Then, when the user reads out the text information with a whisper, the utterance reception unit 271 acquires information indicating the movement of the user's mouth at the time of reading out by the camera 24, and acquires voice information at the time of reading out by the microphone 23. .. Then, the utterance information transmitting unit 272 creates utterance information including information indicating the movement of the user's mouth and voice information when the text for initial setting is read aloud, which is acquired by the utterance receiving unit 271, and sends it to the reading device 10. Send.

修正情報送信部２７５は、読話装置１０から受信したテキスト情報の修正情報を読話装置１０へ送信する。例えば、表示部２７４が、読話装置１０から受信したテキスト情報を出力部２６に表示した後、入力部２５から当該テキスト情報の修正情報を受け付けた場合、修正情報送信部２７５は当該修正情報を読話装置１０へ送信する。 The correction information transmission unit 275 transmits the correction information of the text information received from the reading device 10 to the reading device 10. For example, when the display unit 274 displays the text information received from the reading device 10 on the output unit 26 and then receives the correction information of the text information from the input unit 25, the correction information transmission unit 275 reads the correction information. It is transmitted to the device 10.

［処理手順］
次に、図５を用いて、読話装置１０の処理手順の例を説明する。 [Processing procedure]
Next, an example of the processing procedure of the reading device 10 will be described with reference to FIG.

なお、図５において説明を省略しているが、読話装置１０は、例えば、端末装置２０に対しユーザ認証等を行うことにより、アクセス元の端末装置２０がどのユーザの端末装置２０かを識別するものとする。これにより、読話装置１０は、ユーザごとにモデルを管理することができる。 Although the description is omitted in FIG. 5, the reading device 10 identifies which user's terminal device 20 is the access source terminal device 20 by, for example, performing user authentication or the like on the terminal device 20. It shall be. As a result, the reading device 10 can manage the model for each user.

まず、読話装置１０の発話情報取得部１３３は、端末装置２０から初回設定用のユーザの発話情報を取得する（Ｓ１）。例えば、発話情報取得部１３３は、ユーザが初回設定用のテキストをささやき声で読み上げたときの発話情報を取得する。そして、学習部１３４は、Ｓ１で取得した初回設定用のユーザの発話情報をモデルの初期情報として登録する（Ｓ２：初回設定用のユーザの発話情報の登録）。例えば、学習部１３４は、ユーザが初回設定用のテキストをささやき声で読み上げたときの発話情報と、初回設定用のテキストの内容とを対応付けた情報をモデルの初期情報として登録する。 First, the utterance information acquisition unit 133 of the reading device 10 acquires the utterance information of the user for the initial setting from the terminal device 20 (S1). For example, the utterance information acquisition unit 133 acquires utterance information when the user reads out the text for initial setting with a whisper. Then, the learning unit 134 registers the utterance information of the user for the initial setting acquired in S1 as the initial information of the model (S2: registration of the utterance information of the user for the initial setting). For example, the learning unit 134 registers as the initial information of the model information in which the utterance information when the user reads out the text for the initial setting with a whisper and the content of the text for the initial setting are associated with each other.

Ｓ２の後、読話装置１０の発話情報取得部１３３が、端末装置２０から発話内容の認識の対象となるユーザの発話情報を取得すると（Ｓ３）、出力処理部１３５は当該ユーザのモデルを用いて当該発話情報の示す発話内容の認識結果を出力する（Ｓ４）。例えば、出力処理部１３５は、端末装置２０から発話内容の認識の対象となるユーザの発話情報を取得すると、当該ユーザのモデルを用いて当該発話情報の示す発話内容を示す音声データを生成する。そして、出力処理部１３５は、生成した音声データを音声テキスト変換部１３２へ出力する。その後、出力処理部１３５は、音声テキスト変換部１３２から当該音声データのテキスト情報を受け取る。そして、出力処理部１３５は、受け取ったテキスト情報を当該ユーザの端末装置２０へ送信する。 After S2, when the utterance information acquisition unit 133 of the reading device 10 acquires the utterance information of the user to be recognized of the utterance content from the terminal device 20 (S3), the output processing unit 135 uses the user's model. The recognition result of the utterance content indicated by the utterance information is output (S4). For example, when the output processing unit 135 acquires the utterance information of the user whose utterance content is to be recognized from the terminal device 20, the output processing unit 135 generates voice data indicating the utterance content indicated by the utterance information using the user's model. Then, the output processing unit 135 outputs the generated voice data to the voice text conversion unit 132. After that, the output processing unit 135 receives the text information of the voice data from the voice text conversion unit 132. Then, the output processing unit 135 transmits the received text information to the terminal device 20 of the user.

その後、修正情報取得部１３６が、当該ユーザの端末装置２０から、Ｓ４で出力した認識結果の修正情報を取得した場合（Ｓ５でＹｅｓ）、学習部１３４は、当該修正情報に基づき、当該ユーザのモデルの修正を行い（Ｓ６）、Ｓ３へ戻る。一方、修正情報取得部１３６が、当該ユーザの端末装置２０から、Ｓ４で出力した認識結果の修正情報を取得しなかった場合（Ｓ５でＮｏ）、Ｓ３へ戻る。 After that, when the correction information acquisition unit 136 acquires the correction information of the recognition result output in S4 from the terminal device 20 of the user (Yes in S5), the learning unit 134 of the learning unit 134 of the user based on the correction information. Modify the model (S6) and return to S3. On the other hand, when the correction information acquisition unit 136 does not acquire the correction information of the recognition result output in S4 from the terminal device 20 of the user (No in S5), the process returns to S3.

読話装置１０が上記のＳ３〜Ｓ６の処理を繰り返すことにより、ユーザがささやき声で発話するときに口の動きおよび音声の特徴を学習することができる。その結果、読話装置１０は、ユーザのささやき声による発話の発話内容を精度よく認識することができる。 By repeating the above processes S3 to S6, the reading device 10 can learn the movement of the mouth and the characteristics of the voice when the user speaks with a whisper. As a result, the reading device 10 can accurately recognize the utterance content of the utterance of the user's whispering voice.

［処理手順の例］
次に、図６を用いて、読話装置１０を含むシステムの処理手順の例を説明する。処理手順は、例えば、（１）初期情報登録、（２）音声認識サービス利用、（３）認識結果の活用のフェーズに分けられる。 [Example of processing procedure]
Next, an example of the processing procedure of the system including the reading device 10 will be described with reference to FIG. The processing procedure is divided into, for example, (1) initial information registration, (2) voice recognition service use, and (3) recognition result utilization phase.

（１）初期情報登録
例えば、端末装置２０の読話アプリケーションが起動されると、端末装置２０は初回設定用のテキスト文を出力部２６に表示する。そして、端末装置２０のユーザは、当該端末装置２０に向かって初回設定用のテキスト文をささやき声で読み上げる（Ｓ１１）。このとき発話受付部２７１は、カメラ２４により初回設定用のテキスト文の読み上げ時におけるユーザの口の動きを示す情報を取得し、また、マイク２３により初回設定用のテキスト文の読み上げ時におけるユーザの音声情報を取得する。その後、発話情報送信部２７２は、取得した口の動きを示す情報および音声情報をデジタル信号に変換して、読話装置１０へ送信する（Ｓ１２）。 (1) Initial information registration For example, when the reading application of the terminal device 20 is started, the terminal device 20 displays a text sentence for initial setting on the output unit 26. Then, the user of the terminal device 20 reads aloud the text sentence for the initial setting to the terminal device 20 with a whisper (S11). At this time, the utterance reception unit 271 acquires information indicating the movement of the user's mouth when reading the text sentence for initial setting by the camera 24, and the microphone 23 acquires information indicating the movement of the user's mouth when reading the text sentence for initial setting. Get voice information. After that, the utterance information transmitting unit 272 converts the acquired information indicating the movement of the mouth and the voice information into digital signals and transmits them to the reading device 10 (S12).

Ｓ１２の後、読話装置１０の発話情報取得部１３３が、端末装置２０からデジタル信号を受信すると、学習部１３４は、受信したデジタル信号を初回設定用のテキストと照合し、モデルに登録する（Ｓ１３）。つまり、学習部１３４は、ユーザの発話の初期情報をモエルに登録する。 After S12, when the utterance information acquisition unit 133 of the reading device 10 receives the digital signal from the terminal device 20, the learning unit 134 collates the received digital signal with the text for initial setting and registers it in the model (S13). ). That is, the learning unit 134 registers the initial information of the user's utterance in Moel.

（２）音声認識サービス利用
次に、ユーザは端末装置２０に向けてささやき声で発話を行う（Ｓ２１）。そして、発話受付部２７１は、カメラ２４により当該発話におけるユーザの口の動きを示す情報を取得し、また、マイク２３により当該発話におけるユーザの音声情報を取得する。その後、発話情報送信部２７２は、取得した口の動きを示す情報および音声情報をデジタル信号に変換して、読話装置１０へ送信する（Ｓ２２）。 (2) Use of voice recognition service Next, the user makes a whisper to the terminal device 20 (S21). Then, the utterance reception unit 271 acquires information indicating the movement of the user's mouth in the utterance by the camera 24, and acquires the voice information of the user in the utterance by the microphone 23. After that, the utterance information transmitting unit 272 converts the acquired information indicating the movement of the mouth and the voice information into digital signals and transmits them to the reading device 10 (S22).

Ｓ２２の後、読話装置１０の発話情報取得部１３３が、ユーザの端末装置２０から上記のデジタル信号を受信すると、出力処理部１３５は、当該ユーザの初期情報が登録されたモデルを用いて、受信したデジタル信号を音声信号に変換し、音声テキスト変換部１３２へ出力する（Ｓ２３）。そして、音声テキスト変換部１３２は、出力された音声信号をテキスト情報に変換し、出力処理部１３５へ出力する（Ｓ２４：音声→テキスト変換）。出力処理部１３５は、Ｓ２４で変換されたテキスト情報を当該ユーザの端末装置２０へ送信する（Ｓ２５）。 After S22, when the utterance information acquisition unit 133 of the reading device 10 receives the above digital signal from the user's terminal device 20, the output processing unit 135 receives the signal using the model in which the initial information of the user is registered. The digital signal is converted into a voice signal and output to the voice text conversion unit 132 (S23). Then, the voice-text conversion unit 132 converts the output voice signal into text information and outputs it to the output processing unit 135 (S24: voice-to-text conversion). The output processing unit 135 transmits the text information converted in S24 to the terminal device 20 of the user (S25).

Ｓ２５の後、端末装置２０のテキスト受信部２７３が、読話装置１０からテキスト情報を受信すると、表示部２７４は、受信したテキスト情報を出力部２６に表示する（Ｓ２６：テキスト表示）。次に、ユーザは、端末装置２０に表示されたテキスト情報を確認し（Ｓ２７）、当該テキスト情報に修正が必要な部分があれば、入力部２５等によりテキスト情報の修正情報を入力する。そして、修正情報送信部２７５は、入力されたテキスト情報の修正情報を読話装置１０へ送信する（Ｓ２８：テキスト修正）。その後、読話装置１０の修正情報取得部１３６が、ユーザの端末装置２０からテキスト情報の修正情報を受信すると、学習部１３４は当該修正情報を用いて、当該ユーザのモデルの修正を行う（Ｓ２９：修正情報を用いたモデルの修正）。 After S25, when the text receiving unit 273 of the terminal device 20 receives the text information from the reading device 10, the display unit 274 displays the received text information on the output unit 26 (S26: text display). Next, the user confirms the text information displayed on the terminal device 20 (S27), and if there is a portion of the text information that needs to be corrected, the input unit 25 or the like inputs the correction information of the text information. Then, the correction information transmission unit 275 transmits the correction information of the input text information to the reading device 10 (S28: text correction). After that, when the correction information acquisition unit 136 of the reading device 10 receives the correction information of the text information from the terminal device 20 of the user, the learning unit 134 corrects the model of the user using the correction information (S29: Model modification using modification information).

なお、ここでは説明を省略しているが、Ｓ２９の後、システムが再度当該ユーザの発話を受け付けた場合、読話装置１０は、修正後の当該ユーザのモデルに基づき、再度Ｓ２３以降の処理を実行する。上記の処理を繰り返すことで、読話装置１０は、ユーザにカスタマイズされた精度の高い変換を実現するモデルを作成することができる。 Although the description is omitted here, when the system accepts the user's utterance again after S29, the reading device 10 executes the processing after S23 again based on the modified model of the user. To do. By repeating the above processing, the reading device 10 can create a model that realizes a user-customized and highly accurate conversion.

なお、Ｓ２７においてユーザが端末装置２０に表示されたテキスト情報を確認し、修正の必要な部分がなければ、修正の必要がない旨を端末装置２０に入力してもよい。その場合、端末装置２０は、当該テキスト情報に修正の必要がない旨の情報を読話装置１０へ送信する。 In S27, the user may check the text information displayed on the terminal device 20, and if there is no portion that needs to be corrected, the user may input to the terminal device 20 that the correction is not necessary. In that case, the terminal device 20 transmits information to the reading device 10 that the text information does not need to be modified.

（３）認識結果の活用
また、端末装置２０は、読話装置１０から受信したテキスト情報（発話内容の認識結果）を他のアプリケーションやサービスに活用してもよい。例えば、端末装置２０は、受信したテキスト情報を用いてメール、チャット等のコミュニケーションアプリへのテキスト連携を行ってもよい。 (3) Utilization of Recognition Result The terminal device 20 may utilize the text information (recognition result of the utterance content) received from the reading device 10 for other applications and services. For example, the terminal device 20 may use the received text information to perform text linkage to a communication application such as e-mail or chat.

［その他］
また、上記の実施形態において読話装置１０は、ユーザの発話内容をテキスト情報に変換したものを端末装置２０へ送信することとしたが、これに限定されない。例えば、読話装置１０はユーザの発話内容を示す音声データを端末装置２０へ送信してもよい。 [Other]
Further, in the above embodiment, the reading device 10 is not limited to this, although it is decided that the reading device 10 converts the utterance content of the user into text information and transmits it to the terminal device 20. For example, the reading device 10 may transmit voice data indicating the content of the user's utterance to the terminal device 20.

また、上記の実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。例えば、各図に示した各種情報は、図示した情報に限られない。 Further, among the processes described in the above-described embodiment, all or a part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed. It is also possible to automatically perform all or part of the above by a known method. In addition, the processing procedure, specific name, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified. For example, the various information shown in each figure is not limited to the illustrated information.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。例えば、読話装置１０の機能を端末装置２０に装備してもよい。 Further, each component of each of the illustrated devices is a functional concept, and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically distributed / physically in arbitrary units according to various loads and usage conditions. Can be integrated and configured. For example, the function of the reading device 10 may be provided in the terminal device 20.

また、上述してきた実施形態及び変形例は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 Further, the above-described embodiments and modifications can be appropriately combined as long as the processing contents do not contradict each other.

以上、本願の実施形態のいくつかを図面に基づいて詳細に説明したが、これらは例示であり、発明の開示の欄に記載の態様を始めとして、当業者の知識に基づいて種々の変形、改良を施した他の形態で本発明を実施することが可能である。 Although some of the embodiments of the present application have been described in detail with reference to the drawings, these are examples, and various modifications are made based on the knowledge of those skilled in the art, including the embodiments described in the disclosure column of the invention. It is possible to practice the present invention in other improved forms.

［プログラム］
また、上記の実施形態で述べた読話装置１０の機能を実現するプログラムを所望の情報処理装置（コンピュータ）にインストールすることによって実装できる。例えば、パッケージソフトウェアやオンラインソフトウェアとして提供される上記のプログラムを情報処理装置に実行させることにより、情報処理装置を読話装置１０として機能させることができる。ここで言う情報処理装置には、デスクトップ型またはノート型のパーソナルコンピュータ、ラック搭載型のサーバコンピュータ等が含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）等の移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistants）等がその範疇に含まれる。また、読話装置１０を、クラウドサーバに実装してもよい。 [program]
Further, it can be implemented by installing a program that realizes the function of the reading device 10 described in the above embodiment on a desired information processing device (computer). For example, the information processing device can function as the reading device 10 by causing the information processing device to execute the above program provided as package software or online software. The information processing device referred to here includes a desktop type or notebook type personal computer, a rack-mounted server computer, and the like. In addition, the information processing device includes smartphones, mobile phones, mobile communication terminals such as PHS (Personal Handyphone System), and PDA (Personal Digital Assistants). Further, the reading device 10 may be mounted on the cloud server.

図７を用いて、上記のプログラム（音声認識プログラム）を実行するコンピュータの一例を説明する。図７に示すように、コンピュータ１０００は、例えば、メモリ１０１０と、ＣＰＵ１０２０と、ハードディスクドライブインタフェース１０３０と、ディスクドライブインタフェース１０４０と、シリアルポートインタフェース１０５０と、ビデオアダプタ１０６０と、ネットワークインタフェース１０７０とを有する。これらの各部は、バス１０８０によって接続される。 An example of a computer that executes the above program (speech recognition program) will be described with reference to FIG. 7. As shown in FIG. 7, the computer 1000 has, for example, a memory 1010, a CPU 1020, a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。ディスクドライブ１１００には、例えば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース１０５０には、例えば、マウス１１１０およびキーボード１１２０が接続される。ビデオアダプタ１０６０には、例えば、ディスプレイ１１３０が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. For example, a mouse 1110 and a keyboard 1120 are connected to the serial port interface 1050. A display 1130 is connected to the video adapter 1060, for example.

ここで、図７に示すように、ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３およびプログラムデータ１０９４を記憶する。前記した実施形態で説明した各種データや情報は、例えばハードディスクドライブ１０９０やメモリ１０１０に記憶される。 Here, as shown in FIG. 7, the hard disk drive 1090 stores, for example, the OS 1091, the application program 1092, the program module 1093, and the program data 1094. The various data and information described in the above-described embodiment are stored in, for example, the hard disk drive 1090 or the memory 1010.

そして、ＣＰＵ１０２０が、ハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して、上述した各手順を実行する。 Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the hard disk drive 1090 into the RAM 1012 as needed, and executes each of the above-described procedures.

なお、上記の音声認識プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限られず、例えば、着脱可能な記憶媒体に記憶されて、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、上記のプログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ＬＡＮやＷＡＮ（Wide Area Network）等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and program data 1094 related to the above voice recognition program are not limited to the case where they are stored in the hard disk drive 1090. For example, they are stored in a removable storage medium and stored in the CPU 1020 via the disk drive 1100 or the like. May be read by. Alternatively, the program module 1093 and the program data 1094 related to the above program are stored in another computer connected via a network such as a LAN or WAN (Wide Area Network), and read by the CPU 1020 via the network interface 1070. May be done.

１０読話装置
１１，２１通信部
１２，２２記憶部
１３，２７制御部
２０端末装置
２３マイク
２４カメラ
２５入力部
２６出力部
１３１読話処理部
１３２音声テキスト変換部
１３３発話情報取得部
１３４学習部
１３５出力処理部
１３６修正情報取得部
２７１発話受付部
２７２発話情報送信部
２７３テキスト受信部
２７４表示部
２７５修正情報送信部 10 Reading device 11,21 Communication unit 12, 22 Storage unit 13, 27 Control unit 20 Terminal device 23 Microphone 24 Camera 25 Input unit 26 Output unit 131 Reading processing unit 132 Voice text conversion unit 133 Speech information acquisition unit 134 Learning unit 135 Output Processing unit 136 Correction information acquisition unit 271 Speech reception unit 272 Speech information transmission unit 273 Text reception unit 274 Display unit 275 Correction information transmission unit

Claims

A first acquisition unit that acquires utterance information including information indicating the movement of the user's mouth when the user whispers and voice information of the user, and a first acquisition unit.
Using the model created by learning using the utterance information acquired by the first acquisition unit and the utterance content indicated by the utterance information, the utterance information to be recognized is input to indicate the utterance information. An output unit that outputs the recognition result of the utterance content and
A voice recognition device characterized by comprising.

A learning unit for creating a model that outputs a recognition result of the utterance content indicated by the utterance information by learning using the utterance information acquired by the first acquisition unit and the utterance content indicated by the utterance information is further provided.
The output unit
The voice recognition device according to claim 1, wherein the model is used to input utterance information to be recognized and output a recognition result of the utterance content indicated by the utterance information.

It is further equipped with a text conversion unit that converts the recognition result of the utterance content into text information.
The output unit
The voice recognition device according to claim 2, wherein the text information of the recognition result of the utterance content converted by the text conversion unit is output.

The recognition result of the utterance content is
It is the voice data of the utterance content,
The text conversion unit
The voice recognition device according to claim 3, wherein the voice data of the utterance content is converted into text information.

The voice recognition device is
Further provided with a second acquisition unit for acquiring correction information of the text information of the recognition result of the utterance content input by the user.
The learning unit
The voice recognition device according to claim 3, wherein the model is modified by using the correction information of the text information of the recognition result of the utterance content acquired by the second acquisition unit.

A voice recognition method executed by a voice recognition device.
A step of acquiring information indicating the movement of the user's mouth when the user whispers and utterance information including the user's voice information, and
Using a model created by learning using the acquired utterance information and the utterance content indicated by the utterance information, the recognition result of the utterance content indicated by the utterance information is output by inputting the utterance information to be recognized. Steps to do and
A speech recognition method comprising.

A step of acquiring information indicating the movement of the user's mouth when the user whispers and utterance information including the user's voice information, and
Using a model created by learning using the acquired utterance information and the utterance content indicated by the utterance information, the recognition result of the utterance content indicated by the utterance information is output by inputting the utterance information to be recognized. Steps to do and
A speech recognition program characterized by having a computer execute.