JP7400364B2

JP7400364B2 - Speech recognition system and information processing method

Info

Publication number: JP7400364B2
Application number: JP2019203340A
Authority: JP
Inventors: 将樹能勢
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2023-12-19
Anticipated expiration: 2039-11-08
Also published as: JP2021076715A

Description

本発明は、音声認識システム及び情報処理方法に関する。 The present invention relates to a speech recognition system and an information processing method .

音声認識を用いたスマートスピーカ、多言語翻訳システムなどが注目を集めている。スマートスピーカは、音声コマンドにより、室内器具の操作や、天気予報などの情報の提供を実現する装置である。多言語翻訳システムはスマートフォン、専用端末などを用いた翻訳装置である。多言語翻訳システムでは、例えば、人の音声が音声検出部であるマイクに入力され、入力された音声が音声認識により文字化された後、翻訳処理によって所望の言語に翻訳され、スピーカから出力される。さらに、コールセンタでの顧客との対話録を生成するシステム、会議録を自動生成するシステムなどの実用化が進み、これらのシステムも音声認識の技術を活用している。 Smart speakers using voice recognition and multilingual translation systems are attracting attention. A smart speaker is a device that uses voice commands to operate indoor appliances and provide information such as weather forecasts. Multilingual translation systems are translation devices that use smartphones, dedicated terminals, etc. In a multilingual translation system, for example, a person's voice is input into a microphone that is a voice detection unit, the input voice is transcribed into text by voice recognition, then translated into a desired language by translation processing, and output from a speaker. Ru. Furthermore, systems that generate conversation records with customers at call centers and systems that automatically generate meeting minutes are being put into practical use, and these systems also utilize voice recognition technology.

特許文献１には、人の音声以外の雑音による音声誤認識を低減して、音声認識の認識率を高める技術が開示されている。特許文献１に開示される技術は、カメラによる撮像中に発生する音をマイクで取得し、カメラで撮像された画像の情報に基づいて、人が発話している発話区間を検出し、当該発話区間で人の音声認識の感度を上げるように構成されている。 Patent Document 1 discloses a technique for increasing the recognition rate of speech recognition by reducing speech recognition errors caused by noise other than human speech. The technology disclosed in Patent Document 1 uses a microphone to capture sounds generated during image capture by a camera, detects a speech section in which a person is speaking based on information in the image captured by the camera, and detects the speech section in which a person is speaking. It is configured to increase the sensitivity of human voice recognition in this section.

しかしながら、例えばテーブルの中心に１つのマイクが設置され、当該テーブルの周囲に人が存在する状況では、人の口元からマイクまでの距離が相対的に遠くなる。従って、S/N比の低い不明瞭な音声が入力されると共に、文法から逸脱したインフォーマルな発話が頻繁に発声される。特許文献１に開示される従来技術は、このような状況での音声認識を想定していないため、音声認識精度を高める上で改善の余地がある。 However, for example, in a situation where one microphone is installed at the center of a table and there are people around the table, the distance from the person's mouth to the microphone becomes relatively long. Therefore, unclear speech with a low S/N ratio is input, and informal utterances that deviate from grammar are frequently uttered. The conventional technology disclosed in Patent Document 1 does not assume speech recognition in such a situation, so there is room for improvement in increasing speech recognition accuracy.

本発明は、上記課題に鑑み、口元からマイクまでの距離が遠い状況でも音声認識精度を高めることができる。 In view of the above problems, the present invention can improve speech recognition accuracy even in situations where the distance from the mouth to the microphone is long.

上記課題に鑑み、本発明に係る音声認識システムは、音声取得装置と、サーバと、を備える音声認識システムであって、前記音声取得装置は、複数の音声を検出する音声検出部と、複数の前記音声の内容を示すデータである音声データを同期させる制御を行う同期制御部と、を備え、前記サーバは、同期された複数の前記音声データに対して、教師ラベルを共用して音声認識エンジンの機械学習を行い、音声を認識する。
In view of the above problems, a speech recognition system according to the present invention includes a speech acquisition device and a server, and the speech acquisition device includes a speech detection section that detects a plurality of speech sounds, and a speech detection section that detects a plurality of speech sounds. a synchronization control unit that performs control to synchronize audio data that is data indicating the content of the audio, and the server shares a teacher label with respect to the plurality of synchronized audio data and uses a speech recognition engine. Perform machine learning to recognize speech .

本発明によれば、口元からマイクまでの距離が遠い状況でも音声認識精度を高めることができるという効果を奏する。 According to the present invention, it is possible to improve speech recognition accuracy even in situations where the distance from the mouth to the microphone is long.

本発明の実施の形態に係る音声認識システムの構成例を示す図A diagram showing a configuration example of a speech recognition system according to an embodiment of the present invention 音声取得装置の外観図External view of audio acquisition device 音声取得装置のハードウェア構成図Hardware configuration diagram of audio acquisition device 音声取得装置の機能ブロック図Functional block diagram of audio acquisition device クラウドサーバのハードウェア構成図Cloud server hardware configuration diagram クラウドサーバの機能ブロック図Cloud server functional block diagram 音声認識器、機械読唇器及び統合器の動作を説明するための図Diagram for explaining the operation of the speech recognizer, mechanical lip reader, and integrator 機械読唇に用いる画像特徴量を説明するための図Diagram to explain image features used for machine lip reading カメラの第１構成例を示す図Diagram showing a first configuration example of a camera カメラの第２構成例を示す図Diagram showing a second configuration example of the camera 音声認識システムの動作を説明するためのフローチャートFlowchart to explain the operation of the speech recognition system ミュートボタンを備えた筐体の外観図External view of the housing with mute button ミュートボタンが押される前後の画像の例を示す図Diagram showing an example of images before and after the mute button is pressed 複数のマイクが配置された状態を模式的に示す図Diagram schematically showing how multiple microphones are arranged 複数のマイクのそれぞれで取得された音声データの一例を示す図Diagram showing an example of audio data acquired by each of multiple microphones 教師ラベルの一例を示す図Diagram showing an example of a teacher label 統合器の動作を説明するための図Diagram to explain the operation of the integrator

以下、本発明を実施するための形態について図面を用いて説明する。図１は本発明の実施の形態に係る音声認識システムの構成例を示す図である。図１には、会議室１００に設置されるテーブル１１０と、テーブル１１０の周囲に存在する複数の人（会議出席者３１～３６）と、音声認識システム３００とが示される。 EMBODIMENT OF THE INVENTION Hereinafter, the form for implementing this invention is demonstrated using drawings. FIG. 1 is a diagram showing an example of the configuration of a speech recognition system according to an embodiment of the present invention. FIG. 1 shows a table 110 installed in a conference room 100, a plurality of people (conference attendees 31 to 36) existing around the table 110, and a voice recognition system 300.

音声認識システム３００は、１又は複数の会議出席者３１～３６の音声をマイクで取得し、取得した音声の内容を示すデータである音声データを機械学習に利用することにより、音声認識精度を高めるように構成されている。また、音声認識システム３００は、１又は複数の会議出席者３１～３６をカメラで撮像し、撮像した画像の内容を示すデータである撮像データを機械学習に利用することにより、音声認識精度を高めるように構成されている。なお、音声認識システム３００は、音声データのみを収集して音声認識精度を向上させる構成でもよい。ただし音声データに加えて撮像データを収集することにより、音声認識精度をより一層高めることができる。以下では、音声データと撮像データの双方を収集して音声認識精度を向上させる構成例について説明する。 The voice recognition system 300 improves voice recognition accuracy by acquiring the voice of one or more conference attendees 31 to 36 with a microphone and using voice data, which is data indicating the content of the acquired voice, for machine learning. It is configured as follows. In addition, the voice recognition system 300 improves voice recognition accuracy by capturing images of one or more conference attendees 31 to 36 with a camera and using captured data, which is data indicating the content of the captured images, for machine learning. It is configured as follows. Note that the speech recognition system 300 may be configured to collect only speech data to improve speech recognition accuracy. However, by collecting imaging data in addition to voice data, voice recognition accuracy can be further improved. Below, a configuration example will be described in which both voice data and image data are collected to improve voice recognition accuracy.

音声認識システム３００は、テーブル１１０の中央に設置される音声取得装置１と、会議室１００の壁とテーブル１１０の間に設置されるホワイトボード１２０と、クラウドサーバ２００とを備える。音声取得装置１で取得された音声データは、ホワイトボード１２０を介してクラウドサーバ２００に送信され、クラウドサーバ２００に実装されている音声認識エンジンなどで音声認識の処理が行われる。音声認識の結果得られたテキストデータがホワイトボード１２０に送られ、そこで字幕表示が行われる。あるいは、当該テキストデータを利用して議事録として発話内容がまとめられる。なお、音声認識の処理により、会議、講演会、インタビューなどの発言を自動でテキスト化する技術、議事録作成する技術は非特許文献１に開示される通り公知であるため、詳細な説明は省略する。 The voice recognition system 300 includes a voice acquisition device 1 installed in the center of a table 110, a whiteboard 120 installed between the wall of the conference room 100 and the table 110, and a cloud server 200. The voice data acquired by the voice acquisition device 1 is transmitted to the cloud server 200 via the whiteboard 120, and is subjected to voice recognition processing by a voice recognition engine or the like installed in the cloud server 200. The text data obtained as a result of voice recognition is sent to the whiteboard 120, where subtitles are displayed. Alternatively, the text data may be used to summarize the utterances as minutes. Note that the technology for automatically converting comments from meetings, lectures, interviews, etc. into text using voice recognition processing and the technology for creating minutes are well known as disclosed in Non-Patent Document 1, so a detailed explanation will be omitted. do.

音声取得装置１は、テーブル１１０の周囲に存在する複数の会議出席者３１～３６の音声を取得する装置である。なお、音声取得装置１は、音声以外にも、複数の会議出席者３１～３６の画像を取得するように構成される。図２Ａ～図２Ｃを参照して音声取得装置１の構成例について説明する。 The audio acquisition device 1 is a device that acquires the voices of a plurality of conference attendees 31 to 36 present around the table 110. Note that the audio acquisition device 1 is configured to acquire images of a plurality of conference attendees 31 to 36 in addition to audio. A configuration example of the audio acquisition device 1 will be described with reference to FIGS. 2A to 2C.

図２Ａは音声取得装置の外観図である。図２Ａには音声取得装置１の外観と共に、音声取得装置１で撮像される会議室１００の風景が示される。音声取得装置１は、筐体部２と、音声検出部であるマイク５０と、撮像部であるカメラ５１とを備える。マイク５０には複数チャンネルの音声を取得可能なマルチマイクが利用される。カメラ５１には複数チャンネルの画像を取得可能なマルチカメラが利用される。マルチカメラは、例えばそれぞれの画角が９０°以上の撮像部を複数組み合わせたものである。 FIG. 2A is an external view of the audio acquisition device. FIG. 2A shows the appearance of the audio acquisition device 1 as well as the scenery of the conference room 100 captured by the audio acquisition device 1. The audio acquisition device 1 includes a housing section 2, a microphone 50 that is an audio detection section, and a camera 51 that is an imaging section. As the microphone 50, a multi-microphone capable of acquiring audio of multiple channels is used. A multi-camera capable of acquiring images of multiple channels is used as the camera 51. A multi-camera is, for example, a combination of a plurality of imaging units each having an angle of view of 90° or more.

筐体部２は、テーブル１１０に設置される円盤状の台座部１ａと、台座部１ａから鉛直方向に伸び複数のマイク５０などをテーブル１１０から離れた位置に配置する柱状の延伸部１ｂとを備える、また筐体部２は、延伸部１ｂの上部に設けられ複数のマルチマイク及びマルチカメラが配置される円盤状のユニット設置部１ｃを備える。なお、筐体部２の形状は、少なくとも１以上のカメラ５１及びマイク５０を設けることができる構造であればよく、図示例に限定されるものではない。 The housing section 2 includes a disc-shaped pedestal section 1a installed on a table 110, and a columnar extension section 1b extending vertically from the pedestal section 1a and arranging a plurality of microphones 50 and the like at a position away from the table 110. Furthermore, the housing section 2 includes a disk-shaped unit installation section 1c that is provided on the upper part of the extension section 1b and in which a plurality of multi-microphones and multi-cameras are arranged. Note that the shape of the housing section 2 is not limited to the illustrated example, as long as it has a structure in which at least one camera 51 and microphone 50 can be provided.

複数のマイク５０の内、１つのマイク５０は、ユニット設置部１ｃの上部に設けられる。残りのマイク５０は、ユニット設置部１ｃの上部以外の場所、例えばユニット設置部１ｃの側面部に設けられる。側面部は、ユニット設置部１ｃの外周部全体の内、例えば、鉛直方向と直交する水平面に平行な仮想面を含む部分である。ユニット設置部１ｃの側面部には、周方向に互いに離れるようにして複数のマイク５０が設置される。このように複数のマイク５０が設置されることにより、複数の会議出席者３１～３６がテーブル１１０を囲むように存在する場合でも、それぞれの会議出席者と向き合うように個々のマイク５０が配置される形となるため、マイク５０からそれぞれの会議出席者までの距離を短くでき、S/N比の高い明瞭な音声が入力できる。 One microphone 50 among the plurality of microphones 50 is provided at the upper part of the unit installation part 1c. The remaining microphones 50 are provided at a location other than the top of the unit installation portion 1c, for example, on a side surface of the unit installation portion 1c. The side surface portion is a portion of the entire outer peripheral portion of the unit installation portion 1c that includes, for example, a virtual plane parallel to a horizontal plane orthogonal to the vertical direction. A plurality of microphones 50 are installed on the side surface of the unit installation section 1c so as to be spaced apart from each other in the circumferential direction. By installing a plurality of microphones 50 in this manner, even when a plurality of conference attendees 31 to 36 are present surrounding the table 110, each microphone 50 can be placed to face each conference attendee. As a result, the distance from the microphone 50 to each conference attendee can be shortened, and clear audio with a high S/N ratio can be input.

図２Ｂは音声取得装置のハードウェア構成図である。音声取得装置１は、ＣＰＵ（Central Processing Unit）１０１、ＲＯＭ（Read Only Memory）１０２、ＲＡＭ（Random Access Memory）１０３、入力装置１０４、通信インタフェース１０５、及びバス１０６を備える。 FIG. 2B is a hardware configuration diagram of the audio acquisition device. The audio acquisition device 1 includes a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, an input device 104, a communication interface 105, and a bus 106.

ＣＰＵ１０１は、プログラムを実行することにより、音声取得装置１の全体を制御し、後述する各機能を実現する。ＲＯＭ１０２は、ＣＰＵ１０１が実行するプログラムを含む各種のデータを記憶する。ＲＡＭ１０３は、ＣＰＵ１０１に作業領域を提供する。入力装置は、前述したマイク５０及びカメラ５１の他、人の操作に応じた情報を入力するタッチパネル、マウスなどを含む。通信インタフェース１０５は、音声取得装置１を、例えば外部機器の一例であるホワイトボード１２０を介して、通信ネットワーク３０１に接続するためのインタフェースである。通信ネットワーク３０１は、ＬＡＮ（Local Area Network）、インターネット、携帯端末用ネットワークなどである。バス１０６は、ＣＰＵ１０１、ＲＯＭ１０２、ＲＡＭ１０３、入力装置１０４、及び通信インタフェース１０５を相互に接続する。 The CPU 101 controls the entire audio acquisition device 1 by executing programs, and realizes each function described below. The ROM 102 stores various data including programs executed by the CPU 101. RAM 103 provides a work area for CPU 101. In addition to the microphone 50 and camera 51 described above, the input device includes a touch panel, a mouse, and the like for inputting information according to human operations. The communication interface 105 is an interface for connecting the audio acquisition device 1 to the communication network 301, for example, via a whiteboard 120, which is an example of an external device. The communication network 301 is a LAN (Local Area Network), the Internet, a mobile terminal network, or the like. The bus 106 interconnects the CPU 101, ROM 102, RAM 103, input device 104, and communication interface 105.

図２Ｃは音声取得装置の機能ブロック図である。音声取得装置１は、開始／終了制御部１０、同期制御部１１、記録制御部１２、記録部１３、ミュート制御部１４、及び通信制御部１５を備える。 FIG. 2C is a functional block diagram of the audio acquisition device. The audio acquisition device 1 includes a start/end control section 10, a synchronization control section 11, a recording control section 12, a recording section 13, a mute control section 14, and a communication control section 15.

開始／終了制御部１０は、例えば、複数のマイク５０－１～５０－ｎ（ｎは１以上の整数）による録音開始及び録音終了を制御すると共に、複数のカメラ５１－１～５１－ｎ（ｎは１以上の整数）による撮像開始及び撮像終了を制御する。 The start/end control unit 10 controls, for example, the start and end of recording by a plurality of microphones 50-1 to 50-n (n is an integer of 1 or more), and controls the start and end of recording by a plurality of cameras 51-1 to 51-n ( n is an integer of 1 or more) to control the start and end of imaging.

同期制御部１１は、１又は複数のマイク５０－１～５０－ｎで取得された複数の音声データを同期させる制御を行うと共に、１又は複数のカメラ５１で撮像された１又は複数の撮像データを同期させる制御を行う。同期制御部１１による制御の詳細は後述する。 The synchronization control unit 11 performs control to synchronize a plurality of audio data acquired by one or more microphones 50-1 to 50-n, and synchronizes one or more captured data captured by one or more cameras 51. Performs control to synchronize. Details of the control by the synchronization control unit 11 will be described later.

記録制御部１２は、マイク５０とカメラ５１で取得した音声データ及び撮像データの記録部１３への記録制御を行う。通信制御部１５は、ホワイトボード１２０、クラウドサーバ２００などの外部機器との通信制御を行う。通信制御は、例えば、同期制御部で制御された複数の音声データ及び撮像データを、ホワイトボード１２０を介してクラウドサーバ２００へ送信し、又は直接クラウドサーバ２００へ送信する制御である。 The recording control unit 12 controls the recording of audio data and image data acquired by the microphone 50 and camera 51 into the recording unit 13 . The communication control unit 15 controls communication with external devices such as the whiteboard 120 and the cloud server 200. The communication control is, for example, control to transmit a plurality of audio data and image data controlled by the synchronization control unit to the cloud server 200 via the whiteboard 120 or directly to the cloud server 200.

次に図３Ａ及び図３Ｂを参照してクラウドサーバ２００の構成について説明する。図３Ａはクラウドサーバのハードウェア構成図である。クラウドサーバ２００は、プロセッサ２１０、メモリ２２０、及び入出力インタフェース２３０を備える。 Next, the configuration of the cloud server 200 will be described with reference to FIGS. 3A and 3B. FIG. 3A is a hardware configuration diagram of the cloud server. Cloud server 200 includes a processor 210, memory 220, and input/output interface 230.

プロセッサ２１０は、マイクロコンピュータ、ＧＰＵ（General Purpose Graphics Processing Unit）、システムＬＳＩ（Large Scale Integration）などで構成される演算手段である。メモリ２２０は、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）などで構成される記憶手段である。入出力インタフェース２３０は、プロセッサ２１０が音声取得装置１との間で情報の入出力を行うための情報入出力手段である。プロセッサ２１０、メモリ２２０及び入出力インタフェース２３０は、バス２４０に接続され、バス２４０を介して、情報の受け渡しを相互に行うことが可能である。バス２４０は図１に示す通信ネットワーク３０１に接続される。 The processor 210 is a calculation means composed of a microcomputer, a GPU (General Purpose Graphics Processing Unit), a system LSI (Large Scale Integration), and the like. The memory 220 is a storage means composed of RAM (Random Access Memory), ROM (Read Only Memory), and the like. The input/output interface 230 is an information input/output means for the processor 210 to input/output information to/from the audio acquisition device 1 . The processor 210, the memory 220, and the input/output interface 230 are connected to a bus 240, and can exchange information with each other via the bus 240. Bus 240 is connected to communication network 301 shown in FIG.

クラウドサーバ２００は、例えば、プロセッサ２１０がメモリ２２０に記憶された仮想マシンソフトウェア（仮想化アプリケーション）をインストールすることによって、仮想マシンを稼働させる。仮想マシンソフトウェアは、ホストＯＳ（Operating System）上で個別のハードウェアをエミュレーションすることで、個別のＯＳをインストールする。これにより、単一のシステム上で、複数の仮想マシンを互いに独立して実行することが可能になる。このクラウド環境において、音声取得装置１からのデータを収集するソフトウェア（データ収集ソフト）、当該データを解析するソフトウェア（解析ソフト）などが構築される。この仮想化技術を利用することで、リソースの効率的な活用、ハードウェアの初期投資コストの抑制、省電力及び省スペースなどが実現できる。 The cloud server 200 operates a virtual machine by, for example, having the processor 210 install virtual machine software (virtualization application) stored in the memory 220. Virtual machine software installs individual OSs by emulating individual hardware on a host OS (Operating System). This allows multiple virtual machines to run independently of each other on a single system. In this cloud environment, software for collecting data from the audio acquisition device 1 (data collection software), software for analyzing the data (analysis software), and the like are constructed. By using this virtualization technology, it is possible to efficiently utilize resources, reduce initial investment costs for hardware, and save power and space.

図３Ｂはクラウドサーバの機能ブロック図である。クラウドサーバ２００は、音声認識エンジン２０１、読唇処理部２０２、及び統合器２０３を備える。 FIG. 3B is a functional block diagram of the cloud server. The cloud server 200 includes a speech recognition engine 201, a lip reading processing section 202, and an integrator 203.

音声認識エンジン２０１は、音声特徴量抽出部２０１ａ及び音声認識器２０１ｂを備える。読唇処理部２０２は、画像特徴量抽出部２０２ａ及び機械読唇器２０２ｂを備える。 The speech recognition engine 201 includes a speech feature extraction section 201a and a speech recognizer 201b. The lip reading processing unit 202 includes an image feature amount extraction unit 202a and a mechanical lip reading device 202b.

次に図４及び図５を参照して音声認識器２０１ｂ、機械読唇器２０２ｂ、統合器２０３などの動作を説明する。 Next, the operations of the speech recognizer 201b, mechanical lip reader 202b, integrator 203, etc. will be explained with reference to FIGS. 4 and 5.

図４は音声認識器、機械読唇器及び統合器の動作を説明するための図である。音声特徴量抽出部２０１ａでは、音声取得装置１からの音声データの中から機械学習用の入力値としての特徴量である音声特徴量が抽出される。音声特徴量抽出部２０１ａは、例えば、音声取得装置１で取得された複数の音声データを入力して、それぞれの音声データを単位時間ごと（フレームごと）に切り出して、例えば、ＭＦＣＣ（Mel-Frequency Cepstrum Coefficients：メル周波数ケプストラム係数）やメルケプストラム特徴量などのフレームごとの音声信号のスペクトル特徴量を計算し、これを正規化する。 FIG. 4 is a diagram for explaining the operations of the speech recognizer, mechanical lip reader, and integrator. The audio feature extracting unit 201a extracts audio features, which are features used as input values for machine learning, from the audio data from the audio acquisition device 1. The audio feature extracting unit 201a inputs, for example, a plurality of audio data acquired by the audio acquisition device 1, cuts out each audio data for each unit time (every frame), and performs, for example, MFCC (Mel-Frequency). Cepstrum Coefficients: Calculates the spectral features of the audio signal for each frame, such as mel frequency cepstrum coefficients) and mel cepstrum features, and normalizes them.

音声認識器２０１ｂは、音声特徴量抽出部２０１ａで抽出された特徴量を用いて機械学習を行うと共に音声を認識する。音声認識器２０１ｂは、音声の特徴量を識別する識別器であり、当該識別器には、ＤＮＮ（Deep Neural Network）を例示できる。ＤＮＮは、入力層と、隠れ層と呼ばれる中間層と、出力層とを有する。ＤＮＮでは、中間層の数を増やして重層構造にする構成が採られる。ＤＮＮを用いて音声を認識するためには、ＤＮＮに対して、教師ラベル又は訓練データと呼ばれる情報を利用して、教師あり学習を行うのが最も有効である。なお、ＤＮＮを実現するためには、高い演算能力が必要なため、ＤＮＮは、クラウドサーバ２００で実現することが望ましいが、音声取得装置１が高い演算能力を有するＧＰＵなどを搭載する場合、音声取得装置１で実現してもよい。なお、当該識別器には、ＤＮＮのほか、例えば、ＳＶＭ(Support Vector Machine)、ＳＩＦＴ（Scale-Invariant Feature Transform）などの手法を用いてもよい。 The speech recognizer 201b performs machine learning using the feature extracted by the speech feature extraction unit 201a and recognizes speech. The speech recognizer 201b is a classifier that identifies feature amounts of speech, and a DNN (Deep Neural Network) can be exemplified as the classifier. A DNN has an input layer, an intermediate layer called a hidden layer, and an output layer. In DNN, a configuration is adopted in which the number of intermediate layers is increased to create a multilayer structure. In order to recognize speech using a DNN, it is most effective to perform supervised learning on the DNN using information called teacher labels or training data. Note that in order to realize a DNN, high computing power is required, so it is desirable to implement the DNN in the cloud server 200. It may be realized by the acquisition device 1. Note that in addition to DNN, the discriminator may use a method such as SVM (Support Vector Machine) or SIFT (Scale-Invariant Feature Transform).

音声認識に用いるＤＮＮは様々あり、近年頭角を現しているものとしてＥｎｄｔｏＥｎｄモデルがある。ＥｎｄｔｏＥｎｄモデルとは、非特許文献２に開示される従来手法のように、音響モデル、言語モデル、辞書などの複数の機能へ分割せずに、１つのニューラルネットワークを介して、入力された音声を文字に直接変換するモデルであり、一気通貫モデルとも称される。ＥｎｄｔｏＥｎｄモデルは、構造がシンプルなため、実装が容易、応答速度が速いなどのメリットがある一方、大量の学習データを要する。 There are various DNNs used for speech recognition, and one that has emerged recently is the End-to-End model. The End-to-End model is a method that processes input audio through a single neural network without dividing it into multiple functions such as an acoustic model, a language model, and a dictionary, as in the conventional method disclosed in Non-Patent Document 2. This is a model that converts directly into characters, and is also called a one-shot model. Although the End-to-End model has advantages such as simple structure, easy implementation, and fast response speed, it requires a large amount of training data.

画像特徴量抽出部２０２ａでは、例えば、音声取得装置１からの撮像データの中から機械学習用の入力値としての特徴量である画像特徴量を抽出する。機械読唇に用いる画像特徴量の例を図５に示す。 The image feature extraction unit 202a extracts, for example, an image feature amount that is a feature amount as an input value for machine learning from the imaging data from the audio acquisition device 1. FIG. 5 shows an example of image feature amounts used in machine lip reading.

図５は機械読唇に用いる画像特徴量を説明するための図である。まず、画像特徴量抽出部２０２ａは、カメラ５１で撮影された画像全体の中から、例えば会議出席者の顔を認識する。顔認識は一般的なアルゴリズムを使用してもよい。次に、画像特徴量抽出部２０２ａは、認識した顔の中から口唇を抽出する。そして、画像特徴量抽出部２０２ａは、抽出した口唇の画像から、図５に示すようにプロットされた複数の点のそれぞれの時系列な動きを、特徴量して抽出する。当該特徴量は、機械読唇を行うためにカメラ５１で撮像された会議参加者の口元（口唇）の特徴量である。機械読唇器２０２ｂは、当該特徴量を用いて機械学習を行う。例えば、雑音が多い会議の場合、機械読唇器２０２ｂは、複数の会議出席者のそれぞれの口元の特徴量を利用して機械学習を行う。なお、当該特徴量の抽出方法は、非特許文献３に開示される通り公知であるため、詳細な説明は省略する。 FIG. 5 is a diagram for explaining image feature amounts used in machine lip reading. First, the image feature extraction unit 202a recognizes, for example, the faces of conference attendees from the entire image captured by the camera 51. Facial recognition may use a general algorithm. Next, the image feature extraction unit 202a extracts lips from the recognized face. Then, the image feature extraction unit 202a extracts the time-series movement of each of the plurality of points plotted as shown in FIG. 5 as a feature from the extracted lip image. The feature amount is a feature amount of the mouths (lips) of the meeting participants imaged by the camera 51 in order to perform mechanical lip reading. The mechanical lip reader 202b performs machine learning using the feature amount. For example, in the case of a noisy meeting, the machine lip reader 202b performs machine learning using feature amounts of the mouths of each of the plurality of meeting attendees. Note that the method for extracting the feature amount is well known as disclosed in Non-Patent Document 3, so detailed explanation will be omitted.

統合器２０３は、音声認識器２０１ｂによる音声認識結果に、機械読唇器２０２ｂによる機械読唇の結果を融合させる。音声認識器２０１ｂによる音声認識結果だけでなく、発声時の口唇の動画像を用いる手法は、マルチモーダル音声認識に呼ばれる。マルチモーダル音声認識では、入力動画像を時系列の画像特徴量に変換し、この画像特徴量と音声特徴量とを融合させて音響画像特徴量を生成する。そして、この音響画像特徴量を用いることにより、音声認識を行う。マルチモーダル音声認識は、会議での音声認識精度を高める有益な手段である。 The integrator 203 fuses the result of mechanical lip reading by the mechanical lip reader 202b with the result of voice recognition by the voice recognizer 201b. A method that uses not only the voice recognition result by the voice recognizer 201b but also a moving image of the lips during utterance is called multimodal voice recognition. In multimodal speech recognition, input moving images are converted into time-series image features, and the image features and audio features are fused to generate acoustic image features. Then, voice recognition is performed by using this acoustic image feature amount. Multimodal speech recognition is a valuable means of increasing speech recognition accuracy in meetings.

次に図６Ａ及び図６Ｂを参照して、機械読唇による認識精度を向上させための構成例について説明する。図６Ａはカメラの第１構成例を示す図である。音声取得装置１が、例えば、筐体部２から着脱可能なカメラ５１を備える場合、図６Ａに示すように、筐体部２から取り外されたカメラ５１を、例えばホワイトボード１２０などに設置することができる。設置方法は、例えばカメラ５１に把持手段を設けておき、この把持手段をホワイトボード１２０を挟み込み構成でもよいし、ホワイトボード１２０とカメラ５１のそれぞれに嵌め合い構造の器具を設けておき、それらを嵌め合わせることでホワイトボード１２０へカメラ５１を固定してもよい。この構成により、テーブル１１０以外の場所から、会議室１００内を撮像できる。これにより、会議出席者の顔の向きが変わっても、その人の口元を撮像でき、機械読唇できる確率が高まる。 Next, a configuration example for improving recognition accuracy by machine lip reading will be described with reference to FIGS. 6A and 6B. FIG. 6A is a diagram showing a first configuration example of the camera. When the audio acquisition device 1 includes, for example, a camera 51 that is detachable from the housing 2, the camera 51 removed from the housing 2 may be installed on, for example, a whiteboard 120, as shown in FIG. 6A. I can do it. The installation method may be, for example, by providing a gripping means on the camera 51 and sandwiching the whiteboard 120 between the gripping means, or by providing a device with a fitting structure on each of the whiteboard 120 and the camera 51, and then holding them together. The camera 51 may be fixed to the whiteboard 120 by fitting together. With this configuration, the inside of the conference room 100 can be imaged from a location other than the table 110. As a result, even if the face direction of a meeting attendee changes, the mouth of the person can be imaged, increasing the probability that machine lip reading will be possible.

図６Ｂはカメラの第２構成例を示す図である。図６Ｂでは、マルチカメラを構成するカメラ５１－１、カメラ５１－２、及びカメラ５１－２がユニット設置部１ｃに設けられている。この場合、カメラ５１－１、カメラ５１－２、及びカメラ５１－２のそれぞれでは、異なる方角の画像が撮像される。そのため、マイク５０の周囲に複数の会議出席者が存在する状況で、特定の人が発声したときに、その音声がマイク５０で検出されると共に、発話している人の画像をマルチカメラで撮像することができる。従って、その音声を発する人物の画像を当該音声に組み合わせて機械学習させることができる。 FIG. 6B is a diagram showing a second configuration example of the camera. In FIG. 6B, a camera 51-1, a camera 51-2, and a camera 51-2 forming a multi-camera are provided in the unit installation section 1c. In this case, each of the cameras 51-1, 51-2, and 51-2 captures images from different directions. Therefore, when a specific person speaks in a situation where there are multiple conference participants around the microphone 50, the voice is detected by the microphone 50, and an image of the person speaking is captured by the multi-camera. can do. Therefore, machine learning can be performed by combining the image of the person who makes the sound with the sound.

なお、音声取得装置１はその高さを調整可能に構成してもよい。例えば、音声取得装置１のユニット設置部１ｃが直径の異なる２つのパイプで構成され、一方の太いパイプである外管の内側に、他方の細いパイプである内管が挿入され、内管に対して外管が上下方向に移動可能に構成される。例えばテーブル１１０の面積が小さい場合、音声取得装置１から会議出席者までの距離が近くなる傾向があるため、会議出席者の顔及び口唇がカメラ５１の画角に収まらないことがある。その場合、会議出席者の顔及び口唇をカメラ５１の画角内に収まるように、ユニット設置部１ｃの高さを調整することで、その音声を発する人物の画像を正確に捉えことができるため、音声に組み合わせて機械学習させることができる。 Note that the audio acquisition device 1 may be configured to have its height adjustable. For example, the unit installation part 1c of the audio acquisition device 1 is composed of two pipes with different diameters, and the inner pipe, which is the other thin pipe, is inserted inside the outer pipe, which is one thick pipe, and The outer tube is configured to be movable in the vertical direction. For example, when the area of the table 110 is small, the distance from the audio acquisition device 1 to the conference attendee tends to be short, so that the face and lips of the conference attendee may not fit within the viewing angle of the camera 51. In that case, by adjusting the height of the unit installation part 1c so that the face and lips of the meeting attendees are within the angle of view of the camera 51, it is possible to accurately capture the image of the person who is making the sound. , can be combined with voice for machine learning.

次に図７～図１０を参照して、音声認識システム３００が機械学習する動作を説明する。図７は音声認識システムの動作を説明するためのフローチャートである。図８Ａはミュートボタンを備えた筐体の外観図である。図８Ｂはミュートボタンが押される前後の画像の例を示す図である。図９Ａは複数のマイクが配置された状態を模式的に示す図である。図９Ｂは複数のマイクのそれぞれで取得された音声データの一例を示す図である。図１０は教師ラベルの一例を示す図である。 Next, the machine learning operation of the speech recognition system 300 will be described with reference to FIGS. 7 to 10. FIG. 7 is a flowchart for explaining the operation of the speech recognition system. FIG. 8A is an external view of a housing provided with a mute button. FIG. 8B is a diagram showing an example of images before and after the mute button is pressed. FIG. 9A is a diagram schematically showing a state in which a plurality of microphones are arranged. FIG. 9B is a diagram showing an example of audio data acquired by each of a plurality of microphones. FIG. 10 is a diagram showing an example of teacher labels.

音声取得装置１が起動し、マイク５０の録音とカメラ５１の録画が開始されると（ステップＳ１）、図８Ａに示すミュートボタン２０が押されるまで録音及び録画が継続される（ステップＳ２，Ｎｏ）。 When the audio acquisition device 1 is activated and the recording of the microphone 50 and the camera 51 are started (step S1), the recording and recording are continued until the mute button 20 shown in FIG. 8A is pressed (step S2, No. ).

ミュートボタン２０は、例えば、機密情報を含む発話内容の録音を一時停止させ、又は一時的に録音された機密情報を含む発話内容を一定時間遡って消去させるためのボタンである。ミュートボタン２０は、録音を一時停止させ、又は発話内容を一定時間遡って消去させるだけでなく、録画を一時停止させ、又は録画された画像を一定時間遡って消去させるものでもよい。 The mute button 20 is, for example, a button for temporarily stopping recording of speech content including confidential information, or for erasing temporarily recorded speech content including confidential information going back a certain period of time. The mute button 20 may be used not only to temporarily stop recording or erase the uttered content going back a certain period of time, but also to temporarily stop recording or erasing recorded images going back a certain period of time.

ミュートボタン２０は、例えば音声取得装置１にケーブルを介して接続される筐体に設けられているが、音声取得装置１に設けられていてもよい。ミュートボタン２０は、人が操作し易く、又は録音停止状態か否かを識別しやすい形状のものであればよく、押しボタン式のものでもよいし、ダイヤル式のものでもよい。ここでは、押しボタン式の例について説明する。また、ミュートボタン２０の横にはＬＥＤが具備され、録音・録画している間はＬＥＤが点灯、録音・録画していない間はＬＥＤが消灯するようにして、データ取得状況を分かりやすくしてもよい。 The mute button 20 is provided, for example, in a casing connected to the audio acquisition device 1 via a cable, but may also be provided in the audio acquisition device 1. The mute button 20 may be of a shape that is easy for a person to operate or that makes it easy to identify whether recording is stopped or not, and may be of a push button type or a dial type. Here, an example of a push button type will be explained. In addition, an LED is provided next to the mute button 20, and the LED lights up while recording is in progress, and turns off while not recording, making it easy to understand the data acquisition status. Good too.

ミュートボタン２０が押された場合（ステップＳ２，Ｙｅｓ）、録音及び録画が一時停止（オプトアウト）される（ステップＳ３）。例えば、会議出席者が機密情報を話し始めるときにミュートボタン２０が押されることにより、ミュート制御部１４は、録音停止指令を生成して、開始／終了制御部１０に入力する。録音停止指令を入力した開始／終了制御部１０は、マイク５０からの音声データの記録制御部１２への送信を停止することで、機密情報の録音を停止する。これにより、機密性の高い音声データが記録されず、機密情報の漏洩を効果的に防止できる。 When the mute button 20 is pressed (step S2, Yes), audio recording is temporarily stopped (opt out) (step S3). For example, when the mute button 20 is pressed when a conference attendee starts speaking confidential information, the mute control section 14 generates a recording stop command and inputs it to the start/end control section 10 . The start/end control unit 10 that has received the recording stop command stops recording the confidential information by stopping the transmission of audio data from the microphone 50 to the recording control unit 12. As a result, highly confidential audio data is not recorded, and leakage of confidential information can be effectively prevented.

なお、開始／終了制御部１０は、録音停止指令を入力したとき、音声データと共に、撮像データの記録制御部１２への送信を停止してもよい。この構成により、機密性の高い撮像データも記録されず、機密情報の漏洩をより一層効果的に防止できる。 Note that the start/end control section 10 may stop transmitting the imaging data to the recording control section 12 together with the audio data when the recording stop command is input. With this configuration, even highly confidential imaging data is not recorded, making it possible to more effectively prevent leakage of confidential information.

ミュート制御部１４は、以下のように構成してもよい。例えば、会議出席者が機密情報を話し始めた後に、ミュートボタン２０が押されることにより、ミュート制御部１４は、ミュートボタン２０が押された時点から、予め設定された所定時間（例えば数秒～数十秒）遡った時点までに、録音された音声データを消去する消去指令を生成して、記録制御部１２に入力する。 The mute control unit 14 may be configured as follows. For example, when the mute button 20 is pressed after a conference attendee starts speaking confidential information, the mute control unit 14 controls the mute control unit 14 for a preset period of time (for example, from several seconds to several seconds) from the time the mute button 20 is pressed. An erasure command for erasing the recorded audio data up to the point in time (10 seconds) is generated and input to the recording control unit 12.

当該消去指令を入力した記録制御部１２は、記録部１３に時系列順に記録された音声データの内、上記所定時間に対応する音声データを消去する。またミュート制御部１４は、消去指令を生成すると同時に、録音停止指令を生成して、開始／終了制御部１０に入力することで、音声データの記録制御部１２への送信を停止させる。これにより、例えば、機密性の高い音声データが一時的に記録された場合でも、その場で機密情報を消去できる。また、機密情報以外の音声が録音されている場合でも、自動議事録作成などに不要な録音内容であるときには、その部分を消去できるため、クラウドサーバ２００の処理負担を軽減できる。 The recording control unit 12, which has received the erasure command, erases the audio data corresponding to the predetermined time period from among the audio data recorded in the recording unit 13 in chronological order. Furthermore, the mute control unit 14 generates a recording stop command and inputs it to the start/end control unit 10 at the same time as generating the erasure command, thereby stopping the transmission of audio data to the recording control unit 12. With this, for example, even if highly confidential audio data is temporarily recorded, the confidential information can be erased on the spot. Further, even if audio other than confidential information is recorded, if the recorded content is unnecessary for automatic minutes creation, etc., that part can be deleted, so the processing load on the cloud server 200 can be reduced.

なお、記録制御部１２は、消去指令を入力したときに、音声データだけでなく、上記所定時間に対応する撮像データを記録制御部１２から消去してもよい。この構成により、機密性の高い音声データ及び撮像データが一時的に記録された場合でも、その場で機密情報を消去でき、機密情報の漏洩をより一層効果的に防止できる。また記録部１３のリソースを有効に利用できる。また、機密性を確保しながら、音声認識エンジン２０１の性能向上に最も寄与する機械学習のための音声データと撮像データを大量に取得できる。 Note that the recording control section 12 may erase not only the audio data but also the imaging data corresponding to the predetermined time from the recording control section 12 when the erasure command is input. With this configuration, even if highly confidential audio data and imaging data are temporarily recorded, the confidential information can be erased on the spot, and leakage of confidential information can be more effectively prevented. Furthermore, the resources of the recording section 13 can be used effectively. Furthermore, a large amount of voice data and imaging data for machine learning, which contribute most to improving the performance of the voice recognition engine 201, can be acquired while ensuring confidentiality.

なお、ミュート制御部１４は、ミュートボタン２０が押された場合、例えば、図８Ｂに示すように、テレビ会議システムの表示器に表示されていた会議中の画像を、非表示状態にさせるように構成してもよい。この構成により、機密情報が話されていても、外部にその内容が漏洩することを防止できる。なお、ミュートボタン２０が再び押されることにより、録音及び録画が再開されるため、テレビ会議システムの表示器には、会議中の画像を再び表示状態される。 Note that, when the mute button 20 is pressed, the mute control unit 14 controls, for example, as shown in FIG. 8B, to make the image during the meeting displayed on the display of the video conference system into a non-display state. may be configured. With this configuration, even if confidential information is being discussed, it is possible to prevent the contents from leaking to the outside. Note that when the mute button 20 is pressed again, audio recording is restarted, so that the image during the conference is displayed again on the display of the video conference system.

ミュート制御部１４は、音声データ及び撮像データの一部を消去する機能を、有効にするか無効にするかを選択できるように構成してもよい（ステップＳ４）。例えば、当該機能が無効となるように選択された場合（ステップＳ４，Ｎｏ）、ステップＳ６の処理が実行される。当該機能が有効となるように選択された場合（ステップＳ４，Ｙｅｓ）、ステップＳ５の処理、すなわちデータ削除（データ消去）が実行される。 The mute control unit 14 may be configured to enable or disable the function of erasing part of the audio data and imaged data (step S4). For example, if the function is selected to be disabled (step S4, No), the process of step S6 is executed. When the function is selected to be enabled (step S4, Yes), the process of step S5, that is, data deletion (data erasure) is executed.

ステップＳ６において、同期制御部１１は、複数の音声検出部のそれぞれで検出される音声データを同期させる制御を行う。なお、ステップＳ６の処理はステップＳ１とステップＳ２の間に実行されてもよい。図９Ａ及び図９Ｂを参照して、同期制御部１１における同期制御方法を具体的に説明する。 In step S6, the synchronization control section 11 performs control to synchronize the audio data detected by each of the plurality of audio detection sections. Note that the process of step S6 may be executed between step S1 and step S2. A synchronization control method in the synchronization control section 11 will be specifically described with reference to FIGS. 9A and 9B.

図９Ａは複数のマイクが配置された状態を模式的に示す図である。図９Ａに示す（１）～（６）の符号は、第１マイク（１）、第２マイク（２）、第３マイク（３）、第４マイク（４）、第５マイク（５）及び第６マイク（６）を表す。これらの各マイクは、配置位置と向きが互いに異なる。また、これらの各マイクは、会議室のテーブルを中心に配置されるため、テーブルの周囲に存在する会議出席者から各マイクまでの距離が比較的遠くなる。 FIG. 9A is a diagram schematically showing a state in which a plurality of microphones are arranged. The symbols (1) to (6) shown in FIG. 9A are the first microphone (1), the second microphone (2), the third microphone (3), the fourth microphone (4), the fifth microphone (5), and Represents the sixth microphone (6). Each of these microphones has a different arrangement position and orientation. Furthermore, since each of these microphones is arranged around the table in the conference room, the distance from each microphone to the conference attendees around the table is relatively long.

図９Ｂは複数のマイクのそれぞれで取得された音声データの一例を示す図である。図９Ｂには、図９Ａに示す複数のマイクの内、第２マイク（２）、第３マイク（３）、及び第４マイク（４）のそれぞれで検出された、特定の人の発話内容を表す音声データが示される。これらの音声データは、特定の人の発話内容を表すが、互いの波形が僅かに異なる。第１の原因は、各マイクの配置位置と向きが異なることである。第２の原因は、会議出席者から各マイクまでの距離が比較的遠いため、特定の人から発せられた声が、会議室１００の壁に反射してからマイクに届く場合と直接マイクに届く場合があり、マイクへの音声の残響に差が生じることである。 FIG. 9B is a diagram showing an example of audio data acquired by each of a plurality of microphones. FIG. 9B shows the utterance content of a specific person detected by each of the second microphone (2), third microphone (3), and fourth microphone (4) among the plurality of microphones shown in FIG. 9A. The voice data represented is shown. These voice data represent the content of a specific person's utterances, but their waveforms are slightly different. The first cause is that the positions and orientations of the microphones are different. The second reason is that the distance from conference participants to each microphone is relatively long, so the voice emitted by a specific person may be reflected off the walls of the conference room 100 before reaching the microphone, or directly reach the microphone. This may result in differences in the reverberation of the audio to the microphone.

従って、例えば、第２マイク（２）で取得される当該特徴点の音圧レベルは、第３マイク（３）で取得される音声の当該特徴点の音圧レベルと異なることもある。 Therefore, for example, the sound pressure level of the feature point obtained by the second microphone (2) may be different from the sound pressure level of the feature point of the sound obtained by the third microphone (3).

そのため、同一の人が発した音声であっても、図９Ｂに示すように、各マイクで検出される音声データの波形は僅かに相違する。同期制御部１１は、このように波形が僅かに相違する複数の音声データの取得のタイミングを一致させる。 Therefore, even if the voice is uttered by the same person, the waveform of the voice data detected by each microphone is slightly different, as shown in FIG. 9B. The synchronization control unit 11 thus synchronizes the timing of acquisition of a plurality of pieces of audio data having slightly different waveforms.

また、同期制御部１１は、第３マイク（３）と第４マイク（４）との間でも同様の処理を行う。この結果、特定の特徴点のタイミングを各マイクで相互に合わせることができ、音声の特徴点が抽出されたタイミングを合わせてクラウドサーバ２００に入力することができる。その結果、音声認識の精度を効率的に向上できる。 Furthermore, the synchronization control unit 11 performs similar processing between the third microphone (3) and the fourth microphone (4). As a result, the timings of specific feature points can be mutually matched between the microphones, and the timings at which the audio feature points are extracted can be input to the cloud server 200 at the same time. As a result, the accuracy of speech recognition can be efficiently improved.

なお、同期制御部１１は、複数のマイク５０で取得される音声を同期させるだけでなく、１又は複数のカメラ５１での撮像も、同様の方法で同期させてもよい。これにより、機械読唇における機械学習の教師ラベルを音声認識と共通化でき、低コストで効率的に音声認識と機械読唇の機械学習を進めることができる。 Note that the synchronization control unit 11 may not only synchronize the sounds acquired by the plurality of microphones 50, but also synchronize the imaging by one or more cameras 51 in a similar manner. As a result, the teacher labels for machine learning in machine lip reading can be shared with those for voice recognition, and machine learning for voice recognition and machine lip reading can be advanced efficiently at low cost.

次に図１０を参照して、教師ラベルについて説明する。前述したように、複数のマイク５０の配置位置や向きが異なる場合、特定の人の発話内容に対応する音声データの波形、及び音声の特徴量は、相互に相違する。このように、音声データの波形や音声の特徴量が相違する場合でも、それに対する発話の内容は同じである。そこで、特定の発話内容に対応する複数の音声データに対して、図１０に示すような、１つの教師ラベルを共用して機械学習（ステップＳ７）を行うことによって、１つのマイク５０と１つの教師ラベルで機械学習を行う場合に比べて、音声認識の精度を効率的に向上できる。 Next, referring to FIG. 10, teacher labels will be explained. As described above, when the positions and orientations of the plurality of microphones 50 are different, the waveform of the voice data and the feature amount of the voice corresponding to the content of speech by a specific person are different from each other. In this way, even if the waveform of the audio data or the feature amount of the audio differs, the content of the utterance is the same. Therefore, by performing machine learning (step S7) by sharing one teacher label as shown in FIG. The accuracy of speech recognition can be improved more efficiently than when machine learning is performed using teacher labels.

教師ラベルは、例えば図１０に示す「発話Ｎｏ」が「０００１」の「あらゆる現実をすべて自分のほうへねじ曲げたのだ。」という発話内容（ラベル）である。図１０には、これ以外にも、複数の教師ラベルの例が示される。「カメラＩＤ」は、複数のカメラ５１のそれぞれを識別する番号である。「話者ＩＤ」は、発話する人と個別に特定する番号である。その他、「性別ＩＤ」、発話が開始された時間を表す「開始時間」、発話が終了した時間を表す「終了時間」などが対応付けられている。図１０に示す複数の教示ラベルは「発話Ｎｏ」、「カメラＩＤ」、「話者ＩＤ」などに対応付けられて、クラウドサーバ２００のメモリに記憶されている。なお教師ラベルの内容は図示例に限定されるものではない。 The teacher label is, for example, the utterance content (label) shown in FIG. 10, with the ``utterance number'' being ``0001'' and ``I have twisted all reality to my side.'' FIG. 10 also shows examples of a plurality of teacher labels. “Camera ID” is a number that identifies each of the plurality of cameras 51. "Speaker ID" is a number that individually identifies the person speaking. In addition, "gender ID", "start time" representing the time when the utterance started, "end time" representing the time when the utterance ended, etc. are associated with each other. The plurality of teaching labels shown in FIG. 10 are stored in the memory of the cloud server 200 in association with "utterance number", "camera ID", "speaker ID", etc. Note that the content of the teacher label is not limited to the illustrated example.

なお、教師ラベルは、音声データを聴いて人手で書き起こしやタイムスタンプを行い、それを学習に用いるか、あるいは既存の音声認識エンジン２０１から出力されたテキスト（音声認識の出力）のうち、確信度の高い出力結果を教師ラベルとして抽出する方法がある。前者の人手によって全ての教師ラベルを作成し、機械学習を行う方法は教師あり学習と称され、後者の人手を介さず、確信度の高い出力結果を教師ラベルとして利用する方法は半教師あり学習と称される。本実施の形態に係る音声認識システム３００において、半教師あり学習を行う場合、複数のマイク５０で取得した音声データによる認識結果がいずれも同じ内容だった場合、確信度が高いと見なし、それを教師ラベルとして用いることが考えられる。 The teacher labels can be created by listening to the audio data and manually transcribing and time stamping it and using it for learning, or from the text output from the existing speech recognition engine 201 (speech recognition output) with confidence. There is a method of extracting output results with a high degree of accuracy as teacher labels. The former method of creating all teacher labels manually and performing machine learning is called supervised learning, and the latter method of using output results with high confidence as teacher labels without human intervention is semi-supervised learning. It is called. When performing semi-supervised learning in the speech recognition system 300 according to the present embodiment, if the recognition results based on speech data acquired by a plurality of microphones 50 are all the same, it is assumed that the confidence level is high and It can be considered to be used as a teacher label.

図１１は統合器の動作を説明するための図である。図１１の縦軸は確信度、横軸は時間である。統合器２０３における結合方法は様々あるが、その一例を説明する。音声認識器２０１ｂの出力（例えば音声に対応する文字情報）について、図中の符号（１）及び（２）で示す区間のように、確信度が閾値よりも僅かに低いため、又は確信度が閾値よりも大幅に低いために、音声を認識できていない場合、これらの区間で統合器２０３は、機械読唇器２０２ｂの出力を採用する。一方、音声認識の確信度が閾値以上の場合、統合器２０３は、音声認識器２０１ｂの出力を採用し、機械読唇器の出力は採用しない。これは、本質的に、現状の機械読唇の精度は音声認識よりも劣るためである。 FIG. 11 is a diagram for explaining the operation of the integrator. The vertical axis in FIG. 11 is confidence level, and the horizontal axis is time. There are various methods of combining in the integrator 203, one example of which will be explained. Regarding the output of the speech recognizer 201b (for example, text information corresponding to speech), as shown in the sections indicated by symbols (1) and (2) in the figure, the confidence level is slightly lower than the threshold, or the confidence level is If the voice cannot be recognized because it is significantly lower than the threshold, the integrator 203 adopts the output of the mechanical lip reader 202b in these sections. On the other hand, when the confidence level of speech recognition is equal to or greater than the threshold value, the integrator 203 adopts the output of the speech recognizer 201b and does not adopt the output of the mechanical lip reader. This is because current machine lip reading is inherently less accurate than voice recognition.

以上に説明したように本実施の形態に係る音声認識装置は、複数の音声を検出する音声検出部と、複数の音声の内容を示すデータである音声データを同期させる制御を行う同期制御部と、を備え、同期された複数の音声データを音声認識エンジンの機械学習に用いるように構成されている。この構成により、口元からマイクまでの距離が遠いため音声認識が難しい会議などにおいても、音声認識エンジン２０１の性能向上に最も寄与する機械学習のための音声データを同期させた上で取得できる。 As described above, the speech recognition device according to the present embodiment includes a speech detection section that detects a plurality of speech sounds, and a synchronization control section that performs control to synchronize the speech data that is data indicating the contents of the plurality of speech sounds. , and is configured to use a plurality of synchronized voice data for machine learning of a voice recognition engine. With this configuration, it is possible to synchronize and acquire voice data for machine learning, which contributes most to improving the performance of the voice recognition engine 201, even in meetings where voice recognition is difficult because the distance from the mouth to the microphone is long.

なお、音声認識用にマイクアレイとして、発話者を検出し、その発話者の音声を強調するビームフォーミングが知られている。このビームフォーミングで会議音声をクリアに変換するには、煩雑な信号処理を必要とするため、音声認識装置が非常に高価になるほか、その会議の素の収音環境とは異なるように加工されてしまうため、素の収音環境に近い本質的な機械学習を行うことができない。 Note that beam forming, which detects a speaker and emphasizes the voice of the speaker, is known as a microphone array for voice recognition. Converting conference audio clearly using beamforming requires complicated signal processing, which makes voice recognition equipment very expensive, and it is also processed to differ from the original sound recording environment of the conference. Therefore, it is not possible to perform essential machine learning that is close to the original sound collection environment.

これに対し本実施の形態に係る音声認識装置によれば、ビームフォーミングを利用せずに複数の音声データを同期させて機械学習に利用でるため、煩雑な信号処理が不要になる。従って、音声取得装置の製造コストの上昇を抑制しながら音声認識精度を大幅に向上できる。 On the other hand, according to the speech recognition device according to the present embodiment, multiple pieces of speech data can be synchronized and used for machine learning without using beamforming, so complicated signal processing is not necessary. Therefore, it is possible to significantly improve speech recognition accuracy while suppressing an increase in the manufacturing cost of the speech acquisition device.

また会議での高い音声認識精度を優先し、複数の会議出席者のそれぞれにヘッドセット、ピンマイクなどを装着させる手段もある。しかしながら、特に女性は、ヘッドセットやピンマイクの使い回しによる不衛生さを嫌がる場合がある。 There is also a method of prioritizing high speech recognition accuracy in a conference and having multiple conference attendees each wear a headset, pin microphone, or the like. However, women in particular may dislike the unsanitary nature of using headsets and pin microphones over and over again.

これに対し本実施の形態に係る音声認識装置によれば、ヘッドセットなどを利用しなくとも、会議での高い音声認識精度を確保できるため、ヘッドセットなどを装着する煩わしさを軽減できる。また、ヘッドセットなどを装着することによる不快な体験を感じさせることもない。 On the other hand, according to the voice recognition device according to the present embodiment, high voice recognition accuracy can be ensured in a meeting without using a headset or the like, so the troublesomeness of wearing a headset or the like can be reduced. Furthermore, the user does not experience any discomfort caused by wearing a headset or the like.

また特許文献１に開示される従来技術は、人型ロボットの筐体が利用されているため、その態様によって会議参加者は会議に集中できなくなり、特に小さな会議室では、圧迫感を与え得る。 Furthermore, since the conventional technology disclosed in Patent Document 1 uses a humanoid robot casing, it is difficult for meeting participants to concentrate on the meeting, and this can give a feeling of pressure, especially in a small meeting room.

これに対し本実施の形態に係る音声認識装置によれば、図２Ａに示すように卓上照明スタンドに似たシンプルな外観形状であるため、会議参加者に圧迫感を与え得ることがなく、会議への集中が阻害されることを防止できる。 On the other hand, according to the voice recognition device according to the present embodiment, as shown in FIG. 2A, it has a simple external shape resembling a tabletop lighting stand, so that it does not give a feeling of pressure to the conference participants. This will prevent your concentration from being hindered.

また本実施の形態に係る音声認識装置は、複数の音声データを記録する記録部を備えるように構成してもよい。この構成により、通信障害などでクラウドサーバがリアルタイムに音声データなどを受信できない場合についても、記録部に記憶された音声データをクラウドサーバにアップロードすることで、音声データを利用した機械学習を継続できる。 Further, the speech recognition device according to the present embodiment may be configured to include a recording unit that records a plurality of pieces of speech data. With this configuration, even if the cloud server is unable to receive voice data in real time due to communication failure, etc., machine learning using voice data can be continued by uploading the voice data stored in the recording unit to the cloud server. .

また本実施の形態に係る音声認識装置は、外部機器との間で複数の音声データの通信を行う通信制御部を備えるように構成してもよい。この構成により、通信制御部を通じてホワイトボードやクラウドサーバなどの外部機器へ複数の音声データを送信できるため、ＧＰＵのように高価のプロセッサを音声認識装置に搭載しなくとも、外部機器において音声データを利用した機械学習を実現できる。従って、音声認識装置の生産台数が増えても、システム全体でのコストの上昇を抑制できると共に、クラウドサーバなどの外部機器で大量のデータを活用して機械学習を行うことにより、音声認識精度を大幅に向上できる。 Further, the speech recognition device according to the present embodiment may be configured to include a communication control unit that communicates a plurality of speech data with an external device. With this configuration, multiple pieces of audio data can be sent to external devices such as whiteboards and cloud servers through the communication control unit, so voice data can be sent to external devices without having to install an expensive processor like a GPU in the speech recognition device. It is possible to realize machine learning using Therefore, even if the production volume of speech recognition devices increases, it is possible to suppress the increase in the cost of the entire system, and by performing machine learning using large amounts of data on external devices such as cloud servers, speech recognition accuracy can be improved. It can be significantly improved.

また本実施の形態に係る音声認識装置は、録音を一時停止するミュート制御部を備えるように構成してもよい。会議で発言される内容には機密性の高い情報を多く含むため、録音できない場合が生じるが、ミュート制御部を備えることにより、録音を停止できる。従って、発言をためらうことなく会議に参加でき、結果的に有効な音声データを大量に集めることができる。従って、文法から逸脱したインフォーマルな発話に対する訓練が進み、音声認識精度を高めることができる。 Furthermore, the speech recognition device according to this embodiment may be configured to include a mute control section that temporarily stops recording. Because much of what is said at a meeting includes highly confidential information, there may be times when it is not possible to record it, but by providing a mute control unit, recording can be stopped. Therefore, the user can participate in the conference without hesitating to speak, and as a result, a large amount of useful audio data can be collected. Therefore, training for informal utterances that deviate from grammar progresses, and speech recognition accuracy can be improved.

また本実施の形態に係る音声認識装置は、複数の音声検出部のそれぞれの配置位置が互いに異なり、又は複数の音声検出部のそれぞれの向きが互いに異なるように構成してもよい。この構成により、複数チャンネルの音声を同時に取得できると共に、複数の会議出席者がテーブルを囲むように存在する場合でも、それぞれの会議出席者と向き合うように個々の音声検出部が配置される形となるため、音声検出部からそれぞれの会議出席者までの距離を短くでき、S/N比の高い明瞭な音声が入力できる。 Further, the speech recognition device according to the present embodiment may be configured such that the plurality of speech detection sections are arranged in different positions, or the plurality of speech detection sections are arranged in different directions. With this configuration, audio from multiple channels can be acquired simultaneously, and even when multiple conference attendees are present around a table, each audio detection unit can be placed to face each conference attendee. Therefore, the distance from the audio detection unit to each conference attendee can be shortened, and clear audio with a high S/N ratio can be input.

また本実施の形態に係る音声認識装置は、撮像部を備え、撮像部で撮像されたデータである撮像データを、機械読唇器の機械学習に用いるように構成してもよい。この構成により、音声認識エンジンの機械学習結果に、機械読唇の機械学習結果を利用できるため、会議での音声認識精度をより一層高めることができる。 Further, the speech recognition device according to the present embodiment may be configured to include an imaging section and use imaging data, which is data captured by the imaging section, for machine learning of a mechanical lip reader. With this configuration, the machine learning results of machine lip reading can be used as the machine learning results of the voice recognition engine, so it is possible to further improve the accuracy of voice recognition in meetings.

また本実施の形態に係る音声認識装置は、音声認識エンジンの機械学習の結果に応じて、機械読唇の機械学習の結果を採用し又は採用しない統合器を備えるように構成してもよい。この構成により、正しく音声認識できている場合には音声認識エンジンの機械学習の結果を優先させ、正しく音声を認識できていない場合には機械読唇器の出力を採用できるため、より精度の高い音声認識が実現できる。 Further, the speech recognition device according to the present embodiment may be configured to include an integrator that adopts or does not adopt the results of machine learning of machine lip reading, depending on the results of machine learning of the speech recognition engine. With this configuration, if the voice is correctly recognized, the machine learning results of the voice recognition engine are prioritized, and if the voice is not recognized correctly, the output of the machine lip reader can be used, which allows for more accurate voice. Recognition can be achieved.

また本実施の形態に係る情報処理方法では、音声認識装置が、音声検出部で取得された複数の音声の内容を示すデータである音声データを同期させる制御を行い、サーバが、同期された複数の音声データを音声認識エンジンの機械学習に用いる。 Further, in the information processing method according to the present embodiment, the speech recognition device performs control to synchronize the audio data, which is data indicating the contents of a plurality of voices acquired by the speech detection unit, and the server controls the synchronized plurality of voices. The voice data is used for machine learning of the voice recognition engine.

また本実施の形態に係る情報処理プログラムは、音声認識装置に、音声検出部で取得された複数の音声の内容を示すデータである音声データを同期させる制御を行わせ、サーバに、同期された複数の音声データを用いて音声認識エンジンの機械学習を行わせる。 Further, the information processing program according to the present embodiment causes the speech recognition device to perform control to synchronize audio data, which is data indicating the contents of a plurality of sounds acquired by the audio detection unit, and causes the server to perform control to synchronize audio data that is data indicating the content of a plurality of sounds acquired by the audio detection unit. Perform machine learning on a speech recognition engine using multiple pieces of speech data.

１：音声取得装置
１ａ：台座部
１ｂ：延伸部
１ｃ：ユニット設置部
２：筐体部
１０：終了制御部
１１：同期制御部
１２：記録制御部
１３：記録部
１４：ミュート制御部
１５：通信制御部
２０：ミュートボタン
３１，３２，３３，３４，３５，３６：会議出席者
５０，５０－１，５０－ｎ：マイク
５１，５１－１，５１－２，５０－ｎ：カメラ
１００：会議室
１０１：ＣＰＵ
１０２：ＲＯＭ
１０３：ＲＡＭ
１０４：入力装置
１０５：通信インタフェース
１０６：バス
１１０：テーブル
１２０：ホワイトボード
２００：クラウドサーバ
２０１：音声認識エンジン
２０１ａ：音声特徴量抽出部
２０１ｂ：音声認識器
２０２：読唇処理部
２０２ａ：画像特徴量抽出部
２０２ｂ：機械読唇器
２０３：統合器
２１０：プロセッサ
２２０：メモリ
２３０：入出力インタフェース
２４０：バス
３００：音声認識システム
３０１：通信ネットワーク 1: Audio acquisition device 1a: Pedestal section 1b: Extension section 1c: Unit installation section 2: Housing section 10: Termination control section 11: Synchronization control section 12: Recording control section 13: Recording section 14: Mute control section 15: Communication Control unit 20: Mute button 31, 32, 33, 34, 35, 36: Conference attendees 50, 50-1, 50-n: Microphone 51, 51-1, 51-2, 50-n: Camera 100: Conference Room 101: CPU
102:ROM
103: RAM
104: Input device 105: Communication interface 106: Bus 110: Table 120: Whiteboard 200: Cloud server 201: Speech recognition engine 201a: Speech feature extraction unit 201b: Speech recognizer 202: Lip reading processing unit 202a: Image feature extraction Section 202b: Mechanical lip reader 203: Integrator 210: Processor 220: Memory 230: Input/output interface 240: Bus 300: Speech recognition system 301: Communication network

特許５７９７００９号公報Patent No. 5797009

「議事録作成支援システム」［令和１年１０月９日検索］インターネット＜URL: https://www.advanced-media.co.jp/products/service/private-enterprise-proceedings-preparation-support-system＞“Minutes preparation support system” [Retrieved October 9, 2021] Internet <URL: https://www.advanced-media.co.jp/products/service/private-enterprise-proceedings-preparation-support- system＞「音声認識の基礎」［令和１年１０月９日検索］インターネット＜URL: https://www.slideshare.net/akinoriito549/ss-23821600＞“Basics of speech recognition” [Retrieved October 9, 2021] Internet <URL: https://www.slideshare.net/akinoriito549/ss-23821600> 「認識に使用する顔領域の違いによる読唇性能の比較」［令和１年１０月９日検索］インターネット＜URL: http://www.ii.is.kit.ac.jp/hai2011/proceedings/pdf/II-2B-6.pdf＞“Comparison of lip reading performance due to differences in facial regions used for recognition” [Retrieved October 9, 2021] Internet <URL: http://www.ii.is.kit.ac.jp/hai2011/proceedings/ pdf/II-2B-6.pdf＞

Claims

A voice recognition system comprising a voice acquisition device and a server,
The audio acquisition device includes:
a voice detection unit that detects multiple voices;
a synchronization control unit that performs control to synchronize audio data that is data indicating the content of the plurality of audios;
Equipped with
The server is
The voice recognition engine performs machine learning on the multiple synchronized voice data using shared teacher labels to recognize voice.
Voice recognition system .

The audio acquisition device includes:
The voice recognition system according to claim 1, further comprising a recording unit that records a plurality of said voice data.

The audio acquisition device includes:
The voice recognition system according to claim 1 or 2, further comprising a communication control unit that communicates a plurality of the voice data with an external device.

The audio acquisition device includes:
The voice recognition system according to any one of claims 1 to 3, further comprising a mute control unit that temporarily stops recording of the plurality of voices.

5. The voice recognition system according to claim 4, wherein the mute control unit erases the plurality of recorded voice data up to a certain point in time.

The voice recognition system according to any one of claims 1 to 5, wherein the plurality of voice detection units have different arrangement positions or different orientations.

The audio acquisition device includes:
Equipped with an imaging unit,
The voice recognition system according to any one of claims 1 to 6, wherein imaged data, which is data imaged by the imaging unit, is used for machine learning of machine lip reading.

The server is
The speech recognition system according to claim 7, further comprising an integrator that adopts or does not adopt the machine learning result of the machine lip reading depending on the machine learning result of the speech recognition engine.

The voice recognition system according to any one of claims 1 to 8, further comprising an imaging unit that is detachable from a housing of the voice acquisition device.

The audio acquisition device includes:
Equipped with an imaging unit,
The synchronization control unit performs control to synchronize the audio data with imaging data that is data indicating the content of an image captured by the imaging unit,
The server is
Speech recognition according to any one of claims 1 to 9, wherein machine learning of a speech recognition engine is performed on a plurality of the synchronized audio data and the imaged data by sharing a teacher label to recognize speech. system .

The audio acquisition device includes:
Any one of claims 1 to 10, comprising an imaging unit, and performs machine learning on the audio data and the imaging data that is data captured by the imaging unit using the same teacher label to recognize audio. The speech recognition system described in Section.

An information processing method performed by a speech recognition system comprising a speech acquisition device and a server, the method comprising:
The voice acquisition device detects a plurality of voices with a voice detection unit,
The audio acquisition device performs control to synchronize audio data that is data indicating the content of a plurality of audios acquired by the audio detection unit,
An information processing method in which the server performs machine learning on a speech recognition engine by sharing teacher labels for a plurality of synchronized speech data to recognize speech .