JP2022120164A

JP2022120164A - Voice recognition system, voice recognition method, and voice processing device

Info

Publication number: JP2022120164A
Application number: JP2022097190A
Authority: JP
Inventors: 将樹能勢; Masaki Nose; 紘之長野; Hiroyuki Nagano; 悠斗後藤; Yuto Goto
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2018-11-21
Filing date: 2022-06-16
Publication date: 2022-08-17
Anticipated expiration: 2038-11-21
Also published as: JP2020086048A; JP7420166B2; JP7095569B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition system and a voice recognition method capable of suppressing an accuracy deterioration in voice recognition even in the case that user's voice is out of a directional range of a microphone.

SOLUTION: A voice recognition system provided by the present invention includes: acquisition means for acquiring voice data of voices uttered by a plurality of users within the same space from a plurality of sound collection devices respectively attached to the plurality of users; determination means for determining whether or not voice volumes of the plurality of pieces of acquired voice data are less than a threshold; and voice recognition means for performing voice recognition processing with prescribed processing using the plurality of pieces of voice data in combination in the case that the determination means determines that any of the voice volumes of the plurality of pieces of voice data is less than the threshold.

SELECTED DRAWING: Figure 1

Description

本発明は、音声認識システム、及び音声認識方法に関する。 The present invention relates to a speech recognition system and a speech recognition method.

近年、音声認識技術が広く用いられている。例えば、ＡＩ（Artificial Intelligence）スピーカにおけるユーザの音声操作の受け付けや、コールセンターでの顧客との会話内容の記録といった用途で、音声認識技術が広く用いられている。 In recent years, speech recognition technology has been widely used. For example, speech recognition technology is widely used for applications such as accepting user voice operations on AI (Artificial Intelligence) speakers and recording conversations with customers at call centers.

これらの用途では、発話するユーザの口元と、ユーザの発話した音声を収音するマイクとの距離が近い。そのため、音声と、雑音の比を示すＳ／Ｎ比(signal-noise ratio)が高い状態となり、精度良く音声認識をすることができる。 In these uses, the distance between the mouth of the user who speaks and the microphone that picks up the voice spoken by the user is close. As a result, the S/N ratio (signal-noise ratio), which indicates the ratio of voice to noise, is high, and voice recognition can be performed with high accuracy.

一方で、例えば、会議における議事録の作成の用途で、会議卓の中央にテーブルマイクを配し、会議参加者の発話をまとめて収音する場合がある。この場合、発話者とマイクの距離が数十ｃｍ以上離れる。そのため、Ｓ／Ｎ比が低い状態となり、音声認識の精度が低下してしまうという問題が生じる。 On the other hand, for example, there is a case where a table microphone is placed in the center of the conference table to collectively collect the utterances of the conference participants for the purpose of creating the minutes of the conference. In this case, the distance between the speaker and the microphone is several tens of centimeters or more. As a result, the S/N ratio becomes low, and the problem arises that the accuracy of speech recognition is lowered.

このような問題を考慮した技術の一例が、特許文献１に開示されている。特許文献１に開示の技術では、会議に参加するユーザそれぞれに、ユーザの口元に対する指向性を有するマイクを装着させる。そして、これら複数のマイクそれぞれが収音した音声データを適宜選択して音声認識処理を行なう。これにより、発話者とマイクの距離が離れてしまうという問題を生じさせることなく、音声認識を行なうことができる。 An example of technology that takes such problems into account is disclosed in Japanese Patent Application Laid-Open No. 2002-200010. In the technology disclosed in Patent Literature 1, each user participating in a conference is made to wear a microphone that is directional with respect to the mouth of the user. Then, voice data picked up by each of the plurality of microphones is appropriately selected and voice recognition processing is performed. As a result, speech recognition can be performed without causing the problem that the distance between the speaker and the microphone increases.

しかしながら、特許文献１に開示の技術では、ユーザが、マイクの方向と異なる方向に発話してしまい、ユーザの音声がマイクの指向範囲から外れる場合を考慮していなかった。このようにユーザの音声がマイクの指向範囲から外れた場合、ユーザの音声は、他のユーザの音声や雑音に埋もれて収音されてしまうので、音声認識の精度が低下してしまう。 However, the technique disclosed in Patent Document 1 does not consider the case where the user speaks in a direction different from the direction of the microphone and the user's voice is out of the directional range of the microphone. When the user's voice is out of the directional range of the microphone in this manner, the user's voice is picked up while being buried in other users' voices and noise, resulting in deterioration in voice recognition accuracy.

本発明は、このような状況に鑑みてなされたものであり、ユーザの音声がマイクの指向範囲から外れた場合であっても、音声認識の精度低下を抑制することが可能な、音声認識システム、及び音声認識方法を提供することを目的とする。 The present invention has been made in view of such circumstances, and a voice recognition system capable of suppressing deterioration in voice recognition accuracy even when the user's voice is out of the directional range of the microphone. , and to provide a speech recognition method.

上述した課題を解決し、目的を達成するために、本発明により提供される音声認識システムは、同一空間内にて複数のユーザが発話した音声の音声データを、前記複数のユーザそれぞれが装着した複数の収音機器から取得する取得手段と、前記取得した複数の音声データの音量が閾値未満であるか否かを判定する判定手段と、前記判定手段により、前記複数の音声データの音量が何れも閾値未満であると判定された場合に、前記複数の音声データを併用した所定の処理と共に、音声認識処理を行なう音声認識処理手段と、を備える。 In order to solve the above-described problems and achieve the object, the speech recognition system provided by the present invention provides a speech recognition system in which speech data of speech uttered by a plurality of users in the same space is received by each of the plurality of users. acquisition means for acquiring from a plurality of sound collecting devices; determination means for determining whether or not the volume of the plurality of acquired audio data is less than a threshold; voice recognition processing means for performing predetermined processing using the plurality of voice data together and voice recognition processing when it is determined that the voice data is less than the threshold.

本発明によれば、ユーザの音声がマイクの指向範囲から外れた場合であっても、音声認識の精度低下を抑制することが可能な、音声認識システム、及び音声認識方法を提供する
ことができる。 ADVANTAGE OF THE INVENTION According to the present invention, it is possible to provide a speech recognition system and a speech recognition method capable of suppressing deterioration in accuracy of speech recognition even when the user's speech is out of the directional range of the microphone. .

本発明の一実施形態に係る音声認識システムの全体構成と、音声認識システムを利用するユーザが存在する空間とを俯瞰した模式図である。1 is a schematic view of the overall configuration of a speech recognition system according to an embodiment of the present invention and a space where a user using the speech recognition system exists; FIG. 首掛け型のウェアラブルマイクにより本発明の一実施形態におけるマイクを実現した場合の装着例を示す模式図である。FIG. 2 is a schematic diagram showing an example of wearing when the microphone of the embodiment of the present invention is realized by a wearable neck-type microphone. 各ユーザの発話の状況の一例を示すタイミングチャートである。It is a timing chart which shows an example of the situation of each user's utterance. 各ユーザの発話の状況の一例を示すタイミングチャートである。It is a timing chart which shows an example of the situation of each user's utterance. 音声認識システムに含まれる、各マイク、会議端末、及び音声認識サーバそれぞれのハードウェア構成を示すブロック図である。2 is a block diagram showing the hardware configuration of each microphone, conference terminal, and speech recognition server included in the speech recognition system; FIG. 音声認識システムに含まれる、各マイク、会議端末、及び音声認識サーバの機能的構成のうち、複数音声併用処理を実行するための機能的構成を示す機能ブロック図である。FIG. 2 is a functional block diagram showing a functional configuration for executing multi-voice combined processing among the functional configurations of each microphone, conference terminal, and voice recognition server included in the voice recognition system; 会議端末の外観構成及び表示例を示す模式図である。FIG. 3 is a schematic diagram showing an external configuration and a display example of a conference terminal; 本発明の一実施形態における第１複数音声併用処理の流れを説明するフローチャートである。4 is a flowchart for explaining the flow of first multiple-speech combined processing in one embodiment of the present invention. 本発明の一実施形態における第２複数音声併用処理の流れを説明するフローチャートである。FIG. 11 is a flow chart for explaining the flow of a second plural-speech combination processing in one embodiment of the present invention; FIG. 第１の変形例における会議端末の外観構成及び表示例を示す模式図である。FIG. 11 is a schematic diagram showing an external configuration and a display example of a conference terminal in a first modified example; 第１の変形例及び第２の変形例における画像解析に関して示す模式図である。It is a schematic diagram showing about the image analysis in the 1st modification and the 2nd modification. 第１の変形例における会議端末での表示例を示す模式図である。FIG. 11 is a schematic diagram showing a display example on the conference terminal in the first modified example; 第２の変形例における音声認識部の構成例を示す模式図である。FIG. 11 is a schematic diagram showing a configuration example of a speech recognition unit in a second modified example; 第２の変形例における処理の流れを説明するフローチャートである。FIG. 11 is a flow chart for explaining the flow of processing in a second modified example; FIG.

以下、本発明の実施形態について、図面を用いて詳細に説明する。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

［システム構成］
図１は、本実施形態に係る音声認識システムＳの全体構成と、音声認識システムＳを利用するユーザが存在する空間（ここでは、一例として会議室）とを俯瞰した模式図である。図１に示すように、音声認識システムＳは、複数のマイク１０（ここでは、一例としてマイク１０Ａ～マイク１０Ｆ）、会議端末２０、及び音声認識サーバ３０を含む。 [System configuration]
FIG. 1 is a schematic view of the overall configuration of a speech recognition system S according to this embodiment and a space (here, a conference room as an example) where users who use the speech recognition system S are present. As shown in FIG. 1, the speech recognition system S includes a plurality of microphones 10 (here, as an example, microphones 10A to 10F), a conference terminal 20, and a speech recognition server 30. FIG.

また、窓やドアを含む会議室の中央には、テーブルが配置され、テーブル周囲には、会議に参加する複数のユーザＵ（ここでは、一例としてユーザＵＡ～ユーザＵＦ）が位置する。これらユーザＵそれぞれは、符号の末尾のアルファベットが、自身の符号と共通するマイク１０を装着している。なお、これらユーザＵの人数や各マイク１０等の台数は一例に過ぎず、特に限定されない。 A table is placed in the center of the conference room including the windows and doors, and a plurality of users U (here, users UA to UF as an example) participating in the conference are positioned around the table. Each of these users U wears a microphone 10 having the same alphabet at the end of the code as their own code. Note that the number of users U and the number of microphones 10 and the like are merely examples, and are not particularly limited.

各マイク１０は、各ユーザＵの音声を収音する収音機器として機能する。各マイク１０の形状は特に限定されないが、例えば、首掛け型、又は、バッジ型としてユーザＵに装着されるウェアラブルマイクにより、各マイク１０を実現することができる。このようなウェアラブルマイクであれば、例えば、ヘッドセットやピンマイクを煩わしく感じるユーザＵや、他人が装着したマイクの使い回しを嫌がるユーザＵであっても、気にかけることなく装着することができる。 Each microphone 10 functions as a sound pickup device that picks up each user's U sound. Although the shape of each microphone 10 is not particularly limited, each microphone 10 can be realized by a wearable microphone that is attached to the user U as a neck type or a badge type, for example. With such a wearable microphone, for example, even a user U who finds a headset or a pin microphone annoying, or a user U who dislikes using a microphone worn by another person can wear it without worrying about it.

マイク１０の装着例について、図２を参照して説明する。図２は、首掛け型のウェアラブルマイクによりマイク１０を実現した場合の装着例を示す模式図である。本例では、マイク１０は、マイク１０の上方、すなわち、マイク１０を装着したユーザＵの口から発話される音声を集中的に収音するために、口元への指向性を有している。 An example of mounting the microphone 10 will be described with reference to FIG. FIG. 2 is a schematic diagram showing an example of wearing when the microphone 10 is realized by a neck-type wearable microphone. In this example, the microphone 10 has directivity toward the mouth, in order to collect intensively the voice uttered from the mouth of the user U wearing the microphone 10 above the microphone 10 .

そのため、図２（Ａ）に示すように、ユーザＵが正面を向いて発話した場合、ユーザＵの発話する音声を適切に収音することできる。一方で、図２（Ｂ）に示すように、ユーザＵが、横や上を向いて、マイクの指向方向と異なる方向に発話してしまい、ユーザＵの音声がマイクの指向範囲から外れる場合、このユーザＵの音声は、他のユーザＵの音声や雑音に埋もれて収音されてしまう。本実施形態では、「複数音声併用処理」を行なうことにより、この図２（Ｂ）に示すように、ユーザＵの音声がマイクの指向範囲から外れる場合であっても、音声認識の精度低下を抑制する。この複数音声併用処理の詳細については後述する。 Therefore, as shown in FIG. 2A, when the user U speaks while facing the front, the voice uttered by the user U can be appropriately picked up. On the other hand, as shown in FIG. 2B, when the user U turns sideways or upwards and speaks in a direction different from the directivity direction of the microphone, and the voice of the user U is out of the directivity range of the microphone, The voice of this user U is picked up while being buried in voices of other users U and noise. In this embodiment, as shown in FIG. 2(B), by performing the "multi-speech combination processing", even if the user U's speech is out of the directional range of the microphone, the accuracy of speech recognition can be prevented from deteriorating. Suppress. The details of this multiple voice combination processing will be described later.

各マイク１０は、収音した音声に対応するアナログ信号を、Ａ／Ｄ変換回路にてアナログ－デジタル変換することにより、デジタル信号の音声データを作成する。そして、各マイク１０は、作成した音声データを会議端末２０に対して通信により送信する。かかる通信方法は特に限定されないが、例えば、多対多で接続が可能なＢｌｕｅｔｏｏｔｈ（登録商標）等の無線通信により実現することができる。このように、多対多で接続が可能な通信方法を用いることにより、各ユーザＵの発話した音声を同時並行して収集することができる。 Each microphone 10 converts an analog signal corresponding to the collected sound from analog to digital in an A/D conversion circuit, thereby creating audio data of a digital signal. Then, each microphone 10 transmits the created voice data to the conference terminal 20 by communication. Although such a communication method is not particularly limited, for example, it can be realized by wireless communication such as Bluetooth (registered trademark) that allows many-to-many connection. In this way, by using a communication method that allows many-to-many connections, voices uttered by each user U can be collected in parallel.

会議端末２０は、各マイク１０から受信した音声データを、音声認識サーバ３０に対して送信する。かかる通信方法は特に限定されないが、例えば、インターネットや、ＬＡＮ（Local Area Network）等のネットワークを介した、有線又は無線の通信により実現することができる。会議端末２０は、通信機能を有する中継装置等で実現してもよいし、ディスプレイ上での入力操作により、ユーザＵが文字等を入力できる電子情報ボード等の装置で実現してもよい。以下の説明では、会議端末２０を電子情報ボードにより実現する場合を例に取って説明をする。なお、電子情報ボードは、インタラクティブ・ホワイトボード（ＩＷＢ：Interactive Whiteboard）、または電子黒板等と称されることもある。 The conference terminal 20 transmits the voice data received from each microphone 10 to the voice recognition server 30 . Such a communication method is not particularly limited, but can be realized by wired or wireless communication via a network such as the Internet or a LAN (Local Area Network), for example. The conference terminal 20 may be realized by a relay device or the like having a communication function, or may be realized by a device such as an electronic information board on which the user U can input characters or the like by performing an input operation on the display. In the following description, an example in which the conference terminal 20 is implemented by an electronic information board will be described. The electronic information board is also called an interactive whiteboard (IWB), an electronic blackboard, or the like.

音声認識サーバ３０は、各マイク１０から受信した複数の音声データに対して、音声認識処理を行なうサーバである。音声認識サーバ３０では、マイク１０から受信した複数の音声データに対応した、複数の音声認識エンジンが並列に動作する。これにより、各ユーザＵの音声データに対して並列的な音声認識処理をリアルタイムに実現できる。音声認識処理の結果は、例えば、テキスト化された上で、会議端末２０等の表示部にリアルタイムで表示されたり、会議終了後に紙媒体に印刷されたりすることにより利用される。音声認識サーバ３０は、例えば、クラウド上に設けられたクラウドサーバにより実現することができる。 The voice recognition server 30 is a server that performs voice recognition processing on a plurality of voice data received from each microphone 10 . In the voice recognition server 30, multiple voice recognition engines corresponding to multiple voice data received from the microphone 10 operate in parallel. As a result, parallel speech recognition processing can be realized for the speech data of each user U in real time. The result of speech recognition processing is used by, for example, converting it into text and displaying it on the display unit of the conference terminal 20 or the like in real time, or by printing it on a paper medium after the conference is over. The voice recognition server 30 can be realized by, for example, a cloud server provided on the cloud.

［複数音声併用処理］
このような構成を有する音声認識システムＳは、上述したように複数音声併用処理を行う。ここで、複数音声併用処理とは、複数のユーザＵの音声データを併用することにより、音声認識の精度低下を抑制する一連の処理である。 [Multi-sound combination processing]
The speech recognition system S having such a configuration performs multiple speech combination processing as described above. Here, the multiple-speech combined processing is a series of processing for suppressing deterioration in accuracy of speech recognition by using speech data of a plurality of users U together.

具体的に、複数音声併用処理において音声認識システムＳは、同一空間内にて複数のユーザＵが発話した音声の音声データを、複数のユーザＵそれぞれが装着した複数のマイク１０から取得する。また、音声認識システムＳは、取得した複数の音声データの音量が閾値未満であるか否かを判定する。ここで、取得した複数の音声データの音量が閾値未満である場合とは、例えば、上述した図２（Ｂ）に示すように、ユーザＵの音声がマイクの指向範囲から外れる場合である。 Specifically, in the multiple voice combination processing, the voice recognition system S acquires voice data of voices uttered by multiple users U in the same space from multiple microphones 10 worn by each of the multiple users U. Also, the speech recognition system S determines whether or not the volume of the plurality of acquired speech data is less than the threshold. Here, the case where the volume of the plurality of acquired audio data is less than the threshold is, for example, the case where the voice of the user U is out of the directivity range of the microphone as shown in FIG. 2(B) described above.

そして、音声認識システムＳは、複数の音声データの音量が何れも閾値未満であると判定された場合に、複数の音声データを併用した所定の処理と共に、音声認識処理を行なう。この複数の音声データを併用した所定の処理として、例えば、第１複数音声併用処理と、第２複数音声併用処理の２つの処理が挙げられる。 Then, when it is determined that the volumes of all of the plurality of speech data are less than the threshold value, the speech recognition system S performs speech recognition processing together with predetermined processing using the plurality of speech data. As the predetermined processing using a plurality of voice data together, for example, there are two types of processing: a first multiple voice combined processing and a second multiple voice combined processing.

（第１複数音声併用処理）
第１複数音声併用処理では、音声認識システムＳは、複数の音声データを合算し、合算した音声データに対して音声認識処理を行なう。これにより、合算によりＳ／Ｎ比が向上した音声データに対して音声認識処理を行なうことにできるので、音声認識の精度低下を抑制することができる。 (First multiple voice combination processing)
In the first multiple-speech combined processing, the speech recognition system S adds up a plurality of speech data, and performs speech recognition processing on the added speech data. As a result, speech recognition processing can be performed on the speech data whose S/N ratio has been improved by the summation, so that deterioration in accuracy of speech recognition can be suppressed.

第１複数音声併用処理について、図３及び図４を参照して説明する。図３及び図４は、各ユーザＵの発話の状況の一例を示すタイミングチャートである。 The first multiple voice combination processing will be described with reference to FIGS. 3 and 4. FIG. 3 and 4 are timing charts showing an example of the state of each user U's speech.

本例では、図４に示すように、３人のユーザＵ（ユーザＵＡ、ユーザＵＢ、及びユーザＵＣ）が、それぞれ異なるタイミング（一部重複したタイミングを含む）で、順番に発話する場合を想定する。具体的には、時系列に沿って「ユーザＵＡ→ユーザＵＢ→ユーザＵＣ→ユーザＵＢ→ユーザＵＡ」の順番で発話する場合を想定する。ただし、この発話の中で最後に発話したユーザＵＡが横を向き、上述した図２（Ｂ）に示すように、マイクの指向範囲から外れてしまったものとする。 In this example, as shown in FIG. 4, it is assumed that three users U (user UA, user UB, and user UC) sequentially speak at different timings (including partially overlapping timings). do. Specifically, it is assumed that utterances are made in the chronological order of “user UA→user UB→user UC→user UB→user UA”. However, it is assumed that the user UA who made the last utterance among these utterances turned sideways and fell out of the directional range of the microphone as shown in FIG. 2B described above.

これら３人のユーザＵそれぞれが装着しているマイク１０（マイク１０Ａ、マイク１０Ｂ、及びマイク１０Ｃ）はそれぞれ、マイク１０を装着しているユーザＵの発話した音声を最も大きく収音している一方で、他のユーザＵの発話した音声も少量ながら収音している。例えば、図中のＴ１～Ｔ２の間、マイク１０ＡはユーザＵＡの発話した音声を大きく収音しており、マイク１０Ｂ及びマイク１０ＣもユーザＵＡの発話した音声を少量ながら収音している。なお、図中における図示を省略するが、マイク１０は、実際には、会議室内のノイズ（例えば、空調やプロジェクタ等の稼動音）も少量ながら収音している。 Each of the microphones 10 (microphone 10A, microphone 10B, and microphone 10C) worn by each of these three users U picks up the voice uttered by the user U wearing the microphone 10 the loudest. Also, voices uttered by other users U are also picked up, albeit in small amounts. For example, between T1 and T2 in the figure, the microphone 10A picks up the voice uttered by the user UA loudly, and the microphones 10B and 10C also pick up the voice uttered by the user UA, albeit in small amounts. Although not shown in the drawing, the microphone 10 actually picks up a small amount of noise in the conference room (for example, the operating sound of air conditioners, projectors, etc.).

このような状況において、本実施形態では、上述したように音量に閾値を設定し、少なくとも１つのマイク１０の音声データの音量が閾値以上であれば、この閾値以上の音量の音声データに対して音声認識を行なう。すなわち、少なくとも１つのマイク１０において、装着しているユーザＵの発話した音声を適切に収音できている場合には、この音声の音声データに対して音声認識を行なう。そして、他のマイク１０が収音した閾値未満の音量の音声データには音声認識を行わない。 In such a situation, in this embodiment, a threshold is set for the volume as described above, and if the volume of the audio data of at least one microphone 10 is equal to or higher than the threshold, Perform voice recognition. That is, when at least one microphone 10 can appropriately pick up the voice uttered by the user U who is wearing the device, voice recognition is performed on the voice data of this voice. Then, voice recognition is not performed for voice data with a volume less than the threshold value picked up by the other microphones 10 .

例えば、Ｔ１～Ｔ２の期間はマイク１０Ａが収音したユーザＵＡの発話した音声の音声データに対して音声認識を行なう。また、Ｔ２～Ｔ３の期間はマイク１０Ｂが収音したユーザＵＢの発話した音声の音声データに対して音声認識を行なう。更に、Ｔ３～Ｔ４の期間はマイク１０Ｃが収音したユーザＵＣの発話した音声の音声データに対して音声認識を行なう。更に、Ｔ４～Ｔ５の期間はマイク１０Ｂが収音したユーザＵＢの発話した音声の音声データに対して音声認識を行なう。そして、それ以外の閾値未満の音量の音声データは音声認識の対象から除外する。 For example, during a period from T1 to T2, voice recognition is performed on the voice data of the voice uttered by the user UA picked up by the microphone 10A. During the period from T2 to T3, voice recognition is performed on the voice data of the voice uttered by the user UB picked up by the microphone 10B. Furthermore, during the period from T3 to T4, voice recognition is performed on the voice data of the voice uttered by the user UC picked up by the microphone 10C. Furthermore, during the period from T4 to T5, voice recognition is performed on the voice data of the voice uttered by the user UB picked up by the microphone 10B. Other audio data whose volume is less than the threshold is excluded from speech recognition targets.

なお、Ｔ１～Ｔ２の期間において、マイク１０Ａが収音したユーザＵＡの発話した音声と、マイク１０Ｂが収音したユーザＵＢの発話した音声とは、時間的に一部重複しているが、それぞれのマイク１０で閾値以上の音量で収音できている。そのため、この重複期間においてユーザＵＡの発話した音声の音声データと、ユーザＵＢの発話した音声の音声データはそれぞれ音声認識に用いられる。 In the period from T1 to T2, the voice uttered by the user UA picked up by the microphone 10A and the voice uttered by the user UB picked up by the microphone 10B partially overlap in terms of time. , the microphone 10 can pick up sound at a volume equal to or higher than the threshold. Therefore, the voice data of the voice uttered by the user UA and the voice data of the voice uttered by the user UB in this overlapping period are used for voice recognition.

ここで、本実施形態における、複数の音声データを併用した所定の処理が適用されるのは、ユーザＵＡが横を向いて発話しているＴ６～Ｔ７の期間である。このＴ６～Ｔ７の期間は、マイク１０Ａ、マイク１０Ｂ、及びマイク１０Ｃの何れも閾値未満の音量の音声しか収音していない。そのため、各マイク１０が収音した音声のＳ／Ｎ比は低く、このまま各マイク１０が収音した音声の音声データに対して音声認識を行ったとしても、ユーザＵＡの発話した音声は、適切に音声認識されない。 Here, in the present embodiment, predetermined processing using a plurality of voice data is applied during the period from T6 to T7 when the user UA speaks while looking sideways. During the period from T6 to T7, all of the microphones 10A, 10B, and 10C pick up only sounds with volumes below the threshold. Therefore, the S/N ratio of the voice picked up by each microphone 10 is low, and even if voice recognition is performed on the voice data of the voice picked up by each microphone 10 as it is, the voice uttered by the user UA will not be appropriate. voice recognition is not possible.

このように、ユーザＵが横や上を向いて発話して、ユーザＵの音声がマイクの指向範囲から外れ、何れのマイク１０の音声データも閾値未満の音量である場合は、複数の音声データを併用した所定の処理を行なう。具体的には、図４に示すように、複数のマイク１０で収音した音声の音声データを合算してＳ／Ｎ比を向上させる。そして、Ｓ／Ｎ比が向上することにより音声が強調された音声データに対して音声認識を行う。 In this way, when the user U speaks while facing sideways or upward, the voice of the user U is out of the directional range of the microphone, and the volume of the voice data of any microphone 10 is less than the threshold, a plurality of voice data Performs a predetermined process in combination with Specifically, as shown in FIG. 4, voice data of voices picked up by a plurality of microphones 10 are combined to improve the S/N ratio. Then, speech recognition is performed on speech data in which the speech is emphasized by improving the S/N ratio.

これにより、ユーザＵの音声がマイクの指向範囲から外れた場合であっても、音声認識の精度低下を抑制することが可能となる。 As a result, even when the voice of the user U is out of the directional range of the microphone, it is possible to prevent the accuracy of voice recognition from deteriorating.

この場合に、複数の音声データの音量の比較結果に基づいて、発話したユーザＵを推定するようにしてもよい。例えば、マイク１０Ａの音声データの音量が、マイク１０Ｂやマイク１０の音声データの音量よりも相対的に大きい場合には、マイク１０Ａに対応するユーザＵＡが発話したと推定してもよい。ただし、推定が困難な場合には、図４に示すように、発話者不特定として、後述の音声合算部２３０から出力するようにしてもよい。 In this case, the uttering user U may be estimated based on the result of comparing the volumes of a plurality of voice data. For example, when the volume of the audio data of the microphone 10A is relatively higher than the volume of the audio data of the microphones 10B and 10, it may be estimated that the user UA corresponding to the microphone 10A has spoken. However, if the estimation is difficult, as shown in FIG. 4, the speaker may be output from the voice summation unit 230, which will be described later, as an unspecified speaker.

（第２複数音声併用処理）
上述した第１複数音声併用処理により、音声認識の精度低下を抑制することができるが、音声データの合算を行なうために、各マイク１０から取得した音声データの厳密なタイミング制御（例えば、μｓｅｃオーダーの厳密な同期制御等）が必要になる。また、加算処理を行なうための処理部も必要となる。そこで、他のより簡易な処理として、第２複数音声併用処理を行なうことも考えられる。 (Second multiple voice combination processing)
Although the above-described first multiple voice combination processing can suppress deterioration in accuracy of voice recognition, in order to perform summation of voice data, strict timing control (for example, microsecond order) of voice data acquired from each microphone 10 is required. strict synchronous control, etc.). Also, a processing unit for performing addition processing is required. Therefore, it is conceivable to perform a second plural-speech combined process as another simpler process.

第２複数音声併用処理においても、第１複数音声併用処理と同様に、複数の音声データを併用した所定の処理を行なう。すなわち、図３に示すＴ６～Ｔ７の期間のようにユーザＵが横や上を向いて発話して、ユーザＵの音声がマイクの指向範囲から外れ、何れのマイク１０の音声データも閾値未満の音量である場合は、複数の音声データを併用した所定の処理を行なう。 In the second plural-voice combination processing as well, a predetermined process using a plurality of voice data is performed in the same manner as in the first plural-voice combination processing. That is, as in the period T6 to T7 shown in FIG. 3, the user U speaks while looking sideways or upward, the voice of the user U is out of the directivity range of the microphone, and the voice data of any microphone 10 is less than the threshold. In the case of volume, predetermined processing is performed using a plurality of audio data.

具体的には、音声認識システムＳは、複数の音声データそれぞれに対して音声認識処理を行い、複数の音声データそれぞれの音声認識結果の比較に基づいて、音声認識結果を補正する。これにより、或る１つの音声データの音声認識結果のみならず、複数の音声データそれぞれの音声認識結果を踏まえて補正を行なうことができるので、音声認識の精度低下を抑制することができる。 Specifically, the speech recognition system S performs speech recognition processing on each of the plurality of speech data, and corrects the speech recognition result based on a comparison of the speech recognition results of each of the plurality of speech data. As a result, correction can be performed based not only on the speech recognition result of a certain piece of speech data, but also on the basis of the speech recognition results of each of a plurality of speech data, thereby suppressing deterioration in accuracy of speech recognition.

ここで、補正処理の例としては、或る区間（例えば、音声認識結果に基づいて特定される文節に対応する区間等）において、複数の音声データについての音声認識結果が共通している場合は、この共通する音声認識結果が正しいものとして補正を行う。 Here, as an example of correction processing, in a certain section (for example, a section corresponding to a clause specified based on the speech recognition result), if the speech recognition results for a plurality of speech data are common, , correction is performed assuming that this common speech recognition result is correct.

また、この場合に、例えば３つの音声データについての音声認識結果のうち、２つの音声認識結果が同じ結果だった場合は、この音声認識結果が正しいものとして補正を行うというように、いわゆる多数決に基づいた処理としてもよい。 Also, in this case, for example, if two out of the speech recognition results for three speech data are the same result, the speech recognition result is regarded as correct and corrected. It may be processed based on

あるいは、複数の音声認識処理の結果が全て異なるような場合は、複数の音声データそれぞれの音量を比較し、音量が最も大きい音声データについての音声認識結果を正しいものとして補正を行うようにしてもよい。 Alternatively, if the results of multiple voice recognition processes are all different, the volume of each of the multiple voice data is compared, and the voice recognition result for the voice data with the highest volume is corrected as the correct one. good.

これにより、第２複数音声併用処理においても、ユーザＵの音声がマイクの指向範囲から外れた場合であっても、音声認識の精度低下を抑制することが可能となる。 As a result, even in the second multiple voice combination processing, even if the voice of the user U is out of the directional range of the microphone, it is possible to suppress a decrease in accuracy of voice recognition.

［ハードウェア構成］
次に、図５を参照して、本実施形態における各装置のハードウェア構成について説明をする。図５は、音声認識システムＳに含まれる、各マイク１０、会議端末２０、及び音声認識サーバ３０それぞれのハードウェア構成を示すブロック図である。 [Hardware configuration]
Next, the hardware configuration of each device in this embodiment will be described with reference to FIG. FIG. 5 is a block diagram showing the hardware configuration of each microphone 10, conference terminal 20, and speech recognition server 30 included in the speech recognition system S. As shown in FIG.

マイク１０は、ＣＰＵ（Central Processing Unit）１１、ＲＯＭ（Read Only Memory）１２、ＲＡＭ（Random Access Memory)１３、通信Ｉ／Ｆ(Interface)１４、操作部１５、及び収音部１６を含む。これら各部は、バス接続により相互に通信可能に接続される。 The microphone 10 includes a CPU (Central Processing Unit) 11 , a ROM (Read Only Memory) 12 , a RAM (Random Access Memory) 13 , a communication I/F (Interface) 14 , an operation section 15 and a sound pickup section 16 . These units are connected so as to be able to communicate with each other through a bus connection.

ＣＰＵ１１は、マイク１０全体を制御する。具体的には、ＣＰＵ１１は、ＲＡＭ１３をワークエリア（すなわち、作業領域）として、ＲＯＭ１２等に格納されたファームウェアや、ＯＳ（Operating System）や、各種のプログラムに基づいた演算処理を行う。そして、ＣＰＵ１１は、この演算処理の結果に基づいて、マイク１０に含まれる各ハードウェアを制御する。ここで、各種のプログラムとは、例えば、上述した複数音声併用処理において、音声データを会議端末２０に対して送信するためのプログラムである。 The CPU 11 controls the microphone 10 as a whole. Specifically, the CPU 11 uses the RAM 13 as a work area (that is, a work area) to perform arithmetic processing based on firmware, an OS (Operating System), and various programs stored in the ROM 12 or the like. Then, the CPU 11 controls each hardware included in the microphone 10 based on the result of this arithmetic processing. Here, the various programs are, for example, programs for transmitting voice data to the conference terminal 20 in the multiple voice combination processing described above.

ＲＯＭ１２は、ファームウェアや、ＯＳや、各種のプログラムや、これらのプログラムにおいて利用する各種のデータ（例えば、上述した複数音声併用処理において利用する音声データ）を記憶する。 The ROM 12 stores firmware, an OS, various programs, and various data used in these programs (for example, voice data used in the above-described multiple voice combined processing).

ＲＡＭ１３は、上述したように、ＣＰＵ１１のワークエリアとして機能する。 The RAM 13 functions as a work area for the CPU 11 as described above.

通信Ｉ／Ｆ１４は、マイク１０が、音声認識システムＳに含まれる他の各装置と通信するためのインターフェースである。 Communication I/F 14 is an interface for microphone 10 to communicate with other devices included in speech recognition system S. FIG.

操作部１５は、例えば、各種の釦等で実現され、ユーザＵの操作を受け付ける。例えば、操作部１５は、マイク１０の電源のオンオフの切り替え操作や、収音の開始指示操作を受け付ける。そして、操作部１５は、受け付けたユーザＵの操作の内容をＣＰＵ１１に対して出力する。 The operation unit 15 is implemented by, for example, various buttons and the like, and receives user's U operations. For example, the operation unit 15 receives an operation for switching on/off the power of the microphone 10 and an instruction operation for starting sound collection. Then, the operation unit 15 outputs to the CPU 11 the details of the received user U operation.

収音部１６は、収音した音声をアナログの電気信号に変換するデバイスと、この音声をデジタル変換するためのＡ／Ｄ変換回路とを含む。収音部１６は、デジタル変換した音声データを、ＣＰＵ１１に対して出力する。 The sound pickup unit 16 includes a device that converts the picked-up sound into an analog electrical signal, and an A/D conversion circuit that converts the sound into a digital signal. The sound pickup unit 16 outputs the digitally converted sound data to the CPU 11 .

会議端末２０は、ＣＰＵ２１、ＲＯＭ２２、ＲＡＭ２３、ＨＤＤ２４、通信Ｉ／Ｆ２５、操作部２６、表示部２７、及び撮像部２８を含む。これら各部は、バス接続により相互に通信可能に接続される。 The conference terminal 20 includes a CPU 21 , ROM 22 , RAM 23 , HDD 24 , communication I/F 25 , operation section 26 , display section 27 and imaging section 28 . These units are connected so as to be able to communicate with each other through a bus connection.

ＣＰＵ２１は、会議端末２０全体を制御する。具体的には、ＣＰＵ２１は、ＲＡＭ２３をワークエリアとして、ＲＯＭ２２やＨＤＤ２４等に格納されたファームウェアや、ＯＳや、各種のプログラムに基づいた演算処理を行う。そして、ＣＰＵ２１は、この演算処理の結果に基づいて、会議端末２０に含まれる各ハードウェアを制御する。ここで、各種のプログラムとは、例えば、上述した複数音声併用処理を実現するためのプログラムや、電子情報ボードの機能を実現するプログラムである。 The CPU 21 controls the entire conference terminal 20 . Specifically, the CPU 21 uses the RAM 23 as a work area to perform arithmetic processing based on firmware, OS, and various programs stored in the ROM 22, HDD 24, and the like. Then, the CPU 21 controls each hardware included in the conference terminal 20 based on the result of this arithmetic processing. Here, the various programs are, for example, a program for realizing the above-described multiple voice combined processing and a program for realizing the function of the electronic information board.

ＲＯＭ２２及びＨＤＤ２４は、ファームウェアや、ＯＳや、各種のプログラムや、これらのプログラムにおいて利用する各種のデータ（例えば、上述した複数音声併用処理や、電子情報ボードの機能において利用する各種のデータ）を記憶する。 The ROM 22 and HDD 24 store firmware, OS, various programs, and various data used in these programs (for example, the above-described multiple voice combined processing and various data used in the function of the electronic information board). do.

ＲＡＭ２３は、上述したように、ＣＰＵ２１のワークエリアとして機能する。 The RAM 23 functions as a work area for the CPU 21 as described above.

通信Ｉ／Ｆ２５は、会議端末２０が、音声認識システムＳに含まれる他の各装置と通信するためのインターフェースである。 The communication I/F 25 is an interface for the conference terminal 20 to communicate with other devices included in the speech recognition system S. FIG.

操作部２６は、例えば、各種の釦等で実現され、ユーザＵの操作を受け付ける。例えば、操作部１５は、会議端末２０の電源のオンオフの切り替え操作や、収音の開始指示操作や、電子情報ボードの機能に関する操作を受け付ける。そして、操作部２６は、受け付けたユーザＵの操作の内容をＣＰＵ２１に対して出力する。 The operation unit 26 is implemented by various buttons, for example, and receives user's U operations. For example, the operation unit 15 accepts an on/off switching operation of the power of the conference terminal 20, an instruction operation to start collecting sound, and an operation related to the functions of the electronic information board. Then, the operation unit 26 outputs to the CPU 21 the content of the accepted user U operation.

表示部２７は、液晶ディスプレイ（ＬＣＤ：Liquid Crystal Display）や有機ＥＬディスプレイ（Organic Electro Luminescence Display）等で実現され、ＣＰＵ２１から出力された所定の情報をユーザＵに対して表示する。表示部２７は、所定の情報として、例えば、音声認識結果をテキスト化した情報や、各種のユーザインタフェースをユーザＵに対して表示する。 The display unit 27 is implemented by a liquid crystal display (LCD), an organic EL display (organic electro luminescence display), or the like, and displays predetermined information output from the CPU 21 to the user U. The display unit 27 displays to the user U, for example, information obtained by converting speech recognition results into text and various user interfaces as predetermined information.

なお、操作部２６及び表示部２７を、電子情報ボード用のペンやユーザＵの手によるタッチ操作を受付可能な、タッチパネルにより一体として実現してもよい。 Note that the operation unit 26 and the display unit 27 may be realized integrally by a touch panel capable of accepting a touch operation by a pen for an electronic information board or by the user's U hand.

撮像部２８は、カメラを実現するための各種デバイスにより実現され、会議端末２０が設置された場所において、例えば、会議に参加しているユーザＵを撮像する。撮像部２８は、撮像により作成した画像データをＣＰＵ２１に対して出力する。 The image capturing unit 28 is realized by various devices for realizing a camera, and captures an image of, for example, the user U participating in the conference at the place where the conference terminal 20 is installed. The imaging unit 28 outputs image data created by imaging to the CPU 21 .

音声認識サーバ３０は、ＣＰＵ３１、ＲＯＭ３２、ＲＡＭ３３、ＨＤＤ３４、及び通信Ｉ／Ｆ３５を含む。これら各部は、バス接続により相互に通信可能に接続される。 The voice recognition server 30 includes a CPU 31, a ROM 32, a RAM 33, an HDD 34, and a communication I/F 35. These units are connected so as to be able to communicate with each other through a bus connection.

ＣＰＵ３１は、音声認識サーバ３０全体を制御する。具体的には、ＣＰＵ３１は、ＲＡＭ３３をワークエリアとして、ＲＯＭ３２やＨＤＤ３４等に格納されたファームウェアや、ＯＳや、各種のプログラムに基づいた演算処理を行う。そして、ＣＰＵ３１は、この演算処理の結果に基づいて、音声認識サーバ３０に含まれる各ハードウェアを制御する。ここで、各種のプログラムとは、例えば、上述した複数音声併用処理を実現するためのプログラムである。 The CPU 31 controls the voice recognition server 30 as a whole. Specifically, the CPU 31 uses the RAM 33 as a work area to perform arithmetic processing based on firmware, OS, and various programs stored in the ROM 32, HDD 34, and the like. Then, the CPU 31 controls each hardware included in the speech recognition server 30 based on the result of this arithmetic processing. Here, the various programs are, for example, programs for realizing the above-described multiple voice combined processing.

ＲＯＭ３２及びＨＤＤ３４は、ファームウェアや、ＯＳや、各種のプログラムや、これらのプログラムにおいて利用する各種のデータ（例えば、上述した複数音声併用処理において利用する各種のデータ）を記憶する。 The ROM 32 and HDD 34 store firmware, an OS, various programs, and various data used in these programs (for example, various data used in the above-described multiple voice combined processing).

ＲＡＭ３３は、上述したように、ＣＰＵ３１のワークエリアとして機能する。 The RAM 33 functions as a work area for the CPU 31 as described above.

通信Ｉ／Ｆ３５は、音声認識サーバ３０が、音声認識システムＳに含まれる他の各装置と通信するためのインターフェースである。 Communication I/F 35 is an interface for voice recognition server 30 to communicate with other devices included in voice recognition system S. FIG.

［機能的構成］
次に、図５を参照して上述した各ハードウェアによって実現される機能的構成について図６を参照して説明をする。図６は、音声認識システムＳに含まれる、各マイク１０、会議端末２０、及び音声認識サーバ３０の機能的構成のうち、複数音声併用処理を実行するための機能的構成を示す機能ブロック図である。 [Functional configuration]
Next, the functional configuration realized by each hardware described above with reference to FIG. 5 will be described with reference to FIG. FIG. 6 is a functional block diagram showing a functional configuration for executing multiple voice combined processing among the functional configurations of the microphones 10, the conference terminal 20, and the voice recognition server 30 included in the voice recognition system S. be.

なお、これら機能ブロックは、上述した各マイク１０、会議端末２０、及び音声認識サーバ３０に含まれる各ＣＰＵが、複数音声併用処理を実現するためのプログラムに基づいて、各装置に含まれる各ハードウェアを制御することにより実現される。なお、以下で特に言及しない場合も含め、これら機能ブロック間では、複数音声併用処理を実現するために必要なデータを、適切なタイミングで適宜送受信する。 It should be noted that these functional blocks are implemented by each CPU included in each microphone 10, conference terminal 20, and voice recognition server 30 described above, based on a program for realizing combined processing of multiple voices, and each hardware included in each device. It is realized by controlling hardware. In addition, data necessary for realizing the multi-voice combined processing is appropriately transmitted and received between these functional blocks at appropriate timings, including cases not particularly mentioned below.

また、本実施形態では、各音声データに対して並列的に処理を行なうために、一部の機能ブロックが並列的に複数設けられている。ただし、並列的に複数設けられた同名の機能ブロックの機能はそれぞれ共通している。そのため、以下の説明では、各機能ブロック末尾のアルファベットを省略して説明する。 Further, in this embodiment, a plurality of partial functional blocks are provided in parallel in order to process each audio data in parallel. However, the functions of functional blocks with the same names provided in parallel are common. Therefore, in the following description, alphabetical letters at the end of each functional block are omitted.

まず、各マイク１０の機能ブロックについて説明をする。 First, functional blocks of each microphone 10 will be described.

複数音声併用処理が実行される場合、図６に示すように、各マイク１０において、音声収音部１１０と、音声送信部１２０とが機能する。 When the multiple voice processing is executed, as shown in FIG. 6, in each microphone 10, the voice pickup section 110 and the voice transmission section 120 function.

音声収音部１１０は、各ユーザＵの音声を収音及びアナログ－デジタル変換することにより、デジタル信号の音声データを作成する。 The voice recording unit 110 collects the voice of each user U and converts it from analog to digital to create voice data of a digital signal.

音声送信部１２０は、音声収音部１１０が作成した音声データを会議端末２０に対して送信する。 The voice transmission unit 120 transmits voice data created by the voice recording unit 110 to the conference terminal 20 .

次に、会議端末２０の機能ブロックについて説明をする。 Next, functional blocks of the conference terminal 20 will be described.

複数音声併用処理が実行される場合、図６に示すように、会議端末２０において、音声取得部２１０と、音量判定部２２０と、音声合算部２３０と、文字列表示部２４０とが機能する。 When multiple voice processing is executed, as shown in FIG. 6, in the conference terminal 20, the voice acquisition unit 210, the volume determination unit 220, the voice summation unit 230, and the character string display unit 240 function.

音声取得部２１０は、音声送信部１２０が送信した音声データを受信することにより、音声データを取得する。 The voice acquisition unit 210 acquires voice data by receiving the voice data transmitted by the voice transmission unit 120 .

音量判定部２２０は、音声取得部２１０が受信した音声データの音量が閾値未満であるか否かを判定する。この閾値の値は、本実施形態を実装する環境等に応じて、予め設定しておくものとする。また、この閾値の値は、各音声データの音量の平均値等に基づいて適宜変更されてもよい。 Volume determination unit 220 determines whether the volume of the audio data received by audio acquisition unit 210 is less than a threshold. The value of this threshold shall be set in advance according to the environment etc. in which this embodiment is implemented. Also, the threshold value may be appropriately changed based on the average value of the volume of each audio data or the like.

なお、音量判定部２２０は、一時的に（例えば、数秒程度）ユーザＵの発話が途切れる場合も考慮して、この一時的に途切れる期間よりも長い、一定期間における音声データの音量の平均値が、閾値未満であるか否かを判定するとよい。 Note that the volume determination unit 220 considers the case where the user U's speech is temporarily interrupted (for example, for several seconds), so that the average value of the volume of the audio data in a certain period longer than the temporary interruption period is , is less than a threshold.

音声合算部２３０は、もっぱら第１複数音声併用処理を行なう場合に機能する。第１複数音声併用処理を行なう場合、音声合算部２３０は、音量判定部２２０により、複数の音声データの音量が何れも閾値未満であると判定された場合に、複数の音声データを合算する。そして、音声合算部２３０は、合算した音声データを音声認識サーバ３０に対して送信する。一方で、音声合算部２３０は、音量判定部２２０により、複数の音声データの内の何れかの音声データの音量が閾値以上であると判定された場合に、この閾値以上であると判定された音声データを音声認識サーバ３０に対して送信し、閾値未満であると判定された音声データは送信しない。 The speech summation unit 230 functions only when the first plural speech processing is performed. When performing the first plural-voice combined processing, voice summing section 230 sums the plurality of voice data when volume determination section 220 determines that the volume of each of the plurality of voice data is less than the threshold value. Then, the voice combining unit 230 transmits the combined voice data to the voice recognition server 30 . On the other hand, when the volume determining unit 220 determines that the volume of any one of the plurality of audio data is equal to or greater than the threshold, the audio combining unit 230 determines that the volume is equal to or greater than the threshold. The voice data is transmitted to the voice recognition server 30, and the voice data determined to be less than the threshold value is not transmitted.

なお、第２複数音声併用処理を行なう場合には、音声合算部２３０は、音量判定部２２０の判定結果に関わらず、複数の音声データの全てを音声認識サーバ３０に対して送信する。 Note that when performing the second plural-speech combination processing, the speech summation unit 230 transmits all of the plurality of speech data to the speech recognition server 30 regardless of the determination result of the volume determination unit 220 .

文字列表示部２４０は、音声認識サーバ３０から受信した、音声認識結果を表示する。音声認識結果は、例えば、テキスト化した文字列として表示される。文字列表示部２４０による表示の一例を図７に示す。図７に示すように会議端末２０は、例えば、電子情報ボードとして実現される。この場合、操作部２６及び表示部２７はタッチパネルとして実現される。そして、表示部２７には処理の表示領域として、例えば、表示領域２７１が設けられる。文字列表示部２４０は、この表示領域２７１に、例えば、ユーザＵが発話した時系列に沿って文字列を表示する。 The character string display section 240 displays the speech recognition result received from the speech recognition server 30 . A speech recognition result is displayed as a text string, for example. An example of display by the character string display unit 240 is shown in FIG. As shown in FIG. 7, the conference terminal 20 is implemented as an electronic information board, for example. In this case, the operation unit 26 and the display unit 27 are implemented as touch panels. For example, a display area 271 is provided in the display unit 27 as a display area for processing. The character string display unit 240 displays character strings in the display area 271 in chronological order of the user U's utterances, for example.

この場合に、ユーザＵ（の装着しているマイク１０）を識別する情報（例えば、予め登録したユーザＵの名前やマイク１０の番号等）を、対応するテキストと共に表示するようにしてもよい。このように表示をする場合には、音声合算部２３０等と同様に、文字列表示部２４０も、複数のマイク１０に対応して複数設けるようにしてもよい。 In this case, information identifying (the microphone 10 worn by) the user U (for example, the name of the user U registered in advance, the number of the microphone 10, etc.) may be displayed together with the corresponding text. When displaying in this manner, a plurality of character string display sections 240 may be provided corresponding to a plurality of microphones 10 in the same manner as the voice combining section 230 and the like.

このような表示を行うことにより、複数人の発話者が存在する会議シーンにおいて、誰がどのような発言を行ったかという発話履歴が表示される。 By performing such a display, in a conference scene in which a plurality of speakers exist, the speech history of who said what kind of speech is displayed.

次に、音声認識サーバ３０の機能ブロックについて説明をする。 Next, functional blocks of the speech recognition server 30 will be described.

複数音声併用処理が実行される場合、図６に示すように、音声認識サーバ３０において、音声認識部３１０と、認識結果補正部３２０とが機能する。 When multiple voice processing is executed, as shown in FIG. 6, the voice recognition unit 310 and the recognition result correction unit 320 function in the voice recognition server 30 .

音声認識部３１０は、会議端末２０から受信した音声データに対して、音声認識処理を行なう。音声認識処理に用いる音声認識エンジンは特に限定されず、本実施形態特有の音声認識エンジンを利用してもよいし、汎用の音声認識エンジンを利用してもよい。 The voice recognition unit 310 performs voice recognition processing on voice data received from the conference terminal 20 . A speech recognition engine used for speech recognition processing is not particularly limited, and a speech recognition engine unique to this embodiment may be used, or a general-purpose speech recognition engine may be used.

認識結果補正部３２０は、もっぱら第２複数音声併用処理を行なう場合に機能する。第２複数音声併用処理を行なう場合、認識結果補正部３２０は、音量判定部２２０により、複数の音声データの音量が何れも閾値未満であると判定された場合に、複数の音声データの音声認識結果の比較に基づいて、音声認識結果を補正（アンサンブル）する。そして、認識結果補正部３２０は、補正した音声認識結果を会議端末２０に対して送信する。一方で、音声合算部２３０は、音量判定部２２０により、複数の音声データの内の何れかの音声データの音量が閾値以上であると判定された場合に、この閾値以上であると判定された音声データに関する音声認識結果を会議端末２０に対して送信する。音声認識結果は、例えば、テキスト化した文字列として送信される。 Recognition result correcting section 320 functions only when performing the second plural voice combination processing. When performing the second multiple voice combined processing, recognition result correcting unit 320 performs voice recognition of multiple voice data when volume determination unit 220 determines that the volume of each of the multiple voice data is less than the threshold value. Correct (ensemble) the speech recognition results based on the comparison of the results. The recognition result correction unit 320 then transmits the corrected speech recognition result to the conference terminal 20 . On the other hand, when the volume determining unit 220 determines that the volume of any one of the plurality of audio data is equal to or greater than the threshold, the audio combining unit 230 determines that the volume is equal to or greater than the threshold. A speech recognition result regarding the speech data is transmitted to the conference terminal 20 . A speech recognition result is transmitted as a text string, for example.

なお、第１複数音声併用処理を行なう場合には、音量が閾値以上の音声データや、合算されてＳ／Ｎ比が向上した音声データといった、適切に音声認識できる音声データのみが音声認識の対象となっている。そのため、認識結果補正部３２０は、音量判定部２２０の判定結果に関わらず、音声認識部３１０による音声認識結果の全てを会議端末２０に対して送信する。 Note that when performing the first multiple voice combination processing, only voice data that can be properly recognized, such as voice data whose volume is equal to or greater than a threshold value, voice data whose S/N ratio has been improved by summing, etc., is the object of voice recognition. It has become. Therefore, the recognition result correction unit 320 transmits all speech recognition results by the speech recognition unit 310 to the conference terminal 20 regardless of the determination result of the volume determination unit 220 .

［動作］
次に、本実施形態における複数音声併用処理の流れについて説明をする。なお、下記の説明にて特に言及しない場合であっても、図６を参照して上述した各機能ブロックは、複数音声併用処理に必要となる処理を適宜実行する。なお、第１複数音声併用処理と、第２複数音声併用処理の何れが行われるかは、予めなされた設定や、ユーザＵによる選択操作に応じて決定される。 [motion]
Next, the flow of processing for using multiple voices in this embodiment will be described. It should be noted that each functional block described above with reference to FIG. 6 appropriately executes the processing necessary for the multiple voice combination processing, even if it is not specifically mentioned in the following description. It should be noted that which of the first multiple-voice combination processing and the second multiple-voice combination processing is to be performed is determined according to preset settings and a selection operation by the user U. FIG.

（第１複数音声併用処理）
図８は、第１複数音声併用処理の流れを説明するフローチャートである。第１複数音声併用処理は、例えば、マイク１０による収音が開始されて音声データの取得が開始された場合や、ユーザＵによる開始指示操作応じて実行される。 (First multiple voice combination processing)
FIG. 8 is a flow chart for explaining the flow of the first plural-sound combination processing. The first multiple voice processing is executed, for example, when sound collection by the microphone 10 is started and acquisition of voice data is started, or when the user U performs a start instruction operation.

ステップＳ１１において、第１音声併用処理のループ処理が開始される。 In step S11, loop processing of the first voice combined processing is started.

ステップＳ１２において、会議端末２０の各音声取得部２１０は、各マイク１０から音声データを取得する。 In step S12 , each voice acquisition unit 210 of the conference terminal 20 acquires voice data from each microphone 10 .

ステップＳ１３において、会議端末２０の各音量判定部２２０は、一定期間における音声データの音量の平均値が、閾値未満であるか否かを判定する。 In step S13, each volume determining unit 220 of the conference terminal 20 determines whether or not the average value of volume of voice data in a certain period is less than a threshold.

ステップＳ１４において、会議端末２０の音声合算部２３０は、ステップＳ１３における判定結果に基づいて、一定期間における、全ての音声データの音量の平均値が、閾値未満であったか否かを判断する。全ての音声データの音量の平均値が、閾値未満であった場合は、ステップＳ１４においてＹｅｓと判定され、処理はステップＳ１５に進む。一方で、少なくとも何れかの音声データの音量の平均値が、閾値以上であった場合は、ステップＳ１４においてＮｏと判定され、処理はステップＳ１７に進む。 In step S14, the voice summation unit 230 of the conference terminal 20 determines whether or not the average value of volume of all voice data for a certain period of time is less than a threshold based on the determination result in step S13. If the average value of volume of all audio data is less than the threshold value, it is determined as Yes in step S14, and the process proceeds to step S15. On the other hand, if the average value of the volume of at least one piece of audio data is equal to or greater than the threshold value, it is determined No in step S14, and the process proceeds to step S17.

ステップＳ１５において、会議端末２０の音声合算部２３０は、各マイク１０が収音した各音声データを選択する。この処理は、各音声データに対して並列的に行われる（ここでは、一例としてステップＳ１５Ａ～ステップＳ１５Ｃが行われる）。 In step S15 , the voice combining unit 230 of the conference terminal 20 selects each voice data picked up by each microphone 10 . This processing is performed in parallel for each audio data (here, steps S15A to S15C are performed as an example).

ステップＳ１６において、会議端末２０の音声合算部２３０は、ステップＳ１５において選択された各音声データを合算する。 In step S16, the voice summation unit 230 of the conference terminal 20 sums up each voice data selected in step S15.

一方で、ステップＳ１７において、会議端末２０の音声合算部２３０は、閾値以上の音量の音声データを選択する。 On the other hand, in step S17, the voice combining unit 230 of the conference terminal 20 selects voice data whose volume is equal to or greater than the threshold.

ステップＳ１８において、音声認識サーバ３０の音声認識部３１０は、ステップＳ１６において合算されてＳ／Ｎ比の向上した音声データ、あるいは、ステップＳ１７において選択された閾値以上の音量の音声データに対して音声認識処理を行う。 In step S18, the speech recognition unit 310 of the speech recognition server 30 performs speech recognition on the speech data whose S/N ratio has been improved by summing in step S16, or the speech data whose volume is equal to or greater than the threshold value selected in step S17. Perform recognition processing.

ステップＳ１９において、会議端末２０の文字列表示部２４０は、ステップＳ１８における音声認識結果をテキスト化した文字列を出力する。この場合、出力は、例えば、図７を参照して上述したような表示や、紙媒体への印刷等により行われる。 In step S19, the character string display unit 240 of the conference terminal 20 outputs a character string obtained by converting the speech recognition result in step S18 into text. In this case, the output is performed by, for example, the display as described above with reference to FIG. 7, printing on a paper medium, or the like.

ステップＳ２０において、第１複数音声併用処理のループ処理が終了する条件が満たされていない場合には、ステップＳ１１から上述のループ処理が繰り返される。一方で、第１複数音声併用処理のループ処理が終了する条件が満たされた場合には、ループ処理は終了する。終了条件は、例えば、マイク１０による収音が終了して音声データの取得が終了したことや、ユーザＵによる終了指示操作を受け付けたことである。 In step S20, if the condition for terminating the loop processing of the first plural-sound combination processing is not satisfied, the loop processing described above is repeated from step S11. On the other hand, when the condition for terminating the loop processing of the first multiple-voice combination processing is satisfied, the loop processing is terminated. The termination condition is, for example, that the pickup of sound by the microphone 10 has ended and the acquisition of voice data has ended, or that the end instruction operation by the user U has been received.

以上説明した第１複数音声併用処理により、ユーザＵの音声がマイクの指向範囲から外れた場合であっても、音声認識の精度低下を抑制することが可能となる。 With the first multiple voice combination processing described above, even when the voice of the user U is out of the directional range of the microphone, it is possible to suppress a decrease in accuracy of voice recognition.

（第２複数音声併用処理）
図９は、第２複数音声併用処理の流れを説明するフローチャートである。第２複数音声併用処理は、例えば、マイク１０による収音が開始されて音声データの取得が開始された場合や、ユーザＵによる開始指示操作応じて実行される。 (Second multiple voice combination processing)
FIG. 9 is a flow chart for explaining the flow of the second plural-sound combination processing. The second multiple voice processing is executed, for example, when sound collection by the microphone 10 is started and acquisition of voice data is started, or when the user U performs a start instruction operation.

ステップＳ３１において、第２音声併用処理のループ処理が開始される。 In step S31, loop processing of the second voice combined processing is started.

ステップＳ３２において、会議端末２０の各音声取得部２１０は、各マイク１０から音声データを取得する。 In step S32 , each voice acquisition unit 210 of the conference terminal 20 acquires voice data from each microphone 10 .

ステップＳ３３において、会議端末２０の各音量判定部２２０は、一定期間における音声データの音量の平均値が、閾値未満であるか否かを判定する。 In step S33, each volume determining unit 220 of the conference terminal 20 determines whether or not the average value of volume of voice data in a certain period is less than a threshold.

ステップＳ３４において、会議端末２０の音声合算部２３０は、ステップＳ３３における判定結果に基づいて、一定期間における、全ての音声データの音量の平均値が、閾値未満であったか否かを判断する。全ての音声データの音量の平均値が、閾値未満であった場合は、ステップＳ３４においてＹｅｓと判定され、処理はステップＳ３５に進む。一方で、少なくとも何れかの音声データの音量の平均値が、閾値以上であった場合は、ステップＳ３４においてＮｏと判定され、処理はステップＳ３９に進む。 In step S34, the voice summation unit 230 of the conference terminal 20 determines whether or not the average value of volume of all voice data for a certain period of time is less than the threshold based on the determination result in step S33. If the average volume value of all audio data is less than the threshold value, the determination in step S34 is Yes, and the process proceeds to step S35. On the other hand, if the average value of the volume of at least one piece of audio data is equal to or greater than the threshold value, the determination in step S34 is No, and the process proceeds to step S39.

ステップＳ３５において、会議端末２０の音声合算部２３０は、各マイク１０が収音した各音声データを選択する。この処理及び以後のステップＳ３６及びステップＳ３７の処理は、各音声データに対して並列的に行われる。ここでは、一例としてステップＳ３５Ａ～ステップＳ３５Ｃ、ステップＳ３６Ａ～ステップＳ３６Ｃ、及びステップＳ３７Ａ～ステップＳ３７Ｃが行われる。 In step S35 , the voice combining unit 230 of the conference terminal 20 selects each voice data picked up by each microphone 10 . This process and subsequent processes in steps S36 and S37 are performed in parallel for each audio data. Here, as an example, steps S35A to S35C, steps S36A to S36C, and steps S37A to S37C are performed.

ステップＳ３６において、音声認識サーバ３０の音声認識部３１０は、ステップＳ３５において選択された各音声データに対して音声認識処理を行う。 In step S36, the speech recognition unit 310 of the speech recognition server 30 performs speech recognition processing on each piece of speech data selected in step S35.

ステップＳ３７において、音声認識サーバ３０の音声認識部３１０は、ステップＳ３６における音声認識処理の結果を、テキスト化した文字列として出力する。 In step S37, the speech recognition unit 310 of the speech recognition server 30 outputs the result of the speech recognition processing in step S36 as a text string.

ステップＳ３８において、音声認識サーバ３０の認識結果補正部３２０は、複数の音声データそれぞれの音声認識結果の比較に基づいて、音声認識結果を補正する。 In step S38, the recognition result correction unit 320 of the speech recognition server 30 corrects the speech recognition result based on the comparison of the speech recognition results of each of the plurality of speech data.

一方で、ステップＳ３９において、会議端末２０の音声合算部２３０は、閾値以上の音量の音声データを選択する。 On the other hand, in step S39, the voice combining unit 230 of the conference terminal 20 selects voice data with a volume equal to or greater than the threshold.

ステップＳ４０において、音声認識サーバ３０の音声認識部３１０は、ステップＳ３９において選択された閾値以上の音量の音声データに対して音声認識処理を行う。 In step S40, the speech recognition unit 310 of the speech recognition server 30 performs speech recognition processing on the speech data whose volume is greater than or equal to the threshold value selected in step S39.

ステップＳ４１において、会議端末２０の文字列表示部２４０は、ステップＳ３８における補正後の文字列、又は、ステップＳ４０における音声認識結果をテキスト化した文字列を出力する。この場合、出力は、例えば、図７を参照して上述したような表示や、紙媒体への印刷等により行われる。 In step S41, the character string display unit 240 of the conference terminal 20 outputs the character string after correction in step S38 or the character string obtained by textualizing the voice recognition result in step S40. In this case, the output is performed by, for example, the display as described above with reference to FIG. 7, printing on a paper medium, or the like.

ステップＳ４２において、第２複数音声併用処理のループ処理が終了する条件が満たされていない場合には、ステップＳ３１から上述のループ処理が繰り返される。一方で、第２複数音声併用処理のループ処理が終了する条件が満たされた場合には、ループ処理は終了する。終了条件は、例えば、マイク１０による収音が終了して音声データの取得が終了したことや、ユーザＵによる終了指示操作を受け付けたことである。 In step S42, if the condition for terminating the loop processing of the second plural-sound combination processing is not satisfied, the loop processing described above is repeated from step S31. On the other hand, when the condition for terminating the loop processing of the second multiple-voice combination processing is satisfied, the loop processing is terminated. The termination condition is, for example, that the pickup of sound by the microphone 10 has ended and the acquisition of voice data has ended, or that the end instruction operation by the user U has been received.

以上説明した第２複数音声併用処理により、ユーザＵの音声がマイクの指向範囲から外れた場合であっても、音声認識の精度低下を抑制することが可能となる。 With the second multiple voice combination processing described above, even when the voice of the user U is out of the directional range of the microphone, it is possible to suppress a decrease in accuracy of voice recognition.

［変形例］
本発明は、上述の実施形態に限定されるものではなく、本発明の目的を達成できる範囲での変形、改良等は本発明に含まれるものである。 [Modification]
The present invention is not limited to the above-described embodiments, but includes modifications, improvements, and the like within the scope of achieving the object of the present invention.

［第１の変形例］
上述した実施形態では、音声認識処理に基づいて文字列を表示していた。これに限らず、更に他の情報を表示するようにしてもよい。例えば、文字列に対応する発話を行ったユーザＵの画像を対応付けて表示するようにしてもよい。 [First modification]
In the above-described embodiments, character strings are displayed based on speech recognition processing. Not limited to this, other information may be displayed. For example, an image of the user U who made the utterance corresponding to the character string may be associated and displayed.

本変形例について図１０、図１１、及び図１２を参照して説明をする。まず、表示の前提として、図１０に示すように、撮像部２８を、会議に参加している各ユーザＵを撮像可能な位置に配置する。例えば、電子情報ボードとして実現された会議端末２０の正面上部等に撮像部２８を配置する。これにより、会議端末２０に正対した位置で会議を行っている各ユーザＵを撮像することができる。 This modification will be described with reference to FIGS. 10, 11, and 12. FIG. First, as a premise of display, as shown in FIG. 10, the imaging unit 28 is arranged at a position where each user U participating in the conference can be imaged. For example, the imaging unit 28 is arranged in the front upper part of the conference terminal 20 implemented as an electronic information board. Thereby, each user U who is having a meeting at a position directly facing the conference terminal 20 can be imaged.

次に、図１１に示すようにして、撮像部２８が撮像することにより作成される各ユーザＵが被写体となった画像（あるいは、動画）に対して、画像解析を行うことにより、各ユーザＵの顔等を検出し、この各ユーザＵの内の誰が発話しているかを特定する。この特定は、一般的に知られているアルゴリズムに基づいた画像解析による、顔検知、あるいは動作検知により実現することができる。これらの画像解析を行う機能は、例えば、会議端末２０の音声合算部２３０に実装する。そして、音声合算部２３０は、このように特定した発話中のユーザＵの顔画像と、同時刻に収音した音声とを紐づけて音声認識サーバ３０に送信する。 Next, as shown in FIG. 11, by performing image analysis on images (or moving images) in which each user U is a subject created by imaging by the imaging unit 28, each user U , and identify which of the users U is speaking. This identification can be realized by face detection or motion detection by image analysis based on a generally known algorithm. These image analysis functions are implemented, for example, in the audio combining unit 230 of the conference terminal 20 . Then, the voice summation unit 230 associates the identified facial image of the user U who is speaking with the voice picked up at the same time, and transmits them to the voice recognition server 30 .

そして、音声認識サーバ３０の認識結果補正部３２０は、音声認識処理の結果である文字列と、紐付けられている発話したユーザＵの顔画像とを、会議端末２０の文字列表示部２４０に対して送信する。そして、会議端末２０の文字列表示部２４０は、表示領域２７１に、文字列と、発話したユーザＵの顔画像とを紐づけて表示する。例えば、図１２に示すようにして表示する。これにより、表示を参照したユーザＵの、発話内容の理解や臨場感が向上する。すなわち、音声認識システムＳの利便性が向上する。 Then, the recognition result correcting unit 320 of the speech recognition server 30 displays the character string, which is the result of the speech recognition processing, and the associated facial image of the user U who has spoken, on the character string display unit 240 of the conference terminal 20. send to. Then, the character string display unit 240 of the conference terminal 20 associates and displays the character string and the face image of the user U who has spoken in the display area 271 . For example, it is displayed as shown in FIG. This improves the understanding of the utterance content and the sense of presence of the user U who refers to the display. That is, the convenience of the speech recognition system S is improved.

なお、今回の会議に参加しているユーザＵの、画像や特徴量等のデータを予め登録しておくことにより、より高い精度で、参加しているユーザＵを特定することができる。 In addition, by registering in advance data such as images and feature amounts of the users U participating in the current conference, the participating users U can be identified with higher accuracy.

［第２の変形例］
上述した第１の変形例のようにして、会議に参加しているユーザＵを特定した場合に、特定したユーザＵ個人に特化した音声認識のモデルに切り替えることで、音声認識の精度を向上させることができる。この場合に、仮にユーザＵ個人までは特定できなくても、男性か女性等の属性が分かれば、それぞれの音声認識のモデルを用いることでも、音声認識の精度を向上させることができる。 [Second modification]
As in the first modified example described above, when a user U participating in a conference is identified, the accuracy of speech recognition is improved by switching to a speech recognition model specialized for the identified user U. can be made In this case, even if the individual user U cannot be specified, the accuracy of speech recognition can be improved by using the respective speech recognition models if attributes such as male or female are known.

本変形例について図１３及び図１４を参照して説明する。本変形例では、音声認識サーバ３０の音声認識部３１０に、複数の機能ブロックを含ませる。具体的には、図１３に示すように、顔認証結果受信部３１１、第１音声認識モデル３１２、第２音声認識モデル３１３、第３音声認識モデル３１４、及び音声認識処理部３１５を含ませる。 This modification will be described with reference to FIGS. 13 and 14. FIG. In this modification, the speech recognition unit 310 of the speech recognition server 30 is made to include a plurality of functional blocks. Specifically, as shown in FIG. 13, a face authentication result receiving unit 311, a first speech recognition model 312, a second speech recognition model 313, a third speech recognition model 314, and a speech recognition processing unit 315 are included.

そして、図８におけるステップＳ１８において、図１４に示す各処理を行う。具体的には、ステップＳ１８１において、顔認証結果受信部３１１が、会議端末２０の音声合算部２３０による顔認証の結果を受信する。そして、受信した顔認証の結果に基づいて、顔認証結果受信部３１１が、論理的なスイッチを切り替えることにより、音声認識のモデルを切り替える。例えば、以下のようにして切り替える。 Then, in step S18 in FIG. 8, each process shown in FIG. 14 is performed. Specifically, in step S181 , the face authentication result receiving section 311 receives the result of face authentication by the voice combining section 230 of the conference terminal 20 . Then, based on the received face recognition result, the face recognition result receiving unit 311 switches the voice recognition model by switching a logical switch. For example, switch as follows.

ステップＳ１８２において、顔認証の結果が「男性」であるか否かを判定する。「男性」である場合は、ステップＳ１８２においてＹｅｓと判定され、処理はステップＳ１８３に進む。ステップＳ１８３では、顔認証結果受信部３１１が男性用の音声認識モデルである第１音声認識モデル３１２に切り替えた上で、音声認識処理部３１５が音声認識処理を行う。一方で、「男性」でない場合は、ステップＳ１８２においてＮｏと判定され、処理はステップＳ１８４に進む。 In step S182, it is determined whether or not the face authentication result is "male". If the person is "male", a determination of Yes is made in step S182, and the process proceeds to step S183. In step S183, the face recognition result receiving unit 311 switches to the first voice recognition model 312, which is a voice recognition model for men, and then the voice recognition processing unit 315 performs voice recognition processing. On the other hand, if the person is not "male", a determination of No is made in step S182, and the process proceeds to step S184.

ステップＳ１８４において、顔認証の結果が「女性」であるか否かを判定する。「女性」である場合は、ステップＳ１８４においてＹｅｓと判定され、処理はステップＳ１８５に進む。ステップＳ１８５では、顔認証結果受信部３１１が女性用の音声認識モデルである第２音声認識モデル３１３に切り替えた上で、音声認識処理部３１５が音声認識処理を行う。一方で、「女性」でない場合は、ステップＳ１８４においてＮｏと判定され、処理はステップＳ１８６に進む。ステップＳ１８５では、顔認証結果受信部３１１が汎用の音声認識モデルである第３音声認識モデル３１４に切り替えた上で、音声認識処理部３１５が音声認識処理を行う。 In step S184, it is determined whether or not the face authentication result is "female". If it is a "female", a determination of Yes is made in step S184, and the process proceeds to step S185. In step S185, the face recognition result receiving unit 311 switches to the second voice recognition model 313, which is a voice recognition model for women, and then the voice recognition processing unit 315 performs voice recognition processing. On the other hand, if it is not "female", it is determined No in step S184, and the process proceeds to step S186. In step S185, the face recognition result receiving unit 311 switches to the third voice recognition model 314, which is a general-purpose voice recognition model, and then the voice recognition processing unit 315 performs voice recognition processing.

このように、顔認証の結果に基づいて、適切な音声モデルを利用することにより、音声認識の精度を向上させることができる。 In this way, the accuracy of speech recognition can be improved by using an appropriate speech model based on the result of face recognition.

なお、ステップＳ１８のみならず、図９におけるステップＳ３６やステップＳ４０にも本変形例を適用し、上述したようにして、音声モデルの切り替えを行ってもよい。 Note that this modification may be applied not only to step S18 but also to steps S36 and S40 in FIG. 9 to switch voice models as described above.

［第３の変形例］
上述した実施形態における装置構成や、機能ブロックの切り分けは一例に過ぎず、これに限られない。例えば、会議端末２０に音声認識サーバ３０の機能を実装し、単一の装置として実現してもよい。あるいは、エッジデバイスである会議端末２０を単なる通信中継装置により実現し、音声認識サーバ３０に会議端末２０の機能を実装するようにしてもよい。この場合に、例えば、音声認識処理の結果は、会議端末２０以外の他の装置により表示されてもよい。 [Third Modification]
The device configuration and division of functional blocks in the above-described embodiments are merely examples, and are not limited to these. For example, the functions of the voice recognition server 30 may be implemented in the conference terminal 20 and realized as a single device. Alternatively, the conference terminal 20, which is an edge device, may be realized by a mere communication relay device, and the functions of the conference terminal 20 may be implemented in the voice recognition server 30. FIG. In this case, for example, the result of speech recognition processing may be displayed by a device other than the conference terminal 20. FIG.

あるいは、会議端末２０や音声認識サーバ３０のそれぞれを、複数の装置により実現してもよい。例えば、音声認識サーバ３０を、複数のクラウドサーバが協働することにより実現してもよい。 Alternatively, each of the conference terminal 20 and the speech recognition server 30 may be realized by a plurality of devices. For example, the voice recognition server 30 may be realized by cooperation of a plurality of cloud servers.

つまり、上述した各装置が備える機能ブロック、あるいは代替となる機能ブロックを、音声認識システムＳに含まれる何れかの装置により実現するようにすればよい。換言すると、図６の機能的構成は例示に過ぎず、特に限定されない。すなわち、上述した一連の処理を全体として実行できる機能が音声認識システムＳに含まれる各装置に備えられていれば足り、この機能を実現するためにどのような機能ブロックを用いるのかは特に図６の例に限定されない。 In other words, any of the devices included in the speech recognition system S may realize the functional blocks provided in the respective devices described above or alternative functional blocks. In other words, the functional configuration of FIG. 6 is merely an example and is not particularly limited. That is, it is sufficient if each device included in the speech recognition system S has a function capable of executing the above-described series of processes as a whole. is not limited to the example of

なお、一例として上述した実施形態における機能的構成で機能ブロックを実現した場合、音声認識システムＳは、本発明における「音声認識システム」に相当する。またこの場合、マイク１０は、本発明における「収音機器」に相当する。更にこの場合、音声取得部２１０は、本発明における「取得手段」に相当する。更にこの場合、音量判定部２２０は、本発明における「判定手段」に相当する。更にこの場合、音声合算部２３０、音声認識部３１０、及び認識結果補正部３２０は、本発明における「音声認識処理手段」や「識別手段」に相当する。 Note that, as an example, when the functional blocks are realized by the functional configuration in the above-described embodiment, the speech recognition system S corresponds to the "speech recognition system" in the present invention. Further, in this case, the microphone 10 corresponds to the "sound pickup device" of the present invention. Furthermore, in this case, the voice acquisition unit 210 corresponds to the "acquisition means" in the present invention. Furthermore, in this case, the volume determination section 220 corresponds to the "determination means" in the present invention. Furthermore, in this case, the speech combining unit 230, the speech recognition unit 310, and the recognition result correction unit 320 correspond to the "speech recognition processing means" and the "identification means" in the present invention.

［他の変形例］
上述した一連の処理は、ハードウェアにより実行させることもできるし、ソフトウェアにより実行させることもできる。また、１つの機能ブロックは、ハードウェア単体で構成してもよいし、ソフトウェア単体で構成してもよいし、それらの組み合わせで構成してもよい。例えば、本実施形態における機能的構成は、演算処理を実行するプロセッサによって実現される。 [Other Modifications]
The series of processes described above can be executed by hardware or by software. Also, one functional block may be composed of hardware alone, software alone, or a combination thereof. For example, the functional configuration in this embodiment is implemented by a processor that executes arithmetic processing.

本実施形態に用いることが可能なプロセッサには、シングルプロセッサ、マルチプロセッサ及びマルチコアプロセッサ等の各種処理装置単体によって構成されるものを含む。また、他にも、これら各種処理装置と、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field‐Programmable Gate Array）等の処理回路とが組み合わせられたものを含む。 Processors that can be used in this embodiment include processors configured by various single processing units such as single processors, multiprocessors, and multicore processors. In addition, it also includes a combination of these various processing devices and a processing circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field-Programmable Gate Array).

一連の処理をソフトウェアにより実行させる場合には、そのソフトウェアを構成するプログラムが、コンピュータ等にネットワークや記録媒体からインストールされる。コンピュータは、専用のハードウェアに組み込まれているコンピュータであってもよい。また、コンピュータは、各種のプログラムをインストールすることで、各種の機能を実行することが可能なコンピュータ、例えば汎用のパーソナルコンピュータであってもよい。 When executing a series of processes by software, a program constituting the software is installed in a computer or the like from a network or a recording medium. The computer may be a computer built into dedicated hardware. Also, the computer may be a computer capable of executing various functions by installing various programs, such as a general-purpose personal computer.

このようなプログラムを含む記録媒体は、ユーザにプログラムを提供するために装置本体とは別に配布されるリムーバブルメディアにより構成されるだけでなく、装置本体に予め組み込まれた状態でユーザに提供される記録媒体等で構成される。リムーバブルメディアは、例えば、磁気ディスク（フロッピディスクを含む）、光ディスク、又は光磁気ディスク等により構成される。 A recording medium containing such a program is not only composed of a removable medium that is distributed separately from the main body of the device in order to provide the program to the user, but is also provided to the user in a state pre-installed in the main body of the device. It consists of a recording medium, etc. Removable media are composed of, for example, magnetic disks (including floppy disks), optical disks, or magneto-optical disks.

光ディスクは、例えば、ＣＤ－ＲＯＭ（Compact Disk-Read Only Memory），ＤＶＤ（Digital Versatile Disk），Ｂｌｕ－ｒａｙ（登録商標）Ｄｉｓｃ（ブルーレイディスク）等により構成される。光磁気ディスクは、ＭＤ（Mini-Disk）等により構成される。また、装置本体に予め組み込まれた状態でユーザに提供される記録媒体は、例えば、プログラムが記録されている、図５の、ＲＯＭ１２、ＲＯＭ２２、及びＲＯＭ３２、あるいは、ＨＤＤ２４、及びＨＤＤ３４等で構成される。 Optical discs are composed of, for example, CD-ROMs (Compact Disk-Read Only Memory), DVDs (Digital Versatile Disks), Blu-ray (registered trademark) Discs (Blu-ray Discs), and the like. The magneto-optical disk is composed of an MD (Mini-Disk) or the like. Further, the recording medium provided to the user in a state of being pre-installed in the apparatus main body is composed of, for example, the ROM 12, ROM 22, and ROM 32, or the HDD 24, and HDD 34 in FIG. 5, in which the program is recorded. be.

なお、本明細書において、記録媒体に記録されるプログラムを記述するステップは、その順序に沿って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的あるいは個別に実行される処理をも含むものである。また、本明細書において、システムの用語は、複数の装置や複数の手段等より構成される全体的な装置を意味するものとする。 It should be noted that in this specification, the steps of writing a program recorded on a recording medium are not only processes that are performed chronologically in that order, but also processes that are not necessarily chronologically processed, parallel or individually. It also includes the processing to be executed. Further, in this specification, the term "system" means an overall device composed of a plurality of devices, a plurality of means, or the like.

以上、本発明のいくつかの実施形態について説明したが、これらの実施形態は、例示に過ぎず、本発明の技術的範囲を限定するものではない。本発明はその他の様々な実施形態を取ることが可能であり、更に、本発明の要旨を逸脱しない範囲で、省略や置換等種々の変更を行うことができる。これら実施形態やその変形は、本明細書等に記載された発明の範囲や要旨に含まれると共に、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the present invention have been described above, these embodiments are merely examples and do not limit the technical scope of the present invention. The present invention can take various other embodiments, and various modifications such as omissions and substitutions can be made without departing from the gist of the present invention. These embodiments and their modifications are included in the scope and gist of the invention described in this specification and the like, and are included in the scope of the invention described in the claims and equivalents thereof.

Ｓ印刷システム
１０マイク
２０会議端末
３０音声認識サーバ
１１、２１、３１、４１ＣＰＵ
１２、２２、３２、４２ＲＯＭ
１３、２３、３３、４３ＲＡＭ
１４、２５、３５通信Ｉ／Ｆ
１６収音部
２４、３４ＨＤＤ
２６操作部
２７表示部
２８撮像部
１１０音声収音部
１２０音声送信部
２１０音声取得部
２２０音量判定部
２３０音声合算部
２４０文字列表示部
３１０音声認識部
３１１顔認証結果受信部
３１２第１音声認識モデル
３１３第２音声認識モデル
３１４第３音声認識モデル
３１５音声認識処理部
３２０認識結果補正部 S printing system 10 microphone 20 conference terminal 30 voice recognition server 11, 21, 31, 41 CPU
12, 22, 32, 42 ROMs
13, 23, 33, 43 RAM
14, 25, 35 Communication I/F
16 sound pickup unit 24, 34 HDD
26 operation unit 27 display unit 28 imaging unit 110 sound pickup unit 120 sound transmission unit 210 sound acquisition unit 220 volume determination unit 230 sound summation unit 240 character string display unit 310 sound recognition unit 311 face authentication result reception unit 312 first sound recognition Model 313 Second speech recognition model 314 Third speech recognition model 315 Speech recognition processing unit 320 Recognition result correction unit

特開２０１７－１６７３１８号公報JP 2017-167318 A

本発明は、音声認識システム、音声認識方法、及び音声処理装置に関する。
The present invention relates to a speech recognition system, a speech recognition method, and a speech processing device .

しかしながら、特許文献１に開示の技術では、ユーザが、マイクの方向と異なる方向に
発話してしまい、ユーザの音声がマイクの指向範囲から外れる場合を考慮していなかった
。ユーザの音声が、他のユーザの音声や雑音に埋もれて収音されてしまう場合、音声認識の精度が低下してしまう。
However, the technique disclosed in Patent Document 1 does not take into consideration the case where the user speaks in a direction different from the direction of the microphone and the user's voice goes out of the directional range of the microphone. If the user's voice is picked up while being buried in other users' voices and noise, the accuracy of voice recognition will be degraded.

本発明は、このような状況に鑑みてなされたものであり、音声認識の精度低下を抑制することが可能な、音声認識システム、音声認識方法及び音声処理装置を提供することを目的とする。
SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech recognition system, a speech recognition method, and a speech processing apparatus capable of suppressing deterioration in accuracy of speech recognition. .

上述した課題を解決し、目的を達成するために、本発明は同一空間内にて複数のユーザが発話した音声の音声データを取得する取得手段と、前記取得した複数のユーザの音声データをそれぞれ音声認識する音声認識処理手段と、前記音声認識処理手段は、前記複数の音声データそれぞれの音声認識結果の比較に基づいて、音声認識結果を補正する、音声認識システムを提供する。
In order to solve the above-described problems and achieve the object, the present invention provides acquisition means for acquiring voice data of voices uttered by a plurality of users in the same space ; Speech recognition processing means for recognizing speech, and the speech recognition processing means provide a speech recognition system that corrects the speech recognition result based on a comparison of the speech recognition results of each of the plurality of speech data .

本発明によれば、音声認識の精度低下を抑制することが可能な、音声認識システム、音声認識方法及び音声処理装置を提供することができる。

Advantageous Effects of Invention According to the present invention, it is possible to provide a speech recognition system, a speech recognition method, and a speech processing device capable of suppressing deterioration in accuracy of speech recognition.

Claims

Acquisition means for acquiring audio data of voices uttered by a plurality of users in the same space from a plurality of sound collecting devices worn by each of the plurality of users;
determination means for determining whether or not the volume of the plurality of acquired audio data is less than a threshold;
voice recognition processing means for performing a predetermined process using the plurality of voice data together with a voice recognition process when the determination means determines that the volume of each of the plurality of voice data is less than a threshold value;
A speech recognition system with

The speech recognition processing means is
Summing the plurality of audio data as a predetermined process using the plurality of audio data together,
performing speech recognition processing on the combined speech data;
A speech recognition system according to claim 1.

The speech recognition processing means is
performing speech recognition processing on each of the plurality of speech data;
correcting a speech recognition result based on a comparison of the speech recognition results of each of the plurality of speech data as a predetermined process using the plurality of speech data together;
A speech recognition system according to claim 1.

The speech recognition processing means corrects the speech recognition result based on the most common speech recognition result when the speech recognition results of the plurality of speech data are different in the comparison.
4. A speech recognition system according to claim 3.

When the determining means determines that at least one of the volume levels of the plurality of audio data is equal to or greater than a threshold, the voice recognition processing means performs voice recognition processing on the audio data determined to be equal to or greater than the threshold. while performing speech recognition processing for other speech data,
5. The speech recognition system according to claim 1.

The speech recognition processing means, when it is determined that the volume of each of the plurality of audio data is less than a threshold value, estimates the user who made the utterance based on the comparison result of the volume of each of the plurality of audio data.
A speech recognition system according to any one of claims 1 to 5.

Further comprising the plurality of sound collecting devices,
The plurality of sound collecting devices are neck-mounted or badge-type sound collecting devices,
A speech recognition system according to any one of claims 1 to 6.

Further comprising identification means for identifying the user based on the image in which the plurality of users are subjects,
The speech recognition processing means varies a method of speech recognition processing for each of the users based on the identification result of the user.
A speech recognition system according to any one of claims 1 to 7.

A speech recognition method performed by a speech recognition system,
an acquisition step of acquiring audio data of voices uttered by a plurality of users in the same space from a plurality of sound collecting devices worn by each of the plurality of users;
a determination step of determining whether or not the volume of the acquired plurality of audio data is less than a threshold;
a voice recognition processing step of performing voice recognition processing along with predetermined processing using the plurality of voice data together when it is determined by the determination step that the volume of each of the plurality of voice data is less than a threshold value;
voice recognition method.