JP2023133782A

JP2023133782A - Speech recognition text display system, speech recognition text display device, speech recognition text display method and program

Info

Publication number: JP2023133782A
Application number: JP2022038972A
Authority: JP
Inventors: 一也眞浦; Kazuya Maura; 恭佑日根野; Kyosuke Hineno; 卓山村; Taku Yamamura; 直亮住田; Naoaki Sumita; 一博中臺; Kazuhiro Nakadai; 雅樹中塚; Masaki NAKATSUKA; 唯周藤; Yui Shudo
Original assignee: Honda Motor Co Ltd; Honda Sun Co Ltd
Current assignee: Honda Motor Co Ltd; Honda Sun Co Ltd
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2023-09-27

Abstract

To provide a speech recognition text display system, a speech recognition text display device, a speech recognition text display method, and a program that enable participants to easily understand text information converted from speech signals.SOLUTION: A speech recognition text display system for converting speech signals into text information to display it, comprises: an acquisition unit that acquires speech signals; a speech recognition unit that speech-recognizes speech signals acquired by the acquisition unit and outputs text information; and a display unit that displays the text information. The speech recognition unit calculates recognition likelihood for each homonym for words whose pronunciation is determined by speech recognition to have multiple homonyms in common, and switches whether to output the determined words as text information converted into Chinese characters or text information using kana characters according to the magnitude of each calculated value.SELECTED DRAWING: Figure 1

Description

本発明は、音声認識テキスト表示システム、音声認識テキスト表示装置、音声認識テキスト表示方法およびプログラムに関する。 The present invention relates to a voice recognition text display system, a voice recognition text display device, a voice recognition text display method, and a program.

従来、複数の参加者が会議をする際に、各参加者が発話した内容（音声信号）をテキストに変換して表示する装置が知られている（例えば特許文献１参照）。このような装置は、聴覚障がい者による会議への参加を支援するために用いられることがある。 2. Description of the Related Art Conventionally, there has been known a device that converts the content (audio signals) uttered by each participant into text and displays it when a plurality of participants hold a conference (for example, see Patent Document 1). Such devices are sometimes used to assist hearing-impaired people in attending conferences.

特開２０１９－１７９４８０号公報JP2019-179480A

例えば特許文献１に記載の装置において、会議の参加者が発話した内容の中に同音異義語が存在する語が含まれる際に、当該語が、発話者が意図しない漢字に変換されて表示される場合がある。この場合、発話者が発話した内容を、他の参加者が理解しにくくなる可能性がある。このような問題は、特に、聴覚障がい者が会議に参加する場合に顕著となる。聴覚障がい者は、テキストを読むことによって会議の内容を理解するためである。 For example, in the device described in Patent Document 1, when a word that has a homophone is included in the content uttered by a conference participant, the word is converted into a kanji that is not intended by the speaker and displayed. There may be cases where In this case, it may become difficult for other participants to understand what the speaker has said. Such problems become particularly noticeable when hearing-impaired people participate in a meeting. This is because hearing-impaired people understand the content of the meeting by reading the text.

本発明は、上記の問題点に鑑みてなされたものであって、音声信号から変換されたテキスト情報を参加者が理解しやすくすることができる音声認識テキスト表示システム、音声認識テキスト表示装置、音声認識テキスト表示方法およびプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and includes a voice recognition text display system, a voice recognition text display device, and a voice recognition text display system that can make it easier for participants to understand text information converted from voice signals. The purpose of this invention is to provide a recognized text display method and program.

（１）上記目的を達成するため、本発明の一態様に係る音声認識テキスト表示システム（１）は、音声信号をテキスト情報に変換して表示する音声認識テキスト表示システムであって、前記音声信号を取得する取得部（２２２）と、前記取得部にて取得された前記音声信号に対して音声認識を行い、前記テキスト情報を出力する音声認識部（音声認識部２２３、テキスト変換部２２４、係り受け解析部２２５、算出部２２６、変換切替部２２７）と、前記テキスト情報を表示する表示部（表示部２０３、表示部３０３、議事録作成部２２８、画像出力部２４１）と、を備え、前記音声認識部は、前記音声認識において発音が共通する複数の同音異義語の存在があると判断された語について、前記複数の同音異議語の各々に対する認識尤度を算出し、前記算出された複数の認識尤度の大きさに応じて、前記判断された語を、漢字に変換した前記テキスト情報で出力するか、仮名文字による前記テキスト情報で出力するかを切り替える。 (1) In order to achieve the above object, a voice recognition text display system (1) according to one aspect of the present invention is a voice recognition text display system that converts a voice signal into text information and displays the voice signal. an acquisition unit (222) that acquires the audio signal, and a voice recognition unit (speech recognition unit 223, text conversion unit 224, assistant) that performs voice recognition on the audio signal acquired by the acquisition unit and outputs the text information. a reception analysis section 225, a calculation section 226, a conversion switching section 227), and a display section (display section 203, display section 303, minutes creation section 228, image output section 241) that displays the text information; The speech recognition unit calculates a recognition likelihood for each of the plurality of homophones for words for which it is determined in the speech recognition that there are a plurality of homophones having a common pronunciation, and Depending on the magnitude of the recognition likelihood, it is switched whether to output the determined word as the text information converted into Kanji or as the text information in Kana characters.

（２）また、本発明の一態様に係る音声認識テキスト表示システムにおいて、前記音声認識部は、前記算出された複数の認識尤度の最大値が所定値より低い場合に、前記判断された語を漢字に変換せず仮名文字による前記テキスト情報で出力してもよい。 (2) Furthermore, in the speech recognition text display system according to one aspect of the present invention, the speech recognition unit may detect the determined word when the maximum value of the plurality of calculated recognition likelihoods is lower than a predetermined value. may be output as the text information in kana characters without converting it into kanji.

（３）また、本発明の一態様に係る音声認識テキスト表示システムにおいて、前記表示部は、前記テキスト情報を表示する際に、前記判断された語であり、かつ、前記音声認識部によって仮名文字による前記テキスト情報で出力された語を、他の語とは異なる書式で表示してもよい。 (3) Furthermore, in the voice recognition text display system according to one aspect of the present invention, when displaying the text information, the display unit displays the determined word and the kana character by the voice recognition unit. The words output as the text information may be displayed in a format different from other words.

（４）上記目的を達成するため、本発明の一態様に係る音声認識テキスト表示装置は、音声信号をテキスト情報に変換して表示する音声認識テキスト表示装置であって、前記音声信号を取得する取得部と、前記取得部にて取得された前記音声信号に対して音声認識を行い、前記テキスト情報を出力する音声認識部と、前記テキスト情報を表示する表示部と、を備え、前記音声認識部は、前記音声認識において発音が共通する複数の同音異義語の存在があると判断された語について、前記複数の同音異議語の各々に対する認識尤度を算出し、前記算出された複数の認識尤度の大きさに応じて、前記判断された語を、漢字に変換した前記テキスト情報で出力するか、仮名文字による前記テキスト情報で出力するかを切り替える。 (4) In order to achieve the above object, a voice recognition text display device according to one aspect of the present invention is a voice recognition text display device that converts a voice signal into text information and displays it, and acquires the voice signal. The voice recognition unit includes an acquisition unit, a voice recognition unit that performs voice recognition on the voice signal acquired by the acquisition unit and outputs the text information, and a display unit that displays the text information. The unit calculates the recognition likelihood for each of the plurality of homonyms with respect to the word for which it is determined in the speech recognition that there are a plurality of homonyms with a common pronunciation, and Depending on the magnitude of the likelihood, it is switched whether the determined word is output as the text information converted into kanji or as the text information in kana characters.

（５）上記目的を達成するため、本発明の一態様に係る音声認識テキスト表示方法は、音声信号をテキスト情報に変換して表示する音声認識テキスト表示システムにおける音声認識テキスト表示方法であって、取得部が、音声信号を取得する取得ステップと、音声認識部が、前記取得部にて取得された前記音声信号に対して音声認識を行い、前記テキスト情報を出力する音声認識ステップと、表示部が、前記テキスト情報を表示する表示ステップと、を備え、前記音声認識ステップにおいて、前記音声認識部は、前記音声認識において発音が共通する複数の同音異義語の存在があると判断された語について、前記複数の同音異議語の各々に対する認識尤度を算出し、前記算出された複数の認識尤度の大きさに応じて、前記判断された語を、漢字に変換した前記テキスト情報で出力するか、仮名文字による前記テキスト情報で出力するかを切り替える。 (5) In order to achieve the above object, a voice recognition text display method according to one aspect of the present invention is a voice recognition text display method in a voice recognition text display system that converts a voice signal into text information and displays it, comprising: an acquisition step in which an acquisition unit acquires a voice signal; a voice recognition step in which a voice recognition unit performs voice recognition on the voice signal acquired by the acquisition unit and outputs the text information; and a display unit. and a displaying step of displaying the text information, and in the speech recognition step, the speech recognition unit is configured to display information about a word for which it is determined in the speech recognition that there are a plurality of homophones having a common pronunciation. , calculate the recognition likelihood for each of the plurality of homophone opposition words, and output the determined word as the text information converted into kanji according to the magnitude of the plurality of calculated recognition likelihoods. or outputting the text information in kana characters.

（６）上記目的を達成するため、本発明の一態様に係るプログラムは、音声信号をテキスト情報に変換して表示する音声認識テキスト表示システムに、音声信号を取得する取得ステップと、前記取得部にて取得された前記音声信号に対して音声認識を行い、前記テキスト情報を出力する音声認識ステップと、前記テキスト情報を表示する表示ステップと、を実行させ、前記音声認識ステップにおいては、前記音声認識において発音が共通する複数の同音異義語の存在があると判断された語について、前記複数の同音異議語の各々に対する認識尤度を算出し、前記算出された複数の認識尤度の大きさに応じて、前記判断された語を、漢字に変換した前記テキスト情報で出力するか、仮名文字による前記テキスト情報で出力するかを切り替える。 (6) In order to achieve the above object, a program according to one aspect of the present invention includes an acquisition step of acquiring an audio signal, and an acquisition section for acquiring an audio signal in a speech recognition text display system that converts an audio signal into text information and displays the result. A voice recognition step of performing voice recognition on the voice signal acquired in the step and outputting the text information, and a display step of displaying the text information, and in the voice recognition step, the voice For a word that is determined to have a plurality of homonyms with a common pronunciation in recognition, the recognition likelihood for each of the plurality of homonyms is calculated, and the magnitude of the plurality of calculated recognition likelihoods is calculated. The determined word is output as the text information converted into kanji or as the text information in kana characters.

上述した（１）、（４）または（５）あるいは（６）によれば、音声信号から変換されたテキスト情報を参加者が理解しやすくなるという効果を奏する。 According to the above-mentioned (1), (4), (5), or (6), there is an effect that the participants can easily understand the text information converted from the audio signal.

上述した（２）によれば、発話者が意図しない漢字に変換されたテキスト情報が表示される可能性を低減できる。
上述した（３）によれば、同音異義語の認識尤度が低かったために仮名文字によるテキスト情報で表示された語を、参加者が判別することができる。 According to (2) above, it is possible to reduce the possibility that text information converted into kanji characters not intended by the speaker will be displayed.
According to (3) above, the participant can identify words that were displayed as text information in kana characters because the recognition likelihood of homophones was low.

本実施形態に係る音声認識テキスト表示装置（音声認識テキスト表示システム）の構成例を示すブロック図である。1 is a block diagram showing a configuration example of a speech recognition text display device (speech recognition text display system) according to the present embodiment. FIG. 本実施形態に係る親機の表示部に表示される画像例を示す図である。FIG. 3 is a diagram showing an example of an image displayed on the display unit of the parent device according to the present embodiment. 本実施形態に係る子機の表示部に表示される画像例を示す図である。FIG. 3 is a diagram illustrating an example of an image displayed on a display unit of a slave device according to the present embodiment. 本実施形態に係る音声認識テキスト表示装置（音声認識テキスト表示システム）が行う処理手順例を示すフローチャートである。2 is a flowchart illustrating an example of a processing procedure performed by the speech recognition text display device (speech recognition text display system) according to the present embodiment.

以下、本発明の実施の形態について図面を参照しながら説明する。 Embodiments of the present invention will be described below with reference to the drawings.

まず、本実施形態の音声認識テキスト表示装置（音声認識テキスト表示システム）が使用される状況例を説明する。
本実施形態の音声認識テキスト表示装置（音声認識テキスト表示システム）は、２人以上が参加して行われる会議で用いられる。参加者のうち、発話が不自由な人が会議に参加していてもよい。発話可能な参加者は、参加者毎にマイクロフォンを装着するか、マイクロフォンを備える端末（スマートフォン、タブレット端末、パーソナルコンピュータ等）を用いる。聴覚障がいの参加者は、テキストを入力可能な端末を用いる。音声認識テキスト表示装置は、参加者の発話した音声信号に対して音声認識、テキスト化して、各自の端末にテキスト表示させる。また、音声認識テキスト表示装置は、聴覚障がい者が入力したテキスト情報を各自の端末にテキスト表示させる。 First, an example of a situation in which the voice recognition text display device (voice recognition text display system) of this embodiment is used will be described.
The voice recognition text display device (voice recognition text display system) of this embodiment is used in a conference attended by two or more people. Among the participants, a person with a speech disability may be participating in the conference. Participants who can speak should each wear a microphone or use a terminal (smartphone, tablet terminal, personal computer, etc.) equipped with a microphone. Hearing-impaired participants will use a terminal that can input text. The voice recognition text display device recognizes voice signals uttered by participants, converts them into text, and displays the text on their respective terminals. Furthermore, the voice recognition text display device displays text information input by hearing-impaired people on their own terminals.

図１は、本実施形態に係る音声認識テキスト表示装置（音声認識テキスト表示システム）１の構成例を示すブロック図である。
図１に示すように、音声認識テキスト表示装置（音声認識テキスト表示システム）１は、親機２と、子機３ａ、子機３ｂ、・・・を含んで構成される。なお、子機３ａ、子機３ｂ、・・・のうち１つを特定しない場合は、単に子機３という。
親機２と子機３とは、有線または無線のネットワーク４を介して接続されている。 FIG. 1 is a block diagram showing a configuration example of a speech recognition text display device (speech recognition text display system) 1 according to the present embodiment.
As shown in FIG. 1, a voice recognition text display device (voice recognition text display system) 1 includes a base unit 2, a slave unit 3a, a slave unit 3b, . . . . Note that if one of the handset 3a, handset 3b, . . . is not specified, it will simply be referred to as the handset 3.
The base unit 2 and the slave unit 3 are connected via a wired or wireless network 4.

親機２は、収音部２０１、操作部２０２、表示部２０３、通信部２０４、認証部２１１、音響モデル・辞書記憶部２２１、取得部２２２、音声認識部２２３、テキスト変換部２２４、係り受け解析部２２５、議事録作成部２２８、議事録記憶部２２９、テキスト取得部２３１、および画像出力部２４１を備える。係り受け解析部２２５は、算出部２２６および変換切替部２２７を備える。ただし、算出部２２６または変換切替部２２７は、係り受け解析部２２５とは別に設けられていてもよい。 The base device 2 includes a sound collection section 201, an operation section 202, a display section 203, a communication section 204, an authentication section 211, an acoustic model/dictionary storage section 221, an acquisition section 222, a speech recognition section 223, a text conversion section 224, and a modification section. It includes an analysis section 225, a minutes creation section 228, a minutes storage section 229, a text acquisition section 231, and an image output section 241. The dependency analysis section 225 includes a calculation section 226 and a conversion switching section 227. However, the calculation unit 226 or the conversion switching unit 227 may be provided separately from the dependency analysis unit 225.

子機３は、収音部３０１、操作部３０２、表示部３０３、通信部３０４、および処理部３０５を備える。収音部３０１、操作部３０２、表示部３０３、通信部３０４、および処理部３０５は、バス３０６を介して接続されている。 The handset 3 includes a sound collection section 301, an operation section 302, a display section 303, a communication section 304, and a processing section 305. The sound collection section 301, the operation section 302, the display section 303, the communication section 304, and the processing section 305 are connected via a bus 306.

＜子機３＞
まず、子機３について説明する。
子機３は、例えばスマートフォン、タブレット端末、パーソナルコンピュータ等である。なお、子機３は、音声出力部、モーションセンサー、ＧＰＳ（ＧｌｏｂａｌＰｏｓｉｔｉｏｎｉｎｇＳｙｓｔｅｍ；全地球測位システム）等を備えていてもよい。 <Slave unit 3>
First, the handset 3 will be explained.
The handset 3 is, for example, a smartphone, a tablet terminal, a personal computer, or the like. Note that the handset 3 may include an audio output unit, a motion sensor, a GPS (Global Positioning System), and the like.

収音部３０１は、マイクロフォンである。収音部３０１は、利用者の音声信号を収音し、収音した音声信号をアナログ信号からデジタル信号に変換して、デジタル信号に変換した音声信号を処理部３０５に出力する。 The sound collection unit 301 is a microphone. The sound collection unit 301 collects a user's audio signal, converts the collected audio signal from an analog signal to a digital signal, and outputs the audio signal converted to the digital signal to the processing unit 305.

操作部３０２は、利用者の操作を検出し、検出した結果を処理部３０５に出力する。操作部３０２は、例えば表示部３０３上に設けられたタッチパネル式のセンサー、または優先接続または無線接続のキーボード等である。 The operation unit 302 detects a user's operation and outputs the detected result to the processing unit 305. The operation unit 302 is, for example, a touch panel sensor provided on the display unit 303, a keyboard with priority connection or wireless connection, or the like.

処理部３０５は、操作部３０２が操作された操作結果に基づいて設定情報を生成し、生成した設定情報を通信部３０４に出力する。ここで、設定情報には、参加者の識別情報が含まれている。設定情報には、収音部の使用の有無を示す情報、操作部の使用の有無を示す情報が含まれていてもよい。処理部３０５は、操作部３０２が操作された操作結果に基づいてログイン指示を生成し、生成したログイン指示を通信部３０４に出力する。ここで、ログイン指示には、参加者の識別情報、子機３の識別情報が含まれている。処理部３０５は、操作部３０２が操作された操作結果に基づくテキスト情報に識別情報を付加して通信部３０４に出力する。処理部３０５は、収音部３０１が出力する音声信号に識別情報を付加して通信部３０４に出力する。処理部３０５は、通信部３０４が出力する画像データを取得し、取得した画像データを表示部３０３に出力する。処理部３０５は、通信部３０４が出力するログインを許可する情報に基づいて、親機２との通信を確立する。処理部３０５は、親機２から発言制限指示（入力制限指示）を受信した場合、テキスト入力に対して制限を行ってもよい。また、処理部３０５は、親機２から発言制限指示を受信した場合、音声入力に対しても制限を行うようにしてもよい。 The processing unit 305 generates setting information based on the operation result of the operation unit 302 and outputs the generated setting information to the communication unit 304. Here, the setting information includes participant identification information. The setting information may include information indicating whether the sound collection section is used or not, and information indicating whether or not the operation section is used. The processing unit 305 generates a login instruction based on the operation result of the operation unit 302 and outputs the generated login instruction to the communication unit 304. Here, the login instruction includes identification information of the participant and identification information of the handset 3. The processing unit 305 adds identification information to text information based on the operation result of the operation unit 302 and outputs the text information to the communication unit 304 . The processing unit 305 adds identification information to the audio signal output by the sound collection unit 301 and outputs the audio signal to the communication unit 304 . The processing unit 305 acquires the image data output by the communication unit 304 and outputs the acquired image data to the display unit 303. The processing unit 305 establishes communication with the base device 2 based on the information output by the communication unit 304 that permits login. When the processing unit 305 receives a speech restriction instruction (input restriction instruction) from the base device 2, the processing unit 305 may restrict text input. Further, when the processing unit 305 receives a speech restriction instruction from the base device 2, the processing unit 305 may also restrict voice input.

表示部３０３は、処理部３０５が出力した画像データを表示する。表示部３０３は、例えば液晶表示装置、有機ＥＬ（エレクトロルミネッセンス）表示装置、電子インク表示装置等である。なお、表示部３０３上に表示される画像については後述する。 The display unit 303 displays the image data output by the processing unit 305. The display unit 303 is, for example, a liquid crystal display device, an organic EL (electroluminescence) display device, an electronic ink display device, or the like. Note that the image displayed on the display unit 303 will be described later.

通信部３０４は、処理部３０５が出力する設定情報を、ネットワーク４を介して親機２へ送信する。通信部３０４は、処理部３０５が出力するログイン指示を、ネットワーク４を介して親機２へ送信する。通信部３０４は、処理部３０５が出力するテキスト情報または音声信号を、ネットワーク４を介して親機２へ送信する。なお、送信するテキスト情報または音声信号には、利用者の識別情報と子機３の識別情報が含まれている。通信部３０４は、親機２が送信した画像データを受信し、受信した画像データを処理部３０５に出力する。通信部３０４は、親機２が送信したログインを許可する情報を受信した場合、受信したログインを許可する情報を処理部３０５に出力する。 The communication unit 304 transmits the setting information output by the processing unit 305 to the base device 2 via the network 4. The communication unit 304 transmits the login instruction output by the processing unit 305 to the base device 2 via the network 4. The communication unit 304 transmits text information or audio signals output by the processing unit 305 to the base unit 2 via the network 4. Note that the text information or audio signal to be transmitted includes the identification information of the user and the identification information of the handset 3. The communication unit 304 receives the image data transmitted by the base device 2 and outputs the received image data to the processing unit 305. When the communication unit 304 receives the information that permits login transmitted from the base device 2, the communication unit 304 outputs the received information that permits login to the processing unit 305.

＜親機２＞
次に親機２について説明する。
親機２は、例えばノートパソコン等である。 <Main unit 2>
Next, the base unit 2 will be explained.
The parent device 2 is, for example, a notebook computer.

収音部２０１は、マイクロフォンである。収音部２０１は、利用者の音声信号を収音し、収音した音声信号をアナログ信号からデジタル信号に変換して、デジタル信号に変換した音声信号を取得部２２２に出力する。 The sound collection unit 201 is a microphone. The sound collection unit 201 collects a user's audio signal, converts the collected audio signal from an analog signal to a digital signal, and outputs the audio signal converted to the digital signal to the acquisition unit 222.

操作部２０２は、利用者の操作を検出し、検出した結果をテキスト取得部２３１に出力する。操作部２０２は、例えば表示部２０３上に設けられたタッチパネル式のセンサー、またはキーボードである。操作部２０２は、ログイン処理の際、操作を検出した結果を、認証部２１１に出力する。 The operation unit 202 detects a user's operation and outputs the detected result to the text acquisition unit 231. The operation unit 202 is, for example, a touch panel sensor provided on the display unit 203 or a keyboard. The operation unit 202 outputs the result of detecting the operation to the authentication unit 211 during the login process.

表示部２０３は、例えば液晶表示装置、有機ＥＬ表示装置、電子インク表示装置等である。表示部２０３は、画像出力部２４１が出力する画像データを表示する。なお、表示部２０３上に表示される画像については後述する。 The display unit 203 is, for example, a liquid crystal display device, an organic EL display device, an electronic ink display device, or the like. The display unit 203 displays image data output by the image output unit 241. Note that the image displayed on the display unit 203 will be described later.

通信部２０４は、子機３が送信した音声信号を受信し、受信した音声信号を取得部２２２に出力する。通信部２０４は、子機３が送信したテキスト情報を受信し、受信したテキスト情報をテキスト取得部２３１に出力する。通信部２０４は、子機３が送信したログイン指示を受信し、受信したログイン指示を認証部２１１に出力する。通信部２０４は、画像出力部２４１が出力する画像データを、ネットワーク４を介して子機３へ送信する。通信部２０４は、認証部２１１が出力するログインを許可する情報を、ネットワーク４を介して子機３へ送信する。 The communication unit 204 receives the audio signal transmitted by the handset 3 and outputs the received audio signal to the acquisition unit 222. The communication unit 204 receives the text information transmitted by the handset 3 and outputs the received text information to the text acquisition unit 231. The communication unit 204 receives the login instruction sent by the handset 3 and outputs the received login instruction to the authentication unit 211. The communication unit 204 transmits the image data output by the image output unit 241 to the slave unit 3 via the network 4. The communication unit 204 transmits the information output by the authentication unit 211 to permit login to the slave device 3 via the network 4 .

認証部２１１は、通信部２０４が出力するログイン指示に含まれる参加者の識別情報と子機３の識別情報に基づいて、ログインを許可するか否かを判定する。認証部２１１は、ログインを許可する場合、ログインを許可する情報を通信部２０４に出力する。認証部２１１は、操作部２０２が操作された結果に基づいて、親機２の利用者のログインを許可するか否かを判定する。認証部２１１は、ログインを許可する場合、各機能部にログインを許可する情報を出力し、各機能部の動作を許可する。なお、各機能部とは、通信部２０４、認証部２１１、音響モデル・辞書記憶部２２１、取得部２２２、音声認識部２２３、テキスト変換部２２４、係り受け解析部２２５、算出部２２６、変換切替部２２７、議事録作成部２２８、議事録記憶部２２９、テキスト取得部２３１、および画像出力部２４１である。 The authentication unit 211 determines whether to permit login based on the identification information of the participant and the identification information of the handset 3 included in the login instruction output by the communication unit 204. When permitting login, the authentication section 211 outputs information permitting login to the communication section 204. The authentication unit 211 determines whether or not to permit the user of the base device 2 to log in, based on the result of the operation on the operation unit 202. When permitting login, the authentication section 211 outputs information permitting login to each functional section, and permits the operation of each functional section. Note that each functional unit includes a communication unit 204, an authentication unit 211, an acoustic model/dictionary storage unit 221, an acquisition unit 222, a speech recognition unit 223, a text conversion unit 224, a dependency analysis unit 225, a calculation unit 226, and a conversion switch. section 227 , a minutes creation section 228 , a minutes storage section 229 , a text acquisition section 231 , and an image output section 241 .

音響モデル・辞書記憶部２２１は、例えば音響モデル、言語モデル、単語辞書等を格納している。音響モデルとは、音の特徴量に基づくモデルであり、言語モデルとは、単語とその並び方の情報のモデルである。また、単語辞書とは、多数の語彙による辞書であり、例えば大語彙単語辞書である。なお、親機２は、音響モデル・辞書記憶部２２１に格納されていない単語等を格納して更新するようにしてもよい。なお、音響モデル・辞書記憶部２２１は、例えば会議ごとにＤＢ（データベース）を備えていてもよい。例えば、第１のＤＢが一般の会議用であり、第２のＤＢが発表会用であり、第３のＤＢが国際会議用であってもよい。このように会議に合わせたＤＢを用いることで、同音異義語等の変換を適切に行いやすくなる。 The acoustic model/dictionary storage unit 221 stores, for example, acoustic models, language models, word dictionaries, and the like. The acoustic model is a model based on sound feature amounts, and the language model is a model of information about words and how they are arranged. Further, the word dictionary is a dictionary with a large number of vocabulary words, for example, a large vocabulary word dictionary. Note that the base unit 2 may store and update words and the like that are not stored in the acoustic model/dictionary storage unit 221. Note that the acoustic model/dictionary storage unit 221 may include a DB (database) for each conference, for example. For example, the first DB may be used for general conferences, the second DB may be used for presentations, and the third DB may be used for international conferences. By using a DB tailored to the conference in this way, it becomes easier to appropriately convert homonyms and the like.

取得部２２２は、収音部２０１が出力する音声信号、または通信部２０４が出力する音声信号を取得し、取得した音声信号を音声認識部２２３に出力する。 The acquisition unit 222 acquires the audio signal output by the sound collection unit 201 or the audio signal output by the communication unit 204, and outputs the acquired audio signal to the voice recognition unit 223.

音声認識部２２３は、取得部２２２が出力する音声信号を取得する。音声認識部２２３は、音声信号から発話区間の音声信号を検出する。発話区間の検出は、例えば所定のしきい値以上の音声信号を発話区間として検出する。なお、音声認識部２２３は、発話区間の検出を周知の他の手法を用いて行ってもよい。音声認識部２２３は、検出した発話区間の音声信号に対して、音響モデル・辞書記憶部２２１を参照して、周知の手法を用いて音声認識を行う。なお、音声認識部２２３は、例えば特開２０１５－６４５５４号公報に開示されている手法等を用いて音声認識を行う。音声認識部２２３は、認識した認識結果と音声信号をテキスト変換部２２４に出力する。なお、音声認識部２２３は、認識結果と音声信号とを、例えば１文毎、または発話区間毎、または発話毎に対応つけて出力する。
なお、音声認識部２２３は、音声信号が同時に入力された場合、例えば時分割処理によって収音部（２０１または３０１）毎に音声認識を行う。また、音声認識部２２３は、マイクロフォンがマイクロフォンアレイの場合、音源分離処理、音源定位処理、音源同定処理等、周知の音声認識処理も行う。 The voice recognition unit 223 acquires the voice signal output by the acquisition unit 222. The speech recognition unit 223 detects the speech signal of the speech section from the speech signal. The speech section is detected by detecting, for example, an audio signal equal to or higher than a predetermined threshold value as the speech section. Note that the speech recognition unit 223 may detect the utterance section using other well-known techniques. The speech recognition unit 223 refers to the acoustic model/dictionary storage unit 221 and performs speech recognition on the speech signal of the detected speech section using a well-known method. Note that the speech recognition unit 223 performs speech recognition using, for example, the method disclosed in Japanese Patent Application Publication No. 2015-64554. The speech recognition section 223 outputs the recognized recognition result and speech signal to the text conversion section 224. Note that the speech recognition unit 223 outputs the recognition result and the speech signal in association with each other, for example, for each sentence, for each utterance section, or for each utterance.
Note that when audio signals are input simultaneously, the audio recognition unit 223 performs audio recognition for each sound collection unit (201 or 301), for example, by time-sharing processing. Furthermore, when the microphone is a microphone array, the speech recognition unit 223 also performs well-known speech recognition processing such as sound source separation processing, sound source localization processing, and sound source identification processing.

テキスト変換部２２４は、音声認識部２２３が出力する認識結果に対して、音響モデル・辞書記憶部２２１を参照して、テキストに変換する。なお、テキスト情報は、少なくとも１文字の情報を含む。テキスト変換部２２４は、変換したテキスト情報と、取得した音声信号を係り受け解析部２２５に出力する。なお、テキスト変換部２２４は、発話情報を認識した結果から「あー」、「えーと」、「えー」、「まあ」等の間投詞を削除してテキストに変換するようにしてもよい。 The text conversion unit 224 converts the recognition result output by the speech recognition unit 223 into text by referring to the acoustic model/dictionary storage unit 221. Note that the text information includes information of at least one character. The text conversion unit 224 outputs the converted text information and the acquired audio signal to the dependency analysis unit 225. Note that the text conversion unit 224 may delete interjections such as "ah", "um", "um", and "well" from the result of recognizing the utterance information and convert it into text.

係り受け解析部２２５は、テキスト変換部２２４が出力したテキスト情報または通信部２０４が出力したテキスト情報に対して、音響モデル・辞書記憶部２２１を参照して、形態素解析と係り受け解析を行う。なお、係り受け解析には、例えば、Ｓｈｉｆｔ－ｒｅｄｕｃｅ法や全域木の手法やチャンク同定の段階適用手法においてＳＶＭ（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅｓ）を用いる。
係り受け解析部２２５は、形態素解析と係り受け解析を行ったテキスト情報と、解析した結果を議事録作成部２２８に出力する。なお、係り受け解析部２２５は、テキスト変換部２２４が出力する音声信号を取得した場合、取得した音声信号も議事録作成部２２８に出力する。
ここで、係り受け解析部２２５は、上記した情報に加えて（または代えて）、以下で説明する算出部２２６および変換切替部２２７による処理を行ったテキスト情報を議事録作成部２２８に出力する。 The dependency analysis unit 225 performs morphological analysis and dependency analysis on the text information output by the text conversion unit 224 or the text information output by the communication unit 204 with reference to the acoustic model/dictionary storage unit 221. Note that SVM (Support Vector Machines) is used in the dependency analysis, for example, in the shift-reduce method, the spanning tree method, and the stepwise application method of chunk identification.
The dependency analysis unit 225 outputs text information subjected to morphological analysis and dependency analysis, and the results of the analysis to the minutes creation unit 228. Note that when the dependency analysis unit 225 acquires the audio signal output by the text conversion unit 224, it also outputs the acquired audio signal to the minutes creation unit 228.
Here, in addition to (or instead of) the above-mentioned information, the dependency analysis unit 225 outputs text information processed by the calculation unit 226 and conversion switching unit 227 described below to the minutes creation unit 228. .

算出部２２６は、音声認識（形態素解析と係り受け解析）において発音が共通する複数の同音異義語の存在があると判断された語について、複数の同音異義語の各々に対する認識尤度を算出する。例えば、「こうせい」という単語には、「構成」「鋼製」「厚生」「公正」・・・といった発音が共通する複数の同音異義語が存在する。算出部２２６は、音響モデル・辞書記憶部２２１、またはテキスト情報における「こうせい」の前後の文脈その他の情報を参照して、「構成」「鋼製」「厚生」「公正」・・・の各々について、認識尤度を算出する。算出部２２６は、当該算出によって得られた複数の認識尤度を、変換切替部２２７に出力する。 The calculation unit 226 calculates the recognition likelihood for each of the plurality of homonyms for words that are determined to have a plurality of homonyms with a common pronunciation in speech recognition (morphological analysis and dependency analysis). . For example, the word "kosei" has multiple homophones that have the same pronunciation, such as "structure," "made of steel," "welfare," "fairness," and so on. The calculation unit 226 refers to the acoustic model/dictionary storage unit 221 or the context before and after “Kousei” in the text information and other information to calculate each of “composition”, “made of steel”, “welfare”, “fairness”, etc. Calculate the recognition likelihood for The calculation unit 226 outputs the plurality of recognition likelihoods obtained by the calculation to the conversion switching unit 227.

変換切替部２２７は、算出部２２６によって出力された複数の認識尤度の大きさに応じて、発音が共通する複数の同音異義語の存在があると判断された語を、係り受け解析部２２５が、漢字に変換したテキスト情報で出力するか、仮名文字によるテキスト情報で出力するか、を切り替える。なお、「仮名文字」には、平仮名および片仮名が含まれる。
例えば、変換切替部２２７は、算出部２２６によって出力された「構成」「鋼製」「厚生」「公正」・・・の各々の認識尤度に応じて、「こうせい」という語を、「こうせい」または「コウセイ」で出力するか、「構成」または「鋼製」等で出力するか、を切り替える。
一例として、変換切替部２２７は、算出部２２６によって出力された複数の認識尤度の最大値が所定値より低い場合に、発音が共通する複数の同音異義語の存在があると判断された語（「こうせい」）を係り受け解析部２２５に漢字変換させず、仮名文字によるテキスト情報で出力させてもよい。 The conversion switching unit 227 converts the word that has been determined to have a plurality of homophones with the same pronunciation into the dependency analysis unit 225 according to the magnitude of the plurality of recognition likelihoods output by the calculation unit 226. switches between outputting text information converted to kanji or text information using kana characters. Note that "kana characters" include hiragana and katakana.
For example, the conversion switching unit 227 converts the word ``Kousei'' into ” or “Kosei”, or “Composition” or “Steel” etc.
As an example, when the maximum value of the plurality of recognition likelihoods outputted by the calculation section 226 is lower than a predetermined value, the conversion switching section 227 selects a word that is determined to have a plurality of homophones with a common pronunciation. (“Kosei”) may be output as text information in kana characters without having the dependency analysis unit 225 convert it into kanji.

議事録作成部２２８は、係り受け解析部２２５またはテキスト取得部２３１が出力したテキスト情報に基づいて、発話者毎に分けて、議事録を作成する。議事録作成部２２８は、作成した議事録と対応する音声信号を議事録記憶部２２９に記憶させる。また、議事録作成部２２８は、作成した議事録を画像出力部２４１に出力する。なお、議事録作成部２２８は、「あー」、「えーと」、「えー」、「まあ」等の間投詞を削除して議事録を作成するようにしてもよい。 The minutes creation unit 228 creates minutes for each speaker based on the text information output by the dependency analysis unit 225 or the text acquisition unit 231. The minutes creation unit 228 causes the minutes storage unit 229 to store the audio signal corresponding to the created minutes. Further, the minutes creation unit 228 outputs the created minutes to the image output unit 241. Note that the minutes creation unit 228 may create the minutes by deleting interjections such as "ah", "um", "um", and "well".

議事録記憶部２２９は、議事録と音声信号を対応つけて記憶する。 The minutes storage unit 229 stores minutes and audio signals in association with each other.

テキスト取得部２３１は、操作部２０２が出力する操作結果、または通信部２０４が出力する操作部３０２の操作結果を取得し、取得した結果に基づいてテキスト情報を生成する。テキスト取得部２３１は、生成したテキスト情報を議事録作成部２２８に出力する。 The text acquisition unit 231 acquires the operation result output by the operation unit 202 or the operation result of the operation unit 302 output by the communication unit 204, and generates text information based on the acquired result. The text acquisition unit 231 outputs the generated text information to the minutes creation unit 228.

画像出力部２４１は、議事録作成部２２８が出力する議事録情報を取得する。画像出力部２４１は、議事録情報に基づいて画像データを生成し、生成した画像データを表示部２０３と通信部２０４に出力する。 The image output unit 241 acquires the minutes information output by the minutes creation unit 228. The image output unit 241 generates image data based on the minutes information and outputs the generated image data to the display unit 203 and the communication unit 204.

＜親機２の表示画像＞
次に、親機２の表示部２０３上に表示される画像例を説明する。
図２は、本実施形態に係る親機２の表示部２０３上に表示される画像例を示す図である。
画像ｇ１０が、親機２の表示部２０３上に表示される画像である。 <Display image of base unit 2>
Next, an example of an image displayed on the display unit 203 of the base device 2 will be described.
FIG. 2 is a diagram showing an example of an image displayed on the display unit 203 of the base device 2 according to the present embodiment.
Image g10 is an image displayed on display unit 203 of base device 2.

領域ｇ１００は、参加者情報編集を行う領域である。
領域ｇ１０１は、参加者情報の領域である。符号ｇ１０２は、参加者の名前である。符号ｇ１０３は、参加者が親機２の操作部２０２または子機３の操作部３０２によってテキスト入力を行うことを示すアイコンである。符号ｇ１０４は、参加者が親機２の収音部２０１または子機３の収音部３０１によって発話を行うことを示すアイコンである。符号ｇ１０５は、参加者が使用するマイクロフォンの番号（または識別情報）である。 Area g100 is an area where participant information is edited.
Area g101 is an area for participant information. The code g102 is the name of the participant. Reference numeral g103 is an icon indicating that the participant inputs text using the operation unit 202 of the base unit 2 or the operation unit 302 of the slave unit 3. Reference numeral g104 is an icon indicating that the participant speaks using the sound collection unit 201 of the base unit 2 or the sound collection unit 301 of the slave unit 3. The code g105 is the number (or identification information) of the microphone used by the participant.

領域ｇ２００は、議事録を表示する領域である。なお、図２では、ログイン後の状態を示している。符号ｇ２０１は、ログイン／ログアウトのボタン画像である。符号ｇ２０２は、音声認識テキスト表示装置（音声認識テキスト表示システム）１の開始／終了のボタン画像である。符号ｇ２０３は、音声認識テキスト表示装置（音声認識テキスト表示システム）１の使用中に点灯する表示である。符号ｇ２０４は、議事録記憶部２２９が記憶する議事録の表示や音声信号の再生を行うボタン画像である。符号ｇ２０５は、親機２の利用者が収音部２０１の使用有無を選択するボタン画像である。 Area g200 is an area for displaying minutes. Note that FIG. 2 shows the state after login. Symbol g201 is a login/logout button image. Reference numeral g202 is a start/end button image of the voice recognition text display device (voice recognition text display system) 1. Reference numeral g203 is a display that lights up while the voice recognition text display device (voice recognition text display system) 1 is in use. Reference numeral g204 is a button image for displaying the minutes stored in the minutes storage section 229 and reproducing the audio signal. Reference numeral g205 is a button image through which the user of the base unit 2 selects whether or not to use the sound collection section 201.

符号ｇ２１１は、第１の参加者が操作部（２０２または３０２）を操作して入力したテキスト情報である。符号ｇ２１２は、第１の参加者が操作部（２０２または３０２）を操作して入力した絵文字である。符号ｇ２１３は、第１の参加者がテキスト情報および絵文字を入力した日時を示す情報である。符号ｇ２１４は、第１の参加者の名前である。 The code g211 is text information input by the first participant by operating the operation unit (202 or 302). The symbol g212 is a pictogram input by the first participant by operating the operation unit (202 or 302). The code g213 is information indicating the date and time when the first participant inputted the text information and pictograms. Code g214 is the name of the first participant.

符号ｇ２２１は、第２の参加者が操作部（２０２または３０２）を操作して入力したテキスト情報である。符号ｇ２２２は、第２の参加者が発話した内容を音声認識したテキスト情報である。符号ｇ２２３は、操作部（２０２または３０２）を操作してテキストを入力したことを示すアイコンである。符号ｇ２２４は、収音部（２０１または３０１）によって発話を入力したことを示すアイコンである。符号ｇ２３１は、第３の参加者が発話した内容を音声認識したテキスト情報である。 The code g221 is text information input by the second participant by operating the operation unit (202 or 302). The code g222 is text information obtained by voice recognition of the content uttered by the second participant. The symbol g223 is an icon indicating that text has been input by operating the operation unit (202 or 302). The symbol g224 is an icon indicating that speech has been input by the sound collection unit (201 or 301). The code g231 is text information obtained by voice recognition of the content uttered by the third participant.

符号ｇ２４１は、第２の参加者が発話した語「こうせい」について、各同音異義語「構成」「鋼製」等の認識尤度に応じて、仮名文字によるテキスト情報で表示する一例である。
図示の例に示すように、表示部２０３は、音声認識において発音が共通する複数の同音異義語の存在があると判断された語（「こうせい」）であり、かつ、係り受け解析部２２５によって仮名文字によるテキスト情報で出力された語を、他の語とは異なる書式で表示させてもよい。書式を異ならせる態様としては、例えば、斜体（イタリック）、太字、下線、マーカー表示、文字色、文字サイズ、またはフォントその他の態様、ならびにこれらの組み合わせ等を用いることができる。図示の例においては、「こうせい」を太字かつ斜体とすることにより、「こうせい」を他のテキスト情報とは異なる書式で表示している。 Symbol g241 is an example of displaying text information in kana characters for the word "Kosei" uttered by the second participant, according to the recognition likelihood of each homophone "structure", "made of steel", etc.
As shown in the illustrated example, the display unit 203 indicates a word that has been determined to have multiple homophones with the same pronunciation in speech recognition (“Kosei”), and which has been determined by the dependency analysis unit 225 to have multiple homophones with the same pronunciation. Words output as text information in kana characters may be displayed in a format different from other words. Examples of ways to vary the format include italics, boldface, underlining, marker display, font color, font size, font, and other aspects, as well as combinations thereof. In the illustrated example, "Kosei" is displayed in a different format from other text information by making "Kosei" bold and italicized.

なお、表示部２０３が上記の表示をするために、画像出力部２４１が、当該語を他の語とは異なる書式で表示させた画像データを生成する。 Note that in order for the display unit 203 to display the above, the image output unit 241 generates image data in which the word is displayed in a format different from that of other words.

なお、図２に示した画像は一例であり、表示部２０３上に表示される画像はこれに限らない。例えば、表示部２０３は、音声認識において発音が共通する複数の同音異義語の存在があると判断された語（「こうせい」）であり、かつ、係り受け解析部２２５によって仮名文字によるテキスト情報で出力された語を、他の語とは異なる書式で表示しなくてもよい。 Note that the image shown in FIG. 2 is an example, and the image displayed on the display unit 203 is not limited to this. For example, the display unit 203 displays a word that has been determined to have multiple homophones with the same pronunciation in speech recognition (“Kousei”), and is displayed as text information in kana characters by the dependency analysis unit 225. The output word does not have to be displayed in a different format from other words.

＜子機３の表示画面＞
次に、子機３の表示部３０３上に表示される画像例を説明する。
図３は、本実施形態に係る子機３の表示部３０３上に表示される画像例を示す図である。
画像ｇ３０が、子機３の表示部３０３上に表示される画像である。 <Display screen of handset 3>
Next, an example of an image displayed on the display unit 303 of the handset 3 will be described.
FIG. 3 is a diagram showing an example of an image displayed on the display unit 303 of the handset 3 according to the present embodiment.
Image g30 is an image displayed on display unit 303 of handset 3.

領域ｇ３００は、議事録を表示する領域である。符号ｇ３１１は、第１の参加者が操作部（２０２または３０２）を操作して入力したテキスト情報である。符号ｇ３２１は、第２の参加者が操作部（２０２または３０２）を操作して入力したテキスト情報である。符号ｇ３２２は、第２の参加者が発話した内容を音声認識したテキスト情報である。符号ｇ３３１は、第３の参加者が発話した内容を音声認識したテキスト情報である。領域ｇ３０１は、テキスト入力部の領域である。なお、操作部３０２は、表示部３０３上に表示されるソフトウェアキーボードであってもよく、子機３と有線または無線で接続されていてもよい。 Area g300 is an area for displaying minutes. The code g311 is text information input by the first participant by operating the operation unit (202 or 302). The code g321 is text information input by the second participant by operating the operation unit (202 or 302). The code g322 is text information obtained by voice recognition of the content uttered by the second participant. The code g331 is text information obtained by voice recognition of the content uttered by the third participant. Area g301 is an area for a text input section. Note that the operation unit 302 may be a software keyboard displayed on the display unit 303, or may be connected to the handset 3 by wire or wirelessly.

符号ｇ３４１は、図２おける符号ｇ２４１と同様に、第２の参加者が発話した語「こうせい」を仮名文字によるテキスト情報で表示する一例である。図示の例に示すように、表示部３０３は、表示部２０３と同様に、音声認識において発音が共通する複数の同音異義語の存在があると判断された語（「こうせい」）であり、かつ、係り受け解析部２２５によって仮名文字によるテキスト情報で出力された語を、他の語とは異なる書式で表示させてもよい。なお、表示部３０３が上記の表示をするために、画像出力部２４１が、当該語を他の語とは異なる書式で表示させた画像データを生成する。 Similar to the symbol g241 in FIG. 2, the symbol g341 is an example of displaying the word "Kousei" uttered by the second participant as text information in kana characters. As shown in the illustrated example, the display section 303, like the display section 203, is a word ("kosei") that has been determined to have multiple homophones with a common pronunciation in speech recognition, and , the word output as text information in kana characters by the dependency analysis unit 225 may be displayed in a format different from that of other words. Note that in order for the display unit 303 to display the above, the image output unit 241 generates image data in which the word is displayed in a format different from that of other words.

なお、図３に示した画像は一例であり、表示部３０３上に表示される画像はこれに限らない。例えば、表示部３０３は、音声認識において発音が共通する複数の同音異義語の存在があると判断された語（「こうせい」）であり、かつ、係り受け解析部２２５によって仮名文字によるテキスト情報で出力された語を、他の語とは異なる書式で表示しなくてもよい。 Note that the image shown in FIG. 3 is an example, and the image displayed on the display unit 303 is not limited to this. For example, the display unit 303 displays a word that has been determined to have multiple homophones with the same pronunciation in speech recognition (“Kousei”), and is displayed as text information in kana characters by the dependency analysis unit 225. The output word does not have to be displayed in a different format from other words.

＜音声認識テキスト表示装置（音声認識テキスト表示システム）１が行う処理＞
次に、音声認識テキスト表示装置（音声認識テキスト表示システム）１が行う処理手順例を説明する。図４は、本実施形態に係る音声認識テキスト表示装置（音声認識テキスト表示システム）１が行う処理手順例を示すフローチャートである。 <Processing performed by voice recognition text display device (voice recognition text display system) 1>
Next, an example of a processing procedure performed by the voice recognition text display device (voice recognition text display system) 1 will be described. FIG. 4 is a flowchart illustrating an example of a processing procedure performed by the voice recognition text display device (voice recognition text display system) 1 according to the present embodiment.

（ステップＳ１）認証部２１１は、操作部（２０２または３０２）の操作内容に基づいて、ログイン処理を行う。例えば、各利用者が、操作部（２０２または３０２）を操作して、利用者を識別する識別情報（利用者ＩＤ）とパスワードを入力すると、認証部２１１は、入力された識別情報及びパスワードに基づいてログイン処理を行う。 (Step S1) The authentication unit 211 performs a login process based on the operation content of the operation unit (202 or 302). For example, when each user operates the operation unit (202 or 302) and inputs identification information (user ID) that identifies the user and a password, the authentication unit 211 inputs the input identification information and password. Login processing is performed based on

（ステップＳ２）利用者が入力を収音部（２０１または３０１）によって行う場合、取得部２２２は、収音部２０１または通信部２０４が出力する音声信号を取得し、取得した音声信号を音声認識部２２３に出力する。 (Step S2) When the user performs input using the sound collection unit (201 or 301), the acquisition unit 222 acquires the audio signal output by the sound collection unit 201 or the communication unit 204, and performs voice recognition on the acquired audio signal. 223.

（ステップＳ３）音声認識部２２３は、取得部２２２が出力する音声信号を取得し、取得した音声信号に対して音声認識処理を行う。 (Step S3) The voice recognition unit 223 acquires the voice signal output by the acquisition unit 222, and performs voice recognition processing on the acquired voice signal.

（ステップＳ４）テキスト変換部２２４は、音声認識された結果に対してテキスト変換処理を行う。 (Step S4) The text conversion unit 224 performs text conversion processing on the voice recognition result.

（ステップＳ５）係り受け解析部２２５は、テキスト変換されたテキスト情報に対して、発話者毎に係り受け解析と形態素解析処理を行う。 (Step S5) The dependency analysis unit 225 performs dependency analysis and morphological analysis processing for each speaker on the converted text information.

（ステップＳ６）係り受け解析部２２５は、音響モデル・辞書記憶部２２１を参照して、解析されたテキスト情報に含まれる各語について、発音が共通する複数の同音異義語の存在があるか否か判定する。係り受け解析部２２５が、複数の同音異義語の存在があると判定した場合（ステップＳ６；ＹＥＳ）には、ステップＳ７の処理が行われ、複数の同音異義語の存在はないと判定した場合（ステップＳ６；ＮＯ）には、ステップＳ１１の処理が行われる。 (Step S6) The dependency analysis unit 225 refers to the acoustic model/dictionary storage unit 221 to determine whether or not there are multiple homophones with the same pronunciation for each word included in the analyzed text information. Determine whether If the dependency analysis unit 225 determines that there are multiple homophones (step S6; YES), the process of step S7 is performed, and if it determines that there are no multiple homonyms. (Step S6; NO), the process of step S11 is performed.

（ステップＳ７）算出部２２６は、発音が共通する複数の同音異義語の存在があると判定された語について、各同音異義語に対する認識尤度を算出し、算出結果を変換切替部２２７に出力する。 (Step S7) The calculation unit 226 calculates the recognition likelihood for each homophone for the word that is determined to have a plurality of homophones with the same pronunciation, and outputs the calculation result to the conversion switching unit 227. do.

（ステップＳ８）変換切替部２２７は、複数の認識尤度の算出値の最大値が所定値以下であるかを判定する。変換切替部２２７が、最大値が所定値以下であると判定した場合（ステップＳ８；ＹＥＳ）には、ステップＳ９の処理が行われ、最大値が所定値を超えると判定した場合（ステップＳ８；ＮＯ）には、ステップＳ１０の処理が行われる。 (Step S8) The conversion switching unit 227 determines whether the maximum value of the calculated values of the plurality of recognition likelihoods is less than or equal to a predetermined value. If the conversion switching unit 227 determines that the maximum value is less than or equal to the predetermined value (step S8; YES), the process of step S9 is performed, and if it determines that the maximum value exceeds the predetermined value (step S8; If NO), the process of step S10 is performed.

（ステップＳ９）係り受け解析部２２５は、発音が共通する複数の同音異義語の存在があると判定された語を、仮名文字によるテキスト情報として議事録作成部２２８に出力する。 (Step S9) The dependency analysis unit 225 outputs words for which it has been determined that there are multiple homophones with the same pronunciation to the minutes creation unit 228 as text information in kana characters.

（ステップＳ１０）係り受け解析部２２５は、発音が共通する複数の同音異義語の存在があると判定された語を、認識尤度が最も大きい同音異義語（漢字）に変換したテキスト情報として議事録作成部２２８に出力する。 (Step S10) The dependency analysis unit 225 converts the word that has been determined to have multiple homophones with the same pronunciation into the homophone (kanji) with the highest recognition likelihood as text information. It is output to the recording section 228.

（ステップＳ１１）係り受け解析部２２５は、発音が共通する複数の同音異義語の存在はないと判定された語を、漢字によるテキスト情報として議事録作成部２２８に出力する。なお、漢字に変換することのできない語については、仮名文字によるテキスト情報として議事録作成部２２８に出力する。 (Step S11) The dependency analysis unit 225 outputs words for which it has been determined that there are no homophones with the same pronunciation to the minutes creation unit 228 as text information in Kanji. Note that words that cannot be converted into kanji are output to the minutes creation unit 228 as text information in kana characters.

（ステップＳ１２）利用者が入力を操作部（２０２または３０２）によって行う場合、テキスト取得部２３１は、操作部２０２または通信部２０４が出力する操作結果を取得し、取得した結果に基づきテキスト情報を生成し、議事録作成部２２８に出力する。 (Step S12) When the user performs an input using the operation unit (202 or 302), the text acquisition unit 231 acquires the operation result output by the operation unit 202 or the communication unit 204, and extracts text information based on the acquired result. It is generated and output to the minutes creation section 228.

（ステップＳ１３）議事録作成部２２８は、係り受け解析部２２５またはテキスト取得部２３１が出力するテキスト情報に基づいて議事録を作成し、画像出力部２４１に出力する。 (Step S13) The minutes creation unit 228 creates minutes based on the text information output by the dependency analysis unit 225 or the text acquisition unit 231, and outputs the minutes to the image output unit 241.

（ステップＳ１４）画像出力部２４１は、議事録作成部２２８が出力する議事録に基づいて、表示部（２０３または３０３）上に表示する画像を生成し、表示部２０３または通信部２０４に出力する。 (Step S14) The image output unit 241 generates an image to be displayed on the display unit (203 or 303) based on the minutes output by the minutes creation unit 228, and outputs it to the display unit 203 or communication unit 204. .

（ステップＳ１５）表示部（２０３または３０３）は、画像出力部２４１が出力する画像を表示する。 (Step S15) The display unit (203 or 303) displays the image output by the image output unit 241.

音声認識テキスト表示装置（音声認識テキスト表示システム）１は、以下、ステップＳ２～Ｓ１５の処理を繰り返す。
なお、図４の処理は一例であり、これに限らない。 The voice recognition text display device (voice recognition text display system) 1 repeats the processes of steps S2 to S15.
Note that the process in FIG. 4 is an example, and the process is not limited to this.

以上、本実施形態では、発音が共通する複数の同音異義語の存在がある語を、各同音異義語の認識尤度に応じて、あえて仮名文字で表示するようにした。
これにより、本実施形態によれば、発話者が意図しない漢字に変換されたテキスト情報が表示される可能性を低減できる。これにより、本実施形態によれば、音声信号から変換されたテキスト情報を参加者が理解しやすくすることができる。 As described above, in this embodiment, a word that has a plurality of homophones with a common pronunciation is intentionally displayed in kana characters according to the recognition likelihood of each homophone.
As a result, according to the present embodiment, it is possible to reduce the possibility that text information converted into kanji that is not intended by the speaker will be displayed. Thereby, according to this embodiment, it is possible to make it easier for participants to understand text information converted from an audio signal.

なお、上述した例では、音声認識テキスト表示装置（音声認識テキスト表示システム）１は操作部（２０２または３０２）によるテキスト入力および収音部（２０１または３０１）による音声認識を用いたテキスト入力の双方を許容していたが、これに限らない。例えば、音声認識テキスト表示装置（音声認識テキスト表示システム）１は収音部（２０１または３０１）による音声認識を用いたテキスト入力のみを許容していてもよい。 In the above example, the voice recognition text display device (voice recognition text display system) 1 can input text using the operation unit (202 or 302) and input text using voice recognition using the sound collection unit (201 or 301). However, this is not limited to. For example, the voice recognition text display device (voice recognition text display system) 1 may only allow text input using voice recognition by the sound collection unit (201 or 301).

また、上述した例では、音声認識テキスト表示装置１が親機２および複数の子機３を備える例を説明したが、これに限らない。例えば、音声認識テキスト表示装置１が備える子機３は１つのみでもよく、あるいは、音声認識テキスト表示装置１は子機３を備えていなくてもよい。 Further, in the example described above, the voice recognition text display device 1 includes the base unit 2 and a plurality of slave units 3, but the present invention is not limited to this. For example, the voice recognition text display device 1 may include only one slave device 3, or the voice recognition text display device 1 may not include any slave device 3.

また、認証部２１１、音響モデル・辞書記憶部２２１、取得部２２２、音声認識部２２３、テキスト変換部２２４、係り受け解析部２２５、算出部２２６、変換切替部２２７、議事録作成部２２８、議事録記憶部２２９、テキスト取得部２３１、および画像出力部２４１の各々は、子機３が備えていてもよい。同様に、処理部３０５は、親機２が備えていてもよい。 Additionally, the authentication section 211, acoustic model/dictionary storage section 221, acquisition section 222, speech recognition section 223, text conversion section 224, dependency analysis section 225, calculation section 226, conversion switching section 227, minutes creation section 228, minutes The record storage section 229, the text acquisition section 231, and the image output section 241 may each be included in the slave device 3. Similarly, the processing unit 305 may be included in the base device 2.

また、音声認識テキスト表示装置１の各機能部は親機２および子機３以外の装置に備えられていてもよい。あるいは、音声認識テキスト表示システム１の各機能部は親機２または子機３その他の物理的装置に備えられていなくてもよく、一つまたは複数のサーバやクラウド上に設けられていてもよい。なお、各機能部とは、通信部２０４、認証部２１１、音響モデル・辞書記憶部２２１、取得部２２２、音声認識部２２３、テキスト変換部２２４、係り受け解析部２２５、算出部２２６、変換切替部２２７、議事録作成部２２８、議事録記憶部２２９、テキスト取得部２３１、画像出力部２４１、通信部３０４、および処理部３０５である。 Further, each functional unit of the voice recognition text display device 1 may be provided in a device other than the base device 2 and the slave device 3. Alternatively, each functional unit of the voice recognition text display system 1 may not be provided in the base unit 2 or slave unit 3 or other physical device, but may be provided on one or more servers or the cloud. . Note that each functional unit includes a communication unit 204, an authentication unit 211, an acoustic model/dictionary storage unit 221, an acquisition unit 222, a speech recognition unit 223, a text conversion unit 224, a dependency analysis unit 225, a calculation unit 226, and a conversion switch. 227 , a minutes creation section 228 , a minutes storage section 229 , a text acquisition section 231 , an image output section 241 , a communication section 304 , and a processing section 305 .

なお、本発明における音声認識テキスト表示装置（音声認識テキスト表示システム）１の機能の全てまたは一部を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより音声認識テキスト表示装置（音声認識テキスト表示システム）１が行う処理の全てまたは一部を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 Note that a program for realizing all or part of the functions of the speech recognition text display device (speech recognition text display system) 1 of the present invention is recorded on a computer-readable recording medium, and the program is recorded on this recording medium. All or part of the processing performed by the speech recognition text display device (speech recognition text display system) 1 may be performed by loading a program into a computer system and executing it. Note that the "computer system" herein includes hardware such as an OS and peripheral devices. Furthermore, the term "computer system" includes a WWW system equipped with a home page providing environment (or display environment). Furthermore, the term "computer-readable recording medium" refers to portable media such as flexible disks, magneto-optical disks, ROMs, and CD-ROMs, and storage devices such as hard disks built into computer systems. Furthermore, "computer-readable recording medium" refers to volatile memory (RAM) inside a computer system that serves as a server or client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. This also includes programs that are retained for a certain period of time.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 Further, the program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in a transmission medium. Here, the "transmission medium" that transmits the program refers to a medium that has a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Moreover, the above-mentioned program may be for realizing a part of the above-mentioned functions. Furthermore, it may be a so-called difference file (difference program) that can realize the above-described functions in combination with a program already recorded in the computer system.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形および置換を加えることができる。 Although the mode for implementing the present invention has been described above using embodiments, the present invention is not limited to these embodiments in any way, and various modifications and substitutions can be made without departing from the gist of the present invention. can be added.

１…音声認識テキスト表示装置（音声認識テキスト表示システム）２０３…表示部２２２…取得部２２４…テキスト変換部２２５…係り受け解析部２２６…算出部２２７…変換切替部２４１…画像出力部３０３…表示部 1... Speech recognition text display device (speech recognition text display system) 203... Display section 222... Acquisition section 224... Text conversion section 225... Dependency analysis section 226... Calculation section 227... Conversion switching section 241... Image output section 303... Display Department

Claims

A voice recognition text display system that converts voice signals into text information and displays it,
an acquisition unit that acquires the audio signal;
a voice recognition unit that performs voice recognition on the voice signal acquired by the acquisition unit and outputs the text information;
a display unit that displays the text information,
The speech recognition unit calculates a recognition likelihood for each of the plurality of homophones with respect to a word for which it is determined in the speech recognition that there are a plurality of homophones with a common pronunciation, and switching whether to output the determined word as the text information converted into kanji or as the text information in kana characters, depending on the magnitude of the plurality of recognition likelihoods;
Speech recognition text display system.

The speech recognition unit outputs the text information in kana characters without converting the determined word into kanji when the maximum value of the plurality of calculated recognition likelihoods is lower than a predetermined value.
The speech recognition text display system according to claim 1.

When displaying the text information, the display unit displays the determined word and the word output in the text information in kana characters by the speech recognition unit in a format different from other words. indicate,
The voice recognition text display system according to claim 1 or 2.

A voice recognition text display device that converts voice signals into text information and displays it,
an acquisition unit that acquires the audio signal;
a voice recognition unit that performs voice recognition on the voice signal acquired by the acquisition unit and outputs the text information;
a display unit that displays the text information,
The speech recognition unit calculates a recognition likelihood for each of the plurality of homophones with respect to a word for which it is determined in the speech recognition that there are a plurality of homophones with a common pronunciation, and switching whether to output the determined word as the text information converted into kanji or as the text information in kana characters, depending on the magnitude of the plurality of recognition likelihoods;
Voice recognition text display device.

A voice recognition text display method in a voice recognition text display system that converts a voice signal into text information and displays it,
an acquisition step in which the acquisition unit acquires the audio signal;
a voice recognition step in which a voice recognition unit performs voice recognition on the voice signal acquired by the acquisition unit and outputs the text information;
a display step for displaying the text information;
In the speech recognition step, the speech recognition unit calculates a recognition likelihood for each of the plurality of homophone dissimilar words for the word for which it is determined in the speech recognition that there are a plurality of homophones with a common pronunciation. and switching, depending on the magnitude of the plurality of calculated recognition likelihoods, whether to output the determined word as the text information converted into kanji or as the text information in kana characters.
Speech recognition text display method.

A voice recognition text display system that converts voice signals into text information and displays it.
an acquisition step of acquiring an audio signal;
a voice recognition step of performing voice recognition on the voice signal acquired in the acquisition step and outputting the text information;
performing a display step of displaying the text information;
In the speech recognition step, for words for which it is determined in the speech recognition that there are a plurality of homophones with a common pronunciation, a recognition likelihood is calculated for each of the plurality of homophones, and the recognition likelihood is calculated for each of the plurality of homophones. switching whether to output the determined word as the text information converted into kanji or as the text information in kana characters, depending on the magnitude of the plurality of recognition likelihoods;
program.