JP2023148706A

JP2023148706A - Speech recognition result display system, speech recognition result display device, and speech recognition result display method and non-transitory storage medium storing program

Info

Publication number: JP2023148706A
Application number: JP2022056877A
Authority: JP
Inventors: 直亮住田; Naoaki Sumita; 一博中臺; Kazuhiro Nakadai; 雅樹中塚; Masaki NAKATSUKA; 唯周藤; Yui Shudo; 一也眞浦; Kazuya Maura; 恭佑日根野; Kyosuke Hineno; 健人清水; Kento Shimizu
Original assignee: Honda Motor Co Ltd; Honda Sun Co Ltd
Current assignee: Honda Motor Co Ltd; Honda Sun Co Ltd
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2023-10-13

Abstract

To provide a speech recognition result display system, a speech recognition result display device, a speech recognition result display method and a non-transitory storage medium storing program that can suppress text information from becoming hard to read when one utterance section becomes long.SOLUTION: A speech recognition result display system comprises: an acquisition part; a speech recognition part which performs speech recognition on an utterance speech acquired by the acquisition part and outputs text information; an image generation part which generates image data based upon the text information; and a display part. The image generation part is configured to: when the number of characters of the text information included in a one-utterance section is less than a predetermined value, generate image data for displaying the text information included in the one-utterance section of a user in one group; and otherwise when the number of characters is equal to or larger than the predetermined value, generate the image data for displaying the text information included in the one-utterance section in a plurality of groups, wherein the numbers of characters of the text information included in the plurality of groups are less than the predetermined value.SELECTED DRAWING: Figure 2

Description

本発明は、音声認識結果表示システム、音声認識結果表示装置、音声認識結果表示方法およびプログラムに関する。 The present invention relates to a speech recognition result display system, a speech recognition result display device, a speech recognition result display method, and a program.

従来、複数の参加者が会議をする際に、各参加者が発話した内容（発話音声）をテキストに変換して表示する装置が知られている（例えば特許文献１参照）。このような装置は、聴覚障がい者による会議への参加を支援するために用いられることがある。 2. Description of the Related Art Conventionally, there has been known a device that converts the content (uttered audio) uttered by each participant into text and displays it when a plurality of participants hold a conference (for example, see Patent Document 1). Such devices are sometimes used to assist hearing-impaired people in attending conferences.

特開２０１９－１７９４８０号公報JP2019-179480A

例えば特許文献１に記載の装置において、会議の参加者による発話が途切れることなく長い時間継続すると、テキストの表示も途切れることなく長く継続する場合がある。この場合、他の参加者がテキストを読みにくくなる可能性がある。このような問題は、特に、聴覚障がい者が会議に参加する場合に顕著となる。聴覚障がい者は、テキストを読むことによって会議の内容を理解するためである。 For example, in the device described in Patent Document 1, if the utterances of conference participants continue for a long time without interruption, the display of text may also continue for a long time without interruption. In this case, it may be difficult for other participants to read the text. Such problems become particularly noticeable when hearing-impaired people participate in a meeting. This is because hearing-impaired people understand the content of the meeting by reading the text.

本発明は、上記の問題点に鑑みてなされたものであって、一発話区間が長くなった際にテキスト情報が読みにくくなることを抑制できる音声認識結果表示システム、音声認識結果表示装置、音声認識結果表示方法およびプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and includes a speech recognition result display system, a speech recognition result display device, and a speech recognition result display system capable of suppressing text information from becoming difficult to read when one utterance section becomes long. The purpose is to provide a recognition result display method and program.

（１）上記目的を達成するため、本発明の一態様に係る音声認識結果表示システム（１）は、ユーザの発話音声を取得する取得部（２２２）と、前記取得部にて取得された前記発話音声に対して音声認識を行い、テキスト情報を出力する音声認識部（音声認識部２２３、テキスト変換部２２４、係り受け解析部２２５）と、前記テキスト情報に基づいて画像データを生成する画像生成部（議事録作成部２２６、画像生成部２４１）と、前記画像データを表示する表示部（表示部２０３、表示部３０３）と、を備え、前記画像生成部は、ユーザの一発話区間に含まれる前記テキスト情報の文字数が所定値未満である場合には、前記一発話区間に含まれる前記テキスト情報を一つのまとまりとして表示する前記画像データを生成し、前記一発話区間に含まれる前記テキスト情報の文字数が前記所定値以上である場合には、前記一発話区間に含まれる前記テキスト情報を複数のまとまりに分けて表示する前記画像データを生成する画像データ生成処理を行い、前記複数のまとまりの各々に含まれる前記テキスト情報の文字数は、前記所定値未満である。 (1) In order to achieve the above object, the speech recognition result display system (1) according to one aspect of the present invention includes an acquisition unit (222) that acquires the user's uttered voice, and a A voice recognition unit (speech recognition unit 223, text conversion unit 224, dependency analysis unit 225) that performs voice recognition on spoken voice and outputs text information, and an image generation unit that generates image data based on the text information. (minutes creation unit 226, image generation unit 241), and a display unit (display unit 203, display unit 303) that displays the image data, and the image generation unit includes a If the number of characters of the text information included in the one utterance section is less than a predetermined value, the image data that displays the text information included in the one utterance section as one group is generated, and the text information included in the one utterance section is generated. If the number of characters is greater than or equal to the predetermined value, image data generation processing is performed to generate the image data that displays the text information included in the one utterance section divided into a plurality of groups, and The number of characters of the text information included in each text information is less than the predetermined value.

（２）また、本発明の一態様に係る音声認識結果表示システムにおいて、前記画像生成部は、前記一発話区間が文節の途中で前記複数のまとまりに分割されないように制御してもよい。 (2) Furthermore, in the speech recognition result display system according to one aspect of the present invention, the image generation unit may be controlled so that the one utterance section is not divided into the plurality of groups in the middle of a clause.

（３）また、本発明の一態様に係る音声認識結果表示システムにおいて、前記画像生成部は、前記画像データを生成する際に、前記複数のまとまりの各々にテキスト枠を付与してもよい。 (3) Furthermore, in the speech recognition result display system according to one aspect of the present invention, the image generation unit may provide a text frame to each of the plurality of groups when generating the image data.

（４）また、本発明の一態様に係る音声認識結果表示システムにおいて、前記画像生成部は、前記一発話区間に含まれる前記テキスト情報が複数の前記テキスト枠にわたる場合、前記複数のテキスト枠に連番表示を付与してもよい。 (4) Furthermore, in the speech recognition result display system according to one aspect of the present invention, when the text information included in the one utterance section spans the plurality of text frames, the image generation unit A serial number display may be given.

（５）また、本発明の一態様に係る音声認識結果表示システムにおいて、前記画像生成部は、前記一発話区間に含まれる前記テキスト情報が複数の前記テキスト枠にわたる場合、前記複数のテキスト枠の色を統一してもよい。 (5) Further, in the speech recognition result display system according to one aspect of the present invention, when the text information included in the one utterance section spans the plurality of text frames, the image generation unit The colors may be unified.

（６）また、本発明の一態様に係る音声認識結果表示システムにおいて、前記画像生成部は、改行を行うことによって前記複数のまとまりの各々を区別する前記画像データを生成してもよい。 (6) Furthermore, in the speech recognition result display system according to one aspect of the present invention, the image generation unit may generate the image data that distinguishes each of the plurality of groups by performing a line break.

（７）また、本発明の一態様に係る音声認識結果表示システムにおいて、前記画像生成部は、前記音声認識部から前記テキスト情報が出力される度に、前記画像データ生成処理を行ってもよい。 (7) Furthermore, in the speech recognition result display system according to one aspect of the present invention, the image generation section may perform the image data generation process every time the text information is output from the speech recognition section. .

（８）上記目的を達成するため、本発明の一態様に係る音声認識結果表示装置は、ユーザの発話音声を取得する取得部と、前記取得部にて取得された前記発話音声に対して音声認識を行い、テキスト情報を出力する音声認識部と、前記テキスト情報に基づいて画像データを生成する画像生成部と、前記画像データを表示する表示部と、を備え、前記画像生成部は、ユーザの一発話区間に含まれる前記テキスト情報の文字数が所定値未満である場合には、前記一発話区間に含まれる前記テキスト情報を一つのまとまりとして表示する前記画像データを生成し、前記一発話区間に含まれる前記テキスト情報の文字数が前記所定値以上である場合には、前記一発話区間に含まれる前記テキスト情報を複数のまとまりに分けて表示する前記画像データを生成する画像データ生成処理を行い、前記複数のまとまりの各々に含まれる前記テキスト情報の文字数は、前記所定値未満である。 (8) In order to achieve the above object, the speech recognition result display device according to one aspect of the present invention includes an acquisition unit that acquires the user's utterance, and a voice recognition result for the utterance acquired by the acquisition unit. A voice recognition unit that performs recognition and outputs text information, an image generation unit that generates image data based on the text information, and a display unit that displays the image data. If the number of characters of the text information included in one utterance section is less than a predetermined value, the image data that displays the text information included in the one utterance section as one group is generated, and the number of characters of the text information included in one utterance section is generated. If the number of characters of the text information included in the one-utterance section is equal to or greater than the predetermined value, image data generation processing is performed to generate the image data that displays the text information included in the one utterance section divided into a plurality of groups. , the number of characters of the text information included in each of the plurality of groups is less than the predetermined value.

（９）上記目的を達成するため、本発明の一態様に係る音声認識結果表示方法は、音声認識結果表示システムにおける音声認識結果表示方法であって、取得部が、ユーザの発話音声を取得する取得ステップと、音声認識部が、前記取得部にて取得された前記発話音声に対して音声認識を行い、テキスト情報を出力する音声認識ステップと、画像生成部が、前記テキスト情報に基づいて画像データを生成する画像生成ステップと、表示部が、前記画像データを表示する表示ステップと、を備え、前記画像生成ステップにおいて、前記画像生成部は、ユーザの一発話区間に含まれる前記テキスト情報の文字数が所定値未満である場合には、前記一発話区間に含まれる前記テキスト情報を一つのまとまりとして表示する前記画像データを生成し、前記一発話区間に含まれる前記テキスト情報の文字数が前記所定値以上である場合には、前記一発話区間に含まれる前記テキスト情報を複数のまとまりに分けて表示する前記画像データを生成する画像データ生成処理を行い、前記複数のまとまりの各々に含まれる前記テキスト情報の文字数は、前記所定値未満である。 (9) In order to achieve the above object, a voice recognition result display method according to one aspect of the present invention is a voice recognition result display method in a voice recognition result display system, wherein the acquisition unit acquires the user's uttered voice. an acquisition step, a voice recognition step in which a voice recognition unit performs voice recognition on the uttered voice acquired by the acquisition unit and outputs text information, and an image generation unit generates an image based on the text information. an image generation step for generating data; and a display step for a display section to display the image data; If the number of characters is less than the predetermined value, the image data that displays the text information included in the one utterance section as one group is generated, and the number of characters of the text information included in the one utterance section is less than the predetermined value. If the value is greater than or equal to the value, image data generation processing is performed to generate the image data that displays the text information included in the one utterance section divided into a plurality of groups, and the image data included in each of the plurality of groups is The number of characters of the text information is less than the predetermined value.

（１０）上記目的を達成するため、本発明の一態様に係るプログラムは、音声認識結果表示システムに、ユーザの発話音声を取得する取得ステップと、前記発話音声に対して音声認識を行い、テキスト情報を出力する音声認識ステップと、前記テキスト情報に基づいて画像データを生成する画像生成ステップと、前記画像データを表示する表示ステップと、を実行させ、前記画像生成ステップにおいては、ユーザの一発話区間に含まれる前記テキスト情報の文字数が所定値未満である場合には、前記一発話区間に含まれる前記テキスト情報を一つのまとまりとして表示する前記画像データを生成し、前記一発話区間に含まれる前記テキスト情報の文字数が前記所定値以上である場合には、前記一発話区間に含まれる前記テキスト情報を複数のまとまりに分けて表示する前記画像データを生成する画像データ生成処理を行われ、前記複数のまとまりの各々に含まれる前記テキスト情報の文字数は、前記所定値未満である。 (10) In order to achieve the above object, a program according to an aspect of the present invention includes an acquisition step of acquiring a user's uttered voice in a voice recognition result display system, performs voice recognition on the uttered voice, and performs text A voice recognition step of outputting information, an image generation step of generating image data based on the text information, and a display step of displaying the image data are executed, and in the image generation step, one utterance of the user is executed. If the number of characters of the text information included in the section is less than a predetermined value, generating the image data that displays the text information included in the one utterance section as one group; If the number of characters of the text information is equal to or greater than the predetermined value, image data generation processing is performed to generate the image data that displays the text information included in the one utterance section divided into a plurality of groups; The number of characters of the text information included in each of the plurality of groups is less than the predetermined value.

上述した（１）、（８）、（９）または（１０）によれば、一発話区間が長くなった際にテキスト情報が読みにくくなることが抑制されるという効果を奏する。 According to (1), (8), (9), or (10) described above, it is possible to suppress text information from becoming difficult to read when one utterance section becomes long.

上述した（２）によれば、複数のまとまりに分割されたテキスト情報をより読みやすくすることができる。
上述した（３）によれば、各まとまりを視認しやすくなり、テキスト情報をさらに読みやすくすることができる。
上述した（４）または（５）によれば、一発話区間に対応するまとまりを視認しやすくなり、テキスト情報をより効果的に読みやすくすることができる。
上述した（６）によれば、各まとまりを視認しやすくなり、テキスト情報をさらに読みやすくすることができる。
上述した（７）によれば、各参加者が発話した内容を、表示部による表示により早く反映させることができる。 According to (2) above, it is possible to make text information divided into a plurality of groups easier to read.
According to (3) above, it becomes easier to visually recognize each group, and the text information can be made easier to read.
According to (4) or (5) described above, it becomes easier to visually recognize a group corresponding to one utterance section, and it is possible to make text information more effective and easier to read.
According to the above-mentioned (6), it becomes easier to visually recognize each group, and the text information can be made easier to read.
According to (7) above, the content uttered by each participant can be more quickly reflected in the display on the display unit.

本実施形態に係る音声認識結果表示装置（音声認識結果表示システム）の構成例を示すブロック図である。1 is a block diagram showing a configuration example of a speech recognition result display device (speech recognition result display system) according to the present embodiment. FIG. 本実施形態に係る親機の表示部に表示される画像例を示す図である。FIG. 3 is a diagram showing an example of an image displayed on the display unit of the parent device according to the present embodiment. 本実施形態に係る親機の表示部に表示される他の画像例を示す図である。FIG. 7 is a diagram illustrating another example of an image displayed on the display unit of the parent device according to the present embodiment. 本実施形態に係る子機の表示部に表示される画像例を示す図である。FIG. 3 is a diagram illustrating an example of an image displayed on a display unit of a slave device according to the present embodiment. 本実施形態に係る音声認識結果表示装置（音声認識結果表示システム）が行う処理手順例を示すフローチャートである。2 is a flowchart illustrating an example of a processing procedure performed by the speech recognition result display device (speech recognition result display system) according to the present embodiment.

まず、本実施形態の音声認識結果表示装置（音声認識結果表示システム）が使用される状況例を説明する。
本実施形態の音声認識結果表示装置（音声認識結果表示システム）は、２人以上が参加して行われる会議で用いられる。参加者のうち、発話が不自由な人が会議に参加していてもよい。発話可能な参加者は、参加者毎にマイクロフォンを装着するか、マイクロフォンを備える端末（スマートフォン、タブレット端末、パーソナルコンピュータ等）を用いる。聴覚障がいの参加者は、テキストを入力可能な端末を用いる。音声認識結果表示装置は、参加者の発話した発話音声に対して音声認識、テキスト化して、各自の端末にテキスト表示させる。また、音声認識結果表示装置は、聴覚障がい者が入力したテキスト情報を各自の端末にテキスト表示させる。 First, an example of a situation in which the speech recognition result display device (speech recognition result display system) of this embodiment is used will be described.
The speech recognition result display device (speech recognition result display system) of this embodiment is used in a conference attended by two or more people. Among the participants, a person with a speech disability may be participating in the conference. Participants who can speak should each wear a microphone or use a terminal (smartphone, tablet terminal, personal computer, etc.) equipped with a microphone. Hearing-impaired participants will use a terminal that can input text. The voice recognition result display device recognizes the voice uttered by the participants, converts it into text, and displays the text on each participant's terminal. Furthermore, the speech recognition result display device displays text information input by the hearing-impaired person on their terminal.

図１は、本実施形態に係る音声認識結果表示装置（音声認識結果表示システム）１の構成例を示すブロック図である。
図１に示すように、音声認識結果表示装置（音声認識結果表示システム）１は、親機２と、子機３ａ、子機３ｂ、・・・を含んで構成される。なお、子機３ａ、子機３ｂ、・・・のうち１つを特定しない場合は、単に子機３という。
親機２と子機３とは、有線または無線のネットワーク４を介して接続されている。 FIG. 1 is a block diagram showing a configuration example of a speech recognition result display device (speech recognition result display system) 1 according to the present embodiment.
As shown in FIG. 1, a speech recognition result display device (speech recognition result display system) 1 includes a base unit 2, a slave unit 3a, a slave unit 3b, . . . . Note that if one of the handset 3a, handset 3b, . . . is not specified, it will simply be referred to as the handset 3.
The base unit 2 and the slave unit 3 are connected via a wired or wireless network 4.

親機２は、収音部２０１、操作部２０２、表示部２０３、通信部２０４、認証部２１１、音響モデル・辞書記憶部２２１、取得部２２２、音声認識部２２３、テキスト変換部２２４、係り受け解析部２２５、議事録作成部２２６、議事録記憶部２２７、テキスト取得部２３１、および画像生成部２４１を備える。 The base device 2 includes a sound collection section 201, an operation section 202, a display section 203, a communication section 204, an authentication section 211, an acoustic model/dictionary storage section 221, an acquisition section 222, a speech recognition section 223, a text conversion section 224, and a modification section. It includes an analysis section 225, a minutes creation section 226, a minutes storage section 227, a text acquisition section 231, and an image generation section 241.

子機３は、収音部３０１、操作部３０２、表示部３０３、通信部３０４、および処理部３０５を備える。収音部３０１、操作部３０２、表示部３０３、通信部３０４、および処理部３０５は、バス３０６を介して接続されている。 The handset 3 includes a sound collection section 301, an operation section 302, a display section 303, a communication section 304, and a processing section 305. The sound collection section 301, the operation section 302, the display section 303, the communication section 304, and the processing section 305 are connected via a bus 306.

＜子機３＞
まず、子機３について説明する。
子機３は、例えばスマートフォン、タブレット端末、パーソナルコンピュータ等である。なお、子機３は、音声出力部、モーションセンサー、ＧＰＳ（ＧｌｏｂａｌＰｏｓｉｔｉｏｎｉｎｇＳｙｓｔｅｍ；全地球測位システム）等を備えていてもよい。 <Slave unit 3>
First, the handset 3 will be explained.
The handset 3 is, for example, a smartphone, a tablet terminal, a personal computer, or the like. Note that the handset 3 may include an audio output unit, a motion sensor, a GPS (Global Positioning System), and the like.

収音部３０１は、マイクロフォンである。収音部３０１は、ユーザの発話音声を収音し、収音した発話音声をアナログ信号からデジタル信号に変換して、デジタル信号に変換した発話音声を処理部３０５に出力する。 The sound collection unit 301 is a microphone. The sound collection unit 301 collects the user's utterance, converts the collected utterance from an analog signal to a digital signal, and outputs the converted utterance to the processing unit 305 .

操作部３０２は、ユーザの操作を検出し、検出した結果を処理部３０５に出力する。操作部３０２は、例えば表示部３０３上に設けられたタッチパネル式のセンサー、または優先接続または無線接続のキーボード等である。 The operation unit 302 detects user operations and outputs the detected results to the processing unit 305. The operation unit 302 is, for example, a touch panel sensor provided on the display unit 303, a keyboard with priority connection or wireless connection, or the like.

処理部３０５は、操作部３０２が操作された操作結果に基づいて設定情報を生成し、生成した設定情報を通信部３０４に出力する。ここで、設定情報には、参加者の識別情報が含まれている。設定情報には、収音部の使用の有無を示す情報、操作部の使用の有無を示す情報が含まれていてもよい。処理部３０５は、操作部３０２が操作された操作結果に基づいてログイン指示を生成し、生成したログイン指示を通信部３０４に出力する。ここで、ログイン指示には、参加者の識別情報、子機３の識別情報が含まれている。処理部３０５は、操作部３０２が操作された操作結果に基づくテキスト情報に識別情報を付加して通信部３０４に出力する。処理部３０５は、収音部３０１が出力する発話音声に識別情報を付加して通信部３０４に出力する。処理部３０５は、通信部３０４が出力する画像データを取得し、取得した画像データを表示部３０３に出力する。処理部３０５は、通信部３０４が出力するログインを許可する情報に基づいて、親機２との通信を確立する。処理部３０５は、親機２から発言制限指示（入力制限指示）を受信した場合、テキスト入力に対して制限を行ってもよい。また、処理部３０５は、親機２から発言制限指示を受信した場合、音声入力に対しても制限を行うようにしてもよい。 The processing unit 305 generates setting information based on the operation result of the operation unit 302 and outputs the generated setting information to the communication unit 304. Here, the setting information includes participant identification information. The setting information may include information indicating whether the sound collection section is used or not, and information indicating whether or not the operation section is used. The processing unit 305 generates a login instruction based on the operation result of the operation unit 302 and outputs the generated login instruction to the communication unit 304. Here, the login instruction includes identification information of the participant and identification information of the handset 3. The processing unit 305 adds identification information to text information based on the operation result of the operation unit 302 and outputs the text information to the communication unit 304 . The processing unit 305 adds identification information to the speech output from the sound collection unit 301 and outputs it to the communication unit 304 . The processing unit 305 acquires the image data output by the communication unit 304 and outputs the acquired image data to the display unit 303. The processing unit 305 establishes communication with the base device 2 based on the information output by the communication unit 304 that permits login. When the processing unit 305 receives a speech restriction instruction (input restriction instruction) from the base device 2, the processing unit 305 may restrict text input. Further, when the processing unit 305 receives a speech restriction instruction from the base device 2, the processing unit 305 may also restrict voice input.

表示部３０３は、処理部３０５が出力した画像データを表示する。表示部３０３は、例えば液晶表示装置、有機ＥＬ（エレクトロルミネッセンス）表示装置、電子インク表示装置等である。なお、表示部３０３上に表示される画像については後述する。 The display unit 303 displays the image data output by the processing unit 305. The display unit 303 is, for example, a liquid crystal display device, an organic EL (electroluminescence) display device, an electronic ink display device, or the like. Note that the image displayed on the display unit 303 will be described later.

通信部３０４は、処理部３０５が出力する設定情報を、ネットワーク４を介して親機２へ送信する。通信部３０４は、処理部３０５が出力するログイン指示を、ネットワーク４を介して親機２へ送信する。通信部３０４は、処理部３０５が出力するテキスト情報または発話音声を、ネットワーク４を介して親機２へ送信する。なお、送信するテキスト情報または発話音声には、ユーザの識別情報と子機３の識別情報が含まれている。通信部３０４は、親機２が送信した画像データを受信し、受信した画像データを処理部３０５に出力する。通信部３０４は、親機２が送信したログインを許可する情報を受信した場合、受信したログインを許可する情報を処理部３０５に出力する。 The communication unit 304 transmits the setting information output by the processing unit 305 to the base device 2 via the network 4. The communication unit 304 transmits the login instruction output by the processing unit 305 to the base device 2 via the network 4. The communication unit 304 transmits the text information or the uttered voice output by the processing unit 305 to the base unit 2 via the network 4 . Note that the text information or the uttered voice to be transmitted includes the user's identification information and the identification information of the handset 3. The communication unit 304 receives the image data transmitted by the base device 2 and outputs the received image data to the processing unit 305. When the communication unit 304 receives the information that permits login transmitted from the base device 2, the communication unit 304 outputs the received information that permits login to the processing unit 305.

＜親機２＞
次に親機２について説明する。
親機２は、例えばノートパソコン等である。 <Main unit 2>
Next, the base unit 2 will be explained.
The parent device 2 is, for example, a notebook computer.

収音部２０１は、マイクロフォンである。収音部２０１は、ユーザの発話音声を収音し、収音した発話音声をアナログ信号からデジタル信号に変換して、デジタル信号に変換した発話音声を取得部２２２に出力する。 The sound collection unit 201 is a microphone. The sound collection unit 201 collects the user's utterance, converts the collected utterance from an analog signal to a digital signal, and outputs the converted utterance to the acquisition unit 222 .

操作部２０２は、ユーザの操作を検出し、検出した結果をテキスト取得部２３１に出力する。操作部２０２は、例えば表示部２０３上に設けられたタッチパネル式のセンサー、またはキーボードである。操作部２０２は、ログイン処理の際、操作を検出した結果を、認証部２１１に出力する。 The operation unit 202 detects a user's operation and outputs the detected result to the text acquisition unit 231. The operation unit 202 is, for example, a touch panel sensor provided on the display unit 203 or a keyboard. The operation unit 202 outputs the result of detecting the operation to the authentication unit 211 during the login process.

表示部２０３は、例えば液晶表示装置、有機ＥＬ表示装置、電子インク表示装置等である。表示部２０３は、画像生成部２４１が出力する画像データを表示する。なお、表示部２０３上に表示される画像については後述する。 The display unit 203 is, for example, a liquid crystal display device, an organic EL display device, an electronic ink display device, or the like. The display unit 203 displays image data output by the image generation unit 241. Note that the image displayed on the display unit 203 will be described later.

通信部２０４は、子機３が送信した発話音声を受信し、受信した発話音声を取得部２２２に出力する。通信部２０４は、子機３が送信したテキスト情報を受信し、受信したテキスト情報をテキスト取得部２３１に出力する。通信部２０４は、子機３が送信したログイン指示を受信し、受信したログイン指示を認証部２１１に出力する。通信部２０４は、画像生成部２４１が出力する画像データを、ネットワーク４を介して子機３へ送信する。通信部２０４は、認証部２１１が出力するログインを許可する情報を、ネットワーク４を介して子機３へ送信する。 The communication unit 204 receives the utterances transmitted by the handset 3 and outputs the received utterances to the acquisition unit 222 . The communication unit 204 receives the text information transmitted by the handset 3 and outputs the received text information to the text acquisition unit 231. The communication unit 204 receives the login instruction sent by the handset 3 and outputs the received login instruction to the authentication unit 211. The communication unit 204 transmits the image data output by the image generation unit 241 to the slave device 3 via the network 4. The communication unit 204 transmits the information output by the authentication unit 211 to permit login to the slave device 3 via the network 4 .

認証部２１１は、通信部２０４が出力するログイン指示に含まれる参加者の識別情報と子機３の識別情報に基づいて、ログインを許可するか否かを判定する。認証部２１１は、ログインを許可する場合、ログインを許可する情報を通信部２０４に出力する。認証部２１１は、操作部２０２が操作された結果に基づいて、親機２のユーザのログインを許可するか否かを判定する。認証部２１１は、ログインを許可する場合、各機能部にログインを許可する情報を出力し、各機能部の動作を許可する。なお、各機能部とは、通信部２０４、認証部２１１、音響モデル・辞書記憶部２２１、取得部２２２、音声認識部２２３、テキスト変換部２２４、係り受け解析部２２５、議事録作成部２２６、議事録記憶部２２７、テキスト取得部２３１、および画像生成部２４１である。 The authentication unit 211 determines whether to permit login based on the identification information of the participant and the identification information of the handset 3 included in the login instruction output by the communication unit 204. When permitting login, the authentication section 211 outputs information permitting login to the communication section 204. The authentication unit 211 determines whether or not to permit the user of the base device 2 to log in, based on the result of the operation on the operation unit 202. When permitting login, the authentication section 211 outputs information permitting login to each functional section, and permits the operation of each functional section. Note that each functional unit includes a communication unit 204, an authentication unit 211, an acoustic model/dictionary storage unit 221, an acquisition unit 222, a speech recognition unit 223, a text conversion unit 224, a dependency analysis unit 225, a minutes creation unit 226, These are a minutes storage section 227, a text acquisition section 231, and an image generation section 241.

音響モデル・辞書記憶部２２１は、例えば音響モデル、言語モデル、単語辞書等を格納している。音響モデルとは、音の特徴量に基づくモデルであり、言語モデルとは、単語とその並び方の情報のモデルである。また、単語辞書とは、多数の語彙による辞書であり、例えば大語彙単語辞書である。なお、親機２は、音響モデル・辞書記憶部２２１に格納されていない単語等を格納して更新するようにしてもよい。なお、音響モデル・辞書記憶部２２１は、例えば会議ごとにＤＢ（データベース）を備えていてもよい。例えば、第１のＤＢが一般の会議用であり、第２のＤＢが発表会用であり、第３のＤＢが国際会議用であってもよい。このように会議に合わせたＤＢを用いることで、同音異義語等の変換を適切に行いやすくなる。 The acoustic model/dictionary storage unit 221 stores, for example, acoustic models, language models, word dictionaries, and the like. The acoustic model is a model based on sound feature amounts, and the language model is a model of information about words and how they are arranged. Further, the word dictionary is a dictionary with a large number of vocabulary words, for example, a large vocabulary word dictionary. Note that the base unit 2 may store and update words and the like that are not stored in the acoustic model/dictionary storage unit 221. Note that the acoustic model/dictionary storage unit 221 may include a DB (database) for each conference, for example. For example, the first DB may be used for general conferences, the second DB may be used for presentations, and the third DB may be used for international conferences. By using a DB tailored to the conference in this way, it becomes easier to appropriately convert homonyms and the like.

取得部２２２は、収音部２０１が出力する発話音声、または通信部２０４が出力する発話音声を取得し、取得した発話音声を音声認識部２２３に出力する。 The acquisition unit 222 acquires the utterances output by the sound collection unit 201 or the utterances output by the communication unit 204, and outputs the acquired utterances to the voice recognition unit 223.

音声認識部２２３は、取得部２２２が出力する発話音声を取得する。音声認識部２２３は、発話音声から発話区間の音声信号を検出する。発話区間の検出は、例えば所定のしきい値以上の音声信号を発話区間として検出する。なお、音声認識部２２３は、発話区間の検出を周知の他の手法を用いて行ってもよい。音声認識部２２３は、検出した発話区間の音声信号に対して、音響モデル・辞書記憶部２２１を参照して、周知の手法を用いて音声認識を行う。なお、音声認識部２２３は、例えば特開２０１５－６４５５４号公報に開示されている手法等を用いて音声認識を行う。音声認識部２２３は、認識した認識結果と音声信号をテキスト変換部２２４に出力する。なお、音声認識部２２３は、認識結果と音声信号とを、例えば１文毎、または発話区間毎、または発話毎に対応つけて出力する。
なお、音声認識部２２３は、発話音声が同時に入力された場合、例えば時分割処理によって収音部（２０１または３０１）毎に音声認識を行う。また、音声認識部２２３は、マイクロフォンがマイクロフォンアレイの場合、音源分離処理、音源定位処理、音源同定処理等、周知の音声認識処理も行う。 The speech recognition section 223 obtains the uttered speech output by the obtaining section 222 . The speech recognition unit 223 detects the speech signal of the speech section from the speech. The speech section is detected by detecting, for example, an audio signal equal to or higher than a predetermined threshold value as the speech section. Note that the speech recognition unit 223 may detect the utterance section using other well-known techniques. The speech recognition unit 223 refers to the acoustic model/dictionary storage unit 221 and performs speech recognition on the speech signal of the detected speech section using a well-known method. Note that the speech recognition unit 223 performs speech recognition using, for example, the method disclosed in Japanese Patent Application Publication No. 2015-64554. The speech recognition section 223 outputs the recognized recognition result and speech signal to the text conversion section 224. Note that the speech recognition unit 223 outputs the recognition result and the speech signal in association with each other, for example, for each sentence, for each utterance section, or for each utterance.
Note that when the uttered voices are input simultaneously, the voice recognition unit 223 performs voice recognition for each sound collection unit (201 or 301), for example, by time-sharing processing. Furthermore, when the microphone is a microphone array, the speech recognition unit 223 also performs well-known speech recognition processing such as sound source separation processing, sound source localization processing, and sound source identification processing.

テキスト変換部２２４は、音声認識部２２３が出力する認識結果に対して、音響モデル・辞書記憶部２２１を参照して、テキストに変換する。なお、テキスト情報は、少なくとも１文字の情報を含む。テキスト変換部２２４は、変換したテキスト情報と、取得した音声信号を係り受け解析部２２５に出力する。なお、テキスト変換部２２４は、発話情報を認識した結果から「あー」、「えーと」、「えー」、「まあ」等の間投詞を削除してテキストに変換するようにしてもよい。 The text conversion unit 224 converts the recognition result output by the speech recognition unit 223 into text by referring to the acoustic model/dictionary storage unit 221. Note that the text information includes information of at least one character. The text conversion unit 224 outputs the converted text information and the acquired audio signal to the dependency analysis unit 225. Note that the text conversion unit 224 may delete interjections such as "ah", "um", "um", and "well" from the result of recognizing the utterance information and convert it into text.

係り受け解析部２２５は、テキスト変換部２２４が出力したテキスト情報または通信部２０４が出力したテキスト情報に対して、音響モデル・辞書記憶部２２１を参照して、形態素解析と係り受け解析を行う。なお、係り受け解析には、例えば、Ｓｈｉｆｔ－ｒｅｄｕｃｅ法や全域木の手法やチャンク同定の段階適用手法においてＳＶＭ（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅｓ）を用いる。
係り受け解析部２２５は、形態素解析と係り受け解析を行ったテキスト情報と、解析した結果を議事録作成部２２６に出力する。なお、係り受け解析部２２５は、テキスト変換部２２４が出力する音声信号を取得した場合、取得した音声信号も議事録作成部２２６に出力する。 The dependency analysis unit 225 performs morphological analysis and dependency analysis on the text information output by the text conversion unit 224 or the text information output by the communication unit 204 with reference to the acoustic model/dictionary storage unit 221. Note that SVM (Support Vector Machines) is used in the dependency analysis, for example, in the shift-reduce method, the spanning tree method, and the stepwise application method of chunk identification.
The dependency analysis unit 225 outputs text information that has been subjected to morphological analysis and dependency analysis, and the results of the analysis to the minutes creation unit 226. Note that when the dependency analysis unit 225 acquires the audio signal output by the text conversion unit 224, it also outputs the acquired audio signal to the minutes creation unit 226.

議事録作成部２２６は、係り受け解析部２２５またはテキスト取得部２３１が出力したテキスト情報に基づいて、発話者毎に分けて、議事録を作成する。議事録作成部２２６は、作成した議事録と対応する音声信号を議事録記憶部２２７に記憶させる。また、議事録作成部２２６は、作成した議事録を画像生成部２４１に出力する。なお、議事録作成部２２６は、「あー」、「えーと」、「えー」、「まあ」等の間投詞を削除して議事録を作成するようにしてもよい。 The minutes creation unit 226 creates minutes for each speaker based on the text information output by the dependency analysis unit 225 or the text acquisition unit 231. The minutes creation unit 226 causes the minutes storage unit 227 to store the audio signal corresponding to the created minutes. Further, the minutes creation unit 226 outputs the created minutes to the image generation unit 241. Note that the minutes creation unit 226 may create the minutes by deleting interjections such as "ah," "um," "um," and "well."

議事録記憶部２２７は、議事録と音声信号を対応つけて記憶する。 The minutes storage unit 227 stores minutes and audio signals in association with each other.

テキスト取得部２３１は、操作部２０２が出力する操作結果、または通信部２０４が出力する操作部３０２の操作結果を取得し、取得した結果に基づいてテキスト情報を生成する。テキスト取得部２３１は、生成したテキスト情報を議事録作成部２２６に出力する。 The text acquisition unit 231 acquires the operation result output by the operation unit 202 or the operation result of the operation unit 302 output by the communication unit 204, and generates text information based on the acquired result. The text acquisition unit 231 outputs the generated text information to the minutes creation unit 226.

画像生成部２４１は、議事録作成部２２６が出力する議事録情報を取得する。画像生成部２４１は、議事録情報に基づいて画像データを生成する。
画像生成部２４１は、ユーザの発話音声に基づくテキスト情報（議事録情報）について、ユーザの一発話区間に含まれるテキスト情報の文字数に応じた画像データを生成する。つまり、一発話区間に含まれる文字数が所定値未満である場合には、画像生成部２４１は、一発話区間に含まれるテキスト情報を一つのまとまりとして表示する画像データを生成する。一方、一発話区間に含まれるテキスト情報の文字数が所定値以上である場合には、画像生成部２４１は、一発話区間に含まれるテキスト情報を複数のまとまりに分けて表示する画像データを生成する。この場合において、画像生成部２４１は、複数のまとまりの各々に含まれるテキスト情報の文字数が所定値未満となるように、画像データを生成する。なお、「一発話区間」とは、あるユーザが、所定時間以上の発話の中断を挟むことなく、発話を継続し続けた区間（時間）を意味する。「所定値」は、例えば５００文字程度であってもよい。
画像生成部２４１は、上記に基づき生成した画像データを表示部２０３と通信部２０４に出力する。 The image generation unit 241 acquires the minutes information output by the minutes creation unit 226. The image generation unit 241 generates image data based on the minutes information.
The image generation unit 241 generates image data corresponding to the number of characters of the text information included in one utterance section of the user, regarding text information (minutes information) based on the user's utterance voice. That is, when the number of characters included in one utterance section is less than a predetermined value, the image generation unit 241 generates image data that displays the text information included in one utterance section as one group. On the other hand, if the number of characters in the text information included in one utterance section is greater than or equal to the predetermined value, the image generation unit 241 generates image data that displays the text information included in one utterance section divided into a plurality of groups. . In this case, the image generation unit 241 generates image data such that the number of characters of text information included in each of the plurality of groups is less than a predetermined value. Note that "one utterance period" means a period (time) in which a certain user continues to speak without interruption for a predetermined period of time or longer. The "predetermined value" may be, for example, about 500 characters.
The image generation unit 241 outputs the image data generated based on the above to the display unit 203 and the communication unit 204.

＜親機２の表示画像＞
次に、親機２の表示部２０３上に表示される画像例を説明する。
図２は、本実施形態に係る親機２の表示部２０３上に表示される画像例を示す図である。
画像ｇ１０が、親機２の表示部２０３上に表示される画像である。 <Display image of base unit 2>
Next, an example of an image displayed on the display unit 203 of the base device 2 will be explained.
FIG. 2 is a diagram showing an example of an image displayed on the display unit 203 of the base device 2 according to the present embodiment.
Image g10 is an image displayed on display unit 203 of base device 2.

領域ｇ１００は、参加者情報編集を行う領域である。
領域ｇ１０１は、参加者情報の領域である。符号ｇ１０２は、参加者の名前である。符号ｇ１０３は、参加者が親機２の操作部２０２または子機３の操作部３０２によってテキスト入力を行うことを示すアイコンである。符号ｇ１０４は、参加者が親機２の収音部２０１または子機３の収音部３０１によって発話を行うことを示すアイコンである。符号ｇ１０５は、参加者が使用するマイクロフォンの番号（または識別情報）である。 Area g100 is an area where participant information is edited.
Area g101 is an area for participant information. The code g102 is the name of the participant. Reference numeral g103 is an icon indicating that the participant inputs text using the operation unit 202 of the base unit 2 or the operation unit 302 of the slave unit 3. Reference numeral g104 is an icon indicating that the participant speaks using the sound collection unit 201 of the base unit 2 or the sound collection unit 301 of the slave unit 3. The code g105 is the number (or identification information) of the microphone used by the participant.

領域ｇ２００は、議事録を表示する領域である。なお、図２では、ログイン後の状態を示している。符号ｇ２０１は、ログイン／ログアウトのボタン画像である。符号ｇ２０２は、音声認識結果表示装置（音声認識結果表示システム）１の開始／終了のボタン画像である。符号ｇ２０３は、音声認識結果表示装置（音声認識結果表示システム）１の使用中に点灯する表示である。符号ｇ２０４は、議事録記憶部２２７が記憶する議事録の表示や音声信号の再生を行うボタン画像である。符号ｇ２０５は、親機２の利用者が収音部２０１の使用有無を選択するボタン画像である。 Area g200 is an area for displaying minutes. Note that FIG. 2 shows the state after login. Symbol g201 is a login/logout button image. Reference numeral g202 is a start/end button image of the speech recognition result display device (speech recognition result display system) 1. Reference numeral g203 is a display that lights up while the speech recognition result display device (speech recognition result display system) 1 is in use. Reference numeral g204 is a button image for displaying the minutes stored in the minutes storage unit 227 and reproducing the audio signal. Reference numeral g205 is a button image through which the user of the base unit 2 selects whether or not to use the sound collection section 201.

符号ｇ２１１は、第１の参加者が操作部（２０２または３０２）を操作して入力したテキスト情報である。図２の例に示すように、表示部２０３が表示する画像ｇ１０において、テキスト情報には、当該テキスト情報を囲む矩形のテキスト枠が付与される。以下で説明するテキスト情報についても同様にテキスト枠が付与されている。符号ｇ２１２は、第１の参加者が操作部（２０２または３０２）を操作して入力した絵文字である。符号ｇ２１３は、第１の参加者がテキスト情報および絵文字を入力した日時を示す情報である。符号ｇ２１４は、第１の参加者の名前である。 The code g211 is text information input by the first participant by operating the operation unit (202 or 302). As shown in the example of FIG. 2, in the image g10 displayed by the display unit 203, text information is provided with a rectangular text frame surrounding the text information. A text frame is similarly provided for the text information described below. The symbol g212 is a pictogram input by the first participant by operating the operation unit (202 or 302). The code g213 is information indicating the date and time when the first participant inputted the text information and pictograms. Code g214 is the name of the first participant.

符号ｇ２２１は、第２の参加者が操作部（２０２または３０２）を操作して入力したテキスト情報である。符号ｇ２２２および符号ｇ２２３は、第２の参加者が発話した内容を音声認識したテキスト情報である。符号ｇ２２４は、操作部（２０２または３０２）を操作してテキストを入力したことを示すアイコンである。符号ｇ２２５は、収音部（２０１または３０１）によって発話を入力したことを示すアイコンである。符号ｇ２３１は、第３の参加者が発話した内容を音声認識したテキスト情報である。 The code g221 is text information input by the second participant by operating the operation unit (202 or 302). Code g222 and code g223 are text information obtained by voice recognition of the content uttered by the second participant. The symbol g224 is an icon indicating that text has been input by operating the operation unit (202 or 302). The symbol g225 is an icon indicating that speech has been input by the sound collection unit (201 or 301). The code g231 is text information obtained by voice recognition of the content uttered by the third participant.

ここで、符号ｇ２２２に係るテキスト情報「今日の議題は、今度の展示会についてです。」は、ユーザ（藤沢）の一発話区間に含まれており、かつ、その文字数が所定値未満であったために、１つのまとまり（テキスト枠）として表示されている。一方、符号ｇ２２３に係るテキスト情報「前回の会議では、・・・お聞かせください。」は、ユーザ（藤沢）の一発話区間に含まれているものの、その文字数が所定値以上であったために、符号ｇ２２３ａ～ｇ２２３ｄで示す複数のまとまり（テキスト枠）に分けて表示されている。また、符号ｇ２２３ａ～ｇ２２３ｄで示す各テキスト枠に含まれる文字数は、所定値未満となっている。 Here, the text information related to code g222, "Today's topic is about the upcoming exhibition." is included in one utterance section by the user (Fujisawa), and the number of characters is less than the predetermined value. are displayed as one group (text frame). On the other hand, although the text information related to code g223, "Please tell me about... in the last meeting," is included in one utterance section by the user (Fujisawa), the number of characters is greater than the predetermined value. It is displayed divided into a plurality of groups (text frames) indicated by symbols g223a to g223d. Further, the number of characters included in each text frame indicated by symbols g223a to g223d is less than a predetermined value.

図２の例に示すように、表示部２０３は、符号ｇ２２３ａ～ｇ２２３ｄで示す各テキスト枠に符号ｇ２４１～ｇ２４４で示す連番表示を付与した画像データを表示してもよい。また、表示部２０３は、符号ｇ２２３ａ～ｇ２２３ｄで示す各テキスト枠の色が統一された画像データを表示してもよい。 As shown in the example of FIG. 2, the display unit 203 may display image data in which each text frame indicated by symbols g223a to g223d is given a serial number display indicated by symbols g241 to g244. Further, the display unit 203 may display image data in which each text frame indicated by symbols g223a to g223d has a uniform color.

なお、表示部２０３が上記の表示をするために、画像生成部２４１は、各まとまりにテキスト枠を付与した画像データを生成する。画像生成部２４１は、各テキスト枠に連番表示を付与したり、各テキスト枠の色が統一されたりした画像データを生成してもよい。 Note that in order for the display unit 203 to display the above, the image generation unit 241 generates image data in which each group is given a text frame. The image generation unit 241 may generate image data in which sequential numbers are assigned to each text frame or the colors of each text frame are unified.

また、画像生成部２４１は、一発話区間に対応するテキスト情報が文節の途中で複数のまとまり（テキスト枠）に分割されないように画像データの生成を制御してもよい。あるいは、画像生成部２４１は、一発話区間に対応するテキスト情報を複数のまとまり（テキスト枠）に分割する際に、句点または読点が存在する位置でテキスト情報を分割するように画像データの生成を制御してもよい。 Furthermore, the image generation unit 241 may control the generation of image data so that the text information corresponding to one utterance section is not divided into a plurality of groups (text frames) in the middle of a phrase. Alternatively, when dividing text information corresponding to one utterance section into a plurality of groups (text frames), the image generation unit 241 generates image data so as to divide the text information at the position where a period or comma is present. May be controlled.

なお、図２に示した画像は一例であり、表示部２０３上に表示される画像はこれに限らない。例えば、表示部２０３は、図３に示す画像ｇ１０ａのように、改行を行うことによって複数のまとまり（符号ｇ２２３Ａ～ｇ２２３Ｄ）の各々を区別する画像データを表示してもよい。この場合、画像生成部２４１は、改行を行うことによって複数のまとまり（符号ｇ２２３Ａ～ｇ２２３Ｄ）の各々を区別する画像データを生成する。 Note that the image shown in FIG. 2 is an example, and the image displayed on the display unit 203 is not limited to this. For example, the display unit 203 may display image data such as image g10a shown in FIG. 3, which distinguishes each of a plurality of groups (signs g223A to g223D) by performing a line break. In this case, the image generation unit 241 generates image data that distinguishes each of the plurality of groups (symbols g223A to g223D) by performing line breaks.

また、テキスト枠の形状は矩形に限られず、例えば多角形、円形、楕円形その他の形状であってもよい。あるいは、各まとまりにテキスト枠が付与されなくてもよい。 Furthermore, the shape of the text frame is not limited to a rectangle, and may be, for example, a polygon, circle, oval, or other shape. Alternatively, each group may not be provided with a text frame.

＜子機３の表示画面＞
次に、子機３の表示部３０３上に表示される画像例を説明する。
図４は、本実施形態に係る子機３の表示部３０３上に表示される画像例を示す図である。
画像ｇ３０が、子機３の表示部３０３上に表示される画像である。 <Display screen of handset 3>
Next, an example of an image displayed on the display unit 303 of the handset 3 will be described.
FIG. 4 is a diagram showing an example of an image displayed on the display unit 303 of the handset 3 according to the present embodiment.
Image g30 is an image displayed on display unit 303 of handset 3.

領域ｇ３００は、議事録を表示する領域である。符号ｇ３１１は、第１の参加者が操作部（２０２または３０２）を操作して入力したテキスト情報である。符号ｇ３２１は、第２の参加者が操作部（２０２または３０２）を操作して入力したテキスト情報である。符号ｇ３２２および符号ｇ３２３は、第２の参加者が発話した内容を音声認識したテキスト情報である。符号ｇ３３１は、第３の参加者が発話した内容を音声認識したテキスト情報である。領域ｇ３０１は、テキスト入力部の領域である。なお、操作部３０２は、表示部３０３上に表示されるソフトウェアキーボードであってもよく、子機３と有線または無線で接続されていてもよい。 Area g300 is an area for displaying minutes. The code g311 is text information input by the first participant by operating the operation unit (202 or 302). The code g321 is text information input by the second participant by operating the operation unit (202 or 302). Code g322 and code g323 are text information obtained by voice recognition of the content uttered by the second participant. The code g331 is text information obtained by voice recognition of the content uttered by the third participant. Area g301 is an area for a text input section. Note that the operation unit 302 may be a software keyboard displayed on the display unit 303, or may be connected to the handset 3 by wire or wirelessly.

ここで、符号ｇ３２２に係るテキスト情報「今日の議題は、今度の展示会についてです。」は、ユーザ（藤沢）の一発話区間に含まれており、かつ、その文字数が所定値未満であったために、１つのまとまり（テキスト枠）として表示されている。一方、符号ｇ３２３に係るテキスト情報「前回の会議では、・・・お聞かせください。」は、ユーザ（藤沢）の一発話区間に含まれているものの、その文字数が所定値以上であったために、符号ｇ３２３ａ～ｇ３２３ｃで示す複数のまとまり（テキスト枠）に分けて表示されている。また、符号ｇ３２３ａ～ｇ３２３ｃで示す各テキスト枠に含まれる文字数は、所定値未満となっている。 Here, the text information related to code g322 "Today's topic is about the upcoming exhibition" is included in one utterance section by the user (Fujisawa) and the number of characters is less than the predetermined value. are displayed as one group (text frame). On the other hand, although the text information related to code g323, "Please tell me about... in the last meeting," is included in one utterance section by the user (Fujisawa), the number of characters is greater than the predetermined value. It is displayed divided into a plurality of groups (text frames) indicated by symbols g323a to g323c. Further, the number of characters included in each text frame indicated by symbols g323a to g323c is less than a predetermined value.

図４の例に示すように、表示部３０３は、符号ｇ３２３ａ～ｇ３２３ｃで示す各テキスト枠に符号ｇ３４１～ｇ３４３で示す連番表示を付与した画像データを表示してもよい。また、表示部３０３は、符号ｇ３２３ａ～ｇ３２３ｃで示す各テキスト枠の色が統一された画像データを表示してもよい。 As shown in the example of FIG. 4, the display unit 303 may display image data in which each text frame indicated by symbols g323a to g323c is given a serial number display indicated by symbols g341 to g343. Further, the display unit 303 may display image data in which each text frame indicated by symbols g323a to g323c has a uniform color.

なお、表示部３０３が上記の表示をするために、画像生成部２４１は、各まとまりにテキスト枠を付与した画像データを生成する。画像生成部２４１は、各テキスト枠に連番表示を付与したり、各テキスト枠の色が統一されたりした画像データを生成してもよい。 Note that in order for the display unit 303 to display the above, the image generation unit 241 generates image data in which each group is given a text frame. The image generation unit 241 may generate image data in which sequential numbers are assigned to each text frame or the colors of each text frame are unified.

また、画像生成部２４１は、一発話区間に対応するテキスト情報が文節の途中で複数のまとまり（テキスト枠）に分割されないように画像データの生成を制御してもよい。あるいは、画像生成部２４１は、一発話区間に対応するテキスト情報を複数のまとまり（テキスト枠）に分割する際に、句点または読点が存在する位置でテキスト情報を分割するように画像データの生成を制御してもよい。 Furthermore, the image generation unit 241 may control the generation of image data so that the text information corresponding to one utterance section is not divided into a plurality of groups (text frames) in the middle of a clause. Alternatively, when dividing text information corresponding to one utterance section into a plurality of groups (text frames), the image generation unit 241 generates image data so as to divide the text information at the position where a period or comma is present. May be controlled.

図２および図４に示すように、親機２の表示部２０３上に表示される画像と子機３の表示部３０３上に表示される画像とで、一発話区間に対応するテキスト情報を複数のまとまり（テキスト枠）に分割する位置は異なっていてもよい。例えば、画像生成部２４１が一発話区間に対応するテキスト情報を複数のまとまりに分けるか否かの基準となる「所定値」（先述）は、親機２と子機３とで互いに異なっていてもよい。同様に、テキスト情報を複数のまとまりに分割する位置および上記「所定値」は子機３ａ、子機３ｂ・・・毎に異なっていてもよい。 As shown in FIGS. 2 and 4, the image displayed on the display unit 203 of the base unit 2 and the image displayed on the display unit 303 of the slave unit 3 display multiple pieces of text information corresponding to one utterance section. The positions at which the text is divided into groups (text frames) may be different. For example, the "predetermined value" (described earlier), which is the standard for determining whether the image generation unit 241 divides text information corresponding to one utterance section into multiple groups, is different between the base unit 2 and the slave unit 3. Good too. Similarly, the position at which the text information is divided into a plurality of groups and the above-mentioned "predetermined value" may be different for each handset 3a, handset 3b, and so on.

なお、図４に示した画像は一例であり、表示部３０３上に表示される画像はこれに限らない。例えば、表示部３０３は、改行を行うことによって複数のまとまりの各々を区別する画像データを表示してもよい。この場合、画像生成部２４１は、改行を行うことによって複数のまとまりの各々を区別する画像データを生成する。 Note that the image shown in FIG. 4 is an example, and the image displayed on the display unit 303 is not limited to this. For example, the display unit 303 may display image data that distinguishes each of a plurality of groups by performing a line break. In this case, the image generation unit 241 generates image data that distinguishes each of the plurality of groups by performing line breaks.

＜音声認識結果表示装置（音声認識結果表示システム）１が行う処理＞
次に、音声認識結果表示装置（音声認識結果表示システム）１が行う処理手順例を説明する。図５は、本実施形態に係る音声認識結果表示装置（音声認識結果表示システム）１が行う処理手順例を示すフローチャートである。なお、以下では、ユーザがテキスト情報の入力を収音部（２０１または３０１）によって行う場合を説明する。 <Processing performed by speech recognition result display device (speech recognition result display system) 1>
Next, an example of a processing procedure performed by the speech recognition result display device (speech recognition result display system) 1 will be described. FIG. 5 is a flowchart showing an example of a processing procedure performed by the speech recognition result display device (speech recognition result display system) 1 according to the present embodiment. In the following, a case will be described in which the user inputs text information using the sound pickup section (201 or 301).

（ステップＳ１）認証部２１１は、操作部（２０２または３０２）の操作内容に基づいて、ログイン処理を行う。例えば、各利用者が、操作部（２０２または３０２）を操作して、利用者を識別する識別情報（利用者ＩＤ）とパスワードを入力すると、認証部２１１は、入力された識別情報及びパスワードに基づいてログイン処理を行う。 (Step S1) The authentication unit 211 performs a login process based on the operation content of the operation unit (202 or 302). For example, when each user operates the operation unit (202 or 302) and inputs identification information (user ID) that identifies the user and a password, the authentication unit 211 inputs the input identification information and password. Login processing is performed based on

（ステップＳ２）取得部２２２は、収音部２０１または通信部２０４が出力するユーザの発話音声を取得し、取得した発話音声を音声認識部２２３に出力する。 (Step S2) The acquisition unit 222 acquires the user's utterance that is output by the sound collection unit 201 or the communication unit 204, and outputs the acquired utterance to the voice recognition unit 223.

（ステップＳ３）音声認識部２２３は、取得部２２２が出力する発話音声を取得し、取得した発話音声に対して音声認識処理を行う。 (Step S3) The speech recognition unit 223 acquires the uttered voice outputted by the acquisition unit 222, and performs voice recognition processing on the acquired uttered voice.

（ステップＳ４）テキスト変換部２２４は、音声認識された結果に対してテキスト変換処理を行う。 (Step S4) The text conversion unit 224 performs text conversion processing on the voice recognition result.

（ステップＳ５）係り受け解析部２２５は、テキスト変換されたテキスト情報に対して、発話者毎に係り受け解析と形態素解析処理を行う。 (Step S5) The dependency analysis unit 225 performs dependency analysis and morphological analysis processing for each speaker on the converted text information.

（ステップＳ６）係り受け解析部２２５は、係り受け解析および形態素解析処理を行ったテキスト情報を議事録作成部２２６に出力する。 (Step S6) The dependency analysis unit 225 outputs the text information subjected to the modification analysis and morphological analysis processing to the minutes creation unit 226.

（ステップＳ７）議事録作成部２２６は、係り受け解析部２２５が出力するテキスト情報に基づいて議事録を作成し、画像生成部２４１に出力する。 (Step S7) The minutes creation unit 226 creates minutes based on the text information output by the dependency analysis unit 225, and outputs the minutes to the image generation unit 241.

（ステップＳ８）画像生成部２４１は、一発話区間に含まれるテキスト情報の文字数が所定値以上であるかを判定する。画像生成部２４１が、一発話区間に含まれるテキスト情報の文字数が所定値以上であると判定した場合（ステップＳ８；ＹＥＳ）には、ステップＳ９の処理が行われ、一発話区間に含まれるテキスト情報の文字数が所定値未満であると判定した場合（ステップＳ８；ＮＯ）には、ステップＳ１０の処理が行われる。なお、上記の判断は、議事録作成部２２６が行ってもよい。 (Step S8) The image generation unit 241 determines whether the number of characters of text information included in one utterance section is greater than or equal to a predetermined value. If the image generation unit 241 determines that the number of characters of the text information included in one utterance section is equal to or greater than the predetermined value (step S8; YES), the process of step S9 is performed, and the text information included in one utterance section is If it is determined that the number of characters in the information is less than the predetermined value (step S8; NO), the process of step S10 is performed. Note that the above judgment may be made by the minutes creation unit 226.

（ステップＳ９）画像生成部２４１は、一発話区間に含まれるテキスト情報を複数のまとまりに分けて表示する画像データを生成し、表示部２０３または通信部２０４に出力する。ステップＳ９において、画像生成部２４１は、複数のまとまりの各々に含まれるテキスト情報の文字数が所定値未満となるように、画像データを生成する。 (Step S9) The image generation unit 241 generates image data that displays text information included in one utterance section divided into a plurality of groups, and outputs it to the display unit 203 or the communication unit 204. In step S9, the image generation unit 241 generates image data such that the number of characters of text information included in each of the plurality of groups is less than a predetermined value.

（ステップＳ１０）画像生成部２４１は、一発話区間に含まれるテキスト情報を一つのまとまりとして表示する画像データを生成し、表示部２０３または通信部２０４に出力する。 (Step S10) The image generation unit 241 generates image data that displays text information included in one utterance section as one group, and outputs it to the display unit 203 or the communication unit 204.

（ステップＳ１１）表示部（２０３または３０３）は、画像生成部２４１が出力する画像を表示する。 (Step S11) The display unit (203 or 303) displays the image output by the image generation unit 241.

音声認識結果表示装置（音声認識結果表示システム）１は、以下、ステップＳ２～Ｓ１１の処理を繰り返す。なお、ステップＳ７～Ｓ９の処理（画像データ生成処理）は、係り受け解析部２２５がテキスト情報を出力する度に行われてもよい。言い換えれば、ステップＳ２～Ｓ１１の処理は、一発話区間の途中でリアルタイムに繰り返されてもよい。つまり、画像生成部２４１による画像の生成はリアルタイムに繰り返され、表示部（２０３または３０３）によって表示される画像はリアルタイムに更新され続けてもよい。
なお、図５の処理は一例であり、これに限らない。 The speech recognition result display device (speech recognition result display system) 1 repeats the processing of steps S2 to S11. Note that the processing in steps S7 to S9 (image data generation processing) may be performed every time the dependency analysis unit 225 outputs text information. In other words, the processes of steps S2 to S11 may be repeated in real time during one speech section. That is, image generation by the image generation unit 241 may be repeated in real time, and the image displayed by the display unit (203 or 303) may continue to be updated in real time.
Note that the process in FIG. 5 is an example, and the process is not limited to this.

以上、本実施形態では、一発話区間が長くなった際に当該一発話区間に含まれるテキスト情報を複数のまとまりに分割して表示するようにした。
これにより、本実施形態によれば、一発話区間が長くなった際にテキスト情報が読みにくくなることを抑制できる。 As described above, in this embodiment, when one utterance section becomes long, the text information included in the one utterance section is divided into a plurality of groups and displayed.
As a result, according to the present embodiment, it is possible to suppress text information from becoming difficult to read when one utterance section becomes long.

なお、上述した例では、音声認識結果表示装置（音声認識結果表示システム）１は操作部（２０２または３０２）によるテキスト入力および収音部（２０１または３０１）による音声認識を用いたテキスト入力の双方を許容していたが、これに限らない。例えば、音声認識結果表示装置（音声認識結果表示システム）１は収音部（２０１または３０１）による音声認識を用いたテキスト入力のみを許容していてもよい。 In the above example, the speech recognition result display device (speech recognition result display system) 1 can input text using the operation unit (202 or 302) and text input using speech recognition using the sound collection unit (201 or 301). However, this is not limited to. For example, the speech recognition result display device (speech recognition result display system) 1 may only allow text input using speech recognition by the sound collection unit (201 or 301).

また、上述した例では、音声認識結果表示装置１が親機２および複数の子機３を備える例を説明したが、これに限らない。例えば、音声認識結果表示装置１が備える子機３は１つのみでもよく、あるいは、音声認識結果表示装置１は子機３を備えていなくてもよい。 Further, in the example described above, the voice recognition result display device 1 includes the base unit 2 and a plurality of slave units 3, but the present invention is not limited to this. For example, the voice recognition result display device 1 may include only one handset 3, or the voice recognition result display device 1 may not include any handset 3.

また、認証部２１１、音響モデル・辞書記憶部２２１、取得部２２２、音声認識部２２３、テキスト変換部２２４、係り受け解析部２２５、議事録作成部２２６、議事録記憶部２２７、テキスト取得部２３１、および画像生成部２４１の各々は、子機３が備えていてもよい。同様に、処理部３０５は、親機２が備えていてもよい。 Additionally, an authentication section 211, an acoustic model/dictionary storage section 221, an acquisition section 222, a speech recognition section 223, a text conversion section 224, a dependency analysis section 225, a minutes creation section 226, a minutes storage section 227, a text acquisition section 231 , and the image generation unit 241 may be included in the slave device 3. Similarly, the processing unit 305 may be included in the base device 2.

また、音声認識結果表示装置１の各機能部は親機２および子機３以外の装置に備えられていてもよい。あるいは、音声認識結果表示システム１の各機能部は親機２または子機３その他の物理的装置に備えられていなくてもよく、一つまたは複数のサーバやクラウド上に設けられていてもよい。なお、各機能部とは、通信部２０４、認証部２１１、音響モデル・辞書記憶部２２１、取得部２２２、音声認識部２２３、テキスト変換部２２４、係り受け解析部２２５、議事録作成部２２６、議事録記憶部２２７、テキスト取得部２３１、画像生成部２４１、通信部３０４、および処理部３０５である。 Further, each functional section of the voice recognition result display device 1 may be provided in a device other than the base device 2 and the slave device 3. Alternatively, each functional unit of the voice recognition result display system 1 may not be provided in the base unit 2 or slave unit 3 or other physical device, but may be provided on one or more servers or the cloud. . Note that each functional unit includes a communication unit 204, an authentication unit 211, an acoustic model/dictionary storage unit 221, an acquisition unit 222, a speech recognition unit 223, a text conversion unit 224, a dependency analysis unit 225, a minutes creation unit 226, These are a minutes storage section 227, a text acquisition section 231, an image generation section 241, a communication section 304, and a processing section 305.

なお、本発明における音声認識結果表示装置（音声認識結果表示システム）１の機能の全てまたは一部を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより音声認識結果表示装置（音声認識結果表示システム）１が行う処理の全てまたは一部を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 Note that a program for realizing all or part of the functions of the speech recognition result display device (speech recognition result display system) 1 of the present invention is recorded on a computer-readable recording medium, and the program is recorded on this recording medium. All or part of the processing performed by the speech recognition result display device (speech recognition result display system) 1 may be performed by loading the program into the computer system and executing it. Note that the "computer system" herein includes hardware such as an OS and peripheral devices. Furthermore, the term "computer system" includes a WWW system equipped with a home page providing environment (or display environment). Furthermore, the term "computer-readable recording medium" refers to portable media such as flexible disks, magneto-optical disks, ROMs, and CD-ROMs, and storage devices such as hard disks built into computer systems. Furthermore, "computer-readable recording medium" refers to volatile memory (RAM) inside a computer system that serves as a server or client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. This also includes programs that are retained for a certain period of time.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 Further, the program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in a transmission medium. Here, the "transmission medium" that transmits the program refers to a medium that has a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Moreover, the above-mentioned program may be for realizing a part of the above-mentioned functions. Furthermore, it may be a so-called difference file (difference program) that can realize the above-described functions in combination with a program already recorded in the computer system.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形および置換を加えることができる。 Although the mode for implementing the present invention has been described above using embodiments, the present invention is not limited to these embodiments in any way, and various modifications and substitutions can be made without departing from the gist of the present invention. can be added.

１…音声認識結果表示装置（音声認識結果表示システム）２０３…表示部２２２…取得部２２３…音声認識部２２４…テキスト変換部２２５…係り受け解析部２２６…議事録作成部２４１…画像生成部３０３…表示部 1... Speech recognition result display device (speech recognition result display system) 203... Display section 222... Acquisition section 223... Speech recognition section 224... Text conversion section 225... Dependency analysis section 226... Minutes creation section 241... Image generation section 303 …Display section

Claims

an acquisition unit that acquires the user's uttered audio;
a voice recognition unit that performs voice recognition on the uttered voice acquired by the acquisition unit and outputs text information;
an image generation unit that generates image data based on the text information;
a display unit that displays the image data;
When the number of characters of the text information included in one utterance section of the user is less than a predetermined value, the image generation section generates the image data that displays the text information included in the one utterance section as one group. and when the number of characters of the text information included in the one utterance section is equal to or greater than the predetermined value, the image data is generated for displaying the text information included in the one utterance section divided into a plurality of groups. Perform image data generation processing to
the number of characters of the text information included in each of the plurality of groups is less than the predetermined value;
Speech recognition result display system.

The image generation unit controls the one utterance section so that it is not divided into the plurality of groups in the middle of a clause.
The speech recognition result display system according to claim 1.

The image generation unit adds a text frame to each of the plurality of groups when generating the image data.
The speech recognition result display system according to claim 1 or 2.

When the text information included in the one utterance section spans a plurality of the text frames, the image generation unit assigns a sequential number display to the plurality of text frames.
The speech recognition result display system according to claim 3.

The image generation unit unifies the color of the plurality of text frames when the text information included in the one utterance section spans the plurality of text frames.
The speech recognition result display system according to claim 3 or 4.

The image generation unit generates the image data that distinguishes each of the plurality of groups by performing a line break.
The speech recognition result display system according to claim 1 or 2.

The image generation unit performs the image data generation process every time the text information is output from the voice recognition unit.
The speech recognition result display system according to any one of claims 1 to 6.

an acquisition unit that acquires the user's uttered audio;
a voice recognition unit that performs voice recognition on the uttered voice acquired by the acquisition unit and outputs text information;
an image generation unit that generates image data based on the text information;
a display unit that displays the image data;
When the number of characters of the text information included in one utterance section of the user is less than a predetermined value, the image generation section generates the image data that displays the text information included in the one utterance section as one group. and when the number of characters of the text information included in the one utterance section is equal to or greater than the predetermined value, the image data is generated for displaying the text information included in the one utterance section divided into a plurality of groups. Perform image data generation processing to
the number of characters of the text information included in each of the plurality of groups is less than the predetermined value;
Speech recognition result display device.

A voice recognition result display method in a voice recognition result display system, the method comprising:
an acquisition step in which the acquisition unit acquires the user's uttered audio;
a voice recognition step in which a voice recognition unit performs voice recognition on the uttered voice acquired by the acquisition unit and outputs text information;
an image generation step in which the image generation unit generates image data based on the text information;
a display step for displaying the image data;
In the image generation step, if the number of characters of the text information included in one utterance section of the user is less than a predetermined value, the image generation section converts the text information included in the one utterance section into one group. generating the image data to be displayed, and if the number of characters of the text information included in the one utterance section is equal to or greater than the predetermined value, displaying the text information included in the one utterance section divided into a plurality of groups; performing an image data generation process to generate the image data,
the number of characters of the text information included in each of the plurality of groups is less than the predetermined value;
How to display voice recognition results.

For the voice recognition result display system,
an acquisition step of acquiring the user's uttered audio;
a voice recognition step of performing voice recognition on the uttered voice and outputting text information;
an image generation step of generating image data based on the text information;
performing a display step of displaying the image data;
In the image generation step, if the number of characters of the text information included in one utterance section of the user is less than a predetermined value, the image data is configured to display the text information included in the one utterance section as one group. and when the number of characters of the text information included in the one utterance section is equal to or greater than the predetermined value, the image data is divided into a plurality of groups and displayed. Image data generation processing is performed,
the number of characters of the text information included in each of the plurality of groups is less than the predetermined value;
program.