JP2019050482A

JP2019050482A - Information acquisition device, display method, and program

Info

Publication number: JP2019050482A
Application number: JP2017173163A
Authority: JP
Inventors: 貴大中代; Takahiro Nakadai
Original assignee: Olympus Corp
Current assignee: Olympus Corp
Priority date: 2017-09-08
Filing date: 2017-09-08
Publication date: 2019-03-28

Abstract

To provide an information acquisition device, a display method and a program, capable of intuitively grasping a position of a speaker during recording.SOLUTION: An information acquisition device 2 includes: a display unit 23 which can display an image; a first voice collection section 20 and a second voice collection section 21 installed at positions different from each other and collecting voice emitted from each of a plurality of sound sources and creating voice data; a sound source position estimation section 295 for estimating positions of the plurality of sound sources on the basis of sound data generated by each of the first voice collection section 20 and the second voice selection 21; and a display control section 303 for displaying sound source position information concerning positions of each of the plurality of sound sources on a display unit 23 on the basis of an estimation result estimated by the sound source position estimation section 295.SELECTED DRAWING: Figure 2

Description

本発明は、音声データを元に各音源の位置を推定して表示する情報取得機器、表示方法およびプログラムに関する。 The present invention relates to an information acquisition device, display method and program for estimating and displaying the position of each sound source based on voice data.

近年、複数のマイクロホンアレイを用いて音源の位置を特定する技術が知られている（特許文献１参照）。この技術では、複数のマイクロホンアレイの出力から得られる複数の音源信号の各々と複数のマイクロホンアレイの各々の位置関係とに基づいて、マイクロホンアレイの位置に関連して定められる点を中心とする空間内で定義された複数の方向の各々について所定時間毎にＭＵＳＩＣパワーを算出し、このＭＵＳＩＣパワーのピークを音源位置として特定した後に、マイクロホンアレイの出力信号から音源位置の音声信号を分離する。 In recent years, a technique for specifying the position of a sound source using a plurality of microphone arrays is known (see Patent Document 1). In this technique, a space centered on a point determined in relation to the position of the microphone array based on each of the plurality of sound source signals obtained from the outputs of the plurality of microphone arrays and the positional relationship of each of the plurality of microphone arrays The MUSIC power is calculated at predetermined time intervals for each of a plurality of directions defined therein, and the peak of the MUSIC power is identified as the sound source position, and then the audio signal of the sound source position is separated from the output signal of the microphone array.

特開２０１２−２１１７６８号公報JP, 2012-211768, A

ところで、ユーザが音声データを元に会議等の摘録を作成する場合、録音時における話者の位置を把握したいときがある。しかしながら、上述した特許文献１では、音源位置を特定するのみであるため、ユーザが音声データを元に会議等の摘録を作成する場合、録音時における話者の位置を直感的に把握することができないという問題点があった。 By the way, when the user prepares a deduction such as a meeting based on voice data, there are times when it is desired to grasp the position of the speaker during recording. However, in Patent Document 1 described above, only the sound source position is specified. Therefore, when the user prepares a deduction such as a meeting based on voice data, the position of the speaker at the time of recording can be intuitively grasped. There was a problem that it was impossible.

本発明は、上記に鑑みてなされたものであって、録音時における話者の位置を直感的に把握することができる情報取得機器、表示方法およびプログラムを提供することを目的とする。 The present invention has been made in view of the above, and it is an object of the present invention to provide an information acquiring apparatus, a display method and a program which can intuitively grasp the position of a speaker at the time of recording.

上述した課題を解決し、目的を達成するために、本発明に係る情報取得機器は、画像を表示することができる表示部と、互いに異なる位置に設けられ、複数の音源の各々から発せられた音声を収音して音声データを生成する複数の収音部と、前記複数の収音部の各々が生成した前記音声データに基づいて、前記複数の音源の位置を推定する音源位置推定部と、前記音源位置推定部が推定した推定結果に基づいて、前記複数の音源の各々の位置に関する音源位置情報を前記表示部に表示させる表示制御部と、を備えることを特徴とする。 In order to solve the problems described above and achieve the object, the information acquisition device according to the present invention is provided at a position different from each other from a display unit capable of displaying an image, and emitted from each of a plurality of sound sources. A plurality of sound collection units that collect sound and generate sound data; and a sound source position estimation unit that estimates positions of the plurality of sound sources based on the sound data generated by each of the plurality of sound collection units; A display control unit configured to display, on the display unit, sound source position information on each of the positions of the plurality of sound sources based on the estimation result estimated by the sound source position estimating unit.

また、本発明に係る表示方法は、画像を表示することができる表示部と、互いに異なる位置に設けられ、複数の音源の各々から発せられた音声を収音して音声データを生成する複数の収音部と、を備えた情報取得機器が実行する表示方法であって、前記複数の収音部の各々が生成した前記音声データに基づいて、前記複数の音源の位置を推定する音源位置推定ステップと、前記音源位置推定ステップで推定した推定結果に基づいて、前記複数の音源の各々の位置に関する音源位置情報を前記表示部に表示させる表示制御ステップと、を含むことを特徴とする。 In the display method according to the present invention, a display unit capable of displaying an image and a plurality of display units provided at mutually different positions and generating voice data by collecting voices emitted from each of a plurality of sound sources. A display method executed by an information acquisition apparatus including a sound collection unit, wherein sound source position estimation is performed to estimate the positions of the plurality of sound sources based on the voice data generated by each of the plurality of sound collection units The display control step may include the steps of: displaying, on the display unit, sound source position information on each position of the plurality of sound sources based on the estimation result estimated in the sound source position estimating step.

また、本発明に係るプログラムは、画像を表示することができる表示部と、互いに異なる位置に設けられ、複数の音源の各々から発せられた音声を収音して音声データを生成する複数の収音部と、を備えた情報取得機器に、前記複数の収音部の各々が生成した前記音声データに基づいて、前記複数の音源の位置を推定する音源位置推定ステップと、前記音源位置推定ステップで推定した推定結果に基づいて、前記複数の音源の各々の位置に関する音源位置情報を前記表示部に表示させる表示制御ステップと、を実行させることを特徴とする。 Further, a program according to the present invention is provided with a display unit capable of displaying an image, and a plurality of sets provided at different positions and collecting voices emitted from each of a plurality of sound sources to generate voice data. A sound source position estimation step of estimating the positions of the plurality of sound sources based on the voice data generated by each of the plurality of sound collection units in an information acquisition device including a sound unit; And a display control step of causing the display unit to display sound source position information related to the position of each of the plurality of sound sources, based on the estimation result estimated in (4).

本発明によれば、録音時における話者の位置を直感的に把握することができるという効果を奏する。 According to the present invention, it is possible to intuitively grasp the position of the speaker at the time of recording.

図１は、本発明の実施の形態１に係るトランスクライバーシステムの概略構成を示す図である。FIG. 1 is a diagram showing a schematic configuration of a transcriber system according to a first embodiment of the present invention. 図２は、本発明の実施の形態１に係るトランスクライバーシステムの機能構成を示すブロック図である。FIG. 2 is a block diagram showing a functional configuration of the transcriber system according to the first embodiment of the present invention. 図３は、本発明の実施の形態１に係る情報取得機器が実行する処理の概要を示すフローチャートである。FIG. 3 is a flowchart showing an outline of processing performed by the information acquisition device according to Embodiment 1 of the present invention. 図４は、本発明の実施の形態１に係る情報取得機器の利用シーンを示す図である。FIG. 4 is a diagram showing a usage scene of the information acquisition device according to Embodiment 1 of the present invention. 図５は、図４の状況下を模式的に示す俯瞰図である。FIG. 5 is a bird's-eye view schematically showing the situation of FIG. 図６は、本発明の実施の形態１に係る音源位置推定部が推定する音源の位置を模式的に示す図である。FIG. 6 is a diagram schematically showing the position of the sound source estimated by the sound source position estimation unit according to Embodiment 1 of the present invention. 図７は、本発明の実施の形態１に係る音源位置推定部が１つの音源に対して到達時間差を算出する算出状況を模式的に示す図である。FIG. 7 is a diagram schematically showing a calculation situation in which the sound source position estimation unit according to Embodiment 1 of the present invention calculates the arrival time difference for one sound source. 図８は、本発明の実施の形態１に係る音源位置推定部が算出する到達時間差を算出する算出方法の一例を模式的に示す図である。FIG. 8 is a diagram schematically showing an example of a calculation method for calculating the arrival time difference calculated by the sound source position estimation unit according to Embodiment 1 of the present invention. 図９は、図３の各音源位置表示決定処理の概要を示すフローチャートである。FIG. 9 is a flowchart showing an outline of each sound source position display determination process of FIG. 3. 図１０は、図９のアイコン決定生成処理の概要を示すフローチャートである。FIG. 10 is a flowchart showing an outline of the icon determination generation process of FIG. 図１１は、本発明の実施の形態１に係る音源情報生成部が生成するアイコンの一例を模式的に示す図である。FIG. 11 is a view schematically showing an example of an icon generated by the sound source information generation unit according to Embodiment 1 of the present invention. 図１２は、本発明の実施の形態１に係る音源情報生成部が生成するアイコンの別の一例を模式的に示す図である。FIG. 12 is a diagram schematically showing another example of the icon generated by the sound source information generation unit according to Embodiment 1 of the present invention. 図１３は、本発明の実施の形態１に係る音源情報生成部が生成するアイコンの別の一例を模式的に示す図である。FIG. 13 is a diagram schematically showing another example of the icon generated by the sound source information generation unit according to Embodiment 1 of the present invention. 図１４は、本発明の実施の形態１に係る表示制御部が表示する音源位置情報の一例を模式的に示す図である。FIG. 14: is a figure which shows typically an example of the sound source positional information which the display control part which concerns on Embodiment 1 of this invention displays. 図１５は、本発明の実施の形態１に係る情報処理装置が実行する処理の概要を示すフローチャートである。FIG. 15 is a flowchart showing an outline of processing performed by the information processing apparatus according to Embodiment 1 of the present invention. 図１６は、本発明の実施の形態１に係るドキュメント作成画面の一例を模式的に示す図である。FIG. 16 is a view schematically showing an example of a document creation screen according to Embodiment 1 of the present invention. 図１７は、図１５のキーワード判定処理の概要を示すフローチャートである。FIG. 17 is a flowchart showing an outline of the keyword determination process of FIG. 図１８は、本発明の実施の形態１に係るドキュメント作成画面の別の一例を模式的に示す図である。FIG. 18 is a view schematically showing another example of the document creation screen according to the first embodiment of the present invention. 図１９は、本発明の実施の形態１に係るドキュメント作成画面の別の一例を模式的に示す図である。FIG. 19 is a view schematically showing another example of the document creation screen according to the first embodiment of the present invention. 図２０は、本発明の実施の形態１に係るドキュメント作成画面の別の一例を模式的に示す図である。FIG. 20 is a view schematically showing another example of the document creation screen according to the first embodiment of the present invention. 図２１は、本発明の実施の形態２に係る情報取得機器に対して外部マイクを装着する前の概略構成を示す斜視図である。FIG. 21 is a perspective view showing a schematic configuration before an external microphone is attached to the information acquisition apparatus according to the second embodiment of the present invention. 図２２は、本発明の実施の形態２に係る情報取得機器に対して外部マイクを装着した後の概略構成を示す斜視図である。FIG. 22 is a perspective view showing a schematic configuration after an external microphone is attached to the information acquisition apparatus according to the second embodiment of the present invention. 図２３は、本発明の実施の形態２に係る音源位置推定部が推定する音源の位置を模式的に示す図である。FIG. 23 is a diagram schematically showing the position of the sound source estimated by the sound source position estimation unit according to Embodiment 2 of the present invention. 図２４は、本発明の実施の形態２に係る音源位置推定部が１つの音源に対して到達時間差を算出する算出状況を模式的に示す図である。FIG. 24: is a figure which shows typically the calculation condition which the sound source position estimation part which concerns on Embodiment 2 of this invention calculates an arrival time difference with respect to one sound source.

以下、本発明を実施するための形態（以下、「実施の形態」という）を図面とともに詳細に説明する。なお、以下の実施の形態により本発明が限定されるものではない。また、以下の説明において参照する各図は、本発明の内容を理解でき得る程度に形状、大きさ、および位置関係を概略的に示してあるに過ぎない。即ち、本発明は、各図で例示された形状、大きさ、および位置関係のみに限定されるものではない。 Hereinafter, modes for carrying out the present invention (hereinafter referred to as “embodiments”) will be described in detail with reference to the drawings. Note that the present invention is not limited by the following embodiments. In addition, the drawings referred to in the following description merely schematically show the shapes, sizes, and positional relationships to the extent that the contents of the present invention can be understood. That is, the present invention is not limited to only the shapes, sizes, and positional relationships illustrated in the respective drawings.

（実施の形態１）
〔トランスクライバーシステムの構成〕
図１は、本発明の実施の形態１に係るトランスクライバーシステムの概略構成を示す図である。図２は、本発明の実施の形態１に係るトランスクライバーシステムの機能構成を示すブロック図である。 Embodiment 1
[Configuration of Transcriber System]
FIG. 1 is a diagram showing a schematic configuration of a transcriber system according to a first embodiment of the present invention. FIG. 2 is a block diagram showing a functional configuration of the transcriber system according to the first embodiment of the present invention.

図１および図２に示すトランスクライバーシステム１は、例えばマイク等で音声を入力して音声データを記録するＩＣレコーダやマイク等で音声を入力して音声データを記録する携帯電話等の録音機器として機能する情報取得機器２と、情報取得機器２から通信ケーブル４を介して音声データを取得して音声データの書き起こしや各種処理を行うパーソナルコンピュータ等の情報処理装置３と、を備える。なお、本実施の形態１では、情報取得機器２と情報処理装置３は、通信ケーブル４を介して双方向に通信を行うが、これに限定されることなく、無線によって双方向に通信可能に接続されてもよい。この場合、無線通信規格は、ＩＥＥＥ８０２．１１ａ、ＩＥＥＥ８０２．１１ｂ、ＩＥＥＥ８０２．１１ｎ、ＩＥＥＥ８０２．１１ｇ、ＩＥＥＥ８０２．１１ａｃ、Ｂｌｕｅｔｏｏｔｈ（登録商標）および赤外線通信規格等である。 The transcriber system 1 shown in FIGS. 1 and 2 is, for example, a recording device such as a cellular phone or the like, which receives voice by using a microphone or the like and records voice data by using an IC recorder or a microphone or the like. The system includes an information acquisition device 2 that functions, and an information processing device 3 such as a personal computer that acquires voice data from the information acquisition device 2 via the communication cable 4 and performs transcription of voice data and various processes. In the first embodiment, the information acquisition device 2 and the information processing device 3 communicate bi-directionally via the communication cable 4. However, the present invention is not limited to this, and bi-directional communication can be performed wirelessly. It may be connected. In this case, the wireless communication standards include IEEE 802.11a, IEEE 802.11b, IEEE 802.11n, IEEE 802.11 g, IEEE 802.11 ac, Bluetooth (registered trademark), an infrared communication standard, and the like.

〔情報取得機器の構成〕
まず、情報取得機器２の構成について説明する。
情報取得機器２は、第１収音部２０と、第２収音部２１と、外部入力検出部２２と、表示部２３と、時計２４と、入力部２５と、記録部２６と、通信部２７と、出力部２８と、機器制御部２９と、を備える。 [Configuration of information acquisition device]
First, the configuration of the information acquisition device 2 will be described.
The information acquisition device 2 includes a first sound collection unit 20, a second sound collection unit 21, an external input detection unit 22, a display unit 23, a clock 24, an input unit 25, a recording unit 26, and a communication unit. 27, an output unit 28, and a device control unit 29.

第１収音部２０は、情報取得機器２の上面の左側に設けられる（図１を参照）。第１収音部２０は、複数の音源の各々から発せられた音声を収音してアナログの音声信号（電気信号）に変換し、この音声信号に対してＡ／Ｄ変換処理やゲイン調整処理を行ってデジタルの音声データ（第１音声データ）を生成して機器制御部２９へ出力する。第１収音部２０は、単一指向性マイク、無指向性マイクおよび双指向性マイクのいずれか１つのマイクロホン、Ａ／Ｄ変換回路および信号処理回路等を用いて構成される。 The first sound collecting unit 20 is provided on the left side of the top surface of the information acquisition device 2 (see FIG. 1). The first sound collection unit 20 picks up the sound emitted from each of the plurality of sound sources and converts it into an analog sound signal (electric signal), and performs A / D conversion processing and gain adjustment processing on this sound signal. To generate digital audio data (first audio data) and output it to the device control unit 29. The first sound collection unit 20 is configured using a microphone of any one of a unidirectional microphone, a nondirectional microphone and a bidirectional microphone, an A / D conversion circuit, a signal processing circuit, and the like.

第２収音部２１は、第１収音部２０と異なる位置に設けられる。第２収音部２１は、第１収音部２０と所定の距離ｄだけ離れた情報取得機器２の上面の右側に設けられる（図１を参照）。第２収音部２１は、複数の音源の各々から発せられた音声を収音してアナログの音声信号（電気信号）に変換し、この音声信号に対してＡ／Ｄ変換処理やゲイン調整処理を行ってデジタルの音声データ（第２音声データ）を生成して機器制御部２９へ出力する。第２収音部２１は、第１収音部２０と同様の構成を有し、単一指向性マイク、無指向性マイクおよび双指向性マイクのいずれか１つのマイクロホン、Ａ／Ｄ変換回路および信号処理回路等を用いて構成される。なお、本実施の形態１では、第１収音部２０および第２収音部２１によってステレオマイクを形成する。 The second sound collecting unit 21 is provided at a position different from that of the first sound collecting unit 20. The second sound collecting unit 21 is provided on the right side of the top surface of the information acquisition device 2 separated from the first sound collecting unit 20 by a predetermined distance d (see FIG. 1). The second sound collection unit 21 picks up the sound emitted from each of the plurality of sound sources and converts it into an analog sound signal (electric signal), and performs A / D conversion processing and gain adjustment processing on this sound signal. To generate digital audio data (second audio data) and output it to the device control unit 29. The second sound collection unit 21 has a configuration similar to that of the first sound collection unit 20, and includes one of a unidirectional microphone, a nondirectional microphone and a bidirectional microphone, an A / D conversion circuit, and It is configured using a signal processing circuit or the like. In the first embodiment, the first sound collection unit 20 and the second sound collection unit 21 form a stereo microphone.

外部入力検出部２２は、情報取得機器２の外部から挿入される外部マイクのプラグが挿脱され、外部マイクの挿入を検出し、この検出結果を機器制御部２９へ出力する。また、外部入力検出部２２は、外部マイクが複数の音源の各々から発せられた音声を収音して生成したアナログの音声信号（電気信号）の入力を受け付け、この入力を受け付けた音声信号に対してＡ／Ｄ変換処理やゲイン調整処理を行ってデジタルの音声データ（少なくとも第３音声データを含む）を生成して機器制御部２９へ出力する。また、外部入力検出部２２は、外部マイクのプラグが挿入された場合、外部マイクが情報取得機器２に接続されたことを示す信号を機器制御部２９へ出力するとともに、外部マイクが生成した音声データを機器制御部２９へ出力する。外部入力検出部２２は、マイクロホンジャック、Ａ／Ｄ変換回路および信号処理回路等を用いて構成される。また、外部マイクは、単一指向性マイク、無指向性マイクおよび双指向性マイク、左右の音を収音することができるステレオマイク等のいずれかのマイクを用いて構成される。外部マイクとしてステレオマイクを用いる場合、外部入力検出部２２は、左右のマイクロホンの各々で収音された２つの音声データ（第３音声データおよび第４音声データ）を生成して機器制御部２９へ出力する。 The external input detection unit 22 inserts and removes the plug of the external microphone inserted from the outside of the information acquisition device 2, detects the insertion of the external microphone, and outputs the detection result to the device control unit 29. Further, the external input detection unit 22 receives an input of an analog audio signal (electric signal) generated by collecting an audio emitted from each of the plurality of sound sources by the external microphone, and the input is received as an audio signal. On the other hand, A / D conversion processing and gain adjustment processing are performed to generate digital audio data (including at least third audio data) and output it to the device control unit 29. Further, when the plug of the external microphone is inserted, the external input detection unit 22 outputs a signal indicating that the external microphone is connected to the information acquisition device 2 to the device control unit 29, and the voice generated by the external microphone is generated. The data is output to the device control unit 29. The external input detection unit 22 is configured using a microphone jack, an A / D conversion circuit, a signal processing circuit, and the like. Also, the external microphone is configured using any microphone such as a unidirectional microphone, a nondirectional microphone and a bidirectional microphone, and a stereo microphone capable of collecting left and right sounds. When a stereo microphone is used as the external microphone, the external input detection unit 22 generates two audio data (third audio data and fourth audio data) collected by each of the left and right microphones, and sends it to the device control unit 29. Output.

表示部２３は、機器制御部２９の制御のもと、情報取得機器２に関する各種情報を表示する。表示部２３は、有機ＥＬ（Electro Luminescence）や液晶等を用いて構成される。 The display unit 23 displays various information on the information acquisition device 2 under the control of the device control unit 29. The display unit 23 is configured using an organic EL (Electro Luminescence), a liquid crystal, or the like.

時計２４は、計時機能の他、第１収音部２０および第２収音部２１および外部マイクの各々において生成された音声データの日時に関する日時情報を生成し、この日時情報を機器制御部２９へ出力する。 The clock 24 generates time and date information on the date and time of the audio data generated in each of the first sound collecting unit 20 and the second sound collecting unit 21 and the external microphone, in addition to the clocking function. Output to

入力部２５は、情報取得機器２に関する各種情報の入力を受け付ける。入力部２５は、ボタン、スイッチ等を用いて構成される。また、入力部２５は、表示部２３の表示領域に重畳して設けられ、外部からの物体の接触を検出し、この検出した位置に応じた操作信号の入力を受け付けるタッチパネル２５１を有する。 The input unit 25 receives an input of various information related to the information acquisition device 2. The input unit 25 is configured using a button, a switch, and the like. Further, the input unit 25 is provided so as to be superimposed on the display area of the display unit 23, and has a touch panel 251 that detects a touch of an object from the outside and receives an input of an operation signal according to the detected position.

記録部２６は、揮発性メモリ、不揮発性メモリおよび記録媒体等を用いて構成され、音声データを格納した音声ファイルおよび情報取得機器２が実行する各種プログラムを記録する。記録部２６は、情報取得機器２が実行する各種プログラムを記録するプログラム記録部２６１と、音声データを格納した音声ファイルを記録する音声ファイル記録部２６２と、を有する。なお、記録部２６は、外部から着脱できるメモリカード等の記録媒体であってもよい。 The recording unit 26 is configured using a volatile memory, a non-volatile memory, a recording medium, and the like, and records an audio file storing audio data and various programs executed by the information acquisition device 2. The recording unit 26 includes a program recording unit 261 that records various programs executed by the information acquisition device 2 and an audio file recording unit 262 that records an audio file storing audio data. The recording unit 26 may be a recording medium such as a memory card that can be attached and detached from the outside.

通信部２７は、所定の通信規格に従って、情報処理装置３に音声データが格納された音声ファイルを含むデータを送信するとともに、情報処理装置３から各種の情報やデータを受信する。 The communication unit 27 transmits data including an audio file in which audio data is stored to the information processing device 3 in accordance with a predetermined communication standard, and receives various information and data from the information processing device 3.

出力部２８は、機器制御部２９から入力されたデジタルの音声データに対してＤ／Ａ変換処理を行ってアナログの音声信号に変換して外部へ出力する。出力部２８は、スピーカ、Ｄ／Ａ変換回路等を用いて構成される。 The output unit 28 performs D / A conversion processing on digital audio data input from the device control unit 29, converts the digital audio data into an analog audio signal, and outputs the analog audio signal to the outside. The output unit 28 is configured using a speaker, a D / A conversion circuit, and the like.

機器制御部２９は、情報取得機器２を構成する各部を統括的に制御する。機器制御部２９は、ＣＰＵ（Central Processing Unit）やＦＰＧＡ（Field Programmable Gate Array）等を用いて構成される。機器制御部２９は、信号処理部２９１と、テキスト化部２９２と、テキスト化制御部２９３と、音声判定部２９４と、音源位置推定部２９５と、表示位置判定部２９６と、声紋判定部２９７と、音源情報生成部２９８と、音声特定部２９９と、移動判定部３００と、インデックス付加部３０１と、音声ファイル生成部３０２と、表示制御部３０３と、を有する。 The device control unit 29 comprehensively controls the respective units constituting the information acquisition device 2. The device control unit 29 is configured using a central processing unit (CPU), a field programmable gate array (FPGA), or the like. The device control unit 29 includes a signal processing unit 291, a text conversion unit 292, a text conversion control unit 293, a voice determination unit 294, a sound source position estimation unit 295, a display position determination unit 296, and a voice print determination unit 297. A sound source information generation unit 298, a voice specification unit 299, a movement determination unit 300, an index addition unit 301, a voice file generation unit 302, and a display control unit 303.

信号処理部２９１は、第１収音部２０および第２収音部２１が生成した音声データの音声レベルの調整処理、ノイズ低減処理およびゲイン調整処理等を行う。 The signal processing unit 291 performs adjustment processing of the sound level of the sound data generated by the first sound collection unit 20 and the second sound collection unit 21, noise reduction processing, gain adjustment processing, and the like.

テキスト化部２９２は、音声データに対して音声認識処理を行うことによって複数の文字で構成された音声テキストデータを生成する。なお、音声認識処理の詳細については後述する。 The text conversion unit 292 generates voice text data composed of a plurality of characters by performing voice recognition processing on voice data. The details of the speech recognition process will be described later.

テキスト化制御部２９３は、入力部２５からテキスト化部２９２に音声テキストデータを生成させる指示信号の入力を受け付けた場合、この指示信号の入力を受け付けた時点から所定時間だけテキスト化部２９２に音声テキストデータを生成させる。 When the textification control unit 293 receives an instruction signal for causing the textification unit 292 to generate voice text data from the input unit 25, the textification unit 292 performs speech only for a predetermined time after the input of the instruction signal is received. Generate text data.

音声判定部２９４は、信号処理部２９１が自動レベル調整を順次行った音声データに無音期間が含まれているか否かを判定する。具体的には、音声判定部２９４は、音声データの音声レベルが所定の閾値未満であるか否かを判定し、所定の閾値未満である期間を無音期間であると判定する。 The voice determination unit 294 determines whether or not the silent period is included in the voice data sequentially subjected to the automatic level adjustment by the signal processing unit 291. Specifically, the voice determination unit 294 determines whether the voice level of the voice data is less than a predetermined threshold, and determines that a period less than the predetermined threshold is a silent period.

音源位置推定部２９５は、第１収音部２０および第２収音部２１の各々が生成した音声データに基づいて、複数の音源の位置を推定する。具体的には、第１収音部２０および第２収音部２１の各々が生成した音声データに基づいて、複数の音源の各々から発された音声信号が第１収音部２０および第２収音部２１の各々に到達する到達時間の差を算出し、この算出結果に基づいて、情報取得機器２を中心と見たときの複数の音源の各々の位置を推定する。 The sound source position estimation unit 295 estimates the positions of the plurality of sound sources based on the audio data generated by each of the first sound collection unit 20 and the second sound collection unit 21. Specifically, based on the audio data generated by each of the first sound collecting unit 20 and the second sound collecting unit 21, the sound signal emitted from each of the plurality of sound sources is transmitted to the first sound collecting unit 20 and the second sound collecting unit 20. The difference in arrival time to reach each of the sound collection units 21 is calculated, and the position of each of the plurality of sound sources when the information acquisition device 2 is viewed from the center is estimated based on the calculation result.

表示位置判定部２９６は、表示部２３の表示領域の形状と音源位置推定部２９５が推定した推定結果とに基づいて、表示部２３の表示領域における複数の音源の各々の表示位置を判定する。具体的には、表示位置判定部２９６は、表示部２３の表示領域の中心を情報取得機器２とした際における複数の音源の各々の表示位置を判定する。例えば、表示位置判定部２９６は、表示部２３の表示領域を４つの象限に分割し、かつ表示部２３の表示領域の中心に情報取得機器２を載置した際の複数の音源の各々の表示位置を判定する。 The display position determination unit 296 determines the display position of each of the plurality of sound sources in the display area of the display unit 23 based on the shape of the display area of the display unit 23 and the estimation result estimated by the sound source position estimation unit 295. Specifically, the display position determination unit 296 determines the display position of each of the plurality of sound sources when the center of the display area of the display unit 23 is the information acquisition device 2. For example, the display position determination unit 296 divides each display area of the display unit 23 into four quadrants, and displays each of a plurality of sound sources when the information acquisition device 2 is placed at the center of the display area of the display unit 23 Determine the position.

声紋判定部２９７は、音声データに基づいて、複数の音源の各々の声紋を判定する。具体的には、声紋判定部２９７は、音声データに含まれる複数の音源の各々の声紋（話者）を判定する。例えば、声紋判定部２９７は、情報取得機器２を用いて会議の録音を行う前に、会議に参加する話者が発した音声に基づく特徴を登録した話者識別テンプレートに基づいて、音声データに含まれる複数の音源の各々から声紋（話者）を判定する。また、声紋判定部２９７は、音声データに基づいて、話者が発した音声に基づく特徴以外にも、周波数の高さ（声の高さ）、抑揚、音量（強度）およびヒストグラム等を判定する。もちろん、声紋判定部２９７は、音声データに基づいて、性別を判定してもよい。さらに、声紋判定部２９７は、音声データに基づいて、複数の話者の各々が発した話者毎の音量（強度）や周波数の高さ（声の高さ）を判定してもよい。もちろん、声紋判定部２９７は、音声データに基づいて、複数の話者の各々が発した話者毎の抑揚を判定してもよい。 The voiceprint determination unit 297 determines the voiceprint of each of the plurality of sound sources based on the voice data. Specifically, the voiceprint determination unit 297 determines each voiceprint (speaker) of the plurality of sound sources included in the voice data. For example, before the voice print determination unit 297 performs recording of a conference using the information acquisition device 2, the voice print determination unit 297 generates voice data based on a speaker identification template in which features based on voice emitted by speakers participating in the conference are registered. A voiceprint (speaker) is determined from each of a plurality of included sound sources. Further, the voiceprint determination unit 297 determines, based on voice data, frequency height (voice height), intonation, volume (intensity), histogram, etc. in addition to the feature based on the voice emitted by the speaker. . Of course, the voiceprint determination unit 297 may determine gender based on voice data. Furthermore, the voiceprint determination unit 297 may determine the volume (intensity) and the frequency height (voice height) of each speaker emitted by each of the plurality of speakers based on the voice data. Of course, the voiceprint determination unit 297 may determine the intonation for each speaker generated by each of the plurality of speakers based on the voice data.

音源情報生成部２９８は、声紋判定部２９７が判定した判定結果に基づいて、複数の音源の各々に関する複数の音源情報を生成する。具体的には、音源情報生成部２９８は、声紋判定部２９７が判定した話者が発した話者毎に音声情報を生成する。例えば、音源情報生成部２９８は、話者が発した周波数の高さ（声の高さ）に基づいて、話者を模式的に示すアイコンを音声情報として生成する。なお、音源情報生成部２９８は、声紋判定部２９７が判定した性別に基づいて音声情報の種別、例えば女性アイコン、男性アイコン、犬や猫等のアイコンを変更して生成してもよい。ここでは、音源情報生成部２９８は、特定の高さの声のデータをデータベースとして用意して、これを取得した声の信号と比較してアイコンを決めても良く、検出された複数の話者の各々の複数の音声の周波数の高さ（声の高さ）などの比較などを用いて行ってもよい。また、音源情報生成部２９８は、使われる言葉の種類や言い回しなども、男女別、言語別、年齢別などでデータベース化しておき、これと音声パターンを比較してアイコンを決めても良い。また、ちょっと関係ないことを言ってみただけ、相づちを打っただけみたいな人のアイコンまでを出すかどうかという問題もある。こうした発言は後で聞き返す必要が少ないことも多く、メインの発言の補足のようなものなので、音源情報生成部２９８は、あえてアイコンを生成する意味が少ない。直感的な判断にはむしろ不適切な場合もある。したがって、発言が特定の時間の長さ以上でなかったり、主語、目的語などの名詞や動詞、形容詞、助動詞などが明確でない場合、音源情報生成部２９８は、重要発言とみなさず曖昧発言と考えて、発話者をアイコン化しないようにしたり、アイコンを薄くしたり点線にしたり小さくしたり、アイコンを構成する線の途中をとぎらせたりして、視認性を異ならせても良い。つまり、音源情報生成部２９８は、発話内容を音声認識で判定して、使われた言葉を判定して発話の完成度を文法的に検証する機能を持たせ、議題に対してふさわしい目的語や主語になっているかを判定可能としてもよい。こうした議題に関連する言葉かどうかは、主に発言している人（議長や進行役）の発話内容に類似の言葉が使われているかなどを検出し、これと各発話者の発した言葉を比較して判定してもよい。これが一致しないようなら、不明瞭な発言としてもよく、声が小さいや発話が短いとかも、同様に判定が可能である。このような工夫で、話者が発した話者毎の音源情報から話者が発した音声の長さや明瞭さに基づいて、話者を模式的に示すアイコンの視認性を異ならせて生成することで、発話の直感的な検索性を向上させる。また、音源情報生成部２９８は、声紋判定部２９７が判定した複数の話者の各々の音量の比較に基づいて、複数の話者の各々を模式的に示すアイコンを音源情報として生成してもよい。もちろん、音源情報生成部２９８は、音声データに基づいて、複数の話者の各々が発した話者毎の音声の長さおよび音量に基づいて、話者を模式的に示すアイコンを互いに異ならせた音源情報を生成してもよい。 The sound source information generation unit 298 generates a plurality of sound source information on each of the plurality of sound sources based on the determination result determined by the voiceprint determination unit 297. Specifically, the sound source information generation unit 298 generates voice information for each speaker emitted by the speaker determined by the voiceprint determination unit 297. For example, the sound source information generation unit 298 generates, as voice information, an icon schematically showing the speaker based on the height of the frequency (voice height) emitted by the speaker. Note that the sound source information generation unit 298 may change and generate the type of voice information, for example, a female icon, a male icon, or an icon such as a dog or a cat based on the gender determined by the voiceprint determination unit 297. Here, the sound source information generation unit 298 may prepare voice data of a specific height as a database, and compare this with the acquired voice signal to determine an icon, and a plurality of detected speakers The comparison may be performed using, for example, the frequency height (voice height) of each of the plurality of voices. In addition, the sound source information generation unit 298 may create a database of types and phrases used, such as gender, language, age, etc., and compare this with a voice pattern to determine an icon. In addition, there is also a problem as to whether or not even the icon of a person who just just hit a match is shown by just saying something irrelevant. Since such remarks are often less likely to be heard later and are like supplements to the main remarks, the sound source information generation unit 298 has little meaning in generating icons. Intuitive judgment may be rather inappropriate. Therefore, the sound source information generation unit 298 does not consider it as an important utterance and considers it as an ambiguous utterance if the utterance is not longer than a specific length of time, or if the subject, object nouns, nouns and adjectives are not clear. The visibility may be made different by not making the speaker iconify, thinning the icon, making it dotted line or smaller, or making the lines constituting the icon break in the middle. That is, the sound source information generation unit 298 has a function of judging the content of the speech by speech recognition, judging the used words, and grammatically verifying the degree of completion of the speech. It may be possible to determine whether it is the subject. As to whether or not the words related to such an agenda, it is detected whether words similar to the content of the utterance of the person (the chairperson or facilitator) who is mainly speaking are used, etc. You may compare and determine. If this does not match, it may be an unclear speech, and it can be similarly judged whether the voice is small or the speech is short. With such a device, the visibility of the icon schematically showing the speaker is generated based on the sound source information of each speaker uttered by the speaker based on the length and clarity of the voice uttered by the speaker Improve intuitive searchability of speech. Also, even if the sound source information generation unit 298 generates an icon schematically showing each of the plurality of speakers as sound source information based on the comparison of the volume of each of the plurality of speakers determined by the voiceprint determination unit 297. Good. Of course, based on the voice data, the sound source information generation unit 298 makes the icons schematically showing the speakers different from each other, based on the length and volume of the voice for each speaker emitted by each of the plurality of speakers. Sound source information may be generated.

音声特定部２９９は、声紋判定部２９７が判定した複数の声紋の各々が音声データ上に出現する出現位置（出現時間）を特定する。 The voice identification unit 299 identifies an appearance position (appearance time) at which each of the plurality of voiceprints determined by the voiceprint determination unit 297 appears on voice data.

移動判定部３００は、音源位置推定部２９５が推定した推定結果と声紋判定部２９７が判定した判定結果とに基づいて、複数の音源の各々が移動しているか否かを判定する。 The movement determination unit 300 determines whether each of the plurality of sound sources is moving, based on the estimation result estimated by the sound source position estimation unit 295 and the determination result determined by the voiceprint determination unit 297.

インデックス付加部３０１は、音声データに対して、音声判定部２９４によって判定された無音期間に、他の期間と区別するためのインデックスを無音期間の最初および最後の少なくとも一方に付加する。 The index adding unit 301 adds, to the silent period determined by the voice determination unit 294, an index for distinguishing from the other periods to the voice data to at least one of the first and the last of the silent period.

音声ファイル生成部３０２は、信号処理部２９１が信号処理を行った音声データと、音源位置推定部２９５が推定した音源位置情報と、音源情報生成部２９８が生成した複数の音源情報と、音声特定部２９９が特定した出現位置、インデックス付加部３０１が付加したインデックスの位置に関する位置情報またはインデックスが付加された音声データ上における時間に関する時間情報と、テキスト化部２９２が生成した音声テキストデータと、を対応付けた音声ファイルを生成して音声ファイル記録部２６２に記録する。また、音声ファイル生成部３０２は、信号処理部２９１が信号処理を行った音声データと、入力部２５が指示信号の入力を受け付けた時点から所定時間だけテキスト化部２９２に音声テキストデータを生成させる候補タイミングとする候補タイミング情報と、を対応付けた音声ファイルを生成して記録媒体として機能する音声ファイル記録部２６２に記録してもよい。 The voice file generation unit 302 includes voice data subjected to signal processing by the signal processing unit 291, sound source position information estimated by the sound source position estimation unit 295, a plurality of sound source information generated by the sound source information generation unit 298, and voice specification. Position information on the appearance position specified by the unit 299, position information on the position of the index added by the index adding unit 301, or time information on time on the voice data to which the index is added, and voice text data generated by the textification unit 292 An associated audio file is generated and recorded in the audio file recording unit 262. In addition, the audio file generation unit 302 causes the textification unit 292 to generate audio text data for a predetermined time from the time when the input unit 25 receives the audio data on which the signal processing unit 291 has performed the signal processing and the input unit 25 receives the instruction signal. An audio file in which candidate timing information as candidate timing is associated may be generated and recorded in the audio file recording unit 262 functioning as a recording medium.

表示制御部３０３は、表示部２３の表示態様を制御する。具体的には、表示制御部３０３は、情報取得機器２に関する各種情報を表示部２３に表示させる。例えば、表示部２３は、信号処理部２９１が調整を行った音声データの音声レベルを表示部２３に表示させる。また、表示制御部３０３は、音源位置推定部２９５が推定した推定結果に基づいて、複数の音源の各々の位置に関する音源位置情報を表示部２３に表示させる。具体的には、表示制御部３０３は、表示位置判定部２９６が判定した判定結果に基づいて、音源位置情報を表示部２３に表示させる。より具体的には、表示制御部３０３は、音源位置情報として音源情報生成部２９８が生成した複数の音源情報を表示部２３に表示させる。 The display control unit 303 controls the display mode of the display unit 23. Specifically, the display control unit 303 causes the display unit 23 to display various information related to the information acquisition device 2. For example, the display unit 23 causes the display unit 23 to display the audio level of the audio data adjusted by the signal processing unit 291. In addition, the display control unit 303 causes the display unit 23 to display sound source position information on each position of a plurality of sound sources based on the estimation result estimated by the sound source position estimating unit 295. Specifically, the display control unit 303 causes the display unit 23 to display the sound source position information based on the determination result determined by the display position determination unit 296. More specifically, the display control unit 303 causes the display unit 23 to display a plurality of pieces of sound source information generated by the sound source information generation unit 298 as sound source position information.

〔情報処理装置の構成〕
次に、情報処理装置３の構成について説明する。
情報処理装置３は、通信部３１と、入力部３２と、記録部３３と、音声再生部３４と、表示部３５と、情報処理制御部３６と、を有する。 [Configuration of Information Processing Apparatus]
Next, the configuration of the information processing device 3 will be described.
The information processing device 3 includes a communication unit 31, an input unit 32, a recording unit 33, an audio reproduction unit 34, a display unit 35, and an information processing control unit 36.

通信部３１は、所定の通信規格に従って、情報取得機器２にデータを送信するとともに、情報取得機器２から少なくとも音声データが格納された音声ファイルを含むデータを受信する。 The communication unit 31 transmits data to the information acquisition device 2 in accordance with a predetermined communication standard, and receives from the information acquisition device 2 data including at least an audio file in which audio data is stored.

入力部３２は、情報処理装置３に関する各種情報の入力を受け付ける。入力部３２は、ボタン、スイッチ、キーボード、およびタッチパネル等を用いて構成される。例えば、入力部３２は、ユーザがドキュメント化作業を行う場合、テキストデータの入力を受け付ける。 The input unit 32 receives an input of various information related to the information processing device 3. The input unit 32 is configured using a button, a switch, a keyboard, a touch panel, and the like. For example, when the user performs a documentization operation, the input unit 32 receives an input of text data.

記録部３３は、揮発性メモリ、不揮発性メモリおよび記録媒体等を用いて構成され、音声データを格納した音声ファイルおよび情報処理装置３が実行する各種プログラムを記録する。記録部３３は、情報処理装置３が実行する各種プログラムを記録するプログラム記録部３３１と、音声データをテキストデータに変換するために用いる音声テキスト化辞書データ記録部３３２と、を有する。これは、音声とテキストの関係のみならず、類義語などを検索できるようなデータベースであることが好ましい。ここで、類義語とは、同一言語において、語形が異なるが意味が互いによく似ており、場合によっては代替が可能となる二つ以上の語である。もちろん、シソーラスや同義語が含まれてもよい。 The recording unit 33 is configured using a volatile memory, a non-volatile memory, a recording medium, and the like, and records an audio file storing audio data and various programs executed by the information processing apparatus 3. The recording unit 33 has a program recording unit 331 for recording various programs to be executed by the information processing device 3 and a voice-to-text dictionary data recording unit 332 used to convert voice data into text data. It is preferable that this is a database that can search not only the relationship between speech and text but also synonyms and the like. Here, synonyms are two or more words whose word forms are different but their meanings are similar to each other in the same language, and in some cases, substitution is possible. Of course, thesaurus and synonyms may be included.

音声再生部３４は、情報処理制御部３６から入力されたデジタルの音声データに対してＤ／Ａ変換処理を行ってアナログの音声信号に変換して外部へ出力する。音声再生部３４は、スピーカおよびＤ／Ａ変換回路等を用いて構成される。 The audio reproduction unit 34 performs D / A conversion processing on digital audio data input from the information processing control unit 36, converts it into an analog audio signal, and outputs the signal to the outside. The audio reproduction unit 34 is configured using a speaker, a D / A conversion circuit, and the like.

表示部３５は、情報処理制御部３６の制御のもと、情報処理装置３に関する各種情報や音声データの録音時間に対応するタイムバーを表示する。表示部３５は、有機ＥＬや液晶等を用いて構成される。 Under the control of the information processing control unit 36, the display unit 35 displays a time bar corresponding to various information related to the information processing device 3 and a recording time of audio data. The display unit 35 is configured using an organic EL, a liquid crystal, or the like.

情報処理制御部３６は、情報処理装置３を構成する各部を統括的に制御する。情報処理制御部３６は、ＣＰＵ等を用いて構成される。情報処理制御部３６は、テキスト化部３６１と、特定部３６２と、キーワード判定部３６３と、キーワード設定部３６４と、音声制御部３６５と、表示制御部３６６と、ドキュメント生成部３６７と、を有する。 The information processing control unit 36 centrally controls the units that constitute the information processing apparatus 3. The information processing control unit 36 is configured using a CPU or the like. The information processing control unit 36 includes a text conversion unit 361, a specification unit 362, a keyword determination unit 363, a keyword setting unit 364, a voice control unit 365, a display control unit 366, and a document generation unit 367. .

テキスト化部３６１は、音声データに対して音声認識処理を行うことによって複数の文字で構成された音声テキストデータを生成する。なお、音声認識処理の詳細は後述する。 The text conversion unit 361 generates voice text data composed of a plurality of characters by performing voice recognition processing on voice data. The details of the speech recognition process will be described later.

特定部３６２は、キーワードの文字列と音声テキストデータにおける文字列とが一致する音声データ上における出現位置（出現時間）を特定する。もちろん、特定部３６２は、キーワードの文字列と音声テキストデータにおける文字列とが完全一致する必要がなく、例えばキーワードの文字列と音声テキストデータにおける文字列との類似度（例えば８０％以上）が高い音声データ上における出現位置（出現時間）を特定してもよい。 The specifying unit 362 specifies an appearance position (appearing time) on audio data in which the character string of the keyword matches the character string in the audio text data. Of course, the specification unit 362 does not have to completely match the character string of the keyword and the character string in the audio text data, for example, the similarity (for example, 80% or more) between the character string of the keyword and the character string in the audio text data The appearance position (appearance time) on high voice data may be specified.

キーワード判定部３６３は、通信部３１が情報取得機器２から取得した音声ファイルにキーワード候補があるか否かを判定する。具体的には、キーワード判定部３６３は、通信部３１が情報取得機器２から取得した音声ファイルに音声テキストデータが格納されているか否かを判定する。 The keyword determination unit 363 determines whether the voice file acquired by the communication unit 31 from the information acquisition device 2 includes a keyword candidate. Specifically, the keyword determination unit 363 determines whether or not the audio text data is stored in the audio file acquired by the communication unit 31 from the information acquisition device 2.

キーワード設定部３６４は、キーワード判定部３６３が通信部３１を介して情報取得機器２から取得した音声ファイルにキーワード候補があると判定した場合、音声ファイルに格納されたキーワード候補を音声データ上に出現する位置を検索するためのキーワードとして設定する。具体的には、キーワード設定部３６４は、通信部３１が情報取得機器２から取得した音声ファイルに格納された音声テキストデータを音声データ上に出現する位置を検索するためのキーワードに設定する。会議が終わってしまうと、単語のイメージは覚えていても、正確な単語そのものを忘れている事も多いので、キーワード設定部３６４は、類義語（例えば、単語が「重要」の場合であれば、「ポイント」や「大切」等の類似語）をデータベース（音声テキスト化辞書データ記録部３３２）などで辞書検索して、似たような意味のキーワードを探しても良い。 When the keyword setting unit 364 determines that the voice file acquired from the information acquisition device 2 via the communication unit 31 is a keyword candidate, the keyword setting unit 364 causes the keyword candidate stored in the voice file to appear on the voice data Set as a keyword to search for the position you want. Specifically, the keyword setting unit 364 sets voice text data stored in the voice file acquired by the communication unit 31 from the information acquisition device 2 as a keyword for searching for a position where it appears on the voice data. When the meeting is over, even though the image of the word is remembered, the correct word itself is often forgotten. Therefore, the keyword setting unit 364 is a synonym (for example, if the word is "important", A similar word such as “point” or “important” may be searched by dictionary search in a database (speech / textification dictionary data recording unit 332) or the like to search for a keyword having a similar meaning.

音声制御部３６５は、音声再生部３４の駆動を制御する。具体的には、音声制御部３６５は、音声再生部３４に音声ファイルに格納された音声データを再生させる。 The voice control unit 365 controls the drive of the voice reproduction unit 34. Specifically, the audio control unit 365 causes the audio reproduction unit 34 to reproduce the audio data stored in the audio file.

表示制御部３６６は、表示部３５の表示態様を制御する。表示制御部３６６は、タイムバー上にキーワードが出現する出現位置に関する位置情報を表示部３５に表示させる。 The display control unit 366 controls the display mode of the display unit 35. The display control unit 366 causes the display unit 35 to display position information regarding the appearance position where the keyword appears on the time bar.

〔情報取得機器の処理〕
次に、情報取得機器２が実行する処理について説明する。図３は、情報取得機器２が実行する処理の概要を示すフローチャートである。図４は、情報取得機器２の利用シーンを示す図である。図５は、図４の状況下を模式的に示す俯瞰図である。 [Process of information acquisition device]
Next, the process performed by the information acquisition device 2 will be described. FIG. 3 is a flowchart showing an outline of the process performed by the information acquisition device 2. FIG. 4 is a view showing a usage scene of the information acquisition device 2. FIG. 5 is a bird's-eye view schematically showing the situation of FIG.

図３に示すように、入力部２５が操作されることによって入力部２５から録音を指示する指示信号が入力された場合（ステップＳ１０１：Ｙｅｓ）、機器制御部２９は、第１収音部２０および第２収音部２１を駆動させて音声の入力に応じて音声データを音声ファイルに順次格納して記録部２６に記録させる録音を開始する（ステップＳ１０２）。 As shown in FIG. 3, when the instruction signal instructing recording is input from the input unit 25 by operating the input unit 25 (step S101: Yes), the device control unit 29 selects the first sound collection unit 20. The second sound pickup unit 21 is driven to sequentially store the voice data in the voice file according to the voice input and start the recording to be recorded in the recording unit 26 (step S102).

続いて、信号処理部２９１は、第１収音部２０および第２収音部２１の各々が生成した音声データのレベルを自動で調整する自動レベル調整を行う（ステップＳ１０３）。 Subsequently, the signal processing unit 291 performs automatic level adjustment to automatically adjust the level of the audio data generated by each of the first sound collecting unit 20 and the second sound collecting unit 21 (step S103).

その後、表示制御部３０３は、信号処理部２９１が音声データに行っている自動レベル調整のレベルを表示部２３に表示させる（ステップＳ１０４）。 After that, the display control unit 303 causes the display unit 23 to display the level of the automatic level adjustment performed on the audio data by the signal processing unit 291 (step S104).

続いて、音声判定部２９４は、信号処理部２９１が自動レベル調整を順次行った音声データに無音期間が含まれているか否かを判定する（ステップＳ１０５）。具体的には、音声判定部２９４は、信号処理部２９１が自動レベル調整を順次行った音声データの所定フレーム期間毎に、音量レベルが所定の閾値未満であるか否かを判定することによって、無音期間が含まれているか否かを判定する。より具体的には、音声判定部２９４は、音声データの音量レベルが所定の閾値未満である期間が所定期間（例えば１０秒）ある場合、音声データに無音期間が含まれていると判定する。なお、所定期間は、ユーザが入力部２５を用いて適宜設定することができる。音声判定部２９４によって信号処理部２９１が自動レベル調整を順次行った音声データに無音期間が含まれていると判定された場合（ステップＳ１０５：Ｙｅｓ）、情報取得機器２は、後述するステップＳ１０６へ移行する。これに対して、音声判定部２９４によって信号処理部２９１が自動レベル調整を順次行った音声データに無音期間が含まれていないと判定された場合（ステップＳ１０５：Ｎｏ）、情報取得機器２は、後述するステップＳ１０７へ移行する。 Subsequently, the voice determination unit 294 determines whether or not the silent period is included in the voice data sequentially subjected to the automatic level adjustment by the signal processing unit 291 (step S105). Specifically, the sound determination unit 294 determines whether the volume level is less than a predetermined threshold value for each predetermined frame period of the sound data in which the signal processing unit 291 has performed the automatic level adjustment sequentially. It is determined whether a silent period is included. More specifically, when there is a predetermined period (for example, 10 seconds) in which the sound volume level of the sound data is less than the predetermined threshold, the sound determination unit 294 determines that the sound data includes a silent period. The user can appropriately set the predetermined period using the input unit 25. When it is determined by the voice determination unit 294 that the voice data to which the signal processing unit 291 has sequentially performed the automatic level adjustment includes a silent period (step S105: Yes), the information acquisition device 2 proceeds to step S106 described later. Transition. On the other hand, when it is determined by the voice determination unit 294 that the voice data for which the signal processing unit 291 has sequentially performed the automatic level adjustment does not include the silent period (step S105: No), the information acquisition device 2 It transfers to step S107 mentioned later.

ステップＳ１０６において、インデックス付加部３０１は、音声データに対して、音声判定部２９４によって判定された無音期間に、他の期間と区別するためのインデックスを無音期間の最初および最後の少なくとも一方に付加する。ステップＳ１０６の後、情報取得機器２は、後述するステップＳ１０７へ移行する。 In step S106, the index adding unit 301 adds, to the voice data, in the silent period determined by the voice determining unit 294, an index for distinguishing it from other periods to at least one of the beginning and the end of the silent period. . After step S106, the information acquisition device 2 proceeds to step S107 described later.

ステップＳ１０７において、入力部２５が操作されることによって入力部２５からインデックスを付加するためのキーワード候補の設定を指示する指示信号が入力された場合（ステップＳ１０７：Ｙｅｓ）、情報取得機器２は、後述するステップＳ１０８へ移行する。これに対して、入力部２５からインデックスを付加するためのキーワード候補の設定を指示する指示信号が入力されていない場合（ステップＳ１０７：Ｎｏ）、情報取得機器２は、後述するステップＳ１０９へ移行する。このステップは、会議中など、録音している最中に、後で聞き返したくなる重要な議題が始まった時などに、メモや付箋といった感じで、ユーザが何らかの指示を行う場合に対応している。ここでは、特定のスイッチ操作（例えば入力部２５に対する操作による入力）のような記載を行っているが、音声で、「ここ重要」等の声を検出すると、同様の入力が行われてもよい。即ち、インデックス付加部３０１は、第１収音部２０および第２収音部２１を介して入力された音声データに対してテキスト化部２９２がテキスト化を行ったテキストデータに基づいて、インデックスを付加してもよい。 In step S107, when an instruction signal instructing setting of a keyword candidate for adding an index is input from the input unit 25 by operating the input unit 25 (step S107: Yes), the information acquisition device 2 It transfers to step S108 mentioned later. On the other hand, when an instruction signal instructing setting of a keyword candidate for adding an index is not input from the input unit 25 (step S107: No), the information acquisition device 2 proceeds to step S109 described later. . This step corresponds to the case where the user gives some instruction such as a note or a sticky note when an important topic that he / she wants to hear back later has begun while recording, such as during a meeting. . Here, a description such as a specific switch operation (for example, an input by an operation to the input unit 25) is described, but a similar input may be performed when a voice such as "here important" is detected by voice. . That is, the index adding unit 301 generates an index based on the text data converted into text by the text converting unit 292 with respect to the voice data input through the first sound collecting unit 20 and the second sound collecting unit 21. It may be added.

このようなタイミングでは、その後、その会議における重要キーワードとなる言葉を使った議論が始まった可能性が高いので、ステップＳ１０８において、テキスト化制御部２９３は、入力部２５からキーワード候補の設定を指示する指示信号が入力された時点から所定時間（例えば３秒、会話が途切れていなければ、それより遡るような処理でもよい）遡及した音声データに対して、テキスト化部２９２に後述する音声認識処理を実行させてテキスト化を実行させて音声テキストデータを生成させる。これによって、後で聞き直したくなるキーワードを、録音現場においてリアルタイムで判定しやすくする工夫が可能となる。会議が終わってしまうと、単語のイメージは覚えていても、正確にその単語そのものを忘れている事も多い。このように、後で検索する時に、よく探すべきタイミングを分かりやすくしている。これは、いわば、候補タイミングとも言えるもので、このタイミングにおいては、重要キーワードやその類義語や、似たニュアンスの言葉を使った議論がなされている可能性が高い。このため、テキスト化制御部２９３は、ここでの音声データが優先的にテキストとして見える化をできた方が、議論全体を把握するのに役立つので、テキスト化を実行させて音声テキストデータを生成させる。なお、インデックス付加部３０１は、ステップＳ１０８において、テキスト化まで必ずしも行う必要はなく、重点的に検索するタイミングを、録音開始から何分何秒のタイミング、などと候補タイミングを音声データに関連づけて記録しておくだけでも良い。音声ファイルを作る時のメタデータとして候補タイミング情報を記録するような方法がある。ステップＳ１０８の後、情報取得機器２は、後述するステップＳ１０９へ移行する。 At such a timing, since there is a high possibility that discussions using words that are important keywords in the meeting have started thereafter, the textification control unit 293 instructs the setting of keyword candidates from the input unit 25 in step S108. Voice recognition processing to be described later in the textification unit 292 for voice data retrospectively for a predetermined time (for example, 3 seconds, if the conversation is not interrupted, processing to go back after that) may be performed from the time when the instruction signal is input To perform textification and generate speech text data. This makes it possible to make it easy to determine in a recording site the keywords that you want to rehearse later in real time. When the meeting is over, even though you remember the image of a word, you often forget exactly the word itself. In this way, when searching later, it is easy to understand the timing to look for often. This can be said to be a candidate timing, so it is highly probable that at this timing there is discussion using important keywords, their synonyms, and similar nuance words. For this reason, the textification control unit 293 executes textification to generate speech text data because it is better for the speech data here to be visualized as text preferentially if it can be understood as a whole of the discussion. Let In step S108, the index adding unit 301 does not necessarily perform processing until the conversion to text, and records the timing for searching intensively, the timing of minutes and seconds from the start of recording, and the candidate timing in association with the audio data. You may do just that. There is a method of recording candidate timing information as metadata at the time of creating an audio file. After step S108, the information acquisition device 2 proceeds to step S109 described later.

ステップＳ１０９において、音源位置推定部２９５は、第１収音部２０および第２収音部２１の各々が生成した音声データに基づいて、複数の音源の位置を推定する。ステップＳ１０９の後、情報取得機器２は、後述するステップＳ１１０へ移行する。 In step S109, the sound source position estimating unit 295 estimates the positions of the plurality of sound sources based on the audio data generated by each of the first sound collecting unit 20 and the second sound collecting unit 21. After step S109, the information acquisition device 2 proceeds to step S110 described later.

図６は、音源位置推定部２９５が推定する音源の位置を模式的に示す図である。図６に示すように、音源位置推定部２９５は、第１収音部２０および第２収音部２１の各々が生成した音声データに基づいて、音源Ａ１である話者および音源Ａ２である話者の各々が発した音声が第１収音部２０および第２収音部２１の各々に到達する到達時間差を算出し、算出した到達時間差を用いて音源の焦点を形成することによって音源方向を推定する。 FIG. 6 is a diagram schematically showing the position of the sound source estimated by the sound source position estimation unit 295. As shown in FIG. As shown in FIG. 6, the sound source position estimation unit 295 is configured to generate a speaker that is the sound source A1 and a speaker that is the sound source A2 based on the sound data generated by each of the first sound collection unit 20 and the second sound collection unit 21. The sound source direction is calculated by calculating the arrival time difference at which the voice emitted by each person reaches each of the first sound collection unit 20 and the second sound collection unit 21 and forming the focus of the sound source using the calculated arrival time difference. presume.

図７は、音源位置推定部２９５が１つの音源に対して到達時間差を算出する算出状況を模式的に示す図である。図８は、音源位置推定部２９５が算出する到達時間差を算出する算出方法の一例を模式的に示す図である。 FIG. 7 is a diagram schematically showing a calculation situation in which the sound source position estimation unit 295 calculates the arrival time difference for one sound source. FIG. 8 is a diagram schematically showing an example of a calculation method for calculating the arrival time difference calculated by the sound source position estimation unit 295. As shown in FIG.

図７および図８に示すように、音源位置推定部２９５は、第１収音部２０と第２収音部２１との距離をｄ、音源Ａ１である話者の音源方位をθ、音速をＶとした場合、以下の式（１）によって、到達時間差Ｔを算出する。
Ｔ＝（ｄ×ＣＯＳ（θ））／Ｖ・・・（１）
この場合、音源位置推定部２９５は、第１収音部２０および第２収音部２１の各々が生成した２つの音声データに含まれる周波数の一致度を用いて、到達時間差Ｔを算出することができる。このため、音源位置推定部２９５は、第１収音部２０および第２収音部２１の各々が生成した２つの音声データに含まれる周波数の一致度を用いて、到達時間差Ｔを算出する。その後、音源位置推定部２９５は、到達時間差Ｔおよび式（１）を用いて音源方位θを算出することによって、音源の方位を推定する。具体的には、音源位置推定部２９５は、以下の式（２）によって、音源方位θを算出することによって、音源Ａ１の方位を推定する。
θ＝ＣＯＳ^−１（Ｔ×Ｖ）／ｄ・・・（２）
このように、音源位置推定部２９５は、音源毎に方位を推定することができる。 As shown in FIGS. 7 and 8, the sound source position estimation unit 295 sets the distance between the first sound collection unit 20 and the second sound collection unit 21 to d, the sound source direction of the speaker who is the sound source A1, θ, and the sound speed. In the case of V, the arrival time difference T is calculated by the following equation (1).
T = (d × COS (θ)) / V (1)
In this case, the sound source position estimation unit 295 calculates the arrival time difference T using the degree of coincidence of the frequencies included in the two audio data generated by each of the first sound collection unit 20 and the second sound collection unit 21. Can. Therefore, the sound source position estimation unit 295 calculates the arrival time difference T using the degree of coincidence of the frequencies included in the two audio data generated by each of the first sound collection unit 20 and the second sound collection unit 21. After that, the sound source position estimation unit 295 estimates the direction of the sound source by calculating the sound source direction θ using the arrival time difference T and the equation (1). Specifically, the sound source position estimation unit 295 estimates the direction of the sound source A1 by calculating the sound source direction θ by the following equation (2).
θ = COS ⁻¹ (T × V) / d (2)
As described above, the sound source position estimation unit 295 can estimate the direction for each sound source.

図３に戻り、ステップＳ１１０以降の説明を続ける。
ステップＳ１１０において、情報取得機器２は、音源位置推定部２９５の推定結果に基づいて、複数の音源の各々の位置に関する音源位置情報を表示部２３の表示領域上に表示するための位置を決定する各音源位置表示決定処理を実行する。 Returning to FIG. 3, the description of step S110 and subsequent steps is continued.
In step S110, the information acquisition device 2 determines the position for displaying the sound source position information on each of the positions of the plurality of sound sources on the display area of the display unit 23, based on the estimation result of the sound source position estimation unit 295. Each sound source position display determination process is executed.

〔各音源位置表示決定処理〕
図９は、図３のステップＳ１１０の各音源位置表示決定処理の概要を示すフローチャートである。 [Each sound source position display decision processing]
FIG. 9 is a flowchart showing an outline of each sound source position display determination process of step S110 of FIG.

図９に示すように、声紋判定部２９７は、音声データに基づいて、複数の音源の各々の種類を判定する（ステップＳ２０１）。具体的には、声紋判定部２９７は、周知の声紋認証技術を用いて音源位置推定部２９５が推定した複数の音源の各々が発した音声を音声データから解析して分離して複数の音源の各々の種類を判定する。例えば、声紋判定部２９７は、会議に参加する話者が発した音声に基づく特徴を登録した話者識別テンプレートに基づいて、音声データに含まれる複数の音源の各々から声紋（話者）を判定する。 As shown in FIG. 9, the voiceprint determination unit 297 determines the type of each of the plurality of sound sources based on the voice data (step S201). Specifically, the voiceprint determination unit 297 analyzes the voice emitted by each of the plurality of sound sources estimated by the sound source position estimation unit 295 using known voiceprint authentication technology from voice data, and separates the voice source Determine the type of each. For example, the voiceprint determination unit 297 determines a voiceprint (speaker) from each of the plurality of sound sources included in the voice data, based on the speaker identification template in which the feature based on the voice emitted by the speaker participating in the conference is registered. Do.

表示位置判定部２９６は、表示部２３の表示領域の形状と音源位置推定部２９５が推定した複数の音源の各々の位置に基づいて、表示部２３の表示領域上における各音源が第１象限から第４象限のいずれかに位置するか否かを判定する（ステップＳ２０２）。具体的には、表示位置判定部２９６は、表示部２３の表示領域の中心を情報取得機器２として見た際の各音源の表示位置を判定する。例えば、表示位置判定部２９６は、音源位置推定部２９５が推定した複数の音源の各々が第１象限から第４象限のいずれかに位置するか否かを判定する。この場合、表示位置判定部２９６は、表示部２３の表示領域に対して、表示部２３の表示領域の中心を通る二直線であって、互いに平面上で直交する二直線によって仕切られた４つの第１象限から第４象限に分割する。なお、本実施の形態では、表示位置判定部２９６は、４つの象限に分割しているが、これに限定されることなく、２つの象限であってもよし、情報取得機器２に設けられた収音部の数に応じて、適宜選択できるようにしてもよい。 Based on the shape of the display area of the display unit 23 and the positions of the plurality of sound sources estimated by the sound source position estimation unit 295, the display position determination unit 296 determines that each sound source on the display area of the display unit 23 starts from the first quadrant. It is determined whether it is located in any of the fourth quadrant (step S202). Specifically, the display position determination unit 296 determines the display position of each sound source when the center of the display area of the display unit 23 is viewed as the information acquisition device 2. For example, the display position determination unit 296 determines whether each of the plurality of sound sources estimated by the sound source position estimation unit 295 is located in any one of the first quadrant to the fourth quadrant. In this case, the display position determination unit 296 is a two straight line passing through the center of the display area of the display unit 23 with respect to the display area of the display unit 23 and divided by two straight lines orthogonal to each other on a plane. Divide the first quadrant into the fourth quadrant. In the present embodiment, the display position determination unit 296 is divided into four quadrants. However, the display position determination unit 296 is not limited to this and may be two quadrants and provided in the information acquisition device 2 It may be made to be able to select suitably according to the number of sound collection parts.

続いて、表示位置判定部２９６は、同じ象限に複数の音源があるか否かを判定する（ステップＳ２０３）。表示位置判定部２９６によって同じ象限に複数の音源があると判定された場合（ステップＳ２０３：Ｙｅｓ）、情報取得機器２は、後述するステップＳ２０４へ移行する。これに対して、表示位置判定部２９６によって同じ象限に複数の音源がないと判定された場合（ステップＳ２０３：Ｎｏ）、情報取得機器２は、後述するステップＳ２０５へ移行する。 Subsequently, the display position determination unit 296 determines whether or not there are a plurality of sound sources in the same quadrant (step S203). When the display position determination unit 296 determines that there are a plurality of sound sources in the same quadrant (step S203: Yes), the information acquisition device 2 proceeds to step S204 described later. On the other hand, when it is determined by the display position determination unit 296 that there are not a plurality of sound sources in the same quadrant (step S203: No), the information acquisition device 2 proceeds to step S205 described later.

ステップＳ２０４において、表示位置判定部２９６は、同じ象限に位置する各音源に遠近があるか否かを判定する。表示位置判定部２９６によって同じ象限に位置する各音源に遠近があると判定された場合（ステップＳ２０４：Ｙｅｓ）、情報取得機器２は、後述するステップＳ２０６へ移行する。これに対して、表示位置判定部２９６によって同じ象限に位置する各音源に遠近がないと判定された場合（ステップＳ２０４：Ｎｏ）、情報取得機器２は、後述するステップＳ２０５へ移行する。 In step S204, the display position determination unit 296 determines whether or not each sound source located in the same quadrant has a perspective. If it is determined by the display position determination unit 296 that the sound sources located in the same quadrant have a distance (step S204: Yes), the information acquisition device 2 proceeds to step S206 described later. On the other hand, when it is determined by the display position determination unit 296 that the respective sound sources located in the same quadrant do not have perspective (step S204: No), the information acquisition device 2 proceeds to step S205 described later.

ステップＳ２０５において、表示位置判定部２９６は、各象限の音源に基づいて、アイコンを表示する表示位置を決定する。ステップＳ２０５の後、情報取得機器２は、後述するステップＳ２０７へ移行する。 In step S205, the display position determination unit 296 determines the display position at which the icon is displayed, based on the sound source of each quadrant. After step S205, the information acquisition device 2 proceeds to step S207 described later.

ステップＳ２０６において、表示位置判定部２９６は、同じ象限に位置する複数の音源の各々の遠近に基づいて、アイコンを表示する表示位置を決定する。ステップＳ２０６の後、情報取得機器２は、後述するステップＳ２０７へ移行する。 In step S206, the display position determination unit 296 determines the display position at which the icon is displayed, based on the perspective of each of the plurality of sound sources located in the same quadrant. After step S206, the information acquisition device 2 proceeds to step S207 described later.

〔アイコン決定生成処理〕
図１０は、図９のステップＳ２０７のアイコン決定生成処理の概要を示すフローチャートである。 [Icon determination generation process]
FIG. 10 is a flowchart showing an outline of the icon determination generation process of step S207 of FIG.

図１０に示すように、まず、音源情報生成部２９８は、声紋判定部２９７によって判定された複数の声紋のうち声の高い順に順位を決定する（ステップＳ３０１）。 As shown in FIG. 10, first, the sound source information generation unit 298 determines the order in descending order of voice among a plurality of voiceprints determined by the voiceprint determination unit 297 (step S301).

続いて、音源情報生成部２９８は、声紋判定部２９７によって判定された複数の声紋のうち一番高い声の話者（音源）を細い顔、長髪アイコンとして生成する（ステップＳ３０２）。具体的には、図１１に示すように、音源情報生成部２９８は、声紋判定部２９７によって判定された複数の声紋のうち一番高い声の話者（音源）を細い顔で長髪のアイコンＯ１（女性をイメージとするアイコン）として生成する。 Subsequently, the sound source information generation unit 298 generates a speaker (sound source) of the highest voice among the plurality of voiceprints determined by the voiceprint determination unit 297 as a thin face long hair icon (step S302). Specifically, as shown in FIG. 11, the sound source information generation unit 298 sets the speaker (sound source) of the highest voice among the plurality of voiceprints determined by the voiceprint determination unit 297 to a thin-faced long hair icon O1. Generate as (an icon that takes a woman as an image).

その後、音源情報生成部２９８は、声紋判定部２９７によって判定された複数の声紋のうち一番低い声の話者（音源）を丸顔、短髪アイコンとして生成する（ステップＳ３０３）。具体的には、図１２に示すように、音源情報生成部２９８は、声紋判定部２９７によって判定された複数の声紋のうち一番低い声の話者（音源）を丸顔で短髪のアイコンＯ２（男性をイメージとするアイコン）として生成する。 Thereafter, the sound source information generation unit 298 generates a speaker (sound source) of the lowest voice among the plurality of voiceprints determined by the voiceprint determination unit 297 as a round face short hair icon (step S303). Specifically, as shown in FIG. 12, the sound source information generation unit 298 sets the speaker (sound source) of the lowest voice of the plurality of voiceprints determined by the voiceprint determination unit 297 to a round-haired short hair icon O 2. Generate as (an icon that takes men as an image).

続いて、音源情報生成部２９８は、声紋判定部２９７によって判定された複数の声紋の高さの順にアイコンを生成する（ステップＳ３０４）。具体的には、音源情報生成部２９８は、声紋判定部２９７によって判定された複数の声紋の高さの順に細い顔から丸顔に向けて顔の形状を順次変形させるとともに、長髪から短髪に向けて順次変形させたアイコンを生成する。ここでは、ビジネスシーンを想定したが、会議をするのは子供もする事があるので、音源情報生成部２９８は、子供の声の特徴であれば、これとは異なるアイコン生成方法にする。例えば、音源情報生成部２９８は、大人と一緒の場合、そのような状況であることを声の質の差異で判定して、子供は小さめのアイコンにしたり、子供の方が多い場合には、大人を大きく表現したりするなど識別性を高くするような応用を行ってもよい。子供は成長過程にあるため、一般に顔の縦横比が大人より１：１に近いので、アイコンでも横幅を強調して広くするような工夫もあり得る。つまり、音源情報生成部２９８は、アイコン作成に際して、横幅を強調したアイコンを生成してもよい。 Subsequently, the sound source information generation unit 298 generates icons in the order of the heights of the plurality of voiceprints determined by the voiceprint determination unit 297 (step S304). Specifically, the sound source information generation unit 298 sequentially deforms the shape of the face from a thin face to a round face in the order of heights of a plurality of voiceprints determined by the voiceprint determination unit 297 and also turns from long hair to short hair. Generate an icon that has been transformed sequentially. Here, although a business scene is assumed, it may be a child to have a meeting, so the sound source information generation unit 298 uses a different icon generation method if it is a feature of the child's voice. For example, the sound source information generation unit 298 determines that such a situation is present based on differences in voice quality when being with an adult, and makes the child a smaller icon or has more children. You may perform application which makes identification high, such as expressing an adult large. Since the child is in the process of growth, the aspect ratio of the face is generally closer to 1: 1 than that of the adult, so it may be possible to emphasize the width of the icon and make it wider. That is, at the time of icon creation, the sound source information generation unit 298 may generate an icon in which the horizontal width is emphasized.

その後、移動判定部３００は、音源位置推定部２９５が推定した複数の音源の各々の位置と声紋判定部２９７によって判定された複数の声紋とに基づいて、声紋判定部２９７によって判定された複数の声紋のうち第１象限から第４象限の２つ以上の象限を移動する移動音源があるか否かを判定する（ステップＳ３０５）。具体的には、移動判定部３００は、表示位置判定部２９６が決定した各象限の音源と音源位置推定部２９５が推定した各音源の位置が時間の経過とともに異なっているか否かを判定し、時間の経過とともに異なっている場合、移動する音源があると判定する。移動判定部３００によって各象限を移動する音源があると判定された場合（ステップＳ３０５：Ｙｅｓ）、情報取得機器２は、後述するステップＳ３０６へ移行する。これに対して、移動判定部３００によって各象限を移動する音源がないと判定された場合（ステップＳ３０５：Ｎｏ）、情報取得機器２は、上述した図９のサブルーチンへ戻る。 After that, the movement determination unit 300 determines a plurality of voiceprints determined by the voiceprint determination unit 297 based on the positions of the plurality of sound sources estimated by the sound source position estimation unit 295 and the plurality of voiceprints determined by the voiceprint determination unit 297. It is determined whether there is a moving sound source moving in two or more quadrants of the first quadrant to the fourth quadrant in the voiceprint (step S305). Specifically, the movement determination unit 300 determines whether the position of the sound source of each quadrant determined by the display position determination unit 296 and the position of each sound source estimated by the sound source position estimation unit 295 differ with the passage of time, If they differ with the passage of time, it is determined that there is a moving sound source. If it is determined by the movement determination unit 300 that there is a sound source moving in each quadrant (step S305: Yes), the information acquisition device 2 proceeds to step S306 described later. On the other hand, when it is determined by the movement determination unit 300 that there is no sound source moving in each quadrant (step S305: No), the information acquisition device 2 returns to the subroutine of FIG. 9 described above.

ステップＳ３０６において、音声特定部２９９は、移動判定部３００が判定した音源に対応するアイコンを特定する。具体的には、音声特定部２９９は、移動判定部３００が判定した第１象限から第４象限の２つ以上の象限を移動する音源のアイコンを特定する。 In step S306, the voice identification unit 299 identifies an icon corresponding to the sound source determined by the movement determination unit 300. Specifically, the voice identification unit 299 identifies an icon of a sound source moving in two or more quadrants of the first quadrant to the fourth quadrant determined by the movement determination unit 300.

続いて、音源情報生成部２９８は、音声特定部２９９が特定した音源のアイコンに移動情報を付加する（ステップＳ３０７）。具体的には、図１３に示すように、音源情報生成部２９８は、音声特定部２９９が特定した音源のアイコンＯ２に移動アイコンＵ１（移動情報）を付加する。なお、音源情報生成部２９８は、移動したアイコンＯ２に、移動アイコンＵ１を付加していたが、例えばアイコンＯ２の色を変更してもよいし、形状を変更してもよい。もちろん、音源情報生成部２９８は、移動したアイコンＯ２に、文字や図形を付加してもよいし、移動していた時間や移動したタイミング等を付加してもよい。ステップＳ３０７の後、情報取得機器２は、上述した図９のサブルーチンへ戻る。 Subsequently, the sound source information generation unit 298 adds movement information to the icon of the sound source specified by the sound specification unit 299 (step S307). Specifically, as shown in FIG. 13, the sound source information generation unit 298 adds the movement icon U1 (movement information) to the icon O2 of the sound source specified by the sound specification unit 299. Although the sound source information generation unit 298 adds the move icon U1 to the moved icon O2, for example, the color of the icon O2 may be changed or the shape may be changed. Of course, the sound source information generation unit 298 may add a character or a figure to the moved icon O2, or may add a moving time, a moved timing, or the like. After step S307, the information acquisition device 2 returns to the above-described subroutine of FIG.

図９に戻り、ステップＳ２０８以降の説明を続ける。
ステップＳ２０８において、全象限の判定が終了した場合（ステップＳ２０８：Ｙｅｓ）、情報取得機器２は、図３のメインルーチンへ戻る。これに対して、全象限の判定が終了していない場合（ステップＳ２０８：Ｎｏ）、情報取得機器２は、上述したステップＳ２０３へ戻る。 Returning to FIG. 9, the description of step S208 and subsequent steps is continued.
In step S208, when the determination of all the quadrants is completed (step S208: Yes), the information acquisition device 2 returns to the main routine of FIG. On the other hand, when the determination of all the quadrants is not completed (step S208: No), the information acquisition device 2 returns to the above-described step S203.

図３に戻り、ステップＳ１１１以降の説明を続ける。
ステップＳ１１１において、表示制御部３０３は、上述したステップＳ１１０において生成された複数の音源位置情報を表示部２３に表示させる。具体的には、図１４に示すように、表示制御部３０３は、表示部２３の表示領域を第１象限Ｈ１〜第４象限Ｈ４の各々に、アイコンＯ１〜アイコンＯ３の各々を重畳して表示部２３に表示させる。これにより、ユーザは、録音時であっても、情報取得機器２と中心に見た際の話者（音源）の位置を直感的に把握することができる。さらに、ユーザは、アイコンＯ２に移動アイコンＵ１が重畳されているので、録音時に移動した話者を直感的に把握することができる。 Returning to FIG. 3, the description of step S111 and subsequent steps is continued.
In step S111, the display control unit 303 causes the display unit 23 to display the plurality of sound source position information generated in step S110 described above. Specifically, as shown in FIG. 14, the display control unit 303 displays the display area of the display unit 23 by superimposing each of the icons O1 to O3 on each of the first quadrant H1 to the fourth quadrant H4. It is displayed on the part 23. Thereby, even at the time of recording, the user can intuitively grasp the position of the speaker (sound source) when viewed from the information acquisition device 2 at the center. Furthermore, since the move icon U1 is superimposed on the icon O2, the user can intuitively grasp the speaker who has moved at the time of recording.

ステップＳ１１２において、入力部２５から録音を終了する指示信号が入力された場合（ステップＳ１１２：Ｙｅｓ）、情報取得機器２は、後述するステップＳ１１３へ移行する。これに対して、入力部２５から録音を終了する指示信号が入力されていない場合（ステップＳ１１２：Ｎｏ）、情報取得機器２は、上述したステップＳ１０３へ戻る。 In step S112, when an instruction signal to end recording is input from the input unit 25 (step S112: Yes), the information acquisition device 2 proceeds to step S113 described later. On the other hand, when the instruction | indication signal which complete | finishes recording from the input part 25 is not input (step S112: No), the information acquisition apparatus 2 returns to step S103 mentioned above.

ステップＳ１１３において、信号処理部２９１が信号処理を行った音声データと、音源位置推定部２９５が推定した音源位置情報と、音源情報生成部２９８が生成した複数の音源情報と、音声特定部２９９が特定した出現位置、インデックス付加部３０１が付加したインデックスの位置に関する位置情報またはインデックスが付加された音声データ上における時間に関する時間情報と、テキスト化部２９２が生成した音声テキストデータと、を対応付けた音声ファイルを生成して音声ファイル記録部２６２に記録する。ステップＳ１１３の後、情報取得機器２は、後述するステップＳ１１４へ移行する。なお、音声ファイル生成部３０２は、信号処理部２９１が信号処理を行った音声データと、入力部２５が指示信号の入力を受け付けた時点から所定時間だけテキスト化部２９２に音声テキストデータを生成させる候補タイミングとする候補タイミング情報と、を対応付けた音声ファイルを生成して音声ファイル記録部２６２に記録させてもよい。即ち、音声ファイル生成部３０２は、音声データと、入力部２５が指示信号の入力を受け付けた時点から所定時間だけ候補タイミングとする候補タイミング情報と、を対応付けて音声ファイルを生成して音声ファイル記録部２６２に記録させてもよい。 In step S113, the voice data subjected to signal processing by the signal processing unit 291, the sound source position information estimated by the sound source position estimating unit 295, the plurality of sound source information generated by the sound source information generating unit 298, and the sound specifying unit 299 The specified appearance position, the position information on the position of the index added by the index adding unit 301, or the time information on time on the voice data to which the index is added are associated with the voice text data generated by the textification unit 292 An audio file is generated and recorded in the audio file recording unit 262. After step S113, the information acquisition device 2 proceeds to step S114 described later. Note that the audio file generation unit 302 causes the textification unit 292 to generate audio text data for a predetermined time from the time when the input unit 25 receives the audio data on which the signal processing unit 291 has performed the signal processing and the input unit 25 receives the instruction signal. An audio file in which candidate timing information as candidate timing is associated may be generated and recorded in the audio file recording unit 262. That is, the audio file generation unit 302 generates an audio file by correlating the audio data with candidate timing information which is candidate timing for a predetermined period of time from the time when the input unit 25 receives the input of the instruction signal. You may make it record on the recording part 262.

続いて、入力部２５から電源をオフする指示信号が入力された場合（ステップＳ１１４：Ｙｅｓ）、情報取得機器２は、本処理を終了する。これに対して、入力部２５から電源をオフする指示信号が入力されていない場合（ステップＳ１１４：Ｎｏ）、情報取得機器２は、上述したステップＳ１０１へ戻る。 Subsequently, when the instruction signal to turn off the power is input from the input unit 25 (step S114: Yes), the information acquisition device 2 ends the process. On the other hand, when the instruction | indication signal which turns off a power supply is not input from the input part 25 (step S114: No), the information acquisition apparatus 2 returns to step S101 mentioned above.

ステップＳ１０１において、入力部２５から録音を指示する指示信号が入力されていない場合（ステップＳ１０１：Ｎｏ）、情報取得機器２は、ステップＳ１１５へ移行する。 In step S101, when the instruction signal instructing recording is not input from the input unit 25 (step S101: No), the information acquisition device 2 proceeds to step S115.

続いて、入力部２５から音声ファイルの再生を指示する指示信号が入力された場合（ステップＳ１１５：Ｙｅｓ）、情報取得機器２は、後述するステップＳ１１６へ移行する。これに対して、入力部２５から音声ファイルの再生を指示する指示信号が入力されていない場合（ステップＳ１１５：Ｎｏ）、情報取得機器２は、ステップＳ１２２へ移行する。 Subsequently, when an instruction signal instructing reproduction of the audio file is input from the input unit 25 (step S115: Yes), the information acquisition device 2 proceeds to step S116 described later. On the other hand, when the instruction signal instructing reproduction of the audio file is not input from the input unit 25 (step S115: No), the information acquisition device 2 proceeds to step S122.

ステップＳ１１６において、入力部２５が操作されることによって音声ファイルが選択された場合（ステップＳ１１６：Ｙｅｓ）、情報取得機器２は、後述するステップＳ１１７へ移行する。これに対して、入力部２５が操作されず、音声ファイルが選択されていない場合（ステップＳ１１６：Ｎｏ）、情報取得機器２は、ステップＳ１１４へ移行する。 In step S116, when the audio file is selected by operating the input unit 25 (step S116: Yes), the information acquisition device 2 proceeds to step S117 described later. On the other hand, when the input unit 25 is not operated and the audio file is not selected (step S116: No), the information acquisition device 2 proceeds to step S114.

ステップＳ１１７において、表示制御部３０３は、入力部２５を介して選択された音声ファイルに格納された複数の音源位置情報を表示部２３に表示させる。 In step S117, the display control unit 303 causes the display unit 23 to display the plurality of sound source position information stored in the audio file selected via the input unit 25.

続いて、タッチパネル２５１を介して表示部２３が表示する複数の音声位置情報のいずれかのアイコンがタッチされた場合（ステップＳ１１８：Ｙｅｓ）、出力部２８は、アイコンに対応する音声データを再生して出力する（ステップＳ１１９）。 Subsequently, when any icon of the plurality of pieces of audio position information displayed by the display unit 23 is touched via the touch panel 251 (step S118: Yes), the output unit 28 reproduces audio data corresponding to the icon. And output (step S119).

その後、入力部２５から音声ファイルの再生を終了する指示信号が入力された場合（ステップＳ１２０：Ｙｅｓ）、情報取得機器２は、ステップＳ１１４へ移行する。これに対して、入力部２５から音声ファイルの再生を終了する指示信号が入力されていない場合（ステップＳ１２０：Ｎｏ）、情報取得機器２は、上述したステップＳ１１７へ戻る。 Thereafter, when an instruction signal for ending the reproduction of the audio file is input from the input unit 25 (step S120: Yes), the information acquisition device 2 proceeds to step S114. On the other hand, when the instruction signal for ending the reproduction of the audio file is not input from the input unit 25 (step S120: No), the information acquisition device 2 returns to the above-described step S117.

ステップＳ１１８において、タッチパネル２５１を介して表示部２３が表示する複数の音声位置情報のいずれかのアイコンがタッチされていない場合（ステップＳ１１８：Ｎｏ）、出力部２８は、音声データを再生する（ステップＳ１２１）。ステップＳ１２１の後、情報取得機器２は、ステップＳ１２０へ移行する。 In step S118, when any icon of the plurality of audio position information displayed by display unit 23 is not touched via touch panel 251 (step S118: No), output unit 28 reproduces audio data (step S118). S121). After step S121, the information acquisition device 2 proceeds to step S120.

ステップＳ１２２において、入力部２５が操作されることによって音声ファイルを送信する指示信号が入力された場合（ステップＳ１２２：Ｙｅｓ）、通信部２７は、音声ファイルを所定の通信規格に従って情報処理装置３へ送信する（ステップＳ１２３）。ステップＳ１２３の後、情報取得機器２は、ステップＳ１１４へ移行する。 In step S122, when the instruction signal for transmitting the audio file is input by operating the input unit 25 (step S122: Yes), the communication unit 27 transmits the audio file to the information processing device 3 according to a predetermined communication standard. It transmits (step S123). After step S123, the information acquisition device 2 proceeds to step S114.

ステップＳ１２２において、入力部２５が操作されることによって音声ファイルを送信する指示信号が入力されていない場合（ステップＳ１２２：Ｎｏ）、情報取得機器２は、ステップＳ１１４へ移行する。 In step S122, when the instruction signal for transmitting the audio file is not input by operating the input unit 25 (step S122: No), the information acquisition device 2 proceeds to step S114.

〔情報処理装置の処理〕
次に、情報処理装置３が実行する処理について説明する。図１５は、情報処理装置３が実行する処理の概要を示すフローチャートである。 [Process of Information Processing Device]
Next, processing executed by the information processing device 3 will be described. FIG. 15 is a flowchart showing an outline of the process performed by the information processing device 3.

図１５に示すように、まず、ユーザが音声データを再生させながら摘録作成を行うドキュメント化作業を行う場合（ステップＳ４０１：Ｙｅｓ）、通信部３１は、情報処理装置３に接続された情報取得機器２から音声ファイルを取得する（ステップＳ４０２）。 As shown in FIG. 15, first, when the user performs a documentization operation for making a pick while reproducing audio data (step S401: Yes), the communication unit 31 acquires the information acquisition device connected to the information processing device 3. An audio file is acquired from 2 (step S402).

続いて、表示制御部３６６は、ドキュメント作成画面を表示部３５に表示させる（ステップＳ４０３）。具体的には、図１６に示すように、表示制御部３６６は、ドキュメント作成画面Ｗ１を表示部３５に表示させる。ドキュメント作成画面Ｗ１には、表示領域Ｒ１と、表示領域Ｒ２と、表示領域Ｒ３と、が含まれる。表示領域Ｒ１は、ユーザが入力部３２を操作することによって音声データの再生から書き起こしたテキストデータに対応するテキストが表示される。表示領域Ｒ２は、音声ファイルに格納された音声データに対応するタイムバーＴ１と、入力部３２の操作に応じて入力されたキーワードを表示する表示領域Ｋ１と、録音時における音源に関する音源情報を示す複数のアイコンＯ１〜Ｏ３と、を有する。表示領域Ｒ３は、音声ファイルに格納された音声データに対応するタイムバーＴ２と、キーワードの出現位置を表示する表示領域Ｋ２と、を有する。 Subsequently, the display control unit 366 causes the display unit 35 to display a document creation screen (step S403). Specifically, as shown in FIG. 16, the display control unit 366 causes the display unit 35 to display the document creation screen W1. The document creation screen W1 includes a display area R1, a display area R2, and a display area R3. In the display area R1, the text corresponding to the text data transcribed from the reproduction of the audio data by the user operating the input unit 32 is displayed. The display area R2 shows a time bar T1 corresponding to audio data stored in the audio file, a display area K1 for displaying a keyword input according to the operation of the input unit 32, and sound source information on a sound source at the time of recording. And a plurality of icons O1 to O3. The display area R3 has a time bar T2 corresponding to the audio data stored in the audio file, and a display area K2 for displaying the appearance position of the keyword.

その後、入力部３２を介して音声データを再生する再生操作が行われた場合（ステップＳ４０４：Ｙｅｓ）、音声制御部３６５は、音声再生部３４に音声ファイルに格納された音声データを再生させる（ステップＳ４０５）。 After that, when a reproduction operation to reproduce audio data is performed via the input unit 32 (step S404: Yes), the audio control unit 365 causes the audio reproduction unit 34 to reproduce the audio data stored in the audio file ( Step S405).

続いて、キーワード判定部３６３は、音声ファイルにキーワード候補があるか否かを判定する（ステップＳ４０６）。具体的には、キーワード判定部３６３は、音声ファイルにキーワードとしての音声テキストデータが１つ以上格納されているか否かを判定する。キーワード判定部３６３によって音声ファイルにキーワード候補があると判定された場合（ステップＳ４０６：Ｙｅｓ）、キーワード設定部３６４は、音声ファイルに格納されたキーワード候補を音声データ上に出現する位置を検索するためのキーワードに設定する（ステップＳ４０７）。具体的には、キーワード設定部３６４は、音声ファイルに格納された１つ以上の音声テキストデータを音声データ上に出現する位置を検索するためのキーワードに設定する。ステップＳ４０７の後、情報処理装置３は、後述するステップＳ４１０へ移行する。これに対して、キーワード判定部３６３によって音声ファイルにキーワード候補がないと判定された場合（ステップＳ４０６：Ｎｏ）、情報処理装置３は、後述するステップＳ４０８へ移行する。会議が終わってしまうと、キーワードとなる単語のイメージは覚えていても、正確な単語そのものを忘れている事も多いので、キーワード判定部３６３は、類似の意味の単語を記録した辞書などの利用で、類義語を検索してもよい。 Subsequently, the keyword determination unit 363 determines whether there is a keyword candidate in the audio file (step S406). Specifically, the keyword determination unit 363 determines whether one or more pieces of speech text data as a keyword are stored in the speech file. When it is determined by the keyword determination unit 363 that there is a keyword candidate in the audio file (step S406: Yes), the keyword setting unit 364 searches the position where the keyword candidate stored in the audio file appears on the audio data The keywords of are set (step S407). Specifically, the keyword setting unit 364 sets one or more voice text data stored in the voice file as a keyword for searching for a position where it appears on the voice data. After step S407, the information processing device 3 proceeds to step S410 described later. On the other hand, when the keyword determination unit 363 determines that there is no keyword candidate in the audio file (step S406: No), the information processing device 3 proceeds to step S408 described later. When the meeting is over, even though it remembers the image of the word that is the keyword but often forgets the exact word itself, the keyword determination unit 363 uses a dictionary or the like that records words having similar meanings. You may search for synonyms.

ステップＳ４０８、入力部３２が操作された場合（ステップＳ４０８：Ｙｅｓ）において、入力部３２を介して音声データで出現する特定のキーワードを検索するとき（ステップＳ４０９：Ｙｅｓ）、情報処理装置３は、後述するステップＳ４１０へ移行する。これに対して、入力部３２が操作された場合（ステップＳ４０８：Ｙｅｓ）において、入力部３２を介して音声データで出現する特定のキーワードを検索しない場合（ステップＳ４０９：Ｎｏ）、情報処理装置３は、後述するステップＳ４１６へ移行する。 When the specific keyword appearing in the audio data is searched through the input unit 32 (step S409: Yes), the information processing apparatus 3 determines whether the input unit 32 is operated (step S408). It transfers to step S410 mentioned later. On the other hand, when the input unit 32 is operated (step S408: Yes), when the specific keyword appearing in the audio data is not searched via the input unit 32 (step S409: No), the information processing device 3 The process proceeds to step S416 described later.

ステップＳ４０８において、入力部３２が操作されていない場合（ステップＳ４０８：Ｎｏ）、情報処理装置３は、後述するステップＳ４１４へ移行する。 In step S408, when the input unit 32 is not operated (step S408: No), the information processing device 3 proceeds to step S414 described later.

ステップＳ４１０において、情報処理制御部３６は、音声データ上においてキーワードが出現する時間を判定するキーワード判定処理を実行する。 In step S410, the information processing control unit 36 executes keyword determination processing to determine the time for which the keyword appears on the voice data.

〔キーワード判定処理〕
図１７は、上述した図１５のステップＳ４１０におけるキーワード判定処理の概要を示すフローチャートである。 [Keyword determination processing]
FIG. 17 is a flowchart showing an outline of the keyword determination process in step S410 of FIG. 15 described above.

図１７に示すように、音声データで出現する特定のキーワードを自動で検出する自動モードが設定されている場合（ステップＳ５０１：Ｙｅｓ）、情報処理装置３は、後述するステップＳ５０２へ移行する。これに対して、音声データで出現する特定のキーワードを自動で検出する自動モードが設定されていない場合（ステップＳ５０１：Ｎｏ）、情報処理装置３は、後述するステップＳ５１３へ移行する。 As shown in FIG. 17, when the automatic mode for automatically detecting a specific keyword appearing in voice data is set (step S501: Yes), the information processing device 3 proceeds to step S502 described later. On the other hand, when the automatic mode for automatically detecting the specific keyword appearing in the voice data is not set (step S501: No), the information processing device 3 proceeds to step S513 described later.

ステップＳ５０２において、テキスト化部３６１は、音声データを音声波形に分解し（ステップＳ５０２）、分解した音声波形に対してフーリエ変換を行うことによって音声テキストデータを生成する（ステップＳ５０３）。 In step S502, the text conversion unit 361 decomposes the speech data into speech waveforms (step S502), and generates speech text data by performing Fourier transform on the decomposed speech waveforms (step S503).

続いて、キーワード判定部３６３は、テキスト化部３６１がフーリエ変換を行った音声テキストデータに対して、音声テキスト化辞書データ記録部３３２が記録する音素辞書データに含まれる複数の音素のいずれかと一致するか否かを判定する（ステップＳ５０４）。具体的には、キーワード判定部３６３は、テキスト化部３６１がフーリエ変換を行った結果に対して、音声テキスト化辞書データ記録部３３２が記録する音素辞書データに含まれる複数の音素のいずれかの波形と一致するか否かを判定する。ただし、キーワード判定部３６３は、個人によって発音には癖や差異があるので、厳密な一致である必要はなく、類似度が高いかどうかという判定でもよい。また、個人によって、同じ事を別な言い方をする人もいるので、必要に応じて、類義語を使った検索を行っても良い。キーワード判定部３６３によってテキスト化部３６１がフーリエ変換を行った結果に対して、音声テキスト化辞書データ記録部３３２が記録する音素辞書データに含まれる複数の音素のいずれかと一致する（類似度が高い）と判定された場合（ステップＳ５０４：Ｙｅｓ）、情報処理装置３は、後述するステップＳ５０６へ移行する。これに対して、キーワード判定部３６３によってテキスト化部３６１がフーリエ変換を行った結果に対して、音声テキスト化辞書データ記録部３３２が記録する音素辞書データに含まれる複数の音素のいずれかと一致しない（類似度が低い）と判定された場合（ステップＳ５０４：Ｎｏ）、情報処理装置３は、後述するステップＳ５０５へ移行する。 Subsequently, the keyword determination unit 363 matches the voice text data subjected to the Fourier transformation by the textification unit 361 with any one of a plurality of phonemes included in the phoneme dictionary data recorded by the speech textification dictionary data recording unit 332 It is determined whether or not to do (step S504). Specifically, the keyword determination unit 363 selects one of a plurality of phonemes included in the phoneme dictionary data recorded by the speech-to-text dictionary data recording unit 332, as a result of the Fourier transformation performed by the textification unit 361. It is determined whether or not it matches the waveform. However, the keyword determination unit 363 does not have to be a strict match because there is a habit or difference in pronunciation depending on individuals, and it may be determined whether the degree of similarity is high. Also, some individuals may say the same thing differently, so you may search for synonyms as needed. The result of the Fourier transformation performed by the text conversion unit 361 by the keyword determination unit 363 matches one of a plurality of phonemes included in the phoneme dictionary data recorded by the speech text conversion dictionary data recording unit 332 (the similarity is high When it is determined that the information processing apparatus 3 is determined (step S504: Yes), the information processing device 3 proceeds to step S506 described later. On the other hand, the result that the text conversion unit 361 performs the Fourier transform by the keyword determination unit 363 does not match any one of a plurality of phonemes included in the phoneme dictionary data recorded by the voice text conversion dictionary data recording unit 332 When it is determined that the similarity is low (step S504: No), the information processing device 3 proceeds to step S505 described later.

ステップＳ５０５において、テキスト化部３６１は、分解した音声波形に対してフーリエ変換を行う波形幅を変更する。ステップＳ５０５の後、情報処理装置３は、ステップＳ５０３へ戻る。 In step S505, the text conversion unit 361 changes the waveform width for performing Fourier transform on the decomposed speech waveform. After step S505, the information processing device 3 returns to step S503.

ステップＳ５０６において、テキスト化部３６１は、キーワード判定部３６３によって一致すると判定された音素を、フーリエ変換の結果として音素化を行う。 In step S506, the text conversion unit 361 performs phonemicization as the result of the Fourier transform on the phonemes determined to be coincident by the keyword determination unit 363.

続いて、テキスト化部３６１は、複数の音素で構成された音素集合を作成する（ステップＳ５０７）。 Subsequently, the text conversion unit 361 creates a phoneme set including a plurality of phonemes (step S507).

その後、キーワード判定部３６３は、テキスト化部３６１が作成した音素集合に対して、音声テキスト化辞書データ記録部３３２が記録する音声テキスト化辞書データに含まれる複数の単語のいずれかと一致するか否か（類似度が高いか否か）を判定する（ステップＳ５０８）。キーワード判定部３６３によってテキスト化部３６１が作成した音素集合に対して、音声テキスト化辞書データ記録部３３２が記録する音声テキスト化辞書データに含まれる複数の単語のいずれかと一致する（類似度が高い）と判定された場合（ステップＳ５０８：Ｙｅｓ）、情報処理装置３は、後述するステップＳ５１０へ移行する。これに対して、キーワード判定部３６３によってテキスト化部３６１が作成した音素集合に対して、音声テキスト化辞書データ記録部３３２が記録する音声テキスト化辞書データに含まれる複数の単語のいずれかと一致しない（類似度が低い）と判定された場合（ステップＳ５０８：Ｎｏ）、情報処理装置３は、後述するステップＳ５０９へ移行する。 Thereafter, the keyword determination unit 363 determines whether the phoneme set created by the textification unit 361 matches any one of a plurality of words included in the speech-to-text dictionary data recorded by the speech-to-text dictionary data recording unit 332. (Whether or not the degree of similarity is high) is determined (step S508). The phoneme set created by the text conversion unit 361 by the keyword determination unit 363 matches one of a plurality of words included in the audio text conversion dictionary data recorded by the audio text conversion dictionary data storage unit 332 (the similarity is high (Step S508: Yes), the information processing apparatus 3 proceeds to step S510 described later. On the other hand, the phoneme set created by the text conversion unit 361 by the keyword determination unit 363 does not match any of a plurality of words included in the audio text conversion dictionary data recorded by the audio text conversion dictionary data recording unit 332 When it is determined that the degree of similarity is low (step S508: No), the information processing device 3 proceeds to step S509 described later.

ステップＳ５０９において、テキスト化部３６１は、複数の音素で構成された音素集合を変更する。例えば、テキスト化部３６１は、音素の数を減少または増加させることによって音素集合を変更する。ステップＳ５０９の後、情報処理装置３は、上述したステップＳ５０８へ戻る。このようなステップＳ５０２〜ステップＳ５０９の各処理を含む一例の処理が上述した音声認識処理に該当する。 In step S509, the text conversion unit 361 changes the phoneme set configured by a plurality of phonemes. For example, the textification unit 361 changes the phoneme set by decreasing or increasing the number of phonemes. After step S509, the information processing device 3 returns to step S508 described above. An example of the process including the processes of step S502 to step S509 corresponds to the voice recognition process described above.

ステップＳ５１０において、特定部３６２は、入力部３２を介して入力されたキーワードの文字列とテキスト化部３６１が生成した音声テキストデータの文字列とが一致するか否か（類似度が高いか否か）を判定する。この場合、特定部３６２は、キーワード設定部３６４によって設定されたキーワードの文字列とテキスト化部３６１が生成した音声テキストデータの文字列とが一致するか否か（類似度が高いか否か）を判定してもよい。特定部３６２が入力部３２を介して入力されたキーワードの文字列とテキスト化部３６１が生成した音声テキストデータの文字列とが一致する（類似度が高い）と判定した場合（ステップＳ５１０：Ｙｅｓ）、情報処理装置３は、後述するステップＳ５１１へ移行する。これに対して、特定部３６２が入力部３２を介して入力されたキーワードの文字列とテキスト化部３６１が生成した音声テキストデータの文字列とが一致しない（類似度が低い）と判定した場合（ステップＳ５１０：Ｎｏ）、情報処理装置３は、後述するステップＳ５１２へ移行する。 In step S510, the identification unit 362 determines whether the character string of the keyword input through the input unit 32 matches the character string of the audio text data generated by the textification unit 361 (whether the similarity is high or not). To determine In this case, the identifying unit 362 determines whether the character string of the keyword set by the keyword setting unit 364 matches the character string of the audio text data generated by the text converting unit 361 (whether the similarity is high) May be determined. When the identification unit 362 determines that the character string of the keyword input through the input unit 32 matches the character string of the voice text data generated by the textification unit 361 (the similarity is high) (Step S510: Yes And the information processing apparatus 3 proceeds to step S511 described later. On the other hand, when the specification unit 362 determines that the character string of the keyword input through the input unit 32 does not match the character string of the voice text data generated by the textification unit 361 (the similarity is low). (Step S510: No), the information processing device 3 proceeds to Step S512 described later.

ステップＳ５１１において、特定部３６２は、音声データ上におけるキーワードの出現時間を特定する。具体的には、特定部３６２は、入力部３２を介して入力されたキーワードの文字列とテキスト化部３６１が生成した音声テキストデータの文字列とが一致（類似度が高い）する期間を音声データ上におけるキーワードの出現位置（出現時間）として特定する。ただし、特定部３６２は、個人によって発音には癖や差異があるので、厳密な一致である必要はなく、類似度が高いかどうかという判定でもよい。また、個人によって、同じ事を別な言い方をする人もいるので、必要に応じて、特定部３６２は、類義語を使った検索を行っても良い。これによって、後で聞き直したくなるキーワードを、再生現場においてリアルタイムで判定しやすくする工夫が可能となる。再生データが終わってしまうと、単語のイメージは覚えていても、正確な単語そのものを忘れている事も多い。このように、後で検索する時に、よく探すべきタイミングを分かりやすくしている。これは、いわば、候補タイミングとも言えるもので、このタイミングにおいては、重要キーワードやその類義語や、似たニュアンスの言葉を使った議論がなされている可能性が高い。このため、特定部３６２は、ここでの音声データが優先的にテキストとして見える化をできた方が、議論全体を把握するのに役立つので、テキスト化部３６１にテキスト化を実行させて音声テキストデータを生成させてもよい。なお、特定部３６２は、ステップＳ５１１において、テキスト化まで必ずしも行う必要はなく、重点的に検索するタイミングを、録音開始から何分何秒のタイミング、などと候補タイミングを音声データに関連づけて記録しておくだけでも良い。音声ファイルを作る時のメタデータとして候補タイミング情報を記録するような方法がある。 In step S511, the identifying unit 362 identifies the appearance time of the keyword on the audio data. Specifically, the specifying unit 362 voices a period during which the character string of the keyword input through the input unit 32 and the character string of the voice text data generated by the textifying unit 361 match (high similarity) It specifies as the appearance position (appearance time) of the keyword on the data. However, the identifying unit 362 does not have to be an exact match because the pronunciation may or may not differ depending on the individual, and it may be determined whether the similarity is high. Further, depending on the individual, the same thing may be expressed differently, so the identifying unit 362 may perform a search using synonyms as necessary. This makes it possible to make it easy to determine in real time the keywords that you want to rehearse later in the reproduction site. When the playback data is over, you often remember the exact word even though you remember the word image. In this way, when searching later, it is easy to understand the timing to look for often. This can be said to be a candidate timing, so it is highly probable that at this timing there is discussion using important keywords, their synonyms, and similar nuance words. For this reason, the identification unit 362 can make the text conversion unit 361 execute the text conversion to make the voice text since the identification unit 362 is able to understand the whole discussion if the audio data here can be visualized as text preferentially. Data may be generated. In step S 511, the identifying unit 362 does not necessarily perform processing up to textification, and records the timing of intensive search, the timing of minutes and seconds from the start of recording, and the candidate timing in association with the voice data. You may just leave it alone. There is a method of recording candidate timing information as metadata at the time of creating an audio file.

続いて、ドキュメント生成部３６７は、特定部３６２が特定したキーワードの出現位置を音声データに付加して記録する（ステップＳ５１２）。ステップＳ５１２の後、情報処理装置３は、上述した図１５のメインルーチンへ戻る。 Subsequently, the document generation unit 367 adds the appearance position of the keyword identified by the identification unit 362 to the voice data and records the voice data (step S512). After step S512, the information processing device 3 returns to the main routine of FIG. 15 described above.

ステップＳ５１３において、音声データで出現する特定のキーワードをユーザが手動で検出する手動モードが設定されている場合（ステップＳ５１３：Ｙｅｓ）、音声再生部３４は、特定フレーズまで音声データを再生する（ステップＳ５１４）。 In step S513, when the manual mode in which the user manually detects a specific keyword appearing in audio data is set (step S513: Yes), the audio reproduction unit 34 reproduces audio data up to a specific phrase (step S513). S514).

続いて、入力部３２から特定フレームまで繰り返し操作を指示する指示信号が入力された場合（ステップＳ５１５：Ｙｅｓ）、情報処理装置３は、上述したステップＳ５１４に戻る。これに対して、入力部３２から特定フレームまで繰り返し操作を指示する指示信号が入力されていない場合（ステップＳ５１５：Ｎｏ）、情報処理装置３は、後述するステップＳ５１６へ移行する。 Subsequently, when an instruction signal instructing repeat operation to the specific frame is input from the input unit 32 (step S515: Yes), the information processing device 3 returns to step S514 described above. On the other hand, when the instruction signal instructing the repeat operation to the specific frame is not input from the input unit 32 (step S515: No), the information processing device 3 proceeds to step S516 described later.

ステップＳ５１３において、音声データで出現する特定のキーワードをユーザが手動で検出する手動モードが設定されていない場合（ステップＳ５１３：Ｎｏ）、情報処理装置３は、ステップＳ５１２へ移行する。 In step S513, when the manual mode in which the user manually detects a specific keyword appearing in the voice data is not set (step S513: No), the information processing device 3 proceeds to step S512.

ステップＳ５１６において、入力部３２を介してキーワードを入力する操作があった場合（ステップＳ５１６：Ｙｅｓ）、テキスト化部３６１は、入力部３２の操作に応じてキーワードの単語化を行う（ステップＳ５１７）。 In step S516, when there is an operation of inputting a keyword through the input unit 32 (step S516: Yes), the text conversion unit 361 transcribes the keyword according to the operation of the input unit 32 (step S517). .

続いて、ドキュメント生成部３６７は、入力部３２を介してキーワードの入力があった時刻の音声データ上にインデックスを付加して記録する（ステップＳ５１８）。ステップＳ５１８の後、情報処理装置３は、ステップＳ５１２へ移行する。 Subsequently, the document generation unit 367 adds an index to the voice data at the time when the keyword is input through the input unit 32 and records the index (step S518). After step S518, the information processing device 3 proceeds to step S512.

ステップＳ５１６において、入力部３２を介してキーワードを入力する操作がなかった場合（ステップＳ５１６：Ｎｏ）、情報処理装置３は、ステップＳ５１２へ移行する。 In step S516, when there is no operation to input a keyword via the input unit 32 (step S516: No), the information processing device 3 proceeds to step S512.

図１５に戻り、ステップＳ４１１以降の説明を説明する。
ステップＳ４１１において、表示制御部３６６は、表示部３５が表示するタイムバー上に、特定部３６２が特定したキーワードが出現する出現位置にインデックスを付加して表示部３５に表示させる。具体的には、図１８に示すように、表示制御部３６６は、表示部３５が表示するタイムバーＴ２上に、特定部３６２が特定したキーワード、例えば「確認」が出現する出現位置にインデックスＢ１を付加して表示部３５に表示させる。より具体的には、表示制御部３６６は、表示部３５が表示するタイムバーＴ２上の近傍に、特定部３６２が特定したキーワード、例えば「確認」が出現する出現位置にインデックスＢ１として（１）を付加して表示部３５に表示させる。これにより、ユーザは、所望のキーワードの出現位置を直感的に把握することができる。なお、表示制御部３６６は、表示部３５が表示するタイムバーＴ２上に、特定部３６２が特定したキーワードが出現する出現位置にインデックスＢ１として（１）を重畳して表示部３５に表示させてもよいし、インデックスＢ１として図形、テキストデータを重畳してもよいし、出現位置のタイムバーＴ２の色彩を他の領域と識別可能に表示させてもよい。もちろん、表示制御部３６６は、特定部３６２が特定したキーワードが出現する出現位置の時間を表示部３５に表示させてもよい。 Referring back to FIG. 15, the description of step S411 and subsequent steps will be described.
In step S411, the display control unit 366 adds an index to the appearance position where the keyword specified by the specifying unit 362 appears on the time bar displayed by the display unit 35, and causes the display unit 35 to display the index. Specifically, as shown in FIG. 18, the display control unit 366 causes the index B1 to be displayed at the appearance position where the keyword specified by the identification unit 362, for example, “confirmation” appears on the time bar T2 displayed by the display unit 35. Are displayed on the display unit 35. More specifically, the display control unit 366 sets the index B1 as the index B1 at the appearance position where the keyword specified by the specifying unit 362, for example, "confirmation" appears in the vicinity of the time bar T2 displayed by the display unit 35 (1) Are displayed on the display unit 35. Thereby, the user can intuitively grasp the appearance position of the desired keyword. The display control unit 366 causes the display unit 35 to superimpose (1) as an index B1 on the appearance position where the keyword specified by the specifying unit 362 appears on the time bar T2 displayed by the display unit 35. Alternatively, graphics and text data may be superimposed as the index B1, or the color of the time bar T2 of the appearance position may be displayed distinguishably from other areas. Of course, the display control unit 366 may cause the display unit 35 to display the time of the appearance position where the keyword specified by the specifying unit 362 appears.

また、図１９に示すように、ユーザがキーワードを３つ設定した場合、例えば「確認」、「ＡＢ社」および「納期」の各々を設定した場合、表示制御部３６６は、表示部３５が表示するタイムバーＴ２上に、特定部３６２が特定した３つのキーワードの各々が出現する出現位置にインデックスＢ１、インデックスＢ２およびインデックスＢ３として（１）、（２）および（３）を付加して表示部３５に表示させる。この場合において、表示制御部３６６は、表示部３５が表示するタイムバーＴ２上に、特定部３６２が特定した３つのキーワードの各々が所定時間内（例えば１０秒以上内）に全て出現する出現位置にインデックスを付加して表示部３５に表示させてもよい。このとき、表示制御部３６６は、タイムバーＴ２上において最初のキーワードが出現する出現位置にインデックスを付加して表示部３５に表示させてもよい。これにより、ユーザが所望する複数のキーワードが音声データ上で出現する出現位置を直感的に把握することができる。 Further, as shown in FIG. 19, when the user sets three keywords, for example, when each of “confirm”, “AB company” and “delivery time” is set, the display control unit 366 causes the display unit 35 to display In the time bar T2, the display unit is added with (1), (2) and (3) as the index B1, the index B2 and the index B3 at the appearance position where each of the three keywords specified by the specifying unit 362 appears Display on 35 In this case, the display control unit 366 is an appearance position where each of the three keywords specified by the specifying unit 362 appears in a predetermined time (for example, 10 seconds or more) on the time bar T2 displayed by the display unit 35. An index may be added to the display unit 35 and displayed on the display unit 35. At this time, the display control unit 366 may add an index to the appearance position where the first keyword appears on the time bar T2 and cause the display unit 35 to display the position. As a result, it is possible to intuitively grasp the appearance position where a plurality of keywords desired by the user appear on the voice data.

続いて、入力部３２を介してタイムバー上のインデックスまたは各音源（アイコン）のいずれかが指定された場合（ステップＳ４１２：Ｙｅｓ）、音声制御部３６５は、入力部３２を介して指定されたタイムバー上のインデックスに対応する時間または指定された音源に対応する時間図まで音声データをスキップして音声再生部３４に再生させる（ステップＳ４１３）。具体的には、図２０に示すように、ユーザが入力部３２を介して矢印Ａのインデックス（１）を指定した場合、音声制御部３６５は、入力部３２を介して指定されたタイムバーＴ２上のインデックスに対応する時間まで音声データをスキップして音声再生部３４に再生させる。これにより、ユーザは、所望のキーワードの出現位置を直感的に把握することができるとともに、所望の位置の書き出しを行うことができる。 Subsequently, when either the index or each sound source (icon) on the time bar is specified via the input unit 32 (step S412: Yes), the voice control unit 365 is specified via the input unit 32. The audio data is skipped until the time corresponding to the index on the time bar or the time chart corresponding to the designated sound source, and the audio reproduction unit 34 is made to reproduce (step S413). Specifically, as shown in FIG. 20, when the user designates index (1) of arrow A via input unit 32, voice control unit 365 causes time bar T2 designated via input unit 32 to be designated. The audio data is skipped until the time corresponding to the upper index, and the audio reproduction unit 34 is made to reproduce. As a result, the user can intuitively grasp the appearance position of the desired keyword, and can write out the desired position.

その後、入力部３２を介してドキュメント化を終了する操作を行った場合（ステップＳ４１４：Ｙｅｓ）、ドキュメント生成部３６７は、入力部３２を介してユーザが入力したドキュメントと、音声データと、特定部３６２が特定した出現位置とを対応付けたドキュメントファイルを生成して記録部３３に記録する（ステップＳ４１５）。ステップＳ４１５の後、情報処理装置３は、本処理を終了する。これに対して、入力部３２を介してドキュメント化を終了する操作を行っていない場合（ステップＳ４１４：Ｎｏ）、情報処理装置３は、上述したステップＳ４０８へ戻る。 Thereafter, when an operation to end documentization is performed via the input unit 32 (step S 414: Yes), the document generation unit 367 determines the document input by the user via the input unit 32, the voice data, and the identification unit A document file in which the appearance position specified by 362 is associated is generated and recorded in the recording unit 33 (step S415). After step S415, the information processing device 3 ends the present process. On the other hand, when the operation to end the document conversion is not performed via the input unit 32 (step S414: No), the information processing device 3 returns to the above-described step S408.

ステップＳ４１２において、入力部３２を介してタイムバー上のインデックスが指定されていない場合（ステップＳ４１２：Ｎｏ）、情報処理装置３は、ステップＳ４１４へ移行する。 In step S412, when the index on the time bar is not designated via the input unit 32 (step S412: No), the information processing device 3 proceeds to step S414.

ステップＳ４１６において、テキスト化部３６１は、入力部３２の操作に応じてテキストデータのドキュメント化を行う。ステップＳ４１６の後、情報処理装置３は、ステップＳ４１２へ移行する。 In step S 416, the textification unit 361 documents the text data in accordance with the operation of the input unit 32. After step S416, the information processing device 3 proceeds to step S412.

ステップＳ４０４において、入力部３２を介して音声データを再生する再生操作が行われていない場合（ステップＳ４０４：Ｎｏ）、情報処理装置３は、本処理を終了する。 In step S404, when the reproduction operation for reproducing the audio data is not performed via the input unit 32 (step S404: No), the information processing device 3 ends the present process.

ステップＳ４０１において、ユーザが音声データを再生させながら摘録作成を行うドキュメント化作業を行わない場合（ステップＳ４０１：Ｎｏ）、情報処理装置３は、ドキュメント化作業以外のその他のモード処理に応じた処理を実行する（ステップＳ４１７）。ステップＳ４１７の後、情報処理装置３は、本処理を終了する。 In step S401, when the user does not perform the documentizing operation for making the collection while reproducing the audio data (step S401: No), the information processing device 3 performs the process according to the other mode process other than the documentizing operation. Execute (step S417). After step S417, the information processing device 3 ends the present process.

以上説明した本発明の実施の形態１によれば、表示制御部３０３が音源位置推定部２９５によって推定された推定結果に基づいて、複数の音源の各々の位置に関する音源位置情報を表示部２３に表示させるので、録音時における話者の位置を直感的に把握することができる。 According to the first embodiment of the present invention described above, based on the estimation result estimated by the sound source position estimation unit 295, the display control unit 303 causes the display unit 23 to provide sound source position information on each position of a plurality of sound sources. Since the display is made, the position of the speaker at the time of recording can be intuitively grasped.

また、本発明の実施の形態１によれば、表示制御部３０３が表示位置判定部２９６によって判定された判定結果に基づいて、音源位置情報を表示部２３に表示させるので、表示部２３の形状に応じた話者の位置を直感的に把握することができる。 Further, according to the first embodiment of the present invention, since the display control unit 303 causes the display unit 23 to display the sound source position information based on the determination result determined by the display position determination unit 296, the shape of the display unit 23 The position of the speaker according to the user can be intuitively grasped.

また、本発明の実施の形態１によれば、表示位置判定部２９６が表示部２３の表示領域の中心を情報取得機器２とした際における複数の音源の各々の表示位置を判定するので、情報取得機器２を中心とした際の話者の位置を直感的に把握することができる。 Further, according to the first embodiment of the present invention, the display position determination unit 296 determines the display position of each of the plurality of sound sources when the center of the display area of the display unit 23 is the information acquisition device 2. The position of the speaker when the acquisition device 2 is at the center can be intuitively grasped.

また、本発明の実施の形態１によれば、表示制御部３０３が音源位置情報として音源情報生成部２９８によって生成された複数の音源情報を表示部２３に表示させるので、録音時における参加した話者の性別や人数を直感的に把握することができる。 Further, according to the first embodiment of the present invention, the display control unit 303 causes the display unit 23 to display a plurality of pieces of sound source information generated by the sound source information generation unit 298 as sound source position information. Intuitively grasp the gender and number of persons.

また、本発明の実施の形態１によれば、音声ファイル生成部３０２が音声データと、音源位置推定部２９５によって推定された音源位置情報と、音源情報生成部２９８によって生成された複数の音源情報と、音声特定部２９９によって特定された出現位置、インデックス付加部３０１によって付加されたインデックスの位置に関する位置情報またはインデックスによって付加された音声データ上における時間に関する時間情報と、テキスト化部２９２によって生成された音声テキストデータと、を対応付けた音声ファイルを生成して音声ファイル記録部２６２に記録するので、情報処理装置３で摘録を作成する際に作成者の所望の位置を把握することができる。 Further, according to the first embodiment of the present invention, the sound file generation unit 302 generates sound data, sound source position information estimated by the sound source position estimation unit 295, and a plurality of sound source information generated by the sound source information generation unit 298. , Position information on the position of appearance specified by the sound specifying unit 299, position of the index added by the index adding unit 301, or time information on time on voice data added by the index; Since the voice file in which the voice text data is associated is generated and recorded in the voice file recording unit 262, it is possible to grasp the position desired by the creator when the information processing device 3 creates a collection.

また、本発明の実施の形態１によれば、音源情報生成部２９８が移動判定部３００によって移動していると判定された音源の音源情報に、移動したことを示す情報を付加するので、録音時に移動した話者を直感的に把握することができる。 Further, according to the first embodiment of the present invention, the sound source information generation unit 298 adds information indicating that the sound source has moved to the sound source information of the sound source determined to be moving by the movement determination unit 300. It is possible to intuitively grasp the speaker who moved from time to time.

（実施の形態２）
次に、本発明の実施の形態２について説明する。上述した実施の形態１では、第１収音部２０および第２収音部２１の各々が生成した２つの音声データに基づいて、複数の音源の位置を推定して表示部２３に表示していたが、本実施の形態２では、外部入力検出部２２に挿入される外部マイクが生成した音声データをさらに用いて複数の音源の位置を推定して表示する。以下においては、本実施の形態２に係る情報取得機器の構成について説明する。なお、上述した実施の形態１に係る情報取得機器２と同一の構成には同一の符号を付して説明を省略する。 Second Embodiment
Next, a second embodiment of the present invention will be described. In the first embodiment described above, the positions of a plurality of sound sources are estimated and displayed on the display unit 23 based on the two audio data generated by each of the first sound collecting unit 20 and the second sound collecting unit 21. However, in the second embodiment, the positions of a plurality of sound sources are estimated and displayed by further using voice data generated by the external microphone inserted into the external input detection unit 22. Hereinafter, the configuration of the information acquisition device according to the second embodiment will be described. In addition, the same code | symbol is attached | subjected to the structure same as the information acquisition apparatus 2 which concerns on Embodiment 1 mentioned above, and description is abbreviate | omitted.

図２１は、本発明の実施の形態２に係る情報取得機器に対して外部マイクを装着する前の概略構成を示す斜視図である。図２２は、本発明の実施の形態２に係る情報取得機器に対して外部マイクを装着した後の概略構成を示す斜視図である。 FIG. 21 is a perspective view showing a schematic configuration before an external microphone is attached to the information acquisition apparatus according to the second embodiment of the present invention. FIG. 22 is a perspective view showing a schematic configuration after an external microphone is attached to the information acquisition apparatus according to the second embodiment of the present invention.

図２１および図２２に示すように、情報取得機器２は、外部入力検出部２２に外部マイク１００のプラグ１０１が挿入される。外部マイク１００は、長方体をなし、側面に第３収音部１０２および第４収音部１０３が設けられている。外部マイク１００は、情報取得機器２に設けられた第１収音部２０および第２収音部２１の各々の上面の長手方向に対して直交する方向に挿入されて情報取得機器２に着脱自在に取り付けられる。外部マイク１００は、単一指向性マイク、無指向性マイクおよび双指向性マイク、左右の音を収音することができるステレオマイク等のいずれかのマイクを用いて構成される。なお、本実施の形態２では、外部マイク１００として第３収音部１０２および第４収音部１０３で構成されたステレオマイクを用いる場合について説明する。また、外部マイク１００は、内蔵のマイク（第１収音部２０、第２収音部２１）と周波数特性が違うもの、性能が良いものを選べたり、延長ケーブルで情報取得機器２から離れた位置に置いたり、ユーザの襟に取り付ける等の用い方ができるものであってもよい。 As shown in FIGS. 21 and 22, in the information acquisition device 2, the plug 101 of the external microphone 100 is inserted into the external input detection unit 22. The external microphone 100 has a rectangular shape, and the third sound collecting unit 102 and the fourth sound collecting unit 103 are provided on the side surface. The external microphone 100 is inserted in a direction perpendicular to the longitudinal direction of the upper surface of each of the first sound collecting unit 20 and the second sound collecting unit 21 provided in the information acquisition device 2 and is detachable from the information acquisition device 2 Attached to The external microphone 100 is configured using any one of microphones such as a unidirectional microphone, a nondirectional microphone and a bidirectional microphone, and a stereo microphone capable of collecting left and right sounds. In the second embodiment, a case will be described in which a stereo microphone configured of the third sound collecting unit 102 and the fourth sound collecting unit 103 as the external microphone 100 is used. In addition, external microphones 100 have different frequency characteristics from the built-in microphones (the first sound collecting unit 20 and the second sound collecting unit 21), or those with good performance, or are separated from the information acquisition device 2 by an extension cable. It may be used in a position, attached to the collar of the user, or the like.

外部入力検出部２２は、左右のマイクロホンの各々で収音された２つの音声データ（第３音声データおよび第４音声データ）を生成して機器制御部２９へ出力する。 The external input detection unit 22 generates two audio data (third audio data and fourth audio data) collected by the left and right microphones, respectively, and outputs the audio data to the device control unit 29.

このように、外部マイク１００が挿入された情報取得機器２は、音源位置推定部２９５が第１収音部２０、第２収音部２１および外部マイク１００の各々が生成した複数の音声データを用いて、複数の音源の位置を推定する。具体的には、音源位置推定部２９５は、図２３に示すように、音源Ａ１，Ａ２が同じ距離Ｄ１に位置しているとき、第１収音部２０および第２収音部２１の各々に音声信号が到達する到達時間差が等しくなり、音源Ａ１，Ａ２の位置を推定することが難しくなる。このため、音源位置推定部２９５は、外部マイク１００が情報取得機器２に設けられた第１収音部２０および第２収音部２１の各々の上面の長手方向に対して直交する方向に挿入されることによって、図２４に示すように、第１収音部２０および第２収音部２１の各々に音声信号が到達する到達時間と外部マイク１００に音声信号が到達する到達時間とに差が生じるので、第１収音部２０、第２収音部２１および外部マイク１００の各々が生成した複数の音声データを用いて、奥行き方向の各音源の位置を推定することができる。 Thus, in the information acquisition device 2 into which the external microphone 100 is inserted, the sound source position estimation unit 295 generates a plurality of audio data generated by each of the first sound collection unit 20, the second sound collection unit 21 and the external microphone 100. Use to estimate the position of multiple sound sources. Specifically, as shown in FIG. 23, when the sound sources A1 and A2 are located at the same distance D1, the sound source position estimating unit 295 is used for each of the first sound collecting unit 20 and the second sound collecting unit 21. The arrival time differences reached by the audio signals become equal, and it becomes difficult to estimate the positions of the sound sources A1 and A2. Therefore, the sound source position estimating unit 295 inserts the external microphone 100 in the direction orthogonal to the longitudinal direction of the upper surface of each of the first sound collecting unit 20 and the second sound collecting unit 21 provided in the information acquisition device 2. As shown in FIG. 24, the difference between the arrival time for the audio signal to reach each of the first sound collecting unit 20 and the second sound collecting unit 21 and the arrival time for the audio signal to reach the external microphone 100 as shown in FIG. As a result, the position of each sound source in the depth direction can be estimated using a plurality of sound data generated by each of the first sound collection unit 20, the second sound collection unit 21, and the external microphone 100.

以上説明した本発明の実施の形態２によれば、録音時における話者の位置を直感的に把握することができる。 According to the second embodiment of the present invention described above, the position of the speaker at the time of recording can be intuitively grasped.

また、本発明の実施の形態２によれば、外部マイク１００が情報取得機器２に設けられた第１収音部２０および第２収音部２１の各々の上面の長手方向に対して直交する方向に挿入されることによって、第１収音部２０および第２収音部２１の各々に音声信号が到達する到達時間と外部マイク１００に音声信号が到達する到達時間とに差が生じるので、音源位置推定部２９５が第１収音部２０、第２収音部２１および外部マイク１００の各々が生成した複数の音声データを用いて、奥行き方向の各音源の位置を推定することができる。 Further, according to the second embodiment of the present invention, the external microphone 100 is orthogonal to the longitudinal direction of the upper surface of each of the first sound collecting unit 20 and the second sound collecting unit 21 provided in the information acquisition device 2. Since the insertion time in the direction causes a difference between the arrival time for the audio signal to reach each of the first sound collecting unit 20 and the second sound collecting unit 21 and the arrival time for the audio signal to reach the external microphone 100, The sound source position estimation unit 295 can estimate the position of each sound source in the depth direction using a plurality of audio data generated by each of the first sound collection unit 20, the second sound collection unit 21, and the external microphone 100.

また、本発明に係る情報取得機器および情報処理装置は、通信ケーブルを介して双方向にデータを送受信していたが、これに限定されることなく、情報処理装置は、サーバ等を介して情報処理装置が情報取得機器によって生成された音声データを格納した音声ファイルを取得してもよいし、情報取得機器は、ネットワーク上のサーバに音声データを格納した音声ファイルを送信してもよい。 Further, although the information acquisition device and the information processing apparatus according to the present invention transmit and receive data bidirectionally via the communication cable, without being limited to this, the information processing apparatus transmits information via a server or the like. The processing device may acquire an audio file storing audio data generated by the information acquisition device, or the information acquisition device may transmit the audio file storing audio data to a server on the network.

また、本発明に係る情報処理装置は、情報取得機器から音声データを格納した音声ファイルを受信することによって取得していたが、これに限定されることなく、外部のマイク等を介して音声データを取得するようにしてもよい。 In addition, although the information processing apparatus according to the present invention is obtained by receiving an audio file storing audio data from an information acquisition device, the information processing apparatus is not limited to this, and is not limited to this. May be acquired.

なお、本明細書におけるフローチャートの説明では、「まず」、「その後」、「続いて」等の表現を用いてステップ間の処理の前後関係を明示していたが、本発明を実施するために必要な処理の順序は、それらの表現によって一意的に定められるわけではない。即ち、本明細書で記載したフローチャートにおける処理の順序は、矛盾のない範囲で変更することができる。また、こうした、単純な分岐処理からなるプログラムに限らず、より多くの判定項目を総合的に判定して分岐させてもよい。その場合、ユーザにマニュアル操作を促して学習を繰り返すうちに機械学習するような人工知能の技術を併用しても良い。また、多くの専門家が行う操作パターンを学習させて、さらに複雑な条件を入れ込む形で深層学習をさせて実行してもよい。 In the description of the flowchart in the present specification, the context of processing between steps is clearly indicated using expressions such as "first", "after", "following", etc., in order to implement the present invention. The order of processing required is not uniquely determined by their representation. That is, the order of processing in the flowcharts described herein can be changed without contradiction. Further, not limited to such a program comprising simple branch processing, more judgment items may be comprehensively judged and branched. In that case, artificial intelligence techniques may be used in combination, such as machine learning as the user is prompted to perform manual operations and learning is repeated. Further, it is also possible to learn operation patterns performed by many experts and perform deep learning by inserting more complicated conditions.

このように、本発明は、ここでは記載していない様々な実施の形態を含みうるものであり、請求の範囲によって特定される技術的思想の範囲内で種々の設計変更等を行うことが可能である。 Thus, the present invention can include various embodiments not described herein, and various design changes can be made within the scope of the technical idea specified by the claims. It is.

１・・・トランスクライバーシステム；２・・・情報取得機器；３・・・情報処理装置；４・・・通信ケーブル；２０・・・第１収音部；２１・・・第２収音部；２２・・・外部入力検出部；２３・・・表示部；２４・・・時計；２５・・・入力部；２６・・・記録部；２７・・・通信部；２８・・・出力部；２９・・・機器制御部；３１・・・通信部；３２・・・入力部；３３・・・記録部；３４・・・音声再生部；３５・・・表示部；３６・・・情報処理制御部；１００・・・外部マイク；１０１・・・プラグ；１０２・・・第３収音部；１０３・・・第４収音部；２５１・・・タッチパネル；２６１・・・プログラム記録部；２６２・・・音声ファイル記録部；２９１・・・信号処理部；２９２・・・テキスト化部；２９３・・・テキスト化制御部；２９４・・・音声判定部；２９５・・・音源位置推定部；２９６・・・表示位置判定部；２９７・・・声紋判定部；２９８・・・音源情報生成部；２９９・・・音声特定部；３００・・・移動判定部；３０１・・・インデックス付加部；３０２・・・音声ファイル生成部；３０３・・・表示制御部；３３１・・・プログラム記録部；３３２・・・音声テキスト化辞書データ記録部；３６１・・・テキスト化部；３６２・・・特定部；３６３・・・キーワード判定部；３６４・・・キーワード設定部；３６５・・・音声制御部；３６６・・・表示制御部；３６７・・・ドキュメント生成部 1: Transcriber system 2: Information acquisition device 3: Information processing device 4: Communication cable 20: First sound collection unit 21: Second sound collection unit 22: external input detection unit; 23: display unit; 24: clock; 25: input unit; 26: recording unit; 27: communication unit; 28: output unit 29: device control unit 31: communication unit 32: input unit 33: recording unit 34: audio reproduction unit 35: display unit 36: information Processing control unit; 100: external microphone; 101: plug; 102: third sound collecting unit; 103: fourth sound collecting unit: 251: touch panel; 261: program recording unit 262: voice file recording unit; 291: signal processing unit; 292: text conversion unit; 293: text Sound control unit; 294: sound source position estimation unit; 296 ... display position determination unit; 297 ... voice print judgment unit; 298 ... sound source information generation unit; · · · · · · Voice identification unit; 300 · · · movement determination unit 301 · · · · · · · · · · · · · · · · · · · · · · · · · · · voice identification unit; · Speech-to-text dictionary data storage unit 361 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · keyword determination unit; · · Display control unit; 367 · · · Document generation unit

Claims

A display unit capable of displaying an image;
A plurality of sound collection units provided at mutually different positions and collecting sound emitted from each of the plurality of sound sources to generate sound data;
A sound source position estimation unit configured to estimate the position of each of the plurality of sound sources based on the voice data generated by each of the plurality of sound collection units;
A display control unit that causes the display unit to display sound source position information on each position of the plurality of sound sources based on the estimation result estimated by the sound source position estimating unit;
An information acquisition apparatus comprising:

A display position determination unit is further provided that determines the display position of each of the plurality of sound sources in the display area of the display unit based on the shape of the display area of the display unit and the estimation result estimated by the sound source position estimation unit. ,
The information acquisition apparatus according to claim 1, wherein the display control unit causes the display unit to display the sound source position information based on the determination result determined by the display position determination unit.

A voiceprint determination unit that determines the volume of each speaker emitted by each of a plurality of speakers based on the voice data;
A sound source information generation unit that generates, as sound source information, an icon schematically indicating each of the plurality of speakers based on comparison of the volume of each of the plurality of speakers determined by the voiceprint determination unit;
The information acquisition apparatus according to claim 2, further comprising:

Sound source information for generating sound source information in which icons schematically showing speakers are made different from each other based on the voice data and the length and volume of voice of each speaker uttered by each of a plurality of speakers The information acquisition apparatus according to claim 2, further comprising a generation unit.

The information acquisition device according to claim 2, wherein the display position determination unit determines the display position when the center of the display area of the display unit is the information acquisition device.

A voiceprint determination unit that determines a voiceprint of each of the plurality of sound sources based on the voice data;
A sound source information generation unit configured to generate a plurality of pieces of sound source information related to each of the plurality of sound sources based on the determination result determined by the voiceprint determination unit;
Equipped with
6. The information acquisition apparatus according to claim 5, wherein the display control unit causes the display unit to display the plurality of sound source information as the sound source position information.

A voice identification unit that identifies an appearance position where each voiceprint determined by the voiceprint determination unit appears on the voice data;
An audio file generation unit that generates an audio file in which the audio data, the sound source position information, the plurality of sound source information, and the appearance position are associated with one another and is recorded in a recording medium.
The information acquisition apparatus according to claim 6, comprising:

The apparatus further comprises a movement determination unit that determines whether each of the plurality of sound sources is moving based on the estimation result estimated by the sound source position estimation unit and the determination result determined by the voice print determination unit.
8. The information acquisition according to claim 7, wherein the sound source information generation unit adds information indicating that the sound source has moved to the sound source information of the sound source determined to be moving by the movement determination unit. machine.

The information acquisition device according to any one of claims 1 to 8, wherein the plurality of sound collection units are detachable with respect to the information acquisition device.

The information acquisition device according to any one of claims 1 to 9, further comprising an external microphone that is detachable to the information acquisition device.

Information acquisition comprising: a display unit capable of displaying an image; and a plurality of sound collection units provided at mutually different positions and collecting sound emitted from each of a plurality of sound sources to generate sound data The display method executed by the device, and
A sound source position estimation step of estimating the positions of the plurality of sound sources based on the voice data generated by each of the plurality of sound collection units;
A display control step of causing the display unit to display sound source position information on each position of the plurality of sound sources based on the estimation result estimated in the sound source position estimation step;
A display method characterized in that

Information acquisition comprising: a display unit capable of displaying an image; and a plurality of sound collection units provided at mutually different positions and collecting sound emitted from each of a plurality of sound sources to generate sound data To the equipment
A sound source position estimation step of estimating the positions of the plurality of sound sources based on the voice data generated by each of the plurality of sound collection units;
A display control step of causing the display unit to display sound source position information on each position of the plurality of sound sources based on the estimation result estimated in the sound source position estimation step;
A program characterized by causing