JP6392578B2

JP6392578B2 - Audio processing apparatus, audio processing method, and audio processing program

Info

Publication number: JP6392578B2
Application number: JP2014163742A
Authority: JP
Inventors: 浩次酒井
Original assignee: Olympus Corp
Current assignee: Olympus Corp
Priority date: 2014-08-11
Filing date: 2014-08-11
Publication date: 2018-09-19
Anticipated expiration: 2034-08-11
Also published as: JP2016038546A

Description

本発明は、本発明は、音声処理装置、音声処理方法、及び音声処理プログラムに関する。 The present invention is the invention, the audio processing device, voice processing method, and a voice processing program.

近年、打合せ時等の音声を記録（録音）及び再生可能とするＩＣレコーダが実用化されている（例えば、特許文献１参照）。
具体的に、特許文献１に記載のＩＣレコーダは、マイクを介して入力した音声を音声データ（デジタルデータ）に変換した後、当該音声データをメモリに記録する。また、当該ＩＣレコーダは、メモリに記録された音声データを音声信号（アナログ信号）に変換した後、スピーカを介して当該音声信号に基づく音声を出力（再生）する。 In recent years, IC recorders capable of recording (recording) and reproducing voices at the time of a meeting have been put into practical use (for example, see Patent Document 1).
Specifically, the IC recorder described in Patent Document 1 converts voice input via a microphone into voice data (digital data), and then records the voice data in a memory. In addition, the IC recorder converts audio data recorded in the memory into an audio signal (analog signal), and then outputs (reproduces) audio based on the audio signal via a speaker.

そして、このようなＩＣレコーダでは、一般的に、音声の再生時に以下に示すような再生画面を表示する。
具体的に、再生画面は、音声の録音を開始してから終了するまでの時間に対応する時間スケールと、当該時間スケール上に配置され、再生位置を指し示すスライダとを有するタイムバーが配置された画面である。
すなわち、当該ＩＣレコーダのユーザは、音声の再生時に当該再生画面（タイムバー）を確認することで、既に録音した音声データの再生位置を把握することができる。 Such an IC recorder generally displays a reproduction screen as shown below when reproducing audio.
Specifically, the playback screen is provided with a time bar corresponding to the time from the start to the end of voice recording and a slider arranged on the time scale and a slider indicating the playback position. It is a screen.
That is, the user of the IC recorder can grasp the reproduction position of the already recorded audio data by confirming the reproduction screen (time bar) when reproducing the audio.

特開２０１２−２０５０８６号公報JP 2012-205086 A

しかしながら、従来の再生画面は、タイムバーが配置されただけである。このため、ユーザは、実際に再生された音声を確認しなければ、例えば、話者が誰であったか、当該話者のテンションはどのような状態であったか等の録音時の状況を把握することができない、という問題がある。
したがって、再生画面から録音時の状況を把握することができ、利便性の向上が図れる技術が要望されている。 However, the conventional playback screen is only provided with a time bar. For this reason, if the user does not check the actually reproduced voice, for example, the user can grasp the situation at the time of recording such as who the speaker is and the state of the speaker's tension. There is a problem that it is not possible.
Therefore, there is a demand for a technique that can grasp the situation at the time of recording from the playback screen and can improve convenience.

本発明は、上記に鑑みてなされたものであって、利便性の向上が図れる音声処理装置、音声処理方法、及び音声処理プログラムを提供することを目的とする。 The present invention was made in view of the above, an object of speech processing apparatus can be improved convenience, voice processing method, and to provide a voice processing program.

上述した課題を解決し、目的を達成するために、本発明に係る音声処理装置は、音声データを取得する音声データ取得部と、前記音声データを解析して、当該音声データに含まれる音声のうち、テンションの高い成分を判別する音声データ解析部と、前記テンションの高い成分と前記音声データにおける当該テンションの高い成分が含まれる時間とを関連付け、前記音声データの再生画面を生成する際に用いられる参照情報を生成する参照情報生成部とを備え、前記音声データ解析部は、前記音声データを解析して、当該音声データ中の所定の時間範囲毎に、当該音声データに含まれる音声を発した話者の特定及び前記テンションの高い成分の判別を行い、前記話者を特定することができない場合には、当該話者を特定することができない時間範囲の音声を、当該時間範囲に対する直前または直後の時間範囲で特定した話者が発したものと推定することを特徴とする。 In order to solve the above-described problems and achieve the object, an audio processing apparatus according to the present invention analyzes an audio data acquisition unit that acquires audio data, and analyzes the audio data, and analyzes the audio included in the audio data. Among them, an audio data analysis unit for discriminating a component with high tension is used to generate a reproduction screen of the audio data by associating the high tension component with the time during which the high tension component is included in the audio data. A reference information generation unit that generates reference information to be generated , and the sound data analysis unit analyzes the sound data and generates sound included in the sound data for each predetermined time range in the sound data. When the speaker cannot be specified when the speaker is identified and the high tension component is identified and the speaker cannot be identified Range of sound, characterized that you assumed the speaker identified in the immediately preceding or time range immediately after emitted for that time range.

また、本発明に係る音声処理方法は、音声処理装置が行う音声処理方法において、音声データを取得する音声データ取得ステップと、前記音声データを解析して、当該音声データに含まれる音声のうち、テンションの高い成分を判別する音声データ解析ステップと、前記テンションの高い成分と当該音声データにおける当該テンションの高い成分が含まれる時間とを関連付け、前記音声データの再生画面を生成する際に用いられる参照情報を生成する参照情報生成ステップとを含み、前記音声データ解析ステップでは、前記音声データを解析して、当該音声データ中の所定の時間範囲毎に、当該音声データに含まれる音声を発した話者の特定及び前記テンションの高い成分の判別を行い、前記話者を特定することができない場合には、当該話者を特定することができない時間範囲の音声を、当該時間範囲に対する直前または直後の時間範囲で特定した話者が発したものと推定することを特徴とする。 Further, the audio processing method according to the present invention is an audio processing method performed by an audio processing apparatus, wherein an audio data acquisition step for acquiring audio data, and the audio data are analyzed and included in the audio included in the audio data, A reference used when generating a reproduction screen of the audio data by associating the audio data analysis step for discriminating the high tension component with the time during which the high tension component and the high tension component of the audio data are included look including a reference information generating step of generating information, in the audio data analyzing step analyzes the voice data for each predetermined time range in the audio data, it issues a voice included in the voice data If the speaker is identified and the high tension component is determined and the speaker cannot be identified, Speech time range that can not be identified, and estimates as the speaker identified in the immediately preceding or time range immediately after emitted for that time range.

また、本発明に係る音声処理プログラムは、上述した音声処理方法を音声処理装置に実行させることを特徴とする。 A speech processing program according to the present invention causes a speech processing apparatus to execute the speech processing method described above.

本発明に係る音声処理装置によれば、実際に再生された音声を確認しなくても、話者のテンションの高さに基づいた画像を再生画面に表示することにより、当該再生画面から話者の録音時の状況を把握することができる。 According to the audio processing device of the present invention, an image based on the height of the speaker's tension is displayed on the reproduction screen without confirming the actually reproduced audio, so that the speaker can be reproduced from the reproduction screen. You can grasp the situation when recording.

図１は、本発明の実施の形態１に係る電子機器の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of an electronic apparatus according to Embodiment 1 of the present invention. 図２は、図１に示した電子機器の動作を示すフローチャートである。FIG. 2 is a flowchart showing the operation of the electronic device shown in FIG. 図３は、図２に示した話者表示処理（ステップＳ１１１）を示すフローチャートである。FIG. 3 is a flowchart showing the speaker display process (step S111) shown in FIG. 図４は、図２及び図３に示した話者表示処理（ステップＳ１１１）の対象となる第１，第２音声データが生成（録音）される状況の一例を示す図である。FIG. 4 is a diagram illustrating an example of a situation where the first and second audio data to be subjected to the speaker display process (step S111) illustrated in FIGS. 2 and 3 are generated (recorded). 図５は、図４の状況で生成された第１，第２音声データを対象として話者表示処理（ステップＳ１１１）を実行した場合に生成される参照情報の一例を示す図である。FIG. 5 is a diagram illustrating an example of reference information generated when the speaker display process (step S111) is executed on the first and second audio data generated in the situation of FIG. 図６は、図５に示した参照情報に基づいて生成される話者表示再生画面の一例を示す図である。6 is a diagram showing an example of a speaker display reproduction screen generated based on the reference information shown in FIG. 図７は、本発明の実施の形態１の変形例を示す図である。FIG. 7 is a diagram showing a modification of the first embodiment of the present invention. 図８は、本発明の実施の形態２に係る音声処理システムの構成を示すブロック図である。FIG. 8 is a block diagram showing a configuration of a speech processing system according to Embodiment 2 of the present invention. 図９は、本発明の実施の形態２に係る話者表示処理（ステップＳ１１１）を示すフローチャートである。FIG. 9 is a flowchart showing speaker display processing (step S111) according to Embodiment 2 of the present invention. 図１０は、図８に示したサーバの動作を示すフローチャートである。FIG. 10 is a flowchart showing the operation of the server shown in FIG. 図１１Ａは、本発明の実施の形態１，２で説明した話者表示再生画面の変形例を示す図である。FIG. 11A is a diagram showing a modification of the speaker display reproduction screen described in the first and second embodiments of the present invention. 図１１Ｂは、本発明の実施の形態１，２で説明した話者表示再生画面の変形例を示す図である。FIG. 11B is a diagram showing a modification of the speaker display reproduction screen described in the first and second embodiments of the present invention. 図１１Ｃは、本発明の実施の形態１，２で説明した話者表示再生画面の変形例を示す図である。FIG. 11C is a diagram showing a modification of the speaker display reproduction screen described in the first and second embodiments of the present invention. 図１２は、本発明の実施の形態１，２で説明した参照情報の変形例を示す図である。FIG. 12 is a diagram showing a modification of the reference information described in the first and second embodiments of the present invention. 図１３は、図１２に示した参照情報に基づいて生成される話者表示再生画面の一例を示す図である。FIG. 13 is a diagram showing an example of a speaker display reproduction screen generated based on the reference information shown in FIG.

以下に、図面を参照して、本発明を実施するための形態（以下、実施の形態と記載）について説明する。なお、以下に説明する実施の形態によって本発明が限定されるものではない。さらに、図面の記載において、同一の部分には同一の符号を付している。 DESCRIPTION OF EMBODIMENTS Hereinafter, modes for carrying out the present invention (hereinafter referred to as embodiments) will be described with reference to the drawings. The present invention is not limited to the embodiments described below. Furthermore, the same code | symbol is attached | subjected to the same part in description of drawing.

（実施の形態１）
〔電子機器の構成〕
図１は、本発明の実施の形態１に係る電子機器１の構成を示すブロック図である。
電子機器１は、ＩＣレコーダ、デジタルカメラ、デジタルビデオカメラ、携帯電話、あるいはタブレット型携帯機器等として構成される。そして、電子機器１は、話者が発した音声を含む音声データを解析することで当該音声の特徴成分（話者のテンション）を判別し、タイムバーとともに当該特徴成分が生じた時間を明示した再生画面を表示する。
以下、電子機器１の構成として、本発明の要部を主に説明する。この電子機器１は、図１に示すように、第１音声データ生成部１１と、第２音声データ生成部１２と、操作部１３と、表示部１４と、時計部１５と、メモリ部１６と、記録部１７と、音声出力部１８と、機器側制御部１９とを備える。 (Embodiment 1)
[Configuration of electronic equipment]
FIG. 1 is a block diagram showing a configuration of electronic apparatus 1 according to Embodiment 1 of the present invention.
The electronic device 1 is configured as an IC recorder, a digital camera, a digital video camera, a mobile phone, a tablet mobile device, or the like. And the electronic device 1 discriminate | determines the characteristic component (speaker's tension) of the said voice by analyzing the audio | voice data containing the audio | voice which the speaker uttered, and clarified the time when the said characteristic component produced with the time bar. Display the playback screen.
Hereinafter, the main part of the present invention will be mainly described as the configuration of the electronic apparatus 1. As shown in FIG. 1, the electronic device 1 includes a first audio data generation unit 11, a second audio data generation unit 12, an operation unit 13, a display unit 14, a clock unit 15, and a memory unit 16. A recording unit 17, an audio output unit 18, and a device-side control unit 19.

第１音声データ生成部１１は、機器側制御部１９による制御の下、入力した音声に基づく第１音声データを生成する。この第１音声データ生成部１１は、図１に示すように、第１マイク１１１と、第１増幅器１１２と、第１Ａ／Ｄ変換部１１３とを備える。
第１マイク１１１は、音声を入力して電気信号に変換する。ここで、第１マイク１１１は、電子機器１を正面から見て、左上側に配置されている（図４参照）。
第１増幅器１１２は、第１マイク１１１からの電気信号を入力し、当該電気信号に対して所定のアナログ処理（ノイズ成分を低減するノイズ低減処理、ゲインを増大させて一定の出力レベルを維持するゲイン処理等）を施し、第１Ａ／Ｄ変換部１１３に出力する。
第１Ａ／Ｄ変換部１１３は、第１増幅器１１２からの電気信号を入力し、当該電気信号に対して、Ａ／Ｄ変換を行うことにより、デジタル信号（第１音声データ）に変換し、機器側制御部１９に出力する。 The first sound data generation unit 11 generates first sound data based on the input sound under the control of the device-side control unit 19. As shown in FIG. 1, the first audio data generation unit 11 includes a first microphone 111, a first amplifier 112, and a first A / D conversion unit 113.
The first microphone 111 inputs sound and converts it into an electrical signal. Here, the first microphone 111 is disposed on the upper left side when the electronic device 1 is viewed from the front (see FIG. 4).
The first amplifier 112 receives an electrical signal from the first microphone 111, and performs predetermined analog processing (noise reduction processing for reducing noise components, increasing gain to maintain a constant output level for the electrical signal. Gain processing, etc.) and output to the first A / D converter 113.
The first A / D conversion unit 113 receives the electrical signal from the first amplifier 112 and performs A / D conversion on the electrical signal to convert the electrical signal into a digital signal (first audio data). To the side controller 19.

第２音声データ生成部１２は、第１音声データ生成部１１と同様に、機器側制御部１９による制御の下、入力した音声に基づく第２音声データを生成する。この第２音声データ生成部１２は、図１に示すように、第１音声データ生成部１１を構成する第１マイク１１１、第１増幅器１１２、及び第１Ａ／Ｄ変換部１１３とそれぞれ同様の第２マイク１２１、第２増幅器１２１、及び第２Ａ／Ｄ変換部１２３を備える。
ここで、第２マイク１２１は、電子機器１を正面から見て、右上側（第１マイク１１１に対向する側）に配置されている（図４参照）。 Similar to the first sound data generation unit 11, the second sound data generation unit 12 generates second sound data based on the input sound under the control of the device-side control unit 19. As shown in FIG. 1, the second audio data generation unit 12 is similar to the first microphone 111, the first amplifier 112, and the first A / D conversion unit 113 that constitute the first audio data generation unit 11, respectively. 2 microphones 121, a second amplifier 121, and a second A / D converter 123.
Here, the second microphone 121 is arranged on the upper right side (side facing the first microphone 111) when the electronic device 1 is viewed from the front (see FIG. 4).

操作部１３は、ユーザ操作を受け付けるボタン、スイッチ、タッチパネル等を用いて構成され、当該ユーザ操作に応じた指示信号を機器側制御部１９に出力する。
そして、操作部１３は、本発明に係る操作受付部としての機能を有する。
表示部１４は、液晶または有機ＥＬ（ＥｌｅｃｔｒｏＬｕｍｉｎｅｓｃｅｎｃｅ）等からなる表示パネルを用いて構成されている。そして、表示部１４は、機器側制御部１９による制御の下、話者表示再生画面等の画像を表示する。
時計部１５は、計時機能の他、第１，第２音声データ生成部１１，１２にて音声データが生成された日時に関する日時情報（以下、タイムスタンプと記載）を生成する機能を有する。そして、時計部１５にて生成されたタイムスタンプは、機器側制御部１９に出力される。 The operation unit 13 is configured by using a button, a switch, a touch panel, or the like that receives a user operation, and outputs an instruction signal corresponding to the user operation to the device side control unit 19.
And the operation part 13 has a function as an operation reception part which concerns on this invention.
The display unit 14 is configured using a display panel made of liquid crystal, organic EL (Electro Luminescence), or the like. The display unit 14 displays an image such as a speaker display reproduction screen under the control of the device-side control unit 19.
The clock unit 15 has a function of generating date and time information (hereinafter referred to as a time stamp) related to the date and time when the audio data is generated by the first and second audio data generation units 11 and 12 in addition to the timekeeping function. The time stamp generated by the clock unit 15 is output to the device-side control unit 19.

メモリ部１６は、第１，第２音声データ生成部１１，１２にてそれぞれ生成された第１，第２音声データ、及び機器側制御部１９による処理中の情報を一時的に記憶する。
記録部１７は、機器側制御部１９が実行する各種プログラム（音声処理プログラムを含む）や、第１，第２音声データ生成部１１，１２にてそれぞれ生成された第１，第２音声データを記録する。また、記録部１７は、機器側制御部１９による制御の下、機器側制御部１９にて生成された参照情報を対応する第１，第２音声データに関連付けて記録する。 The memory unit 16 temporarily stores the first and second audio data generated by the first and second audio data generation units 11 and 12 and information being processed by the device-side control unit 19, respectively.
The recording unit 17 stores various programs (including audio processing programs) executed by the device-side control unit 19 and the first and second audio data generated by the first and second audio data generation units 11 and 12, respectively. Record. The recording unit 17 records the reference information generated by the device-side control unit 19 in association with the corresponding first and second audio data under the control of the device-side control unit 19.

音声出力部１８は、機器側制御部１９による制御の下、記録部１７に記録された第１，第２音声データに基づく音声を出力する。この音声出力部１８は、図１に示すように、Ｄ／Ａ変換部１８１と、増幅器１８２と、スピーカ１８３とを備える。
Ｄ／Ａ変換部１８１は、記録部１７に記録された第１，第２音声データに対して、Ｄ／Ａ変換をそれぞれ行うことにより、アナログ信号にそれぞれ変換するとともに、各アナログ信号の和信号を増幅器１８２に出力する。
増幅器１８２は、Ｄ／Ａ変換部１８１からの音声信号（和信号）を入力し、当該音声信号に対して所定のアナログ処理を施して音声信号の増幅等を行い、スピーカ１８３に出力する。
スピーカ１８３は、増幅器１８２からの音声信号を入力し、当該音声信号に基づく音声を出力する。 The audio output unit 18 outputs audio based on the first and second audio data recorded in the recording unit 17 under the control of the device-side control unit 19. As shown in FIG. 1, the audio output unit 18 includes a D / A conversion unit 181, an amplifier 182, and a speaker 183.
The D / A conversion unit 181 performs D / A conversion on the first and second audio data recorded in the recording unit 17 to convert each of the first and second audio data into an analog signal, and the sum signal of each analog signal. Is output to the amplifier 182.
The amplifier 182 receives the audio signal (sum signal) from the D / A converter 181, performs predetermined analog processing on the audio signal, amplifies the audio signal, and outputs the audio signal to the speaker 183.
The speaker 183 receives the audio signal from the amplifier 182 and outputs audio based on the audio signal.

機器側制御部１９は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎＵｎｉｔ）等を用いて構成され、操作部１３からの指示信号等に応じて電子機器１を構成する各部に対応する指示やデータの転送等を行って電子機器１の動作を統括的に制御する。この機器側制御部１９は、図１に示すように、音声データ取得部１９１と、音声データ解析部１９２と、参照情報生成部１９３と、再生画面生成部１９４と、表示制御部１９５と、音声制御部１９６とを備える。 The device-side control unit 19 is configured using a CPU (Central Process Unit) or the like, and performs instructions and data transfer corresponding to each unit constituting the electronic device 1 in response to an instruction signal from the operation unit 13 or the like. The operation of the electronic device 1 is comprehensively controlled. As shown in FIG. 1, the device-side control unit 19 includes an audio data acquisition unit 191, an audio data analysis unit 192, a reference information generation unit 193, a playback screen generation unit 194, a display control unit 195, an audio And a control unit 196.

音声データ取得部１９１は、ユーザによる操作部１３へのユーザ操作（モード切替スイッチの操作等）に応じて電子機器１が録音モードに設定されている場合に、以下の機能を実行する。
音声データ取得部１９１は、ユーザによる操作部１３への録音開始操作（録音スイッチの押下等）に応じて、第１，第２音声データ生成部１１，１２に第１，第２音声データを生成させ、当該第１，第２音声データを取得する。そして、音声データ取得部１９１は、時計部１５にて生成されたタイムスタンプ（第１，第２音声データの生成日時に関する日時情報）を第１，第２音声データに関連付けて、メモリ部１６に順次、記憶する。また、音声データ取得部１９１は、ユーザによる操作部１３への録音終了操作（停止スイッチの押下等）に応じて、第１，第２音声データ生成部１１，１２に第１，第２音声データの生成を終了させ、メモリ部１６に記憶した第１，第２音声データ（タイムスタンプを含む）を記録部１７に記録する。 The audio data acquisition unit 191 performs the following functions when the electronic device 1 is set to the recording mode in response to a user operation (such as operation of a mode switch) on the operation unit 13 by the user.
The voice data acquisition unit 191 generates first and second voice data in the first and second voice data generation units 11 and 12 in response to a recording start operation (such as pressing a recording switch) to the operation unit 13 by the user. The first and second audio data are acquired. Then, the audio data acquisition unit 191 associates the time stamp generated by the clock unit 15 (date and time information regarding the generation date and time of the first and second audio data) with the first and second audio data and stores them in the memory unit 16. Store sequentially. Also, the audio data acquisition unit 191 sends the first and second audio data to the first and second audio data generation units 11 and 12 in response to a recording end operation (such as pressing a stop switch) to the operation unit 13 by the user. And the first and second audio data (including the time stamp) stored in the memory unit 16 are recorded in the recording unit 17.

音声データ解析部１９２は、ユーザによる操作部１３へのユーザ操作（モード切替スイッチの操作等）に応じて電子機器１が再生モード（第１，第２音声データを再生するモード）に設定されている場合に、記録部１７に記録された第１，第２音声データを解析する。この音声データ解析部１９２は、図１に示すように、対象物特定部１９２１と、特徴成分判別部１９２２とを備える。
対象物特定部１９２１は、第１，第２音声データを解析することで、当該第１，第２音声データに含まれる音声を発した話者を特定する。
特徴成分判別部１９２２は、第１，第２音声データを解析することで、当該第１，第２音声データに含まれる音声の特徴成分（話者のテンション）を判別する。 The audio data analysis unit 192 is set so that the electronic device 1 is set to a reproduction mode (a mode for reproducing the first and second audio data) in response to a user operation (such as an operation of a mode switch) on the operation unit 13 by the user. The first and second audio data recorded in the recording unit 17 are analyzed. As shown in FIG. 1, the audio data analysis unit 192 includes an object specifying unit 1921 and a feature component determination unit 1922.
The object specifying unit 1921 analyzes the first and second sound data to specify the speaker who has emitted the sound included in the first and second sound data.
The feature component determination unit 1922 analyzes the first and second sound data to determine the sound feature component (speaker's tension) included in the first and second sound data.

参照情報生成部１９３は、電子機器１が再生モードに設定されている場合に、以下の機能を実行する。
参照情報生成部１９３は、対象物特定部１９２１にて特定された話者、特徴成分判別部１９２２にて判別された話者のテンション、並びに当該話者の声が含まれる日時を示すタイムスタンプ（時計部１５にて生成）等を関連付け、第１，第２音声データの再生画面を生成する際に用いられる参照情報を生成する。ここで、話者のテンション（音声の特徴部分）は感情的な高ぶりを示すものを想定したが、話の集中具合（例えば、一人の話者が説明し、それを他の人物が静かに聴くなど）を反映してもよい。この場合、検出された複数の人物の声の相対的な大きさの関係や、言葉のペースの一定度（説き聞かせるように語る）やスピード（まくしたてる）などを検出して、話者のテンションを判定してもよい。つまり、話者のテンションは、一人の話者の声の時間の経過に伴う相対的な変化を検出したり、絶対的な数値データで判定したり、複数の人物の声の相対的な差異を検出したりして判定されるものである。
そして、上述した音声データ取得部１９１、音声データ解析部１９２、及び参照情報生成部１９３は、本発明に係る音声処理装置としての機能を有する。 The reference information generation unit 193 performs the following functions when the electronic device 1 is set to the playback mode.
The reference information generation unit 193 includes a speaker identified by the object identification unit 1921, a speaker tension determined by the feature component determination unit 1922, and a timestamp indicating the date and time when the voice of the speaker is included. Reference information used when generating the reproduction screens of the first and second audio data is generated. Here, the speaker's tension (speech feature) is assumed to be emotionally high, but the concentration of the story (for example, one speaker explains and others quietly listen to it) Etc.) may be reflected. In this case, the speaker's tension is detected by detecting the relative loudness of the voices of multiple people detected, the degree of the pace of the words (speaking to speak) and the speed (speaking). May be determined. In other words, the tension of a speaker can detect relative changes over time of one speaker's voice, can be judged by absolute numerical data, It is determined by detection.
And the audio | voice data acquisition part 191, the audio | voice data analysis part 192, and the reference information production | generation part 193 mentioned above have a function as an audio | voice processing apparatus which concerns on this invention.

再生画面生成部１９４は、ユーザによる操作部１３へのユーザ操作に応じて話者表示の表示フラグがオン状態になっている場合に、話者表示再生画面を生成する。なお、当該話者表示の表示フラグは、メモリ部１６に記憶されている。
具体的に、再生画面生成部１９４は、再生位置を指し示すタイムバーを配置するとともに、参照情報生成部１９３にて生成された参照情報に基づいて、タイムバーに対応する各時間に、当該時間（タイムスタンプ）に関連付けられた話者及び当該話者のテンションを識別するための識別画像を配置した話者表示再生画面を生成する。ここで、話者のテンション（音声の特徴部分）は感情的な高ぶりを示すものを想定したが、話の集中具合（例えば、一人の話者が説明し、それを他の人物が静かに聴くなど）を反映してもよい。この場合、検出された複数の人物の声の相対的な大きさの関係や、言葉のペースの一定度（説き聞かせるように語る）やスピード（まくしたてる）などを検出して、話者のテンションを判定してもよい。つまり、話者のテンションは、一人の話者の声の時間の経過に伴う相対的な変化を検出したり、絶対的な数値データで判定したり、複数の人物の声の相対的な差異を検出したりして判定されるものである。
また、再生画面生成部１９４は、ユーザによる操作部１３へのユーザ操作に応じて話者表示の表示フラグがオフ状態になっている場合に、再生位置を指し示すタイムバーのみを配置した（上述した識別画像のない）通常再生画面を生成する。 The reproduction screen generation unit 194 generates a speaker display reproduction screen when the display flag for speaker display is turned on in response to a user operation on the operation unit 13 by the user. Note that the display flag of the speaker display is stored in the memory unit 16.
Specifically, the playback screen generation unit 194 arranges a time bar indicating the playback position, and at each time corresponding to the time bar based on the reference information generated by the reference information generation unit 193, the time ( A speaker display reproduction screen in which an identification image for identifying the speaker associated with the (time stamp) and the tension of the speaker is generated. Here, the speaker's tension (speech feature) is assumed to be emotionally high, but the concentration of the story (for example, one speaker explains and others quietly listen to it) Etc.) may be reflected. In this case, the speaker's tension is detected by detecting the relative loudness of the voices of multiple people detected, the degree of the pace of the words (speaking to speak) and the speed (speaking). May be determined. In other words, the tension of a speaker can detect relative changes over time of one speaker's voice, can be judged by absolute numerical data, It is determined by detection.
In addition, the playback screen generation unit 194 arranges only the time bar indicating the playback position when the display flag of the speaker display is turned off in response to the user operation on the operation unit 13 by the user (described above). A normal playback screen (without an identification image) is generated.

表示制御部１９５は、再生する第１，第２音声データをユーザに選択させるための選択画面、再生画面生成部１９４にて生成された話者表示再生画面や通常再生画面等を表示部１４に表示させる。
音声制御部１９６は、電子機器１が再生モードに設定されている場合に、以下の機能を実行する。
音声制御部１９６は、ユーザによる操作部１３への再生開始操作（再生スイッチの押下等）に応じて、音声出力部１８の動作を制御し、記録部１７に記録された第１，第２音声データに基づく音声の出力を開始させる。また、音声制御部１９６は、ユーザによる操作部１３への再生終了操作（停止スイッチの押下等）に応じて、音声出力部１８に音声の出力を終了させる。 The display control unit 195 displays a selection screen for allowing the user to select the first and second audio data to be reproduced, a speaker display reproduction screen generated by the reproduction screen generation unit 194, a normal reproduction screen, and the like on the display unit 14. Display.
The audio control unit 196 performs the following functions when the electronic device 1 is set to the playback mode.
The audio control unit 196 controls the operation of the audio output unit 18 in response to a user's reproduction start operation (such as pressing a reproduction switch) on the operation unit 13, and the first and second audio recorded in the recording unit 17. Start outputting audio based on the data. In addition, the voice control unit 196 causes the voice output unit 18 to end the voice output in response to a reproduction end operation (such as pressing a stop switch) to the operation unit 13 by the user.

〔電子機器の動作〕
次に、上述した電子機器１の動作について説明する。
図２は、電子機器１の動作を示すフローチャートである。
ユーザによる操作部１３への操作によって電子機器１の電源がオンになる（ステップＳ１０１：Ｙｅｓ）と、機器側制御部１９は、電子機器１が録音モードに設定されているか否かを判断する（ステップＳ１０２）。 [Operation of electronic equipment]
Next, the operation of the electronic device 1 described above will be described.
FIG. 2 is a flowchart showing the operation of the electronic device 1.
When the electronic device 1 is turned on by the user operating the operation unit 13 (step S101: Yes), the device-side control unit 19 determines whether or not the electronic device 1 is set to the recording mode ( Step S102).

録音モードに設定されていないと判断された場合（ステップＳ１０２：Ｎｏ）には、電子機器１は、ステップＳ１０７に移行する。
一方、録音モードに設定されていると判断した場合（ステップＳ１０２：Ｙｅｓ）には、機器側制御部１９は、ユーザによる操作部１３への録音開始操作があったか否かを判断する（ステップＳ１０３）。 When it is determined that the recording mode is not set (step S102: No), the electronic apparatus 1 proceeds to step S107.
On the other hand, when it is determined that the recording mode is set (step S102: Yes), the device-side control unit 19 determines whether or not the user has performed a recording start operation on the operation unit 13 (step S103). .

録音開始操作がないと判断された場合（ステップＳ１０３：Ｎｏ）には、電子機器１は、ステップＳ１０１に戻る。
一方、録音開始操作があったと判断された場合（ステップＳ１０３：Ｙｅｓ）には、音声データ取得部１９１は、第１，第２音声データ生成部１１，１２に第１，第２音声データの生成（録音）を開始させる。また、時計部１５は、タイムスタンプの生成（計時）を開始する。そして、音声データ取得部１９１は、当該タイムスタンプを当該第１，第２音声データに関連付けて、メモリ部１６に順次、記憶する（ステップＳ１０４）。 When it is determined that there is no recording start operation (step S103: No), the electronic device 1 returns to step S101.
On the other hand, if it is determined that there has been a recording start operation (step S103: Yes), the audio data acquisition unit 191 generates the first and second audio data in the first and second audio data generation units 11 and 12. Start (Recording). In addition, the clock unit 15 starts generating (clocking) a time stamp. Then, the audio data acquisition unit 191 sequentially stores the time stamp in the memory unit 16 in association with the first and second audio data (step S104).

続いて、機器側制御部１９は、ユーザによる操作部１３への録音終了操作があったか否かを判断する（ステップＳ１０５）。
録音終了操作がないと判断された場合（ステップＳ１０５：Ｎｏ）には、電子機器１は、録音及び計時を継続する。
一方、録音終了操作があったと判断された場合（ステップＳ１０５：Ｙｅｓ）には、音声データ取得部１９１は、第１，第２音声データ生成部１１，１２に第１，第２音声データの生成を終了させる。また、時計部１５は、タイムスタンプの生成を終了する。そして、音声データ取得部１９１は、メモリ部１６に記憶した第１，第２音声データ（タイムスタンプを含む）を記録部１７に記録する（ステップＳ１０６）。この後、電子機器１は、ステップＳ１０１に戻る。
以上説明したステップＳ１０３〜Ｓ１０６は、本発明に係る音声データ取得ステップに相当する。 Subsequently, the device-side control unit 19 determines whether or not the user has performed a recording end operation on the operation unit 13 (step S105).
When it is determined that there is no recording end operation (step S105: No), the electronic device 1 continues recording and timing.
On the other hand, when it is determined that the recording end operation has been performed (step S105: Yes), the audio data acquisition unit 191 generates the first and second audio data in the first and second audio data generation units 11 and 12. End. In addition, the clock unit 15 ends the time stamp generation. Then, the audio data acquisition unit 191 records the first and second audio data (including the time stamp) stored in the memory unit 16 in the recording unit 17 (step S106). Thereafter, the electronic device 1 returns to step S101.
Steps S103 to S106 described above correspond to the audio data acquisition step according to the present invention.

ステップＳ１０２で録音モードに設定されていないと判断した場合（ステップＳ１０２：Ｎｏ）には、機器側制御部１９は、電子機器１が再生モードに設定されているか否かを判断する（ステップＳ１０７）。
再生モードに設定されていないと判断された場合（ステップＳ１０７：Ｎｏ）には、電子機器１は、ステップＳ１１８に移行する。
一方、再生モードに設定されていると判断された場合（ステップＳ１０７：Ｙｅｓ）には、表示制御部１９５は、選択画面を表示部１４に表示させる（ステップＳ１０８）。
ここで、当該選択画面は、記録部１７に記録された複数の第１，第２音声データをユーザに選択させる画面であって、例えば、複数の第１，第２音声データに関連付けられた各タイムスタンプに基づく各日時が一覧表示された画面である。 If it is determined in step S102 that the recording mode is not set (step S102: No), the device-side control unit 19 determines whether or not the electronic device 1 is set to the playback mode (step S107). .
If it is determined that the playback mode is not set (step S107: No), the electronic apparatus 1 proceeds to step S118.
On the other hand, when it is determined that the playback mode is set (step S107: Yes), the display control unit 195 displays a selection screen on the display unit 14 (step S108).
Here, the selection screen is a screen that allows the user to select a plurality of first and second audio data recorded in the recording unit 17. For example, each selection screen is associated with each of the plurality of first and second audio data. It is a screen in which each date based on the time stamp is displayed in a list.

続いて、機器側制御部１９は、ユーザによる操作部１３への選択操作（選択画面中のいずれかの第１，第２音声データを選択する操作）があったか否かを判断する（ステップＳ１０９）。
選択操作がないと判断された場合（ステップＳ１０９：Ｎｏ）には、電子機器１は、選択画面の表示を継続する。
一方、選択操作があったと判断した場合（ステップＳ１０９：Ｙｅｓ）には、機器側制御部１９は、話者表示の表示フラグがオン状態であるか否かを判断する（ステップＳ１１０）。 Subsequently, the device-side control unit 19 determines whether or not the user has performed a selection operation (an operation for selecting any of the first and second audio data in the selection screen) on the operation unit 13 (step S109). .
When it is determined that there is no selection operation (step S109: No), the electronic device 1 continues to display the selection screen.
On the other hand, when it is determined that the selection operation has been performed (step S109: Yes), the device-side control unit 19 determines whether or not the display flag of the speaker display is on (step S110).

話者表示の表示フラグがオン状態であると判断された場合（ステップＳ１１０：Ｙｅｓ）には、電子機器１は、話者表示再生画面を生成し表示する話者表示処理を実行する（ステップＳ１１１）。
なお、話者表示処理の詳細については、後述する。
一方、話者表示の表示フラグがオフ状態であると判断された場合（ステップＳ１１０：Ｎｏ）には、再生画面生成部１９４は、通常再生画面を生成する。そして、表示制御部１９５は、当該通常再生画面を表示部１４に表示させる（ステップＳ１１２）。 If it is determined that the display flag for speaker display is on (step S110: Yes), the electronic device 1 executes speaker display processing for generating and displaying a speaker display reproduction screen (step S111). ).
The details of the speaker display process will be described later.
On the other hand, when it is determined that the display flag of the speaker display is in the off state (step S110: No), the playback screen generation unit 194 generates a normal playback screen. Then, the display control unit 195 displays the normal playback screen on the display unit 14 (step S112).

ステップＳ１１１またはステップＳ１１２の後、機器側制御部１９は、ユーザによる操作部１３への再生開始操作があったか否かを判断する（ステップＳ１１３）。
再生開始操作がないと判断された場合（ステップＳ１１３：Ｎｏ）には、電子機器１は、ステップＳ１１７に移行する。
一方、再生開始操作があったと判断された場合（ステップＳ１１３：Ｙｅｓ）には、音声制御部１９６は、ユーザによる選択操作（ステップＳ１０９）により選択された第１，第２音声データを記録部１７から読み出す。そして、音声制御部１９６は、音声出力部１８に当該第１，第２音声データに基づく音声の出力（再生）を開始させる（ステップＳ１１４）。 After step S111 or step S112, the device-side control unit 19 determines whether or not the user has performed a reproduction start operation on the operation unit 13 (step S113).
If it is determined that there is no reproduction start operation (step S113: No), the electronic device 1 proceeds to step S117.
On the other hand, when it is determined that there has been a reproduction start operation (step S113: Yes), the audio control unit 196 records the first and second audio data selected by the selection operation by the user (step S109). Read from. Then, the sound control unit 196 causes the sound output unit 18 to start outputting (reproducing) sound based on the first and second sound data (step S114).

続いて、機器側制御部１９は、ユーザによる操作部１３への再生終了操作があったか否かを判断する（ステップＳ１１５）。
再生終了操作がないと判断された場合（ステップＳ１１５：Ｎｏ）には、電子機器１は、再生を継続する。
一方、再生終了操作があったと判断された場合（ステップＳ１１５：Ｙｅｓ）には、音声制御部１９６は、音声出力部１８に音声の出力（再生）を終了させる（ステップＳ１１６）。なお、ステップＳ１１５で再生を継続した結果、第１，第２音声データを全て再生し終えた場合にも、ステップＳ１１６に移行するものである。 Subsequently, the device-side control unit 19 determines whether or not the user has performed a reproduction end operation on the operation unit 13 (step S115).
When it is determined that there is no reproduction end operation (step S115: No), the electronic device 1 continues the reproduction.
On the other hand, when it is determined that the reproduction end operation has been performed (step S115: Yes), the audio control unit 196 causes the audio output unit 18 to end the audio output (reproduction) (step S116). Note that, as a result of continuing the reproduction in step S115, when all the first and second audio data have been reproduced, the process proceeds to step S116.

ステップＳ１１６の後、または、ステップＳ１１３で再生開始操作がないと判断された場合（ステップＳ１１３：Ｎｏ）には、機器側制御部１９は、ユーザによる操作部１３への再生対象（第１，第２音声データ）の変更操作があったか否かを判断する（ステップＳ１１７）。
再生対象の変更操作がないと判断された場合（ステップＳ１１７：Ｎｏ）には、電子機器１は、ステップＳ１１３に戻る。
一方、再生対象の変更操作があったと判断された場合（ステップＳ１１７：Ｙｅｓ）には、電子機器１は、ステップＳ１０１に戻り、ステップＳ１０１，Ｓ１０２，Ｓ１０７を経た後、ステップＳ１０８において、再度、選択画面を表示する。 After step S116 or when it is determined in step S113 that there is no reproduction start operation (step S113: No), the device-side control unit 19 performs the reproduction target (first and first) to the operation unit 13 by the user. It is determined whether or not there has been a change operation of (2 audio data) (step S117).
When it is determined that there is no reproduction target change operation (step S117: No), the electronic device 1 returns to step S113.
On the other hand, if it is determined that there has been an operation to change the playback target (step S117: Yes), the electronic device 1 returns to step S101, goes through steps S101, S102, and S107, and then selects again in step S108. Display the screen.

ステップＳ１０７で再生モードに設定されていないと判断された場合（ステップＳ１０７：Ｎｏ）には、電子機器１は、上述した処理とは異なる他の処理を実行する（ステップＳ１１８）。この後、電子機器１は、ステップＳ１０１に戻る。 If it is determined in step S107 that the playback mode is not set (step S107: No), the electronic device 1 executes another process different from the process described above (step S118). Thereafter, the electronic device 1 returns to step S101.

〔話者表示処理〕
次に、上述した話者表示処理（ステップＳ１１１）について説明する。
図３は、話者表示処理（ステップＳ１１１）を示すフローチャートである。
機器側制御部１９は、話者表示処理の対象となる第１，第２音声データ（ステップＳ１０９で選択された第１，第２音声データ）の参照情報を既に生成しているか否かを判断する（ステップＳ１１１Ａ）。すなわち、機器側制御部１９は、ステップＳ１１１Ａにおいて、記録部１７に記録された当該第１，第２音声データに参照情報が関連付けられているか否かを判断している。
参照情報を生成済みであると判断された場合（ステップＳ１１１Ａ：Ｙｅｓ）には、電子機器１は、ステップＳ１１１Ｐに移行する。
一方、参照情報を未だ生成していないと判断された場合（ステップＳ１１１Ａ：Ｎｏ）には、対象物特定部１９２１は、話者表示処理の対象となる第１，第２音声データにおける一期間（例えば、５秒間）に相当するデータをそれぞれ読み出す（ステップＳ１１１Ｂ）。
以下、第１音声データにおける一期間に相当するデータを第１データ要素と記載し、第２音声データにおける一期間に相当するデータを第２データ要素と記載する。 [Speaker display processing]
Next, the speaker display process (step S111) described above will be described.
FIG. 3 is a flowchart showing the speaker display process (step S111).
The device-side control unit 19 determines whether or not reference information of the first and second voice data (first and second voice data selected in step S109) to be subjected to speaker display processing has already been generated. (Step S111A). That is, in step S111A, the device-side control unit 19 determines whether or not reference information is associated with the first and second audio data recorded in the recording unit 17.
When it is determined that the reference information has been generated (step S111A: Yes), the electronic device 1 proceeds to step S111P.
On the other hand, when it is determined that the reference information has not yet been generated (step S111A: No), the target object specifying unit 1921 selects one period (first period) in the first and second audio data to be subjected to the speaker display process ( For example, data corresponding to 5 seconds is read (step S111B).
Hereinafter, data corresponding to one period in the first audio data is referred to as a first data element, and data corresponding to one period in the second audio data is referred to as a second data element.

続いて、対象物特定部１９２１は、ステップＳ１１１Ｂで読み出した一期間（以下、該当期間）に相当する第１，第２データ要素を解析することで、当該第１，第２データ要素に含まれる音声を発した話者を特定する（ステップＳ１１１Ｃ）。
具体的に、対象物特定部１９２１は、該当期間に相当する第１，第２データ要素に含まれる各音声の音量を比較することで、電子機器１に対する話者の方向を特定する。また、対象物特定部１９２１は、当該第１，第２データ要素に含まれる音声の周波数に基づいて、話者の性別を特定する。母音などの発音の周波数は、女性が男性より高めであるため性別の判定に用いることができる。また、使われる言葉や内容、イントネーション等でも性別を判定することができる。男女別の話者がいる場合はこれらの音声を比較して性別を判定してもよく、特定周波数より高いか低いかで性別を判定してもよい。さらに、使われる単語やセンテンスや語尾の特徴でも性別判定が可能である。また、男女それぞれのモデル音声との類似度に基づいて性別判定してもよい。また、同様の考え方で年齢の高低も判定が可能であることは言うまでもない。登場する頻度が高い話者であれば、あらかじめ登録したデータベースとの音声照合で特定する方法もある。 Subsequently, the object specifying unit 1921 analyzes the first and second data elements corresponding to the one period (hereinafter referred to as the corresponding period) read out in step S111B, thereby including the first and second data elements. The speaker who has spoken is specified (step S111C).
Specifically, the object specifying unit 1921 specifies the direction of the speaker with respect to the electronic device 1 by comparing the volume of each voice included in the first and second data elements corresponding to the corresponding period. In addition, the object specifying unit 1921 specifies the gender of the speaker based on the audio frequency included in the first and second data elements. Since the frequency of pronunciation such as vowels is higher than that of men, it can be used for sex determination. In addition, gender can be determined by words, contents, intonation used, and the like. When there are male and female speakers, the voices may be compared to determine the sex, or the sex may be determined based on whether the frequency is higher or lower than the specific frequency. Furthermore, gender can be determined based on the characteristics of words used, sentences, and endings. In addition, gender may be determined based on the similarity between the model voices of male and female. Needless to say, it is possible to determine whether the age is high or low based on the same concept. For speakers who frequently appear, there is a method of specifying by voice collation with a database registered in advance.

なお、ステップＳ１１１Ｃにおいて、話者の特定については、上述したような話者の方向や性別を特定する方法に限られず、以下のように話者を特定しても構わない。
複数のユーザを識別するための識別データ（ユーザ名等）と当該ユーザの声紋に関する声紋データとを関連付け、当該関連付けた情報を記録部１７に予め記録しておく。そして、対象物特定部１９２１は、記録部１７に記録された情報を参照し、第１，第２データ要素に含まれる音声の声紋に一致する声紋データを特定することで、話者（当該声紋データに関連付けられた識別データ（ユーザ名等））を特定する。 In step S111C, the speaker identification is not limited to the method for identifying the speaker direction and gender as described above, and the speaker may be identified as follows.
Identification data (such as user names) for identifying a plurality of users is associated with voice print data relating to the voice print of the user, and the related information is recorded in the recording unit 17 in advance. Then, the object specifying unit 1921 refers to the information recorded in the recording unit 17 and specifies voiceprint data that matches the voiceprint of the voice included in the first and second data elements. Identify identification data (user name, etc.) associated with the data.

ステップＳ１１１Ｃの後、対象物特定部１９２１は、ステップＳ１１１Ｃで話者を特定することができた（話者の方向及び性別の双方を特定することができた）か否かを判断する（ステップＳ１１１Ｄ）。
話者を特定することができなかった（話者の方向及び性別の少なくともいずれか一方を特定することができなかった）と判断された場合（ステップＳ１１１Ｄ：Ｎｏ）には、電子機器１は、ステップＳ１１１Ｇに移行する。 After step S111C, the object specifying unit 1921 determines whether or not the speaker has been specified in step S111C (both the direction and gender of the speaker have been specified) (step S111D). ).
If it is determined that the speaker could not be specified (at least one of the speaker direction and gender could not be specified) (step S111D: No), the electronic device 1 The process proceeds to step S111G.

一方、話者を特定することができた（話者の方向及び性別の双方を特定することができた）と判断された場合（ステップＳ１１１Ｄ：Ｙｅｓ）には、特徴成分判別部１９２２は、該当期間に相当する第１，第２データ要素を解析することで、当該第１，第２データ要素に含まれる音声を発した話者のテンションを話者毎に判別する（ステップＳ１１１Ｅ：音声データ解析ステップ）。
具体的に、特徴成分判別部１９２２は、該当期間に相当する第１，第２データ要素に含まれる各音声の音量に基づいて、話者のテンションを判別する。すなわち、特徴成分判別部１９２２は、話者の音声の音量が直前の期間の音量と比較して所定の第１閾値以上に大きくなった場合に話者のテンションを「ハイテンション」と判別し、その他の場合に話者のテンションを「通常」と判別する。感情によって音声の韻律的特徴が変化するが、これは声の高・低、強・弱、リズム・テンポや、基本周波数、パワー、持続時間などで分析が可能である。感情を表す音声に含まれる感情の程度と基本周波数パターンには関係があると言われており、ピッチ周波数・振幅の変化パターンなどでも分析が可能である。また、アクセントや含まれる単語、感嘆詞などを検出してもよく、これらの検出結果を合わせて、またはそのいずれかを活用して、「ハイテンション」を判定することが可能である。後述するように、笑い声やうなり声などを分析してもよい。これは声（声紋データ）のパターンマッチングなどでも判定可能である。話者のテンション（音声の特徴部分）は、喜怒哀楽のような激しい感情的な高ぶりに限る必要はなく、話の集中具合（例えば、一人の話者が説明し、それを他の人物が静かに聴くなど）を反映してもよい。この場合、検出された複数の人物の声の相対的な大きさの関係や、言葉のペースの一定度（説き聞かせるように語る）やスピード（まくしたてる）などを検出して、話者のテンションを判定してもよい。つまり、話者のテンションは、一人の話者の声の時間の経過に伴う相対的な変化を検出したり、絶対的な数値データで判定したり、複数の人物の声の相対的な差異を検出したりして判定されるものである。 On the other hand, when it is determined that the speaker can be specified (both the direction and gender of the speaker can be specified) (step S111D: Yes), the feature component determination unit 1922 By analyzing the first and second data elements corresponding to the period, the tension of the speaker who has produced the speech included in the first and second data elements is determined for each speaker (step S111E: voice data analysis). Step).
Specifically, the feature component determination unit 1922 determines the speaker's tension based on the volume of each voice included in the first and second data elements corresponding to the corresponding period. That is, the feature component determination unit 1922 determines the speaker's tension as “high tension” when the volume of the speaker's voice is greater than or equal to a predetermined first threshold value compared to the volume of the previous period. In other cases, the speaker's tension is determined as “normal”. The prosodic features of speech change depending on emotions, but this can be analyzed by high / low, strong / weak voice, rhythm / tempo, fundamental frequency, power, duration, etc. It is said that there is a relationship between the degree of emotion contained in the voice representing emotion and the fundamental frequency pattern, and it is possible to analyze even the pitch frequency / amplitude change pattern. In addition, accents, included words, exclamations, and the like may be detected, and “high tension” can be determined by combining these detection results or using one of them. As will be described later, a laughing voice or a roaring voice may be analyzed. This can also be determined by pattern matching of voice (voice print data). The speaker's tension (speech feature) need not be limited to intense emotional highs such as emotions, but the concentration of the story (for example, one speaker explains it to other people Listening quietly, etc.) may be reflected. In this case, the speaker's tension is detected by detecting the relative loudness of the voices of multiple people detected, the degree of the pace of the words (speaking to speak) and the speed (speaking). May be determined. In other words, the tension of a speaker can detect relative changes over time of one speaker's voice, can be judged by absolute numerical data, It is determined by detection.

なお、ステップＳ１１１Ｅにおいて、話者のテンションの判別については、上述した音量に基づいて判別する方法に限られず、以下のようにテンションを判別しても構わない。
例えば、特徴成分判別部１９２２は、第１，第２データ要素に含まれる音声の周波数に基づいて、話者のテンションを判別する。具体的に、特徴成分判別部１９２２は、話者の音声の周波数が直前の期間の音声の周波数と比較して所定の第２閾値以上に高くなった場合に話者のテンションを「ハイテンション」と判別し、その他の場合に話者のテンションを「通常」と判別する。「ハイテンション」は、喜怒哀楽等の話者の感情の高ぶりのみならず、話の集中具合を反映してもよい。この場合、検出された複数の人物の声の相対的な大きさの関係や、言葉のペースの一定度（説き聞かせるように語る）やスピード（まくしたてる）などを検出してテンションが上がっているという判定をしてもよい。つまり、一人の話者の声の時間の経過に伴う相対的な変化を検出したり、絶対的な数値データで判定したり、複数の人物の声の相対的な差異を検出して所定の特徴的な結果が得られた場合、「ハイテンション」と判定してもよい。
また、例えば、特徴成分判別部１９２２は、第１，第２データ要素に含まれる音声の音素成分の時間密度に基づいて、話者のテンションを判別する。具体的に、特徴成分判別部１９２２は、話者の音声の音素成分の時間密度が直前の期間の音声の音素成分の時間密度と比較して所定の第３閾値以上に大きくなった場合に話者のテンションを「ハイテンション」と判別し、その他の場合に話者のテンションを「通常」と判別する。
さらに、例えば、笑い声や怒った声等の声紋に関する声紋データを記録部１７に予め記録しておく。そして、特徴成分判別部１９２２は、記録部１７に記録された当該声紋データを参照し、第１，第２データ要素に含まれる音声に当該声紋データに基づく笑い声や怒った声等の声紋に一致する声紋があった場合に話者のテンションを「ハイテンション」と判別し、その他の場合に話者のテンションを「通常」と判別する。 Note that the determination of the speaker's tension in step S111E is not limited to the method of determining based on the volume described above, and the tension may be determined as follows.
For example, the feature component determination unit 1922 determines the speaker's tension based on the frequency of speech included in the first and second data elements. Specifically, the feature component determination unit 1922 sets the speaker's tension to “high tension” when the frequency of the speaker's voice is higher than a predetermined second threshold value compared with the frequency of the voice in the immediately preceding period. In other cases, the speaker's tension is determined as “normal”. “High tension” may reflect not only high emotion of the speaker such as emotions but also concentration of the talk. In this case, the tension is increased by detecting the relative loudness of the detected voices of multiple people, the degree of the pace of the words (speaking to speak), the speed (speaking), etc. It may be determined. In other words, it is possible to detect relative changes over time of a single speaker's voice, to make judgments based on absolute numerical data, or to detect relative differences among multiple people's voices to obtain predetermined characteristics. When a typical result is obtained, it may be determined as “high tension”.
Further, for example, the feature component determination unit 1922 determines the speaker's tension based on the time density of the phoneme component of the speech included in the first and second data elements. Specifically, the feature component discriminating unit 1922 speaks when the time density of the phoneme component of the speaker's voice is greater than or equal to a predetermined third threshold value compared to the time density of the phoneme component of the voice in the immediately preceding period. The speaker's tension is determined as “high tension”, and in other cases, the speaker's tension is determined as “normal”.
Furthermore, for example, voice print data relating to a voice print such as a laughing voice or an angry voice is recorded in the recording unit 17 in advance. Then, the feature component determination unit 1922 refers to the voiceprint data recorded in the recording unit 17, and the voice included in the first and second data elements matches the voiceprint such as a laughing voice or an angry voice based on the voiceprint data. When there is a voiceprint to be played, the speaker's tension is determined as “high tension”, and in other cases, the speaker's tension is determined as “normal”.

ステップＳ１１１Ｅの後、参照情報生成部１９３は、該当期間の参照情報として、ステップＳ１１１Ｃで特定された話者（方向及び性別）と、ステップＳ１１１Ｅで判別された話者のテンションと、該当期間に相当するタイムスタンプ（時計部１５にて生成）等を関連付けた参照情報（後述する「複数話者期間」フラグ及び「ざわざわ期間」フラグはオフ状態）を生成する（ステップＳ１１１Ｆ：参照情報生成ステップ）。そして、参照情報生成部１９３は、生成した参照情報をメモリ部１６に記憶する。この後、電子機器１は、ステップＳ１１１Ｊに移行する。 After step S111E, the reference information generation unit 193 corresponds to the speaker (direction and gender) specified in step S111C, the speaker tension determined in step S111E, and the corresponding period as reference information for the corresponding period. Reference information (a “multi-speaker period” flag and a “noisy period” flag to be described later are in an off state) associated with a time stamp (generated by the clock unit 15) to be generated is generated (step S111F: reference information generation step). Then, the reference information generation unit 193 stores the generated reference information in the memory unit 16. Thereafter, the electronic device 1 proceeds to step S111J.

ステップＳ１１１Ｄで話者を特定することができなかった（話者の方向及び性別の少なくともいずれか一方を特定することができなかった）と判断した場合（ステップＳ１１１Ｄ：Ｎｏ）には、対象物特定部１９２１は、特定することができた話者の方向または性別に基づいて、話者が複数であるか否かを判断する（ステップＳ１１１Ｇ）。
話者が複数であると判断された場合（ステップＳ１１１Ｇ：Ｙｅｓ）には、参照情報生成部１９３は、該当期間の参照情報として、ステップＳ１１１Ｃで特定することができた話者の方向または性別と、該当期間に相当するタイムスタンプ等を関連付けるとともに、「複数話者期間」フラグをオン状態とした参照情報を生成する（ステップＳ１１１Ｈ）。そして、参照情報生成部１９３は、生成した参照情報をメモリ部１６に記憶する。この後、電子機器１は、ステップＳ１１１Ｊに移行する。
ここで、「複数話者期間」フラグ（オン状態）は、該当期間の話者を特定することができていないこと、及び該当期間の話者が複数であることを示すフラグである。 If it is determined in step S111D that the speaker could not be specified (at least one of the speaker direction and gender could not be specified) (step S111D: No), the target object is specified. The unit 1921 determines whether or not there are a plurality of speakers based on the direction or gender of the speakers that can be identified (step S111G).
When it is determined that there are a plurality of speakers (step S111G: Yes), the reference information generation unit 193 uses the direction or gender of the speakers that can be specified in step S111C as the reference information for the corresponding period. In addition to associating a time stamp corresponding to the corresponding period, reference information with the “multiple speaker period” flag turned on is generated (step S111H). Then, the reference information generation unit 193 stores the generated reference information in the memory unit 16. Thereafter, the electronic device 1 proceeds to step S111J.
Here, the “multi-speaker period” flag (ON state) is a flag indicating that a speaker in the corresponding period cannot be specified and that there are a plurality of speakers in the corresponding period.

一方、話者が複数ではないと判断された場合（ステップＳ１１１Ｇ：Ｎｏ）には、参照情報生成部１９３は、該当期間の参照情報として、ステップＳ１１１Ｃで特定することができた話者の方向または性別と、該当期間に相当するタイムスタンプ等を関連付けるとともに、「ざわざわ期間」フラグをオン状態とした参照情報を生成する（ステップＳ１１１Ｉ）。そして、参照情報生成部１９３は、生成した参照情報をメモリ部１６に記憶する。この後、電子機器１は、ステップＳ１１１Ｊに移行する。
ここで、「ざわざわ期間」フラグは、該当期間の話者を特定することができていないこと、及び該当期間の話者が複数でないことを示すフラグである。
なお、ステップＳ１１１Ｇで話者が複数ではないと判断された場合（ステップＳ１１１Ｇ：Ｎｏ）とは、話者が一人であると判断された場合の他、ステップＳ１１１Ｃで話者の方向及び性別の双方を特定することができず、話者が複数であるか、または、一人であるかの判断が全くできない場合も含むものである。 On the other hand, when it is determined that there are not a plurality of speakers (step S111G: No), the reference information generation unit 193 uses the direction of the speaker that can be identified in step S111C as the reference information of the corresponding period or In addition to associating the gender with the time stamp corresponding to the relevant period, reference information with the “noisy period” flag turned on is generated (step S111I). Then, the reference information generation unit 193 stores the generated reference information in the memory unit 16. Thereafter, the electronic device 1 proceeds to step S111J.
Here, the “noisy period” flag is a flag indicating that a speaker in the corresponding period cannot be specified and that there are not a plurality of speakers in the corresponding period.
When it is determined in step S111G that there are not a plurality of speakers (step S111G: No), in addition to the case where it is determined that there is only one speaker, both the direction and gender of the speaker are determined in step S111C. It is also possible to determine whether there is a plurality of speakers or a single speaker.

ステップＳ１１１Ｆ、ステップＳ１１１Ｈ、またはステップＳ１１１Ｉの後、機器側制御部１９は、話者表示処理の対象となる第１，第２音声データにおける全ての期間で参照情報を生成したか否かを判断する（ステップＳ１１１Ｊ）。
全ての期間で参照情報を生成していないと判断された場合（ステップＳ１１１Ｊ：Ｎｏ）には、電子機器１は、ステップＳ１１１Ｂに戻り、第１，第２音声データにおける他の期間に相当する第１，第２データ要素を読み出し、当該他の期間の参照情報を生成する。 After step S111F, step S111H, or step S111I, the device-side control unit 19 determines whether or not reference information has been generated in all periods in the first and second audio data to be subjected to speaker display processing. (Step S111J).
If it is determined that the reference information has not been generated in all periods (step S111J: No), the electronic device 1 returns to step S111B, and the first corresponding to the other period in the first and second audio data. First, the second data element is read, and reference information for the other period is generated.

一方、全ての期間で参照情報を生成したと判断された場合（ステップＳ１１１Ｊ：Ｙｅｓ）には、対象物特定部１９２１は、以下の処理を実行する（ステップＳ１１１Ｋ）。
対象物特定部１９２１は、ステップＳ１１１Ｋにおいて、メモリ部１６に記憶された各期間の参照情報のうち、「複数話者期間」フラグまたは「ざわざわ期間」フラグがオン状態となっている参照情報（ステップＳ１１１ＨまたはステップＳ１１１Ｉで生成された参照情報）があるか否かを判断する。
「複数話者期間」フラグまたは「ざわざわ期間」フラグがオン状態となっている参照情報がないと判断された場合（ステップＳ１１１Ｋ：Ｎｏ）には、電子機器１は、ステップＳ１１１Ｏに移行する。 On the other hand, when it is determined that the reference information has been generated in all periods (step S111J: Yes), the object specifying unit 1921 executes the following process (step S111K).
In step S111K, the object specifying unit 1921 includes reference information in which the “multi-speaker period” flag or the “noisy period” flag is on in the reference information of each period stored in the memory unit 16 (step S111K). It is determined whether there is reference information generated in S111H or step S111I.
When it is determined that there is no reference information in which the “multiple speaker period” flag or the “noisy period” flag is on (step S111K: No), the electronic device 1 proceeds to step S111O.

一方、「複数話者期間」フラグまたは「ざわざわ期間」フラグがオン状態となっている参照情報があると判断した場合（ステップＳ１１１Ｋ：Ｙｅｓ）には、対象物特定部１９２１は、以下の処理を実行する（ステップＳ１１１Ｌ）。
対象物特定部１９２１は、ステップＳ１１１Ｌにおいて、メモリ部１６に記憶された各期間の参照情報のうち、当該参照情報の直前の期間の参照情報の「複数話者期間」フラグ及び「ざわざわ期間」フラグがオフ状態となっているか否かを判断する。すなわち、対象物特定部１９２１は、当該参照情報の直前の期間で話者が特定されている（話者の方向及び性別の双方を特定することができている）か否かを判断している。
直前の期間で話者が特定されていないと判断された場合（ステップＳ１１１Ｌ：Ｎｏ）には、電子機器１は、ステップＳ１１１Ｏに移行する。 On the other hand, if it is determined that there is reference information in which the “multiple speaker period” flag or the “noisy period” flag is on (step S111K: Yes), the object specifying unit 1921 performs the following processing: Execute (Step S111L).
In step S 111 L, the object specifying unit 1921 includes the “multi-speaker period” flag and the “noisy period” flag of the reference information immediately before the reference information among the reference information stored in the memory unit 16 in step S 111 L. It is determined whether or not is in an off state. That is, the object specifying unit 1921 determines whether or not the speaker is specified in the period immediately before the reference information (both the direction and gender of the speaker can be specified). .
If it is determined that the speaker has not been specified in the immediately preceding period (step S111L: No), the electronic device 1 proceeds to step S111O.

一方、直前の期間で話者が特定されていると判断した場合（ステップＳ１１１Ｌ：Ｙｅｓ）には、対象物特定部１９２１は、「複数話者期間」フラグまたは「ざわざわ期間」フラグがオン状態となっている参照情報の話者を、当該直前の期間で特定された話者（話者の方向及び性別）と推定する（ステップＳ１１１Ｍ）。
続いて、参照情報生成部１９３は、「複数話者期間」フラグまたは「ざわざわ期間」フラグがオン状態となっている参照情報の話者をステップＳ１１１Ｍで推定された話者とし、当該参照情報を更新する（ステップＳ１１１Ｎ）。 On the other hand, when it is determined that the speaker has been specified in the immediately preceding period (step S111L: Yes), the object specifying unit 1921 sets the “multiple speaker period” flag or the “no bother period” flag to be in the on state. The speaker of the reference information is estimated as the speaker (speaker direction and gender) specified in the immediately preceding period (step S111M).
Subsequently, the reference information generation unit 193 sets the speaker of the reference information in which the “multiple speaker period” flag or the “noisy period” flag is on as the speaker estimated in step S111M, and uses the reference information as the speaker. Update (step S111N).

ステップＳ１１１Ｎの後、ステップＳ１１１Ｋで「複数話者期間」フラグまたは「ざわざわ期間」フラグがオン状態となっている参照情報がないと判断された場合（ステップＳ１１１Ｋ：Ｎｏ）、または、ステップＳ１１１Ｌで直前の期間で話者が特定されていないと判断された場合（ステップＳ１１１Ｌ：Ｎｏ）には、参照情報生成部１９３は、以下の処理を実行する（ステップＳ１１１Ｏ）。
参照情報生成部１９３は、ステップＳ１１１Ｏにおいて、メモリ部１６に記憶され、ステップＳ１１１Ｆ，Ｓ１１１Ｈ，Ｓ１１１Ｉで生成された各期間の参照情報（ステップＳ１１１Ｎで更新された場合には更新後の参照情報）を、話者表示処理の対象とした第１，第２音声データに関連付けて、記録部１７に記録する。 After step S111N, if it is determined in step S111K that there is no reference information in which the “multi-speaker period” flag or the “noisy period” flag is on (step S111K: No), or immediately before in step S111L When it is determined that the speaker has not been identified during the period (step S111L: No), the reference information generation unit 193 executes the following processing (step S111O).
In step S111O, the reference information generation unit 193 stores the reference information for each period stored in the memory unit 16 and generated in steps S111F, S111H, and S111I (or the updated reference information when updated in step S111N). Then, it is recorded in the recording unit 17 in association with the first and second audio data as the target of the speaker display processing.

ステップＳ１１１Ｏの後、または、ステップＳ１１１Ａで参照情報を生成済みであると判断された場合（ステップＳ１１１Ａ：Ｙｅｓ）には、再生画面生成部１９４は、以下の処理を実行する（ステップＳ１１１Ｐ）。
再生画面生成部１９４は、ステップＳ１１１Ｐにおいて、記録部１７に記録された第１，第２音声データのうち、話者表示処理の対象となる第１，第２音声データに関連付けられた参照情報に基づいて、話者表示再生画面を生成する。
続いて、表示制御部１９５は、ステップＳ１１１Ｐで生成された話者表示再生画面を表示部１４に表示させる（ステップＳ１１１Ｑ）。この後、電子機器１は、図２に示したメインルーチンに戻る。 After step S111O or when it is determined in step S111A that reference information has been generated (step S111A: Yes), the playback screen generation unit 194 executes the following processing (step S111P).
In step S111P, the reproduction screen generation unit 194 uses the reference information associated with the first and second audio data to be subjected to the speaker display process among the first and second audio data recorded in the recording unit 17. Based on this, a speaker display reproduction screen is generated.
Subsequently, the display control unit 195 causes the display unit 14 to display the speaker display reproduction screen generated in step S111P (step S111Q). Thereafter, the electronic device 1 returns to the main routine shown in FIG.

〔参照情報の具体例〕
次に、上述した話者表示処理（ステップＳ１１１）で生成される参照情報の具体例について説明する。
図４は、話者表示処理（ステップＳ１１１）の対象となる第１，第２音声データが生成（録音）される状況の一例を示す図である。図５は、図４の状況で生成された第１，第２音声データを対象として話者表示処理（ステップＳ１１１）を実行した場合に生成される参照情報の一例を示す図である。
具体的に、図４では、男性Ｍと女性Ｌ１，Ｌ２の３人がテーブルを囲んで打合せをし、当該打合せをテーブルの上に置いた電子機器１にて録音している状況を示している。ここで、電子機器１の上端から当該電子機器１の中心線を延長させた軸Ａｘを基準とした場合に、男性Ｍは、軸Ａｘに対して「右（電子機器１を正面から見て（図４中、上側から見て）右に１２０°」の方向に座っているものとする。また、女性Ｌ１は、軸Ａｘに対して「右に９０°」の方向に座っているものとする。さらに、女性Ｌ２は、軸Ａｘに対して「左に１０°」の方向に座っているものとする。
また、図５では、ステップＳ１１１Ｂで第１，第２データ要素を読み出す一期間を５秒間としている。このため、以下では、「０〜５秒」、「５〜１０秒」、「１０〜１５秒」、「１５〜２０秒」、「２０〜２５秒」の各期間について順に説明する。 [Specific examples of reference information]
Next, a specific example of the reference information generated in the speaker display process (step S111) described above will be described.
FIG. 4 is a diagram illustrating an example of a situation where the first and second audio data to be subjected to speaker display processing (step S111) are generated (recorded). FIG. 5 is a diagram illustrating an example of reference information generated when the speaker display process (step S111) is executed on the first and second audio data generated in the situation of FIG.
Specifically, FIG. 4 shows a situation where three persons, a male M and a female L1, L2, make a meeting surrounding the table and record the meeting with the electronic device 1 placed on the table. . Here, when the axis Ax obtained by extending the center line of the electronic device 1 from the upper end of the electronic device 1 is used as a reference, the male M is “right (when the electronic device 1 is viewed from the front ( Assume that the person is sitting in the direction of 120 ° to the right (viewed from the upper side in FIG. 4). The woman L1 is sitting in the direction of 90 ° to the right with respect to the axis Ax. Furthermore, it is assumed that the female L2 is sitting in the direction of “10 ° to the left” with respect to the axis Ax.
In FIG. 5, one period for reading the first and second data elements in step S111B is set to 5 seconds. Therefore, in the following, each period of “0 to 5 seconds”, “5 to 10 seconds”, “10 to 15 seconds”, “15 to 20 seconds”, and “20 to 25 seconds” will be described in order.

〔０〜５秒の期間〕
この期間では、男性Ｍのみが声を発したものである。すなわち、当該期間では、第２データ要素に含まれる音声（軸Ａｘに対して右側からの音声）の音量は、第１データ要素に含まれる音声（軸Ａｘに対して左側からの音声）の音量よりも大きくなっている。また、男性Ｍの声であるため、当該音声は、比較的に低い周波数となっている。このため、ステップＳ１１１Ｃでは、当該期間の第１，第２データ要素に含まれる各音声の音量のバランスにより、話者が「右に１２０°」の方向であると特定される。また、当該第１，第２データ要素に含まれる音声が比較的に低い周波数であるため、話者が「男性」であると特定される。 [0-5 seconds duration]
During this period, only male M uttered. That is, during the period, the volume of the sound included in the second data element (the sound from the right side with respect to the axis Ax) is the volume of the sound included in the first data element (the sound from the left side with respect to the axis Ax). Is bigger than. In addition, since the voice is male M, the voice has a relatively low frequency. For this reason, in step S111C, the speaker is specified to be in the direction of “120 ° to the right” by the balance of the volume of each voice included in the first and second data elements in the period. Further, since the voice included in the first and second data elements has a relatively low frequency, the speaker is identified as “male”.

また、当該期間は、最初の期間であり、直前の期間がない。このため、ステップＳ１１１Ｅでは、話者のテンションが「通常」と判別される。
そして、ステップＳ１１１Ｆでは、当該期間の参照情報として、図５に示すように、特定された話者（「右に１２０°」の方向の「男性」）と、判別された話者のテンション（「通常」）と、当該期間に相当するタイムスタンプ（「9/15 11:21:10」）と、声の数（「１」）とが関連付けられた参照情報が生成される。 In addition, this period is the first period and there is no immediately preceding period. For this reason, in step S111E, it is determined that the speaker's tension is “normal”.
In step S111F, as reference information for the period, as shown in FIG. 5, the specified speaker (“male” in the direction of “120 ° to the right”) and the determined speaker's tension (“ Normal ”), a time stamp corresponding to the period (“ 9/15 11:21:10 ”), and the number of voices (“ 1 ”) are associated with each other.

〔５〜１０秒の期間〕
この期間では、男性Ｍ及び女性Ｌ１がそれぞれ声を発したものである。そして、ステップＳ１１１Ｃでは、当該期間の第１，第２データ要素に含まれる各音声の音量のバランス及び音声の周波数（男性の声は周波数が比較的に低く、女性の声は周波数が比較的に高い）により、一人目の話者が「右に１２０°」の方向の「男性」であり、二人目の話者が「右に９０°」の方向の「女性」であると特定される。
また、当該期間では、男性Ｍが当該期間の直前の「０〜５秒」の期間よりも大きな声を発している。このため、ステップＳ１１１Ｅでは、一人目の話者（「右に１２０°」の方向の「男性」）の音声の音量が直前の期間での当該話者の音声の音量と比較して第１閾値以上になったことが認識され、当該話者のテンションが「ハイテンション」と判別される。また、二人目の話者（「右に９０°」の方向の「女性」）については、直前の「０〜５秒」の期間では当該話者が特定されていないため、ステップＳ１１１Ｅでは、当該話者のテンションが「通常」と判別される。 [5-10 seconds period]
During this period, male M and female L1 each uttered a voice. In step S111C, the balance of the volume of each voice included in the first and second data elements in the period and the frequency of the voice (male voice has a relatively low frequency and female voice has a relatively low frequency. High) identifies the first speaker as “male” in the direction of “120 ° to the right” and the second speaker as “female” in the direction of “90 ° to the right”.
In this period, the male M speaks louder than the period “0 to 5 seconds” immediately before the period. Therefore, in step S111E, the volume of the voice of the first speaker (“male” in the direction of “120 ° to the right”) is compared with the volume of the voice of the speaker in the immediately preceding period. It is recognized that this is the case, and the tension of the speaker is determined as “high tension”. For the second speaker (“female” in the direction of “90 ° to the right”), since the speaker is not specified in the immediately preceding “0 to 5 seconds” period, in step S111E, The speaker's tension is determined as “normal”.

そして、ステップＳ１１１Ｆでは、当該期間の参照情報として、図５に示すように、特定された一人目の話者（「右に１２０°」の方向の「男性」）及び判別された当該話者のテンション（「ハイテンション」）と、特定された二人目の話者（「右に９０°」の方向の「女性」）及び判別された当該話者のテンション（「通常」）と、当該期間に相当するタイムスタンプ（「9/15 11:21:15」）と、声の数（「２」）とが関連付けられた参照情報が生成される。 In step S111F, as the reference information of the period, as shown in FIG. 5, the identified first speaker (“male” in the direction of “120 ° to the right”) and the identified speaker are identified. The tension (“high tension”), the identified second speaker (“female” in the direction of “90 ° to the right”), the determined tension of the speaker (“normal”), and the period Reference information in which a corresponding time stamp (“9/15 11:21:15”) is associated with the number of voices (“2”) is generated.

〔１０〜１５秒の期間〕
この期間では、男性Ｍ及び女性Ｌ１がそれぞれ声を発したものである。なお、図５に示す例では、ステップＳ１１１Ｃにおいて、一人目の話者が「男性」であり二人目の話者が「女性」であることを特定することはできたが、当該一人目の話者及び二人目の話者の各方向を特定することができなかったことを例示している。すなわち、話者の特定（話者の方向及び性別の双方の特定）はできていないが、話者が複数（二人）であることは特定されている（ステップＳ１１１Ｇ：Ｙｅｓ）。このため、ステップＳ１１１Ｈでは、当該期間の参照情報として、特定された一人目の話者（「男性」）及び二人目の話者（「女性」）と、当該期間に相当するタイムスタンプ（「9/15 11:21:20」）と、声の数（「２」）とが関連付けられるとともに、「複数話者期間」フラグがオン状態とされた参照情報が生成される。 [10-15 seconds period]
During this period, male M and female L1 each uttered a voice. In the example shown in FIG. 5, in step S111C, it can be specified that the first speaker is “male” and the second speaker is “female”. This illustrates that the directions of the first speaker and the second speaker could not be specified. That is, it is not possible to specify a speaker (specification of both speaker direction and gender), but it is specified that there are a plurality (two) of speakers (step S111G: Yes). For this reason, in step S111H, as the reference information for the period, the identified first speaker (“male”) and second speaker (“female”), and a time stamp (“9 / 15 11:21:20 ") and the number of voices (" 2 ") are associated with each other, and reference information in which the" multi-speaker period "flag is turned on is generated.

ここで、当該期間の直前の「５〜１０秒」の期間では、話者が特定されている。このため、ステップＳ１１１Ｍでは、特定された一人目の話者（「男性」）は、直前の期間で特定された同性の話者（「右に１２０°」の方向の「男性」）と推定される。同様に、特定された二人目の話者（「女性」）は、直前の期間で特定された同性の話者（「右に９０°」の方向の「女性」）と推定される。
そして、ステップＳ１１１Ｎでは、ステップＳ１１１Ｈで生成された参照情報は、図５に示すように、一人目の話者（「右に１２０°」の方向の「男性」）及び当該話者のテンション（「通常」）と、二人目の話者（「右に９０°」の方向の「女性」）及び当該話者のテンション（「通常」）と、タイムスタンプ（「9/15 11:21:20」）と、声の数（「２」）とが関連付けられるとともに、「複数話者期間」フラグがオン状態とされた参照情報に更新される。なお、「複数話者期間」フラグがオン状態である場合には、当該参照情報の更新時に、話者のテンションは「通常」とされる。「ざわざわ期間」フラグがオン状態である場合でも同様である。 Here, the speaker is specified in the period of “5 to 10 seconds” immediately before the period. For this reason, in step S111M, the identified first speaker (“male”) is estimated to be the same-sex speaker identified in the immediately preceding period (“male” in the direction of “120 ° to the right”). The Similarly, the identified second speaker (“female”) is estimated to be the same-sex speaker (“female” in the direction of “90 ° to the right”) identified in the immediately preceding period.
In step S111N, the reference information generated in step S111H includes the first speaker (“male” in the direction of “120 ° to the right”) and the tension (“ ”), The second speaker (“ female ”in the direction of“ 90 ° to the right ”), the tension of the speaker (“ normal ”), and the timestamp (“ 9/15 11:21:20 ” ) And the number of voices (“2”) are associated with each other, and the “multi-speaker period” flag is updated to reference information that is turned on. When the “multiple speaker period” flag is on, the speaker's tension is set to “normal” when the reference information is updated. The same applies to the case where the “noisy period” flag is on.

〔１５〜２０秒の期間〕
この期間では、女性Ｌ１及び女性Ｌ２がそれぞれ声を発したものである。そして、ステップＳ１１１Ｃでは、当該期間の第１，第２データ要素に含まれる各音声の音量のバランス及び音声の周波数（女性Ｌ１，Ｌ２の声の周波数の違い）により、一人目の話者が「右に９０°」の方向の「女性」であり、二人目の話者が「左に１０°」の方向の「女性」であると特定される。
また、当該期間では、直前の「１０〜１５秒」の期間で話者が特定されていないため、ステップＳ１１１Ｅでは、一人目の話者（「右に９０°」の方向の「女性」）及び二人目の話者（「左に１０°」の方向の「女性」）の各テンションが「通常」とそれぞれ判別される。 [15-20 seconds period]
During this period, female L1 and female L2 each uttered voices. Then, in step S111C, the first speaker is determined based on the balance of the volume of each voice included in the first and second data elements in the period and the frequency of the voice (difference between the voice frequencies of females L1 and L2). It is identified as “female” in the direction of “90 ° to the right” and the second speaker is “female” in the direction of “10 ° to the left”.
In addition, in the period, since the speaker is not specified in the immediately preceding “10 to 15 seconds” period, in step S111E, the first speaker (“female” in the direction of “90 ° to the right”) and Each tension of the second speaker (“female” in the direction of “10 ° to the left”) is determined as “normal”.

そして、ステップＳ１１１Ｆでは、当該期間の参照情報として、図５に示すように、特定された一人目の話者（「右に９０°」の方向の「女性」）及び判別された当該話者のテンション（「通常」）と、特定された二人目の話者（「左に１０°」の方向の「女性」）及び判別された当該話者のテンション（「通常」）と、当該期間に相当するタイムスタンプ（「9/15 11:21:25」）と、声の数（「２」）とが関連付けられた参照情報が生成される。 In step S111F, as the reference information for the period, as shown in FIG. 5, the identified first speaker (“female” in the direction of “90 ° to the right”) and the identified speaker are identified. Tension (“normal”), the identified second speaker (“female” in the direction of “10 ° to the left”) and the determined speaker's tension (“normal”), corresponding to the period The reference information in which the time stamp (“9/15 11:21:25”) to be associated with the number of voices (“2”) is generated is generated.

〔２０〜２５秒の期間〕
この期間では、女性Ｌ２のみが声を発したものである。そして、ステップＳ１１１Ｃでは、当該期間の第１，第２データ要素に含まれる各音声の音量のバランス及び音声の周波数（女性の声は周波数が比較的に高い）により、話者が「左に１０°」の方向の「女性」であると特定される。
また、当該期間では、女性Ｌ２が当該期間の直前の「１５〜２０秒」の期間よりも大きな声を発している。このため、ステップＳ１１１Ｅでは、話者（「左に１０°」の方向の「女性」）の音声の音量が直前の期間での当該話者の音声の音量と比較して第１閾値以上になったことが認識され、当該話者のテンションが「ハイテンション」と判別される。 [20-25 seconds period]
During this period, only woman L2 uttered. Then, in step S111C, the speaker is “10 to the left by the balance of the volume of each voice and the frequency of the voice (the female voice has a relatively high frequency) included in the first and second data elements in the period. Identified as “female” in the direction of “°”.
Moreover, in the said period, the woman L2 is louder than the period of "15-20 seconds" immediately before the said period. For this reason, in step S111E, the volume of the voice of the speaker (“female” in the direction of “10 ° to the left”) is equal to or higher than the first threshold value compared to the volume of the voice of the speaker in the immediately preceding period. It is recognized that the speaker's tension is “high tension”.

そして、ステップＳ１１１Ｆでは、当該期間の参照情報として、図５に示すように、特定された話者（「左に１０°」の方向の「女性」）と、判別された当該話者のテンション（「ハイテンション」）と、当該期間に相当するタイムスタンプ（「9/15 11:21:30」）と、声の数（「１」）とが関連付けられた参照情報が生成される。 In step S111F, as reference information for the period, as shown in FIG. 5, the identified speaker (“female” in the direction of “10 ° to the left”) and the determined tension ( Reference information in which “high tension”), a time stamp corresponding to the period (“9/15 11:21:30”), and the number of voices (“1”) are associated is generated.

〔話者表示再生画面の具体例〕
次に、上述した話者表示処理（ステップＳ１１１）で生成される話者表示再生画面の具体例について説明する。
図６は、図５に示した参照情報に基づいて生成される話者表示再生画面Ｗ１００の一例を示す図である。
ステップＳ１１１Ｏで記録部１７に記録された参照情報が図５に示す参照情報であった場合、ステップＳ１１１Ｐでは、図６に示す話者表示再生画面Ｗ１００が生成される。
この話者表示再生画面Ｗ１００は、図５に示すように、タイムバーＴＢと、第１〜第３識別画像Ｉ１〜Ｉ３とが配置された画面である。 [Specific example of speaker display playback screen]
Next, a specific example of the speaker display reproduction screen generated by the speaker display process (step S111) described above will be described.
FIG. 6 is a diagram showing an example of the speaker display reproduction screen W100 generated based on the reference information shown in FIG.
If the reference information recorded in the recording unit 17 in step S111O is the reference information shown in FIG. 5, a speaker display reproduction screen W100 shown in FIG. 6 is generated in step S111P.
As shown in FIG. 5, the speaker display reproduction screen W100 is a screen on which a time bar TB and first to third identification images I1 to I3 are arranged.

タイムバーＴＢは、図６に示すように、音声の録音を開始してから終了するまでの時間に対応する時間スケールＳＣと、時間スケールＳＣ上に設けられ、音声の再生時（ステップＳ１１３〜Ｓ１１６）の音声データのタイムスタンプと時間的に対応する再生位置を指し示すスライダＳＬとを備える。
第１〜第３識別画像Ｉ１〜Ｉ３は、話者及び当該話者のテンションを識別するための識別画像である。図５に示した参照情報では、話者が三人（「右に１２０°」の方向の「男性」、「右に９０°」の方向の「女性」、及び「左に１０°」の方向の「女性」の三人）であるため、話者表示再生画面Ｗ１００では、３つの第１〜第３識別画像Ｉ１〜Ｉ３が配置されている。 As shown in FIG. 6, the time bar TB is provided on the time scale SC corresponding to the time from the start to the end of voice recording and the time scale SC, and at the time of voice playback (steps S113 to S116). ) Audio data time stamp and a slider SL indicating a reproduction position corresponding in time.
The first to third identification images I1 to I3 are identification images for identifying the speaker and the tension of the speaker. In the reference information shown in FIG. 5, there are three speakers (“male” in the direction of “120 ° to the right”, “female” in the direction of “90 ° to the right”, and the direction of “10 ° to the left”. In the speaker display reproduction screen W100, three first to third identification images I1 to I3 are arranged.

第１識別画像Ｉ１は、一人目の話者である「右に１２０°」の方向の「男性」に対応する識別画像である。図５に示した参照情報では、当該話者は、「０〜５秒」、「５〜１０秒」、及び「１０〜１５秒」の期間に連続して特定されている。このため、第１識別画像Ｉ１は、タイムバーＴＢに対応する当該「０〜１５秒」の期間、タイムバーＴＢに沿って延びるように配置されている。
ここで、図５に示した参照情報では、当該話者は、「男性」であると特定されている。このため、第１識別画像Ｉ１には、図６に示すように、当該「男性」であることを識別するための男性画像ＭＦが付加されている。
また、図５に示した参照情報では、当該話者のテンションは、「５〜１０秒」の期間で「ハイテンション」であると判別されている。このため、第１識別画像Ｉ１は、図６に示すように、当該期間だけ、他の期間よりも幅寸法が大きくなっている。すなわち、第１識別画像Ｉ１の幅は、話者のテンションの高さを示している。なお、他の識別画像についても同様である。話者のテンションの高さに応じて第１識別画像Ｉ１の幅をアナログ的に変更してもよいし、段階的に変更してもよい。また、当該幅が隣接のタイムバーＴＢと重ならないように制限をかけてもよい。なお、美観が損なわれなければ、当該幅が隣接のタイムバーＴＢと重なるような表現でもよい。この場合には、臨場感が出ることは言うまでもない。また、第１識別画像Ｉ１の幅は一定とし、付加する男性画像ＭＦの大きさによってテンションの高さを表してもよい。 The first identification image I1 is an identification image corresponding to “male” in the direction of “120 ° to the right”, which is the first speaker. In the reference information shown in FIG. 5, the speaker is specified continuously for the periods “0 to 5 seconds”, “5 to 10 seconds”, and “10 to 15 seconds”. Therefore, the first identification image I1 is arranged so as to extend along the time bar TB during the “0 to 15 seconds” period corresponding to the time bar TB.
Here, in the reference information shown in FIG. 5, the speaker is identified as “male”. Therefore, as shown in FIG. 6, a male image MF for identifying the “male” is added to the first identification image I1.
Further, in the reference information shown in FIG. 5, it is determined that the speaker's tension is “high tension” in a period of “5 to 10 seconds”. For this reason, as shown in FIG. 6, the first identification image I1 has a width dimension larger than that in other periods only during the period. That is, the width of the first identification image I1 indicates the height of the speaker's tension. The same applies to other identification images. Depending on the height of the speaker's tension, the width of the first identification image I1 may be changed in an analog manner or may be changed in stages. Moreover, you may restrict | limit so that the said width | variety may not overlap with adjacent time bar TB. In addition, as long as the aesthetic appearance is not impaired, the expression may be such that the width overlaps with the adjacent time bar TB. In this case, it goes without saying that there is a sense of realism. Further, the width of the first identification image I1 may be constant, and the height of the tension may be represented by the size of the added male image MF.

第２識別画像Ｉ２は、二人目の話者である「右に９０°」の方向の「女性」に対応する識別画像である。図５に示した参照情報では、当該話者は、「５〜１０秒」、「１０〜１５秒」、及び「１５〜２０秒」の期間に連続して特定されている。このため、第２識別画像Ｉ２は、タイムバーＴＢに対応する当該「５〜２０秒」の期間、タイムバーＴＢに沿って延びるように配置されている。
ここで、図５に示した参照情報では、当該話者は、「女性」であると特定されている。このため、第２識別画像Ｉ２には、図６に示すように、当該「女性」であることを識別するための女性画像ＬＦ１が付加されている。
また、図５に示した参照情報では、当該話者のテンションは、全て「通常」であると判別されている。このため、第２識別画像Ｉ２は、図６に示すように、全ての期間で同一の幅寸法となっている。 The second identification image I2 is an identification image corresponding to “female” in the direction of “90 ° to the right”, which is the second speaker. In the reference information shown in FIG. 5, the speaker is specified continuously for the periods of “5 to 10 seconds”, “10 to 15 seconds”, and “15 to 20 seconds”. For this reason, the second identification image I2 is arranged so as to extend along the time bar TB during the “5 to 20 seconds” period corresponding to the time bar TB.
Here, in the reference information shown in FIG. 5, the speaker is identified as “female”. For this reason, as shown in FIG. 6, a female image LF1 for identifying the “female” is added to the second identification image I2.
Further, in the reference information shown in FIG. 5, it is determined that all the tensions of the speaker are “normal”. For this reason, as shown in FIG. 6, the second identification image I2 has the same width dimension in all periods.

第３識別画像Ｉ３は、三人目の話者である「左に１０°」の方向の「女性」に対応する識別画像である。図５に示した参照情報では、当該話者は、「１５〜２０秒」及び「２０〜２５秒」の期間に連続して特定されている。このため、第３識別画像Ｉ３は、タイムバーＴＢに対応する当該「１５〜２５秒」の期間、タイムバーＴＢに沿って延びるように配置されている。
ここで、図５に示した参照情報は、当該話者は、二人目の話者（「右に９０°」の方向の「女性」）とは異なる「女性」であると特定されている。このため、第３識別画像Ｉ３には、図６に示すように、当該「女性」であることを識別するための画像であって、女性画像ＬＦ１とは異なる女性画像ＬＦ２が付加されている。
また、図５に示した参照情報では、当該話者のテンションは、「２０〜２５秒」の期間で「ハイテンション」であると判別されている。このため、第３識別画像Ｉ３は、図６に示すように、当該期間だけ、他の期間よりも幅寸法が大きくなっている。 The third identification image I3 is an identification image corresponding to “female” in the direction of “10 ° to the left”, which is the third speaker. In the reference information shown in FIG. 5, the speaker is specified continuously for the period of “15 to 20 seconds” and “20 to 25 seconds”. Therefore, the third identification image I3 is arranged so as to extend along the time bar TB during the “15 to 25 seconds” period corresponding to the time bar TB.
Here, the reference information shown in FIG. 5 specifies that the speaker is “female” different from the second speaker (“female” in the direction of “90 ° to the right”). For this reason, as shown in FIG. 6, the third identification image I3 is added with a female image LF2 that is an image for identifying the “female” and is different from the female image LF1.
Further, in the reference information shown in FIG. 5, the tension of the speaker is determined to be “high tension” in a period of “20 to 25 seconds”. For this reason, as shown in FIG. 6, the third identification image I3 has a width dimension larger than that in the other periods only during the period.

以上説明した本実施の形態１に係る電子機器１は、第１，第２音声データを解析することで当該音声データに含まれる音声の特徴成分（話者のテンション）を判別し、当該特徴成分と当該特徴成分が含まれる時間（タイムスタンプ）とを関連付けて参照情報を生成する。そして、電子機器１は、当該参照情報に基づいて、タイムバーとともに、当該タイムバーに対応する各時間に、当該時間に関連付けられた特徴成分を識別するための識別画像を配置した話者表示再生画面を表示する。
特に、電子機器１は、話者のテンションに応じて当該識別画像を異なるもの（図６に示した例では、話者のテンションを識別画像Ｉ１〜Ｉ３の幅の太さで表現）としている。
このため、ユーザは、当該話者表示再生画面から録音時の状況（話者のテンションがどのような状態であったか）を一目で把握することができる。したがって、本実施の形態１に係る電子機器１によれば、利便性の向上が図れるという効果を奏する。 The electronic device 1 according to the first embodiment described above analyzes the first and second sound data to determine the sound feature component (speaker tension) included in the sound data, and the feature component. And the time (time stamp) in which the characteristic component is included are associated with each other to generate reference information. Based on the reference information, the electronic device 1 reproduces speaker display in which an identification image for identifying a characteristic component associated with the time is arranged at each time corresponding to the time bar together with the time bar. Display the screen.
In particular, the electronic device 1 uses different identification images depending on the speaker's tension (in the example shown in FIG. 6, the speaker's tension is expressed by the width of the identification images I1 to I3).
For this reason, the user can grasp at a glance the situation at the time of recording (how the tension of the speaker was) from the speaker display reproduction screen. Therefore, according to the electronic device 1 according to the first embodiment, the convenience can be improved.

また、本実施の形態１に係る電子機器１は、第１，第２音声データを解析して、当該第１，第２音声データに含まれる音声を発した話者を特定し、当該特定した話者毎に、特徴情報（話者のテンション）を判別する。
このため、ユーザは、話者表示再生画面から、録音時の状況として、話者が誰であったか、当該話者のテンションがどのような状態であったかの双方を一目で把握することができ、利便性の向上がさらに図れる。
特に、電子機器１は、一対の第１，第２マイク１１１，１２１が設けられ、当該第１，第２マイク１１１，１２１を介して入力した各音声に基づく第１，第２音声データに基づいて、話者の方向を特定する。また、電子機器１は、当該各音声の周波数に基づいて、話者の性別を特定する。さらに、電子機器１は、当該各音声の音量に基づいて、話者のテンションを判別する。このため、簡単な解析処理により、話者の特定（話者の方向及び性別の特定）及び話者のテンションの判別を実行することができる。 In addition, the electronic apparatus 1 according to the first embodiment analyzes the first and second sound data, identifies the speaker that has emitted the sound included in the first and second sound data, and identifies the speaker For each speaker, feature information (speaker tension) is determined.
For this reason, the user can grasp at a glance both who is the speaker and the state of the speaker's tension as a recording situation from the speaker display reproduction screen. Further improvement in performance can be achieved.
In particular, the electronic device 1 is provided with a pair of first and second microphones 111 and 121, and is based on first and second sound data based on each sound input via the first and second microphones 111 and 121. Identify the direction of the speaker. In addition, the electronic device 1 identifies the sex of the speaker based on the frequency of each voice. Furthermore, the electronic device 1 determines the speaker's tension based on the volume of each voice. Therefore, speaker identification (speaker direction and gender identification) and speaker tension determination can be performed by simple analysis processing.

また、本実施の形態１に係る電子機器１は、話者を特定することができなかった場合には、当該話者を特定することができなかった期間の音声を当該期間の直前の期間で特定した話者が発したものと推定する。
このため、話者を特定することができなかった場合であっても、各期間で継続して声を発しているものと推測し、当該話者を尤もらしい話者と推定することができる。 Also, in the case where the electronic device 1 according to the first embodiment cannot identify the speaker, the electronic device 1 transmits the voice of the period in which the speaker cannot be identified in the period immediately before the period. Presume that the identified speaker originated.
For this reason, even if it is a case where a speaker cannot be specified, it can estimate that it is uttering continuously in each period, and can estimate the said speaker as a likely speaker.

（実施の形態１の変形例）
上述した実施の形態１では、電子機器１が再生モードに設定されている場合（ステップＳ１１１）に、第１，第２音声データの解析及び参照情報の生成を行っていたが、これに限られない。
例えば、第１，第２音声データの解析及び参照情報の生成（ステップＳ１１１Ｂ〜Ｓ１１１Ｎ）の少なくとも一部を、第１，第２音声データの生成時（ステップＳ１０３〜Ｓ１０５）に並行して行っても構わない。 (Modification of Embodiment 1)
In the first embodiment described above, when the electronic device 1 is set to the playback mode (step S111), the analysis of the first and second audio data and the generation of the reference information are performed. Absent.
For example, at least part of the analysis of the first and second sound data and the generation of the reference information (steps S111B to S111N) is performed in parallel with the generation of the first and second sound data (steps S103 to S105). It doesn't matter.

上述した実施の形態１において、第１，第２データ要素を解析し、笑い声や怒った声を認識することができた場合には、話者表示再生画面において、当該認識することができた期間に対応する位置に笑い顔や怒った顔の画像を付加しても構わない。 In the first embodiment described above, when the first and second data elements are analyzed and a laughing voice or an angry voice can be recognized, the period during which the recognition can be performed on the speaker display reproduction screen An image of a laughing face or an angry face may be added at a position corresponding to.

上述した実施の形態１では、話者のテンションを「ハイテンション」及び「通常」の２段階で判別していたが、これに限られず、３段階以上で判別しても構わない。 In the first embodiment described above, the tension of the speaker is determined in two stages of “high tension” and “normal”. However, the present invention is not limited to this and may be determined in three or more stages.

上述した実施の形態１では、話者のテンションを判別する際、話者の音声の音量を該当期間と直前の期間とで比較していたが、これに限られず、該当期間における話者の音声の音量を所定の閾値と比較することで話者のテンションを判別しても構わない。また、該当期間内での話者の音声の音量の変化で話者のテンションを判別しても構わない。ここで、話者のテンション（音声の特徴部分）は感情的な高ぶりを示すものを想定したが、話の集中具合（例えば、一人の話者が説明し、それを他の人物が静かに聴くなど）を反映してもよい。この場合、検出された複数の人物の声の相対的な大きさの関係や、言葉のペースの一定度（説き聞かせるように語る）やスピード（まくしたてる）などを検出して、話者のテンションを判定してもよい。つまり、話者のテンションは、一人の話者の声の時間の経過に伴う相対的な変化（集中し始めたなど）を検出したり、複数の人物の声の相対的な差異を検出したりして判定されるものである。 In the first embodiment described above, when determining the tension of the speaker, the volume of the speaker's voice is compared between the corresponding period and the immediately preceding period. However, the present invention is not limited to this. The speaker's tension may be determined by comparing the volume of the speaker with a predetermined threshold. Further, the tension of the speaker may be determined based on a change in the volume of the speaker's voice within the corresponding period. Here, the speaker's tension (speech feature) is assumed to be emotionally high, but the concentration of the story (for example, one speaker explains and others quietly listen to it) Etc.) may be reflected. In this case, the speaker's tension is detected by detecting the relative loudness of the detected voices of multiple people, the degree of the pace of the words (speaking to speak), the speed (speaking), etc. May be determined. In other words, the speaker's tension can detect relative changes (such as starting to concentrate) over the course of a single speaker's voice, or detect the relative differences between multiple people's voices. Is determined.

上述した実施の形態１では、話者を特定することができなかった場合には、当該話者を特定することができなかった期間の音声を当該期間の「直前」の期間で特定した話者が発したものと推定していた（ステップＳ１１１Ｍ）が、これに限られず、当該期間の「直後」の期間で特定した話者が発したものと推定しても構わない。 In the first embodiment described above, if the speaker cannot be specified, the speaker in which the voice during the period when the speaker could not be specified is specified in the “immediately before” period of the period. (Step S111M) is not limited to this, but it may be estimated that the speaker specified in the “immediately after” period is the speaker.

図７は、本発明の実施の形態１の変形例を示す図である。
上述した実施の形態１で説明した電子機器１の代わりに、撮像機能を付加した電子機器１Ａを採用しても構わない。
具体的に、電子機器１Ａは、図７に示すように、上述した実施の形態１で説明した電子機器１に対して、撮像部１０が追加されているとともに、機器側制御部１９の代わりに当該機器側制御部１９に対して撮像制御部１９７を追加した機器側制御部１９Ａが採用されている。
撮像部１０は、撮像制御部１９７による制御の下、被写体を撮像して画像データを生成する。この撮像部１０は、被写体像を結像する光学系（図示略）、当該光学系が結像した被写体像を受光して電気信号に変換するＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ）等の撮像素子、当該撮像素子からの電気信号（アナログ信号）に対して信号処理（Ａ／Ｄ変換等）を行うことによりデジタルの画像データを生成する信号処理部等を用いて構成される。そして、撮像部１０にて生成された画像データは、撮像制御部１９７による制御の下、時計部１５にて生成されたタイムスタンプ（当該画像データが生成された日時に関するタイムスタンプ）が付加されて、記録部１７に記録される。
撮像制御部１９７は、ユーザによる操作部１３への撮影操作に応じて、撮像部１０に被写体を撮像させ、当該撮像部１０にて生成された画像データ（タイムスタンプを含む）を記録部１７に記録する。
以上のように、電子機器１Ａに撮像機能を持たせ、例えば、図４に示す打合せの状況や、男性Ｍ、女性Ｌ１，Ｌ２をそれぞれ撮像しておけば、例えば、図６に示した話者表示再生画面Ｗ１００において、図４に示す打合せの状況を撮像した画像や、男性画像ＭＦ及び女性画像ＬＦ１，ＬＦ２の代わりに男性Ｍ、女性Ｌ１，Ｌ２を撮像した画像を配置することが可能となる。 FIG. 7 is a diagram showing a modification of the first embodiment of the present invention.
Instead of the electronic device 1 described in the first embodiment, an electronic device 1A to which an imaging function is added may be employed.
Specifically, as illustrated in FIG. 7, the electronic device 1 A has an imaging unit 10 added to the electronic device 1 described in the first embodiment, and instead of the device-side control unit 19. A device-side control unit 19A in which an imaging control unit 197 is added to the device-side control unit 19 is employed.
The imaging unit 10 captures a subject and generates image data under the control of the imaging control unit 197. The imaging unit 10 includes an optical system (not shown) that forms a subject image, an imaging element such as a CCD (Charge Coupled Device) that receives the subject image formed by the optical system and converts it into an electrical signal, and the imaging It is configured using a signal processing unit that generates digital image data by performing signal processing (A / D conversion or the like) on an electrical signal (analog signal) from the element. The image data generated by the imaging unit 10 is added with a time stamp (time stamp related to the date and time when the image data was generated) generated by the clock unit 15 under the control of the imaging control unit 197. Are recorded in the recording unit 17.
The imaging control unit 197 causes the imaging unit 10 to image the subject in response to a shooting operation on the operation unit 13 by the user, and the image data (including a time stamp) generated by the imaging unit 10 is stored in the recording unit 17. Record.
As described above, if the electronic device 1A is provided with an imaging function, for example, the situation of the meeting shown in FIG. 4 and the images of the male M and the females L1 and L2, respectively, for example, the speaker shown in FIG. On the display reproduction screen W100, it is possible to arrange an image obtained by imaging the meeting situation shown in FIG. 4 and an image obtained by imaging the male M and the female L1, L2 instead of the male image MF and the female images LF1 and LF2. .

（実施の形態２）
次に、本発明の実施の形態２について説明する。
以下の説明では、上述した実施の形態１と同様の構成及びステップには同一符号を付し、その詳細な説明は省略または簡略化する。
図８は、本発明の実施の形態２に係る音声処理システム１００の構成を示すブロック図である。
本実施の形態２に係る音声処理システム１００は、図８に示すように、上述した実施の形態１で説明した電子機器１の「音声データを解析し参照情報を生成する」機能をサーバ２に持たせ、音声データの生成及び再生を行う電子機器１Ｂと当該サーバ２との間でインターネット網Ｎを介して通信を行うシステムである。 (Embodiment 2)
Next, a second embodiment of the present invention will be described.
In the following description, the same reference numerals are given to the same configurations and steps as those in the above-described first embodiment, and the detailed description thereof is omitted or simplified.
FIG. 8 is a block diagram showing the configuration of the speech processing system 100 according to Embodiment 2 of the present invention.
As shown in FIG. 8, the voice processing system 100 according to the second embodiment provides the server 2 with the function of “analyzing voice data and generating reference information” of the electronic device 1 described in the first embodiment. This is a system that performs communication via the Internet network N between the electronic device 1B that generates and reproduces audio data and the server 2.

〔音声処理システムの構成〕
以下、本実施の形態２に係る音声処理システム１００を構成する電子機器１Ｂ及びサーバ２の構成について順に説明する。 [Configuration of voice processing system]
Hereinafter, the configuration of the electronic device 1B and the server 2 constituting the voice processing system 100 according to the second embodiment will be described in order.

〔電子機器の構成〕
本実施の形態２に係る電子機器１Ｂは、図８に示すように、上述した実施の形態１で説明した電子機器１（図１）に対して、機器側通信部２０が追加されているとともに、機器側制御部１９の一部の機能が変更されている。
機器側通信部２０は、機器側制御部１９Ｂによる制御の下、サーバ２との間で通信に必要な信号を含む各種データの無線通信を行うための通信インターフェースである。 [Configuration of electronic equipment]
As shown in FIG. 8, the electronic device 1B according to the second embodiment has a device-side communication unit 20 added to the electronic device 1 (FIG. 1) described in the first embodiment. Some functions of the device-side control unit 19 are changed.
The device side communication unit 20 is a communication interface for performing wireless communication of various data including signals necessary for communication with the server 2 under the control of the device side control unit 19B.

本実施の形態２に係る機器側制御部１９Ｂは、図８に示すように、音声データ解析部１９２及び参照情報生成部１９３が省略されているとともに、機器側通信制御部１９８が追加されている。
機器側通信制御部１９８は、話者表示処理の実行時に、以下の処理を実行する。
具体的に、機器側通信制御部１９８は、記録部１７に記録されたサーバ２の所在位置情報（ＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ））に基づいて、機器側通信部２０を介して、インターネット網Ｎに接続されたサーバ２にアクセス信号（参照画像の送信要求（自身の電子機器１Ｂを識別する識別情報を含む））を送信し、サーバ２との間で通信接続を確立する。そして、機器側通信制御部１９８は、話者表示処理の対象となる第１，第２音声データ（タイムスタンプを含む）をサーバ２に送信するとともに、参照情報をサーバ２から受信する。 As shown in FIG. 8, in the device-side control unit 19B according to the second embodiment, the audio data analysis unit 192 and the reference information generation unit 193 are omitted, and the device-side communication control unit 198 is added. .
The device-side communication control unit 198 executes the following process when the speaker display process is executed.
Specifically, the device-side communication control unit 198 connects to the Internet network N via the device-side communication unit 20 based on the location information (URL (Uniform Resource Locator)) of the server 2 recorded in the recording unit 17. An access signal (a request for transmitting a reference image (including identification information for identifying its own electronic device 1B)) is transmitted to the connected server 2 to establish a communication connection with the server 2. And the apparatus side communication control part 198 transmits the 1st, 2nd audio | voice data (a time stamp is included) used as the object of a speaker display process to the server 2, and receives reference information from the server 2. FIG.

〔サーバの構成〕
サーバ２は、電子機器１Ｂからの参照画像の送信要求とともに送信された第１，第２音声データを解析して参照情報を生成し、当該参照情報を電子機器１Ｂに送信する。
以下では、サーバ２の構成として、本発明の要部を主に説明する。
サーバ２は、図８に示すように、サーバ側通信部２１と、音声データベース２２と、サーバ側制御部２３とを備える。 [Server configuration]
The server 2 analyzes the first and second audio data transmitted together with the reference image transmission request from the electronic device 1B, generates reference information, and transmits the reference information to the electronic device 1B.
In the following, the main part of the present invention will be mainly described as the configuration of the server 2.
As shown in FIG. 8, the server 2 includes a server-side communication unit 21, a voice database 22, and a server-side control unit 23.

サーバ側通信部２１は、サーバ側制御部２３による制御の下、電子機器１Ｂとの間で通信に必要な信号を含む各種データの無線通信を行うための通信インターフェースである。
音声データベース２２は、サーバ側制御部２３による制御の下、サーバ側通信部２１を介して電子機器１Ｂから受信した第１，第２音声データ（タイムスタンプを含む）を記録する。また、音声データベース２２は、サーバ側制御部２３による制御の下、参照情報を当該参照情報の生成に用いられた第１，第２音声データに関連付けて記録する。 The server-side communication unit 21 is a communication interface for performing wireless communication of various data including signals necessary for communication with the electronic device 1B under the control of the server-side control unit 23.
The voice database 22 records first and second voice data (including a time stamp) received from the electronic device 1B via the server-side communication unit 21 under the control of the server-side control unit 23. The voice database 22 records reference information in association with the first and second voice data used for generating the reference information under the control of the server-side control unit 23.

サーバ側制御部２３は、ＣＰＵ等を用いて構成され、サーバ２の動作を統括的に制御する。このサーバ側制御部２３は、図８に示すように、サーバ側通信制御部２３１と、端末判定部２３２と、音声データ記録制御部２３３と、音声データ解析部２３４と、参照情報生成部２３５とを備える。
サーバ側通信制御部２３１は、サーバ側通信部２１及びインターネット網Ｎを介して電子機器１Ｂから送信されるアクセス信号（参照情報の送信要求（当該電子機器１Ｂの識別情報を含む））に応じて、サーバ側通信部２１の動作を制御し、電子機器１Ｂとの間で通信接続を確立する。そして、サーバ側通信制御部２３１は、電子機器１Ｂから、音声データベース２３３に記録させる第１，第２音声データ（タイムスタンプを含む）を受信するとともに、参照情報生成部２３５にて生成された参照情報を当該電子機器１Ｂに送信する。
そして、サーバ側通信制御部２３１は、本発明に係る音声データ取得部としての機能を有する。 The server-side control unit 23 is configured using a CPU or the like, and comprehensively controls the operation of the server 2. As shown in FIG. 8, the server-side control unit 23 includes a server-side communication control unit 231, a terminal determination unit 232, an audio data recording control unit 233, an audio data analysis unit 234, and a reference information generation unit 235. Is provided.
The server-side communication control unit 231 responds to an access signal (reference information transmission request (including identification information of the electronic device 1B)) transmitted from the electronic device 1B via the server-side communication unit 21 and the Internet network N. Then, the operation of the server side communication unit 21 is controlled to establish a communication connection with the electronic device 1B. Then, the server-side communication control unit 231 receives the first and second audio data (including time stamp) to be recorded in the audio database 233 from the electronic device 1B, and the reference generated by the reference information generation unit 235. Information is transmitted to the electronic device 1B.
And the server side communication control part 231 has a function as an audio | voice data acquisition part which concerns on this invention.

端末判定部２３２は、インターネット網Ｎを介して電子機器１Ｂから送信されるアクセス信号に基づいて、アクセスしてきた送信元の電子機器１Ｂを判定（特定）する。
音声データ記録制御部２３３は、サーバ側通信部２１を介して電子機器１Ｂから受信した第１，第２音声データ（タイムスタンプを含む）を音声データベース２２に記録する。 Based on an access signal transmitted from the electronic device 1B via the Internet N, the terminal determination unit 232 determines (identifies) the transmission-source electronic device 1B that has accessed.
The audio data recording control unit 233 records the first and second audio data (including the time stamp) received from the electronic device 1B via the server side communication unit 21 in the audio database 22.

音声データ解析部２３４（対象物特定部２３４１及び特徴成分判別部２３４２）は、上述した実施の形態１で説明した音声データ解析部１９２（対象物特定部１９２１及び特徴成分判別部１９２２）と同様の機能を有し、サーバ側通信部２１を介して電子機器１Ｂから受信し音声データベース２２に記録された第１，第２音声データを解析する。
参照情報生成部２３５は、上述した実施の形態１で説明した参照情報生成部１９３と同様の機能を有し、音声データ解析部２３４の解析結果に基づいて、参照情報を生成する。そして、参照情報生成部２３５は、当該参照情報を当該参照情報の生成に用いた第１，第２音声データに関連付けて音声データベース２２に記録する。
そして、サーバ側通信制御部２３１、音声データ解析部２３４、及び参照情報生成部２３５は、本発明に係る音声処理装置としての機能を有する。 The sound data analysis unit 234 (the object specifying unit 2341 and the feature component determining unit 2342) is the same as the sound data analyzing unit 192 (the object specifying unit 1921 and the feature component determining unit 1922) described in the first embodiment. It has a function and analyzes the first and second sound data received from the electronic device 1B via the server side communication unit 21 and recorded in the sound database 22.
The reference information generation unit 235 has the same function as the reference information generation unit 193 described in the first embodiment, and generates reference information based on the analysis result of the audio data analysis unit 234. Then, the reference information generation unit 235 records the reference information in the audio database 22 in association with the first and second audio data used for generating the reference information.
And the server side communication control part 231, the audio | voice data analysis part 234, and the reference information generation part 235 have a function as an audio | voice processing apparatus which concerns on this invention.

〔音声処理システムの動作〕
次に、上述した音声処理システム１００の動作について説明する。
以下、音声処理システム１００の動作として、電子機器１Ｂの動作、及びサーバ２の動作を順に説明する。 [Operation of voice processing system]
Next, the operation of the voice processing system 100 described above will be described.
Hereinafter, as operations of the voice processing system 100, operations of the electronic device 1B and operations of the server 2 will be described in order.

〔電子機器の動作〕
なお、電子機器１Ｂの動作については、上述した実施の形態１で説明した電子機器１Ｂの動作（図２，図３）に対して、話者表示処理（ステップＳ１１１）が異なるのみである。このため、以下では、本実施の形態２に係る話者表示処理（ステップＳ１１１）のみを説明する。
図９は、本発明の実施の形態２に係る話者表示処理（ステップＳ１１１）を示すフローチャートである。
本実施の形態２に係る話者表示処理は、図９に示すように、上述した実施の形態１で説明した話者表示処理（図３）に対して、ステップＳ１１１Ａ〜Ｓ１１１Ｏを省略し、ステップＳ１１１Ｒ，Ｓ１１１Ｓを追加した点が異なるのみである。このため、以下では、ステップＳ１１１Ｒ，Ｓ１１１Ｓのみを説明する。 [Operation of electronic equipment]
The operation of electronic device 1B is different from the operation of electronic device 1B described in the first embodiment (FIGS. 2 and 3) only in the speaker display process (step S111). Therefore, hereinafter, only the speaker display process (step S111) according to the second embodiment will be described.
FIG. 9 is a flowchart showing speaker display processing (step S111) according to Embodiment 2 of the present invention.
In the speaker display process according to the second embodiment, as shown in FIG. 9, steps S111A to S111O are omitted from the speaker display process (FIG. 3) described in the first embodiment. The only difference is the addition of S111R and S111S. Therefore, only steps S111R and S111S will be described below.

ステップＳ１１１Ｒは、話者表示処理（ステップＳ１１１）の最初に実行されるステップである。
具体的に、機器側通信制御部１９８は、機器側通信部２０を介して、インターネット網Ｎに接続されたサーバ２にアクセス信号（参照画像の送信要求（自身の電子機器１Ｂの識別情報を含む））を送信し、サーバ２との間で通信接続を確立する。そして、機器側通信制御部１９８は、話者表示処理の対象となる第１，第２音声データ（ステップＳ１０９で選択された第１，第２音声データ（タイムスタンプを含む））をサーバ２に送信する。 Step S111R is a step executed at the beginning of the speaker display process (step S111).
Specifically, the device-side communication control unit 198 sends an access signal (reference image transmission request (including identification information of its own electronic device 1B) to the server 2 connected to the Internet network N via the device-side communication unit 20. )) To establish a communication connection with the server 2. Then, the device-side communication control unit 198 sends the first and second voice data (first and second voice data (including time stamp) selected in step S109) to be the target of the speaker display process to the server 2. Send.

続いて、機器側通信制御部１９８は、機器側通信部２０を介して、サーバ２から参照情報を受信し、メモリ部１６に記憶する（ステップＳ１１１Ｓ）。
そして、電子機器１Ｂは、メモリ部１６に記憶した参照情報に基づいて、話者表示再生画面を生成し（ステップＳ１１１Ｐ）、当該話者表示再生画面を表示部１４に表示する（ステップＳ１１１Ｑ）。 Subsequently, the device-side communication control unit 198 receives reference information from the server 2 via the device-side communication unit 20 and stores it in the memory unit 16 (step S111S).
Then, the electronic device 1B generates a speaker display reproduction screen based on the reference information stored in the memory unit 16 (step S111P), and displays the speaker display reproduction screen on the display unit 14 (step S111Q).

〔サーバの動作〕
図１０は、サーバ２の動作を示すフローチャートである。
サーバ側通信制御部２３１は、サーバ側通信部２１及びインターネット網Ｎを介して、電子機器１Ｂからアクセス信号（参照画像の送信要求（当該電子機器１Ｂの識別情報を含む））を受信したか否かを判断する（ステップＳ２０１）。
参照画像の送信要求を受信していないと判断された場合（ステップＳ２０１：Ｎｏ）には、サーバ３は、ステップＳ２１０に移行する。
一方、参照画像の送信要求を受信したと判断された場合（ステップＳ２０１：Ｙｅｓ）には、端末判定部２３２は、当該送信要求に基づいて、アクセスしてきた送信元の電子機器１Ｂを特定する（ステップＳ２０２）。 [Server operation]
FIG. 10 is a flowchart showing the operation of the server 2.
Whether the server-side communication control unit 231 has received an access signal (a reference image transmission request (including identification information of the electronic device 1B)) from the electronic device 1B via the server-side communication unit 21 and the Internet network N. Is determined (step S201).
If it is determined that the reference image transmission request has not been received (step S201: No), the server 3 proceeds to step S210.
On the other hand, when it is determined that the reference image transmission request has been received (step S201: Yes), the terminal determination unit 232 identifies the electronic device 1B that has accessed, based on the transmission request ( Step S202).

続いて、サーバ側通信制御部２３１は、サーバ側通信部２１及びインターネット網Ｎを介して、電子機器１Ｂから第１，第２音声データ（タイムスタンプを含む）を受信する（ステップＳ２０３：音声データ取得ステップ）。そして、音声データ記録制御部２３３は、音声データベース２２に記録された第１，第２音声データを参照し、ステップＳ２０３で受信した第１，第２音声データと同一の第１，第２音声データが未だ記録されていない場合には、当該受信した第１，第２音声データを音声データベース２２に記録する。 Subsequently, the server side communication control unit 231 receives the first and second audio data (including the time stamp) from the electronic device 1B via the server side communication unit 21 and the Internet network N (step S203: audio data). Acquisition step). Then, the audio data recording control unit 233 refers to the first and second audio data recorded in the audio database 22, and the same first and second audio data as the first and second audio data received in step S203. Is not recorded yet, the received first and second audio data are recorded in the audio database 22.

続いて、サーバ側制御部２３は、ステップＳ２０３で受信した第１，第２音声データの参照情報を既に生成しているか否かを判断する（ステップＳ２０４）。すなわち、サーバ側制御部２３は、ステップＳ２０４において、音声データベース２２に記録された当該第１，第２音声データに参照情報が関連付けられているか否かを判断している。
参照情報を生成済みであると判断された場合（ステップＳ２０４：Ｎｏ）には、サーバ２は、ステップＳ２１９に移行する。
一方、参照情報を未だ生成していないと判断された場合（ステップＳ２０４：Ｙｅｓ）には、サーバ２は、上述した実施の形態１で説明したステップＳ１１１Ｂ〜Ｓ１１１Ｎと同様に、ステップＳ２０３で受信した第１，第２音声データの解析、及び参照情報の生成を実行する（ステップＳ２０５〜Ｓ２１７）。
すなわち、ステップＳ２０８は、本発明に係る音声データ解析ステップに相当する。また、ステップＳ２０９は、本発明に係る参照情報生成ステップに相当する。 Subsequently, the server-side control unit 23 determines whether reference information for the first and second audio data received in step S203 has already been generated (step S204). That is, the server-side control unit 23 determines whether or not reference information is associated with the first and second audio data recorded in the audio database 22 in step S204.
When it is determined that the reference information has been generated (step S204: No), the server 2 proceeds to step S219.
On the other hand, when it is determined that the reference information has not yet been generated (step S204: Yes), the server 2 receives the information in step S203 as in steps S111B to S111N described in the first embodiment. Analysis of the first and second audio data and generation of reference information are executed (steps S205 to S217).
That is, step S208 corresponds to the audio data analysis step according to the present invention. Step S209 corresponds to a reference information generation step according to the present invention.

ステップＳ２１７の後、または、ステップＳ２１５で直前の期間で話者が特定されていないと判断された場合（ステップＳ２１５：Ｎｏ）には、サーバ２は、ステップＳ２０９，Ｓ２１１，Ｓ２１２で生成された各期間の参照情報（ステップＳ２１７で更新された場合には更新後の参照情報）を、当該参照情報の生成に用いられた第１，第２音声データに関連付けて音声データベース２２に記録する（ステップＳ２１８）。
ステップＳ２１８の後、または、ステップＳ２０４で参照情報を生成済みであると判断された場合（ステップＳ２０４：Ｎｏ）には、サーバ側通信制御部２３１は、サーバ側通信部２１及びインターネット網Ｎを介して、ステップＳ２０３で受信した第１，第２音声データに対しステップＳ２１８で関連付けて記録された参照情報を、ステップＳ２０２で特定された電子機器１Ｂに送信する（ステップＳ２１９）。この後、サーバ２は、ステップＳ２０１に戻る。 After step S217 or when it is determined in step S215 that the speaker has not been specified in the immediately preceding period (step S215: No), the server 2 generates each of the steps generated in steps S209, S211 and S212. The reference information of the period (or the updated reference information when updated in step S217) is recorded in the voice database 22 in association with the first and second voice data used for generating the reference information (step S218). ).
After step S218 or when it is determined that the reference information has been generated in step S204 (step S204: No), the server side communication control unit 231 passes through the server side communication unit 21 and the Internet network N. The reference information recorded in association with the first and second audio data received in step S203 in step S218 is transmitted to the electronic device 1B specified in step S202 (step S219). Thereafter, the server 2 returns to step S201.

ステップＳ２０１で参照画像の送信要求を受信していないと判断された場合（ステップＳ２０１：Ｎｏ）には、サーバ２は、上述した処理とは異なる他の処理を実行する（ステップＳ２１０）。この後、サーバ２は、ステップＳ２０１に戻る。 If it is determined in step S201 that a reference image transmission request has not been received (step S201: No), the server 2 executes another process different from the process described above (step S210). Thereafter, the server 2 returns to step S201.

以上説明した本実施の形態２によれば、上述した実施の形態１と同様の効果を奏することができるとともに、電子機器１Ｂの構成の簡素化が図れる、という効果を奏する。 According to the second embodiment described above, the same effects as those of the first embodiment described above can be achieved, and the configuration of the electronic device 1B can be simplified.

（その他の実施の形態）
ここまで、本発明を実施するための形態を説明してきたが、本発明は上述した実施の形態１，２によってのみ限定されるべきものではない。
図１１Ａ〜図１１Ｃは、上述した実施の形態１，２で説明した話者表示再生画面の変形例を示す図である。
上述した実施の形態１，２で例示した話者表示再生画面Ｗ１００では、第１〜第３識別画像Ｉ１〜Ｉ３は、話者のテンションを幅の太さで表現していたが、これに限られず、例えば、図１１Ａ〜図１１Ｃに示す話者表示再生画面Ｗ１０１〜Ｗ１０３のように表現しても構わない。
具体的に、図１１Ａに示す話者表示再生画面Ｗ１０１では、第１〜第３識別画像Ｉ１〜Ｉ３は、話者のテンションの変化を波形で表現している。すなわち、図１１Ａに示す話者表示再生画面Ｗ１０１において、縦方向はテンションの高さを示している。
また、図１１Ｂに示す話者表示再生画面Ｗ１０２は、図１１Ａに示した話者表示再生画面Ｗ１０１を３Ｄ表示したものである。
さらに、図１１Ｃに示す話者表示再生画面Ｗ１０３では、第１〜第３識別画像Ｉ１〜Ｉ３は、話者のテンションを画素値で表現している。すなわち、図１１Ｃに示す話者表示再生画面Ｗ１０１において、画素値の高い部分（明るい部分）は、テンションが高い時間を示している。ここで、話者のテンションは感情的な高ぶりを示すものであるが、話の集中具合（例えば、一人の話者が説明し、それを他の人物が静かに聴くなど）を反映してもよい。この場合、検出された複数の人物の声の相対的な大きさの関係や、言葉のペースの一定度（説き聞かせるように語る）やスピード（まくしたてる）などを検出して、話者のテンションを判定してもよい。このような声の特徴（の変化）によって、例えば、状況を判定するための検索を行うことも可能である。 (Other embodiments)
The embodiments for carrying out the present invention have been described so far, but the present invention should not be limited only by the above-described first and second embodiments.
11A to 11C are diagrams showing modifications of the speaker display reproduction screen described in the first and second embodiments.
In the speaker display reproduction screen W100 exemplified in the first and second embodiments described above, the first to third identification images I1 to I3 express the speaker's tension with the thickness of the width. Instead, for example, the speaker display reproduction screens W101 to W103 shown in FIGS. 11A to 11C may be expressed.
Specifically, on the speaker display reproduction screen W101 shown in FIG. 11A, the first to third identification images I1 to I3 express changes in the speaker's tension as waveforms. That is, in the speaker display reproduction screen W101 shown in FIG. 11A, the vertical direction indicates the height of the tension.
A speaker display reproduction screen W102 shown in FIG. 11B is a 3D display of the speaker display reproduction screen W101 shown in FIG. 11A.
Further, on the speaker display reproduction screen W103 shown in FIG. 11C, the first to third identification images I1 to I3 express the speaker's tension as a pixel value. That is, in the speaker display reproduction screen W101 shown in FIG. 11C, a portion with a high pixel value (bright portion) indicates a time during which the tension is high. Here, the speaker's tension shows an emotional height, but even if it reflects the concentration of the story (for example, one speaker explains it and others listen quietly) Good. In this case, the speaker's tension is detected by detecting the relative loudness of the voices of multiple people detected, the degree of the pace of the words (speaking to speak) and the speed (speaking). May be determined. For example, it is possible to perform a search for determining the situation based on the characteristics (changes) of the voice.

図１２は、上述した実施の形態１，２で説明した参照情報の変形例を示す図である。
上述した実施の形態１，２において、参照情報は、上述した実施の形態１，２で説明した参照情報（例えば、図５）に限られず、例えば、図１２に示す参照情報を採用しても構わない。
例えば、上述した実施の形態１において、記録部１７に特定のキーワードを予め記録しておく。なお、図１２では、説明の便宜上、当該特定のキーワードを１つのみとしているが、複数としても構わない。また、音声データ解析部１９２は、第１，第２データ要素を解析し、当該第１，第２データ要素に記録部１７に記録された特定のキーワードが含まれているか否かを判定する。そして、参照情報生成部１９３は、音声データ解析部１９２にて特定のキーワードが含まれていると判定された場合に、該当期間（図１２の例では、「５〜１０秒」の期間）の参照情報として、「キーワード」フラグをオン状態とした参照情報を生成する。なお、上述した実施の形態２では、サーバ２に上述した処理を実行させ、図１２に示す参照情報を生成させればよい。 FIG. 12 is a diagram showing a modification of the reference information described in the first and second embodiments.
In the first and second embodiments, the reference information is not limited to the reference information (for example, FIG. 5) described in the first and second embodiments. For example, the reference information illustrated in FIG. I do not care.
For example, in Embodiment 1 described above, a specific keyword is recorded in advance in the recording unit 17. In FIG. 12, for convenience of explanation, only one specific keyword is used, but a plurality of specific keywords may be used. Further, the voice data analysis unit 192 analyzes the first and second data elements and determines whether or not the specific keyword recorded in the recording unit 17 is included in the first and second data elements. When the voice data analysis unit 192 determines that the specific keyword is included, the reference information generation unit 193 has a corresponding period (a period of “5 to 10 seconds” in the example of FIG. 12). As reference information, reference information with the “keyword” flag turned on is generated. In the second embodiment described above, the server 2 may execute the above-described processing to generate the reference information illustrated in FIG.

図１３は、図１２に示した参照情報に基づいて生成される話者表示再生画面Ｗ１０４の一例を示す図である。
例えば、上述した実施の形態１，２において、電子機器１やサーバ２が図１２に示した参照情報を生成した場合には、電子機器１，１Ｂは、例えば、図１３に示す話者表示再生画面Ｗ１０４を生成する。
具体的に、図１３に示す話者表示再生画面Ｗ１０４は、図６に示した話者表示再生画面Ｗ１００に対して、キーワード入力部ＫＷが追加されている。
キーワード入力部ＫＷは、ユーザによる操作部１３への操作によって、キーワードが入力される部分である。
そして、再生画面生成部１９４は、記録部１７に記録された特定のキーワードと同一のキーワードがユーザによる操作部１３への操作によって入力された場合には、当該入力の前（図１３（ａ））と当該入力の後（図１３（ｂ））とで、話者表示再生画面Ｗ１０４を以下に示すように変化させる。
すなわち、再生画面生成部１９４は、図１２に示した参照情報を参照し、「キーワード」フラグがオン状態となっている期間の話者に対応する識別画像（図１２及び図１３（ｂ）の例では、第２識別画像Ｉ２）の輝度を向上させた話者表示再生画面Ｗ１０４（図１３（ｂ））を生成する。
なお、「キーワード」フラグがオン状態となっている期間の話者に対応する識別画像の表示態様を従前の当該識別画像の表示態様と異なるものとすれば、上述した輝度の向上に限られず、その他の方法を採用しても構わない。 FIG. 13 is a diagram showing an example of the speaker display reproduction screen W104 generated based on the reference information shown in FIG.
For example, in the first and second embodiments described above, when the electronic device 1 or the server 2 generates the reference information shown in FIG. 12, the electronic devices 1 and 1B, for example, reproduce the speaker display shown in FIG. A screen W104 is generated.
Specifically, in the speaker display reproduction screen W104 shown in FIG. 13, a keyword input unit KW is added to the speaker display reproduction screen W100 shown in FIG.
The keyword input unit KW is a part where a keyword is input by an operation on the operation unit 13 by a user.
Then, when the same keyword as the specific keyword recorded in the recording unit 17 is input by the user operating the operation unit 13, the playback screen generation unit 194 performs the input before the input (FIG. 13A). ) And after the input (FIG. 13B), the speaker display reproduction screen W104 is changed as shown below.
That is, the playback screen generation unit 194 refers to the reference information shown in FIG. 12, and identifies the identification image corresponding to the speaker during the period in which the “keyword” flag is on (FIGS. 12 and 13B). In the example, a speaker display reproduction screen W104 (FIG. 13B) in which the luminance of the second identification image I2) is improved is generated.
In addition, if the display mode of the identification image corresponding to the speaker in the period in which the “keyword” flag is on is different from the display mode of the identification image, the brightness is not limited to the above-described improvement. Other methods may be adopted.

また、処理フローは、上述した実施の形態１，２で説明したフローチャートにおける処理の順序に限られず、矛盾のない範囲で変更しても構わない。
さらに、本明細書においてフローチャートを用いて説明した処理のアルゴリズムは、プログラムとして記述することが可能である。このようなプログラムは、コンピュータ内部の記録部に記録してもよいし、コンピュータ読み取り可能な記録媒体に記録してもよい。プログラムの記録部または記録媒体への記録は、コンピュータまたは記録媒体を製品として出荷する際に行ってもよいし、通信ネットワークを介したダウンロードにより行ってもよい。 The processing flow is not limited to the processing order in the flowcharts described in the first and second embodiments, and may be changed within a consistent range.
Furthermore, the processing algorithm described using the flowcharts in this specification can be described as a program. Such a program may be recorded on a recording unit inside the computer, or may be recorded on a computer-readable recording medium. Recording of the program in the recording unit or recording medium may be performed when the computer or recording medium is shipped as a product, or may be performed by downloading via a communication network.

上述した実施の形態では、分かり易く、ＩＣレコーダ、録音機等の検索技術を例にとって説明したが、ビデオカメラ等で連携する動画撮影システムであれば、音声と画像が関連付けられているので、音声記録機能付きカメラ等にも応用が可能である。撮影画像を話者のテンションに基づいて記録、検索、タグ付けすることが可能である。 In the above-described embodiment, the description has been made by taking the search technology such as an IC recorder and a recorder as an example for easy understanding. However, in the case of a video shooting system linked with a video camera or the like, audio and images are associated with each other. It can also be applied to cameras with recording functions. It is possible to record, search, and tag captured images based on speaker tension.

また、一般のカメラのみならず、車載カメラでは、車内の会話によって撮影したり画像を検索するような用途もあり、ドライバーを判定したり、ドライブしながらのハンズフリー撮影を行うことが可能である。また、監視カメラや検査用カメラ、医療用のカメラでは、特定の人物の会話に関係する画像を検索することができる。検査装置に応用した場合でも、検査対象の画像のみならず、検査している風景などについても、会話に基づいて重要シーンをチェックでき、本発明ならではの効果を期待することができる。 In addition to general cameras, in-vehicle cameras can also be used for taking pictures and searching for images through in-car conversations, so it is possible to determine the driver and perform hands-free photography while driving. . In addition, surveillance cameras, inspection cameras, and medical cameras can search for images related to a conversation of a specific person. Even when applied to an inspection apparatus, not only the image to be inspected but also the scenery being inspected can check the important scene based on the conversation, and the effect unique to the present invention can be expected.

１，１Ａ，１Ｂ・・・電子機器；２・・・サーバ；１０・・・撮像部；１１・・・第１音声データ生成部；１２・・・第２音声データ生成部；１３・・・操作部；１４・・・表示部；１５・・・時計部；１６・・・メモリ部；１７・・・記録部；１８・・・音声出力部；１９，１９Ａ，１９Ｂ・・・機器側制御部；２０・・・機器側通信部；２１・・・サーバ側通信部；２２・・・音声データベース；２３・・・サーバ側通信制御部；１００・・・音声処理システム；１１１・・・第１マイク；１１２・・・第１増幅部；１１３・・・第１Ａ／Ｄ変換部；１２１・・・第２マイク；１２２・・・第２増幅部；１２３・・・第２Ａ／Ｄ変換部；１８１・・・Ｄ／Ａ変換部；１８２・・・増幅器；１８３・・・スピーカ；１９１・・・音声データ取得部；１９２・・・音声データ解析部；１９３・・・参照情報生成部；１９４・・・再生画面生成部；１９５・・・表示制御部；１９６・・・音声制御部；１９７・・・撮像制御部；１９８・・・機器側通信制御部；２３１・・・サーバ側通信制御部；２３２・・・端末判定部；２３３・・・音声データ記録制御部；２３４・・・音声データ解析部；２３５・・・参照情報生成部；１９２１・・・対象物特定部；１９２２・・・特徴成分判別部；２３４１・・・対象物特定部；２３４２・・・特徴成分判別部；Ａｘ・・・軸；Ｉ１〜Ｉ３・・・第１〜第３識別画像；ＫＷ・・・キーワード入力部；Ｌ１，Ｌ２・・・女性；ＬＦ１，ＬＦ２・・・女性画像；Ｍ・・・男性；ＭＦ・・・男性画像；Ｎ・・・インターネット網；ＳＣ・・・時間スケール；ＳＬ・・・スライダ；ＴＢ・・・タイムバー；Ｗ１００〜Ｗ１０４・・・話者表示再生画面 DESCRIPTION OF SYMBOLS 1, 1A, 1B ... Electronic device; 2 ... Server; 10 ... Imaging part; 11 ... 1st audio | voice data generation part; 12 ... 2nd audio | voice data generation part; Operation unit; 14 ... display unit; 15 ... clock unit; 16 ... memory unit; 17 ... recording unit; 18 ... audio output unit; 19, 19A, 19B ... device side control 20: device side communication unit; 21 ... server side communication unit; 22 ... voice database; 23 ... server side communication control unit; 100 ... voice processing system; 1st microphone; 112 ... 1st amplification part; 113 ... 1st A / D conversion part; 121 ... 2nd microphone; 122 ... 2nd amplification part; 123 ... 2nd A / D conversion part 181 ... D / A converter; 182 ... Amplifier; 183 ... Speaker; 191 ... Audio data Obtaining unit: 192: Audio data analyzing unit; 193: Reference information generating unit; 194: Reproduction screen generating unit; 195 ... Display control unit: 196 ... Audio control unit: 197 ... Imaging control unit: 198: Device side communication control unit; 231: Server side communication control unit; 232: Terminal determination unit; 233: Audio data recording control unit: 234: Audio data analysis unit 235 ... Reference information generation unit; 1921 ... Object identification unit; 1922 ... Feature component discrimination unit; 2341 ... Object identification unit; 2342 ... Feature component discrimination unit; Ax ... Axis; I1-I3 ... 1st-3rd identification image; KW ... Keyword input part; L1, L2 ... Female; LF1, LF2 ... Female image; M ... Male;・ Male image; N ... Internet network; SC ... Time scan Lumpur; SL ··· slider; TB ··· time bar; W100~W104 ··· speaker display playback screen

Claims

An audio data acquisition unit for acquiring audio data;
An audio data analysis unit that analyzes the audio data and discriminates a high tension component of the audio included in the audio data;
A reference information generation unit that associates the high tension component and the time during which the high tension component is included in the audio data, and generates reference information used when generating the reproduction screen of the audio data ;
The voice data analysis unit
Analyzing the voice data, for each predetermined time range in the voice data, identifying the speaker that has produced the voice contained in the voice data and determining the high tension component, and identifying the speaker If it is not possible to do, and characterized in that the voice of the speaker time range that can not be identified, you estimated that shortly before or speaker identified in the time range immediately uttered for that time range Voice processing device.

The voice data analysis unit
2. The voice data is analyzed to identify a speaker who has produced a voice included in the voice data, and the presence or absence of the high tension component is determined for each of the identified speakers. The voice processing apparatus according to 1.

The voice data analysis unit
Speaker's emotional height, relative loudness of multiple people's voices, constant pace of words spoken by the speaker, speaker speed, voice volume, voice frequency, phoneme component of voice The speech processing apparatus according to claim 1, wherein the tension of the speaker is determined by detecting the component having a high tension based on the time density of the voice or a specific voice component.

And a reproduction screen generation unit that generates a speaker display reproduction screen in which an identification image for identifying a tension associated with the time is arranged based on the reference information generated by the reference information generation unit. The speech processing apparatus according to any one of claims 1 to 3.

The playback screen generation unit
The voice processing apparatus according to claim 4, wherein an icon to be displayed for identifying a speaker is generated in the identification image for identifying the tension.

The identification image is
The voice included in the voice data is displayed so as to correspond to the time, and the period indicating the speaker's tension is generated as data that is displayed wider than other periods. The speech processing apparatus according to claim 5.

The identification image is
The voice included in the voice data is displayed so as to correspond to the time, and the period indicating the speaker's tension is generated as data that is displayed in a stepwise manner that is wider in analog than the other periods. The speech processing apparatus according to claim 5, wherein:

It has an operation reception unit that accepts keyword input operations,
The playback screen generation unit
When the specific sound based on the high tension component matches the keyword received by the operation reception unit, the identification image for identifying the high tension component is different from the previous identification image. The speech processing apparatus according to claim 4 or 5, wherein:

In the speech processing method performed by the speech processing apparatus,
An audio data acquisition step for acquiring audio data;
An audio data analysis step of analyzing the audio data and discriminating a high tension component of the audio included in the audio data;
Associating the time and included high component of the tension in the high component and the audio data of the tension, see contains a reference information generating step of generating a reference information used in generating the playback screen of the voice data,
In the voice data analysis step,
Analyzing the voice data, for each predetermined time range in the voice data, identifying the speaker that has produced the voice contained in the voice data and determining the high tension component, and identifying the speaker When it is not possible to do so, it is presumed that the voice in the time range in which the speaker cannot be specified is emitted by the speaker specified in the time range immediately before or immediately after the time range. Audio processing method.

A speech processing program for causing a speech processing device to execute the speech processing method according to claim 9 .