JP5383056B2

JP5383056B2 - Sound data recording / reproducing apparatus and sound data recording / reproducing method

Info

Publication number: JP5383056B2
Application number: JP2008025678A
Authority: JP
Inventors: 一博中臺; 雄二長谷川; 広司辻野; 博奥乃
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2007-02-14
Filing date: 2008-02-05
Publication date: 2014-01-08
Anticipated expiration: 2028-02-05
Also published as: JP2008197650A

Description

本発明は、記録された音データの中から所望の音データを容易に再生することのできる音データ記録再生装置および音データ記録再生方法に関する。 The present invention relates to a sound data recording / reproducing apparatus and a sound data recording / reproducing method capable of easily reproducing desired sound data from recorded sound data.

記録された音データの中から所望の音データを再生することは、記録された画像データの中から所望の画像データを再生することに比較して困難である。 It is more difficult to reproduce desired sound data from recorded sound data than to reproduce desired image data from recorded image data.

その第１の理由は、音データを時間に沿って概観するのが困難だからである。画像データの場合、データを早送りすること、あるいは、特定の時間間隔でデータをピックアップし同時に表示することで、所望の画像データを比較的容易に特定し、再生することができる。画像を早送りしても画像の色調などが変化することはなく、画像のコマ数を減らしても大きな認識誤りは発生しない。これに対して、音データの場合、データを早送りすると認識するのが困難となる。また、ピックアップした短時間の音データは、再生しても認識することができない。 The first reason is that it is difficult to overview sound data over time. In the case of image data, desired image data can be identified and reproduced relatively easily by fast-forwarding the data or by picking up and displaying the data at specific time intervals. Even if the image is fast-forwarded, the color tone of the image does not change, and a large recognition error does not occur even if the number of frames of the image is reduced. On the other hand, in the case of sound data, it becomes difficult to recognize if the data is fast-forwarded. Also, the picked-up short-time sound data cannot be recognized even if it is played back.

第２の理由は、音源の弁別が困難であるからである。画像データの場合、画面上で二つの物体が重なっていても前面の物体は正しく認識することができ、二つの物体を弁別することができる。これに対して、音データの場合、複数話者の発言が重なっている場合などに、発言内容を理解することは困難である。 The second reason is that it is difficult to distinguish sound sources. In the case of image data, even if two objects overlap on the screen, the front object can be recognized correctly, and the two objects can be discriminated. On the other hand, in the case of sound data, it is difficult to understand the content of the speech when the speech of a plurality of speakers overlaps.

これまでに、たとえば、音源の位置を推定し、画像上に推定された音源の位置を表示するシステムが開発されている（たとえば、特許文献１）。 So far, for example, a system for estimating the position of a sound source and displaying the position of the estimated sound source on an image has been developed (for example, Patent Document 1).

しかし、音源を弁別するとともに、音源ごとの音データを時間に沿って概観することができるように構成された音データ記録再生装置および音データ記録再生方法は開発されていない。
特開平２００３−１１１１８３号公報 However, a sound data recording / reproducing apparatus and a sound data recording / reproducing method configured to discriminate sound sources and to be able to overview sound data for each sound source over time have not been developed.
Japanese Patent Laid-Open No. 2003-111183

したがって、記録された音データの中から所望の音データを容易に再生することができるように、音源ごとの音データを時間に沿って概観することができるように構成された音データ記録再生装置および音データ記録再生方法に対するニーズがある。 Therefore, a sound data recording / reproducing apparatus configured so that sound data for each sound source can be overviewed over time so that desired sound data can be easily reproduced from the recorded sound data. There is also a need for a method for recording and reproducing sound data.

本発明による音データ記録再生装置は、音データを取得する音データ取得部と、音源が存在する方向を特定する音源定位部と、音源ごとの音データを分離する音源分離部と、を備える。本装置は、音源ごとの時系列の音データを格納する時系列データ格納部と、所定の時間において、所定の音源の方向を示す、音に関するストリームデータを格納するストリームデータ格納部と、前記時系列データ格納部および前記ストリームデータ格納部に接続され、データの処理を行うデータ処理部と、をさらに備える。本装置は、音データを再生する音データ再生部と、ストリームデータを表示する表示部と、をさらに備え、前記表示部によって表示されたストリームデータが選択されると、前記音データ再生部が、選択されたストリームデータに関する音データを再生するように構成される。 The sound data recording / reproducing apparatus according to the present invention includes a sound data acquisition unit that acquires sound data, a sound source localization unit that specifies the direction in which the sound source exists, and a sound source separation unit that separates sound data for each sound source. The apparatus includes a time-series data storage unit that stores time-series sound data for each sound source, a stream data storage unit that stores stream data related to sound that indicates a direction of a predetermined sound source at a predetermined time, and the time And a data processing unit connected to the stream data storage unit and the stream data storage unit for processing data. The apparatus further includes a sound data reproduction unit that reproduces sound data, and a display unit that displays stream data.When the stream data displayed by the display unit is selected, the sound data reproduction unit It is configured to play sound data related to the selected stream data.

本発明による音データ記録再生装置によって音データを再生する方法は、音データを取得し、音源が存在する方向を特定し、音源ごとの音データを分離する。さらに、音源ごとの時系列の音データを、時系列データ格納部に格納し、所定の時間において、所定の音源の方向を示す、音に関するストリームデータを作成し、ストリームデータ格納部に格納する。さらに、ストリームデータを表示し、表示されたストリームデータが選択されると、選択されたストリームデータに関する音データを再生する。 The method of reproducing sound data by the sound data recording / reproducing apparatus according to the present invention acquires sound data, specifies the direction in which the sound source exists, and separates sound data for each sound source. Further, time-series sound data for each sound source is stored in the time-series data storage unit, and stream data related to sound indicating the direction of a predetermined sound source is created at a predetermined time and stored in the stream data storage unit. Furthermore, stream data is displayed, and when the displayed stream data is selected, sound data relating to the selected stream data is reproduced.

本発明によれば、所定の時間において、所定の音源の方向を示す、音に関するストリームデータが表示されるので、音源ごとの音データを時間に沿って概観することができ、記録された音データの中から所望の音データを容易に再生することができる。 According to the present invention, the stream data relating to the sound indicating the direction of the predetermined sound source is displayed at the predetermined time, so that the sound data for each sound source can be overviewed along the time, and the recorded sound data The desired sound data can be easily reproduced from the list.

本発明の実施形態によれば、音データから音声を認識し音声のテキスト情報を生成し、前記音声のテキスト情報を表示する。 According to the embodiment of the present invention, voice is recognized from sound data, voice text information is generated, and the voice text information is displayed.

本実施形態によれば、たとえば、耳の不自由な人も、記録された音データの中から所望の音データ（音声）を容易に再生し利用することができる。 According to the present embodiment, for example, a hearing-impaired person can easily reproduce and use desired sound data (sound) from the recorded sound data.

本発明の実施形態によれば、画像データを取得し、時系列の画像データを前記時系列データ格納部に格納し、前記ストリームデータ格納部に格納されるデータが、所定の時間において、所定の対象の方向を示す画像に関するストリームデータをさらに含む。 According to an embodiment of the present invention, image data is acquired, time-series image data is stored in the time-series data storage unit, and data stored in the stream data storage unit is stored at a predetermined time at a predetermined time. It further includes stream data relating to an image indicating the direction of the object.

本実施形態によれば、音源（話者）の音データ（話の内容）と画像データ（話者の表情）を関連付けて再生することもできる。 According to the present embodiment, sound data (speech contents) of a sound source (speaker) and image data (speaker's facial expression) can be associated and reproduced.

図１は、本発明の一実施形態による音データ記録再生装置の構成を示す図である。 FIG. 1 is a diagram showing a configuration of a sound data recording / reproducing apparatus according to an embodiment of the present invention.

音データ記録再生装置において、音データ取得部１０１が音データを取得する。取得された音データは、音源定位部１０３に送られ、音源定位部１０３は、音源が存在する方向を特定する。また、取得された音データは、音源分離部１０５に送られ、音源分離部１０５は、音源ごとの音を分離する。また、画像データ取得部１０７が画像データを取得する。 In the sound data recording / reproducing apparatus, the sound data acquisition unit 101 acquires sound data. The acquired sound data is sent to the sound source localization unit 103, and the sound source localization unit 103 identifies the direction in which the sound source exists. The acquired sound data is sent to the sound source separation unit 105, and the sound source separation unit 105 separates the sound for each sound source. Further, the image data acquisition unit 107 acquires image data.

音データ記録再生装置は、時系列データ格納部１０９、ストリームデータ格納部１１１およびストリームリスト格納部１１３の３種類のメモリを備える。これらのメモリのデータ構造については後で説明する。 The sound data recording / playback apparatus includes three types of memories: a time-series data storage unit 109, a stream data storage unit 111, and a stream list storage unit 113. The data structure of these memories will be described later.

音データ記録再生装置は、時系列データ格納部１０９、ストリームデータ格納部１１１およびストリームリスト格納部１１３に接続されたデータ処理部１１５ならびに時系列データ格納部１０９およびデータ処理部１１５に接続された音声認識部１２１を備える。データ処理部１１５は、表示・入力部１１７および音データ再生部１１９にさらに接続されている。データ処理部１１５は、上記３種類のメモリに格納されたデータを使用して処理を行い、表示・入力部１１７が音データを概観的に表示し、表示・入力部１１７からのオペレータ入力にしたがって、音データ再生部１１９が所望の音データを再生するように構成されている。音声認識部１２１は、データ処理部１１５からの指示により時系列データ格納部１０９に格納された音データから音声を認識し、音声のテキスト情報を生成し、時系列データ格納部１０９に格納する。表示・入力部１１７は、表示部および入力部の別個の筐体であってもよい。 The sound data recording / reproducing apparatus includes a time series data storage unit 109, a stream data storage unit 111, a data processing unit 115 connected to the stream list storage unit 113, and a time series data storage unit 109 and an audio connected to the data processing unit 115. A recognition unit 121 is provided. The data processing unit 115 is further connected to a display / input unit 117 and a sound data reproduction unit 119. The data processing unit 115 performs processing using the data stored in the above three types of memory, the display / input unit 117 displays the sound data in an overview, and according to the operator input from the display / input unit 117. The sound data reproducing unit 119 is configured to reproduce desired sound data. The speech recognition unit 121 recognizes speech from the sound data stored in the time-series data storage unit 109 according to an instruction from the data processing unit 115, generates speech text information, and stores the speech text information in the time-series data storage unit 109. The display / input unit 117 may be a separate housing for the display unit and the input unit.

音データ取得部１０１は、一例として、８個のマイクと、マイクが採取した音データを処理する音響信号処理用プロセッサとを含む。音響信号処理用プロセッサは、たとえば、１６チャネルの音データを所定の周波数で採取することができる。８個のマイクは、７個の通常のマイクと１個のサラウンド用マイクからなる。７個の通常のマイクは、球状の形をした台に取り付けられており、その台は、それぞれのマイクが設置されている方向からの音を採取しやすいように設計されている。 As an example, the sound data acquisition unit 101 includes eight microphones and an acoustic signal processing processor that processes sound data collected by the microphones. The acoustic signal processing processor can collect, for example, 16-channel sound data at a predetermined frequency. The eight microphones are composed of seven normal microphones and one surround microphone. Seven normal microphones are attached to a base having a spherical shape, and the base is designed to easily collect sound from the direction in which each microphone is installed.

本発明による音データ記録再生装置は、たとえば、会議室など所定の位置で音データを取得し再生する場合に使用してもよい。その場合に、音データ取得部１０１は、該所定の位置に配置される。他の実施形態において、本発明による音データ記録再生装置は、移動体における音データを取得し再生する場合に使用してもよい。その場合に、音データ取得部１０１は、該移動体に取り付けられる。移動体には、一例として、車両、監視ロボットおよび人間などが含まれる。たとえば、ＧＰＳ（Global Positioning System、全地球測位システム）と組み合わせて、移動体の位置において音源の方向を認識することができる。 The sound data recording / reproducing apparatus according to the present invention may be used, for example, when sound data is acquired and reproduced at a predetermined position such as a conference room. In that case, the sound data acquisition unit 101 is arranged at the predetermined position. In another embodiment, the sound data recording / reproducing apparatus according to the present invention may be used when sound data in a moving body is acquired and reproduced. In that case, the sound data acquisition unit 101 is attached to the moving body. Examples of the moving body include a vehicle, a monitoring robot, and a human. For example, in combination with GPS (Global Positioning System), the direction of the sound source can be recognized at the position of the moving body.

音源定位部１０３は、音源が存在する方向を特定するために、たとえば、Steered beam formerによる定位を行い、カルマンフィルタにより精度を上げる（Masamitsu Murase, Shun'ichi Yamamoto, Jean-Marc Valin, Kazuhiro Nakadai, Kentaro Yamada, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno: Multiple Moving Speaker Tracking by
Microphone Array on Mobile Robot, Proceedings of the Ninth European Conference
on Speech Communication and Technology (Interspeech-2005), 249-252, Lisboa, Sep. 2005 p.10) 。Steered beam formerによる定位は、以下のステップにより、マイクペア間での相互相関に基づいた定位を行う。 The sound source localization unit 103 performs localization by, for example, a steered beam former in order to specify the direction in which the sound source exists, and improves accuracy by a Kalman filter (Masamitsu Murase, Shun'ichi Yamamoto, Jean-Marc Valin, Kazuhiro Nakadai, Kentaro Yamada, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno: Multiple Moving Speaker Tracking by
Microphone Array on Mobile Robot, Proceedings of the Ninth European Conference
on Speech Communication and Technology (Interspeech-2005), 249-252, Lisboa, Sep. 2005 p.10). The localization by the steered beam former is performed based on the cross-correlation between the microphone pairs by the following steps.

１）マイクの座標を定義した座標系の原点を中心とする球を考え、その表面に２５６２個の点を等間隔に配置する。
２）各点に対して、全てのマイクのペアについて相互相関の和を求める。この和が最大となる点の方向が音源の方向であると推定する。
３）推定された相互相関の値を全て０とする。
４）２）および３）を繰り返し、全ての音源方向を推定する。これにより、推定した音源方向を時間軸上で走査し、方向の近いものを１つの音源としてラベル付けする。 1) Consider a sphere centered at the origin of the coordinate system that defines the coordinates of the microphone, and place 2562 points on the surface at equal intervals.
2) For each point, find the sum of cross correlations for all microphone pairs. It is estimated that the direction of the point where this sum is the maximum is the direction of the sound source.
3) All estimated cross-correlation values are set to 0.
4) Repeat 2) and 3) to estimate all sound source directions. As a result, the estimated sound source direction is scanned on the time axis, and those having close directions are labeled as one sound source.

カルマンフィルタは、移動音源のような動的に変化するシステムを観測する場合に、過去の観測値から現在及び未来の内部状態を予測するためのものである。たとえば、過去の話者の状態から現在の話者の状態を推定する。これにより、単一の話者が移動する場合に、単一話者へ確実に同一のラベルを付与することができる。 The Kalman filter is for predicting current and future internal states from past observation values when a dynamically changing system such as a moving sound source is observed. For example, the current speaker state is estimated from the past speaker state. Thereby, when a single speaker moves, the same label can be reliably given to the single speaker.

音源分離部１０５は、音源ごとの音データを分離するために、たとえば、Geometric
Source Separation(GSS)による音源分離を行いPost-Filterによって雑音抑圧処理を行う
（J.-M. Valin, J. Rouat, F. Michaud: Enhanced Robot Audition Based on Microphone Array Source Separation with Post-Filter, Proc. IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), pp.2123-2128, 2004.）。 The sound source separation unit 105 is, for example, Geometric to separate sound data for each sound source.
Source separation by Source Separation (GSS) and noise suppression processing by Post-Filter (J.-M. Valin, J. Rouat, F. Michaud: Enhanced Robot Audition Based on Microphone Array Source Separation with Post-Filter, Proc IEEE / RSJ International
Conference on Intelligent Robots and Systems (IROS), pp.2123-2128, 2004.).

周波数kにおいて、音響信号(k)から観測信号x(k)への変換は線形であると仮定して、音源信号から観測信号への伝達関数を式（１）によって表す。

ここでA(k)は変換を表す行列であり，n(k)はノイズである。これより推定した音源信号 y(k)は式（２）で表される。

x(k)は観測信号であるため、音源信号を求めることは変換行列W(k)を求める問題に帰着する。異なる音源間信号の独立性を仮定することにより式（３）が、音源とマイクの幾何学的制限により式（４）が得られる。

これらの式（３）および（４）は制限が強いため、Wの近似解を確率的勾配法により求める。 Assuming that the conversion from the acoustic signal (k) to the observation signal x (k) is linear at the frequency k, the transfer function from the sound source signal to the observation signal is expressed by Equation (1).

Here, A (k) is a matrix representing the transformation, and n (k) is noise. The sound source signal y (k) estimated from this is expressed by equation (2).

Since x (k) is an observation signal, obtaining a sound source signal results in a problem of obtaining a transformation matrix W (k). Equation (3) is obtained by assuming the independence of signals between different sound sources, and Equation (4) is obtained due to geometric limitations of the sound source and the microphone.

Since these equations (3) and (4) are severely limited, an approximate solution of W is obtained by the stochastic gradient method.

Post-Filterによる雑音抑圧は、GSSによる分離音から雑音を取り除くものである。この手法では雑音を定常性雑音と非定常性雑音に分けて推定し除去する。定常性雑音はMCRA (Minima Controlled Recursive Average)により計算する。非定常性雑音は、GSSの過程で別のチャンネルから漏洩したものと仮定し、適応的に干渉成分の推定を行う。 Noise suppression by Post-Filter removes noise from separated sound by GSS. In this method, noise is estimated and removed by dividing it into stationary noise and non-stationary noise. Stationary noise is calculated by MCRA (Minima Controlled Recursive Average). Non-stationary noise is assumed to have leaked from another channel during the GSS process, and interference components are estimated adaptively.

時系列データ格納部１０９、ストリームデータ格納部１１１およびストリームリスト格納部１１３のデータ構造について以下に説明する。 The data structures of the time series data storage unit 109, the stream data storage unit 111, and the stream list storage unit 113 will be described below.

図２は、時系列データ格納部１０９、ストリームデータ格納部１１１およびストリームリスト格納部１１３の音データに関するデータ構造を示す図である。 FIG. 2 is a diagram illustrating a data structure regarding sound data in the time-series data storage unit 109, the stream data storage unit 111, and the stream list storage unit 113.

時系列データ格納部１０９には、それぞれの時刻における音の波形データが格納される。音の波形データは、分離されていない波形データと音源ごとに分離された波形データを含む。音源ごとに分離されたデータには、音源の方向のデータが付与される。それぞれの時刻におけるデータは、時間方向にリンクされており、たとえば、時刻ｔ＝２におけるデータから、時刻ｔ＝１または時刻ｔ＝３のデータを参照することができる。 The time-series data storage unit 109 stores sound waveform data at each time. The sound waveform data includes waveform data that is not separated and waveform data that is separated for each sound source. Data of the direction of the sound source is given to the data separated for each sound source. Data at each time is linked in the time direction. For example, data at time t = 1 or time t = 3 can be referred to from data at time t = 2.

ストリームデータ格納部１１１には、音に関するストリームデータが格納される。音に関するストリームは、一定時間連続する所定の音源の音である。音に関するストリームデータは、一定時間連続する所定の音源の音について、時刻ごとの該音源の方向（位置）を示すデータである。具体的にストリームデータは、開始時刻、終了時刻、音源の識別（たとえば、人物名）、該所定の時間内のそれぞれの時刻における音源の方向に関するデータを含む。音源の方向は、たとえば、水平面および鉛直面内の角度で表現される。ストリームデータは、音の波形データへのリンクをさらに含む。たとえば、所定の時刻におけるストリームデータから該時刻に対応する音の波形データを参照することができる。 The stream data storage unit 111 stores stream data related to sound. The stream related to sound is the sound of a predetermined sound source that continues for a certain period of time. The stream data relating to sound is data indicating the direction (position) of the sound source for each time with respect to the sound of a predetermined sound source that continues for a certain period of time. Specifically, the stream data includes data regarding a start time, an end time, identification of a sound source (for example, a person name), and a direction of the sound source at each time within the predetermined time. The direction of the sound source is expressed by, for example, an angle in a horizontal plane and a vertical plane. The stream data further includes a link to sound waveform data. For example, sound waveform data corresponding to the time can be referenced from stream data at a predetermined time.

図３は、時系列データ格納部１０９、ストリームデータ格納部１１１およびストリームリスト格納部１１３の画像データに関するデータ構造を示す図である。 FIG. 3 is a diagram illustrating a data structure related to image data in the time-series data storage unit 109, the stream data storage unit 111, and the stream list storage unit 113.

時系列データ格納部１０９には、それぞれの時刻における画像データ（動画情報）が格納される。画像データには、対象（たとえば人物）の方向のデータが付与される。人物の識別は、一般的な顔認識の技術（たとえば、特開２００２-２１６１２９号公報）を使用して行う。それぞれの時刻におけるデータは、時間方向にリンクされており、たとえば、時刻ｔ＝２におけるデータから、時刻ｔ＝１または時刻ｔ＝３のデータを参照することができる。 The time series data storage unit 109 stores image data (moving image information) at each time. Data of the direction of the target (for example, a person) is given to the image data. A person is identified using a general face recognition technique (for example, Japanese Patent Laid-Open No. 2002-216129). Data at each time is linked in the time direction. For example, data at time t = 1 or time t = 3 can be referred to from data at time t = 2.

ストリームデータ格納部１１１には、画像に関するストリームデータが格納される。画像に関するストリームは、一定時間連続する所定の対象の画像である。画像に関するストリームデータは、一定時間連続する所定の対象の画像について、時刻ごとの該対象の方向（位置）を示すデータである。具体的にストリームデータは、開始時刻、終了時刻、対象の識別（たとえば、人物名）、それぞれの時刻における対象の方向に関するデータを含む。対象の方向は、たとえば、水平面および鉛直面内の角度で表現される。ストリームデータは、画像データ（動画情報）へのリンクをさらに含む。たとえば、所定の時刻におけるストリームデータから該時刻に対応する画像データを参照することができる。 The stream data storage unit 111 stores stream data related to images. A stream relating to an image is an image of a predetermined target that continues for a certain period of time. The stream data relating to the image is data indicating the direction (position) of the target for each time with respect to the predetermined target image that is continuous for a predetermined time. Specifically, the stream data includes start time, end time, target identification (for example, person name), and data regarding the target direction at each time. The direction of the object is expressed by, for example, an angle in a horizontal plane and a vertical plane. The stream data further includes a link to image data (moving image information). For example, the image data corresponding to the time can be referred to from the stream data at a predetermined time.

ストリームデータ格納部１１１において、音に関するストリームデータと画像に関するストリームデータが存在する。音に関するストリームデータにおける音源の方向と画像に関するストリームデータの対象の方向とが、所定の時間以上一致する場合、上記２個のストリームデータを統合し、統合ストリームデータとしてもよい。統合ストリームデータを使用すれば、たとえば、音源（話者）の音データ（話の内容）と画像データ（話者の表情）を関連付けて再生することもできる。 In the stream data storage unit 111, there are stream data related to sound and stream data related to images. When the direction of the sound source in the stream data related to sound and the target direction of the stream data related to the image coincide with each other for a predetermined time or more, the two stream data may be integrated to form integrated stream data. If the integrated stream data is used, for example, sound data (speech content) of a sound source (speaker) and image data (speaker's facial expression) can be associated and reproduced.

ストリームリスト格納部１１３におけるデータは、ストリームデータのリストに関するデータである。ストリームデータのリストは、概観表示用のものであり、間引いた時刻における総ストリーム数、各ストリームの音源の方向、各ストリームのタイプ（音に関するストリーム、画像に関するストリーム、統合ストリーム）、各ストリームデータへのリンクを含む。たとえば、所定の時刻におけるストリームデータのリストから該時刻に対応するストリームデータを参照することができる。図２および図３に示すように、ストリームリスト格納部１１３におけるデータに基づいて、音源または対象の識別（人物名）と時間との関数、または音源または対象の方向（角度）と時間との関数として、それぞれのストリームを、表示・入力部１１７に表示することができる。図２および図３において、３個のストリームデータが、実線、点線および一点鎖線によって示されている。 The data in the stream list storage unit 113 is data related to a list of stream data. The list of stream data is for overview display. The total number of streams at the thinned time, the direction of the sound source of each stream, the type of each stream (stream related to sound, stream related to image, integrated stream), to each stream data Includes links. For example, stream data corresponding to the time can be referenced from a list of stream data at a predetermined time. 2 and 3, based on the data in the stream list storage unit 113, a function of sound source or target identification (person name) and time, or a function of sound source or target direction (angle) and time Each stream can be displayed on the display / input unit 117. 2 and 3, three stream data are indicated by a solid line, a dotted line, and an alternate long and short dash line.

図４は、本実施形態による音データ記録再生装置のインタフェース機能を概念的に示した図である。本実施形態による音データ記録再生装置のインタフェースは、音データをわかりやすく再生することができるように、概観表示（Over first）、ズーム（Zoom）、フィルタ（Filter）および要求による詳細表示（Details on demand）のための機能を備える。 FIG. 4 is a diagram conceptually showing the interface function of the sound data recording / reproducing apparatus according to the present embodiment. The interface of the sound data recording / playback apparatus according to the present embodiment provides an overview display (Over first), a zoom (Zoom), a filter (Filter), and a detailed display upon request (Details on) so that the sound data can be reproduced in an easily understandable manner function for demand).

図４（ａ）は、概観表示の内容を示す図である。縦軸は音源の識別（人物名）を表し、横軸は時間を表す。この表示は、ストリームリスト格納部１１３におけるデータに基づいて行うことができる。図４（ａ）に示した概観表示によって、ユーザは、音データを時間に沿って概観することができる。 FIG. 4A shows the contents of the overview display. The vertical axis represents sound source identification (person name), and the horizontal axis represents time. This display can be performed based on data in the stream list storage unit 113. With the overview display shown in FIG. 4A, the user can overview the sound data over time.

図４（ｂ）は、ズームされた内容を示す図である。図４（ｂ）は、具体的に、ズームされた時刻における音源およびその方向を示す。これらのデータは、ストリームデータ格納部１１１におけるストリームデータに含まれる。ユーザは、図４（ａ）に示した外観表示の横軸における特定の時刻を指定することにより、該時刻のデータを容易に取り出す（ズームする）ことができる。ストリームデータのリストは、各ストリームデータへのリンクを含むので、上記のデータの取り出しを容易に行うことができる。 FIG. 4B shows the zoomed content. FIG. 4B specifically shows the sound source and its direction at the zoomed time. These data are included in the stream data in the stream data storage unit 111. By specifying a specific time on the horizontal axis of the appearance display shown in FIG. 4A, the user can easily extract (zoom) data at the time. Since the list of stream data includes a link to each stream data, the above data can be easily extracted.

図４（ｃ）は、フィルタ処理の内容を示す図である。図４（ｃ）は、具体的に、音源の範囲を指定することで、ＢＧＭ(バックグラウンドミュージック)の音源を除去することを示す。 FIG. 4C shows the contents of the filter process. FIG. 4C specifically shows that the sound source of BGM (background music) is removed by designating the range of the sound source.

図４（ｄ）は、詳細表示の内容を示す図である。ユーザは、たとえば、図４（ｂ）に示された所定の時刻における音源（発話者）から所定の音源（発話者）を指定して、その音データ（発言内容）のみを再生することができる。 FIG. 4D is a diagram showing the contents of the detailed display. For example, the user can designate a predetermined sound source (speaker) from a sound source (speaker) at a predetermined time shown in FIG. 4B and reproduce only the sound data (speech content). .

図５は、本発明の一実施形態による音データ記録再生方法を示す流れ図である。 FIG. 5 is a flowchart showing a sound data recording / reproducing method according to an embodiment of the present invention.

図５におけるステップＳ０１０において、音データ取得部１０１が音データを取得し、画像データ取得部１０７が画像データを取得する。 In step S010 in FIG. 5, the sound data acquisition unit 101 acquires sound data, and the image data acquisition unit 107 acquires image data.

図５におけるステップＳ０２０において、音源定位部１０３が、音源が存在する方向を特定する。 In step S020 in FIG. 5, the sound source localization unit 103 identifies the direction in which the sound source exists.

図５におけるステップＳ０３０において、音源分離部１０５が、音源ごとの音を分離する。 In step S030 in FIG. 5, the sound source separation unit 105 separates sounds for each sound source.

図５におけるステップＳ０４０において、音データおよび画像データが、時系列データ格納部１０９に格納される。時系列データ格納部は、音声認識部１２１によって生成された音声のテキスト情報をさらに格納してもよい。 In step S040 in FIG. 5, the sound data and the image data are stored in the time-series data storage unit 109. The time-series data storage unit may further store voice text information generated by the voice recognition unit 121.

図５におけるステップＳ０５０において、データ処理部１１５が、時系列データ格納部１０９に格納された音データおよび画像データに基づいて、ストリームデータを作成し、ストリームデータ格納部１１１に格納する。 In step S 050 in FIG. 5, the data processing unit 115 creates stream data based on the sound data and image data stored in the time-series data storage unit 109 and stores the stream data in the stream data storage unit 111.

図５におけるステップＳ０６０において、データ処理部１１５が、ストリームデータ格納部１１１に格納されたストリームデータに基づいて、ストリームデータのリストを作成し、ストリームリスト格納部１１３に格納する。 In step S 060 in FIG. 5, the data processing unit 115 creates a list of stream data based on the stream data stored in the stream data storage unit 111 and stores the list in the stream list storage unit 113.

図５におけるステップＳ０７０において、表示・入力部１１７が、ストリームデータを時間軸とともに表示する（図４（ａ））。 In step S070 in FIG. 5, the display / input unit 117 displays the stream data together with the time axis (FIG. 4A).

図５におけるステップＳ０８０において、表示・入力部１１７が、表示された時間軸上においてユーザによって選択された時刻における、音源およびその方向を表示する（図４（ｂ））。 In step S080 in FIG. 5, the display / input unit 117 displays the sound source and its direction at the time selected by the user on the displayed time axis (FIG. 4B).

図５におけるステップＳ０９０において、音データ再生部１１９が、ユーザによって選択された音源の音データを再生する（図４（ｄ））。 In step S090 in FIG. 5, the sound data reproducing unit 119 reproduces the sound data of the sound source selected by the user (FIG. 4 (d)).

ステップＳ０７０、Ｓ０８０およびステップＳ０９０に代えて、表示・入力部１１７が、ストリームデータにしたがって、たとえば、図４（ｂ）に示すような画面によって、時間ごとに変化する音源およびその方向を、リアルタイムに、または、リアルタイムと同じか増加もしくは減少させた速度で表示するように構成してもよい。 Instead of steps S070, S080, and step S090, the display / input unit 117 displays, in real time, a sound source that changes over time and its direction according to the stream data, for example, on a screen as shown in FIG. Alternatively, it may be configured to display at the same speed as real time, or at an increased or decreased speed.

図６は、表示・入力部１１７の表示画面の一例を示す図である。表示画面は、たとえば、操作パネル部、音源方向表示部およびストリーム表示部からなる。操作パネル部から、ユーザによる音データ記録再生装置に対する指示が入力される。表示部と入力部は、図６に示すように１筐体の１画面から構成してもよい。あるいは、１筐体の２画面または画面を使用するかまたは使用しない２筐体によって構成してもよい。 FIG. 6 is a diagram illustrating an example of a display screen of the display / input unit 117. The display screen includes, for example, an operation panel unit, a sound source direction display unit, and a stream display unit. An instruction to the sound data recording / playback apparatus by the user is input from the operation panel unit. The display unit and the input unit may be configured by one screen of one housing as shown in FIG. Or you may comprise by 2 housing | casing which uses the 2 screen of 1 housing | casing or a screen, or does not use it.

表示画面は、音声認識部１２１によって生成された音声のテキスト情報を表示するテキスト情報表示部（不図示）をさらに備えてもよい。表示・入力部１１７は、音データ再生部１１９が、ユーザによって選択された音源の音データ（音声）を再生する（図４（ｄ））際に、音声のテキスト情報をテキスト情報表示部に表示する。テキスト情報表示部において、認識尤度の高い音声のテキスト情報を強調して濃い色で表示し、認識尤度の低い音声のテキスト情報を薄い色で表示するようにしてもよい。音声のテキスト情報を表示することにより、たとえば、耳の不自由な人も、記録された音データの中から所望の音データ（音声）を容易に再生し利用することができる。 The display screen may further include a text information display unit (not shown) that displays voice text information generated by the voice recognition unit 121. The display / input unit 117 displays voice text information on the text information display unit when the sound data playback unit 119 plays back the sound data (sound) of the sound source selected by the user (FIG. 4D). To do. The text information display unit may emphasize the text information of speech with a high recognition likelihood and display it in a dark color, and display the text information of speech with a low recognition likelihood in a light color. By displaying the audio text information, for example, a person with hearing impairment can easily reproduce and use desired sound data (sound) from the recorded sound data.

図７は、表示画面のストリーム表示部の詳細を示す図である。ストリーム表示部は、ストリームリスト格納部１１３またはストリームデータ格納部１１１のデータに基づいて、ストリームデータを表示する。横軸は、時間であり、縦軸は水平面内における角度を示す。ストリーム表示部の表示内容は、図４（ａ）の表示と対応する。図７には、実線で示したストリームＡおよび点線で示したストリームＢのストリームデータが表示されている。ユーザは、時間軸（横軸）上の時刻を指定することにより、再生時刻を指定することができる。時系列データ格納部１０９のデータに基づいて、ストリーム表示部に、収録された音の波形データをあわせて表示してもよい。 FIG. 7 is a diagram illustrating details of the stream display unit of the display screen. The stream display unit displays the stream data based on the data in the stream list storage unit 113 or the stream data storage unit 111. The horizontal axis represents time, and the vertical axis represents the angle in the horizontal plane. The display content of the stream display unit corresponds to the display of FIG. In FIG. 7, the stream data of the stream A indicated by the solid line and the stream B indicated by the dotted line are displayed. The user can specify the playback time by specifying the time on the time axis (horizontal axis). Based on the data in the time-series data storage unit 109, the waveform data of the recorded sound may be displayed together on the stream display unit.

図８は、表示画面の音源方向表示部の詳細を示す図である。音源方向表示部は、ストリームデータ格納部１１１のデータに基づいて、指定された再生時刻における音源およびその方向を表示する。表示画面は、たとえば、水平面を示し、中心に配置されたマイクの位置に対する音源の方向を示す。図８において、Ａと付された線がストリームＡの音源の方向を示し、Ｂと付された線がストリームＢの音源の方向を示す。この表示は、図４（ｂ）の表示と対応する。音源表示部の画面において、たとえばクリックなどの操作により角度範囲を限定してその範囲の音源の音データを再生するようにしてもよい。この操作は、図４（ｃ）の操作と対応する。さらに、図８において、Ａと付された線またはＢと付された線をクリックすることによって、音源を指定しその音源の音データのみを再生するようにしてもよい。この操作は、図４（ｄ）の操作と対応する。 FIG. 8 is a diagram showing details of the sound source direction display section of the display screen. The sound source direction display unit displays the sound source and its direction at the designated playback time based on the data in the stream data storage unit 111. The display screen shows, for example, a horizontal plane and the direction of the sound source with respect to the position of the microphone arranged at the center. In FIG. 8, the line attached with A indicates the direction of the sound source of stream A, and the line attached with B indicates the direction of the sound source of stream B. This display corresponds to the display of FIG. On the screen of the sound source display unit, for example, the angle range may be limited by an operation such as clicking, and sound data of the sound source in that range may be reproduced. This operation corresponds to the operation of FIG. Further, in FIG. 8, by clicking a line attached with A or a line attached with B, a sound source may be designated and only sound data of the sound source may be reproduced. This operation corresponds to the operation of FIG.

本発明の実施形態の特徴を以下に説明する。 Features of the embodiment of the present invention will be described below.

本発明の実施形態によれば、ストリームデータが、時系列の音データへのリンクに関するデータを含む。 According to the embodiment of the present invention, the stream data includes data related to a link to time-series sound data.

本実施形態によれば、時系列の音データへのリンクにより、選択された所定の時刻における音データを容易に再生することができる。 According to the present embodiment, the sound data at the selected predetermined time can be easily reproduced by the link to the time-series sound data.

本発明の実施形態によれば、ストリームデータのリストがストリームリスト格納部に格納される。 According to the embodiment of the present invention, a list of stream data is stored in the stream list storage unit.

本実施形態によれば、音源ごとの音データを時間に沿って概観できるように表示するのが容易になる。 According to this embodiment, it becomes easy to display the sound data for each sound source so that it can be viewed over time.

本発明の実施形態によれば、ストリームデータとともに時間軸を表示し、表示された時間軸上の点が選択されると、対応するストリームデータを前記ストリームデータ格納部から取り出し、該対応するストリームデータに基づいて、選択された所定の時刻における音源およびその方向を表示する。 According to the embodiment of the present invention, the time axis is displayed together with the stream data, and when a point on the displayed time axis is selected, the corresponding stream data is extracted from the stream data storage unit, and the corresponding stream data The sound source and its direction at the selected predetermined time are displayed.

本実施形態によれば、ユーザは、ストリームデータによって、記録された音データを時間に沿って概観することができ、ストリームデータとともに表示された時間軸上の点を選択することによって、選択された所定の時刻における音源およびその方向のデータを容易に取り出すことができる。 According to the present embodiment, the user can overview the recorded sound data along the time by the stream data, and the user can select the point by selecting the point on the time axis displayed together with the stream data. A sound source and its direction data at a predetermined time can be easily extracted.

本発明の実施形態によれば、前記表示部に表示された所定の時刻における音源から所定の音源が選択されると、該所定の音源の音データを前記時系列データ格納部から取り出し、該対応する音データを再生する。 According to the embodiment of the present invention, when a predetermined sound source is selected from sound sources at a predetermined time displayed on the display unit, sound data of the predetermined sound source is extracted from the time-series data storage unit, and the corresponding Play sound data.

本実施形態によれば、ユーザは、表示された所定の時刻における音源から所定の音源を選択することによって、該所定の音源の音データを容易に取り出すことができる。 According to the present embodiment, the user can easily extract the sound data of the predetermined sound source by selecting the predetermined sound source from the sound sources displayed at the predetermined time.

本発明の一実施形態による音データ記録再生装置の構成を示す図である。It is a figure which shows the structure of the sound data recording / reproducing apparatus by one Embodiment of this invention. 時系列データ格納部、ストリームデータ格納部およびストリームリスト格納部の音データに関するデータ構造を示す図である。It is a figure which shows the data structure regarding the sound data of a time series data storage part, a stream data storage part, and a stream list storage part. 時系列データ格納部、ストリームデータ格納部およびストリームリスト格納部の画像データに関するデータ構造を示す図である。It is a figure which shows the data structure regarding the image data of a time series data storage part, a stream data storage part, and a stream list storage part. 本実施形態による音データ記録再生装置のインタフェース機能を概念的に示した図である。It is the figure which showed notionally the interface function of the sound data recording / reproducing apparatus by this embodiment. 本発明の一実施形態による音データ記録再生方法を示す流れ図である。4 is a flowchart illustrating a sound data recording / reproducing method according to an embodiment of the present invention. 表示・入力部の表示画面の一例を示す図である。It is a figure which shows an example of the display screen of a display and an input part. 表示画面のストリーム表示部の詳細を示す図である。It is a figure which shows the detail of the stream display part of a display screen. 表示画面の音源方向表示部の詳細を示す図である。It is a figure which shows the detail of the sound source direction display part of a display screen.

Explanation of symbols

１０１…音データ取得部、１０３…音源定位部、１０５…音源分離部、１０９…時系列データ格納部、１１１…ストリームデータ格納部、１１３…ストリームリスト格納部、１１５…データ処理部 DESCRIPTION OF SYMBOLS 101 ... Sound data acquisition part, 103 ... Sound source localization part, 105 ... Sound source separation part, 109 ... Time series data storage part, 111 ... Stream data storage part, 113 ... Stream list storage part, 115 ... Data processing part

Claims

A sound data acquisition unit for acquiring sound data;
A sound source localization unit that identifies the direction in which the sound source exists,
A sound source separation unit for separating sound data for each sound source;
A time series data storage unit for storing time series sound data for each sound source;
A stream data storage unit for storing stream data relating to sound indicating the direction of a predetermined sound source at a predetermined time;
A data processing unit connected to the time-series data storage unit and the stream data storage unit for processing data;
A stream list storage unit for storing a list associating a time with a link to the stream data regarding the sound source at the time;
A sound data playback unit for playing back sound data;
A display for displaying stream data;
With
The display unit displays on the time axis a first display indicating a direction of a predetermined sound source at a predetermined time included in the stream data;
In response to selection of one of the first displays displayed on the time axis by the display unit, the sound data reproduction unit performs predetermined processing at a predetermined time corresponding to the selected first display. Play sound data of
A sound data recording / reproducing device ,
In response to the selection of one time on the time axis,
The data processing unit retrieves the stream data at the selected time from the stream data storage unit,
The display unit displays a second display indicating a sound source and its direction that existed at the selected time based on the extracted corresponding stream data,
A microphone is displayed at the center of the second display, and the direction of the sound source that existed at the selected time is shown as the direction relative to the position of the displayed microphone.
By selecting one of the sound sources displayed by the second display, one of the first displays is selected.
Sound data recording / reproducing device.

Stream data, when the data about the links to the sound data of the series only including,
The data processing unit refers to the stream data of the sound source selected by the link in response to selection of one of the sound sources displayed in the second display, and the direction of the selected sound source Is within a predetermined angle range, the sound data of the selected sound source is extracted from the time-series data storage unit,
The sound data reproduction unit reproduces the sound data of the predetermined sound source at the extracted predetermined time;
The sound data recording / reproducing apparatus according to claim 1.

The sound data recording / reproducing apparatus according to claim 1, further comprising a speech recognition unit that recognizes speech from the sound data and generates speech text information, and wherein the display unit further displays the speech text information.

An image data acquisition unit for acquiring image data is further provided, wherein the time-series data storage unit further stores time-series image data, and the data stored in the stream data storage unit has a predetermined time at a predetermined time. The sound data recording / reproducing apparatus according to any one of claims 1 to 3 , further comprising stream data relating to an image indicating a target direction.

A method of reproducing sound data by a sound data recording / reproducing apparatus,
Get sound data,
Identify the direction in which the sound source exists,
Separate sound data for each sound source,
Store time-series sound data for each sound source in the time-series data storage,
At a predetermined time, create stream data relating to sound indicating the direction of a predetermined sound source, store it in a stream data storage unit,
A list that associates the time and a link to the stream data about the sound source at the time is stored in the stream list storage unit,
Displaying on the time axis a first display indicating a direction of a predetermined sound source at a predetermined time included in the stream data;
In response to selection of one of the first displays displayed on the time axis, sound data of a predetermined sound source at a predetermined time corresponding to the selected first display is reproduced.
A method for recording and reproducing sound data ,
In response to the selection of one time on the time axis,
The stream data at the selected time is extracted from the stream data storage unit,
Based on the retrieved corresponding stream data, a second display showing the sound source and its direction that existed at the selected time is displayed,
A microphone is displayed at the center of the second display, and the direction of the sound source that existed at the selected time is shown as the direction relative to the position of the displayed microphone.
By selecting one of the sound sources displayed by the second display, one of the first displays is selected.
Sound data recording and playback method.

Stream data, when the data about the links to sound data of the series only including,
In response to selection of one of the sound sources displayed in the second display, the stream data of the selected sound source is referred to by the link, and the direction of the selected sound source falls within a predetermined angular range. When there is, take out the sound data of the selected sound source from the time-series data storage unit,
Reproducing the sound data of the predetermined sound source at the extracted predetermined time;
The sound data recording / reproducing method according to claim 5 .

The sound data recording / reproducing method according to claim 5 or 6 , wherein a sound is recognized from sound data, sound text information is generated, and the sound text information is displayed.

Stream data related to an image that acquires image data, stores time-series image data in the time-series data storage unit, and the data stored in the stream data storage unit indicates a predetermined target direction at a predetermined time The sound data recording / reproducing method according to any one of claims 5 to 7 , further comprising: