JP5279028B2

JP5279028B2 - Audio processing apparatus, audio processing method, and program

Info

Publication number: JP5279028B2
Application number: JP2009127059A
Authority: JP
Inventors: 裕二山本
Original assignee: NEC Casio Mobile Communications Ltd
Current assignee: NEC Casio Mobile Communications Ltd
Priority date: 2009-05-26
Filing date: 2009-05-26
Publication date: 2013-09-04
Anticipated expiration: 2029-05-26
Also published as: JP2010276728A

Description

本発明は、複数のマイク（アレイマイク）を備え、収録目的の被写体音声（目的音）を抽出及び収録して、その収録した目的音を再生する音声処理装置、音声処理方法、及び、コンピュータを音声処理装置として機能させるプログラムに関する。 The present invention includes a plurality of microphones (array microphones), extracts and records subject sound (target sound) for recording purposes, and reproduces the recorded target sound, a sound processing method, and a computer. The present invention relates to a program that functions as a voice processing device.

音声収録や映像撮影の分野では、個々のマイクから複数の音声を入力し、当該マイクの入力音声を時刻情報とともに記録しておき、その複数の記録音声を利用して、再生時に機器の使用者の使い勝手を向上させる技術が存在する。例えばその一つとして、被写体音声を収録する第一の音声入力部と、撮影者音声を収録する第二の音声入力部と、を備え、撮影者の発話音声と被写体の音声とを別々に収録し、音声分析部によりその撮影者発話音声をテキスト化したものを時刻情報とともにメタデータ化して管理し、再生時に映像に対応した撮影者発話音声をテロップとして映像と合わせて表示する技術が開示されている（特許文献１参照）。また、特許文献１には、機器使用者が被写体の音声と撮影者の発話音声とを自由に選択して再生する技術も開示されている。 In the field of audio recording and video shooting, multiple voices are input from individual microphones, the voices input from the microphones are recorded together with time information, and the recorded user is used to play the equipment. There are technologies that improve usability. For example, as one of these, a first audio input unit that records subject audio and a second audio input unit that records photographer audio are provided, and the photographer's speech and subject audio are recorded separately. In addition, a technology has been disclosed in which the voice analysis unit converts the photographer's speech into text and manages it as metadata along with time information, and displays the photographer's speech corresponding to the video as a telop along with the video during playback. (See Patent Document 1). Patent Document 1 also discloses a technique in which a device user freely selects and reproduces a subject's voice and a photographer's voice.

また、被写体音声の収録に関しては、遠方の目的音や会議における特定の目的音の収録において、複数の入力音声を処理することで目的音以外の周囲雑音を抑圧することにより、その目的音の明瞭性を向上する技術が多く存在する。例えば、特許文献２には、アレイマイクを備え、そのアレイマイクで取得した複数の音声から、各マイクに入力される同一音源音声の位相差により、それぞれの音源音声の方向を判定し、特定方向からの例えば被写体音声といった目的音のみを抽出する音源分離技術が開示されている。これらの技術により、被写体音声などの機器使用者が期待する特定音源の音声のみを周囲雑音から抽出する事が可能となる。 In addition, regarding the recording of subject audio, in the recording of a target sound in a distance or a specific target sound in a conference, by processing multiple input sounds and suppressing ambient noise other than the target sound, the target sound is clearly displayed. There are many techniques to improve the performance. For example, Patent Document 2 includes an array microphone, and from a plurality of sounds acquired by the array microphone, the direction of each sound source sound is determined based on the phase difference of the same sound source sound input to each microphone, and a specific direction is determined. For example, a sound source separation technique for extracting only a target sound such as a subject voice is disclosed. With these technologies, it is possible to extract only the sound of a specific sound source expected by the device user, such as subject sound, from ambient noise.

特開２００７−１０４４０５号公報JP 2007-104405 A 特開２００６−２２７３２８号公報JP 2006-227328 A

しかし、特許文献１及び２に開示されている構成では、収録した音声を再生時に利用するには、音声入力部の入力音声から同時に取り込まれる周囲雑音、その他の音声を雑音として除去し、収録する音源音声の明瞭性を保つ必要がある。また、複数の入力音声を撮影時に常時記録するために、複数音声のコーデックを常時動作させる必要がある。さらに、音源音声を抽出するための音源分離技術を用いておらず、音源音声の明瞭性が確保されていない場合には、テキスト化が困難となる可能性がある。 However, in the configurations disclosed in Patent Documents 1 and 2, in order to use the recorded sound at the time of reproduction, ambient noise and other sounds simultaneously captured from the input sound of the sound input unit are removed as noise and recorded. It is necessary to maintain clarity of the sound source. Also, in order to constantly record a plurality of input sounds at the time of shooting, it is necessary to always operate a plurality of sound codecs. Furthermore, if the sound source separation technique for extracting the sound source sound is not used and the clarity of the sound source sound is not ensured, it may be difficult to convert to text.

本発明は、上記問題に鑑みてなされたものであり、複数の音声入力部からの入力音声から、特定方向から発せられる被写体音声といった目的音と撮影者音声とを高い明瞭性で抽出する。また、収録期間判定手段により撮影者音声の収録期間を限定することで、判定した収録期間の撮影者音声とその時刻情報を含むチャプター情報（シーンの切れ目情報）を記録し、再生時にそのチャプター情報を利用する音声処理装置等を提供することを目的とする。 The present invention has been made in view of the above problem, and extracts target sound such as subject sound emitted from a specific direction and photographer's sound with high clarity from input sounds from a plurality of sound input units. Also, by limiting the recording period of the photographer's voice by the recording period determination means, the chapter information (scene break information) including the photographer's voice and the time information of the determined recording period is recorded, and the chapter information is recorded during playback. An object of the present invention is to provide a voice processing device or the like that uses the.

上記の目的を達成するため、本発明の第１の観点に係る音声処理装置は、
複数の音声を取得する音声取得部と、
前記複数の音声とともに、動画像を撮像する撮像部と、
前記音声取得部により取得された複数の音声から、所定方向から発せられた被写体の音声及び撮影者の音声を抽出する音声抽出部と、
前記音声抽出部により抽出される前記被写体の音声及び前記撮影者の音声の音量に基づいて、前記被写体の音声及び前記撮影者の音声のそれぞれの区切りを判定する判定部と、
前記判定部により判定された区切りに基づいて、前記被写体の音声及び前記撮影者の音声のそれぞれに対応付けられる区切りを示す情報を生成する生成部と、
前記生成部により生成される区切りを示す情報を表示する表示部と、
前記複数の音声が取得された時刻を計時する計時部と、
前記被写体の音声及び前記撮影者の音声を、当該音声に相当する文字列に変換する文字変換部と、を備え、
前記生成部は、前記区切りに対応付けられる前記動画像を構成する所定のフレームを、前記区切りを示す情報として生成し、
前記表示部は、前記計時部により計時される時刻と前記区切りを示す情報とを対応付けて表示し、前記文字列を、前記区切りを示す情報と対応付けて表示する、ことを特徴とする。 In order to achieve the above object, a speech processing apparatus according to the first aspect of the present invention provides:
A voice acquisition unit for acquiring a plurality of voices;
An imaging unit that captures a moving image together with the plurality of sounds;
A voice extraction unit that extracts a subject's voice and a photographer's voice emitted from a predetermined direction from a plurality of voices acquired by the voice acquisition unit ;
A determination unit that determines a separation between the sound of the subject and the sound of the photographer based on the volume of the sound of the subject and the sound of the photographer extracted by the sound extraction unit ;
A generating unit that generates information indicating a delimiter associated with each of the sound of the subject and the sound of the photographer based on the delimiter determined by the determination unit ;
A display unit for displaying information indicating a break generated by the generation unit ;
A timekeeping unit that measures the time at which the plurality of sounds are acquired;
A character conversion unit that converts the sound of the subject and the sound of the photographer into a character string corresponding to the sound;
The generation unit generates a predetermined frame constituting the moving image associated with the break as information indicating the break,
The display unit displays the time measured by the time measuring unit in association with the information indicating the separation, and displays the character string in association with the information indicating the separation .

前記音声処理装置の移動量を検出する移動検出部、をさらに備える、ことも可能である。
前記判定部は、前記移動検出部により検出される移動量が所定の閾量以上である場合、前記被写体の音声及び前記撮影者の音声のそれぞれの区切りを判定する、ことも可能である。 It is also possible to further include a movement detection unit that detects the movement amount of the voice processing device.
The determination unit may determine the separation between the sound of the subject and the sound of the photographer when the movement amount detected by the movement detection unit is equal to or greater than a predetermined threshold amount.

上記の目的を達成するため、本発明の他の観点に係る音声処理方法は、音声取得部と、撮像部と、音声抽出部と、判定部と、生成部と、表示部と、計時部と、文字変換部と、を有する音声処理装置にて実行される音声処理方法であって、
前記音声取得部は、複数の音声を取得する音声取得工程と、
前記撮像部は、前記複数の音声とともに、動画像を撮像する撮像工程と、
前記音声抽出部は、前記音声取得工程により取得された複数の音声から、所定方向から発せられた被写体の音声及び撮影者の音声を抽出する音声抽出工程と、
前記判定部は、前記音声抽出工程により抽出される前記被写体の音声及び前記撮影者の音声の音量に基づいて、前記被写体の音声及び前記撮影者の音声のそれぞれの区切りを判定する判定工程と、
前記生成部は、前記判定工程により判定された区切りに基づいて、前記被写体の音声及び前記撮影者の音声のそれぞれに対応付けられる区切りを示す情報を生成する生成工程と、
前記表示部は、前記生成工程により生成される区切りを示す情報を表示する表示工程と、
前記計時部は、前記複数の音声が取得された時刻を計時する計時工程と、
前記文字変換部は、前記被写体の音声及び前記撮影者の音声を、当該音声に相当する文字列に変換する文字変換工程と、を備え、
前記生成工程では、前記区切りに対応付けられる前記動画像を構成する所定のフレームを、前記区切りを示す情報として生成し、
前記表示工程では、前記計時工程により計時される時刻と前記区切りを示す情報とを対応付けて表示し、前記文字列を、前記区切りを示す情報と対応付けて表示する、ことを特徴とする。 In order to achieve the above object, an audio processing method according to another aspect of the present invention includes an audio acquisition unit, an imaging unit, an audio extraction unit, a determination unit, a generation unit, a display unit, and a timing unit. A voice processing method executed by a voice processing device having a character conversion unit ,
The voice acquisition unit acquires a plurality of voices;
The imaging unit captures a moving image together with the plurality of sounds; and
The voice extraction unit extracts a subject voice and a photographer's voice emitted from a predetermined direction from a plurality of voices acquired by the voice acquisition step ;
The determination unit determines a separation between the sound of the subject and the sound of the photographer based on the volume of the sound of the subject and the sound of the photographer extracted by the sound extraction step ;
The generating unit generates information indicating a delimiter associated with each of the subject's voice and the photographer's voice based on the delimiter determined in the determination step ;
The display unit displays information indicating a break generated by the generation step ;
The timekeeping unit is a timekeeping step of measuring times when the plurality of sounds are acquired;
The character conversion unit includes a character conversion step of converting the sound of the subject and the sound of the photographer into a character string corresponding to the sound,
In the generating step, a predetermined frame constituting the moving image associated with the break is generated as information indicating the break,
In the display step, the time measured by the timing step and information indicating the break are displayed in association with each other, and the character string is displayed in association with information indicating the break .

上記の目的を達成するため、本発明の他の観点に係るプログラムは、
コンピュータを
複数の音声を取得する音声取得部、
前記複数の音声とともに、動画像を撮像する撮像部、
前記音声取得部により取得された複数の音声から、所定方向から発せられた被写体の音声及び撮影者の音声を抽出する音声抽出部、
前記音声抽出部により抽出される前記被写体の音声及び前記撮影者の音声の音量に基づいて、前記被写体の音声及び前記撮影者の音声のそれぞれの区切りを判定する判定部、
前記判定部により判定された区切りに基づいて、前記被写体の音声及び前記撮影者の音声のそれぞれに対応付けられる区切りを示す情報を生成する生成部、
前記生成部により生成される区切りを示す情報を表示する表示部、
前記複数の音声が取得された時刻を計時する計時部、
前記被写体の音声及び前記撮影者の音声を、当該音声に相当する文字列に変換する文字変換部、として機能させ、
前記生成部は、前記区切りに対応付けられる前記動画像を構成する所定のフレームを、前記区切りを示す情報として生成し、
前記表示部は、前記計時部により計時される時刻と前記区切りを示す情報とを対応付けて表示し、前記文字列を、前記区切りを示す情報と対応付けて表示する、ことを特徴とする。 In order to achieve the above object, a program according to another aspect of the present invention provides:
An audio acquisition unit that acquires multiple audio from the computer ;
An imaging unit that captures a moving image together with the plurality of sounds,
A voice extraction unit that extracts a subject's voice and a photographer's voice emitted from a predetermined direction from a plurality of voices acquired by the voice acquisition unit ;
A determination unit that determines a separation between the sound of the subject and the sound of the photographer based on the volume of the sound of the subject and the sound of the photographer extracted by the sound extraction unit ;
A generating unit that generates information indicating a delimiter associated with each of the sound of the subject and the sound of the photographer based on the delimiter determined by the determination unit ;
A display unit for displaying information indicating a break generated by the generation unit ;
A timekeeping unit that times the time at which the plurality of sounds are acquired;
Functioning as a character conversion unit that converts the sound of the subject and the sound of the photographer into a character string corresponding to the sound;
The generation unit generates a predetermined frame constituting the moving image associated with the break as information indicating the break,
The display unit displays the time measured by the time measuring unit in association with the information indicating the separation, and displays the character string in association with the information indicating the separation.

本発明によれば、収録シーンの切れ目を示すチャプターを作成することができ、使用者にとって利便性の高いシーン検索を行うことができる。また、音声処理装置の省電力化を図ることができる。 According to the present invention, it is possible to create a chapter indicating a break of a recorded scene, and to perform a scene search that is highly convenient for the user. In addition, it is possible to save power in the voice processing device.

実施形態１における動画撮影装置のブロック構成を示す図である。It is a figure which shows the block configuration of the moving image imaging device in Embodiment 1. レベル測定部を含むブロック構成の一例を示す図である。It is a figure which shows an example of the block structure containing a level measurement part. 機器の移動検出部を含むブロック構成の一例を示す図である。It is a figure which shows an example of the block structure containing the movement detection part of an apparatus. 実施形態１におけるチャプター情報の一例を示す図である。It is a figure which shows an example of the chapter information in Embodiment 1. チャプター情報テーブルの一例を示す図である。It is a figure which shows an example of a chapter information table. 一回の撮影における撮影シーンの一覧例を示す図である。It is a figure which shows the example of a list of the photography scene in one imaging | photography. 実施形態１における音声を再生するためのメニュー画面例を示す図である。It is a figure which shows the example of a menu screen for reproducing | regenerating the audio | voice in Embodiment 1. FIG. 実施形態２における動画撮影装置のブロック構成を示す図である。It is a figure which shows the block configuration of the moving image imaging device in Embodiment 2. 実施形態２におけるチャプター情報の一例を示す図である。It is a figure which shows an example of the chapter information in Embodiment 2. 実施形態２における音声を再生するためのメニュー画面例を示す図である。It is a figure which shows the example of a menu screen for reproducing | regenerating the audio | voice in Embodiment 2. FIG.

(実施形態１)
本発明は、音声入力部として複数のマイク（アレイマイク）を実装する音声記録再生機能を備える機器に適用されるものである。実施形態１に係る発明を、図１〜図７を参照して説明する。図１は、本発明に係る音声処理装置を動画撮影装置１００に適用した端末装置のブロック構成図である。 (Embodiment 1)
The present invention is applied to a device having an audio recording / playback function in which a plurality of microphones (array microphones) are mounted as an audio input unit. The invention according to Embodiment 1 will be described with reference to FIGS. FIG. 1 is a block configuration diagram of a terminal device in which a sound processing device according to the present invention is applied to a moving image photographing device 100.

以下の実施形態では、本発明を動画撮影装置１００として説明するため、映像の区切りやひとかたまりの画像を示す「チャプター」という用語を用いる。 In the following embodiment, in order to describe the present invention as the moving image photographing apparatus 100, the term “chapter” indicating a video segmentation or a group of images is used.

なお、本発明は、動画撮影に限定されるものではなく、例えば音声のみの記録再生機能を持つマイクロレコーダへも適用可能である。また同等機能を備える携帯電話、デジタルスチルカメラなどへも適用可能である。 Note that the present invention is not limited to moving image shooting, and can also be applied to, for example, a micro recorder having a recording / playback function for only sound. It can also be applied to mobile phones, digital still cameras, and the like having equivalent functions.

また、マイクロレコーダ等の音声記録再生装置に利用する場合には、映像との関連情報は存在しないため、シーン区切りの情報として置き換えることで活用可能である。 In addition, when used in an audio recording / playback apparatus such as a micro recorder, since there is no information related to video, it can be used by replacing it with scene break information.

動画撮影装置１００は、映像および音声を統括処理する映像音声処理部２０と、各部を制御する制御部２１と、使用者からの操作を受け付ける操作部２２と、周期クロックによって一定時間の計測を行うタイマー部２３とから構成される。 The moving image shooting apparatus 100 measures a predetermined time by a video / audio processing unit 20 that performs overall processing of video and audio, a control unit 21 that controls each unit, an operation unit 22 that receives operations from a user, and a periodic clock. And a timer unit 23.

映像音声処理部２０は、アレイマイク１、ＡＤＣ２、音声抽出部３、収録期間判定部４、カメラ５、画像処理部６、コーデック部７、チャプター情報生成部８、記録部９、チャプター情報読出部１０、ＯＳＤ作成部１１、ＯＳＤ合成部１２、表示部１３、ＤＡＣ１４、スピーカー１５から構成される。 The video / audio processing unit 20 includes an array microphone 1, an ADC 2, an audio extraction unit 3, a recording period determination unit 4, a camera 5, an image processing unit 6, a codec unit 7, a chapter information generation unit 8, a recording unit 9, and a chapter information reading unit. 10, an OSD creation unit 11, an OSD composition unit 12, a display unit 13, a DAC 14, and a speaker 15.

アレイマイク１は、複数の音声の入力を受け付ける。アレイマイク１は、相互に接続された個別の複数のマイクが配列されて構成される。マイクの配列は、例えば、一次元、二次元、三次元に配置される。 The array microphone 1 accepts a plurality of voice inputs. The array microphone 1 is configured by arranging a plurality of individual microphones connected to each other. The array of microphones is arranged, for example, in one dimension, two dimensions, or three dimensions.

ＡＤＣ２は、アレイマイク１から入力されたアナログ音声信号をデジタル音声信号に変換する。 The ADC 2 converts the analog audio signal input from the array microphone 1 into a digital audio signal.

音声抽出部３は、アレイマイク１から入力された複数音声から特定方向から発せられた複数の音声（例えば、被写体音声、撮影者音声）を個別に抽出する。 The sound extraction unit 3 individually extracts a plurality of sounds (for example, subject sounds and photographer sounds) emitted from a specific direction from a plurality of sounds input from the array microphone 1.

収録期間判定部４は、撮影者音声の収録期間を判定する。収録期間判定部４は、任意の方法に基づいて、収録期間を判定することができる。 The recording period determination unit 4 determines the recording period of the photographer's voice. The recording period determination unit 4 can determine the recording period based on an arbitrary method.

カメラ５は、映像を撮影する。カメラ５は、動画像や静止画像等の任意の画像を撮像することができる。 The camera 5 captures an image. The camera 5 can capture an arbitrary image such as a moving image or a still image.

画像処理部６は、カメラ５からの出力信号に画質調整やリサイズなどの信号処理を施す。 The image processing unit 6 performs signal processing such as image quality adjustment and resizing on the output signal from the camera 5.

コーデック部７は、撮影時に音声抽出部３から出力された音声信号及び画像処理部６から出力された映像信号に圧縮処理を施す。 The codec unit 7 performs compression processing on the audio signal output from the audio extraction unit 3 and the video signal output from the image processing unit 6 during shooting.

なお、本実施形態では、撮影時における画像処理部６から出力される映像信号と音声抽出部３から出力される音声信号とを、再生時における当該コーデック部７内で伸張された音声信号及び映像信号に切り替える機能を有する構成としたが、コーデック部７の外部に設ける構成とすることも可能である。 In the present embodiment, the video signal output from the image processing unit 6 at the time of shooting and the audio signal output from the audio extraction unit 3 are combined with the audio signal and video expanded in the codec unit 7 at the time of reproduction. Although it is configured to have a function of switching to a signal, a configuration provided outside the codec unit 7 is also possible.

チャプター情報生成部８は、映像の切れ目（区切り）を示すチャプター情報を生成する。 The chapter information generation unit 8 generates chapter information indicating a break (break) in the video.

記録部９は、コーデック部７で圧縮された映像音声データ及びチャプター情報生成部８で作成されたチャプター情報を記録する。 The recording unit 9 records the video / audio data compressed by the codec unit 7 and the chapter information created by the chapter information generation unit 8.

チャプター情報読出部１０は、記録部９に記録されたチャプター情報を映像再生時に読み出す。 The chapter information reading unit 10 reads the chapter information recorded in the recording unit 9 during video reproduction.

ＯＳＤ作成部１１は、操作部２２が受け付けた使用者からの操作、もしくは、タイマー２３からのタイミング情報またはチャプター情報読出部１０で読み出したチャプター情報をもとにメニューや情報表示などのＯＳＤ（On Screen Display）を作成する。 The OSD creation unit 11 performs an OSD (On Display) such as a menu or information display based on a user operation received by the operation unit 22 or timing information from the timer 23 or chapter information read by the chapter information reading unit 10. Screen display).

ＯＳＤ合成部１２は、ＯＳＤ作成部１１で作成されたＯＳＤおよびコーデック部７から出力された映像信号を合成する。 The OSD synthesis unit 12 synthesizes the OSD created by the OSD creation unit 11 and the video signal output from the codec unit 7.

表示部１３は、ＯＳＤ合成部１２から出力された映像信号を表示する。 The display unit 13 displays the video signal output from the OSD synthesis unit 12.

ＤＡＣ１４は、コーデック部７から出力されたデジタル音声信号をアナログ音声信号に変換する。 The DAC 14 converts the digital audio signal output from the codec unit 7 into an analog audio signal.

スピーカー１５は、ＤＡＣ１４で変換されたアナログ音声信号を出力する。 The speaker 15 outputs an analog audio signal converted by the DAC 14.

図１のブロック構成図を用いて、撮影時の処理の流れについて説明する。 The flow of processing at the time of shooting will be described using the block diagram of FIG.

アレイマイク１により取り込まれた複数の音声信号は、ＡＤＣ２によりデジタル信号化された後、音声抽出部３により被写体音声と撮影者音声とが抽出されて、コーデック部７に入力される。 The plurality of audio signals captured by the array microphone 1 are converted into digital signals by the ADC 2, and then the subject audio and the photographer audio are extracted by the audio extraction unit 3 and input to the codec unit 7.

カメラ５により取り込まれた映像信号は、画像処理部６により画質調整やリサイズなどの信号処理が施された後、コーデック部７に入力される。 The video signal captured by the camera 5 is subjected to signal processing such as image quality adjustment and resizing by the image processing unit 6 and then input to the codec unit 7.

音声抽出部３からの出力である被写体音声は、再生時にスピーカー１５から出力される音声である。被写体音声は、映像記録時に画像処理部６からの出力である映像信号とともに、撮影記録中はコーデック部７で圧縮された上で、記録部９に記録される。 The subject sound output from the sound extraction unit 3 is sound output from the speaker 15 during reproduction. The subject audio is recorded in the recording unit 9 after being compressed by the codec unit 7 during shooting and recording, together with the video signal output from the image processing unit 6 during video recording.

音声抽出部３からの出力である撮影者音声は、チャプター情報として利用される目的の音声である。撮影者音声は、撮影記録中においてチャプターと同一の再生位置の撮影者音声のみコーデック部７に出力されて、記録部９に記録される。 The photographer's voice, which is an output from the voice extraction unit 3, is a target voice used as chapter information. For the photographer's voice, only the photographer's voice at the same playback position as that of the chapter during shooting and recording is output to the codec unit 7 and recorded in the recording unit 9.

ここで、チャプター情報とは、動画画像の区切り（シーンの切れ目）であり、再生時に使用者が意図するシーンに容易にジャンプ可能とするために使用される情報である。 Here, the chapter information is a segment of a moving image (scene break), and is information used for easily jumping to a scene intended by the user during reproduction.

撮影者音声の収録期間は、収録期間判定部４により判定される。制御部２１は、判定期間に応じて音声抽出部３及びコーデック部７を制御して、収録期間における撮影者音声のみを記録部９に記録する。 The recording period of the photographer's voice is determined by the recording period determination unit 4. The control unit 21 controls the voice extraction unit 3 and the codec unit 7 according to the determination period, and records only the photographer's voice in the recording period in the recording unit 9.

次に、収録期間判定部４が収録期間を判定する方法について、図１を参照して説明する。 Next, a method in which the recording period determination unit 4 determines the recording period will be described with reference to FIG.

収録期間判定部４は、第一の例として、操作部２２が受け付けた撮影者（使用者）からの撮影開始操作に合わせ、収録期間を判定する。チャプター情報生成部８は、収録期間判定部４の判定に基づいて、開始から一定期間の撮影者音声を取り込み、チャプター情報の生成を行う。ここで、一定期間の計測は、タイマー部２３にて実行される。 As a first example, the recording period determination unit 4 determines the recording period in accordance with the shooting start operation from the photographer (user) received by the operation unit 22. Based on the determination by the recording period determination unit 4, the chapter information generation unit 8 captures the photographer's voice for a certain period from the start and generates chapter information. Here, the measurement for a certain period is executed by the timer unit 23.

収録期間判定部４は、第二の例として、操作部２２に設けられたチャプターボタンを、撮影者（使用者）が操作することにより撮影者音声を取り込む収録期間を判定する。チャプター情報生成部８は、収録期間判定部４の判定に基づいて、チャプター情報の生成を行う。撮影者の操作による収録期間の設定方法は、例えば、トグルボタン、ON/OFF、押し下げ状態時、ボタン押し下げから一定期間等、任意の方法により定められる。第一の例および第二の例ともに、図１のブロック構成図に新たに処理部を追加することなく撮影者のボタン操作により、実現可能である。 As a second example, the recording period determination unit 4 determines a recording period in which a photographer's voice is captured when a photographer (user) operates a chapter button provided on the operation unit 22. The chapter information generation unit 8 generates chapter information based on the determination by the recording period determination unit 4. The method for setting the recording period by the photographer's operation is determined by an arbitrary method such as a toggle button, ON / OFF, in a depressed state, or a certain period after the button is depressed. Both the first example and the second example can be realized by a photographer's button operation without newly adding a processing unit to the block configuration diagram of FIG.

次に、収録期間判定部４が収録期間を判定する他の方法について、図２を参照して説明する。 Next, another method for determining the recording period by the recording period determination unit 4 will be described with reference to FIG.

図２は、レベル測定部の測定結果に基づいて、収録期間の判定を行うためのブロック構成図である。図２は、音声抽出部３の撮影者音声レベルを測定するレベル測定部３０を図１に追加した構成である。収録期間判定部４が収録期間を判定する第三の例を図２に示す。以下、図１と同様の処理部は同一の番号を付し、説明を省略する。 FIG. 2 is a block configuration diagram for determining the recording period based on the measurement result of the level measurement unit. FIG. 2 shows a configuration in which a level measuring unit 30 for measuring the photographer's voice level of the voice extracting unit 3 is added to FIG. A third example in which the recording period determination unit 4 determines the recording period is shown in FIG. Hereinafter, the same processing units as those in FIG.

レベル測定部３０は、撮影者音声の音声レベルがあらかじめ設定されている特定レベルに達しているかどうかを収録期間判定部４に通知する。収録期間判定部４では特定レベルに達している期間を収録期間と判定する。チャプター情報生成部８は、収録期間判定部４の判定結果に基づいて、撮影映像に対する撮影者音声の位置情報を含むチャプター情報を生成し、当該チャプター情報を記録部９に記録する。音声抽出部３は、収録期間判定部４の判定に基づいて生成された一つのチャプター情報に関連した撮影者音声をコーデック部７に出力する。コーデック部７は、音声抽出部３から出力された撮影者音声を圧縮して、当該撮影者音声を記録部９に記録する。 The level measurement unit 30 notifies the recording period determination unit 4 whether or not the sound level of the photographer's voice has reached a specific level set in advance. The recording period determination unit 4 determines that the period reaching the specific level is the recording period. Based on the determination result of the recording period determination unit 4, the chapter information generation unit 8 generates chapter information including the position information of the photographer's voice with respect to the captured video, and records the chapter information in the recording unit 9. The voice extraction unit 3 outputs a photographer voice related to one chapter information generated based on the determination of the recording period determination unit 4 to the codec unit 7. The codec unit 7 compresses the photographer's voice output from the voice extraction unit 3 and records the photographer's voice in the recording unit 9.

なお、特定レベルは、例えば、一定音量以上の音声レベル、一定音量以下の音声レベル、６０ｄＢ〜８０ｄＢの範囲内の音声レベル等、任意の音声レベルである。 The specific level is an arbitrary audio level such as an audio level above a certain volume, an audio level below a certain volume, an audio level within a range of 60 dB to 80 dB, and the like.

次に、収録期間判定部４が収録期間を判定する他の方法について、図３を参照して説明する。図３は、移動検出部４０の検出結果に基づいて、収録期間の判定を行うためのブロック構成図である。図３は、機器の移動状態を検出する移動検出部４０を図１に追加した構成である。収録期間判定部４が収録期間を判定する第四の例を図３に示す。以下、図１と同様の処理部は同一の番号を付し、説明を省略する。 Next, another method for determining the recording period by the recording period determination unit 4 will be described with reference to FIG. FIG. 3 is a block diagram for determining the recording period based on the detection result of the movement detection unit 40. FIG. 3 shows a configuration in which a movement detection unit 40 for detecting the movement state of the device is added to FIG. FIG. 3 shows a fourth example in which the recording period determination unit 4 determines the recording period. Hereinafter, the same processing units as those in FIG.

移動検出部４０は、例えば、物体の移動や移動量を検出できる一般的な移動検出センサを備え、撮影中の機器の移動開始、移動停止を検出し、収録期間判定部４に通知する。収録期間判定部４は、移動検出部４０から通知された移動検出通知からの一定期間を収録期間と判定する。制御部２１は、収録期間判定部４で判定された収録期間に応じ、チャプター情報生成部８にチャプター情報を生成させて、コーデック部７に撮影者音声を圧縮するよう指示する。チャプター情報生成部８は、撮影映像に対する撮影者音声の位置情報を含むチャプター情報を生成し、記録部９に記録する。コーデック部７は、圧縮された撮影者音声を記録部９に記録する。 The movement detection unit 40 includes, for example, a general movement detection sensor that can detect the movement and movement amount of an object, detects the movement start and movement stop of the device being photographed, and notifies the recording period determination unit 4. The recording period determination unit 4 determines that a certain period from the movement detection notification notified from the movement detection unit 40 is a recording period. The control unit 21 causes the chapter information generation unit 8 to generate chapter information according to the recording period determined by the recording period determination unit 4 and instructs the codec unit 7 to compress the photographer's voice. The chapter information generation unit 8 generates chapter information including position information of the photographer's voice with respect to the captured video and records the chapter information in the recording unit 9. The codec unit 7 records the compressed photographer voice in the recording unit 9.

移動検出部４０としては、例えば、ＧＰＳ（Global Positioning System）や加速度センサが存在するが、ＧＰＳや加速度センサに限定されない。移動検出部４０は、位置情報を取得できる機能を備え、その位置情報から算出される移動情報等を利用できる任意の装置としても考えられる。 Examples of the movement detection unit 40 include a GPS (Global Positioning System) and an acceleration sensor, but are not limited to the GPS and the acceleration sensor. The movement detection unit 40 has a function of acquiring position information, and can be considered as an arbitrary device that can use movement information calculated from the position information.

なお、上述する第一の例〜第四の例は、撮影者音声の収録期間を判定する方法を限定するものではなく、各例を組み合わせての活用も可能である。 Note that the first to fourth examples described above do not limit the method for determining the recording period of the photographer's voice, and the examples can be used in combination.

次に、チャプター情報の具体例を図４に示し、その中に含まれる項目の内容について説明する。 Next, a specific example of chapter information is shown in FIG. 4, and the contents of items included therein will be described.

チャプター情報４００は、チャプター一つに対して、撮影者音声格納場所及び時刻情報の各項目を少なくとも一つ備える。撮影者音声格納場所は、記録部９に記録される撮影者音声が格納された場所を示す。例えば、撮影者が撮影者音声を動物園で格納した場合には、撮影者音声格納場所は、動物園となる。撮影者音声は、シーン関連情報としてシーン内容を判断するために利用される。時刻情報は、記録部９に記録される撮影者音声の発話が開始された時刻を示す。例えば、撮影者が10時12分12秒に発話を開始した場合には、時刻情報は、10時12分12秒となる。この時刻情報には、日付情報やチャプター位置情報も含まれる。また、この時刻情報は、撮影画像中の再生位置情報を示すためのものであり、フレーム番号とすることも可能である。 The chapter information 400 includes at least one item of the photographer sound storage location and time information for each chapter. The photographer voice storage location indicates a place where the photographer voice recorded in the recording unit 9 is stored. For example, when the photographer stores the photographer's voice at the zoo, the photographer's voice storage location is the zoo. The photographer's voice is used to determine the scene contents as scene related information. The time information indicates the time when the utterance of the photographer voice recorded in the recording unit 9 is started. For example, when the photographer starts speaking at 10:12:12, the time information is 10:12:12. This time information includes date information and chapter position information. The time information is used to indicate reproduction position information in the captured image, and can be a frame number.

次に、チャプター情報テーブルの例を図５に示す。チャプター情報テーブル５００には、一つの撮影動画（一回の撮影開始から停止まで）に対して、少なくとも一つのチャプター情報が記録される。例えば、図５に示すように、撮影画像１、撮影画像２、撮影画像３には、それぞれ４つ、３つ、５つのチャプター情報が存在する。一つのチャプターには、チャプター名、撮影音声格納場所、及び、時刻情報が対応付けられて記録される。このチャプター情報テーブル５００は、記録部９に記録される。 Next, an example of the chapter information table is shown in FIG. In the chapter information table 500, at least one chapter information is recorded for one shooting moving image (from the start to the stop of one shooting). For example, as illustrated in FIG. 5, the captured image 1, the captured image 2, and the captured image 3 have four, three, and five chapter information, respectively. In one chapter, a chapter name, a shooting sound storage location, and time information are recorded in association with each other. The chapter information table 500 is recorded in the recording unit 9.

次に、再生時のチャプター情報の利用方法、及び、その際の各処理部の動作について、図１、図６及び図７を参照して説明する。動物園において様々な動物を撮影した場合を例に取り、撮影者がその動物の種類を撮影者音声として記録した動画を利用する手法について説明する。 Next, a method of using chapter information at the time of reproduction and the operation of each processing unit at that time will be described with reference to FIG. 1, FIG. 6, and FIG. Taking a case where various animals are photographed in a zoo as an example, a technique will be described in which a photographer uses a moving image in which the type of animal is recorded as a photographer's voice.

図６は、一回の撮影における撮影シーンの一覧を模式的に示した図である。図６では、一回の撮影で動物園の入口、ゾウ、ライオン、サルを撮影したときを例としている。記録画像撮影シーンは、撮影された動画の１シーンや、１つの静止画像である。被写体音声は、動画を撮影する際に、アレイマイク１により集音された被写体が発した音声（例えば、動物の鳴き声）である。撮影者発話音声は、動画を撮影する際に、アレイマイク１により集音された撮影者の音声である。時刻情報は、撮影者音声の発話が開始された時刻を示す。 FIG. 6 is a diagram schematically showing a list of shooting scenes in one shooting. FIG. 6 shows an example in which a zoo entrance, an elephant, a lion, and a monkey are photographed in one shot. The recorded image shooting scene is one scene of a shot moving image or one still image. The subject sound is a sound (for example, an animal cry) uttered by a subject collected by the array microphone 1 when shooting a moving image. The photographer uttered voice is a voice of the photographer collected by the array microphone 1 when shooting a moving image. The time information indicates the time when the utterance of the photographer's voice is started.

チャプター情報生成部８は、動物園の入口等のそれぞれのシーンに対して、チャプター情報を生成して、当該チャプター情報を記録部９に記録する。チャプター情報生成部８は、記録画像撮影シーンの一部をチャプター情報とすることもできる。 The chapter information generation unit 8 generates chapter information for each scene such as a zoo entrance, and records the chapter information in the recording unit 9. The chapter information generation unit 8 can also use a part of the recorded image shooting scene as chapter information.

チャプター情報を再生するためのメニュー画面を表示する際、チャプター情報読出部１０は、記録部９に記録されたチャプター情報を読み出して、当該チャプター情報をＯＳＤ作成部１１に渡す。ＯＳＤ作成部１１は、チャプター情報に基づいて、ＯＳＤ（On Screen Display）を作成して、当該ＯＳＤをＯＳＤ合成部１２に渡す。ＯＳＤは、ＯＳＤ合成部１２において映像信号と合成されて、表示部１３に表示される。 When displaying the menu screen for reproducing the chapter information, the chapter information reading unit 10 reads the chapter information recorded in the recording unit 9 and passes the chapter information to the OSD creation unit 11. The OSD creation unit 11 creates an OSD (On Screen Display) based on the chapter information and passes the OSD to the OSD composition unit 12. The OSD is combined with the video signal in the OSD combining unit 12 and displayed on the display unit 13.

図７は、チャプター情報に対応する音声を再生するためのメニュー画面例である。メニュー画像７００には、記録部９に記録されている撮影動画の数に対応する選択ボタン７０１が表示される。選択ボタン７０１が選択されると、当該選択された選択ボタン７０１に対応付けられた音声が再生される。例えば、図６に示す４つの記録画像撮影シーンを含む撮影動画が、図７に示す撮影動画１に対応している。このため、撮影動画１に対応する選択ボタン７０１が選択されると、図６に示される４つの撮影シーンが再生される。 FIG. 7 is an example of a menu screen for reproducing audio corresponding to chapter information. A selection button 701 corresponding to the number of captured moving images recorded in the recording unit 9 is displayed on the menu image 700. When the selection button 701 is selected, the sound associated with the selected selection button 701 is reproduced. For example, a shooting moving image including four recorded image shooting scenes shown in FIG. 6 corresponds to the shooting moving image 1 shown in FIG. Therefore, when the selection button 701 corresponding to the photographed moving image 1 is selected, the four photographing scenes shown in FIG. 6 are reproduced.

また、メニュー画像７００には、チャプター情報の時間的な位置を示すプログレスバー７０２が表示される。プログレスバー７０２には、撮影動画（記録画像撮影シーン）に対応する時刻情報が表示される。 The menu image 700 also displays a progress bar 702 indicating the temporal position of chapter information. The progress bar 702 displays time information corresponding to the captured moving image (recorded image shooting scene).

撮影者（使用者）により、撮影動画１に対応する選択ボタン７０１が選択されると、撮影動画１の選択ボタン７０１が例えばハイライト表示され、撮影動画１に対するチャプター情報内の時刻情報より(1)10:12:12といった時刻情報がプログレスバー７０２上に表示される。各チャプター情報に関連付けられた撮影者発話音声が順次再生されるとともに、再生中の音声に対応する時刻情報がハイライト表示される。 When the selection button 701 corresponding to the photographed moving image 1 is selected by the photographer (user), the selection button 701 for the photographed moving image 1 is highlighted, for example, based on the time information in the chapter information for the photographed moving image 1 (1 ) Time information such as 10:12:12 is displayed on the progress bar 702. The photographer's utterance voice associated with each chapter information is sequentially reproduced, and time information corresponding to the voice being reproduced is highlighted.

その際に再生される音声は、「○○動物園」、「ゾウ」、「ライオン」、「サル」といった撮影者発話音声である。記録画像撮影シーンに対応する撮影者発話音声が再生されている際に、撮影者（使用者）は、操作部２２が備える再生ボタン等で再生を指示することにより、所望の動画シーンを再生することができる。例えば、「ゾウ」という撮影者発話音声が再生されている際に、所定の再生ボタンが操作されると、当該撮影者発話音声が記録された時刻10:15:31から記録画像撮影シーンが再生される。 The sound reproduced at this time is a photographer's utterance voice such as “XX Zoo”, “Elephant”, “Lion”, “Monkey”. When the photographer's speech corresponding to the recorded image shooting scene is being played back, the photographer (user) plays back a desired moving image scene by instructing playback using a playback button or the like provided in the operation unit 22. be able to. For example, when a photographer's utterance voice “Elephant” is being played, if a predetermined playback button is operated, the recorded image shooting scene is played back from the time 10:15:31 when the photographer's utterance voice was recorded. Is done.

撮影動画１〜３を示す全ての選択ボタン７０１、もしくは、撮影動画１〜３を再生するための「全て」の選択ボタン７０１が選択されると、撮影動画１〜３に対するチャプター情報に対応する音声が順次再生される。例えば、再生されている撮影動画に対応する選択ボタン７０１がハイライト表示されることにより、どの撮影画像が再生されているのかを撮影者は認識することができる。 When all the selection buttons 701 indicating the shooting videos 1 to 3 or the “all” selection button 701 for reproducing the shooting videos 1 to 3 are selected, the audio corresponding to the chapter information for the shooting videos 1 to 3 is selected. Are played sequentially. For example, the photographer can recognize which captured image is being reproduced by highlighting the selection button 701 corresponding to the captured moving image being reproduced.

なお、図７に示すように、選択ボタン７０１には撮影動画１〜３と表示されているが、各撮影動画の先頭のチャプター情報から取得した日付情報を含む時刻情報を選択ボタン７０１内に表示してもよい。また、メニュー画像７００において、撮影動画が再生される際に、特徴的な被写体音声を再生することもできる。 As shown in FIG. 7, the selection buttons 701 are displayed as shooting videos 1 to 3, but time information including date information acquired from the first chapter information of each shooting video is displayed in the selection button 701. May be. In addition, in the menu image 700, a characteristic subject sound can be reproduced when a captured moving image is reproduced.

以上、本実施形態に示すように、動画画像の音声記録を再生に利用した場合、撮影者が意図した撮影シーンにチャプター情報を作成することができ、再生時には記録したチャプター情報により、容易にチャプタースキップ、シーン検索等を実行することができる。 As described above, as shown in the present embodiment, when audio recording of a moving image is used for playback, chapter information can be created in the shooting scene intended by the photographer, and the chapter information recorded during playback can be easily used. Skip, scene search, etc. can be executed.

また、収録期間判定部により撮影者音声の収録期間が制限されて、エンコードの処理負荷を抑えることができるため、特にバッテリー容量の限られた携帯機器に対して求められる低消費電力化にも貢献できる。 Also, the recording period of the photographer's voice is limited by the recording period determination unit, which can reduce the processing load of encoding, contributing to the low power consumption required especially for portable devices with limited battery capacity it can.

さらに、収録期間を限定することにより、チャプター情報をシーン検索等に利用した際に、使用者がチャプター音声として認識していない音声に起因する誤検索の可能性を極力抑えることができる。 Furthermore, by limiting the recording period, when chapter information is used for scene search or the like, it is possible to suppress as much as possible the possibility of erroneous search due to voice that the user does not recognize as chapter voice.

（実施形態２）
実施形態２に係る発明を、図８〜図１０を参照して説明する。図８は、本発明に係る音声処理装置を動画撮影装置１００に適用した端末装置のブロック構成図である。図８は、図１のブロック構成図に対して、音声抽出部３とチャプター情報生成部８の間に音声分析テキスト化部５０を追加した構成である。以下、図１と同様の処理部は同一の番号を付し、説明を省略する。 (Embodiment 2)
The invention according to Embodiment 2 will be described with reference to FIGS. FIG. 8 is a block configuration diagram of a terminal device in which the audio processing device according to the present invention is applied to the moving image photographing device 100. 8 is a configuration in which a speech analysis text converting unit 50 is added between the speech extraction unit 3 and the chapter information generation unit 8 with respect to the block configuration diagram of FIG. Hereinafter, the same processing units as those in FIG.

音声分析テキスト化部５０は、音声抽出部３から出力される撮影者発話音声をテキスト化して、当該テキストをチャプター情報生成部８に渡す。また、音声分析テキスト化部５０は、撮影者発話音声の音量の大小、撮影者からの指示、もしくは、撮影時間等に基づいて、所定の内容を含む撮影者発話音声の全部、もしくは、一部をテキスト化することもできる。 The voice analysis text conversion unit 50 converts the photographer's utterance voice output from the voice extraction unit 3 into text, and passes the text to the chapter information generation unit 8. In addition, the voice analysis text converting unit 50 can perform the entire or a part of the photographer's utterance voice including predetermined contents based on the volume of the photographer's utterance voice, the instruction from the photographer, or the photographing time. Can also be converted into text.

チャプター情報生成部８は、チャプター情報とともにテキスト化された撮影者発話音声の内容を記録部９に記録する。 The chapter information generation unit 8 records the content of the photographer's utterance voice converted into text together with the chapter information in the recording unit 9.

記録部９に記録される本実施形態に係るチャプター情報の具体例を図９に示す。実施形態１と同様に、撮影者音声格納場所及び時刻情報が、チャプター情報として記録される。本実施形態では、テキスト化された撮影者発話内容がチャプター情報として含まれる。 A specific example of the chapter information according to the present embodiment recorded in the recording unit 9 is shown in FIG. Similar to the first embodiment, the photographer's voice storage location and time information are recorded as chapter information. In this embodiment, the photographer's utterance content converted into text is included as chapter information.

チャプター情報を再生するためのメニュー画面を表示する際、実施形態１と同様に、チャプター情報読出部１０は、記録部９に記録されたチャプター情報を読み出して、当該チャプター情報をＯＳＤ作成部１１に渡す。ＯＳＤ作成部１１は、チャプター情報に基づいて、ＯＳＤを作成して、当該ＯＳＤをＯＳＤ合成部１２に渡す。ＯＳＤは、ＯＳＤ合成部１２において映像信号と合成されて、表示部１３に表示される。 When displaying the menu screen for reproducing the chapter information, the chapter information reading unit 10 reads the chapter information recorded in the recording unit 9 and displays the chapter information in the OSD creation unit 11 as in the first embodiment. hand over. The OSD creation unit 11 creates an OSD based on the chapter information and passes the OSD to the OSD synthesis unit 12. The OSD is combined with the video signal in the OSD combining unit 12 and displayed on the display unit 13.

図１０は、チャプター情報に対応する音声を再生するためのメニュー画面例である。メニュー画像１０００には、記録部９に記録されている撮影動画のチャプターに対応する撮影者発話音声がテキスト化されたものが選択ボタン１００１として表示される。例えば、図６に示す撮影シーンを含む動画が、撮影動画１とすると、撮影者発話音声が「○○動物園」、「ゾウ」、「ライオン」、「サル」と記録されているため、音声情報である「○○動物園」がテキスト化されて、文字情報である「○○動物園」が選択ボタン１００１上に表示される。そして、撮影動画１の「ライオン」と表示された選択ボタン１００１を撮影者が選択すると、撮影動画１のライオンのシーンから再生される。 FIG. 10 is an example of a menu screen for reproducing the audio corresponding to the chapter information. The menu image 1000 is displayed as a selection button 1001 in which a photographer's utterance voice corresponding to a chapter of a captured moving image recorded in the recording unit 9 is converted into text. For example, if the moving image including the shooting scene shown in FIG. 6 is the shooting moving image 1, since the voice of the photographer is recorded as “XX Zoo”, “Elephant”, “Lion”, “Monkey”, the audio information “XX Zoo” is converted into text, and “XX Zoo” as character information is displayed on the selection button 1001. Then, when the photographer selects the selection button 1001 displayed as “Lion” in the captured moving image 1, the scene is reproduced from the lion scene in the captured moving image 1.

なお、図１０において、各撮影動画像に対して撮影動画１、撮影動画２等と記載したが、各撮影動画の先頭のチャプター情報から取得した日付情報を含む時刻情報を表示してもよい。 In FIG. 10, the captured moving image 1 and the captured moving image 2 are described for each captured moving image. However, time information including date information acquired from the first chapter information of each captured moving image may be displayed.

以上、本実施形態に示すように、チャプター情報が表示されるメニュー画面上にある選択ボタンに表示されるテキスト情報から、チャプター以降に再生されるシーンを容易に認識することが可能となり、所望するシーンから再生するといった再生の利便性を高めることができる。 As described above, as shown in the present embodiment, it is possible to easily recognize a scene to be reproduced after a chapter from text information displayed on a selection button on a menu screen on which chapter information is displayed. The convenience of reproduction such as reproduction from a scene can be enhanced.

なお、本発明は上記実施の形態に限定されず、種々の変形及び応用が可能である。 In addition, this invention is not limited to the said embodiment, A various deformation | transformation and application are possible.

チャプター情報生成部８は、一つの撮影動画に対する撮影シーンを複数生成することもできる。表示部１３は、複数生成された撮影シーンが連続するように表示することもできる。例えば、図６に示される記録画像撮影シーンが、連続する複数のシーンとして、また、動画像として表示される。 The chapter information generation unit 8 can also generate a plurality of shooting scenes for one shooting moving image. The display unit 13 can also display a plurality of generated shooting scenes so as to be continuous. For example, the recorded image shooting scene shown in FIG. 6 is displayed as a plurality of continuous scenes or as a moving image.

１００…動画撮影装置、１…アレイマイク、２…ＡＤＣ、３…音声抽出部、４…収録期間判定部、５…カメラ、６…画像処理部、７…コーデック部、８…チャプター情報生成部、９…記録部、１０…チャプター情報読出部、１１…ＯＳＤ作成部、１２…ＯＳＤ合成部、１３…表示部、１４…ＤＡＣ、１５…スピーカー、２０…映像音声処理部、２１…制御部、２２…操作部、２３…タイマー部、３０…レベル測定部、４０…移動検出部 DESCRIPTION OF SYMBOLS 100 ... Movie imaging device, 1 ... Array microphone, 2 ... ADC, 3 ... Sound extraction part, 4 ... Recording period determination part, 5 ... Camera, 6 ... Image processing part, 7 ... Codec part, 8 ... Chapter information generation part, DESCRIPTION OF SYMBOLS 9 ... Recording part, 10 ... Chapter information reading part, 11 ... OSD preparation part, 12 ... OSD synthetic | combination part, 13 ... Display part, 14 ... DAC, 15 ... Speaker, 20 ... Video / audio processing part, 21 ... Control part, 22 ... Operation part 23 ... Timer part 30 ... Level measurement part 40 ... Movement detection part

Claims

A voice acquisition unit for acquiring a plurality of voices;
An imaging unit that captures a moving image together with the plurality of sounds;
A voice extraction unit that extracts a subject's voice and a photographer's voice emitted from a predetermined direction from a plurality of voices acquired by the voice acquisition unit ;
A determination unit that determines a separation between the sound of the subject and the sound of the photographer based on the volume of the sound of the subject and the sound of the photographer extracted by the sound extraction unit ;
A generating unit that generates information indicating a delimiter associated with each of the sound of the subject and the sound of the photographer based on the delimiter determined by the determination unit ;
A display unit for displaying information indicating a break generated by the generation unit ;
A timekeeping unit that measures the time at which the plurality of sounds are acquired;
A character conversion unit that converts the sound of the subject and the sound of the photographer into a character string corresponding to the sound;
The generation unit generates a predetermined frame constituting the moving image associated with the break as information indicating the break,
The display unit displays the time measured by the time measuring unit in association with the information indicating the separation, and displays the character string in association with the information indicating the separation;
A speech processing apparatus characterized by that.

A movement detection unit for detecting a movement amount of the voice processing device;
The determination unit determines a separation between the sound of the subject and the sound of the photographer when the amount of movement detected by the movement detection unit is equal to or greater than a predetermined threshold amount;
The speech processing apparatus according to claim 1 .

A voice processing method executed by a voice processing device having a voice acquisition unit, an imaging unit, a voice extraction unit, a determination unit, a generation unit, a display unit, a timing unit, and a character conversion unit. And
The voice acquisition unit acquires a plurality of voices;
The imaging unit captures a moving image together with the plurality of sounds; and
The voice extraction unit extracts a subject voice and a photographer's voice emitted from a predetermined direction from a plurality of voices acquired by the voice acquisition step ;
The determination unit determines a separation between the sound of the subject and the sound of the photographer based on the volume of the sound of the subject and the sound of the photographer extracted by the sound extraction step ;
The generating unit generates information indicating a delimiter associated with each of the subject's voice and the photographer's voice based on the delimiter determined in the determination step ;
The display unit displays information indicating a break generated by the generation step ;
The timekeeping unit is a timekeeping step of measuring times when the plurality of sounds are acquired;
The character conversion unit includes a character conversion step of converting the sound of the subject and the sound of the photographer into a character string corresponding to the sound,
In the generating step, a predetermined frame constituting the moving image associated with the break is generated as information indicating the break,
In the display step, the time measured by the timing step is displayed in association with the information indicating the break, and the character string is displayed in association with the information indicating the break.
And a voice processing method.

An audio acquisition unit that acquires multiple audio from the computer ;
An imaging unit that captures a moving image together with the plurality of sounds,
A voice extraction unit that extracts a subject's voice and a photographer's voice emitted from a predetermined direction from a plurality of voices acquired by the voice acquisition unit ;
A determination unit that determines a separation between the sound of the subject and the sound of the photographer based on the volume of the sound of the subject and the sound of the photographer extracted by the sound extraction unit ;
A generating unit that generates information indicating a delimiter associated with each of the sound of the subject and the sound of the photographer based on the delimiter determined by the determination unit ;
A display unit for displaying information indicating a break generated by the generation unit ;
A timekeeping unit that times the time at which the plurality of sounds are acquired;
Functioning as a character conversion unit that converts the sound of the subject and the sound of the photographer into a character string corresponding to the sound;
The generation unit generates a predetermined frame constituting the moving image associated with the break as information indicating the break,
The display unit displays the time measured by the time measuring unit in association with the information indicating the separation, and displays the character string in association with the information indicating the separation;
A program characterized by that .