JP2011055386A

JP2011055386A - Audio signal processor, and electronic apparatus

Info

Publication number: JP2011055386A
Application number: JP2009204315A
Authority: JP
Inventors: Tomoki Oku; 智岐奥
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 2009-09-04
Filing date: 2009-09-04
Publication date: 2011-03-17

Abstract

<P>PROBLEM TO BE SOLVED: To subject an audio signal suitable for a type of a sound source to slow reproduction in slow reproduction of a video. <P>SOLUTION: An object audio signal for α seconds is collected when imaging an object moving image for α seconds at 600 fps; and, when the object moving image is reproduced at 60 fps (that is, when slow reproduction is performed using (10×α) seconds), an extended audio signal obtained by temporally extending an object audio signal by a factor of 10 is reproduced by being synchronized with reproduction video. When an audio signal of human voice (cheer such as "wahooo") and an audio signal of impulse sound (sound such as "whack") are included in the object audio signal, they are separated and extracted from the object audio signal, a pitch-keeping extension process of extending signal length without changing the pitch is executed to the former, and an echo process of performing repetitive reproduction while being associated with gradual reduction of volume is performed to the latter. A synthetic signal of the audio signal after the processes is reproduced as an extended audio signal along with slow reproduction video. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音響信号に対して信号処理を行う音響信号処理装置に関する。また、本発明は、そのような音響信号処理装置を利用した、記録装置や再生装置等の電子機器に関する。 The present invention relates to an acoustic signal processing apparatus that performs signal processing on an acoustic signal. The present invention also relates to an electronic apparatus such as a recording apparatus or a reproducing apparatus using such an acoustic signal processing apparatus.

近年の撮像技術の発展に伴い、通常よりも高速に映像を撮影及び記録することのできる撮像装置が実用化されている。このような高速撮影の機能は、従来は特殊用途の撮像装置にのみ搭載されていたが、最近では民生用の撮像装置にも搭載されている。 With the recent development of imaging technology, imaging apparatuses capable of capturing and recording video at a higher speed than usual have been put into practical use. Such a high-speed shooting function has been conventionally installed only in a special-purpose imaging device, but recently it is also installed in a consumer imaging device.

この種の撮像装置では、通常撮影モード又は高速撮影モードにて動画像の撮影を行うことができる。通常撮影モードでは、一般的な動画像撮影と同様、１秒間に６０フレーム又は３０フレームの映像を撮影及び記録する。つまり、６０ｆｐｓ（frame per second）又は３０ｆｐｓのフレームレートにて動画像の撮影を行う。通常撮影モードにて記録された動画像を撮影時と同じフレームレート（即ち、６０ｆｐｓ又は３０ｆｐｓ）にて再生すると、等倍速の再生映像が得られる（図５参照）。 With this type of imaging apparatus, it is possible to shoot moving images in the normal shooting mode or the high-speed shooting mode. In the normal shooting mode, as in general moving image shooting, 60 frames or 30 frames of video are shot and recorded per second. That is, moving images are shot at a frame rate of 60 fps (frame per second) or 30 fps. When a moving image recorded in the normal shooting mode is played back at the same frame rate as that at the time of shooting (that is, 60 fps or 30 fps), a playback video at a normal speed is obtained (see FIG. 5).

これに対し、高速撮影モードでは、３００ｆｐｓや６００ｆｐｓの高速フレームレートにて動画像の撮影が行われる。この高速撮影モードによって撮影された動画像を通常のフレームレートである６０ｆｐｓにて再生すると、１／５倍速や１／１０倍速の滑らかなスロー再生を実現することができる（図６参照）。 On the other hand, in the high-speed shooting mode, moving images are shot at a high-speed frame rate of 300 fps or 600 fps. When a moving image shot in this high-speed shooting mode is played back at a normal frame rate of 60 fps, smooth slow playback at 1/5 times speed or 1/10 times speed can be realized (see FIG. 6).

例えば、６００ｆｐｓのフレームレートにて１秒間だけ動画像の撮影を行うと、６００フレームから成る動画像が記録されるが、この動画像を６０ｆｐｓのフレームレートにて再生すると、再生に１０秒間かかる。つまり、１秒間分の記録動画像が１０秒間をかけてスロー再生（１／１０倍速のスロー再生）されることになる。 For example, if a moving image is shot for 1 second at a frame rate of 600 fps, a moving image consisting of 600 frames is recorded. If this moving image is played back at a frame rate of 60 fps, the playback takes 10 seconds. That is, a recorded moving image for one second is played back slowly (1/10 speed slow playback) over 10 seconds.

高速撮影に基づくスロー再生が可能な撮像装置も実用化されているが、高速撮影モードにおいては、音響信号が記録されていないのが実情である。１秒間分の動画像の撮影時に１秒間分の音響信号を収音して記録し、その１秒間分の記録音響信号を１０秒間分のスロー再生動画像に同期させた状態でスロー再生しようとすると、音響信号のピッチが変動して間延びしたような音が再生されるためである。 An imaging apparatus capable of slow reproduction based on high-speed shooting has been put into practical use, but in reality, no acoustic signal is recorded in the high-speed shooting mode. When shooting a moving image for 1 second, an acoustic signal for 1 second is picked up and recorded, and the recorded acoustic signal for 1 second is tried to be played slowly in synchronization with the slow-playing moving image for 10 seconds. This is because a sound that is extended by changing the pitch of the acoustic signal is reproduced.

他方において、音響信号のスロー再生に関する技術が下記特許文献１〜３に開示されている。これらの特許文献に示された方法では、何れも、記録又は再生のフレームレートに合わせて音響信号に伸張処理が施されている。音響信号に関する伸張処理とは、伸張処理の対象となる音響信号を時間方向に引き伸ばすことによって当該音響信号の信号長さを増大させる処理を指す。音響信号の信号長さとは、当該音響信号が存在する区間の時間長さを指す。 On the other hand, technologies relating to slow reproduction of acoustic signals are disclosed in Patent Documents 1 to 3 below. In any of the methods disclosed in these patent documents, an expansion process is performed on an acoustic signal in accordance with a recording or reproduction frame rate. The expansion process related to an acoustic signal refers to a process of increasing the signal length of the acoustic signal by stretching the acoustic signal to be expanded in the time direction. The signal length of the acoustic signal refers to the time length of the section where the acoustic signal exists.

一般的な伸張処理の方法として、ピッチを維持したまま音響信号を伸張する方法（換言すれば、ピッチを伸張処理の前後において変化させない方法）が知られており、声の音程を変化させることなく発話速度を増減させる話速変換技術に応用されている。しかしながら、この方法を単純に映像のスロー再生に適用することは望ましくない。ピッチを維持したまま音響信号を伸張する方法は、基本的に人の声の伸張に適した方法であり、動画像と共に記録された音響信号が例えば音楽の音響信号である場合に該伸張方法を適用すると、違和感のある音が再生されることになるからである。動画像と共に記録された音響信号が人の声及び音楽以外の音源によるものである場合においても、同様の問題が発生しうる。 As a general stretching method, a method of stretching an acoustic signal while maintaining the pitch (in other words, a method in which the pitch is not changed before and after the stretching process) is known, and without changing the pitch of the voice. It is applied to speech speed conversion technology that increases or decreases the speech speed. However, it is not desirable to simply apply this method to slow video playback. The method of extending the sound signal while maintaining the pitch is basically a method suitable for extending a human voice. When the sound signal recorded together with the moving image is, for example, a music sound signal, the extension method is used. This is because when applied, a sound with a sense of incongruity is reproduced. The same problem can occur when the acoustic signal recorded with the moving image is from a voice source other than human voice and music.

再表２００７−２９８３２号公報（ＷＯ２００７／０２９８３２）No. 2007-29832 (WO2007 / 029832) 特開２００１−２９８７１０号公報JP 2001-298710 A 特開２００８−２１９８５７号公報JP 2008-219857 A

そこで本発明は、映像のスロー再生に適した音響信号を生成可能な音響信号処理装置及び電子機器を提供することを目的とする。 SUMMARY An advantage of some aspects of the invention is that it provides an audio signal processing device and an electronic apparatus that can generate an audio signal suitable for slow reproduction of video.

本発明に係る音響信号処理装置は、対象動画像を第１フレームレートにて撮影しているときに収音された入力音響信号から、前記入力音響信号よりも長い信号長さを有する出力音響信号を生成する出力音響信号生成部を備えた音響信号処理装置であって、前記出力音響信号は、前記対象動画像を前記第１フレームレートよりも小さな第２フレームレートで再生するときに前記対象動画像とともに音として再生されるべき音響信号であり、前記出力音響信号生成部は、前記入力音響信号の音源の種類に応じて前記入力音響信号から前記出力音響信号を生成することを特徴とする。 The acoustic signal processing device according to the present invention is an output acoustic signal having a signal length longer than the input acoustic signal from an input acoustic signal picked up when the target moving image is captured at the first frame rate. And an output sound signal generating unit that generates the target moving image when the target moving image is reproduced at a second frame rate smaller than the first frame rate. It is an acoustic signal to be reproduced as sound together with an image, and the output acoustic signal generation unit generates the output acoustic signal from the input acoustic signal according to the type of sound source of the input acoustic signal.

これにより、音源の種類に適応した、映像のスロー再生用の音響信号を生成することが可能になる。 This makes it possible to generate an audio signal for slow playback of a video adapted to the type of sound source.

具体的には例えば、前記出力音響信号生成部は、前記入力音響信号に基づいて前記入力音響信号の音源の種類を解析する音源種類解析部を備え、前記音源種類解析部によって解析された、前記入力音響信号の音源の種類に応じて、前記入力音響信号から前記出力音響信号を生成する。 Specifically, for example, the output acoustic signal generation unit includes a sound source type analysis unit that analyzes a type of a sound source of the input sound signal based on the input sound signal, and is analyzed by the sound source type analysis unit, The output sound signal is generated from the input sound signal according to the type of the sound source of the input sound signal.

また例えば、前記音源種類解析部は、前記入力音響信号に基づいて前記入力音響信号の音源に人の声が含まれているのか否かを判断し、前記出力音響信号生成部は、前記入力音響信号の音源に人の声が含まれているか否かに応じて、前記入力音響信号から前記出力音響信号を生成する方法を変更する。 Further, for example, the sound source type analyzing unit determines whether or not a human voice is included in the sound source of the input sound signal based on the input sound signal, and the output sound signal generating unit is configured to output the input sound signal. The method for generating the output sound signal from the input sound signal is changed according to whether or not a human voice is included in the sound source of the signal.

より具体的には例えば、前記出力音響信号生成部は、前記入力音響信号に種類の異なる複数の音源からの音響信号が含まれているとき、前記音源種類解析部を用いて、前記複数の音源からの音響信号を複数の分離音響信号として個別に前記入力音響信号から抽出しつつ各分離音響信号の音源の種類を解析した後、各分離音響信号に対して各分離音響信号の音源の種類に応じた伸張処理を施してから前記複数の分離音響信号を合成することにより前記出力音響信号を生成する。 More specifically, for example, when the input acoustic signal includes acoustic signals from a plurality of different sound sources, the output acoustic signal generation unit uses the sound source type analysis unit to generate the plurality of sound sources. After analyzing the sound source type of each separated sound signal while individually extracting the sound signal from the input sound signal as a plurality of separated sound signals, the sound source type of each separated sound signal for each separated sound signal The output sound signal is generated by synthesizing the plurality of separated sound signals after performing a corresponding expansion process.

これにより、入力音響信号に含まれうる複数の音源からの音響信号ごとに、音源の種類に適応した伸張処理を施すことができる。 Thereby, the expansion process suitable for the kind of sound source can be performed for each sound signal from a plurality of sound sources that can be included in the input sound signal.

また例えば、前記出力音響信号生成部は、前記音源種類解析部による解析結果だけでなく前記対象動画像の映像信号に対する解析結果にも基づいて、前記入力音響信号から前記出力音響信号を生成する。 Further, for example, the output sound signal generation unit generates the output sound signal from the input sound signal based not only on the analysis result by the sound source type analysis unit but also on the analysis result on the video signal of the target moving image.

これにより、映像内容にも適用した音響信号を生成及び再生することが可能となる。 As a result, it is possible to generate and reproduce an audio signal applied to the video content.

本発明に係る電気機器は、前記音声信号処理を備えた電子機器であって、前記対象動画像を第１フレームレートにて撮影しているときにおいて、前記入力音響信号から前記出力音響信号を生成して前記出力音響信号を記録媒体に記録する、或いは、前記入力音響信号を前記記録媒体に記録しておき、前記対象動画像を第２フレームレートにて再生するときにおいて、記録された前記入力音響信号から前記出力音響信号を生成して前記対象動画像とともに前記出力音響信号を再生することを特徴とする。 The electrical device according to the present invention is an electronic device including the audio signal processing, and generates the output acoustic signal from the input acoustic signal when the target moving image is captured at a first frame rate. When the output sound signal is recorded on a recording medium, or the input sound signal is recorded on the recording medium and the target moving image is reproduced at the second frame rate, the recorded input is recorded. The output sound signal is generated from the sound signal, and the output sound signal is reproduced together with the target moving image.

本発明によれば、映像のスロー再生に適した音響信号を生成可能な音響信号処理装置及び電子機器を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the audio signal processing apparatus and electronic device which can produce | generate the audio signal suitable for slow reproduction | regeneration of an image | video can be provided.

本発明の意義ないし効果は、以下に示す実施の形態の説明により更に明らかとなろう。ただし、以下の実施の形態は、あくまでも本発明の一つの実施形態であって、本発明ないし各構成要件の用語の意義は、以下の実施の形態に記載されたものに制限されるものではない。 The significance or effect of the present invention will become more apparent from the following description of embodiments. However, the following embodiment is merely one embodiment of the present invention, and the meaning of the term of the present invention or each constituent element is not limited to that described in the following embodiment. .

本発明の第１実施形態に係る撮像装置の全体的構成を表すブロック図である。1 is a block diagram illustrating an overall configuration of an imaging apparatus according to a first embodiment of the present invention. 図１の操作部の内部ブロック図である。It is an internal block diagram of the operation part of FIG. 図１のマイク部の内部ブロック図である。It is an internal block diagram of the microphone part of FIG. ２つのマイクロホンを備えた撮像装置の外観斜視図である。It is an external appearance perspective view of an imaging device provided with two microphones. 本発明の第１実施形態に係り、通常撮影モードにて撮影された動画像の再生イメージ図である。It is a reproduction image figure of the moving image image | photographed in normal imaging | photography mode concerning 1st Embodiment of this invention. 本発明の第１実施形態に係り、高速撮影モードにて撮影された動画像の再生イメージ図である。FIG. 4 is a reproduction image diagram of a moving image shot in a high-speed shooting mode according to the first embodiment of the present invention. 本発明の第１実施形態に係り、伸張音響信号の生成に関与する部位のブロック図である。It is a block diagram of the site | part which concerns on 1st Embodiment of this invention and is related to the production | generation of an expansion | extension acoustic signal. 再生時における対象動画像及び伸張音響信号の時間的関係を示す図である。It is a figure which shows the time relationship between the object moving image at the time of reproduction | regeneration, and an expansion | extension sound signal. 特定区間における対象音響信号に対して設定された基準ブロック及び評価ブロックを示す図である。It is a figure which shows the reference | standard block and evaluation block which were set with respect to the object acoustic signal in a specific area. 基準ブロック及び評価ブロック間の自己相関値が、両ブロック間の位置差（ｐ）の変化に対して周期的に極大値をとる様子を示した図である。It is the figure which showed a mode that the autocorrelation value between a reference | standard block and an evaluation block took a local maximum periodically with respect to the change of the positional difference (p) between both blocks. 対象音響信号と伸張音響信号の関係例を示す図である。It is a figure which shows the example of a relationship between a target acoustic signal and an expansion | extension acoustic signal. 分離音響信号の生成を介して対象音響信号から伸張音響信号が生成される様子を示す図である。It is a figure which shows a mode that an expansion | extension acoustic signal is produced | generated from the object acoustic signal through the production | generation of a separated acoustic signal. 伸張処理の一種である単純伸張処理のイメージ図である。It is an image figure of the simple expansion | extension process which is a kind of expansion | extension process. 伸張処理の一種であるピッチ維持伸張処理のイメージ図である。It is an image figure of the pitch maintenance expansion | extension process which is a kind of expansion | extension process. 伸張処理の一種であるエコー処理のイメージ図である。It is an image figure of the echo process which is a kind of expansion process. 本発明に係る第１の伸張具体例に係り、対象音響信号及び対象動画像の通常再生のイメージ図（ａ）と、伸張音響信号の再生を伴う対象動画像のスロー再生のイメージ図（ｂ）である。FIG. 5 is an image diagram (a) of normal reproduction of a target sound signal and a target moving image and an image diagram (b) of slow reproduction of the target moving image accompanied by reproduction of the expanded sound signal, according to the first expansion example according to the present invention. . 本発明に係る第１の伸張具体例に係り、対象音響信号の全区間が３つの区間に分割される様子を示す図である。It is a figure which shows a mode that all the sections of a target sound signal are divided | segmented into three areas in the 1st expansion | extension example according to this invention. 本発明に係る第１の伸張具体例に係り、歓声及び打撃音を含む対象音響信号の周波数スペクトル（ａ）と該周波数スペクトルのフーリエ変換（ｂ）を示すグラフである。It is a graph which shows the frequency spectrum (a) of the object acoustic signal containing a cheer and a striking sound, and the Fourier transform (b) of this frequency spectrum concerning the 1st expansion specific example which concerns on this invention. 本発明に係る第２の伸張具体例に係り、対象音響信号及び対象動画像の通常再生のイメージ図（ａ）と、伸張音響信号の再生を伴う対象動画像のスロー再生のイメージ図（ｂ）である。FIG. 5 is an image diagram (a) of normal reproduction of a target sound signal and a target moving image and an image diagram (b) of slow reproduction of the target moving image accompanied by reproduction of the expanded sound signal according to a second expansion example according to the present invention. . 本発明に係る第２の伸張具体例に係り、２つのマイクロホンと音源の位置関係を説明するための図である。It is a figure for demonstrating the positional relationship of two microphones and a sound source concerning the 2nd expansion specific example which concerns on this invention. 本発明に係る第３の伸張具体例に係り、対象音響信号及び対象動画像の通常再生のイメージ図（ａ）と、伸張音響信号の再生を伴う対象動画像のスロー再生のイメージ図（ｂ）である。FIG. 6 is an image diagram (a) of normal reproduction of a target sound signal and a target moving image, and an image diagram (b) of slow reproduction of the target moving image accompanied by reproduction of the expanded sound signal, according to a third expansion example according to the present invention. . 本発明に係る第３の伸張具体例に係り、対象音響信号の全区間が３つの区間に分割される様子を示す図である。It is a figure which shows a mode that all the area | regions of an object acoustic signal are divided | segmented into three area according to the 3rd expansion specific example which concerns on this invention. 本発明に係る第３の伸張具体例に係り、ゴール発声、歓声及びＢＧＭによる音響信号の周波数スペクトルを示すグラフである。It is a graph which shows the frequency spectrum of the acoustic signal by a goal utterance, a cheer, and BGM concerning the 3rd expansion specific example which concerns on this invention. 本発明に係る第３の伸張具体例に係り、対象音響信号の周波数スペクトルのフーリエ変換を示すグラフである。It is a graph which shows the Fourier-transform of the frequency spectrum of an object acoustic signal concerning the 3rd expansion specific example which concerns on this invention. 本発明の第２実施形態に係り、伸張音響信号の生成に関与する部位のブロック図である。It is a block diagram of the part which concerns on 2nd Embodiment of this invention and is related to the production | generation of an expansion | extension acoustic signal.

以下、本発明の実施の形態につき、図面を参照して具体的に説明する。参照される各図において、同一の部分には同一の符号を付し、同一の部分に関する重複する説明を原則として省略する。 Hereinafter, embodiments of the present invention will be specifically described with reference to the drawings. In each of the drawings to be referred to, the same part is denoted by the same reference numeral, and redundant description regarding the same part is omitted in principle.

＜＜第１実施形態＞＞
本発明の第１実施形態を説明する。図１は、本発明の第１実施形態に係る撮像装置１の全体的構成を表すブロック図である。撮像装置１は、符号１１〜１８によって参照される各部位を備える。撮像装置１は、静止画像及び動画像を撮影可能なデジタルビデオカメラである。尚、撮像装置１と異なる再生装置に表示部１６及び／又はスピーカ１７が設けられている、と解釈することも可能である。 << First Embodiment >>
A first embodiment of the present invention will be described. FIG. 1 is a block diagram showing the overall configuration of the imaging apparatus 1 according to the first embodiment of the present invention. The imaging device 1 includes each part referred to by reference numerals 11 to 18. The imaging device 1 is a digital video camera that can capture still images and moving images. Note that it is possible to interpret that the display unit 16 and / or the speaker 17 is provided in a playback device different from the imaging device 1.

撮像部１１は、撮像素子を用いて被写体の撮影を行い、映像信号処理部１２と協働して被写体の画像の映像信号を取得する。具体的には、撮像部１１は、図示されない光学系、絞り、及び、ＣＣＤ（Charge Coupled Device）又はＣＭＯＳ（Complementary Metal Oxide Semiconductor）イメージセンサなどから成る撮像素子を有する。この撮像素子は、光学系及び絞りを介して入射した被写体を表す光学像を光電変換し、該光電変換によって得られたアナログの電気信号を出力する。図示されないＡＦＥ（Analog Front End）は、撮像素子から出力されたアナログ信号を増幅してデジタル信号に変換する。 The imaging unit 11 captures a subject using an imaging element, and acquires a video signal of an image of the subject in cooperation with the video signal processing unit 12. Specifically, the imaging unit 11 includes an imaging device including an optical system (not shown), a diaphragm, and a CCD (Charge Coupled Device) or a CMOS (Complementary Metal Oxide Semiconductor) image sensor. This image sensor photoelectrically converts an optical image representing a subject incident through an optical system and a diaphragm, and outputs an analog electric signal obtained by the photoelectric conversion. An AFE (Analog Front End) (not shown) amplifies an analog signal output from the image sensor and converts it into a digital signal.

得られたデジタル信号は映像信号処理部１２に送られ、映像信号処理部１２は該デジタル信号から被写体の画像の映像信号を生成する。尚、デジタル信号形式にて表現された映像信号を画像データとも呼ぶ。また、本明細書では、画像データを単に画像ということもある。映像信号処理部１２は、被写体の画像の画像データに対して様々な画像処理（デモザイキング処理、エッジ強調処理、ノイズ低減処理、画像圧縮処理など）を行うことができる。 The obtained digital signal is sent to the video signal processing unit 12, and the video signal processing unit 12 generates a video signal of the image of the subject from the digital signal. A video signal expressed in a digital signal format is also called image data. In this specification, image data may be simply referred to as an image. The video signal processing unit 12 can perform various image processing (such as demosaicing processing, edge enhancement processing, noise reduction processing, and image compression processing) on the image data of the subject image.

マイク部１３は、１又は複数のマイクロホンから成り、撮像装置１の周辺に位置する音源からの音を収音して電気信号に変換する。得られた電気信号は、音響信号として音響信号処理部１４に送られる。音響信号処理部１４では、該音響信号に対して様々な音響信号処理を施すことができるが、詳細は後述する。 The microphone unit 13 includes one or a plurality of microphones, collects sound from a sound source located around the imaging device 1 and converts it into an electrical signal. The obtained electrical signal is sent to the acoustic signal processing unit 14 as an acoustic signal. The acoustic signal processing unit 14 can perform various acoustic signal processing on the acoustic signal, details of which will be described later.

記録媒体１５は、半導体メモリ、磁気ディスク等から成る不揮発性メモリであり、映像信号処理部１２にて生成された映像信号及び音響信号処理部１４にて生成された音響信号を記録することができる。表示部１６は、液晶ディスプレイ等から成り、撮像部１１の撮影によって得られた画像や、記録媒体１５に記録されている画像などを表示する。スピーカ１７は、音響信号処理部１４にて生成された音響信号や記録媒体１５に記録されている音響信号を音として再生出力する。 The recording medium 15 is a nonvolatile memory composed of a semiconductor memory, a magnetic disk, or the like, and can record the video signal generated by the video signal processing unit 12 and the audio signal generated by the audio signal processing unit 14. . The display unit 16 includes a liquid crystal display or the like, and displays an image obtained by photographing with the imaging unit 11, an image recorded on the recording medium 15, and the like. The speaker 17 reproduces and outputs the sound signal generated by the sound signal processing unit 14 and the sound signal recorded on the recording medium 15 as sound.

操作部１８は、ユーザが撮像装置１に対して各種操作を行うための部位である。図２に示す如く、操作部１８には、静止画像の撮影指示を行うためのシャッタボタン１８ａ、動画像の撮影開始及び終了を指示するための録画ボタン１８ｂが含まれる。主制御部１９は、操作部１８に対して成された操作内容に従いつつ、撮像装置１内の各部位の動作を統括的に制御する。 The operation unit 18 is a part for the user to perform various operations on the imaging apparatus 1. As shown in FIG. 2, the operation unit 18 includes a shutter button 18a for instructing to capture a still image and a recording button 18b for instructing start and end of moving image capturing. The main control unit 19 comprehensively controls the operation of each part in the imaging apparatus 1 while following the operation content performed on the operation unit 18.

マイク部１３を形成するマイクロホンの個数は１であっても良いし又は３以上であっても良いが、本実施形態では、図３に示す如く、マイク部１３が２つのマイクロホン、即ち、マイクロホン１３Ｌ及び１３Ｒから形成される場合を想定する。図４は、マイクロホン１３Ｌ及び１３Ｒが設けられた撮像装置１の外観斜視図である。 The number of microphones forming the microphone unit 13 may be 1 or 3 or more. In this embodiment, as shown in FIG. 3, the microphone unit 13 includes two microphones, that is, the microphone 13L. And 13R are assumed. FIG. 4 is an external perspective view of the imaging device 1 provided with the microphones 13L and 13R.

マイクロホン１３Ｌ及び１３Ｒは、撮像装置１の筐体上の、互いに異なる位置に配置されている。撮像装置１の被写体に正対した撮影者から見て、左側よりにマイクロホン１３Ｌが配置され、右側よりにマイクロホン１３Ｒが配置されている。また、図４に示す如く、撮像装置１から撮像部１１の撮影範囲に収まる被写体へと向かう方向を前方と定義し、その逆の方向を後方と定義する。マイクロホン１３Ｌ及び１３Ｒは、指向性を有さない無指向性マイクロホンである。但し、指向性を有するマイクロホンを、マイクロホン１３Ｌ及び１３Ｒとして採用することも可能である。 The microphones 13L and 13R are arranged at different positions on the housing of the imaging device 1. The microphone 13L is arranged on the left side and the microphone 13R is arranged on the right side as viewed from the photographer facing the subject of the imaging apparatus 1. Also, as shown in FIG. 4, the direction from the imaging device 1 toward the subject within the imaging range of the imaging unit 11 is defined as the front, and the opposite direction is defined as the rear. The microphones 13L and 13R are omnidirectional microphones having no directivity. However, microphones having directivity can be employed as the microphones 13L and 13R.

マイクロホン１３Ｌは、自身が収音した音を電気信号に変換して該音を表す検出信号を出力する。マイクロホン１３Ｒは、自身が収音した音を電気信号に変換して該音を表す検出信号を出力する。これらの検出信号は、アナログ音響信号である。マイクロホン１３Ｌ及び１３Ｒの検出信号であるアナログ音響信号は、夫々、図示されないＡ／Ｄ変換器によってデジタル音響信号に変換される。 The microphone 13L converts the sound collected by itself into an electrical signal and outputs a detection signal representing the sound. The microphone 13R converts the sound collected by itself into an electrical signal and outputs a detection signal representing the sound. These detection signals are analog acoustic signals. The analog acoustic signals that are detection signals of the microphones 13L and 13R are converted into digital acoustic signals by A / D converters (not shown), respectively.

マイクロホン１３Ｌを左チャンネルに対応させ、マイクロホン１３Ｒを右チャンネルに対応させて考える。マイクロホン１３Ｌの検出信号に基づく音響信号とマイクロホン１３Ｒの検出信号に基づく音響信号を区別する場合、前者を特に左チャンネルの音響信号などと呼び、後者を特に右チャンネルの音響信号などと呼ぶ。マイクロホン１３Ｌ及び／又は１３Ｒの検出信号をデジタル変換することによって得たデジタル音響信号を原音響信号と呼ぶ。マイクロホン１３Ｌ及び／又は１３Ｒの検出信号をデジタル変換することによって得たデジタル音響信号に所定の信号処理（オートレベルコントロールによる信号レベル調整処理など）を施して得た音響信号を、原音響信号と捉えても良い。原音響信号は、時間軸上の信号であるとする。また、特に記述なき限り、本実施形態及び後述の他の実施形態における任意の音響信号は、時間軸上の音響信号（時間領域で表現された音響信号）であると解釈することができる。 Consider the microphone 13L corresponding to the left channel and the microphone 13R corresponding to the right channel. When distinguishing between the acoustic signal based on the detection signal of the microphone 13L and the acoustic signal based on the detection signal of the microphone 13R, the former is particularly called a left channel acoustic signal and the latter is especially called a right channel acoustic signal. A digital acoustic signal obtained by digitally converting the detection signals of the microphones 13L and / or 13R is referred to as an original acoustic signal. An acoustic signal obtained by subjecting a digital acoustic signal obtained by digitally converting the detection signal of the microphone 13L and / or 13R to a predetermined signal processing (such as signal level adjustment processing by auto level control) is regarded as an original acoustic signal. May be. The original acoustic signal is assumed to be a signal on the time axis. Unless otherwise specified, any acoustic signal in the present embodiment and other embodiments described later can be interpreted as an acoustic signal on the time axis (an acoustic signal expressed in the time domain).

ところで、撮像装置１では、動画像の撮影時のフレームレートが可変となっていると共に動画像の再生時のフレームレートも可変となっている。ユーザは、操作部１８を介して、撮影モードを通常撮影モード又は高速撮影モードに設定することができる。以下、動画像の撮影時のフレームレートを撮影レートとも呼び、動画像の再生時のフレームレートを再生レートとも呼ぶ。 By the way, in the imaging device 1, the frame rate at the time of moving image shooting is variable, and the frame rate at the time of moving image playback is also variable. The user can set the shooting mode to the normal shooting mode or the high-speed shooting mode via the operation unit 18. Hereinafter, the frame rate at the time of moving image shooting is also referred to as a shooting rate, and the frame rate at the time of moving image playback is also referred to as a playback rate.

通常撮影モードでは、図５に示す如く動画像が６０ｆｐｓ（frame per second）にて撮影される。そして、６０ｆｐｓにて撮影された動画像を、同じフレームレート（即ち６０ｆｐｓ）にて再生することができる。この場合、撮影された動画像が通常の再生速度にて表示部１６上に表示される。つまり、１秒間をかけて撮影された６０枚のフレームが１秒間をかけて表示部１６上に表示される。 In the normal shooting mode, a moving image is shot at 60 fps (frame per second) as shown in FIG. A moving image shot at 60 fps can be reproduced at the same frame rate (ie, 60 fps). In this case, the captured moving image is displayed on the display unit 16 at a normal reproduction speed. That is, 60 frames taken over 1 second are displayed on the display unit 16 over 1 second.

高速撮影モードでは、図６に示す如く動画像が６００ｆｐｓにて撮影される。そして、６００ｆｐｓにて撮影された動画像を、６０ｆｐｓにて再生することができる。この場合、１秒間をかけて撮影された６００枚のフレームが１０秒間をかけて表示部１６上に表示される。これにより、実質的なスロー再生を実現することができる。尚、撮影レート及び再生レートの具体的数値は、勿論、例示であり、通常撮影モードにおける撮影レートは６０ｆｐｓ以外（例えば３０ｆｐｓ）であっても良いし、高速撮影モードにおける撮影レートは６００ｆｐｓ以外（例えば３００ｆｐｓ）であっても良い。再生レートの具体的数値は、撮影レートの変更に伴って変更されうる。 In the high-speed shooting mode, a moving image is shot at 600 fps as shown in FIG. A moving image shot at 600 fps can be reproduced at 60 fps. In this case, 600 frames taken over 1 second are displayed on the display unit 16 over 10 seconds. Thereby, substantial slow reproduction can be realized. The specific values of the shooting rate and the playback rate are, of course, exemplary, the shooting rate in the normal shooting mode may be other than 60 fps (for example, 30 fps), and the shooting rate in the high-speed shooting mode may be other than 600 fps (for example, 300 fps). The specific numerical value of the reproduction rate can be changed as the shooting rate is changed.

以下の説明では、高速撮影モードにおいて６００ｆｐｓにて対象動画像の撮影が行われ、対象動画像が再生時において６０ｆｐｓにて再生されることを想定する。α秒間をかけて対象動画像が撮影される際、その撮影区間においてα秒間分の原音響信号が収音されるが、α秒間をかけて撮影された対象動画像を（１０×α）秒間をかけてスロー再生する時に、原音響信号も単純にスロー再生するようにすると、音響信号のピッチが変動して間延びしたような音が再生される（αは任意の正の数）。ピッチとは音響信号の基本周波数であり、音源が人の声である場合、ピッチとは人の声帯振動による音響信号の基本周波数のことである。 In the following description, it is assumed that the target moving image is shot at 600 fps in the high-speed shooting mode, and the target moving image is played back at 60 fps during playback. When the target moving image is shot over α seconds, the original sound signal for α seconds is picked up in the shooting section, and the target moving image shot over α seconds is captured for (10 × α) seconds. When the original sound signal is simply played back slowly when the sound is played back slowly, the sound is reproduced as if the pitch of the sound signal is fluctuated (α is an arbitrary positive number). The pitch is the fundamental frequency of the acoustic signal, and when the sound source is a human voice, the pitch is the fundamental frequency of the acoustic signal due to human vocal cord vibration.

ピッチを維持したまま音響信号を伸張する方法（換言すれば、ピッチを伸張処理の前後において変化させない方法）も知られているが、そのような伸張方法が常に適切であるとは限らない。ピッチを維持した伸張方法では、基本的に、音響信号を複数のブロックに切り分けて同一のブロックを複数回繰り返し再生することで、音響信号の引き伸ばしを行う。このため、人の声の音響信号に対してはピッチを維持した伸張方法が比較的適しているが（音程が変化せず単に一つ一つの音が引き伸ばされるため）、該伸張方法を様々な周波数が混ざり合って形成された音楽に適用すると違和感のある音が生成されることが多い。また、野球のバッティングシーンなどをスロー再生する場合には、バットでボールを打った瞬間の音をエコー処理したほうが、より再生映像にマッチするものと考えられる。 A method of expanding an acoustic signal while maintaining the pitch (in other words, a method in which the pitch is not changed before and after the expansion process) is known, but such an expansion method is not always appropriate. In the stretching method that maintains the pitch, the acoustic signal is basically stretched by dividing the acoustic signal into a plurality of blocks and repeatedly reproducing the same block a plurality of times. For this reason, a stretching method that maintains the pitch is relatively suitable for the acoustic signal of a human voice (because the pitch does not change and each sound is simply stretched). When applied to music formed with a mixture of frequencies, a sound with an uncomfortable feeling is often generated. In addition, when slow-playing a baseball batting scene or the like, it is considered that the echo processing of the sound at the moment of hitting the ball with the bat matches the reproduced video more.

これらを考慮し、対象動画像の再生に適応した音響信号を原音響信号から生成する機能を、撮像装置１に設ける。図７に、この機能に特に関与する部位のブロック図を示す。図７に示される音源種類解析部３１、音響信号伸張部３２及び音響信号符号化部３３を、図１の音響信号処理部１４に設けておくことができ、図１に示される映像信号解析部３４を図１の映像信号処理部１２に設けておくことができる。 In consideration of these, the imaging apparatus 1 is provided with a function of generating an acoustic signal suitable for reproduction of the target moving image from the original acoustic signal. FIG. 7 shows a block diagram of a part particularly related to this function. The sound source type analysis unit 31, the audio signal expansion unit 32, and the audio signal encoding unit 33 shown in FIG. 7 can be provided in the audio signal processing unit 14 of FIG. 1, and the video signal analysis unit shown in FIG. 34 can be provided in the video signal processing unit 12 of FIG.

音源種類解析部３１（以下、解析部３１と略記することがある）及び音響信号伸張部３２（以下、伸張部３２と略記することがある）には、対象音響信号が入力される。対象音響信号とは、対象動画像の撮影時においてマイク部１３にて収音された原音響信号である。 The target sound signal is input to the sound source type analysis unit 31 (hereinafter may be abbreviated as the analysis unit 31) and the acoustic signal expansion unit 32 (hereinafter may be abbreviated as the expansion unit 32). The target sound signal is an original sound signal picked up by the microphone unit 13 at the time of shooting the target moving image.

解析部３１は、対象音響信号に基づいて対象音響信号に含まれる信号成分の音源の種類を解析する。換言すれば、如何なる種類の音源からの音響信号が対象音響信号に含まれているのかを、対象音響信号に基づいて解析する。例えば、対象音響信号に含まれる信号成分の音源の種類が、人の声であるのか（換言すれば人の声帯であるのか）、音楽であるのか、インパルス状の音（以下、インパルス音という）であるのか、動物の鳴き声であるのかを解析する。解析部３１の解析結果を表す情報は、音源種類情報として伸張部３２に送られる。 The analysis unit 31 analyzes the type of the sound source of the signal component included in the target acoustic signal based on the target acoustic signal. In other words, the type of sound signal from the sound source is analyzed based on the target sound signal. For example, whether the sound source of the signal component included in the target acoustic signal is a human voice (in other words, a human vocal cord), music, or an impulse sound (hereinafter referred to as an impulse sound) Or whether it is an animal call. Information representing the analysis result of the analysis unit 31 is sent to the decompression unit 32 as sound source type information.

一方で、映像信号解析部３４は、対象動画像の映像信号である対象映像信号に基づき対象動画像に含まれる物体等の解析を行う。例えば、顔検出処理を用いて対象動画像上に人の顔が存在しているか否かを解析することができる。また例えば、対象動画像上における物体の動きの速度の大きさから対象動画像がスポーツ風景を撮影したものであるのか否かを解析することもできる。映像信号解析部３４の解析結果を表す情報は、映像解析情報として伸張部３２に送られる。 On the other hand, the video signal analysis unit 34 analyzes an object included in the target moving image based on the target video signal that is a video signal of the target moving image. For example, it is possible to analyze whether or not a human face exists on the target moving image using face detection processing. Further, for example, it is possible to analyze whether or not the target moving image is an image of a sports scene from the magnitude of the speed of movement of the object on the target moving image. Information representing the analysis result of the video signal analysis unit 34 is sent to the decompression unit 32 as video analysis information.

伸張部３２は、フレームレート情報に従って、対象音響信号を時間的に伸張することにより伸張音響信号を生成する。フレームレート情報によって、対象動画像の撮影レートと対象動画像の再生レートが規定される。本例では、上述したように、対象動画像の撮影レートは６００ｆｐｓであって且つ対象動画像の再生レートは６０ｆｐｓであるため、α秒間分の対象音響信号から（１０×α）秒間分の音響信号を伸張音響信号として生成する。 The extension unit 32 generates an extended sound signal by extending the target sound signal in time according to the frame rate information. The shooting rate of the target moving image and the playback rate of the target moving image are defined by the frame rate information. In this example, as described above, since the shooting rate of the target moving image is 600 fps and the playback rate of the target moving image is 60 fps, the sound for (10 × α) seconds is obtained from the target sound signal for α seconds. The signal is generated as a stretched acoustic signal.

対象音響信号から伸張音響信号を生成する方法は、主として音源種類情報に応じて決定され、その方法を、映像解析情報及びシーン設定情報にも依存して決定することができる。シーン設定情報とは、設定された撮影シーンを指し示す情報であり、ユーザは、操作部１８を用いて撮影シーンを所望のものに設定することができる。例えば、スポーツ風景を撮影する場合、ユーザは撮影シーンを「スポーツ」に設定することができ、撮像装置１に近接した被写体を撮影する場合、ユーザは撮影シーンを「マクロ」に設定することができる。撮影シーンが「スポーツ」に設定されている時、撮像装置１は、スポーツ風景の撮影に適した撮影条件にて対象動画像の撮影を実行し、撮影シーンが「マクロ」に設定されている時、撮像装置１は、近接した被写体の撮影に適した撮影条件にて対象動画像の撮影を実行する。 A method for generating the extended sound signal from the target sound signal is determined mainly according to the sound source type information, and the method can be determined depending on the video analysis information and the scene setting information. The scene setting information is information indicating the set shooting scene, and the user can set the shooting scene to a desired one using the operation unit 18. For example, when shooting a sports landscape, the user can set the shooting scene to “sports”, and when shooting a subject close to the imaging apparatus 1, the user can set the shooting scene to “macro”. . When the shooting scene is set to “sport”, the imaging apparatus 1 performs shooting of the target moving image under shooting conditions suitable for shooting a sports landscape, and when the shooting scene is set to “macro”. The imaging apparatus 1 performs shooting of the target moving image under shooting conditions suitable for shooting a close subject.

音源種類情報、映像解析情報及びシーン設定情報に応じた伸張音響信号の生成方法については後に詳説される。尚、対象音響信号に基づく伸張音響信号の生成をチャンネルごとに行うことができる。即ち、伸張部３２は、左チャンネルの対象音響信号を時間的に伸張することにより左チャンネルの伸張音響信号を生成し、右チャンネルの対象音響信号を時間的に伸張することにより右チャンネルの伸張音響信号を生成することができる。以下では、特に必要の無い限り、チャンネルを区別しての説明は行わない。 A method for generating an extended sound signal in accordance with the sound source type information, the video analysis information, and the scene setting information will be described in detail later. Note that the generation of the extended acoustic signal based on the target acoustic signal can be performed for each channel. That is, the expansion unit 32 generates a left channel extended sound signal by extending the left channel target sound signal in time, and expands the right channel target sound signal in time by extending the right channel target sound signal. A signal can be generated. In the following description, the channel is not described separately unless particularly required.

音響信号符号化部３３は、伸張部３２にて生成された伸張音響信号を所定の符号化方式（例えば、ＡＡＣ（Advanced Audio Coding））にて符号化することにより符号化音響信号を生成する。他方、図１の映像信号処理部１２において、対象動画像の映像信号は符号化されて符号化映像信号が生成される。符号化音響信号は、対象動画像の符号化映像信号に対して時間的に関連付けられつつ、対象動画像の符号化映像信号と共に記録媒体１５に記録される。 The acoustic signal encoding unit 33 generates an encoded acoustic signal by encoding the expanded acoustic signal generated by the expansion unit 32 using a predetermined encoding method (for example, AAC (Advanced Audio Coding)). On the other hand, in the video signal processing unit 12 in FIG. 1, the video signal of the target moving image is encoded to generate an encoded video signal. The encoded acoustic signal is recorded on the recording medium 15 together with the encoded video signal of the target moving image while being temporally associated with the encoded video signal of the target moving image.

再生時には、記録媒体１５から対象動画像の符号化映像信号と符号化音響信号が読み出され、映像信号処理部１２及び音響信号処理部１４においてそれらは復号されて、対象動画像の映像信号と伸張音響信号が生成される。復号によって得られた映像信号を６０ｆｐｓにて表示部１６に送ることにより対象動画像が６０ｆｐｓにて再生表示されると共に、伸張音響信号をスピーカ１７に送ることで対象動画像の再生映像に同期した伸張音響信号が音として再生される。 At the time of reproduction, the encoded video signal and the encoded audio signal of the target moving image are read from the recording medium 15, and are decoded by the video signal processing unit 12 and the audio signal processing unit 14 to obtain the video signal of the target moving image and A stretched acoustic signal is generated. By sending the video signal obtained by decoding to the display unit 16 at 60 fps, the target moving image is reproduced and displayed at 60 fps, and the decompressed acoustic signal is sent to the speaker 17 to synchronize with the reproduced video of the target moving image. The extended acoustic signal is reproduced as sound.

図８に、再生時における対象動画像と伸張音響信号の時間的関係を示す。α秒間をかけて６００ｆｐｓにて撮影された対象動画像は、再生時において（１０×α）秒間をかけて６０ｆｐｓにて再生される。一方、対象動画像の撮影時に収音されたα秒分の原音響信号から生成された（１０×α）秒分の伸張音響信号は、６０ｆｐｓによる対象動画像の再生に同期した状態で、（１０×α）秒をかけてスピーカ１７にて再生される。 FIG. 8 shows a temporal relationship between the target moving image and the extended sound signal during reproduction. The target moving image shot at 600 fps over α seconds is reproduced at 60 fps over (10 × α) seconds during reproduction. On the other hand, the extended acoustic signal for (10 × α) seconds generated from the original sound signal for α seconds collected during the shooting of the target moving image is synchronized with the reproduction of the target moving image at 60 fps. It is reproduced by the speaker 17 over 10 × α) seconds.

［音源種類解析方法］
解析部３１による、音源の種類の解析方法について説明する。対象音響信号が存在する全区間に含まれる特定区間に注目し、特定区間における対象音響信号中に特定種類の音源からの音響信号が含まれているか否かを判断する方法を説明する。尚、解析部３１は、特定区間における左チャンネル及び右チャンネルの対象音響信号の内、左チャンネルの対象音響信号のみに基づいて、又は、右チャンネルの対象音響信号のみに基づいて、特定区間における左チャンネル及び右チャンネルの対象音響信号中に特定種類の音源からの音響信号が含まれているか否かを判断することができる。或いは、特定区間における左チャンネル及び右チャンネルの対象音響信号に基づいて、その判断を行うことも可能である。 [Sound source type analysis method]
A method of analyzing the type of sound source by the analysis unit 31 will be described. A method for determining whether or not an acoustic signal from a specific type of sound source is included in the target acoustic signal in the specific section will be described by paying attention to a specific section included in all sections in which the target acoustic signal exists. Note that the analysis unit 31 determines whether the left channel and the right channel in the specific section are based on the left channel target acoustic signal alone or only on the right channel target acoustic signal. It can be determined whether or not an acoustic signal from a specific type of sound source is included in the target acoustic signals of the channel and the right channel. Alternatively, the determination can be made based on the target acoustic signals of the left channel and the right channel in the specific section.

特定区間における対象音響信号に人の声による音響信号が含まれているか否かを、音声認識処理等で利用されている公知の発話区間検出方法（例えば、特開平１０−２５７５９６号公報に示された方法）を用いて検出することができる。具体的には例えば、自己相関処理を利用したピッチ抽出に基づく方法によって、特定区間における対象音響信号に人の声による音響信号が含まれているか否かを検出することができる。人の声による音響信号が含まれている区間を特に発話区間とも呼ぶ。 Whether or not a target voice signal in a specific section includes a sound signal by human voice is disclosed in a known utterance section detection method (for example, Japanese Patent Laid-Open No. 10-257596) used in speech recognition processing or the like. Method). Specifically, for example, it is possible to detect whether or not an acoustic signal based on a human voice is included in a target acoustic signal in a specific section by a method based on pitch extraction using autocorrelation processing. A section including an acoustic signal from a human voice is also called an utterance section.

特定区間に１０２４サンプル分のデジタル音響信号が含まれている場合を考えて、解析部３１にて採用可能な、発話区間の検出方法を説明する。特定区間の対象音響信号を形成する１０２４サンプル分のデジタル音響信号の内、ｔ番目のデジタル音響信号の信号値をｘ（ｔ）にて表す。ｔは、１〜１０２４の間の整数値をとる。 Considering a case where a digital acoustic signal for 1024 samples is included in a specific section, a method for detecting a speech section that can be employed by the analysis unit 31 will be described. The signal value of the t-th digital acoustic signal among the digital acoustic signals for 1024 samples forming the target acoustic signal in the specific section is represented by x (t). t takes an integer value between 1 and 1024.

解析部３１は、図９に示す如く、１〜１２８番目のデジタル音響信号から成るブロックを基準ブロックとして自己相関を計算する。つまり、特定区間内に、１２８個の連続するデジタル音響信号から成る評価ブロックを定義し、評価ブロックの時間的な位置を順次ずらしながら、基準ブロックと評価ブロックとの間の相関を求めてゆく。より具体的には、下記式（１）に従って自己相関値Ｓ（ｐ）を算出する。自己相関値Ｓ（ｐ）は、評価ブロックの位置を決める変数ｐの関数であり、ｐは、０≦ｐ≦（１０２４−１２８）、を満たす各整数をとる。 As shown in FIG. 9, the analysis unit 31 calculates the autocorrelation using a block composed of the first to 128th digital sound signals as a reference block. That is, an evaluation block composed of 128 consecutive digital sound signals is defined within a specific section, and the correlation between the reference block and the evaluation block is obtained while sequentially shifting the temporal position of the evaluation block. More specifically, the autocorrelation value S (p) is calculated according to the following formula (1). The autocorrelation value S (p) is a function of a variable p that determines the position of the evaluation block, and p is an integer that satisfies 0 ≦ p ≦ (1024−128).

図１０に、求められた自己相関値Ｓ（ｐ）の変数ｐ依存性を示す。図１０において、横軸は、変数ｐである。図１０は、特定区間における対象音響信号に人の声による音響信号が含まれている場合に対応している。対象音響信号に人の声帯振動によるピッチが含まれていると自己相関値Ｓ（ｐ）が周期的に大きな値をとる。解析部３１は、自己相関値Ｓ（ｐ）が周期的に所定の閾値ＴＨ_Aを超えており且つその周期の逆数である基本周波数が所定の周波数範囲Ｒ_VOICEに収まる場合に、特定区間における対象音響信号に人の声による音響信号が含まれていると判断することができ（即ち、特定区間が発話区間であると判断することでき）、そうでない場合には、特定区間における対象音響信号に人の声による音響信号が含まれていないと判断することができる。例えば、不等式「Ｓ（ｐ）＞ＴＨ_A」を満たす変数ｐの間隔が一定（或いは略一定）の場合に、自己相関値Ｓ（ｐ）が周期的に所定の閾値ＴＨ_Aを超えていると判断する。人の声帯振動によるピッチ（基本周波数）は概ね８０〜２７０Ｈｚの帯域に存在するため、周波数範囲Ｒ_VOICEの下限周波数及び上限周波数は例えば夫々５０Ｈｚ及び３００Ｈｚに設定される。 FIG. 10 shows the variable p dependency of the calculated autocorrelation value S (p). In FIG. 10, the horizontal axis represents the variable p. FIG. 10 corresponds to the case where the target sound signal in the specific section includes a sound signal based on a human voice. When the target acoustic signal includes a pitch due to human vocal cord vibration, the autocorrelation value S (p) takes a large value periodically. When the autocorrelation value S (p) periodically exceeds a predetermined threshold TH _A and the fundamental frequency that is the reciprocal of the period falls within the predetermined frequency range R _VOICE , the analysis unit 31 It can be determined that the sound signal includes a sound signal from a human voice (that is, it can be determined that the specific section is a speech section). Otherwise, the target sound signal in the specific section is included in the target sound signal. It can be determined that an audio signal from a human voice is not included. For example, when the interval of the variable p satisfying the inequality “S (p)> TH _A ” is constant (or substantially constant), the autocorrelation value S (p) periodically exceeds a predetermined threshold value TH _A. to decide. Since the pitch (fundamental frequency) due to human vocal cord vibration is approximately in the band of 80 to 270 Hz, the lower limit frequency and the upper limit frequency of the frequency range R _VOICE are set to 50 Hz and 300 Hz, respectively.

特定区間における対象音響信号に音楽による音響信号が含まれているか否かも、上述の発話区間の検出方法と同様の方法にて検出することができる。音楽による音響信号も一定の周期性を有しているからである。但し、一般的に音楽による音響信号の基本周波数は、人の声帯振動による音響信号のそれよりも高い。従って、解析部３１は、自己相関値Ｓ（ｐ）が周期的に所定の閾値ＴＨ_Aを超えており且つその周期の逆数である基本周波数が所定の周波数範囲Ｒ_VOICEの上限周波数を超えている場合に、特定区間における対象音響信号に音楽による音響信号が含まれていると判断することができる。 Whether or not the target acoustic signal in the specific section includes an acoustic signal based on music can be detected by the same method as the above-described detection method of the speech section. This is because the sound signal by music also has a certain periodicity. However, in general, the fundamental frequency of an acoustic signal due to music is higher than that of an acoustic signal due to human vocal cord vibration. Therefore, the analysis unit 31 has the autocorrelation value S (p) periodically exceeding the predetermined threshold TH _A and the fundamental frequency that is the reciprocal of the period exceeds the upper limit frequency of the predetermined frequency range R _VOICE . In this case, it can be determined that the target acoustic signal in the specific section includes an acoustic signal based on music.

尚、音楽の音響信号の基本周波数が、仮に人の声のそれと同程度であったとしても、人の声に特有のスペクトル包絡（エンベロープ）が対象音響信号に見られるか否かを判定することにより、対象音響信号が人の声の音響信号であるか或いは音楽の音響信号であるかを区別することもできる。共振の影響により、人の声による音響信号の周波数スペクトルは、特定の周波数においてピークを持つ傾向がある。他方、このような傾向は音楽の音響信号には見られない。従って、特定区間における対象音響信号について自己相関値Ｓ（ｐ）が周期的に所定の閾値ＴＨ_Aを超えていて、対象音響信号に人の声又は音楽による音響信号が含まれていると判断されるとき、当該対象音響信号において上記傾向が存在するか否かを峻別することにより、当該対象音響信号が人の声による音響信号及び音楽による音響信号のどちらであるかを区別するようにしても良い。 In addition, even if the fundamental frequency of the acoustic signal of music is about the same as that of a human voice, it is determined whether or not a spectrum envelope (envelope) peculiar to the human voice can be seen in the target acoustic signal. Thus, it can be distinguished whether the target sound signal is a sound signal of a human voice or a sound signal of music. Due to the influence of resonance, the frequency spectrum of an acoustic signal generated by a human voice tends to have a peak at a specific frequency. On the other hand, such a tendency is not seen in the acoustic signal of music. Therefore, it is determined that the autocorrelation value S (p) for the target acoustic signal in the specific section periodically exceeds the predetermined threshold TH _A , and the target acoustic signal includes an acoustic signal based on human voice or music. By distinguishing whether the above-mentioned tendency exists in the target acoustic signal, it is possible to distinguish whether the target acoustic signal is an acoustic signal based on a human voice or an acoustic signal based on music. good.

また、解析部３１は、時間軸上の対象音響信号における信号値又はパワーの変化量の大小に基づいて、対象音響信号に、インパルス音による音響信号が含まれているか否かを判断することができる。具体的には例えば、対象音響信号における信号値又はパワーの、単位時間当たりの変化量が所定の閾値ＴＨ_Bを超えている区間が、特定区間に存在している時、その区間中にインパルス音が存在していると判断することができると共に特定区間における対象音響信号にインパルス音による音響信号が含まれていると判断することができる。インパルス音として、野球のバットでボールを打った瞬間における打撃音や、太鼓をたたく音などが想定される。 Further, the analysis unit 31 may determine whether the target acoustic signal includes an acoustic signal due to the impulse sound based on the magnitude of the signal value or the power change amount in the target acoustic signal on the time axis. it can. Specifically, for example, when a section where the amount of change per unit time of the signal value or power in the target acoustic signal exceeds a predetermined threshold TH _B exists in a specific section, an impulse sound is included in the section. Can be determined, and it can be determined that the target acoustic signal in the specific section includes the acoustic signal due to the impulse sound. As an impulse sound, a hitting sound at the moment of hitting a ball with a baseball bat, a sound of tapping a drum, and the like are assumed.

また、解析部３１は、特定区間における対象音響信号に基づき、特定区間における対象音響信号中に動物の鳴き声による音響信号が含まれているか否かを判断することもできる。人の声の特徴に基づいて発話区間を検出するのと同様に、動物の鳴き声の特徴に基づいて動物の鳴き声が存在する区間を検出するようにすれば、上記判断は可能である。 Moreover, the analysis part 31 can also determine whether the acoustic signal by the call of an animal is contained in the target acoustic signal in a specific area based on the target acoustic signal in a specific area. The above determination can be made by detecting a section where an animal cry exists based on the characteristics of the animal's cry as in the case of detecting the utterance section based on the characteristics of the human voice.

動物の鳴き声とは、具体的には、犬又は猫の鳴き声である。犬の鳴き声の場合、様々な犬の鳴き声を事前に学習して犬の鳴き声に関するデータベースを作成しておき、特定区間における対象音響信号と該データベースとを照合することで、特定区間における対象音響信号に犬の鳴き声による音響信号が含まれているか否かを判断することが可能である。この判断を、対象映像信号をも考慮した上で実行するようにしても良い。つまり例えば、特定区間における対象映像信号に基づいて特定区間における対象動画像中に犬の画像が含まれているかを映像信号解析部３４において解析させ、その解析結果をも考慮した上で、特定区間における対象音響信号中に犬の鳴き声による音響信号が含まれているか否かの判断を行うようにしても良い。 Specifically, the animal cry is a dog or cat cry. In the case of dog calls, a database on dog calls is created in advance by learning various dog calls, and the target acoustic signal in a specific section is checked by comparing the target acoustic signal in the specific section with the database. It is possible to determine whether or not a sound signal from a dog cry is included. This determination may be performed in consideration of the target video signal. That is, for example, based on the target video signal in the specific section, the video signal analysis unit 34 analyzes whether the target moving image in the specific section includes the dog image, and the analysis result is also taken into account. It may be determined whether or not the target sound signal in FIG.

［伸張音響信号の生成方法］
次に、伸張部３２による伸張音響信号の生成方法について説明する。伸張部３２は、対象音響信号に、音源種類情報等に適応した伸張処理を施すことで伸張音響信号を生成する。音響信号に関する伸張処理とは、伸張処理の対象となる音響信号を時間方向に引き伸ばすことによって当該音響信号の信号長さを増大させる処理を指す。音響信号の信号長さとは、当該音響信号が存在する区間の時間長さを指す。伸張処理前の特定区間の時間長さはβ秒であるとする（βは任意の正の数）。本例において再生レートは撮影レートの１／１０であるから伸張処理後の特定区間の時間長さは（１０×β）秒であり、特定区間におけるβ秒分の対象音響信号の信号長さは、伸張処理によって１０倍に引き伸ばされて（１０×β）秒分の信号長さを有する伸張音響信号が生成される。勿論、伸張時間（伸張処理によって引き伸ばされる時間）は再生レートに合わせて変更され、例えば再生レートが遅くなるにつれて長くされる。 [Method for generating extended acoustic signal]
Next, a method for generating a stretched acoustic signal by the stretching unit 32 will be described. The decompressing unit 32 generates a decompressed acoustic signal by subjecting the target acoustic signal to decompression processing adapted to sound source type information and the like. The expansion process related to an acoustic signal refers to a process of increasing the signal length of the acoustic signal by stretching the acoustic signal to be expanded in the time direction. The signal length of the acoustic signal refers to the time length of the section where the acoustic signal exists. It is assumed that the time length of the specific section before the decompression process is β seconds (β is an arbitrary positive number). In this example, since the reproduction rate is 1/10 of the shooting rate, the time length of the specific section after the expansion process is (10 × β) seconds, and the signal length of the target acoustic signal for β seconds in the specific section is Then, it is stretched by a factor of 10 by the stretching process to generate a stretched acoustic signal having a signal length of (10 × β) seconds. Of course, the expansion time (the time extended by the expansion process) is changed in accordance with the reproduction rate, and is increased as the reproduction rate becomes slower, for example.

但し、再生レートに正確に対応する分だけ音響信号の伸張を行うと違和感のある音が再生される可能性もあるため、撮影レート及び再生レート間の差に相当する時間と伸張時間を一致させる必要は必ずしもない。つまり例えば、再生レートが撮影レートの１／１０であるとき、図１１に示す如く、β秒分の対象音響信号を時間軸上で６倍に引き伸ばすことで（６×β）秒分の音響信号を生成し、この（６×β）秒分の音響信号に（４×β）秒分の無音信号を接続することで、（１０×β）秒分の伸張音響信号を生成するようにしても良い。無音信号とは、信号レベル及びパワーがゼロ（又は実質的にゼロ）の音響信号を指す。 However, if the sound signal is stretched by an amount corresponding to the playback rate accurately, a strange sound may be played back. Therefore, the time corresponding to the difference between the shooting rate and the playback rate is matched with the extension time. There is no necessity. That is, for example, when the playback rate is 1/10 of the shooting rate, as shown in FIG. 11, the target acoustic signal for β seconds is stretched 6 times on the time axis to obtain an acoustic signal for (6 × β) seconds. And (10 × β) seconds of extended sound signals are generated by connecting (4 × β) seconds of silence signals to (6 × β) seconds of sound signals. good. A silence signal refers to an acoustic signal having a signal level and power of zero (or substantially zero).

伸張部３２にて採用可能な伸張処理として、以下に、単純伸張処理、ピッチ維持伸張処理、エコー処理及びリピート処理を説明する。 As expansion processing that can be employed by the expansion unit 32, simple expansion processing, pitch maintenance expansion processing, echo processing, and repeat processing will be described below.

詳細な具体例は後述されるが、対象音響信号に種類の異なる複数の音源からの音響信号が含まれている場合（例えば、対象音響信号に人の声による音響信号と音楽による音響信号が混在している場合）、図１２に示す如く、伸張部３２は、その複数の音源からの音響信号を複数の分離音響信号として個別に対象音響信号から抽出しつつ各分離音響信号の音源の種類を解析した後、各分離音響信号に対して各分離音響信号の音源の種類に応じた伸張処理を施してから複数の分離音響信号を合成することにより伸張音響信号を生成する。 Although a specific example will be described later, when the target sound signal includes sound signals from a plurality of different sound sources (for example, the target sound signal includes a sound signal based on human voice and a sound signal based on music) 12), as shown in FIG. 12, the decompressing unit 32 extracts the acoustic signals from the plurality of sound sources as a plurality of separated acoustic signals individually from the target acoustic signal, and sets the type of the sound source of each separated acoustic signal. After the analysis, each of the separated acoustic signals is subjected to a decompression process corresponding to the type of the sound source of each separated acoustic signal, and then a plurality of separated acoustic signals are synthesized to generate a decompressed acoustic signal.

従って、単純伸張処理やピッチ維持伸張処理等は、分離音響信号ごとに個別に実行される。このため、単純伸張処理やピッチ維持伸張処理等が分離音響信号に対して実行されることを想定して、それらの伸張処理の説明を行う。対象音響信号に単一の音源からの音響信号しか含まれていない場合には、対象音響信号に基づく分離音響信号は、対象音響信号そのものである。尚、図１２は、対象音響信号に２種類の音源からの音響信号が含まれている場合における、伸張音響信号の生成過程のイメージ図である（あくまでイメージ図であり、図１２の波形等の妥当性は低いことに留意すべきである）。 Accordingly, simple extension processing, pitch maintenance extension processing, and the like are executed individually for each separated acoustic signal. Therefore, assuming that simple extension processing, pitch maintenance extension processing, and the like are performed on the separated acoustic signal, the extension processing will be described. When the target acoustic signal includes only the acoustic signal from a single sound source, the separated acoustic signal based on the target acoustic signal is the target acoustic signal itself. FIG. 12 is an image diagram of the generation process of the extended acoustic signal in the case where the target acoustic signal includes acoustic signals from two types of sound sources (it is only an image diagram and the validity of the waveform and the like in FIG. 12) Note that is low).

――単純伸張処理――
単純伸張処理について説明する。単純伸張処理が施されるべき、特定区間の分離音響信号を音響信号Ａ₁と呼び、音響信号Ａ₁に単純伸張処理を施して得た音響信号を音響信号Ｂ₁と呼ぶ。本例において、音響信号Ｂ₁の存在する区間長さは、音響信号Ａ₁のそれの１０倍である。図１３は、単純伸張処理のイメージ図である。時間軸上において、音響信号Ａ₁を単純に１０倍に引き伸ばすことで音響信号Ｂ₁が得られる。従って、音響信号Ａ₁に含まれている周波数ｆの信号成分は、音響信号Ｂ₁において周波数（ｆ／１０）の信号成分に変換される。単純伸張処理を施すと、当然ピッチが変化して音程が変質する。 -Simple extension processing-
The simple decompression process will be described. A separated acoustic signal in a specific section to be subjected to simple extension processing is called an acoustic signal A _1, and an acoustic signal obtained by performing simple extension processing on the acoustic signal A ₁ is called an acoustic signal B ₁ . In this example, the length of the section in which the acoustic signal B ₁ exists is 10 times that of the acoustic signal A ₁ . FIG. 13 is an image diagram of simple decompression processing. On the time axis, the acoustic signal B ₁ is obtained by simply stretching the acoustic signal A ₁ 10 times. Therefore, the signal component of the frequency f included in the acoustic signal A ₁ is converted into the signal component of the frequency (f / 10) in the acoustic signal B ₁ . When the simple extension process is performed, the pitch is naturally changed and the pitch is changed.

尚、図１１を参照して説明したように、音響信号Ａ₁を単純に６倍に引き伸ばすことで得た（６×β）秒分の音響信号に対して（４×β）秒分の無音信号を接続することで、（１０×β）秒分の音響信号Ｂ₁を生成するようにしても良い。 As described with reference to FIG. 11, (4 × β) seconds of silence for the (6 × β) seconds of the acoustic signal obtained by simply extending the acoustic signal A ₁ by 6 times. By connecting the signals, the acoustic signal B ₁ for (10 × β) seconds may be generated.

――ピッチ維持伸張処理――
ピッチ維持伸張処理について説明する。ピッチ維持伸張処理が施されるべき、特定区間の分離音響信号を音響信号Ａ₂と呼び、音響信号Ａ₂にピッチ維持伸張処理を施して得た音響信号を音響信号Ｂ₂と呼ぶ。本例において、音響信号Ｂ₂の存在する区間長さは、音響信号Ａ₂のそれの１０倍である。 --Pitch maintenance and extension process--
The pitch maintaining / extending process will be described. A separated acoustic signal in a specific section to be subjected to the pitch maintaining / extending process is referred to as an acoustic signal A _2, and an acoustic signal obtained by performing the pitch maintaining / extending process on the acoustic signal A ₂ is referred to as an acoustic signal B ₂ . In this example, the length of the section in which the acoustic signal B ₂ exists is 10 times that of the acoustic signal A ₂ .

ピッチ維持伸張処理では、音響信号Ａ₂及びＢ₂間でピッチが変化しないように音響信号の伸張が成される。この伸張の方法として、公知の話速変換方法を用いることができる。図１４は、ピッチ維持伸張処理のイメージ図である。単純には例えば、音響信号Ａ₂のピッチに応じたブロック長にて特定区間を第１〜第Ｎのブロックに分割し（Ｎは２以上の整数）、第１のブロックにおける音響信号Ａ₂を１０回繰り返した信号と、第２のブロックにおける音響信号Ａ₂を１０回繰り返した信号と、・・・、第（Ｎ−１）のブロックにおける音響信号Ａ₂を１０回繰り返した信号と、第Ｎのブロックにおける音響信号Ａ₂を１０回繰り返した信号とを、この順番で接続することで音響信号Ｂ₂を生成することができる。 In the pitch maintaining / extending process, the acoustic signal is extended so that the pitch does not change between the acoustic signals A ₂ and B ₂ . As the expansion method, a known speech speed conversion method can be used. FIG. 14 is an image diagram of the pitch maintaining / extending process. For example, the specific section is divided into first to Nth blocks with a block length corresponding to the pitch of the acoustic signal A ₂ (N is an integer of 2 or more), and the acoustic signal A ₂ in the first block is and 10 times repeated signal, and the signal was repeated acoustic signal a ₂ 10 times in the second block, ..., a (N-1) th signal repeated acoustic signal a ₂ 10 times in a block of, the The acoustic signal B ₂ can be generated by connecting the signals obtained by repeating the acoustic signal A ₂ in the N blocks 10 times in this order.

尚、図１１を参照して説明したように、第１のブロックにおける音響信号Ａ₂を６回繰り返した信号と、第２のブロックにおける音響信号Ａ₂を６回繰り返した信号と、・・・、第（Ｎ−１）のブロックにおける音響信号Ａ₂を６回繰り返した信号と、第Ｎのブロックにおける音響信号Ａ₂を６回繰り返した信号と、（４×β）秒分の無音信号とを、この順番で接続することで音響信号Ｂ₂を生成するようにしても良い。但し、この方法では、音響信号Ｂ₂の後半に無音信号が偏る。このような偏りを回避するために、第１のブロックにおける音響信号Ａ₂を６回繰り返した信号と、（４×Ｂ_L[1]）秒分の無音信号と、第２のブロックにおける音響信号Ａ₂を６回繰り返した信号と、（４×Ｂ_L[2]）秒分の無音信号と、・・・、第（Ｎ−１）のブロックにおける音響信号Ａ₂を６回繰り返した信号と、（４×Ｂ_L[N-1]）秒分の無音信号と、第Ｎのブロックにおける音響信号Ａ₂を６回繰り返した信号と、（４×Ｂ_L[N]）秒分の無音信号とを、この順番で接続することで音響信号Ｂ₂を生成するようにしても良い。ここで、Ｂ_L[i]は、第ｉのブロックにおけるブロック長（即ち、第ｉのブロックの時間長さ）を表している（ｉは整数）。 As described with reference to FIG. 11, the signal obtained by repeating the acoustic signal A ₂ in the first block six times, the signal obtained by repeating the acoustic signal A ₂ in the second block six times,... , A signal obtained by repeating the acoustic signal A ₂ in the (N−1) th block 6 times, a signal obtained by repeating the acoustic signal A ₂ in the Nth block 6 times, and a silence signal for (4 × β) seconds and it may generate an acoustic signal B ₂ by connecting in this order. However, in this method, silent signal is biased in the second half of the audio signal B _2. In order to avoid such a bias, a signal obtained by repeating the acoustic signal A ₂ in the first block six times, a silence signal for (4 × B _{L [1]} ) seconds, and an acoustic signal in the second block A signal obtained by repeating A ₂ 6 times, a silence signal corresponding to (4 × B _{L [2]} ) seconds, a signal obtained by repeating the acoustic signal A ₂ in the (N−1) th block 6 times, , (4 × B _{L [N-1]} ) seconds of silence signal, a signal obtained by repeating the acoustic signal A ₂ in the Nth block 6 times, and (4 × B _{L [N]} ) seconds of silence signal May be generated in this order to generate the acoustic signal B ₂ . Here, B _{L [i]} represents the block length in the i-th block (that is, the time length of the i-th block) (i is an integer).

――エコー処理――
エコー処理について説明する。エコー処理が施されるべき、特定区間の分離音響信号を音響信号Ａ₃と呼び、音響信号Ａ₃にエコー処理を施して得た音響信号を音響信号Ｂ₃と呼ぶ。本例において、音響信号Ｂ₃の存在する区間長さは、音響信号Ａ₃のそれの１０倍である。 --Echo processing--
The echo process will be described. A separated acoustic signal in a specific section to be subjected to echo processing is called an acoustic signal A _3, and an acoustic signal obtained by performing echo processing on the acoustic signal A ₃ is called an acoustic signal B ₃ . In this example, the section length in which the acoustic signal B ₃ exists is 10 times that of the acoustic signal A ₃ .

エコー処理では、音響信号Ａ₃と同じ音響信号を、信号レベルを徐々に低減させながら複数回繰り返す。図１５は、エコー処理のイメージ図である。音響信号Ｂ₃は、エコー信号Ａ_3[1]、Ａ_3[2]、Ａ_3[3]、Ａ_3[4]、Ａ_3[5]、Ａ_3[6]、Ａ_3[7]、Ａ_3[8]、Ａ_3[9]及びＡ_3[10]をこの順番で接続した信号である。ここで、エコー信号Ａ_3[i]の信号波形と音響信号Ａ₃の信号波形は相似であり、エコー信号Ａ_3[i+1]の信号レベル及びパワーは、エコー信号Ａ_3[i]の信号レベル及びパワーよりも小さい（ｉは整数）。従って、音響信号Ｂ₃を再生すると、音量が徐々に小さくなりつつ音響信号Ａ₃が繰り返し再生されることになる。例えば、音響信号Ａ₃がバッティングの打撃音である「カキーン」という音であるならば、エコー処理を経た再生により、「カキーン」という音が徐々に音量が小さくされつつ１０回繰り返し再生されることになる。 In the echo processing, the same acoustic signal as the acoustic signal A ₃ is repeated a plurality of times while gradually reducing the signal level. FIG. 15 is an image diagram of echo processing. Acoustic signal B ₃ is an echo signal _{A 3 [1], A 3} [2], A 3 [3], A 3 [4], A 3 [5], A 3 [6], A 3 [7], A _{3 [8]} , A _{3 [9]} and A _{3 [10]} are connected in this order. Here, the signal waveform of the echo signal A _{3 [i] and} the signal waveform of the acoustic signal A ₃ are similar, and the signal level and power of the echo signal A _{3 [i} _{+ 1]} are the same as those of the echo signal A _{3 [i]} . It is smaller than the signal level and power (i is an integer). Therefore, when the acoustic signal B ₃ is reproduced, the acoustic signal A ₃ is repeatedly reproduced while the volume gradually decreases. For example, if the acoustic signal A ₃ is a sound of “Kakkin” that is a batting sound, the sound of “Kakkin” is repeatedly reproduced 10 times while the volume is gradually reduced by the reproduction through the echo process. become.

尚、図１１を参照して説明したように、エコー信号Ａ_3[1]、Ａ_3[2]、Ａ_3[3]、Ａ_3[4]、Ａ_3[5]及びＡ_3[6]を接続した信号と、（４×β）秒分の無音信号とを接続した信号を音響信号Ｂ₃として生成するようにしても良い。また、再生レートに応じて、エコーの回数（即ち、エコー信号Ａ_3[i]を繰り返す回数）、エコーをかける時間（即ち、エコー信号Ａ_3[i]が繰り返される時間）、及び／又は、エコー信号の減衰率（即ち、エコー信号Ａ_3[i+1]の信号レベルの、エコー信号Ａ_3[i]の信号レベルに対する減衰率）を変更するようにしても良い。 As described with reference to FIG. 11, the echo signals A _{3 [1]} , A _{3 [2]} , A _{3 [3]} , A _{3 [4]} , A _{3 [5]} and A _{3 [6]} May be generated as the acoustic signal B ₃ by connecting a signal connecting the two and a silence signal for (4 × β) seconds. Depending on the playback rate, the number of echoes (that is, the number of times the echo signal A _{3 [i]} is repeated), the time for applying the echo (that is, the time that the echo signal A _{3 [i]} is repeated), and / or The attenuation rate of the echo signal (that is, the attenuation rate of the signal level of the echo signal A _{3 [i + 1]} with respect to the signal level of the echo signal A _{3 [i]} ) may be changed.

――リピート処理――
リピート処理について説明する。リピート処理が施されるべき、特定区間の分離音響信号を音響信号Ａ₄と呼び、音響信号Ａ₄にリピート処理を施して得た音響信号を音響信号Ｂ₄と呼ぶ。本例において、音響信号Ｂ₄の存在する区間長さは、音響信号Ａ₄のそれの１０倍である。 -Repeat processing-
The repeat process will be described. A separated acoustic signal in a specific section to be subjected to the repeat processing is referred to as an acoustic signal A _4, and an acoustic signal obtained by performing the repeat processing on the acoustic signal A ₄ is referred to as an acoustic signal B ₄ . In this example, the length of the section in which the acoustic signal B ₄ exists is 10 times that of the acoustic signal A ₄ .

リピート処理では、音響信号Ａ₄と同じ音響信号を単純に複数回繰り返す。つまり、音響信号Ｂ₄は、リピート信号Ａ_4[1]、Ａ_4[2]、Ａ_4[3]、Ａ_4[4]、Ａ_4[5]、Ａ_4[6]、Ａ_4[7]、Ａ_4[8]、Ａ_4[9]及びＡ_4[10]をこの順番で接続した信号であり、リピート信号Ａ_4[1]〜Ａ_4[10]の夫々は、信号レベルも含め、音響信号Ａ₄と同じものである。従って例えば、音響信号Ａ₄が或る音楽の音響信号である場合、リピート処理を経て得られた音響信号Ｂ₄の再生時には、その音楽が音程の変質等を伴うことなく、（１０×β）秒分の特定区間において通常の再生速度で繰り返し再生される。 In the repeat processing, the same acoustic signal as the acoustic signal A ₄ is simply repeated a plurality of times. That is, the acoustic signal B ₄ is a repeat signal A _{4 [1]} , A _{4 [2]} , A _{4 [3]} , A _{4 [4]} , A _{4 [5]} , A _{4 [6]} , A _{4 [7 ]} , A4 _[8] , A4 _[9] and A4 _[10] are connected in this order, and each of the repeat signals A4 _{[1] to} A4 _[10] includes the signal level. This is the same as the acoustic signal A ₄ . Therefore, for example, when the acoustic signal A ₄ is an acoustic signal of a certain music, at the time of reproducing the acoustic signal B ₄ obtained through the repeat process, the music is not accompanied by a change in pitch (10 × β). Playback is repeated at a normal playback speed in a specific section of seconds.

伸張部３２は、音源種類情報等に応じて分離音響信号に対して成すべき伸張処理の内容を変更する。例えば、注目した分離音響信号の音源の種類が人の声であると判断される場合においては、その注目した分離音響信号に対してピッチ維持伸張処理を行い、注目した分離音響信号の音源の種類がインパルス音であると判断される場合においては、その注目した分離音響信号に対してエコー処理を行うことができる。 The decompression unit 32 changes the content of the decompression process to be performed on the separated acoustic signal according to the sound source type information and the like. For example, when it is determined that the focused sound source of the separated acoustic signal is a human voice, a pitch maintaining / extending process is performed on the focused separated acoustic signal, and the focused sound source type of the separated acoustic signal Is determined to be an impulse sound, echo processing can be performed on the focused separated acoustic signal.

また例えば、注目した分離音響信号の音源の種類が音楽であると判断される場合においては、その注目した分離音響信号に対してリピート処理を行うことができる、或いは、その注目した分離音響信号を削除するようにしても良い（つまり、音楽の信号成分を伸張音響信号から除外するようにしても良い）、更に或いは、その注目した分離音響信号の信号レベルを低減するようにしても良い。或る特定の音響信号を削除するとは、その特定の音響信号の信号成分が伸張音響信号に含まれなくなるように、その特定の音響信号の信号成分を伸張処理の過程で対象音響信号から削除する操作を指す。このように、分離音響信号の音源の種類が人の声であるのか否かに応じて伸張処理の方法を変更することができる。また、映像解析情報にも応じて伸張処理の内容を決定するようにしても良い（映像解析情報の利用例は、後述の第２の伸張具体例にて詳説）。 For example, when it is determined that the type of the sound source of the separated sound signal of interest is music, repeat processing can be performed on the separated sound signal of interest, or the separated sound signal of interest is It may be deleted (that is, the signal component of music may be excluded from the extended acoustic signal), or the signal level of the separated separated acoustic signal may be reduced. To delete a specific acoustic signal, the signal component of the specific acoustic signal is deleted from the target acoustic signal during the expansion process so that the signal component of the specific acoustic signal is not included in the expanded acoustic signal. Refers to an operation. In this way, the expansion processing method can be changed depending on whether the type of the sound source of the separated acoustic signal is a human voice. Further, the contents of the decompression process may be determined in accordance with the video analysis information (a use example of the video analysis information will be described in detail in a second specific decompression example described later).

次に、音源種類情報等に基づく伸張処理の、様々な状況に応じた具体例として、第１〜第４の伸張具体例を説明する。 Next, first to fourth expansion specific examples will be described as specific examples according to various situations of the expansion processing based on the sound source type information and the like.

［第１の伸張具体例］
第１の伸張具体例を説明する。第１の伸張具体例では、野球の試合においてバッターがバットでボールを打撃する様子が対象動画像として撮影されたことを想定する。そして、対象音響信号には、バットでボールを打撃する時に生じる打撃音の音響信号に加え、野球の出場選手を応援している人の歓声の音響信号が含まれているものとする。 [First example of expansion]
A first example of expansion will be described. In the first extension specific example, it is assumed that a batter hitting a ball with a bat in a baseball game was shot as a target moving image. The target acoustic signal includes an acoustic signal of a cheer of a person who is cheering a baseball player in addition to an acoustic signal of a hitting sound generated when the ball is hit with a bat.

解析部３１及び伸張部３２は、対象音響信号を解析することで対象音響信号から打撃音の音響信号と歓声の音響信号を別々に分離音響信号として抽出し、打撃音の分離音響信号に対してはエコー処理を施す一方で歓声の分離音響信号に対してはピッチ維持伸張処理を施す。そして、エコー処理後の打撃音の分離音響信号とピッチ維持伸張処理後の歓声の分離音響信号を合成することで伸張音響信号を生成する。 The analysis unit 31 and the decompression unit 32 analyze the target acoustic signal to separately extract the acoustic signal of the hitting sound and the cheering acoustic signal from the target acoustic signal as separated acoustic signals, and perform the extraction on the separated acoustic signal of the impacting sound. , While performing echo processing, performs pitch maintenance expansion processing on the separated sound signal of cheers. Then, the extended acoustic signal is generated by synthesizing the separated acoustic signal of the hitting sound after the echo processing and the separated acoustic signal of the cheer after the pitch maintaining / extending processing.

図１６（ａ）は、第１の伸張具体例の想定下における対象音響信号及び対象動画像の通常再生のイメージ図であり、図１６（ｂ）は、第１の伸張具体例に係る、伸張音響信号の再生を伴う対象動画像のスロー再生のイメージ図である。対象動画像のスロー再生時には、歓声がピッチを維持した状態でスロー再生される一方で打撃の瞬間が表示される周辺区間においては打撃音である「カキーン」という音が音量の漸次低減を伴いながら繰り返し出力される。尚、このシーンでは、打撃の瞬間が最も重要なタイミングであるため、打撃の瞬間を含む区間においては、歓声の音量をなるだけ低減させることが望ましい。 FIG. 16A is an image diagram of normal reproduction of the target sound signal and target moving image under the assumption of the first extension specific example, and FIG. 16B is an extension sound according to the first extension specific example. It is an image figure of slow reproduction of the object moving picture accompanied with reproduction of a signal. During slow playback of the target video, the cheering sound is played slowly while maintaining the pitch, while in the surrounding section where the moment of striking is displayed, the sound of “Kakein”, which is a striking sound, is accompanied by a gradual decrease in volume. Output repeatedly. In this scene, since the moment of hitting is the most important timing, it is desirable to reduce the volume of cheers as much as possible in the section including the moment of hitting.

第１の伸張具体例における分離音響信号及び伸張音響信号の生成方法を、より具体的に説明する。図１７に示す如く、対象音響信号の全区間が３つの区間Ｐ_1A、Ｐ_1B及びＰ_1Cに分類され、区間Ｐ_1A及びＰ_1Cには歓声の音響信号のみが存在し、区間Ｐ_1Bには打撃音と歓声の音響信号が存在する場合を想定する。 A method for generating the separated acoustic signal and the extended acoustic signal in the first extension specific example will be described more specifically. As shown in FIG. 17, all the sections of the target acoustic signal are classified into three sections P _1A , P _1B and P _1C , and only the cheering sound signal exists in the sections P _1A and P _1C , and in the section P _1B Assume that there is a sound signal of hitting sound and cheers.

区間Ｐ_1A及びＰ_1Cにおける対象音響信号には歓声（即ち、人の声）の音響信号のみが含まれているため、解析部３１は、上述した方法によって、区間Ｐ_1A及びＰ_1Cにおける対象音響信号に人の声による音響信号が含まれていることを容易に知ることができる。更に、解析部３１は、区間Ｐ_1Bを特定区間とみなした上で、特定区間の対象音響信号にインパルス音による音響信号が含まれているか否かを判断する上述の方法を用いることで、区間Ｐ_1Bにおける対象音響信号にインパルス音による音響信号が含まれていることを知ることができる。 Since the target acoustic signals in the sections P _1A and P _1C include only a cheering (ie, human voice) acoustic signal, the analysis unit 31 performs the target acoustics in the sections P _1A and P _1C by the method described above. It can be easily known that the signal includes an audio signal by human voice. Furthermore, the analysis unit 31 considers the section P _1B as a specific section, and uses the above-described method for determining whether or not the target acoustic signal in the specific section includes an acoustic signal due to an impulse sound, thereby It can be known that the target acoustic signal in P _1B includes the acoustic signal due to the impulse sound.

区間Ｐ_1A及びＰ_1Cにおける対象音響信号に人の声の音響信号が含まれているため、解析部３１又は伸張部３２は、区間Ｐ_1Bにおける対象音響信号にも人の声の音響信号が含まれていると推測することができる。伸張部３２は、区間Ｐ_1Bにおける対象音響信号から人の声の音響信号とインパルス音（今の例において打撃音）の音響信号を分離抽出すべく、区間Ｐ_1Bにおける時間軸上の対象音響信号に対してフーリエ変換を行うことで区間Ｐ_1Bにおける周波数軸上の対象音響信号、即ち、区間Ｐ_1Bにおける対象音響信号の周波数スペクトルを生成する。フーリエ変換として、離散フーリエ変換が用いられる。 Since the target acoustic signals in the sections P _1A and P _1C include the human voice acoustic signal, the analysis unit 31 or the expansion unit 32 includes the human voice acoustic signal in the target acoustic signal in the section P _1B . Can be guessed. Decompression unit 32, in order to separate and extract the acoustic signals of human voice sound signal and an impulse sound from the target sound signal in the interval P _1B (impact sound in this example), the target sound signal on the time axis in the section P _1B Is subjected to Fourier transform to generate a target acoustic signal on the frequency axis in the section P _1B , that is, a frequency spectrum of the target acoustic signal in the section P _1B . A discrete Fourier transform is used as the Fourier transform.

図１８（ａ）におけるグラフには、区間Ｐ_1Bにおける対象音響信号の周波数スペクトル３１０の各スペクトル成分が示されている。周波数スペクトル３１０は、実線３１１で表される人の声のスペクトル成分と破線３１２で表されるインパルス音のスペクトル成分とを足し合わせたものとなる。人の声のスペクトル成分３１１は周波数の変化に対して周期的に変動する一方、広範な周波数成分の足し合わせに相当するインパルス音のスペクトル成分３１２は周波数の変化に対して周期的に変動するような性質を有さない。 In the graph in FIG. 18A, each spectrum component of the frequency spectrum 310 of the target acoustic signal in the section P _1B is shown. The frequency spectrum 310 is obtained by adding the spectrum component of the human voice represented by the solid line 311 and the spectrum component of the impulse sound represented by the broken line 312. The spectral component 311 of the human voice periodically varies with frequency changes, while the spectral component 312 of the impulse sound corresponding to the addition of a wide range of frequency components periodically varies with frequency changes. It does not have a natural property.

このような性質に注目し、伸張部３２は、周波数スペクトル３１０に対して、もう一度、フーリエ変換を施す。周波数軸上の音響信号にフーリエ変換を施すことで、音響信号がＦ軸上の音響信号に変換されるものとする。図１８（ｂ）におけるグラフは、区間Ｐ_1BにおけるＦ軸上の対象音響信号３２０を表している。Ｆ軸上の対象音響信号３２０は、実線３２１で表される人の声の信号成分と破線３２２で表されるインパルス音の信号成分とを足し合わせたものとなる。上述したような性質から、Ｆ軸上では、人の声の信号成分とインパルス音の信号成分とが分離して存在することとなる。周波数軸上の或る注目音響信号が周波数の変化に対して周期的に変動している場合において、その変動の周期が短くなると、Ｆ軸上における注目音響信号はより高域側にシフトするものとする。 Paying attention to such a property, the decompression unit 32 performs a Fourier transform on the frequency spectrum 310 once again. It is assumed that an acoustic signal is converted into an acoustic signal on the F axis by performing Fourier transform on the acoustic signal on the frequency axis. The graph in FIG. 18B represents the target acoustic signal 320 on the F axis in the section P _1B . The target acoustic signal 320 on the F axis is obtained by adding the signal component of the human voice represented by the solid line 321 and the signal component of the impulse sound represented by the broken line 322. Due to the above-described properties, the human voice signal component and the impulse sound signal component exist separately on the F-axis. When a certain target acoustic signal on the frequency axis fluctuates periodically with respect to a change in frequency, the target acoustic signal on the F axis shifts to a higher frequency side when the period of the variation becomes shorter. And

伸張部３２は、信号成分３２１の、Ｆ軸上の周波数が所定の音声周波数範囲に収まっている場合、信号成分３２１は人の声の信号成分であると判断することができ、そうでない場合、信号成分３２１は人の声の信号成分ではないと判断することができる。今、信号成分３２１の、Ｆ軸上の周波数が所定の音声周波数範囲に収まっているものとする。 The decompression unit 32 can determine that the signal component 321 is a signal component of a human voice if the frequency on the F axis of the signal component 321 is within a predetermined audio frequency range, otherwise, It can be determined that the signal component 321 is not a human voice signal component. Now, it is assumed that the frequency of the signal component 321 on the F axis is within a predetermined audio frequency range.

伸張部３２は、Ｆ軸上の対象音響信号３２０の内、Ｆ軸の高域側に位置している信号成分（即ち、信号成分３２１）が人の声の信号成分であって且つＦ軸の低域側に位置している信号成分（即ち、信号成分３２２）がインパルス音の信号成分であるとみなし、前者の信号成分（即ち、信号成分３２１）と後者の信号成分（即ち、信号成分３２２）に対して個別に２回、逆フーリエ変換を施す。逆フーリエ変換として、離散逆フーリエ変換が用いられる。これにより、信号成分３２１から、区間Ｐ_1Bにおける人の声による時間軸上の分離音響信号が生成され、信号成分３２２から、区間Ｐ_1Bにおけるインパルス音による時間軸上の分離音響信号が生成される。尚、区間Ｐ_1Aにおける人の声による時間軸上の分離音響信号（即ち、区間Ｐ_1Aにおける対象音響信号）及び／又は区間Ｐ_1Cにおける人の声による時間軸上の分離音響信号（即ち、区間Ｐ_1Cにおける対象音響信号）から、区間Ｐ_1Bにおける人の声による時間軸上の分離音響信号を推定するようにしても良い。 In the expansion unit 32, the signal component (that is, the signal component 321) located on the high frequency side of the F axis in the target acoustic signal 320 on the F axis is a human voice signal component, and the F axis The signal component (that is, the signal component 322) positioned on the low frequency side is regarded as the signal component of the impulse sound, and the former signal component (that is, the signal component 321) and the latter signal component (that is, the signal component 322). ) Twice separately. As the inverse Fourier transform, a discrete inverse Fourier transform is used. As a result, a separated acoustic signal on the time axis based on the human voice in the section P _1B is generated from the signal component 321, and a separated acoustic signal on the time axis based on the impulse sound in the section P _1B is generated from the signal component 322. . In addition, the separated acoustic signal on the time axis based on the human voice in the section P _1A (ie, the target acoustic signal in the section P _1A ) and / or the separated acoustic signal on the time axis based on the human voice in the section P _1C (ie, the section) from the target sound signal) at P _1C, it may be estimated separation acoustic signal on the time axis by the human voice in the section P _1B.

逆フーリエ変換を介して得た、区間Ｐ_1Bにおける人の声及びインパルス音の分離音響信号に対して、互いに異なる伸張処理が施される。一方、区間Ｐ_1A及びＰ_1Cにおける対象音響信号には人の声の音響信号しか含まれていないため、区間Ｐ_1A及びＰ_1Cに対しては対象音響信号そのものにピッチ維持伸張処理が施される。つまり、伸張部３２は、区間Ｐ_1Aにおける対象音響信号、区間Ｐ_1Bにおける人の声の分離音響信号及び区間Ｐ_1Cにおける対象音響信号にピッチ維持伸張処理を施して時間的に接続することで伸張音響信号の第１成分を生成し、一方で、区間Ｐ_1Bにおけるインパルス音の分離音響信号に対してエコー処理を施すことで伸張音響信号の第２成分を生成する。ここで、伸張音響信号の第１成分は全区間における音響信号を含むが、伸張音響信号の第２成分は区間Ｐ_1Bにおける音響信号しか含まない。伸張部３２は、伸張音響信号の第１成分及び第２成分を合成することで、最終的な伸張音響信号を生成する。 Different extension processes are performed on the separated voice signals of the human voice and the impulse sound in the section P _1B obtained through the inverse Fourier transform. On the other hand, since the target acoustic signals in the sections P _1A and P _1C include only the sound signal of the human voice, the target acoustic signals themselves are subjected to pitch maintenance / extension processing for the sections P _1A and P _1C . . That is, the decompression unit 32 performs a pitch maintenance decompression process on the target acoustic signal in the section P _1A, the separated voice signal of the human voice in the section P _1B, and the target acoustic signal in the section P _1C and connects them in time. The first component of the acoustic signal is generated, while the second component of the extended acoustic signal is generated by performing echo processing on the separated acoustic signal of the impulse sound in the section P _1B . Here, the first component of the extended acoustic signal includes the acoustic signal in the entire section, but the second component of the extended acoustic signal includes only the acoustic signal in the section P _1B . The extension unit 32 generates a final extended sound signal by combining the first component and the second component of the extended sound signal.

上述のようにして得られる伸張音響信号を映像のスロー再生と共に再生することで、野球の打撃シーンを迫力のあるシーンとして再生することができる。 By playing the extended acoustic signal obtained as described above together with the slow playback of the video, the baseball batting scene can be played as a powerful scene.

［第２の伸張具体例］
第２の伸張具体例を説明する。第２の伸張具体例では、公園などにおいて子供の遊んでいる様子が対象動画像として撮影されたことを想定する。撮影対象となる子供を、特に注目人物と呼ぶ。そして、対象音響信号には、注目人物の声の音響信号に加え、公園内にいる他の人（以下、非注目人物という）の声の音響信号が含まれていることを想定する。 [Second specific example]
A second example of expansion will be described. In the second extension specific example, it is assumed that a child playing in a park or the like is captured as a target moving image. A child to be photographed is particularly called a person of interest. Then, it is assumed that the target acoustic signal includes an acoustic signal of a voice of another person in the park (hereinafter referred to as a non-attention person) in addition to the acoustic signal of the voice of the person of interest.

図１９（ａ）は、第２の伸張具体例の想定下における対象音響信号及び対象動画像の通常再生のイメージ図であり、図１９（ｂ）は、第２の伸張具体例に係る、伸張音響信号の再生を伴う対象動画像のスロー再生のイメージ図である。対象動画像のスロー再生時には、注目人物の声がピッチを維持した状態でスロー再生される。 FIG. 19A is an image diagram of normal reproduction of the target sound signal and target moving image under the assumption of the second extension specific example, and FIG. 19B is an extension sound according to the second extension specific example. It is an image figure of slow reproduction of the object moving picture accompanied with reproduction of a signal. At the time of slow playback of the target moving image, the voice of the person of interest is played back slowly while maintaining the pitch.

第２の伸張具体例では、伸張音響信号の生成に当たり、対象音響信号の解析結果に加えて対象映像信号の解析結果もが利用される。具体的には、以下のように処理される。 In the second extension specific example, the analysis result of the target video signal is used in addition to the analysis result of the target sound signal in generating the extension sound signal. Specifically, the processing is as follows.

映像信号解析部３４は、対象映像信号に基づき、基準顔サイズ以上の大きさを有する人の顔が対象動画像上に含まれているか否かを判断する。今、対象動画像上に注目人物の顔が存在しており、対象動画像上における注目人物の顔の大きさが所定の基準顔サイズ以上であったとする。そうすると、映像信号解析部３４は、基準顔サイズ以上の大きさを有する顔（人の顔）が対象動画像上に含まれていると判断し、その判断結果を含む映像解析情報を伸張部３２に送る。このような映像解析情報が送られてくると、伸張部３２は、その映像解析情報と解析部３１から音源種類情報に基づき、対象音響信号に対してピッチ維持伸張処理だけでなく正面音強調処理を施し、それらの処理後の対象音響信号を伸張音響信号として出力する。尚、対象動画像上に基準顔サイズ以上の大きさを有する人の顔が含まれていない場合、対象音響信号に対して正面音強調処理は成されない。 Based on the target video signal, the video signal analysis unit 34 determines whether or not a human face having a size equal to or larger than the reference face size is included in the target moving image. Now, it is assumed that the face of the target person exists on the target moving image, and the size of the face of the target person on the target moving image is equal to or larger than a predetermined reference face size. Then, the video signal analysis unit 34 determines that a face (human face) having a size equal to or larger than the reference face size is included in the target moving image, and expands the video analysis information including the determination result. Send to. When such video analysis information is sent, the expansion unit 32 performs not only pitch maintenance expansion processing but also front sound enhancement processing on the target audio signal based on the video analysis information and the sound source type information from the analysis unit 31. And the target acoustic signal after the processing is output as an extended acoustic signal. When the target moving image does not include a human face having a size larger than the reference face size, the front sound enhancement process is not performed on the target sound signal.

正面音強調処理は、対象音響信号の内、撮像装置１の正面方向から到来した音（以下、正面音という）の信号成分を強調する処理、または、それ以外の方向から到来した音（以下、非正面音）の信号成分を低減する処理である。或いは、前者の処理と後者の処理を共に正面音強調処理において実行するようにしても良い。 The front sound enhancement process is a process of enhancing a signal component of a sound that has arrived from the front direction of the imaging device 1 (hereinafter referred to as a front sound) in the target sound signal, or a sound that has arrived from another direction (hereinafter, referred to as a front sound). This is processing for reducing signal components of non-frontal sound. Alternatively, both the former process and the latter process may be executed in the front sound enhancement process.

例えば、図２０に示す如く、左チャンネルのマイクロホン１３Ｌの振動板中心と右チャンネルのマイクロホン１３Ｒの振動板中心との中点を原点Ｏとし、両振動板中心を結ぶ直線をＸ軸とし、Ｘ軸と直交し且つ原点Ｏを通る直線をＹ軸と定義する。ＸＹ座標面は、Ｘ軸及びＹ軸を座標軸として持つ座標面である。更に、マイクロホン１３Ｌからマイクロホン１３Ｒに向かう方向がＸ軸の正の方向であって、原点ＯからＹ軸の正側に向かう方向が撮像装置１にとっての前方であると定義する（図４も参照）。図２０において、線分３３１及び３３２は、原点Ｏを通り且つＹ軸と３０°の角度を以って交差する線分である。但し、線分３３１は原点ＯからＸＹ座標面上の第１象限に向かって伸び、線分３３２は原点ＯからＸＹ座標面上の第２象限に向かって伸びる。Ｙ軸は、撮像部１１の光軸と略平行であり、線分３３１から線分３３２に向かう時に横切る、６０°の範囲内に位置する物体が概ね撮像部１１の撮像対象となる。説明の簡略化上、Ｘ軸及びＹ軸の夫々に直交するＺ軸方向の存在を無視するが、実際には、撮像部１１の撮影範囲はＺ軸方向にも広がっている。 For example, as shown in FIG. 20, the midpoint between the diaphragm center of the left channel microphone 13L and the diaphragm center of the right channel microphone 13R is the origin O, the straight line connecting both diaphragm centers is the X axis, and the X axis A straight line that is orthogonal to and passes through the origin O is defined as the Y axis. The XY coordinate plane is a coordinate plane having the X axis and the Y axis as coordinate axes. Furthermore, the direction from the microphone 13L toward the microphone 13R is defined as the positive direction of the X axis, and the direction from the origin O toward the positive side of the Y axis is defined as the front side for the imaging apparatus 1 (see also FIG. 4). . In FIG. 20, line segments 331 and 332 are line segments that pass through the origin O and intersect with the Y axis at an angle of 30 °. However, the line segment 331 extends from the origin O toward the first quadrant on the XY coordinate plane, and the line segment 332 extends from the origin O toward the second quadrant on the XY coordinate plane. The Y axis is substantially parallel to the optical axis of the imaging unit 11, and an object located within a range of 60 ° that crosses when moving from the line segment 331 to the line segment 332 is generally the imaging target of the imaging unit 11. For simplification of explanation, the existence of the Z-axis direction orthogonal to the X-axis and the Y-axis is ignored, but actually, the imaging range of the imaging unit 11 extends in the Z-axis direction.

ＸＹ座標面の第１象限内であって且つ線分３３１よりもＹ軸側に位置する音源から到来する音及びＸＹ座標面の第２象限内であって且つ線分３３２よりもＹ軸側に位置する音源から到来する音を正面音とみなし、それら以外の音源からの音を非正面音とみなす。正面音強調処理では、左チャンネルの対象音響信号及び右チャンネルの対象音響信号の位相差に基づき、左チャンネル及び右チャンネルの対象音響信号の内、正面音の音響信号成分を強調する、及び／又は、非正面音の音響信号成分を低減する（非正面音の音響信号成分を完全に削除するようにしても良い）。尚、位相差情報に基づき特定方向から到来した音の信号成分を強調又は低減する方法として、公知の方法を含む任意の方法を用いることができる。 Sound coming from a sound source located in the first quadrant of the XY coordinate plane and closer to the Y axis than the line segment 331 and sound coming from the sound source located in the second quadrant of the XY coordinate plane and closer to the Y axis than the line segment 332 Sounds coming from the sound sources located are considered as front sounds, and sounds from other sound sources are considered as non-front sounds. In the front sound enhancement process, based on the phase difference between the target acoustic signal of the left channel and the target acoustic signal of the right channel, the acoustic signal component of the front sound is enhanced among the target acoustic signals of the left channel and the right channel, and / or The acoustic signal component of the non-frontal sound is reduced (the acoustic signal component of the non-frontal sound may be completely deleted). As a method for enhancing or reducing the signal component of the sound coming from a specific direction based on the phase difference information, any method including a known method can be used.

上述のようなピッチ維持伸張処理及び正面音強調処理を介して得られる伸張音響信号を再生すると、注目人物の声のピッチが維持された状態で、注目人物の声の音量が非注目人物のそれに対して大きくなり、注目人物の声が聴きとりやすくなる。 When the extended acoustic signal obtained through the pitch maintaining / extending process and the front sound emphasizing process as described above is reproduced, the volume of the voice of the person of interest is that of the non-person of interest while the pitch of the voice of the person of interest is maintained. On the other hand, it becomes louder and it becomes easier to hear the voice of the person of interest.

尚、対象動画像から登録人物の顔が検出された場合にのみ、上述の正面音強調処理を行うようにしても良い。つまり、注目人物となるべき登録人物の顔画像を予め撮像装置１に登録しておき、映像信号解析部３４にて、対象映像信号に基づき該顔画像と対象動画像の各部の画像とを対比することで対象動画像上に登録人物の顔が存在しているか否かを検出する。そして、対象動画像上に登録人物の顔が存在していると判断された場合にのみ、上述の正面音強調処理を行うようにしても良い。 Note that the above-described front sound enhancement processing may be performed only when a registered person's face is detected from the target moving image. That is, a face image of a registered person to be a person of interest is registered in the imaging device 1 in advance, and the video signal analysis unit 34 compares the face image with images of each part of the target moving image based on the target video signal. By doing so, it is detected whether or not the face of the registered person exists on the target moving image. The front sound enhancement process described above may be performed only when it is determined that the face of the registered person exists on the target moving image.

［第３の伸張具体例］
第３の伸張具体例を説明する。第３の伸張具体例では、運動会の徒競走において注目人物がゴール地点を走り抜ける様子が対象動画像として撮影されたことを想定する。そして、対象音響信号には、徒競走の審判による「ゴール」という掛け声（以下、ゴール発声という）の音響信号、周辺で応援している人の歓声による音響信号、及び、周辺で鳴っている音楽（以下、ＢＧＭという）の音響信号が含まれているものとする。また、対象音響信号において、ゴール発声の音響信号の信号レベルは、歓声のそれよりも十分に大きいものとする。 [Third specific example of expansion]
A third decompression example will be described. In the third extension specific example, it is assumed that a state in which a target person runs through a goal point during an athletic meet is photographed as a target moving image. The target acoustic signal includes an acoustic signal of a “goal” call (hereinafter referred to as “goal utterance”) by a referee of an athlete race, an acoustic signal from a cheer of a person cheering in the vicinity, and music ( Hereinafter, it is assumed that an acoustic signal of BGM) is included. In the target sound signal, the signal level of the sound signal of the goal utterance is sufficiently higher than that of the cheer.

解析部３１及び伸張部３２は、対象音響信号を解析することで対象音響信号からゴール発声の音響信号、歓声による音響信号及びＢＧＭの音響信号を別々に分離音響信号として抽出し、ゴール発声の分離音響信号に対してはピッチ維持伸張処理（又はエコー処理）を施し、歓声の分離音響信号に対しては音量を低減しつつピッチ維持伸張処理を施し、ＢＧＭの分離音響信号に対してはリピート処理を施す。そして、それらの処理後の分離音響信号を合成することで伸張音響信号を生成する。尚、ＢＧＭの分離音響信号の音量を低減させた上でリピート処理を行うようにしても良いし、ＢＧＭの分離音響信号を削除するようにしても良い。 The analyzing unit 31 and the decompressing unit 32 analyze the target acoustic signal, and separately extract the goal utterance acoustic signal, the cheering acoustic signal, and the BGM acoustic signal from the target acoustic signal as separated acoustic signals, thereby separating the goal utterance. Pitch maintenance expansion processing (or echo processing) is applied to the acoustic signal, pitch maintenance expansion processing is applied to the cheering separation acoustic signal while reducing the volume, and repeat processing is performed on the BGM separation acoustic signal. Apply. Then, an extended acoustic signal is generated by synthesizing the separated acoustic signals after the processing. The repeat processing may be performed after the volume of the BGM separated acoustic signal is reduced, or the BGM separated acoustic signal may be deleted.

図２１（ａ）は、第３の伸張具体例の想定下における対象音響信号及び対象動画像の通常再生のイメージ図であり、図２１（ｂ）は、第３の伸張具体例に係る、伸張音響信号の再生を伴う対象動画像のスロー再生のイメージ図である。 FIG. 21A is an image diagram of normal reproduction of the target sound signal and the target moving image under the assumption of the third extension example, and FIG. 21B is an extension sound according to the third extension example. It is an image figure of slow reproduction of the object moving picture accompanied with reproduction of a signal.

第３の伸張具体例における分離音響信号及び伸張音響信号の生成方法を、より具体的に説明する。図２２に示す如く、対象音響信号の全区間が３つの区間Ｐ_2A、Ｐ_2B及びＰ_2Cに分類され、区間Ｐ_2A及びＰ_2Cには歓声及びＢＧＭの音響信号のみが存在し、区間Ｐ_2Bには歓声及びＢＧＭの音響信号に加え、ゴール発声の音響信号が存在する場合を想定する。 A method for generating the separated acoustic signal and the extended acoustic signal in the third extension specific example will be described more specifically. As shown in FIG. 22, all sections of the target acoustic signal are classified into three sections P _2A , P _2B and P _2C , and only the cheering and BGM acoustic signals exist in the sections P _2A and P _2C , and the section P _2B Suppose that there is an acoustic signal of goal utterance in addition to the cheering and BGM acoustic signals.

まず、区間Ｐ_2Bに対する伸張方法について説明する。解析部３２は、区間Ｐ_2Bを特定区間とみなした上で、上述した方法を用いることにより、区間Ｐ_2Bの対象音響信号に人の声による音響信号が含まれているか否か、及び、区間Ｐ_2Bの対象音響信号に音楽による音響信号が含まれているか否かを検出することができる。第３の伸張具体例における想定下では、区間Ｐ_2Bの対象音響信号に人の声及び音楽による音響信号が含まれていると検出される。 First, the expansion method for the section P _2B will be described. The analysis unit 32 considers the section P _2B as a specific section, and uses the above-described method to determine whether the target acoustic signal of the section P _{2B includes} an acoustic signal based on a human voice, and the section It is possible to detect whether or not the P _2B target acoustic signal includes a musical acoustic signal. Under the assumption in the third extension specific example, it is detected that the target sound signal in the section P _2B includes sound signals based on human voice and music.

伸張部３２は、区間Ｐ_2Bにおける対象音響信号から人の声の音響信号と音楽（今の例においてＢＧＭ）の音響信号を分離抽出すべく、区間Ｐ_2Bにおける時間軸上の対象音響信号に対してフーリエ変換を行うことで区間Ｐ_2Bにおける周波数軸上の対象音響信号、即ち、区間Ｐ_2Bにおける対象音響信号の周波数スペクトルを生成する。 Decompression unit 32, in order to separate and extract the acoustic signals of the acoustic signals and music human voice from the target sound signal in the interval P _2B (BGM in this example), to subject the acoustic signal on the time axis in the section P _2B By performing Fourier transformation, a target acoustic signal on the frequency axis in the section P _2B , that is, a frequency spectrum of the target acoustic signal in the section P _2B is generated.

図２３（ａ）、（ｂ）及び（ｃ）のグラフに示される周波数スペクトル３６１、３６２及び３６３は、夫々、ゴール発声による音響信号の周波数スペクトル、歓声による音響信号の周波数スペクトル及びＢＧＭによる音響信号の周波数スペクトルである。実際には、スペクトル３６１〜３６３を足し合わせたものが区間Ｐ_2Bの対象音響信号の周波数スペクトルとして生成されるため、周波数軸上においてスペクトル３６１〜３６３を分離することはできない。 The frequency spectrums 361, 362, and 363 shown in the graphs of FIGS. 23A, 23B, and 23C are the frequency spectrum of the acoustic signal due to the goal utterance, the frequency spectrum of the acoustic signal due to the cheer, and the acoustic signal according to BGM, respectively. Is the frequency spectrum. Actually, the sum of the spectra 361 to 363 is generated as the frequency spectrum of the target acoustic signal in the section P _2B , so the spectra 361 to 363 cannot be separated on the frequency axis.

但し、対象音響信号においてゴール発声の信号レベルが歓声のそれよりも十分に大きく、且つ、人の声の基本周波数は音楽のそれよりも随分低い。これを考慮し、伸張部３２は、スペクトル３６１〜３６３の合成スペクトルである、区間Ｐ_2Bの対象音響信号の周波数スペクトルに対して、もう一度、フーリエ変換を施す。図２４（ａ）におけるグラフは、区間Ｐ_2BにおけるＦ軸上の対象音響信号３７０を表している。Ｆ軸上の対象音響信号３７０は、曲線３７１で表される人の声の信号成分と曲線３７２で表される音楽の信号成分（即ち、ＢＧＭの信号成分）とを足し合わせたものとなる。人の声の基本周波数は音楽のそれよりも随分低いという性質から、Ｆ軸上では、人の声の信号成分と音楽の信号成分とが分離して存在している。 However, in the target acoustic signal, the signal level of the goal utterance is sufficiently higher than that of the cheers, and the fundamental frequency of the human voice is much lower than that of the music. Considering this, the decompression unit 32 performs Fourier transform once again on the frequency spectrum of the target acoustic signal in the section P _2B , which is a combined spectrum of the spectra 361 to 363. The graph in FIG. 24A represents the target acoustic signal 370 on the F axis in the section P _2B . The target acoustic signal 370 on the F-axis is a sum of the human voice signal component represented by the curve 371 and the music signal component represented by the curve 372 (that is, the BGM signal component). Since the fundamental frequency of human voice is much lower than that of music, the signal component of human voice and the signal component of music exist separately on the F axis.

曲線３７１で表される人の声の信号成分には、信号レベルの比較的大きいゴール発声による信号成分と信号レベルの比較的小さい歓声による信号成分とが混在している。図２４（ｂ）の破線３８１内は前者の信号成分を表し、図２４（ｃ）の破線３８２及び３８３内は後者の信号成分を表している。尚、ゴール発声が一人の人の声によって形成されているのに対して、歓声は複数人の声によって形成されているため、Ｆ軸上において歓声の信号成分の広がりはゴール発声のそれよりも大きくなっている。 The signal component of the human voice represented by the curve 371 includes a signal component due to goal utterance with a relatively high signal level and a signal component due to a cheer with a relatively low signal level. A broken line 381 in FIG. 24B represents the former signal component, and broken lines 382 and 383 in FIG. 24C represent the latter signal component. Since the goal utterance is formed by the voice of one person, the cheer is formed by the voices of multiple persons, so the spread of the signal component of the cheer on the F axis is larger than that of the goal utterance. It is getting bigger.

破線３８１、３８２及び３８３内の信号成分が存在する、Ｆ軸上の領域を、夫々、符号３９１、３９２及び３９３によって表す（図２４（ｂ）及び（ｃ）参照）。Ｆ軸上において、領域３９１〜３９３は互いに重なり合わない領域であると共に、領域３９３は領域３９１よりも高域側に位置し、領域３９１は領域３９２よりも高域側に位置する。 Regions on the F axis where the signal components within the broken lines 381, 382, and 383 exist are represented by reference numerals 391, 392, and 393, respectively (see FIGS. 24B and 24C). On the F axis, the regions 391 to 393 are regions that do not overlap with each other, the region 393 is located on the higher frequency side than the region 391, and the region 391 is located on the higher frequency side than the region 392.

伸張部３２は、Ｆ軸上において、信号成分の周波数が所定の音声周波数範囲に収まっている場合、その信号成分は人の声の信号成分であると判断することができ、そうでない場合、その信号成分は人の声の信号成分ではないと判断することができる。今、信号成分３７１が上記音声周波数範囲に収まっている一方、信号成分３７２が上記音声周波数範囲に収まっていないものとする。更に、信号成分３７１の最大レベルが所定の基準レベルよりも大きく且つＦ軸上における信号成分３７１の広がりが所定の基準広がりよりも大きい時、信号成分３７１に、主要音声による音響信号と非主要音声による音響信号が混在していると判断することができる。今、そのような混在が発生していると判断されたものとする。主要音声はゴール音声に相当し、非主要音声は歓声に相当する。信号成分３７１の内、基準レベル以上の信号レベルを有している部分が領域３９１内の信号成分であり、基準レベル未満の信号レベルを有している部分が領域３９２及び３９３内の信号成分であるとする。 When the frequency of the signal component is within a predetermined audio frequency range on the F axis, the decompression unit 32 can determine that the signal component is a signal component of a human voice, otherwise, It can be determined that the signal component is not a signal component of a human voice. Now, it is assumed that the signal component 371 is within the audio frequency range while the signal component 372 is not within the audio frequency range. Further, when the maximum level of the signal component 371 is larger than a predetermined reference level and the spread of the signal component 371 on the F axis is larger than the predetermined reference spread, the signal component 371 includes an acoustic signal based on the main sound and the non-main sound. It can be determined that the acoustic signals by are mixed. Assume that it is determined that such a mixture has occurred. The main audio corresponds to the goal audio, and the non-main audio corresponds to cheers. Of the signal component 371, a portion having a signal level equal to or higher than the reference level is a signal component in the region 391, and a portion having a signal level lower than the reference level is a signal component in the regions 392 and 393. Suppose there is.

この場合、伸張部３２は、信号成分３７２が音楽の信号成分（又は人の声以外の何らかの信号成分）であるとみなし、信号成分３７２に対して２回逆フーリエ変換を施すことで、区間Ｐ_2BにおけるＢＧＭの時間軸上の分離音響信号を生成する。一方、信号成分３７１の内、基準レベル以上の信号レベルを有している信号成分（即ち、領域３９１内の信号成分）に対して２回逆フーリエ変換を施すことで、区間Ｐ_2Bにおけるゴール発声の時間軸上の分離音響信号を生成し、信号成分３７１の内、基準レベル以上の信号レベルを有していない信号成分（即ち、領域３９２及び３９３内の信号成分）に対して２回逆フーリエ変換を施すことで、区間Ｐ_2Bにおける歓声の時間軸上の分離音響信号を生成する。但し、Ｆ軸上の領域３９１内の信号成分には歓声の音響信号成分も含まれているため、ここで生成されるゴール発声の時間軸上の分離音響信号には、実際には、歓声の音響信号成分も含まれている。 In this case, the decompressing unit 32 regards the signal component 372 as a signal component of music (or any signal component other than a human voice), and performs the inverse Fourier transform twice on the signal component 372 to obtain the section P. A separated acoustic signal on the time axis of BGM in _2B is generated. On the other hand, goal utterance in the section P _2B is performed by performing inverse Fourier transform twice on the signal component 371 having a signal level equal to or higher than the reference level (that is, the signal component in the region 391). Of the signal component 371 and the signal component 371 that does not have a signal level equal to or higher than the reference level (that is, the signal components in the regions 392 and 393) twice inverse Fourier By performing the conversion, a separated acoustic signal on the timeline of cheers in the section P _2B is generated. However, since the signal component in the area 391 on the F-axis includes a cheering acoustic signal component, the separated acoustic signal on the time axis of the goal utterance generated here is actually a cheering signal. An acoustic signal component is also included.

他方、区間Ｐ_2A及びＰ_2Cにおける対象音響信号には歓声の音響信号とＢＧＭの音響信号しか含まれていないため、それらの分離は容易である。即ち、区間Ｐ_2Aにおける時間軸上の対象音響信号を２回フーリエ変換することで、区間Ｐ_2Aにおける対象音響信号をＦ軸上の信号に変換する。そして、区間Ｐ_2Aにおける対象音響信号に人の声と音楽の音響信号が含まれているという前提の下、区間Ｐ_2AにおけるＦ軸上の対象音響信号の内、音声周波数範囲に収まっている信号成分を人の声（即ち、歓声）の信号成分であるとみなす一方、音声周波数範囲に収まっていない信号成分を音楽（即ち、ＢＧＭ）の信号成分であるとみなし、Ｆ軸上における人の声の信号成分と音楽の信号成分に対して個別に２回逆フーリエ変換を施す。これにより、区間Ｐ_2Aにおいて、Ｆ軸上の人の声の信号成分から人の声による時間軸上の分離音響信号が生成され、Ｆ軸上の音楽の信号成分からＢＧＭによる時間軸上の分離音響信号が生成される。区間Ｐ_2Cについても同様である。 On the other hand, since the target acoustic signals in the sections P _2A and P _2C include only the cheering acoustic signal and the BGM acoustic signal, they can be easily separated. That is, by twice the Fourier transform of the target sound signal on the time axis in the section P _2A, converts the target sound signal in the interval P _2A to the signal on the F-axis. Then, on the assumption that the target acoustic signal in the section P _2A includes human voice and music acoustic signals, the signals within the voice frequency range among the target acoustic signals on the F axis in the section P _2A While the component is regarded as a signal component of a human voice (ie cheer), a signal component not within the voice frequency range is regarded as a signal component of music (ie BGM), and a human voice on the F axis Inverse Fourier transform is performed twice for each of the signal component and the music signal component. As a result, in the section P _2A , a separated acoustic signal on the time axis based on the human voice is generated from the signal component of the human voice on the F axis, and separated on the time axis by the BGM from the signal component of the music on the F axis. An acoustic signal is generated. The same applies to the section P _2C .

各区間において時間軸上の各分離音響信号を生成した後、伸張部３２は、区間Ｐ_2Aにおける歓声の分離音響信号にピッチ維持伸張処理を施す一方で区間Ｐ_2AにおけるＢＧＭの分離音響信号にリピート処理を施し、処理後のそれらを足し合わせることで区間Ｐ_2Aにおける伸張音響信号を生成する。但し、上述したように、伸張処理の過程において、区間Ｐ_2AのＢＧＭの分離音響信号を低減又は削除しても良い。区間Ｐ_2B及びＰ_2CにおけるＢＧＭの分離音響信号についても同様である。
次いで、伸張部３２は、区間Ｐ_2Bにおけるゴール発声及び歓声の分離音響信号にピッチ維持伸張処理を施す一方で区間Ｐ_2BにおけるＢＧＭの分離音響信号にリピート処理を施し、処理後のそれらを足し合わせることで区間Ｐ_2Bにおける伸張音響信号を生成する。但し、上述したように、伸張処理の過程において、区間Ｐ_2Bにおける歓声の分離音響信号の音量を低減させても良い。
更に、伸張部３２は、区間Ｐ_2Cにおける歓声の分離音響信号にピッチ維持伸張処理を施す一方で区間Ｐ_2CにおけるＢＧＭの分離音響信号にリピート処理を施し、処理後のそれらを足し合わせることで区間Ｐ_2Cにおける伸張音響信号を生成する。
最後に、伸張部３２は、区間Ｐ_2Aにおける伸張音響信号、区間Ｐ_2Bにおける伸張音響信号及び区間Ｐ_2Cにおける伸張音響信号を、この順番で接続することで全区間の伸張音響信号を完成させる。 After generating the respective separation acoustic signal on the time axis in each section, the expansion section 32 repeats the separation acoustic signal BGM in the interval P _2A in isolation acoustic signal cheering in the interval P _2A while performing pitch maintain decompression By performing processing and adding the processed signals, an extended acoustic signal in the section P _2A is generated. However, as described above, the BGM separated acoustic signal in the section P _2A may be reduced or deleted in the course of the expansion process. The same applies to the separated BGM acoustic signals in the sections P _2B and P _2C .
Then, the decompression unit 32 performs a goal utterance and repeating process to separate the acoustic signals of BGM to separate the acoustic signal in the interval P _2B while performing pitch maintaining decompression of cheering in the interval P _2B, sums them after treatment Thus, the extended acoustic signal in the section P _2B is generated. However, as described above, the volume of the cheering separated acoustic signal in the section P _2B may be reduced in the process of the expansion process.
Furthermore, decompression section 32, section by subjecting the repeating process to separate the acoustic signals of BGM in the interval P _2C for the separation acoustic signal cheering in the interval P _2C while performing pitch maintain decompression processing, summing them after treatment generating a decompressed audio signal at P _2C.
Finally, the extension unit 32 completes the extended acoustic signal of all sections by connecting the extended acoustic signal in the section P _2A, the extended acoustic signal in the section P _{2B, and} the extended acoustic signal in the section P _2C in this order.

上述のようにして得られる伸張音響信号を再生することで、ゴール発声及び歓声のピッチが維持された状態で、注目すべきゴール発声が強調され、臨場感のある再生が実現される。また、ＢＧＭが違和感なく再生される。 By playing back the extended acoustic signal obtained as described above, the goal utterance to be noted is emphasized in a state where the pitch of the goal utterance and the cheer is maintained, and a realistic reproduction is realized. In addition, the BGM is reproduced without a sense of incongruity.

［第４の伸張具体例］
また、シーン設定情報に応じて、伸張部３２で行う伸張処理の内容を変更するようにしても良い。例えば、シーン設定情報にて指し示される撮影シーンが「スポーツ」である場合には、周辺の歓声と思われる人の声の音響信号に対して伸張処理（例えばピッチ維持伸張処理）を行うことにより伸張音響信号に歓声の音響信号を含ませる一方、シーン設定情報にて指し示される撮影シーンが「マクロ」である場合には、人の声を含む周辺音の音響信号を伸張音響信号からなるだけ排除するようにしてもよい。 [Fourth specific example]
Further, the content of the expansion processing performed by the expansion unit 32 may be changed according to the scene setting information. For example, when the shooting scene pointed to by the scene setting information is “sports”, by performing expansion processing (for example, pitch maintenance expansion processing) on the sound signal of a human voice that seems to be a cheering voice When the extended sound signal includes a cheer sound signal and the shooting scene indicated by the scene setting information is “macro”, the sound signal of the surrounding sound including the human voice is only composed of the expand sound signal. You may make it exclude.

また、シーン設定情報を参照することなく、対象映像信号から撮影シーン判定を行うようにしても良い。即ち例えば、対象映像信号に基づいて対象動画像のオプティカルフローを導出して該オプティカルフローから対象動画像上の物体の動きの大きさを検出し、その大きさが比較的大きい場合には、対象動画像がスポーツ風景を撮影したものであると判断するようにしても良い。このような判断が成された場合には、撮影シーンが「スポーツ」に設定された場合と同様の伸張処理を行うことができる。 Further, the shooting scene determination may be performed from the target video signal without referring to the scene setting information. That is, for example, the optical flow of the target moving image is derived based on the target video signal, and the magnitude of the motion of the object on the target moving image is detected from the optical flow. You may make it judge that a moving image is what image | photographed the sport scenery. When such a determination is made, the same expansion process as when the shooting scene is set to “sports” can be performed.

また例えば、映像信号解析部３４が対象映像信号を解析することで対象動画像上に人と野球のバットが映っていることが判明した場合、対象動画像が野球のバッティングシーンを撮影したものであると判断することができる。このような判断が成された場合、打撃音の再生音量を増大させて再生時の迫力を向上させるべく、打撃音と推定されるインパルス音の音量を伸張処理の過程において増大させる、といったことも可能である。 Further, for example, when the video signal analysis unit 34 analyzes the target video signal and finds that a person and a baseball bat are reflected on the target moving image, the target moving image is obtained by shooting a baseball batting scene. It can be judged that there is. If such a determination is made, the volume of the impulse sound that is estimated to be a striking sound is increased in the process of expansion in order to increase the playback volume of the striking sound and improve the force at the time of playback. Is possible.

＜＜第２実施形態＞＞
本発明の第２実施形態を説明する。上述の第１実施形態では、音響信号を収音して記録媒体１５に記録するまでの過程において音響信号の伸張処理を行っているが、その伸張処理を再生段階において実行するようにしても良い。第２実施形態では、伸張処理を再生段階において実行する撮像装置を説明する。第２実施形態に係る撮像装置の全体的構成は、図１のそれと同じであるため、第２実施形態に係る撮像装置も撮像装置１と呼ぶ。第１実施形態にて述べられた事項は、矛盾なき限り、本実施形態にも適用される。 << Second Embodiment >>
A second embodiment of the present invention will be described. In the first embodiment described above, the sound signal is expanded in the process from collecting the sound signal and recording it on the recording medium 15, but the expansion process may be executed in the reproduction stage. . In the second embodiment, an imaging apparatus that executes decompression processing in the reproduction stage will be described. Since the overall configuration of the imaging apparatus according to the second embodiment is the same as that of FIG. 1, the imaging apparatus according to the second embodiment is also referred to as the imaging apparatus 1. The matters described in the first embodiment are applied to this embodiment as long as there is no contradiction.

第２実施形態では、対象動画像の映像信号を符号化して得た信号と共に、原音響信号である対象音響信号をそのまま符号化して得た信号が、一旦、記録媒体１５に互いに関連付けられて記録される。その後、対象動画像の再生を指示する操作を受けて、記録媒体１５から、対象動画像の映像信号を符号化して得た信号が映像信号ストリームとして読み出されると共に、対象音響信号をそのまま符号化して得た信号が音響信号ストリームとして読み出される。 In the second embodiment, a signal obtained by encoding the target audio signal, which is the original audio signal, as well as a signal obtained by encoding the video signal of the target moving image is temporarily associated with the recording medium 15 and recorded. Is done. Thereafter, in response to an operation for instructing reproduction of the target moving image, a signal obtained by encoding the video signal of the target moving image is read from the recording medium 15 as a video signal stream, and the target audio signal is encoded as it is. The obtained signal is read out as an acoustic signal stream.

図２５は、第２実施形態に係る、伸張音響信号の生成に関与する部位のブロック図である。音源種類解析部３１、音響信号伸張部３２及び映像信号解析部３４は、図７のそれらと同じものである。上述したように、音源種類解析部３１及び音響信号伸張部３２はそれぞれ解析部３１及び伸張部３２と略記されうる。解析部３１、伸張部３２及び音響信号復号部３５を、図１の音響信号処理部１４に設けておくことができ、映像信号解析部３４及び映像信号復号部３６を、図１の映像信号処理部１２に設けておくことができる。 FIG. 25 is a block diagram of a part related to generation of the extended acoustic signal according to the second embodiment. The sound source type analyzing unit 31, the sound signal extending unit 32, and the video signal analyzing unit 34 are the same as those in FIG. As described above, the sound source type analysis unit 31 and the acoustic signal expansion unit 32 may be abbreviated as the analysis unit 31 and the expansion unit 32, respectively. The analysis unit 31, the decompression unit 32, and the audio signal decoding unit 35 can be provided in the audio signal processing unit 14 of FIG. 1, and the video signal analysis unit 34 and the video signal decoding unit 36 are connected to the video signal processing of FIG. It can be provided in the part 12.

記録媒体１５から読み出された音響信号ストリーム及び映像信号ストリームは、夫々、音響信号復号部３５及び映像信号復号部３６にて復号されて対象音響信号及び対象映像信号が生成される。音響信号復号部３５からの対象音響信号は解析部３１及び伸張部３２に送られ、映像信号復号部３６からの対象映像信号は映像信号解析部３４に送られる。解析部３１及び映像信号解析部３４は、第１実施形態と同様、対象音響信号及び対象映像信号に基づき音源種類情報及び映像解析情報を生成して、それらの情報を伸張部３２に送る。 The audio signal stream and the video signal stream read from the recording medium 15 are decoded by the audio signal decoding unit 35 and the video signal decoding unit 36, respectively, to generate the target audio signal and the target video signal. The target audio signal from the audio signal decoding unit 35 is sent to the analysis unit 31 and the expansion unit 32, and the target video signal from the video signal decoding unit 36 is sent to the video signal analysis unit 34. Similarly to the first embodiment, the analysis unit 31 and the video signal analysis unit 34 generate sound source type information and video analysis information based on the target audio signal and the target video signal, and send the information to the decompression unit 32.

第１実施形態にて述べたシーン設定情報が記録媒体１５に記録されている場合には、該シーン設定情報が記録媒体１５から伸張部３２に送られる。再生時においてユーザがシーン設定情報を入力した場合には、その再生時において入力したシーン設定情報を伸張部３２に与えるようにしても良い。また、伸張部３２には、再生速度情報も与えられる。再生速度情報は、対象動画像における撮影レートと再生レートの比を表す情報であり、第１実施形態にて述べたフレームレート情報と同じであっても良い。 When the scene setting information described in the first embodiment is recorded on the recording medium 15, the scene setting information is sent from the recording medium 15 to the expansion unit 32. When the user inputs scene setting information during playback, the scene setting information input during playback may be given to the decompression unit 32. The decompression unit 32 is also given playback speed information. The reproduction speed information is information indicating the ratio between the shooting rate and the reproduction rate in the target moving image, and may be the same as the frame rate information described in the first embodiment.

今、第１実施形態と同様、対象動画像の撮影レートが６００ｆｐｓであって且つ対象動画像の再生レートが６０ｆｐｓであったとする。そうすると、伸張部３２は、再生速度情報に従いつつ、対象音響信号、音源種類情報、映像解析情報及びシーン設定情報の全部又は一部に基づき、第１実施形態と同様にしてα秒間分の対象音響信号から（１０×α）秒間分の音響信号を伸張音響信号として生成する。 Now, as in the first embodiment, it is assumed that the shooting rate of the target moving image is 600 fps and the playback rate of the target moving image is 60 fps. Then, the decompression unit 32 follows the playback speed information, and based on all or part of the target sound signal, sound source type information, video analysis information, and scene setting information, the target sound for α seconds as in the first embodiment. An acoustic signal for (10 × α) seconds is generated from the signal as an extended acoustic signal.

映像信号復号部３６の復号によって得られた対象映像信号を６０ｆｐｓにて表示部１６に送ることにより対象動画像が６０ｆｐｓにて再生表示されると共に、対象映像信号の再生と同期した状態で伸張音響信号をスピーカ１７に送ることで対象動画像の再生映像に同期した伸張音響信号が（１０×α）秒をかけて音として再生される。尚、説明の便宜上、撮影レート及び再生レートが夫々６００ｆｐｓ及び６０ｆｐｓである場合を説明したが、勿論これは例示である。現実的には例えば、撮影レート及び再生レートは夫々６０ｆｐｓ及び３０ｆｐｓとされる。 By sending the target video signal obtained by the decoding of the video signal decoding unit 36 to the display unit 16 at 60 fps, the target moving image is reproduced and displayed at 60 fps, and the expanded sound is synchronized with the reproduction of the target video signal. By transmitting the signal to the speaker 17, the extended acoustic signal synchronized with the reproduced video of the target moving image is reproduced as sound over (10 × α) seconds. For convenience of explanation, the case where the shooting rate and the reproduction rate are 600 fps and 60 fps, respectively, has been described. Of course, this is merely an example. Actually, for example, the shooting rate and the playback rate are 60 fps and 30 fps, respectively.

また、再生時におけるユーザの指示に基づき、伸張処理の方法を変更するようにしても良い。例えば、ユーザは、対象音響信号に対して単純伸張処理を施すべきことを指示することができ、その指示の内容を伸張部３２に与えられるシーン設定情報に含めておくことができる。そのような指示が伸張部３２に与えられた場合、伸張部３２は、音源種類情報及び映像解析情報に依存することなく、音響信号復号部３５からの対象音響信号に単純伸張処理を施すことで伸張音響信号を生成する。 Further, the decompression method may be changed based on a user instruction during playback. For example, the user can instruct that the target audio signal should be subjected to simple extension processing, and the contents of the instruction can be included in the scene setting information given to the extension unit 32. When such an instruction is given to the decompression unit 32, the decompression unit 32 performs simple decompression processing on the target acoustic signal from the acoustic signal decoding unit 35 without depending on the sound source type information and the video analysis information. Generate a stretched acoustic signal.

＜＜変形等＞＞
上述した説明文中に示した具体的な数値は、単なる例示であって、当然の如く、それらを様々な数値に変更することができる。上述の実施形態の変形例または注釈事項として、以下に、注釈１〜注釈３を記す。各注釈に記載した内容は、矛盾なき限り、任意に組み合わせることが可能である。 << Deformation, etc. >>
The specific numerical values shown in the above description are merely examples, and as a matter of course, they can be changed to various numerical values. As modifications or annotations of the above-described embodiment, notes 1 to 3 are described below. The contents described in each comment can be arbitrarily combined as long as there is no contradiction.

［注釈１］
図２５の解析部３１、伸張部３２、映像信号解析部３４、音響信号復号部３５及び映像信号復号部３６、並びに、図１の表示部１６及びスピーカ１７と同等の表示部及びスピーカを備えた再生装置（不図示）を、撮像装置１とは別に構成するようにしても良い。このような再生装置に、記録媒体１５からの音響信号ストリーム及び映像信号ストリームを与えるようにすれば、第２実施形態に係る撮像装置１と同様の再生が当該再生装置上において実現される。 [Note 1]
25, the analysis unit 31, the expansion unit 32, the video signal analysis unit 34, the audio signal decoding unit 35, the video signal decoding unit 36, and the display unit and the speaker equivalent to the display unit 16 and the speaker 17 of FIG. A playback device (not shown) may be configured separately from the imaging device 1. If the audio signal stream and the video signal stream from the recording medium 15 are given to such a reproducing apparatus, reproduction similar to that of the imaging apparatus 1 according to the second embodiment is realized on the reproducing apparatus.

尚、第１実施形態に係る撮像装置１は、映像信号及び音響信号の記録を行う記録装置としての機能を備え、第２実施形態に係る撮像装置１は、映像信号及び音響信号の再生を行う再生装置としての機能を備える。撮像装置は電子機器の一種であり、記録装置又は再生装置も電子機器の一種である。 Note that the imaging device 1 according to the first embodiment has a function as a recording device that records video signals and audio signals, and the imaging device 1 according to the second embodiment reproduces video signals and audio signals. A function as a playback device is provided. An imaging device is a type of electronic device, and a recording device or a playback device is also a type of electronic device.

［注釈２］
図１の撮像装置１又は上記電子機器を、ハードウェア、或いは、ハードウェアとソフトウェアの組み合わせによって構成することができる。ソフトウェアを用いて撮像装置１又は上記電子機器を構成する場合、ソフトウェアにて実現される部位についてのブロック図は、その部位の機能ブロック図を表すことになる。ソフトウェアを用いて実現される機能をプログラムとして記述し、該プログラムをプログラム実行装置（例えばコンピュータ）上で実行することによって、その機能を実現するようにしてもよい [Note 2]
The imaging apparatus 1 in FIG. 1 or the electronic device can be configured by hardware or a combination of hardware and software. When the imaging apparatus 1 or the electronic device is configured using software, a block diagram of a part realized by software represents a functional block diagram of the part. A function realized using software may be described as a program, and the function may be realized by executing the program on a program execution device (for example, a computer).

［注釈３］
例えば、以下のように考えることができる。対象動画像の撮影時に収音された入力音響信号としての対象音響信号から出力音響信号としての伸張音響信号を生成する出力音響信号生成部は、解析部３１及び伸張部３２を含んで形成される（図７又は図２５を参照）。出力音響信号生成部を含む音響信号処理装置は、音響信号処理部１４に相当する、或いは、音響信号処理部１４に内在する、或いは、音響信号処理部１４を含む、と考えることができる。 [Note 3]
For example, it can be considered as follows. An output acoustic signal generation unit that generates a decompressed acoustic signal as an output acoustic signal from a target acoustic signal as an input acoustic signal collected at the time of capturing a target moving image is formed including an analysis unit 31 and an expansion unit 32. (See FIG. 7 or FIG. 25). The acoustic signal processing device including the output acoustic signal generation unit can be considered to correspond to the acoustic signal processing unit 14, to be included in the acoustic signal processing unit 14, or to include the acoustic signal processing unit 14.

１撮像装置
１１撮像部
１２映像信号処理部
１３マイク部
１４音響信号処理部
３１音源種類解析部
３２音響信号伸張部
３３音響信号符号化部
３４映像信号解析部 DESCRIPTION OF SYMBOLS 1 Imaging device 11 Imaging part 12 Video signal processing part 13 Microphone part 14 Acoustic signal processing part 31 Sound source type analysis part 32 Acoustic signal expansion part 33 Acoustic signal encoding part 34 Video signal analysis part

Claims

An output acoustic signal generation unit that generates an output acoustic signal having a signal length longer than the input acoustic signal from an input acoustic signal picked up when the target moving image is captured at the first frame rate; An acoustic signal processing device,
The output acoustic signal is an acoustic signal to be reproduced as a sound together with the target moving image when the target moving image is reproduced at a second frame rate smaller than the first frame rate.
The output acoustic signal generation unit generates the output acoustic signal from the input acoustic signal according to a type of a sound source of the input acoustic signal.

The output sound signal generation unit includes a sound source type analysis unit that analyzes a type of a sound source of the input sound signal based on the input sound signal, and the sound source of the input sound signal analyzed by the sound source type analysis unit The acoustic signal processing apparatus according to claim 1, wherein the output acoustic signal is generated from the input acoustic signal according to a type.

The sound source type analysis unit determines whether or not a human voice is included in the sound source of the input sound signal based on the input sound signal,
The output acoustic signal generation unit changes a method of generating the output acoustic signal from the input acoustic signal according to whether or not a human voice is included in a sound source of the input acoustic signal. The acoustic signal processing apparatus according to claim 2.

When the input sound signal includes sound signals from a plurality of different sound sources, the output sound signal generation unit uses the sound source type analysis unit to generate sound signals from the plurality of sound sources. Analyzing the type of the sound source of each separated acoustic signal while extracting it as the separated acoustic signal individually from the input acoustic signal, and then subjecting each separated acoustic signal to expansion processing according to the type of the sound source of each separated acoustic signal. The acoustic signal processing apparatus according to claim 2, wherein the output acoustic signal is generated by synthesizing the plurality of separated acoustic signals.

The output sound signal generation unit generates the output sound signal from the input sound signal based not only on the analysis result by the sound source type analysis unit but also on the analysis result on the video signal of the target moving image. The acoustic signal processing device according to any one of claims 2 to 4.

An electronic apparatus comprising the acoustic signal processing device according to any one of claims 1 to 5,
When shooting the target moving image at a first frame rate, generating the output acoustic signal from the input acoustic signal and recording the output acoustic signal on a recording medium; or
When the input sound signal is recorded on the recording medium and the target moving image is reproduced at the second frame rate, the output sound signal is generated from the recorded input sound signal, and the target moving image is generated. And an electronic apparatus that reproduces the output acoustic signal.