JP2016045456A

JP2016045456A - Voice processing device, voice processing method and program

Info

Publication number: JP2016045456A
Application number: JP2014171649A
Authority: JP
Inventors: 文裕梶村; Fumihiro Kajimura
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2014-08-26
Filing date: 2014-08-26
Publication date: 2016-04-04
Anticipated expiration: 2034-08-26
Also published as: JP6381366B2

Abstract

PROBLEM TO BE SOLVED: To provide a technique capable of highly accurately reducing noise included in voice.SOLUTION: A voice processing device comprises: first acquisition means for acquiring a first voice signal; first setting means for setting a reference period; second setting means for setting a plurality of comparison periods; second acquisition means for acquiring a second voice signal by executing attenuation processing for attenuating a voice signal in a frequency band other than an attention band to the first voice signal in order to acquire a second voice signal; detection means for detecting a plurality of similar periods from among the plurality of comparison periods by comparing the second voice signal in the reference period to the second voice signal in each of the comparison periods; generation means for generating a replacement signal on the basis of the first voice signal in the reference period and the first voice signal in each of the plurality of similar periods; and replacement means for replacing the first voice signal in the reference period with the replacement signal.SELECTED DRAWING: Figure 1

Description

本発明は、音声処理装置、音声処理方法、及び、プログラムに関する。 The present invention relates to a voice processing device, a voice processing method, and a program.

撮像した動画とともに音声を記録することができる撮像装置では、光学系の駆動により発生する雑音を含む音声が記録されてしまうことがある。
このような課題を解決するための従来技術は、例えば、特許文献１に開示されている。
特許文献１に開示の技術では、撮像装置のモータ（アイリスモータ、シャッタモータ、等）が駆動した場合に、モータの駆動する直前の音声を用いて、雑音が発生する期間の音声が補正される。 In an imaging apparatus capable of recording sound together with a captured moving image, sound including noise generated by driving an optical system may be recorded.
A conventional technique for solving such a problem is disclosed in Patent Document 1, for example.
In the technique disclosed in Patent Document 1, when a motor (an iris motor, a shutter motor, or the like) of an imaging apparatus is driven, the sound during a period in which noise is generated is corrected using the sound immediately before the motor is driven. .

特開２００６−２０３３７６号公報JP 2006-203376 A

しかしながら、特許文献１の技術は循環バッファを必要とするため、特許文献１の技術を用いて雑音を低減できる期間の長さは、物理的な制限によって制限される。
そのため、特許文献１の技術を用いたとしても、雑音を高精度に低減することができないことがある。 However, since the technique of Patent Document 1 requires a circular buffer, the length of a period during which noise can be reduced using the technique of Patent Document 1 is limited by physical limitations.
Therefore, even if the technique of Patent Document 1 is used, noise may not be reduced with high accuracy.

本発明は、音声に含まれている雑音を高精度に低減することができる技術を提供することを目的とする。 An object of this invention is to provide the technique which can reduce the noise contained in an audio | voice with high precision.

本発明の第１の態様は、
音声を表す第１音声信号を取得する第１取得手段と、
参照期間を設定する第１設定手段と、
前記参照期間と同じ時間幅を有する期間であって、前記参照期間とは異なる期間である、複数の比較期間を設定する第２設定手段と、
前記第１音声信号に対する処理において注目すべき周波数帯域である注目帯域以外の周波数帯域の音声信号を減衰させる減衰処理を前記第１音声信号に施すことにより、第２音声信号を取得する第２取得手段と、
前記参照期間における第２音声信号を各比較期間における第２音声信号と比較することにより、複数の比較期間の中から、前記参照期間における第２音声信号と類似した第２音声信号の期間である複数の類似期間を検出する検出手段と、
前記参照期間における第１音声信号と、前記複数の類似期間のそれぞれにおける第１音声信号と、に基づいて、前記参照期間における音声信号として設定すべき音声信号である置換信号を生成する生成手段と、
前記参照期間における第１音声信号を前記置換信号に置換する置換手段と、
を有することを特徴とする音声処理装置である。 The first aspect of the present invention is:
First acquisition means for acquiring a first audio signal representing audio;
First setting means for setting a reference period;
A second setting means for setting a plurality of comparison periods, which are periods having the same time width as the reference period and are different from the reference period;
A second acquisition for acquiring a second audio signal by applying attenuation processing to the first audio signal to attenuate an audio signal in a frequency band other than the band of interest, which is a frequency band of interest in the processing for the first audio signal. Means,
By comparing the second audio signal in the reference period with the second audio signal in each comparison period, the second audio signal is similar to the second audio signal in the reference period among a plurality of comparison periods. Detecting means for detecting a plurality of similar periods;
Generating means for generating a replacement signal, which is an audio signal to be set as an audio signal in the reference period, based on the first audio signal in the reference period and the first audio signal in each of the plurality of similar periods; ,
Replacement means for replacing the first audio signal in the reference period with the replacement signal;
Is a speech processing apparatus characterized by comprising:

本発明の第２の態様は、
音声を表す第１音声信号を取得する第１取得ステップと、
参照期間を設定する第１設定ステップと、
前記参照期間と同じ時間幅を有する期間であって、前記参照期間とは異なる期間である
、複数の比較期間を設定する第２設定ステップと、
前記第１音声信号に対する処理において注目すべき周波数帯域である注目帯域以外の周波数帯域の音声信号を減衰させる減衰処理を前記第１音声信号に施すことにより、第２音声信号を取得する第２取得ステップと、
前記参照期間における第２音声信号を各比較期間における第２音声信号と比較することにより、複数の比較期間の中から、前記参照期間における第２音声信号と類似した第２音声信号の期間である複数の類似期間を検出する検出ステップと、
前記参照期間における第１音声信号と、前記複数の類似期間のそれぞれにおける第１音声信号と、に基づいて、前記参照期間における音声信号として設定すべき音声信号である置換信号を生成する生成ステップと、
前記参照期間における第１音声信号を前記置換信号に置換する置換ステップと、
を有することを特徴とする音声処理方法である。 The second aspect of the present invention is:
A first acquisition step of acquiring a first audio signal representing audio;
A first setting step for setting a reference period;
A second setting step of setting a plurality of comparison periods, which are periods having the same time width as the reference period and are different from the reference period;
A second acquisition for acquiring a second audio signal by applying attenuation processing to the first audio signal to attenuate an audio signal in a frequency band other than the band of interest, which is a frequency band of interest in the processing for the first audio signal. Steps,
By comparing the second audio signal in the reference period with the second audio signal in each comparison period, the second audio signal is similar to the second audio signal in the reference period among a plurality of comparison periods. A detecting step for detecting a plurality of similar periods;
Generating a replacement signal that is an audio signal to be set as an audio signal in the reference period, based on the first audio signal in the reference period and the first audio signal in each of the plurality of similar periods; ,
A replacement step of replacing the first audio signal in the reference period with the replacement signal;
Is a voice processing method characterized by comprising:

本発明の第３の態様は、上述した音声処理方法の各ステップをコンピュータに実行させることを特徴とするプログラムである。 According to a third aspect of the present invention, there is provided a program that causes a computer to execute each step of the voice processing method described above.

本発明によれば、音声に含まれている雑音を高精度に低減することができる。 According to the present invention, it is possible to reduce noise contained in speech with high accuracy.

本実施形態に係る音声処理部の機能構成の一例を示すブロック図The block diagram which shows an example of a function structure of the audio | voice processing part which concerns on this embodiment 本実施形態に係る撮像装置の外観及び機能構成の一例を示す図1 is a diagram illustrating an example of an appearance and a functional configuration of an imaging apparatus according to the present embodiment. 本実施解体に係る雑音低減処理の一例を示す図The figure which shows an example of the noise reduction process which concerns on this implementation dismantling 本実施形態に係る各音声信号と減衰処理の特性の一例を示す図The figure which shows an example of the characteristic of each audio | voice signal which concerns on this embodiment, and an attenuation process 本実施解体に係る雑音低減処理の流れの一例を示すフローチャートThe flowchart which shows an example of the flow of the noise reduction process which concerns on this implementation dismantling 本実施形態に係る音声処理部の機能構成の一例を示すブロック図The block diagram which shows an example of a function structure of the audio | voice processing part which concerns on this embodiment 従来の雑音低減処理の一例を示す図The figure which shows an example of the conventional noise reduction process 従来の雑音低減処理において生じる課題の一例を示す図The figure which shows an example of the subject which arises in the conventional noise reduction processing

以下、図面を参照して本発明の実施形態に係る音声処理装置、撮像装置、及び、音声処理方法について詳細に説明する。
なお、以下の実施形態はあくまで一例であり、本発明は以下の実施形態に限定されない。 Hereinafter, an audio processing device, an imaging device, and an audio processing method according to embodiments of the present invention will be described in detail with reference to the drawings.
The following embodiment is merely an example, and the present invention is not limited to the following embodiment.

（撮像装置の構成）
本実施形態に係る音声処理装置の一例として、以下カメラ１について説明する。
図２（ａ）は、カメラ１の外観の一例を示す斜視図である。図２（ｂ）は、カメラ１の構成の一例を示すブロック図である。
図２（ｂ）に示すように、カメラ１は、カメラシステム制御部１０、撮像レンズ１１、マイクロフォン１２、撮像素子１３、画像処理部１４、レンズ駆動部１５、音声処理部１６、メモリ部１７、操作部１８、画像表示部１９、等を有する。 (Configuration of imaging device)
The camera 1 will be described below as an example of the sound processing apparatus according to the present embodiment.
FIG. 2A is a perspective view showing an example of the appearance of the camera 1. FIG. 2B is a block diagram illustrating an example of the configuration of the camera 1.
As shown in FIG. 2B, the camera 1 includes a camera system control unit 10, an imaging lens 11, a microphone 12, an imaging element 13, an image processing unit 14, a lens driving unit 15, an audio processing unit 16, a memory unit 17, An operation unit 18, an image display unit 19, and the like are included.

撮像レンズ１１を通過した光線は、撮像素子１３の近傍で結像をし、撮像素子１３に適正な時間だけ露光される。
撮像素子１３は、露光された光を電気信号（アナログ信号）に光電変換する。
画像処理部１４は、Ａ／Ｄ変換器、ホワイトバランス回路、ガンマ補正回路、補間演算回路、等の処理部（処理回路）を有する。画像処理部１４は、これらの処理部を用いて、撮像素子１３で生成されたアナログ信号に様々な処理を施すことにより、デジタル信号である撮像画像データを生成する。生成された撮像画像データは、カメラシステム制御部１
０を介して、メモリ部１７に記録される。
レンズ駆動部１５は、カメラシステム制御部１０からの指示（命令）に応じて撮像レンズ１１を駆動することにより、撮像レンズ１１の光学状態を調整する。具体的には、レンズ駆動部１５は、カメラシステム制御部１０からの指示に応じて、撮像レンズ１１が有するフォーカスレンズ群、絞り機構、手振れ防振機構、等を駆動する。
撮像素子１３に連続的に露光し、撮像素子１３からアナログ信号を読み出して撮像画像データを生成する処理を一定のフレームレートで行うことにより、動画の撮像を行うことができる。 The light beam that has passed through the imaging lens 11 forms an image in the vicinity of the imaging device 13 and is exposed to the imaging device 13 for an appropriate time.
The image sensor 13 photoelectrically converts the exposed light into an electrical signal (analog signal).
The image processing unit 14 includes processing units (processing circuits) such as an A / D converter, a white balance circuit, a gamma correction circuit, and an interpolation calculation circuit. The image processing unit 14 uses these processing units to perform various processes on the analog signal generated by the image sensor 13 to generate captured image data that is a digital signal. The generated captured image data is stored in the camera system control unit 1
0 is recorded in the memory unit 17.
The lens driving unit 15 adjusts the optical state of the imaging lens 11 by driving the imaging lens 11 in accordance with an instruction (command) from the camera system control unit 10. Specifically, the lens driving unit 15 drives a focus lens group, a diaphragm mechanism, a camera shake stabilization mechanism, and the like included in the imaging lens 11 in accordance with an instruction from the camera system control unit 10.
By continuously exposing the image sensor 13 and reading out an analog signal from the image sensor 13 to generate captured image data at a constant frame rate, a moving image can be captured.

マイクロフォン１２には、音声が入力される。マイクロフォン１２は、入力された音声を表す音声信号（アナログ信号またはデジタル信号）を生成する。本実施形態では、マイクロフォン１２は、動画の撮像中に入力された音声（被写体の音声を少なくとも含む音声）を表す音声信号を生成する。
音声処理部１６は、マイクロフォン１２で生成された音声信号を取得し、取得した音声信号に種々の処理を施すことにより、デジタル信号である音声信号（出力音声信号）を生成する。音声処理部１６が行う処理は、Ａ／Ｄ変換処理、雑音低減処理、等を含む。雑音低減処理は、マイクロフォン１２で生成された音声信号が表す音声に含まれている雑音を低減する処理である。雑音低減処理の詳細については後述する。生成された出力音声信号は、カメラシステム制御部１０を介して、メモリ部１７に記録される。出力音声信号は、例えば、動画の撮像画像データに対応付けられてメモリ部１７に記録される。 Sound is input to the microphone 12. The microphone 12 generates a sound signal (analog signal or digital signal) representing the input sound. In the present embodiment, the microphone 12 generates an audio signal representing the audio (audio including at least the audio of the subject) input during imaging of the moving image.
The audio processing unit 16 acquires an audio signal generated by the microphone 12 and performs various processes on the acquired audio signal to generate an audio signal (output audio signal) that is a digital signal. The processing performed by the voice processing unit 16 includes A / D conversion processing, noise reduction processing, and the like. The noise reduction process is a process for reducing noise included in the voice represented by the voice signal generated by the microphone 12. Details of the noise reduction processing will be described later. The generated output audio signal is recorded in the memory unit 17 via the camera system control unit 10. The output audio signal is recorded in the memory unit 17 in association with captured image data of a moving image, for example.

操作部１８は、カメラ１に対するユーザ操作を受け付ける。本実施形態では、操作部１８は、図２（ａ）のシャッターレリーズ釦１８ａを含む１つ以上の釦を有する。
カメラシステム制御部１０は、ユーザ操作に応じて生成された操作信号（タイミング信号）に応じて、カメラ１の各機能部を制御する。例えば、シャッターレリーズ釦１８ａの押下が検出されると、撮像素子１３の駆動、画像処理部１４の動作、音声処理部１６の動作、メモリ部１７に記録するデータや信号の圧縮処理、等が制御される。また、カメラシステム制御部１０は、画像表示部１９の画像や情報の表示を制御する。 The operation unit 18 receives a user operation on the camera 1. In the present embodiment, the operation unit 18 has one or more buttons including the shutter release button 18a of FIG.
The camera system control unit 10 controls each functional unit of the camera 1 according to an operation signal (timing signal) generated according to a user operation. For example, when the pressing of the shutter release button 18a is detected, the driving of the image sensor 13, the operation of the image processing unit 14, the operation of the audio processing unit 16, the compression processing of data and signals recorded in the memory unit 17, and the like are controlled. Is done. The camera system control unit 10 controls display of images and information on the image display unit 19.

（音声処理部１６の構成）
図２（ｂ）の音声処理部１６の構成について説明する。
動画撮像中にマイクロフォン１２で生成された音声信号が表す音声には、被写体の音声のみが含まれていることが好ましい。しかしながら、動画撮像中にマイクロフォン１２で生成された音声信号が表す音声には、撮像レンズ１１の駆動に伴って生じたレンズ駆動雑音、マイクロフォン１２の性能に起因して生じた白色雑音である暗雑音、等の雑音が重畳されていることがある。このように、マイクロフォン１２で生成された音声信号が表す音声には、雑音が含まれていることがある。
音声処理部１６は、雑音低減処理を行うことにより、上述した雑音を低減する。 (Configuration of the audio processing unit 16)
The configuration of the audio processing unit 16 in FIG. 2B will be described.
The sound represented by the sound signal generated by the microphone 12 during moving image capture preferably includes only the sound of the subject. However, the voice represented by the audio signal generated by the microphone 12 during moving image capturing includes lens driving noise generated by driving the imaging lens 11 and dark noise that is white noise generated due to the performance of the microphone 12. , Etc. may be superimposed. As described above, the voice represented by the voice signal generated by the microphone 12 may contain noise.
The voice processing unit 16 reduces the above-described noise by performing noise reduction processing.

図１は、音声処理部１６の機能構成の一例を示すブロック図である。
図１では、データ（信号）と機能部とを区別しやすくするために、駆動部は四隅がとがった四角、データ（信号）は四隅が丸まった四角で図示されている。
図１に示すように、音声処理部１６は、音声信号減衰部３１、類似期間検出部３２、置換信号生成部３３、参照期間設定部３４、音声信号置換部３５、等を有する。
図１において、入力音声信号２１は、音声信号であり、マイクロフォン１２で生成されたデジタル信号、マイクロフォン１２で生成されたアナログ信号にＡ／Ｄ変換処理を施したデジタル信号、等である。 FIG. 1 is a block diagram illustrating an example of a functional configuration of the audio processing unit 16.
In FIG. 1, in order to easily distinguish data (signals) and functional units, the drive unit is illustrated as a square with four corners and the data (signal) is illustrated as a rectangle with four corners rounded.
As shown in FIG. 1, the audio processing unit 16 includes an audio signal attenuating unit 31, a similar period detecting unit 32, a replacement signal generating unit 33, a reference period setting unit 34, an audio signal replacing unit 35, and the like.
In FIG. 1, an input audio signal 21 is an audio signal, and is a digital signal generated by the microphone 12, a digital signal obtained by subjecting an analog signal generated by the microphone 12 to A / D conversion processing, or the like.

音声信号減衰部３１は、入力音声信号２１（第１音声信号）を取得する（第１取得処理）。
また、音声信号減衰部３１は、注目帯域以外の周波数帯域である非注目帯域の音声信号を減衰させる減衰処理を入力音声信号２１に施すことにより、減衰音声信号（第２音声信号）を取得（生成）する（第２取得処理）。注目帯域は、入力音声信号２１に対する雑音低減処理において注目すべき周波数帯域である。減衰処理は、注目帯域の音声信号を抽出する抽出処理と言うこともできる。
音声信号減衰部３１は、減衰音声信号を類似期間検出部３２に出力する。 The audio signal attenuation unit 31 acquires the input audio signal 21 (first audio signal) (first acquisition process).
The audio signal attenuating unit 31 obtains an attenuated audio signal (second audio signal) by performing an attenuation process on the input audio signal 21 to attenuate the audio signal in the non-interesting band that is a frequency band other than the attention band. (Second acquisition process). The attention band is a frequency band to be noted in the noise reduction processing for the input audio signal 21. The attenuation process can also be referred to as an extraction process for extracting an audio signal in a target band.
The audio signal attenuation unit 31 outputs the attenuated audio signal to the similar period detection unit 32.

なお、減衰処理（抽出処理）の方法は特に限定されない。例えば、減衰処理は、注目帯域の音声信号を通過させるフィルタ（バンドパスフィルタ；ＢＰＦ）を用いたフィルタ処理であってもよい。
なお、注目帯域や非注目帯域は特に限定されない。注目帯域と非注目帯域は、予め定められた周波数帯域であってもよいし、ユーザによって設定可能なものであってもよい。例えば、注目帯域と非注目帯域の少なくとも一方が、撮像対象、カメラ１の動作モード、ユーザ操作、等に応じて決定されてもよい。 The method of attenuation processing (extraction processing) is not particularly limited. For example, the attenuation process may be a filter process using a filter (band-pass filter; BPF) that passes an audio signal in the band of interest.
Note that the attention band and the non-attention band are not particularly limited. The attention band and the non-attention band may be predetermined frequency bands, or may be settable by the user. For example, at least one of the attention band and the non-attention band may be determined according to the imaging target, the operation mode of the camera 1, a user operation, and the like.

なお、人間が発する声がマイクロフォン１２に入力される音声として想定されている場合には、注目帯域は、人間が発する声の周波数帯域を含むことが好ましい。具体的には、注目帯域は、人間が発する声の第１フォルマントに相当する周波数帯域を含むことが好ましい。一般的に、成人が発する声の第１フォルマントの周波数帯域は、５００Ｈｚ以上且つ１５００Ｈｚ以下の周波数帯域と言われている。そのため、注目帯域は、５００Ｈｚ以上且つ１５００Ｈｚ以下の周波数帯域を含むことが好ましい。また、成人が発する声の第２フォルマントの周波数帯域は、１５００Ｈｚ以上且つ３０００Ｈｚ以下の周波数帯域と言われている。そして、成人が発する声には、第２フォルマントの周波数よりも高い周波数帯域に、第３フォルマントの周波数および第４フォルマントの周波数が存在していると言われている。
本実施形態では、注目帯域が、第１フォルマントの周波数帯域と第２フォルマントの周波数帯域とを含む周波数帯域である場合の例を説明する。具体的には、注目帯域が、５００Ｈｚ以上且つ３０００Ｈｚ以下の周波数帯域である場合の例を説明する。
なお、第１取得処理は、音声信号減衰部３１とは異なる機能部によって実行されてもよい。 When a voice uttered by a person is assumed to be input to the microphone 12, the band of interest preferably includes a frequency band of a voice uttered by a person. Specifically, the attention band preferably includes a frequency band corresponding to a first formant of a voice uttered by a human. Generally, the frequency band of the first formant of a voice uttered by an adult is said to be a frequency band of 500 Hz or more and 1500 Hz or less. Therefore, it is preferable that the attention band includes a frequency band of 500 Hz or more and 1500 Hz or less. The frequency band of the second formant of the voice uttered by an adult is said to be a frequency band of 1500 Hz or more and 3000 Hz or less. It is said that the voices of adults have a third formant frequency and a fourth formant frequency in a frequency band higher than the second formant frequency.
In the present embodiment, an example will be described in which the band of interest is a frequency band including the first formant frequency band and the second formant frequency band. Specifically, an example in which the bandwidth of interest is a frequency band of 500 Hz or more and 3000 Hz or less will be described.
Note that the first acquisition process may be executed by a functional unit different from the audio signal attenuating unit 31.

参照期間設定部３４は、音声信号減衰部３１から出力された減衰音声信号に対して、参照期間を設定する（第１設定処理）。参照期間は、雑音低減処理の対象の期間である。本実施形態では、参照期間として、所定の時間幅を有する期間が設定される。参照期間設定部３４は、参照期間を類似期間検出部３２に通知する。
なお、参照期間の時間幅は予め定められていなくてもよい。例えば、参照期間の時間幅は、撮像対象、カメラ１の動作モード、ユーザ操作、等に応じて決定されてもよい。 The reference period setting unit 34 sets a reference period for the attenuated audio signal output from the audio signal attenuation unit 31 (first setting process). The reference period is a period for noise reduction processing. In the present embodiment, a period having a predetermined time width is set as the reference period. The reference period setting unit 34 notifies the similar period detection unit 32 of the reference period.
The time width of the reference period may not be determined in advance. For example, the time width of the reference period may be determined according to the imaging target, the operation mode of the camera 1, a user operation, and the like.

類似期間検出部３２は、音声信号減衰部３１から出力された減衰音声信号に対して、複数の比較期間を設定する（第２設定処理）。比較期間は、参照期間と同じ時間幅を有する期間であり、且つ、参照期間とは異なる期間である。
また、類似期間検出部３２は、参照期間における減衰音声信号を各比較期間における減衰音声信号と比較する。そして、類似期間検出部３２は、その比較結果に基づいて、複数の比較期間の中から、参照期間における減衰音声信号と類似した減衰音声信号の期間である複数の類似期間を検出する（検出処理）。例えば、参照期間における減衰音声信号との減衰音声信号の類似度が高い比較期間から順番にＮ個（Ｎは２以上の整数）の比較期間のそれぞれが、類似期間として検出される。
そして、類似期間検出部３２は、各類似期間を少なくとも表す類似期間信号２２を出力する。本実施形態では、類似期間信号２２として、複数の類似期間にそれぞれ対応する複数の信号類似度の大小関係をさらに表す信号が、生成され、出力される。信号類似度は、
参照期間における減衰音声信号と類似期間における減衰音声信号との間の類似度である。 The similar period detection unit 32 sets a plurality of comparison periods for the attenuated audio signal output from the audio signal attenuation unit 31 (second setting process). The comparison period is a period having the same time width as the reference period, and is a period different from the reference period.
Further, the similar period detection unit 32 compares the attenuated sound signal in the reference period with the attenuated sound signal in each comparison period. Then, based on the comparison result, the similar period detection unit 32 detects a plurality of similar periods, which are periods of the attenuated sound signal similar to the attenuated sound signal in the reference period, from among the plurality of comparison periods (detection processing). ). For example, each of N (N is an integer of 2 or more) comparison periods in order from the comparison period in which the degree of similarity of the attenuated sound signal with the attenuated sound signal in the reference period is high is detected as the similar period.
And the similar period detection part 32 outputs the similar period signal 22 at least showing each similar period. In the present embodiment, as the similar period signal 22, a signal that further represents a magnitude relationship among a plurality of signal similarities respectively corresponding to a plurality of similar periods is generated and output. The signal similarity is
It is the similarity between the attenuated audio signal in the reference period and the attenuated audio signal in the similar period.

なお、第２設定処理は、類似期間検出部３２とは異なる機能部によって実行されてもよい。
なお、類似期間の検出方法は上記方法に限らない。例えば、信号類似度が閾値以上である複数の比較期間のうち、参照期間に時間的に近い比較期間から順番にＮ個の比較期間のそれぞれが、類似期間として検出されてもよい。
なお、Ｎの値は、予め定められた固定値であってもよいし、ユーザによって設定されてもよい。例えば、Ｎの値は、撮像対象、カメラ１の動作モード、ユーザ操作、等に応じて決定されてもよい。 The second setting process may be executed by a functional unit different from the similar period detection unit 32.
The similar period detection method is not limited to the above method. For example, among a plurality of comparison periods whose signal similarity is equal to or greater than a threshold value, each of the N comparison periods may be detected in order from the comparison period that is temporally close to the reference period.
Note that the value of N may be a predetermined fixed value or may be set by the user. For example, the value of N may be determined according to the imaging target, the operation mode of the camera 1, user operation, and the like.

置換信号生成部３３は、参照期間における入力音声信号と、複数の類似期間のそれぞれにおける入力音声信号と、に基づいて、置換信号２３を生成する。置換信号２３は、参照期間における出力音声信号として設定すべき音声信号である。参照期間における入力音声信号は、参照期間における減衰音声信号に対応する入力音声信号であり、類似期間における入力音声信号は、類似期間における減衰音声信号に対応する入力音声信号である。 The replacement signal generation unit 33 generates the replacement signal 23 based on the input audio signal in the reference period and the input audio signal in each of a plurality of similar periods. The replacement signal 23 is an audio signal to be set as an output audio signal in the reference period. The input audio signal in the reference period is an input audio signal corresponding to the attenuated audio signal in the reference period, and the input audio signal in the similar period is an input audio signal corresponding to the attenuated audio signal in the similar period.

音声信号置換部３５は、参照期間における入力音声信号を置換信号２３に置換することにより、出力音声信号２４を生成する。 The audio signal replacement unit 35 generates the output audio signal 24 by replacing the input audio signal in the reference period with the replacement signal 23.

（従来の雑音低減処理）
従来の雑音低減処理の一例について説明する。詳細は以下で述べるが、従来の雑音低減処理では、減衰音声信号は生成されない。
図７（ａ）〜７（ｅ）は、従来の雑音低減処理の一例を示す模式図である。図７（ａ）の上側には、被写体の音声に白色雑音（暗雑音）が重畳された入力音声信号の一例が示されている。図７（ａ）の下側には、参照期間における入力音声信号と各類似期間における入力音声信号とが、他の期間における入力音声信号から切り離されて図示されている。図７（ｂ）は、置換信号の一例を示す。図７（ｃ）は、参照期間における入力音声信号を置換信号に置き換えて得られる出力音声信号の一例を示す。図７（ｄ）は、出力音声信号の他の例を示す。図７（ｅ）は、被写体の音声に対して一時的にレンズ駆動雑音が重畳された入力音声信号の一例を示す。図７（ａ）〜７（ｅ）において、横軸は時間位置を示し、縦軸は音声信号レベル（音声信号の信号レベル）を示す。図７（ａ）の上側，７（ｃ），７（ｄ），７（ｅ）は、入力音声信号２１や出力音声信号２４の一部を拡大した拡大図である。図７（ａ）の上側，７（ｃ），７（ｄ），７（ｅ）に示す音声信号は、０．２秒程度の音声信号である。図７（ａ）の上側の音声信号を局所的に観察すると、音声信号の繰り返し性が非常に高いことが分かる。以下で説明する従来の雑音低減処理は、音声信号が有する短時間での繰り返し性の高さに着目した処理である。短時間での繰り返し性の高さは、本実施形態でも着目される。 (Conventional noise reduction processing)
An example of conventional noise reduction processing will be described. Although details will be described below, in the conventional noise reduction processing, an attenuated speech signal is not generated.
7A to 7E are schematic diagrams illustrating an example of conventional noise reduction processing. On the upper side of FIG. 7A, an example of an input audio signal in which white noise (dark noise) is superimposed on the audio of the subject is shown. On the lower side of FIG. 7A, the input audio signal in the reference period and the input audio signal in each similar period are illustrated separately from the input audio signals in other periods. FIG. 7B shows an example of the replacement signal. FIG. 7C shows an example of an output audio signal obtained by replacing the input audio signal in the reference period with a replacement signal. FIG. 7D shows another example of the output audio signal. FIG. 7E shows an example of an input sound signal in which lens driving noise is temporarily superimposed on the sound of the subject. 7A to 7E, the horizontal axis indicates the time position, and the vertical axis indicates the audio signal level (signal level of the audio signal). 7 (a), 7 (c), 7 (d), and 7 (e) are enlarged views in which a part of the input audio signal 21 and the output audio signal 24 are enlarged. The audio signals shown on the upper side, 7 (c), 7 (d), and 7 (e) of FIG. 7A are audio signals of about 0.2 seconds. When the upper audio signal in FIG. 7A is observed locally, it can be seen that the repeatability of the audio signal is very high. The conventional noise reduction process described below is a process that focuses on the high repeatability in a short time of an audio signal. The high repeatability in a short time is also noticed in this embodiment.

まず、図７（ａ）に示すように、入力音声信号に対して、参照期間１００が設定される。参照期間における入力音声信号に第１フォルマントの周波数の１周期分の音声信号が含まれるように、参照期間の長さ（時間）が設定されていることが好ましい。即ち、参照期間の長さは、第１フォルマントの周波数の１周期以上であることが好ましい。例えば、成人が発する声の第１フォルマントの周波数は５００Ｈｚ以上且つ１０００Ｈｚ以下の周波数と言われているため、参照期間の長さは、２ｍｓｅｃ（＝０．００２ｓｅｃ＝１÷５００Ｈｚ）以上であることが好ましい。 First, as shown in FIG. 7A, a reference period 100 is set for an input audio signal. It is preferable that the length (time) of the reference period is set so that the input audio signal in the reference period includes an audio signal for one cycle of the first formant frequency. That is, it is preferable that the length of the reference period is one period or more of the frequency of the first formant. For example, since the frequency of the first formant of the voice uttered by an adult is said to be a frequency of 500 Hz or more and 1000 Hz or less, the length of the reference period may be 2 msec (= 0.002 sec = 1 ÷ 500 Hz) or more. preferable.

次に、複数の比較期間が設定される。例えば、参照期間に対して時間的に前の期間と、参照期間に対して時間的に後の期間と、の少なくとも一方を含む複数の期間が、複数の比較期間として設定される。上述したように、比較期間の時間幅は、参照期間の時間幅と等
しい。
なお、比較期間と、当該比較期間に隣接する隣接期間（参照期間または比較期間）と、の間の時間差は、特に限定されない。上記時間差は、例えば、処理負荷、想定される音声の周波数、等を考慮して決定される。上記時間差は、音声信号レベルのサンプリングレートの１ビット分であることが好ましい。比較期間の一部が隣接期間の一部に重畳されていてもよいし、比較期間は隣接期間から離れていてもよい。 Next, a plurality of comparison periods are set. For example, a plurality of periods including at least one of a period before the reference period and a period after the reference period are set as a plurality of comparison periods. As described above, the time width of the comparison period is equal to the time width of the reference period.
Note that the time difference between the comparison period and the adjacent period (reference period or comparison period) adjacent to the comparison period is not particularly limited. The time difference is determined in consideration of, for example, a processing load, an assumed audio frequency, and the like. The time difference is preferably one bit of the sampling rate of the audio signal level. A part of the comparison period may be superimposed on a part of the adjacent period, or the comparison period may be separated from the adjacent period.

そして、参照期間１００における入力音声信号を各比較期間における入力音声信号と比較することにより、複数の比較期間の中から複数の類似期間が検出される。図７（ａ）の例では、３つの類似期間１０１ａ，１０１ｂ，１０１ｃが検出されている。 Then, by comparing the input audio signal in the reference period 100 with the input audio signal in each comparison period, a plurality of similar periods are detected from the plurality of comparison periods. In the example of FIG. 7A, three similar periods 101a, 101b, and 101c are detected.

類似期間の検出方法の一例を以下に説明する。
なお、類似期間の検出方法は、以下の方法に限らない。 An example of a similar period detection method will be described below.
Note that the method for detecting the similar period is not limited to the following method.

まず、比較期間毎に、参照期間における入力音声信号と比較期間における入力音声信号との間の類似度が算出される。類似度は、例えば、以下の式１を用いて算出される。

参照期間及び比較期間は、Ｍ個（Ｍは２以上の整数）の離散時間位置を含む。Ｍの値は、参照期間（または比較期間）の長さを音声信号レベルのサンプリングレートで除算することにより、算出することができる。式１において、Ｓ_Ｃ（ｉ）は比較期間のｉ番目（ｉは１以上且つＭ以下の整数）の離散時間位置における入力信号レベル（入力音声信号の信号レベル）であり、Ｓ_Ｒ（ｉ）は参照期間のｉ番目の離散時間位置における入力信号レベルである。Ｄは、非類似度である。類似度は、例えば、非類似度Ｄの逆数である。 First, for each comparison period, the similarity between the input audio signal in the reference period and the input audio signal in the comparison period is calculated. The similarity is calculated using, for example, the following formula 1.

The reference period and the comparison period include M (M is an integer of 2 or more) discrete time positions. The value of M can be calculated by dividing the length of the reference period (or comparison period) by the sampling rate of the audio signal level. In Equation 1, S _C (i) is the input signal level (signal level of the input audio signal) at the i-th (i is an integer greater than or equal to 1 and less than or equal to M) discrete time position in the comparison period, and S _R (i) Is the input signal level at the i-th discrete time position of the reference period. D is the dissimilarity. The similarity is, for example, the reciprocal of the dissimilarity D.

式１では、各離散時間位置におけるレベル差（参照期間における入力信号レベルと比較期間における入力信号レベルとの間の差の絶対値）の総和が、非類似度Ｄとして算出される。そのため、比較期間における入力音声信号が参照期間における入力音声信号に近いほど小さい値が、非類似度Ｄとして算出される。そして、比較期間における入力音声信号が参照期間における入力音声信号と完全に一致する場合に、非類似度Ｄとして０が算出される。 In Equation 1, the sum of the level differences at each discrete time position (the absolute value of the difference between the input signal level in the reference period and the input signal level in the comparison period) is calculated as the dissimilarity D. Therefore, a smaller value is calculated as the dissimilarity D as the input audio signal in the comparison period is closer to the input audio signal in the reference period. Then, 0 is calculated as the dissimilarity D when the input audio signal in the comparison period completely matches the input audio signal in the reference period.

次に、類似度が高い比較期間から順番にＮ個（Ｎは３以上の整数）の比較期間のそれぞれが、類似期間として検出される。具体的には、非類似度Ｄが小さい比較期間から順番にＮ個の比較期間のそれぞれが、類似期間として検出される。 Next, each of N (N is an integer of 3 or more) comparison periods in order from the comparison period with the highest similarity is detected as the similarity period. Specifically, each of the N comparison periods is detected as a similarity period in order from the comparison period with the low dissimilarity D.

類似期間が検出された後、参照期間における入力音声信号と、各類似期間における入力音声信号と、を用いて、置換信号が生成される。置換信号は、例えば、以下の式２を用いて算出される。

式２において、ｉとＭは式１と同じである。Ｎは類似期間の総数であり、Ｋは類似期間
の番号である。Ｋは、１以上且つＮ以下の整数である。Ｓ_Ｏ（ｉ）はｉ番目の離散時間位置における置換信号レベル（置換信号の信号レベル）であり、Ｓ_Ｒ（ｉ）は参照期間のｉ番目の離散時間位置における入力信号レベルである。Ｓ_ＣＫ（ｉ）は、番号Ｋの類似期間のｉ番目の離散時間位置における入力信号レベルである。ｗ_Ｒは参照期間における入力音声信号の重みであり、ｗ_Ｋは番号Ｋの類似期間における入力音声信号の重みである。式２では、参照期間における入力音声信号と各類似期間における入力音声信号とを重みづけ加算することにより、置換信号が生成される。類似期間における音声信号の重みｗ_Ｋとしては、例えば、参照期間における音声信号との音声信号の類似度が高いほど大きい重みが使用される。即ち、重みｗ_Ｋとしては、非類似度Ｄが小さいほど大きい重みが使用される。 After the similar period is detected, a replacement signal is generated using the input audio signal in the reference period and the input audio signal in each similar period. The replacement signal is calculated using, for example, the following Equation 2.

In Formula 2, i and M are the same as in Formula 1. N is the total number of similar periods, and K is the number of similar periods. K is an integer of 1 or more and N or less. S _O (i) is the replacement signal level (signal level of the replacement signal) at the i-th discrete time position, and S _R (i) is the input signal level at the i-th discrete time position in the reference period. S _CK (i) is the input signal level at the i-th discrete time position in the similar period of number K. w _R is the weight of the input speech signal in the reference period, and w _K is the weight of the input speech signal in the similar period of number K. In Equation 2, a replacement signal is generated by weighted addition of the input audio signal in the reference period and the input audio signal in each similar period. As the weight w _K of the audio signal in the similar period, for example, a higher weight is used as the similarity of the audio signal with the audio signal in the reference period is higher. That is, the weight w _K, greater weight as dissimilarity D is small is used.

図７（ｂ）の音声信号１０２は、参照期間１００における入力音声信号と、類似期間１０１ａ，１０１ｂ，１０１ｃにおける入力音声信号と、を用いて生成された置換信号である。図７（ｂ）から、雑音が低減された音声信号が置換信号１０２として生成されていることがわかる。
なお、置換信号の生成方法は上記方法に限らない。例えば、重みｗ_Ｒ，ｗ_Ｋとして１を使用し、参照期間における入力音声信号と各類似期間における入力音声信号との平均の音声信号が、置換信号として生成されてもよい。また、重みｗ_Ｋとして、参照期間と類似期間の間の時間差が小さいほど大きい重みが使用されてもよい。 The audio signal 102 in FIG. 7B is a replacement signal generated using the input audio signal in the reference period 100 and the input audio signals in the similar periods 101a, 101b, and 101c. From FIG. 7B, it can be seen that an audio signal with reduced noise is generated as the replacement signal 102.
The replacement signal generation method is not limited to the above method. For example, 1 may be used as the weights w _R and w _K , and an average audio signal between the input audio signal in the reference period and the input audio signal in each similar period may be generated as the replacement signal. Further, as the weight w _K, greater weight smaller the time difference between the reference period similar period it may be used.

次に、参照期間１００における入力音声信号が、置換信号１０２に置換される。それにより、図７（ｃ）の出力音声信号が生成される。図７（ｃ）の出力音声信号では、参照期間１００における暗雑音が低減されている。
図７（ａ）の例では、入力音声信号の全期間にわたって暗雑音が重畳されている。参照期間の時間位置を少しずつずらしながら上述した処理を繰り返し行うことにより、図７（ｄ）の出力音声信号を生成することができる。図７（ｄ）の出力音声信号では、入力音声信号の全期間にわたって暗雑音が低減されている。
なお、暗雑音以外の雑音についても、上述した処理により低減することができる。例えば、図７（ｅ）の入力音声信号に重畳されている雑音（一部の期間１０３に重畳されているレンズ駆動雑音）も、上述した処理により低減することができる。具体的には、参照期間１０４ａと参照期間１０４ｂを含む複数の参照期間を順番に設定して上述した処理を行うことにより、図７（ｅ）の入力音声信号に重畳されている全てのレンズ駆動雑音を低減することができる。 Next, the input audio signal in the reference period 100 is replaced with the replacement signal 102. Thereby, the output audio signal of FIG. 7C is generated. In the output audio signal in FIG. 7C, the background noise in the reference period 100 is reduced.
In the example of FIG. 7A, background noise is superimposed over the entire period of the input audio signal. By repeatedly performing the above-described processing while gradually shifting the time position of the reference period, the output audio signal of FIG. 7D can be generated. In the output audio signal of FIG. 7D, background noise is reduced over the entire period of the input audio signal.
Noise other than dark noise can also be reduced by the above-described processing. For example, noise superimposed on the input audio signal in FIG. 7E (lens driving noise superimposed on part of the period 103) can also be reduced by the above-described processing. Specifically, by driving a plurality of reference periods including the reference period 104a and the reference period 104b in order and performing the above-described processing, all the lens drives superimposed on the input audio signal in FIG. Noise can be reduced.

しかしながら、上述した従来の雑音低減処理では、高精度に雑音を低減することができないことがある。以下、図８（ａ）〜８（ｄ）を用いて、従来の雑音低減処理において生じる課題について説明する。 However, the conventional noise reduction process described above may not be able to reduce noise with high accuracy. Hereinafter, the problem which arises in the conventional noise reduction process is demonstrated using Fig.8 (a) -8 (d).

図８（ａ）は、被写体の音声を表す音声信号（被写体音声信号；雑音が重畳されていない音声信号）の一例を示す図である。図８（ｂ）は、雑音を表す音声信号（雑音信号）の一例を示す図である。図８（ｃ）は、図８（ａ）の被写体音声信号に図８（ｂ）の雑音信号が重畳された音声信号を示す図である。図８（ｄ）は、図８（ａ）の被写体信号に風雑音と暗雑音とが重畳された音声信号の一例を示す図である。以下では、簡略化のために被写体の音声を表す音声信号の周波数がＦｂ［Ｈｚ］であるものとする。 FIG. 8A is a diagram showing an example of an audio signal (subject audio signal; audio signal on which noise is not superimposed) representing the audio of the subject. FIG. 8B is a diagram illustrating an example of an audio signal (noise signal) representing noise. FIG. 8C is a diagram illustrating an audio signal in which the noise signal in FIG. 8B is superimposed on the subject audio signal in FIG. FIG. 8D is a diagram illustrating an example of an audio signal in which wind noise and dark noise are superimposed on the subject signal in FIG. In the following, for simplification, it is assumed that the frequency of the audio signal representing the audio of the subject is Fb [Hz].

図８（ｂ）に示す雑音の周波数［Ｈｚ］及びパワー（大きさ）［ｄＢ］は、図７（ａ）の入力音声信号に重畳されている暗雑音に比べて大きい。そのため、図８（ｃ）に示す音声信号が入力音声信号である場合、入力音声信号に対する雑音の影響が大きいため、類似期間として検出されるべき比較期間の信号類似度が低下し、類似期間の検出精度が低下してしまう。具体的には、被写体音声信号の繰り返し単位の比較期間が類似期間として検出され難くなる。このように、入力音声信号に重畳されている雑音の周波数及びパワーが大
きい場合、類似期間の検出精度が低下してしまう。その結果、雑音低減処理の処理精度が低下してしまう。
周波数及びパワーが大きい雑音は、例えば、手振れ防振機構の駆動雑音である。 The noise frequency [Hz] and power (magnitude) [dB] shown in FIG. 8B are larger than the background noise superimposed on the input audio signal shown in FIG. Therefore, when the audio signal shown in FIG. 8C is an input audio signal, since the influence of noise on the input audio signal is large, the signal similarity in the comparison period to be detected as a similar period decreases, Detection accuracy is reduced. Specifically, it becomes difficult to detect the comparison period of the repetition unit of the subject audio signal as the similar period. As described above, when the frequency and power of the noise superimposed on the input audio signal are large, the detection accuracy in the similar period decreases. As a result, the processing accuracy of the noise reduction process decreases.
Noise with a large frequency and power is, for example, driving noise of a camera shake stabilization mechanism.

図８（ｄ）に示す音声信号（風雑音が重畳されている音声信号）が入力音声信号である場合にも、類似期間の検出精度が低下し、雑音低減処理の処理精度が低下してしまう。風雑音は、低周波成分を多く含む。一般的には、風雑音は、４００Ｈｚ以下の周波数帯域に強いパワーを有すると言われている。低周波成分を多く含む雑音が入力音声信号に重畳されている場合にも、類似期間の高精度な検出が困難となる。その結果、類似期間の検出精度が低下し、雑音低減処理の処理精度が低下してしまう。 Even when the audio signal (audio signal on which wind noise is superimposed) shown in FIG. 8D is an input audio signal, the detection accuracy of the similar period is lowered, and the processing accuracy of the noise reduction process is reduced. . Wind noise contains many low frequency components. Generally, wind noise is said to have strong power in a frequency band of 400 Hz or less. Even when noise containing a large amount of low-frequency components is superimposed on the input audio signal, it is difficult to detect the similar period with high accuracy. As a result, the detection accuracy of the similar period decreases, and the processing accuracy of the noise reduction process decreases.

（本実施形態に係る雑音低減処理）
そこで、本実施形態では、入力音声信号に減衰処理を施すことにより、低周波成分を多く含む雑音、周波数及びパワーが大きい雑音、等が低減された減衰音声信号を取得（生成）する。そして、入力音声信号の代わりに減衰音声信号を用いて、類似期間を検出する。その後、上述した従来の雑音低減処理と同様に、入力音声信号を用いて置換信号及び出力音声信号を生成する。減衰音声信号を用いることにより、類似期間を高精度に検出することができる。その結果、音声に含まれている雑音を高精度に低減することができる。
本実施形態に係る雑音低減処理の一例について説明する。 (Noise reduction processing according to this embodiment)
Therefore, in the present embodiment, attenuation processing is performed on the input audio signal to acquire (generate) an attenuated audio signal in which noise including a large amount of low frequency components, noise with high frequency and power, and the like are reduced. Then, the similar period is detected using the attenuated voice signal instead of the input voice signal. Thereafter, similarly to the conventional noise reduction process described above, a replacement signal and an output sound signal are generated using the input sound signal. By using the attenuated sound signal, the similar period can be detected with high accuracy. As a result, the noise contained in the voice can be reduced with high accuracy.
An example of noise reduction processing according to the present embodiment will be described.

図３（ａ）は、被写体音声信号の一例を示す図であり、図３（ｂ）は、周波数及びパワーが大きい雑音を表す雑音信号の一例を示す図である。図３（ｃ）は、入力音声信号の一例を示す図であり、図３（ｄ）は、減衰音声信号の一例を示す図である。図３（ｃ）の下側には、図３（ａ）の被写体音声信号に図３（ｂ）の雑音信号が重畳された入力音声信号の一例が示されている。図３（ｃ）の上側には、参照期間における入力音声信号と各類似期間における入力音声信号とが、他の期間における入力音声信号から切り離されて図示されている。図３（ｄ）の減衰音声信号は、図３（ｃ）の入力音声信号に減衰処理を施すことにより得られた音声信号である。 FIG. 3A is a diagram illustrating an example of a subject audio signal, and FIG. 3B is a diagram illustrating an example of a noise signal representing noise having a large frequency and power. FIG. 3C is a diagram illustrating an example of the input voice signal, and FIG. 3D is a diagram illustrating an example of the attenuated voice signal. An example of an input audio signal in which the noise signal of FIG. 3B is superimposed on the subject audio signal of FIG. 3A is shown on the lower side of FIG. On the upper side of FIG. 3C, the input audio signal in the reference period and the input audio signal in each similar period are illustrated separately from the input audio signals in other periods. The attenuated audio signal in FIG. 3D is an audio signal obtained by performing attenuation processing on the input audio signal in FIG.

図４（ａ），４（ｂ）は、各音声信号の周波数特性及び減衰処理の処理特性（フィルタ特性）の一例を示す図である。
図４（ａ），４（ｂ）において、横軸は周波数を示し、縦軸はパワーを示す。
図４（ａ）において、実線６１は、図３（ａ）の被写体音声信号の周波数特性を表し、破線６２は、図３（ｂ）の雑音信号の周波数特性を表す。図４（ａ），４（ｂ）において、太実線６３は、図３（ｃ）の入力音声信号の周波数特性を表す。図４（ｂ）において、一点鎖線６４は、減衰処理のフィルタ特性を表し、実線６５は、図３（ｄ）の減衰音声信号の周波数特性を表す。 4A and 4B are diagrams illustrating an example of the frequency characteristics of each audio signal and the processing characteristics (filter characteristics) of attenuation processing.
4 (a) and 4 (b), the horizontal axis represents frequency and the vertical axis represents power.
4A, the solid line 61 represents the frequency characteristic of the subject audio signal in FIG. 3A, and the broken line 62 represents the frequency characteristic of the noise signal in FIG. 4 (a) and 4 (b), a thick solid line 63 represents the frequency characteristic of the input audio signal in FIG. 3 (c). In FIG. 4B, the alternate long and short dash line 64 represents the filter characteristic of the attenuation process, and the solid line 65 represents the frequency characteristic of the attenuated sound signal in FIG.

被写体音声信号の周波数特性６１は、周波数帯域Ｆ１，Ｆ２，Ｆ３，Ｆ４にピークを有する。周波数帯域Ｆ１が第１フォルマントの周波数帯域であり、周波数帯域Ｆ２が第２フォルマントの周波数帯域であり、周波数帯域Ｆ３が第３フォルマントの周波数帯域であり、周波数帯域Ｆ４が第４フォルマントの周波数帯域である。
雑音信号の周波数特性６２は、高周波数の側にある周波数帯域Ｆ４に、他の周波数帯域よりも強めの成分が存在している。このような成分は、類似期間の検出精度を低下させる。
本実施形態では、フィルタ特性６４を有するフィルタを用いた減衰処理（フィルタ処理）を行うことにより、入力音声信号から、第１フォルマントの周波数帯域Ｆ１と第２フォルマントの周波数帯域Ｆ２とを含む周波数帯域の音声信号が抽出される。
そのため、減衰音声信号の周波数特性６５では、周波数帯域Ｆ２よりも高い周波数の成分が入力音声信号の周波数特性６３から低減されている。
このように、本実施形態では、減衰処理を行うことにより、類似期間の検出精度を低下させる成分が低減された減衰音声信号が得られる。
なお、図３（ａ）〜３（ｂ）と図４（ａ），４（ｂ）とを用いて、周波数及びパワーが大きい雑音を低減する減衰処理を説明したが、上記減衰処理と同様の方法で他の雑音（低周波成分を多く含む雑音、等）を低減することもできる。 The frequency characteristic 61 of the subject audio signal has peaks in the frequency bands F1, F2, F3, and F4. The frequency band F1 is the first formant frequency band, the frequency band F2 is the second formant frequency band, the frequency band F3 is the third formant frequency band, and the frequency band F4 is the fourth formant frequency band. is there.
The frequency characteristic 62 of the noise signal has a stronger component in the frequency band F4 on the high frequency side than in other frequency bands. Such a component reduces the detection accuracy of the similar period.
In the present embodiment, by performing attenuation processing (filter processing) using a filter having the filter characteristic 64, the frequency band including the first formant frequency band F1 and the second formant frequency band F2 from the input audio signal. Are extracted.
For this reason, in the frequency characteristic 65 of the attenuated sound signal, the frequency component higher than the frequency band F2 is reduced from the frequency characteristic 63 of the input sound signal.
As described above, in the present embodiment, by performing the attenuation process, it is possible to obtain an attenuated sound signal in which a component that decreases the detection accuracy in the similar period is reduced.
In addition, although the attenuation process which reduces noise with a large frequency and power was demonstrated using FIG. 3 (a)-3 (b) and FIG. 4 (a), 4 (b), it is the same as the said attenuation process. The method can also reduce other noises (such as noise that contains a lot of low frequency components).

図５は、本実施形態に係る雑音低減処理の流れの一例を示すフローチャートである。
以下、本実施形態に係る雑音低減処理の流れの一例について説明する。 FIG. 5 is a flowchart showing an example of the flow of noise reduction processing according to the present embodiment.
Hereinafter, an example of the flow of noise reduction processing according to the present embodiment will be described.

まず、音声処理部１６が、マイクロフォン１２から入力音声信号を取得し、メモリ部１７に記録する（Ｓ１１０）。例えば、図３（ｃ）の入力音声信号が取得される。
次に、音声信号減衰部３１が、Ｓ１１０で取得された入力音声信号に減衰処理を施すことにより、減衰音声信号を生成する（Ｓ１１１）。例えば、図３（ｄ）の減衰音声信号が生成される。 First, the audio processing unit 16 acquires an input audio signal from the microphone 12 and records it in the memory unit 17 (S110). For example, the input audio signal in FIG. 3C is acquired.
Next, the audio signal attenuator 31 generates an attenuated audio signal by performing attenuation processing on the input audio signal acquired in S110 (S111). For example, the attenuated sound signal shown in FIG. 3D is generated.

そして、参照期間設定部３４が、Ｓ１１１で生成された減衰音声信号に対して、参照期間を設定する（Ｓ１１２）。参照期間の情報は、類似期間検出部３２と置換信号生成部３３とに出力される。例えば、図３（ｄ）の参照期間５１が設定される。
次に、類似期間検出部３２が、Ｓ１１１で生成された減衰音声信号を用いて、複数の類似期間を検出する（Ｓ１１３）。具体的には、入力音声信号の代わりに減衰音声信号を用いて従来の処理と同様の処理を行うことにより、複数の類似期間が検出される。例えば、図３（ｄ）の３つの類似期間５２ａ，５２ｂ，５２ｃが検出される。類似期間検出部３２は、検出した各類似期間を表す類似期間信号を置換信号生成部３３に出力する。例えば、図３（ｄ）の時刻ｔ１，ｔ２，ｔ３を表す情報が、類似期間信号として出力される。 Then, the reference period setting unit 34 sets a reference period for the attenuated sound signal generated in S111 (S112). The reference period information is output to the similar period detection unit 32 and the replacement signal generation unit 33. For example, the reference period 51 in FIG.
Next, the similar period detection unit 32 detects a plurality of similar periods using the attenuated sound signal generated in S111 (S113). Specifically, a plurality of similar periods are detected by performing the same process as the conventional process using the attenuated sound signal instead of the input sound signal. For example, three similar periods 52a, 52b, and 52c in FIG. 3D are detected. The similar period detector 32 outputs a similar period signal representing each detected similar period to the replacement signal generator 33. For example, information representing times t1, t2, and t3 in FIG. 3D is output as a similar period signal.

そして、置換信号生成部３３が、Ｓ１１０で取得された入力音声信号から、Ｓ１１２で設定された参照期間における入力音声信号と、Ｓ１１３で検出された複数の類似期間における入力音声信号と、を抽出する（Ｓ１１４）。例えば、図３（ｃ）の上側に示すように、参照期間５１における入力音声信号４１、類似期間５２ａにおける入力音声信号４２ａ、類似期間５２ｂにおける入力音声信号４２ｂ、及び、類似期間５２ｃにおける入力音声信号４２ｃ、が抽出される。 Then, the replacement signal generation unit 33 extracts the input audio signal in the reference period set in S112 and the input audio signals in the plurality of similar periods detected in S113 from the input audio signal acquired in S110. (S114). For example, as shown in the upper side of FIG. 3C, the input audio signal 41 in the reference period 51, the input audio signal 42a in the similar period 52a, the input audio signal 42b in the similar period 52b, and the input audio signal in the similar period 52c 42c is extracted.

次に、置換信号生成部３３が、Ｓ１１４で抽出された入力音声信号を用いて、置換信号を生成する（Ｓ１１５）。置換信号は、従来の処理と同様の処理により生成される。置換信号生成部３３は、生成した置換信号を音声信号置換部３５に出力する。
そして、音声信号置換部３５が、Ｓ１１２で設定された参照期間における入力音声信号をＳ１１５で生成された置換信号に置換することにより、出力音声信号を生成または更新する（Ｓ１１６）。１回目の処理では、Ｓ１１０で取得された入力音声信号の一部がＳ１１５で生成された置換信号に置換される。それにより、出力音声信号が生成される。２回目以降の処理では、前回のＳ１１６で生成された出力音声信号の一部がＳ１１５で生成された置換信号に置換される。それにより、出力音声信号が更新される。
次に、置換信号生成部３３が、Ｓ１１６で得られた出力音声信号を、メモリ部１７に記録する（Ｓ１１７）。１回目の処理では、Ｓ１１６で得られた出力音声信号がメモリ部１７に新規保存され、２回目以降の処理では、メモリ部１７に記録されている出力音声信号がＳ１１６で得られた出力音声信号に更新される。 Next, the replacement signal generation unit 33 generates a replacement signal using the input audio signal extracted in S114 (S115). The replacement signal is generated by a process similar to the conventional process. The replacement signal generation unit 33 outputs the generated replacement signal to the audio signal replacement unit 35.
Then, the audio signal replacement unit 35 generates or updates the output audio signal by replacing the input audio signal in the reference period set in S112 with the replacement signal generated in S115 (S116). In the first process, a part of the input audio signal acquired in S110 is replaced with the replacement signal generated in S115. Thereby, an output audio signal is generated. In the second and subsequent processing, a part of the output audio signal generated in the previous S116 is replaced with the replacement signal generated in S115. Thereby, the output audio signal is updated.
Next, the replacement signal generation unit 33 records the output audio signal obtained in S116 in the memory unit 17 (S117). In the first process, the output audio signal obtained in S116 is newly stored in the memory unit 17, and in the second and subsequent processes, the output audio signal recorded in the memory unit 17 is the output audio signal obtained in S116. Updated to

そして、参照期間設定部３４が、雑音を低減すべき期間であり、且つ、参照期間として設定されていない期間である、未処理期間が存在するかを判断する（Ｓ１１８）。未処理期間が存在する場合には、Ｓ１１２に処理が戻される。そして、Ｓ１１２において、未処理期間の少なくとも一部を含む参照期間が設定される。その後、Ｓ１１３〜Ｓ１１８の処
理が行われる。そして、未処理期間が存在しなくなるまで、Ｓ１１２〜Ｓ１１８の処理が繰り返し行われる。未処理期間が存在しなくなると、本フローが終了される。
なお、複数の参照期間の設定方法は特に限定されない。複数の参照期間は、例えば、時間位置を少しずつずらしながら順番に設定される。参照期間の一部が隣接参照期間の一部に重畳されていてもよいし、参照期間が隣接参照期間から離れていてもよい。参照期間の終了時間位置と隣接参照期間の開始時間位置とが一致するように、複数の参照期間が設定されてもよい。隣接参照期間は、参照期間に隣接する参照期間である。 Then, the reference period setting unit 34 determines whether there is an unprocessed period that is a period during which noise should be reduced and is not set as a reference period (S118). If there is an unprocessed period, the process returns to S112. In S112, a reference period including at least a part of the unprocessed period is set. Then, the process of S113-S118 is performed. Then, the processes of S112 to S118 are repeatedly performed until there is no unprocessed period. When there is no unprocessed period, this flow ends.
The method for setting the plurality of reference periods is not particularly limited. The plurality of reference periods are set in order, for example, while shifting the time position little by little. A part of the reference period may be superimposed on a part of the adjacent reference period, or the reference period may be separated from the adjacent reference period. A plurality of reference periods may be set so that the end time position of the reference period matches the start time position of the adjacent reference period. The adjacent reference period is a reference period adjacent to the reference period.

Ｓ１１１では、類似期間の検出精度を低下させる成分が低減された減衰音声信号が得られる。減衰処理のフィルタ特性が図４（ｂ）のフィルタ特性６４である場合には、第１フォルマントの周波数帯域と第２フォルマントの周波数帯域とを含む注目帯域における音声信号（被写体音声信号及び雑音信号）を表す減衰音声信号が得られる。換言すれば、低周波数の側及び低周波数の側の音声信号（被写体音声信号及び雑音信号）を減衰させた減衰音声信号が得られる。そして、Ｓ１１３では、このような減衰音声信号を用いて複数の類似期間が検出される。それにより、複数の類似期間を高精度に検出することができる。具体的には、注目帯域における音声信号に着目して類似期間が検出されるため、高精度に類似期間を検出することができる。 In S111, an attenuated audio signal in which a component that reduces the detection accuracy in the similar period is reduced is obtained. When the filter characteristic of the attenuation process is the filter characteristic 64 of FIG. 4B, the audio signal (subject audio signal and noise signal) in the band of interest including the first formant frequency band and the second formant frequency band. Attenuated speech signal is obtained. In other words, attenuated audio signals obtained by attenuating the low frequency side and low frequency side audio signals (subject audio signal and noise signal) are obtained. In S113, a plurality of similar periods are detected using such attenuated audio signals. Thereby, a plurality of similar periods can be detected with high accuracy. Specifically, since the similar period is detected by paying attention to the audio signal in the band of interest, the similar period can be detected with high accuracy.

ここで、減衰音声信号では、図４（ｂ）の減衰音声信号の周波数特性６５に示すように、非注目帯域（注目帯域以外の周波数帯域）における雑音信号だけでなく、非注目帯域における被写体音声信号も減衰している。そのため、図３（ａ）の被写体音声信号のうち、低周波数の側及び低周波数の側の音声信号は、図３（ｄ）の減衰音声信号には含まれていない。そのため、図３（ｄ）の減衰音声信号（参照期間５１における減衰音声信号、及び、３つの類似期間５２ａ，５２ｂ，５２ｃのそれぞれにおける減衰音声信号）を用いて置換信号を生成すると、被写体の音声が劣化した置換信号が生成されてしまう。具体的には、低周波数の側及び低周波数の側の被写体音声信号を含まない置換信号が生成されてしまう。その結果、被写体の音声が劣化した出力音声信号が生成されてしまう。 Here, in the attenuated audio signal, as shown in the frequency characteristic 65 of the attenuated audio signal in FIG. 4B, not only the noise signal in the non-target band (frequency band other than the target band) but also the subject audio in the non-target band. The signal is also attenuated. Therefore, among the subject audio signals in FIG. 3A, the audio signals on the low frequency side and the low frequency side are not included in the attenuated audio signal in FIG. Therefore, when the replacement signal is generated using the attenuated audio signal (the attenuated audio signal in the reference period 51 and the attenuated audio signal in each of the three similar periods 52a, 52b, and 52c) in FIG. A replacement signal with degraded is generated. Specifically, a replacement signal that does not include the subject audio signal on the low frequency side and the low frequency side is generated. As a result, an output sound signal in which the sound of the subject is deteriorated is generated.

本実施形態では、Ｓ１１５において、減衰されていない入力音声信号（全周波数帯域における音声信号）を用いて、置換信号が生成される。それにより、被写体の音声が劣化しておらず、且つ、雑音が高精度に低減された置換信号を生成することができる。その結果、雑音が高精度に低減された出力音声信号を生成することができる。
具体的には、ランダム性の高い雑音は、Ｓ１１５の処理（例えば、参照期間における入力音声信号と、各類似期間における入力音声信号と、を重みづけ合成する処理）によって低減することができる。例えば、風雑音はランダム性が非常に高いため、Ｓ１１５の処理によって低減することができる。そして、類似期間が高精度に検出されているため、Ｓ１１５の処理によって雑音を高精度に低減することができる。
また、繰り返し性の高い被写体音声信号は、Ｓ１１５の処理によって、低減されず、強調される。そして、入力音声信号では、全周波数帯域において音声信号が減衰されていないため、上述した被写体の音声の劣化を抑制することができる。 In the present embodiment, in S115, a replacement signal is generated using an unattenuated input audio signal (audio signal in all frequency bands). Thereby, it is possible to generate a replacement signal in which the sound of the subject is not deteriorated and noise is reduced with high accuracy. As a result, an output audio signal in which noise is reduced with high accuracy can be generated.
Specifically, highly random noise can be reduced by the process of S115 (for example, a process of weighting and combining the input voice signal in the reference period and the input voice signal in each similar period). For example, since wind noise has very high randomness, it can be reduced by the process of S115. Since the similar period is detected with high accuracy, the noise can be reduced with high accuracy by the processing of S115.
In addition, the subject audio signal having high repeatability is emphasized without being reduced by the process of S115. In the input audio signal, since the audio signal is not attenuated in the entire frequency band, the above-described deterioration of the audio of the subject can be suppressed.

以上述べたように、本実施形態によれば、減衰音声信号を用いて複数の類似期間が検出される。それにより、複数の類似期間を高精度に検出することができる。そして、本実施形態によれば、入力音声信号（参照期間における入力音声信号、及び、複数の類似期間のそれぞれにおける入力音声信号）を用いて置換信号が生成される。それにより、雑音が高精度に低減され、且つ、被写体の音声をよく表す置換信号及び出力音声信号を生成することができる。 As described above, according to the present embodiment, a plurality of similar periods are detected using the attenuated sound signal. Thereby, a plurality of similar periods can be detected with high accuracy. According to the present embodiment, the replacement signal is generated using the input sound signal (the input sound signal in the reference period and the input sound signal in each of the plurality of similar periods). Thereby, noise can be reduced with high accuracy, and a replacement signal and an output sound signal that well represent the sound of the subject can be generated.

なお、注目帯域は、５００Ｈｚ以上且つ３０００Ｈｚ以下の周波数帯域に限らない。類似期間の検出精度に影響を与える雑音が小さい場合には、注目帯域が広いほど高精度に類
似範囲を検出することができる。そのため、類似期間の検出精度に影響を与える雑音として想定される雑音（想定雑音）の周波数に基づいて、注目帯域を決定することが好ましい。例えば、想定雑音が撮像レンズ１１の駆動に伴うレンズ駆動雑音であり、且つ、当該レンズ駆動雑音が８０００Ｈｚの近傍に強い成分を有する場合には、５００Ｈｚ以上且つ７０００Ｈｚ以下の周波数帯域が注目帯域として設定されてもよい。７０００Ｈｚ以下の周波数帯域が注目帯域として設定されてもよい。想定雑音が風雑音である場合には、５００Ｈｚ以上の周波数帯域が注目帯域として設定されてもよい。 Note that the bandwidth of interest is not limited to a frequency band of 500 Hz or more and 3000 Hz or less. When the noise affecting the detection accuracy of the similar period is small, the similarity range can be detected with higher accuracy as the bandwidth of interest is wider. Therefore, it is preferable to determine the band of interest based on the frequency of noise (assumed noise) that is assumed as noise that affects the detection accuracy during the similar period. For example, when the assumed noise is lens driving noise accompanying driving of the imaging lens 11 and the lens driving noise has a strong component in the vicinity of 8000 Hz, a frequency band of 500 Hz to 7000 Hz is set as the attention band. May be. A frequency band of 7000 Hz or less may be set as the attention band. When the assumed noise is wind noise, a frequency band of 500 Hz or more may be set as the attention band.

なお、注目帯域は固定値でなくてもよい。
例えば、音声処理装置や撮像装置が複数の動作モードを有しており、複数の動作モードにそれぞれ対応する複数の周波数帯域が予め定められていてもよい。そして、音声処理装置は、複数の周波数帯域の中から、設定されている動作モードに対応する周波数帯域を、注目帯域として選択する選択部を有していてもよい。
具体的には、複数の動作モードは、屋内での撮像時に設定すべき屋内撮像モード、屋外での撮像時に設定すべき屋外撮像モード、等を含む。そして、屋内撮像モードが設定されている場合には、風雑音が重畳されていないと判断され、３０００Ｈｚ以下の周波数帯域が注目帯域として設定される。屋外撮像モードが設定されている場合には、風雑音が重畳されていると判断され、５００Ｈｚ以上の周波数帯域が注目帯域として設定される。 Note that the bandwidth of interest does not have to be a fixed value.
For example, the voice processing device or the imaging device may have a plurality of operation modes, and a plurality of frequency bands corresponding to the plurality of operation modes may be determined in advance. The voice processing apparatus may include a selection unit that selects a frequency band corresponding to the set operation mode as a target band from a plurality of frequency bands.
Specifically, the plurality of operation modes include an indoor imaging mode that should be set during indoor imaging, an outdoor imaging mode that should be set during outdoor imaging, and the like. When the indoor imaging mode is set, it is determined that wind noise is not superimposed, and a frequency band of 3000 Hz or less is set as the attention band. When the outdoor imaging mode is set, it is determined that wind noise is superimposed, and a frequency band of 500 Hz or more is set as the attention band.

また、撮像装置が有する光学レンズの複数の駆動状態にそれぞれ対応する複数の周波数帯域が予め定められていてもよい。そして、音声処理装置は、複数の周波数帯域の中から、光学レンズの駆動状態に対応する周波数帯域を、注目帯域として選択する選択部を有していてもよい。
なお、入力音声信号にレンズ駆動雑音が重畳されている場合には、撮像装置が有する光学レンズの駆動期間を、参照期間として設定すればよい。具体的には、カメラシステム制御部１０からの駆動命令に応じてレンズ駆動部１５が撮像レンズ１１を駆動している期間を、参照期間として設定すればよい。 Further, a plurality of frequency bands respectively corresponding to a plurality of driving states of the optical lens included in the imaging device may be determined in advance. The sound processing apparatus may include a selection unit that selects a frequency band corresponding to the driving state of the optical lens as a target band from a plurality of frequency bands.
When lens driving noise is superimposed on the input audio signal, the driving period of the optical lens included in the imaging device may be set as the reference period. Specifically, a period during which the lens driving unit 15 drives the imaging lens 11 in accordance with a driving command from the camera system control unit 10 may be set as a reference period.

また、音声処理装置は、入力音声信号に基づいて注目帯域（または非注目帯域）を決定する決定部を有していてもよい。例えば、決定部は、入力音声信号における第１フォルマントの周波数を検出し、検出した周波数を含む周波数帯域を、注目帯域として決定する。
入力音声信号に基づく注目帯域の決定方法は特に限定されない。入力音声信号に基づく注目帯域は、例えば、入力音声信号を用いた周波数解析の結果に基づいて決定することができる。 In addition, the sound processing apparatus may include a determination unit that determines a target band (or a non-target band) based on an input sound signal. For example, the determination unit detects the frequency of the first formant in the input audio signal, and determines the frequency band including the detected frequency as the attention band.
The method of determining the band of interest based on the input audio signal is not particularly limited. The band of interest based on the input voice signal can be determined based on the result of frequency analysis using the input voice signal, for example.

具体的には、図６に示すように、音声処理部１６が、周波数解析部３７と注目帯域決定部３６とをさらに有していてもよい。図６は、音声処理部１６の機能構成の一例を示すブロック図である。図６において、図１と同じ機能部には図１と同じ符号を付し、その説明は省略する。
周波数解析部３７は、入力音声信号２１を周波数解析することにより、入力音声信号２１（入力音声信号２１が含む被写体音声信号）における第１フォルマントの周波数を検出する。例えば、周波数解析部３７は、入力音声信号２１をフーリエ変換し、フーリエ変換の結果に基づいて第１フォルマントの周波数を検出する。
なお、入力音声信号２１（入力音声信号２１が含む被写体音声信号）における特徴的な他の周波数をさらに含む複数の周波数が検出されてもよい。
注目帯域決定部３６は、周波数解析部３７で検出された１つ以上の周波数（検出周波数）を含む周波数帯域を注目帯域として決定する。１つ以上の検出周波数は、第１フォルマントの周波数を含む。
一般的には、成人が発する声の第１フォルマントの周波数帯域は、５００Ｈｚ以上且つ１５００Ｈｚ以下の周波数帯域と言われている。図６の構成によれば、被写体音声信号の
第１フォルマントの周波数が５００Ｈｚ以上且つ１５００Ｈｚ以下の周波数帯域の外側の周波数である場合にも、適切な検出帯域を設定することができ、適切な雑音低減処理を行うことができる。 Specifically, as illustrated in FIG. 6, the audio processing unit 16 may further include a frequency analysis unit 37 and a target band determination unit 36. FIG. 6 is a block diagram illustrating an example of a functional configuration of the audio processing unit 16. 6, the same functional parts as those in FIG. 1 are denoted by the same reference numerals as those in FIG.
The frequency analysis unit 37 analyzes the frequency of the input audio signal 21 to detect the frequency of the first formant in the input audio signal 21 (the subject audio signal included in the input audio signal 21). For example, the frequency analysis unit 37 performs a Fourier transform on the input audio signal 21 and detects the frequency of the first formant based on the result of the Fourier transform.
A plurality of frequencies further including other characteristic frequencies in the input audio signal 21 (subject audio signal included in the input audio signal 21) may be detected.
The attention band determination unit 36 determines a frequency band including one or more frequencies (detection frequencies) detected by the frequency analysis unit 37 as the attention band. The one or more detection frequencies include a first formant frequency.
Generally, the frequency band of the first formant of a voice uttered by an adult is said to be a frequency band of 500 Hz or more and 1500 Hz or less. According to the configuration of FIG. 6, even when the frequency of the first formant of the subject audio signal is a frequency outside the frequency band of 500 Hz or more and 1500 Hz or less, an appropriate detection band can be set and appropriate noise can be set. Reduction processing can be performed.

なお、本実施形態では、音声処理装置としてカメラ１のような撮像装置を例示し、撮像装置が上述した雑音低減処理を実行する例を説明したが、これに限らない。撮像装置と異なる他の電子機器が上述した雑音低減処理を実行してもよい。 In the present embodiment, the imaging apparatus such as the camera 1 is exemplified as the audio processing apparatus, and the example in which the imaging apparatus performs the noise reduction process described above is described, but the present invention is not limited thereto. Another electronic device different from the imaging device may execute the noise reduction process described above.

＜その他の実施形態＞
記憶装置に記録されたプログラムを読み込み実行することで前述した実施形態の機能を実現するシステムや装置のコンピュータ（又はＣＰＵ、ＭＰＵ等のデバイス）によっても、本発明を実施することができる。また、例えば、記憶装置に記録されたプログラムを読み込み実行することで前述した実施形態の機能を実現するシステムや装置のコンピュータによって実行されるステップからなる方法によっても、本発明を実施することができる。この目的のために、上記プログラムは、例えば、ネットワークを通じて、又は、上記記憶装置となり得る様々なタイプの記録媒体（つまり、非一時的にデータを保持するコンピュータ読取可能な記録媒体）から、上記コンピュータに提供される。したがって、上記コンピュータ（ＣＰＵ、ＭＰＵ等のデバイスを含む）、上記方法、上記プログラム（プログラムコード、プログラムプロダクトを含む）、上記プログラムを非一時的に保持するコンピュータ読取可能な記録媒体は、いずれも本発明の範疇に含まれる。 <Other embodiments>
The present invention can also be implemented by a computer (or a device such as a CPU or MPU) of a system or apparatus that implements the functions of the above-described embodiments by reading and executing a program recorded in a storage device. For example, the present invention can be implemented by a method including steps executed by a computer of a system or apparatus that implements the functions of the above-described embodiments by reading and executing a program recorded in a storage device. . For this purpose, the program is stored in the computer from, for example, various types of recording media that can serve as the storage device (ie, computer-readable recording media that holds data non-temporarily). Provided to. Therefore, the computer (including devices such as CPU and MPU), the method, the program (including program code and program product), and the computer-readable recording medium that holds the program non-temporarily are all present. It is included in the category of the invention.

１：カメラ１１：撮像レンズ１５：レンズ駆動部１６：音声処理部
３１：音声信号減衰部３２：類似期間検出部３３：置換信号生成部
３４：参照期間設定部３５：音声信号置換部 DESCRIPTION OF SYMBOLS 1: Camera 11: Imaging lens 15: Lens drive part 16: Sound processing part 31: Sound signal attenuation part 32: Similar period detection part 33: Replacement signal generation part 34: Reference period setting part 35: Sound signal replacement part

Claims

First acquisition means for acquiring a first audio signal representing audio;
First setting means for setting a reference period;
A second setting means for setting a plurality of comparison periods, which are periods having the same time width as the reference period and are different from the reference period;
A second acquisition for acquiring a second audio signal by applying attenuation processing to the first audio signal to attenuate an audio signal in a frequency band other than the band of interest, which is a frequency band of interest in the processing for the first audio signal. Means,
By comparing the second audio signal in the reference period with the second audio signal in each comparison period, the second audio signal is similar to the second audio signal in the reference period among a plurality of comparison periods. Detecting means for detecting a plurality of similar periods;
Generating means for generating a replacement signal, which is an audio signal to be set as an audio signal in the reference period, based on the first audio signal in the reference period and the first audio signal in each of the plurality of similar periods; ,
Replacement means for replacing the first audio signal in the reference period with the replacement signal;
A speech processing apparatus comprising:

The audio processing apparatus according to claim 1, wherein the attenuation process is an extraction process for extracting an audio signal in the band of interest.

The audio processing apparatus according to claim 1, wherein the attenuation process is a filter process using a filter that passes the audio signal in the band of interest.

The speech processing apparatus according to claim 1, wherein the band of interest includes a frequency band of 500 Hz or more and 1500 Hz or less.

A plurality of frequency bands respectively corresponding to a plurality of operation modes are predetermined,
The speech processing apparatus further includes a selection unit that selects, as the band of interest, a frequency band corresponding to a set operation mode from the plurality of frequency bands. The speech processing device according to any one of the above.

5. The sound processing apparatus according to claim 1, further comprising a determining unit that determines the band of interest based on the first sound signal.

The audio processing apparatus according to claim 6, wherein the determining unit determines the band of interest based on a result of frequency analysis using the first audio signal.

The sound processing apparatus according to claim 6 or 7, wherein the determining means determines a frequency band including a first formant frequency in the first sound signal as the band of interest.

The generation unit generates the replacement signal by weighting and adding a first audio signal in the reference period and a first audio signal in each of the plurality of similar periods. The speech processing apparatus according to any one of 1 to 8.

10. The generation unit uses a weight that is larger as the degree of similarity of the second sound signal with the second sound signal in the reference period is higher as the weight of the first sound signal in the similar period. The voice processing apparatus according to 1.

The detection means sets each of N comparison periods (N is an integer of 2 or more) as the similarity period in order from the comparison period in which the similarity of the second audio signal with the second audio signal in the reference period is high. The sound processing device according to claim 1, wherein the sound processing device is detected.

An optical lens,
Driving means for driving the optical lens;
Have
The sound processing apparatus according to claim 1, wherein the first setting unit sets a driving period of the optical lens as the reference period.

A first acquisition step of acquiring a first audio signal representing audio;
A first setting step for setting a reference period;
A second setting step of setting a plurality of comparison periods, which are periods having the same time width as the reference period and are different from the reference period;
A second acquisition for acquiring a second audio signal by applying attenuation processing to the first audio signal to attenuate an audio signal in a frequency band other than the band of interest, which is a frequency band of interest in the processing for the first audio signal. Steps,
By comparing the second audio signal in the reference period with the second audio signal in each comparison period, the second audio signal is similar to the second audio signal in the reference period among a plurality of comparison periods. A detecting step for detecting a plurality of similar periods;
Generating a replacement signal that is an audio signal to be set as an audio signal in the reference period, based on the first audio signal in the reference period and the first audio signal in each of the plurality of similar periods; ,
A replacement step of replacing the first audio signal in the reference period with the replacement signal;
A voice processing method characterized by comprising:

A program causing a computer to execute each step of the voice processing method according to claim 13.