JP2008197577A

JP2008197577A - Voice processing device, voice processing method and program

Info

Publication number: JP2008197577A
Application number: JP2007035410A
Authority: JP
Inventors: Ryuichi Nanba; 隆一難波; Mototsugu Abe; 素嗣安部; Akira Inoue; 晃井上; Shigesuke Higashiyama; 恵祐東山; Hidesuke Takahashi; 秀介高橋; Masayuki Nishiguchi; 正之西口
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2007-02-15
Filing date: 2007-02-15
Publication date: 2008-08-28
Anticipated expiration: 2027-02-15
Also published as: JP4449987B2; CN101246690A; US8422695B2; US20130182857A1; CN101246690B; US9762193B2; US20080199152A1

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice processing device, a voice processing method and a program. <P>SOLUTION: The voice processing device comprises: a voice determination section for determining whether or not, first voice generated from a specified sound source is included in input voice, based on position information of the sound source; a voice separating section in which, when it is determined that the first voice is included in the input voice by the voice determination section, the input voice is separated into the first voice and second voice which is generated by a sound source other than the specific sound source; and a voice mixing section 150 for mixing the first voice and the second voice separated by the voice separating section, with an arbitrary sound volume ratio. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声処理装置、音声処理方法およびプログラムに関する。 The present invention relates to a voice processing device, a voice processing method, and a program.

近日、被写体の映像および被写体から発せられた音声を記録可能な映像音声記録装置が広く普及している。映像音声記録装置の操作者は、映像音声記録装置の撮像方向を調整したり、映像音声記録装置に設けられた操作手段を操作して被写体の映像を拡大または縮小することができる。 In the near future, video / audio recording apparatuses capable of recording a video of a subject and a sound emitted from the subject are widely used. An operator of the video / audio recording apparatus can adjust the imaging direction of the video / audio recording apparatus, or operate an operation unit provided in the video / audio recording apparatus to enlarge or reduce the video of the subject.

ここで、音声の音量は音源から離れるにつれて減少する。したがって、上記のような映像音声記録装置には、映像音声記録装置の操作者の声や操作手段の操作音などの操作者に起因する音声が、被写体が発する音声より大きな音量で記録される場合があった。 Here, the sound volume decreases as the distance from the sound source increases. Accordingly, in the video / audio recording apparatus as described above, when the voice caused by the operator, such as the voice of the operator of the video / audio recording apparatus or the operation sound of the operation means, is recorded at a volume higher than the sound emitted from the subject. was there.

特許文献１には、このような操作者に起因する音声の音量が抑制された音声を記録するための音声処理装置が開示されている。具体的には、当該音声処理装置は、前左用、前右用、後左用、後右用、および着脱可能なマイクロホンの計５本の指向性マイクロホンを備える。したがって、後ろ中央に位置する操作者の声は前左用、前右用、後左用および後右用のいずれのマイクロホンにもほとんど収音されず、必要や目的に応じて着脱可能なマイクロホンに収音させることができる。 Patent Document 1 discloses a sound processing apparatus for recording a sound in which the sound volume caused by such an operator is suppressed. Specifically, the sound processing device includes a total of five directional microphones: front left, front right, rear left, rear right, and detachable microphones. Therefore, the operator's voice located in the rear center is hardly collected by any of the front left, front right, rear left, and rear right microphones, and is collected by a removable microphone according to necessity or purpose. Can be made.

また、特許文献２には、複数の音源からの音声が含まれる混合音声のうちの１つ以上の音源からの信号をＩＣＡ（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）法に基づくＢＳＳ（ＢｌｉｎｄＳｏｕｒｃｅＳｅｐａｒａｔｉｏｎ）方式を用いて分離する技術が開示されている。 Further, Patent Document 2 uses a BSS (Blind Source Separation) method based on an ICA (Independent Component Analysis) method for signals from one or more sound sources among mixed sounds including sounds from a plurality of sound sources. Techniques for separating are disclosed.

特開２００５−３４１０７３号公報Japanese Patent Laid-Open No. 2005-341073 特開２００６−１５４３１４号公報JP 2006-154314 A

しかし、従来の音声処理装置では、多数のマイクロホンを設ける必要があったため、音声処理装置のハードウェア規模が大きくなってしまう。また、従来の音声処理装置はマイクロホンの指向性を利用して操作者の音声を選別するため、操作者の位置に対して制約が課されるという問題があった。 However, since the conventional voice processing apparatus needs to be provided with a large number of microphones, the hardware scale of the voice processing apparatus becomes large. Further, the conventional voice processing apparatus has a problem that restrictions are imposed on the position of the operator because the voice of the operator is selected using the directivity of the microphone.

そこで、本発明は、上記問題に鑑みてなされたものであり、本発明の目的とするところは、特定音源から発せられた音声の全体に占める音量比率を調整して記録することが可能な、新規かつ改良された音声処理装置、音声処理方法およびプログラムを提供することにある。 Therefore, the present invention has been made in view of the above problems, and the object of the present invention is to adjust and record the volume ratio of the entire sound emitted from a specific sound source. It is an object of the present invention to provide a new and improved voice processing apparatus, voice processing method and program.

上記課題を解決するために、本発明のある観点によれば、入力音声に特定音源から発せられた第一の音声が含まれているか否かを音源の位置情報に基づいて判定する音声判定部と、音声判定部により入力音声に第一の音声が含まれていると判定された場合、入力音声を第一の音声と特定音源以外の音源から発せられた第二の音声とに分離する音声分離部と、音声分離部により分離された第一の音声と第二の音声を、任意の音量比率で混合する音声混合部と、を備えることを特徴とする、音声処理装置が提供される。 In order to solve the above problems, according to an aspect of the present invention, a sound determination unit that determines whether or not a first sound emitted from a specific sound source is included in an input sound based on position information of the sound source And when the sound determination unit determines that the first sound is included in the input sound, the sound that separates the input sound into the first sound and the second sound emitted from a sound source other than the specific sound source An audio processing device is provided, comprising: a separation unit; and an audio mixing unit that mixes the first audio and the second audio separated by the audio separation unit at an arbitrary volume ratio.

かかる構成においては、音声分離部が、入力音声に含まれる特定音源から発せられた第一の音声を分離し、音声混合部が、例えば第一の音声と入力音声に含まれる他の音声である第二の音声とを、第一の音声が占める音量比率が、入力音声に占める第一の音声の音量比率より低減されるように混合する。したがって、入力音声のうち特定音源から発せられた第一の音声の音量が不要に大きい場合、音声混合部は、第二の音声が占める音量比率が入力音声に占める第二の音声の音量比率より増大させた混合音声を得ることができる。その結果、当該音声処理装置によれば、第二の音声が第一の音声に埋もれてしまうことを防止できる。 In such a configuration, the sound separation unit separates the first sound emitted from the specific sound source included in the input sound, and the sound mixing unit is, for example, the first sound and another sound included in the input sound. The second sound is mixed so that the volume ratio occupied by the first sound is lower than the volume ratio of the first sound occupied by the input sound. Therefore, when the volume of the first sound emitted from the specific sound source is unnecessarily large among the input sounds, the sound mixing unit determines that the volume ratio occupied by the second sound is greater than the volume ratio of the second sound occupied by the input sound. Increased mixed speech can be obtained. As a result, according to the sound processing apparatus, it is possible to prevent the second sound from being buried in the first sound.

また、音声混合部は、例えば近傍から発せられた第一の音声と入力音声に含まれる他の音声である第二の音声とを、第一の音声が占める音量比率が、入力音声に占める第一の音声の音量比率より増加されるように混合してもよい。かかる構成によれば、音声収録者自身の音声の収録を所望する場合、音声収録者が発した第一の音声が強調することができる。なお、音声判定部により入力音声に第一の音声が含まれていないと判定された場合、音声分離部は入力音声の分離を行なわなくてもよい。 In addition, the sound mixing unit is configured such that, for example, the volume ratio in which the first sound occupies the first sound emitted from the vicinity and the second sound that is other sound included in the input sound occupies the first sound. You may mix so that it may increase from the volume ratio of one audio | voice. According to this configuration, when it is desired to record the voice of the voice recorder, the first voice produced by the voice recorder can be emphasized. Note that when the voice determination unit determines that the first voice is not included in the input voice, the voice separation unit does not have to separate the input voice.

特定音源は入力音声の収録位置から設定距離範囲内に位置してもよい。すなわち、第一の音声が入力音声の収録位置から設定距離範囲内から発せられた音声であってもよい。ここで、音声の音量は距離が離れるにつれて減少するため、収録位置に近い音源から発せられた音声ほど入力音声に大きな音量で収録される場合が多い。したがって、音声混合部は、入力音声の収録位置から近い第一の音声の音量比率を抑制し、収録位置からの音源距離の差に起因するアンバランスな音量関係を是正することができる。 The specific sound source may be located within a set distance range from the recording position of the input sound. That is, the first sound may be a sound emitted from a set distance range from the recording position of the input sound. Here, since the sound volume decreases as the distance increases, the sound emitted from the sound source closer to the recording position is often recorded at a higher sound volume in the input sound. Therefore, the sound mixing unit can suppress the volume ratio of the first sound close to the recording position of the input sound, and can correct the unbalanced sound volume relation caused by the difference in the sound source distance from the recording position.

第一の音声は、入力音声を収音する際に用いられた装置の操作者に起因する音声を含み、第二の音声は、収音対象から発せられた音声を含んでもよい。かかる構成によれば、入力音声を収音する際に用いられた装置の近傍で該装置を操作している操作者により発せられた第一の音声の音量比率を抑制し、収音対象から発せられた第二の音声が第一の音声により埋もれてしまうことを防止可能である。 The first sound may include a sound caused by an operator of the device used when collecting the input sound, and the second sound may include a sound emitted from the sound collection target. According to such a configuration, the volume ratio of the first sound emitted by the operator operating the device in the vicinity of the device used when collecting the input sound is suppressed, and the sound is emitted from the sound collection target. It is possible to prevent the second sound that has been received from being buried by the first sound.

音声判定部は、入力音声の音量または音質の少なくともいずれかに基づいて入力音声に第一の音声が含まれているか否かを判定してもよい。ここで、音声判定部は、入力音声の音量または位相に基づいて入力音声の音源の位置情報、または入力音声に含まれる１または２以上の音源から発せられた音声ごとの音源の位置情報を推定してもよい。 The sound determination unit may determine whether or not the first sound is included in the input sound based on at least one of the volume and the sound quality of the input sound. Here, the sound determination unit estimates the position information of the sound source of the input sound or the position information of the sound source for each sound emitted from one or more sound sources included in the input sound based on the volume or phase of the input sound. May be.

当該音声処理装置は、映像を撮像する撮像部をさらに備え、音声判定部は、入力音声に含まれる１または２以上の音源から発せられた音声の音量または位相の少なくともいずれかに基づいて音源の位置情報を算出する位置情報算出部を備え、入力音声の音源の位置が撮像部の撮像方向の後方であると位置情報算出部により算出され、入力音声が人間の音声と一致または近似する音質である場合、入力音声に特定音源から発せられた第一の音声が含まれていると判定してもよい。ここで、操作者は撮像部の撮像方向の後方から音声処理装置を操作する場合が多い。したがって、音声判定部は、入力音声の音源の位置が撮像部の撮像方向の後方であり、入力音声が人間の音声と一致または近似する音質である場合、入力音声に第一の音声として操作者の音声が支配的に含まれている判定することができる。その結果、音声混合部により操作者の音声の音量比率が低減された混合音声を得ることができる。 The audio processing apparatus further includes an imaging unit that captures an image, and the audio determination unit is configured to detect a sound source based on at least one of a volume or a phase of sound emitted from one or more sound sources included in the input sound. A position information calculation unit that calculates position information, the position information calculation unit calculates that the position of the sound source of the input sound is behind the imaging direction of the imaging unit, and the input sound has a sound quality that matches or approximates the human voice In some cases, it may be determined that the first sound emitted from the specific sound source is included in the input sound. Here, the operator often operates the audio processing device from behind the imaging direction of the imaging unit. Therefore, when the position of the sound source of the input sound is behind the image capturing direction of the image capturing unit and the input sound has a sound quality that matches or approximates the human sound, the sound determination unit operates as the first sound in the input sound. It is possible to determine that the voice is dominantly included. As a result, a mixed sound in which the volume ratio of the operator's sound is reduced by the sound mixing unit can be obtained.

入力音声の音源の位置が収音位置から設定距離の範囲内であり、入力音声にインパルス音が含まれ、入力音声が過去の平均音量と比較して大きい場合、音声判定部は、入力音声に特定音源から発せられた第一の音声が含まれていると判定してもよい。ここで、入力音声を収録する装置の操作者が該装置のボタンを操作したり該装置を持ち替えると「パチン」、「バン」などのインパルス音が発生する場合が多い。また、該インパルス音は該装置において発生するため、比較的大きな音量で収音される可能性が高い。したがって、音声判定部は、入力音声の音源の位置が収音位置から設定距離の範囲内であり、入力音声にインパルス音が含まれ、入力音声が過去の平均音量と比較して大きい場合、入力音声に第一の音声として操作者の動作に起因するノイズが支配的に含まれていると判定することができる。その結果、音声混合部により操作者の動作に起因するノイズの音量比率が低減された混合音声を得ることができる。 If the position of the sound source of the input sound is within the set distance from the sound collection position, the input sound includes impulse sound, and the input sound is large compared to the past average volume, the sound determination unit You may determine with the 1st audio | voice emitted from the specific sound source being included. Here, in many cases, an impulse sound such as “click” or “bang” is generated when an operator of a device that records input sound operates a button of the device or changes the device. Further, since the impulse sound is generated in the apparatus, there is a high possibility that the impulse sound is collected at a relatively large volume. Therefore, the sound determination unit inputs an input sound when the position of the sound source of the input sound is within a set distance from the sound collection position, the input sound includes an impulse sound, and the input sound is larger than the past average sound volume. It can be determined that noise resulting from the operation of the operator is dominantly included in the voice as the first voice. As a result, it is possible to obtain a mixed sound in which the volume ratio of noise caused by the operation of the operator is reduced by the sound mixing unit.

当該音声処理装置は、入力音声を収音する複数の収音部と、音声混合部により混合された混合音声を記憶媒体に記録する記録部と、を備えてもよい。かかる構成においては、記録部は記憶媒体に、第一の音声が占める音量比率が入力音声に占める第一の音声の音量比率より低減された混合音声を記録する。したがって、該混合音声を再生する再生装置に特殊な音量補正機能を実装することなく、該再生装置において第一の音声の占める音量比率が調整された混合音声を再生することが可能となる。 The sound processing apparatus may include a plurality of sound collecting units that collect input sound and a recording unit that records the mixed sound mixed by the sound mixing unit on a storage medium. In such a configuration, the recording unit records the mixed sound in which the volume ratio occupied by the first sound is lower than the volume ratio of the first sound occupied in the input sound in the storage medium. Therefore, it is possible to reproduce the mixed sound in which the volume ratio of the first sound is adjusted in the reproducing apparatus without implementing a special volume correction function in the reproducing apparatus that reproduces the mixed sound.

当該音声処理装置は、入力音声を記憶している記憶媒体と、記憶媒体に記憶されている入力音声を再生し、位置情報算出部、音声判定部および音声分離部の少なくともいずれかに出力する再生部と、を備えてもよい。かかる構成においては、位置情報算出部、音声判定部および音声分離部は再生部から入力される入力音声に基づいて混合音声を生成し、混合音声を再生音声として出力することができる。したがって、記憶媒体に入力音声を記録する記録装置に特殊な音量補正機能を実装することなく、第一の音声の占める音量比率が調整された混合音声を再生することが可能となる。 The audio processing device reproduces the storage medium storing the input sound and the input sound stored in the storage medium, and outputs it to at least one of the position information calculation unit, the sound determination unit, and the sound separation unit May be provided. In this configuration, the position information calculation unit, the sound determination unit, and the sound separation unit can generate mixed sound based on the input sound input from the reproduction unit and output the mixed sound as reproduced sound. Therefore, it is possible to reproduce the mixed sound in which the volume ratio of the first sound is adjusted without implementing a special sound volume correction function in the recording apparatus that records the input sound on the storage medium.

当該音声処理装置は、入力音声の音量が補正されている場合、音声分離部により分離された第二の音声の音量に、補正の程度に応じた逆補正を行なう音量補正部を備えてもよい。例えば、第一の音声の音量が過大であったために入力音声の音量が全体として抑制された場合、第二の音声の音量も抑制されてしまっている。音量補正部は、このような場合、入力音声の音量が抑制された程度に応じて第二の音声の音量を増大させ、第二の音声が過小となることを防止できる。 The sound processing device may include a volume correction unit that performs reverse correction according to the degree of correction on the volume of the second sound separated by the sound separation unit when the volume of the input sound is corrected. . For example, when the volume of the input voice is suppressed as a whole because the volume of the first voice is excessive, the volume of the second voice is also suppressed. In such a case, the volume correction unit can increase the volume of the second sound according to the degree to which the volume of the input sound is suppressed, and can prevent the second sound from becoming too low.

また、上記課題を解決するために、本発明の別の観点によれば、入力音声の分離を行う音声分離部と、音声分離部により分離された音声に特定音源から発せられた第一の音声が含まれているか否かを判定する音声判定部と、音声分離部により分離された第一の音声と特定音源以外の音源から発せられた第二の音声を、任意の音量比率で混合する音声混合部と、を備えることを特徴とする、音声処理装置が提供される。 In order to solve the above problem, according to another aspect of the present invention, a voice separation unit that separates input voices, and a first voice emitted from a specific sound source to the voice separated by the voice separation unit A sound determination unit that determines whether or not a sound source is included, and a sound that mixes the first sound separated by the sound separation unit and the second sound emitted from a sound source other than the specific sound source at an arbitrary volume ratio And a mixing unit. A voice processing device is provided.

また、上記課題を解決するために、本発明の別の観点によれば、コンピュータを、入力音声に特定音源から発せられた第一の音声が含まれているか否かを音源の位置情報に基づいて判定する音声判定部と、音声判定部により入力音声に第一の音声が含まれていると判定された場合、入力音声を第一の音声と特定音源以外の音源から発せられた第二の音声とに分離する音声分離部と、音声分離部により分離された第一の音声と第二の音声を任意の音量比率で混合する音声混合部とを備えることを特徴とする音声処理装置として機能させるための、プログラムが提供される。 In order to solve the above problem, according to another aspect of the present invention, the computer determines whether or not the input sound contains the first sound emitted from the specific sound source based on the position information of the sound source. A voice determination unit that determines the input voice and the voice determination unit determines that the input voice includes the first voice, the second voice generated from the sound source other than the first voice and the specific sound source A voice processing device comprising: a voice separation unit that separates into voice; and a voice mixing unit that mixes the first voice and the second voice separated by the voice separation unit at an arbitrary volume ratio A program is provided to make it happen.

かかるプログラムは、例えばＣＰＵ、ＲＯＭまたはＲＡＭなどを含むコンピュータのハードウェア資源に、上記のような位置情報算出部、音声判定部および音声分離部の機能を実行させることができる。すなわち、当該プログラムを用いるコンピュータを、上述の音声処理装置として機能させることが可能である。 Such a program can cause the hardware resources of a computer including, for example, a CPU, ROM, or RAM to execute the functions of the position information calculation unit, the voice determination unit, and the voice separation unit as described above. That is, it is possible to cause a computer using the program to function as the above-described voice processing device.

音声判定部は、音源の位置情報、入力音声の音量または音質の少なくともいずれかに基づいて入力音声に第一の音声が含まれているか否かを判定してもよい。 The sound determination unit may determine whether or not the first sound is included in the input sound based on at least one of the position information of the sound source, the volume or the sound quality of the input sound.

映像を撮像する撮像部をさらに備え、音声判定部は、入力音声に含まれる１または２以上の音源から発せられた音声の音量または位相の少なくともいずれかに基づいて音源の位置情報を算出する位置情報算出部を備え、入力音声の音源の位置が撮像部の撮像方向の後方であると位置情報算出部により算出され、入力音声が人間の音声と一致または近似する音質である場合、入力音声に特定音源から発せられた第一の音声が含まれていると判定してもよい。 A position that further includes an image pickup unit that picks up an image, and the sound determination unit calculates position information of the sound source based on at least one of a volume and a phase of sound emitted from one or more sound sources included in the input sound Provided with an information calculation unit, the position information calculation unit calculates that the position of the sound source of the input sound is behind the imaging direction of the imaging unit, and if the input sound has a sound quality that matches or approximates a human voice, You may determine with the 1st audio | voice emitted from the specific sound source being included.

入力音声の音源の位置が収音位置から設定距離の範囲内であり、入力音声にインパルス音が含まれ、入力音声が過去の平均音量と比較して大きい場合、音声判定部は、入力音声に特定音源から発せられた第一の音声が含まれていると判定してもよい。 If the position of the sound source of the input sound is within the set distance from the sound collection position, the input sound includes impulse sound, and the input sound is large compared to the past average volume, the sound determination unit You may determine with the 1st audio | voice emitted from the specific sound source being included.

また、上記課題を解決するために、本発明の別の観点によれば、入力音声に特定音源から発せられた第一の音声が含まれているか否かを音源の位置情報に基づいて判定するステップと、入力音声に第一の音声が含まれていると判定された場合、入力音声を第一の音声と特定音源以外の音源から発せられた第二の音声とに分離するステップと、分離された第一の音声と第二の音声を、任意の音量比率で混合するステップと、を含むことを特徴とする、音声処理方法が提供される。 In order to solve the above problem, according to another aspect of the present invention, it is determined based on position information of a sound source whether or not the first sound emitted from a specific sound source is included in the input sound. Separating the input sound into a first sound and a second sound emitted from a sound source other than the specific sound source when it is determined that the first sound is included in the input sound; and And a step of mixing the first sound and the second sound, which are performed at an arbitrary volume ratio, to provide a sound processing method.

以上説明したように本発明にかかる音声処理装置、音声処理方法およびプログラムよれば、特定音源から発せられた音声の全体に占める音量比率を任意に調整して出力、又は記録することができる。 As described above, according to the sound processing device, sound processing method, and program according to the present invention, it is possible to arbitrarily adjust and output or record the volume ratio of the sound emitted from the specific sound source.

以下に添付図面を参照しながら、本発明の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Exemplary embodiments of the present invention will be described below in detail with reference to the accompanying drawings. In addition, in this specification and drawing, about the component which has the substantially same function structure, duplication description is abbreviate | omitted by attaching | subjecting the same code | symbol.

（第１の実施形態）
まず、本発明の第１の実施形態にかかる音声記録装置１０について説明する。本実施形態の説明においては、図１および図２を参照して音声記録装置１０が用いられる場面の一例を説明した後に、図３〜図１０を参照して音声記録装置１０の構成および動作を説明する。 (First embodiment)
First, the audio recording apparatus 10 according to the first embodiment of the present invention will be described. In the description of this embodiment, an example of a scene in which the audio recording device 10 is used will be described with reference to FIGS. 1 and 2, and then the configuration and operation of the audio recording device 10 will be described with reference to FIGS. 3 to 10. explain.

図１は、本実施形態にかかる音声記録装置１０が用いられる場面の一例を示した説明図である。図１に示した例では、被写体である子供が品川区立一番小学校の校門の前に立っており、映像撮像機能が実装された音声記録装置１０を手に持った操作者が音声記録装置１０を被写体に向けている。 FIG. 1 is an explanatory diagram showing an example of a scene in which the audio recording device 10 according to the present embodiment is used. In the example shown in FIG. 1, a child who is a subject stands in front of the school gate of Shinagawa City Ichiban Elementary School, and an operator who holds the voice recording device 10 in which the video imaging function is mounted is in the voice recording device 10. Is facing the subject.

また、被写体は、操作者の「おーい」という呼びかけに「はーい」という返事をしている。このとき、映像撮像機能が実装された音声記録装置１０は、被写体の映像と共に操作者の「おーい」という呼びかけ、および被写体の「はーい」という返事を記録する。ここで、図２を参照して通常の音声記録方法によって記録される音声について説明する。 In addition, the subject responds “yes” to the operator's call “oi”. At this time, the audio recording device 10 in which the video imaging function is implemented records the call of the operator “Ooi” and the answer of the subject “Hai” together with the video of the subject. Here, the audio recorded by the normal audio recording method will be described with reference to FIG.

図２は、通常の音声記録方法によって記録される音声の時間領域の振幅を示した説明図である。音声は、音源が点音源であると仮定すると、収音される音量は音源と収音位置との距離の二乗に反比例する。すなわち、収音位置が音源から離れるほど収音される音量は小さくなる。したがって、収音位置に近い操作者の「おーい」という呼びかけは、図２（ａ）に示すような振幅を有する音声として収音される。 FIG. 2 is an explanatory diagram showing the amplitude in the time domain of audio recorded by a normal audio recording method. Assuming that the sound source is a point sound source, the volume of sound collected is inversely proportional to the square of the distance between the sound source and the sound collection position. That is, as the sound collection position moves away from the sound source, the collected sound volume decreases. Therefore, the operator's call “oi” close to the sound collection position is collected as a sound having an amplitude as shown in FIG.

一方、収音位置から操作者より離れている被写体の「はーい」という返事は、図２（ｂ）に示したように操作者の声に比べて小さな振幅の音声として収音される。この場合、通常の音声記録方法によれば、図２（ｃ）に示したように、単純に操作者の「おーい」という呼びかけと被写体の「はーい」という返事が重畳された音声が記録される。 On the other hand, the answer “yes” of the subject that is farther from the operator than the sound collection position is collected as a sound having a smaller amplitude than the voice of the operator as shown in FIG. In this case, according to the normal audio recording method, as shown in FIG. 2C, audio in which the operator's call “oi” and the subject “yes” are superimposed is recorded. .

しかし、図２（ｃ）に示した音声には操作者の「おーい」という呼びかけが支配的に含まれ、被写体の「はーい」という返事が埋もれてしまっている。同様に、操作者による操作ノイズが被写体の発する音声と比較して相対的に大きく記録されてしまう。このため、被写体の発する音声が操作者に起因する音声によってマスキングされ、操作者の意図した適切な音量バランスで被写体の発する音声を記録できない場合が多いという問題があった。 However, the voice shown in FIG. 2 (c) predominantly includes an operator's call “Ooi”, and the subject's answer “Hai” is buried. Similarly, operation noise by the operator is recorded relatively large compared to the sound emitted by the subject. For this reason, there is a problem that the sound emitted from the subject is often masked by the sound originating from the operator, and the sound emitted from the subject cannot be recorded with an appropriate volume balance intended by the operator.

そこで、上記の問題を一着眼点とし、本実施形態にかかる音声記録装置１０が創作されるに至った。本実施形態にかかる音声記録装置１０は、操作者に起因する音声の音量比率を抑制し、被写体の発する音声と操作者に起因する音声とを適切な音量バランスで記録することができる。以下、このような音声記録装置１０の詳細な構成および動作について説明する。 In view of the above, the audio recording apparatus 10 according to the present embodiment has been created with the above problem as a focus. The audio recording apparatus 10 according to the present embodiment can suppress the volume ratio of audio caused by the operator, and can record the audio emitted by the subject and the audio caused by the operator with an appropriate volume balance. Hereinafter, the detailed configuration and operation of the audio recording apparatus 10 will be described.

図３は、本実施形態にかかる音声処理装置の一例としての音声記録装置１０の構成を示した機能ブロック図である。音声記録装置１０は、音声収音部１１０と、音声判定部１２０と、音源分離部１４０と、音声混合部１５０と、記録部１６０と、記憶部１７０と、を備える。なお、図１においては音声記録装置１０としてビデオカメラを示しているが、音声記録装置１０はビデオカメラに限られず、ＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）、携帯電話、ＰＨＳ（ＰｅｒｓｏｎａｌＨａｎｄｙｐｈｏｎｅＳｙｓｔｅｍ）、携帯用音声処理装置、携帯用映像処理装置、ＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）、家庭用ゲーム機器、携帯用ゲーム機器、などの情報処理装置であってもよい。 FIG. 3 is a functional block diagram showing the configuration of the audio recording apparatus 10 as an example of the audio processing apparatus according to the present embodiment. The audio recording device 10 includes an audio pickup unit 110, an audio determination unit 120, a sound source separation unit 140, an audio mixing unit 150, a recording unit 160, and a storage unit 170. In FIG. 1, a video camera is shown as the audio recording device 10, but the audio recording device 10 is not limited to a video camera, and is a PC (Personal Computer), a mobile phone, a PHS (Personal Handyphone System), and a portable audio process. It may be an information processing device such as a device, a portable video processing device, a PDA (Personal Digital Assistant), a home game device, or a portable game device.

音声収音部１１０は、音声を収音し、収音した音声を離散量子化する。また、音声収音部１１０は、物理的に分離された２以上の収音部（例えば、マイクロホン）を含む。図３に示した例では、音声収音部１１０は、左音声Ｌを収音する収音部と右音声Ｒを収音する収音部の２つを含む。音声収音部１１０は、離散量子化した左音声Ｌおよび右音声Ｒを入力音声として音声判定部１２０および音源分離部１４０へ出力する。 The sound collection unit 110 collects sound and discretely quantizes the collected sound. In addition, the sound collection unit 110 includes two or more sound collection units (for example, microphones) that are physically separated. In the example illustrated in FIG. 3, the sound collecting unit 110 includes two parts, a sound collecting unit that collects the left sound L and a sound collecting unit that collects the right sound R. The sound collection unit 110 outputs the discretely quantized left sound L and right sound R as input sounds to the sound determination unit 120 and the sound source separation unit 140.

音声判定部１２０は、音声収音部１１０から入力された入力音声に、操作者の音声または操作者の動作に起因するノイズなど音声記録装置１０の近傍から発せられた近傍音声（第一の音声）が含まれているか否かを判定する。かかる音声判定部１２０の詳細な構成を図４を参照して説明する。 The sound determination unit 120 adds a nearby sound (first sound) emitted from the vicinity of the sound recording device 10 to the input sound input from the sound collecting unit 110, such as the operator's voice or noise caused by the operator's action. ) Is included. A detailed configuration of the voice determination unit 120 will be described with reference to FIG.

図４は、音声判定部１２０の構成を示した機能ブロック図である。音声判定部１２０は、音量検出器１２４、平均音量検出器１２６および最大音量検出器１２８からなる音量検出部１２２と、スペクトル検出器１３２および音質検出器１３４からなる音質検出部１３０と、距離方向推定器１３６と、操作者音声推定器１３８と、を備える。なお、図４においては図面の明瞭性の観点から左音声Ｌおよび右音声Ｒを合わせて入力音声と示している。 FIG. 4 is a functional block diagram illustrating the configuration of the voice determination unit 120. The sound determination unit 120 includes a sound volume detection unit 122 including a sound volume detector 124, an average sound volume detector 126 and a maximum sound volume detector 128, a sound quality detection unit 130 including a spectrum detector 132 and a sound quality detector 134, and distance direction estimation. And an operator voice estimator 138. In FIG. 4, the left sound L and the right sound R are collectively shown as input sound from the viewpoint of clarity of the drawing.

音量検出器１２４は、所定長さのフレーム単位（例えば、数１０ｍｓｅｃ）で与えられる入力音声の音量値列（振幅）を検出し、検出した入力音声の音量値列を平均音量検出器１２６、最大音量検出器１２８、音質検出器１３４および距離方向推定器１３６に出力する。 The volume detector 124 detects the volume value sequence (amplitude) of the input voice given in frame units (for example, several tens of msec) of a predetermined length, and the volume level sequence of the detected input voice is the average volume detector 126, Output to the volume detector 128, the sound quality detector 134, and the distance direction estimator 136.

平均音量検出器１２６は、音量検出器１２４から入力されるフレーム単位の音量値列に基づいて、入力音声の音量平均値を例えばフレームごとに検出する。また、平均音量検出器１２６は、検出した音量平均値を音質検出器１３４および操作者音声推定器１３８に出力する。 The average sound volume detector 126 detects the average sound volume value of the input sound for each frame, for example, based on the volume value sequence in units of frames input from the sound volume detector 124. The average sound volume detector 126 outputs the detected sound volume average value to the sound quality detector 134 and the operator voice estimator 138.

最大音量検出器１２８は、音量検出器１２４から入力されるフレーム単位の音量値列に基づいて、入力音声の音量最大値を例えばフレームごとに検出する。また、最大音量検出器１２８は、検出した入力音声の音量最大値を音質検出器１３４および操作者音声推定器１３８に出力する。 The maximum sound volume detector 128 detects the maximum sound volume value of the input sound for each frame, for example, based on the volume value sequence in units of frames input from the sound volume detector 124. Further, the maximum volume detector 128 outputs the detected maximum volume of the input voice to the sound quality detector 134 and the operator voice estimator 138.

スペクトル検出器１３２は、入力音声に例えばＦＦＴ（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）処理を施し、入力音声の周波数領域における各スペクトルを検出する。スペクトル検出器１３２は、検出したスペクトルを音質検出器１３４および距離方向推定器１３６に出力する。 The spectrum detector 132 performs, for example, FFT (Fast Fourier Transform) processing on the input sound, and detects each spectrum in the frequency domain of the input sound. The spectrum detector 132 outputs the detected spectrum to the sound quality detector 134 and the distance direction estimator 136.

音質検出器１３４は、入力音声、音量平均値、音量最大値およびスペクトルが入力され、かかる入力に基づいて入力音声の人間の音声らしさ、音楽らしさ、定常性、インパルス性などを検出し、操作者音声推定器１３８に出力する。人間の音声らしさは、入力音声の一部または全体が人間の音声と一致するか否か、あるいは人間の音声とどの程度近似するかなどを示す情報であってもよい。また、音楽らしさは、入力音声の一部または全体が音楽であるか否か、あるいは音楽とどの程度近似するかなどを示す情報であってもよい。 The sound quality detector 134 receives the input sound, the average sound volume value, the maximum sound volume value, and the spectrum. Based on the input, the sound quality detector 134 detects the human sound like the sound, the music likeness, the continuity, the impulsiveness, etc. Output to speech estimator 138. The human voice-likeness may be information indicating whether or not a part or the whole of the input voice matches the human voice, or how close to the human voice. Further, the music likeness may be information indicating whether or not a part or the whole of the input voice is music, or how close it is to music.

定常性は、例えば空調音のように時間的にそれほど音声の統計的性質が変化しない性質を指す。インパルス性は、例えば打撃音、破裂音のように短時間にエネルギーが集中した雑音性の強い性質を指す。 The stationarity refers to a property that the statistical property of the voice does not change so much in time, for example, air-conditioning sound. Impulse property refers to a strong property of noise property in which energy is concentrated in a short time such as a hit sound and a plosive sound.

例えば、音質検出器１３４は、入力音声のスペクトル分布と人間の音声のスペクトル分布との一致度に基づいて人間の音声らしさを検出することができる。また、音質検出器１３４は、フレームごとの音量最大値を比較し、他のフレームと比較して音量最大値が大きいほどインパルス性が高いことを検出してもよい。 For example, the sound quality detector 134 can detect the likelihood of human speech based on the degree of coincidence between the spectral distribution of the input speech and the spectral distribution of the human speech. Also, the sound quality detector 134 may compare the maximum volume value for each frame, and detect that the impulsiveness is higher as the maximum volume value is larger than other frames.

なお、音質検出器１３４は、ゼロクロッシング法、ＬＰＣ（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ）分析などの信号処理技術を用いて入力音声の音質を分析してもよい。ゼロクロッシング法によれば入力音声の基本周期が検出されるため、音質検出器１３４は該基本周期が人間の音声の基本周期（例えば１００〜２００Ｈｚ）に含まれるか否かに基づいて人間の音声らしさを検出してもよい。 Note that the sound quality detector 134 may analyze the sound quality of the input sound by using a signal processing technique such as a zero crossing method or LPC (Linear Predictive Coding) analysis. Since the fundamental period of the input speech is detected according to the zero crossing method, the sound quality detector 134 determines whether the fundamental period is included in the fundamental period of human speech (for example, 100 to 200 Hz). The likelihood may be detected.

距離方向推定器１３６は、入力音声、入力音声の音量値列、入力音声のスペクトルなどが入力され、該入力に基づいて入力音声の音源または入力音声に含まれる支配的な音声が発せられた音源の方向情報および距離情報などの位置情報を推定する位置情報算出部としての機能を有する。かかる距離方向推定器１３６は、入力音声の位相、音量、音量値列、過去の平均音量値、最大音量値などによる音源の位置情報の推定方法を組み合わせることで、残響や映像記録装置１０本体による音声の反射の影響が大きい場合でも総合的に音源位置を推定することができる。距離方向推定器１３６による方向情報および距離情報の推定方法の一例を図５〜図８を参照して説明する。 The distance direction estimator 136 receives an input sound, a volume value sequence of the input sound, a spectrum of the input sound, and the like, and a sound source from which the dominant sound included in the input sound or the input sound is generated based on the input. It functions as a position information calculation unit that estimates position information such as direction information and distance information. The distance direction estimator 136 combines the estimation method of the position information of the sound source based on the phase, volume, volume value sequence, past average volume value, maximum volume value, and the like of the input sound, so that the reverberation and the video recording apparatus 10 main body. Even when the influence of sound reflection is large, the sound source position can be estimated comprehensively. An example of the direction information and distance information estimation method by the distance direction estimator 136 will be described with reference to FIGS.

図５は、２つの入力音声の位相差に基づいて入力音声の音源位置を推定する様子を示した説明図である。音源が点音源であると仮定すると、音声収音部１１０を構成するマイクロホンＭ１およびマイクロホンＭ２に到達する各入力音声の位相と各入力音声の位相差が測定できる。さらに、位相差と、入力音声の周波数ｆおよび音速ｃの値から、入力音声の音源位置までのマイクロホンＭ１からの距離とマイクロホンＭ２からの距離との差を算出できる。音源は、当該距離差が一定である点の集合上に存在する。このような距離差が一定である点の集合は、双曲線となることが知られている。 FIG. 5 is an explanatory diagram showing a state in which the sound source position of the input sound is estimated based on the phase difference between the two input sounds. If it is assumed that the sound source is a point sound source, the phase difference between each input sound and the phase of each input sound that reaches the microphone M1 and the microphone M2 constituting the sound collection unit 110 can be measured. Further, the difference between the distance from the microphone M1 to the sound source position of the input sound and the distance from the microphone M2 can be calculated from the phase difference and the values of the frequency f and the sound speed c of the input sound. The sound source exists on a set of points where the distance difference is constant. It is known that such a set of points having a constant distance difference is a hyperbola.

例えば、マイクロホンＭ１が（ｘ１、０）に位置し、マイクロホンＭ１が（ｘ２、０）に位置すると仮定する（このように仮定しても一般性を失わない）。また、求める音源位置の集合上の点を（ｘ、ｙ）とおき、上記距離差をｄとおくと、以下の数式１が成り立つ。

For example, assume that the microphone M1 is located at (x1, 0) and the microphone M1 is located at (x2, 0) (this assumption does not lose generality). Further, if a point on the set of sound source positions to be obtained is set as (x, y) and the distance difference is set as d, the following formula 1 is established.

さらに、数式１は数式２のように展開でき、数式２を整理すると双曲線を表す数式３が導かれる。

Furthermore, Formula 1 can be expanded as Formula 2, and formula 3 is derived by formulating Formula 2 to represent a hyperbola.

また、距離方向推定器１３６は、マイクロホンＭ１およびマイクロホンＭ２の各々が収音した入力音声の音量差に基づいて音源がマイクロホンＭ１およびマイクロホンＭ２のどちらの近傍であるかを判定できるため、例えば図５に示したようにマイクロホンＭ２に近い双曲線１上に音源が存在すると判定することができる。 Further, since the distance direction estimator 136 can determine whether the sound source is near the microphone M1 or the microphone M2 based on the volume difference between the input sounds collected by the microphone M1 and the microphone M2, for example, FIG. It can be determined that the sound source exists on the hyperbola 1 close to the microphone M2 as shown in FIG.

なお、位相差算出に用いる入力音声の周波数ｆは、マイクロホンＭ１およびマイクロホンＭ２間の距離に対して下記の数式４の条件を満たす必要がある。

The frequency f of the input sound used for the phase difference calculation needs to satisfy the condition of the following formula 4 with respect to the distance between the microphone M1 and the microphone M2.

図６は、３つの入力音声の位相差に基づいて入力音声の音源位置を推定する様子を示した説明図である。図６に示したような音声収音部１１０を構成するマイクロホンＭ３、マイクロホンＭ４およびマイクロホンＭ５の配置を想定した場合、マイクロホンＭ３およびマイクロホンＭ４に到達する入力音声の位相に比較してマイクロホンＭ５に到達する入力音声の位相が遅れていれば、距離方向推定器１３６は、音源がマイクロホンＭ４およびマイクロホンＭ５を結ぶ直線１に対してマイクロホンＭ５の逆側に位置すると判定できる（前後判定）。 FIG. 6 is an explanatory diagram showing a state in which the sound source position of the input sound is estimated based on the phase difference between the three input sounds. Assuming the arrangement of the microphone M3, the microphone M4, and the microphone M5 that constitute the sound pickup unit 110 as shown in FIG. 6, the microphone M5 reaches the microphone M5 in comparison with the phase of the input sound that reaches the microphone M4. If the phase of the input speech to be delayed is delayed, the distance direction estimator 136 can determine that the sound source is located on the opposite side of the microphone M5 with respect to the straight line 1 connecting the microphone M4 and the microphone M5 (front / back determination).

さらに、距離方向推定器１３６は、マイクロホンＭ３およびマイクロホンＭ４の各々に到達する入力音声の位相差に基づいて音源が存在し得る双曲線２を算出し、マイクロホンＭ４およびマイクロホンＭ５の各々に到達する入力音声の位相差に基づいて音源が存在し得る双曲線３を算出することができる。その結果、距離方向推定器１３６は、双曲線２および双曲線３の交点Ｐ１を音源位置として推定することができる。 Further, the distance direction estimator 136 calculates a hyperbola 2 in which a sound source can exist based on the phase difference between the input sounds reaching the microphones M3 and M4, and the input sounds reaching the microphones M4 and M5. Based on the phase difference, a hyperbola 3 in which a sound source can exist can be calculated. As a result, the distance direction estimator 136 can estimate the intersection P1 of the hyperbola 2 and the hyperbola 3 as the sound source position.

図７は、２つの入力音声の音量に基づいて入力音声の音源位置を推定する様子を示した説明図である。音源が点音源であると仮定すると、逆二乗則よりある点で観測される音量は距離の二乗に反比例する。図７に示したような音声収音部１１０を構成するマイクロホンＭ６およびマイクロホンＭ７を想定した場合、マイクロホンＭ６およびマイクロホンＭ７に到達する音量比が一定となる点の集合は円となる。距離方向推定器１３６は、音量検出器１２４から入力される音量の値から音量比を求め、音源の存在する円の半径及び中心位置を算出できる。 FIG. 7 is an explanatory diagram showing a state in which the sound source position of the input sound is estimated based on the volumes of the two input sounds. Assuming that the sound source is a point sound source, the sound volume observed at a certain point is inversely proportional to the square of the distance according to the inverse square law. Assuming the microphone M6 and the microphone M7 constituting the sound pickup unit 110 as shown in FIG. 7, the set of points at which the volume ratio reaching the microphone M6 and the microphone M7 is constant is a circle. The distance direction estimator 136 calculates the volume ratio from the volume value input from the volume detector 124, and can calculate the radius and center position of the circle where the sound source exists.

図７に示したように、マイクロホンＭ６が（ｘ３、０）に位置し、マイクロホンＭ７が（ｘ４、０）に位置する場合（このように仮定しても一般性を失わない）、求める音源位置の集合上の点を（ｘ、ｙ）と置くと、各マイクロホンから音源までの距離ｒ１、ｒ２は以下の数式５のように表せる。

As shown in FIG. 7, when the microphone M6 is located at (x3, 0) and the microphone M7 is located at (x4, 0) (this assumption does not lose generality), the sound source position to be obtained If the points on the set of (x, y) are placed, the distances r1 and r2 from each microphone to the sound source can be expressed as in the following Equation 5.

ここで、逆二乗則より以下の数式６が成り立つ。

Here, the following formula 6 is established from the inverse square law.

数式６は正の定数ｄ（例えば４）を用いて数式７にように変形される。

Formula 6 is transformed into Formula 7 using a positive constant d (for example, 4).

数式７をｒ１およびｒ２に代入し、整理すると以下の数式８が導かれる。

Substituting Equation 7 into r1 and r2 and rearranging it leads to Equation 8 below.

数式８より、距離方向推定器１３６は、図７に示したように、中心の座標が数式９で表され半径が数式１０で表される円１上に音源が存在すると推定できる。

From Equation 8, the distance / direction estimator 136 can estimate that the sound source exists on the circle 1 whose center coordinates are expressed by Equation 9 and whose radius is expressed by Equation 10, as shown in FIG.

図８は、３つの入力音声の音量に基づいて入力音声の音源位置を推定する様子を示した説明図である。図８に示したような音声収音部１１０を構成するマイクロホンＭ３、マイクロホンＭ４およびマイクロホンＭ５の配置を想定した場合、マイクロホンＭ３およびマイクロホンＭ４に到達する入力音声の位相に比較してマイクロホンＭ５に到達する入力音声の位相が遅れていれば、距離方向推定器１３６は、音源がマイクロホンＭ４およびマイクロホンＭ５を結ぶ直線２に対してマイクロホンＭ５の逆側に位置すると判定できる（前後判定）。 FIG. 8 is an explanatory diagram showing a state in which the sound source position of the input sound is estimated based on the volumes of the three input sounds. Assuming the arrangement of the microphone M3, the microphone M4, and the microphone M5 constituting the sound pickup unit 110 as shown in FIG. 8, the microphone M5 reaches the microphone M5 in comparison with the phase of the input sound that reaches the microphone M4. If the phase of the input speech to be delayed is delayed, the distance direction estimator 136 can determine that the sound source is located on the opposite side of the microphone M5 with respect to the straight line 2 connecting the microphone M4 and the microphone M5 (front / back determination).

さらに、距離方向推定器１３６は、マイクロホンＭ３およびマイクロホンＭ４の各々に到達する入力音声の音量比に基づいて音源が存在し得る円２を算出し、マイクロホンＭ４およびマイクロホンＭ５の各々に到達する入力音声の音量比に基づいて音源が存在し得る円３を算出することができる。その結果、距離方向推定器１３６は、円２および円３の交点Ｐ２を音源位置として推定することができる。なお、４つ以上のマイクロホンを使用した場合には、距離方向推定器１３６は、空間的な音源の配置を含め、より精度の高い推定が可能となる。 Further, the distance direction estimator 136 calculates a circle 2 in which a sound source can exist based on the volume ratio of the input sound that reaches each of the microphone M3 and the microphone M4, and the input sound that reaches each of the microphone M4 and the microphone M5. The circle 3 where the sound source can exist can be calculated based on the volume ratio. As a result, the distance direction estimator 136 can estimate the intersection P2 of the circles 2 and 3 as the sound source position. When four or more microphones are used, the distance / direction estimator 136 can perform estimation with higher accuracy including spatial arrangement of sound sources.

距離方向推定器１３６は、上記のように各入力音声の位相差や音量比に基づいて入力音声の音源の位置を推定し、推定した音源の方向情報や距離情報を操作者音声推定器１３８に出力する。以下の表１に、上述した音量検出部１２２、音質検出部１３０および距離方向推定器１３６の各構成の入出力をまとめた。

The distance direction estimator 136 estimates the position of the sound source of the input sound based on the phase difference and volume ratio of each input sound as described above, and sends the estimated sound source direction information and distance information to the operator sound estimator 138. Output. Table 1 below summarizes the input / output of each component of the sound volume detector 122, the sound quality detector 130, and the distance direction estimator 136 described above.

なお、入力音声に複数の音源から発せられた音声が重畳されている場合、距離方向推定器１３６は入力音声に支配的に含まれている音声の音源位置を正確に推定することは困難である。しかし、距離方向推定器１３６は入力音声に支配的に含まれている音声の音源位置に近い位置を推定することは可能である。また、当該推定された音源位置は音源分離部１４０において音声分離のための初期値として利用してもよいため、距離方向推定器１３６が推定する音源位置に誤差があっても当該音声記録装置１０は所望の動作をすることができる。 When voices emitted from a plurality of sound sources are superimposed on the input voice, it is difficult for the distance direction estimator 136 to accurately estimate the sound source position of the voice dominantly included in the input voice. . However, the distance direction estimator 136 can estimate a position close to the sound source position of the sound that is dominantly included in the input sound. Further, since the estimated sound source position may be used as an initial value for sound separation in the sound source separation unit 140, even if there is an error in the sound source position estimated by the distance direction estimator 136, the sound recording device 10 Can perform a desired operation.

図４を参照して音声判定部１２０の構成の説明に戻ると、操作者音声推定器１３８は、入力音声の音量、音質または位置情報の少なくともいずれかに基づき、入力音声に操作者の音声または操作者の動作に起因するノイズなど音声記録装置１０の近傍である特定音源から発せられた近傍音声が含まれているか否かを総合的に判定する。また、操作者音声推定器１３８は、入力音声に近傍音声が含まれていると判定した場合、音源分離部１４０に入力音声に近傍音声が含まれる旨（操作者音声存在情報）や距離方向推定器１３６により推定された位置情報などを出力する音声判定部としての機能を有する。 Returning to the description of the configuration of the speech determination unit 120 with reference to FIG. 4, the operator speech estimator 138 determines that the operator's speech or It is comprehensively determined whether or not the vicinity sound emitted from the specific sound source in the vicinity of the sound recording device 10 such as noise caused by the operation of the operator is included. Further, when the operator speech estimator 138 determines that the input speech includes a nearby speech, the operator speech estimator 138 indicates that the input speech includes the nearby speech (operator speech presence information) or distance direction estimation. A function as a voice determination unit that outputs position information estimated by the device 136.

具体的には、操作者音声推定器１３８は、入力音声の音源の位置が映像を撮像する撮像部（図示せず。）の撮像方向の後方であると距離方向推定器１３６に推定され、入力音声が人間の音声と一致または近似する音質である場合、入力音声に近傍音声が含まれていると判定してもよい。ここで、図９に示すように、操作者は撮像部の撮像方向の後方、すなわちファインダーの左後方から音声記録装置１０を操作する場合が多い（右利きで自分撮り以外の通常の撮影時）。 Specifically, the operator voice estimator 138 estimates that the position of the sound source of the input voice is behind the imaging direction of the imaging unit (not shown) that captures the video, and inputs it to the distance direction estimator 136. When the sound has a sound quality that matches or approximates that of a human sound, it may be determined that the input sound includes a nearby sound. Here, as shown in FIG. 9, the operator often operates the audio recording device 10 from the rear in the imaging direction of the imaging unit, that is, from the left rear of the finder (during right-handed normal shooting other than self-shooting). .

したがって、操作者音声推定器１３８は、入力音声の音源の位置が撮像部の撮像方向の後方であり、入力音声が人間の音声と一致または近似する音質である場合、入力音声に近傍音声として操作者の音声が支配的に含まれていると判定することができる。その結果、後述の音声混合部１５０により操作者の音声の音量比率が低減された混合音声を得ることができる。 Therefore, the operator speech estimator 138 operates as a nearby speech on the input speech when the position of the sound source of the input speech is behind the imaging direction of the imaging unit and the input speech has a sound quality that matches or approximates human speech. It can be determined that the voice of the person is dominantly included. As a result, a mixed sound in which the volume ratio of the operator's sound is reduced can be obtained by the sound mixing unit 150 described later.

また、操作者音声推定器１３８は、入力音声の音源の位置が収音位置から設定距離（例えば、音声記録装置１０の１ｍ以内など音声記録装置１０の近傍）の範囲内であり、入力音声にインパルス音が含まれ、入力音声が過去の平均音量と比較して大きい場合、入力音声に特定音源から発せられた近傍音声が含まれていると判定してもよい。ここで、音声記録装置１０の操作者が音声記録装置１０のボタンを操作したり音声記録装置１０を持ち替えると「パチン」、「バン」などのインパルス音が発生する場合が多い。また、該インパルス音は音声記録装置１０において発生するため、比較的大きな音量で収音される可能性が高い。 In addition, the operator voice estimator 138 has a position of the sound source of the input voice within a set distance from the sound pickup position (for example, in the vicinity of the voice recording apparatus 10 such as within 1 m of the voice recording apparatus 10). If the impulse sound is included and the input sound is larger than the average sound volume in the past, it may be determined that the input sound includes the near sound emitted from the specific sound source. Here, when an operator of the voice recording apparatus 10 operates a button of the voice recording apparatus 10 or changes the voice recording apparatus 10, impulse sounds such as “pachin” and “bang” are often generated. Further, since the impulse sound is generated in the sound recording device 10, there is a high possibility that the sound is collected at a relatively large volume.

したがって、操作者音声推定器１３８は、入力音声の音源の位置が収音位置から設定距離の範囲内であり、入力音声にインパルス音が含まれ、入力音声が過去の平均音量と比較して大きい場合、入力音声に近傍音声として操作者の動作に起因するノイズが支配的に含まれていると判定することができる。その結果、後述の音声混合部１５０により操作者の動作に起因するノイズの音量比率が低減された混合音声を得ることができる。 Therefore, the operator voice estimator 138 has a position of the sound source of the input voice within the set distance from the sound collection position, the input voice includes an impulse sound, and the input voice is larger than the past average volume. In this case, it can be determined that the input voice mainly includes noise caused by the operation of the operator as the vicinity voice. As a result, it is possible to obtain mixed sound in which the volume ratio of noise caused by the operation of the operator is reduced by the sound mixing unit 150 described later.

その他、操作者音声推定器１３８に入力される情報と、入力される情報に基づく操作者音声推定器１３８の判定結果の一例を以下の表２にまとめた。なお、近接センサー、温度センサーなどを組み合わせて用いて操作者音声推定器１３８における判定の精度をあげることも可能である。

In addition, Table 2 below summarizes examples of information input to the operator speech estimator 138 and determination results of the operator speech estimator 138 based on the input information. Note that it is possible to increase the accuracy of determination in the operator speech estimator 138 using a combination of a proximity sensor, a temperature sensor, and the like.

図３を参照して音声記録装置１０の構成の説明に戻ると、音源分離部１４０は、音声判定部１２０から操作者音声存在情報が入力されると、音声判定部１２０から入力される音源の位置情報に基づき、音声収音部１１０から入力される入力音声を操作者の音声などの近傍音声と、近傍音声以外の被写体の音声などの収音対象音声（第二の音声）とに分離する。その結果、音源分離部１４０は、入力される入力音声の数の倍の数の音声を出力する。図３においては、音源分離部１４０が左音声Ｌおよび右音声Ｒを入力音声として入力され、左近傍音声Ｌおよび右近傍音声Ｒを近傍音声として出力し、左収音対象音声Ｌおよび右収音対象音声Ｒを収音対象音声として出力する様子を示している。 Returning to the description of the configuration of the voice recording apparatus 10 with reference to FIG. 3, when the operator voice presence information is input from the voice determination unit 120, the sound source separation unit 140 receives the sound source input from the voice determination unit 120. Based on the position information, the input sound input from the sound pickup unit 110 is separated into a near sound such as an operator's sound and a sound collection target sound (second sound) such as a sound of a subject other than the near sound. . As a result, the sound source separation unit 140 outputs as many sounds as the number of input sounds to be input. In FIG. 3, the sound source separation unit 140 receives the left sound L and the right sound R as input sounds, outputs the left vicinity sound L and the right vicinity sound R as the vicinity sound, and outputs the left sound collection target sound L and the right sound collection. A state in which the target voice R is output as the voice to be collected is shown.

具体的には、音源分離部１４０は、独立成分解析を用いた手法（ＩＣＡ）、音の時間周波数成分間の重なりの少なさを利用する手法などを用いて音声を音源に応じて分離する音声分離部として機能する。 Specifically, the sound source separation unit 140 uses a method (ICA) that uses independent component analysis, a method that uses a small amount of overlap between temporal frequency components of sound, and the like to separate sound according to the sound source. Functions as a separation unit.

音声混合部１５０は、音源分離部１４０から入力された近傍音声および収音対象音声を、近傍音声が占める音量比率が、入力音声に占める近傍音声の音量比率より低減されるように混合する。かかる構成によれば、入力音声のうち特定音源から発せられた近傍音声の音量が不要に大きい場合、音声混合部１５０は、収音対象音声が占める音量比率が入力音声に占める収音対象音声の音量比率より増大させた混合音声を得ることができる。その結果、当該音声記録装置１０によれば、収音対象音声が近傍音声に埋もれてしまうことを防止できる。 The sound mixing unit 150 mixes the near sound and the sound to be collected input from the sound source separation unit 140 such that the volume ratio occupied by the near sound is reduced from the volume ratio of the near sound occupied in the input sound. According to such a configuration, when the volume of the near voice emitted from the specific sound source is unnecessarily large among the input voices, the voice mixing unit 150 causes the volume ratio of the voice to be picked up to be the volume of the voice to be picked up. It is possible to obtain mixed sound that is increased from the volume ratio. As a result, according to the voice recording device 10, it is possible to prevent the voice to be collected from being buried in the neighboring voice.

なお、音声混合部１５０は、入力される左近傍音声Ｌおよび左収音対象音声Ｌを混合して混合左音声Ｌを生成し、入力される右近傍音声Ｒおよび右収音対象音声Ｒを混合して混合右音声Ｒを生成し、混合左音声Ｌおよび混合右音声Ｒを混合音声として記録部１６０に出力する。 The sound mixing unit 150 generates a mixed left sound L by mixing the input left vicinity sound L and the left sound collection target sound L, and mixes the input right vicinity sound R and the right sound collection target sound R. Then, the mixed right sound R is generated, and the mixed left sound L and the mixed right sound R are output to the recording unit 160 as mixed sound.

また、音声混合部１５０は、音源分離部１４０により分離された近傍音声および収音対象音声の平均音量比から適切な混合比率を算出し、算出した混合比率で近傍音声および収音対象音声を混合してもよい。また、音声混合部１５０は、前フレームに適用していた混合比率との差分が所定の上限値を超えない範囲で適用する混合比率を変化させてもよい。また、該混合比率はユーザ設定されるようにしてもよい。 In addition, the sound mixing unit 150 calculates an appropriate mixing ratio from the average volume ratio of the nearby sound and the sound collection target sound separated by the sound source separation unit 140, and mixes the vicinity sound and the sound collection target sound at the calculated mixing ratio. May be. Further, the audio mixing unit 150 may change the mixing ratio to be applied in a range where the difference from the mixing ratio applied to the previous frame does not exceed a predetermined upper limit value. The mixing ratio may be set by the user.

記録部１６０は、音声混合部１５０から入力された混合音声を記憶部１７０に記録する。記憶部１７０は、ＥＥＰＲＯＭ（ＥｌｅｃｔｒｉｃａｌｌｙＥｒａｓａｂｌｅＰｒｏｇｒａｍｍａｂｌｅＲｅａｄ−ＯｎｌｙＭｅｍｏｒｙ）、ＥＰＲＯＭ（ＥｒａｓａｂｌｅＰｒｏｇｒａｍｍａｂｌｅＲｅａｄＯｎｌｙＭｅｍｏｒｙ）などの不揮発性メモリや、ハードディスクおよび円盤型磁性体ディスクなどの磁気ディスクや、ＣＤ−Ｒ（ＣｏｍｐａｃｔＤｉｓｋＲｅｃｏｒｄａｂｌｅ）／ＲＷ（ＲｅＷｒｉｔａｂｌｅ）、ＤＶＤ−Ｒ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋＲｅｃｏｒｄａｂｌｅ）／ＲＷ／＋Ｒ／＋ＲＷ／ＲＡＭ（ＲａｍｄａｍＡｃｃｅｓｓＭｅｍｏｒｙ）およびＢＤ（Ｂｌｕ−ＲａｙＤｉｓｃ（登録商標））―Ｒ／ＢＤ−ＲＥなどの光ディスクや、ＭＯ（ＭａｇｎｅｔｏＯｐｔｉｃａｌ）ディスクなどの記憶媒体であってもよい。なお、記憶部１７０は、被写体の映像データも記憶することができる。 The recording unit 160 records the mixed sound input from the sound mixing unit 150 in the storage unit 170. The storage unit 170 is a non-volatile memory such as an EEPROM (Electrically Erasable Programmable Read-Only Memory) or an EPROM (Erasable Programmable Read Only Memory), a hard disk and a disk type magnetic disk such as a disk. (Recordable) / RW (ReWritable), DVD-R (Digital Versatile Disk Recordable) / RW / + R / + RW / RAM (Ramdam Access Memory) and BD (Blu-Ray Disc-R) Optical disc and MO (Magneto Optical) It may be a storage medium such as a disk. The storage unit 170 can also store subject video data.

このように本実施形態にかかる音声記録装置１０によれば、記録部１６０が記憶部１７０に、近傍音声が占める音量比率が入力音声に占める近傍音声の音量比率より低減された混合音声を記録する。したがって、該混合音声を再生する再生装置に特殊な音量補正機能を実装することなく、該再生装置において近傍音声の占める音量比率が調整された混合音声を再生することが可能となる。 As described above, according to the audio recording apparatus 10 according to the present embodiment, the recording unit 160 records in the storage unit 170 the mixed sound in which the volume ratio occupied by the nearby sound is reduced from the volume ratio of the nearby sound occupied by the input sound. . Therefore, it is possible to reproduce the mixed sound in which the volume ratio of the neighboring sound is adjusted in the reproducing apparatus without implementing a special volume correction function in the reproducing apparatus that reproduces the mixed sound.

以上、本実施形態にかかる音声記録装置１０の構成を説明した。続いて、図１０を参照して本実施形態にかかる音声記録装置１０において実行される音声処理方法を説明する。 The configuration of the audio recording apparatus 10 according to the present embodiment has been described above. Next, an audio processing method executed in the audio recording apparatus 10 according to the present embodiment will be described with reference to FIG.

図１０は、本実施形態にかかる音声記録装置１０において実行される音声処理方法の流れを示したフローチャートである。まず、音声記録装置１０の音声収音部１１０は音声を収音する（Ｓ２１０）。入力音声が無かった場合には処理を終了し、入力音声があった場合には距離方向推定器１３６が入力音声の全体または一部が発せられた音源の距離や方向などの位置情報を推定する（Ｓ２３０）。 FIG. 10 is a flowchart showing the flow of the sound processing method executed in the sound recording apparatus 10 according to the present embodiment. First, the sound collection unit 110 of the sound recording apparatus 10 collects sound (S210). If there is no input voice, the process ends. If there is input voice, the distance / direction estimator 136 estimates position information such as the distance and direction of the sound source from which the whole or part of the input voice is emitted. (S230).

その後、操作者音声推定器１３８は入力音声に操作者の発した音声、または操作者の動作に起因するノイズなどの近傍音声が含まれているか否かを判定する（Ｓ２４０）。操作者音声推定器１３８により入力音声に近傍音声が含まれていると判定された場合、音源分離部１４０は、入力音声を近傍音声とそれ以外の収音対象音声とに分離する（Ｓ２５０）。 Thereafter, the operator voice estimator 138 determines whether or not the input voice includes a voice uttered by the operator or a nearby voice such as noise caused by the operation of the operator (S240). When the operator voice estimator 138 determines that the input voice includes the near voice, the sound source separation unit 140 separates the input voice into the near voice and the other voices to be collected (S250).

続いて、音声混合部１５０が音源分離部１４０により分離された近傍音声と収音対象音声とを任意の比率で混合し、混合音声を生成する（Ｓ２６０）。Ｓ２６０の後、またはＳ２４０において入力音声に操作者の発した音声、または操作者の動作に起因するノイズなどの近傍音声が含まれていないと判定された場合、記録部１６０は混合音声または入力音声を記憶部１７０に記録する（Ｓ２７０）。 Subsequently, the sound mixing unit 150 mixes the nearby sound separated by the sound source separation unit 140 and the sound to be collected at an arbitrary ratio to generate mixed sound (S260). After S260, or in S240, if it is determined that the input sound does not include the sound generated by the operator or the nearby sound such as noise caused by the operation of the operator, the recording unit 160 determines whether the input sound is mixed sound or input sound. Is stored in the storage unit 170 (S270).

以上説明したように、本実施形態にかかる音声記録装置１０は、音源分離部１４０が、入力音声に含まれる特定音源から発せられた近傍音声を距離方向推定器１３６により推定された入力音声の音源の位置情報に基づいて分離し、音声混合部１５０が、近傍音声と入力音声に含まれる他の音声である収音対象音声とを、近傍音声が占める音量比率が、入力音声に占める近傍音声の音量比率より低減されるように混合する。 As described above, in the audio recording apparatus 10 according to the present embodiment, the sound source separation unit 140 is a sound source of the input sound that is estimated by the distance direction estimator 136 for the nearby sound emitted from the specific sound source included in the input sound. The sound mixing unit 150 divides the vicinity sound and the sound to be collected, which is another sound included in the input sound, and the sound volume ratio of the vicinity sound occupies the input sound. Mix to reduce the volume ratio.

したがって、入力音声のうち特定音源から発せられた近傍音声の音量が不要に大きい場合、音声混合部１５０は、収音対象音声が占める音量比率が入力音声に占める収音対象音声の音量比率より増大させた混合音声を得ることができる。その結果、当該音声記録装置１０によれば、近傍音声を相対的に抑制し、収音対象音声が近傍音声に埋もれてしまうことを防止できる。また、入力音声に含まれる操作者により発せられる音声、ノイズなどの近傍音声の影響を低減除去した高品質の混合音声を記録することができる。 Therefore, when the volume of the near voice emitted from the specific sound source is unnecessarily large among the input voices, the voice mixing unit 150 increases the volume ratio occupied by the voice to be collected from the volume ratio of the voice to be collected occupying the input voice. The mixed voice can be obtained. As a result, according to the audio recording device 10, it is possible to relatively suppress the nearby sound and prevent the sound collection target sound from being buried in the nearby sound. In addition, it is possible to record a high-quality mixed sound in which the influence of the nearby sound such as a sound and noise generated by the operator included in the input sound is reduced.

また、音声記録装置１０は、記憶部１７０に近傍音声が占める音量比率が入力音声に占める近傍音声の音量比率より低減された混合音声を記録できる。したがって、該混合音声を再生する再生装置に特殊な音量補正機能を実装することなく、該再生装置において近傍音声の占める音量比率が調整された混合音声を再生することが可能となる。 In addition, the audio recording device 10 can record the mixed sound in which the volume ratio occupied by the near voice in the storage unit 170 is lower than the volume ratio of the near voice occupied in the input sound. Therefore, it is possible to reproduce the mixed sound in which the volume ratio of the neighboring sound is adjusted in the reproducing apparatus without implementing a special volume correction function in the reproducing apparatus that reproduces the mixed sound.

また、本実施形態にかかる音声記録装置１０は、入力音声をソフトウェア的に処理し近傍音声および収音対象音声の音量比率を調整した混合音声を記録できるため、マイクロホンの数などのハードウェア規模を縮小することができる。 In addition, since the audio recording apparatus 10 according to the present embodiment can process the input audio in software and record the mixed audio in which the volume ratio of the adjacent audio and the sound to be collected is adjusted, the hardware scale such as the number of microphones can be increased. Can be reduced.

（第２の実施形態）
次に、本発明の第２の実施形態にかかる音声再生装置１１について説明する。本実施形態にかかる音声再生装置１１は、既に記憶されている音声に含まれる近傍音声の占める音量比率が調整された混合音声を再生することができる。以下、図１１を参照して当該音声再生装置１１の構成を説明する。 (Second Embodiment)
Next, the audio reproducing device 11 according to the second embodiment of the present invention will be described. The audio reproduction device 11 according to the present embodiment can reproduce mixed audio in which the volume ratio of the neighboring audio included in the already stored audio is adjusted. Hereinafter, the configuration of the audio reproduction apparatus 11 will be described with reference to FIG.

図１１は、本実施形態にかかる音声再生装置１１の構成を示した機能ブロック図である。本実施形態にかかる音声再生装置１１は、音声判定部１２０と、音源分離部１４０と、音声混合部１５０と、記憶部１７２と、再生部１７４と、音声出力部１８０と、を備える。
なお、本実施形態の説明においては、第１の実施形態で説明した内容と実質的に同一である構成については説明を省略し、第１の実施形態と異なる構成に重きをおいて説明する。 FIG. 11 is a functional block diagram showing the configuration of the audio playback device 11 according to the present embodiment. The audio reproduction device 11 according to the present embodiment includes an audio determination unit 120, a sound source separation unit 140, an audio mixing unit 150, a storage unit 172, a reproduction unit 174, and an audio output unit 180.
In the description of the present embodiment, the description of the configuration that is substantially the same as the content described in the first embodiment will be omitted, and the description will be given with a focus on the configuration that is different from the first embodiment.

記憶部１７２は、音声の記録機能を有する任意の装置において記録された音声を記憶している。再生部１７４は、記憶部１７２が記憶している音声を読み出し、必要に応じてデコードを行なう。そして、再生部１７４は、記憶部１７２が記憶している音声を音声判定部１２０および音源分離部１４０に出力する。音声判定部１２０および音源分離部１４０は、再生部１７４からの出力を入力音声として扱い、第１の実施形態で説明した内容と実質的に同一な処理を行う。 The storage unit 172 stores voice recorded in any device having a voice recording function. The reproduction unit 174 reads the sound stored in the storage unit 172 and decodes it as necessary. Then, the reproduction unit 174 outputs the audio stored in the storage unit 172 to the audio determination unit 120 and the sound source separation unit 140. The sound determination unit 120 and the sound source separation unit 140 treat the output from the reproduction unit 174 as input sound, and perform processing substantially the same as the content described in the first embodiment.

音声出力部１８０は、音声混合部１５０により混合された混合音声を出力する。音声出力部１８０は、例えばスピーカであってもイヤホンであってもよい。なお、本実施形態にかかる記憶部１７２も、第１の実施形態における記憶部１７０と同様にＥＥＰＲＯＭ、ＥＰＲＰＭなどの不揮発性メモリや、ハードディスクおよび円盤型磁性体ディスクなどの磁気ディスクや、ＣＤ−Ｒ／ＲＷ、ＤＶＤ−Ｒ／ＲＷ／＋Ｒ／＋ＲＷ／ＲＡＭおよびＢＤ（Ｂｌｕ−ＲａｙＤｉｓｃ（登録商標））―Ｒ／ＢＤ−ＲＥなどの光ディスクや、ＭＯディスクなどの記憶媒体であってもよい。 The audio output unit 180 outputs the mixed audio mixed by the audio mixing unit 150. The audio output unit 180 may be a speaker or an earphone, for example. Note that the storage unit 172 according to the present embodiment is similar to the storage unit 170 according to the first embodiment, such as a nonvolatile memory such as an EEPROM or an ERPPM, a magnetic disk such as a hard disk or a disk-type magnetic disk, or a CD-R. / RW, DVD-R / RW / + R / + RW / RAM and BD (Blu-Ray Disc (registered trademark))-R / BD-RE and other storage media such as an MO disk.

このように、本実施形態にかかる音声再生装置１１は、音声判定部１２０、音源分離部１４０および音声混合部１５０が再生部１７４から入力される入力音声に基づいて混合音声を生成し、混合音声を再生音声として出力することができる。したがって、記憶部１７２に入力音声を記録する記録装置に特殊な音量補正機能を実装することなく、近傍音声の占める音量比率が調整された混合音声を再生することが可能となる。また、操作者により発せられる音声、ノイズなどの近傍音声の影響を低減除去した高品質の混合音声を出力することができる。 As described above, in the audio reproduction device 11 according to the present embodiment, the audio determination unit 120, the sound source separation unit 140, and the audio mixing unit 150 generate mixed audio based on the input audio input from the reproduction unit 174, and the mixed audio Can be output as reproduced sound. Therefore, it is possible to reproduce the mixed sound in which the volume ratio occupied by the nearby sound is adjusted without mounting a special sound volume correction function in the recording device that records the input sound in the storage unit 172. In addition, it is possible to output high-quality mixed speech in which the influence of nearby speech such as speech and noise emitted by the operator is reduced.

（第３の実施形態）
次に、本発明の第３の実施形態にかかる音声再生装置１２について説明する。本実施形態にかかる音声再生装置１２は、入力音声にＡＧＣ（ＡｕｔｏＧａｉｎＣｏｎｔｒｏｌ）が施されている場合、入力音声に含まれる収音対象音声の音量を逆補正し、収音対象音声を強調（ブースト）することができる。以下、図１２および図１３を参照し、本実施形態にかかる音声再生装置１２の構成および動作を説明する。 (Third embodiment)
Next, an audio reproducing device 12 according to the third embodiment of the present invention will be described. When the input sound is subjected to AGC (Auto Gain Control), the sound reproducing device 12 according to the present embodiment reversely corrects the volume of the sound collecting target sound included in the input sound and emphasizes the sound collecting target sound ( Boost). Hereinafter, the configuration and operation of the audio reproduction device 12 according to the present embodiment will be described with reference to FIGS. 12 and 13.

図１２は、本実施形態にかかる音声再生装置１２の構成を示した機能ブロック図である。音声再生装置１２は、音声判定部１２０と、音源分離部１４０と、音声混合部１５０と、記憶部１７２と、再生部１７４と、音声出力部１８０と、音量補正部１９０とを備える。
なお、本実施形態の説明においては、第２の実施形態で説明した内容と実質的に同一である構成については説明を省略し、第２の実施形態と異なる構成に重きをおいて説明する。 FIG. 12 is a functional block diagram showing the configuration of the audio reproduction device 12 according to the present embodiment. The audio reproduction device 12 includes an audio determination unit 120, a sound source separation unit 140, an audio mixing unit 150, a storage unit 172, a reproduction unit 174, an audio output unit 180, and a volume correction unit 190.
In the description of the present embodiment, the description of the configuration that is substantially the same as the content described in the second embodiment will be omitted, and a description will be given with emphasis on the configuration different from the second embodiment.

本実施形態にかかる記憶部１７２は、一部または全体に音声判定部１２０と、音源分離部１４０と、音源混合部１５０と、記憶部１７２と、再生部１７４と、音声出力部１８０と、を備える。なお、本実施形態の説明においては、第１の実施形態で説明した内容と実質的に同一である構成については説明を省略し、第１の実施形態と異なる構成に重きをおいて説明する。 The storage unit 172 according to the present embodiment includes a sound determination unit 120, a sound source separation unit 140, a sound source mixing unit 150, a storage unit 172, a playback unit 174, and a sound output unit 180, in part or in whole. Prepare. In the description of the present embodiment, the description of the configuration that is substantially the same as the content described in the first embodiment will be omitted, and the description will be given with a focus on the configuration that is different from the first embodiment.

本実施形態にかかる記憶部１７２は、一部または全体にＡＧＣ（音量補正）が施された音声を記憶している。ここで、ＡＧＣは、音量の過大入力に対して自動的に音量レベルを下げ、音割れ防止を一つの目的とするコンプレッサーの機構である。かかるＡＧＣが施された音声の音量について図１３を参照して説明する。 The storage unit 172 according to the present embodiment stores a sound partly or entirely subjected to AGC (volume correction). Here, AGC is a compressor mechanism whose purpose is to automatically reduce the volume level and prevent sound cracking in response to excessive input of the volume. The volume of the sound subjected to such AGC will be described with reference to FIG.

図１３は、ＡＧＣの適用前の音声（原音）の音量と、ＡＧＣ適用後の音声の音量を対比的に表した説明図である。ＡＧＣは、ＡＧＣの適用前の音声の音量が閾値ｔｈを越えると、アタックタイムとして設定された時間で所定の割合（ｒａｔｉｏ）まで音量を圧縮する。図１３に示した例では、アタックタイムとして設定された時間で、ＡＧＣの適用前の音声の音量がおよそ１／２〜２／３程度に圧縮される場合を示している。その後、ＡＧＣの適用前の音声の音量が閾値ｔｈを下回ると、リリースタイムとして設定された時間内にＡＧＣを解除する。 FIG. 13 is an explanatory diagram that compares the volume of the sound (original sound) before application of AGC and the volume of the sound after application of AGC. When the volume of the sound before application of AGC exceeds the threshold th, the AGC compresses the volume to a predetermined ratio (ratio) in the time set as the attack time. The example shown in FIG. 13 shows a case where the volume of the sound before application of AGC is compressed to about 1/2 to 2/3 in the time set as the attack time. Thereafter, when the volume of the sound before application of AGC falls below the threshold th, AGC is canceled within the time set as the release time.

ここで、音声の音量が閾値ｔｈを超えＡＧＣが動作するのは、該音声の記録装置の近傍から過大レベルの近傍音声の入力があった場合が多い。すなわち、遠方音源の収音対象音声によってＡＧＣが動作する場合は少ない。しかし、入力音声は全体としてＡＧＣにより圧縮されるため、入力音声に含まれる近傍音声のみならず、もともと微弱な収音対象音声がさらに圧縮されてしまう問題があった。 Here, the sound volume exceeds the threshold th and the AGC operates in many cases when there is an input of an excessive level of near sound from the vicinity of the sound recording device. That is, there are few cases in which AGC is operated by the sound collection target voice of a distant sound source. However, since the input speech is compressed by AGC as a whole, there is a problem that not only the nearby speech included in the input speech but also the originally weakly collected sound is further compressed.

そこで、上記の問題を一着眼点とし、本実施形態にかかる音声再生装置１２が創作されるに至った。本実施形態にかかる音声再生装置１２は、音量補正部１９０の機能に基づき、入力音声にＡＧＣが施されていた場合であっても収音対象音声をブーストすることができる。 In view of the above, the audio reproduction device 12 according to the present embodiment has been created with the above problem as a point of focus. The audio reproduction device 12 according to the present embodiment can boost the sound to be collected based on the function of the volume correction unit 190 even when AGC is applied to the input audio.

音量補正部１９０は、音源分離部１４０により分離された近傍音声の音量の変化からＡＧＣが適用されたであろうアタックタイムを検出し、音源分離部１４０により分離された収音対象音声において該アタックタイムに相当する区間を走査する。収音対象音声には、背景環境音や被写体が発した音声などが含まれ得るが、背景環境音のみ含まれる場合は音量レベルがほぼ一定であると近似することが可能である。したがって、音量補正部１９０は、収音対象音声の音量に所定レベル以上の変化が生じている区間にはＡＧＣが施されたと判定することができる。 The sound volume correction unit 190 detects an attack time to which AGC will be applied from the change in the sound volume of the nearby sound separated by the sound source separation unit 140, and the attack in the sound collection target sound separated by the sound source separation unit 140 The section corresponding to the time is scanned. The sound to be picked up may include background environmental sound, sound generated by the subject, and the like, but when only the background environmental sound is included, it can be approximated that the volume level is almost constant. Therefore, the volume correction unit 190 can determine that AGC has been applied to a section in which the volume of the sound to be picked up has changed by a predetermined level or more.

そこで、音量補正部１９０は、収音対象音声における当該区間の音量を該区間の前後区間の音量と同程度になるように調整する逆補正を行い、収音対象音声をブーストすることができる。 Therefore, the volume correction unit 190 can perform reverse correction that adjusts the volume of the section of the sound to be collected so as to be approximately the same as the volume of the section before and after the section, and can boost the sound to be collected.

なお、上記アタックタイムおよびリリースタイムの推定値、および音量補正部１９０が行なった逆補正の程度を保持しておくことにより、収音対象音声に被写体が発した音声が含まれる場合に有効活用することができる。すなわち、収音対象音声に被写体が発した音声が含まれる場合であっても、音量補正部１９０は近傍音声からアタックタイムを検出し、収音対象音声における該アタックタイムに相当する区間の前後にわたって音量値の走査を行う。音量補正部１９０は、走査の結果、アタックタイムまたはリリースタイムと一致する時間幅で音量値が変化している場合はＡＧＣが動作したと判定し、逆補正を行なうことができる。 It should be noted that the estimated values of the attack time and release time and the degree of reverse correction performed by the sound volume correction unit 190 are retained, so that the sound can be effectively used when the sound to be collected includes the sound emitted from the subject. be able to. That is, even when the sound to be picked up includes sound emitted from the subject, the volume correction unit 190 detects the attack time from the nearby sound and extends before and after the section corresponding to the attack time in the sound to be picked up. Scan the volume value. The sound volume correction unit 190 can determine that the AGC has been operated and perform reverse correction when the sound volume value has changed in a time width that matches the attack time or release time as a result of scanning.

音声混合部１５０は、このように音量補正部１９０により音量が逆補正された収音対象音声と、音源分離部１４０により分離された近傍音声とを、全体に占める近傍音声の音量比率が抑制されるような音量比率で混合して混合音声を生成することができる。 In the sound mixing unit 150, the volume ratio of the vicinity sound that occupies the whole of the sound to be collected whose sound volume has been reversely corrected by the sound volume correction unit 190 and the vicinity sound separated by the sound source separation unit 140 is suppressed. The mixed sound can be generated by mixing at such a volume ratio.

以上説明したように、本発明の第３の実施形態にかかる音声再生装置１２は、近傍音声の音量が過大であったために入力音声の音量が全体として抑制され、収音対象音声の音量も抑制されてしまっている場合、入力音声の音量が抑制された程度に応じて収音対象音声の音量を増大させ、収音対象音声が過小となることを防止できる。 As described above, in the audio reproduction device 12 according to the third exemplary embodiment of the present invention, the volume of the input audio is suppressed as a whole because the volume of the nearby audio is excessive, and the volume of the sound to be collected is also suppressed. If it has been done, it is possible to increase the volume of the sound to be collected depending on the degree to which the volume of the input sound is suppressed, and to prevent the sound to be collected from becoming too small.

なお、本実施形態においては音量補正部１９０を音声再生装置１２に設ける場合を説明したが、第１の実施形態で説明した音声記録装置１０に設けることで、入力音声にＡＧＣが施されていてもＡＧＣの程度に応じてブーストされた収音対象音声を含む混合音声を記憶部１７０に記録することも可能である。 In the present embodiment, the case where the sound volume correction unit 190 is provided in the sound reproduction device 12 has been described. However, the input sound is subjected to AGC by being provided in the sound recording device 10 described in the first embodiment. It is also possible to record the mixed sound including the sound to be collected boosted according to the degree of AGC in the storage unit 170.

以上、添付図面を参照しながら本発明の好適な実施形態について説明したが、本発明は係る例に限定されないことは言うまでもない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、それらについても当然に本発明の技術的範囲に属するものと了解される。 As mentioned above, although preferred embodiment of this invention was described referring an accompanying drawing, it cannot be overemphasized that this invention is not limited to the example which concerns. It will be apparent to those skilled in the art that various changes and modifications can be made within the scope of the claims, and these are naturally within the technical scope of the present invention. Understood.

例えば、本明細書の音声記録装置１０の処理における各ステップは、必ずしもフローチャートとして記載された順序に沿って時系列に処理する必要はなく、並列的あるいは個別に実行される処理（例えば、並列処理あるいはオブジェクトによる処理）も含むとしてもよい。 For example, each step in the processing of the audio recording apparatus 10 of the present specification does not necessarily have to be processed in time series in the order described in the flowchart, but is performed in parallel or individually (for example, parallel processing) Alternatively, processing by an object) may be included.

また、図３には音声判定部１２０が音声収音部１１０により収音された入力音声に近傍音声が含まれるか否かを判定する例を示しているが、本発明はかかる例に限定されない。例えば、音声判定部１２０は、音源分離部１４０により分離された音声が入力され、該分離された音声の音源位置を推定し、該分離された音声に近傍音声が含まれるか否かを判定し、該分離された音声を音声混合部１５０に出力してもよい。なお、この場合、音源分離部１４０は初期値無しでブラインドに音声を音源ごとに分離する。 FIG. 3 shows an example in which the sound determination unit 120 determines whether or not the input sound collected by the sound collection unit 110 includes a nearby sound, but the present invention is not limited to such an example. . For example, the sound determination unit 120 receives the sound separated by the sound source separation unit 140, estimates the sound source position of the separated sound, and determines whether the separated sound includes a nearby sound. The separated sound may be output to the sound mixing unit 150. In this case, the sound source separation unit 140 blindly separates the sound for each sound source without an initial value.

また、音声記録装置１０、音声再生装置１１、音声再生装置１２に内蔵されるＣＰＵ、ＲＯＭおよびＲＡＭなどのハードウェアを、上述した音声記録装置１０、音声再生装置１１、音声再生装置１２の各構成と同等の機能を発揮させるためのコンピュータプログラムも作成可能である。また、該コンピュータプログラムを記憶させた記憶媒体も提供される。また、音声記録装置１０、音声再生装置１１、音声再生装置１２の各機能ブロック図で示したそれぞれの機能ブロックをハードウェアで構成することで、一連の処理をハードウェアで実現することもできる。 Further, hardware such as a CPU, a ROM, and a RAM incorporated in the audio recording device 10, the audio reproduction device 11, and the audio reproduction device 12 are configured as the above-described configurations of the audio recording device 10, the audio reproduction device 11, and the audio reproduction device 12, respectively. It is also possible to create a computer program for demonstrating the same function as. A storage medium storing the computer program is also provided. Moreover, a series of processing can also be realized by hardware by configuring each functional block shown in each functional block diagram of the audio recording device 10, the audio reproduction device 11, and the audio reproduction device 12 with hardware.

本発明の第１の実施形態にかかる音声記録装置が用いられる場面の一例を示した説明図である。It is explanatory drawing which showed an example of the scene where the audio | voice recording apparatus concerning the 1st Embodiment of this invention is used. 通常の音声記録方法によって記録される音声の時間領域の振幅を示した説明図である。It is explanatory drawing which showed the amplitude of the time domain of the audio | voice recorded by the normal audio | voice recording method. 同実施形態にかかる音声処理装置の一例としての音声記録装置の構成を示した機能ブロック図である。FIG. 2 is a functional block diagram showing a configuration of a sound recording device as an example of a sound processing device according to the embodiment. 音声判定部の構成を示した機能ブロック図である。It is the functional block diagram which showed the structure of the audio | voice determination part. ２つの入力音声の位相差に基づいて入力音声の音源位置を推定する様子を示した説明図である。It is explanatory drawing which showed a mode that the sound source position of an input audio | voice was estimated based on the phase difference of two input audio | voices. ３つの入力音声の位相差に基づいて入力音声の音源位置を推定する様子を示した説明図である。It is explanatory drawing which showed a mode that the sound source position of an input audio | voice was estimated based on the phase difference of three input audio | voices. ２つの入力音声の音量に基づいて入力音声の音源位置を推定する様子を示した説明図である。It is explanatory drawing which showed a mode that the sound source position of an input audio | voice was estimated based on the volume of two input audio | voices. ３つの入力音声の音量に基づいて入力音声の音源位置を推定する様子を示した説明図である。It is explanatory drawing which showed a mode that the sound source position of an input audio | voice was estimated based on the volume of three input audio | voices. 音声記録装置と操作者の位置関係を示した説明図である。It is explanatory drawing which showed the positional relationship of an audio recording device and an operator. 同実施形態にかかる音声記録装置において実行される音声処理方法の流れを示したフローチャートである。It is the flowchart which showed the flow of the audio | voice processing method performed in the audio | voice recording device concerning the embodiment. 本発明の第２の実施形態にかかる音声再生装置の構成を示した機能ブロック図である。It is the functional block diagram which showed the structure of the audio | voice reproduction apparatus concerning the 2nd Embodiment of this invention. 本発明の第３の実施形態にかかる音声再生装置の構成を示した機能ブロック図である。It is the functional block diagram which showed the structure of the audio | voice reproduction apparatus concerning the 3rd Embodiment of this invention. ＡＧＣの適用前の音声の音量と、ＡＧＣ適用後の音声の音量を対比的に表した説明図である。It is explanatory drawing which represented the volume of the sound before application of AGC and the volume of the sound after application of AGC in contrast.

Explanation of symbols

１０音声記録装置
１１、１２音声再生装置
１１０音声収音部
１２０音声判定部
１２４音量検出器
１３４音質検出器
１３６距離方向推定器
１３８操作者音声推定器
１４０音源分離部
１５０音声混合部
１６０記録部
１７０、１７２記憶部
１７４再生部
１８０音声出力部
１９０音量補正部 DESCRIPTION OF SYMBOLS 10 Audio | voice recording apparatus 11, 12 Audio | voice reproduction apparatus 110 Audio | voice sound collection part 120 Audio | voice determination part 124 Volume | volume detector 134 Sound quality detector 136 Distance direction estimator 138 Operator audio | voice estimator 140 Sound source separation part 150 Audio | voice mixing part 160 Recording part 170 , 172 Storage unit 174 Playback unit 180 Audio output unit 190 Volume correction unit

Claims

A sound determination unit for determining whether or not the first sound emitted from the specific sound source is included in the input sound;
When the sound determination unit determines that the first sound is included in the input sound, the input sound is the first sound and a second sound emitted from a sound source other than the specific sound source. A voice separation unit that separates into;
A sound mixing unit that mixes the first sound and the second sound separated by the sound separation unit at an arbitrary volume ratio;
An audio processing apparatus comprising:

The sound processing apparatus according to claim 1, wherein the specific sound source is located within a set distance from a recording position of the input sound.

The first voice includes a voice attributed to an operator of the device used when picking up the input voice;
The voice processing apparatus according to claim 2, wherein the second voice includes a voice emitted from a sound collection target.

4. The voice determination unit according to claim 3, wherein the voice determination unit determines whether or not the first voice is included in the input voice based on at least one of a volume and a sound quality of the input voice. Voice processing device.

It further includes an imaging unit that captures images,
The sound determination unit includes a position information calculation unit that calculates position information of the sound source based on at least one of a volume or a phase of sound emitted from one or more sound sources included in the input sound, The position information calculation unit calculates that the position of the sound source of the sound is behind the imaging direction of the imaging unit, and when the input sound has a sound quality that matches or approximates a human sound, The speech processing apparatus according to claim 4, wherein the speech processing apparatus determines that the emitted first speech is included.

When the position of the sound source of the input sound is within a set distance from the sound pickup position, the input sound includes an impulse sound, and the input sound is larger than the past average volume, the sound determination unit The sound processing apparatus according to claim 4, wherein the input sound is determined to include the first sound emitted from a specific sound source.

A plurality of sound collection units for collecting the input voice;
A recording unit for recording the mixed audio mixed by the audio mixing unit on a storage medium;
The speech processing apparatus according to claim 1, comprising:

A storage medium storing the input voice;
A reproduction unit that reproduces the input voice stored in the storage medium and outputs the reproduced voice to at least one of the position information calculation unit, the voice determination unit, and the voice separation unit;
The speech processing apparatus according to claim 1, comprising:

A volume correction unit that performs reverse correction on the volume of the second sound separated by the sound separation unit when the volume of the input sound is corrected;
The speech processing apparatus according to claim 1, comprising:

A speech separation unit for separating input speech;
A sound determination unit that determines whether or not the sound separated by the sound separation unit includes a first sound emitted from a specific sound source;
A sound mixing unit that mixes the first sound separated by the sound separation unit and the second sound emitted from a sound source other than the specific sound source at an arbitrary volume ratio;
An audio processing apparatus comprising:

Computer
A sound determination unit that determines whether or not the first sound emitted from the specific sound source is included in the input sound based on the position information of the sound source;
When the sound determination unit determines that the first sound is included in the input sound, the input sound is the first sound and a second sound emitted from a sound source other than the specific sound source. A voice separation unit that separates into;
A sound mixing unit that mixes the first sound and the second sound separated by the sound separation unit at an arbitrary volume ratio;
A program for causing a voice processing apparatus to function.

12. The voice determination unit according to claim 11, wherein the voice determination unit determines whether or not the first voice is included in the input voice based on at least one of a volume and a sound quality of the input voice. Program.

It further includes an imaging unit that captures images,
The sound determination unit includes a position information calculation unit that calculates position information of the sound source based on at least one of a volume or a phase of sound emitted from one or more sound sources included in the input sound, The position information calculation unit calculates that the position of the sound source of the sound is behind the imaging direction of the imaging unit, and when the input sound has a sound quality that matches or approximates a human sound, 13. The program according to claim 12, wherein it is determined that the first voice that is uttered is included.

When the position of the sound source of the input sound is within a set distance from the sound pickup position, the input sound includes an impulse sound, and the input sound is larger than the past average volume, the sound determination unit The program according to claim 12, wherein it is determined that the first voice emitted from a specific sound source is included in the input voice.

Determining whether or not a first sound emitted from a specific sound source is included in the input sound based on position information of the sound source;
Separating the input sound into the first sound and a second sound emitted from a sound source other than the specific sound source when it is determined that the first sound is included in the input sound; ;
Mixing the separated first sound and the second sound at an arbitrary volume ratio;
An audio processing method comprising: