JP2010239542A

JP2010239542A - Voice processor

Info

Publication number: JP2010239542A
Application number: JP2009087197A
Authority: JP
Inventors: Naoki Nitta; 直樹仁田
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2009-03-31
Filing date: 2009-03-31
Publication date: 2010-10-21

Abstract

<P>PROBLEM TO BE SOLVED: To correct the voice automatically and optimally according to the user. <P>SOLUTION: A voice processor includes an attribute estimation unit 10 which estimates the attributes related to the age or the age and gender of a person based on the image of the person photoed by a photographing unit 50, and a voice correction unit 20 which performs auditory compensation processing on a voice signal according to the attributes that are estimated by the attribute estimation unit 10. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は音声処理装置に関する。 The present invention relates to a voice processing apparatus.

従来、使用者の年齢情報を操作部から入力して、音声の周波数特性やレベルを使用者の年齢層による聴力の劣化に合わせて補正する音声補正装置が知られている（特許文献１参照）。 2. Description of the Related Art Conventionally, a voice correction device is known that inputs user's age information from an operation unit and corrects the frequency characteristics and level of the voice in accordance with the deterioration of hearing ability of the user's age group (see Patent Document 1). .

特許第３２３６２６８号公報Japanese Patent No. 3236268

しかしながら、上記の従来の音声補正装置では、使用者自らが操作部を用いて自分の年齢情報を入力する必要があった。そのため、使用者が異なった場合に、自動的に最適な音声の補正を行うことが不可能であった。 However, in the above-described conventional audio correction device, the user himself / herself has to input his / her age information using the operation unit. For this reason, it has been impossible to automatically perform optimum sound correction when users are different.

本発明は上記の点に鑑みてなされたものであり、その目的は、使用者に応じて自動的に最適な音声の補正を行うことが可能な音声処理装置を提供することにある。 The present invention has been made in view of the above points, and an object of the present invention is to provide an audio processing apparatus capable of automatically correcting an optimal audio in accordance with a user.

本発明は上記の課題を解決するためになされたものであり、本発明に係る音声処理装置は、撮影手段によって撮影された人物の画像に基づいて該人物の年齢、又は、年齢及び性別に関する属性を推定する属性推定手段と、前記属性推定手段によって推定された属性に応じた聴覚補償処理を音声信号に施す音声補正手段と、を備えることを特徴とする。 The present invention has been made to solve the above-described problems, and the sound processing device according to the present invention is based on an image of a person photographed by photographing means, or an attribute relating to the age or sex of the person. Attribute estimation means for estimating the sound signal, and sound correction means for performing auditory compensation processing on the sound signal according to the attribute estimated by the attribute estimation means.

この構成によれば、人物を撮影して得られた画像からその人物の年齢や性別を推定し聴覚補償処理を行うので、使用者自らが年齢情報等を入力することなく、その人物の属性に合った適切な音声の補正を実施することができる。 According to this configuration, since the age and sex of the person is estimated from the image obtained by photographing the person and the hearing compensation process is performed, the user himself / herself does not input age information and the like to the attribute of the person. Appropriate sound correction can be performed.

また、本発明は、上記の音声処理装置において、前記聴覚補償処理は、前記属性推定手段によって推定された属性に応じて前記音声信号の音量を周波数毎に補正する処理、又は、前記属性推定手段によって推定された属性に応じて前記音声信号のフォルマントを整形する処理、又は、前記属性推定手段によって推定された属性に応じて前記音声信号の話速を変換する処理、を含むことを特徴とする。 Further, according to the present invention, in the audio processing device, the auditory compensation processing is a process of correcting the volume of the audio signal for each frequency according to the attribute estimated by the attribute estimation means, or the attribute estimation means A process of shaping the formant of the speech signal according to the attribute estimated by the method, or a process of converting the speech speed of the speech signal according to the attribute estimated by the attribute estimation means. .

この構成によれば、周波数毎に音量を変更するので、使用者にとって聴き取りやすい音声を作り出すことができる。例えば高音の音量をより大きくすることで、高齢者が聴き取りにくい高音を聴き取りやすい音声とすることができる。また、この構成によれば、音声信号のフォルマントを整形することによって音質が改善されるので、使用者にとって聴き取りやすい音声を作り出すことができる。また、この構成によれば、話速変換をするので、音声を使用者にとって聴き取りやすいスピードにすることができる。 According to this configuration, since the volume is changed for each frequency, it is possible to create a voice that can be easily heard by the user. For example, by increasing the volume of the high sound, it is possible to make the sound easy to hear a high sound that is difficult for an elderly person to hear. Further, according to this configuration, since the sound quality is improved by shaping the formant of the audio signal, it is possible to create an audio that is easy for the user to hear. Further, according to this configuration, since the speech speed is converted, it is possible to achieve a speed at which the user can easily hear the voice.

また、本発明は、上記の音声処理装置において、前記人物までの距離を算出する距離算出手段を備え、前記音声補正手段は、前記距離算出手段によって算出された距離に応じて前記音声信号の音量を補正する処理を行うことを特徴とする。 Further, the present invention is the above sound processing apparatus, further comprising distance calculating means for calculating a distance to the person, wherein the sound correcting means is a volume of the sound signal according to the distance calculated by the distance calculating means. It is characterized in that a process for correcting is performed.

この構成によれば、人物との距離を考慮して音量を補正するので、使用者が近くにいるか遠くにいるかに応じてより適切な音声の補正を実施することができる。 According to this configuration, since the sound volume is corrected in consideration of the distance to the person, more appropriate sound correction can be performed depending on whether the user is near or far away.

また、本発明は、上記の音声処理装置において、前記音声補正手段は、処理前の音声信号の音量が大きいほどゲインを小さく設定して、前記音声信号の音量を前記ゲインに従って補正する処理を行うことを特徴とする。 Further, according to the present invention, in the audio processing apparatus, the audio correction unit performs a process of setting the gain to be smaller as the volume of the audio signal before processing is larger and correcting the volume of the audio signal according to the gain. It is characterized by that.

この構成によれば、処理前の音声信号の音量が大きい場合には小さいゲインで音量を増大させる処理を行うので、過大な音量で音声が出力されてしまうことを防ぐことができ、補正後の音声を快適なものとすることができる。 According to this configuration, when the volume of the sound signal before processing is high, the volume is increased with a small gain, so that it is possible to prevent the sound from being output with an excessive volume, and after the correction, The voice can be made comfortable.

本発明によれば、使用者に応じて自動的に最適な音声の補正を行うことが可能である。 According to the present invention, it is possible to automatically correct an optimum sound according to a user.

本発明の第１の実施形態による音声処理装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the audio processing apparatus by the 1st Embodiment of this invention. 補正量記憶部２０４が記憶している補正量のデータの一例である。4 is an example of correction amount data stored in a correction amount storage unit 204; 本発明の第２の実施形態による音声処理装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the audio processing apparatus by the 2nd Embodiment of this invention. 本発明の第３の実施形態における補正量Ｇ（ｆ）の決定方法を説明する図である。It is a figure explaining the determination method of the corrected amount G (f) in the 3rd Embodiment of this invention. フォルマント整形処理を説明する図である。It is a figure explaining a formant shaping process.

以下、図面を参照しながら本発明の実施形態について詳しく説明する。
（第１の実施形態）
図１は、本発明の第１の実施形態による音声処理装置の構成を示す機能ブロック図である。同図において、音声処理装置は、属性推定部１０と音声補正部２０を含んで構成される。属性推定部１０は、年齢推定部１０１及び性別推定部１０２から構成され、音声補正部２０は、フーリエ変換部２０１、スペクトル補正部２０２、補正量決定部２０３、補正量記憶部２０４、及び逆フーリエ変換部２０５から構成される。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
(First embodiment)
FIG. 1 is a functional block diagram showing the configuration of the speech processing apparatus according to the first embodiment of the present invention. In the figure, the speech processing apparatus includes an attribute estimation unit 10 and a speech correction unit 20. The attribute estimation unit 10 includes an age estimation unit 101 and a gender estimation unit 102, and the speech correction unit 20 includes a Fourier transform unit 201, a spectrum correction unit 202, a correction amount determination unit 203, a correction amount storage unit 204, and an inverse Fourier. The conversion unit 205 is configured.

この音声処理装置は、例えば、利用者の操作に応じ利用者に音声で各種の案内を行う公共施設等の音声案内装置や、対話型ロボットなどに搭載される。これら音声案内装置及びロボットには、撮像部５０が設けられている。 This voice processing device is mounted on, for example, a voice guidance device in a public facility or the like that performs various types of guidance to the user in response to a user's operation, an interactive robot, or the like. These voice guidance devices and robots are provided with an imaging unit 50.

撮像部５０は、撮像部５０の前にいる利用者の顔を撮影して、撮影により得られた利用者の顔の画像データを音声処理装置の属性推定部１０へ出力する。撮影は、静止画撮影、動画撮影のいずれでもよい。静止画撮影の場合、例えば、所定の時間間隔（例えば１秒おき）で自動で撮影を行って、得られた画像に人物の顔が含まれるか否かを画像処理により判定する方法や、利用者が能動的に所定のボタン（例えば、音声案内装置の利用開始ボタン）を操作することにより撮影を行うといった方法を用いることで、利用者の顔の画像データを得ることができる。また、動画撮影の場合、得られた動画像から同様に画像処理によって人物の顔が含まれるフレームを切り出すようにすればよい。 The imaging unit 50 images the face of the user in front of the imaging unit 50 and outputs image data of the user's face obtained by the imaging to the attribute estimation unit 10 of the sound processing device. Shooting may be either still image shooting or movie shooting. In the case of still image shooting, for example, a method of automatically shooting at a predetermined time interval (for example, every second) and determining whether or not a human face is included in the obtained image by image processing or use The image data of the user's face can be obtained by using a method in which the user actively takes a picture by operating a predetermined button (for example, a use start button of the voice guidance device). In the case of moving image shooting, a frame including a person's face may be cut out from the obtained moving image by image processing in the same manner.

属性推定部１０は、撮像部５０から入力された利用者の顔の画像データから、その利用者の年齢及び性別についての属性を推定する。具体的には、属性推定部１０は、上記の画像データに基づいて、年齢推定部１０１により当該利用者の年齢を推定するとともに、性別推定部１０２により当該利用者の性別を推定する。人の聴力は年齢や性別によって差があるので、これら推定された年齢及び性別の属性は、画像データの人物の聴力を示す指標として利用することが可能である。 The attribute estimation unit 10 estimates attributes about the age and sex of the user from the image data of the user's face input from the imaging unit 50. Specifically, the attribute estimation unit 10 estimates the age of the user by the age estimation unit 101 based on the above image data, and estimates the gender of the user by the sex estimation unit 102. Since human hearing varies depending on age and gender, these estimated age and gender attributes can be used as an index indicating the human hearing of image data.

年齢推定部１０１は、図示しない所定のデータベースを参照することにより、入力された利用者の顔の画像データに基づいて当該画像データの人物の年齢、或いはある年齢の幅をもった年代（例えば、６０歳代や６０歳代後半等）を推定する。また、性別推定部１０２は、同様に、図示しない所定のデータベースを参照することにより、入力された利用者の顔の画像データに基づいて当該画像データの人物の性別を推定する。年齢推定部１０１と性別推定部１０２の機能を明確に分けず、入力された利用者の画像データに基づきデータベースを参照して年齢と性別とを同時に推定する構成としてもよい。 The age estimation unit 101 refers to a predetermined database (not shown), and based on the input image data of the user's face, the age of the person of the image data or a certain age range (for example, 60's and late 60's). Similarly, the gender estimation unit 102 estimates the gender of the person of the image data based on the input image data of the user's face by referring to a predetermined database (not shown). The functions of the age estimating unit 101 and the gender estimating unit 102 may not be clearly divided, and the age and gender may be estimated at the same time by referring to the database based on the input user image data.

上記所定のデータベースには、性別毎及び年齢毎（或いは年代毎）の平均的な顔の画像データが記憶されている。例えば、６０歳代の男性の平均的な顔の画像データＤ１、６０歳代の女性の平均的な顔の画像データＤ２、７０歳代の男性の平均的な顔の画像データＤ３、７０歳代の女性の平均的な顔の画像データＤ４、…といった具合である。このデータベースは、予め多数の顔画像のサンプルを例えば平均化するなどの方法を用いて構築されたものである。年齢推定部１０１及び性別推定部１０２は、データベースに記憶されているこれら複数の画像データと、入力された利用者の画像データとを照合して、データベースの複数の画像データから入力画像データと最も類似した画像データを選択し、選択した画像データが示す年齢（年代）と性別を、年齢（年代）と性別の推定結果とする。例えば、上記の画像データＤ１が選択されたとすると、推定結果は「年齢＝６０歳代、性別＝男性」となる。 The predetermined database stores average face image data for each gender and each age (or every age). For example, average face image data D1 of men in their 60s, average face image data D2 of women in their 60s, average face image data D3 of men in their 70s, 70s The average face image data D4 of the females, and so on. This database is constructed in advance using a method such as averaging a large number of face image samples in advance. The age estimation unit 101 and the gender estimation unit 102 collate the plurality of image data stored in the database with the input image data of the user, and the input image data is the most derived from the plurality of image data in the database. Similar image data is selected, and the age (age) and gender indicated by the selected image data are used as the estimation result of age (age) and gender. For example, assuming that the image data D1 is selected, the estimation result is “age = 60s, gender = male”.

なお、年齢推定部１０１と性別推定部１０２による年齢及び性別の推定において、上記のように性別毎及び年齢毎の平均的な顔の画像データと照合する方法に代えて、例えば、顔のシワの数や密度といった年齢と性別に特有な顔の特徴パラメータを年齢及び性別毎に数値化してデータベース化しておき、入力画像データから同じ様にして抽出した特徴パラメータの値をそのデータベースと比較することにより、年齢と性別を決定する方法を用いてもよい。また、顔ではなく、髪の色や量、あるいは容姿全体（姿勢など）などを画像データから割り出して、これらに基づいて年齢や性別を推定するようにしてもよい。 In the estimation of age and gender by the age estimating unit 101 and the gender estimating unit 102, instead of the method of collating with the average face image data for each gender and each age as described above, for example, By quantifying age and gender specific facial feature parameters such as number and density for each age and gender into a database, and comparing the feature parameter values extracted in the same way from the input image data with that database A method of determining age and gender may be used. Further, instead of the face, the color and amount of hair, the entire appearance (posture, etc.), etc. may be determined from the image data, and the age and gender may be estimated based on these.

属性推定部１０は、こうして得られた利用者の年齢と性別の推定結果を音声補正部２０へ出力する。このように、画像データによって利用者の年齢と性別の属性を推定しているので、これらの属性を利用者が入力する手間を省略することができる。 The attribute estimation unit 10 outputs the estimation result of the user's age and gender thus obtained to the voice correction unit 20. As described above, since the user's age and sex attributes are estimated based on the image data, it is possible to save the user from inputting these attributes.

音声補正部２０は、属性推定部１０から入力された利用者の年齢と性別の推定結果に従って、入力音声信号に利用者に応じた聴覚補償処理を施す。入力音声信号は、例えば、本音声処理装置が搭載された音声案内装置（や対話型ロボット）によって当該利用者に案内すべき音声であり、音声案内装置（や対話型ロボット）の所定のブロックから音声補正部２０へ供給される。音声補正部２０による聴覚補償処理は、入力音声の音量を周波数毎に補正する処理であり、その補正量は、利用者の年齢と性別に応じた補正量とする（後述の図２を参照）。上述したように、年齢及び性別は利用者の聴力を表す指標であるため、この聴覚補償処理によって、入力音声信号を利用者に応じた最適な音声に補正することができる。 The sound correction unit 20 performs auditory compensation processing corresponding to the user on the input sound signal according to the user's age and gender estimation result input from the attribute estimation unit 10. The input voice signal is, for example, a voice to be guided to the user by a voice guidance device (or interactive robot) equipped with the voice processing device, and from a predetermined block of the voice guidance device (or interactive robot). It is supplied to the sound correction unit 20. Auditory compensation processing by the sound correction unit 20 is processing for correcting the volume of the input sound for each frequency, and the amount of correction is a correction amount according to the age and gender of the user (see FIG. 2 described later). . As described above, since age and gender are indices representing the hearing ability of the user, the input speech signal can be corrected to the optimum speech according to the user by this auditory compensation process.

補正量記憶部２０４は、入力音声の音量を補正する際の補正量（ゲイン）を利用者の年齢毎（或いは年代毎）及び性別毎に記憶している。補正量は、入力音声信号の代表周波数ｆ１，ｆ２，ｆ３，…における音圧レベルをそれぞれどれだけ増加させるかを指定する値Ｇ（ｆ１），Ｇ（ｆ２），Ｇ（ｆ３），…である。図２に、補正量記憶部２０４が記憶している補正量のデータの一例を示す。補正量の単位はデシベルである。 The correction amount storage unit 204 stores a correction amount (gain) for correcting the volume of the input voice for each user age (or each age) and each gender. The correction amount is a value G (f1), G (f2), G (f3),... That specifies how much the sound pressure level at the representative frequencies f1, f2, f3,. . FIG. 2 shows an example of correction amount data stored in the correction amount storage unit 204. The unit of the correction amount is decibel.

図２は、例えば、利用者が６０歳代の男性であった場合、入力音声の音量を補正する処理として、ｆ１＝１２５Ｈｚの音圧レベルをＧ（ｆ１）＝５デシベル増加させ、ｆ２＝２５０Ｈｚの音圧レベルをＧ（ｆ２）＝５デシベル増加させ、ｆ３＝５００Ｈｚの音圧レベルをＧ（ｆ３）＝６デシベル増加させ、ｆ４＝１０００Ｈｚの音圧レベルをＧ（ｆ４）＝７デシベル増加させ、ｆ５＝１５００Ｈｚの音圧レベルをＧ（ｆ５）＝１０デシベル増加させ、ｆ６＝２０００Ｈｚの音圧レベルをＧ（ｆ６）＝１２デシベル増加させ、ｆ７＝３０００Ｈｚの音圧レベルをＧ（ｆ７）＝２０デシベル増加させ、ｆ８＝４０００Ｈｚの音圧レベルをＧ（ｆ８）＝２８デシベル増加させ、ｆ９＝６０００Ｈｚの音圧レベルをＧ（ｆ９）＝３２デシベル増加させ、ｆ１０＝８０００Ｈｚの音圧レベルをＧ（ｆ１０）＝３９デシベル増加させる処理を行うことを表している。 FIG. 2 shows that, for example, when the user is a man in his 60s, the sound pressure level of f1 = 125 Hz is increased by G (f1) = 5 decibels and f2 = 250 Hz as processing for correcting the volume of the input voice. Is increased by G (f2) = 5 decibels, the sound pressure level at f3 = 500 Hz is increased by G (f3) = 6 decibels, and the sound pressure level at f4 = 1000 Hz is increased by G (f4) = 7 decibels. , F5 = 1500 Hz sound pressure level is increased by G (f5) = 10 dB, f6 = 2000 Hz sound pressure level is increased by G (f6) = 12 dB, and f7 = 3000 Hz sound pressure level is G (f7) = Increase the sound pressure level of f8 = 4000 Hz by G (f8) = 28 dB and increase the sound pressure level of f9 = 6000 Hz by G (f9) = 32 dB. The sound pressure level of f10 = 8000 Hz represents performing the process of increasing G (f10) = 39 decibels.

なお、図２の補正量のデータは、例えば、予め多くの人の聴力を測定して、その測定結果を基に作成しておくものとする。具体的には、年代毎、性別毎に聴力測定をして最小可聴閾値（被験者が聴き取れる最小の音圧レベル）の統計をとり、最小可聴閾値が大きい周波数の補正量は大きく、最小可聴閾値が小さい周波数の補正量は小さく設定する。最小可聴閾値は加齢によって変化し、性別でも異なるので、図２のように年代及び性別毎の補正量のデータを得ることができる。また、図２の補正量の値は、年代が高く且つ周波数が高いほど大きな値となっているが、これは、高齢者は低音に比べて高音が聴き取りにくく、高音ほど最小可聴閾値が大きいからである。このため、高い年代では、高い周波数の音圧レベルを低い周波数よりも大きく増加させるような聴覚補償処理を行うことが必要となる。 Note that the correction amount data in FIG. 2 is created based on the measurement results obtained by measuring the hearing ability of many people in advance, for example. Specifically, the hearing ability is measured for each age group and gender, and the minimum audible threshold (minimum sound pressure level that the subject can hear) is taken. The minimum audible threshold is large and the minimum audible threshold is large. The correction amount for a frequency with a small is set small. Since the minimum audible threshold varies with aging and varies with gender, correction amount data for each age and gender can be obtained as shown in FIG. In addition, the value of the correction amount in FIG. 2 is larger as the age is higher and the frequency is higher, but this is because older people are less likely to hear high sounds than low sounds, and the minimum audible threshold is higher as the sound is higher. Because. For this reason, in a high age, it is necessary to perform auditory compensation processing that increases the sound pressure level of a high frequency more than a low frequency.

補正量決定部２０３は、属性推定部１０から利用者の年齢及び性別の推定結果を取得し、その推定結果の年齢と性別に対応する補正量のデータＧ（ｆ１），Ｇ（ｆ２），Ｇ（ｆ３），…（即ち、図２の１列分のデータ）を補正量記憶部２０４から取得する。更に、補正量決定部２０３は、得られた補正量のデータを補間して、音声の周波数帯域内における任意の周波数ｆについての補正量Ｇ（ｆ）を求め、求めた補正量Ｇ（ｆ）をスペクトル補正部２０２へ指示する。 The correction amount determination unit 203 acquires the user's age and gender estimation result from the attribute estimation unit 10, and the correction amount data G (f1), G (f2), G corresponding to the age and gender of the estimation result. (F3),... (That is, data for one column in FIG. 2) is acquired from the correction amount storage unit 204. Further, the correction amount determination unit 203 interpolates the obtained correction amount data to obtain a correction amount G (f) for an arbitrary frequency f within the frequency band of the sound, and the obtained correction amount G (f) To the spectrum correction unit 202.

一方、聴覚補償処理の対象である音声信号は、フーリエ変換部２０１に入力される。フーリエ変換部２０１は、入力された時間領域の音声信号をフーリエ変換することにより周波数領域の音声信号Ｓ（ｆ）を求め、この周波数領域の音声信号Ｓ（ｆ）をスペクトル補正部２０２へ出力する。 On the other hand, the audio signal that is the target of the auditory compensation process is input to the Fourier transform unit 201. The Fourier transform unit 201 obtains a frequency domain audio signal S (f) by performing a Fourier transform on the input time domain audio signal, and outputs the frequency domain audio signal S (f) to the spectrum correction unit 202. .

スペクトル補正部２０２は、フーリエ変換部２０１から入力された周波数領域の音声信号Ｓ（ｆ）に対して、補正量決定部２０３により指示された補正量Ｇ（ｆ）に従ってスペクトルの補正を行い、スペクトル補正後の音声信号Ｓ’（ｆ）を逆フーリエ変換部２０５へ出力する。スペクトルの補正は、例えば、任意の周波数ｆにおける入力音声信号の音圧レベルＳ（ｆ）を補正量Ｇ（ｆ）の分だけ増加させる（補正量Ｇ（ｆ）のゲインを与える）処理とする。この場合、入力音声信号のスペクトルは、Ｓ（ｆ）からＳ（ｆ）＋Ｇ（ｆ）へ変化し、スペクトル補正後の音声信号は、Ｓ’（ｆ）＝Ｓ（ｆ）＋Ｇ（ｆ）となる（但し、Ｓ，Ｓ’，Ｇはいずれも対数表示とする）。このように、年齢毎及び性別毎の補正量Ｇ（ｆ）を用いることにより、利用者の年齢と性別に応じて、入力音声の音量（音圧レベル）が周波数毎に補正される。 The spectrum correction unit 202 corrects the spectrum according to the correction amount G (f) instructed by the correction amount determination unit 203 on the frequency domain audio signal S (f) input from the Fourier transform unit 201, The corrected audio signal S ′ (f) is output to the inverse Fourier transform unit 205. The spectrum correction is, for example, a process of increasing the sound pressure level S (f) of the input audio signal at an arbitrary frequency f by the correction amount G (f) (giving a gain of the correction amount G (f)). . In this case, the spectrum of the input audio signal changes from S (f) to S (f) + G (f), and the audio signal after the spectrum correction is S ′ (f) = S (f) + G (f). (However, S, S ′, and G are all logarithmic displays). Thus, by using the correction amount G (f) for each age and each gender, the volume (sound pressure level) of the input voice is corrected for each frequency according to the age and sex of the user.

逆フーリエ変換部２０５は、スペクトル補正部２０２によってスペクトルが補正された周波数領域の音声信号Ｓ’（ｆ）を逆フーリエ変換することにより時間領域の音声信号を求め、求めた時間領域の音声信号を音声補正部２０の出力音声信号として出力する。 The inverse Fourier transform unit 205 obtains a time domain speech signal by performing an inverse Fourier transform on the frequency domain speech signal S ′ (f) whose spectrum has been corrected by the spectrum correction unit 202, and the obtained time domain speech signal is obtained. It is output as an output audio signal of the audio correction unit 20.

以上説明したように、本実施形態の音声処理装置によれば、撮像部５０の前に利用者がいる場合に、撮像部５０によって撮影された利用者の画像データに基づいて当該利用者の年齢と性別が推定され、求められた年齢及び性別に応じて、音声の音量が周波数毎に補正される。したがって、利用者が年齢等の情報を入力する操作を行うことなく、自動的に、利用者に応じた聴覚補償処理を音声に施すことができる。 As described above, according to the audio processing device of the present embodiment, when there is a user in front of the imaging unit 50, the age of the user is based on the image data of the user taken by the imaging unit 50. The gender is estimated, and the sound volume is corrected for each frequency according to the determined age and gender. Accordingly, it is possible to automatically perform auditory compensation processing according to the user on the voice without the user performing an operation of inputting information such as age.

（第２の実施形態）
図３は、本発明の第２の実施形態による音声処理装置の構成を示す機能ブロック図である。同図において、音声処理装置は、あらたに距離算出部３０を備えている。本実施形態は、補正量決定部２０３が距離算出部３０の出力をも考慮して補正量を決定するという内容を第１の実施形態に追加したものである。 (Second Embodiment)
FIG. 3 is a functional block diagram showing the configuration of the speech processing apparatus according to the second embodiment of the present invention. In the figure, the speech processing apparatus is newly provided with a distance calculation unit 30. In the present embodiment, the content that the correction amount determination unit 203 determines the correction amount in consideration of the output of the distance calculation unit 30 is added to the first embodiment.

距離算出部３０は、撮像部５０から利用者の顔が含まれた画像データを取得し、その画像内における顔の大きさに基づき撮像部５０と当該利用者の距離を算出して、得られた距離の値ｄを補正量決定部２０３へ出力する。撮像部５０と利用者の距離ｄは、画像内における顔の大きさから撮像部５０が利用者の顔を見込む見込み角φを求め、この見込み角φと人の顔の実際の大きさとして想定される値ｈとから、ｄ＝ｈ／ｔａｎφ≒ｈ／φの関係により求めることができる。なお、距離算出部３０は、これ以外の方法を用いて利用者との距離を算出するものであってもよい。例えば、超音波を照射して利用者からの反射波を受信するまでの時間を計測することで、利用者との距離を算出することができる。 The distance calculation unit 30 obtains image data including the user's face from the imaging unit 50, calculates the distance between the imaging unit 50 and the user based on the size of the face in the image, and is obtained. The distance value d is output to the correction amount determination unit 203. The distance d between the imaging unit 50 and the user is assumed as an expected angle φ at which the imaging unit 50 expects the user's face from the size of the face in the image, and is assumed as this expected angle φ and the actual size of the human face. From the obtained value h, it can be obtained by the relationship of d = h / tan φ≈h / φ. The distance calculation unit 30 may calculate the distance from the user using a method other than this. For example, the distance from the user can be calculated by measuring the time from receiving the ultrasonic wave until receiving the reflected wave from the user.

補正量決定部２０３は、第１の実施形態において求めた補正量Ｇ（ｆ）を、距離算出部３０が算出した距離ｄに応じて変化させ、変化後の補正量Ｇ’（ｆ）をスペクトル補正部２０２へ指示する。なお、補正量Ｇ（ｆ）をＧ’（ｆ）に変化させる際の変化量は、周波数によらない一定値でよい。スペクトル補正部２０２は、この補正量Ｇ’（ｆ）に従って、第１の実施形態と同様に音声信号Ｓ（ｆ）に対するスペクトルの補正を行う。 The correction amount determination unit 203 changes the correction amount G (f) obtained in the first embodiment in accordance with the distance d calculated by the distance calculation unit 30, and changes the corrected correction amount G ′ (f) to the spectrum. The correction unit 202 is instructed. Note that the amount of change when changing the correction amount G (f) to G ′ (f) may be a constant value independent of the frequency. The spectrum correction unit 202 corrects the spectrum for the audio signal S (f) according to the correction amount G ′ (f) as in the first embodiment.

補正量Ｇ’（ｆ）は、利用者と撮像部５０の距離が離れている場合にその値が大きくなるようにし、利用者と撮像部５０の距離が近い場合はその値が小さくなるようにする。こうすることで、利用者が遠くにいるときは音声の音量がより大きくなる補正が行われ、反対に利用者が近くにいるときは音声の音量がより小さくなる補正が行われることになる。 The correction amount G ′ (f) is such that the value increases when the distance between the user and the imaging unit 50 is large, and the value decreases when the distance between the user and the imaging unit 50 is short. To do. In this way, correction is made to increase the sound volume when the user is far away, and on the contrary, correction is made to decrease the sound volume when the user is near.

例えば、音源から発せられた音が利用者の耳に届いたとき、利用者の位置における音の音圧レベルは音源と利用者との距離の２乗に反比例する（距離が２倍になると音圧レベルは６デシベル下がる）ので、補正量Ｇ’（ｆ）を
Ｇ’（ｆ）＝Ｇ（ｆ）＋６・ｌｏｇ_２（ｄ／ｄ０）
とする。但し、ｄ０はある所定の基準値である。この場合、利用者と撮像部５０の距離がｄ＝２・ｄ０のときＧ’（ｆ）＝Ｇ（ｆ）＋６、ｄ＝４・ｄ０のときＧ’（ｆ）＝Ｇ（ｆ）＋１２、ｄ＝８・ｄ０のときＧ’（ｆ）＝Ｇ（ｆ）＋１８、ｄ＝ｄ０／２のときＧ’（ｆ）＝Ｇ（ｆ）−６、…のようになるので、距離の２乗に反比例して変化する音圧レベルがちょうど補償されて、利用者と撮像部５０の距離によらず、利用者の位置での音圧レベルを常に一定にすることができる。 For example, when sound emitted from a sound source reaches the user's ear, the sound pressure level of the sound at the user's position is inversely proportional to the square of the distance between the sound source and the user (when the distance is doubled, the sound Since the pressure level is reduced by 6 dB), the correction amount G ′ (f) is changed to G ′ (f) = G (f) + 6 · log ₂ (d / d0)
And However, d0 is a certain predetermined reference value. In this case, G ′ (f) = G (f) +6 when the distance between the user and the imaging unit 50 is d = 2 · d0, and G ′ (f) = G (f) +12, when d = 4 · d0. When d = 8 · d0, G ′ (f) = G (f) +18, and when d = d0 / 2, G ′ (f) = G (f) −6,. The sound pressure level that changes inversely proportional to is just compensated, and the sound pressure level at the position of the user can always be made constant regardless of the distance between the user and the imaging unit 50.

このように、本実施形態の音声処理装置によれば、距離算出部３０によって算出された利用者との距離に応じて、自動的に、音声の音量を適切に補正することができる。 As described above, according to the sound processing apparatus of the present embodiment, the sound volume can be automatically corrected appropriately according to the distance from the user calculated by the distance calculation unit 30.

（第３の実施形態）
本実施形態では、出力音声の音量が過大になってしまうことを防止するために、入力音声の音量が大きいほど補正量Ｇ（ｆ）を小さくする。図４を参照して、補正量Ｇ（ｆ）の決定方法を説明する。 (Third embodiment)
In this embodiment, in order to prevent the volume of the output sound from becoming excessive, the correction amount G (f) is decreased as the volume of the input sound is increased. A method for determining the correction amount G (f) will be described with reference to FIG.

補正量決定部２０３は、まず、第１の実施形態と同様に、補正量記憶部２０４から利用者の年齢と性別に対応する補正量のデータＧ（ｆ１）を取得する。また、補正量決定部２０３は、入力音声信号の周波数ｆ１における音圧レベルの値Ｓ（ｆ１）をフーリエ変換部２０１から取得する。補正量決定部２０３は、取得した補正量のデータＧ（ｆ１）と音圧レベルの値Ｓ（ｆ１）から、次式
Ｇ’（ｆ１）＝Ｇ（ｆ１）
（０≦Ｓ（ｆ１）＜Ｌ_ｍｉｎのとき）
Ｇ’（ｆ１）＝Ｇ（ｆ１）・α（Ｓ（ｆ１））
（Ｌ_ｍｉｎ≦Ｓ（ｆ１）＜Ｌ_ｔｈのとき）
Ｇ’（ｆ１）＝Ｌ_ｔｈ−Ｓ（ｆ１）
（Ｓ（ｆ１）≧Ｌ_ｔｈのとき）
を用いて修正した補正量Ｇ’（ｆ１）を計算する。但し、α（Ｌ）は単調減少する入力音圧レベルＬの関数であって、０≦α（Ｌ）≦１及びα（Ｌ_ｍｉｎ）＝１及びα（Ｌ_ｔｈ）＝０であるとする。また、Ｌ_ｔｈは、音が大きすぎることによる不快感を聴者が覚えることのない、最大の音圧レベルを表す。また、Ｌ_ｍｉｎは、Ｌ_ｔｈより小さい任意の値であり、例えば、最小可聴閾値（被験者が聴き取れる最小の音圧レベル）である。 The correction amount determination unit 203 first acquires correction amount data G (f1) corresponding to the age and sex of the user from the correction amount storage unit 204, as in the first embodiment. Further, the correction amount determination unit 203 acquires the value S (f1) of the sound pressure level at the frequency f1 of the input audio signal from the Fourier transform unit 201. From the acquired correction amount data G (f1) and the sound pressure level value S (f1), the correction amount determination unit 203 calculates the following equation: G ′ (f1) = G (f1)
(When 0 ≦ S (f1) <L _min )
G ′ (f1) = G (f1) · α (S (f1))
(When L _min ≦ S (f1) <L _th )
G ′ (f1) = L _th −S (f1)
(When S (f1) ≧ L _th )
The correction amount G ′ (f1) corrected using is calculated. However, α (L) is a function of the input sound pressure level L that monotonously decreases, and it is assumed that 0 ≦ α (L) ≦ 1, α (L _min ) = 1, and α (L _th ) = 0. L _th represents the maximum sound pressure level at which the listener does not feel discomfort due to the sound being too loud. L _min is an arbitrary value smaller than L _th , and is, for example, a minimum audible threshold value (minimum sound pressure level at which the subject can listen).

このような修正補正量Ｇ’（ｆ１）のゲインを入力音声に与えた場合、周波数ｆ１の音圧レベルは、図４（Ａ）に示すように、入力音圧レベルＳ（ｆ１）が大きくなるにつれて小さなゲイン（Ｇ’（ｆ１））でその値が増加していき、入力音圧レベルＳ（ｆ１）が最大音圧レベルＬ_ｔｈを超えると一定値Ｌ_ｔｈをとることとなる。そのため、出力音圧レベルが最大音圧レベルＬ_ｔｈより大きくなることはなく、過大音による不快感を聴者に感じさせてしまうことを避けることができる。 When such a correction correction amount G ′ (f1) is gained to the input voice, the sound pressure level of the frequency f1 is increased as shown in FIG. 4A. It brought the value with a small gain (G '(f1)) and is gradually increased to the input sound pressure level S (f1) is to take a constant value L _th exceeds the maximum sound pressure level L _th. Therefore, it is possible to avoid that the output sound pressure level is not be greater than the maximum sound pressure level L _th, it will feel discomfort due to excessive sound to the listener.

なお、第１の実施形態における補正量Ｇ（ｆ１）は、入力音声の音量によらない一定値であるため、この補正量Ｇ（ｆ１）のゲインを入力音声に与えると、図４（Ｂ）に示すように、大きな入力音圧レベルに対して出力音圧レベルが最大音圧レベルＬ_ｔｈを超えることになる。このように、本（第３の）実施形態は、音の聴き取りやすさの点で第１の実施形態より優れている。 Since the correction amount G (f1) in the first embodiment is a constant value that does not depend on the volume of the input sound, if the gain of this correction amount G (f1) is given to the input sound, FIG. as shown, the output sound pressure levels for large input sound pressure level will exceed the maximum sound pressure level L _th. Thus, the present (third) embodiment is superior to the first embodiment in terms of ease of listening to sound.

補正量決定部２０３は、続いて、他の周波数に対応する修正補正量Ｇ’（ｆ２），Ｇ’（ｆ３），…を同様にして計算する。このとき、関数α（Ｌ）は各周波数で異なっていてもよい。補正量決定部２０３は、こうして得られた修正補正量のデータＧ’（ｆ１），Ｇ’（ｆ２），Ｇ’（ｆ３），…を第１の実施形態と同様に補間して、音声の周波数帯域内における任意の周波数ｆについての修正補正量Ｇ’（ｆ）を求める。以上により求められた修正補正量Ｇ’（ｆ）に従って、第１の実施形態と同様、スペクトル補正部２０２による音声信号Ｓ（ｆ）のスペクトル補正が行われる。 Subsequently, the correction amount determination unit 203 calculates correction correction amounts G ′ (f2), G ′ (f3),... Corresponding to other frequencies in the same manner. At this time, the function α (L) may be different at each frequency. The correction amount determination unit 203 interpolates the correction correction amount data G ′ (f1), G ′ (f2), G ′ (f3),. A correction correction amount G ′ (f) for an arbitrary frequency f in the frequency band is obtained. According to the correction correction amount G ′ (f) obtained as described above, the spectrum correction of the audio signal S (f) is performed by the spectrum correction unit 202 as in the first embodiment.

このように、本実施形態の音声処理装置によれば、出力音声の音量が過大になってしまうことを防止することができる。 Thus, according to the audio processing device of this embodiment, it is possible to prevent the volume of the output audio from becoming excessive.

以上、図面を参照してこの発明の一実施形態について詳しく説明してきたが、具体的な構成は上述のものに限られることはなく、この発明の要旨を逸脱しない範囲内において様々な設計変更等をすることが可能である。 As described above, the embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to the above, and various design changes and the like can be made without departing from the scope of the present invention. It is possible to

例えば、上述したスペクトル補正部２０２に代えて、フーリエ変換部２０１から入力された音声信号Ｓ（ｆ）のフォルマントを整形するフォルマント整形部を設けた構成としてもよい。ここで、フォルマントの整形とは、図５に示すように、入力音声信号Ｓ（ｆ）の第１フォルマント，第２フォルマント，第３フォルマント，…の各周波数（Ｆ１，Ｆ２，Ｆ３，…）における音圧レベルを増加させるとともに、第１フォルマントと第２フォルマントの間，第２フォルマントと第３フォルマントの間，…の音圧レベルが落ち込んでいる各周波数における音圧レベルを減少させる処理のことである。この処理により、各フォルマントが強調されるため、音声を明瞭にすることができる。例えば、利用者の年齢が高い場合にフォルマント整形処理を行い、年齢が低い場合はフォルマント整形処理を行わないようにしたり、利用者の年齢が高いほどフォルマント整形の度合い（音圧レベルの増減量）を大きくしたりする。こうすることで、利用者に応じた聴覚補償処理を音声に施すことができる。 For example, instead of the spectrum correction unit 202 described above, a formant shaping unit that shapes the formant of the audio signal S (f) input from the Fourier transform unit 201 may be provided. Here, the formant shaping means, as shown in FIG. 5, in the first formant, the second formant, the third formant,... (F1, F2, F3,...) Of the input audio signal S (f). A process of increasing the sound pressure level and decreasing the sound pressure level at each frequency where the sound pressure level falls between the first formant and the second formant, between the second formant and the third formant, and so on. is there. By this processing, each formant is emphasized, so that the voice can be clarified. For example, formant shaping processing is performed when the user's age is high, and formant shaping processing is not performed when the age is low. Or make it bigger. In this way, it is possible to perform auditory compensation processing according to the user on the voice.

また、スペクトル補正部２０２の代わりに、入力音声の話速変換をする話速変換部を設けた構成としてもよい。例えば、利用者の年齢に応じて、変換後の話速を変えたり、話速変換処理の実施と停止を切り替えたりする。これにより、利用者に応じた聴覚補償処理を音声に施すことができる。 Moreover, it is good also as a structure which provided the speech rate conversion part which converts the speech rate of input speech instead of the spectrum correction | amendment part 202. FIG. For example, depending on the age of the user, the speech speed after conversion is changed, or the implementation and stop of speech speed conversion processing are switched. Thereby, auditory compensation processing according to the user can be performed on the voice.

また、フーリエ変換部２０１に代えて、フィルタバンクを用いてもよい。フィルタバンクは、入力音声信号から所定の周波数帯域毎の音声信号を生成する。この場合、スペクトル補正部２０２は、この周波数帯域毎の音声信号に、補正量決定部２０３により指示された補正量でスペクトルの補正を行えばよい。
また、フーリエ変換に代えて、コサイン変換やウェーブレット変換を用いてもよい。
また、音声補正部２０は、年齢推定部１０１によって推定された利用者の年齢のみに応じて聴覚補償処理を行ってもよい。
また、第２の実施形態と第３の実施形態を組み合わせてもよい。 Further, a filter bank may be used instead of the Fourier transform unit 201. The filter bank generates an audio signal for each predetermined frequency band from the input audio signal. In this case, the spectrum correction unit 202 may correct the spectrum with the correction amount instructed by the correction amount determination unit 203 on the audio signal for each frequency band.
Further, instead of Fourier transform, cosine transform or wavelet transform may be used.
In addition, the sound correction unit 20 may perform auditory compensation processing according to only the age of the user estimated by the age estimation unit 101.
Further, the second embodiment and the third embodiment may be combined.

１０…属性推定部１０１…年齢推定部１０２…性別推定部２０…音声補正部２０１…フーリエ変換部２０２…スペクトル補正部２０３…補正量決定部２０４…補正量記憶部２０５…逆フーリエ変換部３０…距離算出部５０…撮像部 DESCRIPTION OF SYMBOLS 10 ... Attribute estimation part 101 ... Age estimation part 102 ... Gender estimation part 20 ... Speech correction part 201 ... Fourier transform part 202 ... Spectral correction part 203 ... Correction amount determination part 204 ... Correction amount memory | storage part 205 ... Inverse Fourier transform part 30 ... Distance calculation unit 50 ... imaging unit

Claims

Attribute estimating means for estimating the age of the person based on the image of the person photographed by the photographing means, or an attribute relating to age and sex;
Audio correction means for performing audio compensation processing on the audio signal according to the attribute estimated by the attribute estimation means;
An audio processing apparatus comprising:

The auditory compensation process is a process of correcting the volume of the audio signal for each frequency according to the attribute estimated by the attribute estimation unit, or the formant of the audio signal according to the attribute estimated by the attribute estimation unit. The speech processing apparatus according to claim 1, further comprising: a process of shaping the speech signal, or a process of converting a speech speed of the speech signal in accordance with the attribute estimated by the attribute estimation unit.

A distance calculating means for calculating a distance to the person;
The audio processing apparatus according to claim 1, wherein the audio correction unit performs a process of correcting a volume of the audio signal according to the distance calculated by the distance calculation unit.

4. The sound correction unit according to claim 1, wherein the sound correction unit performs a process of setting the gain smaller as the volume of the sound signal before processing is larger and correcting the sound volume of the sound signal according to the gain. The speech processing apparatus according to any one of the items.