JP2017187686A

JP2017187686A - Voice selection device and program

Info

Publication number: JP2017187686A
Application number: JP2016077455A
Authority: JP
Inventors: 信正清山; Nobumasa Seiyama; 礼子齋藤; Reiko Saito; 和穂尾上; Kazuho Onoe; 今井　篤; Atsushi Imai; 篤今井; 都木　徹; Toru Tsugi; 徹都木
Original assignee: Nippon Hoso Kyokai NHK; NHK Engineering System Inc
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2016-04-07
Filing date: 2016-04-07
Publication date: 2017-10-12
Anticipated expiration: 2036-04-07
Also published as: JP6671221B2

Abstract

PROBLEM TO BE SOLVED: To select, from a plurality of complementary voices, a complementary voice to be added to a program voice when it is presented, the complementary voice distinguishable from the program voice.SOLUTION: A feature amount calculation unit 11-n of a voice selection device 1 calculates an acoustic feature amount by reading voice waveform data of a program voice from a program voice DB 10-n. A feature amount calculation unit 21-m calculates an acoustic feature amount by reading voice waveform data of a complementary voice from a complementary voice DB 20-m. A similarity calculation unit 22-m calculates similarities between 1st-Nth program voices and an mth complementary voice. A similarity addition unit 23-m calculates the sum total of the similarities between the mth complementary voice and the program voices of 1st-Nth program speakers. A selection unit 24 specifies the sum total of smallest similarities in the sum total of similarities for every 1st-Mth complementary voices, and selects a complementary voice DB 20 corresponding to the sum total of smallest similarities.SELECTED DRAWING: Figure 1

Description

本発明は、番組音声に解説音声を付加して解説付番組音声を生成する際に、番組音声に付加する解説音声を複数の解説音声から選択する音声選択装置及びプログラムに関する。 The present invention relates to a sound selection device and a program for selecting commentary sound to be added to program sound from a plurality of commentary sounds when commentary sound is added to program sound to generate commentary-added program sound.

従来、テレビ放送における解説放送番組の制作では、番組の台本または脚本とは別に、視覚障害者のための情景描写または字幕の内容についての解説原稿が作成される。解説原稿は、台詞またはナレーション等の発声音が含まれる音声の区間（番組音声）に重ならないように、無音または背景音のみの区間（ポーズ区間）に、ナレータが解説音声として読み上げる原稿である。 2. Description of the Related Art Conventionally, in the production of commentary broadcast programs in television broadcasts, a commentary manuscript about the scene description or caption content for the visually impaired is created separately from the script or script of the program. The explanation manuscript is a manuscript that the narrator reads out as a commentary voice in a section (pause section) of silence or background sound only so as not to overlap with a section of speech (program sound) including speech sounds such as dialogue or narration.

解説音声の録音時には、発声開始のタイミング及び発声速度を調整しなければならず、リハーサル等を含めて多くの時間と費用が必要となる。この問題を解決するため、解説放送番組の音声を短時間で、かつ低コストで制作する技術が開示されている（例えば、特許文献１を参照）。 At the time of recording the commentary voice, it is necessary to adjust the voice start timing and voice speed, which requires much time and cost including rehearsal. In order to solve this problem, a technique for producing audio of a commentary broadcast program in a short time and at a low cost is disclosed (for example, see Patent Document 1).

この技術では、番組音声と、番組の内容に関連するテキストとを入力し、音声合成によりテキストから解説音声を生成する。そして、番組音声からポーズ区間を検出し、ポーズ区間長に合うように解説音声を話速変換し、話速変換後の解説音声をポーズ区間に付加する。 In this technology, program sound and text related to the contents of the program are input, and commentary sound is generated from the text by speech synthesis. Then, the pause section is detected from the program voice, the commentary voice is converted to the speech speed so as to match the pause section length, and the commentary voice after the speech speed conversion is added to the pause section.

特許第４５９４９０８号公報Japanese Patent No. 4594908

しかしながら、前記特許文献１の技術では、番組音声からポーズ区間を正しく検出できない場合があり、適切なタイミング及び話速で解説音声を挿入することができず、結果として適切な解説音声を提供することができないという問題があった。 However, in the technique of Patent Document 1, there is a case where the pause section cannot be correctly detected from the program sound, and the commentary sound cannot be inserted at a suitable timing and speaking speed, and as a result, a suitable commentary sound is provided. There was a problem that could not.

この問題を解決するため、番組音声と解説音声とを重ねた状態の解説付番組音声を生成することが想定される。しかし、番組音声と解説音声とが類似しているときには、生成した解説付番組音声から解説音声の情報を聞き分けることが難しい。 In order to solve this problem, it is assumed that program audio with commentary in a state where program audio and commentary audio are superimposed is generated. However, when the program sound and the commentary sound are similar, it is difficult to distinguish the commentary sound information from the generated commentary program sound.

このように、テレビ放送の番組音声に対し、当該番組音声の情報を補完するための解説音声（以下、補完音声という。）を付加して解説付番組音声を生成する場合に、聞き分けることが可能な補完音声を適切に提供できない場合があるという問題があった。この問題を解決する手法は提案されていない。 In this way, it is possible to distinguish TV broadcast program audio when commentary audio for supplementing the information of the program audio (hereinafter referred to as complementary audio) is added to generate program audio with explanation. There is a problem in that it may not be possible to provide appropriate supplementary speech properly. No method has been proposed to solve this problem.

そこで、本発明は前記課題を解決するためになされたものであり、その目的は、番組音声に補完音声を付加して提示する際の補完音声であって、番組音声に対して聞き分けやすい補完音声を、複数の補完音声から選択可能な音声選択装置及びプログラムを提供することにある。 Accordingly, the present invention has been made to solve the above-mentioned problems, and the object thereof is complementary speech when a supplementary audio is added to a program audio for presentation, and the complementary audio is easy to distinguish from the program audio. Is to provide a voice selection device and program capable of selecting from a plurality of complementary voices.

前記課題を解決するために、請求項１の音声選択装置は、番組音声に補完音声を付加して提示する際の前記補完音声を、複数の補完音声から選択する音声選択装置において、１以上の所定数の番組音声データが格納された番組音声ＤＢ（データベース）と、２以上の所定数の補完音声データが格納された補完音声ＤＢと、前記番組音声ＤＢに格納された前記所定数の番組音声データのそれぞれについて、音響特徴量を算出すると共に、前記補完音声ＤＢに格納された前記所定数の補完音声データのそれぞれについて、音響特徴量を算出する特徴量算出部と、前記特徴量算出部により算出された前記所定数の番組音声データのそれぞれについての音響特徴量と、前記特徴量算出部により算出された前記所定数の補完音声データのそれぞれについての音響特徴量との間で類似度を算出する類似度算出部と、前記補完音声データ毎に、前記類似度算出部により算出された、前記所定数の番組音声データのそれぞれについての音響特徴量と当該補完音声データの音響特徴量との間の前記類似度を加算し、総和を求める類似度加算部と、前記類似度加算部により求めた前記補完音声データ毎の総和のうち、最小の総和を特定し、前記所定数の補完音声データから、前記最小の総和に対応する前記補完音声データを選択する選択部と、を備えたことを特徴とする。 In order to solve the above-mentioned problem, an audio selection device according to claim 1 is an audio selection device that selects a complementary audio from a plurality of complementary audios when adding a complementary audio to a program audio for presentation. A program audio DB (database) storing a predetermined number of program audio data, a complementary audio DB storing two or more predetermined numbers of complementary audio data, and the predetermined number of program audios stored in the program audio DB A feature amount calculation unit that calculates an acoustic feature amount for each of the data, and calculates an acoustic feature amount for each of the predetermined number of complementary speech data stored in the complementary speech DB, and the feature amount calculation unit. About each of the calculated acoustic feature amount for each of the predetermined number of program audio data and each of the predetermined number of complementary audio data calculated by the feature amount calculation unit. A similarity calculation unit that calculates a similarity between the audio feature quantity and an acoustic feature quantity for each of the predetermined number of program audio data calculated by the similarity calculation unit for each of the complementary audio data And the similarity between the acoustic feature amount of the complementary speech data and a similarity addition unit for obtaining a sum, and a minimum sum among the sums for each of the complementary speech data obtained by the similarity addition unit And a selection unit that selects the complementary audio data corresponding to the minimum sum from the predetermined number of complementary audio data.

また、請求項２の音声選択装置は、請求項１に記載の音声選択装置において、前記特徴量算出部が、前記番組音声データ及び前記補完音声データのそれぞれについて、所定の長さのフレーム単位で音声データを切り出し、前記フレーム単位の音声データ毎に、周波数特性を求め、前記周波数特性に基づいて、メル周波数ケプストラム係数及び対数エネルギーからなる静的係数並びに前記静的係数の１次回帰係数及び２次回帰係数を含めたスペクトル特徴量を求め、前記スペクトル特徴量に基づきＥＭアルゴリズムを用いて、混合数分の混合重み及び前記混合数分のガウス分布からなるＧＭＭパラメータを算出し、前記ＧＭＭパラメータから前記ガウス分布の平均ベクトルを抽出し、前記平均ベクトルを前記混合数分だけ結合したＧＭＭスーパーベクトルを求め、前記ＧＭＭスーパーベクトルに基づいて、前記音響特徴量であるｉベクトルを算出する、ことを特徴とする。 In addition, in the audio selection device according to claim 2, in the audio selection device according to claim 1, the feature amount calculation unit performs frame units of a predetermined length for each of the program audio data and the complementary audio data. Voice data is cut out, a frequency characteristic is obtained for each voice data of the frame unit, and based on the frequency characteristic, a static coefficient composed of a Mel frequency cepstrum coefficient and logarithmic energy, and a primary regression coefficient of the static coefficient and 2 A spectral feature amount including a next regression coefficient is obtained, and an EM algorithm is used based on the spectral feature amount to calculate a GMM parameter composed of a mixture weight for the number of mixtures and a Gaussian distribution for the number of mixtures, and from the GMM parameter An average vector of the Gaussian distribution is extracted, and the average vector is combined by the number of the mixtures. Seek vector, on the basis of the GMM supervectors, calculates the i vector which is the acoustic feature, and wherein the.

また、請求項３の音声選択装置は、請求項１に記載の音声選択装置において、前記特徴量算出部が、前記番組音声データ及び前記補完音声データのそれぞれについて、所定の長さのフレーム単位で音声データを切り出し、前記フレーム単位の音声データ毎に、基本周期候補を設定し、前記基本周期候補の周期性の程度を求めて前記基本周期候補から基本周期を抽出し、前記基本周期に基づいて、対数基本周波数並びに前記対数基本周波数の１次回帰係数及び２次回帰係数を含めたピッチ特徴量を求め、前記ピッチ特徴量に基づきＥＭアルゴリズムを用いて、混合数分の混合重み及び前記混合数分のガウス分布からなるＧＭＭパラメータを算出し、前記ＧＭＭパラメータから前記ガウス分布の平均ベクトルを抽出し、前記平均ベクトルを前記混合数分だけ結合したＧＭＭスーパーベクトルを求め、前記ＧＭＭスーパーベクトルに基づいて、前記音響特徴量であるｉベクトルを算出する、ことを特徴とする。 According to a third aspect of the present invention, there is provided the voice selecting device according to the first aspect, wherein the feature amount calculation unit is configured to perform frame units of a predetermined length for each of the program voice data and the complementary voice data. Cut out audio data, set basic period candidates for each frame-based audio data, determine the degree of periodicity of the basic period candidates, extract the basic period from the basic period candidates, and based on the basic period , A logarithmic fundamental frequency and a pitch feature amount including a primary regression coefficient and a quadratic regression coefficient of the logarithmic fundamental frequency are obtained, and an EM algorithm is used based on the pitch feature amount to mix the mixture weight and the mixture number. A GMM parameter composed of a Gaussian distribution of minutes, an average vector of the Gaussian distribution is extracted from the GMM parameter, and the average vector is extracted from the mixed vector. Seeking GMM supervector bound for the number of, on the basis of the GMM supervectors, calculates the i vector which is the acoustic feature, and wherein the.

また、請求項４の音声選択装置は、請求項１に記載の音声選択装置において、前記特徴量算出部が、前記番組音声データ及び前記補完音声データのそれぞれについて、所定の長さのフレーム単位で音声データを切り出し、前記フレーム単位の音声データ毎に、周波数特性を求め、前記周波数特性に基づいて、メル周波数ケプストラム係数及び対数エネルギーからなる静的係数並びに前記静的係数の１次回帰係数及び２次回帰係数を含めたスペクトル特徴量を求め、前記スペクトル特徴量に基づきＥＭアルゴリズムを用いて、混合数分の混合重み及び前記混合数分のガウス分布からなるＧＭＭパラメータを算出し、前記ＧＭＭパラメータから前記ガウス分布の平均ベクトルを抽出し、前記平均ベクトルを前記混合数分だけ結合したＧＭＭスーパーベクトルを求め、前記ＧＭＭスーパーベクトルに基づいて、前記音響特徴量である第１のｉベクトルを算出し、前記フレーム単位の音声データ毎に、基本周期候補を設定し、前記基本周期候補の周期性の程度を求めて前記基本周期候補から基本周期を抽出し、前記基本周期に基づいて、対数基本周波数並びに前記対数基本周波数の１次回帰係数及び２次回帰係数を含めたピッチ特徴量を求め、前記ピッチ特徴量に基づきＥＭアルゴリズムを用いて、混合数分の混合重み及び前記混合数分のガウス分布からなるＧＭＭパラメータを算出し、前記ＧＭＭパラメータから前記ガウス分布の平均ベクトルを抽出し、前記平均ベクトルを前記混合数分だけ結合したＧＭＭスーパーベクトルを求め、前記ＧＭＭスーパーベクトルに基づいて、前記音響特徴量である第２のｉベクトルを算出し、前記類似度算出部が、前記特徴量算出部により算出された前記所定数の番組音声データのそれぞれについての第１のｉベクトルと、前記特徴量算出部により算出された前記所定数の補完音声データのそれぞれについての第１のｉベクトルとの間で類似度を算出し、前記特徴量算出部により算出された前記所定数の番組音声データのそれぞれについての第２のｉベクトルと、前記特徴量算出部により算出された前記所定数の補完音声データのそれぞれについての第２のｉベクトルとの間の類似度を算出し、前記類似度加算部が、前記補完音声データ毎に、前記類似度算出部により算出された、前記所定数の番組音声データのそれぞれについての第１のｉベクトルと当該補完音声データの第１のｉベクトルとの間の前記類似度を加算し、第１の加算結果を求め、前記補完音声データ毎に、前記類似度算出部により算出された、前記所定数の番組音声データのそれぞれについての第２のｉベクトルと当該補完音声データの第２のｉベクトルとの間の前記類似度を加算し、第２の加算結果を求め、前記第１の加算結果及び前記第２の加算結果を重み付け加算し、前記総和を求める、ことを特徴とする。 According to a fourth aspect of the present invention, there is provided the voice selecting device according to the first aspect, wherein the feature amount calculation unit is configured to perform frame units of a predetermined length for each of the program voice data and the complementary voice data. Voice data is cut out, a frequency characteristic is obtained for each voice data of the frame unit, and based on the frequency characteristic, a static coefficient composed of a Mel frequency cepstrum coefficient and logarithmic energy, and a primary regression coefficient of the static coefficient and 2 A spectral feature amount including a next regression coefficient is obtained, and an EM algorithm is used based on the spectral feature amount to calculate a GMM parameter composed of a mixture weight for the number of mixtures and a Gaussian distribution for the number of mixtures, and from the GMM parameter An average vector of the Gaussian distribution is extracted, and the average vector is combined by the number of the mixtures. A vector is obtained, a first i vector that is the acoustic feature amount is calculated based on the GMM super vector, a basic period candidate is set for each frame-based audio data, and the periodicity of the basic period candidate is determined. A basic period is extracted from the basic period candidates, and a pitch feature amount including a logarithmic fundamental frequency and a primary regression coefficient and a quadratic regression coefficient of the logarithmic fundamental frequency is obtained based on the fundamental period, Based on the pitch feature amount, an EM algorithm is used to calculate a GMM parameter composed of a mixture weight for the number of mixtures and a Gaussian distribution for the number of mixtures, and an average vector of the Gaussian distribution is extracted from the GMM parameter, A GMM supervector obtained by combining vectors by the number of the mixture is obtained, and the acoustic characteristics are obtained based on the GMM supervector. A second i vector as a quantity is calculated, and the similarity calculation unit calculates a first i vector for each of the predetermined number of program audio data calculated by the feature quantity calculation unit and the feature quantity calculation. A similarity is calculated between each of the predetermined number of complementary audio data calculated by the first i-vector and each of the predetermined number of program audio data calculated by the feature amount calculating unit. A similarity between the second i vector of the second i vector and the second i vector for each of the predetermined number of complementary speech data calculated by the feature amount calculation unit, and the similarity addition unit includes: For each complementary audio data, a first i vector for each of the predetermined number of program audio data and a first i vector of the complementary audio data calculated by the similarity calculation unit The second i for each of the predetermined number of program audio data calculated by the similarity calculator for each of the complementary audio data is obtained by adding the similarity between Adding the similarity between the vector and the second i vector of the complementary speech data, obtaining a second addition result, weighting and adding the first addition result and the second addition result, It is characterized in that the sum is obtained.

さらに、請求項５の音声選択プログラムは、コンピュータを、請求項１から４までのいずれか一項に記載の音声選択装置として機能させることを特徴とする。 Furthermore, a voice selection program according to a fifth aspect causes a computer to function as the voice selection device according to any one of the first to fourth aspects.

以上のように、本発明によれば、番組音声に補完音声を付加して提示する際の補完音声であって、番組音声に対して聞き分けやすい補完音声を、複数の補完音声から選択することが可能となる。したがって、選択した補完音声を番組音声に付加し、番組音声と補完音声とを同じタイミングで提示する場合であっても、これらの音声を聴く人は、番組音声と補完音声とを容易に区別することができ、聞き分けやすい補完音声を得ることができる。 As described above, according to the present invention, it is possible to select, from a plurality of complementary voices, a complementary voice that is easy to distinguish from the program voice when the supplementary voice is added to the program voice for presentation. It becomes possible. Therefore, even when the selected complementary sound is added to the program sound and the program sound and the complementary sound are presented at the same timing, a person who listens to these sounds can easily distinguish the program sound and the complementary sound. And supplementary speech that is easy to distinguish.

本発明の実施形態による音声選択装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the audio | voice selection apparatus by embodiment of this invention. 実施例１の特徴量算出部の処理例を示すフローチャートである。6 is a flowchart illustrating a processing example of a feature amount calculation unit according to the first embodiment. ステップＳ２１３の処理により算出されるＧＭＭパラメータλを説明する図である。It is a figure explaining the GMM parameter (lambda) calculated by the process of step S213. ステップＳ２１４の処理により算出されるＧＭＭスーパーベクトルＭを説明する図である。It is a figure explaining the GMM super vector M calculated by the process of step S214. 実施例２の特徴量算出部の処理例を示すフローチャートである。10 is a flowchart illustrating a processing example of a feature amount calculation unit according to the second embodiment. 実施例２の特徴量算出部による処理の事前処理として、音声フレームの区間判定の処理例を示すフローチャートである。10 is a flowchart illustrating a processing example of speech frame section determination as pre-processing of processing by the feature amount calculation unit according to the second embodiment. 前後の有声音区間の基本周期から、無音区間及び無声音区間の基本周期を求める例を説明する図である。It is a figure explaining the example which calculates | requires the basic period of a silent interval and an unvoiced sound area from the basic period of the front and back voiced sound area.

以下、本発明を実施するための形態について図面を用いて詳細に説明する。本発明は、１以上の番組音声及び２以上の補完音声の音響的な特徴量をそれぞれ算出し、２以上の補完音声のそれぞれについて、１以上の番組音声との間の類似度を算出し、当該類似度の最も低い補完音声を２以上の補完音声から選択することを特徴とする。 Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the drawings. The present invention calculates acoustic feature quantities of one or more program sounds and two or more complementary sounds, calculates a similarity between one or more program sounds for each of the two or more complementary sounds, The complementary speech having the lowest similarity is selected from two or more complementary speeches.

これにより、番組音声とは類似しない音響的な特徴を有する補完音声が選択される。したがって、番組音声と補完音声とを同時に提示する場合であっても、これらの音声を聴く人は、番組音声と補完音声とを容易に区別することができ、聞き分けやすい補完音声を得ることができる。 As a result, a complementary sound having an acoustic feature that is not similar to the program sound is selected. Therefore, even when the program sound and the complementary sound are presented at the same time, a person who listens to these sounds can easily distinguish between the program sound and the complementary sound, and can obtain a complementary sound that is easy to distinguish. .

〔音声選択装置〕
まず、本発明の実施形態による音声選択装置について説明する。図１は、本発明の実施形態による音声選択装置の構成例を示すブロック図である。この音声選択装置１は、番組音声ＤＢ（データベース）１０−１〜１０−Ｎ、特徴量算出部１１−１〜１１−Ｎ、補完音声ＤＢ２０−１〜２０−Ｍ、特徴量算出部２１−１〜２１−Ｍ、類似度算出部２２−１〜２２−Ｍ、類似度加算部２３−１〜２３−Ｍ及び選択部２４を備えている。 [Voice selection device]
First, a voice selection device according to an embodiment of the present invention will be described. FIG. 1 is a block diagram illustrating a configuration example of a voice selection device according to an embodiment of the present invention. The audio selection device 1 includes program audio DBs (databases) 10-1 to 10-N, feature amount calculation units 11-1 to 11-N, complementary audio DBs 20-1 to 20-M, and a feature amount calculation unit 21-1. 21-M, similarity calculation units 22-1 to 22-M, similarity addition units 23-1 to 23-M, and a selection unit 24.

Ｎは１以上の整数であり、番組音声ＤＢ１０−１〜１０−Ｎに格納された番組音声についての話者（番組音声話者）の数に相当する。Ｍは２以上の整数であり、補完音声ＤＢ２０−１〜２０−Ｍに格納された補完音声についての話者（補完音声話者）の数に相当する。ｎ＝１，・・・，Ｎとし、ｍ＝１，・・・，Ｍとする。 N is an integer equal to or greater than 1, and corresponds to the number of speakers (program audio speakers) about the program audio stored in the program audio DBs 10-1 to 10-N. M is an integer equal to or greater than 2, and corresponds to the number of speakers (complementary speech speakers) regarding the supplementary speech stored in the supplementary speech DBs 20-1 to 20-M. Let n = 1,..., N and m = 1,.

番組音声ＤＢ１０−ｎは、ある番組音声話者による番組音声の音声波形のデータ（番組音声データ）が格納されたデータベースである。番組音声の音声波形のデータは、標本化周波数１６ｋＨｚ及び変換ビット数１６ビットで標本化されているものとする。 The program audio DB 10-n is a database in which audio waveform data (program audio data) of program audio by a program audio speaker is stored. The audio waveform data of the program audio is sampled at a sampling frequency of 16 kHz and a conversion bit number of 16 bits.

特徴量算出部１１−ｎは、対応する番組音声ＤＢ１０−ｎから、第ｎ番目の番組音声話者による番組音声の音声波形のデータを読み出し、番組音声の音声波形のデータに基づいて、番組音声の音響的な特徴量（音響特徴量）を算出する。そして、特徴量算出部１１−ｎは、第ｎ番目の番組音声話者による番組音声の音響特徴量を、類似度算出部２２−１〜２２−Ｍに出力する。 The feature amount calculation unit 11-n reads out the audio waveform data of the program audio by the nth program audio speaker from the corresponding program audio DB 10-n, and program audio based on the audio waveform data of the program audio The acoustic feature amount (acoustic feature amount) is calculated. Then, the feature amount calculation unit 11-n outputs the acoustic feature amount of the program sound by the nth program sound speaker to the similarity calculation units 22-1 to 22-M.

補完音声ＤＢ２０−ｍは、ある補完音声話者による補完音声の音声波形のデータ（補完音声データ）が格納されたデータベースである。補完音声の音声波形のデータは、番組音声の音声波形のデータと同様に、標本化周波数１６ｋＨｚ及び変換ビット数１６ビットで標本化されているものとする。補完音声の音声波形のデータは、例えば、番組音声に対して付加するために収録された実際の音声データであってもよいし、音声合成によって作成された音声データ（実際の補完音声データでない）、または音声合成用に利用する音声データベースに含まれる音声データであってもよい。 The complementary speech DB 20-m is a database in which speech waveform data (complementary speech data) of complementary speech by a certain complementary speech speaker is stored. The audio waveform data of the complementary audio is sampled at a sampling frequency of 16 kHz and a conversion bit number of 16 bits, similarly to the audio waveform data of the program audio. The audio waveform data of the complementary audio may be, for example, actual audio data recorded for addition to the program audio, or audio data created by voice synthesis (not actual complementary audio data). Or voice data included in a voice database used for voice synthesis.

特徴量算出部２１−ｍは、対応する補完音声ＤＢ２０−ｍから、第ｍ番目の補完音声話者による補完音声の音声波形のデータを読み出し、補完音声の音声波形のデータに基づいて、補完音声の音響特徴量を算出する。そして、特徴量算出部２１−ｍは、第ｍ番目の補完音声話者による補完音声の音響特徴量を、対応する類似度算出部２２−ｍに出力する。 The feature amount calculating unit 21-m reads out the speech waveform data of the supplementary speech by the m-th supplementary speech speaker from the corresponding supplementary speech DB 20-m, and based on the speech waveform data of the supplementary speech, Is calculated. Then, the feature amount calculation unit 21-m outputs the acoustic feature amount of the complementary speech by the m-th complementary speech speaker to the corresponding similarity calculation unit 22-m.

類似度算出部２２−ｍは、特徴量算出部１１−１〜１１−Ｎから第１〜Ｎ番目の番組音声話者による番組音声の音響特徴量を入力すると共に、対応する特徴量算出部２１−ｍから第ｍ番目の補完音声話者による補完音声の音響特徴量を入力する。 The similarity calculation unit 22-m inputs the acoustic feature amount of the program sound by the first to Nth program sound speakers from the feature amount calculation units 11-1 to 11-N, and the corresponding feature amount calculation unit 21. The acoustic feature amount of the complementary speech by the m-th complementary speech speaker from -m is input.

類似度算出部２２−ｍは、第１番目の番組音声話者による番組音声の音響特徴量と、第ｍ番目の補完音声話者による補完音声の音響特徴量との間の類似度を算出する。また、類似度算出部２２−ｍは、同様に、第２〜Ｎ番目の番組音声話者による番組音声の音響特徴量のそれぞれと、第ｍ番目の補完音声話者による補完音声の音響特徴量との間の類似度を算出する。そして、類似度算出部２２−ｍは、第１〜Ｎ番目の番組音声話者による番組音声と第ｍ番目の補完音声話者による補完音声との間のそれぞれの類似度を、対応する類似度加算部２３−ｍに出力する。 The similarity calculation unit 22-m calculates the similarity between the acoustic feature quantity of the program sound by the first program voice speaker and the acoustic feature quantity of the complementary speech by the m-th complementary speech speaker. . Similarly, the similarity calculation unit 22-m similarly uses the acoustic feature amount of the program sound by the 2nd to Nth program speech speakers and the acoustic feature amount of the supplementary speech by the mth complementary speech speaker. The similarity between is calculated. Then, the similarity calculation unit 22-m determines the respective similarities between the program audio by the first to Nth program audio speakers and the complementary audio by the mth complementary audio speaker, and the corresponding similarity. The result is output to the adder 23-m.

ここで、第ｎ番目の番組音声話者による番組音声の音響特徴量をｗ_inとし、第ｍ番目の補完音声話者による補完音声の音響特徴量をｗ_cmとし、類似度をコサイン類似度ｃｏｓ（ｗ_in，ｗ_cm）とする。第ｎ番目の番組音声話者による番組音声と第ｍ番目の補完音声話者による補完音声との間のコサイン類似度ｃｏｓ（ｗ_in，ｗ_cm）は、以下の式にて算出される。

前記式（１）の右辺の分子は、ｗ_in及びｗ_cmの内積を示し、その分母は、ｗ_in及びｗ_cmにおけるそれぞれのノルムの乗算を示す。 Here, the acoustic features of a program audio according to the n-th program audio speaker and w _in the acoustic features of complementary speech in the m-th complementary audio speaker and w _cm, similarity cosine similarity cos (W _in , w _cm ). The cosine similarity cos (w _in , w _cm ) between the program audio from the nth program audio speaker and the complementary audio from the mth complementary audio speaker is calculated by the following equation.

Molecule on the right side of the equation (1) represents the inner product of w _in and w _cm, the denominator indicates the multiplication of the respective norms in w _in and w _cm.

類似度加算部２３−ｍは、対応する類似度算出部２２−ｍから第１〜Ｎ番目の番組音声話者による番組音声と第ｍ番目の補完音声話者による補完音声との間のそれぞれの類似度を入力する。そして、類似度加算部２３−ｍは、第ｍ番目の補完音声話者による補完音声について、それぞれの類似度を加算することで類似度の総和を求める。類似度加算部２３−ｍは、第ｍ番目の補完音声話者による補完音声について、第１〜Ｎ番目の番組音声話者による番組音声との間の類似度の総和（第ｍ番目の補完音声話者による補完音声についての類似度の総和）を選択部２４に出力する。 The similarity adding unit 23-m receives each of the program audio from the first to Nth program audio speakers and the complementary audio from the mth complementary audio speaker from the corresponding similarity calculation unit 22-m. Enter the similarity. And the similarity addition part 23-m calculates | requires the sum total of a similarity degree by adding each similarity degree about the complementary audio | voice by the mth complementary audio | voice speaker. The similarity adding unit 23-m adds the sum of similarities between the supplementary speech by the mth complementary speech speaker and the program speech by the first to Nth program speech speakers (the mth supplementary speech). The sum of the similarities of complementary speech by the speaker is output to the selection unit 24.

ここで、第ｍ番目の補完音声話者による補完音声についての類似度ｃｏｓ（ｗ_in，ｗ_cm）の総和をｓ_mとすると、当該総和ｓ_mは、以下の式にて算出される。

Here, the similarity cos (w _in, w _cm) for complementing the voice according to the m-th complementary audio speaker when the sum of the s _m, the sum s _m is calculated by the following equation.

選択部２４は、類似度加算部２３−１〜２３−Ｍから類似度の総和をそれぞれ入力し、これらの類似度の総和のうち最小の類似度の総和を特定する。そして、選択部２４は、補完音声ＤＢ２０−１〜２０−Ｍのうち（Ｍ人の補完音声話者のうち）、最小の類似度の総和に対応する補完音声ＤＢ２０（補完音声話者）を選択し、選択情報を出力する。 The selection unit 24 inputs the sum of the similarities from the similarity adding units 23-1 to 23 -M, and specifies the minimum sum of the similarities among the sums of these similarities. Then, the selection unit 24 selects the complementary speech DB 20 (complementary speech speaker) corresponding to the sum of the minimum similarities among the supplementary speech DBs 20-1 to 20-M (among M supplementary speech speakers). And select information is output.

ここで、最小の類似度の総和ｓ_mに対応する補完音声ＤＢ２０（補完音声話者）を補完音声ＤＢ２０−ｃ（補完音声話者ｃ）とし、選択情報をｃ（１〜Ｍのうちのいずれかの値）とすると、選択情報ｃは、以下の式にて選択される。

Here, minimum and complementary speech DB20 correspond to the similarity of the sum s _m (complementary audio speakers) complementary voice DB20-c (complementary audio speaker c), any selection information c (of 1~M The selection information c is selected by the following formula.

以上のように、本発明の実施形態の音声選択装置１によれば、選択部２４は、補完音声ＤＢ２０−１〜２０−Ｍのうち（Ｍ人の補完音声話者のうち）、番組音声とは最も類似しない音響的な特徴を有する補完音声ＤＢ２０−ｃ（補完音声話者ｃ）を選択する。選択された補完音声ＤＢ２０−ｃは、番組音声に補完音声を付加して解説付番組音声を生成する際に用いられる。これにより、番組音声に補完音声を付加した結果、番組音声と補完音声とを同じタイミングで提示することになっても、これらの音声を聴く人は、番組音声と補完音声とを容易に区別することができ、聞き分けやすい補完音声を得ることができる。 As described above, according to the audio selection device 1 of the embodiment of the present invention, the selection unit 24 includes the program audio and the audio of the complementary audio DBs 20-1 to 20 -M (among M complementary audio speakers). Selects the complementary speech DB 20-c (complementary speech speaker c) having the most similar acoustic features. The selected complementary audio DB 20-c is used when generating the program audio with explanation by adding the complementary audio to the program audio. As a result, even if the program sound and the complementary sound are presented at the same timing as a result of adding the complementary sound to the program sound, a person who listens to these sounds can easily distinguish the program sound and the complementary sound. And supplementary speech that is easy to distinguish.

以下、本発明の実施形態による音声選択装置１について、実施例１〜３を挙げて具体的に説明する。特徴量算出部１１−１〜１１−Ｎ，２１−１〜２１−Ｍを総称して、特徴量算出部１１，２１と表記する。 Hereinafter, the voice selection device 1 according to the embodiment of the present invention will be specifically described with reference to Examples 1 to 3. The feature quantity calculation units 11-1 to 11-N and 21-1 to 21-M are collectively referred to as feature quantity calculation units 11 and 21.

実施例１〜３において、特徴量算出部１１，２１が音響特徴量を算出する処理として、話者認識または話者照合の際に用いられるi-vector（ｉベクトル）の技術を利用する。i-vectorの詳細については、以下の文献を参照されたい。
［非特許文献１］
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel and P. Ouellet, “Front-end factor analysis for speaker verification”, IEEE Trans. Audio Speech Lang. Process., 19, 788-798(2011) In the first to third embodiments, as a process in which the feature amount calculation units 11 and 21 calculate the acoustic feature amount, an i-vector technique used in speaker recognition or speaker verification is used. For details of i-vector, refer to the following documents.
[Non-Patent Document 1]
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel and P. Ouellet, “Front-end factor analysis for speaker verification”, IEEE Trans. Audio Speech Lang. Process., 19, 788-798 (2011)

〔実施例１〕
まず、実施例１について説明する。実施例１は、声質の観点から、番組音声に対して聞き分けやすい補完音声を選択する例である。具体的には、実施例１は、メル周波数ケプストラム係数（ＭＦＣＣ）及び対数エネルギー（Ｅ）からなる静的係数並びにそれらの１次回帰係数及び２次回帰係数を含めたスペクトル特徴量を用いた音響特徴量に基づいて、複数の補完音声から１つの補完音声を選択する。 [Example 1]
First, Example 1 will be described. The first embodiment is an example in which complementary audio that is easy to distinguish from program audio is selected from the viewpoint of voice quality. Specifically, in the first embodiment, the sound using the static coefficient including the mel frequency cepstrum coefficient (MFCC) and the logarithmic energy (E) and the spectral feature amount including the primary regression coefficient and the secondary regression coefficient thereof. One complementary voice is selected from a plurality of complementary voices based on the feature amount.

特徴量算出部１１，２１は、音響特徴量として、スペクトル特徴量による混合ガウス分布モデル（ＧＭＭ）を構成する平均ベクトルを混合数分だけ結合してＧＭＭスーパーベクトルを求め、ｉベクトルを算出する。スペクトル特徴量の算出方法については、以下の文献を参照されたい。
［非特許文献２］
The HTK Book (for HTK Version 3.4) Cambridge University Engineering Department The feature quantity calculation units 11 and 21 combine the average vectors constituting the mixed Gaussian distribution model (GMM) based on the spectral feature quantity as many as the number of mixtures as the acoustic feature quantity, obtain the GMM super vector, and calculate the i vector. For the calculation method of the spectral feature amount, refer to the following documents.
[Non-Patent Document 2]
The HTK Book (for HTK Version 3.4) Cambridge University Engineering Department

図２は、実施例１の特徴量算出部１１，２１の処理例を示すフローチャートである。特徴量算出部１１，２１は、番組音声ＤＢ１０または補完音声ＤＢ２０から話者の音声波形のデータを読み出し（ステップＳ２０１）、音声波形のデータから窓幅２５ｍｓ及びシフト幅１０ｍｓのフレームの音声データ（音声フレーム）を切り出す（ステップＳ２０２）。 FIG. 2 is a flowchart illustrating a processing example of the feature amount calculation units 11 and 21 according to the first embodiment. The feature quantity calculators 11 and 21 read the speaker's voice waveform data from the program voice DB 10 or the complementary voice DB 20 (step S201), and the voice data (speech) of the frame having a window width of 25 ms and a shift width of 10 ms from the voice waveform data. A frame is cut out (step S202).

特徴量算出部１１，２１は、音声フレームに対し、プリエンファシス係数０．９７にて高域強調（プリエンファシス）を行う（ステップＳ２０３）。そして、特徴量算出部１１，２１は、高域強調後の音声フレームに対し、窓幅２５ｍｓのハミング窓の窓関数を掛け合わせ（ステップＳ２０４）、ＦＦＴポイント数１０２４の離散フーリエ変換（ＦＦＴ）を行い、周波数特性を求める（ステップＳ２０５）。 The feature quantity calculation units 11 and 21 perform high frequency emphasis (pre-emphasis) on the speech frame with a pre-emphasis coefficient of 0.97 (step S203). Then, the feature amount calculation units 11 and 21 multiply the speech frame after high frequency emphasis by a window function of a Hamming window having a window width of 25 ms (step S204), and perform a discrete Fourier transform (FFT) with 1024 FFT points. The frequency characteristic is obtained (step S205).

特徴量算出部１１，２１は、周波数特性にメルフィルターバンクを掛けることで、２６チャンネルのフィルターバンク係数を求める（ステップＳ２０６）。そして、特徴量算出部１１，２１は、フィルターバンク係数に対して離散コサイン変換（ＤＣＴ）を行うことで、１２次元のメル周波数ケプストラム係数（ＭＦＣＣ）を算出する（ステップＳ２０７）。 The feature quantity calculation units 11 and 21 obtain 26-channel filter bank coefficients by multiplying the frequency characteristics by the mel filter bank (step S206). And the feature-value calculation parts 11 and 21 calculate a 12-dimensional mel frequency cepstrum coefficient (MFCC) by performing a discrete cosine transform (DCT) with respect to a filter bank coefficient (step S207).

特徴量算出部１１，２１は、ステップＳ２０２から移行して、音声フレームに対し、対数エネルギー（Ｅ）を算出する（ステップＳ２０８）。 The feature amount calculation units 11 and 21 shift from step S202 to calculate logarithmic energy (E) for the audio frame (step S208).

特徴量算出部１１，２１は、１２次元のメル周波数ケプストラム係数（ＭＦＣＣ）と対数エネルギー（Ｅ）を合わせた１３次元の静的係数を設定する（ステップＳ２０９）。そして、特徴量算出部１１，２１は、これらの静的係数について、１次回帰係数である１次差分ΔＭＦＣＣ，ΔＥ及び２次回帰係数である２次差分Δ^２ＭＦＣＣ，Δ^２Ｅを算出する（ステップＳ２１０，ステップＳ２１１）。特徴量算出部１１，２１は、メル周波数ケプストラム係数（ＭＦＣＣ）、対数エネルギー（Ｅ）、１次差分ΔＭＦＣＣ，ΔＥ、及び２次差分Δ^２ＭＦＣＣ，２次差分Δ^２Ｅをスペクトル特徴量に設定する（ステップＳ２１２）。 The feature quantity calculation units 11 and 21 set a 13-dimensional static coefficient obtained by combining the 12-dimensional mel frequency cepstrum coefficient (MFCC) and the logarithmic energy (E) (step S209). Then, the feature quantity calculation units 11 and 21 calculate, for these static coefficients, primary differences ΔMFCC, ΔE that are primary regression coefficients and secondary differences Δ ² MFCC, Δ ² E that are secondary regression coefficients. (Step S210, Step S211). The feature quantity calculation units 11 and 21 set the mel frequency cepstrum coefficient (MFCC), logarithmic energy (E), primary difference ΔMFCC, ΔE, and secondary difference Δ ² MFCC, secondary difference Δ ² E as spectral feature quantities. (Step S212).

これにより、音声フレーム毎に、１２個のメル周波数ケプストラム係数（ＭＦＣＣ）、１個の対数エネルギー（Ｅ）、１２個の１次差分ΔＭＦＣＣ、１個の１次差分ΔＥ、１２個の２次差分Δ^２ＭＦＣＣ、及び１個の２次差分Δ^２ＥであるＤ_F（＝３９）個の係数からなるスペクトル特徴量が得られる。 Thus, for each voice frame, 12 mel frequency cepstrum coefficients (MFCC), 1 logarithmic energy (E), 12 primary differences ΔMFCC, 1 primary difference ΔE, 12 secondary differences A spectral feature quantity consisting of Δ ² MFCC and D _F (= 39) coefficients that are one secondary difference Δ ² E is obtained.

特徴量算出部１１，２１は、ＥＭ（Expectation Maximization）アルゴリズムを用いて、音声フレーム毎に算出したＤ_F（＝３９）個の係数からなるスペクトル特徴量（全ての音声フレームにおける係数）から、話者の音声波形のデータ全体に関するＧＭＭパラメータλを算出する（ステップＳ２１３）。ＥＭアルゴリズムを用いてＧＭＭパラメータλを算出する手法の詳細については、以下の文献を参照されたい。
［非特許文献３］
REFERENCE MANUAL for Speech Signal Processing Toolkit Ver. 3.9 The feature quantity calculators 11 and 21 use the EM (Expectation Maximization) algorithm to calculate the speech from the spectrum feature quantity (coefficients in all voice frames) composed of D _F (= 39) coefficients calculated for each voice frame. The GMM parameter λ relating to the entire data of the person's voice waveform is calculated (step S213). For details of the method of calculating the GMM parameter λ using the EM algorithm, refer to the following documents.
[Non-Patent Document 3]
REFERENCE MANUAL for Speech Signal Processing Toolkit Ver. 3.9

ＧＭＭパラメータλは、以下の式のとおり、混合数Ｃ（＝５１２）個の混合重み、及び混合数Ｃ個のガウス分布から構成される。混合重みをＷとする。ガウス分布は、Ｄ_Ｆ個の平均値からなる平均ベクトルμ、及びＤ_Ｆ個の分散値からなるベクトルσ²により表される。

The GMM parameter λ is constituted by a mixture weight C (= 512) mixture weights and a mixture number C Gaussian distribution as shown in the following equation. Let the mixing weight be W. Gaussian distribution is represented by the mean vector mu, and the vector sigma ² consisting of D _F-number of variance values consisting of D _F number of the mean.

図３は、ステップＳ２１３の処理により算出されるＧＭＭパラメータλを説明する図である。前述のとおり、ＧＭＭパラメータλは、ステップＳ２１３の処理において、ＥＭアルゴリズムを用いて、音声フレーム毎のＤ_F（＝３９）個の係数からなるスペクトル特徴量（全ての音声フレームにおける係数）から算出される。 FIG. 3 is a diagram for explaining the GMM parameter λ calculated by the process of step S213. As described above, the GMM parameter λ is calculated from the spectral feature amount (coefficient in all voice frames) including _DF (= 39) coefficients for each voice frame using the EM algorithm in the process of step S213. The

図３に示すように、ＧＭＭパラメータλは、混合数Ｃ個における第０番目について、混合重みＷ（０）及びガウス分布からなる。この場合のガウス分布は、Ｄ_Ｆ個の平均値からなる平均ベクトルμ₀（０），・・・，μ₀（Ｄ_F−１）、及びＤ_Ｆ個の分散値からなるベクトルσ₀ ²（０），・・・，σ₀ ²（Ｄ_F−１）により表される。 As shown in FIG. 3, the GMM parameter λ is composed of a mixture weight W (0) and a Gaussian distribution for the 0th in the number of mixture C. In this case, the Gaussian distribution has an average vector μ ₀ (0),..., Μ ₀ (D _F −1) composed of _DF average values, and a vector σ ₀ ² composed of _DF variance values ( 0),..., Σ ₀ ² (D _F −1).

同様に、ＧＭＭパラメータλは、混合数Ｃ個における第（Ｃ−１）番目について、混合重みＷ（Ｃ−１）及びガウス分布からなる。この場合のガウス分布は、Ｄ_Ｆ個の平均値からなる平均ベクトルμ_C-1（０），・・・，μ_C-1（Ｄ_F−１）、及びＤ_Ｆ個の分散値からなるベクトルσ_C-1 ²（０），・・・，σ_C-1 ²（Ｄ_F−１）により表される。 Similarly, the GMM parameter λ is composed of a mixture weight W (C−1) and a Gaussian distribution for the (C−1) th in the number of mixture C. In this case, the Gaussian distribution has an average vector μ _C-1 (0),..., Μ _C-1 (D _F −1) consisting of _DF average values, and a vector consisting of _DF variance values. σ _C-1 ² (0),..., σ _C-1 ² (D _F −1).

図２に戻って、特徴量算出部１１，２１は、ステップＳ２１３の後、ＧＭＭパラメータλからＧＭＭスーパーベクトルＭを求める（ステップＳ２１４）。具体的には、特徴量算出部１１，２１は、混合数Ｃ個の混合重み及び混合数Ｃ個のガウス分布（Ｄ_Ｆ個の平均値からなる平均ベクトルμ、及びＤ_Ｆ個の分散値からなるベクトルσ²）から構成されるＧＭＭパラメータλにより平均ベクトルμのみを抽出する。そして、特徴量算出部１１，２１は、Ｄ_Ｆ個の平均値からなる平均ベクトルμを混合数Ｃ個だけ結合し、ＧＭＭスーパーベクトルＭを求める。ＧＭＭスーパーベクトルＭは、Ｃ・Ｄ_F次元の実数のベクトルであり、以下のように表される。

Returning to FIG. 2, the feature

quantity calculation units

11 and 21 obtain the GMM super vector M from the GMM parameter λ after step S213 (step S214). Specifically, the feature

amount calculation unit

11 and 21, the number of mixture C number mixture weight and number of mixture C-number of Gaussian distribution (D _F mean vector of pieces of the mean value mu, and D _F-number of variance Only the average vector μ is extracted by the GMM parameter λ composed of the following vector σ ² ). Then, the feature

amount calculation units

11 and 21 combine the average vector μ composed of the _DF average values by the number C of mixtures to obtain the GMM super vector M. The GMM super vector M is a C · D _F dimensional real vector and is expressed as follows.

図４は、ステップＳ２１４の処理により算出されるＧＭＭスーパーベクトルＭを説明する図である。図４に示すように、ＧＭＭスーパーベクトルＭは、第０番目についてのＤ_Ｆ個の平均値からなる平均ベクトルμ₀（０），・・・，μ₀（Ｄ_F−１）、・・・、及び、第（Ｃ−１）番目についてのＤ_Ｆ個の平均値からなる平均ベクトルμ_C-1（０），・・・，μ_C-1（Ｄ_F−１）により構成される。 FIG. 4 is a diagram for explaining the GMM super vector M calculated by the process of step S214. As shown in FIG. 4, GMM supervector M is the mean vector mu ₀ consist _{D F-number} of the mean value for the 0th _{(0), ···, μ 0} (D F -1), ··· and, the (C-1) th mean vectors μ _C-1 (0) consisting of _{D F-number} of mean values for, ..., constituted by _{_{μ C-1 (D F -1}} ).

図２に戻って、特徴量算出部１１，２１は、ステップＳ２１４の後、ＧＭＭスーパーベクトルＭに基づいて、前述の非特許文献１に記載されている手法を用いて、次式を満たす音響特徴量であるｉベクトル：ｗを算出する（ステップＳ２１５）。

Returning to FIG. 2, after step S214, the feature

quantity calculation units

11 and 21 use the technique described in Non-Patent Document 1 described above, based on the GMM supervector M, and satisfy the following equation. The i vector as a quantity: w is calculated (step S215).

また、ｉベクトル：ｗは、Ｄ_T次元の実数のベクトルであり、以下のように表される。

The i vector: w is a D _T- dimensional real vector and is expressed as follows.

ここで、ｍは、大量の不特定話者の音声データを用いて学習したＧＭＭスーパーベクトルであり、Ｔは、低ランクの矩形行列（Ｄ_T＜＜Ｃ・Ｄ_F）である。矩形行列Ｔは、Ｃ・Ｄ_F×Ｄ_T次元の実数のベクトルであり、以下のように表される。

Here, m is a GMM super vector learned using a large amount of unspecified speaker's speech data, and T is a low-rank rectangular matrix (D _T << C · D _F ). The rectangular matrix T is a C · D _F × D _T- dimensional real vector and is expressed as follows.

ｗは、平均ベクトルが０であり、共分散行列が単位行列Ｉであるガウス分布Ｎ（ｗ；０，Ｉ）に従う。平均ベクトル０は、Ｄ_T次元の実数のベクトルであり、以下のように表される。

共分散行列Ｉは、Ｄ_T×Ｄ_T次元の実数のベクトルであり、以下のように表される。

w follows a Gaussian distribution N (w; 0, I) in which the mean vector is 0 and the covariance matrix is the unit matrix I. The average vector 0 is a _DT- dimensional real vector and is expressed as follows.

The covariance matrix I is a D _T × D _T dimensional real vector and is expressed as follows.

尚、特徴量算出部１１，２１は、算出したｉベクトル：ｗに対して、ＬＤＡ（Linear Discrimination Analysis）やＷＣＣＮ（Within-Class Covariance Normalization）等の処理にて、同一話者内の音響変動を補正する。後述する実施例２，３についても同様である。 Note that the feature quantity calculation units 11 and 21 perform acoustic fluctuations within the same speaker by processing such as LDA (Linear Discrimination Analysis) and WCCN (Within-Class Covariance Normalization) on the calculated i vector: w. to correct. The same applies to Examples 2 and 3 described later.

類似度算出部２２−１〜２２−Ｍ、類似度加算部２３−１〜２３−Ｍ及び選択部２４の処理は、図１と同様である。 The processes of the similarity calculation units 22-1 to 22-M, the similarity addition units 23-1 to 23-M, and the selection unit 24 are the same as those in FIG.

以上のように、実施例１の特徴量算出部１１，２１は、番組音声ＤＢ１０または補完音声ＤＢ２０から読み出した音声の音声波形のデータについて、スペクトル特徴量による混合ガウス分布モデル（ＧＭＭ）を構成する平均ベクトルμを混合数Ｃ分だけ結合してＧＭＭスーパーベクトルＭを求める。そして、特徴量算出部１１，２１は、ＧＭＭスーパーベクトルＭに基づいて、スペクトル特徴量を用いた音響特徴量であるｉベクトルを算出する。 As described above, the feature amount calculation units 11 and 21 according to the first embodiment configure a mixed Gaussian distribution model (GMM) based on spectral feature amounts with respect to audio waveform data of audio read from the program audio DB 10 or the complementary audio DB 20. The GMM super vector M is obtained by combining the average vectors μ by the number of mixtures C. Then, the feature amount calculation units 11 and 21 calculate an i vector that is an acoustic feature amount using the spectrum feature amount based on the GMM super vector M.

後段の選択部２４は、特徴量算出部１１，２１にて算出されたｉベクトルに基づき、補完音声ＤＢ２０−１〜２０−Ｍのうち（Ｍ人の補完音声話者のうち）、番組音声と最も類似しない音響的な特徴を有する補完音声ＤＢ２０−ｃ（補完音声話者ｃ）を選択する。 The selection unit 24 at the latter stage is based on the i vector calculated by the feature quantity calculation units 11 and 21, and the program audio, among the complementary audio DBs 20-1 to 20 -M (among M complementary audio speakers), A complementary speech DB 20-c (complementary speech speaker c) having an acoustic feature that is least similar is selected.

ここで、補完音声ＤＢ２０−ｃ（補完音声話者ｃ）は、スペクトル特徴量から算出された音響特徴量を指標として選択され、スペクトル特徴量には、音声の周波数成分が反映されている。また、声質は、音声の周波数成分により決定される。 Here, the complementary speech DB 20-c (complementary speech speaker c) is selected with the acoustic feature amount calculated from the spectral feature amount as an index, and the frequency feature of the speech is reflected in the spectral feature amount. Voice quality is determined by the frequency component of the voice.

したがって、番組音声に補完音声を付加した結果、番組音声と補完音声とを同時に提示することになっても、これらの音声を聴く人は、番組音声と補完音声とを容易に区別することができ、話者の声質が聞き分けやすい補完音声を得ることができる。 Therefore, even if the program sound and the complementary sound are simultaneously presented as a result of adding the complementary sound to the program sound, the person who listens to these sounds can easily distinguish the program sound and the complementary sound. Therefore, it is possible to obtain complementary speech that can easily distinguish the voice quality of the speaker.

〔実施例２〕
次に、実施例２について説明する。実施例２は、声の高さの観点から、番組音声に対して聞き分けやすい補完音声を選択する例である。具体的には、実施例２は、対数基本周波数（ＬＦ０）並びにその１次回帰係数及び２次回帰係数を含めたピッチ特徴量を用いた音響特徴量に基づいて、複数の補完音声から１つの補完音声を選択する。 [Example 2]
Next, Example 2 will be described. The second embodiment is an example in which a complementary voice that is easy to distinguish from a program voice is selected from the viewpoint of loudness. Specifically, in the second embodiment, based on the logarithmic fundamental frequency (LF0) and the acoustic feature amount using the pitch feature amount including the primary regression coefficient and the quadratic regression coefficient, one of the complementary speeches is used. Select complementary audio.

特徴量算出部１１，２１は、音響特徴量として、ピッチ特徴量による混合ガウス分布モデル（ＧＭＭ）を構成する平均ベクトルを混合数分だけ結合してＧＭＭスーパーベクトルを求め、ｉベクトルを算出する。ピッチ特徴量の算出方法については、以下の文献を参照されたい。
［非特許文献４］
都木、清山、宮坂、「複数の窓幅から得られた自己相関関数を用いる音声基本周期抽出法」、電子情報通信学会論文誌Ａ Vol, J80-A No.9 pp.1341-1350 1997年9月
［非特許文献５］
清山、今井、三島、都木、宮坂、「高品質リアルタイム話速変換システムの開発」、電子情報通信学会論文誌Ｄ-II Vol, J84-D-II No.6 pp.918-926 2001年6月 The feature quantity calculation units 11 and 21 combine the average vectors constituting the mixed Gaussian distribution model (GMM) based on the pitch feature quantity as many as the number of mixtures as the acoustic feature quantity to obtain the GMM super vector, and calculate the i vector. For the calculation method of the pitch feature value, refer to the following document.
[Non-Patent Document 4]
Tsuki, Kiyoyama, Miyasaka, “Speech Basic Period Extraction Method Using Autocorrelation Function Obtained from Multiple Window Widths”, IEICE Transactions A Vol, J80-A No.9 pp.1341-1350 1997 September [Non-Patent Document 5]
Kiyoyama, Imai, Mishima, Miyagi, Miyasaka, “Development of high-quality real-time speech rate conversion system”, IEICE Transactions D-II Vol, J84-D-II No.6 pp.918-926 2001 6 Moon

図５は、実施例２の特徴量算出部１１，２１の処理例を示すフローチャートである。特徴量算出部１１，２１は、番組音声ＤＢ１０または補完音声ＤＢ２０から音声の音声波形のデータを読み出す（ステップＳ５０１）。そして、特徴量算出部１１，２１は、音声波形のデータに対し、カットオフ周波数１ｋＨｚで低域ろ波を行い、１／４のデシメーションを施す（ステップＳ５０２）。そして、特徴量算出部１１，２１は、低域ろ波及びデシメーション後の音声波形のデータから、所定の窓幅にて音声波形のフレームの音声データ（音声フレーム）を切り出す（ステップＳ５０３）。 FIG. 5 is a flowchart illustrating a processing example of the feature amount calculation units 11 and 21 according to the second embodiment. The feature quantity calculators 11 and 21 read voice waveform data of the voice from the program voice DB 10 or the complementary voice DB 20 (step S501). Then, the feature amount calculation units 11 and 21 perform low-pass filtering on the voice waveform data at a cutoff frequency of 1 kHz and perform ¼ decimation (step S502). Then, the feature amount calculators 11 and 21 cut out voice data (voice frames) of a voice waveform frame with a predetermined window width from the low-pass filtered and decimated voice waveform data (step S503).

特徴量算出部１１，２１は、切り出した音声フレーム毎に、自己相関関数を算出し、それぞれ指定した範囲で複数個の極大点を求める。そして、特徴量算出部１１，２１は、複数個の極大点の周辺を４倍に内挿し、極大点のうち最大となる極大値をとる位置を、基本周期候補の位置に設定する（ステップＳ５０４）。 The feature amount calculation units 11 and 21 calculate an autocorrelation function for each cut out audio frame, and obtain a plurality of maximum points in a specified range. Then, the feature amount calculation units 11 and 21 interpolate the periphery of the plurality of local maximum points four times, and set the position having the maximum maximum value among the local maximum points as the position of the basic period candidate (step S504). ).

特徴量算出部１１，２１は、基本周期候補の位置における自己相関関数の値を０次の自己相関関数の値で除算し、周期性の程度を示す値を求める（ステップＳ５０５）。そして、特徴量算出部１１，２１は、重み付けを行い、重み付け後の周期性の程度を示す値を加算し、加算結果を指標として、基本周期候補のうち最適なものを基本周期として選択する（ステップＳ５０６）。 The feature quantity calculation units 11 and 21 divide the value of the autocorrelation function at the position of the basic period candidate by the value of the zeroth-order autocorrelation function to obtain a value indicating the degree of periodicity (step S505). Then, the feature quantity calculation units 11 and 21 perform weighting, add a value indicating the degree of periodicity after weighting, and select an optimum basic period candidate as a basic period using the addition result as an index ( Step S506).

ここで、特徴量算出部１１，２１は、音声フレームが有声音区間の場合、その音声フレームの基本周期を求め、当該基本周期のみを用いて以下の処理を行うようにしてもよい。さらに、特徴量算出部１１，２１は、音声フレームが無声音区間または無音区間に含まれる場合、前後の有声音区間に含まれる音声フレームの基本周期を補間して基本周期を求め、当該基本周期も用いて以下の処理を行うようにしてもよい。詳細については後述する。 Here, when the speech frame is a voiced sound section, the feature amount calculation units 11 and 21 may obtain the fundamental cycle of the speech frame and perform the following processing using only the fundamental cycle. Furthermore, when the voice frame is included in the unvoiced sound section or the silent section, the feature amount calculation units 11 and 21 obtain the basic period by interpolating the basic period of the voice frame included in the preceding and following voiced sound sections. The following processing may be performed by using them. Details will be described later.

特徴量算出部１１，２１は、基本周期の逆数を基本周波数（Ｆ０）とし、これに自然対数をとることで対数基本周波数（ＬＦ０）を算出する（ステップＳ５０７）。特徴量算出部１１，２１は、１次元の対数基本周波数（ＬＦ０）について、１次回帰係数である１次差分ΔＬＦ０及び２次回帰係数である２次差分Δ^２ＬＦ０を算出する（ステップＳ５０８，ステップＳ５０９）。特徴量算出部１１，２１は、対数基本周波数（ＬＦ０）、１次差分ΔＬＦ０及び２次差分Δ^２ＬＦ０をピッチ特徴量に設定する（ステップＳ５１０）。 The feature quantity calculation units 11 and 21 calculate the logarithmic fundamental frequency (LF0) by taking the natural logarithm of the inverse of the fundamental period as the fundamental frequency (F0) (step S507). The feature quantity calculation units 11 and 21 calculate the primary difference ΔLF0 as the primary regression coefficient and the secondary difference Δ ² LF0 as the secondary regression coefficient for the one-dimensional logarithmic fundamental frequency (LF0) (step S508, Step S509). The feature quantity calculators 11 and 21 set the logarithmic fundamental frequency (LF0), the primary difference ΔLF0, and the secondary difference Δ ² LF0 as pitch feature quantities (step S510).

これにより、音声フレーム毎に、１個の対数基本周波数（ＬＦ０）、１個の１次差分ΔＬＦ０、及び１個の２次差分Δ^２ＬＦ０であるＤ_F（＝３）個の係数からなるピッチ特徴量が得られる。 Thus, for each voice frame, a pitch composed of one logarithmic fundamental frequency (LF0), one primary difference ΔLF0, and one secondary difference Δ ² LF0, D _F (= 3) coefficients. Features can be obtained.

特徴量算出部１１，２１は、ＥＭアルゴリズムを用いて、音声フレーム毎に算出したＤ_F（＝３）個の係数からなるピッチ特徴量（全ての音声フレームにおける係数）から、話者の音声波形のデータ全体に関するＧＭＭパラメータλを算出する（ステップＳ５１１）。そして、特徴量算出部１１，２１は、ＧＭＭパラメータλからＧＭＭスーパーベクトルＭを求める（ステップＳ５１２）。 The feature amount calculation units 11 and 21 use the EM algorithm to calculate the speech waveform of the speaker from the pitch feature amount (coefficient in all speech frames) composed of D _F (= 3) coefficients calculated for each speech frame. The GMM parameter λ relating to the entire data is calculated (step S511). Then, the feature quantity calculation units 11 and 21 obtain the GMM super vector M from the GMM parameter λ (step S512).

特徴量算出部１１，２１は、ＧＭＭスーパーベクトルＭに基づいて、前述の非特許文献１に記載されている手法を用いて、音響特徴量であるｉベクトル：ｗを算出する（ステップＳ５１３）。 Based on the GMM supervector M, the feature quantity calculation units 11 and 21 calculate the i vector: w, which is an acoustic feature quantity, using the method described in Non-Patent Document 1 (step S513).

以上のように、実施例２の特徴量算出部１１，２１は、番組音声ＤＢ１０または補完音声ＤＢ２０から読み出した音声の音声波形のデータについて、ピッチ特徴量による混合ガウス分布モデル（ＧＭＭ）を構成する平均ベクトルμを混合数Ｃ分だけ結合してＧＭＭスーパーベクトルＭを求める。そして、特徴量算出部１１，２１は、ＧＭＭスーパーベクトルＭに基づいて、ピッチ特徴量を用いた音響特徴量であるｉベクトルを算出する。 As described above, the feature amount calculation units 11 and 21 according to the second embodiment configure a mixed Gaussian distribution model (GMM) based on pitch feature amounts for the sound waveform data of the sound read from the program sound DB 10 or the complementary sound DB 20. The GMM super vector M is obtained by combining the average vectors μ by the number of mixtures C. Then, the feature amount calculation units 11 and 21 calculate an i vector that is an acoustic feature amount using the pitch feature amount based on the GMM super vector M.

ここで、補完音声ＤＢ２０−ｃ（補完音声話者ｃ）は、ピッチ特徴量から算出された音響特徴量を指標として選択され、ピッチ特徴量は、音の高さを表す数値である。 Here, the complementary speech DB 20-c (complementary speech speaker c) is selected with the acoustic feature amount calculated from the pitch feature amount as an index, and the pitch feature amount is a numerical value representing the pitch of the sound.

したがって、番組音声に補完音声を付加した結果、番組音声と補完音声とを同時に提示することになっても、これらの音声を聴く人は、番組音声と補完音声とを容易に区別することができ、話者の声の高さが聞き分けやすい補完音声を得ることができる。 Therefore, even if the program sound and the complementary sound are simultaneously presented as a result of adding the complementary sound to the program sound, the person who listens to these sounds can easily distinguish the program sound and the complementary sound. Thus, it is possible to obtain complementary speech that allows the speaker's voice to be easily recognized.

図５に示したとおり、特徴量算出部１１，２１は、音声フレームについて基本周期を求め、当該基本周期を用いて、対数基本周波数（ＬＦ０）等を算出し、音響特徴量であるｉベクトル：ｗを算出する。この場合、特徴量算出部１１，２１は、有声音区間に含まれる音声フレームの基本周期のみを用いて、音響特徴量であるｉベクトル：ｗを算出するようにしてもよい。また、特徴量算出部１１，２１は、前後の有声音区間に含まれる音声フレームの基本周期を補間することで、無声音区間及び無音区間の基本周期を求める。そして、特徴量算出部１１，２１は、有声音区間に含まれる音声フレームの基本周期、及び無声音区間及び無音区間の基本周期を用いて、音響特徴量であるｉベクトル：ｗを算出するようにしてもよい。 As illustrated in FIG. 5, the feature amount calculation units 11 and 21 obtain a fundamental period for a speech frame, calculate a logarithmic fundamental frequency (LF0) and the like using the fundamental period, and an i vector that is an acoustic feature amount: Calculate w. In this case, the feature quantity calculation units 11 and 21 may calculate the i vector: w, which is an acoustic feature quantity, using only the basic period of the voice frame included in the voiced sound section. In addition, the feature amount calculation units 11 and 21 obtain the basic period of the unvoiced sound period and the silent period by interpolating the basic period of the voice frame included in the preceding and following voiced sound periods. Then, the feature quantity calculation units 11 and 21 calculate the i vector: w, which is an acoustic feature quantity, using the basic period of the voice frame included in the voiced sound section and the basic period of the unvoiced sound section and the silent section. May be.

図６は、図５に示した処理の事前処理として、音声フレームの区間判定の処理例を示すフローチャートである。特徴量算出部１１，２１は、図５に示した処理の事前処理として、音声フレームが含まれる区間として、有声音区間、無声音区間及び無音区間を判定する。 FIG. 6 is a flowchart illustrating a process example of voice frame section determination as pre-processing of the process illustrated in FIG. 5. The feature amount calculation units 11 and 21 determine, as pre-processing of the processing illustrated in FIG. 5, voiced sound segments, unvoiced sound segments, and silent segments as sections including a voice frame.

特徴量算出部１１，２１は、番組音声ＤＢ１０または補完音声ＤＢ２０から話者の音声波形のデータを読み出し（ステップＳ６０１）、音声波形のデータに対し、高域強調（プリエンファシス）を行う（ステップＳ６０２）。そして、特徴量算出部１１，２１は、高域強調後の音声波形のデータから所定の窓幅のフレームの音声データ（音声フレーム）を切り出す（ステップＳ６０３）。以下に示すステップＳ６０４〜ステップＳ６１２の処理は、音声フレーム毎に行われる。 The feature quantity calculators 11 and 21 read the speaker's voice waveform data from the program voice DB 10 or the complementary voice DB 20 (step S601), and perform high frequency emphasis (pre-emphasis) on the voice waveform data (step S602). ). Then, the feature amount calculation units 11 and 21 cut out audio data (audio frames) of a frame having a predetermined window width from the audio waveform data after high-frequency emphasis (step S603). The processes in steps S604 to S612 described below are performed for each audio frame.

特徴量算出部１１，２１は、音声フレームのパワーを算出し（ステップＳ６０４）、音声フレームのパワーが予め設定された閾値よりも大きいか否かを判定する（ステップＳ６０５）。特徴量算出部１１，２１は、ステップＳ６０５において、音声フレームのパワーが閾値よりも大きいと判定した場合（ステップＳ６０５：Ｙ）、音声フレームは有音区間に含まれるとし、ステップＳ６０７へ移行する。 The feature quantity calculation units 11 and 21 calculate the power of the audio frame (step S604), and determine whether or not the power of the audio frame is greater than a preset threshold value (step S605). If it is determined in step S605 that the power of the audio frame is greater than the threshold value (step S605: Y), the feature amount calculation units 11 and 21 assume that the audio frame is included in the sound section, and the process proceeds to step S607.

一方、特徴量算出部１１，２１は、ステップＳ６０５において、音声フレームのパワーが閾値よりも大きくないと判定した場合（ステップＳ６０５：Ｎ）、音声フレームは無音区間に含まれるとし、当該区間を無音区間に設定する（ステップＳ６０６）。 On the other hand, if it is determined in step S605 that the power of the audio frame is not greater than the threshold value (step S605: N), the feature amount calculation units 11 and 21 assume that the audio frame is included in the silent section, and the section is silent. A section is set (step S606).

特徴量算出部１１，２１は、ステップＳ６０５から移行して、音声フレームのパワーが閾値よりも大きい場合、音声フレームの零交叉数を算出する（ステップＳ６０７）。そして、特徴量算出部１１，２１は、音声フレームの零交叉数が予め設定された閾値よりも小さいか否かを判定する（ステップＳ６０８）。特徴量算出部１１，２１は、ステップＳ６０８において、音声フレームの零交叉数が閾値よりも小さいと判定した場合（ステップＳ６０８：Ｙ）、音声フレームは非摩擦性区間に含まれるとし、ステップＳ６１０へ移行する。 The feature quantity calculation units 11 and 21 shift from step S605 to calculate the zero crossing number of the audio frame when the power of the audio frame is larger than the threshold (step S607). Then, the feature amount calculation units 11 and 21 determine whether or not the zero crossing number of the audio frame is smaller than a preset threshold value (step S608). If it is determined in step S608 that the number of zero crossings of the audio frame is smaller than the threshold value (step S608: Y), the feature amount calculation units 11 and 21 determine that the audio frame is included in the non-frictional section, and go to step S610. Transition.

一方、特徴量算出部１１，２１は、ステップＳ６０８において、音声フレームの零交叉数が閾値よりも小さくないと判定した場合（ステップＳ６０８：Ｎ）、音声フレームは摩擦性区間に含まれるとし、当該区間を無声音区間に設定する（ステップＳ６０９）。 On the other hand, if the feature amount calculation units 11 and 21 determine in step S608 that the number of zero crossings of the audio frame is not smaller than the threshold (step S608: N), the audio frame is included in the frictional section, The section is set as an unvoiced sound section (step S609).

特徴量算出部１１，２１は、ステップＳ６０８から移行して、音声フレームの零交叉数が閾値よりも小さい場合、音声フレームの自己相関関数を算出する（ステップＳ６１０）。そして、特徴量算出部１１，２１は、音声フレームの自己相関関数が予め設定された閾値よりも大きいか否かを判定する（ステップＳ６１１）。特徴量算出部１１，２１は、ステップＳ６１１において、音声フレームの自己相関関数が閾値よりも大きいと判定した場合（ステップＳ６１１：Ｙ）、音声フレームは有声音区間に含まれるとし、当該区間を有声音区間に設定する（ステップＳ６１２）。 The feature quantity calculation units 11 and 21 shift from step S608 to calculate the autocorrelation function of the audio frame when the number of zero crossings of the audio frame is smaller than the threshold (step S610). Then, the feature amount calculation units 11 and 21 determine whether or not the autocorrelation function of the audio frame is larger than a preset threshold value (step S611). When the feature quantity calculation units 11 and 21 determine in step S611 that the autocorrelation function of the audio frame is greater than the threshold (step S611: Y), the audio frame is included in the voiced sound interval, and the relevant interval is included. It is set to the voice interval (step S612).

一方、特徴量算出部１１，２１は、ステップＳ６１１において、音声フレームの自己相関関数が閾値よりも大きくないと判定した場合（ステップＳ６１１：Ｎ）、音声フレームは無声音区間に含まれるとし、当該区間を無声音区間に設定する（ステップＳ６０９）。 On the other hand, if the feature amount calculation units 11 and 21 determine in step S611 that the autocorrelation function of the audio frame is not greater than the threshold (step S611: N), the audio frame is included in the unvoiced sound interval, Is set as an unvoiced sound section (step S609).

これにより、音声フレームは、有声音区間、無声音区間または無音区間のうちのいずれの区間に含まれるか判定される。特徴量算出部１１，２１は、有声音区間に含まれる音声フレームの基本周期のみを用いて、音響特徴量であるｉベクトル：ｗを算出する。また、特徴量算出部１１，２１は、前後の有声音区間に含まれる音声フレームの基本周期に基づいて、無声音区間または無音区間の基本周期を求め、この基本周期も用いて、音響特徴量であるｉベクトル：ｗを算出するようにしてもよい。 Thereby, it is determined whether the voice frame is included in any of a voiced sound section, an unvoiced sound section, or a silent section. The feature quantity calculation units 11 and 21 calculate the i vector: w, which is an acoustic feature quantity, using only the basic period of the voice frame included in the voiced sound section. In addition, the feature quantity calculation units 11 and 21 obtain an unvoiced sound section or a basic period of a silent section based on a basic period of a voice frame included in the preceding and following voiced sound sections, and use this basic period as an acoustic feature quantity. A certain i vector: w may be calculated.

図７は、前後の有声音区間の基本周期から、無音区間及び無声音区間の基本周期を求める例を説明する図である。図７に示すように、時系列に、音声フレームの区間が判定されたとする。特徴量算出部１１，２１は、有声音区間について、当該有声音区間に含まれる音声フレームの基本周期を求める。また、特徴量算出部１１，２１は、無音区間（図７のαの箇所を参照）について、有声音区間に挟まれる当該無音区間に含まれる音声フレームの基本周期を、先行する有声音区間の終端近傍の基本周期と、後続する有声音区間の始端近傍の基本周期とを用いた補間処理にて算出する。無声音区間（図７のβを参照）についても同様である。 FIG. 7 is a diagram for explaining an example in which the basic period of the silent section and the unvoiced sound section is obtained from the basic period of the preceding and following voiced sound sections. As shown in FIG. 7, it is assumed that voice frame sections are determined in time series. The feature quantity calculation units 11 and 21 obtain the basic period of the voice frame included in the voiced sound section for the voiced sound section. In addition, for the silent section (see the portion α in FIG. 7), the feature amount calculation units 11 and 21 calculate the basic period of the voice frame included in the silent section sandwiched between the voiced sound sections in the preceding voiced sound section. Calculation is performed by interpolation processing using the basic period near the end and the basic period near the start of the subsequent voiced sound section. The same applies to the unvoiced sound section (see β in FIG. 7).

〔実施例３〕
次に、実施例３について説明する。実施例３は、実施例１，２を組み合わせた例であり、声質及び声の高さの観点から、番組音声に対して聞き分けやすい補完音声を選択する。具体的には、実施例３は、実施例１のスペクトル特徴量を用いた音響特徴量、及び実施例２のピッチ特徴量を用いた音響特徴量に基づいて、複数の補完音声から１つの補完音声を選択する。 Example 3
Next, Example 3 will be described. The third embodiment is an example in which the first and second embodiments are combined, and a complementary sound that is easy to distinguish from the program sound is selected from the viewpoint of voice quality and voice pitch. Specifically, in the third embodiment, one supplement from a plurality of complementary sounds is performed based on the acoustic feature using the spectral feature of the first embodiment and the acoustic feature using the pitch feature of the second embodiment. Select audio.

特徴量算出部１１，２１は、実施例１と同様に、音響特徴量として、スペクトル特徴量による混合ガウス分布モデル（ＧＭＭ）を構成する平均ベクトルを混合数分だけ結合してＧＭＭスーパーベクトルを求め、ｉベクトルを算出する。また、特徴量算出部１１，２１は、実施例２と同様に、音響特徴量として、ピッチ特徴量による混合ガウス分布モデル（ＧＭＭ）を構成する平均ベクトルを混合数分だけ結合してＧＭＭスーパーベクトルを求め、ｉベクトルを算出する。 As in the first embodiment, the feature quantity calculation units 11 and 21 combine the average vectors constituting the mixed Gaussian distribution model (GMM) based on the spectral feature quantity by the number of mixtures as the acoustic feature quantity to obtain the GMM super vector. , I vector is calculated. Similarly to the second embodiment, the feature quantity calculation units 11 and 21 combine the average vectors constituting the mixed Gaussian distribution model (GMM) based on the pitch feature quantity as the acoustic feature quantity by the number of the mixture, thereby combining the GMM super vectors. And the i vector is calculated.

具体的には、特徴量算出部１１，２１は、図２に示した処理を行うことで、スペクトル特徴量に基づいたｉベクトル：ｗ_sを算出し、図５に示した処理を行うことで、ピッチ特徴量に基づいたｉベクトル：ｗ_pを算出する。 Specifically, the feature quantity calculation units 11 and 21 calculate the i vector: w _s based on the spectrum feature quantity by performing the process shown in FIG. 2, and perform the process shown in FIG. Then, i vector: w _p based on the pitch feature amount is calculated.

類似度算出部２２−ｍは、特徴量算出部１１−１〜１１−Ｎから、第１〜Ｎ番目のスペクトル特徴量に基づいたｉベクトル：ｗ_s及びピッチ特徴量に基づいたｉベクトル：ｗ_pを入力する。また、類似度算出部２２−ｍは、対応する特徴量算出部２１−ｍから、第ｍ番目のスペクトル特徴量に基づいたｉベクトル：ｗ_s及びピッチ特徴量に基づいたｉベクトル：ｗ_pを入力する。 The similarity calculation unit 22-m receives, from the feature amount calculation units 11-1 to 11-N, an i vector based on the first to Nth spectral feature amounts: w _s and an i vector based on the pitch feature amount: w. Enter _p . Also, the similarity calculation unit 22-m receives the i vector: w _s based on the m-th spectrum feature amount and the i vector: w _p based on the pitch feature amount from the corresponding feature amount calculation unit 21-m. input.

類似度算出部２２−ｍは、スペクトル特徴量に基づいたｉベクトル：ｗ_s及びピッチ特徴量に基づいたｉベクトル：ｗ_pのそれぞれについて、第１〜Ｎ番目のｉベクトル：ｗのそれぞれと、第ｍ番目のｉベクトル：ｗとの間の類似度を算出する。そして、類似度算出部２２−ｍは、第１〜Ｎ番目の番組音声と第ｍ番目の補完音声との間のそれぞれの類似度を、対応する類似度加算部２３−ｍに出力する。 The similarity calculation unit 22-m, for each of the i vector: w _s based on the spectral feature amount and the i vector: w _p based on the pitch feature amount, each of the first to Nth i vectors: w, The similarity between the m-th i-vector and w is calculated. Then, the similarity calculation unit 22-m outputs the similarities between the first to Nth program sounds and the mth complementary sound to the corresponding similarity addition unit 23-m.

類似度加算部２３−ｍは、対応する類似度算出部２２−ｍから、スペクトル特徴量に基づいたｉベクトル：ｗ_s及びピッチ特徴量に基づいたｉベクトル：ｗ_pのそれぞれについて、第１〜Ｎ番目の番組音声と第ｍ番目の補完音声との間のそれぞれの類似度を入力する。そして、類似度加算部２３−ｍは、スペクトル特徴量に基づいたｉベクトル：ｗ_s及びピッチ特徴量に基づいたｉベクトル：ｗ_pのそれぞれについて、類似度を加算することで類似度の総和を算出する。これにより、スペクトル特徴量を用いた音響特徴量における類似度の総和、及びピッチ特徴量を用いた音響特徴量における類似度の総和が得られる。類似度加算部２３−ｍは、２つの算出結果を、予め設定された重み付け係数にて重み付け加算し、類似度の加算総和を求めて選択部２４に出力する。 The similarity adding unit 23-m receives the first to first i vectors: w _s based on the spectrum feature quantity and i vectors: w _p based on the pitch feature quantity from the corresponding similarity calculation section 22-m. Each similarity between the Nth program sound and the mth complementary sound is input. Then, the similarity adding unit 23-m adds the similarities to each of the i vector: w _s based on the spectrum feature quantity and the i vector: w _p based on the pitch feature quantity, thereby calculating the sum of the similarity degrees. calculate. Thereby, the sum total of the similarity in the acoustic feature amount using the spectral feature amount and the sum of the similarity in the acoustic feature amount using the pitch feature amount are obtained. The similarity addition unit 23-m weights and adds the two calculation results with a preset weighting coefficient, obtains the sum of the similarities, and outputs it to the selection unit 24.

ここで、スペクトル特徴量に基づいたｉベクトル：ｗ_sについての前記式（１）（２）により得られた類似度の総和をｓ_Smとする。また、ピッチ特徴量に基づいたｉベクトル：ｗ_pについての前記式（１）（２）により得られた類似度の総和をｓ_Pmとする。重み付け係数をｇとすると、類似度の総和ｓ_Sm，ｓ_Pmを重み付けして加算した結果である、類似度の加算総和ｓ_SPmは、以下の式で表される。

Here, let s _Sm be the sum of the similarities obtained by the equations (1) and (2) for the i vector: w _s based on the spectral feature quantity. Also, let s _Pm be the sum of the similarities obtained by the above equations (1) and (2) for the i vector: w _p based on the pitch feature quantity. When the weighting coefficient is g, the similarity sum s _SPm , which is the result of weighted summation of similarities s _Sm and s _Pm , is expressed by the following equation.

重み付け係数ｇは、以下の範囲の値をとる実数である。

ｇ＝１．０の場合は実施例１を示し、ｇ＝０．０の場合は実施例２を示す。 The weighting coefficient g is a real number that takes a value in the following range.

In the case of g = 1.0, Example 1 is shown, and in the case of g = 0.0, Example 2 is shown.

選択部２４は、類似度加算部２３−１〜２３−Ｍから類似度の加算総和をそれぞれ入力し、これらの類似度の加算総和のうち最小の類似度の加算総和を特定する。そして、選択部２４は、補完音声ＤＢ２０−１〜２０−Ｍのうち（Ｍ人の補完音声話者のうち）、最小の類似度の加算総和に対応する補完音声ＤＢ２０（補完音声話者）を選択し、選択情報を出力する。 The selection unit 24 inputs the sum of similarity additions from the similarity addition units 23-1 to 23 -M, and specifies the sum of the minimum similarity among the addition sums of these similarities. Then, the selection unit 24 selects the complementary speech DB 20 (complementary speech speaker) corresponding to the sum total of the minimum similarity among the supplementary speech DBs 20-1 to 20-M (among M supplementary speech speakers). Select and output selection information.

ここで、最小の類似度の加算総和ｓ_SPmに対応する補完音声ＤＢ２０（補完音声話者）を補完音声ＤＢ２０−ｃ（補完音声話者ｃ）とし、選択情報をｃ（１〜Ｎのうちのいずれかの値）とすると、選択情報ｃは、以下の式にて選択される。

Here, the supplementary speech DB 20 (complementary speech speaker c) corresponding to the sum total s _SPm of the minimum similarity is set as the supplementary speech DB 20-c (complementary speech speaker c), and the selection information is c (1 to N). If any value), the selection information c is selected by the following equation.

以上のように、実施例３の特徴量算出部１１，２１は、音響特徴量として、スペクトル特徴量に基づいたｉベクトルを算出すると共に、ピッチ特徴量に基づいたｉベクトルを算出する。 As described above, the feature amount calculation units 11 and 21 according to the third embodiment calculate the i vector based on the spectral feature amount and the i vector based on the pitch feature amount as the acoustic feature amount.

類似度算出部２２−ｍは、スペクトル特徴量に基づいたｉベクトル及びピッチ特徴量に基づいたｉベクトルのそれぞれについて、第１〜Ｎ番目の番組音声と第ｍ番目の補完音声との間のそれぞれの類似度を算出する。そして、類似度加算部２３−ｍは、スペクトル特徴量に基づいたｉベクトル及びピッチ特徴量に基づいたｉベクトルのそれぞれについて、類似度を加算することで類似度の総和を算出し、２つの算出結果を重み付けして加算し、類似度の加算総和を求める。 The similarity calculation unit 22-m, for each of the i vector based on the spectral feature amount and the i vector based on the pitch feature amount, is between the first to Nth program sounds and the mth complementary sound. The similarity is calculated. Then, the similarity addition unit 23-m calculates the sum of the similarities by adding the similarities for each of the i vector based on the spectral feature quantity and the i vector based on the pitch feature quantity, thereby calculating two calculations. The results are weighted and added to determine the sum of the similarities.

選択部２４は、類似度の加算総和に基づいて、補完音声ＤＢ２０−１〜２０−Ｍのうち（Ｍ人の補完音声話者のうち）、番組音声と最も類似しない音響的な特徴を有する補完音声ＤＢ２０−ｃ（補完音声話者ｃ）を選択する。 Based on the sum total of the similarities, the selection unit 24 complements the acoustic features that are most similar to the program audio among the complementary audio DBs 20-1 to 20 -M (among M complementary audio speakers). The voice DB 20-c (complementary voice speaker c) is selected.

ここで、補完音声ＤＢ２０−ｃ（補完音声話者ｃ）は、スペクトル特徴量から算出された音響特徴量及びピッチ特徴量から算出された音響特徴量を指標として選択される。また、前述のとおり、スペクトル特徴量には音声の周波数成分が反映されており、声質は、音声の周波数成分により決定される。また、音の高さは、ピッチ特徴量により決定される。 Here, the complementary speech DB 20-c (complementary speech speaker c) is selected using as an index the acoustic feature amount calculated from the spectral feature amount and the acoustic feature amount calculated from the pitch feature amount. Further, as described above, the spectral feature amount reflects the frequency component of the voice, and the voice quality is determined by the frequency component of the voice. The pitch of the sound is determined by the pitch feature amount.

したがって、番組音声に補完音声を付加した結果、番組音声と補完音声とを同時に提示することになっても、これらの音声を聴く人は、番組音声と補完音声とを容易に区別することができ、話者の声質及び声の高さが聞き分けやすい補完音声を得ることができる。 Therefore, even if the program sound and the complementary sound are simultaneously presented as a result of adding the complementary sound to the program sound, the person who listens to these sounds can easily distinguish the program sound and the complementary sound. In addition, it is possible to obtain complementary speech in which the voice quality and pitch of the speaker can be easily distinguished.

特に、補完音声ＤＢ２０−ｃ（補完音声話者ｃ）を選択する指標である類似度の加算総和には、スペクトル特徴量に基づいたｉベクトル及びピッチ特徴量に基づいたｉベクトルのそれぞれについての重み付けが反映される。つまり、声質を重視する場合は、スペクトル特徴量に基づいたｉベクトルの重み付け係数を１．０に近づけることで、当該声質が反映された類似度の加算総和が算出される。また、声の高さを重視する場合は、ピッチ特徴量に基づいたｉベクトルの重み付け係数を１．０に近づけることで、当該声の高さが反映された類似度の加算総和が算出される。したがって、番組音声に応じた重み付け係数を予め設定することで、番組音声に対し、一層聞き分けやすい補完音声を得ることができる。 In particular, the addition sum of similarities, which is an index for selecting the complementary speech DB 20-c (complementary speech speaker c), is weighted for each of the i vector based on the spectral feature amount and the i vector based on the pitch feature amount. Is reflected. That is, when importance is attached to voice quality, the sum of similarities in which the voice quality is reflected is calculated by bringing the weighting coefficient of the i vector based on the spectrum feature amount close to 1.0. In addition, when importance is attached to the voice pitch, the sum of the similarities reflecting the voice pitch is calculated by bringing the weighting coefficient of the i vector based on the pitch feature amount close to 1.0. . Therefore, by setting a weighting coefficient corresponding to the program sound in advance, it is possible to obtain complementary sound that is easier to distinguish from the program sound.

以上、実施例１〜３を挙げて本発明を説明したが、本発明は前記実施例１〜３に限定されるものではなく、その技術思想を逸脱しない範囲で種々変形可能である。例えば、実施例１では、スペクトル特徴量に基づいた音響特徴量を算出し、実施例２では、ピッチ特徴量に基づいた音響特徴量を算出するようにした。また、実施例３では、スペクトル特徴量に基づいた音響特徴量、及びピッチ特徴量に基づいた音響特徴量を算出する。本発明は、音響特徴量の算出手法を、スペクトル特徴量に基づいた手法またはピッチ特徴量に基づいた手法に限定するものではなく、他の手法を用いるようにしてもよい。 The present invention has been described with reference to the first to third embodiments. However, the present invention is not limited to the first to third embodiments, and various modifications can be made without departing from the technical idea thereof. For example, in the first embodiment, the acoustic feature amount based on the spectral feature amount is calculated, and in the second embodiment, the acoustic feature amount based on the pitch feature amount is calculated. In the third embodiment, the acoustic feature quantity based on the spectral feature quantity and the acoustic feature quantity based on the pitch feature quantity are calculated. In the present invention, the calculation method of the acoustic feature amount is not limited to the method based on the spectrum feature amount or the method based on the pitch feature amount, and other methods may be used.

例えば、異なる３種類の手法を用いて異なる３種類の音響特徴量を算出する場合を想定する。特徴量算出部１１，２１は、第１〜３の手法を用いて、第１〜３のｉベクトルをそれぞれ算出する。類似度算出部２２−ｍは、第１〜３のｉベクトルのそれぞれについて、第１〜Ｎ番目の番組音声と第ｍ番目の補完音声との間のそれぞれの類似度を算出する。そして、類似度加算部２３−ｍは、第１〜３のｉベクトルのそれぞれについて、類似度を加算することで類似度の総和を算出し、３つの算出結果を重み付けして加算し、類似度の加算総和を求める。選択部２４は、類似度の加算総和に基づいて、補完音声ＤＢ２０−１〜２０−Ｍのうち（Ｍ人の補完音声話者のうち）、番組音声と最も類似しない音響的な特徴を有する補完音声ＤＢ２０−ｃ（補完音声話者ｃ）を選択する。 For example, it is assumed that three different types of acoustic feature quantities are calculated using three different types of methods. The feature amount calculation units 11 and 21 calculate the first to third i vectors using the first to third methods, respectively. The similarity calculation unit 22-m calculates the similarity between each of the first to Nth program sounds and the mth complementary sound for each of the first to third i vectors. Then, the similarity adder 23-m calculates the sum of the similarities by adding the similarities for each of the first to third i vectors, adds the three calculation results by weighting, and adds the similarities. Find the total sum of. Based on the sum total of the similarities, the selection unit 24 complements the acoustic features that are most similar to the program audio among the complementary audio DBs 20-1 to 20 -M (among M complementary audio speakers). The voice DB 20-c (complementary voice speaker c) is selected.

尚、本発明の実施形態による音声選択装置１のハードウェア構成としては、通常のコンピュータを使用することができる。音声選択装置１は、ＣＰＵ、ＲＡＭ等の揮発性の記憶媒体、ＲＯＭ等の不揮発性の記憶媒体、及びインターフェース等を備えたコンピュータによって構成される。音声選択装置１に備えた特徴量算出部１１−１〜１１−Ｎ、特徴量算出部２１−１〜２１−Ｍ、類似度算出部２２−１〜２２−Ｍ、類似度加算部２３−１〜２３−Ｍ及び選択部２４の各機能は、これらの機能を記述したプログラムをＣＰＵに実行させることによりそれぞれ実現される。また、これらのプログラム（音声選択プログラム）は、磁気ディスク（フロッピー（登録商標）ディスク、ハードディスク等）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤ等）、半導体メモリ等の記憶媒体に格納して頒布することもでき、ネットワークを介して送受信することもできる。 Note that a normal computer can be used as the hardware configuration of the voice selection device 1 according to the embodiment of the present invention. The voice selection device 1 is configured by a computer including a volatile storage medium such as a CPU and a RAM, a non-volatile storage medium such as a ROM, an interface, and the like. Feature amount calculation units 11-1 to 11-N, feature amount calculation units 21-1 to 21-M, similarity calculation units 22-1 to 22-M, and similarity addition unit 23-1 included in the voice selection device 1 Each of the functions of ˜23-M and the selection unit 24 is realized by causing the CPU to execute a program describing these functions. These programs (voice selection programs) may be stored and distributed in a storage medium such as a magnetic disk (floppy (registered trademark) disk, hard disk, etc.), an optical disk (CD-ROM, DVD, etc.), or a semiconductor memory. It can also be transmitted / received via a network.

１音声選択装置
１０−１〜１０−Ｎ番組音声ＤＢ
１１−１〜１１−Ｎ，２１−１〜２１−Ｍ特徴量算出部
２０−１〜２０−Ｍ補完音声ＤＢ
２２−１〜２２−Ｍ類似度算出部
２３−１〜２３−Ｍ類似度加算部
２４選択部 1 Voice Selector 10-1 to 10-N Program Voice DB
11-1 to 11-N, 21-1 to 21-M feature quantity calculation units 20-1 to 20-M Complementary speech DB
22-1 to 22-M Similarity Calculation Units 23-1 to 23-M Similarity Addition Unit 24 Selection Unit

Claims

In the audio selection device for selecting the complementary audio when presenting the program audio by adding the complementary audio, from a plurality of complementary audios,
A program audio DB (database) in which a predetermined number of program audio data of 1 or more are stored;
A complementary speech DB storing a predetermined number of supplemental speech data of 2 or more;
An acoustic feature is calculated for each of the predetermined number of program audio data stored in the program audio DB, and an acoustic feature is calculated for each of the predetermined number of complementary audio data stored in the complementary audio DB. A feature amount calculation unit to be calculated;
An acoustic feature amount for each of the predetermined number of program audio data calculated by the feature amount calculation unit and an acoustic feature amount for each of the predetermined number of complementary audio data calculated by the feature amount calculation unit A similarity calculation unit that calculates the similarity between,
For each of the supplementary audio data, the similarity between the acoustic feature amount for each of the predetermined number of program audio data calculated by the similarity calculation unit and the acoustic feature amount of the complementary audio data is added. , A similarity addition unit for calculating the sum,
A selection unit that specifies a minimum sum among the sums for each of the complementary speech data obtained by the similarity addition unit, and selects the complementary speech data corresponding to the minimum sum from the predetermined number of supplemental speech data When,
A voice selection device comprising:

The voice selection device according to claim 1,
The feature amount calculation unit includes:
For each of the program audio data and the complementary audio data, audio data is cut out in units of frames of a predetermined length, a frequency characteristic is obtained for each audio data in units of frames, and a mel frequency cepstrum is obtained based on the frequency characteristics. Spectral features including a static coefficient composed of a coefficient and logarithmic energy, and a primary regression coefficient and a quadratic regression coefficient of the static coefficient are obtained. Calculating a GMM parameter comprising weights and Gaussian distributions for the number of mixtures, extracting an average vector of the Gaussian distributions from the GMM parameters, obtaining a GMM supervector obtained by combining the average vectors by the number of mixtures; Based on the super vector, the i vector which is the acoustic feature amount is calculated. To the voice selection device, characterized in that.

The voice selection device according to claim 1,
The feature amount calculation unit includes:
For each of the program audio data and the complementary audio data, audio data is cut out in units of frames of a predetermined length, a basic period candidate is set for each audio data in units of frames, and the periodicity of the basic period candidates is set. A fundamental period is extracted from the fundamental period candidates by obtaining a degree, and a pitch feature amount including a logarithmic fundamental frequency and a primary regression coefficient and a secondary regression coefficient of the logarithmic fundamental frequency is obtained based on the fundamental period, Based on the pitch feature amount, an EM algorithm is used to calculate a GMM parameter composed of a mixture weight for the number of mixtures and a Gaussian distribution for the number of mixtures, and an average vector of the Gaussian distribution is extracted from the GMM parameter, and the average vector GMM supervectors obtained by combining the same number as the number of the mixture, and based on the GMM supervectors, Calculating the i vectors are sounding feature amount, the audio selection device, characterized in that.

The voice selection device according to claim 1,
The feature amount calculation unit includes:
For each of the program audio data and the complementary audio data, audio data is cut out in units of frames of a predetermined length, a frequency characteristic is obtained for each audio data in units of frames, and a mel frequency cepstrum is obtained based on the frequency characteristics. Spectral features including a static coefficient composed of a coefficient and logarithmic energy, and a primary regression coefficient and a quadratic regression coefficient of the static coefficient are obtained. Calculating a GMM parameter comprising weights and Gaussian distributions for the number of mixtures, extracting an average vector of the Gaussian distributions from the GMM parameters, obtaining a GMM supervector obtained by combining the average vectors by the number of mixtures; Based on the super vector, the first i-vector that is the acoustic feature amount Is calculated,
For each frame of audio data, set a basic period candidate, obtain a degree of periodicity of the basic period candidate, extract the basic period from the basic period candidate, and based on the basic period, A pitch feature amount including a primary regression coefficient and a secondary regression coefficient of the logarithmic fundamental frequency is obtained, and an EM algorithm is used on the basis of the pitch feature amount from a mixture weight for the number of mixtures and a Gaussian distribution for the number of mixtures. GMM parameters are calculated, an average vector of the Gaussian distribution is extracted from the GMM parameters, a GMM super vector obtained by combining the average vectors by the number of the mixture is obtained, and the acoustic feature amount is calculated based on the GMM super vector. Calculate a second i-vector,
The similarity calculation unit includes:
A first i vector for each of the predetermined number of program audio data calculated by the feature amount calculation unit and a first for each of the predetermined number of complementary audio data calculated by the feature amount calculation unit. Calculate similarity between i vector and
A second i vector for each of the predetermined number of program audio data calculated by the feature amount calculation unit and a second i vector for each of the predetermined number of complementary audio data calculated by the feature amount calculation unit. calculate the similarity between the i vector and
The similarity adding unit includes:
For each of the complementary audio data, the similarity between the first i vector for each of the predetermined number of program audio data and the first i vector of the complementary audio data calculated by the similarity calculation unit Add the degree, find the first addition result,
For each of the complementary audio data, the similarity between the second i vector for each of the predetermined number of program audio data and the second i vector of the complementary audio data calculated by the similarity calculation unit Add the degree, find the second addition result,
A voice selection device characterized by weighting and adding the first addition result and the second addition result to obtain the sum.

A voice selection program for causing a computer to function as the voice selection device according to any one of claims 1 to 4.