JP2008026489A

JP2008026489A - Voice signal conversion apparatus

Info

Publication number: JP2008026489A
Application number: JP2006197173A
Authority: JP
Inventors: Makoto Shosakai; 誠庄境
Original assignee: Asahi Kasei Corp
Current assignee: Asahi Kasei Corp
Priority date: 2006-07-19
Filing date: 2006-07-19
Publication date: 2008-02-07
Anticipated expiration: 2026-07-19
Also published as: JP4996156B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice signal conversion apparatus which is suitable for precisely converting a feature parameter sequence, when speaker variation, mode variation and environment variation compositely affect signal conversion. <P>SOLUTION: The feature parameter sequence is extracted from an input voice signal, and the feature parameter sequence of the input voice signal is converted into the feature parameter sequence of a voice signal of a first reference speaker 16a. The converted feature parameter sequence is converted into the feature parameter sequence of the voice signal of a first reference model 16b. The converted feature parameter sequence is converted to the feature parameter sequence of the voice signal of a second reference model 16c. The converted feature parameter sequence is converted into the feature parameter sequence of the voice signal of a second reference speaker 16d. The converted feature parameter sequence is converted into the feature parameter sequence of an output voice signal, and the output voice signal is generated from the converted feature parameter sequence. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声信号を変換する装置および方法、音声信号を変換するサービスを提供するシステムおよび端末に係り、特に、話者変動、様式変動および環境変動が複合的に影響する場合に、音声信号の特徴パラメータ系列を精度よく変換するのに好適な音声信号変換装置に関する。この音声信号装置が、音声認識における音響モデルの特徴量パラメータ系列を変換することにより、その応用分野を音響モデルの変換とすることができる。また、入力音声を、別の声質や発話様式を有する音声に変換する声質変換にも応用が可能で、この場合には変換後の特徴量パラメータ系列から音声信号を生成する。 The present invention relates to an apparatus and method for converting an audio signal, a system and a terminal for providing a service for converting an audio signal, and more particularly to an audio signal when speaker fluctuation, style fluctuation, and environmental fluctuation are combinedly affected. The present invention relates to an audio signal conversion apparatus suitable for accurately converting the feature parameter series. The voice signal device converts the feature parameter series of the acoustic model in voice recognition, so that the application field can be converted to the acoustic model. Also, the present invention can be applied to voice quality conversion in which input voice is converted into voice having another voice quality or utterance style. In this case, a voice signal is generated from the converted feature parameter series.

音声認識処理は、話者が発声した音声サンプルをある特徴パラメータ系列に変換する特徴分析処理と、特徴分析処理で得られた特徴パラメータ系列をあらかじめメモリやハードディスク等の記憶装置に蓄積した語彙単語の特徴パラメータに関する情報と照合し、最も類似度の高い音声を認識結果とする特徴照合処理の２つの部分から構成される。音声サンプルをある特徴パラメータ系列に変換する特徴分析処理方法としては、ケプストラム分析法や線形予測分析法等が知られており、非特許文献１のChapter 3 Signal Processing and Analysis Methods for Speech Recognitionにも詳述されている。音声認識のなかで、不特定話者の音声を認識する技術を一般に不特定話者音声認識と呼ぶ。不特定話者音声認識においては、語彙単語の特徴パラメータに関する情報があらかじめ記憶装置に蓄積されているため、特定話者音声認識のようにユーザが音声認識させたい単語を登録するという作業は発生しない。また、語彙単語の特徴パラメータに関する情報の作成方法、およびその情報と入力された音声から変換された特徴パラメータ系列との音声照合方法としては、隠れマルコフモデル（以下、ＨＭＭ（Hidden Markov Model）と略記する。）による方法が一般に用いられている。ＨＭＭによる方法においては、音節、半音節、音韻、ｂｉｐｈｏｎｅ、ｔｒｉｐｈｏｎｅ等の音声単位がＨＭＭによりモデル化される。これらのモデルを一般に、音響モデルと呼ぶ。音響モデルの作成方法、例えば、公知のＥＭアルゴリズムについては、非特許文献１のChapter 6 Theory and Implementation of Hidden Markov Modelsに詳しく述べられている。また、同文献に記載されている公知のＶｉｔｅｒｂｉアルゴリズムにより、当業者は不特定話者音声認識装置を容易に構成することができる。 The speech recognition process includes a feature analysis process for converting a speech sample uttered by a speaker into a feature parameter series, and a feature parameter series obtained by the feature analysis process for vocabulary words stored in a storage device such as a memory or a hard disk in advance. It is composed of two parts of a feature matching process that collates with information on feature parameters and uses the speech with the highest similarity as a recognition result. Known cepstrum analysis methods and linear prediction analysis methods are known as feature analysis processing methods for converting speech samples into a certain feature parameter series, and are also described in detail in Chapter 3 Signal Processing and Analysis Methods for Speech Recognition in Non-Patent Document 1. It is stated. In speech recognition, a technique for recognizing the voice of an unspecified speaker is generally called unspecified speaker voice recognition. In unspecified speaker speech recognition, information related to the characteristic parameters of vocabulary words is stored in the storage device in advance, so that the user does not need to register a word that the user wants to recognize as in the case of specific speaker speech recognition. . A method for creating information on feature parameters of vocabulary words and a speech matching method between the information and a feature parameter sequence converted from input speech are abbreviated as HMM (Hidden Markov Model). Is generally used. In the HMM method, speech units such as syllables, semi-syllables, phonemes, biphones, and triphones are modeled by the HMM. These models are generally called acoustic models. A method for creating an acoustic model, for example, a known EM algorithm is described in detail in Chapter 6 Theory and Implementation of Hidden Markov Models of Non-Patent Document 1. Moreover, those skilled in the art can easily configure an unspecified speaker voice recognition device by using the well-known Viterbi algorithm described in the document.

一方、声質変換において、入力音声信号を目標の出力音声信号に変換する場合、音声信号に関する主な変動要因が、入力音声信号の話者の声道と出力音声信号の話者の声道との間の変動である話者変動であるときは、混合正規分布モデル（ＧＭＭ）に基づく特徴パラメータ系列変換処理方法として、非特許文献２、３の方法が有効であることが知られている。
Lawrence Rabiner and Biing-Hwang Juang, "Fundamentals of Speech Recognition," Prentice Hall Signal Processing Series, 1993. Y.Stylianou, O.Cappe and E.Moulines,."Continuous Probabilistic Transform for Voice Conversion".IEEE Proc. on Speech and Audio Processing., 1996. A.Kain and M.Macon."Spectral Voice Conversion for Text-to-Speech Synthesis".ICASSP, 1998. On the other hand, when converting an input audio signal to a target output audio signal in voice quality conversion, the main fluctuation factors regarding the audio signal are the vocal tract of the speaker of the input audio signal and the vocal tract of the speaker of the output audio signal. It is known that the methods of Non-Patent Documents 2 and 3 are effective as a feature parameter series conversion processing method based on a mixed normal distribution model (GMM).
Lawrence Rabiner and Biing-Hwang Juang, "Fundamentals of Speech Recognition," Prentice Hall Signal Processing Series, 1993. Y. Stylianou, O. Cappe and E. Moulines ,. "Continuous Probabilistic Transform for Voice Conversion". IEEE Proc. On Speech and Audio Processing., 1996. A.Kain and M.Macon. "Spectral Voice Conversion for Text-to-Speech Synthesis" .ICASSP, 1998.

音声認識においては、大量の音声コーパスから、上記のように、ＨＭＭによる不特定話者用音響モデルを事前に開発する。従来、音声コーパスのサイズは大きければ大きいほど、音声認識性能は向上すると信じられ、大量の音声コーパスが収集された。しかし、最近になって、音声コーパスのサイズをむやみやたらと増やしても、音声認識性能の向上が鈍化し、根本的な音声認識性能の向上にはつながらないことが明らかになってきた。例えば、文章を読み上げた音声コーパスをたくさん使って音響モデルを作成しても、単語や連続数字の音声認識性能は決して改善しない。また、同じ文章であっても、音素バランス文を読み上げた音声コーパスから作成した音響モデルでは、講演音声の音声認識性能は改善しないし、その逆の場合もまた同じであることが明らかにされてきた。すなわち、同じ母音の「ａ」であっても、数字の「３（ｓａｎ）」の「ａ」、孤立単語の「あさひ（ａｓａｈｉ）」の「ａ」と文章のなかに現れる「ａ」は、同一の話者であっても、異なる周波数スペクトルを持つ場合がある。ましてや、話者が異なれば、声の高低や喋り方等が異なるので、周波数スペクトルの変動は大きくなる。したがって、どの「ａ」とどの「ａ」が同じ周波数スペクトルを有するのかは、それらの「ａ」を含む数字、単語、文章の発音表記、すなわち、音素表記だけでは一概に判断ができない。従来、日本語は、母音と子音とから成る音韻セットから構成されると体系化されているが、人間はロボットではなく、数字、単語、文章のなかに含まれる「ａ」を毎回必ず同じように発音することはできない。その前後の１つ以上の音韻の影響を受けて、周波数スペクトルは変化する。残念ながら、この周波数スペクトルの変化の具合は、すべてが明らかにされたとは言えないし、数式化されたとは言えない。従来の技術では、既に収集された音声コーパスに含まれていない数字・単語・文章のなかの「ａ」の周波数スペクトルの特徴パラメータ系列を得るためには、必要とする数字・単語・文章を含む音声コーパスを新たに収集しなければならず、音声コーパス収集コストが増大するという問題があった。 In speech recognition, an acoustic model for unspecified speakers by HMM is developed in advance from a large number of speech corpora as described above. Conventionally, it is believed that the larger the size of the speech corpus, the better the speech recognition performance, and a large amount of speech corpus has been collected. However, recently, it has become clear that even if the size of the speech corpus is increased excessively, the improvement in speech recognition performance has slowed down, and does not lead to a fundamental improvement in speech recognition performance. For example, even if an acoustic model is created using many speech corpora that read out a sentence, the speech recognition performance of words and consecutive numbers will never be improved. In addition, even with the same sentence, it has been clarified that an acoustic model created from a speech corpus that reads out phoneme balance sentences does not improve the speech recognition performance of lecture speech, and vice versa. It was. That is, even if “a” is the same vowel, “a” of the number “3 (san)”, “a” of the isolated word “asahi” and “a” appearing in the sentence are: Even the same speaker may have different frequency spectra. In addition, the frequency spectrum fluctuates greatly because different speakers have different levels of voice and how to speak. Accordingly, which “a” and which “a” have the same frequency spectrum cannot be generally determined only by the phonetic notation of numbers, words, and sentences including those “a”. Traditionally, Japanese has been systematized to be composed of phonological sets consisting of vowels and consonants, but humans are not robots, and "a" contained in numbers, words, and sentences is always the same. Cannot be pronounced. Under the influence of one or more phonemes before and after that, the frequency spectrum changes. Unfortunately, this change in frequency spectrum is not entirely clear and not mathematical. In the prior art, in order to obtain the characteristic parameter series of the frequency spectrum of “a” among the numbers, words, and sentences that are not included in the collected speech corpus, the necessary numbers, words, and sentences are included. There is a problem that a voice corpus must be newly collected, and a voice corpus collection cost increases.

また、カーナビゲーションシステムやハンズフリー通話システムは、自動車における音声認識の応用として顕在化している。これまでの研究結果から、同一の話者であっても、自動車のなかで運転行動中に発話する場合の周波数スペクトルと静寂な部屋に座って発声する場合の周波数スペクトルは変化することが明らかになってきた。静寂な部屋に座って発声する場合の音声コーパスの収集コストは比較的安価であり、音声コーパスの量も多い。しかし、従来の技術において、運転行動中の発話音声の周波数スペクトルを得るには、自動車のなかで運転行動中に発話した音声コーパスを新たに収集する必要があるが、その収集コストは非常に高価となり、そのコストが制約を受ける場合には収集可能な音声コーパスの量も大きく制限されてしまうという問題があった。 In addition, car navigation systems and hands-free call systems have become apparent as voice recognition applications in automobiles. From the research results so far, it is clear that the frequency spectrum when speaking while driving in a car and the frequency spectrum when speaking while sitting in a quiet room are changed even for the same speaker. It has become. The voice corpus collection cost when speaking in a quiet room is relatively low, and the volume of the voice corpus is large. However, in the conventional technology, in order to obtain the frequency spectrum of speech voice during driving action, it is necessary to newly collect a voice corpus spoken during driving action in a car, but the collection cost is very expensive. When the cost is limited, there is a problem that the amount of voice corpus that can be collected is greatly limited.

そこで、上記の問題を解決するために、既に収集された音声コーパスに対して音声信号変換技術を適用して、所望の収録条件を有する音声コーパスを擬似的に生成することが考えられる。具体的には、既に収集された小規模の音声コーパスに含まれる数字・単語・文章の周波数スペクトルの特徴パラメータ系列を変換元とし、変換先が所望の収録条件を有する数字・単語・文章の周波数スペクトルの特徴パラメータ系列である特徴パラメータ系列変換処理を、既に収集された音声コーパスと所望の収録条件下で収集された小規模の音声コーパスとの学習に基づいて行う。同様に、運転行動中の発話音声の周波数スペクトルの特徴パラメータ系列については、特徴パラメータ系列変換処理を、静寂な部屋に座って収録した音声コーパスと運転行動中の少量の音声コーパスとを使用した学習に基づいて行う。 In order to solve the above problem, it is conceivable to apply a speech signal conversion technique to a speech corpus that has already been collected to generate a speech corpus having a desired recording condition in a pseudo manner. Specifically, the frequency, feature, frequency, and number / word / sentence frequency sequence of numbers / words / sentences contained in a small speech corpus that has already been collected, and the destination has the desired recording conditions. A feature parameter sequence conversion process, which is a spectral feature parameter sequence, is performed based on learning of a speech corpus that has already been collected and a small speech corpus that has been collected under the desired recording conditions. Similarly, for the feature parameter series of the frequency spectrum of speech speech during driving behavior, the feature parameter series conversion processing is performed using a speech corpus recorded in a quiet room and a small amount of speech corpus during driving behavior. Based on.

ところが、従来の音声信号変換処理では、音声信号に関する主な変動要因が話者変動である場合は、非特許文献２、３の技術等を用いて特徴パラメータ系列を精度よく変換することができるが、話者変動以外の要因が影響する場合は、特徴パラメータ系列を精度よく変換することができないという問題がある。例えば、上記のように、数字・単語・文章間で音声信号の周波数スペクトルを変換しようとする場合は、入力音声信号の発話様式または発話内容と、出力音声信号の発話様式または発話内容との間の様式変動が影響すると考えられる。また例えば、上記のように、静寂な環境で発声した音声信号を喧噪な環境で発声した音声信号に変換しようとする場合は、音声信号に関する変動要因として、入力音声信号の発声環境の雑音または残響と、出力音声信号の発声環境の雑音または残響との間の環境変動が影響すると考えられる。ここで、精度よく変換するとは、所定の話者の音声信号から目標の話者が発声した音声に近い音声信号が得られるように、特徴パラメータ系列を変換することをいう。 However, in the conventional speech signal conversion processing, when the main variation factor regarding the speech signal is speaker variation, the feature parameter series can be accurately converted using the techniques of Non-Patent Documents 2 and 3. When factors other than speaker variation are affected, there is a problem that the feature parameter series cannot be converted with high accuracy. For example, as described above, when converting the frequency spectrum of an audio signal between numbers, words, and sentences, between the utterance style or utterance content of the input voice signal and the utterance style or utterance content of the output voice signal It is thought that changes in the style of this will have an effect. For example, as described above, when trying to convert an audio signal uttered in a quiet environment into an audio signal uttered in a harsh environment, noise or reverberation in the utterance environment of the input audio signal may be used as a variation factor related to the audio signal. And environmental fluctuations between the utterance environment noise and reverberation of the output audio signal. Here, converting accurately means converting the feature parameter series so that a voice signal close to the voice uttered by the target speaker can be obtained from the voice signal of a predetermined speaker.

そこで、本発明は、このような従来の技術の有する未解決の課題に着目してなされたものであって、話者変動、様式変動および環境変動が複合的に影響する場合に、特徴パラメータ系列を精度よく変換するのに好適な音声信号変換装置を提供することを目的としている。 Therefore, the present invention has been made paying attention to such an unsolved problem of the conventional technology, and in the case where speaker fluctuation, style fluctuation, and environmental fluctuation are combinedly influenced, a characteristic parameter series is provided. An object of the present invention is to provide an audio signal conversion apparatus suitable for accurately converting an image.

上記目的を達成するために、本発明に係る請求項１記載の音声信号変換装置は、入力音声信号を目標の出力音声信号に変換する音声信号変換装置であって、前記入力音声信号から所定次元数以上の高次元の特徴パラメータ系列を抽出する特徴パラメータ系列抽出手段と、複数話者から取得した音声データを話者属性、様式属性および環境属性の３つの属性に基づいてグループ分けし、当該各グループに属する音声データに基づいて所定次元数以上の高次元の特徴パラメータ系列を有する高次元音響モデルを生成し、当該高次元音響モデル相互間の数学的距離関係を保持しながら前記高次元音響モデルから変換した前記高次元の次元数未満の音響モデル対応低次元ベクトルから構成される音響モデルマップを、前記高次元音響モデルとともに記憶する音響モデルマップ記憶手段と、話者属性間の変動、様式属性間の変動および環境属性間の変動のうち少なくとも２つの組み合わせに応じて、前記特徴パラメータ系列抽出手段で抽出した特徴パラメータ系列を前記出力音声信号の特徴パラメータ系列に変換する特徴パラメータ系列変換手段と、前記特徴パラメータ系列変換手段で変換した特徴パラメータ系列から前記出力音声信号を生成する音声信号生成手段とを備え、前記音響モデルマップは、環境属性が同一である音響モデル対応低次元ベクトルの分布領域が、様式属性の異なる複数の音響モデル対応低次元ベクトルの分布領域を包含する関係と、前記様式属性の異なる複数の音響モデル対応低次元ベクトルの分布領域それぞれが、話者属性の異なる複数の音響モデル対応低次元ベクトルの分布領域を包含する関係とを有する。 In order to achieve the above object, an audio signal converter according to claim 1 of the present invention is an audio signal converter for converting an input audio signal into a target output audio signal, and has a predetermined dimension from the input audio signal. Feature parameter sequence extraction means for extracting a plurality of high-dimensional feature parameter sequences, and voice data obtained from a plurality of speakers are grouped based on the three attributes of speaker attributes, style attributes, and environmental attributes, Generating a high-dimensional acoustic model having a high-dimensional feature parameter sequence of a predetermined dimension number or more based on audio data belonging to a group, and maintaining the mathematical distance relationship between the high-dimensional acoustic models; An acoustic model map composed of a low-dimensional vector corresponding to an acoustic model less than the high-dimensional dimension converted from the above is recorded together with the high-dimensional acoustic model. The feature parameter series extracted by the feature parameter series extraction means according to a combination of at least two of the acoustic model map storage means and the change between speaker attributes, the change between style attributes and the change between environment attributes A feature parameter series conversion means for converting the feature parameter series of the output voice signal; and a voice signal generation means for generating the output voice signal from the feature parameter series converted by the feature parameter series conversion means. The relationship between the distribution area of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute includes the distribution area of the low-dimensional vector corresponding to the plurality of acoustic models having different style attributes and the low correspondence to the plurality of acoustic models having the different style attribute. Each dimensional vector distribution region is a low-dimensional vector for multiple acoustic models with different speaker attributes. And a encompassing relationship distribution region torr.

このような構成であれば、特徴パラメータ系列抽出手段により、入力音声信号から所定次元数以上の高次元の特徴パラメータ系列が抽出され、特徴パラメータ系列変換手段により、話者属性間の変動、様式属性間の変動および環境属性間の変動のうち少なくとも２つの変動の組み合わせに応じて、抽出された特徴パラメータ系列が出力音声信号の特徴パラメータ系列に変換される。そして、音声信号生成手段により、変換された特徴パラメータ系列から出力音声信号が生成される。 With such a configuration, the feature parameter series extraction means extracts a high-dimensional feature parameter series of a predetermined dimension number or more from the input speech signal, and the feature parameter series conversion means extracts variations between speaker attributes, style attributes The extracted feature parameter series is converted into a feature parameter series of the output audio signal in accordance with a combination of at least two of the fluctuations between and the fluctuations between the environmental attributes. Then, the audio signal generation means generates an output audio signal from the converted feature parameter series.

ここで、話者属性間の変動に応じた変換としては、例えば、変換元の特徴パラメータ系列を、異なる話者属性を有する特徴パラメータ系列に変換を行うこと、または特徴パラメータ系列のうち話者に関するものを変換することが含まれる。以下、請求項１０記載の音声信号変換方法において同じである。
また、様式属性間の変動に応じた変換としては、例えば、変換元の特徴パラメータ系列を、異なる様式属性を有する特徴パラメータ系列に変換を行うこと、または特徴パラメータ系列のうち様式に関するものを変換することが含まれる。以下、請求項１０記載の音声信号変換方法において同じである。 Here, as the conversion according to the variation between the speaker attributes, for example, the feature parameter sequence of the conversion source is converted into the feature parameter sequence having different speaker attributes, or the speaker is included in the feature parameter sequence. Includes converting things. Hereinafter, this is the same in the audio signal conversion method according to claim 10.
In addition, as the conversion according to the variation between the style attributes, for example, the feature parameter series of the conversion source is converted into the feature parameter series having different style attributes, or the feature parameter series related to the style is converted. It is included. Hereinafter, this is the same in the audio signal conversion method according to claim 10.

また、環境属性間の変動に応じた変換としては、例えば、変換元の特徴パラメータ系列を、異なる環境属性を有する特徴パラメータ系列に変換を行うこと、または特徴パラメータ系列のうち環境に関するものを変換することが含まれる。以下、請求項１０記載の音声信号変換方法において同じである。
さらに、本発明に係る請求項２記載の音声信号変換装置は、請求項１記載の音声信号変換装置において、前記話者属性間の変動は、前記入力音声信号の話者の声道に関する話者属性と、前記出力音声信号の話者の声道に関する話者属性との間の変動である。 Further, as the conversion according to the variation between the environmental attributes, for example, the feature parameter series of the conversion source is converted into the characteristic parameter series having different environmental attributes, or the environmental parameter among the characteristic parameter series is converted. It is included. Hereinafter, this is the same in the audio signal conversion method according to claim 10.
Furthermore, the speech signal conversion device according to claim 2 according to the present invention is the speech signal conversion device according to claim 1, wherein the variation between the speaker attributes is a speaker related to the vocal tract of the speaker of the input speech signal. The variation between the attribute and the speaker attribute relating to the vocal tract of the speaker of the output audio signal.

このような構成であれば、話者変動に応じて変換を行う場合は、特徴パラメータ系列変換手段により、入力音声信号の話者の声道に関する話者属性と、出力音声信号の話者の声道に関する話者属性との間の変動に応じて特徴パラメータ系列が変換される。
さらに、本発明に係る請求項３記載の音声信号変換装置は、請求項１記載の音声信号変換装置において、前記様式属性間の変動は、前記入力音声信号の発話様式または発話内容に関する様式属性と、前記出力音声信号の発話様式または発話内容に関する様式属性との間の変動である。 With such a configuration, when conversion is performed according to speaker fluctuation, the feature parameter series conversion means performs speaker attributes relating to the vocal tract of the speaker of the input speech signal and the voice of the speaker of the output speech signal. The feature parameter series is converted in accordance with the variation between the speaker attributes relating to the road.
Furthermore, the speech signal conversion device according to claim 3 according to the present invention is the speech signal conversion device according to claim 1, wherein the variation between the style attributes is a style attribute related to the speech style of the input speech signal or the content of speech. , Fluctuation between the utterance style of the output audio signal or the style attribute relating to the utterance content.

このような構成であれば、様式変動に応じて変換を行う場合は、特徴パラメータ系列変換手段により、入力音声信号の発話様式または発話内容に関する様式属性と、出力音声信号の発話様式または発話内容に関する様式属性との間の変動に応じて特徴パラメータ系列が変換される。
さらに、本発明に係る請求項４記載の音声信号変換装置は、請求項１記載の音声信号変換装置において、前記環境属性間の変動は、前記入力音声信号の発声環境の雑音または残響に関する環境属性と、前記出力音声信号の発声環境の雑音または残響に関する環境属性との間の変動である。 With such a configuration, when conversion is performed according to a change in form, the characteristic parameter series conversion means uses the form attribute relating to the utterance style or utterance content of the input voice signal and the utterance style or utterance content of the output voice signal. The feature parameter series is converted according to the variation between the style attributes.
Furthermore, the audio signal conversion device according to claim 4 according to the present invention is the audio signal conversion device according to claim 1, wherein the variation between the environmental attributes is an environmental attribute relating to noise or reverberation of the utterance environment of the input audio signal. And an environmental attribute related to noise or reverberation of the utterance environment of the output audio signal.

このような構成であれば、環境変動に応じて変換を行う場合は、特徴パラメータ系列変換手段により、入力音声信号の発声環境の雑音または残響に関する環境属性と、出力音声信号の発声環境の雑音または残響に関する環境属性との間の変動に応じて特徴パラメータ系列が変換される。
本発明者は、さらに、話者変動、様式変動および環境変動が複合的に影響する場合は、環境変動、様式変動および話者変動の順で影響が大きいこと、そして、この順序を考慮して特徴パラメータ系列を変換することで精度を向上することができることを見いだした。 With such a configuration, when performing conversion according to environmental changes, the characteristic parameter sequence conversion means causes the environment attribute relating to the noise or reverberation of the voice environment of the input voice signal and the noise of the voice environment of the output voice signal or The feature parameter series is converted in accordance with the variation between the environmental attributes regarding reverberation.
The present inventor further considers that in the case where speaker variation, style variation, and environmental variation are combined, the influence is large in the order of environmental variation, style variation, and speaker variation, and this order is taken into consideration. We found that the accuracy can be improved by converting the feature parameter series.

さらに、本発明に係る請求項５記載の音声信号変換装置は、請求項１ないし４のいずれか１項に記載の音声信号変換装置において、前記特徴パラメータ系列変換手段は、前記特徴パラメータ系列抽出手段で抽出した特徴パラメータ系列を、前記入力音声信号とは異なる話者属性を有する第１基準話者の音声信号の特徴パラメータ系列、前記入力音声信号とは異なる様式属性を有する第１基準様式の音声信号の特徴パラメータ系列、並びに前記出力音声信号とは異なる様式属性を有する第２基準様式の音声信号の特徴パラメータ系列のうちいずれか１つに変換する第１変動変換手段と、前記第１変動変換手段で変換した特徴パラメータ系列を、前記第１基準様式の音声信号の特徴パラメータ系列、前記第２基準様式の音声信号の特徴パラメータ系列、並びに前記出力音声信号とは異なる話者属性を有する第２基準話者の音声信号の特徴パラメータ系列のうちいずれか１つであって、前記第１基準話者、前記第１基準様式、前記第２基準様式および前記第２基準話者の順列において、前記第１変動変換手段の変換先となる対象よりも後段の特徴パラメータ系列に変換する第２変動変換手段と、前記第２変動変換手段で変換した特徴パラメータ系列を、前記出力音声信号の特徴パラメータ系列に変換する第３変動変換手段とを備え、前記第１基準話者の音声信号の特徴パラメータ系列は、前記音響モデルマップにおいて、前記入力音声信号と同一の環境属性および様式属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な話者属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列であり、前記第１基準様式の音声信号の特徴パラメータ系列は、前記音響モデルマップにおいて、前記入力音声信号と同一の環境属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な様式属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列であり、前記第２基準様式の音声信号の特徴パラメータ系列は、前記音響モデルマップにおいて、前記出力音声信号と同一の環境属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な様式属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列であり、前記第２基準話者の音声信号の特徴パラメータ系列は、前記音響モデルマップにおいて、前記出力音声信号と同一の環境属性および様式属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な話者属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列である。 Furthermore, the audio signal converter according to claim 5 of the present invention is the audio signal converter according to any one of claims 1 to 4, wherein the feature parameter series converter is the feature parameter series extractor. The feature parameter series extracted in step 1 is the feature parameter series of the voice signal of the first reference speaker having a speaker attribute different from that of the input voice signal, and the voice of the first reference style having a style attribute different from that of the input voice signal. A first variation conversion means for converting the feature parameter sequence of the signal into any one of a feature parameter sequence of a second reference format audio signal having a format attribute different from that of the output audio signal; and the first variation conversion The feature parameter series converted by the means is a feature parameter series of the first reference style speech signal, and a feature parameter of the second reference style speech signal. And any one of a feature parameter series of a speech signal of a second reference speaker having a speaker attribute different from that of the output speech signal, the first reference speaker, the first reference style, Second variation conversion means for converting the second reference style and the second reference speaker permutation into a feature parameter sequence at a stage after the target to be converted by the first variation conversion means; and the second variation conversion Means for converting the feature parameter series converted by the means into a feature parameter series of the output speech signal, wherein the feature parameter series of the speech signal of the first reference speaker is the acoustic model map, Acoustic model-compatible low-dimensional vector having an average speaker attribute within the distribution region of acoustic model-compatible low-dimensional vectors having the same environmental attributes and style attributes as the input speech signal A feature parameter sequence of a corresponding high-dimensional acoustic model, wherein the feature parameter sequence of the audio signal of the first reference style is an acoustic model-corresponding low-dimensional vector having the same environmental attributes as the input audio signal in the acoustic model map Is a feature parameter sequence of a high-dimensional acoustic model corresponding to an acoustic model-corresponding low-dimensional vector having an average style attribute within a distribution region of the second reference style, and the feature parameter sequence of the audio signal of the second reference style is In the map, a characteristic parameter sequence of a high-dimensional acoustic model corresponding to an acoustic model-compatible low-dimensional vector having an average style attribute within a distribution region of acoustic model-compatible low-dimensional vectors having the same environmental attributes as the output speech signal The feature parameter series of the voice signal of the second reference speaker is included in the acoustic model map. A high-dimensional acoustic model corresponding to an acoustic model-corresponding low-dimensional vector having an average speaker attribute within a distribution region of acoustic model-corresponding low-dimensional vectors having the same environmental attributes and style attributes as the output speech signal. It is a feature parameter series.

このような構成であれば、第１変動変換手段により、抽出された特徴パラメータ系列が、第１基準話者の音声信号の特徴パラメータ系列、第１基準様式の音声信号の特徴パラメータ系列、および第２基準様式の音声信号の特徴パラメータ系列のうちいずれか１つに変換される。次いで、第２変動変換手段により、変換された特徴パラメータ系列が、第１基準様式の音声信号の特徴パラメータ系列、第２基準様式の音声信号の特徴パラメータ系列、および第２基準話者の音声信号の特徴パラメータ系列のうちいずれか１つであって、第１基準話者、第１基準様式、第２基準様式および第２基準話者の順列において、第１変動変換手段の変換先となる対象よりも後段の特徴パラメータ系列に変換される。そして、第３変動変換手段により、変換された特徴パラメータ系列が出力音声信号の特徴パラメータ系列に変換される。第１基準話者の音声信号の特徴パラメータ系列は、音響モデルマップにおいて、入力音声信号と同一の環境属性および様式属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な話者属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列特徴である。第１基準様式の音声信号の特徴パラメータ系列は、音響モデルマップにおいて、入力音声信号と同一の環境属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な様式属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列である。第２基準様式の音声信号の特徴パラメータ系列は、音響モデルマップにおいて、出力音声信号と同一の環境属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な様式属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列である。第２基準話者の音声信号の特徴パラメータ系列は、音響モデルマップにおいて、出力音声信号と同一の環境属性および様式属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な話者属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列である。 With such a configuration, the feature parameter series extracted by the first variation conversion means is the feature parameter series of the first reference speaker voice signal, the feature parameter series of the first reference style voice signal, and the first It is converted into any one of the feature parameter series of the audio signal of the two reference formats. Next, the characteristic parameter series converted by the second variation conversion means are the characteristic parameter series of the first reference style voice signal, the feature parameter series of the second reference style voice signal, and the voice signal of the second reference speaker. Any one of the characteristic parameter series of the above, and a target to be converted by the first variation conversion means in the permutation of the first reference speaker, the first reference form, the second reference form, and the second reference speaker Is converted into a feature parameter sequence at a later stage. Then, the converted feature parameter series is converted into a feature parameter series of the output audio signal by the third variation conversion means. The feature parameter series of the speech signal of the first reference speaker is the average speaker attribute in the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attributes and style attributes as the input speech signal in the acoustic model map. Is a feature parameter series feature of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to the acoustic model. The feature parameter series of the first reference style audio signal corresponds to the acoustic model having an average style attribute within the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute as the input voice signal in the acoustic model map. It is a feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector. The feature parameter series of the second reference style audio signal corresponds to the acoustic model having an average style attribute in the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute as the output voice signal in the acoustic model map. It is a feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector. The characteristic parameter series of the speech signal of the second reference speaker is an average speaker attribute in the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute and style attribute as the output speech signal in the acoustic model map. Is a feature parameter sequence of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to an acoustic model having.

すなわち、第１変動変換手段および第２変動変換手段によれば、第１基準話者および第１基準様式を経た変換、第１基準話者および第２基準様式を経た変換、第１基準話者および第２基準話者を経た変換、第１基準様式および第２基準様式を経た変換、第１基準様式および第２基準話者を経た変換、または第２基準様式および第２基準話者を経た変換が行われる。ここで、第１基準話者への変換は、入力音声信号の話者と第１基準話者との間の話者変動に応じた変換となり、第１基準話者から第１基準様式への変換は、入力音声信号の様式と第１基準様式との間の様式変動に応じた変換となる。また、第１基準様式から第２基準様式への変換は、入力音声信号および出力音声信号の環境の間の環境変動に応じた変換となり、第２基準様式から第２基準話者への変換は、第２基準様式と出力音声信号の様式との間の様式変動に応じた変換となる。したがって、第１変動変換手段および第２変動変換手段による変換は、話者変動、様式変動および環境変動の順序で行われる。 That is, according to the first variation conversion means and the second variation conversion means, the conversion through the first reference speaker and the first reference style, the conversion through the first reference speaker and the second reference style, the first reference speaker, And conversion through the second reference speaker, conversion through the first reference form and the second reference form, conversion through the first reference form and the second reference speaker, or through the second reference form and the second reference speaker Conversion is performed. Here, the conversion to the first reference speaker is conversion according to the speaker fluctuation between the speaker of the input speech signal and the first reference speaker, and the conversion from the first reference speaker to the first reference style. The conversion is performed in accordance with a change in form between the form of the input audio signal and the first reference form. The conversion from the first reference form to the second reference form is conversion according to the environmental fluctuation between the environment of the input sound signal and the output sound signal, and the conversion from the second reference form to the second reference speaker is performed. , Conversion according to the format variation between the second reference format and the format of the output audio signal. Therefore, the conversion by the first variation conversion unit and the second variation conversion unit is performed in the order of speaker variation, style variation, and environmental variation.

ここで、第１変動変換手段は、第１基準話者の音声信号の特徴パラメータ系列、第１基準様式の音声信号の特徴パラメータ系列、および第２基準様式の音声信号の特徴パラメータ系列のうちいずれか１つに変換すればよいが、変換にあたっては、特徴パラメータ系列を多段階で変換してもよい。例えば、第１基準様式の音声信号の特徴パラメータ系列に変換する場合、入力音声信号の特徴パラメータ系列を第１基準話者の音声信号の特徴パラメータ系列に変換し、変換した特徴パラメータ系列を第１基準様式の音声信号の特徴パラメータ系列に変換してもよい。このことは、第２変動変換手段についても同じである。 Here, the first variation conversion means is any one of a feature parameter series of the first reference speaker voice signal, a feature parameter series of the first reference style voice signal, and a feature parameter series of the second reference style voice signal. However, the feature parameter series may be converted in multiple stages. For example, when converting to a feature parameter sequence of a speech signal in the first reference style, a feature parameter sequence of the input speech signal is converted to a feature parameter sequence of a speech signal of the first reference speaker, and the converted feature parameter sequence is the first. You may convert into the characteristic parameter series of the audio | voice signal of a reference | standard format. The same applies to the second variation conversion means.

さらに、本発明に係る請求項６記載の音声信号変換装置は、請求項１ないし４のいずれか１項に記載の音声信号変換装置において、前記特徴パラメータ系列変換手段は、前記特徴パラメータ系列抽出手段で抽出した特徴パラメータ系列を、前記入力音声信号と同一の様式属性および環境属性並びに異なる話者属性を有する第１基準話者の音声信号の特徴パラメータ系列に変換する第１話者変動変換手段と、前記第１話者変動変換手段で変換した特徴パラメータ系列を、前記入力音声信号と同一の環境属性および異なる様式属性を有する第１基準様式の音声信号の特徴パラメータ系列に変換する第１様式変動変換手段と、前記第１様式変動変換手段で変換した特徴パラメータ系列を、前記出力音声信号と同一の環境属性および異なる様式属性を有する第２基準様式の音声信号の特徴パラメータ系列に変換する環境変動変換手段と、前記環境変動変換手段で変換した特徴パラメータ系列を、前記出力音声信号と同一の様式属性および環境属性並びに異なる話者属性を有する第２基準話者の音声信号の特徴パラメータ系列に変換する第２様式変動変換手段と、前記第２様式変動変換手段で変換した特徴パラメータ系列を、前記出力音声信号の特徴パラメータ系列に変換する第２話者変動変換手段とを備え、前記第１基準話者の音声信号の特徴パラメータ系列は、前記音響モデルマップにおいて、前記入力音声信号と同一の環境属性および様式属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な話者属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列であり、前記第１基準様式の音声信号の特徴パラメータ系列は、前記音響モデルマップにおいて、前記入力音声信号と同一の環境属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な様式属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列であり、前記第２基準様式の音声信号の特徴パラメータ系列は、前記音響モデルマップにおいて、前記出力音声信号と同一の環境属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な様式属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列であり、前記第２基準話者の音声信号の特徴パラメータ系列は、前記音響モデルマップにおいて、前記出力音声信号と同一の環境属性および様式属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な話者属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列である。 Furthermore, an audio signal converter according to claim 6 of the present invention is the audio signal converter according to any one of claims 1 to 4, wherein the feature parameter series converter is the feature parameter series extractor. A first speaker fluctuation converting means for converting the feature parameter series extracted in step 1 into a feature parameter series of a voice signal of a first reference speaker having the same style attribute and environment attribute as the input voice signal and different speaker attributes; The first style variation for converting the feature parameter series converted by the first speaker fluctuation conversion means into a feature parameter series of a first reference style speech signal having the same environmental attributes and different style attributes as the input speech signal. A characteristic parameter sequence converted by the conversion means and the first style variation conversion means, with the same environmental attributes and different style attributes as the output audio signal. The environmental variation conversion means for converting the characteristic parameter sequence of the second reference style audio signal into the characteristic parameter sequence converted by the environmental fluctuation conversion means, the same style attribute and environmental attribute as the output audio signal, and different speakers A second modal variation converting means for converting into a characteristic parameter sequence of the voice signal of the second reference speaker having the attribute, and a feature parameter sequence converted by the second modal variation converting means as a characteristic parameter sequence of the output speech signal. A second speaker fluctuation converting means for converting, and the feature parameter series of the voice signal of the first reference speaker has the same environmental attributes and style attributes as the input voice signal in the acoustic model map A high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to an acoustic model having an average speaker attribute in the distribution region of the corresponding low-dimensional vector The feature parameter sequence of the audio signal in the first reference style is a feature parameter sequence of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute as the input speech signal in the acoustic model map. A feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to an acoustic model having an average style attribute, and the feature parameter series of the speech signal of the second reference style is the output speech in the acoustic model map. A feature parameter series of a high-dimensional acoustic model corresponding to an acoustic model-compatible low-dimensional vector having an average style attribute within a distribution region of acoustic model-compatible low-dimensional vectors having the same environmental attributes as the signal, The feature parameter series of the speech signal of the reference speaker is the output speech signal in the acoustic model map. It is a feature parameter series of a high-dimensional acoustic model corresponding to an acoustic model-corresponding low-dimensional vector having an average speaker attribute within a distribution region of acoustic model-corresponding low-dimensional vectors having the same environmental attribute and style attribute.

このような構成であれば、第１話者変動変換手段により、抽出された特徴パラメータ系列が第１基準話者の音声信号の特徴パラメータ系列に変換される。この変換は、入力音声信号の話者と第１基準話者との間の話者変動に応じた変換となる。第１基準話者の音声信号の特徴パラメータ系列は、音響モデルマップにおいて、入力音声信号と同一の環境属性および様式属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な話者属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列特徴である。 With such a configuration, the extracted feature parameter series is converted into the feature parameter series of the voice signal of the first reference speaker by the first speaker fluctuation conversion means. This conversion is a conversion corresponding to the speaker fluctuation between the speaker of the input speech signal and the first reference speaker. The feature parameter series of the speech signal of the first reference speaker is the average speaker attribute in the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attributes and style attributes as the input speech signal in the acoustic model map. Is a feature parameter series feature of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to the acoustic model.

次いで、第１様式変動変換手段により、変換された特徴パラメータ系列が第１基準様式の音声信号の特徴パラメータ系列に変換される。この変換は、入力音声信号の様式と第１基準様式との間の様式変動に応じた変換となる。第１基準様式の音声信号の特徴パラメータ系列は、音響モデルマップにおいて、入力音声信号と同一の環境属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な様式属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列である。 Next, the converted feature parameter sequence is converted into a feature parameter sequence of the audio signal in the first reference format by the first format variation conversion means. This conversion is a conversion according to the format variation between the format of the input audio signal and the first reference format. The feature parameter series of the first reference style audio signal corresponds to the acoustic model having an average style attribute within the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute as the input voice signal in the acoustic model map. It is a feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector.

次いで、環境変動変換手段により、変換された特徴パラメータ系列が第２基準様式の音声信号の特徴パラメータ系列に変換される。この変換は、入力音声信号および出力音声信号の環境の間の環境変動に応じた変換となる。第２基準様式の音声信号の特徴パラメータ系列は、音響モデルマップにおいて、出力音声信号と同一の環境属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な様式属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列である。 Next, the converted feature parameter sequence is converted into a feature parameter sequence of the audio signal in the second reference format by the environment variation conversion means. This conversion is a conversion according to the environmental variation between the environments of the input audio signal and the output audio signal. The feature parameter series of the second reference style audio signal corresponds to the acoustic model having an average style attribute in the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute as the output voice signal in the acoustic model map. It is a feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector.

次いで、第２様式変動変換手段により、変換された特徴パラメータ系列が第２基準話者の音声信号の特徴パラメータ系列に変換される。この変換は、第２基準様式と出力音声信号の様式との間の様式変動に応じた変換となる。第２基準話者の音声信号の特徴パラメータ系列は、音響モデルマップにおいて、出力音声信号と同一の環境属性および様式属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な話者属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列である。 Next, the converted feature parameter series is converted into a feature parameter series of the voice signal of the second reference speaker by the second style variation conversion means. This conversion is a conversion corresponding to the format variation between the second reference format and the format of the output audio signal. The characteristic parameter series of the speech signal of the second reference speaker is an average speaker attribute in the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute and style attribute as the output speech signal in the acoustic model map. Is a feature parameter sequence of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to an acoustic model having.

そして、第２話者変動変換手段により、変換された特徴パラメータ系列が出力音声信号の特徴パラメータ系列に変換される。この変換は、第２基準話者と出力音声信号の話者との間の話者変動に応じた変換となる。
したがって、入力音声信号の特徴パラメータ系列は、話者変動、様式変動および環境変動の順序で出力音声信号の特徴パラメータ系列に変換される。 Then, the converted feature parameter sequence is converted into a feature parameter sequence of the output speech signal by the second speaker fluctuation conversion means. This conversion is a conversion according to the speaker fluctuation between the second reference speaker and the speaker of the output voice signal.
Therefore, the feature parameter sequence of the input speech signal is converted into the feature parameter sequence of the output speech signal in the order of speaker variation, style variation, and environment variation.

さらに、本発明に係る請求項７記載の音声信号変換装置は、請求項１ないし４のいずれか１項に記載の音声信号変換装置において、前記特徴パラメータ系列変換手段は、前記特徴パラメータ系列抽出手段で抽出した特徴パラメータ系列を、前記入力音声信号と同一の様式属性および環境属性並びに異なる話者属性を有する第１基準話者の音声信号の特徴パラメータ系列に変換する第１話者変動変換手段と、前記第１話者変動変換手段で変換した特徴パラメータ系列を、前記入力音声信号と同一の環境属性および異なる様式属性を有する第１基準様式または前記出力音声信号と同一の環境属性および異なる様式属性を有する第２基準様式の音声信号の特徴パラメータ系列に変換する第１様式変動変換手段と、前記第１様式変動変換手段で変換した特徴パラメータ系列を、前記出力音声信号と同一の様式属性および環境属性並びに異なる話者属性を有する第２基準話者の音声信号の特徴パラメータ系列に変換する第２様式変動変換手段と、前記第２様式変動変換手段で変換した特徴パラメータ系列を、前記出力音声信号の特徴パラメータ系列に変換する第２話者変動変換手段とを備え、前記第１基準話者の音声信号の特徴パラメータ系列は、前記音響モデルマップにおいて、前記入力音声信号と同一の環境属性および様式属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な話者属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列であり、前記第１基準様式の音声信号の特徴パラメータ系列は、前記音響モデルマップにおいて、前記入力音声信号と同一の環境属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な様式属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列であり、前記第２基準様式の音声信号の特徴パラメータ系列は、前記音響モデルマップにおいて、前記出力音声信号と同一の環境属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な様式属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列であり、前記第２基準話者の音声信号の特徴パラメータ系列は、前記音響モデルマップにおいて、前記出力音声信号と同一の環境属性および様式属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な話者属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列である。 Furthermore, an audio signal conversion apparatus according to claim 7 of the present invention is the audio signal conversion apparatus according to any one of claims 1 to 4, wherein the feature parameter series conversion means is the feature parameter series extraction means. A first speaker fluctuation converting means for converting the feature parameter series extracted in step 1 into a feature parameter series of a voice signal of a first reference speaker having the same style attribute and environment attribute as the input voice signal and different speaker attributes; The feature parameter series converted by the first speaker variation conversion means is a first reference format having the same environmental attributes and different style attributes as the input voice signal or the same environmental attributes and different style attributes as the output voice signal. A first style variation conversion means for converting the second reference style voice signal into a feature parameter series, and the first style fluctuation conversion means for conversion. A second style variation converting means for converting the feature parameter series into a feature parameter series of a voice signal of a second reference speaker having the same style attribute and environment attribute as the output voice signal and different speaker attributes; A feature parameter series converted by the form fluctuation conversion means, and a second speaker fluctuation conversion means for converting the feature parameter series to the feature parameter series of the output speech signal, wherein the feature parameter series of the speech signal of the first reference speaker is In an acoustic model map, a high dimension corresponding to an acoustic model-corresponding low-dimensional vector having an average speaker attribute within a distribution region of an acoustic model-corresponding low-dimensional vector having the same environmental attributes and style attributes as the input speech signal A feature parameter sequence of the acoustic model, and the feature parameter sequence of the audio signal in the first reference style is included in the acoustic model map. The feature parameter series of the high-dimensional acoustic model corresponding to the low-dimensional vector corresponding to the acoustic model having an average style attribute in the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute as the input speech signal In the acoustic model map, the characteristic parameter series of the audio signal of the second reference format has an average format attribute in the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute as the output audio signal. A feature parameter sequence of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to the acoustic model, and the feature parameter sequence of the speech signal of the second reference speaker is the same environment as the output speech signal in the acoustic model map Sounds with average speaker attributes in the distribution domain of low-dimensional vectors for acoustic models with attributes and style attributes It is a feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to an echo model.

次いで、第１様式変動変換手段により、変換された特徴パラメータ系列が第１基準様式または第２基準様式の音声信号の特徴パラメータ系列に変換される。この変換は、入力音声信号の様式と第１基準様式または第２基準様式との間の様式変動に応じた変換となる。第１基準様式の音声信号の特徴パラメータ系列は、音響モデルマップにおいて、入力音声信号と同一の環境属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な様式属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列である。第２基準様式の音声信号の特徴パラメータ系列は、音響モデルマップにおいて、出力音声信号と同一の環境属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な様式属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列である。 Next, the converted feature parameter sequence is converted into a feature parameter sequence of the audio signal in the first reference format or the second reference format by the first format variation conversion means. This conversion is a conversion corresponding to a change in format between the format of the input audio signal and the first reference format or the second reference format. The feature parameter series of the first reference style audio signal corresponds to the acoustic model having an average style attribute within the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute as the input voice signal in the acoustic model map. It is a feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector. The feature parameter series of the second reference style audio signal corresponds to the acoustic model having an average style attribute in the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute as the output voice signal in the acoustic model map. It is a feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector.

次いで、第２様式変動変換手段により、変換された特徴パラメータ系列が第２基準話者の音声信号の特徴パラメータ系列に変換される。この変換は、第１基準様式または第２基準様式と出力音声信号の様式との間の様式変動に応じた変換となる。第２基準話者の音声信号の特徴パラメータ系列は、音響モデルマップにおいて、出力音声信号と同一の環境属性および様式属性を有する音響モデル対応低次元ベクトルの分布領域内での平均的な話者属性を有する音響モデル対応低次元ベクトルに対応する高次元音響モデルの特徴パラメータ系列である。 Next, the converted feature parameter series is converted into a feature parameter series of the voice signal of the second reference speaker by the second style variation conversion means. This conversion is a conversion corresponding to a change in format between the first standard format or the second standard format and the format of the output audio signal. The characteristic parameter series of the speech signal of the second reference speaker is an average speaker attribute in the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute and style attribute as the output speech signal in the acoustic model map. Is a feature parameter sequence of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to an acoustic model having.

さらに、本発明に係る請求項８記載の音声信号変換装置は、請求項５ないし７のいずれか１項に記載の音声信号変換装置において、前記音響モデルマップ記憶手段は、音声信号の複数の話者属性に対応する話者属性高次元音響モデルの集合、音声信号の複数の様式属性に対応する様式属性高次元音響モデルの集合および音声信号の複数の環境属性に対応する環境属性高次元音響モデルの集合をさらに記憶し、前記音響モデルマップ記憶手段の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、前記特徴パラメータ系列抽出手段で抽出した特徴パラメータ系列に対する尤度を求め、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する前記話者属性、前記様式属性および前記環境属性を、前記入力音声信号の話者属性、様式属性および環境属性として同定する入力音声信号属性同定手段と、前記出力音声信号のサンプルから特徴パラメータ系列を抽出する出力特徴パラメータ系列抽出手段と、前記属性モデル記憶手段の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、前記出力特徴パラメータ系列抽出手段で抽出した特徴パラメータ系列に対する尤度を求め、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する前記話者属性、前記様式属性および前記環境属性を、前記出力音声信号の話者属性、様式属性および環境属性として同定する出力音声信号属性同定手段と、与えられた特徴パラメータ系列を前記第１基準話者の音声信号の特徴パラメータ系列に変換する複数の第１話者変動変換関数、与えられた特徴パラメータ系列を前記第１基準様式の音声信号の特徴パラメータ系列に変換する複数の第１様式変動変換関数、与えられた特徴パラメータ系列を前記第２基準様式の音声信号の特徴パラメータ系列に変換する複数の環境変動変換関数、与えられた特徴パラメータ系列を前記第２基準話者の音声信号の特徴パラメータ系列に変換する複数の第２様式変動変換関数、および与えられた特徴パラメータ系列を前記出力音声信号の特徴パラメータ系列に変換する複数の第２話者変動変換関数を記憶する変動変換関数記憶手段と、前記入力音声信号属性同定手段および前記出力音声信号属性同定手段で同定した属性に基づいて、前記第１話者変動変換関数、前記第１様式変動変換関数、前記環境変動変換関数、前記第２様式変動変換関数および前記第２話者変動変換関数を前記変動変換関数記憶手段のなかから選択する変動変換関数選択手段とを備え、前記特徴パラメータ系列変換手段は、前記変動変換関数選択手段で選択した変動変換関数に基づいて特徴パラメータ系列を変換する。 Furthermore, an audio signal conversion device according to an eighth aspect of the present invention is the audio signal conversion device according to any one of the fifth to seventh aspects, wherein the acoustic model map storage means includes a plurality of audio signal stories. A set of speaker attribute high-dimensional acoustic models corresponding to speaker attributes, a set of style attribute high-dimensional acoustic models corresponding to multiple style attributes of speech signals, and an environment attribute high-dimensional acoustic model corresponding to multiple environmental attributes of speech signals For the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model, and the environment attribute high-dimensional acoustic model of the acoustic model map storage means, with respect to the feature parameter series extracted by the feature parameter series extraction means For the speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model, and environment attribute high-dimensional acoustic model that obtain the maximum likelihood. Input speech signal attribute identifying means for identifying the speaker attribute, the style attribute and the environment attribute as speaker attributes, the style attribute and the environment attribute of the input speech signal, and a feature parameter series from the sample of the output speech signal Output feature parameter series extraction means for extracting the speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model and environmental attribute high-dimensional acoustic model of the attribute model storage means extracted by the output feature parameter series extraction means Speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model and environmental attribute high-dimensional acoustic model corresponding to the speaker attribute high-dimensional acoustic model which obtains the maximum likelihood with respect to the feature parameter series, the style attribute and the environment Output audio signal identifying attributes as speaker attributes, style attributes and environment attributes of the output audio signal Sex identification means, a plurality of first speaker variation conversion functions for converting the given feature parameter series into the feature parameter series of the speech signal of the first reference speaker, and the given feature parameter series as the first reference form A plurality of first-form variation conversion functions for converting into a feature parameter sequence of a speech signal of the first, a plurality of environmental variation conversion functions for converting a given feature parameter sequence into a feature parameter sequence of a speech signal in the second reference format, A plurality of second-type variation conversion functions for converting the feature parameter series into the feature parameter series of the speech signal of the second reference speaker, and a plurality of transforming the given feature parameter series into the feature parameter series of the output speech signal Fluctuation conversion function storage means for storing the second speaker fluctuation conversion function, input voice signal attribute identification means, and output voice signal attribute identification Based on the attribute identified by the means, the first speaker fluctuation conversion function, the first style fluctuation conversion function, the environment fluctuation conversion function, the second style fluctuation conversion function, and the second speaker fluctuation conversion function Fluctuation conversion function selection means for selecting from among the fluctuation conversion function storage means, and the characteristic parameter series conversion means converts the characteristic parameter series based on the fluctuation conversion function selected by the fluctuation conversion function selection means.

このような構成であれば、入力音声信号属性同定手段により、音響モデルマップ記憶手段の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、抽出された特徴パラメータ系列に対する尤度が求められ、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する話者属性、様式属性および環境属性が入力音声信号の属性として同定される。 In such a configuration, the extracted feature parameters of the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model, and the environment attribute high-dimensional acoustic model of the acoustic model map storage unit by the input speech signal attribute identification unit Speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model, and environment attribute high-dimensional acoustic model that give the maximum likelihood are obtained. Identified as a signal attribute.

また、出力特徴パラメータ系列抽出手段により、出力音声信号のサンプルから特徴パラメータ系列が抽出される。次いで、出力音声信号属性同定手段により、音響モデルマップ記憶手段の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、抽出された特徴パラメータ系列に対する尤度が求められ、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する話者属性、様式属性および環境属性が出力音声信号の属性として同定される。 Further, the feature parameter series is extracted from the sample of the output audio signal by the output feature parameter series extraction means. Next, the likelihood for the extracted feature parameter series is obtained for the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model, and the environment attribute high-dimensional acoustic model of the acoustic model map storage means by the output speech signal attribute identifying means. Speaker attribute high-dimensional acoustic model, stylistic attribute high-dimensional acoustic model, and environment attribute corresponding to the high-dimensional acoustic model are identified as attributes of the output speech signal. The

そして、変動変換関数選択手段により、同定された属性に基づいて、第１話者変動変換関数、第１様式変動変換関数、環境変動変換関数、第２様式変動変換関数および第２話者変動変換関数が変動変換関数記憶手段のなかから選択され、特徴パラメータ系列変換手段により、選択された変動変換関数に基づいて特徴パラメータ系列が変換される。
さらに、本発明に係る請求項９記載の音声信号変換装置は、請求項５ないし７のいずれか１項に記載の音声信号変換装置において、前記音響モデルマップ記憶手段は、音声信号の複数の話者属性に対応する話者属性高次元音響モデルの集合、音声信号の複数の様式属性に対応する様式属性高次元音響モデルの集合および音声信号の複数の環境属性に対応する環境属性高次元音響モデルの集合を記憶する属性モデル記憶手段と、前記属性モデル記憶手段の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、前記特徴パラメータ系列抽出手段で抽出した特徴パラメータ系列に対する尤度を求め、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する前記話者属性、前記様式属性および前記環境属性を、前記入力音声信号の話者属性、様式属性および環境属性として同定する入力音声信号属性同定手段と、前記出力音声信号の話者属性、様式属性および環境属性を入力する出力音声信号属性入力手段と、前記出力音声信号属性入力手段で入力した話者属性、様式属性および環境属性に最も適合する話者属性、様式属性および環境属性を有する混合正規分布モデルを求め、当該混合正規分布モデルに対応する前記話者属性、前記様式属性および前記環境属性を、前記出力音声信号の話者属性、様式属性および環境属性として同定する出力音声信号属性同定手段と、与えられた特徴パラメータ系列を前記第１基準話者の音声信号の特徴パラメータ系列に変換する複数の第１話者変動変換関数、与えられた特徴パラメータ系列を前記第１基準様式の音声信号の特徴パラメータ系列に変換する複数の第１様式変動変換関数、与えられた特徴パラメータ系列を前記第２基準様式の音声信号の特徴パラメータ系列に変換する複数の環境変動変換関数、与えられた特徴パラメータ系列を前記第２基準話者の音声信号の特徴パラメータ系列に変換する複数の第２様式変動変換関数、および与えられた特徴パラメータ系列を前記出力音声信号の特徴パラメータ系列に変換する複数の第２話者変動変換関数を記憶する変動変換関数記憶手段と、前記入力音声信号属性同定手段および前記出力音声信号属性同定手段で同定した属性に基づいて、前記第１話者変動変換関数、前記第１様式変動変換関数、前記環境変動変換関数、前記第２様式変動変換関数および前記第２話者変動変換関数を前記変動変換関数記憶手段のなかから選択する変動変換関数選択手段とを備え、前記特徴パラメータ系列変換手段は、前記変動変換関数選択手段で選択した変動変換関数に基づいて特徴パラメータ系列を変換する。 Then, based on the attribute identified by the variation conversion function selection means, the first speaker variation conversion function, the first style variation conversion function, the environment variation conversion function, the second style variation conversion function, and the second speaker variation conversion A function is selected from the fluctuation conversion function storage means, and the characteristic parameter series conversion means converts the characteristic parameter series based on the selected fluctuation conversion function.
Furthermore, the speech signal converter according to claim 9 of the present invention is the speech signal converter according to any one of claims 5 to 7, wherein the acoustic model map storage means includes a plurality of speech signal speech. A set of speaker attribute high-dimensional acoustic models corresponding to speaker attributes, a set of style attribute high-dimensional acoustic models corresponding to multiple style attributes of speech signals, and an environment attribute high-dimensional acoustic model corresponding to multiple environmental attributes of speech signals Attribute model storage means for storing a set of features, speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model and environmental attribute high-dimensional acoustic model of the attribute model storage means, extracted by the feature parameter series extraction means Speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model, and environmental attribute high-dimensional acoustic sound that obtain the maximum likelihood for parameter series An input voice signal attribute identifying means for identifying the speaker attribute, the style attribute and the environment attribute corresponding to Dell as a speaker attribute, a style attribute and an environment attribute of the input voice signal; and a speaker of the output voice signal Output audio signal attribute input means for inputting attributes, style attributes and environment attributes, and speaker attributes, style attributes and environment attributes most suitable for the speaker attributes, style attributes and environment attributes input by the output audio signal attribute input means Output speech for determining the speaker attribute, the style attribute and the environment attribute corresponding to the mixed normal distribution model as the speaker attribute, the style attribute and the environment attribute of the output speech signal And a plurality of first speakers for converting a given feature parameter sequence into a feature parameter sequence of the voice signal of the first reference speaker. A dynamic conversion function, a plurality of first mode variation conversion functions for converting a given feature parameter series into a feature parameter series of the first reference style speech signal, and a given feature parameter series as a second reference style voice signal A plurality of environmental variation conversion functions for converting the characteristic parameter sequence into a plurality of second mode variation conversion functions for converting the given feature parameter sequence into a feature parameter sequence of the voice signal of the second reference speaker, and A variation conversion function storage unit for storing a plurality of second speaker variation conversion functions for converting a feature parameter sequence into a feature parameter sequence of the output speech signal; an input speech signal attribute identification unit; and an output speech signal attribute identification unit. Based on the identified attributes, the first speaker fluctuation conversion function, the first style fluctuation conversion function, the environment fluctuation conversion function, the second style A fluctuation conversion function selecting means for selecting a fluctuation conversion function and the second speaker fluctuation conversion function from the fluctuation conversion function storage means, and the feature parameter series conversion means is selected by the fluctuation conversion function selection means The feature parameter series is converted based on the fluctuation conversion function.

このような構成であれば、入力音声信号属性同定手段により、属性モデル記憶手段の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、抽出された特徴パラメータ系列に対する尤度が求められ、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する話者属性、様式属性および環境属性が入力音声信号の属性として同定される。 With such a configuration, the feature parameter series extracted for the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model, and the environment attribute high-dimensional acoustic model of the attribute model storage means by the input speech signal attribute identifying means Speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model, and environment attribute high-dimensional acoustic model that give the maximum likelihood. Identified as an attribute.

また、出力音声信号属性入力手段により、出力音声信号の話者属性、様式属性および環境属性が入力される。次いで、出力音声信号属性同定手段により、入力された話者属性、様式属性および環境属性に最も適合する話者属性、様式属性および環境属性を有する混合正規分布モデルが求められ、その混合正規分布モデルに対応する話者属性、様式属性および環境属性が出力音声信号の属性として同定される。
そして、変動変換関数選択手段により、同定された属性に基づいて、第１話者変動変換関数、第１様式変動変換関数、環境変動変換関数、第２様式変動変換関数および第２話者変動変換関数が変動変換関数記憶手段のなかから選択され、特徴パラメータ系列変換手段により、選択された変動変換関数に基づいて特徴パラメータ系列が変換される。 Further, the speaker attribute, the style attribute, and the environment attribute of the output audio signal are input by the output audio signal attribute input means. Next, a mixed normal distribution model having a speaker attribute, a style attribute, and an environment attribute that best matches the input speaker attribute, style attribute, and environment attribute is obtained by the output audio signal attribute identification means, and the mixed normal distribution model is obtained. Are identified as attributes of the output audio signal.
Then, based on the attribute identified by the variation conversion function selection means, the first speaker variation conversion function, the first style variation conversion function, the environment variation conversion function, the second style variation conversion function, and the second speaker variation conversion A function is selected from the fluctuation conversion function storage means, and the characteristic parameter series conversion means converts the characteristic parameter series based on the selected fluctuation conversion function.

一方、上記目的を達成するために、本発明に係る請求項１０記載の音声信号変換方法は、入力音声信号を目標の出力音声信号に変換する音声信号変換方法であって、前記入力音声信号から所定次元数以上の高次元の特徴パラメータ系列を抽出する特徴パラメータ系列抽出ステップと、複数話者から取得した音声データを話者属性、様式属性および環境属性の３つの属性に基づいてグループ分けし、当該各グループに属する音声データに基づいて所定次元数以上の高次元の特徴パラメータ系列を有する高次元音響モデルを生成し、当該高次元音響モデル相互間の数学的距離関係を保持しながら前記高次元音響モデルから変換した前記高次元の次元数未満の音響モデル対応低次元ベクトルから構成される音響モデルマップを、前記高次元音響モデルとともに記憶する音響モデルマップ記憶ステップと、話者属性間の変動、様式属性間の変動および環境属性間の変動のうち少なくとも２つの組み合わせに応じて、前記特徴パラメータ系列抽出ステップで抽出した特徴パラメータ系列を前記出力音声信号の特徴パラメータ系列に変換する特徴パラメータ系列変換ステップと、前記特徴パラメータ系列変換ステップで変換した特徴パラメータ系列から前記出力音声信号を生成する音声信号生成ステップとを含み、前記音響モデルマップは、環境属性が同一である音響モデル対応低次元ベクトルの分布領域が、様式属性の異なる複数の音響モデル対応低次元ベクトルの分布領域を包含する関係と、前記様式属性の異なる複数の音響モデル対応低次元ベクトルの分布領域それぞれが、話者属性の異なる複数の音響モデル対応低次元ベクトルの分布領域を包含する関係とを有する。 On the other hand, in order to achieve the above object, an audio signal conversion method according to claim 10 according to the present invention is an audio signal conversion method for converting an input audio signal into a target output audio signal, from the input audio signal. A feature parameter sequence extraction step for extracting a high-dimensional feature parameter sequence of a predetermined dimension number or more, and grouping speech data acquired from a plurality of speakers based on three attributes of speaker attributes, style attributes, and environmental attributes, Generate a high-dimensional acoustic model having a high-dimensional feature parameter sequence of a predetermined dimension or more based on the audio data belonging to each group, and maintain the mathematical distance relationship between the high-dimensional acoustic models An acoustic model map composed of a low-dimensional vector corresponding to an acoustic model with less than the high-dimensional dimension converted from the acoustic model is converted into the high-dimensional acoustic model. A feature parameter sequence extracted in the feature parameter sequence extraction step according to a combination of at least two of an acoustic model map storage step stored together with speaker attribute variation, variation between style attributes, and variation between environmental attributes A feature parameter sequence conversion step for converting the output speech signal into a feature parameter sequence, and a speech signal generation step for generating the output speech signal from the feature parameter sequence converted in the feature parameter sequence conversion step. The map includes a relationship in which a distribution area of low-dimensional vectors corresponding to acoustic models having the same environmental attribute includes a distribution area of low-dimensional vectors corresponding to different acoustic attributes, and a plurality of acoustic models having different style attributes. Each of the corresponding low-dimensional vector distribution areas has different speaker attributes. And a encompassing relationship distribution region of the plurality of acoustic-model-compatible low dimensional vectors that.

一方、上記目的を達成するために、本発明に係る請求項１１記載の音声信号変換サービス提供システムは、変換関数提供端末と、携帯端末とを通信可能に接続し、入力音声信号を目標の出力音声信号に変換する音声変換サービスを提供する音声信号変換サービス提供システムであって、前記変換関数提供端末は、複数話者から取得した音声データを話者属性、様式属性および環境属性の３つの属性に基づいてグループ分けし、当該各グループに属する音声データに基づいて所定次元数以上の高次元の特徴パラメータ系列を有する高次元音響モデルを生成し、当該高次元音響モデル相互間の数学的距離関係を保持しながら前記高次元音響モデルから変換した前記高次元の次元数未満の音響モデル対応低次元ベクトルから構成される音響モデルマップを、前記高次元音響モデルとともに記憶し、さらに、音声信号の複数の話者属性に対応する話者属性高次元音響モデルの集合、音声信号の複数の様式属性に対応する様式属性高次元音響モデルの集合および音声信号の複数の環境属性に対応する環境属性高次元音響モデルの集合を記憶する音響モデルマップ記憶手段と、前記入力音声信号の特徴パラメータ系列を受信する入力特徴パラメータ系列受信手段と、前記音響モデルマップ記憶手段の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、前記入力特徴パラメータ系列受信手段で受信した特徴パラメータ系列に対する尤度を求め、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する前記話者属性、前記様式属性および前記環境属性を、前記入力音声信号の話者属性、様式属性および環境属性として同定する入力音声信号属性同定手段と、前記出力音声信号の特徴パラメータ系列を受信する出力特徴パラメータ系列受信手段と、前記音響モデルマップ記憶手段の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、前記出力特徴パラメータ系列受信手段で受信した特徴パラメータ系列に対する尤度を求め、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する前記話者属性、前記様式属性および前記環境属性を、前記出力音声信号の話者属性、様式属性および環境属性として同定する出力音声信号属性同定手段と、与えられた特徴パラメータ系列を、前記入力音声信号と同一の様式属性および環境属性並びに異なる話者属性を有する第１基準話者の音声信号の特徴パラメータ系列に変換する複数の第１話者変動変換関数、与えられた特徴パラメータ系列を、前記入力音声信号と同一の環境属性および異なる様式属性を有する第１基準様式の音声信号の特徴パラメータ系列に変換する複数の第１様式変動変換関数、与えられた特徴パラメータ系列を、前記出力音声信号と同一の環境属性および異なる様式属性を有する第２基準様式の音声信号の特徴パラメータ系列に変換する複数の環境変動変換関数、与えられた特徴パラメータ系列を、前記出力音声信号と同一の様式属性および環境属性並びに異なる話者属性を有する第２基準話者の音声信号の特徴パラメータ系列に変換する複数の第２様式変動変換関数、並びに与えられた特徴パラメータ系列を、前記出力音声信号の特徴パラメータ系列に変換する複数の第２話者変動変換関数を記憶する変動変換関数記憶手段と、前記入力音声信号属性同定手段および前記出力音声信号属性同定手段で同定した属性に基づいて、前記第１話者変動変換関数、前記第１様式変動変換関数、前記環境変動変換関数、前記第２様式変動変換関数および前記第２話者変動変換関数を前記変動変換関数記憶手段のなかから選択する変動変換関数選択手段と、前記変動変換関数選択手段で選択した変換関数を前記携帯端末に送信する変換関数送信手段とを備え、前記携帯端末は、前記入力音声信号を入力する入力音声信号入力手段と、前記入力音声信号入力手段で入力した入力音声信号から所定次元数以上の高次元の特徴パラメータ系列を抽出する入力特徴パラメータ系列抽出手段と、前記入力特徴パラメータ系列抽出手段で抽出した特徴パラメータ系列を前記変換関数提供端末に送信する入力特徴パラメータ系列送信手段と、前記出力音声信号のサンプルを入力する出力音声信号入力手段と、前記出力音声信号入力手段で入力した出力音声信号から特徴パラメータ系列を抽出する出力特徴パラメータ系列抽出手段と、前記出力特徴パラメータ系列抽出手段で抽出した特徴パラメータ系列を前記変換関数提供端末に送信する出力特徴パラメータ系列送信手段と、前記変換関数を受信する変換関数受信手段と、前記変換関数受信手段で受信した変換関数に基づいて、前記入力特徴パラメータ系列抽出手段で抽出した特徴パラメータ系列を前記出力音声信号の特徴パラメータ系列に変換する特徴パラメータ系列変換手段と、前記特徴パラメータ系列変換手段で変換した特徴パラメータ系列から前記出力音声信号を生成する音声信号生成手段とを備える。 On the other hand, in order to achieve the above object, an audio signal conversion service providing system according to claim 11 according to the present invention connects a conversion function providing terminal and a mobile terminal so that they can communicate with each other, and inputs an input audio signal to a target output. An audio signal conversion service providing system for providing an audio conversion service for converting into an audio signal, wherein the conversion function providing terminal has three attributes of speaker data, style attribute, and environment attribute for audio data acquired from a plurality of speakers Based on the sound data belonging to each group, generating a high-dimensional acoustic model having a high-dimensional feature parameter sequence of a predetermined dimension number or more, and a mathematical distance relationship between the high-dimensional acoustic models An acoustic model model composed of a low-dimensional vector corresponding to an acoustic model having a dimension less than the high-dimensional dimension converted from the high-dimensional acoustic model while maintaining Are stored together with the high-dimensional acoustic model, and further, a set of speaker attribute high-dimensional acoustic models corresponding to a plurality of speaker attributes of a speech signal, a style attribute high-dimensional acoustic corresponding to a plurality of style attributes of a speech signal Acoustic model map storage means for storing a set of models and a set of environmental attribute high-dimensional acoustic models corresponding to a plurality of environmental attributes of the voice signal; and input feature parameter series receiving means for receiving a feature parameter series of the input voice signal; , For the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model and the environment attribute high-dimensional acoustic model of the acoustic model map storage means, the likelihood for the feature parameter series received by the input feature parameter series receiving means, Speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model and environmental attribute high-dimensional acoustic model that give maximum likelihood Input speech signal attribute identifying means for identifying the speaker attribute, the style attribute and the environment attribute corresponding to the voice as the speaker attribute, the style attribute and the environment attribute of the input speech signal, and the characteristic parameter of the output speech signal An output feature parameter sequence receiving means for receiving a sequence, and a speaker attribute high-dimensional acoustic model, a style attribute high-dimensional acoustic model, and an environment attribute high-dimensional acoustic model of the acoustic model map storage means. Speaker attribute high-dimensional acoustic model that obtains maximum likelihood for the received feature parameter sequence, and gives the maximum likelihood, style attribute high-dimensional acoustic model and environment attribute high-dimensional acoustic model, speaker attribute, style attribute and An output audio signal that identifies the environmental attributes as speaker attributes, style attributes, and environmental attributes of the output audio signal And a plurality of first parameter conversion means for converting the given feature parameter series into the feature parameter series of the voice signal of the first reference speaker having the same style attribute and environment attribute as the input voice signal and different speaker attributes. A first speaker variation conversion function, a plurality of first mode variations for converting a given feature parameter sequence into a feature parameter sequence of a first reference style speech signal having the same environmental attributes and different style attributes as the input speech signal A plurality of environmental variation conversion functions for converting a conversion function, a given feature parameter series into a feature parameter series of a second reference style audio signal having the same environmental attributes and different style attributes as the output audio signal; The feature parameter series is a voice of a second reference speaker having the same style attribute and environment attribute as the output speech signal, and a different speaker attribute. A plurality of second-form variation conversion functions for converting to a feature parameter sequence of a signal, and a variation for storing a plurality of second speaker variation conversion functions for converting a given feature parameter sequence to a feature parameter sequence of the output speech signal Based on the attributes identified by the transformation function storage means, the input voice signal attribute identification means and the output voice signal attribute identification means, the first speaker fluctuation transformation function, the first style fluctuation transformation function, and the environment fluctuation transformation A variation conversion function selection means for selecting a function, the second style variation conversion function and the second speaker variation conversion function from the variation conversion function storage means, and the conversion function selected by the variation conversion function selection means Conversion function transmitting means for transmitting to a portable terminal, the portable terminal comprising: input voice signal input means for inputting the input voice signal; and input voice signal input. Input feature parameter series extraction means for extracting a high-dimensional feature parameter series of a predetermined dimension or more from the input speech signal input by the means, and the feature parameter series extracted by the input feature parameter series extraction means to the conversion function providing terminal Input feature parameter sequence transmitting means for transmitting, output speech signal input means for inputting the sample of the output speech signal, and output feature parameter sequence extraction for extracting a feature parameter sequence from the output speech signal input by the output speech signal input means Means, output feature parameter series transmitting means for transmitting the feature parameter series extracted by the output feature parameter series extracting means to the conversion function providing terminal, conversion function receiving means for receiving the conversion function, and the conversion function receiving means The input feature parameter series extraction based on the transformation function received in Feature parameter series converting means for converting the feature parameter series extracted by the means into feature parameter series of the output voice signal, and voice signal generating means for generating the output voice signal from the feature parameter series converted by the feature parameter series conversion means With.

このような構成であれば、請求項８記載の音声信号変換装置と同等の作用が得られる。すなわち、携帯端末では、入力音声信号入力手段により入力音声信号が入力されると、入力特徴パラメータ系列抽出手段により、入力された入力音声信号から特徴パラメータ系列が抽出され、入力特徴パラメータ系列送信手段により、抽出された特徴パラメータ系列が変換関数提供端末に送信される。 With such a configuration, an operation equivalent to that of the audio signal conversion device according to claim 8 can be obtained. That is, in the portable terminal, when the input voice signal is input by the input voice signal input means, the feature parameter series is extracted from the input voice signal input by the input feature parameter series extraction means, and is input by the input feature parameter series transmission means. The extracted feature parameter series is transmitted to the conversion function providing terminal.

変換関数提供端末では、入力特徴パラメータ系列受信手段により特徴パラメータ系列を受信すると、入力音声信号属性同定手段により、属性モデル記憶手段の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、受信した特徴パラメータ系列に対する尤度が求められ、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する話者属性、様式属性および環境属性が入力音声信号の属性として同定される。 In the conversion function providing terminal, when the feature parameter series is received by the input feature parameter series receiving means, the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model and the environment attribute of the attribute model storage means are inputted by the input speech signal attribute identifying means. Speakers corresponding to speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model, and environment attribute high-dimensional acoustic model that give the maximum likelihood for the received feature parameter series for the high-dimensional acoustic model Attributes, style attributes, and environment attributes are identified as attributes of the input audio signal.

また、携帯端末では、出力音声信号入力手段により出力音声信号のサンプルが入力されると、出力特徴パラメータ系列抽出手段により、入力された出力音声信号から所定次元数以上の高次元の特徴パラメータ系列が抽出され、出力特徴パラメータ系列送信手段により、抽出された特徴パラメータ系列が変換関数提供端末に送信される。
変換関数提供端末では、出力特徴パラメータ系列受信手段により特徴パラメータ系列を受信すると、出力音声信号属性同定手段により、音響モデルマップ記憶手段の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、受信した特徴パラメータ系列に対する尤度が求められ、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する話者属性、様式属性および環境属性が出力音声信号の属性として同定される。 In the portable terminal, when a sample of the output audio signal is input by the output audio signal input unit, a high-dimensional feature parameter sequence of a predetermined dimension number or more is generated from the input output audio signal by the output feature parameter sequence extraction unit. The extracted feature parameter sequence is transmitted to the conversion function providing terminal by the output feature parameter sequence transmitting means.
In the conversion function providing terminal, when the feature parameter sequence is received by the output feature parameter sequence receiving means, the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model and the environment of the acoustic model map storage means are output by the output speech signal attribute identifying means. For attribute high-dimensional acoustic models, the likelihood corresponding to the received feature parameter series is obtained, and the stories corresponding to speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model, and environment attribute high-dimensional acoustic model that give the maximum likelihood A person attribute, a style attribute, and an environment attribute are identified as attributes of the output audio signal.

入力音声信号および出力音声信号の属性が同定されると、変動変換関数選択手段により、同定された属性に基づいて、第１話者変動変換関数、第１様式変動変換関数、環境変動変換関数、第２様式変動変換関数および第２話者変動変換関数が変動変換関数記憶手段のなかから選択され、変換関数送信手段により、選択された変換関数が携帯端末に送信される。
携帯端末では、変換関数受信手段により変換関数を受信すると、特徴パラメータ系列変換手段により、受信した変換関数に基づいて、抽出された入力音声信号の特徴パラメータ系列が出力音声信号の特徴パラメータ系列に変換される。そして、音声信号生成手段により、変換された特徴パラメータ系列から出力音声信号が生成される。 When the attributes of the input speech signal and the output speech signal are identified, the variation conversion function selection means, based on the identified attributes, the first speaker variation conversion function, the first style variation conversion function, the environment variation conversion function, The second style variation conversion function and the second speaker variation conversion function are selected from the variation conversion function storage means, and the selected conversion function is transmitted to the portable terminal by the conversion function transmission means.
In the portable terminal, when the conversion function is received by the conversion function receiving means, the feature parameter series conversion means converts the extracted feature parameter series of the input speech signal into a feature parameter series of the output speech signal based on the received conversion function. Is done. Then, the audio signal generation means generates an output audio signal from the converted feature parameter series.

さらに、本発明に係る請求項１２記載の音声信号変換サービス提供システムは、変換関数提供端末と、携帯端末とを通信可能に接続し、入力音声信号を目標の出力音声信号に変換する音声変換サービスを提供する音声信号変換サービス提供システムであって、前記変換関数提供端末は、複数話者から取得した音声データを話者属性、様式属性および環境属性の３つの属性に基づいてグループ分けし、当該各グループに属する音声データに基づいて所定次元数以上の高次元の特徴パラメータ系列を有する高次元音響モデルを生成し、当該高次元音響モデル相互間の数学的距離関係を保持しながら前記高次元音響モデルから変換した前記高次元の次元数未満の音響モデル対応低次元ベクトルから構成される音響モデルマップを、前記高次元音響モデルとともに記憶し、さらに、音声信号の複数の話者属性に対応する話者属性高次元音響モデルモデルの集合、音声信号の複数の様式属性に対応する様式属性高次元音響モデルモデルの集合および音声信号の複数の環境属性に対応する環境属性高次元音響モデルモデルの集合を記憶する音響モデルマップ記憶手段と、前記入力音声信号の特徴パラメータ系列を受信する入力特徴パラメータ系列受信手段と、前記音響モデルマップ記憶手段の話者属性高次元音響モデルモデル、様式属性高次元音響モデルモデルおよび環境属性高次元音響モデルモデルについて、前記入力特徴パラメータ系列受信手段で受信した特徴パラメータ系列に対する尤度を求め、最大の尤度を与える話者属性高次元音響モデルモデル、様式属性高次元音響モデルモデルおよび環境属性高次元音響モデルモデルに対応する前記話者属性、前記様式属性および前記環境属性を、前記入力音声信号の話者属性、様式属性および環境属性として同定する入力音声信号属性同定手段と、前記出力音声信号の話者属性、様式属性および環境属性を示す属性データを受信する属性データ受信手段と、前記属性データ受信手段で受信した属性データの話者属性、様式属性および環境属性に最も適合する話者属性、様式属性および環境属性を有する混合正規分布モデルを求め、当該混合正規分布モデルに対応する前記話者属性、前記様式属性および前記環境属性を、前記出力音声信号の話者属性、様式属性および環境属性として同定する出力音声信号属性同定手段と、与えられた特徴パラメータ系列を、前記入力音声信号と同一の様式属性および環境属性並びに異なる話者属性を有する第１基準話者の音声信号の特徴パラメータ系列に変換する複数の第１話者変動変換関数、与えられた特徴パラメータ系列を、前記入力音声信号と同一の環境属性および異なる様式属性を有する第１基準様式の音声信号の特徴パラメータ系列に変換する複数の第１様式変動変換関数、与えられた特徴パラメータ系列を、前記出力音声信号と同一の環境属性および異なる様式属性を有する第２基準様式の音声信号の特徴パラメータ系列に変換する複数の環境変動変換関数、与えられた特徴パラメータ系列を、前記出力音声信号と同一の様式属性および環境属性並びに異なる話者属性を有する第２基準話者の音声信号の特徴パラメータ系列に変換する複数の第２様式変動変換関数、並びに与えられた特徴パラメータ系列を、前記出力音声信号の特徴パラメータ系列に変換する複数の第２話者変動変換関数を記憶する変動変換関数記憶手段と、前記入力音声信号属性同定手段および前記出力音声信号属性同定手段で同定した属性に基づいて、前記第１話者変動変換関数、前記第１様式変動変換関数、前記環境変動変換関数、前記第２様式変動変換関数および前記第２話者変動変換関数を前記変動変換関数記憶手段のなかから選択する変動変換関数選択手段と、前記変動変換関数選択手段で選択した変換関数を前記携帯端末に送信する変換関数送信手段とを備え、前記携帯端末は、前記入力音声信号を入力する入力音声信号入力手段と、前記入力音声信号入力手段で入力した入力音声信号から所定次元数以上の高次元の特徴パラメータ系列を抽出する入力特徴パラメータ系列抽出手段と、前記入力特徴パラメータ系列抽出手段で抽出した特徴パラメータ系列を前記変換関数提供端末に送信する入力特徴パラメータ系列送信手段と、前記出力音声信号の話者属性、様式属性および環境属性を入力する出力音声信号属性入力手段と、前記出力音声信号属性入力手段で入力した話者属性、様式属性および環境属性を示す属性データを前記変換関数提供端末に送信する属性データ送信手段と、前記変換関数を受信する変換関数受信手段と、前記変換関数受信手段で受信した変換関数に基づいて、前記入力特徴パラメータ系列抽出手段で抽出した特徴パラメータ系列を前記出力音声信号の特徴パラメータ系列に変換する特徴パラメータ系列変換手段と、前記特徴パラメータ系列変換手段で変換した特徴パラメータ系列から前記出力音声信号を生成する音声信号生成手段とを備える。 Furthermore, the audio signal conversion service providing system according to claim 12 according to the present invention is such that the conversion function providing terminal and the portable terminal are communicably connected, and the audio conversion service converts the input audio signal into a target output audio signal. The conversion function providing terminal divides the audio data acquired from a plurality of speakers into groups based on the three attributes of speaker attributes, style attributes, and environment attributes, A high-dimensional acoustic model having a high-dimensional feature parameter sequence of a predetermined dimension number or more is generated based on audio data belonging to each group, and the high-dimensional acoustic model is maintained while maintaining a mathematical distance relationship between the high-dimensional acoustic models. An acoustic model map composed of a low-dimensional vector corresponding to an acoustic model less than the number of dimensions of the high dimension converted from the model is converted into the high-dimensional acoustic model. A set of speaker attribute high-dimensional acoustic model models corresponding to a plurality of speaker attributes of a speech signal, a set of style attribute high-dimensional acoustic model models corresponding to a plurality of style attributes of a speech signal, and speech An acoustic model map storage means for storing a set of environmental attribute high-dimensional acoustic model models corresponding to a plurality of environmental attributes of the signal; an input feature parameter series receiving means for receiving a feature parameter series of the input speech signal; and the acoustic model For the speaker attribute high-dimensional acoustic model model, the style attribute high-dimensional acoustic model model and the environment attribute high-dimensional acoustic model model of the map storage means, the likelihood for the feature parameter series received by the input feature parameter series receiving means is obtained, and the maximum Speaker attribute high-dimensional acoustic model model, style attribute high-dimensional acoustic model model Input speech signal attribute identifying means for identifying the speaker attribute, the style attribute, and the environment attribute corresponding to a high-dimensional acoustic model model and the environmental attribute as a speaker attribute, a style attribute, and an environment attribute of the input speech signal; Attribute data receiving means for receiving attribute data indicating speaker attributes, style attributes, and environmental attributes of the output audio signal, and most suitable for speaker attributes, style attributes, and environmental attributes of the attribute data received by the attribute data receiving means A mixed normal distribution model having a speaker attribute, a style attribute, and an environment attribute, and the speaker attribute, the style attribute, and the environment attribute corresponding to the mixed normal distribution model are set as a speaker attribute of the output audio signal, Output audio signal attribute identification means for identifying as a style attribute and an environment attribute, and a given feature parameter series are the same as the input audio signal A plurality of first speaker variation conversion functions for converting into a feature parameter sequence of a speech signal of a first reference speaker having a formula attribute, an environment attribute, and different speaker attributes, and a given feature parameter sequence, A plurality of first style variation conversion functions for converting into a feature parameter sequence of a speech signal of a first reference format having the same environment attribute and a different format attribute, and the given feature parameter sequence is the same environment attribute as the output speech signal And a plurality of environment variation conversion functions for converting the feature parameter series of the second reference style voice signal having different style attributes, the same style attribute and environment attribute as the output voice signal, and a different story A plurality of second mode variation conversion functions for converting into a feature parameter sequence of a speech signal of a second reference speaker having a speaker attribute, and Fluctuation conversion function storage means for storing a plurality of second speaker fluctuation conversion functions for converting the feature parameter series into the characteristic parameter series of the output voice signal, the input voice signal attribute identification means, and the output voice signal attribute identification Based on the attribute identified by the means, the first speaker fluctuation conversion function, the first style fluctuation conversion function, the environment fluctuation conversion function, the second style fluctuation conversion function, and the second speaker fluctuation conversion function Fluctuation conversion function selection means for selecting from among the fluctuation conversion function storage means, and conversion function transmission means for transmitting the conversion function selected by the fluctuation conversion function selection means to the portable terminal, the portable terminal including the input Input audio signal input means for inputting an audio signal, and a high-dimensional feature parameter sequence having a predetermined dimension number or more from the input audio signal input by the input audio signal input means Input feature parameter sequence extracting means for extracting; input feature parameter sequence transmitting means for transmitting the feature parameter sequence extracted by the input feature parameter sequence extracting means to the conversion function providing terminal; and speaker attributes and format of the output speech signal Output audio signal attribute input means for inputting attributes and environment attributes, and attribute data transmission for transmitting to the conversion function providing terminal attribute data indicating speaker attributes, style attributes and environment attributes input by the output audio signal attribute input means Means, a conversion function receiving means for receiving the conversion function, and a feature parameter series extracted by the input feature parameter series extraction means based on the conversion function received by the conversion function receiving means. Feature parameter series conversion means for converting into a series, and conversion by the feature parameter series conversion means Voice signal generating means for generating the output voice signal from the characteristic parameter series.

このような構成であれば、請求項９記載の音声信号変換装置と同等の作用が得られる。すなわち、携帯端末では、入力音声信号入力手段により入力音声信号が入力されると、入力特徴パラメータ系列抽出手段により、入力された入力音声信号から所定次元数以上の高次元の特徴パラメータ系列が抽出され、入力特徴パラメータ系列送信手段により、抽出された特徴パラメータ系列が変換関数提供端末に送信される。 With such a configuration, an operation equivalent to that of the audio signal conversion device according to claim 9 can be obtained. That is, in the portable terminal, when an input voice signal is input by the input voice signal input means, a high-dimensional feature parameter series of a predetermined dimension number or more is extracted from the input voice signal input by the input feature parameter series extraction means. The extracted feature parameter series is transmitted to the conversion function providing terminal by the input feature parameter series transmission means.

変換関数提供端末では、入力特徴パラメータ系列受信手段により特徴パラメータ系列を受信すると、入力音声信号属性同定手段により、記憶手段の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、受信した特徴パラメータ系列に対する尤度が求められ、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する話者属性、様式属性および環境属性が入力音声信号の属性として同定される。 In the conversion function providing terminal, when the feature parameter sequence is received by the input feature parameter sequence receiving means, the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model, and the environment attribute high-dimension of the storage means are input by the input speech signal attribute identifying means. Speaker attributes corresponding to the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model, and the environment attribute high-dimensional acoustic model for which the likelihood for the received feature parameter series is obtained for the acoustic model and the maximum likelihood is obtained, Style attributes and environment attributes are identified as attributes of the input audio signal.

また、携帯端末では、出力音声信号属性入力手段により、出力音声信号の話者属性、様式属性および環境属性が入力されると、属性データ送信手段により、入力された話者属性、様式属性および環境属性を示す属性データが変換関数提供端末に送信される。
変換関数提供端末では、属性データ受信手段により属性データを受信すると、出力音声信号属性同定手段により、受信した属性データの話者属性、様式属性および環境属性に最も適合する話者属性、様式属性および環境属性を有する混合正規分布モデルが求められ、その混合正規分布モデルに対応する話者属性、様式属性および環境属性が出力音声信号の属性として同定される。 In the portable terminal, when the speaker attribute, the style attribute, and the environment attribute of the output voice signal are input by the output voice signal attribute input unit, the input speaker attribute, the style attribute, and the environment are input by the attribute data transmission unit. Attribute data indicating the attribute is transmitted to the conversion function providing terminal.
In the conversion function providing terminal, when the attribute data is received by the attribute data receiving means, the speaker attribute, the style attribute and the attribute attribute most suitable for the speaker attribute, the style attribute and the environment attribute of the received attribute data are obtained by the output audio signal attribute identifying means. A mixed normal distribution model having environmental attributes is obtained, and speaker attributes, style attributes, and environmental attributes corresponding to the mixed normal distribution model are identified as attributes of the output speech signal.

入力音声信号および出力音声信号の属性が同定されると、変動変換関数選択手段により、同定された属性に基づいて、第１話者変動変換関数、第１様式変動変換関数、環境変動変換関数、第２様式変動変換関数および第２話者変動変換関数が変動変換関数記憶手段のなかから選択され、変換関数送信手段により、選択された変換関数が携帯端末に送信される。 When the attributes of the input speech signal and the output speech signal are identified, the variation conversion function selection means, based on the identified attributes, the first speaker variation conversion function, the first style variation conversion function, the environment variation conversion function, The second style variation conversion function and the second speaker variation conversion function are selected from the variation conversion function storage means, and the selected conversion function is transmitted to the portable terminal by the conversion function transmission means.

携帯端末では、変換関数受信手段により変換関数を受信すると、特徴パラメータ系列変換手段により、受信した変換関数に基づいて、抽出された入力音声信号の特徴パラメータ系列が出力音声信号の特徴パラメータ系列に変換される。そして、音声信号生成手段により、変換された特徴パラメータ系列から出力音声信号が生成される。
さらに、本発明に係る請求項１３記載の音声信号変換サービス提供システムは、請求項１１および１２のいずれか１項に記載の音声信号変換サービス提供システムにおいて、前記変換関数提供端末は、さらに、前記変換関数送信手段で前記変換関数を送信する際に、前記携帯端末のユーザに対する課金処理を行う課金手段を備える。 In the portable terminal, when the conversion function is received by the conversion function receiving means, the feature parameter series conversion means converts the extracted feature parameter series of the input speech signal into a feature parameter series of the output speech signal based on the received conversion function. Is done. Then, the audio signal generation means generates an output audio signal from the converted feature parameter series.
Furthermore, the audio signal conversion service providing system according to claim 13 according to the present invention is the audio signal conversion service providing system according to any one of claims 11 and 12, wherein the conversion function providing terminal further includes the conversion function providing terminal. When the conversion function is transmitted by the conversion function transmission unit, a charging unit that performs a charging process for the user of the portable terminal is provided.

このような構成であれば、変換関数提供端末では、変換関数送信手段により変換関数が送信されると、課金手段により、携帯端末のユーザに対する課金処理が行われる。
一方、上記目的を達成するために、本発明に係る請求項１４記載の変換関数提供端末は、請求項１１記載の音声信号変換サービス提供システムに適用される変換関数提供端末であって、前記音響モデルマップ記憶手段、前記入力特徴パラメータ系列受信手段、前記入力音声信号属性同定手段、前記出力特徴パラメータ系列受信手段、前記出力音声信号属性同定手段、前記変動変換関数記憶手段、前記変動変換関数選択手段および前記変換関数送信手段を備える。 With such a configuration, in the conversion function providing terminal, when the conversion function is transmitted by the conversion function transmitting unit, the charging unit performs charging processing for the user of the portable terminal.
On the other hand, in order to achieve the above object, a conversion function providing terminal according to claim 14 according to the present invention is a conversion function providing terminal applied to the audio signal conversion service providing system according to claim 11, wherein Model map storage means, input feature parameter series reception means, input speech signal attribute identification means, output feature parameter series reception means, output speech signal attribute identification means, variation conversion function storage means, variation conversion function selection means And the conversion function transmitting means.

このような構成であれば、請求項１１記載の音声信号変換サービス提供システムにおける変換関数提供端末と同等の作用が得られる。
さらに、本発明に係る請求項１５記載の変換関数提供端末は、請求項１２記載の音声信号変換サービス提供システムに適用される変換関数提供端末であって、前記音響モデルマップ記憶手段、前記入力特徴パラメータ系列受信手段、前記入力音声信号属性同定手段、前記属性データ受信手段、前記出力音声信号属性同定手段、前記変動変換関数記憶手段、前記変動変換関数選択手段および前記変換関数送信手段を備える。 With such a configuration, an operation equivalent to that of the conversion function providing terminal in the audio signal conversion service providing system according to claim 11 can be obtained.
Furthermore, the conversion function providing terminal according to claim 15 of the present invention is a conversion function providing terminal applied to the audio signal conversion service providing system according to claim 12, wherein the acoustic model map storage means, the input feature Parameter series reception means, input voice signal attribute identification means, attribute data reception means, output voice signal attribute identification means, fluctuation conversion function storage means, fluctuation conversion function selection means, and conversion function transmission means.

このような構成であれば、請求項１２記載の音声信号変換サービス提供システムにおける変換関数提供端末と同等の作用が得られる。
一方、上記目的を達成するために、本発明に係る請求項１６記載の携帯端末は、請求項１１記載の音声信号変換サービス提供システムに適用される携帯端末であって、前記入力音声信号入力手段、前記入力特徴パラメータ系列抽出手段、前記入力特徴パラメータ系列送信手段、前記出力音声信号入力手段、前記出力特徴パラメータ系列抽出手段、前記出力特徴パラメータ系列送信手段、前記変換関数受信手段、前記特徴パラメータ系列変換手段および前記音声信号生成手段を備える。 With such a configuration, an operation equivalent to that of the conversion function providing terminal in the audio signal conversion service providing system according to claim 12 can be obtained.
On the other hand, to achieve the above object, a mobile terminal according to claim 16 of the present invention is a mobile terminal applied to the audio signal conversion service providing system according to claim 11, wherein the input audio signal input means , Input feature parameter series extraction means, input feature parameter series transmission means, output speech signal input means, output feature parameter series extraction means, output feature parameter series transmission means, conversion function reception means, feature parameter series Conversion means and the audio signal generation means are provided.

このような構成であれば、請求項１１記載の音声信号変換サービス提供システムにおける携帯端末と同等の作用が得られる。
さらに、本発明に係る請求項１７記載の携帯端末は、請求項１２記載の音声信号変換サービス提供システムに適用される携帯端末であって、前記入力音声信号入力手段、前記入力特徴パラメータ系列抽出手段、前記入力特徴パラメータ系列送信手段、前記出力音声信号属性入力手段、前記属性データ送信手段、前記変換関数受信手段、前記特徴パラメータ系列変換手段および前記音声信号生成手段を備える。
このような構成であれば、請求項１２記載の音声信号変換サービス提供システムにおける携帯端末と同等の作用が得られる。 With such a configuration, an action equivalent to that of the portable terminal in the audio signal conversion service providing system according to claim 11 can be obtained.
Furthermore, the mobile terminal according to claim 17 of the present invention is a mobile terminal applied to the audio signal conversion service providing system according to claim 12, wherein the input audio signal input means, the input feature parameter series extraction means. , Input feature parameter series transmission means, output voice signal attribute input means, attribute data transmission means, conversion function reception means, feature parameter series conversion means, and voice signal generation means.
With such a configuration, an action equivalent to that of the portable terminal in the audio signal conversion service providing system according to claim 12 can be obtained.

以上説明したように、本発明に係る請求項１記載の音声信号変換装置によれば、話者属性間の変動、様式属性間の変動および環境属性間の変動のうち少なくとも２つの組み合わせに応じて特徴パラメータ系列が変換されるので、従来に比して、特徴パラメータ系列を精度よく変換することができるという効果が得られる。
さらに、本発明に係る請求項５ないし７記載の音声信号変換装置によれば、話者属性間の変動、様式属性間の変動および環境属性間の変動の順序を考慮して特徴パラメータ系列が変換されるので、特徴パラメータ系列をさらに精度よく変換することができるという効果が得られる。 As described above, according to the speech signal conversion device according to claim 1 of the present invention, according to at least two combinations among the variation between speaker attributes, the variation between style attributes, and the variation between environment attributes. Since the feature parameter series is converted, it is possible to obtain an effect that the feature parameter series can be converted with higher accuracy than in the past.
Furthermore, according to the speech signal conversion apparatus according to claims 5 to 7 of the present invention, the feature parameter series is converted in consideration of the order of variation between speaker attributes, variation between style attributes, and variation between environmental attributes. Therefore, the effect that the feature parameter series can be converted with higher accuracy is obtained.

さらに、本発明に係る請求項８または９記載の音声信号変換装置によれば、入力音声信号および出力音声信号の話者属性、様式属性および環境属性を同定した上で特徴パラメータ系列が変換されるので、特徴パラメータ系列をさらに精度よく変換することができるという効果が得られる。
一方、本発明に係る請求項１０記載の音声信号変換方法によれば、請求項１記載の音声信号変換装置と同等の効果が得られる。 Furthermore, according to the audio signal conversion apparatus of claim 8 or 9 according to the present invention, the feature parameter series is converted after identifying the speaker attribute, the style attribute, and the environment attribute of the input audio signal and the output audio signal. Therefore, an effect that the feature parameter series can be converted with higher accuracy can be obtained.
On the other hand, according to the audio signal conversion method according to claim 10 of the present invention, an effect equivalent to that of the audio signal conversion apparatus according to claim 1 can be obtained.

一方、本発明に係る請求項１１記載の音声信号変換サービス提供システムによれば、請求項８記載の音声信号変換装置と同等の効果が得られる。
さらに、本発明に係る請求項１２記載の音声信号変換サービス提供システムによれば、請求項９記載の音声信号変換装置と同等の効果が得られる。
一方、本発明に係る請求項１４記載の変換関数提供端末によれば、請求項１１記載の音声信号変換サービス提供システムと同等の効果が得られる。 On the other hand, according to the audio signal conversion service providing system according to claim 11 of the present invention, the same effect as the audio signal conversion apparatus according to claim 8 can be obtained.
Furthermore, according to the audio signal conversion service providing system according to claim 12 of the present invention, an effect equivalent to that of the audio signal conversion apparatus according to claim 9 can be obtained.
On the other hand, according to the conversion function providing terminal according to the fourteenth aspect of the present invention, an effect equivalent to that of the voice signal conversion service providing system according to the eleventh aspect can be obtained.

さらに、本発明に係る請求項１５記載の変換関数提供端末によれば、請求項１２記載の音声信号変換サービス提供システムと同等の効果が得られる。
一方、本発明に係る請求項１６記載の携帯端末によれば、請求項１１記載の音声信号変換サービス提供システムと同等の効果が得られる。
さらに、本発明に係る請求項１７記載の携帯端末によれば、請求項１２記載の音声信号変換サービス提供システムと同等の効果が得られる。 Furthermore, according to the conversion function providing terminal according to claim 15 of the present invention, an effect equivalent to that of the audio signal conversion service providing system according to claim 12 can be obtained.
On the other hand, according to the portable terminal of the sixteenth aspect of the present invention, an effect equivalent to that of the audio signal conversion service providing system of the eleventh aspect can be obtained.
Furthermore, according to the portable terminal according to the seventeenth aspect of the present invention, an effect equivalent to that of the audio signal conversion service providing system according to the twelfth aspect can be obtained.

以下、本発明の第１の実施の形態を図面を参照しながら説明する。図１ないし図１０は、本発明に係る音声信号変換装置の第１の実施の形態を示す図である。また、図１４、図１５および図１６は、複数話者から取得した音声データに基づいて作成した音響モデルマップの例である。
まず、本実施の形態の基本原理を説明する。 Hereinafter, a first embodiment of the present invention will be described with reference to the drawings. 1 to 10 are diagrams showing a first embodiment of an audio signal conversion apparatus according to the present invention. FIGS. 14, 15 and 16 are examples of acoustic model maps created based on voice data acquired from a plurality of speakers.
First, the basic principle of this embodiment will be described.

図１４、図１５および図１６で例示される音響モデルマップを作成するために使用した音声データは、静粛な部屋で多数の女性に５種類のタスク（コマンド、都市名、４桁数字、仮名、苗字）を発声して貰って取得したものである。このクリーン音声に信号雑音比（SNR:Signal-to-Noise Ratio）２０ｄＢで展示会雑音を重畳した。このクリーン音声と雑音重畳音声から、２１次元の特徴パラメータ系列を求め、タスク毎かつ女性話者毎に特定話者音響モデルを作成した。図１４は、これらの特定話者音響モデルから、以下で説明するＣＯＳＭＯＳ法により２次元平面上にマッピングした場合の音響モデルの分布を示す図である。ここで、図１４の各記号は、それぞれ、■：コマンド、◆：都市名、▲：４桁数字、□：仮名、△：苗字の通り、各タスクを表す。また、図１４には、クリーン音声と雑音重畳音声の区切りの曲線も示す。図１４が示す通り、クリーン音声と雑音重畳音声の固まりは分離しており、環境属性（クリーンであるか、雑音重畳であるか）が様式属性（タスクの種類）、話者属性に比して、最も大きな影響を有していることが分かる。さらに、ＳＮＲ２０ｄＢの雑音重畳音声に限定して、図１５に示す通り音響モデルマップを作成した。ここで、図１５の各記号は、それぞれ、■：コマンド、◆：都市名、▲：４桁数字、□：仮名、△：苗字の通り、各タスクを表す。図１５が示す通り、多少の重なりはあるものの、コマンド、都市名、４桁数字、仮名、苗字のタスク毎の固まりが分離していること分かる。この現象は、クリーン音声においても同様である。さらに、ＳＮＲ２０ｄＢの雑音重畳音声であり、かつコマンドのタスクに限定して、図１６に示す通り音響モデルマップを作成した。ここで、図１６の記号は、■：コマンドのタスクを表す。図１６が示す通り、コマンドのタスクの特定話者音響モデルは円状の固まりを形成していること分かる。この現象は、別の５種類のタスクおよびクリーン音声においても同様である。 The voice data used to create the acoustic model map illustrated in FIGS. 14, 15 and 16 is a quiet room with a large number of women having five types of tasks (command, city name, 4-digit number, kana, The last name was obtained by speaking. Exhibition noise was superimposed on this clean speech with a signal-to-noise ratio (SNR) of 20 dB. A 21-dimensional feature parameter sequence was obtained from the clean speech and noise superimposed speech, and a specific speaker acoustic model was created for each task and for each female speaker. FIG. 14 is a diagram showing the distribution of acoustic models when these specific speaker acoustic models are mapped onto a two-dimensional plane by the COSMOS method described below. Here, each symbol in FIG. 14 represents each task as follows: ■: command, ◆: city name, ▲: 4-digit number, □: kana, Δ: surname. FIG. 14 also shows a curve for separating clean speech and noise superimposed speech. As shown in FIG. 14, the cluster of clean speech and noise superimposed speech is separated, and the environmental attribute (clean or noise superimposed) is compared to the style attribute (task type) and speaker attribute. , It can be seen that it has the greatest impact. Furthermore, an acoustic model map was created as shown in FIG. 15 by limiting to the noise-superimposed speech of SNR 20 dB. Here, each symbol in FIG. 15 represents each task as follows: ■: command, ◆: city name, ▲: 4-digit number, □: kana, Δ: surname. As shown in FIG. 15, although there is some overlap, it can be seen that the command, city name, 4-digit number, kana, and last name groups for each task are separated. This phenomenon is the same for clean speech. Furthermore, an acoustic model map was created as shown in FIG. 16 with an SNR of 20 dB noise superimposed speech and limited to the command task. Here, symbols in FIG. 16 represent ■: command tasks. As shown in FIG. 16, it can be seen that the specific speaker acoustic model of the command task forms a circular lump. This phenomenon is the same for the other five types of tasks and clean speech.

以上の結果から、環境属性間の変動、様式属性間の変動、話者属性間の変動の順に変動量が多く、様式属性間の変動は話者属性間の変動を包含し、環境属性間の変動は様式属性間の変動を包含していることが分かる。
図１は、話者属性、様式属性および環境属性を有する音響モデルを以下で説明するＣＯＳＭＯＳ法により２次元平面上にマッピングした場合の音響モデルの分布を示す概念図である。環境属性が同一である音響モデル対応低次元ベクトルの分布領域が、様式属性の異なる複数の音響モデル対応低次元ベクトルの分布領域を包含し、様式属性の異なる複数の音響モデル対応低次元ベクトルの分布領域が、話者属性の異なる複数の音響モデル対応低次元ベクトルの分布領域を包含している。加えて、環境属性と様式属性を固定した場合には、話者属性の異なる複数の音響モデル対応低次元ベクトルの分布領域は、円状に拡がり、円の中心には最も平均的な話者属性の音響モデル対応低次元ベクトルが位置することが、M. Shozakai and G. Nagino, "Acoustic Space Analysis Method Utilizing Statistical Multidimensional Scaling Technique," Proc. NSIP(International Workshop on Nonlinear Signal and Image Processing), pp. 430-435, Sapporo, Japan, May 2005.で示されている。 From the above results, there are many variations in the order of variation between environment attributes, variation between style attributes, variation between speaker attributes, and variations between style attributes include variations between speaker attributes. It can be seen that the variation includes variation between style attributes.
FIG. 1 is a conceptual diagram showing the distribution of acoustic models when acoustic models having speaker attributes, style attributes, and environmental attributes are mapped onto a two-dimensional plane by the COSMOS method described below. The distribution area of low-dimensional vectors for acoustic models with the same environmental attributes includes the distribution area of low-dimensional vectors for acoustic models with different style attributes, and the distribution of low-dimensional vectors for acoustic models with different style attributes The region includes a plurality of low-dimensional vector distribution regions corresponding to acoustic models having different speaker attributes. In addition, when environment attributes and style attributes are fixed, the distribution area of low-dimensional vectors corresponding to multiple acoustic models with different speaker attributes extends in a circle, with the most average speaker attribute at the center of the circle. M. Shozakai and G. Nagino, "Acoustic Space Analysis Method Utilizing Statistical Multidimensional Scaling Technique," Proc. NSIP (International Workshop on Nonlinear Signal and Image Processing), pp. 430 -435, Sapporo, Japan, May 2005.

本実施の形態では、話者変動、様式変動および環境変動に応じて、入力音声信号の特徴パラメータ系列を目標の出力音声信号の特徴パラメータ系列に変換する。変換にあたっては、図１に示すような、話者属性、様式属性および環境属性を有する音響モデルをＣＯＳＭＯＳ法により２次元平面上にマッピングした音響モデルマップが参照される。
図１の音響モデルマップの作成方法を簡単に説明する。 In the present embodiment, the feature parameter sequence of the input speech signal is converted into the feature parameter sequence of the target output speech signal in accordance with speaker variation, style variation, and environmental variation. In the conversion, an acoustic model map in which an acoustic model having speaker attributes, style attributes, and environmental attributes as shown in FIG. 1 is mapped on a two-dimensional plane by the COSMOS method is referred to.
A method for creating the acoustic model map of FIG. 1 will be briefly described.

ＣＯＳＭＯＳ法では、まず、複数話者から取得した不特定多数の音声コーパスを、話者の声道に関する話者属性、発話様式または発話内容に関する様式属性、および発声環境の雑音または残響に関する環境属性の３つの特定条件に基づいてグループ分けする。ここで、音声コーパスには、話者属性、様式属性および環境属性を示す属性データが付与される。 In the COSMOS method, first, an unspecified number of speech corpuses obtained from a plurality of speakers are divided into speaker attributes relating to the vocal tract of the speaker, style attributes relating to the speech style or utterance content, and environmental attributes relating to noise or reverberation of the speech environment. Grouping is based on three specific conditions. Here, attribute data indicating speaker attributes, style attributes, and environment attributes is given to the speech corpus.

次いで、各グループごとに、そのグループの音声コーパスに基づいて、高次元の特徴パラメータからなる特徴パラメータ系列を有する高次元の音響モデル（以下、高次元音響モデルという。）を生成する。高次元音響モデルは、複数の音声単位のＨＭＭの集合から構成される。
次いで、生成した高次元音響モデル相互間の数学的距離を算出し、生成した高次元音響モデルおよび算出した数学的距離に基づいて、高次元音響モデルを音響モデル対応低次元ベクトル変換する。ここで、数学的距離が小さい２つの高次元音響モデルは互いに近くに、数学的距離が大きい２つの高次元音響モデルは互いに遠くに位置するように、すべての高次元音響モデルを相互間の距離関係を保持したまま音響モデル対応低次元ベクトルに変換する。 Next, for each group, a high-dimensional acoustic model (hereinafter referred to as a high-dimensional acoustic model) having a feature parameter series including high-dimensional feature parameters is generated based on the speech corpus of the group. The high-dimensional acoustic model is composed of a set of HMMs of a plurality of speech units.
Next, a mathematical distance between the generated high-dimensional acoustic models is calculated, and the high-dimensional acoustic model is converted into a low-dimensional vector corresponding to the acoustic model based on the generated high-dimensional acoustic model and the calculated mathematical distance. Here, the distance between all the high-dimensional acoustic models is set so that the two high-dimensional acoustic models with a small mathematical distance are close to each other and the two high-dimensional acoustic models with a large mathematical distance are far from each other. Convert to a low-dimensional vector corresponding to the acoustic model while maintaining the relationship.

そして、各音響モデル対応低次元ベクトルを、２次元平面においてマッピングした音響モデルマップが図１である。
なお、ＣＯＳＭＯＳ法については、「M. Shozakai and G. Nagino, "Analysis of Speaking Styles by Two-Dimensional Visualization of Aggregate of Acoustic Models," Proc. ICSLP, vol.1, pp.717-720, 2004.」に詳述されている。 FIG. 1 shows an acoustic model map obtained by mapping low-dimensional vectors corresponding to each acoustic model in a two-dimensional plane.
Regarding the COSMOS method, “M. Shozakai and G. Nagino,“ Analysis of Speaking Styles by Two-Dimensional Visualization of Aggregate of Acoustic Models, ”Proc. ICSLP, vol.1, pp.717-720, 2004.” Is described in detail.

図１の音響モデルマップは、２つの環境属性１０ａ、１０ｂ、４つの様式属性１２ａ〜１２ｂおよび４つの話者属性１４ａ〜１４ｄによって音声コーパスをグループ分けした場合である。図１によれば、２次元平面上には、各環境属性１０ａ、１０ｂごとに分布領域が形成される。これは、音響モデルの類似性に対して環境変動が最も大きな影響を与えることを示している。すなわち、環境属性が異なれば、同一話者、同一様式であっても音響モデルは大きく異なるのである。 The acoustic model map of FIG. 1 is a case where speech corpora are grouped by two environment attributes 10a, 10b, four style attributes 12a-12b, and four speaker attributes 14a-14d. According to FIG. 1, a distribution area is formed for each environmental attribute 10a, 10b on a two-dimensional plane. This indicates that environmental variation has the greatest influence on the similarity of acoustic models. That is, if the environmental attribute is different, the acoustic model is greatly different even if the speaker is the same and the same style.

さらに、環境属性１０ａの分布領域には、各様式属性１２ａ〜１２ｄごとに分布領域が形成される。環境属性１０ｂの分布領域についても同様に、各様式属性１２ａ〜１２ｄごとに分布領域が形成される。これは、音響モデルの類似性に対して様式変動による影響が次に大きいことを示している。すなわち、様式属性が異なれば、同一話者であっても音響モデルは大きく異なるのである。 Furthermore, a distribution area is formed for each style attribute 12a to 12d in the distribution area of the environment attribute 10a. Similarly, the distribution area of the environmental attribute 10b is formed for each of the style attributes 12a to 12d. This shows that the influence of style variation on the similarity of acoustic models is the next largest. That is, if the style attribute is different, the acoustic model is greatly different even for the same speaker.

さらに、様式属性１２ａの各分布領域には、各話者属性１４ａ〜１４ｄごとに分布領域が形成される。様式属性１２ｂの各分布領域、様式属性１２ｃの各分布領域および様式属性１２ｄの各分布領域についても同様に、各話者属性１４ａ〜１４ｄごとに分布領域が形成される。これは、音響モデルの類似性に対して話者変動による影響が最も小さいことを示している。 Further, in each distribution area of the style attribute 12a, a distribution area is formed for each speaker attribute 14a to 14d. Similarly, for each distribution area of the style attribute 12b, each distribution area of the style attribute 12c, and each distribution area of the style attribute 12d, a distribution area is formed for each speaker attribute 14a to 14d. This indicates that the influence of speaker variation is the least on the similarity of acoustic models.

次に、特徴パラメータ系列を変換する手順を説明する。
図２は、話者変動、様式変動および環境変動に応じて、入力音声信号の特徴パラメータ系列を出力音声信号の特徴パラメータ系列に変換する場合を示す図である。
話者変動、様式変動および環境変動が複合的に影響する場合は、入力音声信号の特徴パラメータ系列から出力音声信号の特徴パラメータ系列に直接変換するのではなく、環境変動、様式変動および話者変動の順で影響が大きいので、この順序を考慮して段階的に特徴パラメータ系列を変換する。 Next, a procedure for converting the feature parameter series will be described.
FIG. 2 is a diagram illustrating a case where a feature parameter sequence of an input speech signal is converted into a feature parameter sequence of an output speech signal in accordance with speaker variation, style variation, and environmental variation.
When speaker variation, style variation, and environmental variation are combined, it is not necessary to directly convert the feature parameter sequence of the input speech signal to the feature parameter sequence of the output speech signal. Therefore, the feature parameter series is converted step by step in consideration of this order.

高次元音響モデルは、話者属性、様式属性および環境属性によってグループ分けされた音声コーパスのうち、その高次元音響モデルが属するグループの音声コーパスに基づいて生成されるため、話者属性、様式属性および環境属性を有する。属性付けは、高次元音響モデルに対して、その高次元音響モデルの生成の元となった音声コーパスの属性データを対応付けることにより行う。 The high-dimensional acoustic model is generated based on the speech corpus of the group to which the high-dimensional acoustic model belongs among the speech corpuses grouped by speaker attributes, style attributes, and environment attributes. And has environmental attributes. The attribute assignment is performed by associating the attribute data of the speech corpus from which the high-dimensional acoustic model is generated with the high-dimensional acoustic model.

変換に先立って、入力音声信号を入力し、入力した入力音声信号から特徴パラメータ系列を抽出し、抽出した特徴パラメータ系列に基づいて入力音声信号の話者属性、様式属性および環境属性を同定する。これにより、入力音声信号の特徴パラメータ系列の音響モデルマップ上での位置を特定する。
また、出力音声信号のサンプルを入力し、入力した出力音声信号から特徴パラメータ系列を抽出し、抽出した特徴パラメータ系列に基づいて出力音声信号の話者属性、様式属性および環境属性を同定する。これにより、出力音声信号の特徴パラメータ系列の音響モデルマップ上での位置を特定する。 Prior to conversion, an input speech signal is input, a feature parameter series is extracted from the input speech signal, and speaker attributes, style attributes, and environment attributes of the input speech signal are identified based on the extracted feature parameter series. As a result, the position of the feature parameter series of the input audio signal on the acoustic model map is specified.
Also, a sample of the output speech signal is input, a feature parameter series is extracted from the input output speech signal, and speaker attributes, style attributes, and environment attributes of the output speech signal are identified based on the extracted feature parameter series. As a result, the position of the feature parameter series of the output audio signal on the acoustic model map is specified.

属性の同定が完了すると、影響が小さい話者変動、様式変動および環境変動の順で入力音声信号の特徴パラメータ系列を変換していく。
まず、話者変動による影響が最も小さいので、図２に示すように、入力音声信号の特徴パラメータ系列を第１基準話者１６ａの音声信号の特徴パラメータ系列に変換する。第１基準話者１６ａの音声信号の特徴パラメータ系列は、入力音声信号と同一の環境属性および様式属性のなかで平均的な話者属性を有する話者の音声信号の特徴パラメータ系列である。この変換は、入力音声信号の話者と第１基準話者１６ａとの間の話者変動に応じた変換となる。 When the identification of the attribute is completed, the feature parameter series of the input speech signal is converted in the order of speaker fluctuation, style fluctuation, and environmental fluctuation with a small influence.
First, since the influence due to speaker fluctuation is the smallest, as shown in FIG. 2, the feature parameter sequence of the input speech signal is converted into the feature parameter sequence of the speech signal of the first reference speaker 16a. The feature parameter series of the speech signal of the first reference speaker 16a is a feature parameter sequence of a speech signal of a speaker having an average speaker attribute among the same environmental attributes and style attributes as the input speech signal. This conversion is a conversion corresponding to the speaker fluctuation between the speaker of the input voice signal and the first reference speaker 16a.

次いで、様式変動による影響が小さいので、第１基準話者１６ａの音声信号の特徴パラメータ系列を第１基準様式１６ｂの音声信号の特徴パラメータ系列に変換する。第１基準様式１６ｂの音声信号の特徴パラメータ系列は、入力音声信号と同一の環境属性のなかで平均的な様式属性を有する様式の音声信号の特徴パラメータ系列である。この変換は、入力音声信号の様式と第１基準様式１６ｂとの間の様式変動に応じた変換となる。 Next, since the influence of the form variation is small, the feature parameter series of the speech signal of the first reference speaker 16a is converted into the feature parameter series of the speech signal of the first reference form 16b. The feature parameter series of the audio signal of the first reference style 16b is a feature parameter series of the audio signal having the average style attribute among the same environmental attributes as the input voice signal. This conversion is a conversion corresponding to a change in form between the form of the input audio signal and the first reference form 16b.

そして、環境変動による影響が最も大きいので、第１基準様式１６ｂの音声信号の特徴パラメータ系列を第２基準様式１６ｃの音声信号の特徴パラメータ系列に変換する。第２基準様式１６ｃの音声信号の特徴パラメータ系列は、出力音声信号と同一の環境属性のなかで平均的な様式属性を有する様式の音声信号の特徴パラメータ系列である。この変換は、第２基準様式１６ｃと出力音声信号の様式との間の様式変動に応じた変換となる。 Since the influence by the environmental change is the greatest, the feature parameter sequence of the audio signal in the first reference format 16b is converted into the feature parameter sequence of the audio signal in the second reference format 16c. The feature parameter sequence of the audio signal of the second reference style 16c is a feature parameter sequence of the audio signal having a format having an average format attribute among the same environmental attributes as the output audio signal. This conversion is a conversion according to the format variation between the second reference format 16c and the format of the output audio signal.

以降は、話者変動、様式変動および環境変動の逆順で出力音声信号の特徴パラメータ系列に変換していく。
まず、第２基準様式１６ｃの音声信号の特徴パラメータ系列を第２基準話者１６ｄの音声信号の特徴パラメータ系列に変換する。第２基準話者１６ｄの音声信号の特徴パラメータ系列は、出力音声信号と同一の環境属性および様式属性のなかで平均的な話者属性を有する話者の音声信号の特徴パラメータ系列である。この変換は、第２基準様式１６ｃと出力音声信号の様式との間の様式変動に応じた変換となる。 After that, it is converted into a characteristic parameter series of the output voice signal in the reverse order of speaker fluctuation, style fluctuation and environmental fluctuation.
First, the feature parameter sequence of the speech signal of the second reference style 16c is converted into the feature parameter sequence of the speech signal of the second reference speaker 16d. The feature parameter series of the speech signal of the second reference speaker 16d is a feature parameter sequence of the speech signal of a speaker having an average speaker attribute among the same environmental attributes and style attributes as the output speech signal. This conversion is a conversion according to the format variation between the second reference format 16c and the format of the output audio signal.

そして、第２基準話者１６ｄの音声信号の特徴パラメータ系列を出力音声信号の特徴パラメータ系列に変換する。この変換は、第２基準話者１６ｄと出力音声信号の話者との間の話者変動に応じた変換となる。
これらの変換は、ＧＭＭに基づく音声信号変換方法などを適宜用いればよい。この場合、各変換手段は、音声信号変換関数として構成することができる。このような構成にしておけば、入力音声信号の特徴パラメータ系列、第１基準話者１６ａの音声信号の特徴パラメータ系列、第１基準様式１６ｂの音声信号の特徴パラメータ系列、第２基準様式１６ｃの音声信号の特徴パラメータ系列、第２基準話者１６ｄの音声信号の特徴パラメータ系列、または出力音声信号の特徴パラメータ系列のいずれかが変わった場合でも、音声信号変換関数のすべてを変更することなく、最小限の音声信号変換関数を変更すればよい。例えば、環境変動を跨いで出力音声信号の話者のみを変える場合は、第２基準話者１６ｄの音声信号の特徴パラメータ系列を出力音声信号の特徴パラメータ系列に変換する変換関数のみを変更すればよいし、出力音声信号の様式と話者を変える場合には、第２基準様式１６ｃの音声信号の特徴パラメータ系列を第２基準話者１６ｄの音声信号の特徴パラメータ系列に変換する変換関数、および第２基準話者１６ｄの音声信号の特徴パラメータ系列を出力音声信号の特徴パラメータ系列に変換する変換関数を変更すればよい。 Then, the feature parameter sequence of the speech signal of the second reference speaker 16d is converted into the feature parameter sequence of the output speech signal. This conversion is a conversion according to the speaker fluctuation between the second reference speaker 16d and the speaker of the output speech signal.
For these conversions, an audio signal conversion method based on GMM may be used as appropriate. In this case, each conversion means can be configured as an audio signal conversion function. With this configuration, the feature parameter sequence of the input speech signal, the feature parameter sequence of the speech signal of the first reference speaker 16a, the feature parameter sequence of the speech signal of the first reference form 16b, and the second reference form 16c. Even when any one of the feature parameter series of the speech signal, the feature parameter series of the speech signal of the second reference speaker 16d, or the feature parameter series of the output speech signal changes, without changing all of the speech signal conversion functions, What is necessary is just to change the minimum audio | voice signal conversion function. For example, when only the speaker of the output speech signal is changed across environmental fluctuations, only the conversion function for converting the feature parameter sequence of the speech signal of the second reference speaker 16d into the feature parameter sequence of the output speech signal is changed. If the format of the output speech signal and the speaker are changed, a conversion function for converting the feature parameter sequence of the speech signal of the second reference format 16c into the feature parameter sequence of the speech signal of the second reference speaker 16d, and The conversion function for converting the feature parameter series of the voice signal of the second reference speaker 16d into the feature parameter series of the output voice signal may be changed.

次に、本発明を適用するネットワークシステムの構成を説明する。
図３は、本発明を適用するネットワークシステムの構成を示すブロック図である。
ネットワーク１９９には、図３に示すように、音声信号変換サーバ１００と、携帯端末２００とネットワーク１９９との通信を中継する中継局２２０とが接続されている。
中継局２２０には、携帯端末２００と無線通信を行う複数の基地局２１０が接続されており、中継局２２０は、携帯端末２００がネットワーク１９９に接続するときは、携帯端末２００に代わってネットワーク１９９上での一端末となって、基地局２１０を介して受信した携帯端末２００からのデータをネットワーク１９９を介して目的の端末に送信するとともに、ネットワーク１９９上にある目的の端末のデータを基地局２１０を介して携帯端末２００に送信する。 Next, the configuration of a network system to which the present invention is applied will be described.
FIG. 3 is a block diagram showing a configuration of a network system to which the present invention is applied.
As shown in FIG. 3, audio signal conversion server 100 and relay station 220 that relays communication between portable terminal 200 and network 199 are connected to network 199.
A plurality of base stations 210 that perform wireless communication with the mobile terminal 200 are connected to the relay station 220. When the mobile terminal 200 connects to the network 199, the relay station 220 replaces the mobile terminal 200 with the network 199. The terminal transmits data from the portable terminal 200 received via the base station 210 to the target terminal via the network 199 and transmits the data of the target terminal on the network 199 to the base station. The data is transmitted to the mobile terminal 200 via 210.

次に、音声信号変換サーバ１００の構成を説明する。
図４は、音声信号変換サーバ１００のハードウェア構成を示す図である。
音声信号変換サーバ１００は、図４に示すように、制御プログラムに基づいて演算およびシステム全体を制御するＣＰＵ３０と、所定領域にあらかじめＣＰＵ３０の制御プログラム等を格納しているＲＯＭ３２と、ＲＯＭ３２等から読み出したデータやＣＰＵ３０の演算過程で必要な演算結果を格納するためのＲＡＭ３４と、外部装置に対してデータの入出力を媒介するＩ／Ｆ３８とで構成されており、これらは、データを転送するための信号線であるバス３９で相互にかつデータ授受可能に接続されている。 Next, the configuration of the audio signal conversion server 100 will be described.
FIG. 4 is a diagram illustrating a hardware configuration of the audio signal conversion server 100.
As shown in FIG. 4, the audio signal conversion server 100 is read from the CPU 30 that controls the operation and the entire system based on the control program, the ROM 32 that stores the control program of the CPU 30 in a predetermined area, the ROM 32, and the like. The RAM 34 stores data and calculation results required in the calculation process of the CPU 30, and an I / F 38 that mediates input / output of data to / from an external device. These are used to transfer data. Are connected to each other via a bus 39 which is a signal line.

Ｉ／Ｆ３８には、外部装置として、ヒューマンインターフェースとしてデータの入力が可能なキーボードやマウス等からなる入力装置４０と、データやテーブル等をファイルとして格納する記憶装置４２と、画像信号に基づいて画面を表示する表示装置４４と、ネットワーク１９９に接続するための信号線とが接続されている。
次に、記憶装置４２のデータ構造を説明する。 The I / F 38 includes, as external devices, an input device 40 such as a keyboard and a mouse that can input data as a human interface, a storage device 42 that stores data, tables, and the like as files, and a screen based on image signals. And a signal line for connecting to the network 199 are connected.
Next, the data structure of the storage device 42 will be described.

図５は、属性モデルのデータ構造を模式化した図である。
記憶装置４２には、図５に示すように、音声信号の複数の話者属性に対応する話者属性高次元音響モデルの集合と、音声信号の複数の様式属性に対応する様式属性高次元音響モデルの集合と、音声信号の複数の環境属性に対応する環境属性高次元音響モデルの集合がデータとして記憶されている。 FIG. 5 is a diagram schematically illustrating the data structure of the attribute model.
As shown in FIG. 5, the storage device 42 includes a set of speaker attribute high-dimensional acoustic models corresponding to a plurality of speaker attributes of a speech signal, and a style attribute high-dimensional sound corresponding to a plurality of style attributes of a speech signal. A set of models and a set of environment attribute high-dimensional acoustic models corresponding to a plurality of environment attributes of an audio signal are stored as data.

話者属性高次元音響モデルは、話者の声道に関する話者属性の異なりに対応したＧＭＭで、様式属性高次元音響モデルは、発話様式や発話内容に関する様式属性の異なりに対応したＧＭＭで、環境属性高次元音響モデルは、発声環境の雑音や残響に関する環境属性の異なりに対応したＧＭＭでそれぞれモデル化することができる。これら属性モデルは、入力音声信号および出力音声信号を同定するために用いられる。 The speaker attribute high-dimensional acoustic model is a GMM corresponding to different speaker attributes related to the vocal tract of the speaker, and the style attribute high-dimensional acoustic model is a GMM corresponding to different style attributes related to the speech style and utterance content. The environmental attribute high-dimensional acoustic model can be modeled by GMM corresponding to different environmental attributes related to noise and reverberation in the utterance environment. These attribute models are used to identify the input audio signal and the output audio signal.

図６は、音声信号変換関数のデータ構造を模式化した図である。
記憶装置４２には、図６に示すように、与えられた特徴パラメータ系列を第１基準話者１６ａの音声信号の特徴パラメータ系列に変換する複数の第１話者変動変換関数と、与えられた特徴パラメータ系列を第１基準様式１６ｂの音声信号の特徴パラメータ系列に変換する複数の第１様式変動変換関数がデータとして記憶されている。また、与えられた特徴パラメータ系列を第２基準様式１６ｃの音声信号の特徴パラメータ系列に変換する複数の環境変動変換関数と、与えられた特徴パラメータ系列を第２基準話者１６ｄの音声信号の特徴パラメータ系列に変換する複数の第２様式変動変換関数と、与えられた特徴パラメータ系列を出力音声信号の特徴パラメータ系列に変換する複数の第２話者変動変換関数がデータとして記憶されている。これら変換関数は、例えば、非特許文献２、３の方法により作成することができる。 FIG. 6 is a diagram schematically illustrating the data structure of the audio signal conversion function.
As shown in FIG. 6, the storage device 42 is provided with a plurality of first speaker variation conversion functions for converting a given feature parameter sequence into a feature parameter sequence of the speech signal of the first reference speaker 16a. A plurality of first style variation conversion functions for converting the feature parameter series into the feature parameter series of the audio signal of the first reference style 16b are stored as data. Also, a plurality of environment variation conversion functions for converting the given feature parameter series into the feature parameter series of the voice signal of the second reference style 16c, and the feature signal series of the voice signal of the second reference speaker 16d. A plurality of second-style variation conversion functions for converting to a parameter series and a plurality of second speaker variation conversion functions for converting a given feature parameter sequence to a feature parameter sequence of an output speech signal are stored as data. These conversion functions can be created by the methods of Non-Patent Documents 2 and 3, for example.

次に、ＣＰＵ３０で実行される処理を説明する。
ＣＰＵ３０は、マイクロプロセッシングユニット等からなり、ＲＯＭ３２の所定領域に格納されている所定のプログラムを起動させ、そのプログラムに従って、図７のフローチャートに示す変換関数提供処理を実行する。
図７は、変換関数提供処理を示すフローチャートである。 Next, processing executed by the CPU 30 will be described.
The CPU 30 includes a microprocessing unit and the like, starts a predetermined program stored in a predetermined area of the ROM 32, and executes the conversion function providing process shown in the flowchart of FIG. 7 according to the program.
FIG. 7 is a flowchart showing conversion function provision processing.

変換関数提供処理は、ＣＰＵ３０において実行されると、図７に示すように、まず、ステップＳ１００に移行する。
ステップＳ１００では、音声信号変換関数の取得要求を受信したか否かを判定し、取得要求を受信したと判定したとき(Yes)は、ステップＳ１０２に移行するが、そうでないと判定したとき(No)は、取得要求を受信するまでステップＳ１００で待機する。 When the conversion function providing process is executed by the CPU 30, first, the process proceeds to step S100 as shown in FIG.
In step S100, it is determined whether an acquisition request for an audio signal conversion function has been received. When it is determined that an acquisition request has been received (Yes), the process proceeds to step S102, but when it is determined that it is not (No) ) Waits in step S100 until an acquisition request is received.

ステップＳ１０２では、入力音声信号の特徴パラメータ系列を受信し、ステップＳ１０４に移行して、出力音声信号の特徴パラメータ系列を受信し、ステップＳ１０６に移行する。
ステップＳ１０６では、記憶装置４２の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、ステップＳ１０２で受信した特徴パラメータ系列に対する尤度を求め、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する話者属性、様式属性および環境属性を、入力音声信号の話者属性、様式属性および環境属性として同定する。 In step S102, the feature parameter sequence of the input speech signal is received, the process proceeds to step S104, the feature parameter sequence of the output speech signal is received, and the procedure proceeds to step S106.
In step S106, the likelihood for the feature parameter series received in step S102 is obtained for the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model, and the environment attribute high-dimensional acoustic model in the storage device 42, and the maximum likelihood is obtained. Speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model and environment attribute Speaker attribute, style attribute and environment attribute corresponding to the high-dimensional acoustic model are set as speaker attribute, style attribute and environment attribute of the input speech signal. Identify.

次いで、ステップＳ１０８に移行して、記憶装置４２の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、ステップＳ１０４で受信した特徴パラメータ系列に対する尤度を求め、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する話者属性、様式属性および環境属性を、出力音声信号の話者属性、様式属性および環境属性として同定する。 Next, the process proceeds to step S108, and the likelihood for the feature parameter series received in step S104 is determined for the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model, and the environment attribute high-dimensional acoustic model in the storage device 42, Speaker attribute that gives maximum likelihood High-dimensional acoustic model, style attribute High-dimensional acoustic model and environmental attribute Speaker attribute, style attribute and environmental attribute corresponding to high-dimensional acoustic model Identify as attributes and environmental attributes.

次いで、ステップＳ１１０に移行して、ステップＳ１０６、Ｓ１０８で同定した属性に基づいて、入力音声信号および出力音声信号の話者属性、様式属性および環境属性に対応する第１話者変動変換関数、第１様式変動変換関数、環境変動変換関数、第２様式変動変換関数および第２話者変動変換関数を記憶装置４２のなかから選択し、ステップＳ１１２に移行して、選択した変換関数を要求元の携帯端末２００に送信する。 Next, the process proceeds to step S110, and based on the attributes identified in steps S106 and S108, the first speaker variation conversion function corresponding to the speaker attribute, the style attribute, and the environment attribute of the input speech signal and the output speech signal, The 1-style variation conversion function, the environmental variation conversion function, the second-style variation conversion function, and the second speaker variation conversion function are selected from the storage device 42, the process proceeds to step S112, and the selected conversion function is selected from the request source. It transmits to the portable terminal 200.

次いで、ステップＳ１１４に移行して、要求元の携帯端末２００のユーザに対する課金処理を実行する。携帯端末２００のユーザは、例えば、変換関数の提供にあたって携帯端末２００にユーザ情報を要求し、これに応じて受信したユーザ情報に基づいて特定することができる。
ステップＳ１１４の処理が終了すると、一連の処理を終了して元の処理に復帰させる。 Next, the process proceeds to step S114, and billing processing for the user of the requesting mobile terminal 200 is executed. For example, the user of the mobile terminal 200 can request user information from the mobile terminal 200 to provide the conversion function, and can specify the user information based on the user information received accordingly.
When the process of step S114 ends, the series of processes ends and the original process is restored.

次に、携帯端末２００の構成を説明する。
図８は、携帯端末２００のハードウェア構成を示す図である。
携帯端末２００は、図８に示すように、制御プログラムに基づいて演算およびシステム全体を制御するＣＰＵ５０と、所定領域にあらかじめＣＰＵ５０の制御プログラム等を格納しているＲＯＭ５２と、ＲＯＭ５２等から読み出したデータやＣＰＵ５０の演算過程で必要な演算結果を格納するためのＲＡＭ５４と、外部装置に対してデータの入出力を媒介するＩ／Ｆ５８とで構成されており、これらは、データを転送するための信号線であるバス５９で相互にかつデータ授受可能に接続されている。 Next, the configuration of the mobile terminal 200 will be described.
FIG. 8 is a diagram illustrating a hardware configuration of the mobile terminal 200.
As shown in FIG. 8, the portable terminal 200 includes a CPU 50 that controls operations and the entire system based on a control program, a ROM 52 that stores a control program for the CPU 50 in a predetermined area, and data read from the ROM 52 and the like. And a RAM 54 for storing calculation results required in the calculation process of the CPU 50, and an I / F 58 for mediating input / output of data to / from an external device, which are signals for transferring data The buses 59 are connected to each other so as to be able to exchange data.

Ｉ／Ｆ５８には、基地局２１０と無線通信を行う無線通信装置６０と、ヒューマンインターフェースとして複数のキーによりデータの入力が可能なキーパネル６２と、画像信号に基づいて画面を表示するＬＣＤ（Liquid Crystal Display）６４と、音声を入力して音声信号に変換するマイク６６と、音声信号を入力して音声に変換するスピーカ６８とが接続されている。 The I / F 58 includes a wireless communication device 60 that performs wireless communication with the base station 210, a key panel 62 that can input data using a plurality of keys as a human interface, and an LCD (Liquid) that displays a screen based on an image signal. (Crystal Display) 64, a microphone 66 for inputting sound and converting it into a sound signal, and a speaker 68 for inputting the sound signal and converting it into sound are connected.

ＣＰＵ５０は、マイクロプロセッシングユニット等からなり、ＲＯＭ５２の所定領域に格納されている所定のプログラムを起動させ、そのプログラムに従って、図９および図１０のフローチャートに示す変換関数取得処理および音声信号変換処理をそれぞれ時分割で実行する。
初めに、変換関数取得処理を説明する。 The CPU 50 includes a microprocessing unit and the like, starts a predetermined program stored in a predetermined area of the ROM 52, and performs conversion function acquisition processing and audio signal conversion processing shown in the flowcharts of FIGS. 9 and 10 according to the program. Run in time division.
First, the conversion function acquisition process will be described.

図９は、変換関数取得処理を示すフローチャートである。
変換関数取得処理は、ＣＰＵ５０において実行されると、図９に示すように、まず、ステップＳ２００に移行する。
ステップＳ２００では、音声信号変換関数の取得要求をキーパネル６２から入力したか否かを判定し、取得要求を入力したと判定したとき(Yes)は、ステップＳ２０２に移行するが、そうでないと判定したとき(No)は、取得要求を入力するまでステップＳ２００で待機する。 FIG. 9 is a flowchart showing the conversion function acquisition process.
When the conversion function acquisition process is executed in the CPU 50, as shown in FIG. 9, first, the process proceeds to step S200.
In step S200, it is determined whether or not an acquisition request for an audio signal conversion function is input from the key panel 62. When it is determined that an acquisition request is input (Yes), the process proceeds to step S202. When it is (No), it waits in step S200 until an acquisition request is input.

ステップＳ２０２では、入力音声信号をマイク６６から入力し、ステップＳ２０４に移行して、入力音声信号の入力を開始してから所定時間が経過したか否かを判定し、所定時間が経過したと判定したとき(Yes)は、ステップＳ２０６に移行して、入力した入力音声信号から特徴パラメータ系列を抽出し、ステップＳ２０８に移行する。
ステップＳ２０８では、出力音声信号のサンプルをマイク６６から入力し、ステップＳ２１０に移行して、出力音声信号の入力を開始してから所定時間が経過したか否かを判定し、所定時間が経過したと判定したとき(Yes)は、ステップＳ２１２に移行して、入力した出力音声信号から特徴パラメータ系列を抽出し、ステップＳ２１４に移行する。 In step S202, an input audio signal is input from the microphone 66, and the process proceeds to step S204, where it is determined whether or not a predetermined time has elapsed since the input of the input audio signal was started, and it is determined that the predetermined time has elapsed. If yes (Yes), the process proceeds to step S206, a feature parameter series is extracted from the input voice signal that has been input, and the process proceeds to step S208.
In step S208, a sample of the output audio signal is input from the microphone 66, and the process proceeds to step S210 to determine whether or not a predetermined time has elapsed since the input of the output audio signal was started. (Yes), the process proceeds to step S212, a feature parameter series is extracted from the input output audio signal, and the process proceeds to step S214.

ステップＳ２１４では、音声信号変換関数の取得要求を音声信号変換サーバ１００に送信し、ステップＳ２１６に移行して、抽出した入力音声信号の特徴パラメータ系列を音声信号変換サーバ１００に送信し、ステップＳ２１８に移行して、抽出した出力音声信号の特徴パラメータ系列を音声信号変換サーバ１００に送信し、ステップＳ２２０に移行する。 In step S214, an acquisition request for the audio signal conversion function is transmitted to the audio signal conversion server 100, the process proceeds to step S216, and the feature parameter series of the extracted input audio signal is transmitted to the audio signal conversion server 100, and the process proceeds to step S218. Then, the feature parameter series of the extracted output audio signal is transmitted to the audio signal conversion server 100, and the process proceeds to step S220.

ステップＳ２２０では、変換関数を受信したか否かを判定し、変換関数を受信したと判定したとき(Yes)は、ステップＳ２２２に移行して、受信した変換関数をＲＡＭ５４に格納し、一連の処理を終了して元の処理に復帰させる。
一方、ステップＳ２２０で、変換関数を受信しないと判定したとき(No)は、変換関数を受信するまでステップＳ２２０で待機する。 In step S220, it is determined whether a conversion function has been received. If it is determined that a conversion function has been received (Yes), the process proceeds to step S222, where the received conversion function is stored in the RAM 54, and a series of processes is performed. To return to the original process.
On the other hand, when it is determined in step S220 that the conversion function is not received (No), the process waits in step S220 until the conversion function is received.

一方、ステップＳ２１０で、出力音声信号の入力を開始してから所定時間が経過していないと判定したとき(No)は、ステップＳ２０８に移行する。
一方、ステップＳ２０４で、入力音声信号の入力を開始してから所定時間が経過していないと判定したとき(No)は、ステップＳ２０２に移行する。
次に、音声信号変換処理を説明する。 On the other hand, when it is determined in step S210 that the predetermined time has not elapsed since the input of the output audio signal was started (No), the process proceeds to step S208.
On the other hand, when it is determined in step S204 that the predetermined time has not elapsed since the input of the input audio signal was started (No), the process proceeds to step S202.
Next, the audio signal conversion process will be described.

図１０は、音声信号変換処理を示すフローチャートである。
音声信号変換処理は、ＣＰＵ５０において実行されると、図１０に示すように、まず、ステップＳ３００に移行する。
ステップＳ３００では、音声信号の変換要求をキーパネル６２から入力したか否かを判定し、変換要求を入力したと判定したとき(Yes)は、ステップＳ３０２に移行するが、そうでないと判定したとき(No)は、変換要求を入力するまでステップＳ３００で待機する。 FIG. 10 is a flowchart showing the audio signal conversion process.
When the audio signal conversion process is executed by the CPU 50, the process first proceeds to step S300 as shown in FIG.
In step S300, it is determined whether or not an audio signal conversion request is input from the key panel 62. When it is determined that a conversion request is input (Yes), the process proceeds to step S302. (No) waits in step S300 until a conversion request is input.

ステップＳ３０２では、変換関数をＲＡＭ５４から読み出し、ステップＳ３０４に移行して、入力音声信号をマイク６６から入力し、ステップＳ３０６に移行して、入力した入力音声信号から特徴パラメータ系列を抽出し、ステップＳ３０８に移行する。
ステップＳ３０８では、読み出した変換関数に基づいて、抽出した入力音声信号の特徴パラメータ系列を出力音声信号の特徴パラメータ系列に変換し、ステップＳ３１０に移行して、変換した特徴パラメータ系列から出力音声信号を生成してスピーカ６８から出力し、ステップＳ３１２に移行する。 In step S302, the conversion function is read from the RAM 54, the process proceeds to step S304, the input sound signal is input from the microphone 66, the process proceeds to step S306, and a feature parameter series is extracted from the input sound signal input, and step S308 is performed. Migrate to
In step S308, based on the read conversion function, the extracted feature parameter sequence of the input speech signal is converted into a feature parameter sequence of the output speech signal, and the process proceeds to step S310, and the output speech signal is converted from the converted feature parameter sequence. Generate and output from the speaker 68, and the process proceeds to step S312.

ステップＳ３１２では、音声信号変換の終了要求をキーパネル６２から入力したか否かを判定し、終了要求を入力したと判定したとき(Yes)は、一連の処理を終了して元の処理に復帰させる。
一方、ステップＳ３１２で、終了要求を入力しないと判定したとき(No)は、ステップＳ３０４に移行する。 In step S312, it is determined whether or not an audio signal conversion end request has been input from the key panel 62. If it is determined that an end request has been input (Yes), the series of processing ends and the original processing is restored. Let
On the other hand, when it is determined in step S312 that an end request is not input (No), the process proceeds to step S304.

次に、本実施の形態の動作を説明する。
まず、携帯端末２００で変換関数を取得する場合を説明する。
変換関数を取得する場合、ユーザは、携帯端末２００において、取得要求を入力し、入力音声信号と、出力音声信号のサンプルをそれぞれマイク６６から入力する。
携帯端末２００では、入力音声信号が入力されると、ステップＳ２０６を経て、入力された入力音声信号から特徴パラメータ系列が抽出される。また、出力音声信号のサンプルが入力されると、ステップＳ２１２を経て、入力された出力音声信号から特徴パラメータ系列が抽出される。そして、ステップＳ２１４〜Ｓ２１８を経て、抽出された入力音声信号および出力音声信号の特徴パラメータ系列が取得要求とともに音声信号変換サーバ１００に送信される。 Next, the operation of the present embodiment will be described.
First, a case where a conversion function is acquired by the mobile terminal 200 will be described.
When acquiring the conversion function, the user inputs an acquisition request in the portable terminal 200, and inputs the input audio signal and the sample of the output audio signal from the microphone 66, respectively.
In the portable terminal 200, when an input voice signal is input, a feature parameter series is extracted from the input voice signal that has been input through step S206. When a sample of the output audio signal is input, a feature parameter series is extracted from the input output audio signal through step S212. Then, through steps S214 to S218, the feature parameter series of the extracted input audio signal and output audio signal are transmitted to the audio signal conversion server 100 together with the acquisition request.

音声信号変換サーバ１００では、取得要求とともに特徴パラメータ系列を受信すると、ステップＳ１０６を経て、記憶装置４２の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、入力音声信号の特徴パラメータ系列に対する尤度が求められ、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する話者属性、様式属性および環境属性が入力音声信号の属性として同定される。また、ステップＳ１０８を経て、記憶装置４２の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、受信した特徴パラメータ系列に対する尤度が求められ、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する話者属性、様式属性および環境属性が出力音声信号の属性として同定される。 When the audio signal conversion server 100 receives the feature parameter series together with the acquisition request, the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model, and the environment attribute high-dimensional acoustic model in the storage device 42 are input through step S106. Speaker attribute corresponding to feature parameter series of speech signal, speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model and environment attribute high-dimensional acoustic model giving maximum likelihood Speaker attribute, style attribute and Environmental attributes are identified as attributes of the input audio signal. Further, through step S108, the likelihood for the received feature parameter series is obtained for the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model, and the environment attribute high-dimensional acoustic model in the storage device 42, and the maximum likelihood is obtained. Speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model, and environment attribute Speaker attribute, style attribute, and environmental attribute corresponding to the high-dimensional acoustic model are identified as attributes of the output speech signal.

そして、ステップＳ１１０、Ｓ１１２を経て、同定された属性に基づいて、第１話者変動変換関数、第１様式変動変換関数、環境変動変換関数、第２様式変動変換関数および第２話者変動変換関数が選択され、選択された変換関数が携帯端末２００に送信される。また、ステップＳ１１４を経て、携帯端末２００のユーザに対する課金処理が行われる。
携帯端末２００では、変換関数を受信すると、ステップＳ２２２を経て、受信した変換関数がＲＡＭ５４に格納される。 Then, the first speaker variation conversion function, the first style variation conversion function, the environment variation conversion function, the second style variation conversion function, and the second speaker variation conversion are performed based on the identified attributes through steps S110 and S112. A function is selected, and the selected conversion function is transmitted to the mobile terminal 200. In addition, through step S114, billing processing for the user of the portable terminal 200 is performed.
In the portable terminal 200, when the conversion function is received, the received conversion function is stored in the RAM 54 through step S222.

次に、音声信号を変換する場合を説明する。
音声信号を変換する場合、ユーザは、携帯端末２００において、変換要求を入力し、入力音声信号をマイク６６から入力する。
携帯端末２００では、変換要求が入力されると、ステップＳ３０２を経て、変換関数がＲＡＭ５４から読み出される。そして、入力音声信号が入力されると、ステップＳ３０６、Ｓ３０８を経て、入力された入力音声信号から特徴パラメータ系列が抽出され、読み出された変換関数に基づいて、抽出された入力音声信号の特徴パラメータ系列が出力音声信号の特徴パラメータ系列に変換される。そして、ステップＳ３１０を経て、変換された特徴パラメータ系列から出力音声信号が生成されてスピーカ６８から出力される。音声信号の変換は、終了要求が入力されるまで行われる。 Next, a case where an audio signal is converted will be described.
When converting the audio signal, the user inputs a conversion request in the portable terminal 200 and inputs the input audio signal from the microphone 66.
In the portable terminal 200, when a conversion request is input, the conversion function is read from the RAM 54 through step S302. Then, when the input speech signal is input, a feature parameter series is extracted from the input speech signal that has been input through steps S306 and S308, and the features of the extracted input speech signal are extracted based on the read conversion function. The parameter series is converted into a characteristic parameter series of the output audio signal. Then, through step S310, an output audio signal is generated from the converted feature parameter series and output from the speaker 68. The audio signal is converted until an end request is input.

なお、以上の説明では、同一の携帯端末２００において出力音声信号のサンプルおよび入力音声信号を入力する場合を説明したが、これに限らず、異なる携帯端末２００において出力音声信号のサンプルおよび入力音声信号をそれぞれ入力してもよい。
また、以上の説明では、同一の携帯端末２００において入力音声信号の入力、特徴パラメータ系列の変換および出力音声信号の出力を行う場合を説明したが、これに限らず、異なる携帯端末２００において入力音声信号の入力および出力音声信号の出力をそれぞれ行ってもよい。この場合、例えば、２者間の通話において一方の話者の音声を他の音声に変換して通話を行う場合に応用することができる。 In the above description, the output audio signal sample and the input audio signal are input in the same portable terminal 200. However, the present invention is not limited to this, and the output audio signal sample and the input audio signal are different in different portable terminals 200. May be entered respectively.
Further, in the above description, the case where input voice signal input, feature parameter series conversion, and output voice signal output are performed in the same mobile terminal 200 has been described. The signal may be input and the output audio signal may be output. In this case, for example, it can be applied to a case where a call is made by converting the voice of one speaker into another voice in a call between two parties.

このようにして、本実施の形態では、入力音声信号から特徴パラメータ系列を抽出し、入力音声信号の特徴パラメータ系列を第１基準話者１６ａの音声信号の特徴パラメータ系列に変換し、変換した特徴パラメータ系列を第１基準様式１６ｂの音声信号の特徴パラメータ系列に変換し、変換した特徴パラメータ系列を第２基準様式１６ｃの音声信号の特徴パラメータ系列に変換し、変換した特徴パラメータ系列を第２基準話者１６ｄの音声信号の特徴パラメータ系列に変換し、変換した特徴パラメータ系列を出力音声信号の特徴パラメータ系列に変換し、変換した特徴パラメータ系列から出力音声信号を生成する。 In this way, in the present embodiment, a feature parameter sequence is extracted from the input speech signal, the feature parameter sequence of the input speech signal is converted into a feature parameter sequence of the speech signal of the first reference speaker 16a, and the converted feature is converted. The parameter series is converted into a feature parameter series of the speech signal in the first reference form 16b, the converted feature parameter series is converted into a feature parameter series of the speech signal in the second reference form 16c, and the converted feature parameter series is converted into the second reference form. The feature parameter sequence of the speech signal of the speaker 16d is converted, the converted feature parameter sequence is converted into a feature parameter sequence of the output speech signal, and an output speech signal is generated from the converted feature parameter sequence.

これにより、話者変動、様式変動および環境変動の順序を考慮して特徴パラメータ系列が変換されるので、従来に比して、特徴パラメータ系列をさらに精度よく変換することができる。
本実施の形態による音声信号変換処理は、例えば、数字・単語・文章の音声認識を行う場合に、数字・単語・文章間の周波数スペクトルの特徴パラメータ系列変換処理を既に収集された小規模の音声コーパスから学習することに応用することができる。具体的には、例えば、数字の音声コーパスを収集しておき、数字の音声コーパスから特徴パラメータ系列を抽出し、単語または文章の音声コーパスの特徴パラメータ系列に変換する。これにより、新たに大規模な音声コーパスを収集することなく、数字・単語・文章間相互の周波数スペクトルの変換を行うことができるので、音声コーパス収集コストを大幅に節約できて経済的である。 As a result, the feature parameter series is converted in consideration of the order of speaker fluctuation, style fluctuation, and environmental fluctuation. Therefore, it is possible to convert the characteristic parameter series more accurately than in the past.
The speech signal conversion processing according to the present embodiment is a small-scale speech that has already been collected from the feature parameter series conversion processing of frequency spectrum between numbers, words, and sentences, for example, when performing speech recognition of numbers, words, and sentences. It can be applied to learning from the corpus. Specifically, for example, a numeric speech corpus is collected, a feature parameter sequence is extracted from the numeric speech corpus, and converted into a feature parameter sequence of a speech corpus of words or sentences. As a result, the frequency spectrum between numbers, words, and sentences can be converted without newly collecting a large-scale speech corpus, so that the cost of speech corpus collection can be greatly saved, which is economical.

また、例えば、カーナビゲーションシステム等の音声認識を行う場合に、静寂な部屋に座って発声する場合の音声コーパスを自動車のなかで運転行動中の少量の音声コーパスに変換する特徴パラメータ系列変換処理を学習することに応用することができる。具体的には、静寂な部屋で不特定話者の音声コーパスを収集しておき、静寂な部屋の音声コーパスから特徴パラメータ系列を抽出し、自動車のなかで運転行動中の音声コーパスの特徴パラメータ系列に変換する。これにより、新たに自動車のなかで運転行動中の大規模な音声コーパスを収集することなく、周波数スペクトルの変換を行うことができるので、音声コーパス収集コストを大幅に節約できて経済的である。 In addition, for example, when performing speech recognition in a car navigation system or the like, a feature parameter series conversion process for converting a speech corpus when speaking in a quiet room to a small amount of speech corpus during driving behavior in a car is performed. It can be applied to learning. Specifically, a voice corpus of unspecified speakers is collected in a quiet room, a feature parameter series is extracted from the voice corpus of the quiet room, and a feature parameter series of the voice corpus during driving in a car Convert to Accordingly, since it is possible to perform frequency spectrum conversion without newly collecting a large-scale voice corpus during driving behavior in a car, the voice corpus collection cost can be greatly saved and it is economical.

さらに、本実施の形態では、入力音声信号および出力音声信号の特徴パラメータ系列並びに属性モデルに基づいて入力音声信号および出力音声信号の属性を同定し、同定した属性に基づいて、第１話者変動変換関数、第１様式変動変換関数、環境変動変換関数、第２様式変動変換関数および第２話者変動変換関数を選択し、選択した変動変換関数に基づいて特徴パラメータ系列を変換する。 Further, in the present embodiment, the attributes of the input voice signal and the output voice signal are identified based on the feature parameter series and the attribute model of the input voice signal and the output voice signal, and the first speaker variation is determined based on the identified attribute. A conversion function, a first style fluctuation conversion function, an environment fluctuation conversion function, a second style fluctuation conversion function, and a second speaker fluctuation conversion function are selected, and a feature parameter series is converted based on the selected fluctuation conversion function.

これにより、入力音声信号および出力音声信号の属性を同定した上で特徴パラメータ系列が変換されるので、特徴パラメータ系列をさらに精度よく変換することができる。
上記第１の実施の形態において、音声信号変換サーバ１００は、請求項１１、１３または１４記載の変換関数提供端末に対応し、記憶装置４２は、請求項８記載の属性モデル記憶手段、または請求項８、１１若しくは１４記載の変動変換関数記憶手段に対応し、ステップＳ１０２は、請求項１１または１４記載の入力特徴パラメータ系列受信手段に対応している。また、ステップＳ１０４は、請求項１１または１４記載の出力特徴パラメータ系列受信手段に対応し、ステップＳ１０６は、請求項８、１１または１４記載の入力音声信号属性同定手段に対応し、ステップＳ１０８は、請求項８、１１または１４記載の出力音声信号属性同定手段に対応している。 Thereby, the feature parameter series is converted after identifying the attributes of the input voice signal and the output voice signal, so that the feature parameter series can be converted with higher accuracy.
In the first embodiment, the audio signal conversion server 100 corresponds to the conversion function providing terminal according to claim 11, 13 or 14, and the storage device 42 is the attribute model storage means according to claim 8, or The step S102 corresponds to the input feature parameter series reception unit according to claim 11 or 14, and corresponds to the variation conversion function storage unit according to item 8, 11 or 14. Further, step S104 corresponds to the output feature parameter series receiving means according to claim 11 or 14, step S106 corresponds to the input voice signal attribute identifying means according to claim 8, 11 or 14, and step S108 is This corresponds to the output audio signal attribute identification means according to claim 8, 11 or 14.

また、上記第１の実施の形態において、ステップＳ１１０は、請求項８、１１または１４記載の変動変換関数選択手段に対応し、ステップＳ１１２は、請求項１１、１３または１４記載の変換関数送信手段に対応し、ステップＳ１１４は、請求項１３記載の課金手段に対応し、ステップＳ２０２は、請求項１１または１６記載の入力音声信号入力手段に対応している。また、ステップＳ２０６、Ｓ３０６は、請求項１、５、６若しくは８記載の特徴パラメータ系列抽出手段、請求項１１若しくは１６記載の入力特徴パラメータ系列抽出手段、または請求項１０記載の特徴パラメータ系列抽出ステップに対応し、ステップＳ２０８は、請求項１１または１６記載の出力音声信号入力手段に対応している。 In the first embodiment, step S110 corresponds to the variation conversion function selection unit according to claim 8, 11 or 14, and step S112 corresponds to the conversion function transmission unit according to claim 11, 13 or 14. Step S114 corresponds to the billing means according to claim 13, and step S202 corresponds to the input voice signal input means according to claim 11 or 16. Steps S206 and S306 are the feature parameter series extraction means according to claim 1, 5, 6 or 8, the input feature parameter series extraction means according to claim 11 or 16, or the feature parameter series extraction step according to claim 10. Step S208 corresponds to the output audio signal input means according to claim 11 or 16.

また、上記第１の実施の形態において、ステップＳ２１２は、請求項８、１１または１６記載の出力特徴パラメータ系列抽出手段に対応し、ステップＳ２１６は、請求項１１または１６記載の入力特徴パラメータ系列送信手段に対応し、ステップＳ２１８は、請求項１１または１６記載の出力特徴パラメータ系列送信手段に対応している。また、ステップＳ２２２は、請求項１１または１６記載の変換関数受信手段に対応し、ステップＳ３０８は、請求項１、５、６、８、１１若しくは１６記載の特徴パラメータ系列変換手段、または請求項１０記載の特徴パラメータ系列変換ステップに対応している。 Further, in the first embodiment, step S212 corresponds to the output feature parameter sequence extracting means according to claim 8, 11 or 16, and step S216 is the input feature parameter sequence transmission according to claim 11 or 16. Step S218 corresponds to the output feature parameter series transmission means according to claim 11 or 16. Further, step S222 corresponds to the conversion function receiving means according to claim 11 or 16, and step S308 is the characteristic parameter series conversion means according to claim 1, 5, 6, 8, 11 or 16, or claim 10. This corresponds to the described feature parameter series conversion step.

また、上記第１の実施の形態において、第１話者変動変換関数および第１話者変動変換関数は、請求項５記載の第１変動変換手段に対応し、環境変動変換関数および第２様式変動変換関数は、請求項５記載の第２変動変換手段に対応し、第１話者変動変換関数は、請求項６記載の第１話者変動変換手段に対応している。また、第１様式変動変換関数は、請求項６記載の第１様式変動変換手段に対応し、環境変動変換関数は、請求項６記載の環境変動変換手段に対応し、第２様式変動変換関数は、請求項６記載の第２様式変動変換手段に対応し、第２話者変動変換関数は、請求項５記載の第３変動変換手段、または請求項６記載の第２話者変動変換手段に対応している。 In the first embodiment, the first speaker fluctuation conversion function and the first speaker fluctuation conversion function correspond to the first fluctuation conversion means according to claim 5, and the environment fluctuation conversion function and the second form. The fluctuation conversion function corresponds to the second fluctuation conversion means according to claim 5, and the first speaker fluctuation conversion function corresponds to the first speaker fluctuation conversion means according to claim 6. The first style fluctuation conversion function corresponds to the first style fluctuation conversion means according to claim 6, and the environment fluctuation conversion function corresponds to the environment fluctuation conversion means according to claim 6, and the second style fluctuation conversion function. Corresponds to the second style fluctuation converting means according to claim 6, and the second speaker fluctuation converting function is the third fluctuation converting means according to claim 5, or the second speaker fluctuation converting means according to claim 6. It corresponds to.

また、上記第１の実施の形態において、ステップＳ３１０は、請求項１、１１若しくは１６記載の音声信号生成手段、または請求項１０記載の音声信号生成ステップに対応している。
次に、本発明の第２の実施の形態を図面を参照しながら説明する。図１１および図１２は、本発明に係る音声信号変換装置の第２の実施の形態を示す図である。 In the first embodiment, step S310 corresponds to the audio signal generation means according to claim 1, 11 or 16, or the audio signal generation step according to claim 10.
Next, a second embodiment of the present invention will be described with reference to the drawings. 11 and 12 are diagrams showing a second embodiment of the audio signal conversion apparatus according to the present invention.

本実施の形態は、上記第１の実施の形態に対して、出力音声信号のサンプルではなく、出力音声信号の属性データに基づいて出力音声信号の属性を同定する点が異なる。なお、以下、上記第１の実施の形態と異なる部分についてのみ説明し、上記第１の実施の形態と重複する部分については同一の符号を付して説明を省略する。
まず、ＣＰＵ３０で実行される処理を説明する。 The present embodiment is different from the first embodiment in that the attribute of the output audio signal is identified based on the attribute data of the output audio signal, not the sample of the output audio signal. Hereinafter, only the parts different from the first embodiment will be described, and the same parts as those in the first embodiment will be denoted by the same reference numerals and the description thereof will be omitted.
First, processing executed by the CPU 30 will be described.

ＣＰＵ３０は、図７の変換関数提供処理に代えて、図１１のフローチャートに示す変換関数提供処理を実行する。
図１１は、変換関数提供処理を示すフローチャートである。
変換関数提供処理は、ＣＰＵ３０において実行されると、図１１に示すように、まず、ステップＳ４００に移行する。 The CPU 30 executes the conversion function providing process shown in the flowchart of FIG. 11 instead of the conversion function providing process of FIG.
FIG. 11 is a flowchart showing conversion function provision processing.
When the conversion function providing process is executed by the CPU 30, first, the process proceeds to step S400, as shown in FIG.

ステップＳ４００では、音声信号変換関数の取得要求を受信したか否かを判定し、取得要求を受信したと判定したとき(Yes)は、ステップＳ４０２に移行するが、そうでないと判定したとき(No)は、取得要求を受信するまでステップＳ４００で待機する。
ステップＳ４０２では、入力音声信号の特徴パラメータ系列を受信し、ステップＳ４０４に移行して、出力音声信号の話者属性、様式属性および環境属性を示す属性データを受信し、ステップＳ４０６に移行する。 In step S400, it is determined whether or not an acquisition request for an audio signal conversion function has been received. When it is determined that an acquisition request has been received (Yes), the process proceeds to step S402. ) Waits in step S400 until an acquisition request is received.
In step S402, the characteristic parameter series of the input voice signal is received, and the process proceeds to step S404, and attribute data indicating the speaker attribute, the style attribute, and the environment attribute of the output voice signal is received, and the process proceeds to step S406.

ステップＳ４０６では、記憶装置４２の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、ステップＳ４０２で受信した特徴パラメータ系列に対する尤度を求め、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する話者属性、様式属性および環境属性を、入力音声信号の話者属性、様式属性および環境属性として同定する。 In step S406, the likelihood for the feature parameter series received in step S402 is obtained for the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model, and the environment attribute high-dimensional acoustic model in the storage device 42, and the maximum likelihood is obtained. Speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model and environment attribute Speaker attribute, style attribute and environment attribute corresponding to the high-dimensional acoustic model are set as speaker attribute, style attribute and environment attribute of the input speech signal. Identify.

次いで、ステップＳ４０８に移行して、ステップＳ４０４で受信した属性データの話者属性、様式属性および環境属性に最も適合する話者属性、様式属性および環境属性を有するＧＭＭを求め、そのＧＭＭに対応する話者属性、様式属性および環境属性を、出力音声信号の話者属性、様式属性および環境属性として同定する。
次いで、ステップＳ４１０に移行して、ステップＳ４０６、Ｓ４０８で同定した属性に基づいて、入力音声信号および出力音声信号の話者属性、様式属性および環境属性に対応する第１話者変動変換関数、第１様式変動変換関数、環境変動変換関数、第２様式変動変換関数および第２話者変動変換関数を記憶装置４２のなかから選択し、ステップＳ４１２に移行して、選択した変換関数を要求元の携帯端末２００に送信する。 Next, the process proceeds to step S408, and a GMM having a speaker attribute, a style attribute, and an environment attribute that most closely matches the speaker attribute, the style attribute, and the environment attribute of the attribute data received in step S404 is obtained and corresponds to the GMM. The speaker attribute, the style attribute, and the environment attribute are identified as the speaker attribute, the style attribute, and the environment attribute of the output audio signal.
Next, the process proceeds to step S410, and based on the attributes identified in steps S406 and S408, the first speaker variation conversion function corresponding to the speaker attribute, the style attribute, and the environment attribute of the input voice signal and the output voice signal, The 1-style variation conversion function, the environment variation conversion function, the second-style variation conversion function, and the second speaker variation conversion function are selected from the storage device 42, the process proceeds to step S412 and the selected conversion function is selected from the request source. It transmits to the portable terminal 200.

次いで、ステップＳ４１４に移行して、要求元の携帯端末２００のユーザに対する課金処理を実行する。携帯端末２００のユーザは、例えば、変換関数の提供にあたって携帯端末２００にユーザ情報を要求し、これに応じて受信したユーザ情報に基づいて特定することができる。
ステップＳ４１４の処理が終了すると、一連の処理を終了して元の処理に復帰させる。 Next, the process proceeds to step S414, and billing processing for the user of the requesting mobile terminal 200 is executed. For example, the user of the mobile terminal 200 can request user information from the mobile terminal 200 to provide the conversion function, and can specify the user information based on the user information received accordingly.
When the process of step S414 ends, the series of processes ends and returns to the original process.

次に、ＣＰＵ５０で実行される処理を説明する。
ＣＰＵ５０は、図９の変換関数取得処理に代えて、図１２のフローチャートに示す変換関数取得処理を実行する。
図１２は、変換関数取得処理を示すフローチャートである。
変換関数取得処理は、ＣＰＵ５０において実行されると、図１２に示すように、まず、ステップＳ５００に移行する。 Next, processing executed by the CPU 50 will be described.
The CPU 50 executes the conversion function acquisition process shown in the flowchart of FIG. 12 instead of the conversion function acquisition process of FIG.
FIG. 12 is a flowchart showing the conversion function acquisition process.
When the conversion function acquisition process is executed by the CPU 50, the process first proceeds to step S500 as shown in FIG.

ステップＳ５００では、音声信号変換関数の取得要求をキーパネル６２から入力したか否かを判定し、取得要求を入力したと判定したとき(Yes)は、ステップＳ５０２に移行するが、そうでないと判定したとき(No)は、取得要求を入力するまでステップＳ５００で待機する。
ステップＳ５０２では、入力音声信号をマイク６６から入力し、ステップＳ５０４に移行して、入力音声信号の入力を開始してから所定時間が経過したか否かを判定し、所定時間が経過したと判定したとき(Yes)は、ステップＳ５０６に移行して、入力した入力音声信号から特徴パラメータ系列を抽出し、ステップＳ５０８に移行する。 In step S500, it is determined whether an acquisition request for an audio signal conversion function has been input from the key panel 62. When it is determined that an acquisition request has been input (Yes), the process proceeds to step S502. If it is (No), it waits in step S500 until an acquisition request is input.
In step S502, an input audio signal is input from the microphone 66, and the process proceeds to step S504, where it is determined whether or not a predetermined time has elapsed since the input of the input audio signal was started, and it is determined that the predetermined time has elapsed. If yes (Yes), the process proceeds to step S506, a feature parameter series is extracted from the input voice signal that has been input, and the process proceeds to step S508.

ステップＳ５０８では、出力音声信号の話者属性、様式属性および環境属性を示す属性データをキーパネル６２から入力し、ステップＳ５１４に移行する。
ステップＳ５１４では、音声信号変換関数の取得要求を音声信号変換サーバ１００に送信し、ステップＳ５１６に移行して、抽出した入力音声信号の特徴パラメータ系列を音声信号変換サーバ１００に送信し、ステップＳ５１８に移行して、入力した出力音声信号の属性データを音声信号変換サーバ１００に送信し、ステップＳ５２０に移行する。 In step S508, attribute data indicating the speaker attribute, style attribute, and environment attribute of the output audio signal is input from the key panel 62, and the process proceeds to step S514.
In step S514, the acquisition request for the audio signal conversion function is transmitted to the audio signal conversion server 100, the process proceeds to step S516, the feature parameter series of the extracted input audio signal is transmitted to the audio signal conversion server 100, and the process proceeds to step S518. Then, the attribute data of the input output audio signal is transmitted to the audio signal conversion server 100, and the process proceeds to step S520.

ステップＳ５２０では、変換関数を受信したか否かを判定し、変換関数を受信したと判定したとき(Yes)は、ステップＳ５２２に移行して、受信した変換関数をＲＡＭ５４に格納し、一連の処理を終了して元の処理に復帰させる。
一方、ステップＳ５２０で、変換関数を受信しないと判定したとき(No)は、変換関数を受信するまでステップＳ５２０で待機する。 In step S520, it is determined whether or not a conversion function has been received. If it is determined that the conversion function has been received (Yes), the process proceeds to step S522, the received conversion function is stored in the RAM 54, and a series of processes is performed. To return to the original process.
On the other hand, when it is determined in step S520 that the conversion function is not received (No), the process waits in step S520 until the conversion function is received.

一方、ステップＳ５０４で、入力音声信号の入力を開始してから所定時間が経過していないと判定したとき(No)は、ステップＳ５０２に移行する。
次に、本実施の形態の動作を説明する。
変換関数を取得する場合、ユーザは、携帯端末２００において、取得要求を入力し、入力音声信号をマイク６６から、出力音声信号の属性データをキーパネル６２からそれぞれ入力する。 On the other hand, when it is determined in step S504 that the predetermined time has not elapsed since the input of the input audio signal was started (No), the process proceeds to step S502.
Next, the operation of the present embodiment will be described.
When acquiring the conversion function, the user inputs an acquisition request at the portable terminal 200 and inputs the input audio signal from the microphone 66 and the attribute data of the output audio signal from the key panel 62.

携帯端末２００では、入力音声信号および属性データが入力されると、ステップＳ５０６〜Ｓ５１８を経て、入力された入力音声信号から特徴パラメータ系列が抽出され、抽出された入力音声信号の特徴パラメータ系列および入力された属性データが取得要求とともに音声信号変換サーバ１００に送信される。
音声信号変換サーバ１００では、取得要求とともに特徴パラメータ系列および属性データを受信すると、ステップＳ４０６を経て、記憶装置４２の話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルについて、入力音声信号の特徴パラメータ系列に対する尤度が求められ、最大の尤度を与える話者属性高次元音響モデル、様式属性高次元音響モデルおよび環境属性高次元音響モデルに対応する話者属性、様式属性および環境属性が入力音声信号の属性として同定される。また、ステップＳ４０８を経て、受信した属性データの話者属性、様式属性および環境属性に最も適合する話者属性、様式属性および環境属性を有するＧＭＭが求められ、そのＧＭＭに対応する話者属性、様式属性および環境属性が出力音声信号の属性として同定される。 In portable terminal 200, when an input voice signal and attribute data are input, a feature parameter series is extracted from the input voice signal that has been input through steps S506 to S518, and the feature parameter series and input of the extracted input voice signal are input. The attribute data thus transmitted is transmitted to the audio signal conversion server 100 together with the acquisition request.
When the audio signal conversion server 100 receives the feature parameter series and the attribute data together with the acquisition request, the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model, and the environment attribute high-dimensional acoustic model stored in the storage device 42 are obtained through step S406. , The speaker attribute corresponding to the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model, and the environment attribute high-dimensional acoustic model that give the maximum likelihood to the feature parameter sequence of the input speech signal, Style attributes and environment attributes are identified as attributes of the input audio signal. Further, through step S408, a GMM having a speaker attribute, a style attribute, and an environment attribute that most closely matches the speaker attribute, the style attribute, and the environment attribute of the received attribute data is obtained, and the speaker attribute corresponding to the GMM, Style attributes and environment attributes are identified as attributes of the output audio signal.

そして、ステップＳ４１０、Ｓ４１２を経て、同定された属性に基づいて、第１話者変動変換関数、第１様式変動変換関数、環境変動変換関数、第２様式変動変換関数および第２話者変動変換関数が選択され、選択された変換関数が携帯端末２００に送信される。また、ステップＳ４１４を経て、携帯端末２００のユーザに対する課金処理が行われる。
携帯端末２００では、変換関数を受信すると、ステップＳ５２２を経て、受信した変換関数がＲＡＭ５４に格納される。
なお、音声信号を変換する場合は、上記第１の実施の形態と同様に行われる。 Then, the first speaker variation conversion function, the first style variation conversion function, the environment variation conversion function, the second style variation conversion function, and the second speaker variation conversion are performed based on the identified attributes through steps S410 and S412. A function is selected, and the selected conversion function is transmitted to the mobile terminal 200. In addition, through step S414, billing processing for the user of the mobile terminal 200 is performed.
In the portable terminal 200, when the conversion function is received, the received conversion function is stored in the RAM 54 through step S522.
Note that the audio signal is converted in the same manner as in the first embodiment.

また、以上の説明では、同一の携帯端末２００において出力音声信号の属性データおよび入力音声信号を入力する場合を説明したが、これに限らず、異なる携帯端末２００において出力音声信号の属性データおよび入力音声信号をそれぞれ入力してもよい。
このようにして、本実施の形態では、入力音声信号の特徴パラメータ系列および属性モデルに基づいて入力音声信号の属性を同定し、出力音声信号の属性データに基づいて出力音声信号の属性を同定し、同定した属性に基づいて、第１話者変動変換関数、第１様式変動変換関数、環境変動変換関数、第２様式変動変換関数および第２話者変動変換関数を選択し、選択した変動変換関数に基づいて特徴パラメータ系列を変換する。 In the above description, the case where the attribute data and the input audio signal of the output audio signal are input in the same mobile terminal 200 has been described. Each audio signal may be input.
Thus, in this embodiment, the attributes of the input audio signal are identified based on the feature parameter series and the attribute model of the input audio signal, and the attributes of the output audio signal are identified based on the attribute data of the output audio signal. Based on the identified attributes, the first speaker variation conversion function, the first style variation conversion function, the environment variation conversion function, the second style variation conversion function, and the second speaker variation conversion function are selected, and the selected variation conversion is selected. The feature parameter series is converted based on the function.

これにより、入力音声信号および出力音声信号の属性を同定した上で特徴パラメータ系列が変換されるので、上記第１の実施の形態と同様に、特徴パラメータ系列を精度よく変換することができる。
上記第２の実施の形態において、音声信号変換サーバ１００は、請求項１２または１５記載の変換関数提供端末に対応し、記憶装置４２は、請求項９記載の属性モデル記憶手段、または請求項９、１２若しくは１５記載の変動変換関数記憶手段に対応し、ステップＳ４０２は、請求項１２または１５記載の入力特徴パラメータ系列受信手段に対応している。また、ステップＳ４０４は、請求項１２または１５記載の属性データ受信手段に対応し、ステップＳ４０６は、請求項９、１２または１５記載の入力音声信号属性同定手段に対応し、ステップＳ４０８は、請求項９、１２または１５記載の出力音声信号属性同定手段に対応している。 As a result, the feature parameter series is converted after identifying the attributes of the input voice signal and the output voice signal, so that the feature parameter series can be converted with high accuracy as in the first embodiment.
In the second embodiment, the audio signal conversion server 100 corresponds to the conversion function providing terminal according to claim 12 or 15, and the storage device 42 is the attribute model storage means according to claim 9, or claim 9. , 12 or 15, and step S 402 corresponds to the input feature parameter series receiving means according to claim 12 or 15. Further, step S404 corresponds to the attribute data receiving means according to claim 12 or 15, step S406 corresponds to the input voice signal attribute identifying means according to claim 9, 12 or 15, and step S408 corresponds to the claim. This corresponds to the output audio signal attribute identification means described in 9, 12, or 15.

また、上記第２の実施の形態において、ステップＳ４１０は、請求項９、１２または１５記載の変動変換関数選択手段に対応し、ステップＳ４１２は、請求項１２または１５記載の変換関数送信手段に対応し、ステップＳ５０２は、請求項１２または１７記載の入力音声信号入力手段に対応している。また、ステップＳ５０６、Ｓ３０６は、請求項９記載の特徴パラメータ系列抽出手段、または請求項１２若しくは１７記載の入力特徴パラメータ系列抽出手段に対応し、ステップＳ５０８は、請求項９、１２または１７記載の出力音声信号属性入力手段に対応している。 In the second embodiment, step S410 corresponds to the variation conversion function selection unit according to claim 9, 12 or 15, and step S412 corresponds to the conversion function transmission unit according to claim 12 or 15. Step S502 corresponds to the input voice signal input means described in claim 12 or 17. Steps S506 and S306 correspond to the feature parameter series extraction means according to claim 9, or the input feature parameter series extraction means according to claim 12 or 17, and step S508 corresponds to claim 9, 12 or 17. This corresponds to output audio signal attribute input means.

また、上記第２の実施の形態において、ステップＳ５１６は、請求項１２または１７記載の入力特徴パラメータ系列送信手段に対応し、ステップＳ５１８は、請求項１２または１７記載の属性データ送信手段に対応し、ステップＳ５２２は、請求項１２または１７記載の変換関数受信手段に対応している。また、ステップＳ３０８は、請求項９、１２または１７記載の特徴パラメータ系列変換手段に対応し、ステップＳ３１０は、請求項１２または１７記載の音声信号生成手段に対応している。 In the second embodiment, step S516 corresponds to the input feature parameter series transmission unit according to claim 12 or 17, and step S518 corresponds to the attribute data transmission unit according to claim 12 or 17. Step S522 corresponds to the conversion function receiving means according to claim 12 or 17. Step S308 corresponds to the feature parameter series conversion means according to claim 9, 12 or 17, and step S310 corresponds to the sound signal generation means according to claim 12 or 17.

なお、上記第１および第２の実施の形態においては、環境変動を跨いで特徴パラメータ系列を変換するように構成したが、これに限らず、入力音声信号および出力音声信号の環境属性が同一である場合は、次のように特徴パラメータ系列を変換するように構成することができる。
図１３は、話者変動および様式変動に応じて、入力音声信号の特徴パラメータ系列を出力音声信号の特徴パラメータ系列に変換する場合を示す図である。 In the first and second embodiments, the feature parameter series is converted across the environmental variation. However, the present invention is not limited to this, and the environmental attributes of the input audio signal and the output audio signal are the same. In some cases, the feature parameter sequence can be converted as follows.
FIG. 13 is a diagram illustrating a case where a feature parameter sequence of an input speech signal is converted into a feature parameter sequence of an output speech signal in accordance with speaker variation and style variation.

図１３の音響モデルマップは、１つの環境属性１０ａ、４つの様式属性１２ａ〜１２ｂおよび４つの話者属性１４ａ〜１４ｄによって音声コーパスをグループ分けした場合である。図１３によれば、２次元平面上には、環境属性１０ａに対応する分布領域が１つだけ形成される。
属性の同定が完了すると、影響が小さい話者変動および様式変動の順で入力音声信号の特徴パラメータ系列を変換していく。 The acoustic model map of FIG. 13 is a case where voice corpora are grouped by one environment attribute 10a, four style attributes 12a-12b, and four speaker attributes 14a-14d. According to FIG. 13, only one distribution region corresponding to the environment attribute 10a is formed on the two-dimensional plane.
When the identification of the attribute is completed, the feature parameter series of the input speech signal is converted in the order of speaker fluctuation and style fluctuation with a small influence.

まず、話者変動による影響が最も小さいので、図１３に示すように、入力音声信号の特徴パラメータ系列を第１基準話者１６ａの音声信号の特徴パラメータ系列に変換する。
そして、様式変動による影響が最も大きいので、第１基準話者１６ａの音声信号の特徴パラメータ系列を第３基準様式１６ｅの音声信号の特徴パラメータ系列に変換する。第３基準様式１６ｅの音声信号の特徴パラメータ系列は、入力音声信号および出力音声信号と同一の環境属性のなかで平均的な様式属性を有する様式の音声信号の特徴パラメータ系列である。この変換は、入力音声信号および出力音声信号の様式と第３基準様式１６ｅとの間の様式変動に応じた変換となる。 First, since the influence due to speaker fluctuation is the smallest, as shown in FIG. 13, the feature parameter sequence of the input speech signal is converted into the feature parameter sequence of the speech signal of the first reference speaker 16a.
Since the influence of the style variation is the largest, the feature parameter series of the speech signal of the first reference speaker 16a is converted into the feature parameter series of the speech signal of the third reference style 16e. The feature parameter sequence of the audio signal of the third reference style 16e is a feature parameter sequence of an audio signal having an average format attribute among the same environmental attributes as the input audio signal and the output audio signal. This conversion is performed in accordance with the format variation between the format of the input audio signal and the output audio signal and the third reference format 16e.

以降は、話者変動および様式変動の逆順で出力音声信号の特徴パラメータ系列に変換していく。
まず、第３基準様式１６ｅの音声信号の特徴パラメータ系列を第２基準話者１６ｄの音声信号の特徴パラメータ系列に変換する。
そして、第２基準話者１６ｄの音声信号の特徴パラメータ系列を出力音声信号の特徴パラメータ系列に変換する。 After that, it is converted into a characteristic parameter series of the output voice signal in the reverse order of the speaker fluctuation and the style fluctuation.
First, the feature parameter sequence of the speech signal of the third reference style 16e is converted into the feature parameter sequence of the speech signal of the second reference speaker 16d.
Then, the feature parameter sequence of the speech signal of the second reference speaker 16d is converted into the feature parameter sequence of the output speech signal.

また、上記第１および第２の実施の形態においては、話者変動、様式変動および環境変動に応じて特徴パラメータ系列を変換するように構成したが、これに限らず、話者変動および様式変動に応じて変換を行う構成、話者変動および環境変動に応じて変換を行う構成、並びに様式変動および環境変動に応じて変換を行う構成を採用することもできる。
話者変動および様式変動に応じて変換を行う場合は、入力音声信号の特徴パラメータ系列を、入力音声信号と同一の様式属性および異なる話者属性を有する第１基準話者の音声信号の特徴パラメータ系列に変換し、変換した特徴パラメータ系列を、出力音声信号と同一の様式属性および異なる話者属性を有する第２基準話者の音声信号の特徴パラメータ系列に変換し、変換した特徴パラメータ系列を出力音声信号の特徴パラメータ系列に変換する。この場合、さらに、入力音声信号の特徴パラメータ系列を第２基準話者の音声信号の特徴パラメータ系列に直接変換してもよいし、第１基準話者の音声信号の特徴パラメータ系列を出力音声信号の特徴パラメータ系列に直接変換してもよい。 In the first and second embodiments, the feature parameter series is converted according to speaker variation, style variation, and environmental variation. However, the present invention is not limited to this, and speaker variation and style variation are not limited thereto. It is also possible to adopt a configuration for performing conversion according to the above, a configuration for performing conversion according to speaker variation and environmental variation, and a configuration for performing conversion according to style variation and environmental variation.
When the conversion is performed according to the speaker variation and the form variation, the feature parameter series of the input speech signal is converted into the feature parameter of the speech signal of the first reference speaker having the same form attribute and different speaker attribute as the input speech signal. And converting the converted feature parameter series into a feature parameter series of the second reference speaker's voice signal having the same style attribute and different speaker attributes as the output voice signal, and outputting the converted feature parameter series Convert to a feature parameter series of the audio signal. In this case, the feature parameter sequence of the input speech signal may be directly converted into the feature parameter sequence of the speech signal of the second reference speaker, or the feature parameter sequence of the speech signal of the first reference speaker may be directly converted into the output speech signal. It may be directly converted into the feature parameter series.

また、話者変動および環境変動に応じて変換を行う場合は、入力音声信号の特徴パラメータ系列を、入力音声信号と同一の環境属性および異なる話者属性を有する第１基準話者の音声信号の特徴パラメータ系列に変換し、変換した特徴パラメータ系列を、出力音声信号と同一の環境属性および異なる話者属性を有する第２基準話者の音声信号の特徴パラメータ系列に変換し、変換した特徴パラメータ系列を出力音声信号の特徴パラメータ系列に変換する。この場合、さらに、入力音声信号の特徴パラメータ系列を第２基準話者の音声信号の特徴パラメータ系列に直接変換してもよいし、第１基準話者の音声信号の特徴パラメータ系列を出力音声信号の特徴パラメータ系列に直接変換してもよい。 In addition, when conversion is performed according to speaker variation and environment variation, the feature parameter series of the input speech signal is converted to the speech signal of the first reference speaker having the same environment attribute and different speaker attribute as the input speech signal. The feature parameter sequence is converted into a feature parameter sequence, and the converted feature parameter sequence is converted into a feature parameter sequence of the speech signal of the second reference speaker having the same environment attribute and different speaker attribute as the output speech signal. Is converted into a characteristic parameter sequence of the output audio signal. In this case, the feature parameter sequence of the input speech signal may be directly converted into the feature parameter sequence of the speech signal of the second reference speaker, or the feature parameter sequence of the speech signal of the first reference speaker may be directly converted into the output speech signal. It may be directly converted into the feature parameter series.

また、様式変動および環境変動に応じて変換を行う場合は、入力音声信号の特徴パラメータ系列を、入力音声信号と同一の環境属性および異なる様式属性を有する第１基準様式の音声信号の特徴パラメータ系列に変換し、変換した特徴パラメータ系列を、出力音声信号と同一の環境属性および異なる様式属性を有する第２基準様式の音声信号の特徴パラメータ系列に変換し、変換した特徴パラメータ系列を出力音声信号の特徴パラメータ系列に変換する。この場合、さらに、入力音声信号の特徴パラメータ系列を第２基準様式の音声信号の特徴パラメータ系列に直接変換してもよいし、第１基準様式の音声信号の特徴パラメータ系列を出力音声信号の特徴パラメータ系列に直接変換してもよい。 When conversion is performed according to style variation and environmental variation, the feature parameter series of the input speech signal is the feature parameter series of the speech signal of the first reference style having the same environmental attributes and different style attributes as the input speech signal. And converting the converted feature parameter sequence into a feature parameter sequence of a second reference format audio signal having the same environmental attributes and different format attributes as the output audio signal, and converting the converted feature parameter sequence of the output audio signal Convert to feature parameter series. In this case, the feature parameter sequence of the input speech signal may be directly converted into the feature parameter sequence of the speech signal in the second reference format, or the feature parameter sequence of the speech signal in the first reference format may be converted into the feature of the output speech signal. You may convert directly to a parameter series.

また、上記第１および第２の実施の形態においては、第１基準話者１６ａ、第１基準様式１６ｂ、第２基準様式１６ｃおよび第２基準話者１６ｄを経た変換を行うように構成したが、これに限らず、第１基準話者１６ａ、第１基準様式１６ｂ、第２基準様式１６ｃおよび第２基準話者１６ｄのうち少なくとも２つを経た変換を行えばよい。図１３の例においても同様に、第１基準話者１６ａ、第３基準様式１６ｅおよび第２基準話者１６ｄのうち少なくとも２つを経た変換を行えばよい。 In the first and second embodiments, the conversion is performed through the first reference speaker 16a, the first reference form 16b, the second reference form 16c, and the second reference speaker 16d. However, the conversion is not limited to this, and the conversion may be performed through at least two of the first reference speaker 16a, the first reference form 16b, the second reference form 16c, and the second reference speaker 16d. Similarly, in the example of FIG. 13, the conversion through at least two of the first reference speaker 16a, the third reference style 16e, and the second reference speaker 16d may be performed.

また、上記第２の実施の形態においては、出力音声信号の属性データに基づいて出力音声信号の属性を同定するように構成したが、これに限らず、入力音声信号の属性データに基づいて入力音声信号の属性を同定するように構成することもできる。
また、上記第１および第２の実施の形態においては、音声信号変換サーバ１００および携帯端末２００を有するネットワークシステムとして構成したが、これに限らず、音声信号変換サーバ１００および携帯端末２００を一体の装置として構成することもできる。 In the second embodiment, the output audio signal attribute is identified based on the output audio signal attribute data. However, the present invention is not limited to this, and the input audio signal attribute data is input. It can also be configured to identify the attributes of the audio signal.
Moreover, in the said 1st and 2nd embodiment, although comprised as a network system which has the audio | voice signal conversion server 100 and the portable terminal 200, not only this but the audio | voice signal conversion server 100 and the portable terminal 200 are integrated. It can also be configured as a device.

また、上記第１および第２の実施の形態において、図７、図９、図１０、図１１および図１２のフローチャートに示す処理を実行するにあたってはいずれも、ＲＯＭ３２、５２にあらかじめ格納されている制御プログラムを実行する場合について説明したが、これに限らず、これらの手順を示したプログラムが記憶された記憶媒体から、そのプログラムをＲＡＭ３４、５４に読み込んで実行するようにしてもよい。 In the first and second embodiments, the processes shown in the flowcharts of FIGS. 7, 9, 10, 11, and 12 are all stored in the ROMs 32 and 52 in advance. Although the case where the control program is executed has been described, the present invention is not limited to this, and the program may be read from the storage medium storing the program showing these procedures into the RAM 34, 54 and executed.

ここで、記憶媒体とは、ＲＡＭ、ＲＯＭ等の半導体記憶媒体、ＦＤ、ＨＤ等の磁気記憶型記憶媒体、ＣＤ、ＣＤＶ、ＬＤ、ＤＶＤ等の光学的読取方式記憶媒体、ＭＯ等の磁気記憶型／光学的読取方式記憶媒体であって、電子的、磁気的、光学的等の読み取り方法のいかんにかかわらず、コンピュータで読み取り可能な記憶媒体であれば、あらゆる記憶媒体を含むものである。 Here, the storage medium is a semiconductor storage medium such as RAM or ROM, a magnetic storage type storage medium such as FD or HD, an optical reading type storage medium such as CD, CDV, LD, or DVD, or a magnetic storage type such as MO. / Optical reading type storage media, including any storage media that can be read by a computer regardless of electronic, magnetic, optical, or other reading methods.

話者属性、様式属性および環境属性を有する音響モデルをＣＯＳＭＯＳ法により２次元平面上にマッピングした場合の音響モデルの分布を示す図である。It is a figure which shows distribution of an acoustic model at the time of mapping the acoustic model which has a speaker attribute, a style attribute, and an environment attribute on a two-dimensional plane by COSMOS method. 話者変動、様式変動および環境変動に応じて、入力音声信号の特徴パラメータ系列を出力音声信号の特徴パラメータ系列に変換する場合を示す図である。It is a figure which shows the case where the characteristic parameter series of an input audio | voice signal is converted into the characteristic parameter series of an output audio | voice signal according to a speaker fluctuation | variation, a style fluctuation | variation, and an environmental fluctuation | variation. 本発明を適用するネットワークシステムの構成を示すブロック図である。It is a block diagram which shows the structure of the network system to which this invention is applied. 音声信号変換サーバ１００のハードウェア構成を示す図である。2 is a diagram illustrating a hardware configuration of an audio signal conversion server 100. FIG. 属性モデルのデータ構造を模式化した図である。It is the figure which modeled the data structure of the attribute model. 音声信号変換関数のデータ構造を模式化した図である。It is the figure which modeled the data structure of the audio | voice signal conversion function. 変換関数提供処理を示すフローチャートである。It is a flowchart which shows a conversion function provision process. 携帯端末２００のハードウェア構成を示す図である。2 is a diagram illustrating a hardware configuration of a mobile terminal 200. FIG. 変換関数取得処理を示すフローチャートである。It is a flowchart which shows a conversion function acquisition process. 音声信号変換処理を示すフローチャートである。It is a flowchart which shows an audio | voice signal conversion process. 変換関数提供処理を示すフローチャートである。It is a flowchart which shows a conversion function provision process. 変換関数取得処理を示すフローチャートである。It is a flowchart which shows a conversion function acquisition process. 話者変動および様式変動に応じて、入力音声信号の特徴パラメータ系列を出力音声信号の特徴パラメータ系列に変換する場合を示す図である。It is a figure which shows the case where the characteristic parameter series of an input audio | voice signal is converted into the characteristic parameter series of an output audio | voice signal according to a speaker fluctuation | variation and a style fluctuation | variation. 複数話者から取得した、２種類の環境属性および５種類の様式属性を有する音声データに基づいて作成した２次元平面上における音響モデルマップの例である。It is an example of an acoustic model map on a two-dimensional plane created based on audio data having two types of environmental attributes and five types of style attributes acquired from a plurality of speakers. 複数話者から取得した、１種類の環境属性および５種類の様式属性を有する音声データに基づいて作成した２次元平面上における音響モデルマップの例である。It is an example of an acoustic model map on a two-dimensional plane created based on voice data having one kind of environmental attribute and five kinds of style attributes acquired from a plurality of speakers. 複数話者から取得した、１種類の環境属性および１種類の様式属性を有する音声データに基づいて作成した２次元平面上における音響モデルマップの例である。It is an example of an acoustic model map on a two-dimensional plane created based on voice data having one kind of environmental attribute and one kind of style attribute acquired from a plurality of speakers.

Explanation of symbols

１０ａ、１０ｂ…環境属性、１２ａ〜１２ｄ…様式属性、１４ａ〜１４ｄ…話者属性、１６ａ…第１基準話者、１６ｂ…第１基準様式、１６ｃ…第２基準様式、１６ｄ…第２基準話者、１６ｅ…第３基準様式、１００…音声信号変換サーバ、２００…携帯端末、３０、５０…ＣＰＵ、３２、５２…ＲＯＭ、３４、５４…ＲＡＭ、３８、５８…Ｉ／Ｆ、３９、５９…バス、４０…入力装置、４２…記憶装置、４４…表示装置、６０…無線通信装置、６２…キーパネル、６４…ＬＣＤ、６６…マイク、６８…スピーカ、２１０…基地局、２２０…中継局、１９９…ネットワーク 10a, 10b ... environmental attribute, 12a-12d ... style attribute, 14a-14d ... speaker attribute, 16a ... first reference speaker, 16b ... first reference style, 16c ... second reference style, 16d ... second reference story 16e ... 3rd standard style, 100 ... audio signal conversion server, 200 ... mobile terminal, 30, 50 ... CPU, 32, 52 ... ROM, 34, 54 ... RAM, 38, 58 ... I / F, 39, 59 ... Bus 40 ... Input device 42 ... Storage device 44 ... Display device 60 ... Wireless communication device 62 ... Key panel 64 ... LCD 66 ... Microphone 68 ... Speaker 210 ... Base station 220 ... Relay station 199 ... Network

Claims

An audio signal converter for converting an input audio signal into a target output audio signal,
A feature parameter sequence extraction means for extracting a high-dimensional feature parameter sequence of a predetermined dimension number or more from the input speech signal;
Voice data obtained from a plurality of speakers is grouped based on the three attributes of speaker attributes, style attributes, and environmental attributes, and a high-dimensional feature parameter series having a predetermined number of dimensions or more based on the voice data belonging to each group. A low-dimensional vector corresponding to an acoustic model with a dimension less than the high-dimensional dimension converted from the high-dimensional acoustic model while maintaining a mathematical distance relationship between the high-dimensional acoustic models. Acoustic model map storage means for storing an acoustic model map to be stored together with the high-dimensional acoustic model;
The feature parameter sequence extracted by the feature parameter sequence extraction means is used as the feature parameter sequence of the output speech signal according to a combination of at least two of variations between speaker attributes, variations between style attributes, and variations between environmental attributes. A characteristic parameter series conversion means for converting;
Voice signal generation means for generating the output voice signal from the feature parameter series converted by the feature parameter series conversion means,
The acoustic model map includes a relationship in which a distribution area of low-dimensional vectors corresponding to acoustic models having the same environmental attribute includes a plurality of distribution areas of low-dimensional vectors corresponding to acoustic models and a plurality of different style attributes. An audio signal conversion device characterized in that each of the distribution regions of low-dimensional vectors corresponding to acoustic models has a relationship including a plurality of low-dimensional vector distribution regions corresponding to acoustic models having different speaker attributes.

In claim 1,
The variation between the speaker attributes is a variation between a speaker attribute related to the vocal tract of the speaker of the input audio signal and a speaker attribute related to the vocal tract of the speaker of the output audio signal. An audio signal conversion device.

In claim 1,
The variation between the style attributes is a variation between a style attribute related to a speech style or speech content of the input speech signal and a style attribute related to a speech style or speech content of the output speech signal. Conversion device.

In claim 1,
The variation between the environmental attributes is a variation between an environmental attribute related to noise or reverberation in the utterance environment of the input audio signal and an environmental attribute related to noise or reverberation in the utterance environment of the output audio signal. Audio signal converter.

In any one of Claims 1 thru | or 4,
The feature parameter series conversion means includes:
The feature parameter series extracted by the feature parameter series extraction means has a style attribute different from the feature parameter series of the voice signal of the first reference speaker having a speaker attribute different from that of the input voice signal and the input voice signal. A first variation converting means for converting the characteristic parameter series of the first reference style audio signal and any one of the characteristic parameter series of the second reference style audio signal having a different style attribute from the output audio signal; ,
The feature parameter sequence converted by the first variation conversion means is a speaker different from the feature parameter sequence of the first reference style speech signal, the feature parameter sequence of the second reference style speech signal, and the output speech signal. Any one of characteristic parameter series of a speech signal of a second reference speaker having an attribute, wherein the first reference speaker, the first reference style, the second reference style, and the second reference speaker In the permutation of the second variation conversion means, the second fluctuation conversion means for converting to a feature parameter series at a stage after the target to be converted by the first fluctuation conversion means,
A third fluctuation conversion means for converting the characteristic parameter series converted by the second fluctuation conversion means into a characteristic parameter series of the output audio signal;
The feature parameter series of the speech signal of the first reference speaker is an average in the acoustic model map within the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attributes and style attributes as the input speech signal. A feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to an acoustic model having speaker attributes,
The feature parameter series of the first reference style audio signal has an average style attribute in a distribution region of an acoustic model corresponding low-dimensional vector having the same environmental attribute as the input audio signal in the acoustic model map. It is a feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to an acoustic model,
The feature parameter series of the audio signal of the second reference style has an average style attribute within the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute as the output voice signal in the acoustic model map. It is a feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to the acoustic model
The feature parameter series of the second reference speaker's speech signal is an average of the acoustic model-corresponding low-dimensional vector distribution region having the same environmental and style attributes as the output speech signal in the acoustic model map. An audio signal converter characterized by being a feature parameter series of a high-dimensional acoustic model corresponding to an acoustic model-compatible low-dimensional vector having speaker attributes.

In any one of Claims 1 thru | or 4,
The feature parameter series conversion means includes:
A feature parameter sequence extracted by the feature parameter sequence extracting means is converted into a feature parameter sequence of a speech signal of a first reference speaker having the same style attribute and environment attribute as the input speech signal and different speaker attributes. Speaker variation conversion means;
A first style fluctuation conversion for converting the feature parameter series converted by the first speaker fluctuation conversion means into a feature parameter series of a first reference style voice signal having the same environmental attributes and different style attributes as the input voice signal. Means,
Environment variation conversion means for converting the feature parameter series converted by the first style change conversion means into a feature parameter series of a second reference style audio signal having the same environmental attributes and different style attributes as the output audio signal;
A second format for converting the feature parameter series converted by the environmental variation conversion means into a feature parameter sequence of a voice signal of a second reference speaker having the same style attribute, environment attribute, and different speaker attribute as the output voice signal Fluctuation conversion means;
A second speaker fluctuation conversion means for converting the feature parameter series converted by the second style fluctuation conversion means into a feature parameter series of the output speech signal;
The feature parameter series of the speech signal of the first reference speaker is an average in the acoustic model map within the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attributes and style attributes as the input speech signal. A feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to an acoustic model having speaker attributes,
The feature parameter series of the first reference style audio signal has an average style attribute in a distribution region of an acoustic model corresponding low-dimensional vector having the same environmental attribute as the input audio signal in the acoustic model map. It is a feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to an acoustic model,
The feature parameter series of the audio signal of the second reference style has an average style attribute within the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute as the output voice signal in the acoustic model map. It is a feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to an acoustic model,
The feature parameter series of the second reference speaker's speech signal is an average of the acoustic model-corresponding low-dimensional vector distribution region having the same environmental and style attributes as the output speech signal in the acoustic model map. An audio signal converter characterized by being a feature parameter series of a high-dimensional acoustic model corresponding to an acoustic model-compatible low-dimensional vector having speaker attributes.

In any one of Claims 1 thru | or 4,
The feature parameter series conversion means includes:
A feature parameter sequence extracted by the feature parameter sequence extracting means is converted into a feature parameter sequence of a speech signal of a first reference speaker having the same style attribute and environment attribute as the input speech signal and different speaker attributes. Speaker variation conversion means;
The feature parameter series converted by the first speaker fluctuation conversion means is converted into a first reference form having the same environmental attribute and different form attribute as the input sound signal, or the same environment attribute and different form attribute as the output sound signal. First style variation conversion means for converting the second reference style voice signal to a feature parameter sequence,
A feature parameter series converted by the first style variation conversion means is converted into a feature parameter series of the voice signal of the second reference speaker having the same style attribute and environment attribute as the output voice signal and different speaker attributes. Two-style change conversion means;
A second speaker fluctuation conversion means for converting the feature parameter series converted by the second style fluctuation conversion means into a feature parameter series of the output speech signal;
The feature parameter series of the speech signal of the first reference speaker is an average in the acoustic model map within the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attributes and style attributes as the input speech signal. A feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to an acoustic model having speaker attributes,
The feature parameter series of the first reference style audio signal has an average style attribute in a distribution region of an acoustic model corresponding low-dimensional vector having the same environmental attribute as the input audio signal in the acoustic model map. It is a feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to an acoustic model,
The feature parameter series of the audio signal of the second reference style has an average style attribute within the distribution region of the low-dimensional vector corresponding to the acoustic model having the same environmental attribute as the output voice signal in the acoustic model map. It is a feature parameter series of a high-dimensional acoustic model corresponding to a low-dimensional vector corresponding to an acoustic model,
The feature parameter series of the second reference speaker's speech signal is an average of the acoustic model-corresponding low-dimensional vector distribution region having the same environmental and style attributes as the output speech signal in the acoustic model map. An audio signal converter characterized by being a feature parameter series of a high-dimensional acoustic model corresponding to an acoustic model-compatible low-dimensional vector having speaker attributes.

In any one of Claims 5 thru | or 7,
The acoustic model map storage means includes a set of speaker attribute high-dimensional acoustic models corresponding to a plurality of speaker attributes of a speech signal, a set of style attribute high-dimensional acoustic models corresponding to a plurality of style attributes of a speech signal, and a speech signal. Further storing a set of environmental attribute high-dimensional acoustic models corresponding to a plurality of environmental attributes of
For the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model and the environment attribute high-dimensional acoustic model of the acoustic model map storage means, the likelihood for the feature parameter sequence extracted by the feature parameter series extraction means is obtained, and the maximum Speaker attribute high likelihood acoustic model, likelihood attribute high dimensional acoustic model, and environmental attribute high likelihood acoustic model corresponding to the speaker attribute, the style attribute, and the environment attribute corresponding to the speaker attribute of the input speech signal Input voice signal attribute identifying means for identifying as a style attribute and an environment attribute;
Output feature parameter sequence extraction means for extracting a feature parameter sequence from the sample of the output audio signal;
For the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model and the environment attribute high-dimensional acoustic model of the attribute model storage means, the likelihood for the feature parameter sequence extracted by the output feature parameter sequence extraction means is obtained, and the maximum Speaker attribute high-dimensional acoustic model that gives likelihood, style attribute high-dimensional acoustic model and environment attribute The speaker attribute corresponding to the high-dimensional acoustic model, the style attribute and the environment attribute are the speaker attributes of the output speech signal. Output audio signal attribute identifying means for identifying as a style attribute and an environment attribute;
A plurality of first speaker variation conversion functions for converting a given feature parameter sequence into a feature parameter sequence of the voice signal of the first reference speaker, and a feature parameter sequence of the feature of the voice signal of the first reference style A plurality of first style variation conversion functions for converting into a parameter series, a plurality of environment fluctuation conversion functions for converting a given feature parameter series into a feature parameter series of an audio signal in the second reference style, and a given feature parameter series A plurality of second mode variation conversion functions for converting into a feature parameter sequence of the speech signal of the second reference speaker, and a plurality of second speakers for converting a given feature parameter sequence into a feature parameter sequence of the output speech signal A fluctuation conversion function storage means for storing the fluctuation conversion function;
Based on the attributes identified by the input voice signal attribute identification means and the output voice signal attribute identification means, the first speaker fluctuation conversion function, the first style fluctuation conversion function, the environment fluctuation conversion function, and the second style A fluctuation conversion function selecting means for selecting a fluctuation conversion function and the second speaker fluctuation conversion function from the fluctuation conversion function storage means;
The audio signal conversion apparatus characterized in that the characteristic parameter series conversion means converts the characteristic parameter series based on the fluctuation conversion function selected by the fluctuation conversion function selection means.

In any one of Claims 5 thru | or 7,
The acoustic model map storage means includes a set of speaker attribute high-dimensional acoustic models corresponding to a plurality of speaker attributes of a speech signal, a set of style attribute high-dimensional acoustic models corresponding to a plurality of style attributes of a speech signal, and a speech signal. Attribute model storage means for storing a set of environmental attribute high-dimensional acoustic models corresponding to a plurality of environmental attributes;
For the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model, and the environment attribute high-dimensional acoustic model of the attribute model storage means, the likelihood for the feature parameter sequence extracted by the feature parameter series extraction means is obtained, and the maximum likelihood Speaker attribute giving a degree, style attribute high dimensional acoustic model and environment attribute speaker attribute corresponding to the high dimensional acoustic model, the style attribute and the environment attribute, the speaker attribute of the input speech signal, Input audio signal attribute identifying means for identifying as a style attribute and an environment attribute;
Output audio signal attribute input means for inputting speaker attributes, style attributes and environment attributes of the output audio signal;
A mixed normal distribution model having a speaker attribute, a style attribute, and an environment attribute that best match the speaker attribute, the style attribute, and the environment attribute input by the output audio signal attribute input means is obtained, and the mixed normal distribution model corresponding to the mixed normal distribution model is obtained. Output audio signal attribute identifying means for identifying speaker attributes, the style attributes and the environment attributes as speaker attributes, style attributes and environment attributes of the output audio signal;
A plurality of first speaker variation conversion functions for converting a given feature parameter sequence into a feature parameter sequence of the voice signal of the first reference speaker, and a feature parameter sequence of the feature of the voice signal of the first reference style A plurality of first style variation conversion functions for converting into a parameter series, a plurality of environment fluctuation conversion functions for converting a given feature parameter series into a feature parameter series of an audio signal in the second reference style, and a given feature parameter series A plurality of second mode variation conversion functions for converting into a feature parameter sequence of the speech signal of the second reference speaker, and a plurality of second speakers for converting a given feature parameter sequence into a feature parameter sequence of the output speech signal A fluctuation conversion function storage means for storing the fluctuation conversion function;
Based on the attributes identified by the input voice signal attribute identification means and the output voice signal attribute identification means, the first speaker fluctuation conversion function, the first style fluctuation conversion function, the environment fluctuation conversion function, and the second style A fluctuation conversion function selecting means for selecting a fluctuation conversion function and the second speaker fluctuation conversion function from the fluctuation conversion function storage means;
The audio signal conversion apparatus characterized in that the characteristic parameter series conversion means converts the characteristic parameter series based on the fluctuation conversion function selected by the fluctuation conversion function selection means.

An audio signal conversion method for converting an input audio signal into a target output audio signal,
A feature parameter sequence extraction step for extracting a high-dimensional feature parameter sequence of a predetermined dimension number or more from the input speech signal;
Voice data obtained from a plurality of speakers is grouped based on the three attributes of speaker attributes, style attributes, and environmental attributes, and a high-dimensional feature parameter series having a predetermined number of dimensions or more based on the voice data belonging to each group. A low-dimensional vector corresponding to an acoustic model with a dimension less than the high-dimensional dimension converted from the high-dimensional acoustic model while maintaining a mathematical distance relationship between the high-dimensional acoustic models. An acoustic model map storage step for storing an acoustic model map to be stored together with the high-dimensional acoustic model;
The feature parameter sequence extracted in the feature parameter sequence extraction step is used as a feature parameter sequence of the output speech signal according to a combination of at least two of variations between speaker attributes, variations between style attributes, and variations between environmental attributes. A feature parameter series conversion step to convert;
An audio signal generation step of generating the output audio signal from the feature parameter sequence converted in the feature parameter sequence conversion step,
The acoustic model map includes a relationship in which a distribution area of low-dimensional vectors corresponding to acoustic models having the same environmental attribute includes a plurality of distribution areas of low-dimensional vectors corresponding to acoustic models and a plurality of different style attributes. A speech signal conversion method characterized in that each of the acoustic model-corresponding low-dimensional vector distribution regions includes a relationship including a plurality of acoustic model-corresponding low-dimensional vector distribution regions having different speaker attributes.

A conversion function providing terminal and a portable terminal are communicably connected to each other, and an audio signal conversion service providing system for providing an audio conversion service for converting an input audio signal into a target output audio signal,
The conversion function providing terminal is:
Voice data obtained from a plurality of speakers is grouped based on the three attributes of speaker attributes, style attributes, and environmental attributes, and a high-dimensional feature parameter series having a predetermined number of dimensions or more based on the voice data belonging to each group. A low-dimensional vector corresponding to an acoustic model with a dimension less than the high-dimensional dimension converted from the high-dimensional acoustic model while maintaining a mathematical distance relationship between the high-dimensional acoustic models. Stored together with the high-dimensional acoustic model, a set of speaker attribute high-dimensional acoustic models corresponding to a plurality of speaker attributes of a speech signal, and a style corresponding to a plurality of style attributes of a speech signal Acoustic model map notation that stores a set of attribute high-dimensional acoustic models and a set of environmental attribute high-dimensional acoustic models corresponding to multiple environmental attributes of speech signals And means,
Input feature parameter sequence receiving means for receiving a feature parameter sequence of the input speech signal;
For the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model and the environment attribute high-dimensional acoustic model of the acoustic model map storage means, the likelihood for the feature parameter series received by the input feature parameter series receiving means is obtained, and the maximum Speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model, and environment attribute high-dimensional acoustic model corresponding to the speaker attribute, the style attribute, and the environment attribute corresponding to the speaker of the input speech signal Input audio signal attribute identifying means for identifying as an attribute, style attribute and environment attribute;
Output feature parameter sequence receiving means for receiving a feature parameter sequence of the output audio signal;
For the speaker attribute high-dimensional acoustic model, the style attribute high-dimensional acoustic model and the environment attribute high-dimensional acoustic model of the acoustic model map storage means, the likelihood for the feature parameter series received by the output feature parameter series receiving means is obtained, and the maximum Speaker attribute high-dimensional acoustic model, style attribute high-dimensional acoustic model, and environment attribute high-dimensional acoustic model corresponding to the speaker attribute, the style attribute, and the environment attribute are given to the speaker of the output speech signal. Output audio signal attribute identifying means for identifying as an attribute, style attribute and environment attribute;
A plurality of first speaker variation transforms for transforming a given feature parameter sequence into a feature parameter sequence of a speech signal of a first reference speaker having the same style attribute and environment attribute as the input speech signal and different speaker attributes A plurality of first style variation conversion functions for converting a given feature parameter series into a feature parameter series of a first reference style voice signal having the same environmental attributes and different style attributes as the input voice signal; A plurality of environment variation conversion functions for converting the feature parameter series into a feature parameter series of a second reference style audio signal having the same environmental attributes and different style attributes as the output voice signal, and the given feature parameter series, The characteristic parameters of the voice signal of the second reference speaker having the same style attribute and environment attribute as the output voice signal and different speaker attributes. A plurality of second-style variation conversion functions for converting into a data sequence, and a variation conversion function storage for storing a plurality of second speaker variation conversion functions for converting a given feature parameter sequence into a feature parameter sequence of the output speech signal Means,
Based on the attributes identified by the input voice signal attribute identification means and the output voice signal attribute identification means, the first speaker fluctuation conversion function, the first style fluctuation conversion function, the environment fluctuation conversion function, and the second style A fluctuation conversion function selecting means for selecting a fluctuation conversion function and the second speaker fluctuation conversion function from the fluctuation conversion function storage means;
Conversion function transmitting means for transmitting the conversion function selected by the variation conversion function selection means to the portable terminal,
The portable terminal is
Input voice signal input means for inputting the input voice signal;
Input feature parameter series extraction means for extracting a high-dimensional feature parameter series of a predetermined dimension number or more from the input voice signal input by the input voice signal input means;
Input feature parameter sequence transmitting means for transmitting the feature parameter sequence extracted by the input feature parameter sequence extracting means to the conversion function providing terminal;
Output audio signal input means for inputting a sample of the output audio signal;
Output feature parameter series extraction means for extracting a feature parameter series from the output voice signal input by the output voice signal input means;
Output feature parameter sequence transmitting means for transmitting the feature parameter sequence extracted by the output feature parameter sequence extracting means to the conversion function providing terminal;
Conversion function receiving means for receiving the conversion function;
Feature parameter sequence conversion means for converting the feature parameter sequence extracted by the input feature parameter sequence extraction means into a feature parameter sequence of the output audio signal based on the conversion function received by the conversion function receiving means;
An audio signal conversion service providing system comprising: audio signal generation means for generating the output audio signal from the feature parameter series converted by the feature parameter series conversion means.

A conversion function providing terminal and a portable terminal are communicably connected to each other, and an audio signal conversion service providing system for providing an audio conversion service for converting an input audio signal into a target output audio signal,
The conversion function providing terminal is:
Voice data obtained from a plurality of speakers is grouped based on the three attributes of speaker attributes, style attributes, and environmental attributes, and a high-dimensional feature parameter series having a predetermined number of dimensions or more based on the voice data belonging to each group. A low-dimensional vector corresponding to an acoustic model with a dimension less than the high-dimensional dimension converted from the high-dimensional acoustic model while maintaining a mathematical distance relationship between the high-dimensional acoustic models. And a set of speaker attributes corresponding to a plurality of speaker attributes of a speech signal and a plurality of style attributes of a speech signal. Stores a set of style attribute high-dimensional acoustic model models and a set of environmental attribute high-dimensional acoustic model models corresponding to multiple environmental attributes of speech signals And sound model map storage means,
Input feature parameter sequence receiving means for receiving a feature parameter sequence of the input speech signal;
For the speaker attribute high-dimensional acoustic model model, the style attribute high-dimensional acoustic model model and the environment attribute high-dimensional acoustic model model of the acoustic model map storage means, the likelihood for the feature parameter series received by the input feature parameter series receiving means is set. The speaker attribute high-dimensional acoustic model model, the style attribute high-dimensional acoustic model model, and the environmental attribute high-dimensional acoustic model model that give the maximum likelihood, the speaker attribute corresponding to the high-dimensional acoustic model model, the style attribute and the environmental attribute, Input speech signal attribute identifying means for identifying speaker attributes, style attributes and environment attributes of the input speech signal;
Attribute data receiving means for receiving attribute data indicating speaker attributes, style attributes and environment attributes of the output audio signal;
A mixed normal distribution model having speaker attributes, style attributes, and environmental attributes that best match the speaker attributes, style attributes, and environmental attributes of the attribute data received by the attribute data receiving means is obtained, and corresponds to the mixed normal distribution model. Output voice signal attribute identifying means for identifying the speaker attribute, the style attribute and the environment attribute as a speaker attribute, a style attribute and an environment attribute of the output voice signal;
A plurality of first speaker variation transforms for transforming a given feature parameter sequence into a feature parameter sequence of a speech signal of a first reference speaker having the same style attribute and environment attribute as the input speech signal and different speaker attributes A plurality of first style variation conversion functions for converting a given feature parameter series into a feature parameter series of a first reference style voice signal having the same environmental attributes and different style attributes as the input voice signal; A plurality of environment variation conversion functions for converting the feature parameter series into a feature parameter series of a second reference style audio signal having the same environmental attributes and different style attributes as the output voice signal, and the given feature parameter series, The characteristic parameters of the voice signal of the second reference speaker having the same style attribute and environment attribute as the output voice signal and different speaker attributes. A plurality of second-style variation conversion functions for converting into a data sequence, and a variation conversion function storage for storing a plurality of second speaker variation conversion functions for converting a given feature parameter sequence into a feature parameter sequence of the output speech signal Means,
Based on the attributes identified by the input voice signal attribute identification means and the output voice signal attribute identification means, the first speaker fluctuation conversion function, the first style fluctuation conversion function, the environment fluctuation conversion function, and the second style A fluctuation conversion function selecting means for selecting a fluctuation conversion function and the second speaker fluctuation conversion function from the fluctuation conversion function storage means;
Conversion function transmitting means for transmitting the conversion function selected by the variation conversion function selection means to the portable terminal,
The portable terminal is
Input voice signal input means for inputting the input voice signal;
Input feature parameter series extraction means for extracting a high-dimensional feature parameter series of a predetermined dimension number or more from the input voice signal input by the input voice signal input means;
Input feature parameter sequence transmitting means for transmitting the feature parameter sequence extracted by the input feature parameter sequence extracting means to the conversion function providing terminal;
Output audio signal attribute input means for inputting speaker attributes, style attributes and environment attributes of the output audio signal;
Attribute data transmitting means for transmitting attribute data indicating speaker attributes, style attributes and environment attributes input by the output audio signal attribute input means to the conversion function providing terminal;
Conversion function receiving means for receiving the conversion function;
Feature parameter sequence conversion means for converting the feature parameter sequence extracted by the input feature parameter sequence extraction means into a feature parameter sequence of the output audio signal based on the conversion function received by the conversion function receiving means;
An audio signal conversion service providing system comprising: audio signal generation means for generating the output audio signal from the characteristic parameter series converted by the characteristic parameter series conversion means.

In any one of Claims 11 and 12,
The conversion function providing terminal further includes:
An audio signal conversion service providing system comprising charging means for performing charging processing for a user of the portable terminal when the conversion function is transmitted by the conversion function transmitting means.

A conversion function providing terminal applied to the audio signal conversion service providing system according to claim 11,
The acoustic model map storage means, the input feature parameter series reception means, the input speech signal attribute identification means, the output feature parameter series reception means, the output speech signal attribute identification means, the variation conversion function storage means, the variation conversion function A conversion function providing terminal comprising a selection unit and the conversion function transmission unit.

A conversion function providing terminal applied to the audio signal conversion service providing system according to claim 12,
The acoustic model map storage unit, the input feature parameter series reception unit, the input audio signal attribute identification unit, the attribute data reception unit, the output audio signal attribute identification unit, the variation conversion function storage unit, and the variation conversion function selection unit And a conversion function providing terminal comprising the conversion function transmitting means.

A portable terminal applied to the audio signal conversion service providing system according to claim 11,
The input voice signal input means, the input feature parameter series extraction means, the input feature parameter series transmission means, the output voice signal input means, the output feature parameter series extraction means, the output feature parameter series transmission means, and the conversion function reception A portable terminal comprising: means, the characteristic parameter series conversion means, and the voice signal generation means.

A mobile terminal applied to the audio signal conversion service providing system according to claim 12,
The input voice signal input means, the input feature parameter series extraction means, the input feature parameter series transmission means, the output voice signal attribute input means, the attribute data transmission means, the conversion function reception means, the feature parameter series conversion means, and A portable terminal comprising the audio signal generating means.