JP2002297199A

JP2002297199A - Method and device for discriminating synthesized voice and voice synthesizer

Info

Publication number: JP2002297199A
Application number: JP2001097158A
Authority: JP
Inventors: Yoshinori Shiga; 芳則志賀
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2001-03-29
Filing date: 2001-03-29
Publication date: 2002-10-11

Abstract

PROBLEM TO BE SOLVED: To provide a method for discriminating whether or not an inputted aural signal is a synthesized aural signal. SOLUTION: The voice synthesizer 12 outputs an aural signal 18 which is a synthesis aural signal generated by performing voice synthesis according to a character string inputted thereto and added with discrimination information for discriminating that the voice is the synthesized aural signal, and a synthesized voice discrimination device 20 arranged on the receiving side of the aural signal detects the presence or absence of the discrimination information from the input aural signal, and judges that the input aural signal is the synthesized signal 18 when detecting this discrimination information.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声合成装置によ
る合成音声と人間による実際の発声音声とを判別するた
めの合成音声判別方法と装置及び音声合成装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a synthesized speech discriminating method and apparatus for discriminating a synthesized speech by a speech synthesizer from an actual uttered speech by a human, and a speech synthesis apparatus.

【０００２】[0002]

【従来の技術】文字列であるテキストから音声を合成す
る、いわゆる音声規則合成技術の進歩は著しい。最近で
は、図１に示すようにアナウンサやナレータの発声を収
録し、この収録した発生音声１を計算機で利用できるよ
うにデータ化した音声データベース２から、音声の音色
（音韻情報）や抑揚（韻律情報）の制御情報を統計的学
習アルゴリズムを用いて学習処理部３により自動獲得し
て制御情報データベース４を作成し、この制御情報デー
タベース４を用いて入力テキスト５から音声規則合成処
理部６によって合成音声を生成する技術が確立されつつ
ある。2. Description of the Related Art The so-called speech rule synthesis technology for synthesizing speech from text as a character string has been remarkably advanced. Recently, as shown in FIG. 1, utterances of announcers and narrators are recorded, and the recorded generated voice 1 is converted into data so that it can be used by a computer. Information) is automatically acquired by the learning processing unit 3 using a statistical learning algorithm to create a control information database 4, and synthesized from the input text 5 by the speech rule synthesis processing unit 6 using the control information database 4. Techniques for generating speech are being established.

【０００３】このような技術については、例えば、X.Ha
ng et al “ Recent Improvementson Microsoft's Trai
nable Text-To-Speech System-Whistler ”. IEEE Inte
rnational Conference on Acoustics, Speech, and Sig
nal Processing . pp.959-962, 1997などに詳しく述べ
られている。[0003] For such a technique, for example, X. Ha
ng et al “Recent Improvementson Microsoft's Trai
nable Text-To-Speech System-Whistler ”. IEEE Inte
rnational Conference on Acoustics, Speech, and Sig
nal Processing. pp.959-962, 1997, etc.

【０００４】こうした手法に基づいて得られた合成音声
は、収録したアナウンサやナレータなどの収録話者の特
徴を良く保持している。すなわち、収録話者による実際
の発声音声と、音声規則合成による合成音声は声質や抑
揚の特徴が良く似ており、少し聞いた程度では収録話者
の発生音声なのか合成音声なのかを簡単には区別できな
い。特に、音声を電話回線のような伝送系を通した場合
のように、音声が帯域制限などにより劣化している場合
には、実際の発生音声と合成音声との区別は難しくな
る。[0004] The synthesized speech obtained based on such a method well retains the characteristics of the recorded speaker such as the recorded announcer and narrator. In other words, the actual uttered voice of the recorded speaker and the synthesized voice produced by the speech rule synthesis are very similar in voice quality and inflection characteristics. Cannot be distinguished. In particular, when the voice is deteriorated due to band limitation, such as when the voice is transmitted through a transmission system such as a telephone line, it is difficult to distinguish the actually generated voice from the synthesized voice.

【０００５】さらに、音声規則合成技術を利用して、ユ
ーザもしくはユーザ以外の発声音声をパーソナルコンピ
ュータなどに取り込むことで、現実の発声音声を真似た
合成音声を生成する技術も研究されており、近い将来は
簡単に自分自身や他人の声を合成することが可能になる
と考えられる。Further, a technique has been studied in which a user or a non-user's uttered voice is taken into a personal computer or the like by using a voice rule synthesis technique to generate a synthesized voice that imitates an actual uttered voice. In the future, it will be possible to easily synthesize the voices of oneself and others.

【０００６】[0006]

【発明が解決しようとする課題】上述したように音声規
則合成技術の進歩は、収録話者による実際の発声音声と
に近い合成音声を生成できるため、有名人の発生と同じ
音声を合成したり、電子メールを送信者の声で読み上げ
るといったことを可能にするなど様々な利点を生む。As described above, the advancement of the speech rule synthesis technology has made it possible to generate a synthesized voice that is close to the actual uttered voice of the recorded speaker, and to synthesize the same voice as that generated by a celebrity, It offers various advantages, such as allowing e-mail to be read aloud by the sender.

【０００７】反面、音声規則合成技術を悪用すると、他
人への「なりすまし」が容易に可能となるため、例えば
音声で発声者を識別したり照合する話者識別あるいは話
者照合技術を意味のないものにしてしまい、声紋分析に
よる犯罪捜査などへも影響を与えることが懸念される。[0007] On the other hand, if the voice rule synthesis technology is abused, it is possible to easily "spoof" to another person, so that, for example, a speaker identification or speaker verification technology for identifying or verifying a speaker by voice is meaningless. There is a concern that voiceprint analysis could affect criminal investigations.

【０００８】本発明は、このような問題点を解消すべく
なされたもので、入力された音声信号が合成音声信号か
どうかを判別できる合成音声判別方法と装置及び音声合
成装置を提供することを目的とする。SUMMARY OF THE INVENTION The present invention has been made in order to solve such a problem, and it is an object of the present invention to provide a synthesized speech discriminating method and apparatus capable of discriminating whether an input speech signal is a synthesized speech signal, and a speech synthesizing apparatus. Aim.

【０００９】[0009]

【課題を解決するための手段】上記課題を解決するた
め、本発明に係る合成音声判別方法は、入力される文字
列に従って音声合成を行うことにより生成された合成音
声信号に、合成音声信号であることを判別するための判
別情報を付加した音声信号を出力することを基本的な特
徴とする。合成音声信号へに付加される判別情報は、ア
ナログ信号による伝送や帯域制限、音質劣化によっても
失われない情報であることが望ましく、例えば、合成音
声信号の周波数特性に変更を加えることにより判別情報
の付加が行われる。より具体的には、合成音声信号に加
える周波数特性の変更としては、合成音声信号の主たる
周波数帯域外の周波数帯域における信号パワーの変更が
用いられる。そして、入力音声信号から判別情報の有無
を検出し、該判別情報を検出したとき入力音声信号が合
成音声信号であると判別する。In order to solve the above-mentioned problems, a synthesized speech discriminating method according to the present invention is characterized in that a synthesized speech signal generated by performing speech synthesis in accordance with an input character string is added to the synthesized speech signal. A basic feature is to output an audio signal to which discrimination information for discriminating the presence is added. It is desirable that the discrimination information added to the synthesized speech signal is information that is not lost even by transmission by an analog signal, band limitation, or sound quality deterioration. For example, the discrimination information can be obtained by changing the frequency characteristics of the synthesized speech signal. Is added. More specifically, as the change of the frequency characteristic applied to the synthesized speech signal, a change in signal power in a frequency band outside the main frequency band of the synthesized speech signal is used. Then, the presence or absence of discrimination information is detected from the input speech signal, and when the discrimination information is detected, it is determined that the input speech signal is a synthesized speech signal.

【００１０】本発明に係る音声合成装置は、入力される
文字列に従って音声合成を行うことにより合成音声信号
を生成する音声合成部と、合成音声信号に合成音声信号
であることを判別するための判別情報を付加した音声信
号を出力する判別情報付加部とを有することを特徴とす
る。A speech synthesizer according to the present invention performs a speech synthesis according to an input character string to generate a synthesized speech signal, and determines whether the synthesized speech signal is a synthesized speech signal. A discriminating information adding unit that outputs an audio signal to which discriminating information is added.

【００１１】この音声合成装置と組み合わせられる本発
明に係る合成音声判別装置は、入力音声信号から音声合
成装置において合成音声信号に付加された判別情報の有
無を検出し、この判別情報を検出したとき入力音声信号
が合成音声信号であると判別する判別部と、この判別部
の判別結果を表示する表示部とを有する。A synthesized speech discriminating apparatus according to the present invention combined with the speech synthesizing apparatus detects the presence or absence of discrimination information added to the synthesized speech signal in the speech synthesizing apparatus from the input speech signal. It has a discriminating section for discriminating that the input speech signal is a synthesized speech signal, and a display section for displaying a discrimination result of this discriminating section.

【００１２】また、本発明によれば入力される文字列に
従って音声合成を行うことにより合成音声信号を生成す
る処理と、合成音声信号に合成音声信号であることを判
別するための判別情報を付加した音声信号を出力する処
理とをコンピュータに実行させるためのプログラムまた
は該プログラムを記録した記録媒体を提供することがで
きる。Further, according to the present invention, a process of generating a synthesized voice signal by performing voice synthesis according to an input character string, and adding discrimination information for discriminating that the synthesized voice signal is a synthesized voice signal. And a program for causing a computer to execute the process of outputting the converted audio signal, or a recording medium on which the program is recorded.

【００１３】さらに、本発明によれば入力音声信号から
合成音声信号に付加された判別情報の有無を検出し、該
判別情報を検出したとき該入力音声信号が該合成音声信
号であると判別する処理と、この判別の結果を表示させ
る処理とをコンピュータに実行させるためのプログラム
または該プログラムを記録した記録媒体を提供すること
ができる。Further, according to the present invention, the presence or absence of discrimination information added to the synthesized speech signal is detected from the input speech signal, and when the discrimination information is detected, the input speech signal is determined to be the synthesized speech signal. A program for causing a computer to execute a process and a process of displaying a result of the determination, or a recording medium on which the program is recorded can be provided.

【００１４】このように本発明によると、合成音声信号
に合成信号であることを示す判別情報を付加した音声信
号を出力することにより、合成音声信号を入力する側で
は入力音声信号が合成音声信号か人間による発生音声信
号かの判別を行うことができる。従って、音声合成の犯
罪への悪用、すなわち他人へのなりすましによる、話者
識別あるいは話者照合システムにおける詐称や、音声に
基づく科学捜査のかく乱などを未然に防止することが可
能となる。また、判別情報を合成音声信号に付加するこ
とによって、音声信号を受け取った側で、入力音声信号
が合成音声信号であるか否かを容易に判定することがで
きるようにすることである。As described above, according to the present invention, by outputting a speech signal in which discrimination information indicating that the speech signal is a synthesized signal is added to the synthesized speech signal, the input speech signal is input to the synthesized speech signal on the input side. Or a voice signal generated by a human. Therefore, it is possible to prevent improper use of speech synthesis in crime, that is, spoofing in a speaker identification or speaker verification system due to impersonation of another person, or disturbance of a forensic investigation based on speech. It is another object of the present invention to add discrimination information to a synthesized speech signal so that the side receiving the speech signal can easily determine whether or not the input speech signal is a synthesized speech signal.

【００１５】[0015]

【発明の実施の形態】以下、本発明の実施の形態につき
図面を参照して説明する。図２は、本発明の一実施形態
に係る音声合成装置及び合成音声判別装置の概略構成を
示すブロック図である。Embodiments of the present invention will be described below with reference to the drawings. FIG. 2 is a block diagram showing a schematic configuration of a speech synthesis device and a synthesized speech discrimination device according to an embodiment of the present invention.

【００１６】音声合成装置１２は、例えばパーソナルコ
ンピュータのような情報処理装置上で、ＣＤ−ＲＯＭ、
フレキシブルディスクまたはメモリカードなどの記録媒
体により、あるいはネットワークなどの通信媒体により
供給される専用のソフトウエア（文音声合成ソフトウエ
ア）を実行することにより実現されるもので、入力テキ
ストから音声を生成する文音声合成（ＴＴＳ）処理機能
を有している。The voice synthesizing device 12 is provided on an information processing device such as a personal computer, for example, on a CD-ROM,
It is realized by executing dedicated software (sentence / speech synthesis software) supplied by a recording medium such as a flexible disk or a memory card or by a communication medium such as a network, and generates speech from input text. It has a sentence speech synthesis (TTS) processing function.

【００１７】音声合成装置１２において、文音声合成
（読み上げ）の対象となる文字列であるテキストは、テ
キストファイル１１として保存されている。文音声合成
ソフトウェアに従い、テキストファイル１１から漢字か
な混じりテキストが読み出されて文音声合成処理部１３
に入力され、この入力テキストに対応する音声が合成さ
れる。文音声合成処理部１３は大きく分けて、入力テキ
ストを解析して発音記号を生成するテキスト解析部１４
と、発音記号から音声を生成する音声生成部１５から構
成されている。In the speech synthesizer 12, a text, which is a character string to be subjected to sentence speech synthesis (speech), is stored as a text file 11. According to the sentence / speech synthesis software, a text mixed with kanji or kana is read from the text file 11 and sentence / speech synthesis processing unit 13
And a speech corresponding to the input text is synthesized. The sentence-to-speech synthesis processing unit 13 is roughly divided into a text analysis unit 14 that analyzes input text and generates phonetic symbols.
And a voice generation unit 15 that generates voice from phonetic symbols.

【００１８】音声生成部１５は、特定話者の声質と韻律
特徴を保存した音声辞書１６を参照して該話者の特徴を
持つ音声を合成する。特定話者の声質や韻律特徴の獲得
と、音声規則合成における利用の詳細については、例え
ば、先に示した文献：X.Huang et al.“Recent Improve
ments on Microsoft's Trainable Text-To-Speech Syst
em-Whistler”. IEEE International Conference on Ac
oustics, Speech, andSignal Processing . pp.959-96
2, 1997等に記載されているので、ここでは説明を省略
する。The speech generation unit 15 synthesizes a speech having the characteristics of the specific speaker with reference to the voice dictionary 16 storing the voice quality and the prosodic features of the specific speaker. For details on obtaining voice characteristics and prosodic features of specific speakers and using them in speech rule synthesis, see, for example, the above-mentioned reference: X. Huang et al. “Recent Improve
ments on Microsoft's Trainable Text-To-Speech Syst
em-Whistler ”. IEEE International Conference on Ac
oustics, Speech, andSignal Processing .pp.959-96
2, 1997, etc., and the description is omitted here.

【００１９】従来の文音声合成装置では、合成音声信号
をそのまま出力する。話者の特徴を保持した音声を合成
できる最近の音声合成装置では、その高音質な音声を悪
用した同話者への「なりすまし」を容易に許してしまう
ことは前述した通りである。The conventional sentence speech synthesizer outputs a synthesized speech signal as it is. As described above, in a recent speech synthesizer capable of synthesizing a speech that retains the characteristics of a speaker, "spoofing" is easily permitted to a speaker who abuses the high-quality speech.

【００２０】この問題を回避するため、本実施形態の音
声合成装置１２では、文音声合成処理部１３から出力さ
れる合成音声信号を合成音声判別情報付加部１７に通
し、人間（この場合、特定話者）が発声した音声信号と
は異なることを示す判別情報を合成音声信号に付加す
る。このようにして音声合成装置１２から、合成音声判
別情報付加部１７により判別情報が付加された後の音声
信号１８が出力され、伝送路１９を経て伝送される。In order to avoid this problem, in the speech synthesizer 12 of the present embodiment, the synthesized speech signal output from the sentence speech synthesis processing unit 13 is passed through the synthesized speech discrimination information adding unit 17 and is sent to a human (in this case, Discriminating information indicating that the voice signal is different from the voice signal uttered by the (speaker) is added to the synthesized voice signal. In this manner, the speech signal 18 to which the discrimination information has been added by the synthesized speech discrimination information adding unit 17 is output from the speech synthesizer 12, and transmitted via the transmission path 19.

【００２１】合成音声判別装置２０では、伝送路１９を
経て入力されてきた入力音声信号が合成音声信号である
か人間による発生音声信号１８であるかが判別される。
この合成音声判別装置２０は、判別部２１と判別結果表
示部２２から構成される。判別部２１では、入力音声信
号から判別情報の有無を検出し、判別情報を検出したと
き入力音声信号が合成音声信号１８であると判別され、
この判別結果が判別結果表示部２２で可視表示または音
声表示によって表示される。The synthesized speech discriminating device 20 discriminates whether the input speech signal input via the transmission line 19 is a synthesized speech signal or a human-generated speech signal 18.
The synthesized speech discrimination device 20 includes a discrimination unit 21 and a discrimination result display unit 22. The discrimination unit 21 detects the presence or absence of discrimination information from the input speech signal, and when the discrimination information is detected, discriminates that the input speech signal is the synthesized speech signal 18,
The discrimination result is displayed on the discrimination result display unit 22 by visual display or audio display.

【００２２】次に、合成音声判別情報付加部１７につい
て具体的に説明する。合成音声判別情報付加部１７にお
いて合成音声信号に付加される判別情報は、合成音声信
号が人間による発生音声信号とは異なることを判別する
ための情報であり、種々の形態が考えられる。例えば、
ディジタル映像やディジタル音声の伝送においては、不
正コピーを防止するための「電子透かし」といわれる技
術がある。電子透かしは、基本的には画像や音声が知覚
されない範囲で直接、画像データや音声データにディジ
タル的に別の情報を付加する。もし、音声がディジタル
信号としてのみ扱われる場合は、この電子透かし技術を
利用して合成音声の判別情報を付加するようにすること
も可能である。しかし、音声はアナログ信号で伝送され
たり、空気中を伝わった後に電気信号に変換されて伝送
されるケースが多いため、判別情報はアナログ信号の伝
送や、伝送路での帯域制限、あるいは音質劣化によって
失われないものでなければならない。Next, the synthesized speech discrimination information adding section 17 will be described in detail. The discrimination information added to the synthesized speech signal in the synthesized speech discrimination information adding unit 17 is information for discriminating that the synthesized speech signal is different from a voice signal generated by a human, and various forms can be considered. For example,
In the transmission of digital video and digital audio, there is a technique called “digital watermark” for preventing unauthorized copying. Basically, digital watermarking directly adds other information to image data or audio data directly to the extent that the image or audio is not perceived. If the voice is handled only as a digital signal, it is possible to add the discrimination information of the synthesized voice using this digital watermarking technique. However, since voice is often transmitted as an analog signal or transmitted in the air and then converted to an electrical signal before being transmitted, the discrimination information is based on analog signal transmission, band limitation in the transmission path, or sound quality degradation. Must not be lost by

【００２３】そこで、本実施形態においては、合成音声
判別情報付加部１７を合成音声信号の周波数特性に変更
を加える構成とする。より具体的には、合成音声判別情
報付加部１７が合成音声信号に加える周波数特性の変更
は、合成音声信号の主たる周波数帯域外の周波数帯域に
おける信号パワーの変更（減衰または増大）であり、こ
れは例えば合成音声判別情報付加部１７にノッチフィル
タを用いることによって実現できる。Therefore, in the present embodiment, the synthesized speech discrimination information adding section 17 is configured to change the frequency characteristics of the synthesized speech signal. More specifically, the change in the frequency characteristic added to the synthesized voice signal by the synthesized voice determination information adding unit 17 is a change (attenuation or increase) in the signal power in a frequency band outside the main frequency band of the synthesized voice signal. Can be realized, for example, by using a notch filter in the synthesized speech discrimination information adding unit 17.

【００２４】図３は、このノッチフィルタの周波数特性
の一例であり、比較的高周波数側で非常に帯域幅の狭い
周波数帯の信号パワーを大きく減衰させる特性を有す
る。ノッチフィルタの中心周波数、すなわちノッチフィ
ルタが合成音声信号の信号パワーを減衰させる周波数帯
域の中心周波数ｆｄは、音声信号の主たる周波数帯であ
る２〜３ｋＨｚより高い４〜５ｋＨｚ程度に選定され
る。FIG. 3 shows an example of the frequency characteristic of the notch filter, which has a characteristic of greatly attenuating the signal power in a frequency band having a very narrow bandwidth on a relatively high frequency side. The center frequency of the notch filter, that is, the center frequency fd of the frequency band in which the notch filter attenuates the signal power of the synthesized audio signal is selected to be about 4 to 5 kHz, which is higher than the main frequency band of the audio signal, that is, 2 to 3 kHz.

【００２５】このようなノッチフィルタを用いると、合
成音声判別情報付加処理２７を通して出力される合成音
声信号は、人間の耳に知覚できる音質の劣化は小さい
が、この合成音声信号を例えば周波数分析により解析す
れば、ノッチフィルタの中心周波数ｆｄの信号パワーが
極端に小さくなっているので、音声が機械的に合成され
たものであることが容易に判断できる。When such a notch filter is used, the synthesized speech signal output through the synthesized speech discrimination information adding process 27 has a small deterioration in sound quality that can be perceived by human ears. When the analysis is performed, the signal power at the center frequency fd of the notch filter is extremely small, so that it can be easily determined that the voice is a mechanically synthesized voice.

【００２６】合成音声判別装置２０では、このようなノ
ッチフィルタによる特定周波数成分の信号パワーの低下
を検出することにより、入力音声信号が合成音声信号か
人間による発生音声信号かを判別する。図４は、合成音
声判別装置２０における判別部２１の具体的な構成例で
あり、合成音声判別情報付加部１７が上記のようなノッ
チフィルタで構成される場合の例を示している。The synthesized speech discriminating device 20 discriminates whether the input speech signal is a synthesized speech signal or a human-generated speech signal by detecting a decrease in the signal power of a specific frequency component by such a notch filter. FIG. 4 is a specific configuration example of the discrimination unit 21 in the synthesized speech discrimination device 20, and shows an example in which the synthesized speech discrimination information adding unit 17 is configured by the above notch filter.

【００２７】入力音声信号である合成音声信号１８は二
分岐され、一方では合成音声判別情報付加部１７を構成
するノッチフィルタの中心周波数ｆｄと同じ中心周波数
を持つバンドパスフィルタ（ＢＰＦ）３１を介してパワ
ー演算部３２に入力され、他方ではもう一つのパワー演
算部３３に直接入力される。The synthesized speech signal 18 which is an input speech signal is split into two, and on the other hand, passes through a band-pass filter (BPF) 31 having the same center frequency as the center frequency fd of the notch filter constituting the synthesized speech discrimination information adding section 17. And input directly to another power calculator 33.

【００２８】パワー演算部３２では、合成音声信号１８
のうちバンドパスフィルタ３１を通過した周波数成分の
信号パワーが求められる。パワー演算部３３では、合成
音声信号１８の全周波数帯域の信号パワーが求められ
る。割算部３４では、パワー演算部３２により求められ
た周波数成分の信号パワーをパワー演算部３３により求
められた全周波数帯域の信号パワーによって割り算す
る。In the power calculation unit 32, the synthesized speech signal 18
Of the frequency components passing through the band-pass filter 31 are obtained. In the power calculation unit 33, the signal power of the entire frequency band of the synthesized voice signal 18 is obtained. The division unit 34 divides the signal power of the frequency component obtained by the power operation unit 32 by the signal power of the entire frequency band obtained by the power operation unit 33.

【００２９】割算部３４の割り算結果は、閾値処理部３
５により所定の閾値と比較される。閾値処理部３５にお
いて割り算結果が閾値より小さければ入力音声信号は合
成音声信号であると判定され、そうでなければ人間によ
る発声音声信号であると判定される。この判定結果は、
図２の判定結果表示部２２によって表示される。The division result of the division unit 34 is output to the threshold processing unit 3
5 is compared with a predetermined threshold. If the result of the division is smaller than the threshold value in the threshold processing section 35, the input voice signal is determined to be a synthesized voice signal, and otherwise, it is determined to be a human uttered voice signal. This judgment result is
It is displayed by the determination result display section 22 of FIG.

【００３０】次に、本発明の他の実施形態について説明
する。本実施形態における音声合成装置１２及び合成音
声判別装置２０の基本構成は先の実施形態と同様であ
り、図２に示した通りである。Next, another embodiment of the present invention will be described. The basic configurations of the speech synthesis device 12 and the synthesized speech discrimination device 20 in the present embodiment are the same as those in the previous embodiment, and are as shown in FIG.

【００３１】本実施形態では、合成音声判別情報付加部
１７として中心周波数が所定の時間の関数に従った変化
をするノッチフィルタが用いられる。この変化の仕方
は、文音声合成装置１２のメーカ名や機種名に応じて異
ならせるものとする。例えば、図５及び図６に示すよう
に、Ａ社のＸ型音声合成装置ではノッチフィルタの中止
間周波数の時間関数はＦ(t)であり、Ｙ型音声合成装置
ではノッチフィルタの中止間周波数の時間関数はＧ(t)
と異なり、またＢ社のＺ型音声合成装置ではノッチフィ
ルタの中心周波数の時間関数はＨ(t)とさらに違った変
化をする。In the present embodiment, a notch filter whose center frequency changes according to a function of a predetermined time is used as the synthesized voice discrimination information adding unit 17. The manner of this change is made different depending on the maker name and model name of the sentence speech synthesizer 12. For example, as shown in FIGS. 5 and 6, in the X-type speech synthesizer of Company A, the time function of the frequency during the stop of the notch filter is F (t), and in the Y-type speech synthesizer, the frequency between the stop of the notch filter is F (t). Is the time function of G (t)
Unlike the Z-type speech synthesizer of Company B, the time function of the center frequency of the notch filter changes further differently from H (t).

【００３２】図７に、本実施形態における合成音声判別
装置２０内の判別部２１の構成を示す。図７の判別部２
１では、合成音声判別情報付加部１７のノッチフィルタ
における上述した３通りの関数に従う時間変化をする中
心周波数と同じ中心周波数をそれぞれ持つバンドパスフ
ィルタ（ＢＰＦ）４１，４２，４３に、伝送路１９を経
て入力された音声信号が入力される。バンドパスフィル
タ４１，４２，４３の出力信号は平均パワー計算部４
４，４５，４６に入力され、発声区間内の平均パワーが
計算される。一方、バンドパスフィルタを通さない入力
音声信号の全周波数帯域の発声区間内の平均パワーが平
均パワー計算部４７によって計算される。FIG. 7 shows the configuration of the discriminating section 21 in the synthesized speech discriminating apparatus 20 according to the present embodiment. Discriminator 2 in FIG.
In 1, the notch filter of the synthesized speech discrimination information adding unit 17 has bandpass filters (BPF) 41, 42, and 43 having the same center frequency as the center frequency that changes with time according to the three functions described above, respectively. The audio signal input through is input. The output signals of the band-pass filters 41, 42 and 43 are output to the average power calculator 4
4, 45 and 46, and the average power in the utterance section is calculated. On the other hand, the average power in the utterance section of the entire frequency band of the input audio signal that does not pass through the band-pass filter is calculated by the average power calculator 47.

【００３３】割算部４８，４９，５０では、平均パワー
計算部４４，４５，４６によりそれぞれ計算されたバン
ドパスフィルタ４１，４２，４３の出力信号の発声区間
内の平均パワーを、平均パワー計算部４７により計算さ
れた入力音声信号の全周波数帯域の発声区間内の平均パ
ワーによって割り算する。これらの割算部４８，４９，
５０の割り算結果の最小値が最小値検出部５０によって
求められ、その最小値が閾値処理部５２で所定の閾値と
比較されることにより、割り算結果が閾値より小さく、
かつその値が最小となる関数に対応したメーカ名及び機
種名が判別される。この場合、判別部２１では図５に示
したようなメーカ名と機種名に関数Ｆ(T)，Ｇ(t)，Ｈ
(t)を対応付けたテーブルを持っているものとする。The dividing units 48, 49, and 50 calculate the average power in the vocal section of the output signals of the band-pass filters 41, 42, and 43 calculated by the average power calculating units 44, 45, and 46, respectively. The input voice signal is divided by the average power in the utterance section of the entire frequency band calculated by the unit 47. These divisions 48, 49,
The minimum value of the division result of 50 is obtained by the minimum value detection unit 50, and the minimum value is compared with a predetermined threshold value by the threshold processing unit 52, so that the division result is smaller than the threshold value,
In addition, a manufacturer name and a model name corresponding to the function having the minimum value are determined. In this case, the discriminator 21 assigns the functions F (T), G (t), H to the manufacturer name and model name as shown in FIG.
It is assumed that there is a table corresponding to (t).

【００３４】判別部２１の判別結果であるメーカ名及び
機種名は、図２の判定結果表示部２２によって表示され
る。図６に示したいずれの関数Ｆ(T)，Ｇ(t)，Ｈ(t)で
も割り算の結果が閾値より小さくならなければ、入力音
声信号は人間による発生音声信号と判別され、その旨が
表示部２２で表示される。The manufacturer name and the model name, which are the results of the determination by the determination section 21, are displayed by the determination result display section 22 of FIG. If the result of the division does not become smaller than the threshold value in any of the functions F (T), G (t), and H (t) shown in FIG. 6, the input audio signal is determined to be a human-generated audio signal. Displayed on the display unit 22.

【００３５】このような本実施形態によれば、入力音声
信号が合成音声信号であることを判別できるだけでな
く、その合成音声信号を生成した音声合成装置のメーカ
名及び機種名までを特定することができ、犯罪防止によ
り有効となる。According to this embodiment, not only can the input voice signal be determined to be a synthesized voice signal, but also the name of the manufacturer and model of the voice synthesizer that generated the synthesized voice signal can be specified. And more effective in crime prevention.

【００３６】以上、本発明の実施形態について説明して
きたが、本発明は上記の実施形態に限定されるものでは
ない。例えば、上記実施形態では合成音声判別情報とし
て音声信号の特定周波数帯域における信号パワーの変更
を用いたが、人間に知覚されにくい位相情報を人間の発
声音声とは異なるように操作して合成音声判別情報とす
る方法や、時間情報（ポーズ長、音韻継続時間長、発声
時刻など）を粗く量子化する方法、例えば、発声が５０
［ｍｓ］刻みの時間長に正しく収まるように（人間が正
確な時間内で発声することは不可能である）合成する方
法などを用いてもよい。また、ノッチフィルタの中心周
波数を音韻毎に変えるなどの方法も可能であり、これに
よると聴感上の劣化を目立たなくすることができる。Although the embodiments of the present invention have been described above, the present invention is not limited to the above embodiments. For example, in the above embodiment, the change in the signal power in the specific frequency band of the audio signal is used as the synthesized voice determination information, but the phase information that is hardly perceived by humans is operated so as to be different from human uttered voices. Information, or a method of coarsely quantizing time information (pause length, phoneme duration time, utterance time, etc.), for example, 50 utterances
It is also possible to use a method of synthesizing so that it fits correctly in the time length of [ms] (it is impossible for a human to utter within a precise time). Further, a method of changing the center frequency of the notch filter for each phoneme is also possible, and according to this, it is possible to make deterioration in auditory perception inconspicuous.

【００３７】要するに本発明は、合成音声信号を人間の
発声に近づけることを本来の目的としている音声合成装
置において、人間が知覚しにくい範囲で合成音声信号の
一部を故意に人間の発声音声信号とは異ならせることで
合成音声信号の判別を容易にするところに主眼があり、
このような主旨を逸脱しない範囲で種々変形して実施す
ることができる。In short, the present invention relates to a speech synthesizer originally intended to bring a synthesized speech signal closer to a human utterance, and to intentionally convert a part of the synthesized speech signal to a human utterance speech signal within a range that is difficult for a human to perceive. The main focus is on making it easy to distinguish the synthesized speech signal by making it different from
Various modifications can be made without departing from the scope of the invention.

【００３８】[0038]

【発明の効果】以上説明したように、本発明によれば入
力された音声信号が合成音声信号かどうかを判別するこ
とが可能であり、音声合成の犯罪への悪用、すなわち他
人へのなりすましによる、話者識別あるいは話者照合シ
ステムにおける詐称や、音声に基づく科学捜査のかく乱
を未然に防止することができる。As described above, according to the present invention, it is possible to determine whether or not an input speech signal is a synthesized speech signal, and the speech synthesis can be abused to a crime, that is, spoofed to another person. In addition, it is possible to prevent spoofing in a speaker identification or speaker verification system and disturbance of a forensic investigation based on speech.

[Brief description of the drawings]

【図１】音声規則合成システムの構成を示すブロック図FIG. 1 is a block diagram showing a configuration of a speech rule synthesis system.

【図２】本発明の一実施形態に係る音声合成装置及び合
成音声判別装置の構成を示すブロック図FIG. 2 is a block diagram showing a configuration of a speech synthesis device and a synthesized speech discrimination device according to an embodiment of the present invention.

【図３】同実施形態における合成音声判別情報付加部に
用いるノッチフィルタの周波数特性を示す図FIG. 3 is a view showing frequency characteristics of a notch filter used in a synthesized speech discrimination information adding unit according to the embodiment;

【図４】同実施形態における合成音声判別装置内の判別
部の構成を示すブロック図FIG. 4 is a block diagram showing a configuration of a discriminating unit in the synthesized speech discriminating apparatus according to the embodiment;

【図５】本発明の他の実施形態における合成音声判別情
報付加部に用いるノッチフィルタの音声合成装置のメー
カ名及び機種名に対応した中心周波数の時間関数につい
て説明する図FIG. 5 is a diagram illustrating a time function of a center frequency corresponding to a maker name and a model name of a speech synthesizer of a notch filter used in a synthesized speech discrimination information adding unit according to another embodiment of the present invention.

【図６】同実施形態における合成音声判別情報付加部に
用いるノッチフィルタの種々の時間関数に対応する周波
数特性を示す図FIG. 6 is a diagram showing frequency characteristics corresponding to various time functions of a notch filter used in a synthesized speech discrimination information adding unit according to the embodiment;

【図７】同実施形態における合成音声判別装置内の判別
部の構成を示すブロック図FIG. 7 is a block diagram showing a configuration of a discriminating unit in the synthesized speech discriminating apparatus according to the embodiment;

[Explanation of symbols]

１１…テキストファイル１２…音声合成装置１３…文音声合成処理部１４…テキスト解析部１５…音声生成部１６…音声辞書１７…合成音声判別情報付加部１８…合成音声信号１９…伝送路２０…合成音声判別装置２１…判別部２２…判別結果表示部 DESCRIPTION OF SYMBOLS 11 ... Text file 12 ... Speech synthesizer 13 ... Sentence speech synthesis processing part 14 ... Text analysis part 15 ... Speech generation part 16 ... Speech dictionary 17 ... Synthesis speech discrimination information addition part 18 ... Synthesis speech signal 19 ... Transmission path 20 ... Synthesis Voice discriminating device 21: discriminating unit 22: discrimination result display unit

Claims

[Claims]

An audio signal in which discrimination information for discriminating a synthesized speech signal is added to a synthesized speech signal generated by performing speech synthesis in accordance with an input character string, is output. To determine the synthesized speech to be performed.

2. The method according to claim 1, wherein the determination information is added to the synthesized voice signal by changing a frequency characteristic of the synthesized voice signal.

3. The synthesized voice discrimination method according to claim 2, wherein the change of the frequency characteristic applied to the synthesized voice signal is a change of a signal power in a frequency band outside a main frequency band of the synthesized voice signal. .

4. A synthetic speech discriminating method comprising: detecting presence or absence of said discrimination information from an input speech signal; and discriminating that said input speech signal is said synthesized speech signal when detecting said discrimination information.

5. A voice synthesizing unit for generating a synthesized voice signal by performing voice synthesis according to an input character string, and a voice obtained by adding discrimination information for discriminating that the synthesized voice signal is a synthesized voice signal. And a discrimination information adding unit for outputting a signal.

6. A speech synthesizer according to claim 5, wherein the presence or absence of said discrimination information added to said synthesized speech signal is detected from said input speech signal. A synthesized speech discriminating apparatus comprising: a discriminating unit that discriminates a synthesized speech signal; and a display unit that displays a discrimination result of the discriminating unit.

7. A process for generating a synthesized voice signal by performing voice synthesis in accordance with an input character string, and generating a voice signal obtained by adding discrimination information for discriminating that the synthesized voice signal is a synthesized voice signal. A program for causing a computer to execute output processing.

8. A process for detecting presence / absence of discrimination information added to a synthesized speech signal from an input speech signal, and when detecting the discrimination information, discriminating that the input speech signal is the synthesized speech signal. And a program for causing a computer to execute a process of displaying a result of the process.