JPH06337700A

JPH06337700A - Voice synthesizer

Info

Publication number: JPH06337700A
Application number: JP5127275A
Authority: JP
Inventors: Kimu Kiyunho Rooken; キムキュンホローケン
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1993-05-28
Filing date: 1993-05-28
Publication date: 1994-12-06

Abstract

PURPOSE:To easily generate a synthesized sound exceedingly similar to a voice of a speaker. CONSTITUTION:This device is constituted of a voice input means 10 inputting the voice of the speaker, a recognition processing means 11 receiving an input voice and recognizing the input voice based on a prescribed acoustic segment network, an acoustic segment network storage means 13 storing the acoustic segment network corresponding to every word becoming an object of the synthesized sound, an acoustic segment network updating means 12 comparing and deciding the characteristic of the input voice and the acoustic segment network, and updating and registering the acoustic segment network according to the comparison result, a character input means 14 inputting the voice of the speaker as the word by a character line, a synthesis processing means 15 synthesizing the input word by a prescribed voice synthesis processing based on the acoustic segment network and a synthesis sound output means 16 outputting a synthesized sound.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明はユーザ（話者）の音声に
類似した合成音声を出力する音声合成装置に関する。本
発明は最終的には音声対話機能を持った情報処理装置に
利用することができる。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizer which outputs synthetic speech similar to that of a user (speaker). The present invention can finally be applied to an information processing device having a voice dialogue function.

【０００２】[0002]

【従来の技術】従来の音声認識装置として、特開昭６
３─２２３７９３号公報「登録式音声入出力装置」、
特開昭６３─１８３４９８号公報「登録式音声入出力装
置」、特開昭６１─２３１６００号公報「音声認識装
置」等がある。は、登録した話者の音声を合成音とし
て出力するものであり、予め話者の音声を特徴パラメー
タに変換し、使用時に入力音声と登録音声の類似度を計
算し、予め登録された特徴パラメータに最も類似したも
のを認識結果の合成音として出力するものである。な
お、類似度は周知のＰＡＲＣＯＲ分析により行う。2. Description of the Related Art As a conventional voice recognition device, Japanese Patent Laid-Open No.
No. 3-223793 “Registered voice input / output device”,
There are Japanese Unexamined Patent Publication No. 63-183498 “Registered voice input / output device” and Japanese Unexamined Patent Publication No. 61231600 “Voice recognition device”. Is for outputting the voice of the registered speaker as a synthesized voice, converting the voice of the speaker into a feature parameter in advance, calculating the similarity between the input voice and the registered voice at the time of use, and registering the feature parameter registered in advance. The one most similar to is output as a synthetic sound of the recognition result. The similarity is determined by the well-known PARCOR analysis.

【０００３】は、上記ではＰＡＲＣＯＲ分析を用い
ているが、本例ではＡＤＰＣＭ合成を用いて類似度を求
める。は、特定話者に代わって音声登録することがで
きるもので、予め特定話者の登録語を、音声合成部によ
り合成した音声により自動的に登録処理するものであ
る。In the above, PARCOR analysis is used, but in this example, ADPCM synthesis is used to determine the degree of similarity. Is capable of performing voice registration on behalf of a specific speaker, and automatically performs registration processing of a registered word of the specific speaker in advance with a voice synthesized by a voice synthesizer.

【０００４】図４は従来の音響セグメントネットワーク
を使用した音声合成装置の一例要部構成図である。図
中、４１は音響セグメントネットワーク格納部、４２は
文字入力部、４３は合成処理部、４４は合成音出力部で
ある。この方式は音響セグメントネットワーク格納部に
予め複数の音響セグメントネットワークを格納し、話者
から音声が入力されると、その音声に最も類似した音響
セグメントネットワークを抽出し、対応する合成音を生
成する方式である。FIG. 4 is a block diagram of an example of a conventional voice synthesizer using a sound segment network. In the figure, 41 is an acoustic segment network storage unit, 42 is a character input unit, 43 is a synthesis processing unit, and 44 is a synthesized sound output unit. In this method, a plurality of acoustic segment networks are stored in advance in the acoustic segment network storage unit, and when a voice is input from a speaker, the acoustic segment network that most resembles the voice is extracted and a corresponding synthesized voice is generated. Is.

【０００５】[0005]

【発明が解決しようとする課題】しかし上述の従来の方
法では、予め容易された複数の音響セグメントネットワ
ークから、話者の音声に最も類似するネットワークを抽
出する方法であるため、話者の音声との類似性において
まだ問題が多い。本発明の目的は、話者の音声に極めて
類似した合成音を容易に生成することにある。However, the above-mentioned conventional method is a method of extracting a network most similar to the speaker's voice from a plurality of acoustic segment networks that have been facilitated in advance. There are still many problems in the similarity of. An object of the present invention is to easily generate a synthetic voice that is very similar to the voice of the speaker.

【０００６】[0006]

【課題を解決するための手段】図１は本発明の原理構成
図である。図示のように、本発明は話者の音声に類似し
た合成音を生成する音声合成装置であって、話者の音声
を入力する音声入力手段１０と、該音声入力手段からの
入力音声を受け、所定の音響セグメントネットワークに
基づき該入力音声を認識する認識処理手段１１と、合成
音の対象となる各単語に対応する音響セグメントネット
ワークを格納する音響セグメントネットワーク格納手段
１３と、入力音声の特徴を抽出し、該特徴と該音響セグ
メントネットワーク格納部に格納されている音響セグメ
ントネットワークを比較判定し、比較結果による入力音
声の特徴に基づき該音響セグメントネットワークを更新
し登録する音響セグメントネットワーク更新手段１２
と、話者の音声を文字列による単語として入力する文字
入力手段１４と、該文字入力手段からの入力単語を、該
音響セグメントネットワークに基づき所定の音声合成処
理によって合成する合成処理手段１５と、合成された音
声を出力する合成音出力手段１６とを備え、話者の音声
の特徴を抽出し、その特徴を該音響セグメントネットワ
ーク格納手段に格納した後、該文字入力手段から入力さ
れた文字列を所定の音声合成処理によって該合成処理手
段が合成する際に、該音響セグメントネットワーク格納
手段に格納されている更新された音響セグメントネット
ワークに従って、合成音を生成し、その結果、話者の音
声と似た合成音を生成するようにしたことを特徴とす
る。FIG. 1 is a block diagram showing the principle of the present invention. As shown in the figure, the present invention is a voice synthesizing apparatus for generating a synthetic voice similar to a voice of a speaker, which receives voice input means 10 for inputting the voice of the speaker and input voice from the voice input means. A recognition processing means 11 for recognizing the input voice based on a predetermined acoustic segment network, an acoustic segment network storage means 13 for storing an acoustic segment network corresponding to each word to be a target of a synthetic voice, and characteristics of the input voice. Acoustic segment network updating means 12 for extracting and comparing the characteristic with the acoustic segment network stored in the acoustic segment network storage unit, and updating and registering the acoustic segment network based on the characteristic of the input voice based on the comparison result.
A character input means 14 for inputting the voice of the speaker as a word in a character string, and a synthesis processing means 15 for synthesizing the input word from the character input means by a predetermined voice synthesis processing based on the acoustic segment network, And a synthesized voice output means 16 for outputting synthesized speech, extracting a feature of a speaker's voice, storing the feature in the acoustic segment network storage means, and then inputting a character string input from the character input means. When the synthesizing means synthesizes the sound by a predetermined speech synthesizing process, a synthesized sound is generated according to the updated acoustic segment network stored in the acoustic segment network storing means, and as a result, the speech of the speaker It is characterized in that similar synthetic sounds are generated.

【０００７】また、該音響セグメントネットワーク更新
手段１２は、話者による入力音声の音響的な特徴を抽出
する特徴抽出部２２と、抽出された特徴と該音響セグメ
ントネットワークに格納されている音響セグメントネッ
トワークを比較し判定する特徴判定部２３と、該特徴判
定部の判定結果に基づき音響セグメントネットワークを
更新し登録する特徴登録部２４とを有する。Further, the acoustic segment network updating means 12 includes a feature extraction unit 22 for extracting acoustic features of a voice input by a speaker, an extracted segment and an acoustic segment network stored in the acoustic segment network. And a feature registration unit 24 that updates and registers the acoustic segment network based on the determination result of the feature determination unit.

【０００８】さらに、話者が音声で入力した文章を分析
する入力解析部３１２と、タスクに関する知識を格納す
る知識ベース３５と、話者からの入力に対し、該知識ベ
ースを参照し、所定の文を生成する文生成部３８とをさ
らに備える。Further, an input analysis unit 312 for analyzing a sentence input by a speaker by voice, a knowledge base 35 for storing knowledge about a task, and a predetermined reference to an input from a speaker with reference to the knowledge base. It further includes a sentence generator 38 that generates a sentence.

【０００９】[0009]

【作用】本発明によれば、例えば、マイクロホン等の音
声入力手段により、話者が音声を入力すると、その音声
の音響セグメントネットワークが格納され、一方、キー
ボード等の文字入力手段から、話者の音声の文字列を単
語入力すると、格納された音響セグメントネットワーク
に従って、話者の音声に類似した合成音を容易に生成す
ることができる。According to the present invention, when a speaker inputs a voice, for example, by a voice input means such as a microphone, an acoustic segment network of the voice is stored, while a character input means such as a keyboard is used to input the voice of the speaker. When a character string of a voice is input as a word, a synthesized voice similar to the voice of the speaker can be easily generated according to the stored acoustic segment network.

【００１０】[0010]

【実施例】図２は本発明の一実施例構成図である。２０
は音声入力部、２１は認識処理部、２２は特徴抽出部、
２３は特徴判定部、２４は特徴登録部、２５は音響セグ
メントネットワーク格納部、２６は文字入力部、２７は
合成処理部、２８は音声出力部である。FIG. 2 is a block diagram of an embodiment of the present invention. 20
Is a voice input unit, 21 is a recognition processing unit, 22 is a feature extraction unit,
Reference numeral 23 is a feature determination unit, 24 is a feature registration unit, 25 is an acoustic segment network storage unit, 26 is a character input unit, 27 is a synthesis processing unit, and 28 is a voice output unit.

【００１１】音声入力部２０は例えばマイクロホンであ
り、文字入力部２６は例えばキーボードであり、音声出
力部２８は例えばスピーカーである。音声認識処理部２
１はマイクロホンからの音声のパターンを認識する。特
徴抽出部２２はユーザの音声の音響的な特徴を抽出す
る。特徴判定部２３は音響セグメントネットワーク格納
部２５に格納されている単語のネットワークと、特徴抽
出部２２で抽出した音響的な特徴とを比べ、その差を判
定する。特徴登録部２４は特徴判定部２３で得られた特
徴を音響セグメントネットワークに格納する。音響セグ
メントネットワーク格納部２５は認識或いは合成の対象
となる単語群の音響セグメントネットワークを格納す
る。合成処理部２６はキーボードから入力された文字列
を合成音に変換する。The voice input unit 20 is, for example, a microphone, the character input unit 26 is, for example, a keyboard, and the voice output unit 28 is, for example, a speaker. Speech recognition processing unit 2
1 recognizes a voice pattern from a microphone. The feature extraction unit 22 extracts the acoustic feature of the user's voice. The feature determination unit 23 compares the network of words stored in the acoustic segment network storage unit 25 with the acoustic features extracted by the feature extraction unit 22, and determines the difference. The feature registration unit 24 stores the features obtained by the feature determination unit 23 in the acoustic segment network. The acoustic segment network storage unit 25 stores the acoustic segment network of the word group to be recognized or synthesized. The synthesis processing unit 26 converts the character string input from the keyboard into synthetic sound.

【００１２】このような構成において、本実施例の動作
を以下に説明する。音響セグメントは、音素単位や音節
単位が考えられる。これを音声認識や合成の対象となる
各単語の様々な発音の変化を現すように規則によって接
続し、音響セグメントネットワークを生成した後、音響
セグメントネットワーク格納部２５に登録する。The operation of this embodiment having such a configuration will be described below. The acoustic segment may be a phoneme unit or a syllable unit. These are connected according to rules so as to show various changes in pronunciation of each word to be subjected to voice recognition or synthesis, and after the acoustic segment network is generated, it is registered in the acoustic segment network storage unit 25.

【００１３】次に、ある特定ユーザが音声で入力した単
語の認識を行う際に、そのユーザの音声の音響的な特徴
を特徴抽出部２２で抽出し、その結果を特徴判定部２３
に渡す。特徴判定部２３はその結果と音響セグメントネ
ットワーク格納部２５に格納されている単語のネットワ
ークを比べ、その差の有無を判定する。その差がある場
合には特徴登録部２４は音響セグメントネットワーク格
納部２５にこれを登録することによりネットワークを更
新する。Next, when a certain user recognizes a word input by voice, the feature extraction unit 22 extracts the acoustic feature of the voice of the user, and the result is extracted by the feature determination unit 23.
Pass to. The feature determination unit 23 compares the result with the network of words stored in the acoustic segment network storage unit 25 and determines whether there is a difference. If there is a difference, the feature registration unit 24 updates the network by registering it in the acoustic segment network storage unit 25.

【００１４】合成処理部２７はユーザがキーボードで入
力した文字列を合成音に変換する際に、音響セグメント
ネットワーク２５に格納されている更新された音響セグ
メントネットワークを用いて合成音を出力することによ
り、ユーザの音声に近い合成音を生成する。図３は本発
明の他の実施例構成図である。本構成は音声対話システ
ムを示している。図中、３０は音声入力手段としてのマ
イクロホン、３１は音声認識処理部、３２は認識結果の
表示手段としてのＣＲＴ、３３はユーザの音声の音響的
な特徴を抽出する特徴抽出部、３４は音響セグメントネ
ットワーク格納部３６に格納されている単語のネットワ
ークと、特徴抽出部３３で抽出した音響的な特徴を比
べ、その差を判定する特徴判定部、３７は前記の特徴を
音響セグメントネットワーク格納部３６に登録する特徴
登録部、３６は認識や合成の対象となる単語群の音響セ
グメントネットワークを格納する音響セグメントネット
ワーク格納部、３９は単語の文字列の入力手段であるキ
ーボード、３１０はキーボードから入力された文字列を
合成音に変換する合成処理部、３１１は合成音声の出力
手段であるスピーカー、３５は話題のタスクに関する知
識を格納する知識ベース、３８はユーザの入力に対して
前記の知識ベースを用いて所定の文を生成する文章生成
部、３１２はユーザが音声で入力した文を分析する入力
文分析部をそれぞれ示している。By converting the character string input by the user with the keyboard into the synthetic sound, the synthesis processing unit 27 outputs the synthetic sound by using the updated acoustic segment network stored in the acoustic segment network 25. , Generates a synthetic sound close to the user's voice. FIG. 3 is a block diagram of another embodiment of the present invention. This configuration shows a voice dialogue system. In the figure, 30 is a microphone as a voice input unit, 31 is a voice recognition processing unit, 32 is a CRT as a display unit of the recognition result, 33 is a feature extraction unit for extracting acoustic features of the user's voice, and 34 is a sound. A feature determination unit that compares the network of words stored in the segment network storage unit 36 with the acoustic features extracted by the feature extraction unit 33 and determines the difference between them is denoted by 37. The feature is determined by the acoustic segment network storage unit 36. The feature registration unit 36, the acoustic segment network storage unit 36 that stores the acoustic segment network of the word group to be recognized or synthesized, 39 the keyboard that is the input means of the word character string, and 310 that is input from the keyboard. A synthesis processing unit for converting a character string into a synthesized voice, 311 is a speaker that is a means for outputting a synthesized voice, and 35 is a topic. A knowledge base that stores knowledge about a task, 38 is a sentence generation unit that generates a predetermined sentence using the knowledge base in response to a user input, and 312 is an input sentence analysis unit that analyzes a sentence that the user has input by voice. Are shown respectively.

【００１５】タスクは、例えば、国際会議の要約、新幹
線切符の予約、或いは国内旅行案内等が例として考えら
れる。このようなタスクに関する知識、例えば、１２月
出発の東北方面の旅行パッケージに関する情報を規則で
表現し、格納したのが知識ベースであり、これは、ユー
ザの音声による質問を分析し、適切な回答を文章として
生成する機能をシステムに与える。The task may be, for example, a summary of an international conference, a reservation for a Shinkansen ticket, or a domestic travel guide. The knowledge base expresses and stores the knowledge about such tasks, for example, the information about the travel packages for the Tohoku area departing from December in a rule, which analyzes the user's voice question and provides an appropriate answer. Gives the system the ability to generate as a sentence.

【００１６】[0016]

【発明の効果】以上説明したように、本発明によれば、
話者の音声に極めて類似した合成音を容易に生成するこ
とができる。As described above, according to the present invention,
It is possible to easily generate a synthesized voice that is very similar to the voice of the speaker.

[Brief description of drawings]

【図１】本発明の原理構成図である。FIG. 1 is a principle configuration diagram of the present invention.

【図２】本発明の一実施例構成図である。FIG. 2 is a configuration diagram of an embodiment of the present invention.

【図３】本発明の他の実施例構成図である。FIG. 3 is a configuration diagram of another embodiment of the present invention.

【図４】従来の一例構成図である。FIG. 4 is a diagram illustrating a conventional example configuration.

[Explanation of symbols]

１０，２０，３０…音声入力部１１，２１，３１…認識処理部１２…音響セグメントネットワーク更新部１３，２５，３６…音響セグメントネットワーク格納部１４，２６，３９，４２…文字入力部１５，２７，４３，３１０…合成処理部１６，２８，４４，３１１…合成音出力部２２，３３…特徴抽出部２３，３４…特徴判定部２４，３７…特徴登録部３５…知識データベース３８…文章生成部３１２…入力文解析部 10, 20, 30 ... Voice input unit 11, 21, 31 ... Recognition processing unit 12 ... Acoustic segment network updating unit 13, 25, 36 ... Acoustic segment network storage unit 14, 26, 39, 42 ... Character input unit 15, 27 , 43, 310 ... Synthesis processing unit 16, 28, 44, 311 ... Synthetic sound output unit 22, 33 ... Feature extraction unit 23, 34 ... Feature determination unit 24, 37 ... Feature registration unit 35 ... Knowledge database 38 ... Text generation unit 312 ... Input sentence analysis unit

Claims

[Claims]

1. A voice synthesizer for generating a synthetic voice similar to a voice of a speaker, comprising voice input means (10) for inputting the voice of the speaker, and input voice from the voice input means. Recognition processing means (11) for recognizing the input voice based on a predetermined acoustic segment network; acoustic segment network storage means (13) for storing an acoustic segment network corresponding to each word to be the target of the synthesized speech; An acoustic segment for extracting a feature of an input voice, comparing the feature with an acoustic segment network stored in the acoustic segment network storage unit, and updating and registering the acoustic segment network based on the feature of the input voice based on the comparison result. Network updating means (12) and character input means (14) for inputting a speaker's voice as a word in a character string. A synthesis processing means (15) for synthesizing an input word from the character input means by a predetermined speech synthesis processing based on the acoustic segment network, and a synthetic sound output means (16) for outputting the synthesized speech. The feature of the voice of the speaker is extracted, the feature is stored in the acoustic segment network storage means, and then the character string input from the character input means is synthesized by the synthesis processing means by a predetermined voice synthesis processing. At this time, a synthetic sound is generated according to the updated acoustic segment network stored in the acoustic segment network storage means, and as a result, a synthetic sound similar to the voice of the speaker is generated. Voice synthesizer.

2. The acoustic segment network updating means (12) stores a feature extraction unit (22) for extracting acoustic features of a voice input by a speaker, the extracted features and the acoustic segment network. A characteristic determination section (23) for comparing and determining existing acoustic segment networks,
The voice synthesizing apparatus according to claim 1, further comprising a feature registration unit (24) that updates and registers the acoustic segment network based on the determination result of the feature determination unit.

3. An input analysis unit (312) for analyzing a sentence input by a speaker by voice, and a knowledge base (35) for storing knowledge about a task which is information such as recent reservations and guidance.
The speech synthesis apparatus according to claim 1, further comprising: a sentence generation unit (38) that refers to the knowledge base in response to an input from a speaker and generates a predetermined sentence.