JP4433734B2

JP4433734B2 - Speech analysis / synthesis apparatus, speech analysis apparatus, and program

Info

Publication number: JP4433734B2
Application number: JP2003320310A
Authority: JP
Inventors: 克瀬戸口
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2003-09-11
Filing date: 2003-09-11
Publication date: 2010-03-17
Anticipated expiration: 2023-09-11
Also published as: JP2005084660A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a technique which makes it possible to synthesize always appropriate speech waveforms by using parameters extracted by analyzing original speech waveforms. <P>SOLUTION: In an analysis phase, the speech data outputted from an A/D converter 8 is analyzed in a frame unit. As the parameters, a PARCOR coefficient, a voiced speech ratio indicating the degree at which the speech represented by the speech data is voiced speech, etc., are extracted. In a synthesis phase, a driving sound source wave is generated in the form of mixing a Rosenberg wave generated at an assigned pitch and a white noise wave not having the pitch according to a voiced speech ratio. A synthesizing filter section 45 generates the speech data for one frame component by using the driving sound source wave and the PARCOR coefficient. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声波形を分析し、その分析結果を用いて音声波形を合成するための技術に関する。 The present invention relates to a technique for analyzing a speech waveform and synthesizing a speech waveform using the analysis result.

音声波形を分析し、その分析結果を用いて音声波形を合成する音声分析合成装置は、入力した音声波形に対し声質を変化させるといった音響効果を付与する用途でも利用されている。
その声質の変化は、例えば音声（例えば人声）のフォルマントを操作したり、或いは音声をバンドパスフィルタ（ＢＰＦ）に通してバンド別に振幅値を特定し、特定した振幅値から構成したフィルタに音声を通すことで行われる。後者の方式は、鍵盤等により指定されたピッチ（音高）で人声を発音させるボコーダー（音声分析合成装置）に主に採用されている。その後者の方式を採用した音声分析合成装置としては、例えば特許文献１に記載されたものがある。 A speech analysis / synthesis apparatus that analyzes a speech waveform and synthesizes a speech waveform using the analysis result is also used for applying an acoustic effect such as changing the voice quality of the input speech waveform.
The change in the voice quality is, for example, by manipulating a formant of voice (for example, human voice) or by passing the voice through a bandpass filter (BPF) and specifying an amplitude value for each band, and then changing the voice to a filter configured from the specified amplitude value. It is done by passing. The latter method is mainly employed in a vocoder (speech analysis / synthesis device) that produces a human voice at a pitch (pitch) designated by a keyboard or the like. As a speech analysis / synthesis apparatus adopting the latter method, there is one described in Patent Document 1, for example.

その特許文献１に記載された従来の音声分析合成装置では、入力した音声波形を分析してパラメータを抽出し、指定されたピッチに対応した駆動音源波形を生成し、抽出したパラメータと駆動音源波形とを用いて出力用の音声波形を合成していた。その駆動音源波形としては三角波や異なるパルス幅を持った波形などのパルス波形を生成していた。しかし、そのようなパルス波形を用いた場合、人の自然な音声を再現するのはできない。また、入力音声が有声音か否か判定し、駆動音源波形、及び無声音用のホワイトノイズ波形の何れかをその判定結果に応じて選択して音声波形の合成に用いているが、そのように切り換えると、合成する音声波形が急激に変化して不自然となることがある。判定結果に応じてそれらを単にクロスフェードさせたとしても、元の音声波形の変化に適切に追従するとは限らないため、合成した音声波形が不自然なものとなることを確実に回避することはできない。 In the conventional speech analysis and synthesis apparatus described in Patent Document 1, an input speech waveform is analyzed to extract parameters, a driving sound source waveform corresponding to a designated pitch is generated, and the extracted parameters and driving sound source waveform are generated. Was used to synthesize a speech waveform for output. As the driving sound source waveform, a pulse waveform such as a triangular wave or a waveform having a different pulse width was generated. However, when such a pulse waveform is used, natural human speech cannot be reproduced. Further, it is determined whether the input voice is voiced sound, and either the driving sound source waveform or the white noise waveform for unvoiced sound is selected according to the determination result and used for the synthesis of the voice waveform. When switching, the synthesized speech waveform may change suddenly and become unnatural. Even if they are simply crossfade according to the judgment result, it does not always follow the changes in the original speech waveform appropriately, so it is definitely possible to avoid the synthesized speech waveform from becoming unnatural. Can not.

不自然と感じられる音声は人に違和感を与えるのが普通である。その音声が聞き慣れた人のものであればなおさらである。このことから、合成後の音声が常に自然と感じられる適切なものとすることは非常に重要なことであると考えられる。
上記パラメータの抽出は音声符号化等のために行われる場合がある。その符号化では、元の情報をより再現することが望まれる。このことから、元の情報をより再現できるようにする意味からも、合成後の音声が常に自然と感じられるようにすることは非常に重要なことであると考えられる。
特開平０９−２０４１８５号公報 Sounds that seem unnatural usually give people a sense of incongruity. This is even more so if the voice is from someone who is used to listening. From this, it can be considered that it is very important to make the synthesized speech appropriate so that it always feels natural.
The extraction of the parameters may be performed for speech encoding or the like. In the encoding, it is desired to reproduce the original information more. From this, it can be considered that it is very important to make the synthesized speech always feel natural from the viewpoint of making the original information more reproducible.
JP 09-204185 A

本発明の課題は、元の音声波形を分析して抽出されるパラメータを用いて常に適切な音声波形を合成できるようにする技術を提供することにある。 An object of the present invention is to provide a technique that allows an appropriate speech waveform to be always synthesized using parameters extracted by analyzing an original speech waveform.

本発明の音声分析合成装置は、第１の音声波形を分析し、該分析結果を用いて第２の音声波形の合成を行うことを前提とし、第１の音声波形をフレーム単位で分析してパラメータを抽出する第１の分析手段と、第１の音声波形をフレーム単位で分析して、各フレームにおける第１の音声波形の周波数振幅値の自己相関値を算出するとともに、これら各フレームの自己相関値からその標準偏差を算出し、当該標準偏差の値が小さくなるに従ってその値が大きくなる有声音比率を抽出する第２の分析手段と、音高を指定する音高指定手段と、声帯音源波形を模擬した音源波形を前記音高指定手段により指定された音高で生成する音源波形生成手段と、音高を持たない他の音源波形を生成する他の音源波形生成手段と、抽出された有声音比率の値が大きくなるに従って音源波形の混合比率が大きくなるように他の音源波形と混合して駆動音源波形を生成し、該駆動音源波形、及び前記パラメータを用いて第２の音声波形を合成する音声波形合成手段と、を具備する。 Vocoding system of the present invention is to analyze the first speech waveform, assuming to perform the synthesis of the second speech waveform using the analysis result, by analyzing the first speech waveform frame by frame A first analyzing means for extracting a parameter; and a first speech waveform is analyzed in units of frames to calculate an autocorrelation value of a frequency amplitude value of the first speech waveform in each frame; Second analysis means for calculating the standard deviation from the correlation value, and extracting a voiced sound ratio that increases as the standard deviation value decreases, a pitch specifying means for specifying the pitch, and a vocal cord sound source A sound source waveform generating means for generating a sound source waveform simulating a waveform at a pitch specified by the pitch specifying means, another sound source waveform generating means for generating another sound source waveform having no pitch, and extracted The voiced sound ratio value is Mixed with other sound sources waveform as the mixing ratio of the sound source waveform is increased by generating an excitation waveform with increasing listening, speech waveform synthesis to synthesize a second speech waveform using the excitation waveform, and the parameters Means.

なお、上記第１の音声波形から得られるパラメータ、及び有声音比率は、該パラメータ、及び有声音比率からなるデータ群を複数、格納できる記憶手段に格納し、音声波形合成手段は、記憶手段に格納されたデータ群のうちの一つを用いて第２の音声波形を合成する、ことが望ましい。 The parameter obtained from the first speech waveform and the voiced sound ratio are stored in a storage unit capable of storing a plurality of data groups including the parameter and the voiced sound ratio, and the speech waveform synthesis unit is stored in the storage unit. It is desirable to synthesize the second speech waveform using one of the stored data groups.

また、第１の分析手段は、前記各フレームにおいて該第１の音声波形の周波数振幅値、及び位相情報を抽出し、該周波数振幅値の自己相関値を算出し、該自己相関値が最大となった周波数振幅値、及び該位相情報から該第１の音声波形の音高を前記パラメータとして抽出する、ことが望ましい。 The first analysis means extracts the frequency amplitude value and phase information of the first speech waveform in each frame, calculates an autocorrelation value of the frequency amplitude value, and sets the autocorrelation value to the maximum. It is desirable that the pitch of the first speech waveform is extracted as the parameter from the frequency amplitude value and the phase information.

また、第１の分析手段は、現フレームで抽出された音高と当該現フレームの前のフレームで抽出された音高との間での音高変化が所定値以上の場合が所定回数連続して続いた際は、前記前フレームで抽出された音高を前記パラメータとして採用し、それ以外の場合は前記現フレームで抽出された音高を前記パラメータとして採用する、ことが望ましい。 In addition, the first analysis means continuously repeats a predetermined number of times when a pitch change between a pitch extracted in the current frame and a pitch extracted in a frame before the current frame is equal to or greater than a predetermined value. When continuing, it is preferable that the pitch extracted in the previous frame is adopted as the parameter, and otherwise, the pitch extracted in the current frame is adopted as the parameter .

本発明の音声分析装置は、音声波形からパラメータを抽出することを前提とし、音声波形を取得する音声波形取得手段と、音声波形取得手段が取得した音声波形をフレーム単位で分析して、該音声波形の合成用の合成フィルタに用いられるフィルタ係数をパラメータとして抽出する第１の分析手段と、音声波形取得手段が取得した音声波形をフレーム単位で分析して、各フレームにおける第１の音声波形の周波数振幅値の自己相関値を算出するとともに、これら各フレームの自己相関値からその標準偏差を算出し、当該標準偏差の値が小さくなるに従ってその値が大きくなる有声音比率を、合成フィルタに入力される音源波形生成用のパラメータとして抽出する第２の分析手段と、を具備する。 The speech analysis apparatus of the present invention is based on the assumption that parameters are extracted from a speech waveform, a speech waveform acquisition unit that acquires a speech waveform, and a speech waveform acquired by the speech waveform acquisition unit is analyzed in units of frames. First analysis means for extracting a filter coefficient used for a synthesis filter for waveform synthesis as a parameter, and a speech waveform acquired by the speech waveform acquisition means are analyzed in units of frames, and the first speech waveform in each frame is analyzed. Calculates the autocorrelation value of the frequency amplitude value, calculates its standard deviation from the autocorrelation value of each frame, and inputs the voiced sound ratio that increases as the standard deviation value decreases to the synthesis filter And second analysis means for extracting as a parameter for generating a sound source waveform.

本発明のプログラムは、音声分析合成装置、及び音声分析装置をそれぞれ実現させるための機能を搭載している。 The program of the present invention has a function for realizing a speech analysis / synthesis device and a speech analysis device .

本発明は、第１の音声波形の分析を行い、フィルタ係数等のパラメータに加えて、その第１の音声波形が表す音声が有声音である度合いを示す有声音比率を抽出する。
有声音、無声音は明確に区別されるわけではなく、音声が両者の中間的な状態となっている場合も存在する。このため、有声音比率を基に、有声音用の音源波形を無声音用の音源波形と混合することにより、中間的な状態を再現するうえでより適切な駆動音源波形を生成することができる。それにより、その駆動音源波形を用いてより適切な音声データを合成できるようになって、その音声データにより発音される音声もより自然なものとすることができる。 The present invention analyzes the first speech waveform and extracts a voiced sound ratio indicating the degree to which the speech represented by the first speech waveform is voiced sound in addition to parameters such as filter coefficients.
Voiced and unvoiced sounds are not clearly distinguished, and there are cases where the voice is in an intermediate state. Therefore, by mixing the voiced sound source waveform with the unvoiced sound source waveform based on the voiced sound ratio, a more appropriate driving sound source waveform can be generated in reproducing the intermediate state. As a result, more appropriate sound data can be synthesized using the drive sound source waveform, and the sound generated by the sound data can be made more natural.

以下、本発明の実施例について、図面を参照しながら詳細に説明する。
＜第１の実施例＞
図１は、本実施例による音声分析合成装置を搭載した電子楽器の構成図である。
その電子楽器は、図１に示すように、楽器全体の制御を行うＣＰＵ１と、複数の鍵を備えた鍵盤２と、各種スイッチを備えたスイッチ部３と、ＣＰＵ１が実行するプログラムや各種制御用データ等を格納したＲＯＭ４と、ＣＰＵ１のワーク用のＲＡＭ５と、例えば液晶表示装置（ＬＣＤ）や複数のＬＥＤなどを備えた表示部６と、特には図示しない端子に接続されたマイク７から入力されるアナログの音声信号のＡ／Ｄ変換を行いその音声データを出力するＡ／Ｄ変換器８と、ＣＰＵ１の指示に従い楽音発音用の波形データを生成する楽音生成部９と、その生成部９が生成した波形データのＤ／Ａ変換を行い、アナログのオーディオ信号を出力するＤ／Ａ変換器１０と、そのオーディオ信号の増幅を行うアンプ１１と、そのアンプ１１が増幅を行った後のオーディオ信号を音声に変換するスピーカ１２と、例えば着脱自在な記憶媒体にアクセスする外部記憶装置１３と、を備えて構成されている。それらの構成において、ＣＰＵ１、鍵盤２、スイッチ部３、ＲＯＭ４、ＲＡＭ５、表示部６、Ａ／Ｄ変換器８、楽音生成部９、及び外部記憶装置１３の間はバスによって接続されている。なお、上記外部記憶装置１３とは、例えばフレキシブルディスク装置、ＣＤ−ＲＯＭ装置、或いは光磁気ディスク装置である。スイッチ部３は、例えばユーザが操作の対象とする各種スイッチの他に、各種スイッチの状態変化を検出するための検出回路を備えたものである。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
<First embodiment>
FIG. 1 is a block diagram of an electronic musical instrument equipped with a speech analysis / synthesis apparatus according to this embodiment.
As shown in FIG. 1, the electronic musical instrument includes a CPU 1 that controls the entire musical instrument, a keyboard 2 that includes a plurality of keys, a switch unit 3 that includes various switches, a program executed by the CPU 1, and various control applications. The data is input from a ROM 4 storing data, a work RAM 5 of the CPU 1, a display unit 6 including, for example, a liquid crystal display (LCD) and a plurality of LEDs, and a microphone 7 connected to a terminal (not shown). An A / D converter 8 that performs A / D conversion of an analog audio signal and outputs the audio data, a tone generator 9 that generates waveform data for tone generation in accordance with instructions from the CPU 1, and a generator 9 The D / A converter 10 that performs D / A conversion of the generated waveform data and outputs an analog audio signal, the amplifier 11 that amplifies the audio signal, and the amplifier 11 amplifies A speaker 12 for converting the audio signal after the speech, an external storage device 13 to access the example removable storage medium, and is configured with a. In these configurations, the CPU 1, keyboard 2, switch unit 3, ROM 4, RAM 5, display unit 6, A / D converter 8, musical sound generation unit 9, and external storage device 13 are connected by a bus. The external storage device 13 is, for example, a flexible disk device, a CD-ROM device, or a magneto-optical disk device. The switch unit 3 includes, for example, a detection circuit for detecting state changes of various switches in addition to various switches that are operated by the user.

ＣＰＵ１は、ＲＯＭ４、或いは外部記憶装置１３がアクセス可能な記憶媒体に格納されている演奏再生用のシーケンスデータ（例えばスタンダードＭＩＤＩファイル（ＳＭＦ））を処理することで自動演奏を実現させる。そのために、スイッチ部３を構成するスイッチとしては、図２に示すように、その自動演奏の開始を指示するためのスタート（ＳＴＡＲＴ）スイッチ２１、その終了を指示するためのストップ（ＳＴＯＰ）スイッチ２２、自動演奏時のテンポ値の変更を指示するためのテンポ（ＴＥＭＰＯ）スイッチ２３が設けられている。そのテンポスイッチ２３として、テンポ値のアップを指示するためのアップ（ＵＰ）スイッチ２３ａ、及びそのダウンを指示するためのダウン（ＤＯＷＮ）スイッチ２３ｂが設けられている。その他には、特に図示していないが、マイク７から入力される音声の分析を指示するための分析スイッチ、その分析結果を用いた音声の合成を行わせるための合成スイッチ、その合成時に鍵盤２への操作による音高指定を有効とするか否か選択するためのピッチ選択スイッチ、及び保存している分析結果のなかで合成に用いられるものを選択するためのファイル選択スイッチ、などが設けられている。 The CPU 1 realizes automatic performance by processing sequence data (for example, standard MIDI file (SMF)) for performance reproduction stored in the ROM 4 or a storage medium accessible by the external storage device 13. Therefore, as shown in FIG. 2, the switches constituting the switch unit 3 include a start (START) switch 21 for instructing the start of the automatic performance and a stop (STOP) switch 22 for instructing the end thereof. A tempo (TEMPO) switch 23 is provided for instructing change of the tempo value during automatic performance. As the tempo switch 23, an up (UP) switch 23a for instructing an increase in the tempo value and a down (DOWN) switch 23b for instructing the down of the tempo value are provided. In addition, although not particularly illustrated, an analysis switch for instructing analysis of the voice input from the microphone 7, a synthesis switch for performing synthesis of voice using the analysis result, and the keyboard 2 at the time of synthesis There are provided a pitch selection switch for selecting whether or not the pitch specification by the operation is valid, and a file selection switch for selecting one of the stored analysis results to be used for synthesis. ing.

上記構成の電子楽器において、本実施例による音声分析合成装置は、マイク７から入力した音声に対し、その音高（ピッチ）を指定された音高（ピッチ）に変換する音響効果を付与できるものとして実現されている。音声の入力は、外部記憶装置１３を介して行っても良く、ＬＡＮ、或いは公衆網等の通信ネットワークを介して行っても良い。 In the electronic musical instrument having the above-described configuration, the speech analysis / synthesis apparatus according to the present embodiment can impart an acoustic effect for converting the pitch (pitch) to the designated pitch (pitch) with respect to the voice input from the microphone 7. It is realized as. Audio input may be performed via the external storage device 13 or via a communication network such as a LAN or a public network.

図３は、本実施例による音声分析合成装置の機能構成図である。
音響効果を付加した音声波形は、元の音声波形を分析してパラメータを抽出し、抽出したパラメータ、及び生成した音源波形を用いて生成するようになっている。このことから、図３に示すように、音声分析合成装置の機能構成は、分析を行う段階（分析フェーズ）用のものと、その分析結果を用いて音声波形を合成する段階（合成フェーズ）用のものと、に区別される。分析フェーズ用のものは、音声符号化等を行うためのもの、つまり音声分析装置を構成するものに相当し、合成フェーズ用のものは、音声復号化を行うためのもの、つまり音声合成装置を構成するものに相当する。 FIG. 3 is a functional configuration diagram of the speech analysis / synthesis apparatus according to the present embodiment.
The speech waveform to which the acoustic effect is added is generated by analyzing the original speech waveform and extracting parameters, and using the extracted parameters and the generated sound source waveform. From this, as shown in FIG. 3, the functional configuration of the speech analysis / synthesis apparatus is for the stage of analysis (analysis phase) and for the stage of synthesis of speech waveforms using the analysis results (synthesis phase). And those of The one for the analysis phase is equivalent to one for performing speech encoding or the like, that is, one constituting a speech analysis device, and the one for the synthesis phase is one for performing speech decoding, that is, a speech synthesis device. It corresponds to what constitutes.

始めに、分析フェーズ用の構成、及び各部の動作について説明する。
図３に示すＡ／Ｄ変換器（ＡＤＣ）８は、マイク７から出力されたアナログの音声信号をデジタルの音声データに変換するものである。例えばサンプリング周波数２２，０５２Ｈｚ、１６ｂｉｔでＡＤ変換を行う。以降、それがＡＤ変換して得られる音声データについては便宜的に「元音声データ」、或いは「元波形データ」と呼び、マイク７に入力された音声については「元音声」と呼ぶことにする。 First, the configuration for the analysis phase and the operation of each unit will be described.
An A / D converter (ADC) 8 shown in FIG. 3 converts an analog audio signal output from the microphone 7 into digital audio data. For example, AD conversion is performed at a sampling frequency of 22,052 Hz and 16 bits. Hereinafter, the voice data obtained by AD conversion will be referred to as “original voice data” or “original waveform data” for convenience, and the voice input to the microphone 7 will be referred to as “original voice”. .

フレーム抽出部３１は、その元音声データを入力し、予め定められたサイズでフレームを切り出すことで抽出する。そのサイズ、つまり音声データ数は例えば５１２であり、フレームの切り出しは、図６に示すように、フレームサイズをホップサイズで割った値であるオーバーラップファクタでオーバーラップさせて行う。図６に示す例は、そのファクタの値を８、つまりホップサイズを６４（５１２／６４＝８）とした場合のものである。そのようにして抽出されたフレーム（フレームサイズ分の音声データ）はレベル算出部３２、線形予測分析部３３、及びピッチ抽出／有声音比率算出部３４に送られる。 The frame extraction unit 31 inputs the original voice data and extracts the frame by cutting out the frame with a predetermined size. The size, that is, the number of audio data is 512, for example, and the frame is cut out by overlapping with an overlap factor that is a value obtained by dividing the frame size by the hop size, as shown in FIG. In the example shown in FIG. 6, the factor value is 8, that is, the hop size is 64 (512/64 = 8). The frames thus extracted (voice data corresponding to the frame size) are sent to the level calculation unit 32, the linear prediction analysis unit 33, and the pitch extraction / voiced sound ratio calculation unit 34.

レベル算出部３２は、送られたフレーム毎に音量レベルを示すレベル値（レベルパラメータ）を算出する。その算出は、例えばフレームを構成する元音声データの自乗和を求め、その自乗和をフレームサイズで除算することで行う。
線形予測分析部３３は、送られたフレームを対象に線形予測分析を行い、パーコール（ＰＡＲＣＯＲ）係数を算出する。線形予測分析（ＬＰＣ：Linear Predictive Coding）の次数は、サンプリング周波数が２２，０５０Ｈｚの音声データでは２６次もしくは２８次が適当である。本実施例では２６次としている。それにより、送られたフレームからLevinson-Durbin アルゴリズムを使って２６個のＰＡＲＣＯＲ係数を算出している。 The level calculation unit 32 calculates a level value (level parameter) indicating a volume level for each transmitted frame. The calculation is performed, for example, by calculating the sum of squares of the original audio data constituting the frame and dividing the sum of squares by the frame size.
The linear prediction analysis unit 33 performs linear prediction analysis on the sent frame and calculates a Percoll coefficient. The order of linear predictive analysis (LPC) is suitably 26th or 28th for speech data with a sampling frequency of 22,050 Hz. In this embodiment, the order is 26th. Thereby, 26 PARCOR coefficients are calculated from the transmitted frame using the Levinson-Durbin algorithm.

ピッチ抽出／有声音比率算出部３４は、元音声のピッチ周波数（繰り返し周期）、及び元音声が有声音である度合いを示す有声音比率を算出する。
図４は、上記算出部３４の機能構成図である。図４を参照して、その算出部３４の機能構成、及び各部の動作について詳細に説明する。 The pitch extraction / voiced sound ratio calculation unit 34 calculates the pitch frequency (repetition period) of the original voice and the voiced sound ratio indicating the degree to which the original voice is voiced.
FIG. 4 is a functional configuration diagram of the calculation unit 34. With reference to FIG. 4, the functional configuration of the calculation unit 34 and the operation of each unit will be described in detail.

フレーム抽出部３１から送られたフレームはハイパスフィルタ（ＨＰＦ）部５１に入力される。そのＨＰＦ部５１は、低域成分を除去したフレームを窓掛け部５２に出力する。その窓掛け部５２は、フレーム（１フレームサイズ分の音声データ）に対し窓関数、例えばハニング窓（Hanning Window）を乗算する。 The frame sent from the frame extraction unit 31 is input to the high pass filter (HPF) unit 51. The HPF unit 51 outputs the frame from which the low frequency component has been removed to the windowing unit 52. The windowing unit 52 multiplies a frame (audio data for one frame size) by a window function, for example, a Hanning Window.

高速フーリエ変換（ＦＦＴ）部５３は、窓関数乗算後のフレームを対象にＦＦＴを行う。周波数振幅位相算出部５４は、それによって算出された実数値、及び虚数値から周波数振幅値と位相を求める。周波数振幅自己相関算出部５５は、求めた周波数振幅値を対象に自己相関を計算し相関値を算出する。算出された周波数振幅値をｘ_i とすると、相関値ｒ_n は次式により算出する。その式から明らかなように、フレームの後半部分である値が２５６以上のインデクスではそれの前半部分の折り返しとなるため、前半部分である０〜２５５の計２５６個の周波数振幅値ｘ_i のみ算出すれば良い。 The fast Fourier transform (FFT) unit 53 performs FFT on the frame after the window function multiplication. The frequency amplitude phase calculation unit 54 obtains a frequency amplitude value and a phase from the real value and the imaginary value calculated thereby. The frequency amplitude autocorrelation calculation unit 55 calculates autocorrelation for the obtained frequency amplitude value and calculates a correlation value. If the calculated frequency amplitude value and x _i, the correlation value r _n is calculated by the following equation. As is apparent from the equation, since the value of the second half of the frame is an index of 256 or more, the first half of the index is folded, so only the total 256 frequency amplitude values x _{i from} 0 to 255 are calculated. Just do it.

周波数振幅値ｘ_i の自己相関を取ることにより、調波構造すなわち倍音成分を持った音声や楽音では相関値ｒ_n が大きくなる。このため、ノイズ等の影響を排除した正確なピッチ抽出を行うことができる。元音声として人の音声を前提としている場合、その音声が持つピッチとして考えられる範囲のみを対象とすれば良いので、４０〜５００Ｈｚに相当するインデクス値が１≦ｎ≦１３の相関値ｒ_n のみを求めれば良いことになる。 By taking the autocorrelation of the frequency amplitude value x _i, the correlation value r _n increases the harmonic structure or voice or a musical tone with harmonics. For this reason, accurate pitch extraction can be performed without the influence of noise or the like. When a human voice is assumed as the original voice, only a range that can be considered as a pitch of the voice needs to be targeted, and therefore, only a correlation value r _{n with} an index value corresponding to 40 to 500 Hz is 1 ≦ n ≦ 13. If you ask for.

相関値ｒ_n が最大となるｎの値がピッチ周波数に対応する周波数振幅値ｘ_i のインデクス値となる。しかし、そのインデクス値から周波数を算出しても、高い精度は得られないことが多いのが実状である。このことから、元音声のピッチ周波数を高い精度で抽出するために、位相差計測法により正確なピッチ周波数を算出するようにしている。その算出を行うのが位相差計測法による周波数計算部（以降、「周波数計算部」と略記する）５７である。その計算部５７は、前フレームの位相データ５６を参照してピッチ周波数を算出する。 The value of _n that maximizes the correlation value r _n is the index value of the frequency amplitude value x _i corresponding to the pitch frequency. However, in reality, high accuracy is often not obtained even if the frequency is calculated from the index value. Therefore, in order to extract the pitch frequency of the original voice with high accuracy, an accurate pitch frequency is calculated by the phase difference measurement method. The calculation is performed by a frequency calculation unit (hereinafter abbreviated as “frequency calculation unit”) 57 using a phase difference measurement method. The calculation unit 57 calculates the pitch frequency with reference to the phase data 56 of the previous frame.

周波数振幅値ｘ_i の自己相関を取って行うピッチ抽出法では、通常の話し言葉であれば高い精度でピッチを抽出することができる。しかし、歌唱時の音声や特殊な発音をした際の音声では、必ずしも高い精度でピッチを抽出することができない。このため、ピッチ補正部５８は、周波数計算部５７が算出したピッチ周波数の補正を行う。その補正後のピッチ周波数がピッチ抽出／有声音比率算出部３４から出力される。 In the pitch extraction method performed by taking the autocorrelation of the frequency amplitude value x _i , the pitch can be extracted with high accuracy if it is a normal spoken word. However, it is not always possible to extract the pitch with a high degree of accuracy when singing or specially pronounced. For this reason, the pitch correction unit 58 corrects the pitch frequency calculated by the frequency calculation unit 57. The corrected pitch frequency is output from the pitch extraction / voiced sound ratio calculation unit 34.

自己相関値の標準偏差算出部（以降「標準偏差算出部」と略記する）５９は、周波数振幅自己相関算出部５５から相関値ｒ_n を受け取り、その標準偏差を算出する。その標準偏差は、周知のように、分散（散らばり）の度合いを示す値である。このため、標準偏差が小さい程、特定周波数にパワーが集中していることになって有声音である可能性が高くなる。逆にそれが大きい程、様々な周波数に平均的にパワーが分散していることになって無声音である可能性が高くなる。標準偏差算出部５９は、このことに着目し、算出した標準偏差から元音声が有声音である度合いを示す有声音比率ｍｉｘ（０≦ｍｉｘ≦１）を算出する。その有声音比率ｍｉｘがピッチ抽出／有声音比率算出部３４から出力される。 Standard deviation calculating section of the autocorrelation value (hereinafter abbreviated as "standard deviation calculation unit") 59 receives the correlation value r _n from the frequency amplitude autocorrelation calculating unit 55, and calculates the standard deviation. The standard deviation is a value indicating the degree of dispersion (scattering), as is well known. For this reason, as the standard deviation is smaller, the power is concentrated on the specific frequency and the possibility of being a voiced sound increases. On the other hand, the larger it is, the higher the possibility that the sound is unvoiced because the power is dispersed on the average in various frequencies. The standard deviation calculation unit 59 pays attention to this, and calculates a voiced sound ratio mix (0 ≦ mix ≦ 1) indicating the degree to which the original voice is voiced sound from the calculated standard deviation. The voiced sound ratio mix is output from the pitch extraction / voiced sound ratio calculation unit 34.

このようにして、レベル算出部３２、線形予測分析部３３、及びピッチ抽出／有声音比率算出部３４は、図５に示すように、それぞれレベル値、ＰＡＲＣＯＲ係数、有声音比率ｍｉｘ、及びピッチ周波数をフレーム毎に算出して出力する。それらは分析結果を示すパラメータとしてパラメータバッファ３５に一旦、格納され、必要に応じてパラメータファイル３６として保存される。なお、パラメータバッファ３５、及びパラメータファイル３６を保存するものは、例えばＲＡＭ５、或いは外部記憶装置１３がアクセス可能な記憶媒体である。ここでは便宜的にＲＡＭ５であるとの前提で以降の説明を行うこととする。 In this way, the level calculation unit 32, the linear prediction analysis unit 33, and the pitch extraction / voiced sound ratio calculation unit 34, as shown in FIG. 5, respectively, are a level value, a PARCOR coefficient, a voiced sound ratio mix, and a pitch frequency. Is calculated and output for each frame. These are temporarily stored in the parameter buffer 35 as parameters indicating the analysis results, and are saved as a parameter file 36 as necessary. Note that what stores the parameter buffer 35 and the parameter file 36 is, for example, a storage medium accessible by the RAM 5 or the external storage device 13. Here, for the sake of convenience, the following description will be made on the assumption that the RAM 5 is used.

上記分析フェーズ用の構成、つまりＡ／Ｄ変換器８が出力する元音声データの分析を行う各部３１〜３４は、例えばＣＰＵ１、スイッチ部３、ＲＯＭ４、ＲＡＭ５、及び外部記憶装置１３によって実現される。合成フェーズ用の構成、つまりパラメータバッファ３５に格納されたパラメータを用いて音声波形を合成しＤ／Ａ変換器１０に出力するための各部４１〜４７は、例えばＣＰＵ１、鍵盤２、スイッチ部３、ＲＯＭ４、ＲＡＭ５、楽音生成部９、及び外部記憶装置１３によって実現される。 The configuration for the analysis phase, that is, the units 31 to 34 that analyze the original voice data output from the A / D converter 8 are realized by, for example, the CPU 1, the switch unit 3, the ROM 4, the RAM 5, and the external storage device 13. . The components 41 to 47 for synthesizing a speech waveform using the configuration for the synthesis phase, that is, the parameters stored in the parameter buffer 35 and outputting the synthesized speech waveform to the D / A converter 10 are, for example, CPU 1, keyboard 2, switch unit 3, This is realized by the ROM 4, the RAM 5, the tone generation unit 9, and the external storage device 13.

次に、合成フェーズ用の構成、及び各部の動作について説明する。
音声波形の合成では、パラメータバッファ３５にはユーザが選択したパラメータファイル３６が読み出されて格納される。
切り替えスイッチ４１は、パラメータバッファ３５にパラメータとして格納されているピッチ周波数、及び鍵盤２をユーザが操作することで指定するピッチ周波数のうちの一方をRosenberg 波生成部４２に出力する。そのスイッチ４１は、ピッチ選択スイッチへの操作に応じて、それらのうちの一方を選択して出力する。 Next, the composition for the synthesis phase and the operation of each unit will be described.
In the synthesis of the speech waveform, the parameter buffer 35 reads and stores the parameter file 36 selected by the user.
The changeover switch 41 outputs one of the pitch frequency stored as a parameter in the parameter buffer 35 and the pitch frequency specified by the user operating the keyboard 2 to the Rosenberg wave generation unit 42. The switch 41 selects and outputs one of them according to the operation of the pitch selection switch.

Rosenberg 波生成部４２は、図７に示すようなRosenberg 波を、切り替えスイッチ４１から入力するピッチで生成する。そのRosenberg 波は、声帯音源波形を模擬した波形であり、声帯の開口期間長を制御するＯＱ（Open Quotient ）パラメータは本実施例では０．５に固定している。振幅パラメータＡＶについては１としている。ホワイトノイズ生成部４３は、ホワイトノイズ波を生成する。そのノイズ波は、周知のように、Rosenberg 波とは異なり音高を持たない。 The Rosenberg wave generation unit 42 generates a Rosenberg wave as shown in FIG. 7 at a pitch input from the changeover switch 41. The Rosenberg wave is a waveform simulating a vocal cord sound source waveform, and an OQ (Open Quotient) parameter for controlling the opening period length of the vocal cord is fixed to 0.5 in this embodiment. The amplitude parameter AV is set to 1. The white noise generation unit 43 generates a white noise wave. As is well known, the noise wave has no pitch unlike the Rosenberg wave.

Rosenberg 波生成部４２が生成したRosenberg 波には、パラメータバッファ３５に格納された有声音比率ｍｉｘが乗算されて駆動音源バッファ４４に格納される。ホワイトノイズ生成部４３が生成したホワイトノイズ波には、１から有声音比率ｍｉｘを減算した値（＝１−ｍｉｘ）が乗算されて、駆動音源バッファ４４に格納されたRosenberg 波に加算される。それにより、それらの波形を有声音比率ｍｉｘに応じて混合した駆動音源波形を駆動音源バッファ４４に用意する。 The Rosenberg wave generated by the Rosenberg wave generation unit 42 is multiplied by the voiced sound ratio mix stored in the parameter buffer 35 and stored in the driving sound source buffer 44. The white noise wave generated by the white noise generation unit 43 is multiplied by a value (= 1−mix) obtained by subtracting the voiced sound ratio mix from 1 and added to the Rosenberg wave stored in the driving sound source buffer 44. Thereby, a driving sound source waveform obtained by mixing these waveforms in accordance with the voiced sound ratio mix is prepared in the driving sound source buffer 44.

上記Rosenberg 波は有声音用の駆動音源波形であり、ホワイトノイズ波は無声音用の駆動音源波形である。しかし、有声音、無声音は明確に区別されるわけではなく、音声が両者の中間的な状態となっている場合も存在する。このことから、有声音比率ｍｉｘを算出し、その比率ｍｉｘに応じてRosenberg 波とホワイトノイズ波を混合して、中間的な状態を再現するうえでより適切な駆動音源波形を生成するようにしている。そのような駆動音源波形を生成することにより、常に自然と感じられる音声波形を合成することができることとなる。 The Rosenberg wave is a driving sound source waveform for voiced sound, and the white noise wave is a driving sound source waveform for unvoiced sound. However, voiced sound and unvoiced sound are not clearly distinguished, and there are cases where the sound is in an intermediate state between the two. From this, the voiced sound ratio mix is calculated, and the Rosenberg wave and the white noise wave are mixed according to the ratio mix to generate a more appropriate driving sound source waveform for reproducing the intermediate state. Yes. By generating such a driving sound source waveform, it is possible to synthesize a voice waveform that is always felt natural.

合成フィルタ部４５は、駆動音源バッファ４４に用意された１フレーム分の駆動音源波形、及びパラメータバッファ３５に格納されたＰＡＲＣＯＲ係数から１フレーム分の音声波形（音声データ）を生成する。駆動音源波形は入力として、ＰＡＲＣＯＲ係数はフィルタ係数として用いられる。その一方では、生成した音声データ全体の自乗和を算出してフレームサイズ（ここでは５１２）で除算し、パラメータバッファ３５のレベル値（元音声データで同様に算出された値）をその除算結果で除算し、その平方根を乗算係数として求めて乗算器４６に出力する。 The synthesis filter unit 45 generates a sound waveform (sound data) for one frame from the driving sound source waveform for one frame prepared in the driving sound source buffer 44 and the PARCOR coefficient stored in the parameter buffer 35. The driving sound source waveform is used as an input, and the PARCOR coefficient is used as a filter coefficient. On the other hand, the sum of squares of the entire generated audio data is calculated and divided by the frame size (here, 512), and the level value of the parameter buffer 35 (the value calculated in the same manner with the original audio data) is calculated by the result of the division. Division is performed, and the square root is obtained as a multiplication coefficient and output to the multiplier 46.

乗算器４６は、合成フィルタ部４５が生成した音声データに対し、乗算係数を乗算する。フレーム加算部４７は、図８に示すように、乗算器４６が出力する１フレーム分の音声データにハニング窓を乗算し、それによって得られた音声データを、分析フェーズのときと同じオーバーラップファクタで他のフレームと重畳（オーバーラップ加算）する。そのようにして重畳した後の音声データをＤ／Ａ変換器（ＤＡＣ）１０に出力する。 The multiplier 46 multiplies the audio data generated by the synthesis filter unit 45 by a multiplication coefficient. As shown in FIG. 8, the frame adder 47 multiplies the audio data for one frame output from the multiplier 46 by the Hanning window, and uses the audio data obtained thereby for the same overlap factor as in the analysis phase. Overlapping (overlap addition) with other frames. The audio data after such superposition is output to the D / A converter (DAC) 10.

そのフレーム加算部４７は、例えばＣＰＵ１、ＲＯＭ４、及びＲＡＭ５によって実現される。他のフレームとの重畳は、例えばＲＡＭ５に設けた領域である出力バッファを用いて行われる。このことから、楽音生成部９は、ＣＰＵ１の指示に従って発音させるべき楽音を発音させるための波形データを生成してＤ／Ａ変換器１０に出力する他に、出力バッファから読み出されて送られた音声データをそのまま変換器１０に出力できるようになっている。 The frame addition unit 47 is realized by the CPU 1, the ROM 4, and the RAM 5, for example. Superimposition with other frames is performed using an output buffer which is an area provided in the RAM 5, for example. Therefore, the tone generator 9 generates waveform data for generating a tone to be generated in accordance with an instruction from the CPU 1 and outputs it to the D / A converter 10 as well as being read out from the output buffer and sent. The voice data can be output to the converter 10 as it is.

本実施例による音声分析合成装置は、上述したようにして、元音声データの分析を行い、その分析結果を用いて音声データを合成しスピーカ１２から放音させるものとして実現されている。以降は、その音声分析合成装置を実現させる電子楽器の動作について、図９〜図１８に示す各種フローチャートを参照して詳細に説明する。 The speech analysis and synthesis apparatus according to the present embodiment is realized as a system that analyzes original speech data as described above, synthesizes speech data using the analysis result, and emits sound from the speaker 12. Hereinafter, the operation of the electronic musical instrument that realizes the speech analysis / synthesis apparatus will be described in detail with reference to various flowcharts shown in FIGS.

図９は、全体処理のフローチャートである。始めに図９を参照して、その全体処理について詳細に説明する。なお、その全体処理は、ＣＰＵ１が、ＲＯＭ４に格納されたプログラムを実行して電子楽器のリソースを使用することにより実現される。
先ず、ステップ９０１では、電源がオンされたことに伴い、初期化処理を実行する。続くステップ９０２では、スイッチ部３を構成するスイッチへのユーザの操作に対応するためのスイッチ処理を実行する。そのスイッチ処理は、例えばスイッチ部３を構成する検出回路に各種スイッチの状態を検出させてその検出結果を受け取り、その検出結果を解析して状態が変化したスイッチの種類、及びその変化を特定して行われる。 FIG. 9 is a flowchart of the entire process. First, the entire process will be described in detail with reference to FIG. Note that the overall processing is realized by the CPU 1 executing a program stored in the ROM 4 and using resources of the electronic musical instrument.
First, in step 901, an initialization process is executed when the power is turned on. In the subsequent step 902, switch processing for responding to the user's operation on the switches constituting the switch unit 3 is executed. In the switch process, for example, the detection circuit constituting the switch unit 3 detects the state of various switches, receives the detection results, analyzes the detection results, and identifies the type of switch whose state has changed and the change. Done.

ステップ９０２に続くステップ９０３では、鍵盤２へのユーザの操作に対応するための鍵盤処理を実行する。その鍵盤処理を実行することにより、鍵盤２への演奏操作に応じて楽音がスピーカ１２から放音される。ステップ９０４にはその後に移行する。
ステップ９０４では、表示部６を構成するＬＣＤ、或いはＬＥＤを駆動してユーザに提供すべき情報を提供するための表示処理を実行する。その実行後は上記ステップ９０２に戻る。それにより、電源がオンされている間、ステップ９０２〜９０４で形成される処理ループを繰り返し実行する。 In step 903 following step 902, a keyboard process for responding to a user operation on the keyboard 2 is executed. By executing the keyboard process, a musical sound is emitted from the speaker 12 in accordance with a performance operation on the keyboard 2. Step 904 then proceeds.
In step 904, a display process for providing information to be provided to the user by driving the LCD or LED constituting the display unit 6 is executed. After the execution, the process returns to step 902. Thereby, while the power is on, the processing loop formed in steps 902 to 904 is repeatedly executed.

次に、上記全体処理内で実行されるサブルーチン処理について詳細に説明する。
図１０は、上記ステップ９０２として実行されるスイッチ処理のフローチャートである。スイッチ部３を構成する検出回路から受け取った検出結果を解析した後に行われる処理の流れを表したものである。全体処理内で実行されるサブルーチン処理では、始めに図１０を参照して、そのスイッチ処理について詳細に説明する。 Next, subroutine processing executed in the overall processing will be described in detail.
FIG. 10 is a flowchart of the switch process executed as step 902. 3 shows a flow of processing performed after analyzing a detection result received from a detection circuit constituting the switch unit 3. In the subroutine process executed in the overall process, the switch process will be described in detail with reference to FIG.

先ず、ステップ１００１では、分析スイッチがオン（操作）されたか否か判定する。そのスイッチをユーザが操作した場合、判定はＹＥＳとなり、ステップ１００２で変数ａｎａ＿ｆｌｇの値を反転、即ちそれまでの値が１であれば０、それまでの値が０であれば１を新たに代入した後、ステップ１００３に移行する。そうでない場合には、判定はＮＯとなってステップ１００６に移行する。それに代入される値の１は、マイク７から入力される元音声の分析を行う状況であることを示している。 First, in step 1001, it is determined whether or not the analysis switch is turned on (operated). If the switch is operated by the user, the determination is YES, and in step 1002, the value of the variable ana_flg is inverted, that is, 0 if the previous value is 1, and 1 if the previous value is 0. Then, the process proceeds to step 1003. Otherwise, the determination is no and the process moves to step 1006. A value of 1 assigned to this indicates that the original sound input from the microphone 7 is being analyzed.

ステップ１００３では、変数ａｎａ＿ｆｌｇの値が１か否か判定する。ユーザが分析スイッチを操作して元音声の分析を指示した場合、それには１が代入されることから、判定はＹＥＳとなり、ステップ１００４でパラメータバッファ３５をクリアし、それの先頭に位置するパラメータ保存用の領域（ここでは便宜的に「アドレス」と呼ぶ）を指定する値を変数ｐａｒａｍ＿ｂｕｆ＿ａｄｒに代入した後、ステップ１００６に移行する。そうでない場合には、判定はＮＯとなり、ステップ１００５でバッファ３５に保存した分析結果をパラメータファイル３６としてＲＡＭ５に書き込んで保存した後、そのステップ１００６に移行する。そのようにして、本実施例では、分析スイッチを操作してから再度、操作するまでの間、元音声の分析を行い、その分析結果をバッファ３５に順次、保存するようにしている。 In step 1003, it is determined whether or not the value of the variable ana_flg is 1. If the user operates the analysis switch to instruct the analysis of the original voice, 1 is substituted for it, so the determination is YES, the parameter buffer 35 is cleared in step 1004, and the parameter stored at the head of it is saved. After substituting a variable param_buf_adr for a value for designating a region for use (referred to herein as an “address” for convenience), the process proceeds to step 1006. Otherwise, the determination is no, the analysis result stored in the buffer 35 in step 1005 is written and stored in the RAM 5 as the parameter file 36, and the process proceeds to step 1006. In this way, in this embodiment, the original voice is analyzed from when the analysis switch is operated until it is operated again, and the analysis results are sequentially stored in the buffer 35.

ステップ１００６では、合成スイッチがオン（操作）されたか否か判定する。そのスイッチをユーザが操作した場合、判定はＹＥＳとなり、ステップ１００７で変数ｓｙｎ＿ｆｌｇの値を反転させた後、ステップ１００８に移行する。そうでない場合には、判定はＮＯとなってステップ１０１０に移行する。 In step 1006, it is determined whether or not the synthesis switch has been turned on (operated). If the switch is operated by the user, the determination is yes, the value of the variable syn_flg is inverted in step 1007, and then the process proceeds to step 1008. Otherwise, the determination is no and the process moves to step 1010.

ステップ１００８では、変数ｓｙｎ＿ｆｌｇの値が１か否か判定する。その変数に１が代入されていた場合、判定はＹＥＳとなり、ステップ１００９において変数ｆｉｌｅ＿ｎｕｍの値で指定されるパラメータファイル３６をパラメータバッファ３５に読み込んだ後、ステップ１０１０に移行する。そうでない場合には、判定はＮＯとなってそのステップ１０１０に移行する。このようにして、本実施例では、合成スイッチへの操作により合成に用いるパラメータファイル３６の切り替えを行うようにしている。 In step 1008, it is determined whether or not the value of the variable syn_flg is 1. If 1 is assigned to the variable, the determination is yes, and the parameter file 36 specified by the value of the variable file_num is read into the parameter buffer 35 in step 1009, and then the process proceeds to step 1010. Otherwise, the determination is no and the process moves to step 1010. In this way, in the present embodiment, the parameter file 36 used for synthesis is switched by operating the synthesis switch.

ステップ１０１０では、ピッチ選択スイッチがオン（操作）されたか否か判定する。そのスイッチをユーザが操作した場合、判定はＹＥＳとなり、ステップ１０１１で変数ｐｉｔｃｈ＿ｓｅｌの値を反転させた後、ステップ１０１２に移行する。そうでない場合には、判定はＮＯとなり、直接そのステップ１０１２に移行する。 In step 1010, it is determined whether or not the pitch selection switch is turned on (operated). If the switch is operated by the user, the determination is yes, the value of the variable pitch_sel is inverted in step 1011, and then the process proceeds to step 1012. Otherwise, the determination is no and the process moves directly to step 1012.

ステップ１０１２では、ファイル選択スイッチがオンされたか否か判定する。そのスイッチをユーザが操作した場合、判定はＹＥＳとなり、ステップ１０１３で変数ｆｉｌｅ＿ｎｕｍの値をサイクリックに変化（更新）し、更にステップ１０１４でその他のスイッチへの操作に対応するためのその他スイッチ処理を実行した後、一連の処理を終了する。そうでない場合には、判定はＮＯとなってそのステップ１０１４の処理を次に実行した後、一連の処理を終了する。 In step 1012, it is determined whether the file selection switch is turned on. If the switch is operated by the user, the determination is YES, the value of the variable file_num is cyclically changed (updated) in step 1013, and other switch processing is performed in step 1014 to correspond to the operation of other switches. After execution, the series of processing is terminated. Otherwise, the determination is no and the process in step 1014 is executed next, and then the series of processes is terminated.

このように、変数ｆｉｌｅ＿ｎｕｍの値のサイクリックな更新は、例えば存在するパラメータファイル３６の数、そのファイル番号に応じて行われる。それにより、存在するパラメータファイル３６のうちの何れかをユーザがファイル選択スイッチを操作することで任意に選択できるようにさせている。そのようにして選択可能なパラメータファイル３６が合成スイッチへの操作によってパラメータバッファ３５に読み込まれる。 As described above, the cyclic update of the value of the variable file_num is performed according to, for example, the number of existing parameter files 36 and the file number. Accordingly, the user can arbitrarily select any of the existing parameter files 36 by operating the file selection switch. The parameter file 36 that can be selected in this way is read into the parameter buffer 35 by an operation on the synthesis switch.

図１１は、図９に示す全体処理内でステップ９０３として実行される鍵盤処理のフローチャートである。次に図１１を参照して、その鍵盤処理について詳細に説明する。
先ず、ステップ１１０１では、例えば鍵盤２からそれに対してユーザが行った操作内容を表すイベントデータ（例えばＭＩＤＩデータ）を受け取り、その操作内容を判定する。操作内容を表すイベントデータを受け取らなかった場合、鍵盤２は状態の変化がなかったと判定し、ここで一連の処理を終了する。そのイベントデータが鍵への押鍵を示していた場合には、その旨が判定されてステップ１１０２に移行し、そのイベントデータが鍵の離鍵を示していた場合には、その旨が判定されたステップ１１０７に移行する。 FIG. 11 is a flowchart of the keyboard process executed as step 903 in the overall process shown in FIG. Next, the keyboard process will be described in detail with reference to FIG.
First, in step 1101, for example, event data (for example, MIDI data) representing the operation content performed by the user on the keyboard 2 is received, and the operation content is determined. When the event data representing the operation content is not received, the keyboard 2 determines that the state has not changed, and ends a series of processes. If the event data indicates that the key has been pressed, it is determined that the process proceeds to step 1102. If the event data indicates that the key has been released, that is determined. The process proceeds to step 1107.

ステップ１１０２では、変数ｓｙｎ＿ｆｌｇの値が１か否か判定する。その値が１でなかった場合、判定はＮＯとなり、押鍵された鍵に割り当てられた音高の楽音の発音を指示するコマンド（例えばＭＩＤＩデータ）を楽音発生部９に送出する処理をステップ１１０３で実行した後、一連の処理を終了する。そうでない場合には、判定はＹＥＳとなってステップ１１０４に移行する。 In step 1102, it is determined whether or not the value of the variable syn_flg is 1. If the value is not 1, the determination is NO, and processing for sending a command (for example, MIDI data) for instructing the tone generation of the pitch assigned to the pressed key to the tone generator 9 is performed in step 1103. After executing the above, a series of processing ends. Otherwise, the determination is yes and the process moves to step 1104.

ステップ１１０４では、上記フレーム加算部４７を構成する出力バッファをクリアし、その先頭に位置するアドレス（ここでは１音声データを格納する領域のことである）を指定する値を変数ｏｕｔ＿ｂｕｆ＿ａｄｒに代入し、パラメータバッファ３５の先頭に位置するアドレス（ここではパラメータデータを格納する領域のことである）を指定する値を変数ｐａｒａｍ＿ｂｕｆ＿ａｄｒに代入する。続くステップ１１０５では、上述したようにして１フレーム分の音声データを合成するフレーム合成処理を実行し、それに続くステップ１１０６では、合成した音声データによる発音中であることを示す値である１を変数ｎｏｔｅ＿ｏｎに代入する。一連の処理はその後に終了する。 In step 1104, the output buffer constituting the frame adder 47 is cleared, and a value for designating an address (here, an area for storing one audio data) located at the head is substituted into a variable out_buf_adr. A value designating an address (here, an area for storing parameter data) located at the head of the parameter buffer 35 is substituted into a variable param_buf_adr. In the subsequent step 1105, the frame synthesis process for synthesizing the audio data for one frame is executed as described above, and in the subsequent step 1106, 1 is set as a variable indicating that sound is being generated by the synthesized audio data. Substitute into note_on. The series of processing is finished thereafter.

一方、上記ステップ１１０１で鍵盤２から受け取ったイベントデータが鍵の離鍵を示していたと判定することで移行するステップ１１０７では、変数ｓｙｎ＿ｆｌｇの値が１か否か判定する。その値が１でなかった場合、判定はＮＯとなり、離鍵された鍵に割り当てられた音高の楽音の消音を指示するコマンド（例えばＭＩＤＩデータ）を楽音発生部９に送出する処理をステップ１１０８で実行した後、一連の処理を終了する。そうでない場合には、判定はＹＥＳとなってステップ１１０９に移行し、合成した音声データを出力バッファから読み出すことで行う発音を終了させ、変数ｎｏｔｅ＿ｏｎに発音中でないことを示す値の０を代入してから、一連の処理を終了する。 On the other hand, in step 1107, where it is determined that the event data received from the keyboard 2 in step 1101 indicates that the key has been released, it is determined whether the value of the variable syn_flg is 1. If the value is not 1, the determination is NO, and processing for sending a command (for example, MIDI data) for instructing to mute the musical tone of the pitch assigned to the released key to the musical tone generating unit 9 is performed in step 1108. After executing the above, a series of processing ends. If not, the determination is yes and the process proceeds to step 1109, where the synthesized sound data is read from the output buffer to end the sound generation, and the variable note_on is substituted with a value of 0 indicating that the sound is not sounding. After that, a series of processing ends.

このように、変数ｓｙｎ＿ｆｌｇの値が１であった場合、パラメータバッファ３５に格納されたパラメータを用いて音声データを合成し、その音声データによる音声（楽音）の発音を行うようになっている。それにより、ユーザが合成スイッチを操作してから再度それを操作するまでの間、合成した音声を鍵盤２への操作内容、つまり押鍵した鍵（音高）、押鍵している期間に応じて発音させるようになっている。 As described above, when the value of the variable syn_flg is 1, the voice data is synthesized using the parameters stored in the parameter buffer 35, and the voice (musical sound) is generated by the voice data. As a result, between the time when the user operates the synthesis switch and the time when the user operates the synthesis switch again, the synthesized voice is operated according to the operation content to the keyboard 2, that is, the key pressed (pitch), and the period during which the key is pressed. To be pronounced.

図１２は、上記ステップ１１０５として実行されるフレーム合成処理のフローチャートである。次に図１２を参照して、その合成処理について詳細に説明する。
先ず、ステップ１２０１では、変数ｐｉｔｃｈ＿ｓｅｌの値が０か否か判定する。ユーザがピッチ選択スイッチを操作して、合成する音声の音高（ピッチ）を指定するうえでの鍵盤２への操作を有効とさせていた場合、その変数には０が代入されていることから、判定はＹＥＳとなり、ステップ１２０２で駆動音源波形のピッチとして押鍵された鍵に割り当てられているピッチを設定した後、ステップ１２０４に移行する。そうでない場合には、判定はＮＯとなり、ステップ１２０３でそのピッチとしてパラメータバッファ３５にパラメータの形で格納されたピッチを設定した後、そのステップ１２０４に移行する。 FIG. 12 is a flowchart of the frame synthesizing process executed as step 1105. Next, the synthesis process will be described in detail with reference to FIG.
First, in step 1201, it is determined whether or not the value of the variable pitch_sel is zero. If the user operates the pitch selection switch to enable the operation on the keyboard 2 to specify the pitch (pitch) of the voice to be synthesized, 0 is assigned to the variable. The determination is YES, and after setting the pitch assigned to the key pressed as the pitch of the driving sound source waveform in step 1202, the process proceeds to step 1204. Otherwise, the determination is no, and in step 1203, the pitch stored in the parameter buffer 35 in the form of a parameter is set as the pitch, and the process proceeds to step 1204.

ステップ１２０４では、パラメータバッファ３５の変数ｐａｒａｍ＿ｂｕｆ＿ａｄｒの値で指定されるアドレスからパラメータを読み込み、音声データの合成を行い、その合成によって得られた１フレーム分の音声データを既に生成した他のフレームに加算することにより、新たに合成した１フレーム分の音声データを出力バッファに保存された音声データに加える。具体的には、ステップ１２０２、或いは１２０３で設定したピッチのRosenberg 波、及びホワイトノイズ波を生成し、生成したそれらの波形を有声音比率ｍｉｘに応じて混合することにより駆動音源波形を生成し、生成した駆動音源波形、及びＰＡＲＣＯＲ係数から音声データを合成し、合成した音声データのレベル調整を施し、更にハニング窓を乗算した後、変数ｏｕｔ＿ｂｕｆ＿ａｄｒの値で指定されるアドレスから既に生成した他のフレームに加算して重畳し、その重畳後に次のフレームの書き込みを開始すべきアドレスを指定する値を変数ｏｕｔ＿ｂｕｆ＿ａｄｒに代入して更新する。一連の処理はその後に終了する。 In step 1204, the parameter is read from the address specified by the value of the variable param_buf_adr in the parameter buffer 35, the voice data is synthesized, and the voice data for one frame obtained by the synthesis is added to the other frames already generated. Thus, the newly synthesized audio data for one frame is added to the audio data stored in the output buffer. Specifically, a Rosenberg wave having a pitch set in step 1202 or 1203 and a white noise wave are generated, and the generated waveform is mixed according to the voiced sound ratio mix, thereby generating a driving sound source waveform. Other frames already generated from the address specified by the value of the variable out_buf_adr after the voice data is synthesized from the generated driving sound source waveform and the PARCOR coefficient, the level of the synthesized voice data is adjusted, and the Hanning window is further multiplied. Is added to and superposed, and after the superposition, a value for designating an address at which writing of the next frame is to be started is substituted into the variable out_buf_adr and updated. The series of processing is finished thereafter.

このようにして、本実施例では、音声データの合成を行う場合、鍵盤２で押鍵された鍵のピッチ、元音声のピッチのうちの何れかをピッチ選択スイッチへの操作により選択できるようにさせている。
図１３は、楽音タイマインタラプト処理のフローチャートである。これは、元音声データの分析、或いは音声データの合成を行うために、例えばサンプリング周期で発生する割り込み信号により実行される処理である。例えば図１０に示すスイッチ処理において、変数ａｎａ＿ｆｌｇ、及びｓｙｎ＿ｆｌｇのうちの少なくとも一方に１を新たに代入したときに割り込み（実行）禁止が解除され（割り込みが有効とされ）、それらの値がともに０となったときに割り込みが禁止される（割り込みが無効とされる）ようになっている。次に図１３を参照して、そのタイマインタラプト処理について詳細に説明する。 Thus, in this embodiment, when synthesizing voice data, either the pitch of the key pressed on the keyboard 2 or the pitch of the original voice can be selected by operating the pitch selection switch. I am letting.
FIG. 13 is a flowchart of the musical tone timer interrupt process. This is a process executed by an interrupt signal generated at, for example, a sampling period in order to analyze original voice data or synthesize voice data. For example, in the switch processing shown in FIG. 10, when 1 is newly assigned to at least one of the variables ana_flg and syn_flg, the interrupt (execution) prohibition is canceled (interrupt is enabled), and both of these values are 0. When it becomes, interrupt is prohibited (interrupt is disabled). Next, the timer interrupt process will be described in detail with reference to FIG.

先ず、ステップ１３０１では、入力した元音声データを分析するための分析処理を実行する。続くステップ１３０２では、変数ｓｙｎ＿ｆｌｇの値が１か否か判定する。その変数に１が代入されていた場合、判定はＹＥＳとなってステップ１３０３に移行し、そうでない場合には、判定はＮＯとなり、ここで一連の処理を終了する。 First, in step 1301, an analysis process for analyzing the input original voice data is executed. In the following step 1302, it is determined whether or not the value of the variable syn_flg is 1. If 1 is assigned to the variable, the determination is yes and the process proceeds to step 1303. If not, the determination is no and the series of processes ends here.

ステップ１３０３では、変数ｎｏｔｅ＿ｏｎの値が１か否か判定する。合成した音声データによる楽音を発音させるべき状況であった場合、それには１が代入されていることから、判定はＹＥＳとなり、ステップ１３０４で楽音生成部９を介して出力バッファの音声データをＤ／Ａ変換器１０に送出することにより楽音の発音を継続させる処理を実行した後、ステップ１３０５に移行する。 In step 1303, it is determined whether the value of the variable note_on is 1. When the musical sound based on the synthesized voice data is to be generated, since 1 is substituted for it, the determination is YES, and in step 1304, the voice data in the output buffer is converted to D / D via the musical tone generator 9. After executing the process of continuing the tone generation by sending it to the A converter 10, the process proceeds to step 1305.

ステップ１３０５では、フレーム合成タイミングか否か判定する。そのタイミングであった場合、判定はＹＥＳとなり、パラメータバッファ３５の最後に位置するフレームのパラメータを読み出していなければ次にパラメータを読み出すべきフレームに応じて変数ｐａｒａｍ＿ｂｕｆ＿ａｄｒの値を更新し（ステップ１３０６）、その更新後に図１２に示すフレーム合成処理を実行してから一連の処理を終了する。そうでない場合には、判定はＮＯとなり、ここで一連の処理を終了する。 In step 1305, it is determined whether it is frame synthesis timing. If it is the timing, the determination is YES, and if the parameter of the frame located at the end of the parameter buffer 35 has not been read, the value of the variable param_buf_adr is updated according to the frame from which the parameter is to be read next (step 1306). After the update, the frame synthesis process shown in FIG. Otherwise, the determination is no and the series of processing ends here.

上述したように、フレームサイズは５１２でオーバーラップファクタは８であった場合、楽音タイマインタラプト処理の実行間隔をサンプリング周期とすると、フレーム合成タイミングは、６４回目毎に到来することになる。
このようにして、音声データの合成を行っている場合、その合成を開始してからの時間の経過に応じてパラメータバッファ３５からパラメータを読み出す対象となるフレームを変更するようにしている。それにより、元音声の音質の変化を音声データの合成に反映させるようにしている。 As described above, when the frame size is 512 and the overlap factor is 8, the frame synthesis timing comes every 64th time when the execution interval of the musical tone timer interrupt process is set as the sampling period.
In this way, when audio data is synthesized, the frame from which the parameter is read out from the parameter buffer 35 is changed as time elapses after the synthesis is started. Thereby, a change in the sound quality of the original voice is reflected in the synthesis of the voice data.

次に上記ステップ１３０１として実行される分析処理について、図１４に示すそのフローチャートを参照して詳細に説明する。
先ず、ステップ１４０１では、変数ａｎａ＿ｆｌｇの値が１か否か判定する。それに１が代入されていなかった場合、判定はＮＯとなり、ここで一連の処理を終了する。そうでない場合には、判定はＹＥＳとなり、Ａ／Ｄ変換器８が出力する、Ａ／Ｄ変換されてデジタル化された元音声データを取り込んで例えばＲＡＭ５に確保した領域（以降「入力バッファ」と呼ぶ）に格納してから（ステップ１４０２）、ステップ１４０３に移行する。 Next, the analysis processing executed as step 1301 will be described in detail with reference to the flowchart shown in FIG.
First, in step 1401, it is determined whether or not the value of the variable ana_flg is 1. If 1 is not assigned to it, the determination is no, and the series of processing ends here. Otherwise, the determination is YES, and the area (hereinafter referred to as “input buffer”) obtained by acquiring the original audio data that has been A / D converted and digitized and output from the A / D converter 8 and secured in the RAM 5, for example. (Step 1402), and the process proceeds to step 1403.

ステップ１４０３では、フレーム分析タイミングか否か判定する。そのタイミングが到来する時間間隔はフレーム合成タイミングと同じであり、そのタイミングであった場合、判定はＹＥＳとなってステップ１４０４に移行する。そうでない場合には、判定はＮＯとなり、ここで一連の処理を終了する。 In step 1403, it is determined whether it is frame analysis timing. The time interval at which the timing arrives is the same as the frame synthesis timing. If it is the timing, the determination is yes and the process proceeds to step 1404. Otherwise, the determination is no and the series of processing ends here.

ステップ１４０４では、入力バッファからフレームサイズの元音声データをオーバーラップファクタの値に応じて抽出し、ハニング窓を乗算する。次のステップ１４０５では、乗算後のフレームを対象に分析を行い、レベル値、ＰＡＲＣＯＲ係数、ピッチ周波数、及び有声音比率ｍｉｘ等のパラメータを算出するパラメータ算出処理を実行する。その後は、算出したパラメータを変数ｐａｒａｍ＿ｂｕｆ＿ａｄｒの値で指定されるアドレスからパラメータバッファ３５に書き込み、その書き込みに応じてその値を更新してから（ステップ１４０６）ステップ１４０７に移行する。 In step 1404, the original audio data of the frame size is extracted from the input buffer according to the overlap factor value and multiplied by the Hanning window. In the next step 1405, analysis is performed on the frame after multiplication, and parameter calculation processing for calculating parameters such as a level value, a PARCOR coefficient, a pitch frequency, and a voiced sound ratio mix is executed. After that, the calculated parameter is written to the parameter buffer 35 from the address specified by the value of the variable param_buf_adr, the value is updated according to the writing (step 1406), and the process proceeds to step 1407.

ステップ１４０７では、パラメータバッファ３５がフル、つまり次のフレームで抽出されるパラメータを書き込む容量がパラメータバッファ３５に残っていないか否か判定する。その容量が残っていない場合、判定はＹＥＳとなり、ステップ１４０８でパラメータバッファ３５に格納されたパラメータをパラメータファイル３６として保存し、変数ａｎａ＿ｆｌｇに０を代入してから、一連の処理を終了する。それにより、パラメータバッファ３５に書き込めるだけのパラメータを書き込むと、元音声データの分析を自動的に終了させる。そうでない場合には、判定はＮＯとなり、ここで一連の処理を終了する。 In step 1407, it is determined whether or not the parameter buffer 35 is full, that is, there is no capacity in the parameter buffer 35 for writing parameters extracted in the next frame. If the capacity does not remain, the determination is YES, the parameter stored in the parameter buffer 35 is saved as the parameter file 36 in step 1408, 0 is substituted into the variable ana_flg, and the series of processing is terminated. Thus, when the parameters that can be written to the parameter buffer 35 are written, the analysis of the original voice data is automatically terminated. Otherwise, the determination is no and the series of processing ends here.

以降は、上記ステップ１４０５のパラメータ算出処理内で実行されるサブルーチン処理について詳細に説明する。ここでは、ＰＡＲＣＯＲ係数、ピッチ周波数、その補正、及び有声音比率ｍｉｘに係わるものについてのみ、図１５〜図１８に示す各種フローチャートを参照して詳細に説明する。 Hereinafter, the subroutine processing executed in the parameter calculation processing in step 1405 will be described in detail. Here, only those related to the PARCOR coefficient, the pitch frequency, the correction thereof, and the voiced sound ratio mix will be described in detail with reference to various flowcharts shown in FIGS.

図１５は、ＰＡＲＣＯＲ係数算出処理のフローチャートである。始めに図１５を参照して、その算出処理について詳細に説明する。上述したように、この算出処理では、Levinson-Durbin アルゴリズムを用いてＰＡＲＣＯＲ係数を算出する。
先ず、ステップ１５０１では、１フレーム分の音声データから短時間自己相関関数を計算する。その計算は、短時間自己相関関数をＲ、音声データをｙで表すと、以下の式により計算して算出する。図中、シンボル「Ｒ」に添字として付したシンボル「Ｐ」、つまり次数Ｐの値は前述したように２６である。 FIG. 15 is a flowchart of PARCOR coefficient calculation processing. First, the calculation process will be described in detail with reference to FIG. As described above, in this calculation process, the PARCOR coefficient is calculated using the Levinson-Durbin algorithm.
First, in step 1501, a short-time autocorrelation function is calculated from audio data for one frame. The calculation is calculated by the following formula, where R is the short-time autocorrelation function and y is the voice data. In the figure, the symbol “P” added as a subscript to the symbol “R”, that is, the value of the order P is 26 as described above.

ステップ１５０１に続くステップ１５０２では、配列変数Ｗの添字（括弧内の数字）が０で指定される要素（以降「Ｗ₀ 」と表記する。他の要素についても同様である）に自己相関関数Ｒ₁ 、配列変数Ｅの要素Ｅ₀ に自己相関関数Ｒ₀ 、変数ｎに１、をそれぞれ代入する。その後はステップ１５０３に移行する。 In step 1502 following step 1501, an autocorrelation function R is added to an element (hereinafter referred to as “W ₀ ”, the same applies to other elements) in which the subscript (number in parentheses) of the array variable W is designated as ₀ . ₁ Substitute autocorrelation function R ₀ for element E ₀ of array variable E and 1 for variable n. Thereafter, the process proceeds to step 1503.

ステップ１５０３では、配列変数ｋの要素ｋ_n には要素Ｗ_n-1 の値を要素Ｅ_n-1 の値で割った値（＝Ｗ_n-1／Ｅ_n-1）を代入し、要素Ｅ_n には、１から要素ｋ_n の値を２乗した値を減算した値を要素Ｅ_n-1 の値に掛けて得られる値（＝Ｅ_n-1・（１−ｋ_n ²））を代入する。要素ｋ_n に代入した値がＰＡＲＣＯＲ係数（偏自己相関係数）である。 In step 1503, a value (= W _n-1 / E _n-1 ) obtained by dividing the value of element W _{n-1 by} the value of element E _n-1 is assigned to element k _n of array variable k, and element E _n is a value obtained by subtracting the value obtained by subtracting the value of the element k _n from 1 to the value of the element E _n−1 (= E _n−1 · (1−k _n ² )). substitute. Value assigned to the element k _n is PARCOR coefficients (partial autocorrelation coefficients).

ステップ１５０３に続くステップ１５０４では、配列変数αの２つの変数ｎの値で指定される要素（図中「α_n ⁽ⁿ⁾」と表記。以降、その表記法を用いる）に要素ｋ_n の負の値（＝−ｋ_n ）を代入し、要素α_i ⁽ⁿ⁾（変数ｎの値、及び変数ｉの値で指定される要素）に、要素α_i ^(n-1)の値から、要素ｋ_n の値に要素α_n-i ^(n-1)の値を掛けた値を減算して得られる値（＝α_i ^(n-1)−ｋ_n α_n-i ^(n-1)）を代入する。後者は、変数ｉ（インデクス）の値が１≦ｉ≦ｎ−１の範囲内に存在する要素α_i ⁽ⁿ⁾を全て対象にして行う。そのような代入が終了した後、ステップ１５０５に移行する。 In step 1504 following step 1503, the element specified by the values of the two variables n of the array variable α ^{(denoted as} “α _n ⁽ⁿ⁾ ” in the figure, and the notation is used hereinafter) is negative of the element k _n . of substituting the value (= -k _n), (the value of the variable n, and elements that are specified by the value of the variable i) elements α _i ⁽ⁿ⁾ to from the value of the element α _i ^(n-1), component k values to the elements α _ni ^(n-1) value obtained by subtracting the value obtained by multiplying the value of _{_{^{n (= α i (n-}}} 1) -k n α ni (n-1)) is substituted for. The latter is performed for all elements α _i ^{(n) in} which the value of the variable i (index) is within the range of 1 ≦ i ≦ n−1. After such substitution is completed, the routine proceeds to step 1505.

ステップ１５０６では、変数ｎの値が次数Ｐの値と等しいか否か判定する。それらの値が一致していた場合、判定はＹＥＳとなり、ＰＡＲＣＯＲ係数の算出は終了したとして一連の処理を終了する。そうでない場合には、判定はＮＯとなってステップ１５０６に移行する。 In step 1506, it is determined whether or not the value of the variable n is equal to the value of the order P. If these values match, the determination is yes, and the series of processing ends, assuming that the calculation of the PARCOR coefficient is complete. Otherwise, the determination is no and the process moves to step 1506.

ステップ１５０６では、要素Ｗ_n に、以下の式により求めた値を代入する。その後は、ステップ１５０７で変数ｎの値をインクリメントしてから上記ステップ１５０３に戻る。 In step 1506, a value obtained by the following equation is substituted for the element W _n . Thereafter, in step 1507, the value of the variable n is incremented, and the process returns to step 1503.

このようにして、ステップ１５０３〜１５０７で形成される処理ループをステップ１５０５の判定がＹＥＳとなるまで繰り返し実行することにより、次数Ｐ分のＰＡＲＣＯＲ係数が変数ｎにより指定される各要素Ｗ_n に代入されることになる。
図１６は、ピッチ周波数算出処理のフローチャートである。次に図１６を参照して、その算出処理について詳細に説明する。 In this way, by repeatedly executing the processing loop formed in steps 1503 to 1507 until the determination in step 1505 becomes YES, the PARCOR coefficient for the order P is assigned to each element W _n specified by the variable n. Will be.
FIG. 16 is a flowchart of the pitch frequency calculation process. Next, the calculation process will be described in detail with reference to FIG.

先ず、ステップ１６０１では、ハニング窓を乗算した後のフレームを対象にＦＦＴを行い、それによって得られた実数値、及び虚数値から周波数振幅値ｘ、及び位相を求め、上記（１）式を用いて周波数振幅値ｘの自己相関関数ｒを計算する。求めた位相は、１次元の配列変数ｐｈａｓｅの各要素に代入する。このとき、それまで配列変数ｐｈａｓｅの各要素に代入されていた値は、配列変数ｏｌｄ＿ｐｈａｓｅの同じ値で指定される要素に代入されている。配列変数ｏｌｄ＿ｐｈａｓｅの要素に代入された値は、図４に示す前フレームの位相データ５６に対応する。そして、人の音声の周波数の範囲内で相関値ｒが最大となる添字ｎの値を特定し、その値を変数ｉに代入する。その変数ｉに代入される値は、ピッチ周波数に対応する周波数振幅値ｘのインデクス値である。 First, in step 1601, FFT is performed on the frame after multiplying by the Hanning window, the frequency amplitude value x and the phase are obtained from the real value and imaginary value obtained thereby, and the above equation (1) is used. Then, the autocorrelation function r of the frequency amplitude value x is calculated. The obtained phase is substituted for each element of the one-dimensional array variable phase. At this time, the value previously assigned to each element of the array variable phase is assigned to the element designated by the same value of the array variable old_phase. The value assigned to the element of the array variable old_phase corresponds to the phase data 56 of the previous frame shown in FIG. Then, the value of the subscript n that maximizes the correlation value r within the frequency range of human speech is specified, and that value is substituted into the variable i. The value substituted for the variable i is an index value of the frequency amplitude value x corresponding to the pitch frequency.

ステップ１６０１に続くステップ１６０２では、変数ｄｅｌｔａ＿ｐｈａｓｅに、配列変数ｐｈａｓｅの変数ｉの値で指定される要素ｐｈａｓｅ［ｉ］の値から配列変数ｏｌｄ＿ｐｈａｓｅの要素ｏｌｄ＿ｐｈａｓｅ［ｉ］の値を減算して得られる値（位相ずれを示す位相ずれ値）を代入し、要素ｏｌｄ＿ｐｈａｓｅ［ｎ］（１≦ｎ≦１３）に要素ｐｈａｓｅ［ｎ］の値を代入する。１≦ｎ≦１３の範囲に限定して代入を行うのは、人の音声が取りうる範囲の周波数のみを対象にピッチ周波数を探索すれば良いためである。そのような代入が終了した後にステップ１６０３に移行する。 In Step 1602 following Step 1601, a value obtained by subtracting the value of the element old_phase [i] of the array variable old_phase from the value of the element phase [i] specified by the value of the variable i of the array variable phase to the variable delta_phase. (Phase shift value indicating phase shift) is substituted, and the value of element phase [n] is substituted for element old_phase [n] (1 ≦ n ≦ 13). The reason why the substitution is limited to the range of 1 ≦ n ≦ 13 is that it is only necessary to search the pitch frequency only for the frequency within the range that human voice can take. After such substitution is completed, the process proceeds to step 1603.

ステップ１６０３では、変数ｏｆｆｓｅｔ＿ｐｈａｓｅに、ＦＦＴグリッド間角周波数Δω（＝２π×Δｆ（＝サンプリング周波数ＦＳ÷ＦＦＴ点数（フレームサイズ））、変数ｉ、及び前フレームとの時間差Δｔ（＝ホップサイズＨ÷サンプリング周波数ＦＳ）の各値を乗算して得られる理論的位相進み値を変数ｄｅｌｔａ＿ｐｈａｓｅの値から減算した値（＝ｄｅｌｔａ＿ｐｈａｓｅ−Δω・ｉ・Δｔ）を算出して代入する。そのようにして算出される値は、ＦＦＴによる周波数グリッドとピッチ周波数とのずれに対応する。 In step 1603, the variable offset_phase includes the FFT grid angular frequency Δω (= 2π × Δf (= sampling frequency FS ÷ FFT points (frame size)), variable i, and time difference Δt (= hop size H ÷ sampling) with the previous frame. A value (= delta_phase−Δω · i · Δt) obtained by subtracting the theoretical phase lead value obtained by multiplying each value of the frequency FS) from the value of the variable delta_phase is calculated and substituted. The value corresponds to the deviation between the frequency grid and the pitch frequency by FFT.

次に移行するステップ１６０４では、変数ｏｆｆｓｅｔ＿ｐｈａｓｅの値が−π＜ｏｆｆｓｅｔ＿ｐｈａｓｅ＜πとなるように正規化処理を実行する。その後に移行するステップ１６０５では、正規化した変数ｏｆｆｓｅｔ＿ｐｈａｓｅの値を時間差Δｔで除算して得られる値をその変数に新たに代入する。ステップ１６０６にはその代入後に移行する。新たに代入された値は、位相を角周波数の次元で表したものである。 In the next step 1604, normalization processing is executed so that the value of the variable offset_phase satisfies −π <offset_phase <π. In the subsequent step 1605, a value obtained by dividing the value of the normalized variable offset_phase by the time difference Δt is newly substituted into the variable. Step 1606 proceeds after the substitution. The newly assigned value represents the phase in the dimension of angular frequency.

ステップ１６０６では、変数ｄｅｌｔａに、変数ｏｆｆｓｅｔ＿ｐｈａｓｅの値をグリッド間角周波数Δωで除算した値を代入する。そのようにして得られた値は、ＦＦＴの周波数グリッドとのずれ成分、即ち変数ｉ（インデクス値）の小数成分となる。このことから、続くステップ１６０７では、変数ｉの値に変数ｄｅｌｔａの値を加算した値をグリッド間周波数Δｆに乗算してピッチ周波数（＝（ｉ＋ｄｅｌｔａ）×Δｆ）を算出し、それを変数ｆｒｅｑに代入する。一連の処理はその後に終了する。 In step 1606, a value obtained by dividing the value of the variable offset_phase by the inter-grid angular frequency Δω is substituted for the variable delta. The value thus obtained is a shift component from the FFT frequency grid, that is, a decimal component of the variable i (index value). From this, in the following step 1607, a value obtained by adding the value of the variable delta to the value of the variable i is multiplied by the inter-grid frequency Δf to calculate the pitch frequency (= (i + delta) × Δf), and this is set to the variable freq. substitute. The series of processing is finished thereafter.

上述したように、調波構造すなわち倍音成分を持った音声や楽音では相関値ｒが大きくなる。このため、ノイズ等の影響を排除した正確なピッチ抽出を行うことができる。また、その後に行う位相差計測法によるＦＦＴの周波数グリッド間における差分となる周波数の算出により、ＦＦＴの周波数分解能の低さはピッチ抽出に影響を及ぼさなくなる。これらの結果、ノイズ等に強い高精度なピッチ抽出が実現される。 As described above, the correlation value r is large for a sound or musical tone having a harmonic structure, that is, a harmonic component. For this reason, accurate pitch extraction can be performed without the influence of noise or the like. Further, by calculating the frequency that becomes the difference between the FFT frequency grids by the phase difference measurement method performed thereafter, the low frequency resolution of the FFT does not affect the pitch extraction. As a result, highly accurate pitch extraction resistant to noise and the like is realized.

上述したようにして求めたピッチ周波数は通常の話し言葉のような音声を対象とするのであれば高い精度が得られる。しかし、歌唱時や特殊な発音をした場合などの音声では必ずしも高い精度が得られない。このことから、本実施例では、このピッチ周波数算出処理の実行後、図１７に示すピッチ補正処理を実行している。次にそのピッチ補正処理について、図１７を参照して詳細に説明する。その図１７中のシンボル「ｐｉｔｃｈ」は変数ｆｒｅｑに代入されたピッチ周波数を示し、シンボル「ｐｉｔｃｈ＿ｏｌｄ」は前フレームで抽出（特定）したピッチ周波数を示している。 If the pitch frequency obtained as described above is intended for speech such as ordinary spoken words, high accuracy can be obtained. However, high accuracy is not always obtained with voices such as when singing or when special pronunciations are made. Therefore, in this embodiment, after the pitch frequency calculation process is executed, the pitch correction process shown in FIG. 17 is executed. Next, the pitch correction process will be described in detail with reference to FIG. The symbol “pitch” in FIG. 17 indicates the pitch frequency assigned to the variable freq, and the symbol “pitch_old” indicates the pitch frequency extracted (specified) in the previous frame.

先ず、ステップ１７０１では、今回のピッチ周波数ｐｉｔｃｈから前回のピッチ周波数ｐｉｔｃｈ＿ｏｌｄを減算し、その減算結果を変数ｘに代入する。次のステップ１７０２では、変数ｘに、それまでの値の絶対値を代入する。ステップ１７０３にはその後に移行する。 First, in step 1701, the previous pitch frequency pitch_old is subtracted from the current pitch frequency pitch, and the subtraction result is substituted into a variable x. In the next step 1702, the absolute value of the previous value is substituted for the variable x. Thereafter, the process proceeds to step 1703.

ステップ１７０３では、変数ｘの値が定数ＴＨＲＥＳＨよりも大きいか否か判定する。その定数ＴＨＲＥＳＨは、短時間（ここでは時間差Δｔが対応）に生じることが考えにくい周波数変化が生じたか否か判定用に設定した定数である。このことから、ピッチ周波数にそのような変化が生じた場合、判定はＹＥＳとなってステップ１７０５に移行する。そうでない場合には、判定はＮＯとなり、ステップ１７０４で変数ｃｏｕｎｔｅｒに０を代入するとともに、前フレームのピッチ周波数ｐｉｔｃｈ＿ｏｌｄとして今回のピッチ周波数ｐｉｔｃｈを設定することにより、今回のピッチ周波数ｐｉｔｃｈが正確なものとして採用した後、一連の処理を終了する。 In step 1703, it is determined whether the value of the variable x is larger than a constant THRESH. The constant THRESH is a constant set for determining whether or not a frequency change that is unlikely to occur in a short time (here, the time difference Δt corresponds) has occurred. Therefore, if such a change occurs in the pitch frequency, the determination is yes and the process moves to step 1705. Otherwise, the determination is no, and in step 1704, 0 is substituted into the variable counter, and the current pitch frequency pitch is set as the pitch frequency pitch_old of the previous frame so that the current pitch frequency pitch is accurate. Then, the series of processing is terminated.

ステップ１７０５では、変数ｃｏｕｎｔｅｒの値をインクリメントする。続くステップ１７０６では、変数ｃｏｕｎｔｅｒの値が定数Ｈの値より小さいか否か判定する。前者が後者より小さかった場合、判定はＹＥＳとなり、ステップ１７０７で今回のピッチ周波数ｐｉｔｃｈとして前フレームのピッチ周波数ｐｉｔｃｈ＿ｏｌｄを設定することにより、今回のピッチ周波数ｐｉｔｃｈが不正確である可能性が高いとして不採用とした後、一連の処理を終了する。そうでない場合には、判定はＮＯとなり、次に上記ステップ１７０４の処理を実行する。 In step 1705, the value of the variable counter is incremented. In the subsequent step 1706, it is determined whether or not the value of the variable counter is smaller than the value of the constant H. If the former is smaller than the latter, the determination is YES, and setting the pitch frequency pitch_old of the previous frame as the current pitch frequency pitch in step 1707 indicates that the current pitch frequency pitch is likely to be inaccurate. After adopting, the series of processing ends. Otherwise, the determination is no and the process of step 1704 is executed next.

このようにして、本実施例では、フレーム間におけるピッチ周波数変化が比較的に大きいと、その変化が定数Ｈで設定される回数、続かない限り不採用としている。それにより、一時的に不適切に抽出されたピッチ周波数を無効とするため、全体としてより適切、且つ安定的にピッチ周波数を抽出できることとなる。なお、上記定数ＴＨＲＥＳＨの値は３０程度、定数Ｈの値としては４程度が適当であることが確認できたが、これらは元音声を入力する人、その発生方法等に応じて適宜、調整することが望ましい。 In this way, in this embodiment, if the change in pitch frequency between frames is relatively large, the change is not adopted unless the change is continued for the number of times set by the constant H. Thereby, since the pitch frequency extracted inappropriately temporarily is invalidated, the pitch frequency can be extracted more appropriately and stably as a whole. It was confirmed that the value of the constant THRESH was about 30 and that the value of the constant H was about 4, but these are appropriately adjusted according to the person who inputs the original voice, the generation method thereof, and the like. It is desirable.

図１８は、混合比（有声音比率）算出処理のフローチャートである。最後に図１８を参照して、その算出処理について詳細に説明する。
先ず、ステップ１８０１では、自己相関値ｒの標準偏差ｄｅｖｉａｔｉｏｎを計算し、その値が定数ＣＲＯＳＳ＿ＭＩＮ以下か否か判定する。その定数ＣＲＯＳＳ＿ＭＩＮは完全な有声音と見なす基準として設定した定数であり、標準偏差ｄｅｖｉａｔｉｏｎの値がそれ以下であった場合、判定はＹＥＳとなり、ステップ１８０２で混合比（有声音比率）ｍｉｘとして１を設定した後、一連の処理を終了する。そうでない場合には、判定はＮＯとなってステップ１８０３に移行する。 FIG. 18 is a flowchart of the mixing ratio (voiced sound ratio) calculation process. Finally, the calculation process will be described in detail with reference to FIG.
First, in step 1801, the standard deviation deviation of the autocorrelation value r is calculated, and it is determined whether or not the value is equal to or less than a constant CROSS_MIN. The constant CROSS_MIN is a constant set as a reference to be regarded as a complete voiced sound. If the value of the standard deviation deviation is less than that, the determination is YES, and in step 1802, 1 is set as the mixing ratio (voiced sound ratio) mix. After setting, the series of processing is terminated. Otherwise, the determination is no and the process moves to step 1803.

ステップ１８０３では、標準偏差ｄｅｖｉａｔｉｏｎの値が定数ＣＲＯＳＳ＿ＭＡＸ以上か否か判定する。その定数ＣＲＯＳＳ＿ＭＡＸは完全な無声音と見なす基準として設定した定数であり、標準偏差ｄｅｖｉａｔｉｏｎの値がそれ以上であった場合、判定はＹＥＳとなり、ステップ１８０４で混合比（有声音比率）ｍｉｘとして０を設定した後、一連の処理を終了する。そうでない場合には、判定はＮＯとなってステップ１８０５に移行し、定数ＣＲＯＳＳ＿ＭＡＸから標準偏差ｄｅｖｉａｔｉｏｎの値を減算した減算結果を、定数ＣＲＯＳＳ＿ＭＡＸから定数ＣＲＯＳＳ＿ＭＩＮを減算した減算結果で除算して得られる値を算出し、その算出値を混合比ｍｉｘとして設定する。一連の処理はその後に移行する。 In step 1803, it is determined whether or not the value of the standard deviation division is equal to or greater than a constant CROSS_MAX. The constant CROSS_MAX is a constant set as a reference for considering it as a complete unvoiced sound. If the value of the standard deviation deviation is more than that, the determination is YES, and 0 is set as the mixing ratio (voiced sound ratio) mix in step 1804. After that, a series of processing is finished. Otherwise, the determination is no, the process proceeds to step 1805, and the value obtained by dividing the subtraction result obtained by subtracting the value of the standard deviation division from the constant CROSS_MAX by the subtraction result obtained by subtracting the constant CROSS_MIN from the constant CROSS_MAX. And the calculated value is set as the mixing ratio mix. A series of processing shifts thereafter.

このようにして、本実施例では、完全に有声音と見なせる場合、或いは完全に無声音と見なせる場合、混合比ｍｉｘとして１、或いは０を設定し、それら以外の場合には、標準偏差ｄｅｖｉａｔｉｏｎの値に応じて値が変化する値を設定するようにしている。それにより、有声音、無声音と完全に見なせない音声のみを対象に混合比ｍｉｘの値を０＜ｍｉｘ＜１の範囲内で変化させるようにして、必要な場合にのみ、駆動音源波形をRosenberg 波、及びホワイトノイズ波から生成するようにしている。そのようにして、駆動音源波形はより適切に生成できるようにさせている。このため、常に自然と感じられる音声データを合成できることとなる。なお、上記定数ＣＲＯＳＳ＿ＭＡＸの値は０．３６程度、定数ＣＲＯＳＳ＿ＭＩＮの値は０．１４程度が適当であることが確認できたが、これらは元音声を入力する人、その発生方法等に応じて適宜、調整することが望ましい。 In this way, in this embodiment, when it can be regarded as a completely voiced sound or when it can be regarded as a completely unvoiced sound, 1 or 0 is set as the mixing ratio mix, and in other cases, the value of the standard deviation deviation is set. A value that changes depending on the value is set. As a result, the value of the mixing ratio mix is changed within the range of 0 <mix <1 for only voices that cannot be completely regarded as voiced and unvoiced sounds, and the drive sound source waveform is changed to Rosenberg only when necessary. Generated from the wave and the white noise wave. In this way, the driving sound source waveform can be generated more appropriately. Therefore, it is possible to synthesize voice data that always feels natural. It was confirmed that the value of the constant CROSS_MAX was about 0.36 and the value of the constant CROSS_MIN was about 0.14, but these values were appropriately determined depending on the person who inputs the original voice, the generation method, etc. It is desirable to adjust.

なお、本実施例では、元音声データから抽出したパラメータは一旦、パラメータファイル３６として保存するようにしているが、その抽出（分析）、及び抽出したパラメータを用いた音声データの合成をリアルタイムで実施できるようにしても良い。外部装置に出力させるようにしても良い。 In this embodiment, the parameters extracted from the original voice data are temporarily stored as the parameter file 36, but the extraction (analysis) and synthesis of the voice data using the extracted parameters are performed in real time. You may be able to do it. You may make it output to an external apparatus.

音声データの合成については、並行して複数、行えるようにしても良い。パラメータを操作して行えるようにしても良い。合成する音声データのピッチ指定については、弦楽器や吹奏楽器等を用いて行えるようにしても良く、ＳＭＦの再生により発音される楽音の音高を利用できるようにしても良い。 A plurality of voice data may be synthesized in parallel. It may be made possible by operating parameters. The pitch designation of the audio data to be synthesized may be performed using a stringed instrument, a wind instrument, or the like, or the pitch of a musical sound generated by SMF reproduction may be used.

ＰＡＲＣＯＲ係数は、線形分析による全極フィルタで抽出した形となっているが、そのフィルタはカルマンフィルタ等を使った極零フィルタであっても良い。格子型フィルタであっても良い。ＰＡＲＣＯＲ係数以外のものをフィルタ係数として抽出するようにしても良い。 The PARCOR coefficient is extracted by an all-pole filter based on linear analysis, but the filter may be a pole-zero filter using a Kalman filter or the like. A lattice filter may be used. A filter coefficient other than the PARCOR coefficient may be extracted.

無声音用の音源波形としてホワイトノイズ波を生成しているが、音声分析時に残差信号を算出し、有声音比率ｍｉｘが低い場合は、この残差信号をその音源波形として保存するようにして、合成時に用いるようにしても良い。分析の対象とする音声としては、人声を想定しているが、それに限定しなくとも良い。しかし、ピッチ抽出の精度を維持するうえからは、ピッチの範囲が予め特定できるようなものが望ましい。
＜第２の実施例＞
上記第１の実施例では、マイク７から入力した元音声の分析結果を基にした音声データの合成を、鍵盤２への操作に応じて行わせることができるようになっている。これに対し、第２の実施例は、音声データの合成を、シーケンスデータ（ここではＳＭＦ）の再生と同期させて行わせるようにしたものである。 Although a white noise wave is generated as a sound source waveform for unvoiced sound, a residual signal is calculated during voice analysis, and when the voiced sound ratio mix is low, the residual signal is stored as the sound source waveform, You may make it use at the time of a synthesis | combination. The voice to be analyzed is assumed to be a human voice, but is not limited thereto. However, in order to maintain the accuracy of pitch extraction, it is desirable that the pitch range can be specified in advance.
<Second embodiment>
In the first embodiment, voice data synthesis based on the analysis result of the original voice input from the microphone 7 can be performed in response to an operation on the keyboard 2. On the other hand, in the second embodiment, the synthesis of audio data is performed in synchronization with the reproduction of sequence data (here, SMF).

第２の実施例による音声分析合成装置を搭載した電子楽器の構成は基本的に第１の実施例におけるそれと同じである。動作も大部分は同じか、或いは比較的に大きな差がない。このようなことから、同じ、或いは区別するほどの相違のないものについては、第１の実施例の説明で付した符号をそのまま用いつつ、第１の実施例から異なる部分に着目して説明を行うこととする。 The configuration of the electronic musical instrument equipped with the speech analysis / synthesis apparatus according to the second embodiment is basically the same as that in the first embodiment. The operation is also largely the same or relatively small. For this reason, the same or different ones that are not different from each other will be described by focusing on the different parts from the first embodiment while using the reference numerals in the description of the first embodiment as they are. I will do it.

シーケンスデータの再生では、テンポを指定することができる。このことから、第２の実施例では、テンポの変更をサポートするようにしている。また、合成する音声データのピッチとしては、入力された音声のそれを用いるようにしている。それにより、パラメータファイル３６のなかから音声を入力する人とは別の人の音声の分析結果を格納したものが選択されていた場合には、声質の変換を実施する形で音声データの合成を行えるようにさせている。音声を入力する人としては、歌をうたう歌唱者を想定している。 In playback of sequence data, the tempo can be specified. For this reason, in the second embodiment, a change in tempo is supported. The pitch of the voice data to be synthesized is that of the input voice. As a result, in the parameter file 36, when the one that stores the voice analysis result of a person different from the person who inputs the voice is selected, the voice data is synthesized in the form of voice quality conversion. I am allowed to do it. As a person who inputs voice, a singer who sings is assumed.

図１９は、第２の実施例による音声分析合成装置の分析フェーズ用の機能構成図である。始めに図１９を参照して、分析フェーズ用の構成、及び各部の動作について具体的に説明する。なお、上述したように、第１の実施例と同じ、或いは区別するほどの相違のないものについては同一の符号を付している。これは、後述する合成フェーズ用の構成においても同様である。 FIG. 19 is a functional configuration diagram for the analysis phase of the speech analysis and synthesis device according to the second embodiment. First, with reference to FIG. 19, the configuration for the analysis phase and the operation of each unit will be described in detail. Note that, as described above, the same reference numerals are assigned to those that are the same as those of the first embodiment or that are not different enough to be distinguished. The same applies to the composition for the synthesis phase described later.

テンポ指示手段１９０１は、歌唱者にテンポを指示するものであり、それに設定されたテンポ値はテンポ取得部１９０２によって取得され、パラメータとしてパラメータバッファ３５に格納される。その指示手段１９０１としては、メトロノームもしくはリズムボックス等の外部装置であっても良い。 The tempo instruction means 1901 instructs the singer to specify the tempo. The tempo value set for the tempo is acquired by the tempo acquisition unit 1902 and stored in the parameter buffer 35 as a parameter. The instruction means 1901 may be an external device such as a metronome or a rhythm box.

第２の実施例では、第１の実施例とは異なり、元音声のピッチ抽出は行わない。このため、ピッチ抽出／有声音比率算出部３４の代わりに、有声音比率ｍｉｘのみを算出する有声音比率算出部１９０３を搭載している。その算出方法は第１の実施例と同じである。
図２０は、第２の実施例による音声分析合成装置の合成フェーズ用の機能構成図である。次に図２０を参照して、合成フェーズ用の構成、及び各部の動作について具体的に説明する。 Unlike the first embodiment, the second embodiment does not extract the pitch of the original speech. Therefore, instead of the pitch extraction / voiced sound ratio calculation unit 34, a voiced sound ratio calculation unit 1903 that calculates only the voiced sound ratio mix is installed. The calculation method is the same as in the first embodiment.
FIG. 20 is a functional configuration diagram for the synthesis phase of the speech analysis and synthesis device according to the second embodiment. Next, the composition for the synthesis phase and the operation of each unit will be specifically described with reference to FIG.

ＳＭＦ２００１は、再生の対象となるシーケンスデータであり、発生させるべき演奏上のイベントの内容を示すＭＩＤＩ（イベント）データに対し、その処理タイミングを示すタイムデータが付加された形で構成されている。シーケンサー２００２は、そのような構成のＳＭＦ２００１の再生を行うものである。音声データの合成は、そのシーケンサー２００２によるＳＭＦ２００１の再生に同期させる形で行われる。 The SMF 2001 is sequence data to be reproduced, and is configured such that time data indicating the processing timing is added to MIDI (event) data indicating the contents of a performance event to be generated. The sequencer 2002 reproduces the SMF 2001 having such a configuration. The synthesis of the audio data is performed in synchronization with the reproduction of the SMF 2001 by the sequencer 2002.

操作部２００３は、図２に示すテンポスイッチ２３へのユーザの操作に応じてテンポを変更するものである。当然のことながら、テンポの変更に伴い、ＳＭＦ２００１を再生する速度も変更しなければならない。このことから、時間制御部２００４は、テンポの変更に伴う再生速度の変更を実現させる。その実現は、シーケンサー２００２に供給するクロックの周期を変化させることで行う。 The operation unit 2003 changes the tempo in response to a user operation on the tempo switch 23 shown in FIG. As a matter of course, the speed at which the SMF 2001 is reproduced has to be changed as the tempo is changed. From this, the time control unit 2004 realizes a change in the reproduction speed accompanying a change in the tempo. This is achieved by changing the cycle of the clock supplied to the sequencer 2002.

ピッチ抽出部２００５は、Ａ／Ｄ変換器８が出力した元音声データのピッチ抽出を第１の実施例と同じ手法で行い、それによって抽出したピッチ周波数を出力する。駆動音源生成部２００６は、パラメータファイル３６からパラメータバッファ３５に読み込まれたパラメータ中の有声音比率（混合比）ｍｉｘの値に応じて、生成するRosenberg 波、及びホワイトノイズ波のうちの少なくとも一方を用いて駆動音源波形を生成するものである。ピッチ抽出部２００５が抽出したピッチ周波数は、Rosenberg 波の生成に反映させる。その駆動音源生成部２００６は、図３中のRosenberg 波生成部４２、及びホワイトノイズ生成部４３を有し、それらが生成した波形に有声音比率ｍｉｘの値、１からそれを減算した値をそれぞれ乗算して混合することにより駆動音源波形を生成し、駆動音源バッファ４４に格納する。その駆動音源バッファ４４に格納された駆動音源波形が合成フィルタ４５に送られる。 The pitch extraction unit 2005 performs pitch extraction of the original voice data output from the A / D converter 8 by the same method as in the first embodiment, and outputs the pitch frequency extracted thereby. The drive sound source generation unit 2006 generates at least one of the Rosenberg wave and the white noise wave to be generated according to the value of the voiced sound ratio (mixing ratio) mix in the parameters read from the parameter file 36 into the parameter buffer 35. It is used to generate a driving sound source waveform. The pitch frequency extracted by the pitch extraction unit 2005 is reflected in the generation of the Rosenberg wave. The drive sound source generation unit 2006 includes the Rosenberg wave generation unit 42 and the white noise generation unit 43 in FIG. 3, and sets the value of the voiced sound ratio mix to the waveform generated by them and the value obtained by subtracting it from 1, respectively. A drive sound source waveform is generated by multiplication and mixing, and is stored in the drive sound source buffer 44. The drive sound source waveform stored in the drive sound source buffer 44 is sent to the synthesis filter 45.

ファイル制御部２００７は、時間制御部２００４の指示に従い、パラメータファイル３６のうちの一つからパラメータを読み込んでパラメータバッファ３５に格納する。その格納したパラメータは、時間制御部２００４の指示により、その時間制御部２００４や合成フィルタ４５、及び出力制御部２００８に随時、送る。 The file control unit 2007 reads parameters from one of the parameter files 36 and stores them in the parameter buffer 35 in accordance with instructions from the time control unit 2004. The stored parameters are sent to the time control unit 2004, the synthesis filter 45, and the output control unit 2008 as needed according to an instruction from the time control unit 2004.

合成フィルタ部４５は、ファイル制御部２００７から送られたパラメータ（ＰＡＲＣＯＲ係数）、及び駆動音源波形を用いて１フレーム分の音声データを合成し出力制御部２００８に送る。
出力制御部２００８は、出力バッファを備え、合成フィルタ部４５が合成した１フレーム分の音声データにハニング窓を乗算し、それによって得られた音声データを、オーバーラップファクタで他のフレームと重畳するように出力バッファに格納する。 The synthesis filter unit 45 synthesizes audio data for one frame using the parameter (PARCOR coefficient) sent from the file control unit 2007 and the driving sound source waveform, and sends the synthesized audio data to the output control unit 2008.
The output control unit 2008 includes an output buffer, multiplies the audio data for one frame synthesized by the synthesis filter unit 45 by a Hanning window, and superimposes the obtained audio data on other frames with an overlap factor. So that it is stored in the output buffer.

出力バッファから読み出された音声データは、シーケンサー２００２が出力する波形データと加算器２００９によって加算（ミックス）される。その加算後の音声データがＤ／Ａ変換器１０に出力される。
シーケンサー２００２が出力する波形データは楽音生成部９によって生成される。出力バッファは、ＲＡＭ５内に確保された領域である。このことから、楽音生成部９には、自身が生成したデータを他から入力したデータとミックスする機能を搭載させている。それにより、加算器２００９は楽音生成部９により実現される。 The audio data read from the output buffer is added (mixed) by the adder 2009 and the waveform data output from the sequencer 2002. The audio data after the addition is output to the D / A converter 10.
The waveform data output from the sequencer 2002 is generated by the tone generator 9. The output buffer is an area secured in the RAM 5. For this reason, the tone generator 9 has a function of mixing data generated by itself with data input from others. Thereby, the adder 2009 is realized by the tone generation unit 9.

音声波形の分析は５１２のフレームサイズで行われる。しかし、ＳＭＦ２００１の再生速度はテンポの変更に伴い変化するために、音声データの合成もその変化に合わせなければならない。パラメータファイル３６にフレーム単位で格納されたパラメータの使用は、その再生速度に合わせる必要がある。 The analysis of the speech waveform is performed with a frame size of 512. However, since the playback speed of the SMF 2001 changes as the tempo changes, the synthesis of audio data must also be matched to the change. The use of parameters stored in the parameter file 36 in units of frames must be matched to the playback speed.

これらのことは、合成される音声データは、指定された音高（ここでは元音声の音高）を維持させつつ、時間軸方向の圧縮伸長を行わなければならないことを意味する。このことから、本実施例では、テンポスイッチ２３（図２参照）への操作により指定されるテンポ値に応じてフレームサイズを変化させるようにしている。そのフレームサイズの決定は、時間制御部２００４が行っており、具体的には以下のようにして決定する。図２１を参照して具体的に説明する。 These means that the voice data to be synthesized must be compressed and expanded in the time axis direction while maintaining the specified pitch (here, the pitch of the original voice). Therefore, in this embodiment, the frame size is changed according to the tempo value designated by the operation on the tempo switch 23 (see FIG. 2). The frame size is determined by the time control unit 2004. Specifically, the frame size is determined as follows. This will be specifically described with reference to FIG.

時間制御部２００４は、指定されたテンポ値を操作部２００３から、パラメータファイル３６に格納されているテンポ値をファイル制御部２００７からそれぞれ受け取る。ここでは便宜的に、前者のテンポ値を合成時テンポ値、後者のテンポ値を分析時テンポ値と呼ぶことにする。 The time control unit 2004 receives the specified tempo value from the operation unit 2003 and the tempo value stored in the parameter file 36 from the file control unit 2007, respectively. Here, for the sake of convenience, the former tempo value is referred to as a composition tempo value, and the latter tempo value is referred to as an analysis tempo value.

合成時テンポ値が分析時テンポ値の２倍になると、図２１に示すように、サンプリング周波数が等しければ、合成時には分析時の半分の時間で１フレーム分の音声データを出力しなければならない。このことから、フレームサイズは分析時の半分の２５６に決定する。ホップサイズも半分の３２とすることで、オーバーラップファクタは合成時の値を維持させている。逆に、合成時テンポ値が分析時テンポ値の１／２倍であった場合には、合成時には分析時の２倍の時間で１フレーム分の音声データを出力しなければならないことから、フレームサイズは分析時の２倍の１０２４に決定し、オーバーラップファクタを維持させるためにホップサイズも２倍の１２８とする。 When the synthesizing tempo value is twice the analyzing tempo value, as shown in FIG. 21, if the sampling frequency is equal, the synthesizing data must be output for one frame in half the time of analysis when synthesizing. For this reason, the frame size is determined to be half of that at the time of analysis. By setting the hop size to half of 32, the overlap factor maintains the value at the time of synthesis. On the other hand, when the tempo value at the time of synthesis is ½ times the tempo value at the time of analysis, audio data for one frame must be output at the time of synthesis twice as long as at the time of analysis. The size is determined to be 1024, which is twice that at the time of analysis, and the hop size is also doubled to 128 to maintain the overlap factor.

そのようにフレームサイズ（ここではホップサイズを含む）をテンポ値に応じて変更させることにより、分析により得られたパラメータによる音声データの合成を適切に行えるようになる。このため、ＳＭＦ２００１の再生と合わせた、合成された音声データによる発音も常に自然なものとなる。 By changing the frame size (including the hop size in this case) according to the tempo value as described above, it is possible to appropriately synthesize audio data using parameters obtained by analysis. For this reason, the pronunciation by the synthesized voice data combined with the reproduction of the SMF 2001 is always natural.

フレームサイズ、ホップサイズは共に整数である。分析時テンポ値、合成時テンポ値の関係によってはそれらの値のうちの少なくとも一方が整数とならないことがある。このことから、実際の決定は、そうならないように考慮して行っている。
第２の実施例による音声変換装置を実現させるための電子楽器の動作については、分析処理（図１４参照）では、分析時にテンポ値を取得してパラメータとして保存するようになっている部分、フレーム合成処理（図１２参照）では、決定されたサイズで１フレーム分の音声データを合成するようになっている部分、が第１の実施例から主に異なっている。それ以外の処理では、図９に示す全体処理内でステップ９０２として実行されるスイッチ処理、及び楽音タイマインタラプト処理（図１３参照）が第１の実施例が比較的に大きく異なっている。このことから、第２の実施例では、そのスイッチ処理、及びそのタイマインタラプト処理についてのみ説明することとする。 Both frame size and hop size are integers. Depending on the relationship between the analysis tempo value and the composition tempo value, at least one of these values may not be an integer. For this reason, the actual decision is made so that it does not happen.
Regarding the operation of the electronic musical instrument for realizing the sound conversion apparatus according to the second embodiment, in the analysis process (see FIG. 14), a tempo value is acquired at the time of analysis and stored as a parameter, a frame In the synthesizing process (see FIG. 12), the part for synthesizing the audio data for one frame with the determined size is mainly different from the first embodiment. In other processes, the switch process executed as step 902 and the musical tone timer interrupt process (see FIG. 13) in the overall process shown in FIG. 9 are relatively different from those in the first embodiment. Therefore, in the second embodiment, only the switch process and the timer interrupt process will be described.

図２２は、第２の実施例におけるスイッチ処理のフローチャートである。始めに図２２を参照して、そのスイッチ処理について詳細に説明する。ここでは、第１の実施例と同じ、或いは特に区別する必要のないステップの処理には同一の符号を付して説明を省略することとする。 FIG. 22 is a flowchart of the switch process in the second embodiment. First, the switch process will be described in detail with reference to FIG. Here, the same reference numerals are given to the processes of steps that are the same as those in the first embodiment or need not be distinguished from each other, and the description thereof will be omitted.

ステップ１００１の判定がＮＯとなるか、ステップ１００４、或いは１００５の処理を実行することで移行するステップ２２０１では、スタートスイッチ２１がオンされ、且つ変数ｐｌａｙ＿ｆｌｇの値が０か否か判定する。その変数ｐｌａｙ＿ｆｌｇは、ＳＭＦ２００１の再生を管理するための変数であり、０は未再生、１は再生中を示している。このことから、ＳＭＦ２００１が再生中でない状態でユーザがスタートスイッチ２１を操作した場合、判定はＹＥＳとなり、変数ｆｉｌｅ＿ｎｕｍの値で指定されるパラメータファイル３６をパラメータバッファ３５に読み込み、ＳＭＦ２００１の再生の進行を管理するための単位時間計時用のタイマインタラプト処理の実行禁止を解除（図中では「同期クロック開始」と表記）し、更に変数ｐｌａｙ＿ｆｌｇに１を代入する処理をステップ２２０２で実行した後、ステップ２２０３に移行する。そうでない場合には、判定はＮＯとなって次にそのステップ２２０３の処理を実行する。 In step 2201, which is determined as NO in step 1001 or shifted by executing the processing in step 1004 or 1005, it is determined whether the start switch 21 is turned on and the value of the variable play_flg is zero. The variable play_flg is a variable for managing the playback of the SMF 2001. 0 indicates unplayed and 1 indicates that playback is in progress. Therefore, if the user operates the start switch 21 while the SMF 2001 is not being played back, the determination is YES, the parameter file 36 specified by the value of the variable file_num is read into the parameter buffer 35, and the playback of the SMF 2001 is progressed. The execution prohibition of the timer interrupt processing for unit time counting for management is canceled (indicated as “synchronous clock start” in the drawing), and further, a process of substituting 1 into the variable play_flg is executed in step 2202, and then step 2203 is executed. Migrate to Otherwise, the determination is no and the process of step 2203 is executed next.

ＳＭＦ２００１を構成する時間データは、予め定められた単位時間（ＭＩＤＩでは４分音符の１／２４に相当する時間である）で時間が表現されている。上記タイマインタラプト処理（以降、便宜的に「計時用タイマインタラプト処理」と呼ぶ）は、その単位時間毎に発生する割り込み信号により実行される処理であり、その処理を実行することで更新される変数の値を参照することにより、時間データが表す処理タイミングの到来を判定するようにしている。その単位時間はテンポ値に応じて変動させることで、設定されたテンポ値に合った速さでＳＭＦ２００１の再生を行えるようにしている。 Time data constituting the SMF 2001 is expressed in a predetermined unit time (a time corresponding to 1/24 of a quarter note in MIDI). The timer interrupt process (hereinafter referred to as “timer timer interrupt process” for convenience) is a process executed by an interrupt signal generated every unit time, and is updated by executing the process. The arrival of the processing timing represented by the time data is determined by referring to the value of. The unit time is changed according to the tempo value, so that the SMF 2001 can be reproduced at a speed matching the set tempo value.

ステップ２２０３では、ストップスイッチ２２がオンされ、且つ変数ｐｌａｙ＿ｆｌｇの値が１か否か判定する。その変数ｐｌａｙ＿ｆｌｇの１が代入、即ちＳＭＦ２００１の再生中にユーザがストップスイッチ２２を操作した場合、判定はＹＥＳとなり、出力バッファをクリアしてそれに格納された音声データによる発音を終了させ、計時用タイマインタラプト処理の実行を禁止させ、更に変数ｐｌａｙ＿ｆｌｇに０を代入する処理をステップ２２０４で実行した後、ステップ２２０５に移行する。そうでない場合には、判定はＮＯとなって次にそのステップ２２０５の処理を実行する。 In step 2203, it is determined whether or not the stop switch 22 is turned on and the value of the variable play_flg is 1. If 1 of the variable play_flg is substituted, that is, if the user operates the stop switch 22 during playback of the SMF 2001, the determination is YES, the output buffer is cleared, and the sound generation by the sound data stored therein is terminated, and the timer for timekeeping is completed. Execution of the interrupt process is prohibited, and a process of substituting 0 into the variable play_flg is executed in step 2204, and then the process proceeds to step 2205. Otherwise, the determination is no and the process of step 2205 is executed next.

ステップ２２０５では、テンポ値を上げることを指示するためのアップスイッチ２３ａがオンされたか否か判定する。そのスイッチ２３ａをユーザが操作した場合、判定はＹＥＳとなり、ステップ２２０６でユーザ指定のテンポ値を代入した変数ｔｅｍｐｏ＿ｕｓｅｒの値をインクリメントし、更にステップ２２０７でフレームサイズを修正するフレーム修正処理を実行した後、ステップ２２０８に移行する。そうでない場合には、判定はＮＯとなり、他のステップの処理を実行することなく、そのステップ２２０８に移行する。 In step 2205, it is determined whether or not the up switch 23a for instructing to increase the tempo value is turned on. If the switch 23a is operated by the user, the determination is YES, and after the variable tempo_user value substituted with the user-specified tempo value is incremented in step 2206, and further, the frame correction process for correcting the frame size is executed in step 2207. The process proceeds to step 2208. Otherwise, the determination is no and the process moves to step 2208 without executing the process of other steps.

ステップ２２０８では、テンポ値を下げることを指示するためのダウンスイッチ２３ｂがオンされたか否か判定する。そのスイッチ２３ｂをユーザが操作した場合、判定はＹＥＳとなり、ステップ２２０９で変数ｔｅｍｐｏ＿ｕｓｅｒの値をデクリメントし、更にステップ２２１０でフレーム修正処理を実行した後、ステップ１０１２に移行する。そうでない場合には、判定はＮＯとなり、他のステップの処理を実行することなく、そのステップ１０１２に移行する。そのステップ１０１２以降の処理についての説明は省略する。 In step 2208, it is determined whether or not the down switch 23b for instructing to lower the tempo value is turned on. If the switch 23b is operated by the user, the determination is yes, the value of the variable tempo_user is decremented in step 2209, the frame correction process is further executed in step 2210, and the process proceeds to step 1012. Otherwise, the determination is no and the process moves to step 1012 without executing the process of other steps. A description of the processing after step 1012 is omitted.

このように、第２の実施例では、ユーザがアップスイッチ２３ａ、或いはダウンスイッチ２３ｂを操作するたびに、合成時におけるフレームサイズを修正して更新するようにしている。それにより、常に適切なフレームサイズで音声データの合成を行えるようにしている。 As described above, in the second embodiment, every time the user operates the up switch 23a or the down switch 23b, the frame size at the time of synthesis is corrected and updated. Thereby, it is possible to always synthesize audio data with an appropriate frame size.

図２３は、上記ステップ２２０７、或いは２２１０として実行されるフレーム修正処理のフローチャートである。次に図２３を参照して、その修正処理について詳細に説明する。
先ず、ステップ２３０１では、パラメータファイル３６から読み出した分析時のテンポ値ｔｅｍｐｏ＿ｏｒｇを変数ｔｅｍｐｏ＿ｕｓｅｒの値で除算し、その除算結果を分析時のホップサイズｈｏｐ＿ｓｉｚｅに乗算して得られる値の四捨五入した整数値（＝ＩＮＴ（ｈｏｐ＿ｓｉｚｅ×ｔｅｍｐｏ＿ｏｒｇ／ｔｅｍｐｏ＿ｕｓｅｒ））を変数ｎｅｗ＿ｈｏｐ＿ｓｉｚｅに代入する。続くステップ２３０２では、変数ｎｅｗ＿ｆｒａｍｅ＿ｓｉｚｅに、変数ｎｅｗ＿ｈｏｐ＿ｓｉｚｅの値にオーバーラップファクタＯＬＦの値（ここでは「８」を表記）を乗算した値を代入する。その次のステップ２３０３では、テンポ値ｔｅｍｐｏ＿ｏｒｇに分析時のフレームサイズｆｒａｍｅ＿ｓｉｚｅの値を乗算し、その乗算結果を変数ｎｅｗ＿ｆｒａｍｅ＿ｓｉｚｅの値で除算して得られる値の四捨五入した整数値（＝ＩＮＴ（ｔｅｍｐｏ＿ｏｒｇ×ｆｒａｍｅ＿ｓｉｚｅ／ｎｅｗ＿ｆｒａｍｅ＿ｓｉｚｅ））を変数ｔｅｍｐｏに代入する。一連の処理はその後に終了する。 FIG. 23 is a flowchart of the frame correction process executed as step 2207 or 2210. Next, the correction process will be described in detail with reference to FIG.
First, in step 2301, the tempo value tempo_org at the time of analysis read from the parameter file 36 is divided by the value of the variable tempo_user, and the result obtained by multiplying the division result by the hop size hop_size at the time of analysis is rounded to an integer value ( = INT (hop_size × tempo_org / tempo_user)) is substituted into the variable new_hop_size. In the subsequent step 2302, a value obtained by multiplying the value of the variable new_hop_size by the value of the overlap factor OLF (represented by “8” here) is substituted for the variable new_frame_size. In the next step 2303, the tempo value tempo_org is multiplied by the value of the frame size frame_size at the time of analysis, and the multiplication result is divided by the value of the variable new_frame_size, which is a rounded integer value (= INT (tempo_org × frame_size). / New_frame_size)) is substituted into the variable tempo. The series of processing is finished thereafter.

このように、第２の実施例では、ユーザがテンポ値を変更すると、先ず、変更後のテンポ値に対応するホップサイズを整数値で求め、その求めたホップサイズから次に設定するフレームサイズを求め、求めたフレームサイズから実際に設定するテンポ値を求めている。それにより、ホップサイズ、フレームサイズ、及び実際のテンポ値を全て整数値で決定している。 As described above, in the second embodiment, when the user changes the tempo value, first, the hop size corresponding to the changed tempo value is obtained as an integer value, and the frame size to be set next is determined from the obtained hop size. The tempo value actually set is obtained from the obtained frame size. As a result, the hop size, frame size, and actual tempo value are all determined as integer values.

図２４は、第２の実施例における楽音タイマインタラプト処理のフローチャートである。これは、例えばサンプリング周期で発生する割り込み信号により実行される処理である。例えば図２２に示すスイッチ処理において、変数ａｎａ＿ｆｌｇ、及びｐｌａｙ＿ｆｌｇのうちの少なくとも一方に１を新たに代入したときに割り込み（実行）禁止が解除され（割り込みが有効とされ）、それらの値がともに０となったときに割り込みが禁止される（割り込みが無効とされる）ようになっている。最後に図２４を参照して、そのタイマインタラプト処理について詳細に説明する。 FIG. 24 is a flowchart of the musical tone timer interrupt process in the second embodiment. This is a process executed by an interrupt signal generated at a sampling period, for example. For example, in the switch process shown in FIG. 22, when 1 is newly assigned to at least one of the variables ana_flg and play_flg, the interrupt (execution) prohibition is canceled (interrupt is enabled), and both of these values are 0. When it becomes, interrupt is prohibited (interrupt is disabled). Finally, the timer interrupt process will be described in detail with reference to FIG.

先ず、ステップ２４０１では、入力した元音声データを分析するための分析処理を実行する。続くステップ２４０２では、変数ｐｌａｙ＿ｆｌｇの値が１か否か判定する。その変数に１が代入されていた場合、判定はＹＥＳとなってステップ２３０３に移行し、そうでない場合には、判定はＮＯとなり、ここで一連の処理を終了する。 First, in step 2401, an analysis process for analyzing the input original voice data is executed. In the following step 2402, it is determined whether or not the value of the variable play_flg is 1. If 1 is assigned to the variable, the determination is yes and the process proceeds to step 2303. If not, the determination is no and the series of processes ends here.

ステップ２４０３では、上記計時用タイマインタラプト処理の実行によって更新される変数の値を参照してＳＭＦ２００１の再生を進行させるＳＭＦ再生処理を実行する。その再生は、処理タイミングとなったＭＩＤＩデータを楽音発生部９に順次、送出していくことで行われる。その再生処理の実行後は、ステップ２４０４に移行する。 In step 2403, referring to the value of the variable updated by the execution of the timekeeping timer interrupt process, the SMF regeneration process for proceeding with the regeneration of the SMF 2001 is executed. The reproduction is performed by sequentially sending the MIDI data at the processing timing to the musical sound generating unit 9. After the reproduction process is executed, the process proceeds to step 2404.

ステップ２４０４では、音声入力があるか否か判定する。マイク７に所定の音量以上の音量の元音声が入力されていなかった場合、音声入力はないとして判定はＮＯとなり、出力バッファに格納された音声データによる発音を終了させ、変数ｎｏｔｅ＿ｏｎに０を代入する処理をステップ２４０５で実行してから、一連の処理を終了する。そうでない場合には、判定はＹＥＳとなってステップ２４０６に移行する。その変数ｎｏｔｅ＿ｏｎはローカル変数であり、音声入力が継続している間、１を代入し、その間でなければ０を代入するようになっている。 In step 2404, it is determined whether there is a voice input. If the original sound having a volume higher than the predetermined volume is not input to the microphone 7, the determination is NO because there is no sound input, the sound generation by the sound data stored in the output buffer is terminated, and 0 is substituted for the variable note_on. After executing the processing to be performed in step 2405, the series of processing is terminated. Otherwise, the determination is yes and the process moves to step 2406. The variable note_on is a local variable, and 1 is substituted while the voice input continues, and 0 is substituted otherwise.

ステップ２４０６では、変数ｎｏｔｅ＿ｏｎの値が１か否か判定する。音声入力が継続していた場合、それには１が代入されていることから、判定はＹＥＳとなってステップ２４０８に移行する。そうでない場合には、判定はＮＯとなり、出力バッファをクリアし、その先頭に位置するアドレス（ここでは１音声データを格納する領域のことである）を指定する値を変数ｏｕｔ＿ｂｕｆ＿ａｄｒに代入し、パラメータバッファ３５の先頭に位置するアドレス（ここではパラメータデータを格納する領域のことである）を指定する値を変数ｐａｒａｍ＿ｂｕｆ＿ａｄｒに代入し、更に変数ｎｏｔｅ＿ｏｎに１を代入する処理をステップ２４０７で行った後、ステップ２４１１に移行する。 In step 2406, it is determined whether or not the value of the variable note_on is 1. If the voice input has continued, 1 is assigned to it, so the determination is yes and the process moves to step 2408. Otherwise, the determination is no, the output buffer is cleared, a value specifying the address located at the head of the output buffer (here, an area for storing one audio data) is substituted into the variable out_buf_adr, and the parameter After performing a process of substituting a value specifying an address (here, an area for storing parameter data) located at the head of the buffer 35 into a variable param_buf_adr and further substituting 1 into a variable note_on in step 2407, Control goes to step 2411.

ステップ２４０８では、楽音生成部９を介して出力バッファの音声データをＤ／Ａ変換器１０に送出することにより楽音の発音を継続させる処理を実行する。このとき、ＭＩＤＩデータで指示された楽音を発音中であった場合、楽音生成部９は、その楽音発音用に生成した波形データに音声データをｍｉｘさせる。 In step 2408, a process for continuing the sound generation of the musical sound by sending the audio data in the output buffer to the D / A converter 10 via the musical sound generating unit 9 is executed. At this time, if the musical sound instructed by the MIDI data is being generated, the musical sound generating unit 9 mixes the sound data with the waveform data generated for the musical sound generation.

ステップ２４０８に続くステップ２４０９では、フレーム合成タイミングか否か判定する。そのタイミング（フレームサイズによってその周期は変動する）であった場合、判定はＹＥＳとなり、パラメータバッファ３５の最後に位置するフレームのパラメータを読み出していなければ次にパラメータを読み出すべきフレームに応じて変数ｐａｒａｍ＿ｂｕｆ＿ａｄｒの値を更新する（ステップ２４１０）。その更新後に移行するステップ２４１１では、パラメータバッファ３５の変数ｐａｒａｍ＿ｂｕｆ＿ａｄｒの値で指定されるアドレスからパラメータを読み込み、音声データの合成を行い、その合成によって得られた１フレーム分の音声データを既に生成した他のフレームに加算することにより、新たに合成した１フレーム分の音声データを出力バッファに保存された音声データに加える。具体的には、元音声データから抽出したピッチのRosenberg 波、及びホワイトノイズ波を生成し、生成したそれらの波形を有声音比率ｍｉｘに応じて混合することにより駆動音源波形を生成し、生成した駆動音源波形、及びＰＡＲＣＯＲ係数から、テンポ値に応じて変化させるフレームサイズの音声データを合成し、合成した音声データにハニング窓を乗算した後、変数ｏｕｔ＿ｂｕｆ＿ａｄｒの値で指定されるアドレスから既に生成した他のフレームに加算して重畳し、その重畳後に次のフレームの書き込みを開始すべきアドレスを指定する値をテンポ値に応じて変化させるホップサイズを考慮して決定し変数ｏｕｔ＿ｂｕｆ＿ａｄｒに代入して更新する。一連の処理はその後に終了する。 In step 2409 following step 2408, it is determined whether or not it is a frame synthesis timing. If the timing (the cycle varies depending on the frame size), the determination is YES, and if the parameter of the frame located at the end of the parameter buffer 35 has not been read, the variable param_buf_adr is read according to the frame from which the parameter is to be read next. Is updated (step 2410). In step 2411 to which the process proceeds after the update, the parameters are read from the address specified by the value of the variable param_buf_adr in the parameter buffer 35, the voice data is synthesized, and the voice data for one frame obtained by the synthesis has already been generated. By adding to other frames, the newly synthesized audio data for one frame is added to the audio data stored in the output buffer. Specifically, a Rosenberg wave with a pitch extracted from the original voice data and a white noise wave are generated, and the generated waveform is generated by mixing the generated waveforms according to the voiced sound ratio mix, and generated. After synthesizing the audio data of the frame size that changes according to the tempo value from the driving sound source waveform and the PARCOR coefficient, the synthesized audio data is multiplied by the Hanning window, and then generated from the address specified by the value of the variable out_buf_adr Add to other frames and superimpose, and after superimposing, determine the value that specifies the address to start writing the next frame in consideration of the hop size that changes according to the tempo value, and assign it to variable out_buf_adr to update To do. The series of processing is finished thereafter.

このようにして、第２の実施例では、音声入力ありと判定した場合にのみ、音声データの合成を行うようにしている。それにより、音声データの合成を行わせる期間、そのタイミング、及びそれのピッチ（音高）を歌唱者が音声の発音を通して制御できるようにさせている。 In this manner, in the second embodiment, the voice data is synthesized only when it is determined that there is a voice input. Thereby, the period when the voice data is synthesized, the timing, and the pitch (pitch) thereof can be controlled by the singer through the pronunciation of the voice.

なお、第２の実施例は、シーケンサーを搭載した電子楽器に本発明を適用したものであるが、シーケンサーを搭載していない装置に本発明を適用させても良い。その場合でも、ＳＭＰＴＥ等の同期方法を使用することにより、シーケンサーによるシーケンスデータ（ＳＭＦ等）の再生と同期させることができる。 In the second embodiment, the present invention is applied to an electronic musical instrument equipped with a sequencer. However, the present invention may be applied to an apparatus not equipped with a sequencer. Even in such a case, by using a synchronization method such as SMPTE, it is possible to synchronize with reproduction of sequence data (SMF, etc.) by a sequencer.

フレームサイズの設定は、パラメータファイル３６に格納された分析時テンポ値を基準にする形で行っているが、基準となるテンポ値としてはＳＭＦ２００１等に設定されるテンポ値、或いはその再生時に設定されていたテンポ値などを採用するようにしても良い。そのテンポ値をユーザが任意に設定できるようにしても良い。このようなことから、分析時に必ずしもテンポ値をパラメータとして保存しなくとも良い。 Although the frame size is set based on the analysis tempo value stored in the parameter file 36, the reference tempo value is set to the tempo value set in the SMF 2001 or the like, or set at the time of playback. The tempo value that has been used may be adopted. The tempo value may be arbitrarily set by the user. For this reason, it is not always necessary to save the tempo value as a parameter during analysis.

本実施例（第１及び第２の実施例）は、音声分析と音声合成の両方を行えるようになっているが、それらは別の装置として実現させても良い。つまり本発明を適用させた音声分析装置、音声合成装置は必ずしも同じ装置に搭載しなくとも良い。そのようにしても、抽出したパラメータをデータファイルとして様々な人の間で交換することができる。 In this embodiment (first and second embodiments), both voice analysis and voice synthesis can be performed, but they may be realized as separate devices. That is, the speech analysis apparatus and speech synthesis apparatus to which the present invention is applied are not necessarily installed in the same apparatus. Even in such a case, the extracted parameters can be exchanged between various people as a data file.

上述したような音声分析合成装置、音声分析装置、音声合成装置、或いはその変形例を実現させるようなプログラムは、ＣＤ−ＲＯＭ、ＤＶＤ、或いは光磁気ディスク等の記録媒体に記録させて配布しても良い。或いは、公衆網等で用いられる伝送媒体を介して、そのプログラムの一部、若しくは全部を配信するようにしても良い。そのようにした場合には、ユーザーはプログラムを取得してコンピュータなどのデータ処理装置にロードすることにより、そのデータ処理装置を用いて本発明を適用させた音声分析合成装置、音声分析装置、或いは音声合成装置を実現させることができる。このことから、記録媒体は、プログラムを配信する装置がアクセスできるものであっても良い。 The above-described speech analysis / synthesis device, speech analysis device, speech synthesis device, or a program that realizes a modification thereof is recorded on a recording medium such as a CD-ROM, DVD, or magneto-optical disk and distributed. Also good. Alternatively, part or all of the program may be distributed via a transmission medium used in a public network or the like. In such a case, the user obtains the program and loads it into a data processing device such as a computer, so that the speech analysis / synthesis device, speech analysis device, or A speech synthesizer can be realized. Therefore, the recording medium may be accessible by a device that distributes the program.

本実施例による音声分析合成装置を搭載した電子楽器の構成図である。It is a block diagram of the electronic musical instrument carrying the audio | voice analysis synthesis apparatus by a present Example. スイッチ部を構成する一部のスイッチの配置を示す図である。It is a figure which shows arrangement | positioning of the one part switch which comprises a switch part. 第１の実施例による音声分析合成装置の機能構成図である。1 is a functional configuration diagram of a speech analysis / synthesis device according to a first embodiment. FIG. ピッチ抽出／有声音比率算出部の機能構成図である。It is a functional block diagram of a pitch extraction / voiced sound ratio calculation part. パラメータバッファに格納されるパラメータを説明する図である。It is a figure explaining the parameter stored in a parameter buffer. フレームの切り出し方法を説明する図である。It is a figure explaining the cutting-out method of a frame. Rosenberg 波を説明する図である。It is a figure explaining a Rosenberg wave. フレーム単位で合成した音声データのフレーム間の加算方法を説明する図である。It is a figure explaining the addition method between the frames | frames of the audio | voice data synthesize | combined per frame. 全体処理のフローチャートである。It is a flowchart of the whole process. スイッチ処理のフローチャートである。It is a flowchart of a switch process. 鍵盤処理のフローチャートである。It is a flowchart of a keyboard process. フレーム合成処理のフローチャートである。It is a flowchart of a frame composition process. 楽音タイマインタラプト処理のフローチャートである。It is a flowchart of a musical tone timer interrupt process. 分析処理のフローチャートである。It is a flowchart of an analysis process. ＰＡＲＣＯＲ係数算出処理のフローチャートである。It is a flowchart of a PARCOR coefficient calculation process. ピッチ周波数算出処理のフローチャートである。It is a flowchart of a pitch frequency calculation process. ピッチ補正処理のフローチャートである。It is a flowchart of a pitch correction process. 混合比算出処理のフローチャートである。It is a flowchart of a mixing ratio calculation process. 第２の実施例による音声分析合成装置の分析フェーズ用の機能構成図である。It is a function block diagram for the analysis phase of the speech analysis and synthesis apparatus according to the second embodiment. 第２の実施例による音声分析合成装置の合成フェーズ用の機能構成図である。It is a functional block diagram for the synthesis | combination phase of the speech analysis synthesizer by 2nd Example. フレームサイズの変更方法を説明する図である。It is a figure explaining the change method of a frame size. スイッチ処理のフローチャートである（第２の実施例）。It is a flowchart of a switch process (2nd Example). フレーム修正処理のフローチャートである（第２の実施例）。It is a flowchart of a frame correction process (2nd Example). 楽音タイマインタラプト処理のフローチャートである（第２の実施例）。It is a flowchart of a musical tone timer interrupt process (second embodiment).

Explanation of symbols

１ＣＰＵ
３スイッチ部
４ＲＯＭ
５ＲＡＭ
７マイク
８Ａ／Ｄ変換器
９楽音生成部
１０Ｄ／Ａ変換器
１１アンプ
１２スピーカ
１３外部記憶装置 1 CPU
3 Switch part 4 ROM
5 RAM
7 Microphone 8 A / D Converter 9 Musical Sound Generation Unit 10 D / A Converter 11 Amplifier 12 Speaker 13 External Storage Device

Claims

In a speech analysis and synthesis device that analyzes a first speech waveform and synthesizes a second speech waveform using the analysis result,
First analysis means for analyzing the first speech waveform in units of frames and extracting parameters;
The first speech waveform is analyzed on a frame basis to calculate the autocorrelation value of the frequency amplitude value of the first speech waveform in each frame, and the standard deviation is calculated from the autocorrelation value of each frame. A second analyzing means for extracting a voiced sound ratio whose value increases as the standard deviation value decreases ;
A pitch designating means for designating the pitch;
A sound source waveform generating means for generating a sound source waveform simulating a vocal cord sound source waveform at a pitch specified by the pitch specifying means;
Other sound source waveform generating means for generating another sound source waveform having no pitch,
A drive sound source waveform is generated by mixing with the other sound source waveform so that the mixing ratio of the sound source waveform increases as the extracted voiced sound ratio value increases , and the drive sound source waveform and the parameter are used. Voice waveform synthesis means for synthesizing the second voice waveform;
A speech analysis / synthesis apparatus comprising:

The parameter obtained from the first speech waveform and the voiced sound ratio are stored in a storage unit capable of storing a plurality of data groups including the parameter and the voiced sound ratio,
The speech waveform synthesis means synthesizes the second speech waveform using one of the data groups stored in the storage means;
The speech analysis / synthesis apparatus according to claim 1.

The first analysis means extracts the frequency amplitude value and phase information of the first speech waveform in each frame, calculates an autocorrelation value of the frequency amplitude value, and maximizes the autocorrelation value. Extracting the pitch of the first speech waveform from the frequency amplitude value and the phase information as the parameter ,
The speech analysis / synthesis apparatus according to claim 1.

The first analyzing means continuously repeats a predetermined number of times when a pitch change between a pitch extracted in the current frame and a pitch extracted in a frame before the current frame is a predetermined value or more. When followed, the pitch extracted in the previous frame is adopted as the parameter, otherwise the pitch extracted in the current frame is adopted as the parameter .
The speech analysis / synthesis apparatus according to claim 3.

In a speech analyzer that extracts parameters from speech waveforms,
Voice waveform acquisition means for acquiring the voice waveform;
A first analysis unit that analyzes the speech waveform acquired by the speech waveform acquisition unit in units of frames and extracts filter coefficients used in a synthesis filter for synthesis of the speech waveform as a parameter;
The speech waveform acquired by the speech waveform acquisition means is analyzed in units of frames to calculate the autocorrelation value of the frequency amplitude value of the first speech waveform in each frame, and the standard from the autocorrelation value of each frame Second analysis means for calculating a deviation, and extracting a voiced sound ratio whose value increases as the value of the standard deviation decreases as a parameter for generating a sound source waveform input to the synthesis filter;
A voice analysis device comprising:

A program to be executed by a speech analysis / synthesis apparatus that analyzes a first speech waveform and synthesizes a second speech waveform using the analysis result,
A first analysis function for analyzing the first speech waveform in units of frames and extracting parameters;
The first speech waveform is analyzed in units of frames to calculate the autocorrelation value of the frequency amplitude value of the first speech waveform in each frame, and the standard deviation is calculated from the autocorrelation values of each frame. A second analysis function for extracting a voiced sound ratio that increases as the standard deviation value decreases ;
A pitch specification function for specifying the pitch,
A sound source waveform generating function for generating a sound source waveform simulating a vocal cord sound source waveform at a pitch specified by the pitch specifying function;
Other sound source waveform generation functions for generating other sound source waveforms having no pitch,
A drive sound source waveform is generated by mixing with the other sound source waveform so that a mixing ratio of the sound source waveform increases as the extracted voiced sound ratio value increases , and the drive sound source waveform and the parameter are used. A voice waveform synthesis function for synthesizing the second voice waveform;
A program to realize

A program to be executed by a speech analyzer that extracts parameters from a speech waveform,
A voice waveform acquisition function for acquiring the voice waveform;
A first analysis function for analyzing a voice waveform acquired by the voice waveform acquisition function in units of frames and extracting filter coefficients used for a synthesis filter for synthesis of the voice waveform as a parameter;
The speech waveform acquired by the speech waveform acquisition function is analyzed in units of frames to calculate an autocorrelation value, and the standard deviation is calculated from the autocorrelation value of each frame, and the value of the standard deviation becomes smaller A second analysis function for extracting a voiced sound ratio having a larger value as a parameter for generating a sound source waveform input to the synthesis filter;
A program to realize