JP2013015693A

JP2013015693A - Spoken word analyzer, method thereof, and program

Info

Publication number: JP2013015693A
Application number: JP2011148817A
Authority: JP
Inventors: Hideji Nakajima; 秀治中嶋; Hideyuki Mizuno; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-07-05
Filing date: 2011-07-05
Publication date: 2013-01-24
Anticipated expiration: 2031-07-05
Also published as: JP5588932B2

Abstract

PROBLEM TO BE SOLVED: To provide a spoken word analyzer that extracts a section of prominence from voice data without using learning data.SOLUTION: A spoken word analyzer synthesizes a synthesized voice in a reading tone that has a language label, by regarding a prominence section extracted voice, in which a prominence section applied with the language label is extracted, as an input. Then, the analyzer extracts a fundamental frequency sequence X1 that is a fundamental frequency sequence of the prominence section extracted voice and a fundamental frequency sequence X2 that is a fundamental frequency sequence of the synthesized voice, with the prominence section extracted voice and the synthesized voice as inputs. A prominence section extracting unit in the analyzer extracts a prominence section of the prominence section extracted voice and outputs prominence section information, on the basis of a correlation between the fundamental frequency sequence X1 and the fundamental frequency sequence X2 regarding a direction of variation in fundamental frequency between accent phrases and a comparison between the fundamental frequency sequence X1 and the fundamental frequency sequence X2 regarding a quantity of variation in fundamental frequency between the accent phrases, with the fundamental frequency sequences X1, X2, and the language label as inputs.

Description

この発明は、発話音声中の強調と抑制という卓立（prominence）に該当する音声区間を自動抽出するはなし言葉分析装置とその方法と、プログラムに関する。 The present invention relates to a speech analysis device, method and program for automatically extracting speech segments corresponding to prominence of emphasis and suppression in spoken speech.

例えば、映画のシーンに応じた台詞を発話する場面、童話の語り聞かせの場面、テレビなどのメディアを通じた商品宣伝の場面、及び、コールセンタなどでの電話応対場面などにおいて自然に発せられた「表現豊かな音声」において、強調や抑制という卓立は頻繁に用いられている。このような卓立は、何らかの基準と比較して明らかになる相対的なものである。よって、基準が不明な状態で、与えられた音声だけから卓立を自動抽出することは困難である。これまでは、卓立区間を予め指定しておき、その区間に卓立を伴って発話された音声を収録して利用されてきた。 For example, “expression” naturally uttered in scenes where speech is spoken according to movie scenes, scenes of storytelling of fairy tales, scenes of product advertisements through media such as television, and telephone response scenes at call centers etc. In “rich speech”, the emphasis on emphasis and suppression is frequently used. Such a standoff is a relative one that becomes apparent compared to some standard. Therefore, it is difficult to automatically extract a table from a given voice only when the reference is unknown. Until now, a standing section has been designated in advance, and voices spoken with standing are recorded and used in that section.

従来の卓立自動付与では、「強調」か「強調では無い（非強調）」かの自動付与を、２値判別問題として定式化し、２値判別器を用いて「強調」の箇所を抽出していた。その方法は、非特許文献１に開示されている。非特許文献１では、予め人手で強調区間にラベル付けされた学習用音声データを必要とする。学習用音声には、強調区間へのラベル付けと同時に強調のない箇所には非強調を示すラベルが付与される。 In conventional automatic automatic assignment, the automatic assignment of “emphasis” or “not enhancement (non-emphasis)” is formulated as a binary discrimination problem, and the location of “emphasis” is extracted using a binary discriminator. It was. This method is disclosed in Non-Patent Document 1. Non-Patent Document 1 requires learning speech data that is manually labeled in advance in the emphasis section. The learning voice is given a label indicating non-emphasis at the same time that the emphasis section is labeled and where there is no emphasis.

２値判別器は、「音節などの音声単位を表すカテゴリラベルの並び」、「その音声単位のフレーズや文内での位置を示す数値」、「フレーズの有するアクセント核の位置などの韻律に関する言語特徴を表すカテゴリラベル」、「それらを用いて通常の音声合成器によって合成された合成音と学習用音声データ原音のそれぞれの基本周波数間の差分値」、を入力変数とし、強調または非強調という２値のラベルを出力変数として構築される。この構築された２値判別器を用いて、学習データ以外の新たな音声データに対して、強調か非強調かの２値判別を行い、強調という卓立の区間を音声データから抽出する。 The binary discriminator is a language related to prosody such as “a sequence of category labels representing speech units such as syllables”, “a numerical value indicating a phrase or a position in a sentence of the speech unit”, and “a position of an accent nucleus included in the phrase”. “Category labels that represent features” and “difference values between fundamental frequencies of synthesized speech synthesized using a normal speech synthesizer and the original speech data for learning” are used as input variables. A binary label is constructed as an output variable. Using this constructed binary discriminator, new speech data other than learning data is subjected to binary discrimination of emphasis or non-emphasis, and a stand-alone section called emphasis is extracted from the speech data.

J. Xu and L.・H. Cai, “Automatic emphasis labeling for emotional speech by measuring prosody generation error”, Proceedings of ICIC, 2009, pp. 177-186, 2009.J. Xu and L. ・ H. Cai, “Automatic emphasis labeling for emotional speech by measuring prosody generation error”, Proceedings of ICIC, 2009, pp. 177-186, 2009.

しかしながら、従来の手法では、卓立区間の抽出のために強調・非強調のラベルが付与された学習データを必要とした。高い精度で卓立区間を判別する２値判別器を構成するためには、正確にラベル付けされた学習データを大量に必要とする。この正確にラベル付けされた音声データを用意するには、人手に頼る他なく、コストが高く付く。 However, the conventional method requires learning data to which emphasis / non-emphasis labels are attached in order to extract the stand-up interval. In order to construct a binary discriminator that discriminates the standing section with high accuracy, a large amount of learning data accurately labeled is required. In order to prepare this correctly labeled audio data, there is no choice but to rely on human hands, and the cost is high.

このように、卓立区間の自動抽出は困難であり、非特許文献１以前の研究の多くでは、卓立のラベルをテキストに予め付けておき、そのラベルの付けられた箇所で人間が卓立をつけた発話を行うことによって音声を収録していた。しかし、その方法では、自然な発話データ、且つ、そのような強調や非強調を含む発話が自然な割合で含まれる音声データベースを構築することは困難となる。 Thus, it is difficult to automatically extract the standing section, and in many of the studies before Non-Patent Document 1, a label of the standing is previously attached to the text, and a person stands out at the place where the label is attached. The voice was recorded by performing the utterance with the. However, with this method, it is difficult to construct a speech database that includes natural utterance data and utterances including such emphasis and non-emphasis at a natural rate.

この発明は、このような課題に鑑みてなされたものであり、人手で予め強調・非強調ラベルを付与した音声データを用意することなく、音声データから効率的に卓立区間を抽出することが可能な、はなし言葉分析装置とその方法とプログラムを提供することを目的とする。 This invention is made in view of such a subject, and it is possible to efficiently extract a stand-by section from audio data without preparing audio data to which manual emphasis / non-emphasis labels are assigned in advance. An object of the present invention is to provide a possible speech analysis device, method and program thereof.

この発明のはなし言葉分析装置は、言語ラベルが付与された卓立区間抽出対象音声を入力とし、音声合成部と、基本周波数系列抽出部と、卓立区間抽出部と、を具備する。音声合成部は、上記言語ラベルを入力として、その言語ラベルを持つ読み上げ口調の合成音声を合成する。基本周波数列抽出部は、合成音声と卓立区間抽出対象音声を入力として、卓立区間抽出対象音声の基本周波数列である基本周波数列Ｘ１と、合成音声の基本周波数列である基本周波数列Ｘ２を抽出する。卓立区間抽出部は、基本周波数系列Ｘ１とＸ２と上記言語ラベルを入力として、アクセント句間での基本周波数の変動方向についての基本周波数系列Ｘ１とＸ２との間の相関と、当該アクセント句間での基本周波数の変動量についての基本周波数系列Ｘ１とＸ２との間での比較と、に基づいて卓立区間抽出対象音声の卓立区間を抽出して卓立区間情報を出力する。 The speech analysis device according to the present invention has a speech segment extraction target speech provided with a language label as an input, and includes a speech synthesizer, a fundamental frequency sequence extractor, and a standoff segment extractor. The speech synthesizer receives the language label as an input and synthesizes a speech with a reading tone having the language label. The fundamental frequency sequence extraction unit receives the synthesized speech and the speech that is to be extracted as a standup interval, and receives a basic frequency sequence X1 that is a basic frequency sequence of the standout interval extraction target speech and a basic frequency sequence X2 that is a basic frequency sequence of the synthesized speech. To extract. The stand-by section extraction unit receives the fundamental frequency sequences X1 and X2 and the language label as input, and the correlation between the fundamental frequency sequences X1 and X2 with respect to the direction of variation of the fundamental frequency between the accent phrases, and between the accent phrases Based on the comparison between the fundamental frequency series X1 and X2 with respect to the fluctuation amount of the fundamental frequency at, the utterance interval of the utterance interval extraction target speech is extracted and the erection interval information is output.

この発明のはなし言葉分析装置によれば、正確な強調・非強調ラベルが付与された大量の学習データを必要とせずに、卓立区間抽出対象音声から、卓立区間情報を得ることが出来る。よって、学習データを用意する人間の稼働にかかる高いコストを排除することが出来る。また、自然に発話された音声データから正確な卓立区間を抽出することが可能になるので、この発明のはなし言葉分析装置は、自然な卓立を有する音声に基づく研究開発を、大幅に加速することに貢献する。また、言語ラベルから生成した合成音声は、変動範囲が小さな読み上げ音声となるので、それを基準とした定義の明確な卓立区間情報を得ることができる。 According to the speech analysis apparatus of the present invention, it is possible to obtain stand-by section information from the stand-by-section extraction target speech without requiring a large amount of learning data to which accurate emphasis / non-emphasis labels are attached. Therefore, it is possible to eliminate a high cost for operating a human who prepares learning data. In addition, since it is possible to extract accurate standing sections from speech data uttered naturally, the speech analysis device of this invention greatly accelerates research and development based on speech with natural standing. Contribute to doing. In addition, since the synthesized speech generated from the language label becomes a reading speech with a small fluctuation range, it is possible to obtain clear section information with a definition based on the speech.

この発明のはなし言葉分析装置１００の機能構成例を示す図。The figure which shows the function structural example of the speech analysis apparatus 100 of this invention. はなし言葉分析装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the speech analysis apparatus 100. 卓立区間抽出部３０の機能構成例を示す図。The figure which shows the function structural example of the stand-up area extraction part 30. FIG. 卓立区間抽出部３０の動作フローを示す図。The figure which shows the operation | movement flow of the stand-up area extraction part 30. FIG. 卓立判定手段３４の判定フローを示す図。The figure which shows the determination flow of the standup determination means. 具体的なアクセント句平均Ｍ１_ｉ，Ｍ２_ｉの動きの一例を示す図。The figure which shows an example of a motion of the concrete accent phrase average M1 _i and M2 _i .

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明のはなし言葉分析装置１００の機能構成例を示す。その動作フローを図２に示す。はなし言葉分析装置１００は、音声合成部１０と、基本周波数系列抽出部２０と、卓立区間抽出部３０と、制御部４０と、を具備する。はなし言葉分析装置１００の各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows an example of the functional configuration of a speech analysis apparatus 100 according to the present invention. The operation flow is shown in FIG. The speech analysis device 100 includes a speech synthesizer 10, a fundamental frequency sequence extractor 20, a standup section extractor 30, and a controller 40. The function of each part of the speech analysis apparatus 100 is realized by reading a predetermined program into a computer composed of, for example, a ROM, a RAM, a CPU, etc., and executing the program by the CPU.

実施例の説明の前に、強調や抑制の箇所について定義する。強調や抑制の箇所とは、１つの発話又は複数の発話系列の中での相対的な変化として定義できる。相対的な変化として測る際に必要となる基準を、この実施例では、強調や抑制の無い読み上げ口調で発話された音声とし、その読み上げ口調の音声と、表現豊かに発話された音声とを比較して変化が生じている箇所を抽出対象とする。この変化は、基本周波数変動や発話時間長や声質などのさまざまな物理量の違いとなって現れるが、この実施例では基本周波数の変動によるものに焦点を当てる。つまり、基本周波数が相対的に高くなっているところを「強調」、相対的に低くなっているところを「抑制」と定義する。また、強調と抑制の箇所の単位は、この実施例ではアクセント句の単位と定義する。 Prior to the description of the embodiments, points of emphasis and suppression will be defined. A point of emphasis or suppression can be defined as a relative change in one utterance or a plurality of utterance sequences. In this example, the standard necessary for measuring relative changes is the speech uttered in a reading tone without emphasis or suppression, and the speech in the reading tone is compared with the speech uttered in an expressive manner. Thus, the location where the change occurs is taken as the extraction target. Although this change appears as a difference in various physical quantities such as fundamental frequency fluctuation, speech duration, and voice quality, this embodiment focuses on the fluctuation due to fundamental frequency fluctuation. That is, a point where the fundamental frequency is relatively high is defined as “emphasis”, and a point where the fundamental frequency is relatively low is defined as “suppression”. In this embodiment, the unit of the emphasis and suppression part is defined as the unit of the accent phrase.

音声合成部１０は、卓立区間抽出対象音声に付与された言語ラベルを入力として、その言語ラベルを持つ読み上げ口調の合成音声を合成する（ステップＳ１０）。 The speech synthesizer 10 receives the language label given to the speech extraction target speech, and synthesizes a speech with a reading tone having the language label (step S10).

言語ラベルには、「音素、音節などの各音声区間の種別とそれらの音声区間の開始・終了時刻」、「ポーズ区間の開始終了の時刻」、「アクセント句境界とその開始・終了時刻及びアクセント句のアクセント型」、がある。これらの情報から、「アクセント句の長さ」、「アクセント句の中での各音声単位の位置」を自動計算して付与することが可能である。 Language labels include “speech segment types such as phonemes and syllables and start / end times of those speech segments”, “pause segment start / end times”, “accent phrase boundaries and their start / end times and accents. Phrase accent type ". From these pieces of information, it is possible to automatically calculate and give “the length of the accent phrase” and “the position of each voice unit in the accent phrase”.

音声合成部１０が合成する合成音声は、従来の音声合成器（参考文献：T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis,” Proceedings of EUROSPEECH, pp.2347-2350, 1999.）で合成することが出来る。音声合成部１０の態様によっては、合成音声と一緒にその基本周波数列Ｘ２を同時に出力できる場合がある。 The synthesized speech synthesized by the speech synthesis unit 10 is a conventional speech synthesizer (reference: T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous modeling of spectrum, pitch and duration. in HMM-based speech synthesis, “Proceedings of EUROSPEECH, pp. 2347-2350, 1999.) Depending on the mode of the speech synthesizer 10, the fundamental frequency sequence X2 may be output simultaneously with the synthesized speech.

基本周波数系列抽出部２０は、卓立区間抽出対象音声と音声合成部１０で合成した合成音声を入力として、卓立区間抽出対象音声の基本周波数系列である基本周波数系列Ｘ１と、合成音声の基本周波数系列である基本周波数系列Ｘ２を抽出する（ステップＳ２０）。音声合成部１０が、合成音声とその基本周波数列Ｘ２を同時に出力する場合（図１の破線を参照）は、基本周波数系列抽出部２０は卓立区間抽出対象音声の基本周波数系列である基本周波数系列Ｘ１のみを抽出する。基本周波数は、周期信号の周期の最短のものとして定義され、聴覚上では声の高さとして感じ取られるものである。音声の場合には、声帯の開閉時間間隔も波形も一定では無いため、厳密な意味での基本周波数は存在しないが、基本波の瞬時周波数を基本周波数とする。基本周波数は、例えば１ｍｓごとに得ることが出来る。基本周波数の単位は元々Ｈｚであるが、そのままの値でも、底をｅ（ネイピア数）とする自然対数に変換した値でも良い。 The basic frequency sequence extraction unit 20 receives the synthesized speech synthesized by the speech segment extraction target speech and the speech synthesis unit 10 as an input, and the fundamental frequency sequence X1 that is the fundamental frequency sequence of the speech synthesis target speech and the basics of the synthesized speech. A basic frequency series X2 that is a frequency series is extracted (step S20). When the speech synthesizer 10 outputs the synthesized speech and its fundamental frequency sequence X2 at the same time (see the broken line in FIG. 1), the fundamental frequency sequence extraction unit 20 uses the fundamental frequency that is the fundamental frequency sequence of the stand-by interval extraction target speech. Only the series X1 is extracted. The fundamental frequency is defined as the shortest cycle of the periodic signal, and is perceived as the pitch of the voice on hearing. In the case of speech, neither the opening / closing time interval of the vocal cords nor the waveform is constant, so there is no strict meaning of the fundamental frequency, but the instantaneous frequency of the fundamental wave is taken as the fundamental frequency. The fundamental frequency can be obtained every 1 ms, for example. The unit of the fundamental frequency is originally Hz, but it may be a value as it is or a value converted to a natural logarithm with the base being e (Napier number).

基本周波数は、例えば基本波だけを取り出すことの出来る帯域フィルタを設計し、そのインパルス応答をマザーウェーブレットとするwavelet変換を行うことで、基本波成分を抽出する方法が知られている（参考文献：H. Katayose, A. de Cheveigne and R. D. Patterson, “Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity,” Eurospeech ’99, pp. 2781-2784 (1999).）。基本周波数を抽出する方法は、これ以外にも時間領域の自己相関係数から求める方法など複数の方法が存在する。基本周波数を求めること自体は、従来技術であり、その詳しい説明は省略する。 As for the fundamental frequency, for example, a method of extracting a fundamental wave component by designing a bandpass filter that can extract only the fundamental wave and performing wavelet transform using the impulse response as a mother wavelet is known (references: H. Katayose, A. de Cheveigne and RD Patterson, “Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity,” Eurospeech '99, pp. 2781-2784 (1999).). There are a plurality of other methods for extracting the fundamental frequency, such as a method of obtaining from the autocorrelation coefficient in the time domain. Obtaining the fundamental frequency itself is a conventional technique and will not be described in detail.

卓立区間抽出部３０は、基本周波数系列抽出部２０が抽出した基本周波数系列Ｘ１と基本周波数系列Ｘ２と言語ラベルを入力として、それぞれの基本周波数系列を比較する。上記したように基本周波数系列Ｘ２は、音声合成部１０から直接入力される場合もある。卓立区間抽出対象音声の基本周波数系列Ｘ１と合成音声の基本周波数系列Ｘ２との間では、言語ラベル情報が一致しているので、２つの基本周波数系列を比較することが可能である。卓立句間抽出部３０は、基本周波数系列Ｘ１とＸ２の基本周波数を比較して、卓立区間抽出対象音声の卓立区間を抽出して卓立区間情報を出力する（ステップＳ３０）。制御部４０は、上記した各部間の時系列的な動作等を制御するものである。 The table interval extraction unit 30 receives the fundamental frequency sequence X1, the fundamental frequency sequence X2, and the language label extracted by the fundamental frequency sequence extraction unit 20, and compares the fundamental frequency sequences. As described above, the fundamental frequency sequence X2 may be directly input from the speech synthesizer 10. Since the language label information is identical between the fundamental frequency sequence X1 of the speech to be extracted and the fundamental frequency sequence X2 of the synthesized speech, it is possible to compare the two fundamental frequency sequences. The inter-phrasing phrase extraction unit 30 compares the fundamental frequencies of the fundamental frequency sequences X1 and X2, extracts the prominent section of the speech to be extracted, and outputs the stand-by section information (step S30). The control unit 40 controls time-series operations between the above-described units.

以上述べたようにこの発明のはなし言葉分析装置１００は、表現豊かな口調で自然に発話された卓立区間抽出対象音声に付与された言語ラベルから合成音声を合成する。そして、卓立区間抽出対象音声の基本周波数系列Ｘ１と、合成音声の基本周波数系列Ｘ２の言語ラベルが一致する基本周波数を比較することで卓立区間を抽出する。したがって、従来法のように、学習用の音声データを使用しないで、対象音声から卓立区間を抽出することが出来る。なお、上記したように音声合成部１０と、基本周波数系列抽出部２０は、従来技術で実現することが可能であるが、この発明のはなし言葉分析装置１００は、図１に示すその装置全体の構成自体が新しい。特に新しい部分は、２つの基本周波数系列から卓立区間を抽出する卓立区間抽出部３０である。 As described above, the speech analysis apparatus 100 according to the present invention synthesizes synthesized speech from the language label attached to the speech extraction target speech that is naturally spoken in an expressive tone. Then, the stand-by section is extracted by comparing the fundamental frequencies having the same language labels of the fundamental frequency sequence X1 of the stand-by section extraction target speech and the basic frequency sequence X2 of the synthesized speech. Therefore, unlike the conventional method, it is possible to extract the standing section from the target speech without using the speech data for learning. As described above, the speech synthesizer 10 and the fundamental frequency sequence extraction unit 20 can be realized by the conventional technique. However, the speech analysis device 100 of the present invention is the entire device shown in FIG. The configuration itself is new. A particularly new part is a stand-up section extraction unit 30 that extracts stand-up sections from two fundamental frequency sequences.

図３に、卓立区間抽出部３０の更に詳しい機能構成例を示してその動作を説明する。卓立区間抽出部３０は、平均値・標準偏差算出手段３１と、正規化手段３２と、アクセント句平均系列抽出手段３３と、卓立判定手段３４と、を備える。図４に、卓立区間抽出部３０の動作フローを示す。 FIG. 3 shows an example of a more detailed functional configuration of the stand-up section extraction unit 30, and its operation will be described. The standing section extracting unit 30 includes an average value / standard deviation calculating unit 31, a normalizing unit 32, an accent phrase average series extracting unit 33, and a standing determination unit 34. In FIG. 4, the operation | movement flow of the standing area extraction part 30 is shown.

平均値・標準偏差算出手段３１は、基本周波数系列抽出部２０が出力する基本周波数系列Ｘ１と基本周波数系列Ｘ２を入力として、それぞれの発話全体にわたる基本周波数の平均値である発話平均μ１と発話平均μ２と、その標準偏差である発話標準偏差σ１と発話標準偏差σ２と、を算出する（ステップＳ３１）。 The average value / standard deviation calculating means 31 receives the fundamental frequency sequence X1 and the fundamental frequency sequence X2 output from the fundamental frequency sequence extraction unit 20 as input, and the utterance average μ1 that is the average value of the fundamental frequency over the entire utterance and the utterance average μ2, and the standard deviation utterance standard deviation σ1 and utterance standard deviation σ2 are calculated (step S31).

正規化手段３２は、基本周波数系列Ｘ１とＸ２から発話平均μ１，μ２を減算した値を発話標準偏差σ１，σ２で除した発話正規化系列である発話正規化系列Ｍ１と発話正規化系列Ｍ２を求める（ステップＳ３２、式（１））。 The normalizing means 32 obtains an utterance normalization sequence M1 and an utterance normalization sequence M2 that are utterance normalization sequences obtained by dividing the basic frequency sequences X1 and X2 by the utterance averages μ1 and μ2 by the utterance standard deviations σ1 and σ2. Obtained (step S32, equation (1)).

この正規化処理によって、発話者の個人差を除去することが可能であり、卓立区間抽出の精度を向上させることが出来る。 By this normalization processing, individual differences among speakers can be removed, and the accuracy of extracting the stand-up section can be improved.

アクセント句平均系列抽出手段３３は、発話正規化系列Ｍ１と発話正規化系列Ｍ２と言語ラベルを入力として、それぞれの発話正規化系列Ｍ１，Ｍ２をアクセント句ｉごとに分割してｉ番目のアクセント句の正規化値の平均値であるアクセント句平均Ｍ１_ｉ，Ｍ２_ｉを計算する（ステップＳ３３）。ｉは１〜ｎ。 The accent phrase average series extraction unit 33 receives the utterance normalization series M1, the utterance normalization series M2, and the language label, and divides each utterance normalization series M1 and M2 into accent phrases i to obtain an i-th accent phrase. Accent phrase averages M1 _i and M2 _i which are average values of the normalized values are calculated (step S33). i is 1 to n.

卓立判定手段３４は、アクセント句平均Ｍ１_ｉ，Ｍ２_ｉを入力として、最後のアクセント句ｎまでの隣り合うアクセント句の組に対して変動の方向の相関と変動量の比較に基づいて卓立区間を表す卓立区間情報を出力する（ステップＳ３４）。 The standing judgment means 34 receives the accent phrase averages M1 _i and M2 _i as input, and stands out based on the correlation of the direction of variation and the comparison of the variation amount with respect to a pair of adjacent accent phrases up to the last accent phrase n. The standing section information representing the section is output (step S34).

卓立判定手段３４は、アクセント句平均Ｍ１_ｉ，Ｍ２_ｉを入力として、隣り合うアクセント句の組の平均値μ１_ｉ（式（２））とμ２_ｉ（式（３））を計算すると共に、それぞれの相関係数ρ１２_ｉと、それぞれの変動量Δ１_ｉとΔ２_ｉと変動量Δ１_ｉを変動量Δ２_ｉで除した変動量比率ｒ_ｉとを求め、アクセント句ｉの強調と抑制及び、強調も抑制もない卓立区間を表す卓立区間情報を出力する。 The desk determination means 34 receives the accent phrase averages M1 _i and M2 _i as input and calculates the average values μ1 _i (formula (2)) and μ2 _i (formula (3)) of adjacent accent phrase sets, The respective correlation coefficients ρ12 _i , the respective fluctuation amounts Δ1 _i and Δ2 _i, and the fluctuation amount ratio r _i obtained by dividing the fluctuation amount Δ1 _i by the fluctuation amount Δ2 _i are obtained, emphasis and suppression of the accent phrase i, and emphasis Outputs stand-by section information representing stand-up sections that are neither suppressed nor suppressed.

相関係数ρ１２_ｉを求める式を式（４）に示す。 An equation for obtaining the correlation coefficient ρ12 _i is shown in Equation (4).

変動量Δ１_ｉ，Δ２_ｉと変動量比率ｒ_ｉを求める式を、式（５）と式（６）と、式（７）に示す。 Equations (5), (6), and (7) are used to obtain the fluctuation amounts Δ1 _i , Δ2 _i and the fluctuation amount ratio r _i .

図５に、卓立判定手段３４の動作フローを示して、その判定の流れを説明する。図５の動作フローは、アクセント句の番号をｉ、文内の最初のアクセント句の番号を１、最後のアクセント句の番号をｎとして、ｉ及びｉ＋１のアクセント句を用いて、ｉ＋１のアクセント句が強調なのか抑制なのか、又は強調も抑制も無い卓立区間に分類する以下の処理を最後のアクセント句ｎまで繰り返すものである（ステップＳ３４０）。 FIG. 5 shows an operation flow of the tabletop determination means 34, and the determination flow will be described. The operation flow of FIG. 5 shows that i + 1 is the accent phrase number, i is the first accent phrase number in the sentence, n is the last accent phrase number, and i and i + 1 accent phrases are used. Is the emphasis or suppression, or the following process for classifying into a stand-up section without emphasis or suppression is repeated until the last accent phrase n (step S340).

卓立判定手段３４は、アクセント句ｉごとに相関係数ρ１２_ｉを求める（ステップＳ３４１）。そして、隣接するアクセント句ｉとｉ＋１間の変動量Δ１_ｉ，Δ２_ｉと、変動量比率ｒ_ｉを求める（ステップＳ３４２）。 The standing determination means 34 obtains a correlation coefficient ρ12 _i for each accent phrase i (step S341). Then, the fluctuation amounts Δ1 _i and Δ2 _i between the adjacent accent phrases i and i + 1 and the fluctuation amount ratio r _i are obtained (step S342).

相関係数ρ１２_ｉが正の相関の場合（ステップＳ３４３のＹＥＳ）、変動量Δ１_ｉ＞変動量Δ２_ｉで、且つ変動量比率ｒ_ｉが閾値区間外である場合、アクセント句ｉ＋１は強調区間と判定される（ステップＳ３４４のＹＥＳ）。変動量比率ｒ_ｉが、例えば１２０％を超える場合に強調と判定する。ステップＳ３４４でＮＯと判定された場合、変動量Δ１_ｉ＜変動量Δ２_ｉで、且つ変動量比率ｒ_ｉが閾値区間外である場合、アクセント句ｉ＋１は抑制区間と判定される（ステップＳ３４５のＹＥＳ）。変動量比率ｒ_ｉが、例えば８０％未満の場合に抑制と判定する。ステップＳ３４５でＮＯと判定された場合、アクセント句ｉ＋１は強調も抑制もないと判定される。 When the correlation coefficient ρ12 _i is a positive correlation (YES in step S343), when the fluctuation amount Δ1 _i > the fluctuation amount Δ2 _i and the fluctuation amount ratio r _i is outside the threshold interval, the accent phrase i + 1 is an enhancement interval. Determination is made (YES in step S344). When the fluctuation amount ratio r _i exceeds, for example, 120%, it is determined as emphasis. If NO is determined in step S344, if the variation amount Δ1 _i <the variation amount Δ2 _i and the variation amount ratio r _i is outside the threshold section, the accent phrase i + 1 is determined to be a suppression section (YES in step S345). ). When the fluctuation amount ratio r _i is, for example, less than 80%, it is determined to be suppressed. When it is determined NO in step S345, it is determined that the accent phrase i + 1 is neither emphasized nor suppressed.

ステップＳ３４３で正の相関が無いと判定された場合（ステップＳ３４３のＮＯ）、負の相関であるか否かを判定する（ステップＳ３４６）。ステップＳ３４６において、負の相関も無いと判定された場合、アクセント句ｉ＋１は強調も抑制もないと判定される（ステップＳ３４６のＮＯ）。 If it is determined in step S343 that there is no positive correlation (NO in step S343), it is determined whether the correlation is negative (step S346). If it is determined in step S346 that there is no negative correlation, it is determined that the accent phrase i + 1 is neither emphasized nor suppressed (NO in step S346).

ステップＳ３４３で負の相関があると判定された場合（ステップＳ３４６のＹＥＳ）、変動量Δ１_ｉが正方向で且つ変動量Δ２_ｉが負方向の場合（ステップＳ３４７のＹＥＳ）、アクセント句ｉ＋１は強調と判定される。ステップＳ３４７でＮＯと判定された場合、変動量Δ１_ｉが負方向で且つ変動量Δ２_ｉが正方向の場合（ステップＳ３４８のＹＥＳ）、アクセント句ｉ＋１は抑制と判定される。ステップＳ３４８でＮＯと判定された場合、アクセント句ｉ＋１は強調も抑制もないと判定される。 If it is determined in step S343 that there is a negative correlation (YES in step S346), if the variation Δ1 _i is positive and the variation Δ2 _i is negative (YES in step S347), the accent phrase i + 1 is emphasized. It is determined. When it is determined NO in step S347, if the variation Δ1 _i is negative and the variation Δ2 _i is positive (YES in step S348), the accent phrase i + 1 is determined to be suppressed. If NO is determined in step S348, it is determined that the accent phrase i + 1 is neither emphasized nor suppressed.

図６に、具体的なアクセント句平均系列Ｍ１_ｉとＭ２_ｉの動きを示して卓立判定手段３４の動作を更に説明する。図６の横方向はアクセント句の変化であり経過時間を表し、縦方向は正規化値を表す。図６の、アクセント句ｉとｉ＋１のアクセント句平均系列の動きは、卓立区間抽出対象音声のアクセント句ｉ＋１が強調と判定される場合の一例を示している。アクセント句間の平均値μ１_ｉの変化がプラス方向で、アクセント句間の平均値μ２_ｉの変化がマイナス方向であるので相関係数ρ１２_ｉは負の相関である。この場合、図５の動作フローでは、ステップＳ３４６でＹＥＳ、変動量Δ１_ｉが正方向で且つ変動量Δ２_ｉが負方向なのでステップＳ３４７でＹＥＳと判定され、アクセント句ｉ＋１は強調と判定される。 FIG. 6 shows the movement of the accent phrase average series M1 _i and M2 _i , and the operation of the standup judgment means 34 will be further described. The horizontal direction in FIG. 6 represents a change in accent phrase and represents elapsed time, and the vertical direction represents a normalized value. The movement of the accent phrase average sequence of accent phrases i and i + 1 in FIG. 6 shows an example of the case where the accent phrase i + 1 of the speech to be extracted is extracted as emphasized. Since the change in the average value μ1 _i between the accent phrases is in the plus direction and the change in the average value μ2 _i between the accent phrases is in the minus direction, the correlation coefficient ρ12 _i is a negative correlation. In this case, in the operation flow of FIG. 5, YES is determined in step S346, and since the variation Δ1 _i is in the positive direction and the variation Δ2 _i is in the negative direction, YES is determined in step S347, and the accent phrase i + 1 is determined to be emphasized.

次に、ｉ＋１とｉ＋２のアクセント句間の平均値μ１_ｉ+1の変化はマイナス方向、アクセント句間の平均値μ２_ｉ+1の変化はマイナス方向であるので相関係数ρ１２_ｉ+1は正の相関である（ステップＳ３４３のＹＥＳ）。そして、この場合、変動量比率ｒ_ｉ+1が明らかに閾値区間内である。つまり、変動量比率ｒ_ｉ+1が８０％〜１２０％の範囲内であるのでアクセント句ｉ＋２は強調も抑制もないと判定される（ステップＳ３４５のＮＯ）。 Next, since the change in the average value μ1 _{i + 1} between the accent phrases i + 1 and i + 2 is in the minus direction and the change in the average value μ2 _{i + 1} between the accent phrases is in the minus direction, the correlation coefficient ρ12 _{i + 1} is positive. (YES in step S343). In this case, the fluctuation amount ratio r _{i + 1} is clearly within the threshold interval. That is, since the fluctuation amount ratio r _{i + 1} is in the range of 80% to 120%, it is determined that the accent phrase i + 2 is neither emphasized nor suppressed (NO in step S345).

以上述べたように、卓立区間抽出対象音声由来のアクセント句平均系列Ｍ１_ｉと、言語ラベルから合成した合成音声由来のアクセント句平均系列Ｍ２_ｉの変化の相関係数と、それぞれの変動量Δ１_ｉ，Δ２_ｉと、変動量比率ｒ_ｉとを用いることでアクセント句の卓立を判定することが出来る。このように、この発明のはなし言葉分析装置１００によれば、正確な強調・非強調ラベルが付与された大量の学習データを不要としながら、音声データから卓立区間情報を得ることを可能にする。また、言語ラベルから生成した合成音声は、変動範囲が小さな読み上げ音声となるので、それを基準とした定義の明確な卓立区間情報を得ることができる。 As described above, the correlation coefficient of the change between the accent phrase average series M1 _i derived from the speech to be extracted and the accent phrase average series M2 _i derived from the synthesized speech, and the variation Δ1 of each. _{By using i} and Δ2 _i and the variation ratio r _i , the accent phrase can be determined. As described above, according to the speech analysis device 100 of the present invention, it is possible to obtain stand-by section information from speech data while eliminating a large amount of learning data to which accurate emphasis / non-emphasis labels are attached. . In addition, since the synthesized speech generated from the language label becomes a reading speech with a small fluctuation range, it is possible to obtain clear section information with a definition based on the speech.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

The speech to be extracted from the standing section with the language label is input.
A speech synthesizer for synthesizing a synthesized speech with a reading tone with the language label;
Using the synthesized speech and the speech that is to be extracted as a stand-up interval, the basic frequency sequence X1 that is the basic frequency sequence of the speech that is to be extracted as a stand-by interval and the basic frequency sequence X2 that is the basic frequency sequence of the synthesized speech are extracted. A fundamental frequency sequence extraction unit to
Using the fundamental frequency sequences X1 and X2 and the language label as input, the correlation between the fundamental frequency sequences X1 and X2 with respect to the direction of fluctuation of the fundamental frequency between accent phrases and the fundamental frequency between the accent phrases. A stand-up section extraction unit that extracts a stand-up section of the above-described stand-by section extraction target speech based on the comparison between the fundamental frequency series X1 and X2 with respect to the amount of fluctuation, and outputs stand-up section information; ,
Talking word analysis device.

In the speech analysis device according to claim 1,
The above stand section extraction unit
Using the fundamental frequency sequence X1 and the fundamental frequency sequence X2 as input, an utterance average μ1 and an utterance average μ2 that are average values of fundamental frequencies over the entire utterance, and an utterance standard deviation σ1 and an utterance standard deviation σ2 that are standard deviations are obtained. Mean value / standard deviation calculation means for calculating,
Normalization means for obtaining an utterance normalization sequence M1 and an utterance normalization sequence M2, which are utterance normalization sequences obtained by dividing a value obtained by subtracting the utterance average from the basic frequency sequences X1 and X2 by an utterance standard deviation;
The utterance normalization sequence M1, the utterance normalization sequence M2, and the language label are input, the respective sequences are divided into accent phrases i, and the accent of the utterance normalization sequence in each accent phrase i (i: 1 to n). An accent phrase average series extracting means for obtaining phrase averages M1 _i and M2 _i ;
Using the above accent phrase averages M1 _i and M2 _i as input, a standup that represents a standup interval based on the correlation of the direction of change and the comparison of the amount of change with respect to a set of adjacent accent phrase averages up to the last accent phrase n Standing judgment means for outputting section information;
A speech analysis device characterized by comprising:

The speech to be extracted from the standing section with the language label is input.
A speech synthesis process for synthesizing a spoken tone synthesized speech with the language label;
Using the synthesized speech and the speech that is to be extracted as a stand-up interval, the basic frequency sequence X1 that is the basic frequency sequence of the speech that is to be extracted as a stand-by interval and the basic frequency sequence X2 that is the basic frequency sequence of the synthesized speech are extracted. A basic frequency sequence extraction process,
Using the fundamental frequency sequences X1 and X2 and the language label as input, the correlation between the fundamental frequency sequences X1 and X2 with respect to the direction of fluctuation of the fundamental frequency between accent phrases and the fundamental frequency between the accent phrases. A comparison between the fundamental frequency series X1 and X2 with respect to the amount of fluctuation, and a table interval extraction process for extracting a table interval of the table segment extraction target speech and outputting table interval information based on the comparison ,
A word analysis method with a story.

In the speech analysis method described in claim 3,
The above-mentioned standing section extraction process is
Using the fundamental frequency sequence X1 and the fundamental frequency sequence X2 as input, an utterance average μ1 and an utterance average μ2 that are average values of fundamental frequencies over the entire utterance, and an utterance standard deviation σ1 and an utterance standard deviation σ2 that are standard deviations An average value / standard deviation calculation step to be calculated;
A normalization step for obtaining an utterance normalization sequence M1 and an utterance normalization sequence M2 which are utterance normalization sequences obtained by dividing a value obtained by subtracting the utterance average from the basic frequency sequences X1 and X2 by an utterance standard deviation;
The utterance normalization sequence M1, the utterance normalization sequence M2, and the language label are input, the respective sequences are divided into accent phrases i, and the accent of the utterance normalization sequence in each accent phrase i (i: 1 to n). An accent phrase average sequence extraction step for obtaining phrase averages M1 _i and M2 _i ;
Using the above accent phrase averages M1 _i and M2 _i as input, a standup that represents a standup interval based on the correlation of the direction of change and the comparison of the amount of change with respect to a set of adjacent accent phrase averages up to the last accent phrase n A standing judgment step for outputting section information;
A word analysis method characterized by including.

A program for causing a computer to function as the speech analysis device according to claim 1 or 2.