JP2013015693A - Spoken word analyzer, method thereof, and program - Google Patents

Spoken word analyzer, method thereof, and program Download PDF

Info

Publication number
JP2013015693A
JP2013015693A JP2011148817A JP2011148817A JP2013015693A JP 2013015693 A JP2013015693 A JP 2013015693A JP 2011148817 A JP2011148817 A JP 2011148817A JP 2011148817 A JP2011148817 A JP 2011148817A JP 2013015693 A JP2013015693 A JP 2013015693A
Authority
JP
Japan
Prior art keywords
utterance
speech
fundamental frequency
sequence
accent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2011148817A
Other languages
Japanese (ja)
Other versions
JP5588932B2 (en
Inventor
Hideji Nakajima
秀治 中嶋
Hideyuki Mizuno
秀之 水野
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2011148817A priority Critical patent/JP5588932B2/en
Publication of JP2013015693A publication Critical patent/JP2013015693A/en
Application granted granted Critical
Publication of JP5588932B2 publication Critical patent/JP5588932B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

PROBLEM TO BE SOLVED: To provide a spoken word analyzer that extracts a section of prominence from voice data without using learning data.SOLUTION: A spoken word analyzer synthesizes a synthesized voice in a reading tone that has a language label, by regarding a prominence section extracted voice, in which a prominence section applied with the language label is extracted, as an input. Then, the analyzer extracts a fundamental frequency sequence X1 that is a fundamental frequency sequence of the prominence section extracted voice and a fundamental frequency sequence X2 that is a fundamental frequency sequence of the synthesized voice, with the prominence section extracted voice and the synthesized voice as inputs. A prominence section extracting unit in the analyzer extracts a prominence section of the prominence section extracted voice and outputs prominence section information, on the basis of a correlation between the fundamental frequency sequence X1 and the fundamental frequency sequence X2 regarding a direction of variation in fundamental frequency between accent phrases and a comparison between the fundamental frequency sequence X1 and the fundamental frequency sequence X2 regarding a quantity of variation in fundamental frequency between the accent phrases, with the fundamental frequency sequences X1, X2, and the language label as inputs.

Description

この発明は、発話音声中の強調と抑制という卓立(prominence)に該当する音声区間を自動抽出するはなし言葉分析装置とその方法と、プログラムに関する。   The present invention relates to a speech analysis device, method and program for automatically extracting speech segments corresponding to prominence of emphasis and suppression in spoken speech.

例えば、映画のシーンに応じた台詞を発話する場面、童話の語り聞かせの場面、テレビなどのメディアを通じた商品宣伝の場面、及び、コールセンタなどでの電話応対場面などにおいて自然に発せられた「表現豊かな音声」において、強調や抑制という卓立は頻繁に用いられている。このような卓立は、何らかの基準と比較して明らかになる相対的なものである。よって、基準が不明な状態で、与えられた音声だけから卓立を自動抽出することは困難である。これまでは、卓立区間を予め指定しておき、その区間に卓立を伴って発話された音声を収録して利用されてきた。   For example, “expression” naturally uttered in scenes where speech is spoken according to movie scenes, scenes of storytelling of fairy tales, scenes of product advertisements through media such as television, and telephone response scenes at call centers etc. In “rich speech”, the emphasis on emphasis and suppression is frequently used. Such a standoff is a relative one that becomes apparent compared to some standard. Therefore, it is difficult to automatically extract a table from a given voice only when the reference is unknown. Until now, a standing section has been designated in advance, and voices spoken with standing are recorded and used in that section.

従来の卓立自動付与では、「強調」か「強調では無い(非強調)」かの自動付与を、2値判別問題として定式化し、2値判別器を用いて「強調」の箇所を抽出していた。その方法は、非特許文献1に開示されている。非特許文献1では、予め人手で強調区間にラベル付けされた学習用音声データを必要とする。学習用音声には、強調区間へのラベル付けと同時に強調のない箇所には非強調を示すラベルが付与される。   In conventional automatic automatic assignment, the automatic assignment of “emphasis” or “not enhancement (non-emphasis)” is formulated as a binary discrimination problem, and the location of “emphasis” is extracted using a binary discriminator. It was. This method is disclosed in Non-Patent Document 1. Non-Patent Document 1 requires learning speech data that is manually labeled in advance in the emphasis section. The learning voice is given a label indicating non-emphasis at the same time that the emphasis section is labeled and where there is no emphasis.

2値判別器は、「音節などの音声単位を表すカテゴリラベルの並び」、「その音声単位のフレーズや文内での位置を示す数値」、「フレーズの有するアクセント核の位置などの韻律に関する言語特徴を表すカテゴリラベル」、「それらを用いて通常の音声合成器によって合成された合成音と学習用音声データ原音のそれぞれの基本周波数間の差分値」、を入力変数とし、強調または非強調という2値のラベルを出力変数として構築される。この構築された2値判別器を用いて、学習データ以外の新たな音声データに対して、強調か非強調かの2値判別を行い、強調という卓立の区間を音声データから抽出する。   The binary discriminator is a language related to prosody such as “a sequence of category labels representing speech units such as syllables”, “a numerical value indicating a phrase or a position in a sentence of the speech unit”, and “a position of an accent nucleus included in the phrase”. “Category labels that represent features” and “difference values between fundamental frequencies of synthesized speech synthesized using a normal speech synthesizer and the original speech data for learning” are used as input variables. A binary label is constructed as an output variable. Using this constructed binary discriminator, new speech data other than learning data is subjected to binary discrimination of emphasis or non-emphasis, and a stand-alone section called emphasis is extracted from the speech data.

J. Xu and L.・H. Cai, “Automatic emphasis labeling for emotional speech by measuring prosody generation error”, Proceedings of ICIC, 2009, pp. 177-186, 2009.J. Xu and L. ・ H. Cai, “Automatic emphasis labeling for emotional speech by measuring prosody generation error”, Proceedings of ICIC, 2009, pp. 177-186, 2009.

しかしながら、従来の手法では、卓立区間の抽出のために強調・非強調のラベルが付与された学習データを必要とした。高い精度で卓立区間を判別する2値判別器を構成するためには、正確にラベル付けされた学習データを大量に必要とする。この正確にラベル付けされた音声データを用意するには、人手に頼る他なく、コストが高く付く。   However, the conventional method requires learning data to which emphasis / non-emphasis labels are attached in order to extract the stand-up interval. In order to construct a binary discriminator that discriminates the standing section with high accuracy, a large amount of learning data accurately labeled is required. In order to prepare this correctly labeled audio data, there is no choice but to rely on human hands, and the cost is high.

このように、卓立区間の自動抽出は困難であり、非特許文献1以前の研究の多くでは、卓立のラベルをテキストに予め付けておき、そのラベルの付けられた箇所で人間が卓立をつけた発話を行うことによって音声を収録していた。しかし、その方法では、自然な発話データ、且つ、そのような強調や非強調を含む発話が自然な割合で含まれる音声データベースを構築することは困難となる。   Thus, it is difficult to automatically extract the standing section, and in many of the studies before Non-Patent Document 1, a label of the standing is previously attached to the text, and a person stands out at the place where the label is attached. The voice was recorded by performing the utterance with the. However, with this method, it is difficult to construct a speech database that includes natural utterance data and utterances including such emphasis and non-emphasis at a natural rate.

この発明は、このような課題に鑑みてなされたものであり、人手で予め強調・非強調ラベルを付与した音声データを用意することなく、音声データから効率的に卓立区間を抽出することが可能な、はなし言葉分析装置とその方法とプログラムを提供することを目的とする。   This invention is made in view of such a subject, and it is possible to efficiently extract a stand-by section from audio data without preparing audio data to which manual emphasis / non-emphasis labels are assigned in advance. An object of the present invention is to provide a possible speech analysis device, method and program thereof.

この発明のはなし言葉分析装置は、言語ラベルが付与された卓立区間抽出対象音声を入力とし、音声合成部と、基本周波数系列抽出部と、卓立区間抽出部と、を具備する。音声合成部は、上記言語ラベルを入力として、その言語ラベルを持つ読み上げ口調の合成音声を合成する。基本周波数列抽出部は、合成音声と卓立区間抽出対象音声を入力として、卓立区間抽出対象音声の基本周波数列である基本周波数列X1と、合成音声の基本周波数列である基本周波数列X2を抽出する。卓立区間抽出部は、基本周波数系列X1とX2と上記言語ラベルを入力として、アクセント句間での基本周波数の変動方向についての基本周波数系列X1とX2との間の相関と、当該アクセント句間での基本周波数の変動量についての基本周波数系列X1とX2との間での比較と、に基づいて卓立区間抽出対象音声の卓立区間を抽出して卓立区間情報を出力する。   The speech analysis device according to the present invention has a speech segment extraction target speech provided with a language label as an input, and includes a speech synthesizer, a fundamental frequency sequence extractor, and a standoff segment extractor. The speech synthesizer receives the language label as an input and synthesizes a speech with a reading tone having the language label. The fundamental frequency sequence extraction unit receives the synthesized speech and the speech that is to be extracted as a standup interval, and receives a basic frequency sequence X1 that is a basic frequency sequence of the standout interval extraction target speech and a basic frequency sequence X2 that is a basic frequency sequence of the synthesized speech. To extract. The stand-by section extraction unit receives the fundamental frequency sequences X1 and X2 and the language label as input, and the correlation between the fundamental frequency sequences X1 and X2 with respect to the direction of variation of the fundamental frequency between the accent phrases, and between the accent phrases Based on the comparison between the fundamental frequency series X1 and X2 with respect to the fluctuation amount of the fundamental frequency at, the utterance interval of the utterance interval extraction target speech is extracted and the erection interval information is output.

この発明のはなし言葉分析装置によれば、正確な強調・非強調ラベルが付与された大量の学習データを必要とせずに、卓立区間抽出対象音声から、卓立区間情報を得ることが出来る。よって、学習データを用意する人間の稼働にかかる高いコストを排除することが出来る。また、自然に発話された音声データから正確な卓立区間を抽出することが可能になるので、この発明のはなし言葉分析装置は、自然な卓立を有する音声に基づく研究開発を、大幅に加速することに貢献する。また、言語ラベルから生成した合成音声は、変動範囲が小さな読み上げ音声となるので、それを基準とした定義の明確な卓立区間情報を得ることができる。   According to the speech analysis apparatus of the present invention, it is possible to obtain stand-by section information from the stand-by-section extraction target speech without requiring a large amount of learning data to which accurate emphasis / non-emphasis labels are attached. Therefore, it is possible to eliminate a high cost for operating a human who prepares learning data. In addition, since it is possible to extract accurate standing sections from speech data uttered naturally, the speech analysis device of this invention greatly accelerates research and development based on speech with natural standing. Contribute to doing. In addition, since the synthesized speech generated from the language label becomes a reading speech with a small fluctuation range, it is possible to obtain clear section information with a definition based on the speech.

この発明のはなし言葉分析装置100の機能構成例を示す図。The figure which shows the function structural example of the speech analysis apparatus 100 of this invention. はなし言葉分析装置100の動作フローを示す図。The figure which shows the operation | movement flow of the speech analysis apparatus 100. 卓立区間抽出部30の機能構成例を示す図。The figure which shows the function structural example of the stand-up area extraction part 30. FIG. 卓立区間抽出部30の動作フローを示す図。The figure which shows the operation | movement flow of the stand-up area extraction part 30. FIG. 卓立判定手段34の判定フローを示す図。The figure which shows the determination flow of the standup determination means. 具体的なアクセント句平均M1,M2の動きの一例を示す図。The figure which shows an example of a motion of the concrete accent phrase average M1 i and M2 i .

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。   Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図1に、この発明のはなし言葉分析装置100の機能構成例を示す。その動作フローを図2に示す。はなし言葉分析装置100は、音声合成部10と、基本周波数系列抽出部20と、卓立区間抽出部30と、制御部40と、を具備する。はなし言葉分析装置100の各部の機能は、例えばROM、RAM、CPU等で構成されるコンピュータに所定のプログラムが読み込まれて、CPUがそのプログラムを実行することで実現されるものである。   FIG. 1 shows an example of the functional configuration of a speech analysis apparatus 100 according to the present invention. The operation flow is shown in FIG. The speech analysis device 100 includes a speech synthesizer 10, a fundamental frequency sequence extractor 20, a standup section extractor 30, and a controller 40. The function of each part of the speech analysis apparatus 100 is realized by reading a predetermined program into a computer composed of, for example, a ROM, a RAM, a CPU, etc., and executing the program by the CPU.

実施例の説明の前に、強調や抑制の箇所について定義する。強調や抑制の箇所とは、1つの発話又は複数の発話系列の中での相対的な変化として定義できる。相対的な変化として測る際に必要となる基準を、この実施例では、強調や抑制の無い読み上げ口調で発話された音声とし、その読み上げ口調の音声と、表現豊かに発話された音声とを比較して変化が生じている箇所を抽出対象とする。この変化は、基本周波数変動や発話時間長や声質などのさまざまな物理量の違いとなって現れるが、この実施例では基本周波数の変動によるものに焦点を当てる。つまり、基本周波数が相対的に高くなっているところを「強調」、相対的に低くなっているところを「抑制」と定義する。また、強調と抑制の箇所の単位は、この実施例ではアクセント句の単位と定義する。   Prior to the description of the embodiments, points of emphasis and suppression will be defined. A point of emphasis or suppression can be defined as a relative change in one utterance or a plurality of utterance sequences. In this example, the standard necessary for measuring relative changes is the speech uttered in a reading tone without emphasis or suppression, and the speech in the reading tone is compared with the speech uttered in an expressive manner. Thus, the location where the change occurs is taken as the extraction target. Although this change appears as a difference in various physical quantities such as fundamental frequency fluctuation, speech duration, and voice quality, this embodiment focuses on the fluctuation due to fundamental frequency fluctuation. That is, a point where the fundamental frequency is relatively high is defined as “emphasis”, and a point where the fundamental frequency is relatively low is defined as “suppression”. In this embodiment, the unit of the emphasis and suppression part is defined as the unit of the accent phrase.

音声合成部10は、卓立区間抽出対象音声に付与された言語ラベルを入力として、その言語ラベルを持つ読み上げ口調の合成音声を合成する(ステップS10)。   The speech synthesizer 10 receives the language label given to the speech extraction target speech, and synthesizes a speech with a reading tone having the language label (step S10).

言語ラベルには、「音素、音節などの各音声区間の種別とそれらの音声区間の開始・終了時刻」、「ポーズ区間の開始終了の時刻」、「アクセント句境界とその開始・終了時刻及びアクセント句のアクセント型」、がある。これらの情報から、「アクセント句の長さ」、「アクセント句の中での各音声単位の位置」を自動計算して付与することが可能である。   Language labels include “speech segment types such as phonemes and syllables and start / end times of those speech segments”, “pause segment start / end times”, “accent phrase boundaries and their start / end times and accents. Phrase accent type ". From these pieces of information, it is possible to automatically calculate and give “the length of the accent phrase” and “the position of each voice unit in the accent phrase”.

音声合成部10が合成する合成音声は、従来の音声合成器(参考文献:T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis,” Proceedings of EUROSPEECH, pp.2347-2350, 1999.)で合成することが出来る。音声合成部10の態様によっては、合成音声と一緒にその基本周波数列X2を同時に出力できる場合がある。   The synthesized speech synthesized by the speech synthesis unit 10 is a conventional speech synthesizer (reference: T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous modeling of spectrum, pitch and duration. in HMM-based speech synthesis, “Proceedings of EUROSPEECH, pp. 2347-2350, 1999.) Depending on the mode of the speech synthesizer 10, the fundamental frequency sequence X2 may be output simultaneously with the synthesized speech.

基本周波数系列抽出部20は、卓立区間抽出対象音声と音声合成部10で合成した合成音声を入力として、卓立区間抽出対象音声の基本周波数系列である基本周波数系列X1と、合成音声の基本周波数系列である基本周波数系列X2を抽出する(ステップS20)。音声合成部10が、合成音声とその基本周波数列X2を同時に出力する場合(図1の破線を参照)は、基本周波数系列抽出部20は卓立区間抽出対象音声の基本周波数系列である基本周波数系列X1のみを抽出する。基本周波数は、周期信号の周期の最短のものとして定義され、聴覚上では声の高さとして感じ取られるものである。音声の場合には、声帯の開閉時間間隔も波形も一定では無いため、厳密な意味での基本周波数は存在しないが、基本波の瞬時周波数を基本周波数とする。基本周波数は、例えば1msごとに得ることが出来る。基本周波数の単位は元々Hzであるが、そのままの値でも、底をe(ネイピア数)とする自然対数に変換した値でも良い。   The basic frequency sequence extraction unit 20 receives the synthesized speech synthesized by the speech segment extraction target speech and the speech synthesis unit 10 as an input, and the fundamental frequency sequence X1 that is the fundamental frequency sequence of the speech synthesis target speech and the basics of the synthesized speech. A basic frequency series X2 that is a frequency series is extracted (step S20). When the speech synthesizer 10 outputs the synthesized speech and its fundamental frequency sequence X2 at the same time (see the broken line in FIG. 1), the fundamental frequency sequence extraction unit 20 uses the fundamental frequency that is the fundamental frequency sequence of the stand-by interval extraction target speech. Only the series X1 is extracted. The fundamental frequency is defined as the shortest cycle of the periodic signal, and is perceived as the pitch of the voice on hearing. In the case of speech, neither the opening / closing time interval of the vocal cords nor the waveform is constant, so there is no strict meaning of the fundamental frequency, but the instantaneous frequency of the fundamental wave is taken as the fundamental frequency. The fundamental frequency can be obtained every 1 ms, for example. The unit of the fundamental frequency is originally Hz, but it may be a value as it is or a value converted to a natural logarithm with the base being e (Napier number).

基本周波数は、例えば基本波だけを取り出すことの出来る帯域フィルタを設計し、そのインパルス応答をマザーウェーブレットとするwavelet変換を行うことで、基本波成分を抽出する方法が知られている(参考文献:H. Katayose, A. de Cheveigne and R. D. Patterson, “Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity,” Eurospeech ’99, pp. 2781-2784 (1999).)。基本周波数を抽出する方法は、これ以外にも時間領域の自己相関係数から求める方法など複数の方法が存在する。基本周波数を求めること自体は、従来技術であり、その詳しい説明は省略する。   As for the fundamental frequency, for example, a method of extracting a fundamental wave component by designing a bandpass filter that can extract only the fundamental wave and performing wavelet transform using the impulse response as a mother wavelet is known (references: H. Katayose, A. de Cheveigne and RD Patterson, “Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity,” Eurospeech '99, pp. 2781-2784 (1999).). There are a plurality of other methods for extracting the fundamental frequency, such as a method of obtaining from the autocorrelation coefficient in the time domain. Obtaining the fundamental frequency itself is a conventional technique and will not be described in detail.

卓立区間抽出部30は、基本周波数系列抽出部20が抽出した基本周波数系列X1と基本周波数系列X2と言語ラベルを入力として、それぞれの基本周波数系列を比較する。上記したように基本周波数系列X2は、音声合成部10から直接入力される場合もある。卓立区間抽出対象音声の基本周波数系列X1と合成音声の基本周波数系列X2との間では、言語ラベル情報が一致しているので、2つの基本周波数系列を比較することが可能である。卓立句間抽出部30は、基本周波数系列X1とX2の基本周波数を比較して、卓立区間抽出対象音声の卓立区間を抽出して卓立区間情報を出力する(ステップS30)。制御部40は、上記した各部間の時系列的な動作等を制御するものである。   The table interval extraction unit 30 receives the fundamental frequency sequence X1, the fundamental frequency sequence X2, and the language label extracted by the fundamental frequency sequence extraction unit 20, and compares the fundamental frequency sequences. As described above, the fundamental frequency sequence X2 may be directly input from the speech synthesizer 10. Since the language label information is identical between the fundamental frequency sequence X1 of the speech to be extracted and the fundamental frequency sequence X2 of the synthesized speech, it is possible to compare the two fundamental frequency sequences. The inter-phrasing phrase extraction unit 30 compares the fundamental frequencies of the fundamental frequency sequences X1 and X2, extracts the prominent section of the speech to be extracted, and outputs the stand-by section information (step S30). The control unit 40 controls time-series operations between the above-described units.

以上述べたようにこの発明のはなし言葉分析装置100は、表現豊かな口調で自然に発話された卓立区間抽出対象音声に付与された言語ラベルから合成音声を合成する。そして、卓立区間抽出対象音声の基本周波数系列X1と、合成音声の基本周波数系列X2の言語ラベルが一致する基本周波数を比較することで卓立区間を抽出する。したがって、従来法のように、学習用の音声データを使用しないで、対象音声から卓立区間を抽出することが出来る。なお、上記したように音声合成部10と、基本周波数系列抽出部20は、従来技術で実現することが可能であるが、この発明のはなし言葉分析装置100は、図1に示すその装置全体の構成自体が新しい。特に新しい部分は、2つの基本周波数系列から卓立区間を抽出する卓立区間抽出部30である。   As described above, the speech analysis apparatus 100 according to the present invention synthesizes synthesized speech from the language label attached to the speech extraction target speech that is naturally spoken in an expressive tone. Then, the stand-by section is extracted by comparing the fundamental frequencies having the same language labels of the fundamental frequency sequence X1 of the stand-by section extraction target speech and the basic frequency sequence X2 of the synthesized speech. Therefore, unlike the conventional method, it is possible to extract the standing section from the target speech without using the speech data for learning. As described above, the speech synthesizer 10 and the fundamental frequency sequence extraction unit 20 can be realized by the conventional technique. However, the speech analysis device 100 of the present invention is the entire device shown in FIG. The configuration itself is new. A particularly new part is a stand-up section extraction unit 30 that extracts stand-up sections from two fundamental frequency sequences.

図3に、卓立区間抽出部30の更に詳しい機能構成例を示してその動作を説明する。卓立区間抽出部30は、平均値・標準偏差算出手段31と、正規化手段32と、アクセント句平均系列抽出手段33と、卓立判定手段34と、を備える。図4に、卓立区間抽出部30の動作フローを示す。   FIG. 3 shows an example of a more detailed functional configuration of the stand-up section extraction unit 30, and its operation will be described. The standing section extracting unit 30 includes an average value / standard deviation calculating unit 31, a normalizing unit 32, an accent phrase average series extracting unit 33, and a standing determination unit 34. In FIG. 4, the operation | movement flow of the standing area extraction part 30 is shown.

平均値・標準偏差算出手段31は、基本周波数系列抽出部20が出力する基本周波数系列X1と基本周波数系列X2を入力として、それぞれの発話全体にわたる基本周波数の平均値である発話平均μ1と発話平均μ2と、その標準偏差である発話標準偏差σ1と発話標準偏差σ2と、を算出する(ステップS31)。   The average value / standard deviation calculating means 31 receives the fundamental frequency sequence X1 and the fundamental frequency sequence X2 output from the fundamental frequency sequence extraction unit 20 as input, and the utterance average μ1 that is the average value of the fundamental frequency over the entire utterance and the utterance average μ2, and the standard deviation utterance standard deviation σ1 and utterance standard deviation σ2 are calculated (step S31).

正規化手段32は、基本周波数系列X1とX2から発話平均μ1,μ2を減算した値を発話標準偏差σ1,σ2で除した発話正規化系列である発話正規化系列M1と発話正規化系列M2を求める(ステップS32、式(1))。   The normalizing means 32 obtains an utterance normalization sequence M1 and an utterance normalization sequence M2 that are utterance normalization sequences obtained by dividing the basic frequency sequences X1 and X2 by the utterance averages μ1 and μ2 by the utterance standard deviations σ1 and σ2. Obtained (step S32, equation (1)).

Figure 2013015693
Figure 2013015693

この正規化処理によって、発話者の個人差を除去することが可能であり、卓立区間抽出の精度を向上させることが出来る。   By this normalization processing, individual differences among speakers can be removed, and the accuracy of extracting the stand-up section can be improved.

アクセント句平均系列抽出手段33は、発話正規化系列M1と発話正規化系列M2と言語ラベルを入力として、それぞれの発話正規化系列M1,M2をアクセント句iごとに分割してi番目のアクセント句の正規化値の平均値であるアクセント句平均M1,M2を計算する(ステップS33)。iは1〜n。 The accent phrase average series extraction unit 33 receives the utterance normalization series M1, the utterance normalization series M2, and the language label, and divides each utterance normalization series M1 and M2 into accent phrases i to obtain an i-th accent phrase. Accent phrase averages M1 i and M2 i which are average values of the normalized values are calculated (step S33). i is 1 to n.

卓立判定手段34は、アクセント句平均M1,M2を入力として、最後のアクセント句nまでの隣り合うアクセント句の組に対して変動の方向の相関と変動量の比較に基づいて卓立区間を表す卓立区間情報を出力する(ステップS34)。 The standing judgment means 34 receives the accent phrase averages M1 i and M2 i as input, and stands out based on the correlation of the direction of variation and the comparison of the variation amount with respect to a pair of adjacent accent phrases up to the last accent phrase n. The standing section information representing the section is output (step S34).

卓立判定手段34は、アクセント句平均M1,M2を入力として、隣り合うアクセント句の組の平均値μ1(式(2))とμ2(式(3))を計算すると共に、それぞれの相関係数ρ12と、それぞれの変動量Δ1とΔ2と変動量Δ1を変動量Δ2で除した変動量比率rとを求め、アクセント句iの強調と抑制及び、強調も抑制もない卓立区間を表す卓立区間情報を出力する。 The desk determination means 34 receives the accent phrase averages M1 i and M2 i as input and calculates the average values μ1 i (formula (2)) and μ2 i (formula (3)) of adjacent accent phrase sets, The respective correlation coefficients ρ12 i , the respective fluctuation amounts Δ1 i and Δ2 i, and the fluctuation amount ratio r i obtained by dividing the fluctuation amount Δ1 i by the fluctuation amount Δ2 i are obtained, emphasis and suppression of the accent phrase i, and emphasis Outputs stand-by section information representing stand-up sections that are neither suppressed nor suppressed.

相関係数ρ12を求める式を式(4)に示す。 An equation for obtaining the correlation coefficient ρ12 i is shown in Equation (4).

Figure 2013015693
Figure 2013015693

Figure 2013015693
Figure 2013015693

変動量Δ1,Δ2と変動量比率rを求める式を、式(5)と式(6)と、式(7)に示す。 Equations (5), (6), and (7) are used to obtain the fluctuation amounts Δ1 i , Δ2 i and the fluctuation amount ratio r i .

Figure 2013015693
Figure 2013015693

図5に、卓立判定手段34の動作フローを示して、その判定の流れを説明する。図5の動作フローは、アクセント句の番号をi、文内の最初のアクセント句の番号を1、最後のアクセント句の番号をnとして、i及びi+1のアクセント句を用いて、i+1のアクセント句が強調なのか抑制なのか、又は強調も抑制も無い卓立区間に分類する以下の処理を最後のアクセント句nまで繰り返すものである(ステップS340)。   FIG. 5 shows an operation flow of the tabletop determination means 34, and the determination flow will be described. The operation flow of FIG. 5 shows that i + 1 is the accent phrase number, i is the first accent phrase number in the sentence, n is the last accent phrase number, and i and i + 1 accent phrases are used. Is the emphasis or suppression, or the following process for classifying into a stand-up section without emphasis or suppression is repeated until the last accent phrase n (step S340).

卓立判定手段34は、アクセント句iごとに相関係数ρ12を求める(ステップS341)。そして、隣接するアクセント句iとi+1間の変動量Δ1,Δ2と、変動量比率rを求める(ステップS342)。 The standing determination means 34 obtains a correlation coefficient ρ12 i for each accent phrase i (step S341). Then, the fluctuation amounts Δ1 i and Δ2 i between the adjacent accent phrases i and i + 1 and the fluctuation amount ratio r i are obtained (step S342).

相関係数ρ12が正の相関の場合(ステップS343のYES)、変動量Δ1>変動量Δ2で、且つ変動量比率rが閾値区間外である場合、アクセント句i+1は強調区間と判定される(ステップS344のYES)。変動量比率rが、例えば120%を超える場合に強調と判定する。ステップS344でNOと判定された場合、変動量Δ1<変動量Δ2で、且つ変動量比率rが閾値区間外である場合、アクセント句i+1は抑制区間と判定される(ステップS345のYES)。変動量比率rが、例えば80%未満の場合に抑制と判定する。ステップS345でNOと判定された場合、アクセント句i+1は強調も抑制もないと判定される。 When the correlation coefficient ρ12 i is a positive correlation (YES in step S343), when the fluctuation amount Δ1 i > the fluctuation amount Δ2 i and the fluctuation amount ratio r i is outside the threshold interval, the accent phrase i + 1 is an enhancement interval. Determination is made (YES in step S344). When the fluctuation amount ratio r i exceeds, for example, 120%, it is determined as emphasis. If NO is determined in step S344, if the variation amount Δ1 i <the variation amount Δ2 i and the variation amount ratio r i is outside the threshold section, the accent phrase i + 1 is determined to be a suppression section (YES in step S345). ). When the fluctuation amount ratio r i is, for example, less than 80%, it is determined to be suppressed. When it is determined NO in step S345, it is determined that the accent phrase i + 1 is neither emphasized nor suppressed.

ステップS343で正の相関が無いと判定された場合(ステップS343のNO)、負の相関であるか否かを判定する(ステップS346)。ステップS346において、負の相関も無いと判定された場合、アクセント句i+1は強調も抑制もないと判定される(ステップS346のNO)。   If it is determined in step S343 that there is no positive correlation (NO in step S343), it is determined whether the correlation is negative (step S346). If it is determined in step S346 that there is no negative correlation, it is determined that the accent phrase i + 1 is neither emphasized nor suppressed (NO in step S346).

ステップS343で負の相関があると判定された場合(ステップS346のYES)、変動量Δ1が正方向で且つ変動量Δ2が負方向の場合(ステップS347のYES)、アクセント句i+1は強調と判定される。ステップS347でNOと判定された場合、変動量Δ1が負方向で且つ変動量Δ2が正方向の場合(ステップS348のYES)、アクセント句i+1は抑制と判定される。ステップS348でNOと判定された場合、アクセント句i+1は強調も抑制もないと判定される。 If it is determined in step S343 that there is a negative correlation (YES in step S346), if the variation Δ1 i is positive and the variation Δ2 i is negative (YES in step S347), the accent phrase i + 1 is emphasized. It is determined. When it is determined NO in step S347, if the variation Δ1 i is negative and the variation Δ2 i is positive (YES in step S348), the accent phrase i + 1 is determined to be suppressed. If NO is determined in step S348, it is determined that the accent phrase i + 1 is neither emphasized nor suppressed.

図6に、具体的なアクセント句平均系列M1とM2の動きを示して卓立判定手段34の動作を更に説明する。図6の横方向はアクセント句の変化であり経過時間を表し、縦方向は正規化値を表す。図6の、アクセント句iとi+1のアクセント句平均系列の動きは、卓立区間抽出対象音声のアクセント句i+1が強調と判定される場合の一例を示している。アクセント句間の平均値μ1の変化がプラス方向で、アクセント句間の平均値μ2の変化がマイナス方向であるので相関係数ρ12は負の相関である。この場合、図5の動作フローでは、ステップS346でYES、変動量Δ1が正方向で且つ変動量Δ2が負方向なのでステップS347でYESと判定され、アクセント句i+1は強調と判定される。 FIG. 6 shows the movement of the accent phrase average series M1 i and M2 i , and the operation of the standup judgment means 34 will be further described. The horizontal direction in FIG. 6 represents a change in accent phrase and represents elapsed time, and the vertical direction represents a normalized value. The movement of the accent phrase average sequence of accent phrases i and i + 1 in FIG. 6 shows an example of the case where the accent phrase i + 1 of the speech to be extracted is extracted as emphasized. Since the change in the average value μ1 i between the accent phrases is in the plus direction and the change in the average value μ2 i between the accent phrases is in the minus direction, the correlation coefficient ρ12 i is a negative correlation. In this case, in the operation flow of FIG. 5, YES is determined in step S346, and since the variation Δ1 i is in the positive direction and the variation Δ2 i is in the negative direction, YES is determined in step S347, and the accent phrase i + 1 is determined to be emphasized.

次に、i+1とi+2のアクセント句間の平均値μ1i+1の変化はマイナス方向、アクセント句間の平均値μ2i+1の変化はマイナス方向であるので相関係数ρ12i+1は正の相関である(ステップS343のYES)。そして、この場合、変動量比率ri+1が明らかに閾値区間内である。つまり、変動量比率ri+1が80%〜120%の範囲内であるのでアクセント句i+2は強調も抑制もないと判定される(ステップS345のNO)。 Next, since the change in the average value μ1 i + 1 between the accent phrases i + 1 and i + 2 is in the minus direction and the change in the average value μ2 i + 1 between the accent phrases is in the minus direction, the correlation coefficient ρ12 i + 1 is positive. (YES in step S343). In this case, the fluctuation amount ratio r i + 1 is clearly within the threshold interval. That is, since the fluctuation amount ratio r i + 1 is in the range of 80% to 120%, it is determined that the accent phrase i + 2 is neither emphasized nor suppressed (NO in step S345).

以上述べたように、卓立区間抽出対象音声由来のアクセント句平均系列M1と、言語ラベルから合成した合成音声由来のアクセント句平均系列M2の変化の相関係数と、それぞれの変動量Δ1,Δ2と、変動量比率rとを用いることでアクセント句の卓立を判定することが出来る。このように、この発明のはなし言葉分析装置100によれば、正確な強調・非強調ラベルが付与された大量の学習データを不要としながら、音声データから卓立区間情報を得ることを可能にする。また、言語ラベルから生成した合成音声は、変動範囲が小さな読み上げ音声となるので、それを基準とした定義の明確な卓立区間情報を得ることができる。 As described above, the correlation coefficient of the change between the accent phrase average series M1 i derived from the speech to be extracted and the accent phrase average series M2 i derived from the synthesized speech, and the variation Δ1 of each. By using i and Δ2 i and the variation ratio r i , the accent phrase can be determined. As described above, according to the speech analysis device 100 of the present invention, it is possible to obtain stand-by section information from speech data while eliminating a large amount of learning data to which accurate emphasis / non-emphasis labels are attached. . In addition, since the synthesized speech generated from the language label becomes a reading speech with a small fluctuation range, it is possible to obtain clear section information with a definition based on the speech.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。   When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD(Digital Versatile Disc)、DVD−RAM(Random Access Memory)、CD−ROM(Compact Disc Read Only Memory)、CD−R(Recordable)/RW(ReWritable)等を、光磁気記録媒体として、MO(Magneto Optical disc)等を、半導体メモリとしてEEP−ROM(Electronically Erasable and Programmable-Read Only Memory)等を用いることができる。   The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD−ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。   The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。   Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims (5)

言語ラベルが付与された卓立区間抽出対象音声を入力とし、
上記言語ラベルを持つ読み上げ口調の合成音声を合成する音声合成部と、
上記合成音声と上記卓立区間抽出対象音声とを入力として、上記卓立区間抽出対象音声の基本周波数列である基本周波数列X1と、上記合成音声の基本周波数列である基本周波数列X2を抽出する基本周波数列抽出部と、
上記基本周波数系列X1とX2と上記言語ラベルを入力として、アクセント句間での基本周波数の変動方向についての基本周波数系列X1とX2との間の相関と、当該アクセント句間での上記基本周波数の変動量についての上記基本周波数系列X1とX2との間での比較と、に基づいて上記卓立区間抽出対象音声の卓立区間を抽出して卓立区間情報を出力する卓立区間抽出部と、
を具備するはなし言葉分析装置。
The speech to be extracted from the standing section with the language label is input.
A speech synthesizer for synthesizing a synthesized speech with a reading tone with the language label;
Using the synthesized speech and the speech that is to be extracted as a stand-up interval, the basic frequency sequence X1 that is the basic frequency sequence of the speech that is to be extracted as a stand-by interval and the basic frequency sequence X2 that is the basic frequency sequence of the synthesized speech are extracted. A fundamental frequency sequence extraction unit to
Using the fundamental frequency sequences X1 and X2 and the language label as input, the correlation between the fundamental frequency sequences X1 and X2 with respect to the direction of fluctuation of the fundamental frequency between accent phrases and the fundamental frequency between the accent phrases. A stand-up section extraction unit that extracts a stand-up section of the above-described stand-by section extraction target speech based on the comparison between the fundamental frequency series X1 and X2 with respect to the amount of fluctuation, and outputs stand-up section information; ,
Talking word analysis device.
請求項1に記載したはなし言葉分析装置において、
上記卓立区間抽出部は、
上記基本周波数系列X1と上記基本周波数系列X2を入力として、それぞれの発話全体にわたる基本周波数の平均値である発話平均μ1と発話平均μ2と、標準偏差である発話標準偏差σ1と発話標準偏差σ2を算出する平均値・標準偏差算出手段と、
上記基本周波数系列X1とX2から発話平均を減算した値を発話標準偏差で除した発話正規化系列である発話正規化系列M1と発話正規化系列M2を求める正規化手段と、
上記発話正規化系列M1と上記発話正規化系列M2と上記言語ラベルを入力として、それぞれの系列をアクセント句iに分割し、各アクセント句i(i:1〜n)における発話正規化系列のアクセント句平均M1,M2を得るアクセント句平均系列抽出手段と、
上記アクセント句平均M1とM2を入力として、最後のアクセント句nまでの隣り合うアクセント句平均の組に対して変動の方向の相関と変動量の比較に基づいて卓立区間を表す卓立区間情報を出力する卓立判定手段と、
を備えることを特徴とするはなし言葉分析装置。
In the speech analysis device according to claim 1,
The above stand section extraction unit
Using the fundamental frequency sequence X1 and the fundamental frequency sequence X2 as input, an utterance average μ1 and an utterance average μ2 that are average values of fundamental frequencies over the entire utterance, and an utterance standard deviation σ1 and an utterance standard deviation σ2 that are standard deviations are obtained. Mean value / standard deviation calculation means for calculating,
Normalization means for obtaining an utterance normalization sequence M1 and an utterance normalization sequence M2, which are utterance normalization sequences obtained by dividing a value obtained by subtracting the utterance average from the basic frequency sequences X1 and X2 by an utterance standard deviation;
The utterance normalization sequence M1, the utterance normalization sequence M2, and the language label are input, the respective sequences are divided into accent phrases i, and the accent of the utterance normalization sequence in each accent phrase i (i: 1 to n). An accent phrase average series extracting means for obtaining phrase averages M1 i and M2 i ;
Using the above accent phrase averages M1 i and M2 i as input, a standup that represents a standup interval based on the correlation of the direction of change and the comparison of the amount of change with respect to a set of adjacent accent phrase averages up to the last accent phrase n Standing judgment means for outputting section information;
A speech analysis device characterized by comprising:
言語ラベルが付与された卓立区間抽出対象音声を入力とし、
上記言語ラベルを持つ読み上げ口調の合成音声を合成する音声合成過程と、
上記合成音声と上記卓立区間抽出対象音声とを入力として、上記卓立区間抽出対象音声の基本周波数列である基本周波数列X1と、上記合成音声の基本周波数列である基本周波数列X2を抽出する基本周波数列抽出過程と、
上記基本周波数系列X1とX2と上記言語ラベルを入力として、アクセント句間での基本周波数の変動方向についての基本周波数系列X1とX2との間の相関と、当該アクセント句間での上記基本周波数の変動量についての上記基本周波数系列X1とX2との間での比較と、に基づいて上記卓立区間抽出対象音声の卓立区間を抽出して卓立区間情報を出力する卓立区間抽出過程と、
を備えるはなし言葉分析方法。
The speech to be extracted from the standing section with the language label is input.
A speech synthesis process for synthesizing a spoken tone synthesized speech with the language label;
Using the synthesized speech and the speech that is to be extracted as a stand-up interval, the basic frequency sequence X1 that is the basic frequency sequence of the speech that is to be extracted as a stand-by interval and the basic frequency sequence X2 that is the basic frequency sequence of the synthesized speech are extracted. A basic frequency sequence extraction process,
Using the fundamental frequency sequences X1 and X2 and the language label as input, the correlation between the fundamental frequency sequences X1 and X2 with respect to the direction of fluctuation of the fundamental frequency between accent phrases and the fundamental frequency between the accent phrases. A comparison between the fundamental frequency series X1 and X2 with respect to the amount of fluctuation, and a table interval extraction process for extracting a table interval of the table segment extraction target speech and outputting table interval information based on the comparison ,
A word analysis method with a story.
請求項3に記載したはなし言葉分析方法において、
上記卓立区間抽出過程は、
上記基本周波数系列X1と上記基本周波数系列X2を入力として、それぞれの発話全体にわたる基本周波数の平均値である発話平均μ1と発話平均μ2と、標準偏差である発話標準偏差σ1と発話標準偏差σ2を算出する平均値・標準偏差算出ステップと、
上記基本周波数系列X1とX2から発話平均を減算した値を発話標準偏差で除した発話正規化系列である発話正規化系列M1と発話正規化系列M2を求める正規化ステップと、
上記発話正規化系列M1と上記発話正規化系列M2と上記言語ラベルを入力として、それぞれの系列をアクセント句iに分割し、各アクセント句i(i:1〜n)における発話正規化系列のアクセント句平均M1,M2を得るアクセント句平均系列抽出ステップと、
上記アクセント句平均M1とM2を入力として、最後のアクセント句nまでの隣り合うアクセント句平均の組に対して変動の方向の相関と変動量の比較に基づいて卓立区間を表す卓立区間情報を出力する卓立判定ステップと、
を含むことを特徴とするはなし言葉分析方法。
In the speech analysis method described in claim 3,
The above-mentioned standing section extraction process is
Using the fundamental frequency sequence X1 and the fundamental frequency sequence X2 as input, an utterance average μ1 and an utterance average μ2 that are average values of fundamental frequencies over the entire utterance, and an utterance standard deviation σ1 and an utterance standard deviation σ2 that are standard deviations An average value / standard deviation calculation step to be calculated;
A normalization step for obtaining an utterance normalization sequence M1 and an utterance normalization sequence M2 which are utterance normalization sequences obtained by dividing a value obtained by subtracting the utterance average from the basic frequency sequences X1 and X2 by an utterance standard deviation;
The utterance normalization sequence M1, the utterance normalization sequence M2, and the language label are input, the respective sequences are divided into accent phrases i, and the accent of the utterance normalization sequence in each accent phrase i (i: 1 to n). An accent phrase average sequence extraction step for obtaining phrase averages M1 i and M2 i ;
Using the above accent phrase averages M1 i and M2 i as input, a standup that represents a standup interval based on the correlation of the direction of change and the comparison of the amount of change with respect to a set of adjacent accent phrase averages up to the last accent phrase n A standing judgment step for outputting section information;
A word analysis method characterized by including.
請求項1又は2に記載したはなし言葉分析装置としてコンピュータを機能させるためのプログラム。   A program for causing a computer to function as the speech analysis device according to claim 1 or 2.
JP2011148817A 2011-07-05 2011-07-05 Speech analysis device, method and program Expired - Fee Related JP5588932B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2011148817A JP5588932B2 (en) 2011-07-05 2011-07-05 Speech analysis device, method and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2011148817A JP5588932B2 (en) 2011-07-05 2011-07-05 Speech analysis device, method and program

Publications (2)

Publication Number Publication Date
JP2013015693A true JP2013015693A (en) 2013-01-24
JP5588932B2 JP5588932B2 (en) 2014-09-10

Family

ID=47688422

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2011148817A Expired - Fee Related JP5588932B2 (en) 2011-07-05 2011-07-05 Speech analysis device, method and program

Country Status (1)

Country Link
JP (1) JP5588932B2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016142936A (en) * 2015-02-03 2016-08-08 株式会社日立超エル・エス・アイ・システムズ Preparing method for data for speech synthesis, and preparing device data for speech synthesis
JP2016173430A (en) * 2015-03-17 2016-09-29 日本電信電話株式会社 Speech intention model learning device, speech intention extraction device, speech intention model learning method, speech intention extraction method and program
JP2020154332A (en) * 2020-06-17 2020-09-24 カシオ計算機株式会社 Emotion estimation device, emotion estimation method, and program

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0580791A (en) * 1991-09-20 1993-04-02 Hitachi Ltd Device and method for speech rule synthesis
WO1997036286A1 (en) * 1996-03-25 1997-10-02 Arcadia, Inc. Sound source generator, voice synthesizer and voice synthesizing method
JP2001147919A (en) * 1999-11-24 2001-05-29 Sharp Corp Device and method for processing voice and storage medium to be utilized therefor
JP2003316378A (en) * 2001-08-08 2003-11-07 Nippon Telegr & Teleph Corp <Ntt> Speech processing method and apparatus and program therefor
JP2008040259A (en) * 2006-08-08 2008-02-21 Yamaha Corp Musical piece practice assisting device, dynamic time warping module, and program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0580791A (en) * 1991-09-20 1993-04-02 Hitachi Ltd Device and method for speech rule synthesis
WO1997036286A1 (en) * 1996-03-25 1997-10-02 Arcadia, Inc. Sound source generator, voice synthesizer and voice synthesizing method
JP2001147919A (en) * 1999-11-24 2001-05-29 Sharp Corp Device and method for processing voice and storage medium to be utilized therefor
JP2003316378A (en) * 2001-08-08 2003-11-07 Nippon Telegr & Teleph Corp <Ntt> Speech processing method and apparatus and program therefor
JP2008040259A (en) * 2006-08-08 2008-02-21 Yamaha Corp Musical piece practice assisting device, dynamic time warping module, and program

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CSNG199700054006; 大槻 恭士: '自動学習によって付与された音素ラベルの性質' 電子情報通信学会技術研究報告 音声 SP96-25, 19960614, pp.39-46, 社団法人電子情報通信学会 *
CSNG200000598001; 小林 聡: '音声の高さ、大きさ、速さ感覚と物理関連量' 情報処理学会研究報告 Vol.96, No.123, 19961213, pp.1-8, 社団法人情報処理学会 *
JPN6014030292; 大槻 恭士: '自動学習によって付与された音素ラベルの性質' 電子情報通信学会技術研究報告 音声 SP96-25, 19960614, pp.39-46, 社団法人電子情報通信学会 *
JPN6014030293; 小林 聡: '音声の高さ、大きさ、速さ感覚と物理関連量' 情報処理学会研究報告 Vol.96, No.123, 19961213, pp.1-8, 社団法人情報処理学会 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016142936A (en) * 2015-02-03 2016-08-08 株式会社日立超エル・エス・アイ・システムズ Preparing method for data for speech synthesis, and preparing device data for speech synthesis
JP2016173430A (en) * 2015-03-17 2016-09-29 日本電信電話株式会社 Speech intention model learning device, speech intention extraction device, speech intention model learning method, speech intention extraction method and program
JP2020154332A (en) * 2020-06-17 2020-09-24 カシオ計算機株式会社 Emotion estimation device, emotion estimation method, and program
JP7001126B2 (en) 2020-06-17 2022-01-19 カシオ計算機株式会社 Emotion estimation device, emotion estimation method and program

Also Published As

Publication number Publication date
JP5588932B2 (en) 2014-09-10

Similar Documents

Publication Publication Date Title
JP4882899B2 (en) Speech analysis apparatus, speech analysis method, and computer program
JP6266372B2 (en) Speech synthesis dictionary generation apparatus, speech synthesis dictionary generation method, and program
JP5888356B2 (en) Voice search device, voice search method and program
US20120078625A1 (en) Waveform analysis of speech
EP4266306A1 (en) A speech processing system and a method of processing a speech signal
Bone et al. Classifying language-related developmental disorders from speech cues: the promise and the potential confounds.
JP4353202B2 (en) Prosody identification apparatus and method, and speech recognition apparatus and method
JP5588932B2 (en) Speech analysis device, method and program
Yadav et al. Prosodic mapping using neural networks for emotion conversion in Hindi language
Gowda et al. Time-varying quasi-closed-phase analysis for accurate formant tracking in speech signals
Gutkin et al. Building statistical parametric multi-speaker synthesis for bangladeshi bangla
RU2510954C2 (en) Method of re-sounding audio materials and apparatus for realising said method
Nandi et al. Language identification using Hilbert envelope and phase information of linear prediction residual
JP4839970B2 (en) Prosody identification apparatus and method, and speech recognition apparatus and method
Hasija et al. Recognition of Children Punjabi Speech using Tonal Non-Tonal Classifier
JP4469986B2 (en) Acoustic signal analysis method and acoustic signal synthesis method
Villavicencio et al. Efficient pitch estimation on natural opera-singing by a spectral correlation based strategy
Nandi et al. Sub-segmental, segmental and supra-segmental analysis of linear prediction residual signal for language identification
JP5875504B2 (en) Speech analysis device, method and program
JP6367773B2 (en) Speech enhancement device, speech enhancement method, and speech enhancement program
Kafentzis et al. Analysis of emotional speech using an adaptive sinusoidal model
Yun et al. Bilingual voice conversion by weighted frequency warping based on formant space
Lipeika Optimization of formant feature based speech recognition
JP5722295B2 (en) Acoustic model generation method, speech synthesis method, apparatus and program thereof
JP2009058548A (en) Speech retrieval device

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20130722

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20140120

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20140128

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20140324

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20140430

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20140527

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20140722

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20140728

R150 Certificate of patent or registration of utility model

Ref document number: 5588932

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

LAPS Cancellation because of no payment of annual fees