JP2015212732A

JP2015212732A - Sound metaphor recognition device and program

Info

Publication number: JP2015212732A
Application number: JP2014094694A
Authority: JP
Inventors: 彰夫小林; Akio Kobayashi
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2014-05-01
Filing date: 2014-05-01
Publication date: 2015-11-26

Abstract

PROBLEM TO BE SOLVED: To obtain a text expression representing a sound metaphor of a non-voice included in audio data and display properties of the text expression.SOLUTION: A non-voice section detection part 24 detects a non-voice section of audio data. An acoustic feature quantity extraction part 25 extracts an acoustic feature quantity from audio data in the non-voice section. A language feature quantity extraction part 26 specifies a language feature quantity extraction section including parts before and after the non-voice section, and extracts a language feature quantity from data on utterance contents corresponding to audio data in the specified language feature quantity extraction section. A rhythm feature quantity extraction part 27 specifies a rhythm feature quantity extraction section including the parts before and after the non-voice section, and extracts a rhythm feature quantity from audio data in the specified rhythm feature quantity extraction section. A sound metaphor recognition part 28 calculates a posterior probability of a sound metaphor comprising a text expression of a non-voice and display properties of the text expression from the extracted acoustic feature quantity, language feature quantity, and rhythm feature quantity by using a statistically learnt sound metaphor recognition model, and outputs a sound metaphor selected based upon the posterior probability.

Description

本発明は、音喩認識装置、及びプログラムに関する。 The present invention relates to a phonetic recognition apparatus and a program.

多くの放送番組の音声は、音声言語だけから構成されている訳ではなく、番組の演出上の要請から非言語的な音声(例えば、笑い声)や、拍手、背景音楽などが付加されている。このような非音声は、場面の転換を知らせたりするなど、音声言語と同様に、情報伝達において重要な役割を担っていると考えられる。従って、非音声は、視聴者が番組を理解する際に欠かせない要素の一つといえる。 The sound of many broadcast programs is not composed only of a speech language, but non-linguistic sound (for example, laughter), applause, background music, and the like are added in response to a request for directing the program. Such non-speech is considered to play an important role in information transmission, as in the case of spoken language, such as notifying the change of scene. Therefore, non-speech can be said to be one of the elements indispensable for viewers to understand a program.

一方、非音声のうち、環境音を認識し、擬音語として出力する技術がある（例えば、特許文献１参照）。この技術では、環境音を単純にテキスト表現の一態様である擬音語に変換して障碍者に提示する。また、オノマトペ(擬音語・擬態語)が持つ音韻的特徴からユーザーの印象評価に近いオノマトペの仮説を自動生成する技術や（例えば、特許文献２参照）、画像に対してユーザーが選択した擬音語と、その擬音語にひもづけられた環境音を付加する技術がある（例えば、特許文献３参照）。他に、筆跡から擬音語を生成する技術や（例えば、特許文献４参照）、デジタルコミックにおける線画から音喩の動的属性を付与する技術がある（例えば、非特許文献１参照）。さらには、ユーザーが発声したオノマトペを音声認識し、音声認識結果に基づいてコンピューターグラフィックスにおける線画の属性(線の種類など)を変更したり、ゲームの操作を行ったりするためのインターフェースを実現する技術がある（例えば、非特許文献２参照）。 On the other hand, there is a technique for recognizing environmental sounds out of non-speech and outputting them as onomatopoeia (see, for example, Patent Document 1). In this technique, environmental sounds are simply converted into onomatopoeia, which is one form of text representation, and presented to the disabled. In addition, a technique for automatically generating a hypothesis of onomatopoeia that is close to the user's impression evaluation from the phonological features of onomatopoeia (onomatopoeia / mimetic words) (see, for example, Patent Document 2), and onomatopoeia selected by the user for an image There is a technique for adding an environmental sound linked to the onomatopoeia (see, for example, Patent Document 3). In addition, there is a technique for generating an onomatopoeia from handwriting (for example, see Patent Document 4) and a technique for adding a dynamic attribute of a metaphor from a line drawing in a digital comic (for example, see Non-Patent Document 1). Furthermore, it recognizes the onomatopoeia uttered by the user and realizes an interface for changing line drawing attributes (such as line type) in computer graphics and performing game operations based on the speech recognition result. There is a technique (see, for example, Non-Patent Document 2).

特開２００７−３３４１４９号公報JP 2007-334149 A 特開２０１３−３３３５１号公報JP2013-33351A 特開平１１−２６１９４７号公報JP-A-11-261947 特開２０１２−１８５４４号公報JP 2012-18544 A

寺島亜耶香、外２名，「効果線の描画に着目した動的音喩の付与手法」，２０１２年度人工知能学会全国大会(第２６回)論文集，２０１２年、ｌＭ２−ＯＳ−８ｂ−３，ｐ．１−４Ayaka Terashima, 2 others, “Applying Dynamic Phrase Focusing on Effect Line Drawing”, 2012 Annual Conference of the Japanese Society for Artificial Intelligence (26th), 2012, lM2-OS-8b-3 , P. 1-4 神原啓介、他１名，「オノマトペを用いたマルチモーダルインタラクション」，２０１１年度人工知能学会全国大会(第２５回)論文集，２０１１年，ｌＣ２−ＯＳ４ｂ−１２，ｐ．１−２Keisuke Kanbara and 1 other, “Multimodal Interaction Using Onomatopoeia”, 2011 Annual Conference of the Japanese Society for Artificial Intelligence (25th), 2011, 1C2-OS4b-12, p. 1-2

聴覚障碍者や高齢者への情報補償といった観点から音喩を考えた場合、非音声からどのようなテキスト表現を選択するか、そして、この選択したテキスト表現を、情報を十分に補償しうるグラフィックとしてどのように映像コンテンツに配置するかが大きな問題となる。一般的に、非音声を音喩として表現し、映像中に配置するという作業は、人手により行われるため、制作コストが高くつきがちである。この人手による作業を削減することにより、制作コストを大幅に下げることができる。 When considering a metaphor from the viewpoint of information compensation for hearing impaired and elderly people, what kind of text representation should be selected from non-speech, and the graphic representation that can sufficiently compensate the selected text representation As a result, how to arrange the video content is a big problem. In general, the work of expressing non-speech as a metaphor and placing it in a video is performed manually, and thus the production cost tends to be high. By reducing this manual work, production costs can be greatly reduced.

しかし、特許文献１の技術は、放送音声を対象とし、非音声のテキスト表現のみならず、そのテキスト表現をグラフィックにより表示する際の表示属性を推定するものではない。つまり、環境音を認識して得られたテキスト表現を映像中の表現として用いるといった用途は対象外である。また、特許文献２〜４や非特許文献１は、環境音からオノマトペを生成するものではない。非特許文献２は、ユーザーの発話の音声認識に基づく対話型アプリケーション例であって、実環境下にある非音声を認識するものではない。 However, the technique of Patent Document 1 is intended for broadcast audio, and does not estimate display attributes for displaying not only non-speech text representation but also the text representation graphically. In other words, the use of the text representation obtained by recognizing the environmental sound as the representation in the video is out of scope. Patent Documents 2 to 4 and Non-Patent Document 1 do not generate onomatopoeia from environmental sounds. Non-Patent Document 2 is an interactive application example based on speech recognition of a user's utterance, and does not recognize non-speech in an actual environment.

本発明は、このような事情を考慮してなされたもので、音声データに含まれる非音声の音喩を表すテキスト表現とそのテキスト表現の表示属性を得ることができる音喩認識装置、及びプログラムを提供する。 The present invention has been made in consideration of such circumstances, and a phonetic recognition apparatus and program capable of obtaining a text expression representing a non-speech metaphor included in audio data and a display attribute of the text expression. I will provide a.

本発明の一態様は、非音声区間を検出するための統計的音響モデルと音声データとを照合して、前記音声データにおける非音声区間を検出する非音声区間検出部と、前記非音声区間検出部が検出した前記非音声区間における前記音声データから音響特徴量を抽出する音響特徴量抽出部と、前記非音声区間検出部が検出した前記非音声区間を含み、かつ、前記非音声区間よりも所定だけ長い区間の言語特徴量抽出区間を特定し、特定した前記言語特徴量抽出区間の前記音声データに対応した発話内容のデータから言語特徴量を抽出する言語特徴量抽出部と、前記非音声区間検出部が検出した前記非音声区間を含み、かつ、前記非音声区間よりも所定だけ長い区間の韻律特徴量抽出区間を特定し、特定した前記韻律特徴量抽出区間の前記音声データから韻律特徴量を抽出する韻律特徴量抽出部と、音響特徴量、言語特徴量、及び韻律特徴量を入力として非音声のテキスト表現とそのテキスト表現の表示属性とからなる音喩を得るための統計的に学習された音喩認識モデルを用いて、前記音響特徴量抽出部が抽出した前記音響特徴量、前記言語特徴量抽出部が抽出した前記言語特徴量、及び前記韻律特徴量抽出部が抽出した前記韻律特徴量から音喩の事後確率を計算し、計算された前記事後確率に基づいて選択した音喩のデータを出力する音喩認識部と、を備えることを特徴とする音喩認識装置である。
この発明によれば、音喩認識装置は、音声データにおける非音声区間を検出し、検出した非音声区間の音声データから音響特徴量を抽出する。また、音喩認識装置は、非音声区間を含み、かつ、非音声区間よりも所定だけ長い区間の言語特徴量抽出区間を特定し、特定した言語特徴量抽出区間の音声データに対応した発話内容から言語特徴量を抽出する。さらに、音喩認識装置は、非音声区間を含み、かつ、非音声区間よりも所定だけ長い区間の韻律特徴量抽出区間を特定し、特定した韻律特徴量抽出区間の音声データから韻律特徴量を抽出する。音喩認識装置は、統計的に学習された音喩認識モデルを用いて、抽出された音響特徴量、韻律特徴量、及び言語特徴量から非音声のテキスト表現とそのテキスト表現の表示属性とからなる音喩の事後確率を計算する。この音喩の事後確率は、特徴量が与えられたときの音喩の確からしさを表す条件付き確率である。音喩認識装置は、計算した事後確率に基づいて選択した音喩のデータを出力する。
これにより、音喩認識装置は、音声データに含まれる非音声の適切な音喩を得ることができる。 One aspect of the present invention is to collate a statistical acoustic model for detecting a non-speech segment with speech data and detect a non-speech segment in the speech data, and to detect the non-speech segment An acoustic feature quantity extraction unit that extracts an acoustic feature quantity from the voice data in the non-speech section detected by the unit, and the non-speech section detected by the non-speech section detection unit, and more than the non-speech section. A language feature quantity extraction unit that identifies a language feature quantity extraction section that is a predetermined long section, extracts a language feature quantity from speech content data corresponding to the speech data of the identified language feature quantity extraction section, and the non-speech A prosodic feature quantity extraction section that includes the non-speech section detected by the section detection unit and that is longer than the non-speech section by a predetermined length is specified, and the speech data of the specified prosodic feature quantity extraction section is specified. A prosody feature quantity extraction unit that extracts prosody feature quantities from the input, and obtains a metaphor comprising non-speech text expressions and display attributes of the text expressions by inputting acoustic feature quantities, language feature quantities, and prosodic feature quantities Using the statistically learned phonetic recognition model, the acoustic feature amount extracted by the acoustic feature amount extraction unit, the language feature amount extracted by the language feature amount extraction unit, and the prosodic feature amount extraction unit A metaphor recognition unit that calculates a posteriori probability of a metaphor from the extracted prosodic feature value and outputs data of a metaphor selected based on the calculated a posteriori probability; It is a recognition device.
According to this invention, the metaphor recognition device detects a non-speech segment in speech data, and extracts an acoustic feature amount from the speech data of the detected non-speech segment. Further, the phonetic recognition device identifies a language feature amount extraction section that includes a non-speech section and is longer than the non-speech section by a predetermined amount, and utterance content corresponding to the speech data of the identified language feature amount extraction section Extract language features from. Further, the phonetic recognition device identifies a prosodic feature amount extraction section that includes a non-speech section and is longer than the non-speech section by a predetermined length, and obtains a prosodic feature amount from the speech data of the identified prosodic feature quantity extraction section Extract. The phonetic recognition device uses a statistically learned phonetic recognition model to extract the non-speech text expression and the display attributes of the text expression from the extracted acoustic feature value, prosodic feature value, and language feature value. Calculate the posterior probability of the syllable. The posterior probability of this syllable is a conditional probability that represents the certainty of the syllable when a feature value is given. The phonetic recognition device outputs data of a phonetic selected based on the calculated posterior probability.
Thereby, the phonetic recognition apparatus can obtain an appropriate phonetic metaphor included in the audio data.

本発明の一態様は、上述する音喩認識装置であって、前記表示属性は、テキスト表現を表す文字の字体、大きさ、色、表示動作のうち１以上である、ことを特徴とする。
この発明によれば、音喩認識装置は、音声データに含まれる非音声のテキスト表現と、そのテキスト表現を表す文字の字体、大きさ、色、表示動作のうち１以上とからなる音喩のデータを出力する。
これにより、音喩認識装置は、豊かな表現の音喩のデータを得ることができる。 One aspect of the present invention is the above-described phonetic recognition apparatus, wherein the display attribute is one or more of a character font, a size, a color, and a display operation representing a text expression.
According to the present invention, the metaphor recognition device is a metaphor consisting of a non-speech text expression included in the audio data and one or more of the font, size, color, and display operation of the character representing the text expression. Output data.
Thereby, the phonetic recognition apparatus can obtain data of richly expressed phonetics.

本発明の一態様は、上述する音喩認識装置であって、前記音喩認識モデルは、非音声区間の音声データを分割した時刻順のフレームそれぞれから得られた音響特徴量を入力とし、入力された前記音響特徴量を入力よりも低い次元に変換した音響特徴量を出力する第１のニューラルネットワークと、韻律特徴量抽出区間の韻律特徴量を入力とし、入力された前記韻律特徴量を入力よりも低い次元に変換した韻律特徴量を出力する第２のニューラルネットワークと、言語特徴量抽出区間の言語特徴量を入力とし、入力された前記言語特徴量を入力よりも低い次元に変換した言語特徴量を出力する第３のニューラルネットワークと、前記第１のニューラルネットワークの出力である前記音響特徴量と、前記第２のニューラルネットワークの出力である前記韻律特徴量と、前記第３のニューラルネットワークの出力である前記言語特徴量とを入力として音喩のテキスト表現及び表示属性それぞれの事後確率を出力する第４のニューラルネットワークとを有し、前記音響特徴量抽出部は、前記非音声区間検出部が検出した前記非音声区間の前記音声データを分割したフレームそれぞれから音響特徴量を抽出し、前記音喩認識部は、時刻順の前記フレームそれぞれから前記音響特徴量抽出部が抽出した音響特徴量を前記第１のニューラルネットワークに、前記韻律特徴量抽出部が抽出した韻律特徴量を前記第２のニューラルネットワークに、前記言語特徴量抽出部が抽出した前記言語特徴量を前記第３のニューラルネットワークに入力し、第４のニューラルネットワークの出力である音喩のテキスト表現及び表示属性それぞれの事後確率を計算する、ことを特徴とする。
この発明によれば、音喩認識装置は、非音声区間の音声データを分割した時刻順の各フレームそれぞれの音響特徴量を第１のニューラルネットワークの入力とし、低次元で表した音響特徴量を計算する。また、音喩認識装置は、韻律特徴量抽出区間の韻律特徴量を第２のニューラルネットワークの入力とし、低次元で表した韻律特徴量を計算する。さらに、音喩認識装置は、言語特徴量抽出区間の言語特徴量を第３のニューラルネットワークの入力とし、低次元で表した言語特徴量を計算する。音喩認識装置は、第１のニューラルネットワークにより計算された低次元の音響特徴量と、第２のニューラルネットワークにより計算された低次元の韻律特徴量と、第３のニューラルネットワークにより計算された低次元の言語特徴量とを第４のニューラルネットワークの入力とし、音喩のテキスト表現及び表示属性それぞれの事後確率を計算する。音喩認識装置は、計算した事後確率に基づいて選択した音喩のデータを出力する。
これにより、音喩認識装置は、非音声区間の音声データを分割した各フレームの音響特徴量と、韻律特徴量抽出区間の韻律特徴量と、言語特徴量抽出区間の言語特徴量とを用いて、精度よく音喩の事後確率を得ることができる。 One aspect of the present invention is the above-described phonetic recognition device, in which the phonetic recognition model receives an acoustic feature obtained from each of the time-ordered frames obtained by dividing the speech data of the non-speech section, and inputs A first neural network that outputs an acoustic feature amount obtained by converting the acoustic feature amount into a lower dimension than the input, and the prosodic feature amount of the prosodic feature amount extraction section as an input, and the input prosodic feature amount is input A second neural network that outputs prosodic feature values converted to a lower dimension, and a language feature value obtained by converting the input language feature value into a lower dimension than the input, with the language feature value of the language feature value extraction section as an input A third neural network that outputs a feature value, the acoustic feature value that is an output of the first neural network, and an output of the second neural network. A fourth neural network that outputs the posterior probabilities of text representations and display attributes of the metaphor with the prosodic feature value and the language feature value that is the output of the third neural network as inputs, and The acoustic feature quantity extraction unit extracts an acoustic feature quantity from each of the frames obtained by dividing the voice data of the non-speech section detected by the non-speech section detection unit, and the metaphor recognition unit performs each of the frames in time order. The acoustic feature value extracted by the acoustic feature value extraction unit from the first neural network, the prosody feature value extracted by the prosody feature value extraction unit to the second neural network, and the language feature value extraction unit The extracted linguistic feature quantity is input to the third neural network, and the text of the metaphor which is the output of the fourth neural network is input. Calculating a preparative representation and display attribute each posterior probability, wherein the.
According to this invention, the metaphor recognition device uses the acoustic feature amount of each frame in time order obtained by dividing the speech data of the non-speech section as the input of the first neural network, and uses the acoustic feature amount expressed in a low dimension. calculate. Also, the phonetic recognition device uses the prosodic feature value in the prosodic feature value extraction section as an input to the second neural network, and calculates the prosodic feature value represented in a low dimension. Furthermore, the phonetic recognition device calculates a language feature amount expressed in a low dimension using the language feature amount in the language feature amount extraction section as an input to the third neural network. The phonetic recognition device includes a low-dimensional acoustic feature calculated by the first neural network, a low-dimensional prosodic feature calculated by the second neural network, and a low-dimensional calculated by the third neural network. The dimensional linguistic feature quantity is used as the input of the fourth neural network, and the posterior probabilities of the text representation of the phonetic and the display attributes are calculated. The phonetic recognition device outputs data of a phonetic selected based on the calculated posterior probability.
Thereby, the metaphor recognition apparatus uses the acoustic feature amount of each frame obtained by dividing the speech data of the non-speech interval, the prosodic feature amount of the prosodic feature amount extraction interval, and the linguistic feature amount of the language feature amount extraction interval. The posterior probability of the metaphor can be obtained accurately.

本発明の一態様は、上述する音喩認識装置であって、前記音声データに対応した映像データに前記音喩認識部が選択した前記音喩を重ねて表示させる映像データを生成する結果編集部をさらに備える、ことを特徴とする。
この発明によれば、音喩認識装置は、音声データに含まれる非音声の音喩を、その音声データに対応した映像データに重ねて表示させる映像データを生成する。
これにより、音喩認識装置は、音喩が追加された番組映像を生成することができる。 One aspect of the present invention is the above-described phonetic recognition apparatus, which is a result editing unit that generates video data for displaying the phonetic selected by the phonetic recognition unit on video data corresponding to the audio data. Is further provided.
According to the present invention, the syllable recognition device generates video data that displays a non-speech syllable included in the audio data superimposed on the video data corresponding to the audio data.
Thereby, the metaphor recognition device can generate a program video to which the metaphor is added.

本発明の一態様は、上述する音喩認識装置であって、前記結果編集部は、前記音喩認識部が選択した前記音喩に対する修正指示を受け、前記修正指示に基づいて修正した前記音喩を前記音声データに対応した映像データに重ねて表示させる映像データを生成し、前記音喩が得られたときの前記音喩認識モデルの入力である前記音響特徴量、前記韻律特徴量、及び前記言語特徴量と、修正された前記音喩とに基づいて前記音喩認識モデルを更新する音喩認識モデル学習部をさらに備える、ことを特徴とする。
この発明によれば、音喩認識装置は、認識した音喩を修正指示に従って修正し、修正後の音喩を、音声データに対応した映像データに重ねて表示させる映像データを生成するとともに、修正結果に基づいて音喩認識モデルを更新する。
これにより、音喩認識装置は、ユーザーの演出の傾向や嗜好にマッチした音喩認識モデルを学習することができる。 One aspect of the present invention is the above-described phonetic recognition apparatus, wherein the result editing unit receives a correction instruction for the phonetic selected by the phonetic recognition unit and corrects the sound based on the correction instruction. Generating video data to display a metaphor superimposed on video data corresponding to the audio data, the acoustic feature amount being the input of the metaphor recognition model when the metaphor is obtained, the prosodic feature amount, and It further comprises a phonetic recognition model learning unit that updates the phonetic recognition model based on the language feature and the corrected phonetic.
According to this invention, the metaphor recognition device corrects the recognized metaphor according to the correction instruction, generates video data that displays the corrected metaphor superimposed on the video data corresponding to the audio data, and the correction Update the phonetic recognition model based on the results.
Thereby, the phonetic recognition device can learn a phonetic recognition model that matches the tendency and preference of the user's performance.

本発明の一態様は、上述する音喩認識装置であって、音声区間検出用の統計的音響モデルと照合して前記音声データにおける音声区間を検出する音声区間検出部と、前記音声区間検出部が検出した前記音声区間における前記音声データを音声認識し、音声認識の結果得られた発話内容のデータを出力する音声認識部とをさらに備え、前記言語特徴量抽出部は、前記音声認識部が出力した前記発話内容のデータから、前記言語特徴量抽出区間における言語特徴量を抽出する、ことを特徴とする。
この発明によれば、音喩認識装置は、言語特徴量抽出区間における言語特徴量を、音声データの音声認識結果から抽出する。
これにより、音喩認識装置は、音声データに発話内容のデータが付加されていない場合でも、音喩の認識に必要な言語特徴量を得ることができる。 One aspect of the present invention is the above-described phonetic recognition device, wherein a speech section detection unit that detects a speech section in the speech data by collating with a statistical acoustic model for speech section detection, and the speech section detection unit A speech recognition unit that recognizes the speech data in the speech section detected by the speech recognition unit and outputs data of speech content obtained as a result of speech recognition, and the language feature amount extraction unit includes: A linguistic feature amount in the linguistic feature amount extraction section is extracted from the output utterance content data.
According to this invention, the metaphor recognition device extracts the language feature amount in the language feature amount extraction section from the speech recognition result of the speech data.
Thereby, the syllable recognition device can obtain the language feature amount necessary for the recognition of the syllable even when the utterance content data is not added to the voice data.

本発明の一態様は、コンピュータを、非音声区間を検出するための統計的音響モデルと音声データとを照合して、前記音声データにおける非音声区間を検出する非音声区間検出手段と、前記非音声区間検出手段が検出した前記非音声区間における前記音声データから音響特徴量を抽出する音響特徴量抽出手段と、前記非音声区間検出手段が検出した前記非音声区間を含み、かつ、前記非音声区間よりも所定だけ長い区間の言語特徴量抽出区間を特定し、特定した前記言語特徴量抽出区間の前記音声データに対応した発話内容のデータから言語特徴量を抽出する言語特徴量抽出手段と、前記非音声区間検出手段が検出した前記非音声区間を含み、かつ、前記非音声区間よりも所定だけ長い区間の韻律特徴量抽出区間を特定し、特定した前記韻律特徴量抽出区間の前記音声データから韻律特徴量を抽出する韻律特徴量抽出手段と、音響特徴量、言語特徴量、及び韻律特徴量を入力として非音声のテキスト表現とそのテキスト表現の表示属性とからなる音喩を得るための統計的に学習された音喩認識モデルを用いて、前記音響特徴量抽出手段が抽出した前記音響特徴量、前記言語特徴量抽出手段が抽出した前記言語特徴量、及び前記韻律特徴量抽出手段が抽出した前記韻律特徴量から音喩の事後確率を計算し、計算された前記事後確率に基づいて選択した音喩のデータを出力する音喩認識手段と、を具備する音喩認識装置として機能させるためのプログラムである。 In one aspect of the present invention, the computer collates a statistical acoustic model for detecting a non-speech segment with speech data, and detects a non-speech segment detection unit that detects a non-speech segment in the speech data. An acoustic feature quantity extracting means for extracting an acoustic feature quantity from the speech data in the non-speech section detected by the speech section detection means; and the non-speech section including the non-speech section detected by the non-speech section detection means. A linguistic feature quantity extraction unit that identifies a linguistic feature quantity extraction section of a section longer than the section, and extracts a linguistic feature quantity from speech content data corresponding to the speech data of the identified linguistic feature quantity extraction section; The prosodic feature amount extraction section that includes the non-speech section detected by the non-speech section detection unit and that is longer than the non-speech section by a predetermined amount is specified, and the specified prosody A prosody feature quantity extraction means for extracting prosody feature quantity from the speech data in the collection amount extraction section; and a non-speech text expression with the input of acoustic feature quantity, language feature quantity, and prosodic feature quantity, and display attributes of the text expression, Using a statistically learned phonetic recognition model for obtaining a phonetic, comprising: the acoustic feature extracted by the acoustic feature extractor; the language feature extracted by the language feature extractor; And a metaphor recognition means for calculating a posteriori probability of the metaphor from the prosodic feature quantity extracted by the prosodic feature quantity extraction means, and outputting data of the metaphor selected based on the calculated a posteriori probability; It is a program for functioning as a phonetic recognition device.

本発明によれば、音声データに含まれる非音声の音喩を表すテキスト表現とそのテキスト表現の表示属性を得ることができる。 According to the present invention, it is possible to obtain a text expression representing a non-speech metaphor included in sound data and a display attribute of the text expression.

本発明の一実施形態による音喩認識装置における音喩認識処理の概要を示す図である。It is a figure which shows the outline | summary of the metaphor recognition process in the metaphor recognition apparatus by one Embodiment of this invention. 同実施形態による音喩認識装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the phonetic recognition apparatus by the embodiment. 同実施形態による音喩認識処理部の音喩認識処理フローを示す図である。It is a figure which shows the metaphor recognition process flow of the metaphor recognition process part by the embodiment. 同実施形態による非音声区間検出用のＨＭＭ（Hidden Markov Model、隠れマルコフモデル）を示す図である。It is a figure which shows HMM (Hidden Markov Model, hidden Markov model) for the non-voice area detection by the embodiment. 同実施形態による非音声区間検出部の非音声区間検出処理フローを示す図である。It is a figure which shows the non-voice area detection process flow of the non-voice area detection part by the embodiment. 同実施形態による特徴量抽出区間を示す図である。It is a figure which shows the feature-value extraction area by the embodiment. 同実施形態による音喩認識モデルを示す図である。It is a figure which shows the phonetic recognition model by the embodiment. 同実施形態による編集画面を示す図である。It is a figure which shows the edit screen by the embodiment. 同実施形態による学習データに含まれる音喩テーブルを示す図である。It is a figure which shows the phonetic table contained in the learning data by the embodiment. 同実施形態による学習データに含まれる字幕テーブルを示す図である。It is a figure which shows the caption table contained in the learning data by the embodiment.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。
本実施形態の音喩認識装置は、音声データに含まれる非音声（話声や言語音以外の音）を認識して非音声を表すテキストとそのテキストに付随する表示属性を出力し、グラフィック表現としての音喩を生成する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
The metaphor recognition apparatus according to the present embodiment recognizes non-speech (sound other than speech and speech) included in the sound data, outputs text representing the non-speech, and a display attribute associated with the text, and provides a graphic representation As a metaphor.

現在、放送番組の字幕制作に音声認識を利用する技術が実用化されている。放送番組の字幕は、主に聴覚障碍者や高齢者への情報補償を目的としているため、音声認識の対象は、放送番組を構成するテキスト化可能な音声言語のみである。しかし、多くの放送番組は、音声言語だけから構成されている訳ではない。例えば、放送番組の音声には、番組の演出上の要請から、非言語的な音（例えば、笑い声）や、拍手、背景音楽などの非音声が付加されている。このような音声言語以外の音である非音声は、場面の転換を知らせたりするなど、音声言語同様、情報伝達において重要な役割を担っていると考えられる。このことから、非音声は、視聴者が番組を理解する際に欠かせない要素の一つといえる。
放送字幕は、音声言語のテキストによる視覚表現の一手法と理解できるが、現在の放送における字幕放送では音声言語のみが字幕化の対象であり、非音声のテキスト表現はあまり行われていないのが実情である。 Currently, a technology that uses speech recognition to produce subtitles for broadcast programs has been put into practical use. Since the subtitles of broadcast programs are mainly intended to compensate information for hearing-impaired persons and the elderly, the target of speech recognition is only the speech languages that can be converted into texts that constitute the broadcast programs. However, many broadcast programs are not composed solely of speech languages. For example, non-speech sounds such as non-linguistic sounds (for example, laughter), applause, and background music are added to the broadcast program sound in response to a program production request. Non-speech, which is a sound other than the spoken language, is considered to play an important role in information transmission like the spoken language, such as notifying the change of scene. For this reason, non-speech can be said to be one of the elements indispensable for viewers to understand programs.
Broadcast subtitles can be understood as a visual representation method using text in audio language, but only the audio language is subject to subtitles in subtitle broadcasting in the current broadcast, and non-speech text representation is not often performed. It is a fact.

一方、漫画におけるテキストの視覚表現として音喩という手法がある。この音喩は漫画を構成する要素の一つであり、擬音語・擬態語のような語音象徴や、登場人物の心情を表現する視覚をともなう言語表現として定義される。従来の用語を用いれば、音喩は一種のタイポグラフィであり、漫画の表現において文字のフォント(手書きを含む）、配色、配置といった属性を伴っていることが特徴である。また、漫画は通常、吹き出し中に現れる登場人物のセリフによって物語が進行するが、これらの音喩もまた、読者が物語を理解したり、臨場感を感じたりするための重要な情報となっている。 On the other hand, there is a technique called metaphor as a visual expression of text in comics. This phonetic metaphor is one of the elements that make up comics, and is defined as a linguistic expression with a visual symbol that expresses the sound symbol such as an onomatopoeia or mimetic word or the emotion of the characters. If conventional terms are used, a metaphor is a kind of typography, and it is characterized in that it includes attributes such as character font (including handwriting), color scheme, and arrangement in a cartoon expression. In addition, the story of a comic is usually progressed by the lines of characters appearing in a speech balloon, but these phonetics are also important information for readers to understand the story and to feel a sense of reality. Yes.

漫画における音喩は、静止画をコマとして紙面もしくはコンピュータのディスプレイといったメディアに配置して表現する手法である。一方、放送番組においても、これらの音喩に類似した表現は、特に民間放送局のバラエテイ番組などで、演出の一手法として顕著にみることができる。しかし、放送番組の映像では、漫画における音喩の属性であるテキストの配置や配色に加え、時間変化（大きさの変化といった動作）などの属性が与えられる。そのため、放送番組における音喩は、漫画で行われる静的な音喩ではなく、動的な音喩(通常、文字アニメーションもしくはモーションタイポグラフィと呼ばれる)として定義される。漫画における音喩と同様、放送番組における音喩は、視聴者に対して強烈な印象を残す映像表現としての可能性がある。従って、音喩によって非音声を視覚的に表現することができれば、例えば、聴覚障碍者の理解を深めるような番組制作のための演出手法として、産業上の応用が大いに期待できる。 The metaphor in comics is a technique of expressing still images as frames by arranging them on a medium such as paper or a computer display. On the other hand, even in broadcast programs, expressions similar to these metaphors can be seen prominently as a method of production, especially in variety programs of private broadcast stations. However, in the video of a broadcast program, in addition to text arrangement and color scheme, which are attributes of phonograms in comics, attributes such as time change (motion such as change in size) are given. Therefore, the metaphor in a broadcast program is defined as a dynamic metaphor (usually called character animation or motion typography), not a static metaphor performed in comics. Similar to the phonogram in comics, the phonogram in broadcast programs has the potential to be a video expression that leaves a strong impression on the viewer. Therefore, if non-speech can be visually expressed by phonetic metaphor, for example, industrial application can be greatly expected as a production method for program production that deepens the understanding of hearing impaired people.

漫画であれ放送番組であれ、音喩は作者や演出家の意図によって静止画像ないし映像上に配置される。経験を積んだ演出家であれば、非音声から適切なテキスト表現や属性を選択した上で音喩の配置を行うことも可能だが、一般には非音声を演出意図にあった言語表現に変換することは難しいといえる。これは、非音声から想起されるイメージが個人の主観に依存しており、視聴者に的確にメッセージを伝えるための演出手段が一意に定められないことが原因である。
また、聴覚障碍者や高齢者への情報補償といった観点から音喩を考えた場合、非音声からどのようなテキスト表現を選択し、情報を十分に補償しうるグラフィックとして映像コンテンツに配置するかが大きな問題となる。 Whether it is a manga or a broadcast program, the metaphor is placed on a still image or video depending on the intention of the author or director. If you are an experienced director, you can select a suitable text expression and attributes from non-speech and then arrange the metaphor, but generally convert non-speech to a linguistic expression suitable for the production. That is difficult. This is because the image recalled from non-speech depends on the subjectivity of the individual, and the production means for accurately conveying the message to the viewer cannot be determined uniquely.
Also, when considering metaphors from the viewpoint of information compensation for hearing impaired and elderly people, what text representation should be selected from non-speech and how it should be placed in video content as a graphic that can fully compensate the information It becomes a big problem.

この問題に対して、非音声を自動的に認識し、映像コンテンツ内の音喩として文字の配色や大きさが適切に推定されるのであれば、音の視覚的な提示手法における制作コストを大幅に削減できる。従って、適切な学習データが与えられるという条件下で、非音声のテキスト表現やその属性を統計的に推定する手法を開発すればよいということになる。
このような音喩に基づくテキストアニメーション生成手法は、いわゆるパターン認識・機械学習に基づく番組制作手法の一つといえ、従来の制作手法を大幅に刷新する新たな手法となりうると考えられる。
本実施形態の音喩認識装置は、音声認識を利用して、統計的に推定した音喩認識モデルにより音声データにおける非音声の音響的な特徴と、その非音声の近傍の言語的な特徴とから非音声を認識し、テキスト表現とそのテキスト表現の画面上の表示属性とを含んだ音喩を生成する。 For this problem, if non-speech is automatically recognized and the color scheme and size of characters are properly estimated as a metaphor in video content, the production cost of visual sound presentation methods will be greatly increased. Can be reduced. Therefore, it is only necessary to develop a method for statistically estimating the non-speech text expression and its attributes under the condition that appropriate learning data is given.
A text animation generation method based on such a metaphor can be considered as one of program production methods based on so-called pattern recognition and machine learning, and can be a new method that greatly renews the conventional production method.
The phonetic recognition apparatus according to the present embodiment uses voice recognition, a non-speech acoustic feature in speech data based on a statistically estimated phonetic recognition model, and a linguistic feature near the non-speech. Recognizes non-speech and generates a metaphor containing the text representation and the display attributes of the text representation on the screen.

図１は、本実施形態の音喩認識装置における音喩認識処理の概要を示す図である。
本実施形態の音喩認識装置は、放送番組等の音声データ中に含まれる音声（音声言語）及び非音声（非言語音）の情報から、音喩に関する情報を推定する。本実施形態の音喩認識装置は、音声データからの音喩の推定に統計的手法を適用する。音喩は、非音声のテキストと、表示属性から構成される。ユーザーは、音喩認識装置が推定した音喩に対して、例えば映像の編集機などを用いて演出意図に従って修正を加える。これにより、本実施形態の音喩認識装置は、低コストによる音喩表現の制作を実現する。 FIG. 1 is a diagram showing an outline of a phonetic recognition process in the phonetic recognition device of the present embodiment.
The metaphor recognition apparatus according to the present embodiment estimates information about a metaphor from information on sound (speech language) and non-speech (non-language sound) included in sound data of a broadcast program or the like. The phonetic recognition apparatus according to the present embodiment applies a statistical method to estimate a phonetic from speech data. The phonetic is composed of non-speech text and display attributes. The user modifies the phonetic phrase estimated by the phonetic recognition device, for example, using a video editor or the like according to the production intention. Thereby, the metaphor recognition device of the present embodiment realizes production of metaphor expression at low cost.

本実施形態の音喩は、放送番組の表示画面上への提示を目的とする。そのため、音喩は、非音声のテキスト表現と、表示属性とから構成される。
非音声のテキスト表現は、非音声を表現するめの単語列である。非音声のテキスト表現は、例えば、「ワンワン」、「ガチャガチャ」のような擬音語や擬態語、オノマトペ、間投詞などであり、１語もしくは短いフレーズで構成される。
表示属性は、非音声のテキスト表現を画面に表示する際に、そのテキスト表現に付随する属性である。表示属性は、文字属性や表示動作などを含む。文字属性は、非音声のテキスト表現を表示する際の文字のフォント（書体、文字デザイン）、大きさ、色などである。表示動作（モーション）は、テキスト表現を表示する際の大きさの変化、動きなどである。動きには、表示位置が移動する場合のみならず、表示位置を動かさない場合も含まれる。 The metaphor of this embodiment is intended for presentation on a display screen of a broadcast program. Therefore, the phonetic metaphor is composed of a non-speech text representation and a display attribute.
The non-speech text expression is a word string for expressing non-speech. Non-speech text expressions are, for example, onomatopoeia, mimetic words, onomatopoeia, interjections such as “One One” and “Gachagacha”, and are composed of one word or a short phrase.
The display attribute is an attribute associated with the text expression when the non-speech text expression is displayed on the screen. The display attributes include character attributes and display operations. The character attributes are the font (typeface, character design), size, color, etc. of the character when displaying the non-speech text expression. The display operation (motion) is a change in size or movement when displaying the text expression. The movement includes not only the case where the display position moves, but also the case where the display position is not moved.

音喩の構成要素の詳細については、例えば、”水口充、外１名，「文字アニメーションの自動合成の試み」，社団法人情報処理学会研究報告,２００５年,２００５−ＨＩ−１１６（１５），ｐ．９７−１０４”に記載されている。
なお、上述したテキスト表現及び表示属性は一例であり、画面上に表示可能な形態であれば何であれ、音喩の構成要素となりうる。例えば、インターネット上の掲示板や電子メールで使われるアスキーアート（文字を使ったグラフィック表現の一種）や、ソーシャルネットワークサービスで使われる絵文字などが構成要素になりうる。 For details on the components of the phonogram, see, for example, “Mitsumi Mizuguchi, 1 other person,“ Trial of automatic synthesis of character animation ”, Information Processing Society of Japan Research Report 2005, 2005-HI-116 (15), p. 97-104 ".
Note that the above-described text expression and display attributes are merely examples, and any form that can be displayed on the screen can be a constituent element of a metaphor. For example, ASCII art (a kind of graphic expression using characters) used in bulletin boards and e-mails on the Internet and pictograms used in social network services can be constituent elements.

図２は、本発明の一実施形態による音喩認識装置１の構成を示すブロック図であり、本実施形態と関係する機能ブロックのみ抽出して示してある。音喩認識装置１は、コンピュータ装置により実現される。同図に示すように、音喩認識装置１は、音喩認識処理部２、表示部３、入力部４、及び音喩認識モデル学習処理部５を備えて構成される。音喩認識処理部２は、音喩認識モデルを用いて音喩認識を行う。音喩認識モデルは、音響特徴量、言語特徴量、及び韻律特徴量を入力として音喩の事後確率（特徴量が与えられたときの音喩の確からしさを表す条件付き確率）を得るための統計的モデルである。表示部３は、画像を表示するディスプレイである。なお、表示部３は、音喩認識装置１とネットワークにより接続されるコンピュータ端末のディスプレイに画像を表示させてもよい。入力部４は、キーボードやマウス、タッチパネルに配されたセンサなどであり、ユーザーによる入力操作を受ける。なお、入力部４は、音喩認識装置１とネットワークにより接続されるコンピュータ端末においてユーザーが行った入力操作の情報を受けてもよい。音喩認識モデル学習処理部５は、音喩認識処理部２が音喩認識に用いる音喩認識モデルを学習する。 FIG. 2 is a block diagram showing the configuration of the syllable recognition device 1 according to one embodiment of the present invention, in which only functional blocks related to the present embodiment are extracted and shown. The metaphor recognition device 1 is realized by a computer device. As shown in FIG. 1, the metaphor recognition device 1 includes a metaphor recognition processing unit 2, a display unit 3, an input unit 4, and a metaphor recognition model learning processing unit 5. The phonetic recognition processing unit 2 performs phonetic recognition using a phonetic recognition model. The phonetic recognition model is used to obtain posterior probabilities of metaphors (conditional probabilities representing the probabilities of metaphors when given features) using acoustic features, linguistic features, and prosodic features. It is a statistical model. The display unit 3 is a display that displays an image. The display unit 3 may display an image on a display of a computer terminal connected to the phonetic recognition apparatus 1 via a network. The input unit 4 includes a keyboard, a mouse, a sensor arranged on a touch panel, and the like, and receives an input operation by a user. Note that the input unit 4 may receive information on an input operation performed by a user at a computer terminal connected to the metaphor recognition device 1 via a network. The phonetic recognition model learning processing unit 5 learns a phonetic recognition model that the phonetic recognition processing unit 2 uses for phonetic recognition.

音喩認識処理部２は、音喩認識に使用する特徴量として、入力音声データＤ１に含まれる音声区間を音声認識した結果のテキストデータから言語特徴量を得るとともに、入力音声データＤ１から音声区間の韻律特徴量を得る。そのため、音喩認識処理部２は、音喩認識とは別に、入力音声データＤ１の音声区間を音声認識する。そこで、音喩認識処理部２は、入力音声データＤ１を２つに分離し、それぞれを音声認識の入力、及び音喩認識の入力とする。 The syllable recognition processing unit 2 obtains a linguistic feature amount from text data obtained as a result of speech recognition of a speech section included in the input speech data D1 as a feature amount used for the speech recognition, and also from the input speech data D1. Get the prosodic feature value. Therefore, the syllable recognition processing unit 2 recognizes the speech section of the input speech data D1 separately from the syllable recognition. Therefore, the syllable recognition processing unit 2 separates the input voice data D1 into two, which are used as input for speech recognition and input for syllable recognition, respectively.

音喩認識処理部２は、音響モデル格納部２０、言語モデル格納部２１、音声区間検出部２２、音声認識部２３、非音声区間検出部２４、音響特徴量抽出部２５、言語特徴量抽出部２６、韻律特徴量抽出部２７、音喩認識部２８、及び結果編集部２９を備えて構成される。 The phonetic recognition processing unit 2 includes an acoustic model storage unit 20, a language model storage unit 21, a speech segment detection unit 22, a speech recognition unit 23, a non-speech segment detection unit 24, an acoustic feature quantity extraction unit 25, and a language feature quantity extraction unit. 26, a prosody feature quantity extraction unit 27, a metaphor recognition unit 28, and a result editing unit 29.

音響モデル格納部２０は、音声区間検出用の統計的音響モデル、非音声区間検出用の統計的音響モデル、及び、音声認識用の統計的音響モデルを格納する。言語モデル格納部２１は、音声認識用の統計的言語モデルを格納する。音声区間検出部２２は、音声認識の前処理として、音響モデル格納部２０に格納されている音声区間検出用の統計的音響モデルと入力音声データＤ１とを照合して、入力音声データＤ１における音声区間を同定する。音声区間は、話声（音声言語）の区間である。音声認識部２３は、音声区間検出部２２が同定した音声区間の入力音声データＤ１を、音響モデル格納部２０に記憶されている音声認識用の統計的音響モデル及び言語モデル格納部２１に記憶されている音声認識用の統計的言語モデルを用いて音声認識する。音声認識部２３は、発話内容の音声認識結果を設定した音声認識結果データＤ２を出力する。 The acoustic model storage unit 20 stores a statistical acoustic model for detecting a speech segment, a statistical acoustic model for detecting a non-speech segment, and a statistical acoustic model for speech recognition. The language model storage unit 21 stores a statistical language model for speech recognition. The speech section detection unit 22 collates a statistical acoustic model for speech section detection stored in the acoustic model storage unit 20 with the input speech data D1 as preprocessing for speech recognition, and performs speech in the input speech data D1. Identify the interval. The speech segment is a segment of speech (speech language). The speech recognition unit 23 stores the input speech data D1 of the speech segment identified by the speech segment detection unit 22 in the statistical acoustic model and language model storage unit 21 for speech recognition stored in the acoustic model storage unit 20. Speech recognition using a statistical language model for speech recognition. The voice recognition unit 23 outputs voice recognition result data D2 in which the voice recognition result of the utterance content is set.

非音声区間検出部２４は、音喩認識の前処理として、音響モデル格納部２０に格納されている非音声区間検出用の統計的音響モデルと入力音声データＤ１とを照合して、入力音声データＤ１における非音声区間を同定する。音響特徴量抽出部２５は、非音声区間検出部２４が同定した非音声区間の入力音声データＤ１から音響特徴量を得る。言語特徴量抽出部２６は、音声認識結果データＤ２が示す音声認識結果のテキストデータから、言語特徴量抽出区間の言語特徴量を抽出する。言語特徴量抽出区間は、非音声区間検出部２４が検出した非音声区間を含み、かつ、非音声区間よりも所定だけ長い区間である。韻律特徴量抽出部２７は、韻律特徴量抽出区間における入力音声データＤ１から韻律特徴量を抽出する。韻律特徴量抽出区間は、非音声区間検出部２４が検出した非音声区間を含み、かつ、非音声区間よりも所定だけ長い区間である。本実施形態では、韻律特徴量抽出区間は、言語特徴量抽出区間と同じ区間とするが、言語特徴量抽出区間と異なる区間としてもよい。 The non-speech section detection unit 24 collates a statistical acoustic model for non-speech section detection stored in the acoustic model storage unit 20 with the input speech data D1 as preprocessing for metaphor recognition, A non-speech segment in D1 is identified. The acoustic feature quantity extraction unit 25 obtains an acoustic feature quantity from the input voice data D1 of the non-speech section identified by the non-speech section detection unit 24. The language feature quantity extraction unit 26 extracts the language feature quantity in the language feature quantity extraction section from the text data of the speech recognition result indicated by the speech recognition result data D2. The language feature amount extraction section includes a non-speech section detected by the non-speech section detection unit 24 and is a section longer than the non-speech section by a predetermined amount. The prosodic feature quantity extraction unit 27 extracts prosodic feature quantities from the input speech data D1 in the prosodic feature quantity extraction section. The prosodic feature amount extraction section includes a non-speech section detected by the non-speech section detection unit 24 and is a section longer than the non-speech section by a predetermined amount. In this embodiment, the prosodic feature quantity extraction section is the same section as the language feature quantity extraction section, but may be a section different from the language feature quantity extraction section.

音喩認識部２８は、音響特徴量抽出部２５が得た音響特徴量と、言語特徴量抽出部２６が抽出した言語特徴量と、韻律特徴量抽出部２７が得た韻律特徴量とを、音喩認識用の統計的モデルである音喩認識モデルの入力に用いて音喩認識を行う。音喩認識部２８は、音喩認識結果を設定した音喩データＤ３を結果編集部２９に出力する。音喩認識結果は、非音声のテキスト表現（文字列）とその表示属性とからなる音喩である。表示属性は、テキスト表現を表す文字の字体、大きさ、色、表示動作のうち１以上である。 The metaphor recognition unit 28 uses the acoustic feature obtained by the acoustic feature extraction unit 25, the language feature extracted by the language feature extraction unit 26, and the prosodic feature obtained by the prosody feature extraction unit 27. Phrase recognition is performed using the input of a phonogram recognition model, which is a statistical model for phonogram recognition. The phonetic recognition unit 28 outputs the phonetic data D3 in which the phonetic recognition result is set to the result editing unit 29. The phonetic recognition result is a phonetic metaphor composed of non-speech text representation (character string) and its display attributes. The display attribute is one or more of the font, size, color, and display operation of the character representing the text expression.

結果編集部２９は、入力音声データＤ１の抽出元である入力映像データＤ４の表示画面上に、音喩データＤ３により示される音喩を重ねた画面を生成し、表示部３に表示させる。結果編集部２９は、入力部４により入力された修正指示に従って、音喩データＤ３が示す音喩のテキスト表現や表示属性を修正する。結果編集部２９は、入力映像データＤ４の表示画面上に修正後の音喩を重ねて表示させるよう映像編集を行った結果生成された完成映像データＤ５を出力する。さらに、結果編集部２９は、修正指示に従って音喩データＤ３が示す音喩を修正した結果を設定した修正済み音喩データＤ１０と、音喩データＤ３が示す音喩が得られたときの音響特徴量、韻律特徴量、及び言語特徴量を出力する。 The result editing unit 29 generates a screen in which the metaphor indicated by the metaphor data D3 is superimposed on the display screen of the input video data D4 from which the input audio data D1 is extracted, and displays the screen on the display unit 3. The result editing unit 29 corrects the text expression and display attribute of the phonetic phrase indicated by the phonetic data D <b> 3 according to the correction instruction input by the input unit 4. The result editing unit 29 outputs completed video data D5 generated as a result of video editing so that the corrected metaphor is displayed on the display screen of the input video data D4. Further, the result editing unit 29 sets the corrected metaphor data D10 in which the result of correcting the metaphor indicated by the metaphor data D3 according to the correction instruction, and the acoustic feature when the metaphor indicated by the metaphor data D3 is obtained. The quantity, prosodic feature quantity, and linguistic feature quantity are output.

音喩認識モデル学習処理部５は、音喩認識処理部２における音喩認識に先立って音喩認識モデルを学習する。音喩認識モデル学習処理部５は、音声言語資源格納部５０、学習用音響特徴量抽出部５１、学習用言語特徴量抽出部５２、学習用韻律特徴量抽出部５３、音喩認識モデル学習部５４、及び音喩認識モデル格納部５５を備えて構成される。 The phonetic recognition model learning processing unit 5 learns a phonetic recognition model prior to the phonetic recognition in the phonetic recognition processing unit 2. The phonetic recognition model learning processing unit 5 includes a spoken language resource storage unit 50, a learning acoustic feature quantity extraction unit 51, a learning language feature quantity extraction unit 52, a learning prosodic feature quantity extraction unit 53, and a phonetic recognition model learning unit. 54, and a phonetic recognition model storage unit 55.

音声言語資源格納部５０は、音喩認識モデルの学習データを記憶する。学習データは、学習用音声データと、音喩と、発話内容のテキストデータとを対応付けたデータである。音喩には、学習用音声データにおける非音声の開始時刻及び終了時刻により示される非音声区間のデータが付加されている。発話内容のテキストデータには、学習用音声データにおける発話の開始時刻及び終了時刻により示される音声区間のデータが付加されている。 The spoken language resource storage unit 50 stores learning data of a phonetic recognition model. The learning data is data in which the learning voice data, the metaphor, and the text data of the utterance content are associated with each other. Non-speech section data indicated by the start time and end time of non-speech in the speech data for learning is added to the metaphor. To the text data of the utterance content, voice section data indicated by the start time and end time of the utterance in the learning voice data is added.

学習用音響特徴量抽出部５１は、音声言語資源格納部５０に記憶されている学習データから音喩に対応した非音声区間の学習用音声データである非音声データＤ６を取得する。学習用音響特徴量抽出部５１は、取得した非音声データＤ６から音響特徴量を抽出する。学習用言語特徴量抽出部５２は、非音声データＤ６の非音声区間に対応した言語特徴量抽出区間を特定し、特定した言語特徴量抽出区間の発話内容のテキストデータＤ７を音声言語資源格納部５０に記憶されている学習データから取得する。学習用言語特徴量抽出部５２は、取得したテキストデータＤ７から言語特徴量を抽出する。学習用韻律特徴量抽出部５３は、非音声データＤ６の非音声区間に対応した韻律特徴量抽出区間を特定し、特定した韻律特徴量抽出区間の学習用音声データである音声データＤ８を音声言語資源格納部５０に記憶されている学習データから取得する。学習用韻律特徴量抽出部５３は、取得した音声データＤ８から韻律特徴量を抽出する。 The learning acoustic feature quantity extraction unit 51 acquires non-speech data D6, which is learning speech data of a non-speech section corresponding to a phonetic phrase, from the learning data stored in the speech language resource storage unit 50. The learning acoustic feature quantity extraction unit 51 extracts an acoustic feature quantity from the acquired non-voice data D6. The learning language feature quantity extraction unit 52 specifies a language feature quantity extraction section corresponding to the non-speech section of the non-speech data D6, and the speech language resource storage section stores the text data D7 of the utterance content of the specified language feature quantity extraction section. 50 is acquired from the learning data stored in 50. The learning language feature extraction unit 52 extracts a language feature from the acquired text data D7. The learning prosody feature quantity extraction unit 53 identifies the prosodic feature quantity extraction section corresponding to the non-speech section of the non-speech data D6, and uses the speech data D8, which is the training speech data of the identified prosodic feature quantity extraction section, as a speech language. Obtained from the learning data stored in the resource storage unit 50. The prosodic feature quantity extraction unit 53 for learning extracts prosodic feature quantities from the acquired speech data D8.

音喩認識モデル学習部５４は、学習用音響特徴量抽出部５１が抽出した音響特徴量と、学習用言語特徴量抽出部５２が抽出した言語特徴量と、学習用韻律特徴量抽出部５３が抽出した韻律特徴量とを受信する。音喩認識モデル学習部５４は、音声言語資源格納部５０に記憶されている学習データから非音声データＤ６の非音声区間に対応した音喩データＤ９を読み出す。音喩認識モデル学習部５４は、受信した音響特徴量、言語特徴量、及び韻律特徴量と、読み出した音喩データＤ９とを用いて、統計的手段により音喩認識モデルを学習する。音喩認識モデル学習部５４は、学習した音喩認識モデルを音喩認識モデル格納部５５に格納する。音喩認識部２８は、この音喩認識モデル格納部５５に格納された音喩認識モデルを用いて、音喩認識を行う。 The phonetic recognition model learning unit 54 includes an acoustic feature extracted by the learning acoustic feature extraction unit 51, a language feature extracted by the learning language feature extraction unit 52, and a learning prosodic feature extraction unit 53. The extracted prosodic feature quantity is received. The phonetic recognition model learning unit 54 reads the phonetic data D9 corresponding to the non-speech section of the non-speech data D6 from the learning data stored in the spoken language resource storage unit 50. The phonetic recognition model learning unit 54 learns a phonetic recognition model by statistical means using the received acoustic feature value, language feature value, prosodic feature value, and the read phonetic data D9. The phonetic recognition model learning unit 54 stores the learned phonetic recognition model in the phonetic recognition model storage unit 55. The phonetic recognition unit 28 performs phonetic recognition using the phonetic recognition model stored in the phonetic recognition model storage unit 55.

また、音喩認識モデル学習部５４は、修正された音喩に基づいて音喩認識モデルを更新する。音声言語資源格納部５０は、結果編集部２９から出力された修正済み音喩データＤ１０と、修正前の音喩が得られたときの音響特徴量、韻律特徴量、及び言語特徴量を対応付けて記憶する。音喩認識モデル学習部５４は、これらのデータを音声言語資源格納部５０から読み出し、音喩認識モデル格納部５５に格納されている音喩認識モデルを更新する。なお、音喩認識モデル学習部５４は、カテゴリ別に音喩認識モデルを更新してもよい。この場合、結果編集部２９は、修正済み音喩データＤ１０に対応付けて、修正を行ったユーザーのユーザープロファイルデータＤ１１を出力する。音喩認識モデル学習部５４は、ユーザープロファイルデータＤ１１から得られるカテゴリ別に、修正済み音喩データＤ１０と、修正前の音喩が得られたときの音響特徴量、韻律特徴量、及び言語特徴量とを用いて音喩認識モデルを更新する。音喩認識モデル学習部５４は、更新した音声認識モデルに対応付けてカテゴリを格納する。音喩認識部２８は、入力部４により入力されたカテゴリに対応した音喩認識モデルを音喩認識モデル格納部５５から読み出して音喩認識を行う。 The phonetic recognition model learning unit 54 updates the phonetic recognition model based on the corrected phonetic. The spoken language resource storage unit 50 associates the corrected syllable data D10 output from the result editing unit 29 with the acoustic feature value, prosodic feature value, and language feature value when the uncorrected syllable is obtained. Remember. The phonetic recognition model learning unit 54 reads out these data from the spoken language resource storage unit 50 and updates the phonetic recognition model stored in the phonetic recognition model storage unit 55. Note that the phonetic recognition model learning unit 54 may update the phonetic recognition model for each category. In this case, the result editing unit 29 outputs the user profile data D11 of the user who made the correction in association with the corrected syllable data D10. The phonetic recognition model learning unit 54, for each category obtained from the user profile data D11, the corrected phonetic data D10 and the acoustic feature, prosodic feature, and language feature when the phonetic before correction is obtained. And update the phonetic recognition model. The phonetic recognition model learning unit 54 stores the category in association with the updated speech recognition model. The phonetic recognition unit 28 reads a phonetic recognition model corresponding to the category input by the input unit 4 from the phonetic recognition model storage unit 55 and performs phonetic recognition.

次に、音喩認識装置１の動作について説明する。
まず、音喩認識装置１における音喩認識処理について説明する。
音喩認識装置１の音喩認識処理部２は、音響特徴量、言語特徴量、及び韻律特徴量を音喩認識処理の入力とする。そのため、音喩認識処理部２は、音響特徴量抽出部２５による音響特徴量の抽出処理と、言語特徴量抽出部２６による言語特徴量抽出処理と、韻律特徴量抽出部２７による韻律特徴量抽出処理とを協調的に実行する。 Next, the operation of the phonetic recognition apparatus 1 will be described.
First, the phonetic recognition process in the phonetic recognition device 1 will be described.
The metaphor recognition processing unit 2 of the metaphor recognition device 1 uses the acoustic feature quantity, the language feature quantity, and the prosodic feature quantity as input for the metaphor recognition process. Therefore, the phonetic recognition processing unit 2 performs an acoustic feature extraction process by the acoustic feature extraction unit 25, a language feature extraction process by the language feature extraction unit 26, and a prosodic feature extraction by the prosody feature extraction unit 27. Processes are executed cooperatively.

例えば、音声区間検出部２２が総計Ｎ個の非音声区間を検出したとする。初めに、音響特徴量抽出部２５が、第ｎ番目（ｎ＝１，…，Ｎ）の非音声区間について音響特徴量を抽出する。次に、言語特徴量抽出部２６が、第ｎ番目の非音声区間の開始時刻及び終了時刻に基づいて特定される言語特徴量抽出区間から言語特徴量を抽出する。言語特徴量は、言語特徴量抽出区間における音声認識結果中の単語列の頻度に基づく。同様に、韻律特徴量抽出部２７が、第ｎ番目の非音声区間により特定される韻律特徴量抽出区間の韻律特徴量を抽出する。最後に音喩認識部２８は、第ｎ番目の非音声区間について音響特徴量抽出部２５、言語特徴量抽出部２６、及び韻律特徴量抽出部２７のそれぞれが抽出した音響特徴量、言語特徴量、及び韻律特徴量を統合して入力特徴量とする。音喩認識部２８は、Ｎ個の入力特徴量それぞれについて音喩認識を行う。 For example, it is assumed that the speech segment detection unit 22 detects a total of N non-speech segments. First, the acoustic feature quantity extraction unit 25 extracts acoustic feature quantities for the nth (n = 1,..., N) non-speech section. Next, the language feature quantity extraction unit 26 extracts a language feature quantity from the language feature quantity extraction section specified based on the start time and end time of the nth non-voice section. The language feature amount is based on the frequency of the word string in the speech recognition result in the language feature amount extraction section. Similarly, the prosodic feature quantity extraction unit 27 extracts the prosodic feature quantity of the prosodic feature quantity extraction section specified by the nth non-voice section. Finally, the phonetic recognition unit 28 extracts the acoustic feature amount and the language feature amount extracted by the acoustic feature amount extraction unit 25, the language feature amount extraction unit 26, and the prosodic feature amount extraction unit 27 for the n-th non-speech interval. , And the prosodic feature value are integrated into an input feature value. The phonetic recognition unit 28 performs phonetic recognition for each of the N input feature values.

以下、音喩認識装置１における音喩認識処理の詳細について説明する。
図３は、音喩認識装置１の音喩認識処理フローを示す図である。
まず、音喩認識装置１は、音声区間検出用、非音声区間検出用、音声認識用それぞれの統計的音響モデルを音響モデル格納部２０に格納し、音声認識用の統計的言語モデルを言語モデル格納部２１に格納しておく。また、音喩認識モデル格納部５５には、後述する音喩認識モデル学習処理により学習された音喩認識モデルを格納しておく。
音声区間検出用の統計的音響モデルや、音声認識用の統計的音響モデル及び統計的言語モデルは、従来と同様のものを用いることができる。本実施形態では、非音声区間検出用の統計的音響モデルとして、ＨＭＭ（Hidden Markov Model、隠れマルコフモデル）及びＧＭＭ（Gaussian Mixture Model、ガウス混合分布）を用いる。この非音声区間検出用のＨＭＭ及びＧＭＭは、音声、非音声、及び無音の３つのクラスそれぞれのラベルがつけられた音声データを学習データとして用い、従来技術と同様の学習方法により学習される。例えば、非音声のＧＭＭの場合、混合されるガウス分布のそれぞれが、異なる種類の非音声の特徴を表すようにする。なお、非音声区間検出用のＨＭＭについては図４を用いて後述する。 Hereinafter, the details of the phonetic recognition process in the phonetic recognition device 1 will be described.
FIG. 3 is a diagram showing a metaphor recognition process flow of the metaphor recognition device 1.
First, the phonetic recognition device 1 stores the statistical acoustic models for speech segment detection, non-speech segment detection, and speech recognition in the acoustic model storage unit 20, and the statistical language model for speech recognition as a language model. Stored in the storage unit 21. The phonetic recognition model storage unit 55 stores a phonetic recognition model learned by a phonetic recognition model learning process described later.
A statistical acoustic model for detecting a speech section, a statistical acoustic model for speech recognition, and a statistical language model can be the same as those in the past. In the present embodiment, HMM (Hidden Markov Model) and GMM (Gaussian Mixture Model) are used as statistical acoustic models for detecting non-speech intervals. The HMM and GMM for detecting non-speech intervals are learned by the same learning method as that of the prior art using speech data labeled with three classes of speech, non-speech, and silence as learning data. For example, in the case of a non-speech GMM, each of the mixed Gaussian distributions represents a different type of non-speech feature. The HMM for detecting the non-voice section will be described later with reference to FIG.

音喩認識装置１の音喩認識処理部２は、入力音声データＤ１が入力される度に、図３に示す処理を行う。音喩認識処理部２は、話声と非音声の両者が混合した入力音声データＤ１から音声区間と非音声区間を切り出すため、入力音声データＤ１を２つに分岐する。音喩認識処理部２は、分岐した一方を音声区間検出部２２に入力し、もう一方を非音声区間検出部２４に入力する。 The metaphor recognition processing unit 2 of the metaphor recognition device 1 performs the process shown in FIG. 3 every time the input voice data D1 is input. The metaphor recognition processing unit 2 branches the input voice data D1 into two in order to cut out a voice section and a non-voice section from the input voice data D1 in which both spoken voice and non-voice are mixed. The metaphor recognition processing unit 2 inputs one branched to the speech segment detection unit 22 and the other to the non-speech segment detection unit 24.

音声区間検出部２２は、従来技術によって、入力音声データＤ１においてテキスト化が必要となる音声区間を検出して切り出す（ステップＳ１０５）。この音声区間には、背景音などの非音声との重なりが含まれ得る。本実施形態では、特開２００７−２３３１４８号公報や、特開２００７−２３３１４９号公報に記載の技術により、音声区間を検出する。音声区間検出部２２は、検出した音声区間を入力音声データＤ１から切り出した音声区間データを音声認識部２３に出力する。 The voice section detection unit 22 detects and cuts out a voice section that needs to be converted into text in the input voice data D1 by the conventional technique (step S105). This voice section may include an overlap with non-voice such as background sound. In the present embodiment, a voice section is detected by the techniques described in Japanese Patent Application Laid-Open No. 2007-233148 and Japanese Patent Application Laid-Open No. 2007-233149. The voice section detection unit 22 outputs voice section data obtained by cutting the detected voice section from the input voice data D1 to the voice recognition unit 23.

具体的には、音声区間検出部２２は、入力音声データＤ１が入力される度に、入力音声データＤ１が示す音声を、所定の時間間隔の１処理単位のフレームである入力フレームに分割する。音声区間検出部２２は、時刻が早い順に選択した所定数の入力フレームそれぞれの音響特徴量を計算する。音声区間検出用の状態遷移ネットワークは、発話開始から発話終了までに、非音声、音声、無音の３状態を飛越しなく遷移するｌｅｆｔ−ｔｏ−ｒｉｇｈｔ型のＨＭＭである。なお、無音の状態に代えて、非音声の状態を用いてもよい。音声区間検出部２２は、音響モデル格納部２０から非音声、音声それぞれの音響モデルを読み出し、読み出したこれらの音響モデルを用いて各入力フレームの音響スコア（対数尤度）計算を行う。非音声の音響モデルは、無音や非音声などのＨＭＭを表す。また、音声の音響モデルは、各音素の音素ＨＭＭからなる。音声区間検出部２２は、各入力フレームの状態遷移の記録を記憶しておき、現在の状態から開始状態に向かって状態遷移の記録を遡り、状態遷移ネットワークを用いて処理開始（始端）の入力フレームからの各状態系列の累積の音響スコアを計算する。音声区間検出部２２は、各状態系列の累積の音響スコアのうち最大のものと、始端の音響スコアとの差が閾値より大きい場合、最大の累積の音響スコアが得られた系列において最後に非音声の状態であった時刻から所定時間遡った時刻を発話開始時刻とする。
音声区間検出部２２は、さらに発話開始時刻検出後の入力フレームについて、上記と同様に処理開始の入力フレームからの現在の入力フレームまでの各状態系列の累積の音響スコアを計算する。音声区間検出部２２は、各状態系列の中で最大の累積の音響スコアと、各状態系列のうち音声から無音の終端に至る状態系列の中で最大の累積の音響スコアとの差が閾値を超えたかを判断する。音声区間検出部２２は、閾値を超えた状態が所定時間経過した場合、その経過した時刻から所定時間遡った時刻を発話終了時刻とする。
音声区間検出部２２は、発話開始時刻から発話終了時刻までの区間の入力フレームをまとめた音声区間データを音声認識部２３に出力する。さらに、音声区間検出部２２は、同定した音声区間を示す情報を韻律特徴量抽出部２７に出力する。 Specifically, every time the input voice data D1 is input, the voice section detection unit 22 divides the voice indicated by the input voice data D1 into input frames that are frames of one processing unit at a predetermined time interval. The voice section detection unit 22 calculates the acoustic feature amount of each of a predetermined number of input frames selected in order from the earliest time. The state transition network for detecting a voice section is a left-to-right type HMM that makes a transition between non-voice, voice, and silence without skipping from the start of utterance to the end of utterance. Note that a non-speech state may be used instead of the silent state. The speech section detection unit 22 reads out acoustic models of non-speech and speech from the acoustic model storage unit 20 and calculates acoustic scores (logarithmic likelihood) of each input frame using these read out acoustic models. The non-speech acoustic model represents an HMM such as silence or non-speech. Moreover, the acoustic model of speech is composed of phoneme HMMs of each phoneme. The voice section detection unit 22 stores a record of the state transition of each input frame, traces the state transition record from the current state toward the start state, and inputs the processing start (starting point) using the state transition network. Calculate the cumulative acoustic score of each state series from the frame. When the difference between the maximum accumulated acoustic score of each state series and the starting acoustic score is larger than the threshold, the speech section detection unit 22 is the last in the series in which the maximum accumulated acoustic score is obtained. A time that is a predetermined time later than the time when the voice was in the voice state is set as the speech start time.
The speech section detection unit 22 further calculates the accumulated acoustic score of each state series from the input frame at the start of processing to the current input frame for the input frame after the utterance start time is detected in the same manner as described above. The voice section detection unit 22 sets a threshold value as a difference between the maximum accumulated acoustic score in each state series and the maximum cumulative acoustic score in the state series from the voice to the silent end of each state series. Judge whether it exceeded. When a predetermined time has elapsed after the threshold value is exceeded, the voice section detection unit 22 sets a time that is a predetermined time later than the elapsed time as the utterance end time.
The voice section detection unit 22 outputs voice section data in which input frames of a section from the utterance start time to the utterance end time are collected to the voice recognition unit 23. Further, the speech segment detection unit 22 outputs information indicating the identified speech segment to the prosodic feature value extraction unit 27.

音声認識部２３は、音声区間検出部２２が出力した音声区間データを、音響モデル格納部２０に格納されている音声認識用の統計的音響モデル及び言語モデル格納部２１に格納されている統計的言語モデルを用い、従来技術により音声認識する（ステップＳ１１０）。本実施形態では、音声認識部２３は、統計的音響モデルに、ＨＭＭ、及びＧＭＭを用いる。また、本実施形態では、音声認識部２３は、統計的言語モデルに単語ｎ−ｇｒａｍ言語モデルを用いたマルチパス音声認識により認識結果を得る。この認識結果は、単語を単位とした分かち書きであり、音声認識部２３は、各単語に、当該単語が発話された時刻情報を付与する。音声認識部２３は、音声認識結果を設定した音声認識結果データＤ２を出力する。 The speech recognition unit 23 uses the speech segment data output from the speech segment detection unit 22 as a statistical acoustic model for speech recognition stored in the acoustic model storage unit 20 and a statistical model stored in the language model storage unit 21. Using a language model, speech recognition is performed using conventional technology (step S110). In the present embodiment, the speech recognition unit 23 uses HMM and GMM for the statistical acoustic model. In the present embodiment, the speech recognition unit 23 obtains a recognition result by multipath speech recognition using a word n-gram language model as a statistical language model. This recognition result is a segmentation in units of words, and the speech recognition unit 23 gives each word time information when the word is uttered. The voice recognition unit 23 outputs voice recognition result data D2 in which a voice recognition result is set.

一方、非音声区間検出部２４は、入力音声データＤ１において背景音等を含む非音声から構成される非音声区間を検出して切り出す（ステップＳ１１５）。本実施形態では、非音声区間検出部２４は、音声認識によりテキスト化すべき部分との重複を含む非音声区間を検出する。非音声区間検出部２４は、音声区間検出部２２と同様のアルゴリズムにより、音響モデル格納部２０に記憶されている非音声区間検出用のＧＭＭとＨＭＭを用いて非音声区間の検出を行う。ただし、音声区間検出部２２が、音声区間を検出対象としているのに対し、非音声区間検出部２４は、非音声の音声区間を検出対象とする点が異なる。また、音声区間検出用の状態遷移ネットワークに代えて、非音声区間検出用のＨＭＭを用いる。 On the other hand, the non-speech section detection unit 24 detects and cuts out a non-speech section composed of non-speech including background sound or the like in the input sound data D1 (step S115). In the present embodiment, the non-speech section detection unit 24 detects a non-speech section including an overlap with a portion to be converted into text by speech recognition. The non-speech section detection unit 24 detects a non-speech section using the GMM and HMM for non-speech section detection stored in the acoustic model storage unit 20 by the same algorithm as the speech section detection unit 22. However, the difference is that the speech section detection unit 22 uses the speech section as a detection target, whereas the non-speech section detection unit 24 uses the non-speech speech section as the detection target. In addition, an HMM for non-voice interval detection is used instead of the state transition network for voice interval detection.

図４は、音響モデル格納部２０に記憶されている非音声区間検出用のＨＭＭを示す図である。本実施形態では、ＨＭＭの構成を、いわゆるエルゴディックＨＭＭとする。同図に示すように、このエルゴディックＨＭＭは、音声、非音声、無音の３クラスの遷移を表現したＨＭＭである。各遷移には、学習により得られた遷移確率が付与されている。 FIG. 4 is a diagram illustrating a non-voice segment detection HMM stored in the acoustic model storage unit 20. In the present embodiment, the configuration of the HMM is a so-called ergodic HMM. As shown in the figure, this ergodic HMM is an HMM that represents three classes of transitions: voice, non-voice, and silence. Each transition is given a transition probability obtained by learning.

図５は、非音声区間検出部２４の非音声区間検出処理フローを示す図であり、図３のステップＳ１１５における詳細な処理を示す。まず、非音声区間検出部２４は、入力音声データＤ１が入力される度に、入力音声データＤ１を、所定の時間間隔の１処理単位のフレームである入力フレームに分割する。分割後の入力フレームは、音響特徴量の処理単位であり、通常１単位１０ミリ秒、または、１０ミリ秒に近い長さとする。 FIG. 5 is a diagram showing a non-speech section detection processing flow of the non-speech section detection unit 24, and shows detailed processing in step S115 of FIG. First, every time the input voice data D1 is input, the non-voice section detection unit 24 divides the input voice data D1 into input frames that are frames of one processing unit at a predetermined time interval. The divided input frame is a unit for processing the acoustic feature amount, and is usually set to a unit of 10 milliseconds or a length close to 10 milliseconds.

非音声区間検出部２４は、まだ処理対象としていない入力フレームのうち、時刻が早い順に所定数の入力フレームを取得する（ステップＳ２０５）。非音声区間検出部２４は、取得した各入力フレームの音響特徴量を計算する。非音声区間検出部２４は、音響モデル格納部２０からＨＭＭの各状態である音声、非音声、及び無音それぞれのＧＭＭを読み出す。非音声区間検出部２４は、読み出したこれらのＧＭＭと各入力フレームの音響特徴量とを照合して各入力フレームの音響スコア計算を行い、必要があればＨＭＭの状態間の遷移を行う（ステップＳ２１０）。非音声区間検出部２４は、トレースバックに必要な定められた数の入力フレームを処理していない場合（ステップＳ２１５：ＮＯ）、ステップＳ２０５に戻って新たな入力フレームを取得し、音響スコアの計算を行う。 The non-speech section detection unit 24 acquires a predetermined number of input frames in order from the earliest time among the input frames not yet processed (step S205). The non-speech section detection unit 24 calculates the acoustic feature amount of each acquired input frame. The non-speech section detection unit 24 reads from the acoustic model storage unit 20 each voice, non-speech, and silent GMM that is each state of the HMM. The non-speech section detection unit 24 compares the read GMM and the acoustic feature quantity of each input frame, calculates the acoustic score of each input frame, and makes a transition between HMM states if necessary (step) S210). When the predetermined number of input frames necessary for the traceback are not processed (step S215: NO), the non-speech section detection unit 24 returns to step S205 to acquire a new input frame, and calculates the acoustic score. I do.

非音声区間検出部２４は、トレースバックに必要な定められた数の入力フレームを処理した場合（ステップＳ２１５：ＹＥＳ）、現在の状態に至るまでの状態系列のリストをトレースバックにより求める（ステップＳ２２０）。つまり、非音声区間検出部２４は、現在の状態から開始状態に向かって状態遷移の記録を遡り、図４に示すエルゴディックＨＭＭを用いて、処理開始の入力フレームの状態（開始状態）から現在の状態までの各状態系列の累積の音響スコアを計算する。この際、非音声区間検出部２４は、累積の音響スコアが大きい順に系列をソートしておく。 When the predetermined number of input frames necessary for the traceback are processed (step S215: YES), the non-voice section detection unit 24 obtains a list of state series up to the current state by traceback (step S220). ). That is, the non-speech section detection unit 24 traces the record of the state transition from the current state toward the start state, and uses the ergodic HMM shown in FIG. The cumulative acoustic score of each state sequence up to the state is calculated. At this time, the non-speech section detection unit 24 sorts the series in descending order of the accumulated acoustic score.

非音声区間検出部２４は、トレースバックにより得られたＨＭＭの状態系列から、第１位の系列と第２位の系列を比較する（ステップＳ２２５）。非音声区間検出部２４は、累積の音響スコアの差が予め定めた閾値以下である場合、区間が確定しないと判断し（ステップＳ２３０：ＮＯ）、ステップＳ２０５に戻って新たな入力フレームに対して音響スコアの計算を行う。非音声区間検出部２４は、累積の音響スコアの差が予め定めた閾値を超えたと判断した場合（ステップＳ２３０：ＹＥＳ）、第１位の系列を確定区間とする。非音声区間検出部２４は、最後に非音声の確定区間のフレームを時刻順にまとめあげたフレーム列を、非音声区間フレームデータとして出力する（ステップＳ２３５）。各フレーム列には、フレームの開始時刻と終了時刻の一方または両方の情報が付与されている。 The non-speech section detection unit 24 compares the first rank sequence and the second rank sequence from the HMM state sequence obtained by the traceback (step S225). The non-speech section detection unit 24 determines that the section is not fixed when the accumulated acoustic score difference is equal to or less than a predetermined threshold (step S230: NO), and returns to step S205 to perform a new input frame. Calculate the acoustic score. If the non-speech section detection unit 24 determines that the difference between the accumulated acoustic scores exceeds a predetermined threshold (step S230: YES), the non-speech section detection unit 24 sets the first series as a confirmed section. The non-speech section detection unit 24 outputs a frame sequence in which the frames of the non-speech determined section are collected in order of time as non-speech section frame data (step S235). Each frame sequence is given information on one or both of the start time and end time of the frame.

図３において、音喩認識処理部２は、非音声区間検出部２４が検出したＮ個の非音声区間それぞれについて、以下のステップＳ１２５からステップＳ１４０の処理を行う。音喩認識処理部２は、ｎ＝１を初期値とする（ステップＳ１２０）。 In FIG. 3, the metaphor recognition processing unit 2 performs the following processing from step S125 to step S140 for each of the N non-speech intervals detected by the non-speech interval detection unit 24. The metaphor recognition processing unit 2 sets n = 1 as an initial value (step S120).

音響特徴量抽出部２５は、非音声区間検出部２４から出力された第ｎ番目の非音声区間フレームデータに含まれる各フレームの音響特徴量を抽出する（ステップＳ１２５）。本実施形態では、音声認識で一般に用いられている対数メルフィルタバンク出力を音響特徴量とする。ただし、音響特徴量抽出部２５は、対数メルフィルタバンク出力に、予め平均ゼロ、分散１の正規化を行っておく。音声認識では通常、音声データを離散フーリエ変換した後にメルフィルタバンクに通した上で対数変換を行い、離散コサイン変換によりメルケプストラム係数（ＭＦＣＣ）を音響特徴量として得る。しかし、音響特徴量抽出部２５は、音の周波数をダイレクトに音響特徴量とするため、離散コサイン変換は行わない。 The acoustic feature amount extraction unit 25 extracts the acoustic feature amount of each frame included in the nth non-speech segment frame data output from the non-speech segment detection unit 24 (step S125). In the present embodiment, a log mel filter bank output generally used in speech recognition is used as an acoustic feature amount. However, the acoustic feature quantity extraction unit 25 performs normalization of the mean zero and the variance 1 in advance on the log mel filter bank output. In speech recognition, usually, speech data is subjected to discrete Fourier transform, passed through a mel filter bank, logarithmically transformed, and a mel cepstrum coefficient (MFCC) is obtained as an acoustic feature value by discrete cosine transform. However, the acoustic feature amount extraction unit 25 does not perform discrete cosine transform because the sound frequency is directly set as the acoustic feature amount.

言語特徴量抽出部２６は、音声認識結果データＤ２が示す音声認識結果から、音喩認識部２８において必要となる言語的な特徴量を抽出する（ステップＳ１３０）。
図６は、特徴量抽出区間を示す図である。言語特徴量抽出部２６は、非音声区間検出部２４が同定した第ｎ番目の非音声区間の始端（開始時刻）と終端（終了時刻）とに基づいて言語特徴量抽出区間の始点（開始時刻）と終点（終了時刻）を特定する。すなわち、言語特徴量抽出部２６は、非音声区間の始端からＫ時間単位（秒）だけ前にシフトした時刻を始点とし、非音声区間の終端からＫ時間単位（秒）だけ後ろにシフトした時刻を終点とする区間を言語特徴量抽出区間とする。
これは、非音声の言語表現の推定に強い影響を及ぼす言語的な文脈は、非音声区間及びその周辺に限定されると仮定したものであり、本実施形態では、この文脈が単語の頻度分布に基づいて得られるものとする。 The linguistic feature amount extraction unit 26 extracts linguistic feature amounts necessary for the syllable recognition unit 28 from the speech recognition result indicated by the speech recognition result data D2 (step S130).
FIG. 6 is a diagram illustrating a feature amount extraction section. The language feature quantity extraction unit 26 starts the start point (start time) of the language feature quantity extraction section based on the start (start time) and end (end time) of the nth non-speech section identified by the non-speech section detection unit 24. ) And end point (end time). That is, the linguistic feature quantity extraction unit 26 starts from the time shifted by K time units (seconds) before the start of the non-speech segment, and shifts backward by K time units (seconds) from the end of the non-speech segment. The section whose end point is is a language feature amount extraction section.
This is based on the assumption that the linguistic context that strongly affects the estimation of the non-speech linguistic expression is limited to the non-speech interval and its surroundings. In this embodiment, this context is the word frequency distribution. It shall be obtained based on

言語特徴量抽出部２６は、音声認識結果データＤ２が示す音声認識結果の単語列のうち、言語特徴量抽出区間に含まれる単語列から言語特徴量を抽出する。本実施形態では、言語特徴量を相対頻度ベクトルとして定める。音声認識の語彙Ｖの大きさを｜Ｖ｜、語彙Ｖに含まれる各単語をｖ（ｖ∈Ｖ）、言語特徴量抽出区間内の総単語数をＭとすると、言語特徴量ｗは、式（１）となる。 The language feature quantity extraction unit 26 extracts a language feature quantity from a word string included in the language feature quantity extraction section among the word strings of the speech recognition result indicated by the speech recognition result data D2. In this embodiment, the language feature amount is determined as a relative frequency vector. When the size of the vocabulary V for speech recognition is | V |, each word included in the vocabulary V is v (vεV), and the total number of words in the language feature extraction section is M, the language feature w is given by (1).

言語特徴量ｗ
＝［ｃ（ｖ_１）／Ｍ，ｃ（ｖ_２）／Ｍ，…，ｃ（ｖ_｜Ｖ｜）／Ｍ］^Ｔ …（１） Language feature w
= [C (v ₁ ) / M, c (v ₂ ) / M,..., C (v _{| V |} ) / M] ^T (1)

式（１）において、Ｔは転置を表す記号であり、ｖ_１，ｖ_２，…は、単語ｖである。また、ｃ（ｖ）は、言語特徴量抽出区間における単語ｖの頻度を返す関数であり、Σ_ｖ∈Ｖｃ（ｖ）＝Ｍを満たす。 In the formula (1), T is a symbol representing transposition, and v ₁ , v ₂ ,. Further, c (v) is a function that returns the frequency of the word v in the language feature amount extraction section, and satisfies Σ _vεV c (v) = M.

なお、入力音声データＤ１に対してその音声区間に対応した字幕等のテキストデータが入力され、そのテキストデータに開始時刻及び終了時刻が予め付与されている場合、そのテキストデータを言語特徴量抽出部２６に直接入力してもよい。言語特徴量抽出部２６は、音声認識結果データＤ２に代えて、入力されたテキストデータが示す言語特徴量抽出区間内の単語列から言語特徴量を抽出する。これにより、図３におけるステップＳ１０５及びステップＳ１１０の処理を省略することができる。 If text data such as subtitles corresponding to the speech section is input to the input speech data D1, and the start time and end time are given in advance to the text data, the text feature is extracted from the language feature amount extraction unit. 26 may be input directly. The language feature quantity extraction unit 26 extracts a language feature quantity from a word string in the language feature quantity extraction section indicated by the input text data, instead of the speech recognition result data D2. Thereby, the process of step S105 and step S110 in FIG. 3 is omissible.

話声に含まれる感情もまた、適切な音喩を得るために必要な情報である。本実施形態では、話声の韻律に感情にかかわる情報が含まれていると仮定して、音声データから韻律を抽出し、韻律特徴量として音喩認識に用いる。韻律特徴量を抽出すべき韻律特徴量抽出区間は、図６に示すように、言語特徴量抽出区間と同じである。 Emotions included in the spoken voice are also information necessary for obtaining an appropriate phonetic metaphor. In the present embodiment, assuming that information related to emotion is included in the prosody of the spoken voice, the prosody is extracted from the speech data, and is used as a prosodic feature amount for syllable recognition. The prosodic feature quantity extraction section from which prosodic feature quantities are to be extracted is the same as the language feature quantity extraction section, as shown in FIG.

図３において、韻律特徴量抽出部２７は、音声区間検出部２２が出力した音声区間の情報が示す始端及び終端に基づいて、韻律特徴量抽出区間の始点及び終点を特定する。韻律特徴量抽出区間が言語特徴量抽出区間と異なる場合、図６におけるＫの値を変更すればよい。韻律特徴量抽出部２７は、特定した韻律特徴量抽出区間に対応した入力音声データＤ１から話声の韻律の情報である有声音（主として母音を指す）の基本周波数（声帯の周波数のうち最も低いものを指す）およびパワーを抽出する。韻律特徴量抽出部２７は、抽出した基本周波数およびパワーをそれぞれ対数変換し、時間変化量の系列であるデルタ基本周波数特徴量ベクトル、及びデルタパワー特徴量ベクトルを求める。これらのベクトルは、有声音区間に対しては対応するデルタ基本周波数特徴量およびデルタパワー特徴量の値となり、それ以外の区間に対してはゼロとなるようにする。韻律特徴量抽出部２７は、これらのベクトルを連結して韻律特徴量とする（ステップＳ１３５）。 In FIG. 3, the prosodic feature quantity extraction unit 27 specifies the start point and the end point of the prosodic feature quantity extraction section based on the start end and the end point indicated by the voice section information output by the voice section detection unit 22. When the prosodic feature amount extraction section is different from the language feature amount extraction section, the value of K in FIG. 6 may be changed. The prosody feature quantity extraction unit 27 uses the fundamental frequency of the voiced sound (mainly vowel) that is the prosodic information of the spoken voice from the input speech data D1 corresponding to the specified prosodic feature quantity extraction section (the lowest of the vocal cord frequencies). Point out) and extract power. The prosodic feature quantity extraction unit 27 performs logarithmic transformation on the extracted fundamental frequency and power, respectively, and obtains a delta fundamental frequency feature quantity vector and a delta power feature quantity vector, which are a series of time change quantities. These vectors are the values of the corresponding delta fundamental frequency feature quantity and delta power feature quantity for the voiced sound section, and are zero for the other sections. The prosodic feature quantity extraction unit 27 concatenates these vectors into a prosodic feature quantity (step S135).

音喩認識部２８は、第ｎ番目の入力特徴量を生成する（ステップＳ１４０）。入力特徴量を構成する音響特徴量は、第ｎ番目の非音声区間について音響特徴量抽出部２５が各フレームから抽出した音響特徴量である。入力特徴量を構成する言語特徴量は、言語特徴量抽出部２６が第ｎ番目の非音声区間に対応した言語特徴量抽出区間から抽出した言語特徴量である。入力特徴量を構成する韻律特徴量は、韻律特徴量抽出部２７が第ｎ番目の非音声区間に対応した韻律特徴量抽出区間から抽出した韻律特徴量である。 The phonetic recognition unit 28 generates an nth input feature amount (step S140). The acoustic feature quantity constituting the input feature quantity is an acoustic feature quantity extracted from each frame by the acoustic feature quantity extraction unit 25 for the nth non-voice section. The language feature quantity constituting the input feature quantity is a language feature quantity extracted from the language feature quantity extraction section corresponding to the nth non-speech section by the language feature quantity extraction unit 26. The prosodic feature value constituting the input feature value is a prosodic feature value extracted from the prosodic feature value extraction section corresponding to the nth non-voice section by the prosodic feature value extraction unit 27.

音喩認識処理部２は、非音声区間検出部２４が検出したＮ個全ての非音声区間についてステップＳ１２５からステップＳ１４０の処理を終了していない場合、ｎに１を加算し、第ｎ番目のフレームについてステップＳ１２５からの処理を繰り返す。非音声区間検出部２４が検出したＮ個全ての非音声区間についてステップＳ１２５からステップＳ１４０の処理を終了した場合、音喩認識処理部２は、ステップＳ１５０からの処理を行う（ステップＳ１４５）。 The metaphor recognition processing unit 2 adds 1 to n when all the N non-speech segments detected by the non-speech segment detection unit 24 have been completed, and adds n to the nth The processing from step S125 is repeated for the frame. When the processes from step S125 to step S140 are completed for all N non-speech sections detected by the non-speech section detection unit 24, the metaphor recognition processing unit 2 performs the process from step S150 (step S145).

音喩認識部２８は、音喩認識モデル格納部５５に記憶されている音喩認識モデルを用いて、Ｎ個の入力特徴量それぞれについて以下に示す音喩認識を行う（ステップＳ１５０）。従来の音声認識では、音声からテキストを推定するが、音喩認識では、音喩を構成する非音声のテキスト表現と表示属性の組を推定する。非音声のテキスト表現は、非音声を表現するための単語列である。表示属性は、書体や文字デザインなどのフォント（字体）、大きさ、色、表示動作などのうち１以上である。本実施形態では、音喩を構成するテキスト表現及び表示属性を認識するための統計的な音喩認識モデルとしてニューラルネットワークを利用する。 The metaphor recognition unit 28 performs the following metaphor recognition for each of the N input feature values using the metaphor recognition model stored in the metaphor recognition model storage unit 55 (step S150). In conventional speech recognition, text is estimated from speech, but in metaphor recognition, a set of non-speech text representations and display attributes constituting a metaphor is estimated. The non-speech text expression is a word string for expressing non-speech. The display attribute is one or more of fonts (fonts) such as typeface and character design, size, color, display operation, and the like. In this embodiment, a neural network is used as a statistical phonetic recognition model for recognizing text representations and display attributes that make up a phonetic.

図７は、本実施形態に用いる音喩認識モデルを示す図である。同図に示すように、音喩認識モデルは、音響特徴量、言語特徴量、及び韻律特徴量を入力とし、音喩のテキスト表現の事後確率及び表示属性の事後確率を出力とする多層ニューラルネットワークである。便宜上、音喩認識モデルは、第１のニューラルネットワークＡ１、第２のニューラルネットワークＡ２、第３のニューラルネットワークＡ３、及び、第４のニューラルネットワークＡ４に分割できる。 FIG. 7 is a diagram showing a phonetic recognition model used in the present embodiment. As shown in the figure, the phonetic recognition model is a multi-layer neural network that receives acoustic feature values, language feature values, and prosodic feature values as inputs, and outputs posterior probabilities of text representations of metaphors and posterior probabilities of display attributes. It is. For convenience, the phonetic recognition model can be divided into a first neural network A1, a second neural network A2, a third neural network A3, and a fourth neural network A4.

音喩認識モデルの入力層は、第１のニューラルネットワークＡ１の入力層、第２のニューラルネットワークＡ２の入力層、第３のニューラルネットワークＡ３の入力層からなる。第１のニューラルネットワークＡ１の入力層は、可変長の音響特徴量を入力とする。第１のニューラルネットワークＡ１は、畳み込みニューラルネットワークであり、入出力の半ばにある隠れ層において固定長の特徴量に変換する。第１のニューラルネットワークＡ１の出力は、入力された音響特徴量を、入力よりも低い次元に変換した音響特徴量である。第２のニューラルネットワークＡ２の入力層は、固定長の韻律特徴量を入力とする。第２のニューラルネットワークＡ２の出力は、入力された韻律特徴量を、入力よりも低い次元に変換した韻律特徴量である。第３のニューラルネットワークＡ３の入力層は、固定長の言語特徴量を入力とする。第３のニューラルネットワークＡ３の出力は、入力された言語特徴量を、入力よりも低い次元に変換した言語特徴量である。 The input layer of the phonetic recognition model includes an input layer of the first neural network A1, an input layer of the second neural network A2, and an input layer of the third neural network A3. The input layer of the first neural network A1 receives a variable length acoustic feature amount as an input. The first neural network A1 is a convolutional neural network, and converts it into a fixed-length feature quantity in a hidden layer in the middle of input / output. The output of the first neural network A1 is an acoustic feature value obtained by converting the inputted acoustic feature value into a lower dimension than the input. The input layer of the second neural network A2 receives a fixed-length prosodic feature value as an input. The output of the second neural network A2 is a prosodic feature value obtained by converting the input prosodic feature value into a lower dimension than the input. The input layer of the third neural network A3 receives a fixed-length language feature quantity as an input. The output of the third neural network A3 is a language feature value obtained by converting the input language feature value into a lower dimension than the input.

第４のニューラルネットワークＡ４は、第１のニューラルネットワークＡ１の出力と、第２のニューラルネットワークＡ２の出力と、第３のニューラルネットワークＡ３の出力とを入力とする。第４のニューラルネットワークＡ４の出力層は、音喩認識モデルの出力層であり、各音喩構成要素の事後確率を出力する。上述したように、音喩構成要素は、テキスト表現と、フォント、大きさ、色、表示動作などの各表示属性である。つまり音喩認識モデルの出力層は、音喩のテキスト表現の事後確率、及び、各表示属性の事後確率を出力する。テキスト表現と各表示属性の出力層はそれぞれ接続を持たず、下層の隠れ層とのみ接続を持つ。同図においては、出力層は、テキスト表現の事後確率を表すユニット群Ｂ１と、１つ目の表示属性（表示属性１）の事後確率を表すユニット群Ｂ２と、２つ目の表示属性（表示属性２）の事後確率を表すユニット群Ｂ３とからなる。 The fourth neural network A4 receives the output of the first neural network A1, the output of the second neural network A2, and the output of the third neural network A3. The output layer of the fourth neural network A4 is an output layer of the phonetic recognition model, and outputs the posterior probabilities of each phonetic component. As described above, the phonetic component is a text expression and display attributes such as font, size, color, and display operation. That is, the output layer of the phonetic recognition model outputs the posterior probability of the text representation of the phonetic and the posterior probability of each display attribute. The text representation and the output layer of each display attribute have no connection, only a connection with the hidden layer below. In the figure, the output layer includes a unit group B1 representing the posterior probability of text expression, a unit group B2 representing the posterior probability of the first display attribute (display attribute 1), and the second display attribute (display). It consists of unit group B3 representing the posterior probability of attribute 2).

なお、音喩認識モデルを構成する多層ニューラルネットワークの層数や各層の次元数（音喩の表示属性の種類数を含む）は、認識すべき音喩のテキスト表現の数や、学習データ量に合わせて任意とすることができ、音喩認識モデルの学習時に決定される。 Note that the number of layers of the multilayer neural network and the number of dimensions of each layer (including the number of types of display attributes of the metaphor) are included in the number of text representations of the metaphor to be recognized and the amount of learning data. These can be arbitrary, and are determined when learning the phonetic recognition model.

音喩認識部２８は、音喩認識モデルの第１のニューラルネットワークＡ１の入力層に入力特徴量の音響特徴量を入力する。また、音喩認識部２８は、第２のニューラルネットワークＡ２の入力層に入力特徴量の韻律特徴量を入力する。さらに、音喩認識部２８は、第３のニューラルネットワークＡ３の入力層に入力特徴量の言語特徴量を入力する。音喩認識部２８は、これら入力された音響特徴量、韻律特徴量、及び言語特徴量の入力を用いて、音喩認識モデルにより出力層の値を計算する。音喩認識部２８は、音喩認識モデルの出力として、第４のニューラルネットワークＡ４の出力層の各ユニットの値を要素とする出力ベクトルを得る。 The syllable recognition unit 28 inputs the acoustic feature quantity of the input feature quantity to the input layer of the first neural network A1 of the syllable recognition model. The phonetic recognition unit 28 inputs the prosodic feature quantity of the input feature quantity to the input layer of the second neural network A2. Furthermore, the phonetic recognition unit 28 inputs the language feature quantity of the input feature quantity to the input layer of the third neural network A3. The phonetic recognition unit 28 calculates the value of the output layer by using the phonetic recognition model using the input acoustic feature, prosodic feature, and language feature. The metaphor recognition unit 28 obtains an output vector having the values of the units of the output layer of the fourth neural network A4 as elements as the output of the metaphor recognition model.

音喩認識部２８は、音喩認識モデルの出力ベクトルの各要素の値が示す事後確率に基づいて、音喩構成要素別に要素を選択する。なお、音喩認識モデル格納部５５には、音喩認識モデルの出力ベクトルの各要素に対応した音喩構成要素の値が記憶されている。音喩認識部２８は、出力ベクトルの各要素の中で事後確率が最大となる要素を音喩構成要素別に選択し、選択した要素に対応した音喩構成要素の値の組を音喩とする。 The phonetic recognition unit 28 selects an element for each phonetic component based on the posterior probability indicated by the value of each element of the output vector of the phonetic recognition model. Note that the phonetic recognition model storage unit 55 stores the values of the phonetic component corresponding to each element of the output vector of the phonetic recognition model. The metaphor recognition unit 28 selects, for each metaphor constituent element, an element having the maximum posterior probability from among the elements of the output vector, and sets a set of values of the metaphor constituent elements corresponding to the selected element as a metaphor. .

例えば、ユニット群Ｂ１がテキスト表現に対応し、ユニット群Ｂ２がフォントに対応し、ユニット群Ｂ３が色に対応すると仮定する。この場合、音喩認識モデルの出力ベクトルは、ユニット群Ｂ１の各ユニットに対応した要素、ユニット群Ｂ２の各ユニットに対応した要素、ユニット群Ｂ３の各ユニットに対応した要素から構成される。音喩認識モデル格納部５５には、ユニット群Ｂ１の各ユニットに対応した要素それぞれに関連付けられたテキスト表現が記憶される。さらに、音喩認識モデル格納部５５には、ユニット群Ｂ２の各ユニットに対応した要素それぞれに関連付けられたフォントの種類（ゴシック体、明朝体、…）、ユニット群Ｂ３の各ユニットに対応した要素それぞれに関連付けられた色（赤、青、…）が記憶される。 For example, assume that unit group B1 corresponds to text representation, unit group B2 corresponds to font, and unit group B3 corresponds to color. In this case, the output vector of the phonetic recognition model includes elements corresponding to each unit of the unit group B1, elements corresponding to each unit of the unit group B2, and elements corresponding to each unit of the unit group B3. The phonetic recognition model storage unit 55 stores text expressions associated with each element corresponding to each unit of the unit group B1. Further, the phonetic recognition model storage unit 55 corresponds to the font types (Gothic, Mincho,...) Associated with each element corresponding to each unit of the unit group B2 and each unit of the unit group B3. The color (red, blue,...) Associated with each element is stored.

音喩認識部２８は、ユニット群Ｂ１に対応した出力ベクトルの要素群のうち最も事後確率が高い要素を選択し、選択した要素に関連付けられたテキスト表現「ざわざわ」を音喩認識モデル格納部５５から読み出す。また、音喩認識部２８は、ユニット群Ｂ２に対応した出力ベクトルの要素群のうち最も事後確率が高い要素を選択し、選択した要素に関連付けられたフォント「ゴシック体」を音喩認識モデル格納部５５から読み出す。さらに、音喩認識部２８は、ユニット群Ｂ３に対応した出力ベクトルの要素群のうち最も事後確率が高い要素を選択し、選択した要素に関連付けられた色「赤」を音喩認識モデル格納部５５から読み出す。音喩認識部２８は、読み出したこれらの音喩構成要素の値の組であるテキスト表現「ざわざわ」、フォント「ゴシック体」、色「赤」を音喩とする。このように、音喩認識部２８は、音喩構成要素別に事後確率が最大の出力ベクトルの要素を選択し、選択した要素に関連付けられた音喩構成要素の値の組からなる音喩を設定した音喩データＤ３を出力する。各音喩には、非音声区間の開始時刻及び終了時刻が付与される。 The syllable recognition unit 28 selects an element having the highest posterior probability from the element group of the output vector corresponding to the unit group B1, and the lexical recognition model storage unit 55 stores the text expression “Zawazawa” associated with the selected element. Read from. Further, the phonetic recognition unit 28 selects an element having the highest posterior probability from the element group of the output vector corresponding to the unit group B2, and stores the font “Gothic” associated with the selected element in the phonetic recognition model. Read from unit 55. Further, the phonetic recognition unit 28 selects an element with the highest posterior probability from the element group of the output vector corresponding to the unit group B3, and displays the color “red” associated with the selected element as a phonetic recognition model storage unit. Read from 55. The syllable recognition unit 28 uses the text expression “Zawazawa”, the font “Gothic”, and the color “red”, which are a set of values of these read syllable components, as a phonogram. Thus, the phonetic recognition unit 28 selects an output vector element having the maximum posterior probability for each phonetic component, and sets a phonetic composed of a set of values of the phonetic component associated with the selected element. The phonetic data D3 is output. Each metaphor is given a start time and an end time of a non-speech segment.

なお、音喩認識部２８は、各音喩構成要素別に、事後確率が高いものから所定数の音喩構成要素の値を音喩データＤ３に設定してもよい。例えば、音喩認識部２８は、ユニット群Ｂ１に対応した出力ベクトルの要素のうち事後確率が最も高いものから所定数の要素を選択し、選択したそれぞれの要素に関連付けられたテキスト表現と、事後確率に基づくそれらテキスト表現の順位とを音喩データＤ３に設定する。同様に、音喩認識部２８は、ユニット群Ｂ２に対応した出力ベクトルの要素のうち事後確率が最も高いものから所定数の要素を選択し、選択したそれぞれの要素に関連付けられたフォント及びそれらの順位を音喩データＤ３に設定する。また、音喩認識部２８は、ユニット群Ｂ３に対応した出力ベクトルの要素のうち事後確率が最も高いものから所定数の要素を選択し、選択したそれぞれの要素に関連付けられた色及びそれらの順位を音喩データＤ３に設定する。 Note that the phonetic recognition unit 28 may set, in the phonetic data D3, values of a predetermined number of phonetic components from those having a high posterior probability for each phonetic component. For example, the phonetic recognition unit 28 selects a predetermined number of elements from the elements of the output vector corresponding to the unit group B1 with the highest posterior probability, the text expression associated with each selected element, the posterior The rank of the text expression based on the probability is set in the phonetic data D3. Similarly, the phonetic recognition unit 28 selects a predetermined number of elements from the elements of the output vector corresponding to the unit group B2 having the highest posterior probability, and fonts associated with the selected elements and their The rank is set in the phonetic data D3. Moreover, the phonetic recognition unit 28 selects a predetermined number of elements from the elements of the output vector corresponding to the unit group B3 having the highest posterior probability, and the colors associated with the selected elements and their ranks. Is set in the phonetic data D3.

図３において、結果編集部２９は、音喩データＤ３が示す音喩構成要素をパラメータとするグラフィックオブジェクトを、入力音声データＤ１に付随する入力映像データＤ４の映像中に配置し、表示部３に表示させる（ステップＳ１５５）。このとき、結果編集部２９は、入力映像データＤ４の時刻情報に合わせて、音喩に付加されている非音声区間（開始時刻及び終了時刻）の間、その音喩のグラフィックオブジェクトを配置する。ユーザーは、表示部３に表示された映像を見て、音喩構成要素（グラフィックオブジェクト）を編集する映像編集作業を行う。 In FIG. 3, the result editing unit 29 arranges a graphic object whose parameter is the metaphor component indicated by the metaphor data D <b> 3 in the video of the input video data D <b> 4 attached to the input audio data D <b> 1 and displays the graphic object on the display unit 3. It is displayed (step S155). At this time, the result editing unit 29 arranges the graphic object of the phonogram during the non-speech period (start time and end time) added to the phonogram in accordance with the time information of the input video data D4. The user looks at the video displayed on the display unit 3 and performs video editing work for editing a metaphor component (graphic object).

図８は、表示部３に表示される編集画面を示す図である。結果編集部２９は、入力映像データＤ４の映像と、その映像に配置される音喩のグラフィックオブジェクト１０２とを重ねて、同図に示す編集画面の映像表示画面１０１に表示させる。また、結果編集部２９は、映像表示画面１０１に表示している音喩データＤ３が示す音喩構成要素であるテキスト表現および各表示属性の値を、認識結果表示画面１０３に一覧表示させる。さらに、結果編集部２９は、入力映像データＤ４のタイムラインをタイムライン画面１０４に表示させる。タイムライン画面１０４上には、通常の商用の映像編集アプリケーションと同様に、映像及び音声の各トラックと、現在の再生位置が表示される。各トラックは、映像データ、音声データの存在を表す。本実施形態では、タイムライン画面１０４に追加のトラックとして、音喩データＤ３に設定されている非音声区間と、その非音声区間における音喩のテキスト表現とを新たなトラックとして追加する。 FIG. 8 is a diagram showing an editing screen displayed on the display unit 3. The result editing unit 29 superimposes the video of the input video data D4 and the graphic object 102 of the metaphor arranged in the video on the video display screen 101 of the editing screen shown in FIG. Further, the result editing unit 29 causes the recognition result display screen 103 to display a list of text representations and display attribute values that are the metaphor constituent elements indicated by the metaphor data D3 displayed on the video display screen 101. Further, the result editing unit 29 displays the timeline of the input video data D4 on the timeline screen 104. On the timeline screen 104, as with a normal commercial video editing application, each track of video and audio and the current playback position are displayed. Each track represents the presence of video data and audio data. In the present embodiment, as an additional track on the timeline screen 104, the non-speech section set in the phonogram data D3 and the text representation of the phonogram in the non-speech section are added as new tracks.

ユーザー（作業者）は、表示部３が表示する音喩の認識結果を見ながら、認識された音喩をその演出意図にそって画面上に配置していく。音喩のテキスト表現が演出意図に沿わない場合、ユーザーは、認識結果表示画面１０３においてテキスト表現を修正する。結果編集部２９は、現在表示されている音喩のグラフィックオブジェクト１０２のテキスト表現を、入力部４により入力された修正後のテキスト表現に置き換える。同様に、音喩の表示属性についても変更可能である。例えば、ユーザーは、認識結果表示画面１０３においてフォント、カラー、サイズ等の表示属性を修正する。結果編集部２９は、現在表示させている音喩のグラフィックオブジェクト１０２の表示属性を、入力部４により入力された修正後の表示属性に置き換える。また、結果編集部２９は、ユーザーが入力部４によりテキスト表現を空と入力した場合はグラフィックオブジェクトを画面に配置しない。 The user (operator) arranges the recognized metaphor on the screen in accordance with the production intention while viewing the recognition result of the metaphor displayed on the display unit 3. When the text representation of the phonetic does not meet the production intention, the user corrects the text representation on the recognition result display screen 103. The result editing unit 29 replaces the text representation of the graphic object 102 of the currently displayed phonogram with the corrected text representation input by the input unit 4. Similarly, the display attribute of the phonetic can be changed. For example, the user corrects display attributes such as font, color, and size on the recognition result display screen 103. The result editing unit 29 replaces the display attribute of the graphic object 102 of the metaphor currently displayed with the corrected display attribute input by the input unit 4. Further, the result editing unit 29 does not arrange the graphic object on the screen when the user inputs the text expression as empty by the input unit 4.

修正の際、ユーザーは、入力部４により任意にテキスト表現や、各表示属性の値を入力してもよく、予め結果編集部２９により用意された各表示属性の値の中から修正後の表示属性の値を選択してもよい。あるいは、音喩データＤ３にテキスト表現や各表示属性が事後確率順に複数設定されている場合、結果編集部２９は、それらを表示部３にリスト表示させ、ユーザーは、そのリストの中から修正後のテキスト表現や、各表示属性の値を入力部４により選択してもよい。
結果編集部２９は、入力映像データＤ４の表示画面上に、修正後の音喩を重ねて表示させる完成映像データＤ５を生成し、出力する。また、結果編集部２９は、修正後の音喩を示す修正済み音喩データＤ１０と、修正済み音喩データＤ１０が得られたときの音響特徴量、韻律特徴量、及び言語特徴量とを出力する。
上記の映像編集作業は、専用のコンピュータソフトウェア、もしくは、放送番組の制作に供される市販のソフトウェアに対するプラグインソフトウェアとして実現される。 At the time of correction, the user may arbitrarily input a text expression or a value of each display attribute by the input unit 4, and display the corrected display from the values of each display attribute prepared in advance by the result editing unit 29. An attribute value may be selected. Alternatively, when a plurality of text expressions and display attributes are set in order of posterior probabilities in the phonetic data D3, the result editing unit 29 displays them in a list on the display unit 3, and the user modifies the list from the list. The text representation and the value of each display attribute may be selected by the input unit 4.
The result editing unit 29 generates and outputs completed video data D5 for displaying the corrected metaphor on the display screen of the input video data D4. In addition, the result editing unit 29 outputs the corrected syllable data D10 indicating the corrected syllable, and the acoustic feature value, prosodic feature value, and language feature value when the corrected syllable data D10 is obtained. To do.
The above-described video editing work is realized as dedicated computer software or plug-in software for commercially available software used for production of broadcast programs.

次に、音喩認識装置１における音喩認識モデル学習処理について説明する。
音喩認識部２８が音喩認識のために用いる音喩認識モデルは、統計的手段により事前に学習しておく必要がある。そこで、音喩認識モデルの学習に用いられる学習データを音声言語資源格納部５０に格納しておく。学習データは、音声データと、その音声データに付随するテキスト（例えば、音声の書き起こしや字幕）、音喩に関する情報を含む。 Next, the phonetic recognition model learning process in the phonetic recognition device 1 will be described.
The phonetic recognition model used by the phonetic recognition unit 28 for phonetic recognition needs to be learned in advance by statistical means. Therefore, learning data used for learning the phonetic recognition model is stored in the spoken language resource storage unit 50. The learning data includes audio data, text associated with the audio data (for example, transcription and subtitles of audio), and information related to a metaphor.

図９は、学習データに含まれる音喩テーブルを示す図であり、図１０は、学習データに含まれるテキストテーブルを示す図である。
図９に示すように、学習データに含まれる音喩テーブルは、音声ファイルと、テキストファイルと、音喩の開始時刻及び終了時刻と、音喩のテキスト表現と、音喩の各表示属性（フォント、大きさ、色など）の値とを対応付けた情報である。音喩の開始時刻及び終了時刻は、音声ファイルの冒頭からの時刻により示される。同図では、音声ファイルの情報として、音声ファイルの実体への参照が設定され、テキストファイルの情報には、音声の書き起こしあるいは字幕のテキストファイルの実体への参照が設定される。音喩構成要素については、人手による過去の音喩の制作事例を収集してデータベース化することにより蓄積を行う。 FIG. 9 is a diagram illustrating a phonetic table included in learning data, and FIG. 10 is a diagram illustrating a text table included in learning data.
As shown in FIG. 9, the metaphor table included in the learning data includes an audio file, a text file, a start time and an end time of the metaphor, a text representation of the metaphor, and display attributes (fonts) of the metaphor. , Size, color, and the like). The start time and end time of the metaphor are indicated by the time from the beginning of the audio file. In the figure, a reference to the entity of the audio file is set as the information of the audio file, and a reference to the entity of the audio transcription or subtitle text file is set to the information of the text file. Phrase components are accumulated by collecting past production examples of human phonograms and creating a database.

また、図１０に示すように、学習データに含まれるテキストテーブルは、テキストファイルの実体の参照名と、発話の開始時刻及び終了時刻と、発話内容とを対応付けた情報である。発話の開始時刻及び終了時刻は、音声ファイルの冒頭からの時刻により示される。発話内容は、分かち書きされた単語列である。なお、音喩認識モデル学習処理部５は、韻律特徴量を、テキストテーブルに設定されている発話内容の開始時刻及び終了時刻に基づいて切り出した音声ファイルから抽出する。 As shown in FIG. 10, the text table included in the learning data is information in which the reference name of the text file entity, the start time and end time of the utterance, and the utterance content are associated with each other. The start time and end time of the utterance are indicated by the time from the beginning of the audio file. The content of the utterance is a word string that is separated. The phonetic recognition model learning processing unit 5 extracts the prosodic feature amount from the voice file cut out based on the start time and the end time of the utterance content set in the text table.

音喩認識モデル学習処理部５は、学習データから非音声区間の音響特徴量、言語特徴量抽出区間の言語特徴量、及び韻律特徴量抽出区間の韻律特徴量を構成し、音喩認識モデルを学習する。具体的には、音喩認識モデル学習処理部５は、以下のように音喩認識モデルを学習する。 The metaphor recognition model learning processing unit 5 constructs the acoustic feature quantity of the non-speech section, the language feature quantity of the language feature quantity extraction section, and the prosodic feature quantity of the prosodic feature quantity extraction section from the learning data, and generates a metaphor recognition model. learn. Specifically, the phonetic recognition model learning processing unit 5 learns a phonetic recognition model as follows.

学習用音響特徴量抽出部５１は、音声言語資源格納部５０に記憶されている図９の音喩テーブルから開始時刻及び終了時刻を非音声区間として読み出す。さらに、学習用音響特徴量抽出部５１は、音喩テーブルを参照し、音声言語資源格納部５０から各非音声区間の音声ファイルを非音声データＤ６として読み出す。学習用音響特徴量抽出部５１は、音響特徴量抽出部２５と同様の処理により非音声データＤ６から音響特徴量を抽出する。 The learning acoustic feature quantity extraction unit 51 reads the start time and the end time as non-speech intervals from the metaphor table of FIG. 9 stored in the spoken language resource storage unit 50. Further, the learning acoustic feature quantity extraction unit 51 reads the audio file of each non-speech section as non-speech data D6 from the speech language resource storage unit 50 with reference to the metaphor table. The learning acoustic feature quantity extraction unit 51 extracts the acoustic feature quantity from the non-speech data D6 by the same processing as the acoustic feature quantity extraction unit 25.

学習用言語特徴量抽出部５２は、学習用音響特徴量抽出部５１が読み出した非音声区間それぞれに対応した言語特徴量抽出区間を、言語特徴量抽出部２６と同様に特定する。学習用言語特徴量抽出部５２は、特定した言語特徴量抽出区間に対応した発話内容のテキストデータＤ７を、音声言語資源格納部５０に記憶されている図１０のテキストテーブルから読み出す。学習用言語特徴量抽出部５２は、言語特徴量抽出部２６と同様にテキストデータＤ７から相対頻度に基づく言語特徴量を抽出する。 The learning language feature quantity extraction unit 52 specifies language feature quantity extraction sections corresponding to the non-speech sections read by the learning acoustic feature quantity extraction unit 51 in the same manner as the language feature quantity extraction unit 26. The learning language feature quantity extraction unit 52 reads the text data D7 of the utterance content corresponding to the specified language feature quantity extraction section from the text table of FIG. 10 stored in the spoken language resource storage unit 50. The learning language feature quantity extraction unit 52 extracts a language feature quantity based on the relative frequency from the text data D7 in the same manner as the language feature quantity extraction unit 26.

学習用韻律特徴量抽出部５３は、学習用音響特徴量抽出部５１が読み出した非音声区間それぞれに対応した韻律特徴量抽出区間を、韻律特徴量抽出部２７と同様に特定する。学習用韻律特徴量抽出部５３は、特定した言語特徴量抽出区間に対応した音声データＤ８を、音声言語資源格納部５０に記憶されている音声ファイルから切り出す。学習用韻律特徴量抽出部５３は、音声データＤ８から、韻律特徴量抽出部２７と同様に韻律特徴量を抽出する。 The prosodic feature quantity extraction unit 53 for learning specifies prosodic feature quantity extraction sections corresponding to the non-speech sections read by the learning acoustic feature quantity extraction unit 51 in the same manner as the prosodic feature quantity extraction unit 27. The learning prosodic feature quantity extraction unit 53 cuts out the voice data D8 corresponding to the specified language feature quantity extraction section from the voice file stored in the spoken language resource storage unit 50. The learning prosodic feature quantity extraction unit 53 extracts prosodic feature quantities from the speech data D8 in the same manner as the prosodic feature quantity extraction unit 27.

音喩認識モデル学習部５４は、音喩認識モデルとなるニューラルネットワークを学習する。そこでまず、音喩認識モデル学習部５４は、学習用音響特徴量抽出部５１が抽出した非音声区間それぞれについて学習用の入力特徴量を生成する。入力特徴量を構成する音響特徴量には、学習用音響特徴量抽出部５１が抽出した非音声区間の音響特徴量が設定される。入力特徴量を構成する言語特徴量には、非音声区間に対応した言語特徴量抽出区間から学習用言語特徴量抽出部５２が抽出した言語特徴量が設定される。入力特徴量を構成する韻律特徴量には、非音声区間に対応した韻律特徴量抽出区間から学習用韻律特徴量抽出部５３が抽出した韻律特徴量が設定される。音喩認識モデル学習部５４は、学習用の入力特徴量と、その学習用の入力特徴量の非音声区間に対応して音声言語資源格納部５０に記憶されている音喩テーブルから読み出した音喩（テキスト表現及び表示属性）とに基づいてイベントモデルを学習する。つまり、音喩認識モデル学習用のデータは、同じ非音声区間に対応した音響特徴量、言語特徴量、及び韻律特徴量を１組の入力とし、その非音声区間の音喩を出力とする。この際、音喩認識モデル学習部５４は、音喩認識モデルにおける第４のニューラルネットワークＡ４の出力層から出力されるベクトル（１次元配列）の各要素と、ハッシュ関数により得られた音喩構成要素のハッシュ値とを対応付けておく。 The phonetic recognition model learning unit 54 learns a neural network serving as a phonetic recognition model. First, the phonetic recognition model learning unit 54 generates an input feature value for learning for each non-speech section extracted by the learning acoustic feature amount extraction unit 51. As the acoustic feature quantity constituting the input feature quantity, the acoustic feature quantity of the non-speech section extracted by the learning acoustic feature quantity extraction unit 51 is set. As the language feature quantity constituting the input feature quantity, the language feature quantity extracted by the learning language feature quantity extraction unit 52 from the language feature quantity extraction section corresponding to the non-speech section is set. The prosodic feature value extracted by the learning prosody feature value extracting unit 53 from the prosodic feature value extracting section corresponding to the non-speech section is set as the prosodic feature value constituting the input feature value. The phonetic recognition model learning unit 54 reads the sound read from the phonetic table stored in the spoken language resource storage unit 50 corresponding to the input feature quantity for learning and the non-speech section of the input feature quantity for learning. Learn event models based on metaphors (text representation and display attributes). In other words, the data for learning the phonetic recognition model has an acoustic feature quantity, a language feature quantity, and a prosodic feature quantity corresponding to the same non-speech section as a set of inputs, and a metaphor of the non-speech section as an output. At this time, the phonetic recognition model learning unit 54 uses each element of the vector (one-dimensional array) output from the output layer of the fourth neural network A4 in the phonetic recognition model and the phonetic composition obtained by the hash function. The hash value of the element is associated.

Ｎ組の音喩認識モデル学習用のデータ（入力特徴量と音喩の組）が与えられたとき、音喩認識モデル学習部５４は、これらのデータを１組ずつ用いて、音喩認識モデルとなるニューラルネットワークの各層間の結合重みを誤差逆伝播法により学習する。音喩認識モデル学習部５４は、Ｎ組の音喩認識モデル学習用のデータについて繰り返し誤差逆伝播法による学習を行うが、学習用のデータとは別に用意した検証用データの識別性能が最大となった点で学習が収束したと判断する。 When N sets of metaphor recognition model learning data (sets of input features and metaphors) are given, the metaphor recognition model learning unit 54 uses these sets of data one by one to generate a metaphor recognition model. The connection weight between each layer of the neural network is learned by the error back propagation method. The phonetic recognition model learning unit 54 performs learning by the repeated error back-propagation method on N sets of data for learning the phonetic recognition model, but the discrimination performance of the verification data prepared separately from the learning data is the maximum. It is judged that learning has converged at the point.

なお、音喩認識モデル学習部５４は、異なる層数や異なるユニット数のニューラルネットワークそれぞれについて同じ音喩認識モデル学習用のデータを用いて学習を行う。音喩認識モデル学習部５４は、検証用データの識別性能が最も高いニューラルネットワークを音喩認識モデルとして音喩認識モデル格納部５５に格納する。音喩認識部２８は、音喩認識モデル格納部５５に格納された音喩認識モデルを用いて音喩認識処理を行う。そして、この音喩認識モデルを用いて音喩認識を行った結果に対してユーザーが変更を行った場合、音喩認識モデルのフィードバック学習を行う。 The phonetic recognition model learning unit 54 performs learning using the same phonetic recognition model learning data for each of the neural networks having different numbers of layers and different numbers of units. The phonetic recognition model learning unit 54 stores the neural network having the highest discrimination performance of the verification data in the phonetic recognition model storage unit 55 as a phonetic recognition model. The phonetic recognition unit 28 performs phonetic recognition processing using the phonetic recognition model stored in the phonetic recognition model storage unit 55. And when a user changes with respect to the result of having performed phonetic recognition using this phonetic recognition model, feedback learning of a phonetic recognition model is performed.

結果編集部２９は、入力部４により音喩のテキスト表現や表示属性の変更を受けた場合、その変更に従って音喩データＤ３に設定されている音喩のテキスト表現や表示属性を変更する。結果編集部２９は、変更された音喩を設定した修正済み音喩データＤ１０と、音喩データＤ３が示す修正前の音喩が得られたときの音響特徴量、韻律特徴量、及び言語特徴量を音声言語資源格納部５０に格納する。音喩認識モデル学習部５４は、音声言語資源格納部５０から読み出した修正済み音喩データＤ１０、音響特徴量、韻律特徴量、及び言語特徴量を用いて、誤差逆伝播法によるニューラルネットワークの学習を行い、音喩認識モデル格納部５５に現在格納されている音喩認識モデルを更新する。 When the result editing unit 29 receives a change in the text representation or display attribute of the phonetic phrase from the input unit 4, the result editing unit 29 changes the text representation or display attribute of the phonetic phrase set in the phonetic data D3 according to the change. The result editing unit 29 sets the modified syllable data D10 in which the changed syllable is set, and the acoustic feature value, prosodic feature value, and linguistic feature when the uncorrected syllable shown by the syllable data D3 is obtained. The amount is stored in the spoken language resource storage unit 50. The phonetic recognition model learning unit 54 uses the corrected phonetic data D10, the acoustic feature value, the prosodic feature value, and the language feature value read from the spoken language resource storage unit 50 to learn a neural network by the error back propagation method. And the phonetic recognition model currently stored in the phonetic recognition model storage unit 55 is updated.

結果編集部２９は、修正済み音喩データＤ１０を出力する際、音喩の編集を行ったユーザーのユーザープロファイルデータＤ１１を関連付けて出力するようにしてもよい。ユーザープロファイルデータＤ１１には、例えば、ユーザーＩＤ、グループＩＤ、ユーザーが制作した音喩リストなどが含まれる。ユーザーＩＤは、制作者であるユーザーを特定するための一意の番号である。グループＩＤは、類似した放送番組を制作しているグループの番号である。グループＩＤは、例えば、番組を表すようにしてもよく、ドキュメンタリー、時代劇など、番組のジャンルを表すようにしてもよい。ユーザーが制作した音喩リストは、ユーザーが過去に行った制作履歴である。制作履歴は、音声言語資源格納部５０に蓄積された音喩データの通し番号、音喩データを制作した放送番組名を保持する。 When outputting the corrected syllable data D10, the result editing unit 29 may output the user profile data D11 of the user who edited the syllable in association with each other. The user profile data D11 includes, for example, a user ID, a group ID, a phonetic list created by the user, and the like. The user ID is a unique number for identifying the user who is the creator. The group ID is the number of a group that produces a similar broadcast program. The group ID may represent a program, for example, and may represent a program genre such as a documentary or a historical drama. The phonetic list created by the user is a production history that the user has done in the past. The production history holds the serial number of the phonetic data stored in the spoken language resource storage unit 50 and the name of the broadcast program that produced the phonetic data.

音声言語資源格納部５０に修正済み音喩データＤ１０を格納する際、このデータに関連付けられたユーザープロファイルデータＤ１１も格納する。音喩認識モデル学習部５４は、ユーザープロファイルデータＤ１１から得られるカテゴリごとに修正済み音喩データＤ１０、音響特徴量、韻律特徴量、及び言語特徴量の組を選択して音喩認識モデルのフィードバック学習に用いる。カテゴリは、ユーザーＩＤ、グループＩＤ、音喩リストのいずれか１以上により示される。これにより、ユーザーの演出の傾向や嗜好（どのような音喩を好んで使ったか）、番組のジャンルなどにマッチした音喩認識モデルを学習することが可能となる。 When the corrected syllable data D10 is stored in the spoken language resource storage unit 50, user profile data D11 associated with this data is also stored. The phonetic recognition model learning unit 54 selects a set of the corrected phonetic data D10, the acoustic feature value, the prosodic feature value, and the language feature value for each category obtained from the user profile data D11, and feeds back the phonetic recognition model. Used for learning. The category is indicated by one or more of a user ID, a group ID, and a phonetic list. As a result, it becomes possible to learn a metaphor recognition model that matches the tendency and preferences of the user's performance (what kind of metaphor they liked and used), the genre of the program, and the like.

上述した実施形態によれば、音喩認識装置１は、パターン認識（機械学習）技術により音喩制作を支援し、従来とは異なる新しい番組制作手法を低コストで提供することができる。また、音喩認識装置１は、音喩制作を支援することにより、聴覚障碍者や高齢者など字幕を必要とする人々へ、従来の字幕に付加価値をつけたサービスを提供できる。 According to the above-described embodiment, the phonetic recognition apparatus 1 can support the production of a phonetic by using a pattern recognition (machine learning) technique, and can provide a new program production method different from the conventional one at a low cost. Moreover, the syllable recognition device 1 can provide a service with added value to the conventional captions to people who need captions, such as persons with hearing disabilities and elderly people, by supporting the production of the metaphors.

なお、上述の音喩認識装置１は、内部にコンピュータシステムを有している。そして、音喩認識装置１の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 The above-described phonetic recognition apparatus 1 has a computer system inside. The operation process of the phonetic recognition apparatus 1 is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer system reading and executing the program. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

１…音喩認識装置、２…音喩認識処理部、３…表示部、４…入力部、５…音喩認識モデル学習処理部、２０…音響モデル格納部、２１…言語モデル格納部、２２…音声区間検出部、２３…音声認識部、２４…非音声区間検出部、２５…音響特徴量抽出部、２６…言語特徴量抽出部、２７…韻律特徴量抽出部、２８…音喩認識部、２９…結果編集部、５０…音声言語資源格納部、５１…学習用音響特徴量抽出部、５２…学習用言語特徴量抽出部、５３…学習用韻律特徴量抽出部、５４…音喩認識モデル学習部、５５…音喩認識モデル格納部 DESCRIPTION OF SYMBOLS 1 ... Metaphor recognition apparatus, 2 ... Metaphor recognition process part, 3 ... Display part, 4 ... Input part, 5 ... Metaphor recognition model learning process part, 20 ... Acoustic model storage part, 21 ... Language model storage part, 22 ... Speech segment detection unit, 23... Speech recognition unit, 24 .. non-speech segment detection unit, 25... Acoustic feature quantity extraction unit, 26 .. language feature quantity extraction unit, 27 .. prosodic feature quantity extraction unit, 28. , 29 ... Result editing unit, 50 ... Spoken language resource storage unit, 51 ... Learning acoustic feature quantity extraction unit, 52 ... Learning language feature quantity extraction unit, 53 ... Learning prosodic feature quantity extraction unit, 54 ... Phrase recognition Model learning unit, 55 ... Phonetic recognition model storage unit

Claims

A non-speech section detection unit that detects a non-speech section in the speech data by collating a statistical acoustic model for detecting a non-speech section with speech data;
An acoustic feature quantity extraction unit that extracts an acoustic feature quantity from the voice data in the non-speech section detected by the non-speech section detection unit;
The speech data of the specified language feature quantity extraction section is specified by identifying a language feature quantity extraction section that includes the non-speech section detected by the non-speech section detection section and is longer than the non-speech section by a predetermined amount. A linguistic feature quantity extraction unit that extracts linguistic feature quantities from utterance content data corresponding to
The speech data of the identified prosodic feature quantity extraction section is specified by identifying a prosodic feature quantity extraction section that includes the non-speech section detected by the non-speech section detection section and is longer than the non-speech section by a predetermined amount. A prosodic feature extraction unit for extracting prosody features from
Using a statistically learned phonetic recognition model to obtain a phonetic consisting of non-speech text representations and display attributes of the text representations with input of acoustic features, language features, and prosodic features, A posteriori probability of a metaphor is calculated from the acoustic feature amount extracted by the acoustic feature amount extraction unit, the language feature amount extracted by the language feature amount extraction unit, and the prosodic feature amount extracted by the prosodic feature amount extraction unit. A phonetic recognition unit that outputs data of a phonetic selected based on the calculated posterior probability;
A metaphor recognition device characterized by comprising:

The display attribute is at least one of a font, size, color, and display operation of a character representing a text expression.
The phonetic recognition apparatus according to claim 1, wherein:

The phonetic recognition model is:
First, an acoustic feature obtained from each of the frames in time order obtained by dividing the speech data of the non-speech section is input, and an acoustic feature obtained by converting the inputted acoustic feature into a lower dimension than the input is output. A neural network;
A second neural network that receives the prosodic feature value of the prosodic feature value extraction section and outputs the prosodic feature value obtained by converting the input prosodic feature value into a lower dimension than the input;
A third neural network that inputs a language feature amount of a language feature amount extraction section and outputs a language feature amount obtained by converting the input language feature amount into a lower dimension than the input;
The acoustic feature quantity that is the output of the first neural network, the prosodic feature quantity that is the output of the second neural network, and the language feature quantity that is the output of the third neural network are input. A fourth neural network that outputs the posterior probabilities of each of the textual representation of the phonetic and the display attributes;
The acoustic feature quantity extraction unit extracts an acoustic feature quantity from each of the frames obtained by dividing the voice data of the non-speech section detected by the non-speech section detection unit,
The phonetic recognition unit includes the acoustic feature amount extracted by the acoustic feature amount extraction unit from each of the frames in time order in the first neural network, and the prosodic feature amount extracted by the prosody feature amount extraction unit. The linguistic feature quantity extracted by the linguistic feature quantity extraction unit is inputted to the neural network of No. 2 to the third neural network, and the text expression of the metaphor and the posterior of the display attributes which are the output of the fourth neural network Calculate the probability,
The metaphor recognition apparatus according to claim 1, wherein the syllable recognition apparatus according to claim 1.

A result editing unit for generating video data for displaying the metaphor selected by the metaphor recognition unit on video data corresponding to the audio data;
The phonetic recognition apparatus according to any one of claims 1 to 3, characterized in that:

The result editing unit receives video correction instructions for the phonetic phrase selected by the phonetic recognition unit, and displays video data that is corrected based on the correction command and superimposed on video data corresponding to the audio data Produces
The phonetic recognition model based on the acoustic feature quantity, the prosodic feature quantity, the language feature quantity and the corrected phonetic that are input to the phonetic recognition model when the phonetic is obtained A phonetic recognition model learning unit for updating
The metaphor recognition apparatus according to claim 4, wherein:

A speech section detecting unit for detecting a speech section in the speech data in comparison with a statistical acoustic model for speech section detection;
A voice recognition unit that recognizes the voice data in the voice section detected by the voice section detection unit and outputs data of speech content obtained as a result of the voice recognition;
The language feature amount extraction unit extracts a language feature amount in the language feature amount extraction section from the utterance content data output by the speech recognition unit.
The phonetic recognition apparatus according to any one of claims 1 to 5, wherein:

Computer
A non-speech section detecting means for collating a statistical acoustic model for detecting a non-speech section with speech data and detecting a non-speech section in the speech data;
Acoustic feature quantity extraction means for extracting an acoustic feature quantity from the speech data in the non-speech section detected by the non-speech section detection means;
The speech data of the identified language feature quantity extraction section is specified by identifying a language feature quantity extraction section that includes the non-speech section detected by the non-speech section detection unit and is longer than the non-speech section by a predetermined amount. Language feature extraction means for extracting a language feature from utterance content data corresponding to
The speech data of the specified prosodic feature quantity extraction section is specified by identifying a prosodic feature quantity extraction section that includes the non-speech section detected by the non-speech section detection means and that is longer than the non-speech section by a predetermined amount. A prosodic feature extracting means for extracting prosody features from
Using a statistically learned phonetic recognition model to obtain a phonetic consisting of non-speech text representations and display attributes of the text representations with input of acoustic features, language features, and prosodic features, A posteriori probability of a metaphor is calculated from the acoustic feature extracted by the acoustic feature extractor, the language feature extracted by the language feature extractor, and the prosodic feature extracted by the prosodic feature extractor. A phonetic recognition means for outputting data of a phonetic selected based on the calculated posterior probability;
A program for functioning as a syllable recognition device.