JP2011250100A

JP2011250100A - Image processing system and method, and program

Info

Publication number: JP2011250100A
Application number: JP2010120726A
Authority: JP
Inventors: Tatsuki Sakaguchi; 竜己坂口
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2010-05-26
Filing date: 2010-05-26
Publication date: 2011-12-08

Abstract

PROBLEM TO BE SOLVED: To visually present environmental sound to a user more effectively.SOLUTION: An analysis part 24 subjects sound data of sound constituting a content to a blind sound source separation processing to extract the sound data of sound of each sound source and generates direction data indicating a direction of the sound source based on the sound data of sound of each sound source. Also, the analysis part 24 determines whether or not the sound of the sound source is an environmental sound which is not a human voice, and converts the environmental sound into a text. A visual information generation part 25 generates an effect image for visually presenting the environmental sound based on the text-converted environmental sound. An image composition part 26 overlays the effect image on a position defined by the direction data on a content image constituting the content. The present invention can be adopted to a video reproduction device.

Description

本発明は画像処理装置および方法、並びにプログラムに関し、特に、より効果的に、ユーザに対して環境音を視覚的に提示することができるようにした画像処理装置および方法、並びにプログラムに関する。 The present invention relates to an image processing apparatus, method, and program, and more particularly, to an image processing apparatus, method, and program capable of visually presenting environmental sounds to a user more effectively.

DVD（Digital Versatile Disc）やBD（Blu-ray（登録商標） Disc）といった商用パッケージメディアでは、テキスト情報や画像情報を、コンテンツの画像上の所望する位置に表示することが可能であるが、人の発話とは異なる周囲の環境音に対する考慮は特にされていない。 In commercial package media such as DVD (Digital Versatile Disc) and BD (Blu-ray (registered trademark) Disc), text information and image information can be displayed at a desired position on the content image. There is no particular consideration for ambient environmental sounds that are different from utterances.

例えば、コンテンツの視聴時において、聴覚障害者にとっては、ドアの開閉音、自動車の接近音、電話の着信音などの環境音の演出効果は全く意味をなさない。また、そもそもオーサリングという手順が踏まれない個人的に撮影されたコンテンツや、生放送形式で放送される番組などでは、このような環境音の演出効果を追加することは難しい。 For example, when viewing content, the effect of producing environmental sounds such as door opening / closing sound, car approaching sound, telephone ringing sound, etc. has no meaning for hearing impaired people. In addition, it is difficult to add such an environmental sound effect for content that has been personally shot without the authoring procedure or a program broadcast in a live broadcast format.

なお、クローズドキャプションの付加が義務付けられている米国では、人がコンテンツの音声を聞き取ってタイプすることで、クローズドキャプションを作成する場合もある。そのような場合、ユーザは視覚情報として、変換された文字を読むことはできるものの、クローズドキャプションでは、環境音や音声が、画面上のどの位置にあるものや人から発せられているか表現することはできない。 In the US, where the addition of closed captions is obligatory, closed captions may be created by a person listening to and typing content audio. In such a case, the user can read the converted character as visual information, but with closed captioning, it expresses where the environmental sound or sound is coming from or on the screen. I can't.

また、音声認識を利用した技術として、音声認識された音声をテキスト化して、障害者用のヘッドマウントディスプレイに表示させる装置も提案されている（例えば、特許文献１参照）。このヘッドマウントディスプレイでは、テキスト化された音声のおおよその音源位置も表示されるため、ユーザは、どの方向から音がしているかを知ることができる。 As a technique using voice recognition, an apparatus that converts voice-recognized voice into text and displays it on a head-mounted display for persons with disabilities has also been proposed (see, for example, Patent Document 1). This head-mounted display also displays the approximate sound source position of the voiced text, so that the user can know from which direction the sound is coming from.

特開２００７−３３４１４９号公報JP 2007-334149 A

しかしながら、上述した技術では、任意のコンテンツを対象として、そのコンテンツに含まれる環境音を、効果的に、ユーザに対して視覚的に提示することはできなかった。 However, with the above-described technology, it has been impossible to effectively visually present the environmental sound included in the content to the user for any content.

例えば、ユーザが、周囲の音声をテキスト化して表示させるヘッドマウントディスプレイを装着して、コンテンツを視聴しても、単にディスプレイの中央に認識された音声のテキストが表示されるだけであった。そのため、コンテンツと、表示されたテキストの一体感はなく、効果的に音声を提示しているとはいえなかった。 For example, even when a user wears a head-mounted display that displays surrounding voice as text and views the content, the recognized voice text is simply displayed at the center of the display. Therefore, there is no sense of unity between the content and the displayed text, and it cannot be said that the voice is effectively presented.

本発明は、このような状況に鑑みてなされたものであり、より効果的に、ユーザに対して環境音を視覚的に提示することができるようにするものである。 This invention is made | formed in view of such a condition, and enables it to present an environmental sound visually with respect to a user more effectively.

本発明の一側面の画像処理装置は、コンテンツを構成する音声の音声データに基づいて、所定の基準位置に対する前記音声の音源の方向を推定する音源方向推定手段と、前記音声データを、前記音声のうちの人の発話による発話音を除く環境音の前記音声データと、前記発話音の前記音声データとに分離する分離手段と、前記環境音の前記音声データに対する音声認識処理を行って、前記環境音をテキスト化する環境音識別手段と、前記環境音を視覚的に提示するエフェクト画像が、前記コンテンツを構成する画像上の前記音源の方向の推定結果により定まる位置に表示されるように、テキスト化された前記環境音に基づいて生成された前記エフェクト画像のエフェクトデータと、前記画像の画像データとを合成する画像合成手段とを備える。 An image processing apparatus according to an aspect of the present invention includes a sound source direction estimating unit that estimates a direction of a sound source of a sound with respect to a predetermined reference position based on sound data of sound forming content, and the sound data includes the sound data Separating the sound data of the environmental sound excluding the speech sound generated by the person's speech and the sound data of the speech sound, and performing speech recognition processing on the sound data of the environmental sound, The environmental sound identification means for converting the text into the text and the effect image for visually presenting the environmental sound are displayed in the text so as to be displayed at a position determined by the estimation result of the direction of the sound source on the image constituting the content. Image synthesizing means for synthesizing the effect data of the effect image generated based on the environmental sound and the image data of the image

画像処理装置には、前記音声の前記音声データに基づいて、前記基準位置から前記音源までの距離を推定する音源距離推定手段をさらに設け、前記画像合成手段には、前記エフェクト画像が、前記画像上の前記音源の方向の推定結果により定まる位置に、前記音源の距離の推定結果により定まる大きさで表示されるように、前記エフェクトデータと前記画像データとを合成させることができる。 The image processing apparatus further includes sound source distance estimating means for estimating a distance from the reference position to the sound source based on the sound data of the sound, and the effect image is stored in the image synthesizing means. The effect data and the image data can be combined such that the effect data and the image data are displayed at the position determined by the estimation result of the direction of the sound source in the size determined by the estimation result of the distance of the sound source.

画像処理装置には、前記音声の前記音声データに対するブラインド音源分離処理を行って、前記音声データを、各前記音源の音声の音声データに分離する音源分離手段をさらに設け、前記分離手段には、前記音源分離手段により分離された前記音声データごとに、その前記音声データが前記環境音の前記音声データであるか否かを判別することで、前記環境音の前記音声データと、前記発話音の前記音声データとを分離させることができる。 The image processing apparatus further includes a sound source separation unit that performs a blind sound source separation process on the sound data of the sound and separates the sound data into sound data of the sound of each sound source, and the separation unit includes: For each of the sound data separated by the sound source separation means, by determining whether the sound data is the sound data of the environmental sound, the sound data of the environmental sound and the speech sound The voice data can be separated.

前記画像合成手段には、前記音源の方向、前記音源の距離、または前記環境音の音量の少なくとも何れかに応じて、前記画像上に表示される前記エフェクト画像の大きさ、色、または輝度が変化するように、前記エフェクトデータと前記画像データとを合成させることができる。 The image composition means has the size, color, or brightness of the effect image displayed on the image according to at least one of the direction of the sound source, the distance of the sound source, and the volume of the environmental sound. The effect data and the image data can be combined so as to change.

画像処理装置には、テキスト化された前記環境音に対して予め定められた、前記環境音の内容を補足する文字列からなる補足情報が前記エフェクト画像に表示されるように、テキスト化された前記環境音に基づいて、前記エフェクトデータを生成する視覚情報生成手段をさらに設けることができる。 In the image processing apparatus, text is converted so that supplementary information, which is predetermined for the environmental sound that has been converted into text, and includes a character string that supplements the content of the environmental sound is displayed on the effect image. Visual information generating means for generating the effect data based on the environmental sound can be further provided.

画像処理装置には、前記分離手段により分離された前記発話音の前記音声データに対する音声認識処理を行って、前記発話音をテキスト化する発話音識別手段をさらに設け、前記画像合成手段には、前記環境音の前記エフェクト画像と、前記発話音を視覚的に提示するエフェクト画像とが前記画像上に表示されるように、前記エフェクトデータと前記画像データとを合成させることができる。 The image processing apparatus further includes speech recognition means for performing speech recognition processing on the speech data of the utterance sound separated by the separation means to convert the speech sound into text, and the image synthesis means includes The effect data and the image data can be combined so that the effect image of the environmental sound and the effect image that visually presents the speech sound are displayed on the image.

本発明の一側面の画像処理方法またはプログラムは、コンテンツを構成する音声の音声データに基づいて、所定の基準位置に対する前記音声の音源の方向を推定し、前記音声データを、前記音声のうちの人の発話による発話音を除く環境音の前記音声データと、前記発話音の前記音声データとに分離し、前記環境音の前記音声データに対する音声認識処理を行って、前記環境音をテキスト化し、前記環境音を視覚的に提示するエフェクト画像が、前記コンテンツを構成する画像上の前記音源の方向の推定結果により定まる位置に表示されるように、テキスト化された前記環境音に基づいて生成された前記エフェクト画像のエフェクトデータと、前記画像の画像データとを合成するステップを含む。 An image processing method or program according to one aspect of the present invention estimates a direction of a sound source of a sound with respect to a predetermined reference position based on sound data of sound that constitutes content, and uses the sound data of a person of the sound. Separating the voice data of the environmental sound excluding the voice generated by the utterance and the voice data of the voice, performing voice recognition processing on the voice data of the environmental sound, converting the environmental sound into text, The effect image that visually presents the sound is generated based on the environmental sound generated as text so that the effect image is displayed at a position determined by the estimation result of the direction of the sound source on the image constituting the content. A step of synthesizing the effect data of the effect image and the image data of the image;

本発明の一側面においては、コンテンツを構成する音声の音声データに基づいて、所定の基準位置に対する前記音声の音源の方向が推定され、前記音声データが、前記音声のうちの人の発話による発話音を除く環境音の前記音声データと、前記発話音の前記音声データとに分離され、前記環境音の前記音声データに対する音声認識処理が行われて、前記環境音がテキスト化され、前記環境音を視覚的に提示するエフェクト画像が、前記コンテンツを構成する画像上の前記音源の方向の推定結果により定まる位置に表示されるように、テキスト化された前記環境音に基づいて生成された前記エフェクト画像のエフェクトデータと、前記画像の画像データとが合成される。 In one aspect of the present invention, the direction of the sound source of the sound with respect to a predetermined reference position is estimated based on the sound data of the sound that constitutes the content, and the sound data includes the utterance sound generated by the person's utterance in the sound. The voice data of the environmental sound is separated from the voice data of the utterance sound, and voice recognition processing is performed on the voice data of the environmental sound, the environmental sound is converted into text, and the environmental sound is visually recognized. Of the effect image generated based on the environmental sound that has been converted into text so that the effect image to be presented is displayed at a position determined by the estimation result of the direction of the sound source on the image constituting the content. The effect data and the image data of the image are combined.

本発明の一側面によれば、より効果的に、ユーザに対して環境音を視覚的に提示することができる。 According to one aspect of the present invention, it is possible to visually present environmental sounds to a user more effectively.

本発明を適用した画像処理装置の一実施の形態の構成例を示す図である。It is a figure which shows the structural example of one Embodiment of the image processing apparatus to which this invention is applied. 解析部の構成例を示す図である。It is a figure which shows the structural example of an analysis part. コンテンツ再生処理を説明するフローチャートである。It is a flowchart explaining a content reproduction process. 合成コンテンツ画像の表示例を示す図である。It is a figure which shows the example of a display of a composite content image. 合成コンテンツ画像の表示例を示す図である。It is a figure which shows the example of a display of a composite content image. 解析処理を説明するフローチャートである。It is a flowchart explaining an analysis process. 解析部の他の構成例を示す図である。It is a figure which shows the other structural example of an analysis part. 解析処理を説明するフローチャートである。It is a flowchart explaining an analysis process. コンピュータの構成例を示すブロック図である。It is a block diagram which shows the structural example of a computer.

以下、図面を参照して、本発明を適用した実施の形態について説明する。 Embodiments to which the present invention is applied will be described below with reference to the drawings.

〈第１の実施の形態〉
［画像処理装置の構成］
図１は、本発明を適用した画像処理装置の一実施の形態の構成例を示す図である。 <First Embodiment>
[Configuration of image processing apparatus]
FIG. 1 is a diagram showing a configuration example of an embodiment of an image processing apparatus to which the present invention is applied.

画像処理装置１１は、画像処理装置１１に装着された光ディスク等の記録媒体１２からコンテンツのコンテンツデータを読み出して再生する。例えば、画像処理装置１１は、テレビジョン受像機、ビデオ再生機器、パーソナルコンピュータなどからなり、ビデオ再生アプリケーションプログラム上などで実装されるべき機能を実現する。 The image processing device 11 reads content data of content from a recording medium 12 such as an optical disk mounted on the image processing device 11 and reproduces it. For example, the image processing apparatus 11 includes a television receiver, a video playback device, a personal computer, and the like, and realizes functions to be implemented on a video playback application program.

画像処理装置１１は、特に、コンテンツのオーサリング時に手間をかけることなく、演出に用いられる効果音を画像情報に変換し、その画像情報を適切な位置にオーバーレイすることで、聴覚障害者のコンテンツの視聴の助けとなることを目的とする。 In particular, the image processing device 11 converts sound effects used for production into image information without taking time when authoring the content, and overlays the image information on an appropriate position, so that the content of the hearing impaired person's content can be displayed. The purpose is to help viewing.

なお、コンテンツデータは、コンテンツとしての動画像を表示させる動画像データと、その動画像（以下、コンテンツ画像とも称する）に付随する音声の音声データとから構成され、これらの動画像データおよび音声データは、所定の方式でエンコードされている。 The content data is composed of moving image data for displaying a moving image as content and audio data of audio accompanying the moving image (hereinafter also referred to as a content image). These moving image data and audio data Are encoded by a predetermined method.

画像処理装置１１は、読み出し部２１、動画デコーダ２２、オーディオデコーダ２３、解析部２４、視覚情報生成部２５、画像合成部２６、および表示部２７から構成される。 The image processing apparatus 11 includes a reading unit 21, a moving image decoder 22, an audio decoder 23, an analysis unit 24, a visual information generation unit 25, an image synthesis unit 26, and a display unit 27.

読み出し部２１は、記録媒体１２からコンテンツデータを読み出して、コンテンツデータを構成する動画像データを動画デコーダ２２に供給し、コンテンツデータを構成する音声データをオーディオデコーダ２３に供給する。 The reading unit 21 reads content data from the recording medium 12, supplies moving image data constituting the content data to the moving picture decoder 22, and supplies audio data constituting the content data to the audio decoder 23.

動画デコーダ２２は、読み出し部２１から供給された動画像データをデコードし、画像合成部２６に供給する。また、オーディオデコーダ２３は、読み出し部２１から供給された音声データをデコードし、解析部２４および表示部２７に供給する。 The moving picture decoder 22 decodes the moving image data supplied from the reading unit 21 and supplies it to the image composition unit 26. The audio decoder 23 decodes the audio data supplied from the reading unit 21 and supplies the decoded audio data to the analysis unit 24 and the display unit 27.

解析部２４は、オーディオデコーダ２３から供給された音声データに対し、解析処理を行って、音声データにより再生される音声の音源の方向を示す方向データと、音声の音源までの距離を示す距離データとを生成し、画像合成部２６に供給する。ここで、音声の音源の方向および距離は、その音声を収音したマイクロホン等の収音部を基準とした方向および距離である。 The analysis unit 24 performs analysis processing on the audio data supplied from the audio decoder 23, and direction data indicating the direction of the sound source of the sound reproduced by the sound data, and distance data indicating the distance to the sound source of the sound Are generated and supplied to the image composition unit 26. Here, the direction and distance of the sound source are the direction and distance with reference to a sound collection unit such as a microphone that picks up the sound.

なお、コンテンツを構成する音声には、収音部により直接収音された音声の他、その音声に後から付加（合成）された効果音等の音声が含まれる場合があるが、そのような効果音等の音声は、収音部で収音されたものとみなされる。つまり、収音部を基準として、効果音等の音声の音源の方向と距離とが推定される。 In addition to the sound directly picked up by the sound pickup unit, the sound constituting the content may include sound such as sound effects added (synthesized) to the sound later. Sounds such as sound effects are considered to have been collected by the sound collection unit. That is, the direction and distance of the sound source of sound such as sound effects are estimated using the sound collection unit as a reference.

また、解析部２４は、オーディオデコーダ２３から供給された音声データに対する音声認識処理を行い、その音声認識処理の結果を示す単語列を視覚情報生成部２５に供給する。換言すれば、解析部２４は、音声データにより再生される音声をテキスト化する。例えば、音声認識処理の結果を示す単語列には、「こんにちは」といった人の発話内容など、人の声の認識結果を示す単語列だけでなく、「ピーポーピーポー」といった救急車のサイレンの音など、収音部の周囲で発せられた環境音の認識結果を示す単語列も含まれる。 The analysis unit 24 performs voice recognition processing on the voice data supplied from the audio decoder 23, and supplies a word string indicating the result of the voice recognition processing to the visual information generation unit 25. In other words, the analysis unit 24 converts the voice reproduced by the voice data into text. For example, the word sequence that shows the results of the voice recognition process, "Hello", such as speech contents of the people, not only the word string representing the recognition result of the voice of the people, "folk folk", such as siren sound of an ambulance, A word string indicating the recognition result of the environmental sound emitted around the sound collection unit is also included.

なお、以下においては、特に、人の発話や擬声語などの人から発せられた声を発話音とも称し、コンテンツを構成する音声の収音時に収音された、発話音を除く他の全ての周囲の音を環境音とも称することとする。また、以下、発話音に対する音声認識処理の結果得られた単語列を示すテキストデータを発話音データとも称し、環境音に対する音声認識処理の結果得られた単語列を示すテキストデータを環境音データとも称することとする。したがって、視覚情報生成部２５には、発話音データと環境音データとが供給されることになる。 In the following, in particular, the voice uttered by a person such as a person's utterance or onomatopoeia is also referred to as a utterance sound, and all other surroundings except the utterance sound collected when the sound constituting the content is collected. Is also referred to as environmental sound. Also, hereinafter, text data indicating a word string obtained as a result of speech recognition processing for an uttered sound is also referred to as utterance sound data, and text data indicating a word string obtained as a result of speech recognition processing for environmental sound is referred to as environmental sound data. I will call it. Accordingly, the utterance sound data and the environmental sound data are supplied to the visual information generation unit 25.

視覚情報生成部２５は、解析部２４から供給された発話音データと環境音データを用いて、それらのデータにより示される単語列を視覚的に提示するエフェクトデータを生成し、画像合成部２６に供給する。 The visual information generation unit 25 uses the utterance sound data and the environmental sound data supplied from the analysis unit 24 to generate effect data for visually presenting a word string indicated by the data, and sends it to the image composition unit 26. Supply.

例えば、エフェクトデータは、発話音や環境音がテキスト化されて得られたテキスト画像、発話音や環境音の音声認識結果に対して予め定められたイラストやテクスチャ等の画像などの画像データとされる。以下では、エフェクトデータにより表示される画像を、特にエフェクト画像と呼ぶこととする。 For example, the effect data is image data such as text images obtained by converting speech sounds and environmental sounds into text, images such as illustrations and textures predetermined for speech recognition results of speech sounds and environmental sounds. The Hereinafter, an image displayed by effect data is particularly referred to as an effect image.

なお、エフェクト画像としてのイラスト等の画像は、発話内容や環境音の内容を想起させるものであれば、動画像であっても静止画像であってもよい。また、エフェクト画像としてのテキスト画像やイラスト等の画像は、時間とともに表示形式が変化するといった、表示効果を有するものであってもよい。この場合、例えば、エフェクト画像上に表示される文字の色や位置等が、時間とともに変化する。 It should be noted that an image such as an illustration as an effect image may be a moving image or a still image as long as it recalls the utterance content and the environmental sound content. In addition, an image such as a text image or an illustration as an effect image may have a display effect such that the display format changes with time. In this case, for example, the color and position of characters displayed on the effect image change with time.

画像合成部２６は、解析部２４から供給された方向データと距離データを用いて、動画デコーダ２２から供給された動画像データと、視覚情報生成部２５から供給されたエフェクトデータとを合成し、得られた動画像データを表示部２７に供給する。例えば、動画像データとエフェクトデータの合成は、コンテンツ画像上における、方向データにより定まる位置に、距離データにより定まる大きさでエフェクト画像がオーバーレイされるように行なわれる。以下、特に、エフェクト画像がオーバーレイされたコンテンツ画像を、合成コンテンツ画像とも称し、合成コンテンツ画像の動画像データを、合成動画像データとも称する。 The image synthesis unit 26 uses the direction data and distance data supplied from the analysis unit 24 to synthesize the moving image data supplied from the video decoder 22 and the effect data supplied from the visual information generation unit 25. The obtained moving image data is supplied to the display unit 27. For example, the synthesis of the moving image data and the effect data is performed such that the effect image is overlaid at a position determined by the direction data on the content image with a size determined by the distance data. Hereinafter, in particular, a content image on which an effect image is overlaid is also referred to as a synthesized content image, and moving image data of the synthesized content image is also referred to as synthesized moving image data.

表示部２７は、例えば液晶ディスプレイやスピーカなどからなり、画像合成部２６から供給された合成動画像データに基づいて、合成コンテンツ画像を表示させるとともに、オーディオデコーダ２３から供給された音声データに基づいて、音声を出力する。 The display unit 27 includes, for example, a liquid crystal display and a speaker. The display unit 27 displays a synthesized content image based on the synthesized moving image data supplied from the image synthesizing unit 26, and based on the audio data supplied from the audio decoder 23. , Output audio.

［解析部の構成］
また、図１の解析部２４は、より詳細には、図２に示すように構成される。 [Configuration of analysis unit]
The analysis unit 24 in FIG. 1 is configured as shown in FIG. 2 in more detail.

すなわち、解析部２４は、音源分離部５１、音源方向推定部５２、音源距離推定部５３、環境音／発話音識別部５４、環境音識別部５５、および発話内容識別部５６から構成される。また、音源分離部５１には、オーディオデコーダ２３から音声データが供給される。 That is, the analysis unit 24 includes a sound source separation unit 51, a sound source direction estimation unit 52, a sound source distance estimation unit 53, an environmental sound / utterance sound identification unit 54, an environmental sound identification unit 55, and an utterance content identification unit 56. The sound source separation unit 51 is supplied with audio data from the audio decoder 23.

音源分離部５１は、オーディオデコーダ２３から供給された音声データに対して、独立成分分析に基づくブラインド音源分離処理を行い、音声データから１または複数の音源ごとの音声の音声データを抽出し、音源方向推定部５２乃至環境音／発話音識別部５４に供給する。 The sound source separation unit 51 performs blind sound source separation processing based on independent component analysis on the sound data supplied from the audio decoder 23, extracts sound sound data of one or more sound sources from the sound data, It supplies to the direction estimation part 52 thru | or environmental sound / speech sound discrimination part 54.

例えば、コンテンツデータを構成する音声データの音声には、発話をする人やサイレンを鳴らす救急車など、１または複数の音源から発せられた音声が混合されている。コンテンツの音声データに対して、ブラインド音源分離処理が行われると、コンテンツの音声に含まれている音声の音源ごとに、それらの音源から発せられた音声の音声データが得られる。なお、以下、音源からの音声を個別音声とも称し、個別音声の音声データを個別音声データとも称することとする。 For example, the sound of the sound data constituting the content data is mixed with sound emitted from one or a plurality of sound sources such as a person who speaks or an ambulance that sounds a siren. When the blind sound source separation process is performed on the audio data of the content, the audio data of the audio generated from the sound sources is obtained for each audio source included in the audio of the content. Hereinafter, the sound from the sound source is also referred to as individual sound, and the sound data of the individual sound is also referred to as individual sound data.

音源方向推定部５２は、音源分離部５１から供給された各個別音声データに対して、音源方向推定処理を行って、個別音声を発する音源の方向を示す方向データを生成し、画像合成部２６に供給する。音源距離推定部５３は、音源分離部５１から供給された各個別音声データに対して、音源距離推定処理を行って、個別音声を発する音源までの距離を示す距離データを生成し、画像合成部２６に供給する。 The sound source direction estimation unit 52 performs sound source direction estimation processing on each individual sound data supplied from the sound source separation unit 51 to generate direction data indicating the direction of the sound source that emits the individual sound, and the image composition unit 26. To supply. The sound source distance estimation unit 53 performs sound source distance estimation processing on each individual audio data supplied from the sound source separation unit 51 to generate distance data indicating the distance to the sound source that emits the individual audio, and an image synthesis unit 26.

環境音／発話音識別部５４は、音源分離部５１から供給された各個別音声データについて、個別音声データに基づく個別音声が、発話音であるか環境音であるかの判別を行い、その判別結果に応じて個別音声データの出力先を切り替える。すなわち、環境音／発話音識別部５４は、環境音の個別音声データを環境音識別部５５に供給し、発話音の個別音声データを発話内容識別部５６に供給する。 The environmental sound / speech sound discriminating unit 54 determines, for each individual voice data supplied from the sound source separation unit 51, whether the individual voice based on the individual voice data is a speech sound or an environmental sound. The output destination of the individual audio data is switched according to the result. That is, the environmental sound / speech sound identification unit 54 supplies the individual sound data of the environmental sound to the environmental sound identification unit 55 and supplies the individual sound data of the utterance sound to the utterance content identification unit 56.

環境音識別部５５は、環境音／発話音識別部５４から供給された個別音声データに対して音声認識処理を行い、その結果を示す環境音データを視覚情報生成部２５に供給する。発話内容識別部５６は、環境音／発話音識別部５４から供給された個別音声データに対して音声認識処理を行い、その結果を示す発話音データを視覚情報生成部２５に供給する。 The environmental sound identification unit 55 performs voice recognition processing on the individual audio data supplied from the environmental sound / speech sound identification unit 54 and supplies environmental sound data indicating the result to the visual information generation unit 25. The utterance content identification unit 56 performs voice recognition processing on the individual voice data supplied from the environmental sound / speech sound identification unit 54, and supplies the utterance sound data indicating the result to the visual information generation unit 25.

［コンテンツ再生処理の説明］
ところで、ユーザが、コンテンツが記録されている記録媒体１２を画像処理装置１１に装着し、画像処理装置１１を操作してコンテンツの再生を指示すると、画像処理装置１１は、記録媒体１２からコンテンツを読み出して再生するコンテンツ再生処理を開始する。 [Description of content playback processing]
By the way, when the user attaches the recording medium 12 on which the content is recorded to the image processing apparatus 11 and operates the image processing apparatus 11 to instruct the reproduction of the content, the image processing apparatus 11 receives the content from the recording medium 12. The content reproduction process of reading and reproducing is started.

以下、図３のフローチャートを参照して、画像処理装置１１によるコンテンツ再生処理について説明する。 Hereinafter, content reproduction processing by the image processing apparatus 11 will be described with reference to the flowchart of FIG.

ステップＳ１１において、読み出し部２１は、ユーザにより再生が指示されたコンテンツのコンテンツデータを記録媒体１２から読み出す。そして、読み出し部２１は、読み出したコンテンツデータの動画像データおよび音声データを、動画デコーダ２２およびオーディオデコーダ２３に供給する。 In step S 11, the reading unit 21 reads the content data of the content instructed to be played by the user from the recording medium 12. Then, the reading unit 21 supplies the moving image data and audio data of the read content data to the moving picture decoder 22 and the audio decoder 23.

ステップＳ１２において、動画デコーダ２２は、読み出し部２１から供給された動画像データをデコードし、画像合成部２６に供給する。そして、ステップＳ１３において、オーディオデコーダ２３は、読み出し部２１から供給された音声データをデコードし、解析部２４および表示部２７に供給する。 In step S 12, the moving picture decoder 22 decodes the moving image data supplied from the reading unit 21 and supplies it to the image composition unit 26. In step S 13, the audio decoder 23 decodes the audio data supplied from the reading unit 21 and supplies the decoded audio data to the analysis unit 24 and the display unit 27.

ステップＳ１４において、解析部２４は、解析処理を行なって、オーディオデコーダ２３から供給された音声データから、方向データ、距離データ、環境音データ、および発話音データを生成する。生成された方向データおよび距離データは、画像合成部２６に供給され、環境音データおよび発話音データは、視覚情報生成部２５に供給される。なお、解析処理の詳細は後述する。 In step S 14, the analysis unit 24 performs an analysis process, and generates direction data, distance data, environmental sound data, and speech sound data from the sound data supplied from the audio decoder 23. The generated direction data and distance data are supplied to the image composition unit 26, and the environmental sound data and speech sound data are supplied to the visual information generation unit 25. Details of the analysis process will be described later.

ステップＳ１５において、視覚情報生成部２５は、解析部２４から供給された環境音データと発話音データを用いて、エフェクトデータを生成し、画像合成部２６に供給する。このエフェクトデータは、発話音や環境音などの個別音声ごとに生成される。 In step S 15, the visual information generation unit 25 generates effect data using the environmental sound data and the speech sound data supplied from the analysis unit 24, and supplies the effect data to the image composition unit 26. This effect data is generated for each individual sound such as a speech sound and an environmental sound.

例えば、環境音データには、音声認識処理の結果得られた単語列と、その単語列に関する補足情報が含まれている。補足情報は、コンテンツを視聴するユーザが、その個別音声に関してコンテンツで生じている事象を、より詳細（的確）に把握することができるように、テキスト化された個別音声の内容を補足する情報である。 For example, the environmental sound data includes a word string obtained as a result of the voice recognition process and supplementary information related to the word string. The supplementary information is information that supplements the contents of the individualized audio that has been converted into text so that the user who views the content can understand in detail (accurately) the phenomenon occurring in the content with respect to the individual audio. is there.

具体的には、環境音の個別音声データの音声認識処理の結果、救急車のサイレン音を表す単語列「ピーポーピーポー」が得られたとする。この単語列「ピーポーピーポー」には、予め定められた文字列「（救急車のサイレン）」が補足情報として関連付けられており、環境音識別部５５からは、単語列「ピーポーピーポー」と補足情報「（救急車のサイレン）」とからなる環境音データが出力される。 Specifically, it is assumed that the word string “Peepy Peep” representing the siren sound of an ambulance is obtained as a result of the voice recognition process of the individual voice data of the environmental sound. A predetermined character string “(ambulance siren)” is associated with the word string “Peepy Peep” as supplementary information. From the environmental sound identification unit 55, the word string “Peepy Peep” and the supplementary information “ Environmental sound data consisting of “Ambulance siren” is output.

このような環境音データが供給された場合、視覚情報生成部２５は、例えば、単語列「ピーポーピーポー」の文字と、補足情報「（救急車のサイレン）」の文字とを表示させるエフェクト画像の画像データを、エフェクトデータとして生成する。このように、エフェクトデータとして、個別音声をテキスト化した単語列と、その単語列を補足する補足情報とが含まれるエフェクト画像の画像データを生成することで、コンテンツを視聴するユーザは、より正確にコンテンツの内容を把握することができる。 When such environmental sound data is supplied, the visual information generation unit 25 displays an image of an effect image that displays, for example, characters of the word string “Peepy Peep” and characters of supplementary information “(ambulance siren)”. Data is generated as effect data. As described above, by generating image data of an effect image that includes a word string obtained by converting individual speech into text and supplementary information that supplements the word string as effect data, a user who views the content is more accurate. It is possible to grasp the contents of the contents.

ステップＳ１６において、画像合成部２６は、解析部２４からの方向データと距離データを用いて、動画デコーダ２２からの動画像データと、視覚情報生成部２５からのエフェクトデータとを合成する。そして、画像合成部２６は、合成により得られた合成コンテンツ画像の合成動画像データを表示部２７に供給する。 In step S 16, the image composition unit 26 synthesizes the moving image data from the moving picture decoder 22 and the effect data from the visual information generation unit 25 using the direction data and distance data from the analysis unit 24. Then, the image composition unit 26 supplies the composite moving image data of the composite content image obtained by the composition to the display unit 27.

例えば、乗用車のクラクションが個別音声として収音部に収音され、その個別音声の方向データにより示される音源の位置が、収音部からみて左前方であったとする。この場合、画像合成部２６は、コンテンツ画像の左上の奥に、つまりコンテンツ画像を正面から見るユーザから見て左上の奥に、その個別音声（クラクション）のエフェクト画像が表示されるように、エフェクト画像をコンテンツ画像に合成する。 For example, it is assumed that the horn of a passenger car is picked up by the sound pickup unit as individual sound, and the position of the sound source indicated by the direction data of the individual sound is left front as viewed from the sound pickup unit. In this case, the image synthesizing unit 26 displays the effect image of the individual sound (horn) in the upper left part of the content image, that is, in the upper left part as viewed from the user viewing the content image from the front. Combine the image with the content image.

このとき、画像合成部２６は、その個別音声の距離データにより示される距離に応じて、コンテンツ画像に合成されるエフェクト画像の大きさを調整する。具体的には、収音部から見た音源までの距離が長いほど、エフェクト画像は、より小さく表示されるように、合成が行なわれる。 At this time, the image synthesizing unit 26 adjusts the size of the effect image synthesized with the content image according to the distance indicated by the distance data of the individual sound. Specifically, the synthesis is performed so that the effect image is displayed smaller as the distance from the sound collection unit to the sound source is longer.

なお、コンテンツ画像とエフェクト画像の合成時には、コンテンツ画像を撮影する撮影部と、コンテンツの音声を収音する収音部とは、ほぼ同じ位置にあるものとして、個別音声の方向データにより定まるコンテンツ画像上の位置に、距離データにより定まる大きさで、その個別音声のエフェクト画像がオーバーレイされる。すなわち、各音源からの個別音声のエフェクト画像は、コンテンツ画像上の音源近傍の位置に表示される。 When synthesizing the content image and the effect image, it is assumed that the image capturing unit that captures the content image and the sound collecting unit that collects the sound of the content are located at substantially the same position, and the content image determined by the direction data of the individual sound The effect image of the individual sound is overlaid at the upper position in a size determined by the distance data. That is, the effect image of the individual sound from each sound source is displayed at a position near the sound source on the content image.

また、コンテンツ画像上に表示されるエフェクト画像の大きさは、そのエフェクト画像の個別音声の大きさに応じて変化するようにしてもよい。そのような場合、例えば、音源分離部５１は、各個別音声の音量を示す情報を画像合成部２６に供給し、画像合成部２６は、供給された音量を示す情報に基づいて、音量が大きいほど、よりエフェクト画像が大きくなるように、エフェクト画像の合成を行なう。 Further, the size of the effect image displayed on the content image may be changed according to the size of the individual sound of the effect image. In such a case, for example, the sound source separation unit 51 supplies information indicating the volume of each individual sound to the image synthesis unit 26, and the image synthesis unit 26 has a high volume based on the information indicating the supplied volume. As the effect image becomes larger, the effect image is synthesized.

さらに、画像合成部２６が、個別音声の方向データと距離データに基づいて、その個別音声の音源の方向や距離に応じて、個別音声のエフェクト画像の色や輝度などを変化させるようにしてもよい。 Further, the image synthesis unit 26 may change the color or brightness of the effect image of the individual sound in accordance with the direction and distance of the sound source of the individual sound based on the direction data and distance data of the individual sound. Good.

さらに、例えば、コンテンツデータを構成する音声データが、５．１チャンネルなどのマルチチャンネルステレオである場合など、ユーザの後方から個別音声が聞えてくる、つまり収音部から見て、収音部後方に個別音声の音源が位置していることがある。そのような場合には、コンテンツ画像上に、その個別音声のエフェクト画像を表示させることができなくなってしまう。 Further, for example, when the audio data constituting the content data is multi-channel stereo such as 5.1 channel, the individual audio can be heard from the rear of the user, that is, the sound collecting unit is behind the sound collecting unit. Individual sound sources may be located at the same time. In such a case, the effect image of the individual sound cannot be displayed on the content image.

そこで、このような場合には、画像合成部２６は、コンテンツ画像の端近傍にエフェクト画像を表示させる。また、この場合、画像合成部２６は、コンテンツを視聴するユーザが、自分の後方からの個別音声のエフェクト画像であることを把握できるように、矢印記号や、後方からの音声である旨の補足情報をエフェクト画像とともに表示させる。これにより、ユーザは、より確実かつ正確に、エフェクト画像が示す個別音声の音源位置を知ることができる。 Therefore, in such a case, the image composition unit 26 displays the effect image near the end of the content image. Further, in this case, the image composition unit 26 supplements that the user viewing the content is an arrow symbol or a sound from the back so that the user can grasp that the effect image is an individual sound from the back. Display information with effect images. Thereby, the user can know the sound source position of the individual sound indicated by the effect image more reliably and accurately.

画像合成部２６は、合成コンテンツ画像の合成動画像データを生成すると、その合成動画像データを表示部２７に供給し、処理はステップＳ１６からステップＳ１７に進む。 When the image composition unit 26 generates the combined moving image data of the combined content image, the image combining unit 26 supplies the combined moving image data to the display unit 27, and the process proceeds from step S16 to step S17.

ステップＳ１７において、表示部２７は、画像合成部２６からの合成動画像データに基づいて合成コンテンツ画像を表示するとともに、オーディオデコーダ２３からの音声データに基づいて音声を出力することで、コンテンツを再生する。 In step S 17, the display unit 27 displays the synthesized content image based on the synthesized moving image data from the image synthesizer 26, and outputs the audio based on the audio data from the audio decoder 23, thereby reproducing the content. To do.

これにより、表示部２７には、図４や図５に示す合成コンテンツ画像が表示される。 As a result, the composite content image shown in FIGS. 4 and 5 is displayed on the display unit 27.

例えば、図４の例では、合成コンテンツ画像Ｃ１１の図中、左側に救急車が表示されており、その救急車の下側には、救急車を音源として発せられたサイレンの音（環境音）に対して、エフェクト画像ＥＦ１１が表示されている。このエフェクト画像ＥＦ１１には、テキスト化された救急車のサイレンを表す文字「ピーポーピーポー」と、そのサイレンの補足情報としての文字「（救急車のサイレン）」が表示されている。 For example, in the example of FIG. 4, an ambulance is displayed on the left side of the composite content image C 11, and the lower side of the ambulance corresponds to the siren sound (environmental sound) emitted from the ambulance as a sound source. The effect image EF11 is displayed. In this effect image EF11, the text “Peepy Peep” representing the ambulance siren in text and the character “(ambulance siren)” as supplementary information of the siren are displayed.

表示されているエフェクト画像ＥＦ１１は、音源である救急車とともに移動し、例えば救急車が画面の奥側に移動して小さく表示され、サイレンの音が小さくなると、エフェクト画像ＥＦ１１も救急車の位置やサイレンの音量の変化に合わせて、小さく表示される。 The displayed effect image EF11 moves together with the ambulance as the sound source. For example, when the ambulance moves to the back side of the screen and is displayed small, and the siren sound is reduced, the effect image EF11 is also displayed on the ambulance position and siren volume. It is displayed small in accordance with the change of.

また、図５の例では、合成コンテンツ画像Ｃ１２のほぼ中央で、爆発が起きており、その爆発音に対するエフェクト画像ＥＦ１２が図中、下側に表示されている。 Further, in the example of FIG. 5, an explosion has occurred in the approximate center of the composite content image C12, and an effect image EF12 for the explosion sound is displayed on the lower side in the drawing.

エフェクト画像ＥＦ１２には、爆発音をテキスト化して得られた文字「ドカアアアアン」が、飾り文字のテクスチャとして表示されている。例えば、このエフェクト画像ＥＦ１２は、環境音としての爆発音の音量が次第に小さくなると、その音量の変化に応じて、時間とともに小さくなるように表示される。 In the effect image EF12, the character “Dokaa Aan” obtained by converting the explosion sound into text is displayed as a texture of a decorative character. For example, the effect image EF12 is displayed so as to decrease with time according to a change in the volume when the volume of the explosion sound as the environmental sound gradually decreases.

画像処理装置１１では、コンテンツ画像のフレーム等の所定の単位ごとに、上述したステップＳ１１乃至ステップＳ１７の処理が繰り返し行われるため、音源の移動や音声の音量の変化に応じて、エフェクト画像の位置や大きさも変化する。 In the image processing apparatus 11, the processing in steps S 11 to S 17 described above is repeatedly performed for each predetermined unit such as a frame of a content image. And the size also changes.

図３のフローチャートの説明に戻り、ステップＳ１８において、画像処理装置１１は、コンテンツの再生を終了するか否かを判定する。例えば、ユーザにより画像処理装置１１が操作され、コンテンツの再生終了が指示された場合、終了すると判定される。 Returning to the description of the flowchart of FIG. 3, in step S 18, the image processing apparatus 11 determines whether or not to end the reproduction of the content. For example, when the user operates the image processing apparatus 11 and gives an instruction to end content reproduction, it is determined to end.

ステップＳ１８において、再生を終了しないと判定された場合、処理はステップＳ１１に戻り、上述した処理が繰り返される。すなわち、コンテンツの次のフレームが読み出されて再生される。 If it is determined in step S18 that the reproduction is not terminated, the process returns to step S11, and the above-described process is repeated. That is, the next frame of the content is read and reproduced.

一方、ステップＳ１８において、コンテンツの再生を終了すると判定された場合、画像処理装置１１は、コンテンツの再生を終了して、コンテンツ再生処理は終了する。 On the other hand, if it is determined in step S18 that the content reproduction is to be terminated, the image processing apparatus 11 terminates the content reproduction, and the content reproduction process is terminated.

このようにして、画像処理装置１１は、コンテンツの音声を音源ごとに分離し、各個別音声を音声認識によりテキスト化するとともに、テキスト化により得られた文字（単語列）や補足情報が含まれるエフェクト画像を生成する。そして、画像処理装置１１は、個別音声ごとに、エフェクト画像の表示位置や大きさ、色などを、個別音声の音源の方向と距離に応じて決定し、エフェクト画像をコンテンツ画像にオーバーレイする。 In this way, the image processing apparatus 11 separates the sound of the content for each sound source, converts each individual sound into text by speech recognition, and includes characters (word strings) and supplementary information obtained by text conversion. Generate an effect image. The image processing apparatus 11 determines the display position, size, color, and the like of the effect image for each individual sound according to the direction and distance of the sound source of the individual sound, and overlays the effect image on the content image.

したがって、画像処理装置１１によれば、コンテンツ画像上において、各個別音声の音源近傍にエフェクト画像を表示させることができる。その結果、単にクローズドキャプションや、テキスト化した音声と音源位置を表示させる場合と比べて、エフェクト画像にコンテンツ画像との一体感を持たせることができ、より効果的にユーザに対して環境音等の音声を視覚的に提示することができる。 Therefore, according to the image processing apparatus 11, the effect image can be displayed in the vicinity of the sound source of each individual sound on the content image. As a result, it is possible to give the effect image a sense of unity with the content image compared to simply displaying closed captions or textual sound and the sound source position, and more effectively providing environmental sounds to the user. Can be presented visually.

特に、画像処理装置１１では、発話音だけでなく、環境音についてもエフェクト画像を表示させるようにしたので、従来は発話音の字幕のみに限定されていた、聴覚障害者が知覚可能な聴覚系情報を、環境音にまで拡張することができる。これにより、ユーザは、コンテンツの製作者の意図までも読み取ることができるようになり、コンテンツの視聴を充分に楽しむことができる。 In particular, since the image processing apparatus 11 displays the effect image not only for the utterance sound but also for the environmental sound, the auditory system that can be perceived by a hearing impaired person, which is conventionally limited to only the caption of the utterance sound. Information can be extended to ambient sounds. As a result, the user can read even the intention of the content producer, and can fully enjoy viewing the content.

また、必要に応じて、テキスト化された環境音とともに、補足情報を表示させるようにしたので、ユーザは、より正確にコンテンツの内容を把握することができ、コンテンツの視聴をさらに楽しむことができるようになる。 In addition, supplementary information is displayed along with textual environmental sounds as necessary, so that the user can more accurately grasp the contents and enjoy viewing the contents. It becomes like this.

さらに、画像処理装置１１では、コンテンツの音声データを解析してエフェクト画像を生成するので、もともと字幕が付加されていない、カムコーダで撮影された個人的な映像や、生放送などの番組に対しても、再生時にエフェクト画像を表示させることができる。 Furthermore, since the image processing apparatus 11 analyzes the audio data of the content and generates an effect image, it can also be applied to a personal video or a live broadcast program that has been originally recorded with a camcorder and has no captions added. The effect image can be displayed during playback.

［解析処理の説明］
次に、図６のフローチャートを参照して、図３のステップＳ１４の処理に対応する解析処理について説明する。 [Description of analysis processing]
Next, analysis processing corresponding to the processing in step S14 in FIG. 3 will be described with reference to the flowchart in FIG.

ステップＳ４１において、音源分離部５１は、オーディオデコーダ２３から供給された音声データに対して、独立成分分析に基づくブラインド音源分離処理を行い、音声データから各個別音声の音声データを抽出する。 In step S41, the sound source separation unit 51 performs blind sound source separation processing based on independent component analysis on the sound data supplied from the audio decoder 23, and extracts sound data of each individual sound from the sound data.

例えば、コンテンツデータを構成する音声データが、ＲチャンネルとＬチャンネル、つまり左右の２つのチャンネルの音声データからなるとする。この場合、音源分離部５１は、それらのＲとＬのチャンネルの音声データにフーリエ変換を施し、音声データを周波数成分からなる周波数情報に変換する。この周波数情報は、音声の各周波数成分のパワーを示す情報である。 For example, it is assumed that audio data constituting the content data is composed of audio data of the R channel and the L channel, that is, the left and right channels. In this case, the sound source separation unit 51 performs Fourier transform on the audio data of the R and L channels, and converts the audio data into frequency information composed of frequency components. This frequency information is information indicating the power of each frequency component of the sound.

そして、音源分離部５１は、周波数情報に基づいて、周波数情報により示される周波数帯域全体を、複数の周波数帯域に分割し、分割後の各周波数帯域の各周波数の音声のパワーを示す周波数分割スペクトル成分を生成する。周波数分割スペクトル成分は、ＲとＬの各チャンネルについて、分割後の周波数帯域ごとに生成される。 Then, the sound source separation unit 51 divides the entire frequency band indicated by the frequency information into a plurality of frequency bands based on the frequency information, and shows a frequency division spectrum indicating the power of the sound of each frequency in each frequency band after the division Generate ingredients. The frequency division spectrum component is generated for each frequency band after division for each of the R and L channels.

さらに、音源分離部５１は、ＲとＬのチャンネルの同じ周波数帯域の周波数分割スペクトル成分について、各周波数のパワーの比を算出し、各周波数分割スペクトル成分のうち、求めた比が予め定められた値である周波数分割スペクトル成分を選択する。このようにして選択された周波数分割スペクトル成分からなる音声が、抽出しようとする個別音声であるとされる。 Furthermore, the sound source separation unit 51 calculates the ratio of the power of each frequency for the frequency division spectrum components in the same frequency band of the R and L channels, and the obtained ratio is determined in advance among the frequency division spectrum components. Select the frequency division spectral component that is the value. The sound composed of the frequency division spectrum components selected in this way is assumed to be the individual sound to be extracted.

音源分離部５１は、ＲとＬのチャンネルの選択した周波数分割スペクトル成分を逆フーリエ変換し、その結果得られたＲとＬの各チャンネルの音声データを、個別音声のＲとＬのチャンネルの音声データとする。 The sound source separation unit 51 performs inverse Fourier transform on the selected frequency division spectrum components of the R and L channels, and the audio data of each of the R and L channels obtained as a result is converted to the audio of the R and L channels of the individual audio. Data.

なお、個別音声の抽出に用いる周波数のパワーの比の値は、ＲとＬのチャンネルの音声データに配分された、個別音声のレベルの配分率により予め定められている。また、ブラインド音源分離処理については、例えば特開２００８−１０４２４０号公報などに詳細に記載されている。 The value of the ratio of the frequency powers used for the extraction of the individual audio is determined in advance by the distribution rate of the individual audio levels allocated to the audio data of the R and L channels. The blind sound source separation processing is described in detail in, for example, Japanese Patent Application Laid-Open No. 2008-104240.

音源分離部５１は、コンテンツの音声データを、各個別音声の音声データに分離すると、それらの個別音声の音声データを、音源方向推定部５２、音源距離推定部５３、および環境音／発話音識別部５４に供給する。 When the sound source separation unit 51 separates the sound data of the content into the sound data of each individual sound, the sound data of the individual sound is separated into the sound source direction estimation unit 52, the sound source distance estimation unit 53, and the environmental sound / utterance sound identification. Supplied to the unit 54.

このように、ブラインド音源分離処理を行って、音声データから各個別音声の音声データを抽出することで、より正確に各個別音声の音源の方向や距離を求めることができるようになり、より音源に近い位置にエフェクト画像を表示させることができる。 Thus, by performing the blind sound source separation processing and extracting the sound data of each individual sound from the sound data, the direction and distance of the sound source of each individual sound can be obtained more accurately, and the sound source The effect image can be displayed at a position close to.

ステップＳ４２において、音源方向推定部５２は、音源分離部５１から供給された各個別音声の音声データに対して音源方向推定処理を行い、個別音声の音源の方向を推定する。 In step S42, the sound source direction estimation unit 52 performs sound source direction estimation processing on the sound data of each individual sound supplied from the sound source separation unit 51, and estimates the direction of the sound source of the individual sound.

例えば、コンテンツデータを構成する音声データが、ＲとＬのチャンネルの音声データからなる場合、音源方向推定部５２は、個別音声の音声データ（個別音声データ）をフーリエ変換する。そして、音源方向推定部５２は、得られたＲとＬのチャンネルの周波数情報を比較して、ＲとＬのチャンネルの音声データの位相のずれを検出することで、個別音声の音源の方向を推定する。 For example, when the audio data constituting the content data is composed of audio data of R and L channels, the sound source direction estimating unit 52 performs Fourier transform on the audio data of the individual audio (individual audio data). The sound source direction estimation unit 52 compares the obtained frequency information of the R and L channels and detects the phase shift of the audio data of the R and L channels, thereby determining the direction of the sound source of the individual sound. presume.

音源方向推定部５２は、得られた各個別音声の音源の方向を示す方向データを生成し、画像合成部２６に供給する。なお、音源の方向の推定については、例えば、特開２０１０−２０２９４号公報等に詳細に記載されている。 The sound source direction estimation unit 52 generates direction data indicating the direction of the sound source of each individual sound obtained and supplies the direction data to the image composition unit 26. Note that the estimation of the direction of the sound source is described in detail in, for example, Japanese Patent Application Laid-Open No. 2010-20294.

ステップＳ４３において、音源距離推定部５３は、音源分離部５１から供給された各個別音声データに対して音源距離推定処理を行い、各個別音声の音源までの距離を推定する。 In step S43, the sound source distance estimation unit 53 performs a sound source distance estimation process on each individual audio data supplied from the sound source separation unit 51, and estimates a distance to the sound source of each individual sound.

例えば、コンテンツデータを構成する音声データが、ＲとＬのチャンネルの音声データからなる場合、音源距離推定部５３は、ＲとＬのチャンネルの個別音声データに対して離散フーリエ変換を行なって、位相差スペクトルを求める。 For example, when the audio data constituting the content data is composed of audio data of the R and L channels, the sound source distance estimation unit 53 performs discrete Fourier transform on the individual audio data of the R and L channels, and Obtain the phase difference spectrum.

さらに、音源距離推定部５３は、この位相差スペクトルから、各周波数における位相差の標準偏差を求め、所定の周波数帯域における周波数の標準偏差の平均値を特徴量として算出する。音源距離推定部５３は、このようにして得られた特徴量を、予め求められている関数に代入することで、音源から収音部までの距離を推定する。 Furthermore, the sound source distance estimation unit 53 obtains the standard deviation of the phase difference at each frequency from this phase difference spectrum, and calculates the average value of the standard deviation of the frequency in a predetermined frequency band as the feature amount. The sound source distance estimation unit 53 estimates the distance from the sound source to the sound collection unit by substituting the feature quantity obtained in this way into a function obtained in advance.

音源距離推定部５３は、得られた各個別音声の音源までの距離を示す距離データを生成し、画像合成部２６に供給する。 The sound source distance estimation unit 53 generates distance data indicating the distance to the sound source of each obtained individual sound and supplies the distance data to the image composition unit 26.

ステップＳ４４において、環境音／発話音識別部５４は、音源分離部５１から供給された各個別音声データについて、個別音声データに基づく個別音声が、発話音であるか環境音であるかの判別を行う。 In step S 44, the environmental sound / speech sound identification unit 54 determines, for each individual audio data supplied from the sound source separation unit 51, whether the individual audio based on the individual audio data is an utterance sound or an environmental sound. Do.

例えば、コンテンツデータを構成する音声データが、ＲとＬのチャンネルの音声データからなる場合、環境音／発話音識別部５４は、Ｒチャンネルの個別音声データと、Ｌチャンネルの個別音声データの和を求めることで、個別音声データの和信号を求める。また、環境音／発話音識別部５４は、得られた和信号に対して、一般的な人の声の周波数帯域の成分が除去されるフィルタを用いたフィルタ処理を施す。 For example, when the audio data constituting the content data is composed of audio data of the R and L channels, the environmental sound / speech sound identifying unit 54 calculates the sum of the R channel individual audio data and the L channel individual audio data. By calculating, the sum signal of the individual audio data is determined. In addition, the environmental sound / speech sound discriminating unit 54 performs a filtering process on the obtained sum signal using a filter that removes a frequency band component of a general human voice.

さらに、環境音／発話音識別部５４は、Ｒチャンネルの個別音声データと、Ｌチャンネルの個別音声データの差を求めることで、個別音声データの差信号を求め、差信号と、フィルタ処理された和信号との差分を求める。 Further, the environmental sound / speech sound discriminating unit 54 obtains a difference signal between the individual audio data by obtaining a difference between the individual audio data of the R channel and the individual audio data of the L channel, and is subjected to the difference signal and the filter processing. Find the difference from the sum signal.

環境音／発話音識別部５４は、得られた差信号と和信号の差分が、予め定められた閾値以上である場合、処理対象となっている個別音声は、環境音であるとする。 When the difference between the obtained difference signal and the sum signal is equal to or greater than a predetermined threshold, the environmental sound / speech sound identification unit 54 determines that the individual sound to be processed is an environmental sound.

ＲとＬのチャンネル用の２つの収音部で人の声を収音する場合、音源となる人は２つの収音部のほぼ中間に位置することが多い。したがって、ＲとＬのチャンネルの個別音声に含まれる人の声は、ほぼ同じレベル（音量）となるはずであるから、それらの個別音声データの差を求めると、得られた差信号には、人の声は殆ど含まれていないはずである。 When a person's voice is picked up by the two sound pickup parts for the R and L channels, the person who becomes the sound source is often located approximately in the middle of the two sound pickup parts. Therefore, human voices included in the individual voices of the R and L channels should have substantially the same level (volume). Therefore, when the difference between the individual voice data is obtained, the obtained difference signal includes: There should be little human voice.

そのため、ＲとＬのチャンネルの個別音声データの和信号から、フィルタ処理により人の声の成分を除去し、フィルタ処理された和信号と差信号の差分を求めると、その差分は、ＲまたはＬのチャンネルの環境音のみが含まれる音声データとなるはずである。そこで、環境音／発話音識別部５４は、求めた差分が閾値以上である場合、処理対象の個別音声は環境音であるとし、逆に差分が閾値未満である場合、個別音声は発話音であるとする。 Therefore, when the human voice component is removed from the sum signal of the individual audio data of the R and L channels by filtering and the difference between the filtered sum signal and the difference signal is obtained, the difference is R or L The sound data should include only the environmental sound of the channel. Therefore, the environmental sound / speech sound identification unit 54 determines that the individual sound to be processed is an environmental sound when the obtained difference is equal to or greater than the threshold value, and conversely, if the difference is less than the threshold value, the individual sound is a speech sound. Suppose there is.

環境音／発話音識別部５４は、各個別音声のうち、発話音であると判別された個別音声の音声データを発話内容識別部５６に供給し、環境音であると判別された個別音声の音声データを環境音識別部５５に供給する。 The environmental sound / speech sound identification unit 54 supplies, to each utterance content identification unit 56, the audio data of the individual sound determined to be the utterance sound among the individual sounds, and the individual sound determined to be the environmental sound. The sound data is supplied to the environmental sound identification unit 55.

ステップＳ４５において、発話内容識別部５６は、環境音／発話音識別部５４から供給された各個別音声の音声データに対して音声認識処理を行い、個別音声の発話内容をテキスト化する。 In step S45, the utterance content identification unit 56 performs voice recognition processing on the voice data of each individual voice supplied from the environmental sound / speech sound identification unit 54, and converts the utterance content of the individual voice into text.

例えば、発話内容識別部５６は、所定フレームごとに音声データに対して音響分析処理を行い、音声データから所定の特徴の特徴量を抽出する。例えば、音響分析処理として、離散フーリエ変換が行われ、パワースペクトルが特徴量として抽出される。 For example, the utterance content identification unit 56 performs acoustic analysis processing on the audio data for each predetermined frame, and extracts a feature amount of a predetermined feature from the audio data. For example, as an acoustic analysis process, discrete Fourier transform is performed, and a power spectrum is extracted as a feature amount.

次に、発話内容識別部５６は、得られた特徴量と、予め記録している音響モデルデータベース、辞書データベース、および文法データベースとを用いたマッチング処理を行い、個別音声を認識する。 Next, the utterance content identification unit 56 performs a matching process using the obtained feature amount and a previously recorded acoustic model database, dictionary database, and grammar database to recognize individual speech.

ここで、音響モデルデータベースは、音声の言語における個々の音素や音節などの単位（PLU（Phoneme Like Units））ごとの音響的な特徴を表すHMM（Hidden Markov Model）等の音響モデルなどからなる。 Here, the acoustic model database includes an acoustic model such as an HMM (Hidden Markov Model) representing acoustic features for each unit (PLU (Phoneme Like Units)) such as individual phonemes and syllables in a speech language.

また、辞書データベースは、認識対象の各単語について、単語ごとの発音に関する音韻情報が記述された単語辞書、および各音響モデルから特徴量が観測される確率を示す情報からなる。文法データベースは、辞書データベースの単語辞書に登録されている各単語が、どのように連鎖するかを記述した文法規則（言語モデル）からなる。 The dictionary database includes a word dictionary in which phoneme information related to pronunciation for each word is described for each word to be recognized, and information indicating a probability that a feature amount is observed from each acoustic model. The grammar database is composed of grammar rules (language model) that describe how words registered in the word dictionary of the dictionary database are chained together.

発話内容識別部５６は、辞書データベースの単語辞書を参照して、音響モデルデータベースの音響モデルを接続し、単語の音響モデル（単語モデル）を構成する。そして、発話内容識別部５６は、いくつかの単語モデルを、文法データベースの文法規則を参照することで接続し、そのようにして接続された単語モデルの系列から、特徴量から求まる尤度が最も高い単語モデルの系列に対応する単語列を、個別音声の認識結果として出力する。つまり、音声認識の結果得られた単語列を示す発話音データが、視覚情報生成部２５に供給される。 The utterance content identification unit 56 refers to the word dictionary in the dictionary database, connects the acoustic model in the acoustic model database, and constructs an acoustic model (word model) of the word. Then, the utterance content identification unit 56 connects several word models by referring to the grammar rules of the grammar database, and the likelihood obtained from the feature amount is the highest from the series of word models thus connected. A word string corresponding to a series of high word models is output as an individual speech recognition result. That is, utterance sound data indicating a word string obtained as a result of speech recognition is supplied to the visual information generation unit 25.

ステップＳ４６において、環境音識別部５５は、環境音／発話音識別部５４から供給された各個別音声の音声データに対して音声認識処理を行い、それらの個別音声、つまり環境音をテキスト化する。 In step S46, the environmental sound identification unit 55 performs voice recognition processing on the voice data of each individual voice supplied from the environmental sound / speech sound identification unit 54, and converts the individual voice, that is, the environmental sound into text. .

なお、環境音識別部５５による音声認識処理においても、発話内容識別部５６における音声認識処理と同様の処理が行われる。すなわち、音声データから特徴量が抽出され、抽出された特徴量と、各データベースとのマッチング処理が行われる。 In the voice recognition process by the environmental sound identification unit 55, the same process as the voice recognition process in the utterance content identification unit 56 is performed. That is, feature amounts are extracted from the audio data, and matching processing between the extracted feature amounts and each database is performed.

但し、環境音識別部５５に記録される辞書データベースには、発話内容識別部５６の辞書データベースに登録されている単語とは異なる単語、例えば救急車のサイレン「ピーポーピーポー」などが登録されている。また、環境音識別部５５には、特に文法データベースは設けられていなくてもよい。 However, in the dictionary database recorded in the environmental sound identification unit 55, a word different from the word registered in the dictionary database of the utterance content identification unit 56, for example, an ambulance siren “Peepy Peep” is registered. The environmental sound identification unit 55 does not have to be provided with a grammar database.

環境音がテキスト化され、その結果得られた環境音データが、環境音識別部５５から視覚情報生成部２５に供給されると、解析処理は終了し、その後、処理は図３のステップＳ１５に進む。 When the environmental sound is converted into text and the environmental sound data obtained as a result is supplied from the environmental sound identification unit 55 to the visual information generation unit 25, the analysis process ends, and then the process proceeds to step S15 in FIG. move on.

このようにして解析部２４は、コンテンツの音声データを発話音や環境音の音声データに分離し、各音声データに対して、音源方向の推定や、音源の距離の推定、音声認識処理などを行う。したがって、解析部２４によれば、発話内容や音源の方向など、個別音声ごとの情報をより確実に得ることができる。しかも、解析部２４では、個別音声ごとに発話音であるか、または環境音であるかの判別を行い、その判別結果に従って、異なる辞書を用いて音声認識処理を行うので、より高精度に個別音声をテキスト化することができる。 In this way, the analysis unit 24 separates the audio data of the content into audio data of speech sounds and environmental sounds, and performs estimation of the sound source direction, estimation of the distance of the sound source, audio recognition processing, etc. for each audio data. Do. Therefore, according to the analysis unit 24, information for each individual voice such as the utterance content and the direction of the sound source can be obtained more reliably. In addition, the analysis unit 24 determines whether the sound is an utterance sound or an environmental sound for each individual voice, and performs voice recognition processing using a different dictionary according to the determination result. Voice can be converted into text.

〈第２の実施の形態〉
［解析部の構成］
なお、以上においては、コンテンツを構成する発話音と環境音の両方のエフェクト画像が表示されると説明したが、発話音については、クローズドキャプション等がある場合もあるので、環境音だけのエフェクト画像が表示されるようにしてもよい。 <Second Embodiment>
[Configuration of analysis unit]
In the above description, it has been described that the effect images of both the utterance sound and the environmental sound constituting the content are displayed. However, since the utterance sound may have a closed caption or the like, the effect image of only the environmental sound is included. May be displayed.

そのような場合、解析部２４は、例えば、図７に示すように構成される。 In such a case, the analysis unit 24 is configured as shown in FIG. 7, for example.

すなわち、図７に示す解析部２４は、音源方向推定部５２、環境音／音声分離部８１、および環境音識別部５５から構成され、オーディオデコーダ２３からの音声データが、音源方向推定部５２および環境音／音声分離部８１に供給される。 That is, the analysis unit 24 shown in FIG. 7 includes a sound source direction estimation unit 52, an environmental sound / speech separation unit 81, and an environmental sound identification unit 55. The audio data from the audio decoder 23 is converted into the sound source direction estimation unit 52 and It is supplied to the environmental sound / sound separation unit 81.

なお、図７において、図２における場合と対応する部分には同一の符号を付してあり、その説明は適宜省略する。 In FIG. 7, parts corresponding to those in FIG. 2 are denoted by the same reference numerals, and description thereof is omitted as appropriate.

環境音／音声分離部８１は、オーディオデコーダ２３から供給された音声データから、環境音の音声データを抽出し、環境音識別部５５に供給する。 The environmental sound / audio separation unit 81 extracts environmental sound audio data from the audio data supplied from the audio decoder 23, and supplies it to the environmental sound identification unit 55.

［解析処理の説明］
次に、図８のフローチャートを参照して、解析部２４が図７の構成とされる場合における解析処理について説明する。 [Description of analysis processing]
Next, analysis processing when the analysis unit 24 has the configuration shown in FIG. 7 will be described with reference to the flowchart shown in FIG.

ステップＳ７１において、音源方向推定部５２は、オーディオデコーダ２３から供給された音声データから、環境音の音源の方向を示す方向データを生成し、画像合成部２６に供給する。 In step S 71, the sound source direction estimation unit 52 generates direction data indicating the direction of the sound source of the environmental sound from the audio data supplied from the audio decoder 23, and supplies the direction data to the image synthesis unit 26.

例えば、音源方向推定部５２は、供給された音声データをフーリエ変換し、これにより得られた周波数情報と、独立成分分析を適用した学習処理により求められた分離行列とから、各音源から発せられた環境音の分離信号を生成する。そして、音源方向推定部５２は、各分離信号のうち、対応する時間の区間の周波数情報と環境音の分離信号との間で相互共分散行列を算出し、相互共分散行列の要素同士の位相差を算出することで、環境音の音源方向を求め、方向データを生成する。 For example, the sound source direction estimation unit 52 performs a Fourier transform on the supplied audio data, and is generated from each sound source from the frequency information obtained thereby and a separation matrix obtained by a learning process to which independent component analysis is applied. Generate separated environmental sound signals. The sound source direction estimating unit 52 calculates a mutual covariance matrix between the frequency information of the corresponding time interval and the separated signal of the environmental sound among the separated signals, and determines the positions of the elements of the mutual covariance matrix. By calculating the phase difference, the sound source direction of the environmental sound is obtained and direction data is generated.

なお、この場合、環境音の距離データは生成されないので、画像合成部２６では、コンテンツ画像における、方向データにより定まる位置に、環境音のエフェクト画像がオーバーレイされることになる。勿論、音源距離推定部５３が設けられ、環境音の距離データが生成されるようにしてもよい。 In this case, since the environmental sound distance data is not generated, the image synthesis unit 26 overlays the environmental sound effect image at a position determined by the direction data in the content image. Of course, a sound source distance estimation unit 53 may be provided to generate environmental sound distance data.

ステップＳ７２において、環境音／音声分離部８１は、オーディオデコーダ２３から供給された音声データから、環境音の音声データを抽出し、環境音識別部５５に供給する。 In step S 72, the environmental sound / audio separation unit 81 extracts environmental sound audio data from the audio data supplied from the audio decoder 23, and supplies the environmental sound identification data to the environmental sound identification unit 55.

例えば、コンテンツデータを構成する音声データが、ＲとＬのチャンネルの音声データからなる場合、環境音／音声分離部８１は、Ｒチャンネルの音声データと、Ｌチャンネルの音声データの和を求めることで、音声データの和信号を求める。また、環境音／音声分離部８１は、得られた和信号に対して、一般的な人の声の周波数帯域の成分が除去されるフィルタを用いたフィルタ処理を施す。 For example, when the audio data constituting the content data is composed of R and L channel audio data, the environmental sound / audio separation unit 81 obtains the sum of the R channel audio data and the L channel audio data. The sum signal of the audio data is obtained. In addition, the environmental sound / speech separation unit 81 performs a filtering process using a filter that removes a component of a general human voice frequency band on the obtained sum signal.

さらに、環境音／音声分離部８１は、Ｒチャンネルの音声データから、Ｌチャンネルの音声データを減算することで、音声データの差信号を求め、差信号と、フィルタ処理された和信号との和を求めることで、環境音のＲチャンネルの音声データを生成する。また、環境音／音声分離部８１は、フィルタ処理された和信号から差信号を減算することで、環境音のＬチャンネルの音声データを生成する。 Further, the environmental sound / audio separation unit 81 obtains a difference signal of the audio data by subtracting the L channel audio data from the R channel audio data, and calculates the sum of the difference signal and the filtered sum signal. To obtain the sound data of the R channel of the environmental sound. The environmental sound / sound separation unit 81 generates L channel sound data of the environmental sound by subtracting the difference signal from the filtered sum signal.

上述した環境音／発話音識別部５４における処理と同様に、和信号および差信号には、発話音が含まれていないので、それらの信号の差や和を求めることで、環境音のＲとＬのチャンネルの音声データを抽出することができる。すなわち、コンテンツの音声の音声データを、発話音の音声データと、環境音の音声データとに分離することができる。環境音／発話音識別部５４は、得られた環境音の音声データを環境音識別部５５に供給する。 Similar to the processing in the environmental sound / speech sound discriminating unit 54 described above, the sum signal and the difference signal do not include the speech sound. Therefore, by obtaining the difference or sum of the signals, The audio data of the L channels can be extracted. That is, the audio data of the content audio can be separated into audio data of the utterance sound and audio data of the environmental sound. The environmental sound / speech sound identification unit 54 supplies the obtained environmental sound data to the environmental sound identification unit 55.

このようにして環境音の音声データが得られると、その後、ステップＳ７３の処理が行われて解析処理は終了するが、ステップＳ７３の処理は図６のステップＳ４６の処理と同様であるため、その説明は省略する。解析処理が終了すると、その後、処理は図３のステップＳ１５に進む。 When the sound data of the environmental sound is obtained in this way, the process of step S73 is performed thereafter, and the analysis process ends. However, the process of step S73 is the same as the process of step S46 of FIG. Description is omitted. When the analysis process ends, the process thereafter proceeds to step S15 in FIG.

このようにして、解析部２４は、コンテンツの音声データから、環境音の音声データのみを抽出し、環境音をテキスト化する。これにより、コンテンツ画像上に、環境音のエフェクト画像を表示させることができ、ユーザは、より正確に環境音の内容と音源位置を把握することができる。 In this manner, the analysis unit 24 extracts only the sound data of the environmental sound from the sound data of the content, and converts the environmental sound into text. Thereby, the effect image of the environmental sound can be displayed on the content image, and the user can grasp the content of the environmental sound and the sound source position more accurately.

なお、図７では、解析部２４に発話内容識別部５６が設けられない構成とされているが、図７の解析部２４にも発話内容識別部５６が設けられるようにしてもよい。 In FIG. 7, the utterance content identification unit 56 is not provided in the analysis unit 24, but the utterance content identification unit 56 may also be provided in the analysis unit 24 in FIG. 7.

上述した一連の処理は、ハードウェアにより実行することもできるし、ソフトウェアにより実行することもできる。一連の処理をソフトウェアにより実行する場合には、そのソフトウェアを構成するプログラムが、専用のハードウェアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどに、プログラム記録媒体からインストールされる。 The series of processes described above can be executed by hardware or can be executed by software. When a series of processing is executed by software, a program constituting the software may execute various functions by installing a computer incorporated in dedicated hardware or various programs. For example, it is installed from a program recording medium in a general-purpose personal computer or the like.

図９は、上述した一連の処理をプログラムにより実行するコンピュータのハードウェアの構成例を示すブロック図である。 FIG. 9 is a block diagram illustrating a hardware configuration example of a computer that executes the above-described series of processing by a program.

コンピュータにおいて、CPU（Central Processing Unit）３０１，ROM（Read Only Memory）３０２，RAM（Random Access Memory）３０３は、バス３０４により相互に接続されている。 In a computer, a CPU (Central Processing Unit) 301, a ROM (Read Only Memory) 302, and a RAM (Random Access Memory) 303 are connected to each other by a bus 304.

バス３０４には、さらに、入出力インターフェース３０５が接続されている。入出力インターフェース３０５には、キーボード、マウス、マイクロホンなどよりなる入力部３０６、ディスプレイ、スピーカなどよりなる出力部３０７、ハードディスクや不揮発性のメモリなどよりなる記録部３０８、ネットワークインターフェースなどよりなる通信部３０９、磁気ディスク、光ディスク、光磁気ディスク、或いは半導体メモリなどのリムーバブルメディア３１１を駆動するドライブ３１０が接続されている。 An input / output interface 305 is further connected to the bus 304. The input / output interface 305 includes an input unit 306 including a keyboard, a mouse, and a microphone, an output unit 307 including a display and a speaker, a recording unit 308 including a hard disk and a nonvolatile memory, and a communication unit 309 including a network interface. A drive 310 that drives a removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory is connected.

以上のように構成されるコンピュータでは、CPU３０１が、例えば、記録部３０８に記録されているプログラムを、入出力インターフェース３０５及びバス３０４を介して、RAM３０３にロードして実行することにより、上述した一連の処理が行われる。 In the computer configured as described above, the CPU 301 loads, for example, the program recorded in the recording unit 308 to the RAM 303 via the input / output interface 305 and the bus 304, and executes the above-described series. Is performed.

コンピュータ（CPU３０１）が実行するプログラムは、例えば、磁気ディスク（フレキシブルディスクを含む）、光ディスク（CD-ROM(Compact Disc-Read Only Memory),DVD(Digital Versatile Disc)等）、光磁気ディスク、もしくは半導体メモリなどよりなるパッケージメディアであるリムーバブルメディア３１１に記録して、あるいは、ローカルエリアネットワーク、インターネット、デジタル衛星放送といった、有線または無線の伝送媒体を介して提供される。 The program executed by the computer (CPU 301) is, for example, a magnetic disk (including a flexible disk), an optical disk (CD-ROM (Compact Disc-Read Only Memory), DVD (Digital Versatile Disc), etc.), a magneto-optical disk, or a semiconductor. It is recorded on a removable medium 311 which is a package medium composed of a memory or the like, or provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.

そして、プログラムは、リムーバブルメディア３１１をドライブ３１０に装着することにより、入出力インターフェース３０５を介して、記録部３０８にインストールすることができる。また、プログラムは、有線または無線の伝送媒体を介して、通信部３０９で受信し、記録部３０８にインストールすることができる。その他、プログラムは、ROM３０２や記録部３０８に、あらかじめインストールしておくことができる。 The program can be installed in the recording unit 308 via the input / output interface 305 by attaching the removable medium 311 to the drive 310. Further, the program can be received by the communication unit 309 via a wired or wireless transmission medium and installed in the recording unit 308. In addition, the program can be installed in advance in the ROM 302 or the recording unit 308.

なお、コンピュータが実行するプログラムは、本明細書で説明する順序に沿って時系列に処理が行われるプログラムであっても良いし、並列に、あるいは呼び出しが行われたとき等の必要なタイミングで処理が行われるプログラムであっても良い。 The program executed by the computer may be a program that is processed in time series in the order described in this specification, or in parallel or at a necessary timing such as when a call is made. It may be a program for processing.

なお、本発明の実施の形態は、上述した実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において種々の変更が可能である。 The embodiment of the present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the gist of the present invention.

１１画像処理装置，１２記録媒体，２４解析部，２５視覚情報生成部，２６画像合成部，２７表示部，５１音源分離部，５２音源方向推定部，５３音源距離推定部，５４環境音／発話音識別部，５５環境音識別部，５６発話内容識別部，８１環境音／音声分離部 DESCRIPTION OF SYMBOLS 11 Image processing apparatus, 12 Recording medium, 24 Analysis part, 25 Visual information generation part, 26 Image composition part, 27 Display part, 51 Sound source separation part, 52 Sound source direction estimation part, 53 Sound source distance estimation part, 54 Environment sound / utterance Sound identification unit, 55 Environmental sound identification unit, 56 Utterance content identification unit, 81 Environmental sound / speech separation unit

Claims

Sound source direction estimating means for estimating the direction of the sound source of the sound relative to a predetermined reference position based on the sound data of the sound constituting the content;
Separating means for separating the voice data into the voice data of the environmental sound excluding the utterance sound caused by the utterance of the person in the voice and the voice data of the utterance sound;
An environmental sound identifying means for performing speech recognition processing on the audio data of the environmental sound and converting the environmental sound into text;
An effect image that visually presents the environmental sound is generated based on the textual environmental sound so that the effect image is displayed at a position determined by an estimation result of the direction of the sound source on the image constituting the content. An image processing apparatus comprising: effect data of the effect image; and image combining means for combining the image data of the image.

Further comprising sound source distance estimating means for estimating a distance from the reference position to the sound source based on the sound data of the sound;
The image synthesis means is configured to display the effect data and the effect data so that the effect image is displayed at a position determined by the estimation result of the direction of the sound source on the image and in a size determined by the estimation result of the distance of the sound source. The image processing apparatus according to claim 1, wherein the image processing apparatus combines the image data.

Further comprising sound source separation means for performing a blind sound source separation process on the sound data of the sound and separating the sound data into sound data of the sound of each sound source;
The separation means determines, for each of the sound data separated by the sound source separation means, whether or not the sound data is the sound data of the environmental sound, The image processing apparatus according to claim 1, wherein the voice data of the uttered sound is separated.

The image composition means changes the size, color, or brightness of the effect image displayed on the image according to at least one of the direction of the sound source, the distance of the sound source, or the volume of the environmental sound. The image processing apparatus according to claim 2, wherein the effect data and the image data are combined.

Based on the environmental sound converted to text so that supplementary information consisting of a character string supplementing the content of the environmental sound, which is predetermined for the environmental sound converted to text, is displayed on the effect image. The image processing apparatus according to claim 2, further comprising visual information generation means for generating the effect data.

Further comprising speech recognition means for performing speech recognition processing on the speech data of the utterance sound separated by the separation means, and converting the utterance sound into text,
The image synthesis means synthesizes the effect data and the image data so that the effect image of the environmental sound and the effect image that visually presents the speech sound are displayed on the image. Item 3. The image processing apparatus according to Item 2.

Sound source direction estimating means for estimating the direction of the sound source of the sound relative to a predetermined reference position based on the sound data of the sound constituting the content;
Separating means for separating the voice data into the voice data of the environmental sound excluding the utterance sound caused by the utterance of the person in the voice and the voice data of the utterance sound;
An environmental sound identifying means for performing speech recognition processing on the audio data of the environmental sound and converting the environmental sound into text;
An effect image that visually presents the environmental sound is generated based on the textual environmental sound so that the effect image is displayed at a position determined by an estimation result of the direction of the sound source on the image constituting the content. An image processing method for an image processing apparatus comprising: image data for combining the effect data of the effect image and the image data of the image,
The sound source direction estimating means estimates a direction of a sound source of the sound;
The separation means separates the sound data into the sound data of the environmental sound and the sound data of the utterance sound;
The environmental sound identification means converts the environmental sound into text,
An image processing method including a step in which the image synthesis means synthesizes the effect data and the image data.

Based on the sound data of the sound constituting the content, the direction of the sound source of the sound with respect to a predetermined reference position is estimated,
Separating the voice data into the voice data of the environmental sound excluding the utterance sound caused by the utterance of the person in the voice and the voice data of the utterance sound;
Performing a speech recognition process on the sound data of the environmental sound, converting the environmental sound into text,
An effect image that visually presents the environmental sound is generated based on the textual environmental sound so that the effect image is displayed at a position determined by an estimation result of the direction of the sound source on the image constituting the content. A program for causing a computer to execute processing including a step of combining effect data of the effect image and image data of the image.