JP2006139134A

JP2006139134A - Voice output control device, voice output control system, methods thereof, programs thereof, and recording medium recorded with those programs

Info

Publication number: JP2006139134A
Application number: JP2004329517A
Authority: JP
Inventors: 和哉 ▲高▼橋; Kazuya Takahashi; Ryuichiro Matsumoto; 隆一郎松本; Kentaro Yamamoto; 健太郎山本
Original assignee: Pioneer Electronic Corp
Current assignee: Pioneer Corp
Priority date: 2004-11-12
Filing date: 2004-11-12
Publication date: 2006-06-01

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice output device for favorably outputting voices according to external sound. <P>SOLUTION: Voice segment information is divided from the external sound data at silent part on the basis of the sound volume of external sound data acquired by collecting external sound. The voice segment information is converted into text format to perform language analysis, and the voice segment information as one word is adopted as voice data. Correlation degree information is produced by setting the correlation degree information between a phrase in the voice segment information located in front of the voice data and a phrase of the voice data to a score value that becomes smaller as elapsed time is long and the number of phrases increases. A piece of voice information is produced by correlating phrase information and the correlation degree information to the voice data. When the same phrase as the phrase information of the voice information is recognized in the external sound data, the voice information of this phrase information is retrieved, and a silent period elapses for 1 to 2 seconds or longer, the voice data of the voice information of the correlation degree information holding the highest score value is outputted in voice. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、外部音に応じて音声を出力させる音声出力制御装置、音声出力制御システム、それらの方法、それらのプログラム、および、それらのプログラムを記録した記録媒体に関する。 The present invention relates to a sound output control device that outputs sound according to an external sound, a sound output control system, a method thereof, a program thereof, and a recording medium on which the program is recorded.

従来、音声を認識して音声データを出力するロボットや玩具、テレビゲームプログラムなどの各種構成が広く利用されている（例えば、特許文献１参照）。例えば、特許文献１に記載のものは、ロボットに適用した構成で、頭部ユニットの所定位置に配設されたマイクロホンでユーザの発話を含む周囲の音声を集音する。そして、得られた音声信号に基づいて、モデル記憶部の状態情報における韻律を感情モデルの値に基づいて制御した合成音を生成し、スピーカから出力させる構成が採られている。 Conventionally, various configurations such as robots, toys, and video game programs that recognize voice and output voice data have been widely used (see, for example, Patent Document 1). For example, the device described in Patent Document 1 has a configuration applied to a robot, and collects surrounding sounds including a user's utterance with a microphone arranged at a predetermined position of the head unit. Then, based on the obtained audio signal, a synthesized sound is generated in which the prosody in the state information of the model storage unit is controlled based on the value of the emotion model, and is output from the speaker.

特開２００２−３０４１８７号公報（第３頁右欄−第１０頁左欄）JP 2002-304187 A (page 3 right column-page 10 left column)

しかしながら、上述した特許文献１に記載のような音声出力する従来の構成では、あらかじめ記憶された音声データを出力する構成であることから、発話の内容に対して発音される音声データに限りがある問題が一例として挙げられる。 However, since the conventional configuration for outputting voice as described in Patent Document 1 described above is a configuration for outputting previously stored voice data, there is a limit to the voice data that is pronounced for the content of the utterance. The problem is an example.

本発明の目的は、このような点に鑑みて、外部からの音に応じて良好に音声を出力させる音声出力制御装置、音声出力制御システム、それらの方法、それらのプログラム、および、それらのプログラムを記録した記録媒体を提供することである。 In view of these points, an object of the present invention is to provide a sound output control device, a sound output control system, a method thereof, a program thereof, and a program thereof that can output sound in accordance with external sound. Is provided.

請求項１に記載の発明は、集音した外部音に応じて音声を出力させる制御をする音声出力制御装置であって、発話による音声から抽出された前記音声に関する音声データ、語句に関する語句情報および前記音声データの音声および前記語句情報の語句の関連度合いに関する関連度情報を備え、１つのデータ構造に互いに関連付けられて構成された音声情報を複数記憶するテーブル構造に構築された記憶手段と、前記外部音に関する一連の外部音データを取得する外部音取得手段と、前記外部音に含まれる語句を認識する語句認識手段と、前記認識した語句に対応する前記音声情報を前記記憶手段から検索する音声検索手段と、前記検索により取得した前記音声情報のうち、前記関連度情報に基づいて所定の前記音声データを選出する選出手段と、前記選出された音声データをスピーカから前記音声として出力させる制御をする出力制御手段と、を具備したことを特徴とした音声出力制御装置である。 The invention according to claim 1 is a voice output control device that controls to output a voice according to the collected external sound, the voice data related to the voice extracted from the voice by the utterance, the phrase information about the phrase, and Storage means constructed in a table structure for storing a plurality of pieces of voice information configured to be associated with one data structure, comprising degree-of-association information relating to the degree of relation between the voice of the voice data and the phrase of the phrase information; External sound acquisition means for acquiring a series of external sound data related to external sound, phrase recognition means for recognizing words included in the external sound, and voice for searching the voice information corresponding to the recognized words from the storage means Search means and selection means for selecting predetermined voice data based on the relevance information among the voice information acquired by the search A sound output control apparatus characterized by comprising and an output control means for controlling to output the elected audio data as the audio from the speaker.

請求項２８に記載の発明は、集音した外部音に応じて音声を出力させる制御をする音声出力制御システムであって、発話による音声から抽出された前記音声に関する音声データ、語句に関する語句情報および前記音声データの音声および前記語句情報の語句の関連度合いに関する関連度情報を備え、１つのデータ構造に互いに関連付けられて構成された音声情報を複数記憶するテーブル構造に構築された記憶手段と、この記憶手段にネットワークを介して前記音声情報を取得可能に接続され、前記外部音に関する一連の外部音データを取得する外部音取得手段、前記外部音に含まれる語句を認識する語句認識手段、前記認識した語句に対応する前記音声情報を前記ネットワークを介して前記記憶手段から検索する音声検索手段、前記検索により取得した前記音声情報のうち前記関連度情報に基づいて所定の前記音声データを選出する選出手段、および、前記選出された音声データをスピーカから前記音声として出力させる制御をする出力制御手段、を備えた端末装置と、を具備したことを特徴とした音声出力制御システムである。 The invention according to claim 28 is a voice output control system for controlling the voice to be output in accordance with the collected external sound, the voice data relating to the voice extracted from the voice by speech, the phrase information relating to the phrase, and Storage means constructed in a table structure for storing a plurality of pieces of voice information configured to be associated with one data structure, comprising degree-of-association information related to the degree of relevance between the voice of the voice data and the phrase of the phrase information; External sound acquisition means for acquiring a series of external sound data related to the external sound, word recognition means for recognizing words contained in the external sound, and recognition Voice search means for searching the storage means for the voice information corresponding to the phrase obtained from the storage, A selection unit that selects predetermined audio data based on the relevance information among the acquired audio information, and an output control unit that performs control to output the selected audio data from a speaker as the audio. A voice output control system comprising: a terminal device.

請求項２９に記載の発明は、演算手段により、集音した外部音に応じて音声を出力させる制御をする音声出力制御方法であって、前記演算手段は、前記外部音に関する一連の外部音データを取得し、この取得した外部音データに含まれる語句を認識し、この認識した語句に対応し、発話による音声から抽出された前記音声に関する音声データ、語句に関する語句情報および前記音声データの音声および前記語句情報の語句の関連度合いに関する関連度情報を備え、１つのデータ構造に互いに関連付けられて構成された音声情報を複数記憶するテーブル構造に構築された記憶手段から、前記音声情報を検索し、この検索により取得した前記音声情報のうち、前記関連度情報に基づいて所定の前記音声データを選出し、この選出した音声データをスピーカから前記音声として出力させる制御をすることを特徴とする音声出力制御方法である。 The invention according to claim 29 is an audio output control method for controlling the output of sound according to the collected external sound by the calculating means, wherein the calculating means includes a series of external sound data relating to the external sound. , Recognizing words included in the acquired external sound data, and corresponding to the recognized words and phrases, voice data related to the voice extracted from speech by speech, word and phrase information related to phrases and voices of the voice data and Retrieval of the voice information from storage means constructed in a table structure for storing a plurality of voice information configured to be associated with one data structure, comprising relevance level information relating to the degree of relevance of the word information Among the audio information acquired by the search, the predetermined audio data is selected based on the relevance information, and the selected audio data is selected. An audio output control method, characterized by a control to output as the sound from the speaker.

請求項３０に記載の発明は、演算手段を、請求項１ないし請求項２７のいずれかに記載の音声出力制御装置、または、請求項２８に記載の音声出力制御システムとして機能させることを特徴とした音声出力制御プログラムである。 The invention as set forth in claim 30 is characterized in that the computing means functions as the sound output control device according to any one of claims 1 to 27 or the sound output control system according to claim 28. Audio output control program.

請求項３１に記載の発明は、請求項２９に記載の音声出力制御方法を演算手段に実行させることを特徴とした音声出力制御プログラムである。 According to a thirty-first aspect of the present invention, there is provided a voice output control program that causes a calculation means to execute the voice output control method according to the twenty-ninth aspect.

請求項３２に記載の発明は、請求項３０または請求項３１に記載の音声出力制御プログラムが演算手段に読取可能に記録されたことを特徴とした音声出力制御プログラムを記録した記録媒体である。 According to a thirty-second aspect of the present invention, there is provided a recording medium on which a voice output control program is recorded, wherein the voice output control program according to the thirty or thirty-first aspect is recorded in a readable manner on the calculation means.

以下、本発明に係る一実施の形態を図面に基づいて説明する。本実施の形態では、本発明の音声出力装置として移動体である例えば車両に搭載される装置構成について例示して説明する。なお、本発明における音声出力装置としては、移動体に搭載される構成の他、例えば家屋などの建造物に設置される構成やロボットなどにも適用できる。また、本実施の形態では、外部音として周囲の発話などの音声について説明するが、これに限らず、車内で発生する各種音や外部から車内に伝播する各種の音などを対象とすることができる。 Hereinafter, an embodiment according to the present invention will be described with reference to the drawings. In the present embodiment, an apparatus configuration mounted on, for example, a vehicle that is a moving body will be described as an example of the sound output apparatus of the present invention. Note that the audio output device according to the present invention can be applied to a configuration installed in a moving body, a configuration installed in a building such as a house, and a robot. Further, in the present embodiment, a voice such as a surrounding utterance is described as an external sound. However, the present invention is not limited to this, and various sounds generated in the vehicle and various sounds propagated from the outside to the vehicle may be targeted. it can.

〔音声出力装置の構成〕
図１は、本発明に係る一実施の形態における音声出力装置の概略構成を示すブロック図である。図２は、記憶手段の音声データ検索テーブルデータベースのテーブル構造の概略構成を示す概念図である。図３は、外部音データから音声データおよび語句情報を抽出する状況を概念的に示す説明図で、（Ａ）は外部音データの音量に基づく波形図、（Ｂ）は抽出された語句情報の語句を示す説明図、（Ｃ）は音声データＡに対する距離に関するスコア値を示す説明図、（Ｄ）は音声データＢに対する距離に関するスコア値を示す説明図、（Ｅ）は音声セグメント情報から抽出される語句情報の数を示す説明図である。図４は、音声データに対する語句の経過距離の係数の設定値を表形式で示す説明図である。図５は、キーワードとして認識した語句の音声データに対応する語句についてのスコア値の演算状況を表形式で示す説明図である。 [Configuration of audio output device]
FIG. 1 is a block diagram showing a schematic configuration of an audio output device according to an embodiment of the present invention. FIG. 2 is a conceptual diagram showing a schematic configuration of the table structure of the voice data search table database of the storage means. FIG. 3 is an explanatory diagram conceptually showing a situation in which voice data and phrase information are extracted from external sound data, where (A) is a waveform diagram based on the volume of the external sound data, and (B) is the extracted phrase information. (C) is an explanatory diagram showing a score value related to the distance to the audio data A, (D) is an explanatory diagram showing a score value related to the distance to the audio data B, and (E) is extracted from the audio segment information. It is explanatory drawing which shows the number of word information. FIG. 4 is an explanatory diagram showing, in a tabular form, set values of coefficients of the elapsed distance of words / phrases with respect to audio data. FIG. 5 is an explanatory diagram showing, in a tabular form, the calculation status of score values for words corresponding to the speech data of words recognized as keywords.

図において、１００は音声出力装置で、この音声出力装置１００は、移動体としての例えば車両内で発生する音、あるいは車外から車内に伝播する音に対して音声出力するものである。ここで、移動体としては、自動車や電車などの車両に限らず、例えば飛行機や船舶など移動するいずれの移動体にも適用できる。この音声出力装置１００は、例えば車両に搭載された図示しないバッテリから供給される電力により動作する。そして、この音声出力装置１００は、通信手段２００と、操作手段３００と、集音手段４００と、発音手段５００と、音声データ読取手段６００と、音声情報を記録した記録媒体としても機能する記憶手段７００と、メモリ８００と、音声情報生成装置および音声出力制御装置としても機能する演算手段９００と、を備えている。 In the figure, reference numeral 100 denotes an audio output device. The audio output device 100 outputs an audio with respect to a sound generated in, for example, a vehicle as a moving body or a sound propagated from the outside of the vehicle into the vehicle. Here, the moving body is not limited to a vehicle such as an automobile or a train, and can be applied to any moving body such as an airplane or a ship. The audio output device 100 operates by electric power supplied from, for example, a battery (not shown) mounted on the vehicle. The audio output device 100 includes a communication unit 200, an operation unit 300, a sound collection unit 400, a sound generation unit 500, an audio data reading unit 600, and a storage unit that also functions as a recording medium on which audio information is recorded. 700, a memory 800, and an arithmetic unit 900 that also functions as an audio information generation device and an audio output control device.

通信手段２００は、放送波などの無線媒体を受信し外部から入力される音声に関する音声データや、ネットワークを介してサーバ装置などから音声データあるいは音声情報を取得する。具体的には、通信手段２００は、例えば、地上波アナログ放送や地上波デジタル放送あるいは衛星デジタル放送などの放送波を受信する図示しないアンテナが接続されたチューナなどを有し、アンテナからアナログ信号やデジタル信号として送信される音声データを取得する。また、通信手段２００は、例えば、ＴＣＰ／ＩＰなどの汎用のプロトコルに基づくインターネット、イントラネット、ＬＡＮ（Local Area Network）、無線媒体により情報が送受信可能な複数の基地局がネットワークを構成する通信回線網や放送網などのネットワークに接続可能で、これらネットワークを介してサーバ装置や基地局などから音声データや音声情報などを受信して取得する。そして、通信手段２００は、演算手段９００に接続され、取得した音声データや音声情報を演算手段９００へ出力する。 The communication means 200 receives a wireless medium such as a broadcast wave and acquires audio data relating to audio input from the outside, or audio data or audio information from a server device or the like via a network. Specifically, the communication means 200 includes, for example, a tuner connected to an antenna (not shown) that receives a broadcast wave such as terrestrial analog broadcast, terrestrial digital broadcast, or satellite digital broadcast. Audio data transmitted as a digital signal is acquired. Further, the communication means 200 is a communication line network in which a plurality of base stations capable of transmitting and receiving information via a general-purpose protocol such as TCP / IP, a network (LAN), and a plurality of base stations capable of transmitting and receiving information via a wireless medium constitute a network. It can be connected to a network such as a broadcasting network, and receives and acquires voice data, voice information, and the like from a server device and a base station via these networks. The communication unit 200 is connected to the calculation unit 900 and outputs the acquired voice data and voice information to the calculation unit 900.

操作手段３００は、入力操作可能な図示しない操作ボタンや操作つまみなどを備えている。そして、操作手段３００は、演算手段９００に接続され、操作ボタンや操作つまみなどの入力操作に応じて所定の操作信号を演算手段９００へ出力し、演算手段９００で入力操作に応じた各種設定項目を入力設定させる。この操作手段３００で設定入力される設定項目としては、例えば、通信手段２００により受信させる情報の特定や発音手段５００での発音状態あるいは音声データ読取手段の動作設定などの音声出力装置１００全体の動作の設定内容の他、語句を設定入力したり、記憶手段７００やメモリ８００に記憶された各種情報の処理や実行を設定入力したり、記憶手段７００やメモリ８００に各種情報を記憶させる設定入力をしたりするなどが例示できる。なお、この操作手段３００としては、操作ボタンや操作つまみなどの入力操作に限らず、例えば別途接続される表示装置に設けられたタッチパネルによる入力操作や、音声による入力操作、リモートコントローラなどの無線媒体を介して信号を出力する構成など、各種設定事項を設定入力可能ないずれの構成が適用できる。 The operation means 300 includes operation buttons and operation knobs (not shown) that can be input. The operation unit 300 is connected to the calculation unit 900 and outputs a predetermined operation signal to the calculation unit 900 according to an input operation such as an operation button or an operation knob. The calculation unit 900 performs various setting items according to the input operation. Input setting. The setting items set and input by the operation unit 300 include, for example, the operation of the entire voice output device 100 such as the specification of information received by the communication unit 200, the sound generation state of the sound generation unit 500, or the operation setting of the sound data reading unit. In addition to the setting contents of the above, setting and inputting words and phrases, setting and inputting processing and execution of various types of information stored in the storage unit 700 and the memory 800, and setting input for storing various types of information in the storage unit 700 and the memory 800 And so on. The operation means 300 is not limited to an input operation such as an operation button or an operation knob. For example, an input operation using a touch panel provided in a separately connected display device, an audio input operation, or a wireless medium such as a remote controller. Any configuration in which various setting items can be set and input, such as a configuration in which a signal is output via the network, can be applied.

集音手段４００は、音声出力装置１００の周囲である外部の外部音を取得すなわち集音する。この集音手段４００は、例えば車両のダッシュボードに配設されたマイクロフォン４１０を備えている。そして、集音手段４００は、演算手段９００に接続され、マイクロフォン４１０で集音した外部音に関する外部音データを演算手段９００へ出力する。 The sound collecting means 400 acquires, that is, collects an external sound outside the sound output device 100. The sound collecting means 400 includes a microphone 410 disposed on a dashboard of a vehicle, for example. The sound collection unit 400 is connected to the calculation unit 900 and outputs external sound data related to the external sound collected by the microphone 410 to the calculation unit 900.

発音手段５００は、演算手段９００に接続され、演算手段９００から出力される音声データなどを音声として出力する。この発音手段５００は、演算手段９００から出力されるアナログ信号の音声データなどを取得して増幅するなどの処理をする図示しない増幅器と、この増幅器で処理された音声データを音声として出力するスピーカ５１０と、などを備えている。なお、この発音手段５００としては、あらかじめ車両に搭載された構成を利用してもよい。また、発音手段５００は、通信手段２００で受信あるいは音声データ読取手段で読み取られて出力される音声データや音楽データを出力したり、記憶手段７００やメモリ８００に記憶された音声データや音楽データを出力したりする構成としてもよい。 The sound generation means 500 is connected to the calculation means 900 and outputs sound data and the like output from the calculation means 900 as sound. The sound generation means 500 includes an amplifier (not shown) that performs processing such as obtaining and amplifying the voice data of the analog signal output from the calculation means 900, and a speaker 510 that outputs the voice data processed by the amplifier as voice. And so on. In addition, as this sound generation means 500, you may utilize the structure beforehand mounted in the vehicle. The sound generation means 500 outputs sound data and music data received by the communication means 200 or read by the sound data reading means and outputted, or the sound data and music data stored in the storage means 700 and the memory 800 are output. It is good also as a structure which outputs.

音声データ読取手段６００は、例えば、ＨＤ（Hard Disk）やＦＤ（Flexible Disk）などの磁気ディスク、ＣＤ（Compact Disc）やＤＶＤ（Digital Versatile Disc）などの光ディスク、光磁気ディスク、メモリカード、メモリなどの記録媒体に読み出し可能に記憶するドライブやドライバなどを備え、記録媒体に記憶された音声データや音楽データなどを読み出し、演算手段９００へ出力する。なお、音声データ読取手段６００は、演算手段９００から出力される音声データや音楽データを記録媒体へ記憶させたりする構成を備えていてもよい。また、音声データ読取手段６００としては、例えば通信手段２００と共用の構成として、ＴＶ受像器やラジオ受信機など、放送波を受信して適宜処理し、音声データや音楽データとして演算手段９００へ出力したり、ネットワークを介して音声データや音楽データを取得して演算手段９００へ出力したりしてもよく、ドライブやドライバなどを備えた構成に限らない。 The audio data reading means 600 includes, for example, a magnetic disk such as an HD (Hard Disk) and an FD (Flexible Disk), an optical disk such as a CD (Compact Disc) and a DVD (Digital Versatile Disc), a magneto-optical disk, a memory card, a memory, and the like. Drive, a driver, etc. which memorize | store in a recording medium so that reading is possible, the audio | voice data, music data, etc. which were memorize | stored in the recording medium are read, and it outputs to the calculating means 900. Note that the voice data reading unit 600 may be configured to store voice data and music data output from the calculation unit 900 in a recording medium. Further, as the audio data reading unit 600, for example, as a configuration shared with the communication unit 200, a broadcast wave is received and processed as appropriate, such as a TV receiver or a radio receiver, and output to the calculation unit 900 as audio data or music data. Alternatively, audio data or music data may be acquired via a network and output to the calculation unit 900, and the present invention is not limited to a configuration including a drive or a driver.

記憶手段７００は、例えば音声データ読取手段６００と同様にドライブやドライバなどを備え、記録媒体に音声情報を複数記録するテーブル構造を記録媒体に構築する。具体的には、記憶手段７００は、音声データデータベース（ＤＢ：Data Base）７１０と、図２に示すような音声データ検索テーブルデータベース７２０と、を備えている。音声データＤＢ７１０は、音声に関する音声データと、この音声データを特定する固有情報である音声データＩＤ（identification）と、を関連付けて１つのデータ構造として備えた複数記憶するテーブル構造に構成されている。この音声データは、例えば発話された文言から抽出された例えば単文や感嘆文、語句などの比較的に短い口語についてのデータである。音声データ検索テーブルＤＢ７２０は、図２に示すように、音声データＩＤ７２１と、他の音声データとしてのテキスト形式の語句に関する語句情報７２２と、音声データＩＤ７２１に対応する音声データの音声である口語および語句情報７２２の語句との親和性となる関連度合いに関する関連度情報７２３と、を関連付けて１つのデータ構造として備えた音声情報７２０Ａを複数記憶するテーブル構造に構成されている。関連度情報７２３は、例えば数値にて表されるスコアについての数値データである。なお、語句情報７２２としては、テキスト形式のデータ構造に限らず、音声データと同様の音声に関するデータやそのデータを特定するＩＤ番号などとして別途音声データを記憶させておく構成とするなどもよく、さらには発話に関する音声データに限らず、例えば踏切や警笛など、音に関するいずれの音声データを対象とすることができる。また、関連度情報７２３は、数値データに限らず、関連度合いに対応して差別化可能ないずれのデータ構造として構成してもよい。そして、記憶手段７００としては、別途接続されるナビゲーション装置で利用される地図情報など、他の情報をも記憶可能に構成されている。 The storage unit 700 includes, for example, a drive and a driver like the audio data reading unit 600, and constructs a table structure for recording a plurality of audio information on the recording medium. Specifically, the storage unit 700 includes an audio data database (DB) 710 and an audio data search table database 720 as shown in FIG. The voice data DB 710 is configured in a table structure that stores a plurality of voice data related to voice and a voice data ID (identification) that is specific information for identifying the voice data and includes a single data structure. This voice data is data on relatively short spoken words such as simple sentences, exclamation sentences, and phrases extracted from, for example, spoken words. As shown in FIG. 2, the voice data search table DB 720 includes a voice data ID 721, word / phrase information 722 relating to a text-type word / phrase as other voice data, and a spoken word / phrase that is voice of voice data corresponding to the voice data ID 721. The table is configured in a table structure that stores a plurality of pieces of audio information 720 </ b> A provided as one data structure by associating the degree-of-association information 723 related to the degree of association with the word of the information 722. The relevance information 723 is numerical data on a score represented by a numerical value, for example. Note that the phrase information 722 is not limited to the data structure in the text format, and may be configured such that voice data is stored separately as voice-related data similar to voice data, an ID number for specifying the data, or the like. Furthermore, the present invention is not limited to voice data related to utterances, and any voice data related to sounds such as railroad crossings and horns can be targeted. The relevance information 723 is not limited to numerical data, and may be configured as any data structure that can be differentiated in accordance with the relevance. The storage unit 700 is configured to be able to store other information such as map information used in a separately connected navigation device.

メモリ８００は、操作手段３００で入力操作される設定事項、音声データや音楽データなどの各種データを適宜読み出し可能に記憶する。また、メモリ８００には、音声出力装置１００全体を動作制御するＯＳ（Operating System）上に展開される各種プログラムなどを記憶している。このメモリ８００としては、例えば停電などにより突然電源が落ちた際にも記憶が保持される構成のメモリ、例えばＣＭＯＳ（Complementary Metal-Oxide Semiconductor）メモリなどを用いることが望ましい。なお、メモリ８００としては、ＨＤ、ＤＶＤ、光ディスクなどの記録媒体に読み出し可能に記憶するドライブやドライバなどを備えた構成としてもよい。 The memory 800 stores various items such as setting items input by the operation unit 300, voice data, music data, and the like so that they can be read as appropriate. Further, the memory 800 stores various programs developed on an OS (Operating System) that controls the operation of the entire audio output device 100. As the memory 800, it is desirable to use, for example, a memory (eg, a complementary metal-oxide semiconductor (CMOS) memory) having a configuration that retains memory even when the power is suddenly turned off due to a power failure or the like. Note that the memory 800 may include a drive, a driver, and the like that are readable and stored in a recording medium such as an HD, a DVD, or an optical disk.

演算手段９００は、集音手段４００で集音した外部音に応じて音声を出力させる制御および音声を出力させる音声情報７２０Ａを生成する制御をする。この演算手段９００は、図示しない各種入出力ポート、例えば通信手段２００が接続される通信ポート、操作手段３００が接続される入力ポート、集音手段４００が接続される集音制御ポート、発音手段５００が接続される発音制御ポート、音声データ読取手段６００が接続される読取制御ポート、記憶手段７００が接続される記憶ポート、メモリ８００が接続されるメモリポートなどを有する。そして、演算手段９００は、各種プログラムとして、外部音取得手段９０１と、音特性認識手段９０２と、区切位置認識手段９０３と、テキスト形式変換手段９０４と、語句認識手段としての言語解析手段９０５と、音声データ生成手段９０６と、変更手段としても機能する関連度認識手段９０７と、音声情報生成手段９０８と、語句認識手段としても機能するキーワード認識手段９０９と、音声検索手段９１０と、選出手段９１１と、出力制御手段９１２と、計時手段９１３と、などを備えている。そして、音特性認識手段９０２と、区切位置認識手段９０３と、テキスト形式変換手段９０４と、語句認識手段としての言語解析手段９０５と、音声データ生成手段９０６と、関連度認識手段９０７と、音声情報生成手段９０８とにより、音声を出力させる音声情報７２０Ａを生成する演算手段としての音声情報生成装置としても機能する音声情報生成部９００Ａが構成されている。また、外部音取得手段９０１と、音特性認識手段９０２と、区切位置認識手段９０３と、テキスト形式変換手段９０４と、言語解析手段９０５と、キーワード認識手段９０９と、音声検索手段９１０と、選出手段９１１と、出力制御手段９１２とにより、外部音に応じて音声を出力させる音声出力制御装置として機能する音声データ出力制御部９００Ｂが構成されている。 The computing means 900 performs control for outputting sound according to the external sound collected by the sound collecting means 400 and control for generating sound information 720A for outputting sound. The calculation means 900 includes various input / output ports (not shown), such as a communication port to which the communication means 200 is connected, an input port to which the operation means 300 is connected, a sound collection control port to which the sound collection means 400 is connected, and a sound generation means 500. A sound generation control port to which the sound data reading means 600 is connected, a storage control port to which the storage means 700 is connected, a memory port to which the memory 800 is connected, and the like. The calculation unit 900 includes an external sound acquisition unit 901, a sound characteristic recognition unit 902, a break position recognition unit 903, a text format conversion unit 904, a language analysis unit 905 as a phrase recognition unit, as various programs. A voice data generation unit 906, a relevance recognition unit 907 that also functions as a change unit, a voice information generation unit 908, a keyword recognition unit 909 that also functions as a phrase recognition unit, a voice search unit 910, and a selection unit 911 , Output control means 912, timing means 913, and the like. Then, a sound characteristic recognizing unit 902, a break position recognizing unit 903, a text format converting unit 904, a language analyzing unit 905 as a phrase recognizing unit, a sound data generating unit 906, a relevance recognizing unit 907, a sound information The generation unit 908 constitutes a voice information generation unit 900A that also functions as a voice information generation device as a calculation unit that generates voice information 720A for outputting voice. Also, external sound acquisition means 901, sound characteristic recognition means 902, delimiter position recognition means 903, text format conversion means 904, language analysis means 905, keyword recognition means 909, voice search means 910, selection means The audio data output control unit 900B that functions as an audio output control device that outputs audio in accordance with external sound is configured by 911 and the output control means 912.

外部音取得手段９０１は、集音手段４００で集音した外部音に関し出力される一連の外部音データを取得する。具体的には、外部音取得手段９０１は、図３（Ａ）に示すような一連の波形図となる外部音データを取得する。この波形図としては、アナログ信号あるいはデジタル信号など、いずれのデータ形式で取得可能に構成されている。 The external sound acquisition unit 901 acquires a series of external sound data output regarding the external sound collected by the sound collection unit 400. Specifically, the external sound acquisition unit 901 acquires external sound data that forms a series of waveform diagrams as shown in FIG. This waveform diagram can be acquired in any data format such as an analog signal or a digital signal.

音特性認識手段９０２は、外部音取得手段９０１で取得した外部音データの音特性を認識する。例えば、音特性認識手段９０２は、音特性としての音量に基づく波形で外部音データを認識する。 The sound characteristic recognition unit 902 recognizes the sound characteristic of the external sound data acquired by the external sound acquisition unit 901. For example, the sound characteristic recognizing means 902 recognizes external sound data with a waveform based on sound volume as the sound characteristic.

区切位置認識手段９０３は、音特性認識手段９０２で認識した音特性である音量に基づいて、所定の音量以下、ノイズや雑音などを考慮して発話がない状態の音量レベル、いわゆる無音部分を認識する。そして、区切位置認識手段９０３は、図３中に点線で示すように、認識した無音部分を外部音データが区切られる区切位置として設定する。さらに、区切位置認識手段９０３は、外部音取得手段９０１で取得した外部音データを、認識した区切位置で分割し、音声セグメント情報を生成する。 The delimiter position recognizing unit 903 recognizes a so-called silent portion based on the volume that is the sound characteristic recognized by the sound characteristic recognizing unit 902, in a state where there is no utterance in consideration of noise or noise below a predetermined volume. To do. Then, the delimiter position recognition unit 903 sets the recognized silent portion as the delimiter position where the external sound data is delimited, as indicated by a dotted line in FIG. Furthermore, the delimiter position recognition unit 903 divides the external sound data acquired by the external sound acquisition unit 901 at the recognized delimiter position, and generates audio segment information.

テキスト形式変換手段９０４は、区切位置認識手段９０３で生成した各音声セグメント情報を、それぞれテキスト形式に変換して音声テキスト情報を生成する。この生成した音声テキスト情報は、それぞれ対応する音声セグメント情報と関連付けられてメモリ８００などに一時的に記憶される。 The text format conversion unit 904 converts each speech segment information generated by the delimiter position recognition unit 903 into a text format to generate speech text information. The generated speech text information is associated with the corresponding speech segment information and temporarily stored in the memory 800 or the like.

言語解析手段９０５は、テキスト形式変換手段９０４で生成される各音声テキスト情報を、それぞれ形態素解析や構文解析などの言語解析を実施する。具体的には、図３（Ｂ）に示すように、音声セグメント情報から語句を認識し、各語句に関する語句情報７２２を生成する。 The language analysis unit 905 performs language analysis such as morpheme analysis and syntax analysis on each speech text information generated by the text format conversion unit 904. Specifically, as shown in FIG. 3B, a phrase is recognized from the speech segment information, and phrase information 722 relating to each phrase is generated.

音声データ生成手段９０６は、言語解析手段９０５にて実施した言語解析により、各音声テキスト情報が単文や感嘆文、あるいは疑問文直後の一言、または語句などの比較的に短い口語か否かを判断、例えば図３（Ｂ）に示すように、言語解析手段９０５で生成する語句情報７２２が１つだけか複数かを判断する。そして、音声データ生成手段９０６は、音声テキスト情報が短い口語であると判断すると、その音声テキスト情報が関連付く音声セグメント情報を音声データとして生成する。このように、音声データは、区切位置間における外部音データの一部の音声が抽出されて生成される。そして、生成された音声データは、新たに音声データＩＤ７２１が設定され、この音声データＩＤ７２１が１つのデータ構造に関連付けられて記憶手段７００の音声データＤＢ７１０に記憶される。 The speech data generation unit 906 determines whether each speech text information is a relatively short colloquial such as a single sentence, an exclamation sentence, a word immediately after a question sentence, or a phrase by linguistic analysis performed by the language analysis means 905. For example, as shown in FIG. 3B, it is determined whether there is only one or a plurality of pieces of phrase information 722 generated by the language analysis unit 905. If the speech data generation unit 906 determines that the speech text information is a short colloquial, the speech data generation unit 906 generates speech segment information associated with the speech text information as speech data. As described above, the sound data is generated by extracting a part of the sound of the external sound data between the separation positions. The generated voice data is newly set with a voice data ID 721, and this voice data ID 721 is associated with one data structure and stored in the voice data DB 710 of the storage unit 700.

関連度認識手段９０７は、音声データ生成手段９０６で生成された音声セグメント情報である音声データに対して、外部音データの時系列における前に位置する他の音声データとなる他の音声セグメント情報の音声テキスト情報を構成する各語句との関連度合いを演算して関連度情報７２３を生成する。具体的には、関連度認識手段９０７は、音声データの口語に対して、他の音声セグメント情報の音声テキスト情報を言語解析手段９０５で言語解析により抽出された各語句情報７２２の語句の親和性となる関連度合いを数値にて認識する。この関連度合いの演算としては、音声データとの関連度合いが演算される対象となる語句と音声データとの距離、例えば音声データから語句までに遡る時間である時間経過を関連度合いとして演算する。 The degree-of-relevance recognition unit 907 includes other audio segment information that becomes other audio data positioned before in the time series of the external sound data with respect to the audio data that is the audio segment information generated by the audio data generation unit 906. Relevance information 723 is generated by calculating the degree of association with each word constituting the speech text information. Specifically, the degree-of-relevance recognition unit 907 has the affinity of the phrase of each phrase information 722 extracted by the language analysis by the language analysis unit 905 with respect to the spoken language of the speech data. Recognize the degree of relevance as a numerical value. As the calculation of the degree of association, a distance between a word / phrase for which the degree of association with the voice data is calculated and the voice data, for example, a time elapsed from the voice data to the word / phrase is calculated as the degree of relation.

この経過時間の演算は、例えば、音声データの音声セグメント情報に対して、外部音データの時系列における他の音声セグメント情報までの距離すなわち他の音声セグメント情報の数と、各音声セグメント情報を構成する語句の数と、に基づいて演算する。具体的には、図３（Ｃ），（Ｄ）に示すように、直前の音声セグメント情報が最も高いスコアとなり遠くなる音声セグメント情報の数にしたがって値が小さくなるスコア値をあらかじめ設定する。すなわち、音声として出力させる音声データが、図３（Ｂ）に示すような「そうみたいよ」である場合には、その前に位置する音声セグメント情報である「あっちがパレットシティなの」に、「そうみたいよ」の音声データに対する関連度合いとして図３（Ｃ）に示すように３点がスコア付けされ、さらにその前の音声セグメント情報では２点、１点とスコア付けされる。同様に、音声として出力させる音声データが、図３（Ｂ）に示すような「マジで」である場合には、その前に位置する音声セグメント情報である「お台場の観覧車ってでっかくてキラキラしてるんやね」に、「マジで」の音声データに対する関連度合いとして図３（Ｄ）に示すように３点がスコア付けられ、さらにその前の音声セグメント情報である「そうみたいよ」には２点、さらにその前の「あっちがパレットシティなの」には１点がスコア付けされる。なお、このスコア付けの演算の他、以下の数１に示す式に基づいて演算してスコア付けしたり、数１の式で演算した値を図４に示すようにスコア値としてあらかじめ設定したりしてもよい。なお、図３（Ｃ），（Ｄ）は、３個前までの音声セグメント情報を対象として、音声セグメント情報毎に１点ずつ値が小さくなる状態に設定したスコア値を例示している。 For example, the calculation of the elapsed time includes the distance to other audio segment information in the time series of the external sound data, that is, the number of other audio segment information, and each audio segment information with respect to the audio segment information of the audio data. And the number of words to be calculated. Specifically, as shown in FIGS. 3C and 3D, a score value is set in advance in which the immediately preceding speech segment information is the highest score and the value decreases according to the number of far away speech segment information. That is, if the audio data to be output as audio is “Looks like that” as shown in FIG. 3B, “There is Palette City”, which is the audio segment information located before that, As shown in FIG. 3C, three points are scored as the degree of relevance to the voice data “I want to see it”, and the previous voice segment information is scored as two points and one point. Similarly, when the audio data to be output as audio is “serious” as shown in FIG. 3B, the “Odaiba Ferris wheel” is the audio segment information located in front of it. As shown in FIG. 3 (D), “Sparkle Nyanane” is scored as the degree of relevance to the voice data “Seriously”, as shown in FIG. Will be scored 2 points, and then “I ’m Palette City” will score 1 point. In addition to this scoring calculation, calculation is performed based on the following equation (1) to score, or a value calculated using equation (1) is preset as a score value as shown in FIG. May be. 3C and 3D exemplify score values that are set to a state in which the value is reduced by one point for each piece of voice segment information with respect to up to three pieces of voice segment information.

（数１）
Ｓ＝ｌｏｇ₁₀Ｘ
Ｓ：時間経過のスコア値
Ｘ：対象の音声セグメント情報までの数（自然数） (Equation 1)
S = log ₁₀ X
S: Score value over time
X: Number of target audio segment information (natural number)

また、音声セグメント情報を構成する語句の数によるスコア値は、例えば、図３（Ｅ）に示すように音声セグメント情報毎で言語解析により生成された語句情報７２２の数を認識し、数が多くなるにしたがってスコアの値が小さくなるように設定される。具体的には、音声セグメント情報の数に基づいて設定されたスコア値から語句情報７２２の数を除算し、各語句情報７２２のスコア値を演算する。この語句毎で演算したスコア値が、音声データの口語に対する関連度合いとして設定される。 Further, the score value based on the number of words constituting the speech segment information recognizes the number of word information 722 generated by linguistic analysis for each speech segment information as shown in FIG. The score value is set so as to become smaller. Specifically, the number of word information 722 is divided from the score value set based on the number of audio segment information, and the score value of each word information 722 is calculated. The score value calculated for each word is set as the degree of relevance of the speech data to the spoken language.

なお、経過時間としては、単に対象となる語句までの語句の数や時間長に反比例してスコアの値が小さくなるように演算するなどしてもよい。さらに、関連度合いとしては、経過時間の概念に限らず、例えば構文解析による会話の応答関係によりスコア付けしたり、語句の品詞やアクセント、波形に基づく語尾の抑揚などに基づいてスコア付けしたり、語句の組み合わせにおける過去の出現頻度すなわち語句の組み合わせを履歴して出現する数に比例してスコア付けしたりするなどしてもよく、これら例示した方法と上記例示の方法とを適宜組み合わせるなどしてもよい。 Note that the elapsed time may be simply calculated so that the score value becomes smaller in inverse proportion to the number of words up to the target word and the time length. Furthermore, the degree of relevance is not limited to the concept of elapsed time, for example, scoring according to the response relationship of the conversation by syntactic analysis, scoring based on the part of speech and accent of the phrase, the inflection of the ending based on the waveform, It may be scored in proportion to the number of appearances in the past occurrence frequency of word combinations, that is, the word combinations, and the methods exemplified above and the methods exemplified above may be combined as appropriate. Also good.

音声情報生成手段９０８は、音声データに語句情報７２２が関連度情報７２３とともに１つのデータ構造に関連付けられた音声情報７２０Ａを生成する。すなわち、音声情報生成手段９０８は、音声データに対して関連度認識手段９０７で認識した関連度合いとなる語句の語句情報７２２を、その関連度情報７２３とともに音声データの音声データＩＤ７２１に図２に示すように１つのデータ構造に関連付け、音声情報７２０Ａを生成する。そして、生成した音声情報７２０Ａは、記憶手段７００の音声データ検索テーブルＤＢ７２０に記憶される。この生成した音声情報７２０Ａの記憶の際、音声データおよび語句情報７２２の組み合わせが同じ音声情報７２０Ａが既に記憶されている場合、音声情報生成手段９０８は、既に記憶されている音声情報７２０Ａの関連度情報７２３に、今回演算した関連度情報７２３を反映させる。例えば、以下の数２に示す式に基づいて関連度合いを再演算し、得られた関連度情報７２３を更新する処理をする。なお、この数２に示す出現頻度を考慮した演算方法に限らず、既に記憶されている関連度情報７２３のスコアと新たに生成した音声情報７２０Ａの関連度情報７２３のスコアとの平均を単に演算する出現頻度を考慮しない演算方法などでもよい。さらには、新たに生成した音声情報７２０Ａで更新するなど、過去の関連度情報７２３を考慮せずにそのまま記憶させる構成などとしてもよい。 The voice information generation unit 908 generates voice information 720A in which the phrase information 722 is associated with one data structure together with the relevance information 723 in the voice data. That is, the voice information generating means 908 shows the phrase information 722 of the phrase having the degree of association recognized by the degree-of-association recognition means 907 for the voice data in the voice data ID 721 of the voice data together with the degree of association information 723 shown in FIG. As described above, the audio information 720A is generated in association with one data structure. The generated voice information 720A is stored in the voice data search table DB 720 of the storage unit 700. When storing the generated audio information 720A, if the audio information 720A having the same combination of the audio data and the phrase information 722 has already been stored, the audio information generating unit 908 determines the relevance of the already stored audio information 720A. The relevance information 723 calculated this time is reflected in the information 723. For example, the relevance degree is recalculated based on the following equation 2 and the obtained relevance degree information 723 is updated. Note that the calculation is not limited to the calculation method considering the appearance frequency shown in Equation 2, and the average of the score of the relevance information 723 already stored and the score of the relevance information 723 of the newly generated speech information 720A is simply calculated. An arithmetic method that does not take into account the appearance frequency may be used. Furthermore, it is good also as a structure etc. which are memorize | stored as it is, without considering the past relevance information 723, such as updating with the audio | voice information 720A produced | generated newly.

（数２）
Ｖ＝（Ｖ₀×ｎ／（ｎ＋１））＋（Ｖ₁×１／（ｎ＋１））
Ｖ：再演算した関連度合い
Ｖ₀：記憶されている音声情報７２０Ａの関連度情報７２３の関連度合い
Ｖ₁：新たに生成した音声情報７２０Ａの関連度情報７２３の関連度合い
ｎ：過去に出現した音声データおよび語句情報７２２の組み合わせの回数（出現頻度） (Equation 2)
V = (V ₀ × n / (n + 1)) + (V ₁ × 1 / (n + 1))
V: Recalculated degree of association
V ₀ : degree of association of the degree-of-association information 723 of the stored audio information 720A
V ₁ : Degree of association of the degree-of-association information 723 of the newly generated audio information 720A
n: Number of combinations of speech data and phrase information 722 that have appeared in the past (appearance frequency)

キーワード認識手段９０９は、集音している発話状況に基づいて、音声データを音声出力させるためのキーワードを認識する。すなわち、キーワード認識手段９０９は、集音している外部音データから言語解析手段９０５で認識した語句情報７２２に基づいて、記憶手段７００の音声データ検索テーブルＤＢ７２０に記憶した各音声情報７２０Ａの語句情報７２２の語句と同一のキーワードとなる語句が外部音データ中に出現するか否かを判断する。そして、キーワード認識手段９０９は、キーワードとなる語句を認識すると、キーワードが発話された旨の信号を出力する。この信号としては、キーワードとして認識した語句を特性する情報が含まれている。すなわち、所定の語句を検出した旨の信号である。 The keyword recognizing unit 909 recognizes a keyword for outputting voice data based on the collected utterance situation. That is, the keyword recognizing unit 909 uses the phrase information of each voice information 720A stored in the voice data search table DB 720 of the storage unit 700 based on the phrase information 722 recognized by the language analyzing unit 905 from the collected external sound data. It is determined whether or not a phrase that is the same keyword as the phrase 722 appears in the external sound data. When the keyword recognizing unit 909 recognizes a word or phrase as a keyword, it outputs a signal indicating that the keyword has been uttered. This signal includes information that characterizes a phrase recognized as a keyword. That is, a signal indicating that a predetermined word has been detected.

音声検索手段９１０は、キーワード認識手段９０９で認識した語句に対応する音声情報７２０Ａを、記憶手段７００の音声データ検索テーブルＤＢ７２０から検出する。この検出した音声情報７２０Ａは、例えばメモリ８００に適宜記憶される。 The voice search unit 910 detects the voice information 720A corresponding to the phrase recognized by the keyword recognition unit 909 from the voice data search table DB 720 of the storage unit 700. The detected audio information 720A is stored in the memory 800 as appropriate, for example.

選出手段９１１は、音声検索手段９１０で検索された音声情報７２０Ａの関連度情報７２３に基づいて、所定の音声情報７２０Ａの音声データを選出する。すなわち、選出手段９１１は、スコア演算手段９１１Ａと、音声データ選出手段９１１Ｂと、を備えている。 The selection unit 911 selects the audio data of the predetermined audio information 720A based on the relevance information 723 of the audio information 720A searched by the audio search unit 910. That is, the selection unit 911 includes a score calculation unit 911A and a voice data selection unit 911B.

スコア演算手段９１１Ａは、検索された音声情報７２０Ａの音声データ毎に関連度合いを集計、すなわち、同一の口語となる音声データのスコアの値を合算する演算をし、スコアに関するスコア情報を生成する。例えば、図５に示すように、スコア演算手段９１１Ａは、外部音データにおける計時手段９１３で計時する現時点から音声検索手段９１０で検索した音声情報７２０Ａの語句情報７２２に対応する語句の位置までの時間長が長くなるにしたがって、関連度情報７２３の関連度合いのスコアの値を小さくする演算をする。この時間長が長くなるにしたがってスコア値を小さくする演算としては、関連度合いのスコアの値から、上述した関連度認識手段９０７により関連度合いを設定する際に利用する経過時間の演算方法、例えば数１で演算された値を減算する演算をする。さらに、スコア演算手段９１１Ａは、経過時間を考慮したスコア値を同一の口語となる音声データ毎に合算し、現在時点でのその音声データのスコア値としてスコア情報を生成する。なお、このスコア情報は、音声データに直接関連付けてもよいが、演算処理負荷を考慮して音声データＩＤ７２１に関連付けておくとよい。また、時間長である経過時間に基づいてスコア値を演算する構成に限らず、外部音データにおける現時点から音声検索手段９１０で検索した音声情報７２０Ａの語句情報７２２に対応する語句の位置までの語句の数が多くなるにしたがってスコア情報のスコアの値を小さくする演算、例えばあらかじめ数に対応して設定された設定値を除算する演算をするなどしてもよい。 The score calculation unit 911A calculates the degree of association for each voice data of the searched voice information 720A, that is, calculates the score value of the voice data that is the same spoken word, and generates score information related to the score. For example, as shown in FIG. 5, the score calculation means 911A takes the time from the current time measured by the time measurement means 913 in the external sound data to the position of the phrase corresponding to the phrase information 722 of the speech information 720A searched by the voice search means 910. As the length becomes longer, an operation is performed to decrease the value of the degree of association score of the degree of association information 723. As the calculation for decreasing the score value as the time length becomes longer, an elapsed time calculation method used when setting the degree of association by the degree-of-relevance recognition unit 907 described above from the score value of the degree of association, for example, a number An operation for subtracting the value calculated in 1 is performed. Further, the score calculation unit 911A adds the score values considering the elapsed time for each voice data that is the same colloquial, and generates score information as the score value of the voice data at the current time point. The score information may be directly associated with the audio data, but may be associated with the audio data ID 721 in consideration of the processing load. In addition, the score value is not limited to the configuration in which the score value is calculated based on the elapsed time that is the time length, but the phrase from the current time in the external sound data to the position of the phrase corresponding to the phrase information 722 of the speech information 720A searched by the speech search unit 910 As the number increases, the score value of the score information may be decreased, for example, a setting value that is set in advance corresponding to the number may be calculated.

音声データ選出手段９１１Ｂは、スコア演算手段９１１Ａで順次演算されるスコア情報のスコアの値に基づき、所定の音声データを選出する。例えば、音声データ選出手段９１１Ｂは、スコア情報のスコア値が最も高い音声データを音声出力候補として選出する。この選出された音声データは、メモリ８００などに適宜記憶される。なお、この音声データの記憶は、直接音声データを記憶してもよいが、上述したように、演算処理負荷を考慮して、音声データＩＤ７２１を記憶させておくとよい。 The voice data selection unit 911B selects predetermined voice data based on the score value of the score information sequentially calculated by the score calculation unit 911A. For example, the voice data selection unit 911B selects voice data having the highest score value of the score information as a voice output candidate. The selected audio data is appropriately stored in the memory 800 or the like. The audio data may be stored directly, but as described above, the audio data ID 721 may be stored in consideration of the calculation processing load.

出力制御手段９１２は、選出手段９１１で選出された音声データを発音手段５００のスピーカ５１０から音声として出力させる制御をする。例えば、出力制御手段９１２は、外部音データにおける区切位置を認識すると、メモリ８００に記憶されている音声データＩＤ７２１に対応する音声データを記憶手段７００から読み取ってアナログ信号に適宜変換するなどの処理をし、発音手段５００へ出力する。この区切位置を認識して出力させる際、出力制御手段９１２は、区切位置の区間となる時間長が１〜２秒以上、好ましくは２秒以上であるかを否かを判断し、１〜２秒以上であると判断した場合に音声データを出力させる制御をし、区切位置の時間長が短い場合にはその音声データを出力させない。なお、この次の区切位置を認識するまでには、少なくとも１つの音声セグメント情報が生成されることから、それまで演算された関連度合いに関するスコア値がスコア演算手段により再演算されることとなり、メモリ８００に別の音声データＩＤ７２１が置換されている可能性がある。このため、出力制御手段９１２は、１〜２秒以上の区間となる区切位置を認識した時点でメモリ８００から音声データＩＤ７２１を取得して、音声出力させる制御を実施する。そして、音声出力させる制御としては、例えば米ＭＭＡ（MIDI Manufacturers Association）と日本ＭＩＤＩ評議会（Japan MIDI Standards Committee：ＪＭＳＣ）とにより規格化されたＧＭ（General MIDI）規格、あるいはＧＳ（General Standard）規格、またはＸＧ（Extended General MIDI）規格、さらにはＧＭレベル２規格などに基づくＭＩＤＩメッセージを利用するなどしてもよい。 The output control unit 912 performs control to output the audio data selected by the selection unit 911 as sound from the speaker 510 of the sound generation unit 500. For example, when the output control unit 912 recognizes the break position in the external sound data, the output control unit 912 reads the audio data corresponding to the audio data ID 721 stored in the memory 800 from the storage unit 700 and appropriately converts it into an analog signal. And output to the sound generation means 500. When recognizing and outputting the break position, the output control means 912 determines whether or not the length of time that is the section of the break position is 1 to 2 seconds or more, preferably 2 seconds or more. When it is determined that the time is equal to or greater than 2 seconds, the audio data is controlled to be output. When the time length of the separation position is short, the audio data is not output. Since at least one piece of speech segment information is generated until the next break position is recognized, the score value related to the degree of association calculated so far is recalculated by the score calculation means. There is a possibility that another audio data ID 721 is replaced with 800. For this reason, the output control unit 912 performs control to acquire the audio data ID 721 from the memory 800 at the time of recognizing a break position that is a section of 1 to 2 seconds or more and output the audio. For example, the GM (General MIDI) standard or the GS (General Standard) standard standardized by the US MMA (MIDI Manufacturers Association) and the Japan MIDI Standards Committee (JMSC) can be used as the control for outputting audio. Alternatively, a MIDI message based on the XG (Extended General MIDI) standard or the GM level 2 standard may be used.

計時手段９１３は、例えば内部クロックなどの基準パルスに基づいて現在時刻を認識する。そして、この計時手段９１３は、認識した現在時刻に関する時刻情報を適宜出力する。 The time measuring means 913 recognizes the current time based on a reference pulse such as an internal clock. Then, the time measuring unit 913 appropriately outputs time information regarding the recognized current time.

〔音声出力装置の動作〕
次に、上記音声出力装置１００の動作を図面に基づいて説明する。なお、音声情報７２０Ａの生成処理と、外部音に応じて音声を出力させる音声出力処理とは、同時に処理できるが、説明の都合上、分けて説明する。図６は、音声出力装置における音声情報の生成処理の動作を示すフローチャートである。図７は、音声出力装置における音声出力処理の動作を示すフローチャートである。 [Operation of audio output device]
Next, the operation of the audio output device 100 will be described with reference to the drawings. Note that the generation process of the audio information 720A and the audio output process for outputting audio in accordance with the external sound can be performed simultaneously, but will be described separately for convenience of explanation. FIG. 6 is a flowchart showing the operation of the sound information generation process in the sound output device. FIG. 7 is a flowchart showing the operation of the audio output process in the audio output device.

（音声情報の生成処理）
車両に搭乗した利用者がキー操作により車両のアクセサリ電源を投入することにより、車両のバッテリから音声出力装置１００に電力が供給される。この電力の供給により、音声出力装置１００は、演算手段９００は図示しない表示装置にメニュー画面などを表示させる処理をし、操作手段３００からの入力操作に基づく動作要求の設定の待機状態、すなわち動作待機状態となる。そして、演算手段９００は、メニュー画面に基づく操作手段３００からの音声情報７２０Ａの生成処理要求の信号を認識すると（ステップＳ１０）、例えば音声情報７２０Ａの生成方法が手動によるものか自動によるものかの選択入力を促す画面表示を表示装置に表示させる制御をする（ステップＳ１１）。 (Audio information generation process)
When a user who has boarded the vehicle turns on the accessory power supply of the vehicle by a key operation, electric power is supplied from the battery of the vehicle to the audio output device 100. With this power supply, the audio output device 100 causes the computing unit 900 to perform processing for displaying a menu screen or the like on a display device (not shown), and a standby state for setting an operation request based on an input operation from the operation unit 300, that is, an operation It will be in a standby state. When the calculation unit 900 recognizes the generation processing request signal for the audio information 720A from the operation unit 300 based on the menu screen (step S10), for example, whether the generation method of the audio information 720A is manual or automatic. Control is performed to cause the display device to display a screen display prompting selection input (step S11).

そして、ステップＳ１１において、手動による音声情報７２０Ａの生成処理要求を認識すると、演算手段９００は音声データを取得する処理をする（ステップＳ２０１）。この音声データの取得処理としては、例えばいずれの方法で音声データを取得するかの取得方法の操作手段３００による設定入力を促す画面表示をしたり、音声データを格納する機器やサーバなどの配信元を特定する操作手段３００による設定入力を促す画面表示などをしたりする制御をし、設定された取得方法で取得したり特定された配信元から音声データを取得する処理をする。 In step S11, upon recognizing the manual generation processing request for the audio information 720A, the computing unit 900 performs processing for acquiring audio data (step S201). As the voice data acquisition process, for example, a screen display for prompting the setting input by the operation unit 300 of the acquisition method of the voice data to be acquired, or a distribution source such as a device or a server storing the voice data Control for displaying a screen for prompting a setting input by the operation means 300 for specifying, and performing a process of acquiring audio data from the specified distribution source, acquired by a set acquisition method.

具体的には、例えば所定の音声に関する配信データを配信するサーバ装置や各種放送番組から配信データを受信して音声データを取得する場合、演算手段９００は、通信手段２００を制御してサーバ装置からネットワークを介して所望の音声の配信データを受信させ、外部音取得手段９０１で外部音データとして取得させて記憶手段７００に記憶させるとともに、出力制御手段９１２にてスピーカ５１０から出力させる処理をする。そして、利用者がスピーカ５１０から音声出力される状況を認識しつつ操作手段３００の操作により出力される音声から音声データとして切り出す開始位置と終了位置とを設定すると、区切位置認識手段９０３が配信データにおける開始位置と終了位置とを認識し、音声データ生成手段９０６が開始位置および終了位置間の配信データを音声データとして生成する。なお、記憶手段７００に記憶した配信データは、入力操作に基づいて削除したり、音声データを生成後に自動的に削除したりすればよい。 Specifically, for example, when receiving distribution data from a server device that distributes distribution data related to a predetermined sound or various broadcast programs to acquire sound data, the calculation unit 900 controls the communication unit 200 to control the communication from the server device. The desired audio distribution data is received via the network, is acquired as external sound data by the external sound acquisition means 901, is stored in the storage means 700, and is output from the speaker 510 by the output control means 912. Then, when the user sets the start position and the end position to be cut out as sound data from the sound output by the operation of the operation means 300 while recognizing the situation in which the sound is output from the speaker 510, the delimitation position recognition means 903 causes the distribution data to be distributed. The voice data generation unit 906 generates distribution data between the start position and the end position as voice data. The distribution data stored in the storage unit 700 may be deleted based on an input operation, or may be automatically deleted after generating voice data.

また、例えば光ディスクなどの着脱可能な記録媒体に記録された音声に関する記録データから音声データを取得する場合、演算手段９００は、音声データ読取手段６００を動作させ、所定の記憶データを読み取らせる。そして、上述した配信データから抽出する場合と同様に、入力設定された開始位置および終了位置に基づいて音声データ生成手段９０６により音声データを生成する。なお、記憶手段７００やメモリ８００に別途記憶された記憶データから抽出する場合も同様に、記憶手段７００やメモリ８００から記憶データを読み取って抽出すればよい。これらのように、ステップＳ２０１において、音声データ生成手段９０６により生成された音声データは、新たに音声データＩＤ７２１が設定され、この音声データＩＤ７２１と関連付けられて１つのデータ構造で記憶手段７００の音声データＤＢ７１０に記憶される。 Further, when acquiring audio data from audio-related recording data recorded on a detachable recording medium such as an optical disk, the arithmetic unit 900 operates the audio data reading unit 600 to read predetermined storage data. Then, as in the case of extracting from the distribution data described above, audio data is generated by the audio data generation unit 906 based on the input and set start position and end position. Similarly, when extracting from storage data separately stored in the storage unit 700 or the memory 800, the storage data may be read and extracted from the storage unit 700 or the memory 800. As described above, in step S201, the voice data generated by the voice data generation unit 906 is newly set with the voice data ID 721. The voice data stored in the storage unit 700 is associated with the voice data ID 721 in one data structure. Stored in the DB 710.

このステップＳ２０１の後、演算手段９００は、例えば生成した音声データを出力させるための関連する語句の設定を促す旨の画面表示を表示装置に表示する。具体的には、入力操作に基づいて、操作手段３００による入力操作にてテキスト入力可能なテキストボックスを有する画面表示を表示させ、この画面表示に基づいてテキスト入力された語句を言語解析手段９０５が語句情報７２２として生成する（ステップＳ２０２）。この生成された語句情報７２２は、メモリ８００に適宜記憶される。 After step S201, the calculation unit 900 displays a screen display for prompting the setting of a related word / phrase for outputting the generated voice data, for example. Specifically, on the basis of the input operation, a screen display having a text box in which text can be input by the input operation by the operation unit 300 is displayed, and the language analysis unit 905 displays the words / phrases inputted based on the screen display. It is generated as the phrase information 722 (step S202). The generated phrase information 722 is appropriately stored in the memory 800.

さらに、ステップＳ２０２の後、演算手段９００は、例えばステップＳ２０１で生成した音声データの口語とステップＳ２０２で生成した語句情報７２２の語句との関連度合いの設定を促す旨の画面表示を表示装置に表示する。具体的には、入力操作に基づいて、操作手段３００による入力操作にて数値入力可能なテキストボックスを有する画面表示を表示させ、この画面表示に基づいて数値入力された値を関連度認識手段９０７が関連度合いのスコアと認識して関連度情報７２３を生成する（ステップＳ２０３）。この生成された関連度情報７２３は、メモリ８００に適宜記憶される。 Further, after step S202, the calculation means 900 displays a screen display for prompting the user to set the degree of association between the spoken word of the voice data generated at step S201 and the phrase of the phrase information 722 generated at step S202. To do. Specifically, based on the input operation, a screen display having a text box in which a numerical value can be input by the input operation by the operation unit 300 is displayed, and the value input based on the screen display is used as the relevance recognition unit 907. Is recognized as a score of relevance, and relevance information 723 is generated (step S203). The generated relevance information 723 is stored in the memory 800 as appropriate.

この後、音声情報生成手段９０８は、メモリ８００に記憶された音声データに対応した音声データＩＤ７２１と、語句情報７２２と、関連度情報７２３とを１つのデータ構造に関連付けて音声情報７２０Ａを生成する（ステップＳ２０４）。そして、音声情報生成手段９０８は、生成した音声情報７２０Ａを記憶手段７００の音声データ検索テーブルＤＢ７２０に記憶させる（ステップＳ２０５）。この後、演算手段９００は、新たに他の音声情報７２０Ａの生成を確認、すなわち音声情報７２０Ａの生成処理の継続か否かの操作手段３００による設定入力を促す画面表示を表示装置に表示、すなわち処理の継続か否かを判断する処理をする（ステップＳ２０６）。このステップＳ２０６で処理の継続を要求する旨の入力操作を認識すると、ステップＳ２０１に戻って、手動による音声情報７２０Ａの生成処理を継続する。一方、ステップＳ２０６で処理を継続しないすなわち終了を要求する旨の入力操作を認識すると、音声情報７２０Ａを生成する処理を終了する。 Thereafter, the voice information generation unit 908 generates the voice information 720A by associating the voice data ID 721 corresponding to the voice data stored in the memory 800, the phrase information 722, and the relevance information 723 with one data structure. (Step S204). Then, the voice information generation unit 908 stores the generated voice information 720A in the voice data search table DB 720 of the storage unit 700 (step S205). Thereafter, the computing unit 900 newly confirms the generation of other audio information 720A, that is, displays a screen display prompting a setting input by the operation unit 300 as to whether or not the generation process of the audio information 720A is continued, that is, A process for determining whether or not to continue the process is performed (step S206). When an input operation for requesting continuation of the process is recognized in step S206, the process returns to step S201, and the manual generation process of the audio information 720A is continued. On the other hand, if the input operation not requesting for termination is recognized in step S206, the process for generating the audio information 720A is terminated.

一方、ステップＳ１１において、自動による音声情報７２０Ａの生成処理要求を認識、例えば自動処理を設定する入力操作あるいは外部音に対する音声の出力処理と平行して音声情報７２０Ａを生成する処理を実施させる入力操作などを演算手段９００が認識すると、演算手段９００は集音手段４００を制御してマイクロフォン４１０にて車内の外部音を集音させる（ステップＳ３０１）。このステップＳ３０１における集音処理により、演算手段９００の外部音取得手段９０１がマイクロフォン４１０で集音する外部音に対応する信号を、図３（Ａ）に示すように一連の外部音データとして取得する。この後、演算手段９００は、音特性認識手段９０２により外部音取得手段９０１で取得した外部音データの音特性、例えば音量の大きさを認識、すなわち一連の外部音データの音量を順次監視する（ステップＳ３０２）。 On the other hand, in step S11, an input operation for recognizing an automatic generation processing request for audio information 720A, for example, performing an operation for generating audio information 720A in parallel with an input operation for setting automatic processing or an audio output processing for external sound. When the calculation unit 900 recognizes the above, the calculation unit 900 controls the sound collection unit 400 to collect the external sound in the vehicle with the microphone 410 (step S301). By the sound collection processing in step S301, the external sound acquisition means 901 of the calculation means 900 acquires signals corresponding to the external sound collected by the microphone 410 as a series of external sound data as shown in FIG. . Thereafter, the calculation unit 900 recognizes the sound characteristics of the external sound data acquired by the external sound acquisition unit 901 by the sound characteristic recognition unit 902, for example, the volume level, that is, sequentially monitors the volume of a series of external sound data ( Step S302).

このステップＳ３０２における音特性認識手段９０２での認識する音量が、所定の音量以下、すなわちいわゆる無音となったことを区切位置認識手段９０３により認識すると、区切位置認識手段９０３は無音区間を区切位置として認識し、順次取得している外部音データを区切位置で分割して音声セグメント情報を生成する（ステップＳ３０３）。なお、音声セグメント情報は、連続して切り出した少なくとも４つ分以上をメモリ８００に記憶される。そして、演算手段９００は、テキスト形式変換手段９０４により、区切位置認識手段９０３で順次生成する音声セグメント情報をテキスト形式に変換して音声テキスト情報を生成する。さらに、演算手段９００は、言語解析手段９０５により、生成された音声テキスト情報を形態素解析や構文解析などの言語解析を実施し、例えば図３（Ｂ）に示すように、音声セグメント情報に含まれる語句を認識し、これら語句に関する語句情報７２２を生成する（ステップＳ３０４）。これら生成された語句情報７２２は、音声セグメント情報と関連付けられてメモリ８００に合わせて記憶される。 When the delimiter position recognizing unit 903 recognizes that the sound volume recognized by the sound characteristic recognizing unit 902 in step S302 is equal to or lower than a predetermined volume, that is, so-called silence, the delimiter position recognizing unit 903 uses the silent section as the delimiter position. Recognized and sequentially acquired external sound data is divided at a delimiter position to generate audio segment information (step S303). Note that the audio segment information is stored in the memory 800 for at least four or more cut out continuously. Then, the calculation unit 900 converts the speech segment information sequentially generated by the delimiter position recognition unit 903 into the text format by the text format conversion unit 904 to generate the speech text information. Further, the calculation means 900 performs language analysis such as morphological analysis and syntax analysis on the generated speech text information by the language analysis means 905, and is included in the speech segment information as shown in FIG. 3B, for example. Words are recognized and word information 722 relating to these words is generated (step S304). The generated phrase information 722 is stored in the memory 800 in association with the audio segment information.

そして、演算手段９００は、音声データ生成手段９０６により、順次生成される音声テキスト情報と語句情報７２２とに基づいて、言語解析により音声テキスト情報から抽出される語句が１つだけとなる音声セグメント情報を認識し、その音声セグメント情報を音声データとする（ステップＳ３０５）。そして、音声データ生成手段９０６は、設定した音声データに新たに音声データＩＤ７２１を関連付けて記憶手段７００の音声データＤＢ７１０に記憶させる。 Then, the calculation unit 900 uses the voice data generation unit 906 based on the voice text information and the phrase information 722 sequentially generated, and the voice segment information in which only one phrase is extracted from the voice text information by language analysis. And the voice segment information is used as voice data (step S305). Then, the voice data generation unit 906 newly associates the voice data ID 721 with the set voice data and stores it in the voice data DB 710 of the storage unit 700.

このステップＳ３０５における音声データの生成処理の後、この音声データの直前から少なくとも３つ前までに切り出された音声セグメント情報から抽出された語句情報７２２の語句と、ステップＳ３０５で生成した音声データの口語である語句との関連度合いを関連度認識手段９０７により認識する（ステップＳ３０６）。すなわち、音声データまでの外部音データにおける経過時間の長さとなる音声データに対して音声セグメント情報が外部音データの時系列で前に位置する数を計数する。具体的には、図３（Ｃ），（Ｄ）に示すように、音声データに対して直前に位置する音声セグメント情報に対しては３点、２つ前では２点、３つ前では１点のスコア値を、対応する音声セグメント情報に関連付けるスコア付けの処理をする。さらに、関連度認識手段９０７は、図３（Ｅ）に示すように、スコア付けされた各音声セグメント情報から抽出された語句情報７２２の数を計数する。そして、関連度認識手段９０７は、各音声セグメント情報に関連付けられたスコア値を計数した語句情報７２２の数で除算し、この除算により得られた値を語句情報７２２の関連度合いとして関連度情報７２３を生成する。 After the speech data generation process in step S305, the phrase of the phrase information 722 extracted from the speech segment information cut out at least three times before the speech data, and the spoken phrase of the speech data generated in step S305 The relevance recognition unit 907 recognizes the degree of relevance to the word or phrase (step S306). That is, the number of audio segment information positioned in the time series of the external sound data is counted with respect to the audio data having the length of elapsed time in the external sound data up to the audio data. Specifically, as shown in FIGS. 3C and 3D, the audio segment information positioned immediately before the audio data is 3 points, 2 points before 2 points, 1 point before 3 points. A scoring process for associating the score value of the point with the corresponding speech segment information is performed. Furthermore, the relevance recognition means 907 counts the number of the phrase information 722 extracted from each scored audio segment information as shown in FIG. Then, the degree-of-relevance recognition unit 907 divides the score value associated with each piece of speech segment information by the number of word / phrase information 722 counted, and the degree of relevance information 723 is obtained as a degree of relevance of the word / phrase information 722. Is generated.

このステップＳ３０６における関連度合いの認識処理の後、演算手段９００は、音声情報生成手段９０８により音声データに対応する音声データＩＤ７２１に、その音声データに対応する語句情報７２２および関連度情報７２３を１つのデータ構造に関連付けて音声情報７２０Ａを生成する（ステップＳ３０７）。そして、音声情報生成手段９０８は、生成した音声情報７２０Ａを記憶手段７００の音声データ検索テーブルＤＢ７２０に記憶させる（ステップＳ３０８）。このステップＳ３０８の記憶処理の際、音声データおよび語句情報７２２の組み合わせと同じ音声情報７２０Ａが既に記憶されている場合、音声情報生成手段９０８は、既に記憶されている音声情報７２０Ａの関連度情報７２３に、今回演算した関連度情報７２３を反映、例えば数２の式に基づいて関連度合いを再演算し、新たな関連度情報７２３を更新する処理をする。この後、演算手段９００は、音声情報７２０Ａの自動生成処理の中止を要求する旨を認識したか否かを判断する（ステップＳ３０９）。そして、このステップＳ３０９で中止要求がないと判断した場合にはステップＳ３０１に戻って音声情報７２０Ａの自動生成処理を継続し、中止要求があると判断した場合には音声情報７２０Ａを生成する処理を終了する。 After the relevance degree recognition processing in step S306, the calculation means 900 uses the voice information generation means 908 to store the phrase information 722 and the degree of association information 723 corresponding to the voice data as one voice data ID 721 corresponding to the voice data. Audio information 720A is generated in association with the data structure (step S307). Then, the voice information generation unit 908 stores the generated voice information 720A in the voice data search table DB 720 of the storage unit 700 (step S308). When the same voice information 720A as the combination of the voice data and the phrase information 722 has already been stored in the storage process in step S308, the voice information generation unit 908 has the relevance information 723 of the voice information 720A already stored. Then, the relevance information 723 calculated this time is reflected, for example, the relevance degree is recalculated based on the formula 2 to update the new relevance information 723. Thereafter, the computing unit 900 determines whether or not it has been recognized that a request to stop the automatic generation processing of the audio information 720A has been requested (step S309). If it is determined in step S309 that there is no cancellation request, the process returns to step S301 to continue the automatic generation processing of the audio information 720A. If it is determined that there is a cancellation request, processing for generating the audio information 720A is performed. finish.

（音声出力処理）
一方、演算手段９００の待機状態で、例えばメニュー画面に基づく操作手段３００からの外部音に対する音声の出力要求の信号を認識すると（ステップＳ４０１）、演算手段９００は、集音手段４００を制御してマイクロフォン４１０にて車内の外部音を集音させる（ステップＳ４０２）。このステップＳ４０２における集音処理により、演算手段９００の外部音取得手段９０１がマイクロフォン４１０で集音する外部音に対応する信号を、図３（Ａ）に示すように一連の外部音データとして取得する。この後、演算手段９００は、音特性認識手段９０２により外部音取得手段９０１で取得した外部音データの音特性、例えば音量の大きさを認識、すなわち一連の外部音データの音量を順次監視する（ステップＳ４０３）。 (Audio output processing)
On the other hand, when the calculation means 900 recognizes a sound output request signal for an external sound from the operation means 300 based on the menu screen in the standby state (step S401), the calculation means 900 controls the sound collection means 400. An external sound inside the vehicle is collected by the microphone 410 (step S402). By the sound collection processing in step S402, the external sound acquisition means 901 of the calculation means 900 acquires signals corresponding to the external sound collected by the microphone 410 as a series of external sound data as shown in FIG. . Thereafter, the calculation unit 900 recognizes the sound characteristics of the external sound data acquired by the external sound acquisition unit 901 by the sound characteristic recognition unit 902, for example, the volume level, that is, sequentially monitors the volume of a series of external sound data ( Step S403).

このステップＳ４０３における音特性認識手段９０２での認識する音量が、所定の音量以下、すなわちいわゆる無音となったことを区切位置認識手段９０３により認識すると、区切位置認識手段９０３は無音区間を区切位置として認識し、順次取得している外部音データを区切位置で分割して音声セグメント情報を生成する（ステップＳ４０４）。これら生成された音声セグメント情報は、メモリ８００に適宜記憶される。なお、この記憶される数は、例えば１０個程度とある程度の数にあらかじめ設定した数に限ってもよい。このことにより、メモリ８００の負荷や処理負荷の低減が図れる。 When the delimiter position recognizing unit 903 recognizes that the volume recognized by the sound characteristic recognizing unit 902 in step S403 is equal to or lower than a predetermined volume, that is, so-called silence, the delimiter position recognizing unit 903 uses the silent section as the delimiter position. Recognized and sequentially acquired external sound data is divided at delimiter positions to generate audio segment information (step S404). The generated audio segment information is appropriately stored in the memory 800. Note that the number stored may be limited to a predetermined number such as about 10 for example. As a result, the load on the memory 800 and the processing load can be reduced.

このステップＳ４０４における音声セグメント情報の生成処理の後、演算手段９００は、キーワード認識手段９０９により、記憶手段７００の音声データ検索テーブルＤＢ７２０に記憶した各音声情報７２０Ａの語句情報７２２の語句と同一のキーワードとなる語句が外部音データ中に出現するか否かを判断、すなわち、外部音データの言語解析により（ステップＳ４０５）、キーワードの外部音データの出現を監視する（ステップＳ４０６）。そして、このステップＳ４０５においてキーワードを認識すると、音声検索手段９１０により記憶手段７００の音声データ検索テーブルＤＢ７２０からキーワードとなる語句の語句情報７２２を有した音声情報７２０Ａを検出する（ステップＳ４０７）。 After the voice segment information generation processing in step S404, the calculation means 900 uses the keyword recognition means 909 to use the same keyword as the phrase in the phrase information 722 of each voice information 720A stored in the voice data search table DB 720 of the storage means 700. It is determined whether or not the word or phrase appears in the external sound data, that is, the appearance of the external sound data of the keyword is monitored (step S406) by linguistic analysis of the external sound data (step S405). When the keyword is recognized in step S405, the voice search unit 910 detects the voice information 720A having the phrase information 722 of the phrase used as the keyword from the voice data search table DB 720 of the storage unit 700 (step S407).

このステップＳ４０７の後、演算手段９００は、選出手段９１１により、検出した音声情報７２０Ａのうち、関連度情報７２３に基づいて、所定の音連度合いとなる語句の組み合わせとなる音声情報７２０Ａを選出する。すなわち、選出手段９１１のスコア演算手段９１１Ａにより、計時手段９１３で計時する外部音データにおける現時点から、音声検索手段９１０で検索した音声情報７２０Ａの語句情報７２２に対応する語句の位置までの距離、すなわち時間長の長さを認識する（ステップＳ４０８）。この時間長の長さとして、スコア演算手段９１１Ａは、例えばその語句情報７２２の音声セグメント情報におけるキーワードの語句が含まれる音声セグメント情報からの数を認識する。そして、スコア演算手段９１１Ａは、上述した例えば数１の式に基づいて、経過時間の係数である経過時間のスコア値Ｓを演算し、各音声情報７２０Ａの関連度情報７２３の関連度合いのスコア値から減算する。さらに、スコア演算手段９１１Ａは、経過時間を考慮したスコア値を同一の口語となる音声データ毎に合算し、現在時点でのキーワードとして認識した語句に対応する音声データのスコア値としてスコア情報を生成する。このステップＳ４０８におけるキーワードからの語句の距離の認識処理により演算されたスコア情報のスコア値に基づき、演算手段９００は、音声データ選出手段９１１Ｂにより、スコア情報のスコア値が最も高い音声データを音声出力候補として選出し、その音声データＩＤ７２１に対応する音声データＩＤ７２１をメモリ８００に適宜記憶する（ステップＳ４０９）。 After step S407, the calculation unit 900 selects the voice information 720A that is a combination of words and phrases having a predetermined degree of phonetic linkage, based on the relevance information 723, from the detected voice information 720A by the selection unit 911. . That is, the distance from the current time in the external sound data timed by the time counting unit 913 by the score calculation unit 911A of the selection unit 911 to the position of the phrase corresponding to the phrase information 722 of the speech information 720A searched by the voice search unit 910, that is, The length of the time length is recognized (step S408). As the length of this time length, the score calculation means 911A recognizes the number from the speech segment information including the keyword phrase in the speech segment information of the phrase information 722, for example. Then, the score calculation unit 911A calculates the score value S of the elapsed time that is a coefficient of the elapsed time based on the above-described formula 1, for example, and the score value of the related degree of the related degree information 723 of each audio information 720A Subtract from Further, the score calculation unit 911A adds the score values considering the elapsed time for each voice data that is the same spoken word, and generates score information as the score value of the voice data corresponding to the phrase recognized as the keyword at the current time point To do. Based on the score value of the score information calculated by the recognition processing of the distance from the keyword in step S408, the calculation unit 900 outputs the voice data having the highest score value of the score information by voice output by the voice data selection unit 911B. The candidate is selected as a candidate, and the voice data ID 721 corresponding to the voice data ID 721 is appropriately stored in the memory 800 (step S409).

このステップＳ４０９の後、演算手段９００は、出力制御手段９１２により、区切位置認識手段９０３で区切位置の区間となる時間長が１〜２秒以上となったか否かを判断する（ステップＳ４１０）。そして、このステップＳ４１０において、区切位置の時間長が１〜２秒以上経過していないと判断した場合、音声データを読み取ることなく、音声出力処理の中止を要求する旨を認識したか否かを判断する（ステップＳ４１１）。そして、このステップＳ４１１で中止要求がないと判断した場合にはステップＳ４０２に戻って、処理を継続する。一方、ステップＳ４１１において、音声出力処理の中止要求があると判断した場合には外部音に対する音声を出力させる処理を終了する。 After this step S409, the calculation means 900 determines whether or not the output control means 912 has set the time length that becomes the section of the delimiter position by the delimiter position recognition means 903 to be 1 to 2 seconds or more (step S410). In step S410, if it is determined that the time length of the delimiter position has not exceeded 1 to 2 seconds, it is determined whether or not it is recognized that the audio output process is requested to be stopped without reading the audio data. Judgment is made (step S411). If it is determined in step S411 that there is no cancel request, the process returns to step S402 to continue the process. On the other hand, if it is determined in step S411 that there is a request to cancel the audio output process, the process for outputting the audio for the external sound is terminated.

また、ステップＳ４１０において、出力制御手段９１２が区切位置認識手段９０３で区切位置の区間となる時間長が１〜２秒以上となったことを認識すると、メモリ８００に記憶されている音声データＩＤ７２１に対応する音声データを記憶手段７００から読み取ってアナログ信号に適宜変換するなどの処理をし、発音手段５００へ出力する（ステップＳ４１２）。 In step S410, when the output control unit 912 recognizes that the time length that is the section of the separation position is 1 to 2 seconds or more by the separation position recognition unit 903, the voice data ID 721 stored in the memory 800 is stored. Corresponding audio data is read from the storage unit 700, and is converted into an analog signal as appropriate, and output to the sound generation unit 500 (step S412).

ここで、外部音データから区切位置認識手段９０３で、例えば図５（Ａ）に示すような「そう言えばお台場の」の音声セグメント情報および「パレットシティ」の音声セグメント情報に分解された場合について説明する。まず、外部音データから区切位置認識手段９０３により「そう言えばお台場の」の音声セグメント情報が切り出され、言語解析手段９０５で［そう言えば］、［お台場の］の語句情報７２２が抽出される。そして、キーワード認識手段９０９により各語句情報７２２の語句をキーワードとして認識した場合、そして、音声検索手段９１０により、「そう言えば」に対して、「何ですか？」の音声データと、「聞きたくない！」の音声データと、「急がないよ」の音声データとがそれぞれ関連付けられた３つの音声情報７２０Ａが検索され、「お台場の」に対して、「綺麗〜っ！」の音声データと、「マジで！」の音声データとがそれぞれ関連付けられた２つの音声情報７２０Ａが検索される。そして、スコア演算手段９１１Ａが各音声情報７２０Ａの関連度情報７２３を認識する。この際、次の音声セグメント情報がまだ認識されていない、すなわち次の区切位置がまだ認識されておらず、「パレットシティ」の音声セグメント情報が抽出されていないので、「そう言えばお台場の」の音声セグメント情報は現時点に対して直近となる。このことにより、スコア演算手段９１１Ａは、各音声情報７２０Ａの語句情報７２２における数１に基づく時間経過の係数Ｓの値が「０」となり、関連度情報７２３のスコア値で音声データ選出手段９１１Ｂが最も関連度合いの高い「綺麗〜っ！」の音声データを音声出力候補として選出する。 Here, when the external sound data is decomposed by the separation position recognition means 903 into audio segment information of “Speaking of Odaiba” and audio segment information of “Palette City” as shown in FIG. Will be described. First, the speech segment information of “Speaking of Odaiba” is extracted from the external sound data by the delimiter position recognition unit 903, and the phrase information 722 of [Speaking of Speaking] and [Odaiba of] is extracted by the language analyzing unit 905. Is done. Then, when the keyword recognition unit 909 recognizes the phrase of each phrase information 722 as a keyword, the voice search unit 910 responds with “what?” Voice data and “listen” Three voice information 720A each of which is associated with the voice data “I don't want to!” And the voice data “I do n’t hurry” is searched, and the voice of “Beautiful!” For “Odaiba”. Two pieces of voice information 720A in which the data and the voice data “seriously!” Are associated with each other are searched. Then, the score calculation unit 911A recognizes the relevance information 723 of each voice information 720A. At this time, the next audio segment information has not been recognized yet, that is, the next break position has not been recognized yet, and the audio segment information of “Palette City” has not been extracted. "Is the latest for the current segment. As a result, the score calculation means 911A has the value of the time lapse coefficient S based on the number 1 in the phrase information 722 of each voice information 720A becomes “0”, and the voice data selection means 911B uses the score value of the relevance information 723. Voice data with the highest degree of relevance, “Kirei!”, Is selected as a voice output candidate.

そして、出力制御手段９１２により、現在認識している区切位置の時間長が１〜２秒を経過したと認識すると、「綺麗〜っ！」の音声データが出力される。ここで、区切位置の時間長が１〜２秒を経過していないと判断、すなわち次の「パレットシティ」を認識していることとなる。そして、次の区切位置を認識して「パレットシティ」の音声セグメント情報が切り出されると、この「パレットシティ」の音声セグメント情報からは分解されずにそのまま［パレットシティ］が語句情報７２２として認識される。そして、キーワード認識手段９０９が［パレットシティ］をキーワードとなる語句として認識すると、音声検索手段９１０により「パケットシティ」に対して「マジで！」の音声データが関連付けられた音声情報７２０Ａが検索される。そして、スコア演算手段９１１Ａが各音声情報７２０Ａの関連度情報７２３に、現時点までの時間経過を反映させる。具体的には、［パレットシティ］は現時点に対して直近となるので係数Ｓは「０」であるが、［そう言えば］および［お台場の］の語句情報７２２は一つ前の音声セグメント情報であることから、時間経過の係数Ｓが数１から演算され、この演算された係数Ｓとなる「０．３０」を各関連度合いのスコア値から減算する。さらに、スコア演算手段９１１Ａは、同一の語句となる音声データ「マジで！」のスコア値を集計し、スコア情報とする。このことにより、それまでは、「綺麗〜っ！」の音声データが一番スコア値が高かったが、時間経過により現時点での関連度合いが低くなるのに対し、現時点では「マジで！」が最も高い関連度合いとなり、音声出力候補として「綺麗〜っ！」から「マジで！」に更新される。このようにして、発話における現時点で、発話に対して最も関連度合いが高い語句が選出され、音声出力される状態となる。 When the output control means 912 recognizes that the time length of the currently recognized delimiter position has passed 1 to 2 seconds, audio data of “beautiful!” Is output. Here, it is determined that the time length of the separation position has not passed 1 to 2 seconds, that is, the next “pallet city” is recognized. When the next segmentation position is recognized and the voice segment information of “Palette City” is extracted, [Palette City] is recognized as the phrase information 722 without being decomposed from the voice segment information of “Palette City”. The When the keyword recognizing unit 909 recognizes [Palette City] as a keyword, the voice searching unit 910 searches the voice information 720A in which the voice data “seriously!” Is associated with “packet city”. The Then, the score calculation unit 911A reflects the passage of time up to the present time in the relevance information 723 of each audio information 720A. Specifically, the coefficient S is “0” because [Palette City] is closest to the current time, but the phrase information 722 of [Speaking] and [Odaiba] is the previous speech segment. Since it is information, the time-lapse coefficient S is calculated from Equation 1, and “0.30” that is the calculated coefficient S is subtracted from the score value of each degree of association. Further, the score calculation means 911A totals the score values of the voice data “seriously!” That are the same word and use it as score information. Until then, the voice data of “Kirei!” Had the highest score value, but the degree of relevance at the present time becomes lower with the passage of time. The degree of relevance is the highest, and the voice output candidate is updated from “Beautiful!” To “Seriously!”. In this manner, at the present time in the utterance, the word / phrase having the highest degree of association with the utterance is selected, and the voice is output.

そして、ステップＳ４１２で音声データを出力させる処理の後、演算手段９００は、関連度認識手段９０７により、音声出力させた音声データの音声セグメント情報に対して外部音データにおける時系列で後に位置する外部音データでの音声データの内容についての評価を認識し、この評価に対応して音声出力させた音声データの音声情報７２０Ａの関連度合いを変更する処理をする。具体的には、外部音データの音特性に基づいて笑い声を認識したり、言語解析により例えば「いいね」、「面白い」、「笑える」などの音声出力を肯定する内容を認識したりするなどにより、評価として良好であったか否かを判断する（ステップＳ４１３）。そして、ステップＳ４１３で、関連度認識手段９０７は、評価が良好であったと判断した場合、その音声出力させた音声情報７２０Ａの関連度情報７２３のスコア値をあらかじめ設定された設定値を加算するなどにより高い値に設定して音声情報７２０Ａを更新する処理をし（ステップＳ４１４）、ステップＳ４１１に進む。一方、ステップＳ４１３で、例えば無音が継続したり、言語解析により「つまらない」、「鬱陶しい」、「邪魔」などの否定する内容を認識したりするなどにより、評価として悪かったと判断した場合、その音声情報７２０Ａの関連度情報７２３のスコア値をあらかじめ設定された設定値で減算するなどにより低い値に設定して音声情報７２０Ａを更新する処理をし（ステップＳ４１５）、ステップＳ４１１に進む。なお、評価に対応してスコア値を変更する処理として、あらかじめ設定された設定値を加減算する構成に限らず、例えば笑い声の大きさが大きくなるにしたがって、あるいは肯定する言語の数が多くなるにしたがって加算するスコア値の値が大きくなったり、変数や係数を用いて加減算以外の演算を実施したりするなどしてもよい。 Then, after the process of outputting the audio data in step S412, the calculation unit 900 uses the relevance recognition unit 907 to output the external segment positioned later in time series in the external sound data with respect to the audio segment information of the audio data output as audio. A process of recognizing the evaluation of the contents of the sound data in the sound data and changing the degree of association of the sound information 720A of the sound data output in response to the evaluation is performed. Specifically, laughter is recognized based on the sound characteristics of the external sound data, or linguistic analysis is used to recognize content that affirms voice output such as “Like”, “Funny”, “Laughter”, etc. Thus, it is determined whether or not the evaluation is good (step S413). In step S413, when the degree-of-relevance recognition unit 907 determines that the evaluation is good, the score value of the degree-of-association information 723 of the voice information 720A that has been output as speech is added to a preset setting value. Is set to a higher value to update the audio information 720A (step S414), and the process proceeds to step S411. On the other hand, if it is determined in step S413 that the evaluation is bad, for example, silence continues or the language analysis recognizes negative contents such as “not good”, “depressing”, “disturbing”, etc. A process of updating the audio information 720A by setting the score value of the relevance information 723 of the information 720A to a low value by subtracting it by a preset setting value (step S415), and proceeds to step S411. Note that the process of changing the score value in response to the evaluation is not limited to a configuration in which a preset set value is added or subtracted. For example, as the loudness of the laughter increases or the number of languages to be affirmed increases. Therefore, the score value to be added may be increased, or an operation other than addition / subtraction may be performed using variables and coefficients.

〔音声出力装置の作用効果〕
上述したように、上記実施の形態では、集音した外部音に関する一連の外部音データに含まれる語句を認識する。認識した語句に関する語句情報７２２を有した音声情報７２０Ａを、発話による音声から抽出した音声に関する音声データ、語句に関する語句情報７２２および音声データの語句である口語および語句情報７２２の語句との関連度合いに関する関連度情報７２３を１つのデータ構造に構成した音声情報７２０Ａを複数記憶するテーブル構造の記憶手段７００から検索する。音声情報のうち、関連度情報７２３に基づいて、例えばスコア値が最も高い関連時情報の音声情報の音声データを選出し、スピーカ５１０から適宜音声出力する。 [Function and effect of audio output device]
As described above, in the above embodiment, a phrase included in a series of external sound data related to the collected external sound is recognized. The speech information 720A having the phrase information 722 relating to the recognized phrase is related to the degree of relevance between the speech data extracted from speech by speech, the phrase information 722 relating to the phrase, and the phrase of the speech data and the phrase information 722 that are phrases of the speech data. A search is made from storage means 700 having a table structure for storing a plurality of audio information 720A in which relevance information 723 is formed into one data structure. Of the audio information, based on the relevance information 723, for example, the audio data of the audio information of the related time information having the highest score value is selected and output from the speaker 510 as appropriate.

このため、発話の内容の流れに伴った関連度合いで音声データが音声出力され、発話に対して茶々が入る状態となり、出力される音声を切っ掛けに発話が容易促進される。従って、良好な運転環境が容易に得られる。さらに、発話の流れに伴わない全く関係のない音声は出力されないので、発話が阻害されず、良好な発話環境を提供できる。 For this reason, the voice data is output with the degree of relevance associated with the flow of the content of the utterance, and the utterance is in a state of being filled in, and the utterance is easily promoted by using the output voice. Therefore, a good driving environment can be easily obtained. Furthermore, since an irrelevant voice that does not accompany the utterance flow is not output, the utterance is not hindered and a good utterance environment can be provided.

そして、関連度合いが最も高い値となる音声データを音声出力している。このため、最も発話の流れにおける現時点で最も関連する音声がいわゆる茶々として出力されるので、良好に発話の喚起が得られる。 The voice data having the highest degree of association is output as voice. For this reason, since the most relevant voice at the present time in the flow of utterances is output as so-called tea, the utterance can be evoked satisfactorily.

また、集音した外部音に関する一連の外部音データの音特性を認識し、この音特性に基づいて外部音データが区切られる区切位置を認識し、区切位置の間における音声セグメント情報である外部音データの一部を音声に関する音声データとして生成する。そして、生成した音声データに対して外部音データにおける前後に位置する音声セグメント情報から抽出された語句情報７２２との関連度合いを演算して関連度情報７２３を認識し、音声データと語句情報７２２およびその語句情報７２２の語句の関連度合いに関する関連度情報７２３とを関連付けて１つのデータ構造の音声情報７２０Ａを生成している。 It also recognizes the sound characteristics of a series of external sound data related to the collected external sound, recognizes the separation positions where the external sound data is divided based on the sound characteristics, and external sound that is audio segment information between the separation positions. A part of the data is generated as voice data related to voice. Then, the degree of association between the generated voice data and the phrase information 722 extracted from the voice segment information located before and after the external sound data is calculated to recognize the degree of association information 723, and the voice data and the phrase information 722 and The speech information 720A having one data structure is generated by associating the relevance information 723 relating to the relevance level of the word in the word information 722.

このため、その車両内の発話における語句に対して出力させる音声として、その発話から抽出した音声データに基づいて音声情報７２０Ａを生成しているので、この音声情報７２０Ａを利用して外部音に対して音声データを出力させる構成では、発話の内容の流れに伴った関連度合いで音声データが音声出力され、発話に対して茶々が入る状態となり、出力される音声を切っ掛けに発話が容易促進される。このことにより、良好な運転環境が容易に得られる。さらに、発話の流れに伴わない全く関係のない音声は出力されないので、発話が阻害されず、良好な発話環境を提供できる。 For this reason, since the voice information 720A is generated based on the voice data extracted from the utterance as the voice to be output for the phrase in the utterance in the vehicle, the voice information 720A is used to generate an external sound. In the configuration in which the audio data is output, the audio data is output with the degree of relevance according to the flow of the content of the utterance, and the state where the utterance is filled is entered. . As a result, a good operating environment can be easily obtained. Furthermore, since an irrelevant voice that does not accompany the utterance flow is not output, the utterance is not hindered and a good utterance environment can be provided.

そして、特に発話に基づく外部音データから音声データを抽出している。このため、例えば機械音などの発話の流れに対応しない音声データを出力しないので、より発話が阻害されず、良好な発話環境が得られる。 In particular, voice data is extracted from external sound data based on speech. For this reason, for example, voice data that does not correspond to the flow of speech such as mechanical sound is not output, so that speech is not hindered and a favorable speech environment can be obtained.

さらに、外部音データの音量に基づいて音声データや語句情報７２２の基となる音声セグメント情報を切り出している。このため、語句毎に外部音データを切り出す場合に比して、より発話の流れに沿った関連度合いで音声データに語句情報７２２を関連付けることができ、外部音に対して適切な音声を出力でき、良好な発話環境を提供できる。 Furthermore, the audio segment information that is the basis of the audio data and the phrase information 722 is cut out based on the volume of the external sound data. For this reason, the phrase information 722 can be associated with the audio data with a degree of relevance more in line with the utterance flow than when the external sound data is cut out for each word, and an appropriate sound can be output with respect to the external sound. Provide a good speech environment.

また、外部音データが区切られる位置となる音声データや音声セグメント情報の切り出す位置、すなわち発話の流れの区切れる位置として、所定の音量以下となるいわゆる無音部分を認識している。このため、発話の流れに沿った音声データや音声セグメント情報を容易に抽出でき、良好な発話環境を提出するための音声情報７２０Ａを容易に生成できる。また、外部音データを音量に基づく波形として認識すればよく、構成の簡略化を容易に図れる。 In addition, a so-called silent portion having a predetermined volume or less is recognized as a position where audio data or audio segment information is cut out from which external sound data is divided, that is, a position where the flow of speech is divided. For this reason, voice data and voice segment information along the flow of utterances can be easily extracted, and voice information 720A for submitting a good utterance environment can be easily generated. Further, the external sound data may be recognized as a waveform based on the sound volume, and the configuration can be easily simplified.

さらに、音声セグメント情報から音声テキスト情報を生成し、この音声テキスト情報を形態素解析や構文解析などの言語解析により語句を認識し、この語句を音声データを出力させるためのキーワードとして関連付けて音声情報７２０Ａを生成している。このため、容易に外部音に対して、いわゆる茶々を入れる音声を適切に出力できる音声情報７２０Ａを容易に生成できる。 Furthermore, speech text information is generated from the speech segment information, the speech text information is recognized by a language analysis such as morphological analysis or syntax analysis, and the speech information is associated with the speech text as a keyword for outputting speech data. Is generated. For this reason, it is possible to easily generate audio information 720A that can appropriately output a so-called “chacha” audio for an external sound.

そして、関連度認識手段９０７により音声データに対する関連度合いとして、音声データまでの距離、すなわち時間長や音声セグメント情報の数に基づいて設定している。このため、現時点で発話に対して出力させようとする音声データを選出するためのキーワードとなる語句との適切な関連度合いを設定できる。したがって、発話に対して良好に関連する音声を出力できる。特に、時間長や語句の数により、発話の流れに対していわゆる茶々を入れるタイミングでの適切な音声データを選出するための関連度合いを容易に設定でき、音声情報７２０Ａの設定が容易にできる。 Then, the degree of association with the audio data is set by the relevance recognition means 907 based on the distance to the audio data, that is, the time length and the number of audio segment information. For this reason, it is possible to set an appropriate degree of association with a phrase that is a keyword for selecting voice data to be output for an utterance at the present time. Therefore, it is possible to output sound that is well related to the utterance. In particular, it is possible to easily set the degree of association for selecting appropriate audio data at the timing of putting so-called tea in the utterance flow according to the time length and the number of words, and the audio information 720A can be easily set.

さらには、関連度認識手段９０７は、音声データに対する距離に対応した関連度合いのスコア値を、その音声セグメント情報の語句情報７２２の数で除算している。すなわち、音声セグメント情報に複数の語句情報７２２が存在する場合にその後に発話される語句となる音声データに対する関連度合いは、語句情報７２２の数の分だけ薄れることとなるので、より適切に音声データに対する関連度合いを設定できる。 Further, the relevance recognition unit 907 divides the score of the relevance corresponding to the distance to the audio data by the number of word information 722 of the audio segment information. That is, when there are a plurality of phrase information 722 in the voice segment information, the degree of association with the voice data that becomes a phrase spoken thereafter is reduced by the number of the phrase information 722. The degree of relevance to can be set.

また、音声情報７２０Ａとして音声データを特定する音声データＩＤ７２１に語句情報７２２および関連度情報７２３を関連付けて音声情報７２０Ａを生成している。このため、集音している外部音データから認識した語句に対応する音声情報７２０Ａを比較的に小さい負荷で検索でき、音声情報７２０Ａの検索速度の高速化が容易に図れ、外部音に対する音声の良好な出力が得られる。さらには、音声情報７２０Ａのデータ量が小さくなり、記憶手段７００のテーブル構造の構築や保守管理が容易にできる。 Further, the speech information 720A is generated by associating the phrase information 722 and the relevance information 723 with the speech data ID 721 that identifies the speech data as the speech information 720A. Therefore, the audio information 720A corresponding to the phrase recognized from the collected external sound data can be searched with a relatively small load, the search speed of the audio information 720A can be easily increased, and the audio information for the external sound can be easily detected. Good output can be obtained. Furthermore, the data amount of the audio information 720A is reduced, and the construction and maintenance management of the table structure of the storage unit 700 can be facilitated.

さらに、音声情報７２０Ａの関連度情報７２３のスコア値を音声データの口語と語句情報７２２の語句との組み合わせの出現頻度を利用することで、より対話の流れに対応した音声をいわゆる茶々として出力させることができる。 Further, by using the appearance frequency of the combination of the spoken word of the voice data and the phrase of the phrase information 722 as the score value of the relevance information 723 of the voice information 720A, the voice corresponding to the flow of conversation is further output as a so-called tea. be able to.

また、関連度情報７２３として、音声出力したのちの音声に対する利用者や対話している人の出力された音声に対する評価に基づいてスコア値を変更している。このため、より利用者の嗜好に沿った音声出力が得られる。 Further, as the relevance information 723, the score value is changed based on the evaluation of the output voice of the user or the person who is interacting with the voice after the voice output. For this reason, the audio | voice output along a user's preference can be obtained more.

さらに、音声出力に対する評価として、笑い声や言語解析による肯定する語句の検索などにより評価内容を認識している。このため、容易に評価を認識でき、構成の簡略化が容易に図れる。 Furthermore, as an evaluation for voice output, the contents of the evaluation are recognized by searching for laughing voices or affirming phrases by language analysis. For this reason, evaluation can be recognized easily and the structure can be simplified easily.

そして、出力制御手段９１２により、集音している外部音データの所定の音量以下となるいわゆる無音期間が所定時間経過したことを認識すると、選出した音声情報７２０Ａの音声データを出力させる制御をしている。このため、例えば対話している最中に頻繁に音声データが出力される煩わしさを防止できる。さらには、無音期間が２秒以上経過したことを認識した際に音声データを出力させることで、出力される音声データを切っ掛けとして対話が開始されることとなり、対話の促進も得られる。 When the output control unit 912 recognizes that a so-called silence period that is less than or equal to a predetermined volume of the collected external sound data has passed, a control is performed to output the audio data of the selected audio information 720A. ing. For this reason, for example, it is possible to prevent annoyance that audio data is frequently output during a conversation. Furthermore, when voice data is output when it is recognized that the silent period has passed for 2 seconds or more, the dialog is started with the output voice data as a trigger, and the dialog can be promoted.

また、通信手段２００によりネットワークを介してサーバ装置などから音声データを取得したり、音声データ読取手段６００により記録媒体に記録された音声データを取得して音声情報７２０Ａを生成可能としている。このため、外部音データから自動的に抽出する構成に限らず、手動により音声データを抽出して音声情報７２０Ａを生成できるので、例えば利用者の好みの人の声を利用するなど、利用者の嗜好に沿って外部音に応じた音声の出力が得られる。 Also, audio information can be generated by acquiring audio data from a server device or the like via the network by the communication unit 200, or acquiring audio data recorded on a recording medium by the audio data reading unit 600. For this reason, not only the configuration automatically extracting from the external sound data but also the audio data 720A can be generated by manually extracting the audio data, so that the user's favorite person's voice is used, for example, A voice output according to the external sound can be obtained according to the preference.

そして、演算手段９００を例えばＣＰＵ（Central Processing Unit）などを用いてプログラムとして構成しているので、プログラムをインストールすることで、発話の流れに対応して音声を出力できる構成が容易に得られ、利用の拡大が容易に図れる。さらには、そのプログラムを記録媒体に記録し、適宜演算手段９００、すなわちコンピュータに読み取らせる構成とすることで、容易に対話の流れに対応した音声を出力できる構成が得られるとともに、プログラムを容易に取り扱いでき、利用の拡大が容易にできる。なお、本発明における演算手段としては、１つのコンピュータに限らず、複数のコンピュータをネットワーク状に組み合わせた構成、上述したようなＣＰＵやマイクロコンピュータなどの素子、あるいは複数の電子部品が搭載された回路基板などをも含む。 And since the calculation means 900 is configured as a program using, for example, a CPU (Central Processing Unit), a configuration that can output voice corresponding to the flow of speech can be easily obtained by installing the program, The use can be easily expanded. Further, by recording the program on a recording medium and appropriately allowing the calculation means 900, that is, the computer to read the program, a configuration that can easily output a sound corresponding to the flow of the conversation can be obtained, and the program can be easily executed. It can be handled and can be used easily. The calculation means in the present invention is not limited to a single computer, but a configuration in which a plurality of computers are combined in a network, an element such as a CPU or a microcomputer as described above, or a circuit on which a plurality of electronic components are mounted. Also includes substrates.

〔実施形態の変形〕
なお、本発明は、上述した各実施の形態に限定されるものではなく、本発明の目的を達成できる範囲で以下に示される変形をも含むものである。 [Modification of Embodiment]
In addition, this invention is not limited to each embodiment mentioned above, The deformation | transformation shown below is included in the range which can achieve the objective of this invention.

すなわち、移動状況としては、車両の移動に限らず、例えば飛行機や船舶など移動するいずれの移動体にも適用できる。さらには、上述したように、車両に配設する構成に限らず、家屋などの建造物などに設置して部屋の空間に外部環境を再現させる構成とするなどしてもよい。 That is, the movement situation is not limited to the movement of the vehicle, and can be applied to any moving body such as an airplane or a ship. Furthermore, as described above, the configuration is not limited to the configuration provided in the vehicle, and may be configured to reproduce the external environment in a room space by being installed in a building such as a house.

そして、集音手段４００として、例えば車両の室内空間に対して４隅に位置する状態にそれぞれ配設された４つのマイクロフォン４１０を備えた構成とし、これらマイクロフォン４１０により４方向からの室内における外部音をそれぞれ集音可能としてもよい。この構成を利用し、各マイクロフォン４１０で集音した外部音データの音量特性などを認識し、外部音データにおける車内の搭乗する搭乗者を特定し、発話する搭乗者が切り替わる位置を区切位置としてもよい。さらには、外部音の音特性として周波数やアクセントなどを認識し、発話する搭乗者の切り替わりを認識し、その位置を区切位置とするなどしてもよい。これらのように、人称を認識することで、対話における一言の音声データに対する関連する語句の特定が容易となり、関連度合いのより適切な設定が得られ、発話に対してより関連のある音声出力が得られる。 The sound collecting means 400 includes, for example, four microphones 410 disposed at four corners with respect to the interior space of the vehicle, and the microphones 410 provide external sound in the room from four directions. The sound may be collected respectively. Using this configuration, the volume characteristics of the external sound data collected by each microphone 410 is recognized, the passenger in the vehicle in the external sound data is identified, and the position where the occupant who speaks is switched is used as the delimiter position. Good. Further, the frequency and accent may be recognized as the sound characteristics of the external sound, the change of the passenger who speaks may be recognized, and the position may be set as the delimiter position. In this way, recognizing the person name makes it easy to identify related phrases for a single speech data in a dialogue, and more appropriate setting of the degree of association is obtained, and speech output more relevant to utterances. Is obtained.

また、音声情報７２０Ａとして音声データを特定する音声データＩＤ７２１に語句情報７２２および関連度情報７２３を関連付けて音声情報７２０Ａを生成して説明したが、語句情報７２２および関連度情報７２３を音声データに直接関連付けたデータ構造に音声情報７２０Ａを生成してもよい。このような構成によれば、記憶手段７００のテーブル構造の構成の簡略化が容易に図れる。 In addition, the speech information 720A is generated by associating the phrase information 722 and the relevance information 723 with the speech data ID 721 that identifies the speech data as the speech information 720A. However, the phrase information 722 and the relevance information 723 are directly included in the speech data. Audio information 720A may be generated in the associated data structure. According to such a configuration, the configuration of the table structure of the storage unit 700 can be easily simplified.

さらに、音声セグメント情報からテキスト形式に変換して言語解析により語句を認識する構成として、日本語を例示して説明したが、日本語に限らず、英語、中国語など、いずれの言語を対象とすることができる。 Furthermore, as a configuration for recognizing words and phrases by language analysis after converting speech segment information into text format, Japanese has been described as an example, but not limited to Japanese, any language such as English or Chinese is targeted. can do.

また、外部音データの音特性として音量を検出し、音量が所定値以下となるいわゆる無音部分を外部音データが区切られる区切位置として音声セグメント情報の切り出し位置として説明したが、例えば音特性の周波数などや声紋などを検出し、発話している人称を特定して人称が切り替わる位置を区切位置としたり、言語解析などにより文節を認識して文節が区切れる位置や語句に分解される位置を区切位置として認識するなど、いずれの方法で区切位置を認識してもよい。 Further, the sound volume is detected as the sound characteristic of the external sound data, and the so-called silent part where the sound volume is equal to or lower than the predetermined value has been described as the cut-out position of the audio segment information as the separation position where the external sound data is divided. , Voiceprints, etc. are detected, the person who speaks is specified and the position where the person changes is set as the delimiter position, or the position where the phrase is delimited by language analysis etc. and the position where the phrase is decomposed is delimited The separation position may be recognized by any method such as recognition as a position.

そして、外部音データから音声セグメント情報に分解してそれぞれ言語解析により語句として認識し、音声データおよびキーワードとなる語句情報７２２を生成して説明したが、例えば音声セグメント情報に分解することなく外部音データから言語解析により、一言を音声データとして抽出し、この音声データより前に発話された語句を認識し、語句情報７２２として関連付けてもよい。 Then, the external sound data is decomposed into audio segment information and recognized as words by linguistic analysis, and the phrase information 722 serving as audio data and keywords is generated and described. For example, external sound data is not decomposed into audio segment information. A word may be extracted as voice data from the data by linguistic analysis, a phrase spoken before the voice data may be recognized and associated as phrase information 722.

また、音声出力後の利用者の状況すなわち音声出力した音声データに対する評価を認識し、評価に対応して関連度合いを変更して説明したが、この評価により関連度合いを変更する処理をしなくてもよい。さらには、評価をそのまま関連度合いとして設定してもよい。 In addition, the user's situation after the voice output, that is, the evaluation on the voice data output by voice recognition is recognized, and the degree of relevance is changed according to the evaluation. Also good. Furthermore, the evaluation may be set as the degree of association as it is.

そして、外部音データから自動的に音声データを抽出して音声情報７２０Ａを生成するとともに、ネットワークを介して音声データを取得したり記録媒体に記録された音声データを取得して手動により音声情報７２０Ａを生成可能な構成について説明したが、自動または手動のいずれか一方のみの構成としてもよい。さらには、ネットワークを介して他の音声出力装置１００や記憶手段７００などから音声情報７２０Ａを取得してもよい。このような構成により、外部音データから自動的に音声情報７２０Ａを取得する構成では、利用者が使用する発話の内容などに沿った関連度合い以外の他の利用者の関連度合いとなる音声情報７２０Ａが得られ、茶々として発話に応じて出力される音声の意外性が得られ、より発話の喚起が得られる。さらには、利用者の音声や利用者が対話した人の音声以外の音声で出力されることとなり、より意外性が得られる。 Then, the audio data is automatically extracted from the external sound data to generate the audio information 720A, and the audio data is acquired via the network or the audio data recorded on the recording medium is acquired manually. However, it is possible to use only one of automatic and manual configurations. Furthermore, the audio information 720A may be acquired from another audio output device 100, the storage unit 700, or the like via a network. With such a configuration, in the configuration in which the audio information 720A is automatically acquired from the external sound data, the audio information 720A that is the degree of relevance of the user other than the degree of relevance in accordance with the content of the utterance used by the user. As a result, the unexpectedness of the voice output according to the utterance can be obtained, and the utterance can be further aroused. Furthermore, it will be output in a voice other than the voice of the user and the voice of the person with whom the user interacted, and more unexpectedness can be obtained.

また、出力制御手段９１２は、選出手段９１１にて選出した音声情報７２０Ａのうち、関連度情報７２３の関連度合いが最も高い音声情報７２０Ａの音声データを出力させる制御をする構成について説明したが、例えば選出された音声情報７２０Ａのいずれかをランダムに１つ選出して出力させてもよい。また、音声情報７２０Ａは語句の組み合わせに所定の親和性があるとして音声データに語句情報７２２が関連付けられているので、音声検索手段９１０で検出した音声情報７２０Ａのいずれかを選出手段９１１で選出することは所定の関連度合いに基づくこととなるので、検出した音声情報７２０Ａを選出手段９１１でランダムに１つ選出して出力させてもよい。 In addition, the output control unit 912 has been described with respect to the configuration for performing the control to output the audio data of the audio information 720A having the highest degree of association of the association degree information 723 among the audio information 720A selected by the selection unit 911. Any one of the selected audio information 720A may be selected at random and output. Further, since the phrase information 722 is associated with the voice data because the voice information 720A has a predetermined affinity for the combination of phrases, any one of the voice information 720A detected by the voice search means 910 is selected by the selection means 911. Since this is based on a predetermined degree of relevance, the detected voice information 720A may be selected at random by the selection means 911 and output.

そして、出力制御手段９１２は、集音している外部音データの所定の音量以下となるいわゆる無音期間が所定時間経過したことを認識すると、選出した音声情報７２０Ａの音声データを出力させる制御をする構成について説明したが、所定時間経過しなくても区切位置となる無音を認識することで音声データを出力させてもよい。さらには、語句に対応する音声情報７２０Ａを検出した場合に直ちにその音声データを出力させてもよい。そして、無音期間として１〜２秒程度としたが、この期間に限らない。また、この無音期間や、音声データを出力させるタイミングを操作手段３００により設定可能とした構成としてもよい。 Then, when the output control means 912 recognizes that a so-called silence period that is equal to or lower than the predetermined volume of the collected external sound data has passed a predetermined time, the output control means 912 controls to output the audio data of the selected audio information 720A. Although the configuration has been described, the audio data may be output by recognizing the silence at the separation position even if the predetermined time has not elapsed. Furthermore, when voice information 720A corresponding to a phrase is detected, the voice data may be output immediately. And although it was set as about 1-2 seconds as a silence period, it is not restricted to this period. Further, the silent period and the timing for outputting audio data may be set by the operation means 300.

さらに、音声検索手段９１０で検索した音声情報７２０Ａにおいて、経過時間が長くなるにしたがって関連度合いのスコア値が小さくなるように演算し、同一の口語となる音声データ毎でスコア値を合算してスコア情報を生成し、このスコア情報のスコア値の高さに基づいて音声データを選出して説明したが、例えば同一の口語となる音声データでスコア値を合算する処理をせず、経過時間に基づいてスコア値を小さくする演算のみ実施し、現在地点における最も高いスコア値となる音声データを出力候補として選出したり、単に音声情報７２０Ａ毎の関連度合いの高さに基づいて、最も度合いが高い音声データを出力候補として選出したりしてもよい。また、同一の口語でスコア値を合算した場合において、出力させる音声データとしては、合算する前の各音声データの関連度合いが最も高い音声データを出力させる制御をしてもよい。 Further, in the voice information 720A searched by the voice search means 910, the score value of the relevance degree is calculated to decrease as the elapsed time becomes longer, and the score value is added together for each voice data that has the same colloquial score. The information is generated and the voice data is selected based on the score value height of the score information. However, for example, the process of adding the score values with the same spoken voice data is not performed, but based on the elapsed time. Only the calculation to decrease the score value is performed, and the voice data having the highest score value at the current point is selected as an output candidate, or the voice with the highest degree is simply based on the high degree of association for each piece of voice information 720A. Data may be selected as an output candidate. In addition, when the score values are summed up in the same spoken language, the sound data to be output may be controlled to output the sound data having the highest degree of relevance between the sound data before summing up.

そして、音特性の制御としては、ＭＩＤＩ規格に準拠したＭＩＤＩメッセージのパラメータ値に基づいて外部音データを認識したり、音声データを出力させたりする制御に限らず、例えば電流値に基づいて制御するなどしてもよい。 The sound characteristic control is not limited to the control of recognizing external sound data based on the parameter value of the MIDI message conforming to the MIDI standard or outputting the sound data, but is controlled based on the current value, for example. Etc.

さらに、記憶手段７００を装置構成内に設けて説明したが、例えば記憶手段７００を別体の構成としてネットワークを介して装置本体と接続するシステム構成などとしてもよい。このような構成により、音声情報７２０Ａが統括管理でき、新規の音声情報７２０Ａの登録や更新、修正などが容易にできるとともに、装置構成の簡略化が容易に図れる。 Furthermore, although the storage unit 700 is provided in the apparatus configuration, for example, the storage unit 700 may be configured as a separate configuration, such as a system configuration connected to the apparatus main body via a network. With such a configuration, the audio information 720A can be managed in an integrated manner, new audio information 720A can be easily registered, updated, corrected, and the apparatus configuration can be simplified.

また、音声データに対して発話の語句に関する語句情報７２２を関連付けて音声情報７２０Ａを生成して説明したが、例えば踏切音に対して「またかよ」の発話により、「またかよ」を音声データとして生成し、この音声データに踏切音を他の音声データとして所定の関連度合いの関連度情報７２３を関連付けて音声情報７２０Ａを生成するなどしてもよく、音声を出力させる音として語句に限らず、いずれの音声を対象とすることができる。 In addition, the speech information 720A is generated by associating the speech data with the phrase information 722 related to the speech phrase. For example, “Mayakayo” is used as the speech data by uttering “Mayakayo” with respect to the crossing sound. It is also possible to generate the audio information 720A by associating the audio data with the crossing sound as other audio data and associating the relevance information 723 with a predetermined relevance level. Any audio can be targeted.

その他、本発明の実施の際の具体的な構造および手順は、本発明の目的を達成できる範囲で他の構造などに適宜変更できる。 In addition, the specific structure and procedure for carrying out the present invention can be changed as appropriate to other structures and the like within the scope of achieving the object of the present invention.

〔実施の形態の効果〕
上述したように、集音した外部音に関する一連の外部音データに含まれる語句を認識する。認識した語句に関する語句情報７２２を有した音声情報７２０Ａを、発話による音声から抽出した音声に関する音声データ、語句に関する語句情報７２２および音声データの語句である口語および語句情報７２２の語句との関連度合いに関する関連度情報７２３を１つのデータ構造に構成した音声情報７２０Ａを複数記憶するテーブル構造の記憶手段７００から検索する。音声情報のうち、関連度情報７２３に基づいて、例えばスコア値が最も高い関連時情報の音声情報の音声データを選出し、スピーカ５１０から適宜音声出力する。このため、発話の内容の流れに伴った関連度合いで音声データが音声出力され、発話に対して茶々が入る状態となり、出力される音声を切っ掛けに発話が容易促進される。 [Effect of the embodiment]
As described above, a phrase included in a series of external sound data related to the collected external sound is recognized. The speech information 720A having the phrase information 722 relating to the recognized phrase is related to the degree of relevance between the speech data extracted from speech by speech, the phrase information 722 relating to the phrase, and the phrase of the speech data and the phrase information 722 that are phrases of the speech data. A search is made from storage means 700 having a table structure for storing a plurality of audio information 720A in which relevance information 723 is formed into one data structure. Of the audio information, based on the relevance information 723, for example, the audio data of the audio information of the related time information having the highest score value is selected and output from the speaker 510 as appropriate. For this reason, the voice data is output with the degree of relevance associated with the flow of the content of the utterance, and the utterance is in a state of being filled in, and the utterance is easily promoted by using the output voice.

本発明における一実施の形態に係る音声出力装置の概略構成を示すブロックである。It is a block which shows schematic structure of the audio | voice output apparatus which concerns on one embodiment in this invention. 前記実施の形態における記憶手段の音声データ検索テーブルデータベースのテーブル構造の概略構成を示す概念図である。It is a conceptual diagram which shows schematic structure of the table structure of the audio | voice data search table database of the memory | storage means in the said embodiment. 前記実施の形態における外部音データから音声データおよび語句情報を抽出する状況を概念的に示す説明図で、（Ａ）は外部音データの音量に基づく波形図、（Ｂ）は抽出された語句情報の語句を示す説明図、（Ｃ）は音声データＡに対する距離に関するスコア値を示す説明図、（Ｄ）は音声データＢに対する距離に関するスコア値を示す説明図、（Ｅ）は音声セグメント情報から抽出される語句情報の数を示す説明図である。It is explanatory drawing which shows notionally the condition which extracts audio | voice data and phrase information from the external sound data in the said embodiment, (A) is a waveform diagram based on the volume of external sound data, (B) is the extracted phrase information (C) is an explanatory diagram showing a score value relating to the distance to the audio data A, (D) is an explanatory diagram showing a score value relating to the distance to the audio data B, and (E) is extracted from the audio segment information. It is explanatory drawing which shows the number of word information performed. 前記実施の形態における音声データに対する語句の経過距離の係数の設定値を表形式で示す説明図である。It is explanatory drawing which shows the setting value of the coefficient of the elapsed distance of the phrase with respect to the audio | voice data in the said embodiment in a tabular form. 前記実施の形態におけるキーワードとして認識した語句の音声データに対応する語句についてのスコア値の演算状況を表形式で示す説明図である。It is explanatory drawing which shows the calculation condition of the score value about the phrase corresponding to the audio | voice data of the phrase recognized as the keyword in the said embodiment in a table form. 前記実施の形態における音声出力装置における音声情報の生成処理の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the production | generation process of the audio | voice information in the audio | voice output apparatus in the said embodiment. 前記実施の形態における音声出力装置における音声出力処理の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the audio | voice output process in the audio | voice output apparatus in the said embodiment.

Explanation of symbols

４００…外部音取得手段としても機能する集音手段
９００…音声出力制御システムとして機能し得る音声出力制御装置としての演算手段
９００Ａ…演算手段としての音声情報生成装置である音声情報生成部
９００Ｂ…音声出力制御装置としても機能する音声データ出力制御部
９０１…外部音取得手段
９０２…音特性認識手段
９０３…区切位置認識手段
９０４…テキスト形式変換手段
９０５…語句認識手段としても機能する言語解析手段
９０６…音声データ生成手段
９０７…変更手段としても機能する関連度認識手段
９０８…音声情報生成手段
９１０…音声検索手段
９１１…選出手段
９１２…出力制御手段 400 ... Sound collecting means that also functions as an external sound acquisition means 900 ... Calculation means 900A as a sound output control device that can function as a sound output control system 900A ... Speech information generation unit 900B as a sound information generation device as a calculation means Audio data output control unit 901 ... external sound acquisition means 902 ... sound characteristic recognition means 903 ... break position recognition means 904 ... text format conversion means 905 ... language analysis means 906 that also functions as phrase recognition means 906 ... Voice data generation means 907 ... Relevance level recognition means that also functions as change means 908 ... Voice information generation means 910 ... Voice search means 911 ... Selection means 912 ... Output control means

Claims

An audio output control device that performs control to output audio according to collected external sound,
Speech data relating to the speech extracted from speech by speech, phrase information relating to a phrase, and relevance information relating to the degree of association between the speech of the voice data and the phrase of the phrase information, and being associated with each other in one data structure Storage means constructed in a table structure for storing a plurality of audio information;
External sound acquisition means for acquiring a series of external sound data related to the external sound;
A phrase recognition means for recognizing a phrase included in the external sound;
Voice search means for searching the storage means for the voice information corresponding to the recognized words;
A selection means for selecting the predetermined voice data based on the relevance information among the voice information acquired by the search;
Output control means for controlling the selected audio data to be output as a sound from a speaker;
An audio output control apparatus comprising:

The audio output control device according to claim 1,
The speech output control device, wherein the phrase recognition means recognizes the phrase by language analysis of the external sound data.

The audio output control device according to claim 1 or 2, wherein
The output control means performs control to output, from the speaker, the voice data of the voice information having the highest degree of relevance of the relevance information among the voice information selected by the selection means. A featured audio output control device.

The voice output control device according to any one of claims 1 to 3,
The audio output control device, wherein the output control means sets a high degree of association of the relevance information of the audio information corresponding to the audio data output from the speaker.

The voice output control device according to claim 1, wherein:
The selection means includes a score calculation means for adding up the degree of association for each voice data and generating score information regarding the score based on the degree of association of the plurality of pieces of voice information searched by the voice search means, and the score An audio output control device comprising: audio data selection means for selecting the audio data having a predetermined score based on the information.

The audio output control device according to claim 5,
The score calculation means of the selection means is configured to update the score information as the time length from the current time in the acquired external sound data to the position of the phrase corresponding to the phrase information of the voice information searched by the voice search means increases. An audio output control device characterized by performing an operation for reducing a score value.

The audio output control device according to claim 5,
The score calculation means of the selection means is adapted to increase the number of words from the current time in the acquired external sound data to the position of the word corresponding to the word information of the sound information searched by the sound search means. An audio output control device characterized by performing an operation to reduce the score value of the voice.

The voice output control device according to any one of claims 5 to 7,
The output control unit controls the output of the audio data having a predetermined score from the speaker based on the score information.

The voice output control device according to any one of claims 5 to 8,
The output control means controls to output the audio data having the highest score value from the speaker based on the score information.

The voice output control device according to any one of claims 1 to 9,
Sound characteristic recognition means for recognizing sound characteristics of the external sound data acquired by the external sound acquisition means;
Partition position recognition means for recognizing a partition position where the external sound data is partitioned based on the sound characteristics,
The audio output control apparatus, wherein the output control means performs control to output the audio data from the speaker at the recognized delimiter position.

The voice output control device according to claim 10,
The audio output control device according to claim 1, wherein the delimiter position recognizing unit recognizes, as the delimiter position, a section that is equal to or lower than a predetermined volume based on a volume level in sound characteristics of the external sound data.

The voice output control device according to claim 10 or 11,
The audio output control device according to claim 1, wherein the delimiter position recognition unit recognizes a position where the ratio is equal to or higher than a predetermined ratio as the delimiter position based on a ratio of a change in volume in sound characteristics of the external sound data.

The voice output control device according to any one of claims 10 to 12,
The delimiter position recognition means recognizes a textual phrase based on the sound characteristics of the external sound data, and recognizes a position where the phrase is decomposed into words as the delimiter position. .

The voice output control device according to any one of claims 10 to 13,
The voice output control device characterized in that the break position recognition means recognizes a change in the sound generation direction of the external sound and recognizes the position of the external sound data at which the sound generation direction changes as the break position.

The voice output control device according to any one of claims 10 to 14,
The voice output control device characterized in that the break position recognition means recognizes a change in a person based on a change in sound quality in sound characteristics of the external sound data, and recognizes a position where the person changes as the break position.

The voice output control device according to any one of claims 10 to 15,
The output control means controls to output the selected audio data from the speaker when recognizing that the period of the separation position recognized by the separation position recognition means is a predetermined time or more. Audio output control device.

The voice output control device according to any one of claims 1 to 16,
The storage means includes audio data relating to the voice extracted from a series of external sound data relating to the external sound, phrase information relating to phrases included in positions before and after the external sound data with respect to the audio data, and the audio data A speech output control device characterized in that the speech output control device is constructed in a table structure that stores relevance information relating to the relevance of speech and the phrase of the phrase information, and stores a plurality of speech information associated with one data structure. .

An audio output control device according to any one of claims 1 to 17,
The voice output control apparatus characterized in that the selection means selects the voice information other than the voice information of the voice data last output by the output control means.

The audio output control device according to any one of claims 1 to 18,
Equipped with a time measuring means for measuring time,
The audio information stored in the storage means is associated with time information related to time,
The voice output control device, wherein the voice search means selects the voice information associated with time information corresponding to a time counted by the time counting means.

The voice output control device according to any one of claims 1 to 19,
A voice output control device comprising voice information generation means for acquiring the external sound and generating the voice information stored in the storage means.

The audio output control device according to claim 20,
The audio output control device, wherein the audio information generation unit generates the audio information based on external sound data acquired by the external sound acquisition unit.

The audio output control device according to claim 20 or claim 21, wherein
The voice information generating means includes voice data related to the voice extracted from a series of external sound data related to the external sound, phrase information related to words included in positions before and after in the external sound data with respect to the voice data, and An audio output control apparatus, wherein the audio information is generated by associating relevance information related to the relevance of the audio of the audio data and the phrase of the phrase information with one data structure.

The voice output control device according to any one of claims 20 to 22,
The voice information generation means recognizes the evaluation of the output voice data based on the external sound data located after the time when the voice control data is output from the speaker by the output control means. An audio output control device characterized in that the degree of association is calculated in accordance with content.

The audio output control device according to any one of claims 20 to 23, wherein:
The voice information generation means recognizes the evaluation of the output voice data based on the external sound data located after the time when the voice control data is output from the speaker by the output control means. An audio output control apparatus characterized by performing an operation of changing the degree of association of the degree-of-association information of the audio data in accordance with the content.

25. The audio output control device according to any one of claims 20 to 24, wherein:
The voice information generation means recognizes the loudness level of the laughing voice in the external sound data as the evaluation, and calculates the degree of association of the sound continuity information corresponding to the recognized loudness level of the laughing voice. A voice output control device characterized by:

The audio output control device according to any one of claims 20 to 25, wherein
The speech information generating means recognizes the degree of content to be affirmed with respect to the output speech data by the syntax analysis of the external sound data as the evaluation, and corresponds to the recognized degree of content to be affirmed. A voice output control device that calculates the degree of association of degree information.

The voice output control device according to any one of claims 20 to 26, wherein:
The voice information generation means calculates a frequency of appearance of a combination of a voice of the voice data and the phrase information of the phrase information in the external sound data, and changes a degree of association of the relevance information according to the appearance frequency An audio output control device characterized by

An audio output control system that performs control to output audio according to collected external sound,
Speech data relating to the speech extracted from speech by speech, phrase information relating to a phrase, and relevance information relating to the degree of association between the speech of the voice data and the phrase of the phrase information, and being associated with each other in one data structure Storage means constructed in a table structure for storing a plurality of audio information;
External information acquisition means for acquiring a series of external sound data related to the external sound, word recognition means for recognizing words included in the external sound, connected to the storage means via a network so as to be able to acquire the audio information, Voice search means for searching the voice information corresponding to the recognized phrase from the storage means via the network, and selecting the predetermined voice data based on the relevance information among the voice information acquired by the search A terminal device comprising: selection means; and output control means for controlling the selected voice data to be output as the voice from a speaker;
An audio output control system comprising:

A sound output control method for performing control to output sound according to the collected external sound by a calculation means,
The computing means is
Obtain a series of external sound data related to the external sound,
Recognize words contained in this acquired external sound data,
Corresponding to the recognized word / phrase, the voice data related to the voice extracted from the voice by speech, the word / phrase information related to the word / phrase, and the degree-of-relationship information related to the voice / phrase of the voice data and the word / phrase information. Retrieve the audio information from storage means built in a table structure that stores a plurality of audio information configured in association with each other;
Among the audio information acquired by this search, the predetermined audio data is selected based on the relevance information,
A control for outputting the selected audio data as a sound from a speaker.

An audio output control program for causing a computing means to function as the audio output control device according to any one of claims 1 to 27 or the audio output control system according to claim 28.

An audio output control program for causing an arithmetic means to execute the audio output control method according to claim 29.

32. A recording medium on which an audio output control program according to claim 30 or 31 is recorded so as to be readable by an arithmetic means.