JP2008107641A

JP2008107641A - Voice data retrieving apparatus

Info

Publication number: JP2008107641A
Application number: JP2006291437A
Authority: JP
Inventors: Juichi Sato; 寿一佐藤
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-10-26
Filing date: 2006-10-26
Publication date: 2008-05-08

Abstract

<P>PROBLEM TO BE SOLVED: To accurately retrieve a desired part of a recorded voice data. <P>SOLUTION: Voice in a conference is stored as the voice data in a voice data storing section 17. A feature of the voice data is extracted for each predetermined frame by a CPU 11, and stored as a feature data string together with time information in an analysis data storing section 18. When retrieved, an operator inputs a desired word toward a microphone 16. The feature of the voice is extracted for each predetermined frame by the CPU 11, and stored as the feature data string in a RAM 13. Then, coincidence of the feature data string in the RAM 13 and the feature data string in the analysis data storing section 18 is detected. The time information attached to the feature data string in the analysis data storing section 18, which is detected as coincident, is extracted. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、記憶された音声データの中から所望の部分を検索するための音声データ検索装置に関する。 The present invention relates to a speech data retrieval apparatus for retrieving a desired portion from stored speech data.

記憶された音声データから、所望のキーワードが話されている部分を検索したい場合、例えば、会議の音声をテキストデータ化して保存し、検索キーとなるテキストデータを入力して、保存したデータの中から検索キーと一致するテキストデータ部分を検索する方法がある（特許文献１）。 If you want to search the stored voice data for the part where the desired keyword is spoken, for example, save the meeting voice as text data, enter the text data as the search key, There is a method for searching for a text data portion that matches a search key (Patent Document 1).

また、プレゼンテーション用のアプリケーションソフトウエアの操作の切り替わり状況を会議音声と同期して記録し、プレゼンテーションの操作状況をキーとして音声を検索する方法も提案されている（特許文献２）。
特開２００２−３６６５５２号公報特許第３６３７９３７号公報 In addition, a method has been proposed in which the switching state of the operation of the application software for presentation is recorded in synchronization with the conference voice, and the voice is searched using the operation state of the presentation as a key (Patent Document 2).
JP 2002-366552 A Japanese Patent No. 3637937

しかし、特許文献１では、ナレーションのように明瞭に発音した音声なら高い精度でテキスト化することができるが、会議のようにいろいろな人が通常の会話で発言した内容をテキスト化することは、現在の音声認識技術では精度が不十分であり、正確なテキスト化ができない。そして、テキストが不正確だと、所望のデータを検索することはほとんど出来ないという問題がある。 However, in Patent Document 1, a voice that is clearly pronounced like a narration can be made into text with high accuracy. However, as in a meeting, it is possible to make texts that are spoken by various people in a normal conversation. Current speech recognition technology is not accurate enough to make it into text. If the text is inaccurate, there is a problem that desired data can hardly be searched.

また、特許文献２では、プレゼンテーション用のアプリケーションソフトウエアを使用しない会議も多いから、全く使用できない場合も多いという問題がある。また、プレゼンテーション用のアプリケーションソフトウエアを使用したとしても、検索対象となる音声データが必ずしもプレゼンテーションの操作の切り替えタイミングに該当するとは限らないため、的確な検索ができないという問題がある。 Further, in Patent Document 2, since there are many conferences that do not use application software for presentation, there is a problem that there are many cases where the conference software cannot be used at all. Even if application software for presentation is used, there is a problem in that accurate search cannot be performed because the audio data to be searched does not necessarily correspond to the switching timing of the presentation operation.

この発明は上述した課題を解決するために、プレゼンテーション用のアプリケーションソフトウエアなどを用いない場合であっても、所望とする音声データを正確に検索することができる音声データ検索装置を提供することを目的とする。 In order to solve the above-described problems, the present invention provides an audio data search device capable of accurately searching for desired audio data even when application software for presentation is not used. Objective.

上記課題を解決するために、この発明においては、
収音した音声に対応する音声データを出力する収音手段と、
前記収音手段が出力する音声データを記憶する音声データ記憶手段と、
前記収音手段が出力する音声データを解析してその特徴を示す特徴データを生成する特徴データ生成手段と、
前記特徴データ生成手段が生成した特徴データを、その生成時刻に対応する時刻データとともに記憶する特徴データ記憶手段と、
検索キーの入力を指示する検索キー入力指示手段と、
前記検索キー入力指示手段によって検索キーの入力が指示されている際に、前記特徴データ生成手段が生成した特徴データを検索用特徴データとして記憶する検索用特徴データ記憶手段と、
前記検索用特徴データ記憶手段内の検索用特徴データと前記特徴データ記憶手段内の特徴データとを比較し、一致すると見なされる特徴データを検索する検索手段と、
前記音声データ記憶手段に記憶された音声データについて、前記検索手段によって検索された特徴データの時刻データに対応するアドレスから読み出す読出手段と
を具備することを特徴とする。 In order to solve the above problems, in the present invention,
Sound collection means for outputting sound data corresponding to the collected sound;
Voice data storage means for storing voice data output by the sound pickup means;
Feature data generating means for analyzing the voice data output by the sound collecting means and generating feature data indicating the characteristics;
Feature data storage means for storing the feature data generated by the feature data generation means together with time data corresponding to the generation time;
Search key input instruction means for instructing input of a search key;
Search feature data storage means for storing the feature data generated by the feature data generation means as search feature data when the search key input instruction means is instructed to input a search key;
A search means for comparing the feature data for search in the feature data storage means for search with the feature data in the feature data storage means and searching for feature data that is considered to be matched;
Reading means for reading out the voice data stored in the voice data storage means from an address corresponding to the time data of the characteristic data searched by the search means.

また、この発明の他の態様においては、
収音した音声に対応する音声データを出力する収音手段と、
前記収音手段が出力する音声データを記憶する音声データ記憶手段と、
前記収音手段が出力する音声データを解析してその特徴を示す特徴データを生成する特徴データ生成手段と、
前記特徴データ生成手段が生成した特徴データを、その生成時刻に対応する時刻データとともに記憶する特徴データ記憶手段と、
文字列を入力する文字列入力手段と、
文字列の構成要素となる音素と前記音素が発音された際の音声の特徴データとが対応付けられたテーブルと、
前記文字列入力手段が入力した文字列の各文字に対して前記テーブルを参照して特徴データに変換する変換手段と、
前記変換手段が変換した特徴データを検索用特徴データとして記憶する検索用特徴データ記憶手段と、
前記検索用特徴データ記憶手段内の検索用特徴データと前記特徴データ記憶手段内の特徴データとを比較し、一致すると見なされる特徴データを検索する検索手段と、
前記音声データ記憶手段に記憶された音声データについて、前記検索手段によって検索された特徴データの時刻データに対応するアドレスから読み出す読出手段と
を具備することを特徴とする。 In another aspect of the invention,
Sound collection means for outputting sound data corresponding to the collected sound;
Voice data storage means for storing voice data output by the sound pickup means;
Feature data generating means for analyzing the voice data output by the sound collecting means and generating feature data indicating the characteristics;
Feature data storage means for storing the feature data generated by the feature data generation means together with time data corresponding to the generation time;
A character string input means for inputting a character string;
A table in which phonemes that are constituent elements of a character string are associated with voice feature data when the phonemes are pronounced;
Conversion means for converting each character of the character string input by the character string input means into feature data with reference to the table;
Search feature data storage means for storing the feature data converted by the conversion means as search feature data;
A search means for comparing the feature data for search in the feature data storage means for search with the feature data in the feature data storage means and searching for feature data that is considered to be matched;
Reading means for reading out the voice data stored in the voice data storage means from an address corresponding to the time data of the characteristic data searched by the search means.

また、この発明の他の好ましい態様においては、
前記収音手段は複数のマイクと前記各マイクが収音した音声に対応する音声データを各々生成するとともに、前記音声データがいずれのマイクからの信号であるかを識別する識別データを付けて前記音声データに添付する音声データ生成手段を有し、
前記特徴データ記憶手段は前記識別データに基づいて前記特徴データを区分して記憶し、
前記検索手段は前記識別データが特定されると、特定された識別データによって区分されている前記特徴データと前記検索用特徴データ記憶手段内の検索用特徴データとを比較することを特徴とする。 In another preferred embodiment of the present invention,
The sound collecting means generates a plurality of microphones and sound data corresponding to the sound picked up by each microphone, and adds identification data for identifying which microphone the sound data is from. Audio data generating means for attaching to audio data;
The feature data storage means classifies and stores the feature data based on the identification data,
When the identification data is specified, the search means compares the feature data classified by the specified identification data with the search feature data in the search feature data storage means.

また、この発明の他の好ましい態様においては、
前記収音手段は、収音方向が可変であるアレイマイクと、前記アレイマイクの収音方向を制御し、収音方向を示す方向データを出力する収音方向制御手段と、前記アレイマイクが収音した音声に対応する音声データを生成する音声データ生成手段とを有し、
前記特徴データ記憶手段は前記方向データに基づいて前記特徴データを区分して記憶し、
前記検索手段は前記方向データが特定されると、特定された方向データによって区分されている前記特徴データと前記検索用特徴データ記憶手段内の検索用特徴データとを比較することを特徴とする。 In another preferred embodiment of the present invention,
The sound collection means includes an array microphone whose sound collection direction is variable, a sound collection direction control means for controlling the sound collection direction of the array microphone and outputting direction data indicating the sound collection direction, and the array microphone. Voice data generating means for generating voice data corresponding to the sound that has been sounded,
The feature data storage means classifies and stores the feature data based on the direction data,
When the direction data is specified, the search means compares the feature data classified by the specified direction data with the search feature data in the search feature data storage means.

音声データから抽出された特徴データを用いて比較するため、音声データをテキストデータ等に変換する必要がなく、正確な検索を行うことができる。また、プレゼンテーションソフトウエア等も必要としない。 Since comparison is performed using feature data extracted from voice data, it is not necessary to convert the voice data into text data or the like, and an accurate search can be performed. Also, no presentation software is required.

（第１実施形態）
（Ａ）構成
図１は、本発明の第１実施形態である会議システムのハードウェアの構成を示すブロック図である。図１に示すＣＰＵ（Central Processing Unit）１１は、ＲＯＭ（Read Only Memory）１２に記憶されているコンピュータプログラムを読み出してＲＡＭ（Random Access Memory）１３にロードし、これを実行することにより、ハードウェアの各部を制御する。また、ＲＡＭ１３はＣＰＵ１１のワークエリアとしても使用される。操作部１４は、各種のキーを備えており、押下されたキーに対応した信号をＣＰＵ１１へ出力する。 (First embodiment)
(A) Configuration FIG. 1 is a block diagram showing a hardware configuration of a conference system according to the first embodiment of the present invention. A CPU (Central Processing Unit) 11 shown in FIG. 1 reads a computer program stored in a ROM (Read Only Memory) 12, loads it into a RAM (Random Access Memory) 13, and executes it to execute hardware. Control each part. The RAM 13 is also used as a work area for the CPU 11. The operation unit 14 includes various keys and outputs a signal corresponding to the pressed key to the CPU 11.

マイクロフォン１６は、周囲の音を収音して音声信号として出力する。入力ＩＦ（Interface）１５は、マイクロフォン１６から出力される音声信号（アナログ信号）を、所定のサンプリング周波数でサンプリングして音声データＳａｄに変換する。ここで、図２は音声データＳａｄの一例である。図示のように、時間軸に沿ってサンプリングタイミング毎の振幅を表すデータ列となっている。 The microphone 16 collects ambient sounds and outputs them as audio signals. The input IF (Interface) 15 samples an audio signal (analog signal) output from the microphone 16 at a predetermined sampling frequency and converts it into audio data Sad. Here, FIG. 2 is an example of the audio data Sad. As shown in the figure, the data string represents the amplitude at each sampling timing along the time axis.

次に、図１に示す音声データ記憶部１７は、ＣＰＵ１１の制御の下に入力ＩＦ１５が出力する音声データＳａｄを順次記憶する。この場合、各サンプリングタイミング毎の音声データＳａｄが音声データ記憶部１７の一つのアドレスに順次記憶される。 Next, the audio data storage unit 17 illustrated in FIG. 1 sequentially stores the audio data Sad output from the input IF 15 under the control of the CPU 11. In this case, the audio data Sad at each sampling timing is sequentially stored in one address of the audio data storage unit 17.

また、ＣＰＵ１１は、入力ＩＦ１５が出力する音声データＳａｄを分析して分析データを生成し、生成した分析データを分析データ記憶部１８に順次記憶させるようになっている。 Further, the CPU 11 analyzes the voice data Sad output from the input IF 15 to generate analysis data, and sequentially stores the generated analysis data in the analysis data storage unit 18.

ここで、分析データの生成方法について説明する。この実施形態では、図３に示すように、所定の時間間隔（この実施形態では１０ｍ秒）のフレーム毎に、音声データＳａｄに対して高速フーリエ変換（ＦＦＴ）を行って周波数スペクトルを生成する。図３に示すフレームｆｒ１〜ｆｒ３における周波数スペクトルの例を、図４の（ａ）〜（ｃ）に示す。この図に示すように、各フレームについて、そのフレームに含まれる正弦波の周波数と振幅が抽出される。ＣＰＵ１１は、このようにして抽出された各フレームに含まれる正弦波の周波数と振幅に対し、以下に述べる正規化処理を行う。 Here, a method for generating analysis data will be described. In this embodiment, as shown in FIG. 3, the frequency spectrum is generated by performing fast Fourier transform (FFT) on the audio data Sad at every frame of a predetermined time interval (10 msec in this embodiment). Examples of frequency spectra in the frames fr1 to fr3 shown in FIG. 3 are shown in (a) to (c) of FIG. As shown in this figure, for each frame, the frequency and amplitude of a sine wave included in that frame are extracted. The CPU 11 performs normalization processing described below on the frequency and amplitude of the sine wave included in each frame extracted in this way.

まず、各フレーム毎の正弦波の周波数のうち最も低いものをピッチとするとともに、各フレームの正弦波の振幅の平均値を各フレームの平均音圧レベルとする。そして、各フレームにおける各正弦波の周波数をピッチで除算するとともに、各フレームの正弦波の振幅を平均音圧レベルで除算する。このような処理の結果、各フレームについて、周波数の低い側から高い側に向かって、正規化された周波数と振幅のデータ列が生成される。ここでは、周波数の低い側から高い側に向かって（ｆ１，Ａ１）、（ｆ２，Ａ２）、（ｆ３，Ａ３）…というデータ列が生成される。なお、番号は各フレームにおいて周波数の低い側からの順番を示すものであり、各フレームにおいて番号が同じであっても同じ周波数、同じ振幅を示すものではない。以下の説明においては、このデータ列を特徴データ列という。 First, the lowest one of the sine wave frequencies for each frame is set as the pitch, and the average value of the amplitude of the sine wave in each frame is set as the average sound pressure level of each frame. Then, the frequency of each sine wave in each frame is divided by the pitch, and the amplitude of the sine wave in each frame is divided by the average sound pressure level. As a result of such processing, for each frame, a normalized frequency and amplitude data string is generated from the lower frequency side to the higher frequency side. Here, a data string of (f1, A1), (f2, A2), (f3, A3)... Is generated from the lower frequency side to the higher side. Note that the numbers indicate the order from the lowest frequency in each frame, and even if the numbers are the same in each frame, they do not indicate the same frequency and the same amplitude. In the following description, this data string is referred to as a feature data string.

図５は、分析データ記憶部１８の記憶内容を示す図である。図示のように、一つのレコードはフレーム番号（ｆｒ１，ｆｒ２，ｆｒ３…）、時刻データ、特徴データ列を含んでいる。時刻データは各フレームの開始時刻である。なお、この場合の時刻データは、特徴データ列の生成時刻に対応していれば良く、フレームの開始時刻や終了時刻、あるいは分析データ記憶部１８への書き込み時刻でもよい。また、各フレーム最初の音声データの収音時刻でもよい。要するに、時刻が特定できればよく、特徴データの生成時刻に対応する時刻であればよい。 FIG. 5 is a diagram showing the storage contents of the analysis data storage unit 18. As shown in the figure, one record includes a frame number (fr1, fr2, fr3...), Time data, and a feature data string. The time data is the start time of each frame. Note that the time data in this case only needs to correspond to the generation time of the feature data string, and may be the start time and end time of the frame, or the write time to the analysis data storage unit 18. Alternatively, the sound collection time of the first audio data of each frame may be used. In short, it is only necessary to be able to specify the time, and any time corresponding to the generation time of the feature data may be used.

図１に示す表示部２０は、ディスプレイを備えており、ＣＰＵ１１の制御の下に、所定の文字や図を表示する。再生部２１は、ＣＰＵ１１の制御の下に、音声データＳａｄを音声信号に変換する。スピーカ２２は、変換された音声信号を音声として出力する。 The display unit 20 shown in FIG. 1 includes a display, and displays predetermined characters and diagrams under the control of the CPU 11. The reproduction unit 21 converts the audio data Sad into an audio signal under the control of the CPU 11. The speaker 22 outputs the converted audio signal as audio.

（Ｂ）動作
次に、この実施形態の動作を説明する。以下においては会議の音声を保存し、その中から所望の部分を検索する場合を例にとって説明する。 (B) Operation Next, the operation of this embodiment will be described. In the following, a case will be described as an example where the audio of the conference is stored and a desired portion is searched from among them.

まず、会議テーブルなどにマイクロフォン１６を置き、会議参加者の各発言を記録してゆく。すなわち、マイクロフォン１６は各参加者の発言を収音し、音声信号として出力する。この結果、入力ＩＦ１５からは図２に示すような音声データＳａｄが出力され、音声データ記憶部１７に各サンプリングタイミングにおける振幅が順次記録されてゆく。 First, the microphone 16 is placed on a conference table or the like, and each speech of the conference participant is recorded. That is, the microphone 16 collects the speech of each participant and outputs it as an audio signal. As a result, audio data Sad as shown in FIG. 2 is output from the input IF 15, and the amplitude at each sampling timing is sequentially recorded in the audio data storage unit 17.

同時に、ＣＰＵ１１は音声データＳａｄを分析し、その分析結果を分析データ記憶部１８に順次記憶させてゆく。これにより、図５に示すような特徴データ列が順次記憶されてゆく。このようにして、会議における各発言は、音声データＳａｄとして音声データ記憶部１７に記憶されるとともに、その特徴が分析され、特徴データ列として分析データ記憶部１８に記憶される。 At the same time, the CPU 11 analyzes the voice data Sad and sequentially stores the analysis results in the analysis data storage unit 18. As a result, feature data strings as shown in FIG. 5 are sequentially stored. In this way, each utterance in the conference is stored as voice data Sad in the voice data storage unit 17, and its characteristics are analyzed and stored in the analysis data storage unit 18 as a feature data string.

次に、記録した音声データの所望の部分を聞きたい要求が生じたとき、操作者は、操作部１４の所定のボタンを押して、検索のためのキーワードとなる言葉をマイクロフォン１６に向かって発声する。例えば、キーワードを「こんにちは」とした場合、操作部１４内の所定のボタンを押して「こんにちは」と発声すると、この言葉の音声データＳａｄが生成され、ＲＡＭ１３に記憶されるとともに、会議の記録のときと同様の処理によって分析される。分析結果は検索用特徴データとして、ＲＡＭ１３の所定エリアに記憶される。図６は、この記憶内容を示す。このように、「こんにちは」の特徴データ列が各フレームＦＲ１，ＦＲ２…について検出される。 Next, when a request for listening to a desired portion of the recorded audio data is generated, the operator presses a predetermined button of the operation unit 14 and utters a word as a keyword for search toward the microphone 16. . For example, when the keyword "Hello", by pressing a predetermined button in the operation unit 14 say "Hello", the words of the audio data Sad is generated, along with stored in RAM 13, when the recording of the meeting It is analyzed by the same process. The analysis result is stored in a predetermined area of the RAM 13 as search feature data. FIG. 6 shows the stored contents. In this way, feature data string "Hello" is detected for each frame FR1, FR2 ....

続いて、ＣＰＵ１１は、ＲＡＭ１３に記憶されたキーワードの特徴データ列と分析データ記憶部１８に記憶された会議音声の特徴データ列を順次照合する。ここで、フレーム同士の特徴データ列の一致について説明する。例えば、最初のフレームについては、フレームＦＲ１とｆｒ１の特徴データを周波数の低いほうから順次比較して一致しているか否かを判定するが、一致の判定については所定の許容範囲が設定されている。
例えばフレームＦＲ１のｆ１とフレームｆｒ１のｆ１の値は、完全に一致していなくても許容誤差（例えば１０％）以内であれば一致とみなす。同様に振幅Ａ１の相対誤差が例えば１０％以内のときは、振幅は一致しているとみなす。周波数成分と振幅成分の双方が一致とみなされた場合には、その正弦波成分は一致しているとみなす。このようにして、（ｆｒ１，Ａ１）、（ｆｒ２，Ａ２）、（ｆｒ３，Ａ３）…という順に比較してゆき、全サンプル（例えば、５０乃至１００）のうち９０％が一致と見なされた場合は、第１フレームであるフレームＦＲ１とｆｒ１は一致していると判定される。この判定を各フレームについて行ってゆく。
この場合、周波数および振幅は、前述のとおり正規化処理されているため、操作者の発音したキーワードが、会議の発言者が発音と音程（ピッチ）や音圧レベルにおいて異なっていても、特徴データが一致していれば、言葉が一致していると判定される。したがって、操作者や発言者の発音の個性によって、異なる検索対象となってしまうことはない。なお、上述の許容範囲は、実施状況に応じて適宜設定することができる。設定は、操作部１４のキー操作によって行ってもよく、事前にデフォルト値としてＲＯＭ１２やＲＡＭ１３に記憶させておいてもよい。 Subsequently, the CPU 11 sequentially collates the keyword feature data string stored in the RAM 13 with the conference voice feature data string stored in the analysis data storage unit 18. Here, the coincidence of feature data strings between frames will be described. For example, for the first frame, the feature data of the frames FR1 and fr1 are sequentially compared from the lower frequency to determine whether or not they match, and a predetermined allowable range is set for the match determination. .
For example, even if the values of f1 of the frame FR1 and f1 of the frame fr1 do not completely coincide with each other, they are regarded as coincident if they are within an allowable error (for example, 10%). Similarly, when the relative error of the amplitude A1 is within 10%, for example, the amplitudes are considered to match. When both the frequency component and the amplitude component are regarded as matching, the sine wave components are regarded as matching. In this way, when comparison is made in the order of (fr1, A1), (fr2, A2), (fr3, A3), etc., 90% of all samples (for example, 50 to 100) are regarded as matching Is determined that the frames FR1 and fr1, which are the first frames, match. This determination is performed for each frame.
In this case, since the frequency and amplitude are normalized as described above, even if the keyword pronounced by the operator differs from the pronunciation of the conference by the speaker in the pitch (pitch) and sound pressure level, the feature data If they match, it is determined that the words match. Therefore, it does not become a different search target depending on the individuality of pronunciation of the operator or the speaker. The allowable range described above can be set as appropriate according to the implementation status. The setting may be performed by key operation of the operation unit 14 or may be stored in advance in the ROM 12 or RAM 13 as a default value.

ここで、一致検索の処理内容についてさらに説明する。ＣＰＵ１１は、ＲＡＭ１３に記憶されたキーワード「こんにちは」１語として認識は、この１語の発音に対応する連続したフレーム（以下、フレーム群という）について、分析データ記憶部１８内の特徴データ列を解析し、一致するフレーム群を抽出する。すなわち、「こんにちは」の先頭のフレームから順に操作者と会議発音者の特徴データ列を比較してゆく。 Here, the content of the matching search process will be further described. CPU11 is recognized, the successive frames correspond to the pronunciation of the single word (hereinafter, referred to as a frame group), analyzes the characteristic data sequence in the analytical data storage unit 18 as the keyword "Hello" one word stored in the RAM13 Then, a matching frame group is extracted. In other words, the slide into comparing the features data columns of the conference pronunciation's an operator from the beginning of the frame in the order of "Hello".

この場合、発音の長さが操作者と会議発音者とで異なる場合があるが、ＣＰＵ１１は、操作者と会議発音者の発音に対応する２つの特徴データ列に対してＤＰ（Dynamic Programming：動的計画法）マッチングアルゴリズムに従って順次比較していく。ＤＰマッチング処理を行うことにより、操作者音声と会議発音者音声の特徴が一致するフレームの対応付けが行われる。これにより、発音の長さが異なっても、同じ「こんにちは」の発音であれば検索が可能となる。すなわち、操作者が吹き込んだ「こんにちは」と分析データ記憶部１８内に記憶された会議発言者の特徴データ列から抽出される「こんにちは」の発音に対応するフレーム数が異なっていても、両者が同じ「こんにちは」の発音であれば一致検索が可能になる。 In this case, although the length of the pronunciation may be different between the operator and the conference speaker, the CPU 11 performs DP (Dynamic Programming) for two feature data strings corresponding to the pronunciation of the operator and the conference speaker. Comparing sequentially according to the matching algorithm. By performing the DP matching process, the frames in which the characteristics of the operator voice and the meeting speaker voice are matched are associated. As a result, be different from the length of the pronunciation, the search can be performed if the pronunciation of the same "Hello". That is, be different from the number of frames corresponding to pronounce "Hello" is extracted from the feature data string of the operator was blown "Hello" and analysis conference speaker stored in the data storage unit 18, both match search is possible if the pronunciation of the same "Hello".

この場合、「こんにちは」に一致するフレームが分析データ記憶部１８内から複数検出されることがある。本実施形態においては、ＣＰＵ１１は、「こんにちは」のフレーム群について一致が検出されても、両フレーム群内の各フレームの一致度を参照して、フレーム群同士の一致度を算出する。 In this case, there is the frame that matches the "Hello" is more detected from within the analytical data storage unit 18. In the present embodiment, CPU 11 is consistent for a frame group "Hello" is be detected, with reference to the degree of matching of each frame in both frame group, and calculates the degree of matching between frame group.

例えば、一致するとして検出されたフレーム群が共に１００個のフレームを有しており、９７個のフレームにおいて特徴データ列が一致していると見なされ、他の３フレームについては一致していないとみなされた場合に、この会議発言者の特徴データ列の一致度合いを９７％とするという演算を行う。あるいは、各フレーム同士の一致度合いの平均をフレーム群の一致度合いとしてもよい。また、フレーム数が異なる場合のフレーム群同士の一致度合いは、比例配分によって行えばよい。例えば、フレーム数３０のフレーム群とフレーム数９０のフレーム群との一致判定を行う場合は、前者のフレームのうち一致するフレーム数を３倍して、９０で除するようにして一致度合いである％を求めればよい。
一方、フレーム群に含まれるフレームについて不一致と見なされるフレームの許容割合については予め設定されるが、不一致のフレームが一つでもあれば一致と認めないという設定をしてもよく、２０〜３０％の不一致は認めるという設定をしてもよい。 For example, if the frames detected as matching both have 100 frames, the feature data strings are considered to match in 97 frames, and the other three frames do not match. When it is regarded, the calculation is performed such that the degree of coincidence of the feature data string of the conference speaker is 97%. Or it is good also considering the average of the coincidence degree of each frame as a coincidence degree of a frame group. Further, the degree of coincidence between the frame groups when the number of frames is different may be determined by proportional distribution. For example, when performing a match determination between a frame group of 30 frames and a frame group of 90 frames, the degree of match is such that the number of matching frames in the former frame is tripled and divided by 90. % Can be obtained.
On the other hand, the permissible ratio of frames that are considered to be inconsistent among the frames included in the frame group is set in advance. However, if there is only one mismatched frame, it may be set not to be recognized as a match. You may set it to accept the discrepancy.

以上のようにして、分析データ記憶部１８内から「こんにちは」に該当するフレーム群と、そのフレーム群の一致度合いが検出される。ここで、図７に「こんにちは」に一致するフレーム群が検出された場合の表示部２０における表示例を示す。図７においては、分析データ記憶部１８内の３カ所において一致が検出された場合の表示例を示している。図示のように検出順を示す番号と時刻と一致度合いが表示されている。この場合の時刻は、一致していると判定された分析データ記憶部１８内のフレーム群の最初のフレームの時刻（図５参照）である。 As described above, the frame group corresponding from the analysis data storage unit within 18 to "Hi", the degrees of the frame group is detected. Here, a display example on the display unit 20 when the frame group is detected matching "Hello" in Fig. FIG. 7 shows a display example when coincidence is detected at three locations in the analysis data storage unit 18. As shown in the figure, the number indicating the detection order, the time, and the degree of coincidence are displayed. The time in this case is the time (see FIG. 5) of the first frame of the frame group in the analysis data storage unit 18 determined to match.

表示部２０にはカーソルＣｓｒが表示されており、このカーソルＣｓｒは、操作部１４の所定のキーの押下に従ってＣＰＵ１１の制御の下に移動する。また、所定のキー（Ｅｎｔｅｒキーなど）が押下されると、ＣＰＵ１１はカーソルＣｓｒが特定する時刻を呼び出して開始時刻と認識し、この時刻に対応する音声データを音声データ記憶部１７から読み出す。音声データ記憶部１７内の音声データは、サンプリングタイミングに従って順次記憶されているので、１アドレスの違いはサンプリング周期に対応するから、読み出し開始時刻に対応するアドレスを容易に求めることができる。このようにして読み出された音声データは、再生部２１に供給され、ここで再生信号が生成されてスピーカ２２から発音される。 A cursor Csr is displayed on the display unit 20, and the cursor Csr moves under the control of the CPU 11 as a predetermined key of the operation unit 14 is pressed. When a predetermined key (such as the Enter key) is pressed, the CPU 11 calls the time specified by the cursor Csr and recognizes it as the start time, and reads out audio data corresponding to this time from the audio data storage unit 17. Since the audio data in the audio data storage unit 17 is sequentially stored according to the sampling timing, the difference of one address corresponds to the sampling period, so that the address corresponding to the read start time can be easily obtained. The audio data read out in this way is supplied to the reproduction unit 21 where a reproduction signal is generated and generated from the speaker 22.

以上のようにして、操作者が吹き込んだ「こんにちは」に合致する発音、すなわち会議発言者が「こんにちは」と発音している箇所から音声の再生がなされる。このように再生された音声が、所望のものでない場合は、操作者は、表示されたリストの中から、他の候補を選択して聞くことができ、これにより、所望の部分の音声を容易に検索して聞くことができる。このように、この実施形態においては、文字列の入力や音声認識を一切用いず、記録した音声と検索用の音声の特徴同士を直接比較することによって所望の音声データを検出することができる。 As described above, the sound that matches the was blown into the operator "Hello", ie the voice of the reproduction is made from the point where the conference speaker is pronounced "Hello". When the sound reproduced in this way is not the desired one, the operator can select and listen to another candidate from the displayed list, thereby easily listening to the sound of the desired part. Search and listen to. Thus, in this embodiment, desired voice data can be detected by directly comparing the characteristics of recorded voice and search voice without using any character string input or voice recognition.

（第２実施形態）
次に、本発明の第２実施形態について説明する。なお、以下の説明においては、第１実施形態と共通する部分には共通の符号を付けてその説明を省略する。
（Ａ）構成
本実施形態が前述した第１実施形態と異なる点は、テキスト音素特徴変換部１９が設けられている点である。このテキスト音素特徴変換部１９は、操作部１４のキーボードなどから入力されたテキストデータを特徴データ列に変換する機能を有している。 (Second Embodiment)
Next, a second embodiment of the present invention will be described. In the following description, parts common to those in the first embodiment are denoted by common reference numerals and description thereof is omitted.
(A) Configuration The present embodiment is different from the first embodiment described above in that a text phoneme feature conversion unit 19 is provided. The text phoneme feature conversion unit 19 has a function of converting text data input from the keyboard of the operation unit 14 into a feature data string.

例えば、操作部１４に備えられたキーボードから、「こんにちは」という文字列が入力された場合、この入力文字列を形態素解析によって実際の発音を表す平仮名列に変換する。ここで形態素解析とは、文字列から単語を認識する処理である。すなわち、日本語文章は英語文書と異なり、“分かち書き”されていないため単語間にスペースがなく、単語を切り出して認識することが困難である。そこで、形態素解析においては、予め記憶した形態素辞書データベース（図示略）に基づいて形態素解析を行って単語単位に分割して品詞を判定する。また、本実施形態においては、実際に発音される音に対応するかなに変換する。例えば、「こんちには」という単語について説明すると、この発音を表す平仮名列は「こんにちわ」となる。すなわち、形態素辞書データベースから「こんにちは」という単語が抽出され、さらに内部の発音辞書データベース（図示略）を参照してその実際の発音は「こんにちわ」であると認識し、その認識結果に対応する「かな」を求める。 For example, from the keyboard provided in the operation unit 14, a character string "Hello" if entered, it converts the input character string in hiragana strings representing actual pronunciation by morphological analysis. Here, the morphological analysis is a process for recognizing a word from a character string. In other words, unlike English documents, Japanese sentences are not “separated”, so there is no space between words, and it is difficult to cut out and recognize words. Therefore, in the morpheme analysis, the morpheme analysis is performed based on a morpheme dictionary database (not shown) stored in advance, and the part of speech is determined by dividing into word units. In the present embodiment, the sound is converted into a sound corresponding to a sound that is actually generated. For example, to explain the word “Konchi ni”, the hiragana string representing this pronunciation is “Konchiwa”. That is the word "hello" is extracted from the morpheme dictionary database, further its actual pronunciation by referring to the internal pronunciation dictionary database (not shown) is recognized as "Hello", corresponding to the recognition result " Ask for "kana".

このようにして音素が求められると、テキスト音素特徴変換部１９は、その内部に記憶されているテキスト音素特徴変換テーブル（図９参照）を参照して実際の発音「こんにちわ」に対応する検索用特徴データを生成する。テキスト音素特徴変換テーブルにおいては、図９に示すように、各音素「あ」「い」「う」…のそれぞれに対応するフレーム群が設定され、各フレーム群内の各フレームには特徴データ列が書き込まれている。この特徴データ列は、第１実施形態の分析データ記憶部１８に記憶された特徴データ列と同様に正規化されたデータである。なお、図９の各音素に対応するフレーム群については、説明を簡略化するために、５フレーム分のみを図示しているが、実際にはより多くのフレームから構成されている。 When the phoneme is obtained in this way, the text phoneme feature conversion unit 19 refers to the text phoneme feature conversion table (see FIG. 9) stored in the phoneme feature conversion unit 19 and searches for the phoneme corresponding to the actual pronunciation “Konchiwa”. Generate feature data. In the text phoneme feature conversion table, as shown in FIG. 9, a frame group corresponding to each phoneme “A”, “I”, “U”... Is set, and a feature data string is stored in each frame in each frame group. Has been written. This feature data string is normalized data in the same manner as the feature data string stored in the analysis data storage unit 18 of the first embodiment. Note that the frame group corresponding to each phoneme in FIG. 9 is shown only for five frames in order to simplify the description, but actually includes a larger number of frames.

なお、英文などの場合は、形態素解析は不要となるが、入力された文字列のスペルから辞書データベースを参照して音素を抽出し、抽出した音素に応じた特徴データ列を図９に示すテキスト音素特徴変換テーブルを参照して求める。なお、この場合は、テキスト音素特徴変換テーブルは、英音の音素に応じた特徴データ列を予め設定する必要がある。 In the case of English text, morpheme analysis is not required, but phonemes are extracted by referring to the dictionary database from the spelling of the input character string, and the feature data string corresponding to the extracted phoneme is shown in the text in FIG. Obtained by referring to the phoneme feature conversion table. In this case, in the text phoneme feature conversion table, it is necessary to set in advance a feature data string corresponding to English phonemes.

（Ｂ）動作
次に、この実施形態の動作を説明する。操作者が操作部１４のキーボードから、例えば、「こんにちは」というキーワードを入力すると、テキスト音素特徴変換部１９は、この入力文字列を形態素解析によって実際の発音を表す平仮名列「こんにちわ」に変換しこれに対応する特徴データ列を有するフレーム群を図９に示すテキスト音素特徴変換テーブルを参照して求める。ＣＰＵ１１はテキスト音素特徴変換部１９が求めた「こんにちは」に対応するフレーム群をＲＡＭ１３に書き込む。
次に、ＣＰＵ１１は、前述した第１実施形態と同様にして、ＲＡＭ１３に書き込んだフレーム群と一致するフレーム群を分析データ記憶部１８内のフレーム群から求め、検索された候補を表示部２０に表示する。操作者が表示部２０の表示内容から所望の候補を選択すれば、該当する音声がスピーカ２２から放音される。この動作は、第１実施形態と同様である。 (B) Operation Next, the operation of this embodiment will be described. From the keyboard of the operator the operation unit 14, for example, if you enter the keyword "Hello", the text phoneme feature transformation unit 19 converts the input string to Hiragana string "Hello" to represent the actual pronunciation by morphological analysis A frame group having a feature data string corresponding to this is obtained with reference to the text phoneme feature conversion table shown in FIG. CPU11 writes the frame group corresponding to the "Hello" text phoneme feature transformation unit 19 was determined to RAM13.
Next, as in the first embodiment described above, the CPU 11 obtains a frame group that matches the frame group written in the RAM 13 from the frame group in the analysis data storage unit 18, and displays the searched candidates on the display unit 20. indicate. If the operator selects a desired candidate from the display content of the display unit 20, the corresponding sound is emitted from the speaker 22. This operation is the same as in the first embodiment.

以上のように第２の実施形態によれば、キーボードから文字列を打ち込んでも、文字列に対応する特徴データ列を有するフレーム群が特定され、一致検索は特徴データ列同士の比較となるから、会議音声などを音声認識で文字列に変換する必要はなく、音声の特徴同士を比較することにより、検索を行うことができる。 As described above, according to the second embodiment, even if a character string is typed from the keyboard, a frame group having a feature data string corresponding to the character string is specified, and the match search is a comparison between the feature data strings. It is not necessary to convert conference voice or the like into a character string by voice recognition, and a search can be performed by comparing voice features.

（変形例）
なお、本発明は上述した実施形態に限定されるものではなく、種々の態様で実施が可能である。以下にその例を示す。 (Modification)
In addition, this invention is not limited to embodiment mentioned above, It can implement in a various aspect. An example is shown below.

（変形例１）
複数のフレームにまたがって共通する特徴が連続している場合、その連続しているフレーム数に基づいて同一とみなす規則を設けてもよい。例えば、第５フレームから第３０フレームまで、一致とみなされる特徴データ列をもつフレームが連続している場合の音素は同一であるとみなす、などの規則を設けてもよい。 (Modification 1)
When features common to a plurality of frames are continuous, a rule may be provided that considers the same based on the number of consecutive frames. For example, a rule may be provided such that, from the fifth frame to the 30th frame, phonemes are considered to be the same when frames having characteristic data sequences that are regarded as matching are consecutive.

（変形例２）
マイクロフォン１６は、図１０に示すように、マイクロフォンＡ，Ｂ，Ｃというように複数設けてもよい。さらにこの場合、マイクロフォン入力端子毎に入力経路情報（識別データ）を付加し、図１１に示すように、分析データに入力経路情報を付加すると発言者を区別でき、音声データを区分することができるので、入力経路情報とキーワードによる検索を行うことで、検索範囲を狭めることができ検索効率を向上させることができる。会議においては、各発言者が同時に発言する事はほとんどなく、ある時刻の音声は、図１２に示すように一人の発言者に向けられたマイクロフォンＡ，Ｂ，Ｃのいずれか一つによって収音されたものと推定できるからである。 (Modification 2)
As shown in FIG. 10, a plurality of microphones 16 such as microphones A, B, and C may be provided. Further, in this case, input path information (identification data) is added to each microphone input terminal, and the input path information is added to the analysis data as shown in FIG. 11, so that a speaker can be distinguished and voice data can be classified. Therefore, by performing a search using the input route information and keywords, the search range can be narrowed and the search efficiency can be improved. In the conference, each speaker rarely speaks at the same time, and the sound at a certain time is picked up by one of the microphones A, B, and C directed to one speaker as shown in FIG. It is because it can be estimated that it was done.

（変形例３）
図１０にはマイクロフォンを３つ設置する例を示したが、これに代えて、図１３に示すように、複数のマイクＭｉｃを有するマイクアレイシステム３０を用いてもよい。マイクアレイシステム３０は、音声の入力方向を空間的に生成することができるので、その入力方向を示す方向情報と音声信号とを入力ＩＦ１５に供給するように構成する。そして、入力ＩＦ１５は、音声信号を所定のサンプリング周波数でサンプリングして音声データＳａｄに変換するとともに、方向情報を出力する。音声データ記憶部１７は、ＣＰＵ１１の制御の下に入力ＩＦ１５が出力する音声データＳａｄを順次記憶するとともに、所定のヘッダーを設けて方向情報を記憶させる。この方向情報は、発言者を特定するものとなるので、変形例２の場合と同様に音声データが区分されることになり、検索効率を向上させることができる。 (Modification 3)
FIG. 10 shows an example in which three microphones are installed, but instead of this, a microphone array system 30 having a plurality of microphones Mic may be used as shown in FIG. Since the microphone array system 30 can spatially generate the voice input direction, the microphone array system 30 is configured to supply the input IF 15 with direction information indicating the input direction and a voice signal. The input IF 15 samples the audio signal at a predetermined sampling frequency and converts it into audio data Sad, and outputs direction information. The audio data storage unit 17 sequentially stores the audio data Sad output from the input IF 15 under the control of the CPU 11 and stores a direction information by providing a predetermined header. Since this direction information identifies the speaker, the voice data is classified as in the second modification, and the search efficiency can be improved.

（変形例４）
音声データを音声データ記憶部１７に記憶させる態様としては時刻と振幅が関係付けられているものであれば、どのようなものでもよい。例えば、音声データ記憶部１７の記憶領域の物理アドレスを直接時刻に対応させてもよいし、所定のメモリブロックごとに時刻を記憶するヘッダーを挿入させてもよい。メモリブロック長は固定でもよいし、メモリブロック長の値をヘッダーに含んだ可変長メモリブロックデータの態様でもよい。メモリブロックごとに時刻データを付与する場合は、検索される時刻もメモリブロック単位になって離散的になるが、メモリブロックの大きさを適切に設定することにより、検索対象の時刻が曖昧になる等の問題は生じない。 (Modification 4)
As a mode of storing the voice data in the voice data storage unit 17, any mode may be used as long as time and amplitude are related to each other. For example, the physical address of the storage area of the audio data storage unit 17 may correspond directly to the time, or a header for storing the time may be inserted for each predetermined memory block. The memory block length may be fixed, or may be variable length memory block data in which the value of the memory block length is included in the header. When time data is assigned to each memory block, the search time is also discrete in units of memory blocks, but the time to be searched becomes ambiguous by appropriately setting the size of the memory block. Such a problem does not occur.

また、音声データを連続する記憶領域に保管し、時刻データと前記記憶領域の物理アドレスとの対応関係を記憶するテーブルを別の記憶領域に保管してもよい。
また、上述の場合においても、第１、第２の実施形態の場合においても、音声データは圧縮して記憶することもできる。
さらに、会議音声等を録音する場合、無音時間の音声データを記憶させることは無駄であるから、所定の強度以上の振幅値がない音声データは記憶しないことが望ましい。この場合、記録再開時の時刻データを記憶（タイムスタンプ）してもよいし、前述のとおり、固定長又は可変長メモリブロックデータのヘッダーに時刻データを含めてもよい。 Further, the audio data may be stored in a continuous storage area, and a table for storing the correspondence between the time data and the physical address of the storage area may be stored in another storage area.
Also in the above-described case and in the case of the first and second embodiments, the audio data can be compressed and stored.
Furthermore, when recording conference audio or the like, it is useless to store audio data during silent periods, so it is desirable not to store audio data having no amplitude value greater than a predetermined intensity. In this case, time data at the time of resuming recording may be stored (time stamp), or time data may be included in the header of fixed-length or variable-length memory block data as described above.

（変形例５）
分析データ生成方法のアルゴリズムは、高速フーリエ変換（ＦＦＴ）に限られない。図４のような、各フレームごとに固有の振動数と振幅のスペクトルを生成することができるアルゴリズムであれば、どのようなものであってもよい。例えば、他の離散フーリエ変換や、ウェーブレット変換のアルゴリズムを用いてもよい。
各フレームは前後の期間に重複する期間を設けてもよい。これにより、分析精度を向上させることができる。 (Modification 5)
The algorithm of the analysis data generation method is not limited to the fast Fourier transform (FFT). Any algorithm may be used as long as it is capable of generating a unique frequency and amplitude spectrum for each frame as shown in FIG. For example, another discrete Fourier transform or wavelet transform algorithm may be used.
Each frame may have a period that overlaps the preceding and following periods. Thereby, the analysis accuracy can be improved.

（変形例６）
音声データＳａｄの分析は、この音声データＳａｄを音声データ記憶部１７に記憶させるのと同時に行ってもよいが、別々に行ってもよい。例えば、記憶された音声データＳａｄを読み出して、分析を行ってもよい。 (Modification 6)
The analysis of the sound data Sad may be performed simultaneously with the sound data Sad being stored in the sound data storage unit 17 or may be performed separately. For example, the stored voice data Sad may be read and analyzed.

（変形例７）
音声データ記憶部１７や分析データ記憶部１８への各データの記憶は直接行ってもよいが、所定のバッファメモリあるいはＲＡＭ１３の記憶領域にバッファリングしてもよい。バッファリングを行うと、一時記憶領域に記憶されているデータに対しては素早く検索することができるので、少し前の発言を再生したい場合などに好適である。 (Modification 7)
Each data may be stored directly in the audio data storage unit 17 or the analysis data storage unit 18, but may be buffered in a predetermined buffer memory or a storage area of the RAM 13. When buffering is performed, the data stored in the temporary storage area can be searched quickly, which is suitable for the case where it is desired to reproduce the previous message.

（変形例８）
上述した各実施形態においては、特徴データ列について特徴コード（特徴データ）を付与してもよい。すなわち、特徴データ列は、類似するものを一つの共通の集合として分類できる場合がある。このように分類された集合に対して特徴コードを付与する。そして、この特徴コードを図５，図６、図９、図１１の破線で示すようにテーブルの各レコードに加える。このような構成にすれば、操作者音声と分析データ記憶部１８内のフレーム同士の一致検出を、特徴コードの比較によって行うことができるため、一致検出の処理速度を大幅に向上させることができる。この場合、特徴コードの一致についても、ある程度の許容値を設けてもよい。すなわち、特徴データ列が類似する特徴コードについては完全一致あるいはある一致度（８０％、９０％というような一致度）を付与した上での一致とみなすようにすればよい。
また、上述のようにすれば、特徴データ列に替えて特徴コードのみを分析データ記憶部１８に記憶することにすればよいから記憶領域を小さくすることができる。 (Modification 8)
In each embodiment described above, a feature code (feature data) may be assigned to the feature data string. That is, there are cases in which similar feature data strings can be classified as one common set. A feature code is assigned to the set classified in this way. Then, this feature code is added to each record in the table as shown by the broken lines in FIGS. 5, 6, 9, and 11. With such a configuration, it is possible to detect the match between the operator's voice and the frames in the analysis data storage unit 18 by comparing the feature codes, so that the processing speed of the match detection can be greatly improved. . In this case, a certain allowable value may be provided for matching of feature codes. In other words, feature codes having similar feature data strings may be regarded as a match after giving a complete match or a certain match (80%, 90% match).
Moreover, if it carries out as mentioned above, it will suffice to memorize | store only the characteristic code in the analysis data storage part 18 instead of a characteristic data sequence, and a memory area can be made small.

特徴データ列から特徴コードを導出する方法としては、例えば日本語の五十音をあらかじめ相当数サンプリングしておき、これを前述の実施形態において用いた方法で分析し、その分析結果に対して類似性のあるものをまとめて特徴コードを付与する方法などが挙げられる。 As a method for deriving a feature code from a feature data string, for example, a considerable number of Japanese alphabets are sampled in advance, and this is analyzed by the method used in the above-described embodiment, and similar to the analysis result. For example, there is a method in which characteristic codes are assigned together with a characteristic code.

本発明の１実施形態のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of 1 embodiment of this invention. 音声データの一例を示す波形図である。It is a wave form diagram which shows an example of audio | voice data. 音声データの分析に用いるフレームを示すための波形図である。It is a wave form diagram for showing the frame used for analysis of voice data. 各フレームのスペクトルを示す図である。It is a figure which shows the spectrum of each flame | frame. 分析データの一例を示す図である。It is a figure which shows an example of analysis data. 検索用特徴データの一例を示す図である。It is a figure which shows an example of the characteristic data for search. 検索結果の表示例を示す図である。It is a figure which shows the example of a display of a search result. 本発明の第２の実施形態のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the 2nd Embodiment of this invention. テキスト音素特徴変換テーブルを示す図である。It is a figure which shows a text phoneme characteristic conversion table. 複数のマイクロフォンを備えた入力ＩＦを示すブロック図である。It is a block diagram which shows input IF provided with the several microphone. 入力経路情報（タグ）を含んだ分析データの例を示す図である。It is a figure which shows the example of the analysis data containing input path information (tag). 複数のマイクロフォンに入力される音声のタイミングを示すタイミングチャートである。It is a timing chart which shows the timing of the sound inputted into a plurality of microphones. マイクアレイシステムを備えた入力ＩＦを示すブロック図である。It is a block diagram which shows input IF provided with the microphone array system.

Explanation of symbols

１１…ＣＰＵ、１２…ＲＯＭ、１３…ＲＡＭ、１４…操作部、１５…入力ＩＦ、１６…マイクロフォン、１７…音声データ記憶部、１８…分析データ記憶部。 DESCRIPTION OF SYMBOLS 11 ... CPU, 12 ... ROM, 13 ... RAM, 14 ... Operation part, 15 ... Input IF, 16 ... Microphone, 17 ... Audio | voice data storage part, 18 ... Analysis data storage part

Claims

Sound collection means for outputting sound data corresponding to the collected sound;
Voice data storage means for storing voice data output by the sound pickup means;
Feature data generating means for analyzing the voice data output by the sound collecting means and generating feature data indicating the characteristics;
Feature data storage means for storing the feature data generated by the feature data generation means together with time data corresponding to the generation time;
Search key input instruction means for instructing input of a search key;
Search feature data storage means for storing the feature data generated by the feature data generation means as search feature data when the search key input instruction means is instructed to input a search key;
A search means for comparing the feature data for search in the feature data storage means for search with the feature data in the feature data storage means and searching for feature data that is considered to be matched;
A voice data search apparatus comprising: a reading unit that reads out voice data stored in the voice data storage unit from an address corresponding to time data of feature data searched by the search unit.

Sound collection means for outputting sound data corresponding to the collected sound;
Voice data storage means for storing voice data output by the sound pickup means;
Feature data generating means for analyzing the voice data output by the sound collecting means and generating feature data indicating the characteristics;
Feature data storage means for storing the feature data generated by the feature data generation means together with time data corresponding to the generation time;
A character string input means for inputting a character string;
A table in which phonemes that are constituent elements of a character string are associated with voice feature data when the phonemes are pronounced;
Conversion means for converting each character of the character string input by the character string input means into feature data with reference to the table;
Search feature data storage means for storing the feature data converted by the conversion means as search feature data;
A search means for comparing the feature data for search in the feature data storage means for search with the feature data in the feature data storage means and searching for feature data that is considered to be matched;
A voice data search apparatus comprising: a reading unit that reads out voice data stored in the voice data storage unit from an address corresponding to time data of feature data searched by the search unit.

The sound collecting means generates a plurality of microphones and sound data corresponding to the sound picked up by each microphone, and adds identification data for identifying which microphone the sound data is from. Audio data generating means for attaching to audio data;
The feature data storage means classifies and stores the feature data based on the identification data,
The search means, when the identification data is specified, compares the feature data classified by the specified identification data with the search feature data in the search feature data storage means. Item 3. The speech data retrieval apparatus according to item 1 or 2.

The sound collection means includes an array microphone whose sound collection direction is variable, a sound collection direction control means for controlling the sound collection direction of the array microphone and outputting direction data indicating the sound collection direction, and the array microphone. Voice data generating means for generating voice data corresponding to the sound that has been sounded,
The feature data storage means classifies and stores the feature data based on the direction data,
The search means, when the direction data is specified, compares the feature data classified by the specified direction data with the search feature data in the search feature data storage means. Item 3. The speech data retrieval apparatus according to item 1 or 2.