JPH0668168A

JPH0668168A - Video retrieval method and device by acoustic keyword

Info

Publication number: JPH0668168A
Application number: JP4217620A
Authority: JP
Inventors: Yoko Niikura; 陽子新倉; Hiroshi Hamada; 洋浜田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1992-08-17
Filing date: 1992-08-17
Publication date: 1994-03-11

Abstract

PURPOSE:To provide the video retrieval method and device by means of an acoustic and voice keyword from which a required video image is retrieved with the keyword as a retrieval key. CONSTITUTION:An acoustic information analysis section analyzing a characteristic of acoustic information and converting the result into an acoustic characteristic parameter time series, and a keyword characteristic analysis section 8 analyzing the characteristic of the keyword sound being the video retrieval key and converting the result into a characteristic parameter time series of the keyword sound are provided to database sets 2, 3 storing video information and acoustic information corresponding to the video information and the sound included in the acoustic information is used for the keyword to retrieve the video information by comparing the acoustic characteristic parameter time series in the database with the characteristic parameter time series of the video retrieval key.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、音響キーワードによ
る映像検索方法および装置に関し、特に、音響情報の内
の特に音声情報と映像情報とを有するデータベースにつ
いて、音声情報に含まれる単語をキーワードとして映像
情報を検索する音声キーワードによる映像検索方法およ
び装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for retrieving an image by using an audio keyword, and more particularly, regarding a database having audio information and image information among audio information, an image using a word included in the audio information as a keyword. The present invention relates to a video search method and apparatus using audio keywords for searching information.

【０００２】[0002]

【従来の技術】音響処理技術、映像処理技術その他の信
号処理技術が進歩したことにより、従来、データとして
数値・文字データのみを取り扱ってきたコンピュータ、
ワークステーションの如き計算機において、音声情報お
よび映像情報をもデータとして取り扱うマルチメディア
処理が可能となった（以下の説明は、音響の代表として
特に音声に着目して行われるものとする）。そして、デ
ータの圧縮技術の進歩および記憶装置の大容量化に伴っ
て、取り扱うデータ量も多量になった。多種類の、しか
も多量のデータを取り扱い、蓄積するようになると、必
要な情報の検索を迅速容易に実行することが要望される
に到る。2. Description of the Related Art Computers that have traditionally handled only numerical and character data as data due to advances in audio processing technology, video processing technology and other signal processing technologies,
In computers such as workstations, it has become possible to perform multimedia processing in which audio information and video information are also handled as data (the following description will be made by focusing on audio as a representative of audio). As the data compression technology has advanced and the capacity of storage devices has increased, the amount of data to be handled has increased. When handling and accumulating a large amount of various kinds of data, it is demanded that necessary information be retrieved quickly and easily.

【０００３】情報を検索する方法としては、一般に、情
報に予めインデックスを付与しておき、インデックス付
与された部分を取り出すという方法が採用される。文字
列を取り扱うテキストデータにおいては、文字列検索に
代表される様に、指定されたキーワードとデータとの間
のマッチングとることによって適合箇所にインデックス
を付与していくことができるので、検索も容易である。
しかし、映像情報の様に、連続的でかつ時間とともに内
容が変化する情報の場合は、一定の形状のものを見つけ
出してその映像部分にインデックスを付与することは大
変に困難である。例え、互いに異なる映像情報であって
も、情報を受け取る側にとって共通する意味内容或は関
連をもった情報である場合は、映像部分の共通性に着目
して検索を行う方法に依っては必要な映像を捜し出すこ
とができない。As a method for retrieving information, generally, a method is used in which an index is added to information in advance and the indexed portion is taken out. For text data that handles character strings, matching can be performed by matching between specified keywords and data, as represented by a character string search, so that matching points can be indexed, making retrieval easy. Is.
However, in the case of information that is continuous and whose contents change with time, such as video information, it is very difficult to find a certain shape and add an index to the video portion. For example, even if the image information is different from each other, it is necessary depending on the method of searching by paying attention to the commonality of the image parts if the information content has common meaning or related information. I can't find the perfect image.

【０００４】映像の情報検索については、従来、以下の
如き方策が採用されてきた。１、シーケンシャル・サーチ連続情報である音声と映像とを同時に連続的に再生して
必要な映像を探し出す方法である。２、タイム・コードとの間の対応を利用先ず、最初に音声、映像の情報を始めから再生し、その
際タイム・コードを対応づけする。その後、タイム・コ
ードを手掛かりとして必要な映像を取り出す方法であ
る。Regarding the information retrieval of video, the following measures have hitherto been adopted. 1. Sequential search This is a method of searching for a required video by continuously playing back audio and video that are continuous information simultaneously. 2. Utilize correspondence between time code First, the audio and video information is reproduced from the beginning, and at that time, the time code is associated. After that, it is a method of taking out the necessary image by using the time code as a clue.

【０００５】３、画像ＩＮＤＥＸ連続的なデータである動画像から例えばカメラ位置の変
化点（シーンの切り替わり）、動画中に含まれる特定の
物体の変化に着目して動画をいくつかの区間に分割した
上で、その各区間における最も変化の少ない部分、最も
変化の大きい部分、或はその区間の最初の映像をその区
間の代表映像としてインデックスとし、これを手がかり
として必要な映像を取り出す方法である。3. Image INDEX Dividing the moving image into several sections by paying attention to the change point of the camera position (change of scene) and the change of a specific object included in the moving image from the moving image which is continuous data. After that, the part with the least change in each section, the part with the largest change, or the first video of the section is used as an index of the section's representative video, and this is used as a clue to extract the necessary video. .

【０００６】[0006]

【発明が解決しようとする課題】しかし、上述の方法に
は、次の如き問題がある。１’シーケンシャル・サーチ基本的に原データをそのまま再生することによる検索方
式でってインデックス付け或は記号化を行わないので、
大容量の情報検索においては時間がかかる。However, the above method has the following problems. 1'sequential search Basically, the search method is to reproduce the original data as it is, and no indexing or coding is performed.
It takes time to retrieve a large amount of information.

【０００７】２’タイム・コードとの間の対応を利用本来、タイム・コードは音声、映像情報とは直接関係の
ない記号であるにもかかわらず、利用者は常に無意味な
記号であるタイム・コードとそれに対応する映像との間
の関係を意識しなくてはならない。３’画像ＩＮＤＥＸ映像の種類によってはシーンの切り替わりが少ないも
の、或は動きの変化が少ないものがあり、変化点をみつ
けにくいものがある。そのために、不必要に細かくイン
デックスがつけられたり、逆に粗すぎるインデックス付
けとなる問題がある。Utilizing correspondence with 2'time code Originally, the time code is a symbol that is not directly related to the audio and video information, but the user is always a meaningless symbol.・ We must be aware of the relationship between the code and the corresponding video. 3'image INDEX Depending on the type of video, there are few scene changes or little movement changes, and it is difficult to find the change point. Therefore, there is a problem that unnecessary fine indexes are added or conversely, coarse indexing is performed.

【０００８】この発明は、従来の映像検索方法および装
置の上述の通りの問題を解消するものであり、従来は困
難であった映像へのインデックス付与を可能とし、映像
検索を容易に実施することができる音声キーワードによ
る映像検索方法および装置を提供するものである。The present invention solves the above-described problems of the conventional video search method and apparatus, and makes it possible to add an index to a video, which has been difficult in the past, and to easily perform a video search. The present invention provides a video search method and device using audio keywords.

【０００９】[0009]

【課題を解決するための手段】映像情報とこの映像情報
に対応する音声情報とを有するデータベースについて音
声情報を特徴分析してこれを音声の特徴パラメータ時系
列に変換し、映像検索キーとなるべきキーワード音声を
特徴分析してこれをキーワード音声の特徴パラメータ時
系列に変換し、データベース中の音声の特徴パラメータ
時系列と映像検索キーの特徴パラメータ時系列とを比較
することにより、音声情報に含まれる単語をキーワード
として映像情報を検索する音声キーワードによる映像検
索方法を構成し、そして、映像情報とこの映像情報に対
応する音声情報とを有するデータベース２および３につ
いて音声情報を特徴分析してこれを音声の特徴パラメー
タ時系列に変換する音声情報分析部６を具備し、映像検
索キーとなるべきキーワード音声を特徴分析してこれを
キーワード音声の特徴パラメータ時系列に変換するキー
ワード特徴分析部８を具備し、データベース中の音声の
特徴パラメータ時系列と映像検索キーの特徴パラメータ
時系列とを比較することによりキーワードに相当する部
分を音声情報の中から抽出するキーワード区間抽出部１
０を具備し、抽出された区間に対応する映像情報のアド
レスを記憶するインデックス付与部１２を具備し、付与
されたインデックスとその区間に対応する映像を表示す
る検索映像表示部１４を具備し、検索結果の映像と音声
を同時に再生する音声・映像出力部１５を具備し、音声
情報に含まれる単語をキーワードとして映像情報を検索
する音声キーワードによる映像検索装置、をも構成し
た。SOLUTION: To perform a feature analysis of voice information in a database having video information and voice information corresponding to this video information, convert this into a voice feature parameter time series, and use it as a video search key. It is included in the audio information by performing a feature analysis of the keyword voice, converting this into a keyword voice feature parameter time series, and comparing the voice feature parameter time series in the database with the video search key feature parameter time series. A video search method using voice keywords for searching video information using a word as a keyword is configured, and the voice information is feature-analyzed for databases 2 and 3 having the video information and the voice information corresponding to the video information. The audio information analysis unit 6 for converting the characteristic parameters of the A keyword feature analysis unit 8 for analyzing the feature of the word voice and converting it into the feature parameter time series of the keyword voice is provided, and compares the voice feature parameter time series in the database with the feature parameter time series of the video search key. By doing so, a keyword segment extraction unit 1 for extracting a portion corresponding to a keyword from voice information
0 is provided, an index assigning unit 12 that stores an address of video information corresponding to the extracted section is provided, and a search video display unit 14 that displays the provided index and an image corresponding to the section is provided. An audio / video output unit 15 that simultaneously reproduces the video and audio of the search result is provided, and an audio keyword-based video search device that searches the video information using the words included in the audio information as keywords is also configured.

【００１０】[0010]

【実施例】音声情報と映像情報とを有するデータベース
について、音声情報に含まれる単語をキーワードとし、
映像情報にインデックスを付与して映像検索を実施す
る。音声情報は映像情報と同様に連続情報であり、しか
も時間と共に変化する情報ではあるが、通常の会話にお
いてはキーワードとなる単語が何度も出現する。従っ
て、音声情報の内に含まれるキーワードを利用して、そ
のキーワードが出現する部分の映像を取り出すことは可
能である。この方法によれば、映像情報としては異なる
内容の画面であっても音声情報内容に関連性がある場合
にも検索が可能となる。この発明は、音声情報と映像情
報とを有するデータベースにおいて、予め決められたキ
ーワードを検索するために、キーワードの音声を使用し
て音声情報中に存在するキーワードの発声を抽出し、抽
出されたキーワードをインデックスとすると共に、その
インデックスに対応する映像を表示装置に表示すること
により、音声により映像情報の検索をするものである。[Example] Regarding a database having audio information and video information, a word included in the audio information is used as a keyword,
Video information is searched by adding an index to the video information. The audio information is continuous information like the video information and is information that changes with time, but in a normal conversation, a keyword word appears many times. Therefore, it is possible to use the keyword included in the audio information to extract the video of the portion in which the keyword appears. According to this method, even if the screen has different contents as the video information, the search can be performed even when the audio information contents are related. The present invention uses a voice of a keyword to extract a utterance of a keyword present in the voice information in order to search a predetermined keyword in a database having voice information and video information, and extract the extracted keyword. Is used as an index, and the video corresponding to the index is displayed on the display device to search the video information by voice.

【００１１】この発明の実施例を先ず図１を参照して説
明する。図１は音声・映像情報を入力し、音声と映像と
の間の対応づけをし、そして音声情報の分析をする構成
を示すフロー図である。この発明は、音声・映像情報を
入力してこれら両者の対応づけを行うと共に、音声情報
の分析を実施してこれを音声特徴パラメータに変換する
過程、および映像検索を実施する過程の２つに大きく分
けられる。An embodiment of the present invention will be described first with reference to FIG. FIG. 1 is a flow chart showing a configuration for inputting audio / video information, associating audio and video, and analyzing audio information. The present invention has two steps of inputting audio / video information and associating them with each other, analyzing audio information and converting it into audio characteristic parameters, and performing video search. It can be roughly divided.

【００１２】音声映像入力部１は外部から音声情報およ
び映像情報を入力するものであり、これら入力された音
声情報および映像情報はそれぞれ音声データ記憶部２お
よび映像データ記憶部３に蓄積される。情報はマイクロ
ホンおよびカメラの如く音声情報と映像情報とが独立し
た２つの媒体を介して入力するのが一般的であるが、音
声情報と映像情報とが一緒に記録されているビデオテー
プレコーダの様な外部記憶媒体から入力することも可能
である。The audio / video input unit 1 inputs audio information and video information from the outside, and the input audio information and video information are stored in the audio data storage unit 2 and the video data storage unit 3, respectively. Information is generally input via two media such as a microphone and a camera, in which audio information and video information are independent, but like a video tape recorder in which audio information and video information are recorded together. It is also possible to input from an external storage medium.

【００１３】音声情報は、コンピュータによる処理を可
能とするために、デジタル情報に変換して音声データ記
憶部２に記憶する。デジタル化する際のサンプリング周
波数は音声品質を考慮すると８ＫＨｚ以上が望ましい。
以下の説明においては、音声サンプリング周波数は１２
ＫＨｚであるものとする。音声データはサンプリング番
号を付与されて音声データ記憶部２に記憶されている。The voice information is converted into digital information and stored in the voice data storage unit 2 so as to be processed by a computer. Considering voice quality, the sampling frequency for digitization is preferably 8 KHz or higher.
In the following description, the audio sampling frequency is 12
It shall be KHz. The voice data is stored in the voice data storage unit 2 with a sampling number.

【００１４】これに対して、映像情報は入力した後でコ
ンピュータ処理を必要としない。しかし、後から検索を
する必要上、ランダムなアクセスが可能となる様に時間
情報に対応させて映像データ記憶部３に記憶される。具
体的方法としては、デジタル化してコンピュータ内に取
り込み、アドレスをインデックスとして蓄積しておく方
法、ビデオテープレコーデにタイムコードと共に蓄積し
ておく方法その他がある。以下、フレーム番号を検索の
際に利用する場合について説明する。In contrast, video information does not require computer processing after it is input. However, since it is necessary to retrieve the information later, it is stored in the video data storage unit 3 in association with time information so that random access is possible. As a concrete method, there are a method of digitizing and fetching it in a computer and storing an address as an index, a method of storing it together with a time code in a video tape recorder, and the like. Hereinafter, a case where the frame number is used for searching will be described.

【００１５】音声情報および映像情報の入力を考えると
き、音声情報は１２ＫＨｚサンプリングの場合は８.３
マイクロ秒に１サンプリングであり、映像は通常は１秒
間に３０画面で構成されている点に留意する。入力され
た音声情報と映像情報とは独立して記憶されているの
で、音声情報のキーワードを利用して映像にインデック
スを付与するには、両者の間の対応づけをする必要があ
る。対応をとるには計算式として記憶しておく方法、対
応テーブルを作成する方法その他の方法がある。以下、
対応テーブルを作成する場合を例として説明する。When considering the input of audio information and video information, the audio information is 8.3 in the case of 12 KHz sampling.
It should be noted that there is one sampling per microsecond and the image is usually composed of 30 screens per second. Since the input audio information and video information are stored independently, it is necessary to associate the two in order to add the index to the video using the keyword of the audio information. There are a method of storing correspondence as a calculation expression, a method of creating a correspondence table, and other methods for obtaining correspondence. Less than,
The case of creating the correspondence table will be described as an example.

【００１６】音声・映像対応部４は音声と映像との間の
対応関係を示す音声映像対応テーブル５を作成する。こ
こで、音声データ記憶部２および映像データ記憶部３に
記憶されるデータが共に時刻を利用してアクセス可能で
あれば、この場合音声映像対応テーブルは不用である。
ここで、映像情報は１秒間に３０フレーム（画面）によ
り構成され、フレーム番号をインデックスとしてアクセ
スすることができ、音声情報は音声サンプリングデータ
の番号（サンプリング番号）をインデックスとして記憶
している場合を例として説明する。この場合、映像情報
１フレームに対して、音声サンプリング数は４００とな
る。The audio / video correspondence unit 4 creates an audio / video correspondence table 5 showing the correspondence between the audio and the video. Here, if the data stored in the audio data storage unit 2 and the video data storage unit 3 are both accessible by using the time, the audio / video correspondence table is unnecessary in this case.
Here, the video information is composed of 30 frames (screen) per second and can be accessed using the frame number as an index, and the audio information stores the number of audio sampling data (sampling number) as an index. This will be explained as an example. In this case, the number of audio samples is 400 for one frame of video information.

【００１７】図２は映像フレーム番号と音声サンプリン
グ番号との間の対応を示す音声・映像対応テーブルの例
である。この例は各映像フレーム番号に対応する音声サ
ンプリング番号の始めの数値と終わりの数値とが音声・
映像対応テーブルに格納されている。音声情報分析部６
は音声データ記憶部２に記憶されている音声のディジタ
ルデータを後でキーワードと一致する部分を抽出するこ
とが可能な形態に分析変換する。変換する形態は、キー
ワードと一致する部分を検出するアルゴリズムに依存す
るが、スペクトル分析を行う方法が良く採用される。ス
ペクトル分析を行う場合、高速フーリエ変換およびＬＰ
Ｃ分析の手法を使用することができる。また、スペクト
ル分析を行う際、予め定められたフレーム長のディジタ
ル音声データを予め定められたフレームシフト長だけず
らしながら繰り返し処理を行う。結果として、音声スペ
クトルの時系列が音声特徴パラメータ記憶部７に記憶さ
れる。FIG. 2 is an example of an audio / video correspondence table showing the correspondence between the video frame number and the audio sampling number. In this example, the start and end values of the audio sampling number corresponding to each video frame number are audio and
It is stored in the video correspondence table. Speech information analysis unit 6
Converts the voice digital data stored in the voice data storage unit 2 into a form in which a portion matching the keyword can be extracted later. The form of conversion depends on the algorithm that detects the part that matches the keyword, but the method of performing spectrum analysis is often adopted. Fast Fourier Transform and LP for spectral analysis
The technique of C analysis can be used. Further, when performing the spectrum analysis, the digital audio data having a predetermined frame length is repeatedly processed while being shifted by a predetermined frame shift length. As a result, the time series of the voice spectrum is stored in the voice feature parameter storage unit 7.

【００１８】図３はこの発明によるキーワードに依り映
像を検索する構成を示すフロー図である。キーワード特
徴分析部８は検索のキーとなるべきキーワードの音声を
分析し、これを特徴パラメータ時系列に変換する。この
場合の分析方法は音声分析部６における音声情報の分析
法と同じものとする。キーワード特徴分析部８により分
析した結果はキーワード特徴パラメータ記憶部９に記憶
される。キーワードは音声情報の中に含まれる単語或は
会話その他の語であるが、キーワードの入力方法として
は２通りが考えられる。第１の方法はキーワードを入力
済みの音声情報から切り出して利用する方法であり、第
２の方法はまったく新たにキーワードを入力する方法で
ある。第１の方法である入力済みの音声から抽出してキ
ーワードとして利用する場合は、入力音声の該当部分を
指示するインタフェースを実現しておく必要がある。こ
れは例えば音声パワーの表示、スペクトルシーケンスの
図示により実施することができる。また、第２の方法の
新たに入力する場合は、キーワード区間を音声情報から
正確に抽出するには音声情報の話し手と同一人物の音声
を入力することが望ましい。FIG. 3 is a flow chart showing a structure for searching a video by a keyword according to the present invention. The keyword feature analysis unit 8 analyzes the voice of a keyword that should be a search key and converts it into a feature parameter time series. The analysis method in this case is the same as the analysis method of the voice information in the voice analysis unit 6. The result analyzed by the keyword feature analysis unit 8 is stored in the keyword feature parameter storage unit 9. The keyword is a word included in voice information or a word such as conversation, but there are two possible ways to input the keyword. The first method is a method of extracting a keyword from the input voice information and using it, and the second method is a method of newly inputting a keyword. In the case of extracting from the input voice which is the first method and using it as a keyword, it is necessary to realize an interface for instructing the corresponding portion of the input voice. This can be done, for example, by displaying the audio power or by displaying the spectral sequence. Further, in the case of newly inputting in the second method, it is desirable to input the voice of the same person as the speaker of the voice information in order to accurately extract the keyword section from the voice information.

【００１９】キーワード区間抽出部１０は、データに含
まれる音声情報の特徴パラメータ７とキーワードの特徴
パラメータ９を比較することにより音声情報中のキーワ
ード区間を抽出する。ここにおいては、一般に音声認識
の応用技術であるワードスポッティング技術を利用する
ことができる。ワードスポッティングの手法としては種
々の手法が提案されているが、基本的にはキーワードの
特徴パラメータを逐次ずらしながら音声情報の内の一致
する区間を検索するものである。この場合、音声の時間
的な伸縮を補正するには、ＤＰマッチング法その他の非
線形伸縮を可能とするマッチング法を採用するのが一般
的である。その他にも、音声特徴量をベクトル量子化し
ておいてＨＭＭ（隠れマルコフモデル：Ｈｉｄｄｅｎ
ＭａｒｋｏｖＭｏｄｅｌ）により該当区間を検索する
ものその他、種々の音声認識、ワードスポッティング方
法の利用が可能である。The keyword section extraction unit 10 extracts the keyword section in the voice information by comparing the characteristic parameter 7 of the voice information included in the data with the characteristic parameter 9 of the keyword. Here, the word spotting technique, which is generally an applied technique of speech recognition, can be used. Although various methods have been proposed as word spotting methods, basically, the method is to search the matching section of the voice information while sequentially shifting the characteristic parameters of the keyword. In this case, in order to correct the temporal expansion / contraction of the voice, it is common to employ a DP matching method or another matching method that enables nonlinear expansion / contraction. In addition, HMM (Hidden Markov model: Hidden)
It is possible to use various speech recognition and word spotting methods as well as a method of searching the corresponding section by Markov Model).

【００２０】キーワード区間抽出部１０において音声情
報とキーワードの対応が抽出された場合、一致した区間
の音声情報の開始点の音声サンプリング番号をキーワー
ド区間記憶部１１に記憶する。音声情報とキーワードと
の間の比較は全ての音声情報について連続的に実行さ
れ、抽出されたキーワード区間全てに対する音声サンプ
リング番号がキーワード区間記憶部１１に記憶される。When the keyword section extraction unit 10 extracts the correspondence between the voice information and the keyword, the keyword sampling storage unit 11 stores the voice sampling number of the start point of the voice information of the matched section. The comparison between the voice information and the keyword is continuously performed for all the voice information, and the voice sampling numbers for all the extracted keyword sections are stored in the keyword section storage unit 11.

【００２１】インデックス付与部１２は、キーワード区
間記憶部１１に記憶されている音声サンプリング番号に
よって音声・映像対応テーブル５を参照して映像情報の
フレーム番号を得る。このフレーム番号をインデックス
位置データ１３として記憶する。図４はインデックス位
置データを記憶する例である。ここで、音声・映像テー
ブル５の参照例を図５により説明する。キーワード区間
記憶部１１にはこの例の場合音声サンプリング番号１０
２５が記憶されているものとする。この音声サンプリン
グ番号１０２５は、音声・映像対応テーブル５上の映像
フレーム番号３のフレーム開始の音声サンプリング番号
８０１とフレーム終了の音声サンプリング番号１２００
の間に位置する番号である。従って、音声サンプリング
番号１０２５は映像フレーム３に含まれていることがわ
かる。The index assigning unit 12 obtains the frame number of the video information by referring to the audio / video correspondence table 5 with the audio sampling number stored in the keyword section storage unit 11. This frame number is stored as the index position data 13. FIG. 4 shows an example of storing index position data. Here, a reference example of the audio / video table 5 will be described with reference to FIG. In this example, the keyword section storage unit 11 has a voice sampling number 10
25 is stored. The audio sampling number 1025 is the audio sampling number 801 at the start of the frame and the audio sampling number 1200 at the end of the frame of the video frame number 3 on the audio / video correspondence table 5.
It is a number located between. Therefore, it can be seen that the audio sampling number 1025 is included in the video frame 3.

【００２２】この様にして、音声情報のキーワードが出
現する時の映像情報フレームを指定することができ、こ
のフレーム番号をインデックス位置データ１３として記
憶する。図５の例はフレーム番号３がインデックス位置
データとして記憶されたものである。検索映像表示部１
４は、インデックス位置データ１３に基づいて、検索結
果として代表画面を表示する。代表画面の選択方法に
は、キーワード音声に対応する映像の最初の画面を使用
する方法、キーワードとキーワードの区間を一つのセグ
メントと考えてそのセグメントの内から最も動きの少な
い映像部分を抽出してこれを代表画面とする方法、セグ
メントの内で最も変化の大きい部分を抽出してこれを代
表画面とする方法、キーワードを含む音声の一区間を一
つのブロックと考えてこのブロックの内から最も動きの
少ない部分を抽出してこれを代表画面とする方法、その
他様々の方法がある。これらの方法の内の何れを選択す
るかは、対象としている映像の種類に応じて決定され
る。例えば、スポーツの場合は最も変化が大きい部分、
風景の場合は最も変化の小さい部分、を選択すれば効果
的な検索結果の表示が可能となる。何れの方法を採用し
た場合も、インデックスが付与された映像フレームをも
とに１つの代表映像フレームを選択して画面上に表示す
る。キーワード音声に対応する映像の最初の画面を選択
する場合は、インデックス位置データ１３の映像フレー
ム番号をそのまま利用して映像データ記憶部３から映像
を取り出す。キーワード間をセグメントとしてセグレン
ト中の最も動きの少ない映像部分を抽出する場合は、連
続するインデックス位置データ１３の２つのフレーム番
号にはさまれる複数の映像フレームから映像の変化の差
分が小さいもの、或は場合によって大きいものを選び出
す。キーワードを含む音声の一区間を一つのブロックと
考えてその区間から代表画面を抽出する方法の場合は、
音声情報に再度注目して、キーワードを含む一続きの音
声部分を音声ブロックとみなしてその音声ブロックに対
応する複数の映像フレームから映像の変化の差分が小さ
いものを選び出す。In this way, the video information frame when the keyword of the audio information appears can be designated, and this frame number is stored as the index position data 13. In the example of FIG. 5, frame number 3 is stored as index position data. Search video display section 1
4 displays a representative screen as a search result based on the index position data 13. The selection method of the representative screen is to use the first screen of the video corresponding to the keyword sound, consider the keyword and the section of the keyword as one segment, and extract the video part with the least movement from that segment. Using this as the representative screen, extracting the part with the largest change in the segment and using this as the representative screen, consider that one section of speech containing the keyword is one block and move the most from this block. There are various methods such as a method of extracting a portion having a small number of pixels and using this as a representative screen. Which of these methods is selected is determined according to the type of the target video. For example, in the case of sports, the part that changes the most,
In the case of landscape, selecting the part with the smallest change enables effective display of search results. Whichever method is adopted, one representative video frame is selected and displayed on the screen based on the video frame to which the index is added. When the first screen of the video corresponding to the keyword sound is selected, the video frame number of the index position data 13 is used as it is to retrieve the video from the video data storage unit 3. In the case of extracting a video part having the least motion in the segment using the keywords as segments, the difference in video change from a plurality of video frames sandwiched by two frame numbers of consecutive index position data 13 is small, or Choose a big one in some cases. In the case of a method of extracting a representative screen from a section that considers one section of speech including a keyword as one block
Refocusing on the audio information, the continuous audio part including the keyword is regarded as an audio block, and a plurality of video frames corresponding to the audio block are selected so as to have a small difference in video change.

【００２３】選択した代表画面の表示方法としては、デ
ィスプレイ全体に代表映像を１枚表示してオペレータの
キーボード操作で次々と代表映像を映し出す方法、ディ
スプレイ上に複数の代表映像を並べて表示する方法その
他の方法がある。何れにしても、検索映像表示部１４は
インデックスが付与された複数の映像情報からオペレー
タの望む映像を選択できるように検索結果を表示する様
に構成される。As the method of displaying the selected representative screen, one representative image is displayed on the entire display and the representative images are successively displayed by the keyboard operation of the operator, a method of displaying a plurality of representative images side by side on the display, etc. There is a method. In any case, the search video display unit 14 is configured to display the search result so that the video desired by the operator can be selected from the plurality of video information to which the indexes are added.

【００２４】映像・音声出力部１５は検索映像表示部１
４に表示された代表画面をもとに映像情報と音声情報を
同時に出力再生する。出力する情報は検索映像表示部１
４に表示された複数の代表映像から再生したい映像情報
をオペレータに選択させる方法、検索映像表示部１４に
表示された映像情報をオペレータのキーボード操作をト
リガとして再生・中止を繰り返しお行うことですべての
代表映像を網羅的に再生する方法がある。映像・音声出
力部１５は検索映像表示部１４の代表画面を導き出した
インデックス位置データをもとに音声・映像対応テーブ
ルを参照して映像フレーム番号とそのフレームに対応す
る音声サンプリング番号とを得て、対象となる音声情報
と映像情報を音声データ記憶部２、映像データ記憶部３
からとりだして、両者を同時に出力再生する。図６は映
像情報としてはフレーム番号３、音声情報としては音声
・映像対応テーブルを参照してフレーム開始の音声サン
プリング番号８０１の値を得て、映像・音声出力部が再
生を行う例を示す。The video / audio output unit 15 is the search video display unit 1.
Based on the representative screen displayed in 4, the video information and the audio information are output and reproduced at the same time. The output information is the search video display unit
The method of allowing the operator to select the video information to be reproduced from the plurality of representative videos displayed in 4, and the video information displayed in the search video display section 14 is repeatedly reproduced and stopped by the operator's keyboard operation as a trigger. There is a method of comprehensively playing the representative video of. The video / audio output unit 15 obtains the video frame number and the audio sampling number corresponding to the frame by referring to the audio / video correspondence table based on the index position data derived from the representative screen of the search video display unit 14. , The target audio information and video information are the audio data storage unit 2 and the video data storage unit 3.
Both are output and reproduced at the same time. FIG. 6 shows an example in which the value of the audio sampling number 801 at the start of the frame is obtained by referring to the frame number 3 as the video information and the audio / video correspondence table as the audio information, and the video / audio output unit reproduces.

【００２５】一般に、映像情報と音声情報とを有するデ
ータベースにおける情報検索を考える場合、或る情報部
分を人間が記憶していてその情報を検索しようとするこ
とがある。この記憶には、映像情報である場合、音声情
報である場合、両者を共に記憶している場合の３通りが
ある。この発明は、これら３通りの内、音声による記憶
に基づいて映像情報を検索する場合に極めて有効なもの
である。Generally, when considering information retrieval in a database having video information and audio information, a person may remember a certain information portion and try to retrieve the information. There are three types of this storage: video information, audio information, and both stored. The present invention is extremely effective in the case of retrieving video information based on the memory by voice among these three ways.

【００２６】[0026]

【発明の効果】以上の通りであって、映像情報と音声情
報とを有するマルチメディアのデータベースから音声情
報の内に含まれるキーワードを検索キーとして必要とす
る映像の検索をすることができる。これにより連続的な
情報である映像情報を効率的に検索・アクセスすること
ができる。As described above, it is possible to search for a video that requires a keyword included in audio information as a search key from a multimedia database having video information and audio information. This makes it possible to efficiently search and access video information, which is continuous information.

[Brief description of drawings]

【図１】音声・映像情報を入力し、音声と映像との間の
対応づけをし、そして音声情報の分析をする構成を示す
フロー図である。FIG. 1 is a flowchart showing a configuration in which audio / video information is input, audio and video are associated with each other, and audio information is analyzed.

【図２】映像フレーム番号と音声サンプリング番号との
間の対応を示す音声・映像対応テーブルである。FIG. 2 is an audio / video correspondence table showing correspondence between video frame numbers and audio sampling numbers.

【図３】キーワードに依り映像を検索する構成を示すフ
ロー図である。FIG. 3 is a flowchart showing a configuration for searching a video based on a keyword.

【図４】インデックス位置データのテーブルである。FIG. 4 is a table of index position data.

【図５】インデックス付与の際の音声・映像対応テーブ
ル参照の例である。FIG. 5 is an example of referring to an audio / video correspondence table when assigning an index.

【図６】映像・音声出力の際の音声・映像対応テーブル
参照の例である。FIG. 6 is an example of referring to an audio / video correspondence table when outputting an audio / video.

[Explanation of symbols]

２音声データ記憶部３映像データ記憶部６音声情報分析部８キーワード特徴分析部１０キーワード区間抽出部１２インデックス付与部１４検索映像表示部１５音声・映像出力部 2 audio data storage unit 3 video data storage unit 6 audio information analysis unit 8 keyword feature analysis unit 10 keyword section extraction unit 12 index assignment unit 14 search video display unit 15 audio / video output unit

Claims

[Claims]

1. A feature analysis of audio information on a database having video information and audio information corresponding to the video information, converting this into a time series of characteristic parameters of the audio, and characterizing a keyword audio that should be a video search key. By analyzing and converting this into a keyword audio feature parameter time series, by comparing the audio feature parameter time series in the database with the video search key feature parameter time series,
A video search method using audio keywords, wherein video information is searched using audio included in audio information as a keyword.

2. An audio information analysis unit for performing a feature analysis of the audio information on a database having image information and audio information corresponding to the image information and converting this into an audio characteristic parameter time series. It is equipped with a keyword feature analysis unit for performing a feature analysis of the keyword sound to be used and converting it into a feature parameter time series of the keyword sound, and the feature parameter time series of the sound in the database and the feature parameter time series of the video search key. A keyword section extraction unit that extracts a portion corresponding to a keyword from the audio information by comparison is provided, and an index addition unit that stores the address of the video information corresponding to the extracted section is provided. And a search video display unit that displays the video corresponding to that section, and the video and audio of the search results are To play comprises a sound and video output unit, a video search apparatus according to an acoustic keywords, characterized in that retrieving the image information as a keyword audio included in the audio information.