JP2008134979A

JP2008134979A - Moving picture retrieval method and moving picture retrieval system

Info

Publication number: JP2008134979A
Application number: JP2006346724A
Authority: JP
Inventors: Seiya Takada; 靖也高田
Original assignee: INTELEGAL CO Ltd
Current assignee: INTELEGAL CO Ltd
Priority date: 2006-11-28
Filing date: 2006-11-28
Publication date: 2008-06-12

Abstract

<P>PROBLEM TO BE SOLVED: To prevent a retrieval keyword from being commonplaced due to the lapse of time or the like and to retrieve moving pictures by using a keyword adapted to time in keyword retrieval of many moving pictures registered in a moving picture site of the Internet or the like. <P>SOLUTION: The moving picture retrieval system comprises a moving picture registering interface for inputting reading Kana characters in the retrieval of many moving pictures registered in the moving picture site of the Internet or the like, a voice recognition part 7 for performing voice recognition for a moving picture file in a moving picture database on the basis of a phoneme string generated from the reading Kana characters, an index database 5 for adding a new vocabulary obtained from the voice recognition result, and a data processing part 4 for controlling the whole system by issuing an SQL sentence to respective databases. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、サイト上に登録された動画の検索方法および検索システムに関する。 The present invention relates to a search method and a search system for moving images registered on a site.

音声認識技術の進歩により、無雑音の環境下では単語音声認識については９０％以上、連続音声認識については７０％以上の認識精度が得られるようになった。こうした音声認識技術を利用し、映像信号に付随する音声信号に対し音声認識を行い、語彙を検出し、映像信号と関連付ける技術が可能になっている。これに伴い、動画サイト上の動画について、音声認識を行うことでキーワードを抽出し、インデックス化し検索する方法が可能となった。 Due to advances in speech recognition technology, recognition accuracy of 90% or more for word speech recognition and 70% or more for continuous speech recognition can be obtained in a no-noise environment. Using such a voice recognition technique, a technique for performing voice recognition on a voice signal accompanying a video signal, detecting a vocabulary, and associating it with the video signal becomes possible. Along with this, it has become possible to extract and index keywords by performing voice recognition on moving images on a moving image site.

特許文献１には、受信した映像信号に含まれる音声信号を音声認識によって文字情報へ変換し、該文字列をキーワードとして映像信号又は音声信号と対応付ける。検索ユーザはキーワードを用いることで、該キーワードに対応付けられた映像信号を抽出できる。しかしながら、キーワードの候補になりうる語彙はあらかじめ静的な語彙データベースの範囲内にとどまり、日々変化する新語、造語、固有名詞をキーワードとすることは現実的に困難であるという問題があった。 In Patent Document 1, an audio signal included in a received video signal is converted into character information by voice recognition, and the character string is used as a keyword and associated with the video signal or the audio signal. The search user can extract a video signal associated with the keyword by using the keyword. However, vocabularies that can be keyword candidates remain within the range of a static vocabulary database in advance, and there is a problem that it is practically difficult to use new words, coined words, and proper nouns that change every day as keywords.

特開２００６−５４５１７ JP 2006-54517 A

音声認識技術の進歩により、無雑音の環境下では単語音声認識については９０％以上、連続音声認識については７０％以上の認識精度が得られるようになった。こうした音声認識技術を利用し、映像信号に付随する音声信号に対し音声認識を行い、語彙を検出し、映像信号と関連付ける技術が可能になっている。しかしながら、音声認識可能な語彙数は大語彙データベース内の語彙数に依存し、語彙数が少なければ認識精度は向上するもののほとんどの認識語彙が未知語として認識される結果となり、逆に語彙数が多ければ、未知語として認識される語彙は少なくなるものの、音韻的な距離の近い語彙については誤認識されるおそれがある。このため、大語彙データベース内の語彙数を一定の範囲に保ちつつ、質の向上、つまり常に時代に即した内容の語彙を収容することが望ましい。静的な語彙データベースでは、日々変化する新語、造語、固有名詞などについての音声認識は行うことができず、全て未知語として処理されてしまう。例えば、急に知名度の向上した政治家やタレントの名前などは、通常は語彙データベースに収容されていないが、急に知名度が向上するくらいであるから、これらをキーワードして検索したいユーザは多いはずである。しかしながら、語彙データベースの内容を時代に即して日々更新していくことは現実的には困難であるため、キーワードとして最も使用したい語彙が使用できないという問題があった。 Due to advances in speech recognition technology, recognition accuracy of 90% or more for word speech recognition and 70% or more for continuous speech recognition can be obtained in a no-noise environment. Using such a voice recognition technique, a technique for performing voice recognition on a voice signal accompanying a video signal, detecting a vocabulary, and associating it with the video signal becomes possible. However, the number of vocabulary that can be recognized by speech depends on the number of vocabulary in the large vocabulary database. If the number of vocabulary is small, the recognition accuracy improves, but most recognition vocabularies are recognized as unknown words. If there are many, the vocabulary recognized as an unknown word will decrease, but there exists a possibility that the vocabulary with a phonological distance may be misrecognized. For this reason, it is desirable to improve the quality while keeping the number of vocabularies in the large vocabulary database within a certain range, that is, to accommodate vocabularies with contents that always match the times. In a static vocabulary database, speech recognition of new words, coined words, proper nouns, etc. that change from day to day cannot be performed, and all are processed as unknown words. For example, names of politicians and talents who have suddenly improved their names are not usually stored in the vocabulary database, but since their names are suddenly improved, there are many users who want to search for these keywords. It is. However, since it is actually difficult to update the contents of the vocabulary database according to the times, there is a problem that the vocabulary that the user wants to use most as a keyword cannot be used.

本発明は、登録ユーザが動画を新規に登録する際に、該動画にとって適切なキーワードとその読み仮名を入力することを促し、入力されたキーワードを大語彙データベースに追加し、該語彙の読み仮名から得られる音素列を用いて既に登録されている動画に対し音声認識を行い、該音素列が検出された場合はインデックスデータベースに該語彙を追加することを特徴とし、本発明はインターネット上の動画サイトなど、ユーザが多くインタラクティブな媒体上で効果的であることを特徴とする。 The present invention prompts a registered user to input a keyword suitable for the moving image and its reading kana when the moving image is newly registered, adds the input keyword to the large vocabulary database, and reads the kana reading of the vocabulary The present invention is characterized in that speech recognition is performed on a moving image already registered using a phoneme sequence obtained from the above, and the vocabulary is added to an index database when the phoneme sequence is detected. It is characterized by being effective on many interactive media such as sites.

また本発明は、前記のインデックスデータベース上において出現頻度の高い語彙について、該語彙をカテゴリ化し、検索ユーザのカテゴリ検索のニーズにも対応することを特徴とする。 Further, the present invention is characterized by categorizing the vocabulary having a high appearance frequency in the index database and responding to the search user's category search needs.

また本発明は、前記の新規語彙の追加による大語彙データベースの肥大化を防ぐために、検索ユーザの検索キーワードの統計情報から時間の経過などにより陳腐化したキーワードを大語彙データベースから削除することにより、大語彙データベースを常に一定範囲の語彙数に保ちつつその内容を最適化することを特徴とする。 In addition, in order to prevent the enlargement of the large vocabulary database due to the addition of the above new vocabulary, the present invention deletes keywords that have become obsolete from the statistical information of the search keyword of the search user from the large vocabulary database. It is characterized by optimizing the contents of a large vocabulary database while always keeping the number of vocabularies within a certain range.

発明によれば、検索ユーザは常に適切な動画の検索結果を得ることができる。 According to the invention, the search user can always obtain a search result of an appropriate moving image.

以下、本発明の実施の形態について図面を参照しながら説明する。図１は、本発明の動画検索システムの実施形態例を実現する基本構成図である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a basic configuration diagram for realizing an embodiment of the moving image search system of the present invention.

動画登録インタフェース２上に、動画ファイルのアップローダとキーワード入力フィールドが設置され、キーワード入力フィールドは、文字列入力フィールドと読み仮名入力フィールドより構成される。 A video file uploader and a keyword input field are installed on the video registration interface 2, and the keyword input field includes a character string input field and a reading input field.

登録された動画ファイルのＵＲＩは動画データベース３に挿入される。 The URI of the registered moving image file is inserted into the moving image database 3.

登録された動画ファイルに対し、データ処理部４は音声認識部７に対し、大語彙データベース６の語彙に基づくキーワードスポッティングを行うよう指示し、動画データベース３における該動画ファイルの動画ＩＤと共に、キーワードスポッティングによって検出された語彙に対する大語彙データベース６上の語彙ＩＤがインデックスデータベース５に挿入される。この際、助詞などの機能語や検索キーワードとして適切でないものについては、大語彙データベース６上の重要度フィールドの値をＦａｌｓｅに設定しておくことで、インデックスデータベース５に挿入されることを回避する。 For the registered moving image file, the data processing unit 4 instructs the voice recognition unit 7 to perform keyword spotting based on the vocabulary of the large vocabulary database 6, and keyword spotting together with the moving image ID of the moving image file in the moving image database 3. The vocabulary ID on the large vocabulary database 6 for the vocabulary detected by is inserted into the index database 5. At this time, for the functional words such as particles and those that are not suitable as search keywords, the value of the importance field in the large vocabulary database 6 is set to False to avoid being inserted into the index database 5. .

音声認識において、キーワードスポッティングよりも大語彙連続音声認識を行う方が得られる語彙情報量は多く効果的だが、現実的には不特定話者の連続音声認識の認識精度は低く、特に話し言葉についてはその文法も明確ではないため、本発明ではキーワードスポッティングを採用している。将来的には大語彙連続音声認識が用いられるのが望ましい。 In speech recognition, the amount of vocabulary information obtained from large vocabulary continuous speech recognition is more effective than keyword spotting, but the recognition accuracy of continuous speech recognition for unspecified speakers is actually low, especially for spoken language Since the grammar is not clear, keyword spotting is adopted in the present invention. In the future, it is desirable to use large vocabulary continuous speech recognition.

動画データベース３における該動画ファイルの動画ＩＤと共に、文字列入力フィールドに入力された文字列について大語彙データベース６を参照した結果返される語彙ＩＤがインデックスデータベース５に挿入される。 Along with the moving image ID of the moving image file in the moving image database 3, a vocabulary ID returned as a result of referring to the large vocabulary database 6 for the character string input in the character string input field is inserted into the index database 5.

文字列入力フィールドに入力された文字列が大語彙データベース６に存在しない場合、データ処理部４により読み仮名入力フィールドに入力された文字列が音声認識部７にとって利用可能な音素列の形態に変換され、文字列入力フィールドに入力された文字列と共に大語彙データベース６に挿入される。 When the character string input in the character string input field does not exist in the large vocabulary database 6, the character string input in the reading kana input field by the data processing unit 4 is converted into a phoneme string form usable by the speech recognition unit 7. Then, it is inserted into the large vocabulary database 6 together with the character string input in the character string input field.

さらにデータ処理部４は、該音素列について動画データベース３に挿入された既存の動画ファイルに対するキーワードスポッティングを行うよう音声認識部７に指示を行う。 Further, the data processing unit 4 instructs the voice recognition unit 7 to perform keyword spotting on the existing moving image file inserted into the moving image database 3 for the phoneme string.

動画データベース３に挿入された既存の動画ファイルに対するキーワードスポッティングの結果、該音素列が検出された場合、動画データベース３上の該動画ＩＤおよび大語彙データベース６上の該音素列に対する語彙ＩＤをインデックスデータベース５に挿入する。ここで、データ処理部４はキーワードスポッティングが未処理の語彙ＩＤを保持しているため、同音異字語を誤ってインデックスデータベース５に挿入することはない。 When the phoneme string is detected as a result of keyword spotting on the existing movie file inserted in the movie database 3, the movie ID on the movie database 3 and the vocabulary ID for the phoneme sequence on the large vocabulary database 6 are index database. 5 is inserted. Here, since the data processing unit 4 holds the vocabulary IDs that have not been subjected to keyword spotting, the homonyms are not erroneously inserted into the index database 5.

大語彙データベース６は検索ユーザ９の各語彙に対する検索回数という統計情報を保持しており、新たな語彙の追加により大語彙データベース６のレコード数が一定値を超えた場合、統計情報に基づき検索回数の少ない語彙を削除する。こうした語彙の最適化を行うことにより、検索キーワード群の陳腐化の防止、音声認識処理の高速化、音声認識の誤認識の防止等の効果が得られる。 The large vocabulary database 6 holds statistical information indicating the number of searches for each vocabulary of the search user 9, and when the number of records in the large vocabulary database 6 exceeds a certain value due to the addition of a new vocabulary, the number of searches is based on the statistical information. Delete less vocabulary. By optimizing the vocabulary, it is possible to obtain effects such as preventing the search keyword group from becoming obsolete, speeding up speech recognition processing, and preventing erroneous recognition of speech recognition.

データ処理部４は、インデックスデータベース５上での出現頻度が高い語彙について、動画検索インタフェース８上で該語彙をカテゴリ化することにより、検索ユーザ９のキーワード検索だけでなく。カテゴリ検索のニーズにも対応する。 The data processing unit 4 categorizes the vocabulary having a high appearance frequency on the index database 5 on the moving image search interface 8, thereby performing not only the keyword search of the search user 9. Corresponds to the needs of category search.

音声認識対象の動画ファイル数が大規模で、音声認識処理の負荷が高い場合、一定数の新規語彙に達した時点や週に一度などの期間を区切っての複数新規語彙に対するキーワードスポッティングにより、音声認識処理の負荷を緩和することができる。 When the number of video files subject to speech recognition is large and the load of speech recognition processing is high, voice recognition is performed by keyword spotting for multiple new vocabularies when a certain number of new vocabularies are reached, or once every week, etc. The load of recognition processing can be reduced.

動画登録インタフェース２および動画検索インタフェース８はユーザの利便性のために存在し、本発明の登録ユーザまたは検索ユーザを人間として想定しない場合、これらインタフェースはシステム実現のために必ずしも必要とはならない。例えば、ＷＥＢ上を巡回するような検索ロボットにとって、これらインタフェースは必ずしも必要とはならない。 The moving image registration interface 2 and the moving image search interface 8 exist for the convenience of the user, and when the registered user or the search user of the present invention is not assumed as a human, these interfaces are not necessarily required for realizing the system. For example, these interfaces are not necessarily required for a search robot that patrols on the WEB.

本発明は、動画サイト等で動画検索サービスを提供する産業分野などに適用できる。 The present invention can be applied to an industrial field that provides a moving image search service on a moving image site or the like.

本発明の動画検索システムの実施形態例を実現する基本構成図である。 1 is a basic configuration diagram for realizing an embodiment of a moving image search system of the present invention. 本発明における各データベースのテーブルの基本構成図である。 It is a basic composition figure of the table of each database in the present invention.

Explanation of symbols

１登録ユーザ
２動画登録インタフェース
３動画データベース
４データ処理部
５インデックスデータベース
６大語彙データベース
７音声認識部
８動画検索インタフェース
９検索ユーザDESCRIPTION OF SYMBOLS 1 Registered user 2 Movie registration interface 3 Movie database 4 Data processing part 5 Index database 6 Large vocabulary database 7 Voice recognition part 8 Movie search interface 9 Search user

Claims

A keyword entered by a registered user of a video on a video site is added to a large vocabulary database, and speech recognition is performed on a video already registered using a phoneme string obtained from the reading of the vocabulary. When a column is detected, by adding the vocabulary to the index database, the keyword group that can be used for the search is always kept in accordance with the age and purpose, and when the video search user performs a search, a new word, A moving picture search method and a moving picture search system equipped with a dynamic large vocabulary database characterized by avoiding, as much as possible, a situation in which an appropriate search result cannot be obtained due to unknown words such as coined words and proper nouns.

The vocabulary with high appearance frequency is categorized on the index database of claim 1 to categorize the vocabulary and respond to not only the keyword search shown in claim 1 but also the category search needs. Method and video search system.

In order to prevent the enlargement of the large vocabulary database due to the addition of a new vocabulary, the large vocabulary is deleted by deleting from the large vocabulary database keywords that have become obsolete from the statistical information of the search keywords of the search user. A moving picture search method and a moving picture search system characterized by optimizing the contents of a database while always keeping the number of vocabularies within a certain range.