JP6208631B2

JP6208631B2 - Voice document search device, voice document search method and program

Info

Publication number: JP6208631B2
Application number: JP2014138333A
Authority: JP
Inventors: 隆伸大庭; 記良鎌土
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-07-04
Filing date: 2014-07-04
Publication date: 2017-10-04
Anticipated expiration: 2034-07-04
Also published as: JP2016018229A

Description

この発明は、音声ファイルや音声アーカイブのような音声ドキュメントを検索する技術に関する。 The present invention relates to a technique for searching an audio document such as an audio file or an audio archive.

近年、スマートフォンやICレコーダなど音声の収録が容易になり、大量の音声ドキュメントの蓄積が進んでいる。大量の音声ドキュメントの中から所望の音声ドキュメントを見つけ出すことは容易ではない。そこで音声ドキュメント検索の技術が必要となる。 In recent years, it has become easier to record audio such as smartphones and IC recorders, and a large amount of audio documents have been accumulated. It is not easy to find a desired voice document from a large number of voice documents. Therefore, voice document retrieval technology is required.

音声ドキュメント検索は、基本的にはテキストの検索と同種の手法で実現される場合が多い。音声ドキュメントは音声認識技術によりテキスト化され、ユーザはクエリとしてテキストを与え検索を行う。テキストのクエリと類似した発話内容を含む音声ドキュメントが検索結果となる。 In many cases, the voice document search is basically realized by the same kind of technique as the text search. The voice document is converted into text by voice recognition technology, and the user performs a search by giving the text as a query. An audio document containing utterance content similar to a text query is the search result.

しかし、音声ドキュメント検索の特徴の１つに、話者指定型の検索に対するニーズがある。つまり、「誰が、何と言ったか？」を検索する。例えば、会社の会議を録音した音声ドキュメントが大量にある場合に、上司の発言を聞き直したいとすると、上司を指定して検索を行う必要がある。 However, one of the features of voice document search is the need for speaker-specific search. That is, it searches for “who said what?”. For example, when there are a large number of audio documents recorded from a company meeting, if it is desired to re-listen to the remarks of the supervisor, it is necessary to perform a search by designating the supervisor.

従来の方法では、話者の指定もテキストの検索と同種の手法で実現されてきた。すなわち、音声ドキュメントに話者ラベル（話者名）を付与し、検索時は話者ラベルをクエリとして渡す。話者ラベルもテキストであるから、結果としてテキストの検索と同一のフレームワークで話者指定が可能になる。 In the conventional method, speaker specification has also been realized by the same kind of technique as text search. That is, a speaker label (speaker name) is assigned to the voice document, and the speaker label is passed as a query at the time of retrieval. Since the speaker label is also a text, as a result, the speaker can be specified by the same framework as the text search.

しかしながら、大量の音声ドキュメントすべてに話者ラベルを付与することは困難である。非特許文献１では、話者識別による話者名の情報はメタデータ制作などへの応用が見込まれるとし、話者名の登録を継続的に繰り返すシステムについて報告されている。しかし、すべての音声に話者名が付与されているわけではない上、このシステムは放送局などが長期的なコンテンツの利活用により収益が見込まれるためにコストをかけて作成するものである。実際に一般のユーザに強いられるものではない。 However, it is difficult to assign speaker labels to all of a large number of voice documents. Non-Patent Document 1 reports a system in which speaker name information based on speaker identification is expected to be applied to metadata production, and the registration of speaker names is continuously repeated. However, speaker names are not assigned to all voices, and this system is created at a cost by broadcasting stations and the like because revenue is expected from long-term use of content. It is not actually forced by ordinary users.

そこで、機械的に話者ラベルを付与する技術が用いられる。その一例に非特許文献２の技術がある。非特許文献２では、発話傾向と組み合わせた話者モデルを事前に用意し、音声ドキュメントがどの話者モデルに適合するかを検証する。音声ドキュメント（ここでは音声ドキュメントが一話者による発話と仮定する）に最も適合する話者モデルを選択し、対応する話者ラベルを当該音声ドキュメントに付与する。 Therefore, a technique for mechanically attaching a speaker label is used. One example is the technique of Non-Patent Document 2. In Non-Patent Document 2, a speaker model combined with an utterance tendency is prepared in advance, and it is verified which speaker model the voice document matches. The speaker model that best matches the speech document (here, the speech document is assumed to be uttered by one speaker) is selected, and the corresponding speaker label is assigned to the speech document.

小林彰夫、奥貴裕、本間真一、佐藤庄衛、今井亨、“コンテンツ活用のための報道番組自動書き起こしシステム”、電子情報通信学会論文誌、vol. J93-D(10)、pp. 2085-2095、2010年Akio Kobayashi, Takahiro Oku, Shinichi Honma, Shohei Sato, Satoshi Imai, “Automatic Transcription System for News Programs for Content Utilization”, IEICE Transactions, vol. J93-D (10), pp. 2085- 2095, 2010 山室慶太、伊藤克亘、“デジタル放送の字幕情報と発話傾向を考慮した発話者アノテーション”、情報処理学会第74回全国大会、2012(1)、pp. 619-620、2012年Keimuro Yamamuro and Katsunobu Ito, “Speaker Annotation Considering Digital Broadcast Captioning Information and Tendency”, Information Processing Society of Japan 74th Annual Conference, 2012 (1), pp. 619-620, 2012

大量の音声ドキュメントのデータベースの中から、所望の音声ドキュメントを検索する際、話者を指定する機能は極めて有用である。話者の指定を可能にするには、各音声ドキュメントに話者ラベル（話者名）を付与しておく必要がある。しかし、すべての音声ドキュメントに対して話者ラベルを付与することは現実的には難しい。 When searching for a desired voice document from a database of a large number of voice documents, the function of specifying a speaker is extremely useful. In order to be able to specify a speaker, it is necessary to attach a speaker label (speaker name) to each voice document. However, it is practically difficult to assign speaker labels to all voice documents.

話者ラベルを用いた従来の音声ドキュメント検索方法では、話者ラベルが付与されていない音声ドキュメントは話者指定検索において適切に検出できないという問題がある。 In the conventional voice document search method using the speaker label, there is a problem that a voice document without a speaker label cannot be properly detected in the speaker-specified search.

機械的に話者ラベルを付与する従来の音声ドキュメント検索技術では、事前にあらゆる話者の話者モデルを用意しておく必要がある。しかし、実際の音声ドキュメント検索においては、大量の音声ドキュメントが存在する場合には、未知の話者が不可避的に存在し、この仮定は成り立たない。 In the conventional speech document retrieval technology that mechanically assigns speaker labels, it is necessary to prepare speaker models for all speakers in advance. However, in an actual speech document search, when a large amount of speech documents exists, an unknown speaker inevitably exists, and this assumption does not hold.

この発明の目的は、話者ラベルが不要な話者指定型の音声ドキュメント検索技術を提供することである。 SUMMARY OF THE INVENTION An object of the present invention is to provide a speaker specification type voice document search technique that does not require a speaker label.

上記の課題を解決するために、この発明の音声ドキュメント検索装置は、複数の話者による複数の音声ドキュメントを記憶する音声ドキュメント記憶部と、検索対象とする話者の話者特徴ベクトルである目的話者特徴ベクトルと音声ドキュメントを発話した話者の話者特徴ベクトルとから話者類似度を算出する話者特徴ベクトル空間類似度算出部と、話者類似度が高い音声ドキュメントを出力する検索結果出力部と、を含む。 In order to solve the above problems, an audio document search device of the present invention is an audio document storage unit that stores a plurality of audio documents by a plurality of speakers, and a speaker feature vector of a speaker to be searched Speaker feature vector space similarity calculation unit that calculates speaker similarity from the speaker feature vector and the speaker feature vector of the speaker who uttered the speech document, and a search result that outputs a speech document with high speaker similarity And an output unit.

この発明の音声ドキュメント検索技術は、話者特徴ベクトルをクエリとする制約を与えることで、話者ラベル不要の話者指定型の音声ドキュメント検索を実現する。音声ドキュメントに適切な話者ラベルを付与するような整備は必要でなくなり、音声ドキュメントの整備にかかわる稼働やコストを削減できる。 The voice document search technology of the present invention realizes a speaker-specified type voice document search that does not require a speaker label by giving a restriction using a speaker feature vector as a query. It is no longer necessary to provide an appropriate speaker label for the voice document, and the operation and cost associated with the maintenance of the voice document can be reduced.

音声ドキュメント検索では高速さも求められる上、テキストクエリの入力も考慮する必要がある。この発明では、話者特徴とテキストのベクトル空間上で、計算コストの小さな類似度尺度によりクエリと音声ドキュメント間の類似度を算出することで高速な話者指定型の音声ドキュメント検索を実現する。 In voice document search, high speed is required and it is necessary to consider input of text query. According to the present invention, high-speed speaker-specific speech document search is realized by calculating the similarity between a query and a speech document using a similarity measure with a small calculation cost on a vector space of speaker features and text.

図１は、第一実施形態の音声ドキュメント検索装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of the voice document search apparatus according to the first embodiment. 図２は、第一実施形態の音声ドキュメント検索方法の処理フローを例示する図である。FIG. 2 is a diagram illustrating a processing flow of the voice document search method according to the first embodiment. 図３は、第二実施形態の音声ドキュメント検索装置の機能構成を例示する図である。FIG. 3 is a diagram illustrating a functional configuration of the voice document search apparatus according to the second embodiment. 図４は、第二実施形態の音声ドキュメント検索方法の処理フローを例示する図である。FIG. 4 is a diagram illustrating a processing flow of the voice document search method according to the second embodiment. 図５は、第一実施形態の変形例の音声ドキュメント検索装置の機能構成を例示する図である。FIG. 5 is a diagram illustrating a functional configuration of an audio document search apparatus according to a modification of the first embodiment. 図６は、第二実施形態の変形例の音声ドキュメント検索装置の機能構成を例示する図である。FIG. 6 is a diagram illustrating a functional configuration of an audio document search device according to a modification of the second embodiment. 図７は、目的話者特徴ベクトルの生成方法を例示する図である。FIG. 7 is a diagram illustrating a method for generating a target speaker feature vector. 図８は、第一実施形態の変形例の音声ドキュメント検索方法の処理フローを例示する図である。FIG. 8 is a diagram illustrating a processing flow of a voice document search method according to a modification of the first embodiment. 図９は、第二実施形態の変形例の音声ドキュメント検索方法の処理フローを例示する図である。FIG. 9 is a diagram illustrating a processing flow of a voice document search method according to a modification of the second embodiment. 図１０は、第三実施形態の音声ドキュメント検索装置の機能構成を例示する図である。FIG. 10 is a diagram illustrating a functional configuration of the voice document search apparatus according to the third embodiment. 図１１は、第三実施形態の音声ドキュメント検索方法の処理フローを例示する図である。FIG. 11 is a diagram illustrating a processing flow of the voice document search method according to the third embodiment.

この発明は、音声ドキュメントのデータベース中に話者ラベルのない音声ドキュメントが存在することを前提に、そのような音声ドキュメントでも話者を指定した検索によって検出できるようにすることを目的に考案された音声ドキュメント検索装置及び方法である。この音声ドキュメント検索装置及び方法は、音声ドキュメント検索に必要な探索の高速性も担保する。 The present invention has been devised for the purpose of making it possible to detect a speech by a specified search even in such a speech document on the assumption that a speech document having no speaker label exists in a database of speech documents. An apparatus and method for voice document retrieval. This voice document search apparatus and method also ensures high speed search required for voice document search.

この発明では、話者ラベルではなく話者特徴ベクトルをクエリとすることで検索対象の話者を指定する。データベース上の各音声ドキュメントも話者特徴ベクトル化されており、話者特徴ベクトル間の類似度により話者指定型の音声ドキュメント検索を実現する。 In the present invention, a speaker to be searched is specified by using a speaker feature vector as a query instead of a speaker label. Each voice document on the database is also converted into speaker feature vectors, and a speaker-specified type of voice document search is realized based on the similarity between the speaker feature vectors.

話者特徴は、音声ドキュメントに対して１つのベクトル（以下、話者特徴ベクトルという）であるように構成する。そして、ベクトル空間での２つのベクトル間の類似度により類似性を算出する。ベクトル間の類似度は、例えばコサイン類似度など、情報処理の分野で広く利用される類似度尺度を用いればよい。一般に、ベクトル間の類似度の算出は計算コストが低いものが多く、そのような尺度を採用することで高速な検索が可能となる。 The speaker feature is configured to be one vector (hereinafter referred to as a speaker feature vector) for the voice document. Then, the similarity is calculated based on the similarity between the two vectors in the vector space. The similarity between vectors may be a similarity measure widely used in the field of information processing, such as cosine similarity. In general, the calculation of the similarity between vectors often has a low calculation cost, and employing such a scale enables high-speed search.

この発明の第一実施形態に係る音声ドキュメント検索装置及び方法では、テキストのクエリに基づく類似度と話者の類似度とを合算し、最終的な検索結果となる音声ドキュメントを決定する。 In the speech document search apparatus and method according to the first embodiment of the present invention, the similarity based on the text query and the similarity of the speaker are added together to determine the speech document that is the final search result.

テキストの類似度の算出方法は任意であり、ベクトル表現によるものに限定されない。例えば、ウェブ（WEB）上の検索エンジンでは、クエリは少数の単語の組に限定される。各単語が出現するドキュメントのANDやORを取って検索し、クエリ単語の出現単語数などが類似度として扱われる。第一実施形態におけるテキストの類似度に関しては、このようなベクトル空間上の類似度として算出しないものも許容する。 The method of calculating the text similarity is arbitrary, and is not limited to a vector expression. For example, in a search engine on the web (WEB), the query is limited to a small set of words. The search is performed by taking AND and OR of documents in which each word appears, and the number of appearance words of the query word is treated as the similarity. As for the text similarity in the first embodiment, such text similarity that is not calculated as a vector space is allowed.

一方、テキストの文章をクエリにするドキュメント検索では、総ドキュメントに対する各単語の出現頻度等を要素とするベクトル（以下、単語ベクトル）が特徴ベクトルとして広く用いられる。クエリも、データベース上の各ドキュメントも１つの単語ベクトルにより表現され、コサイン類似度等の尺度を定義した上で検索が行われる。 On the other hand, in a document search using a text sentence as a query, a vector (hereinafter referred to as a word vector) having elements such as the appearance frequency of each word in a total document is widely used as a feature vector. The query and each document on the database are expressed by one word vector, and a search is performed after defining a measure such as cosine similarity.

この発明の第二実施形態に係る音声ドキュメント検索装置及び方法は、話者特徴ベクトルと単語ベクトルを接続して新たなベクトル（以下、話者特徴単語ベクトルという）を作成し、話者特徴単語ベクトル空間上で類似度を算出する。第二実施形態は、第一実施形態と比較してテキスト側の処理に制約を課すものの、話者特徴とテキストの類似度を別々に算出する必要がなくなる。話者特徴ベクトル間の類似度と単語ベクトル間の類似度を別々に算出するよりも高速化が期待できる。 An audio document search apparatus and method according to a second embodiment of the present invention creates a new vector (hereinafter referred to as a speaker feature word vector) by connecting a speaker feature vector and a word vector, and provides a speaker feature word vector. Calculate the similarity in space. Although the second embodiment imposes restrictions on processing on the text side as compared with the first embodiment, it is not necessary to separately calculate the similarity between the speaker feature and the text. A higher speed can be expected than when the similarity between the speaker feature vectors and the similarity between the word vectors are calculated separately.

この発明では、クエリを話者特徴ベクトルとしているが、話者特徴ベクトルは音声データから所定のステップで機械的に算出できるものであることは自明である。そのため実施形態として、音声データをクエリとする場合もこの発明の範疇である。 In this invention, the query is a speaker feature vector, but it is obvious that the speaker feature vector can be mechanically calculated from speech data in a predetermined step. Therefore, as an embodiment, the case of using voice data as a query is also within the scope of the present invention.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

［第一実施形態］
第一実施形態の音声ドキュメント検索装置は、図１に示すように、音声クエリ入力部１０、テキストクエリ入力部１１、話者特徴ベクトル抽出部１２、単語ベクトル抽出部１３、話者特徴ベクトル空間類似度算出部１４、単語ベクトル空間類似度算出部１５、類似度合算部１６、検索結果出力部１７、音声ドキュメント記憶部１８及び類似度記憶部１９を例えば含む。 [First embodiment]
As shown in FIG. 1, the speech document search apparatus according to the first embodiment includes a speech query input unit 10, a text query input unit 11, a speaker feature vector extraction unit 12, a word vector extraction unit 13, and a speaker feature vector space similarity. For example, a degree calculation unit 14, a word vector space similarity calculation unit 15, a similarity addition unit 16, a search result output unit 17, an audio document storage unit 18, and a similarity storage unit 19 are included.

音声ドキュメント検索装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。音声ドキュメント検索装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。音声ドキュメント検索装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、音声ドキュメント検索装置の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。 The voice document retrieval device is a specially configured computer in which a special program is read into a known or dedicated computer having a central processing unit (CPU), a main storage device (RAM), and the like. Device. For example, the voice document search device executes each process under the control of the central processing unit. The data input to the voice document search device and the data obtained in each process are stored in the main storage device, for example, and the data stored in the main storage device is read out as necessary and used for other processing. Is done. Further, at least a part of each processing unit of the voice document search apparatus may be configured by hardware such as an integrated circuit.

音声ドキュメント検索装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。音声ドキュメント検索装置が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 Each storage unit included in the voice document search device includes, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory, or It can be configured with middleware such as a relational database or key-value store. Each storage unit included in the voice document search device only needs to be logically divided, and may be stored in one physical storage device.

音声ドキュメント記憶部１８には、複数の話者が発話する音声ドキュメントが記憶されている。 The voice document storage unit 18 stores voice documents uttered by a plurality of speakers.

図２を参照して、第一実施形態の音声ドキュメント検索方法を説明する。 With reference to FIG. 2, the voice document search method of the first embodiment will be described.

ステップＳ１０において、音声クエリ入力部１０へ、クエリ（以下、音声クエリという）として検索対象とする話者の音声データもしくは話者特徴ベクトルが入力される。音声データが入力された場合には、音声クエリは話者特徴ベクトル抽出部１２へ送られる。話者特徴ベクトルが入力された場合には、音声クエリは話者特徴ベクトル空間類似度算出部１４へ送られる。この場合、音声ドキュメント検索装置は話者特徴ベクトル抽出部１２を備えなくてもよい。話者特徴ベクトルの具体的な構成は後述する。 In step S10, speech data or a speaker feature vector of a speaker to be searched is input as a query (hereinafter referred to as a speech query) to the speech query input unit 10. When voice data is input, the voice query is sent to the speaker feature vector extraction unit 12. When the speaker feature vector is input, the voice query is sent to the speaker feature vector space similarity calculation unit 14. In this case, the speech document search apparatus may not include the speaker feature vector extraction unit 12. A specific configuration of the speaker feature vector will be described later.

ステップＳ１１において、テキストクエリ入力部１１へ、クエリ（以下、テキストクエリという）として検索対象とするテキストデータもしくは単語ベクトルが入力される。テキストデータが入力された場合には、テキストクエリは単語ベクトル抽出部１３へ送られる。単語ベクトルが入力された場合には、テキストクエリは単語ベクトル空間類似度算出部１５へ送られる。この場合、音声ドキュメント検索装置は単語ベクトル抽出部１３を備えなくてもよい。単語ベクトルについての詳細は後述する。 In step S11, text data or a word vector to be searched is input to the text query input unit 11 as a query (hereinafter referred to as a text query). When text data is input, the text query is sent to the word vector extraction unit 13. When a word vector is input, the text query is sent to the word vector space similarity calculation unit 15. In this case, the voice document search device may not include the word vector extraction unit 13. Details of the word vector will be described later.

ステップＳ１２において、話者特徴ベクトル抽出部１２は、入力された音声クエリから話者特徴ベクトルを抽出する。抽出された話者特徴ベクトルは話者特徴ベクトル空間類似度算出部１４へ送られる。 In step S12, the speaker feature vector extraction unit 12 extracts a speaker feature vector from the input speech query. The extracted speaker feature vector is sent to the speaker feature vector space similarity calculation unit 14.

ステップＳ１３において、単語ベクトル抽出部１３は、入力されたテキストクエリから単語ベクトルを抽出する。抽出された単語ベクトルは単語ベクトル空間類似度算出部１５へ送られる。 In step S13, the word vector extraction unit 13 extracts a word vector from the input text query. The extracted word vector is sent to the word vector space similarity calculation unit 15.

ステップＳ１４において、話者特徴ベクトル空間類似度算出部１４は、音声ドキュメント記憶部１８に記憶された各音声ドキュメントの話者特徴ベクトルと、入力された話者特徴ベクトルとから、所定の類似度尺度に従い算出される話者類似度を算出する。算出された話者類似度は類似度合算部１６へ送られる。 In step S14, the speaker feature vector space similarity calculation unit 14 calculates a predetermined similarity measure from the speaker feature vector of each voice document stored in the voice document storage unit 18 and the input speaker feature vector. The speaker similarity calculated according to the above is calculated. The calculated speaker similarity is sent to the similarity summation unit 16.

話者特徴ベクトル間の類似度は、任意の類似度尺度を用いることができる。１つの代表的な形態はコサイン類似度である。もう１つの代表的な形態は内積値である。 Any similarity measure can be used for the similarity between the speaker feature vectors. One representative form is cosine similarity. Another representative form is an inner product value.

ステップＳ１５において、単語ベクトル空間類似度算出部１５は、音声ドキュメント記憶部１８に記憶された各音声ドキュメントの単語ベクトルと、入力された単語ベクトルとから、所定の類似度尺度に従い算出されるテキスト類似度を算出する。算出されたテキスト類似度は類似度合算部１６へ送られる。 In step S15, the word vector space similarity calculation unit 15 calculates the text similarity based on a predetermined similarity measure from the word vector of each audio document stored in the audio document storage unit 18 and the input word vector. Calculate the degree. The calculated text similarity is sent to the similarity summation unit 16.

単語ベクトル間の類似度は、話者特徴ベクトルと同様に、任意の類似度尺度を用いることができる。 As the similarity between the word vectors, an arbitrary similarity measure can be used similarly to the speaker feature vector.

ステップＳ１６において、類似度合算部１６は、所定の方法に従い、話者類似度とテキスト類似度を合算して当該音声ドキュメントとクエリの類似度を算出する。算出された類似度は類似度記憶部１９へ記憶される。 In step S16, the similarity summation unit 16 calculates the similarity between the voice document and the query by adding the speaker similarity and the text similarity according to a predetermined method. The calculated similarity is stored in the similarity storage unit 19.

話者類似度とテキスト類似度の合算の方法は、加算、乗算、対数上での加算、その重み付きの演算などである。重みは予備実験などを通して検索精度等の観点で最適と思われる値を人為的に決めるとよい。 The method of summing the speaker similarity and the text similarity includes addition, multiplication, logarithmic addition, and a weighted operation. It is recommended to artificially determine a weight that is considered optimal in terms of search accuracy and the like through preliminary experiments.

ステップＳ１７において、検索結果出力部１７は、類似度記憶部１９に記憶された類似度の高い音声ドキュメントを検索結果として出力する。出力形式は、音声ドキュメントを音声認識した認識結果テキストであってもよく、また音声ドキュメントに含めておいた音声データそのものを音声波形として再生してもよい。 In step S <b> 17, the search result output unit 17 outputs an audio document having a high similarity stored in the similarity storage unit 19 as a search result. The output format may be a recognition result text obtained by voice recognition of a voice document, or the voice data itself included in the voice document may be reproduced as a voice waveform.

［第二実施形態］
第二実施形態の音声ドキュメント検索装置は、図３に示すように、第一実施形態と同様に、音声クエリ入力部１０、テキストクエリ入力部１１、話者特徴ベクトル抽出部１２、単語ベクトル抽出部１３、検索結果出力部１７、音声ドキュメント記憶部１８及び類似度記憶部１９を例えば含み、音声ドキュメント話者特徴単語ベクトル作成部２０、クエリ話者特徴単語ベクトル作成部２１及び話者特徴単語ベクトル空間類似度算出部２２をさらに含む。 [Second Embodiment]
As shown in FIG. 3, the speech document search apparatus according to the second embodiment includes a speech query input unit 10, a text query input unit 11, a speaker feature vector extraction unit 12, a word vector extraction unit, as in the first embodiment. 13. A search result output unit 17, a voice document storage unit 18, and a similarity storage unit 19, for example, include a voice document speaker feature word vector creation unit 20, a query speaker feature word vector creation unit 21, and a speaker feature word vector space. A similarity calculation unit 22 is further included.

図４を参照して、第二実施形態の音声ドキュメント検索方法を説明する。以下では、上述の第一実施形態との相違点を中心に説明する。 With reference to FIG. 4, the audio | voice document search method of 2nd embodiment is demonstrated. Below, it demonstrates centering on difference with the above-mentioned 1st embodiment.

ステップＳ２０において、音声ドキュメント話者特徴単語ベクトル作成部２０は、音声ドキュメント記憶部１８に記憶された音声ドキュメントから抽出した話者特徴ベクトルと、音声ドキュメント記憶部１８に記憶された音声ドキュメントを音声認識した認識結果テキストから抽出した単語ベクトルとを接続して、音声ドキュメント話者特徴単語ベクトルを作成する。作成された音声ドキュメント話者特徴単語ベクトルは話者特徴単語ベクトル空間類似度算出部２２に送られる。 In step S <b> 20, the voice document speaker feature word vector creation unit 20 recognizes the speaker feature vector extracted from the voice document stored in the voice document storage unit 18 and the voice document stored in the voice document storage unit 18. The word vector extracted from the recognized recognition result text is connected to create a speech document speaker feature word vector. The created voice document speaker feature word vector is sent to the speaker feature word vector space similarity calculation unit 22.

ステップＳ２１において、クエリ話者特徴単語ベクトル作成部２１は、話者特徴ベクトル抽出部１２から入力された話者特徴ベクトルと、単語ベクトル抽出部１３から入力された単語ベクトルとを接続して、クエリ話者特徴単語ベクトルを作成する。作成されたクエリ話者特徴単語ベクトルは話者特徴単語ベクトル空間類似度算出部２２に送られる。 In step S21, the query speaker feature word vector creation unit 21 connects the speaker feature vector input from the speaker feature vector extraction unit 12 and the word vector input from the word vector extraction unit 13 to obtain a query. Create speaker feature word vectors. The created query speaker feature word vector is sent to the speaker feature word vector space similarity calculation unit 22.

ステップＳ２２において、話者特徴単語ベクトル空間類似度算出部２２は、音声ドキュメント話者特徴単語ベクトル作成部２０が出力する音声ドキュメント話者特徴単語ベクトルと、クエリ話者特徴単語ベクトル作成部２１が出力するクエリ話者特徴単語ベクトルとから、所定の類似度尺度に従い当該音声ドキュメントとクエリの類似度を算出し、その結果を類似度記憶部１９に記憶する。 In step S22, the speaker feature word vector space similarity calculation unit 22 outputs the voice document speaker feature word vector output by the voice document speaker feature word vector generation unit 20 and the query speaker feature word vector generation unit 21. The similarity between the voice document and the query is calculated according to a predetermined similarity measure from the query speaker feature word vector to be stored, and the result is stored in the similarity storage unit 19.

［特徴ベクトルの具体例］
以下、第一実施形態及び第二実施形態で利用する各特徴ベクトルの具体的な構成について詳述する。 [Specific examples of feature vectors]
Hereinafter, a specific configuration of each feature vector used in the first embodiment and the second embodiment will be described in detail.

話者特徴ベクトルの１つの形態は、例えば、i-vectorと呼ばれる特徴量である。i-vectorについての詳細は、「H. Aronowitz and O. Barkan, “Efficient approximated i-vector extraction”, Proceedings of ICASSP, pp. 4789-4792, 2012.（参考文献１）」に記載されている。話者特徴ベクトルのもう１つの形態は、Joint Factor Analysis(JFA)を用いて抽出した話者依存成分のベクトルである。JFAにより得られるベクトルも上記参考文献１に記載されている。 One form of the speaker feature vector is, for example, a feature value called i-vector. Details of i-vectors are described in “H. Aronowitz and O. Barkan,“ Efficient approximated i-vector extraction ”, Proceedings of ICASSP, pp. 4789-4792, 2012. (Reference 1). Another form of the speaker feature vector is a vector of speaker-dependent components extracted using Joint Factor Analysis (JFA). The vector obtained by JFA is also described in the above reference 1.

i-vectorもJFAにより得られるベクトルも、音声データに対して適応処理を施した混合ガウス分布（GMM: Gaussian Mixture Model）の各ガウス分布の平均ベクトルを接続して一繋ぎにしたベクトル（スーパーベクトル）を所定の方法で行列分解したものである。それを考慮すると、話者特徴ベクトルのもう１つの形態は、GMMのスーパーベクトルを所定の方法で話者成分が抽出できるように行列分解して得たベクトルである。話者特徴ベクトルのもう１つの形態は、GMMのスーパーベクトルである。GMMのスーパーベクトルは話者成分を残しているという点において選択肢の１つではある。しかし、話者以外の成分も多量に含んでおり、設定によっては他のベクトルに比べて極めて高次元となり検索速度への影響も懸念される。その他に、GMMのスーパーベクトルを介さない方法で得たベクトルであっても、話者を識別する効力を発揮する特徴量ベクトルである限り、話者特徴ベクトルの範疇である。 Both the i-vector and the vector obtained by JFA are connected by connecting the average vector of each Gaussian distribution (GMM: Gaussian Mixture Model) that has been applied to speech data (super vector). ) Is subjected to matrix decomposition by a predetermined method. Considering this, another form of the speaker feature vector is a vector obtained by matrix decomposition so that speaker components can be extracted by a predetermined method from the GMM super vector. Another form of speaker feature vector is the GMM supervector. The GMM supervector is one of the options in that it retains the speaker component. However, it contains a large amount of components other than the speaker, and depending on the setting, it is extremely high in dimension compared to other vectors, and there is a concern about the influence on the search speed. In addition, even a vector obtained by a method that does not use a GMM super vector is a category of a speaker feature vector as long as it is a feature vector that exhibits the effect of identifying a speaker.

テキスト類似度は、既存のテキスト検索で用いられるスコアを用いればよい。 The text similarity may be a score used in an existing text search.

話者特徴ベクトル及び単語ベクトルのもう１つの形態は、正規化ベクトルである。ベクトルXの正規化ベクトルを、 Another form of speaker feature vector and word vector is a normalized vector. The normalized vector of vector X,

（ただし、||X||はXのノルム）とすると、話者特徴ベクトルVを正規化したベクトル (Where || X || is the norm of X), a vector that normalizes speaker feature vector V

であり、単語ベクトルWを正規化したベクトル Is a normalized vector of the word vector W

である。 It is.

話者特徴ベクトル及び単語ベクトルのもう１つの形態は、重み付けベクトルである。ベクトルXの重み付きベクトルを、aX（ただし、aは定数）とすると、話者特徴ベクトルVを重み付けしたベクトルaVもしくは正規化したベクトルを重み付けしたベクトル Another form of speaker feature vector and word vector is a weighting vector. If the weighted vector of vector X is aX (where a is a constant), vector aV weighted by speaker feature vector V or vector weighted by normalized vector

であり、単語ベクトルWを重み付けしたベクトルaWもしくは正規化したベクトルを重み付けしたベクトル A vector aW weighted by word vector W or vector weighted by normalized vector

である。 It is.

話者特徴単語ベクトルは、話者特徴ベクトルと単語ベクトルを単純に接続したベクトルである。例えば、話者特徴ベクトルV及び単語ベクトルWに関しては、 The speaker feature word vector is a vector obtained by simply connecting the speaker feature vector and the word vector. For example, for speaker feature vector V and word vector W:

とするとよい。 It is good to do.

正規化の利点は、コサイン類似度の算出をより演算量の少ない内積演算に置き換えられる点にある。重み付けは、第二実施形態において、第一実施形態と等価の類似度を導入するのに利用できる。例えば、第一実施形態における類似度の合算において、コサイン類似度の重み付き和を用いた場合、 The advantage of normalization is that the calculation of cosine similarity can be replaced with an inner product operation with a smaller amount of calculation. Weighting can be used in the second embodiment to introduce a degree of similarity equivalent to the first embodiment. For example, in the summation of similarities in the first embodiment, when a weighted sum of cosine similarity is used,

であるから、２回の内積 So, the inner product of 2 times

のみに計算量を削減できる。 Only the amount of calculation can be reduced.

は事前に計算された１つのベクトルであるから、２回の内積のみで済む。なお、添字dは音声ドキュメントを表し、添字qはクエリを表す。さらに、 Is a vector calculated in advance, so only two inner products are required. Note that the subscript d represents an audio document, and the subscript q represents a query. further,

である。右辺はまさに第二実施形態である。この等式関係は第一実施形態と第二実施形態で同じ類似度を用いることができることを示している。 It is. The right side is exactly the second embodiment. This equality relationship shows that the same similarity can be used in the first embodiment and the second embodiment.

［変形例］
第一実施形態もしくは第二実施形態では、ユーザは検索対象話者の話者特徴ベクトルもしくは音声データを用意する必要がある。この条件は現実的である。この発明では話者ラベルの付与を否定しているわけではないので、話者ラベルが付与されているデータであれば、目的話者の（目的外の）音声ドキュメントを得ることができる。この音声ドキュメントは本来検索したい発話内容とは内容が異なるもの、すなわち目的のものではないにせよ、当該話者の特徴を表すものであるから、それをクエリとして話者指定を行い、発話内容はテキストクエリとして入力することで、話者およびテキストが一致する、目的の音声ドキュメントを検索すればよい。 [Modification]
In the first embodiment or the second embodiment, the user needs to prepare speaker feature vectors or voice data of the search target speaker. This condition is realistic. In the present invention, since the speaker label is not denied, the voice document of the target speaker (non-target) can be obtained as long as the data has the speaker label. This speech document is different from the content of the utterance to be originally searched, that is, it is not the target, but represents the characteristics of the speaker. By inputting as a text query, a target speech document with a matching speaker and text may be searched.

また、一度クエリとして使用した音声データは、e-mailのアドレス帳のように管理しておけばよい。すなわち、クエリとして利用可能な話者ラベルの付与された音声データに対して、識別の容易な話者ラベル等の名称やＩＤを付与してアドレス帳で管理しておけば、次回以降は、アドレス帳から目的話者のデータを呼び出すことで、話者指定型の音声ドキュメント検索が可能である。その際、特徴ベクトルの算出ステップを回避する目的で、話者特徴ベクトルをアドレス帳に登録しておく方が効率的である。 The voice data once used as a query may be managed like an e-mail address book. That is, if voice data with speaker labels that can be used as queries is given names and IDs such as speaker labels that can be easily identified and managed in an address book, By calling the target speaker's data from the book, it is possible to search for a speaker-specified voice document. At this time, it is more efficient to register the speaker feature vector in the address book in order to avoid the feature vector calculation step.

第一実施形態もしくは第二実施形態の変形例は、話者ラベルと話者特徴ベクトルとをアドレス帳形式で対応付けておき、ユーザは目的話者に付与しておいた話者ラベル等の名称やＩＤを入力することで音声ドキュメントの検索を行う音声ドキュメント検索装置及び方法である。 In the modification of the first embodiment or the second embodiment, the speaker label and the speaker feature vector are associated with each other in the address book format, and the name of the speaker label or the like given to the target speaker by the user. An audio document search apparatus and method for searching for an audio document by inputting an ID or ID.

第一実施形態の変形例である音声ドキュメント検索装置は、図５に示すように、第一実施形態と同様に、テキストクエリ入力部１１、単語ベクトル抽出部１３、話者特徴ベクトル空間類似度算出部１４、単語ベクトル空間類似度算出部１５、類似度合算部１６、検索結果出力部１７、音声ドキュメント記憶部１８及び類似度記憶部１９を例えば含み、話者ラベル入力部３０、目的話者特徴ベクトル記憶部３１及び話者特徴ベクトル抽出部３２をさらに含む。 As shown in FIG. 5, the speech document search apparatus that is a modification of the first embodiment is similar to the first embodiment in that the text query input unit 11, the word vector extraction unit 13, and the speaker feature vector space similarity calculation. Unit 14, word vector space similarity calculation unit 15, similarity summation unit 16, search result output unit 17, voice document storage unit 18, and similarity storage unit 19, for example, speaker label input unit 30, target speaker characteristics A vector storage unit 31 and a speaker feature vector extraction unit 32 are further included.

第二実施形態の変形例である音声ドキュメント検索装置は、図６に示すように、第二実施形態と同様に、テキストクエリ入力部１１、単語ベクトル抽出部１３、検索結果出力部１７、音声ドキュメント記憶部１８、類似度記憶部１９、音声ドキュメント話者特徴単語ベクトル作成部２０、クエリ話者特徴単語ベクトル作成部２１及び話者特徴単語ベクトル空間類似度算出部２２を例えば含み、話者ラベル入力部３０、目的話者特徴ベクトル記憶部３１及び話者特徴ベクトル抽出部３２をさらに含む。 As shown in FIG. 6, the voice document search apparatus as a modification of the second embodiment is similar to the second embodiment in that a text query input unit 11, a word vector extraction unit 13, a search result output unit 17, and a voice document are used. A storage unit 18, a similarity storage unit 19, a voice document speaker feature word vector creation unit 20, a query speaker feature word vector creation unit 21, and a speaker feature word vector space similarity calculation unit 22 are included, for example, and speaker label input It further includes a unit 30, a target speaker feature vector storage unit 31, and a speaker feature vector extraction unit 32.

目的話者特徴ベクトル記憶部３１には、話者ラベルが付与された話者特徴ベクトルが記憶されている。話者ラベルは、話者を識別するための名称やＩＤであり、目的話者特徴ベクトル記憶部３１内に記憶されている話者特徴ベクトルを一意に識別できるように付与される。 The target speaker feature vector storage unit 31 stores speaker feature vectors to which speaker labels are assigned. The speaker label is a name or ID for identifying the speaker, and is assigned so that the speaker feature vector stored in the target speaker feature vector storage unit 31 can be uniquely identified.

図７を参照して、目的話者特徴ベクトル記憶部３１に記憶する話者特徴ベクトルの生成方法を説明する。音声データ入力部３３に任意の音声データが入力される。話者特徴ベクトル抽出部１２は、第一実施形態の話者特徴ベクトル抽出部と同様に、入力された音声データから話者特徴ベクトルを抽出する。話者特徴ベクトルは上述したどの形態のものであってもよい。話者ラベル名付与部３４は、話者特徴ベクトル抽出部１２の出力する話者特徴ベクトルにユーザが入力した話者ラベルを付与することで話者ラベル名と話者特徴ベクトルとを対応づけて、目的話者特徴ベクトル記憶部３１へ記憶する。 A method for generating speaker feature vectors stored in the target speaker feature vector storage unit 31 will be described with reference to FIG. Arbitrary audio data is input to the audio data input unit 33. The speaker feature vector extraction unit 12 extracts a speaker feature vector from the input voice data, similarly to the speaker feature vector extraction unit of the first embodiment. The speaker feature vector may have any form as described above. The speaker label name assigning unit 34 associates the speaker label name with the speaker feature vector by assigning the speaker label input by the user to the speaker feature vector output from the speaker feature vector extracting unit 12. And stored in the target speaker feature vector storage unit 31.

図８に第一実施形態の変形例である音声ドキュメント検索方法の処理フローを、図９に第二実施形態の変形例である音声ドキュメント検索方法の処理フローを、それぞれ示す。以下では、上述の第一実施形態及び第二実施形態との相違点を中心に説明する。 FIG. 8 shows a processing flow of a voice document search method that is a modification of the first embodiment, and FIG. 9 shows a processing flow of a voice document search method that is a modification of the second embodiment. Below, it demonstrates centering around difference with the above-mentioned 1st embodiment and 2nd embodiment.

ステップＳ３０において、話者ラベル入力部３０へ、目的話者に付与しておいた話者ラベルのいずれかが入力される。 In step S30, one of the speaker labels assigned to the target speaker is input to the speaker label input unit 30.

ステップＳ３２において、話者特徴ベクトル抽出部３２は、入力された話者ラベルを用いて目的話者特徴ベクトル記憶部３１へ記憶された話者特徴ベクトルを抽出する。以降の処理では、抽出した話者特徴ベクトルを、第一実施形態もしくは第二実施形態において音声クエリから抽出した話者特徴ベクトルとして取り扱えばよい。 In step S <b> 32, the speaker feature vector extraction unit 32 extracts the speaker feature vector stored in the target speaker feature vector storage unit 31 using the input speaker label. In the subsequent processing, the extracted speaker feature vector may be handled as the speaker feature vector extracted from the voice query in the first embodiment or the second embodiment.

一方、目的話者特徴ベクトル記憶部３１に、話者ラベルが付与された音声ドキュメントが１つもない話者を検索する場合、何らかの方法で音声データを得る必要がある。著名人の音声であればウェブ検索などを頼りに音声データを入手可能であろう。知人など比較的親しい関係性の人間であれば音声を収録させてもらいアドレス帳に登録しておけばよい。もしくは、目的話者を登録している知人のアドレス帳を共有してもらうといった方法でもクエリを入手することができる。 On the other hand, when searching for a speaker having no speech document with a speaker label in the target speaker feature vector storage unit 31, it is necessary to obtain speech data by some method. If it is a celebrity's voice, it will be possible to obtain voice data by web search. If you have a relatively close relationship, such as an acquaintance, you can record audio and register it in your address book. Alternatively, the query can be obtained by sharing the address book of an acquaintance who registers the target speaker.

クエリに用いる話者特徴ベクトルは、話者ラベルの付与された音声データ、もしくは上記のようにデータベース外から得た音声データ等を、話者特徴ベクトル抽出部に入力したときの出力結果として得ることができ、これにユーザによって任意の話者ラベル名を付与し、話者ラベル名と話者特徴ベクトルとを対応づけた目的話者特徴ベクトル記憶部に収録すればよい。 The speaker feature vector used for the query is obtained as an output result when speech data to which a speaker label is attached or speech data obtained from outside the database as described above is input to the speaker feature vector extraction unit. Any desired speaker label name may be assigned by the user and recorded in the target speaker feature vector storage unit in which the speaker label name and the speaker feature vector are associated with each other.

［類似の従来技術との比較］
類似音声の選択手法及び装置が、「森島繁生他、“新映像技術「ダイブイントゥザムービー」”、電子情報通信学会誌、Vol. 94、No. 3、pp. 250-268、2011年3月（参考文献２）」に記載されている。参考文献２では、音声を入力とし、事前に登録された各音声ファイル（音声ドキュメント）との話者類似度を算出している。参考文献２はこの点においてこの発明に似たアイデアであるが、決定的に計算コストに関する考慮が欠落しているため音声ドキュメント検索には使用できない。実際、特徴量間の動的尺度やGMM尤度を算出するなど計算コストの大きな処理を前提としている。また、テキストを指定する方法について考慮されていない。参考文献２における特徴量は、フレーム（区分時間）ごとに抽出されており、音声データに対しては１つの行列（特徴量の次数×フレーム数）が得られる。この発明のような１つのベクトルではなく、この点も異なる。 [Comparison with similar prior art]
Similar voice selection methods and devices are described in Shigeo Morishima et al., “New Video Technology“ Dive Into the Movie ””, IEICE Journal, Vol. 94, No. 3, pp. 250-268, March 2011 ( Reference 2) ”. In Reference Document 2, speech is used as input, and the speaker similarity with each speech file (speech document) registered in advance is calculated. Reference 2 is an idea similar to the present invention in this respect, but cannot be used for speech document retrieval because it decisively lacks consideration regarding calculation cost. In fact, it is premised on processing with a high calculation cost, such as calculating a dynamic scale between feature quantities and GMM likelihood. Also, no consideration is given to how to specify text. The feature amount in Reference 2 is extracted for each frame (segment time), and one matrix (order of feature amount × number of frames) is obtained for audio data. It is not a single vector as in the present invention, and this point is also different.

話者特徴ベクトルの実施例の１つであるi-vectorは話者照合分野で開発された技術である。当該分野では、当初i-vectorのコサイン類似度により話者の照合等を行う方法が提案された。話者照合では、事前に登録されている全ての話者の音声と入力音声間の類似度を算出する。この点はこの発明と共通する部分である。しかし、i-vectorのコサイン類似度では十分な話者照合、識別の精度を出すことができず、その後、統計的な手法（例えば、Probabilistic Linear Discriminant Analysisなど）の識別技術を併用するように遷移してきている（詳しくは、「Pavel Matejka, Ondrej Glembek, Fabio Castaldo, Md. Jahangir Alam, Oldrich Plchot, Patrick Kenny, Lukas Burget, Jan Cernocky, “Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification”, ICASSP 2011, pp. 4828-4831（参考文献３）」参照）。つまり、i-vectorのコサイン類似度といった、ベクトル空間上で２点の類似度を算出するといった単純な方法では話者の識別、同定は難しいことを示している。まして、この発明の対象のような大規模なデータベースを対象にする場合は更に深刻化する。 I-vector, which is one example of speaker feature vectors, is a technology developed in the field of speaker verification. In this field, a method for collating speakers based on the cosine similarity of i-vector was proposed. In speaker verification, the similarity between the speech of all speakers registered in advance and the input speech is calculated. This is a part in common with the present invention. However, i-vector cosine similarity does not provide sufficient speaker verification and identification accuracy, and then transitions to use statistical techniques (such as Probabilistic Linear Discriminant Analysis). (For details, see Pavel Matejka, Ondrej Glembek, Fabio Castaldo, Md. Jahangir Alam, Oldrich Plchot, Patrick Kenny, Lukas Burget, Jan Cernocky, “Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification. ", ICASSP 2011, pp. 4828-4831 (reference 3)"). That is, it is difficult to identify and identify a speaker by a simple method such as calculating the similarity of two points on a vector space, such as the cosine similarity of i-vector. Furthermore, the problem becomes more serious when a large database such as the subject of the present invention is targeted.

しかし、発明者らは、音声ドキュメント検索の以下の特徴に着目することで、i-vectorのコサイン類似度でも十分に精度よく動作することに気がついた。その特徴の１つ目は、音声ドキュメント検索では、テキストクエリが存在することである。テキストクエリにより、大きく候補が限定される。これにより話者数も事実上限定される。 However, the inventors have noticed that the i-vector cosine similarity works sufficiently accurately by paying attention to the following features of the voice document search. The first feature is that a text query exists in a speech document search. The text query greatly limits the candidates. This effectively limits the number of speakers.

特徴の２つ目は、音声ドキュメント検索では、一位候補の精度よりも、上位複数候補に目的の音声ドキュメントが含まれる精度を重要視する点である。i-vectorのコサイン類似度では、一位で当てることは難しい場合も多い。しかし、上位候補に目的音声ドキュメントを挙げることは比較的容易である。 The second feature is that in the audio document search, the accuracy in which the target audio document is included in the plurality of higher candidates is more important than the accuracy of the first candidate. In c-similarity of i-vector, it is often difficult to hit first. However, it is relatively easy to list the target speech document as a top candidate.

特徴の３つ目は、音声ドキュメント検索では話者照合等の分野で扱う音声と比べて、長い音声を扱う点である。長い音声を対象にすることから、話者の特徴を正確に抽出できるようになり、精度の向上が期待できる。 The third feature is that the voice document search uses a longer voice compared to the voice used in the field of speaker verification and the like. Since long speech is targeted, speaker characteristics can be extracted accurately, and improvement in accuracy can be expected.

さらに音声ドキュメント検索では検索の高速性も求められるから、i-vectorのコサイン類似度のような低演算量の方法が向いている。 Furthermore, since high-speed search is also required for voice document search, a low-computation method such as i-vector cosine similarity is suitable.

発明者らは、これらの様々な音声ドキュメント検索固有の特徴に着目し、話者特徴ベクトルをクエリに与えること、およびi-vectorのコサイン類似度のように、ベクトル空間上で２点間の類似度として話者類似度を算出することが音声ドキュメント検索に適している事に気づいたのであり、当該分野（話者照合や音声ドキュメント検索の分野）の者であっても容易に想起できるものではない。 The inventors focused on these various features unique to voice document search, giving speaker feature vectors to queries, and similarity between two points in vector space, such as cosine similarity of i-vectors I found that calculating speaker similarity as a degree is suitable for voice document search, and even those in the relevant field (speaker verification and voice document search field) can easily recall Absent.

［第三実施形態］
上述した音声ドキュメント検索の特徴の１つ目を活かす構成として、第三実施形態の音声ドキュメント検索装置及び方法を説明する。 [Third embodiment]
As a configuration that makes use of the first feature of the above-described voice document search, the voice document search apparatus and method according to the third embodiment will be described.

第三実施形態の音声ドキュメント検索装置は、図１０に示すように、第一実施形態の変形例と同様に、テキストクエリ入力部１１、単語ベクトル抽出部１３、単語ベクトル空間類似度算出部１５、検索結果出力部１７、音声ドキュメント記憶部１８、類似度記憶部１９、話者ラベル入力部３０、目的話者特徴ベクトル記憶部３１及び話者特徴ベクトル抽出部３２を例えば含み、高類似度候補記憶部４０及び話者特徴ベクトル空間類似度算出部４１をさらに含む。 As shown in FIG. 10, the speech document search apparatus according to the third embodiment is similar to the modification of the first embodiment in that a text query input unit 11, a word vector extraction unit 13, a word vector space similarity calculation unit 15, The search result output unit 17, the audio document storage unit 18, the similarity storage unit 19, the speaker label input unit 30, the target speaker feature vector storage unit 31, and the speaker feature vector extraction unit 32, for example, include a high similarity candidate storage And a speaker feature vector space similarity calculation unit 41.

図１１を参照して、第三実施形態の音声ドキュメント検索方法を説明する。以下では、上述の各実施形態との相違点を中心に説明する。 With reference to FIG. 11, the audio | voice document search method of 3rd embodiment is demonstrated. Below, it demonstrates centering on difference with each above-mentioned embodiment.

ステップＳ１５において、単語ベクトル空間類似度算出部１５は、音声ドキュメント記憶部１８に記憶された各音声ドキュメントの単語ベクトルと、入力された単語ベクトルとから、所定の類似度尺度に従い算出されるテキスト類似度を算出し、高いテキスト類似度を持つ音声ドキュメントの候補情報を高類似度候補記憶部４０に記憶する。 In step S15, the word vector space similarity calculation unit 15 calculates the text similarity based on a predetermined similarity measure from the word vector of each audio document stored in the audio document storage unit 18 and the input word vector. The degree information is calculated, and the candidate information of the voice document having a high text similarity is stored in the high similarity candidate storage unit 40.

ステップＳ４１において、話者特徴ベクトル空間類似度算出部４１は、高類似度候補記憶部４０に記憶された候補情報のうち類似度の高い上位候補に限定して、話者特徴ベクトルとの類似度を算出し、単語ベクトルおよび話者特徴ベクトルの双方に対して類似度の高い上位候補を検索結果として出力する。 In step S <b> 41, the speaker feature vector space similarity calculation unit 41 limits the similarity to the speaker feature vector by limiting the candidate information stored in the high similarity candidate storage unit 40 to a higher-ranked candidate. , And a candidate with high similarity to both the word vector and the speaker feature vector is output as a search result.

第三実施形態の構成では限定された候補に対して話者特徴ベクトルの類似度を算出することになるので、検索結果を速く得ることが可能であり、また単語ベクトルに基づく類似度と話者特徴ベクトルに基づく類似度のそれぞれについて独立して候補数を制御できるため、第一実施形態または第二実施形態では得られない検索精度を持たせることができる。 In the configuration of the third embodiment, the similarity of speaker feature vectors is calculated for a limited number of candidates, so that search results can be obtained quickly, and similarity based on word vectors and speakers can be obtained. Since the number of candidates can be controlled independently for each degree of similarity based on the feature vector, it is possible to provide a search accuracy that cannot be obtained in the first embodiment or the second embodiment.

［実験結果］
スマートフォンにおける音声検索や音声質問応答システム利用時の音声をデータベース化した。収音環境は様々で雑音も多分に含まれている。データベース上の音声ドキュメントのファイル数はおよそ11万である。１ファイルは１発話に相当するので、11万発話が存在することに相当する。各音声ドキュメントはファイル毎に音声認識技術により自動で発話の始端と終端が決定され、発話内容が書き起こされ、その後、認識結果中の内容語単語の頻度を要素とする単語ベクトルによって表現された。また同じく、ファイル毎に話者特徴ベクトルとしてi-vectorも事前に抽出し、データベース上に登録しておいた。5000クエリを与え、検索精度を比較した。話者、テキストともに類似度尺度としてコサイン類似度を用いた。 [Experimental result]
Voice database for smartphones and voice question answering system are made into a database. There are various sound collection environments and noise is also included. The number of voice documents in the database is approximately 110,000. Since one file corresponds to one utterance, it corresponds to 110,000 utterances. Each voice document is automatically represented for each file by the speech recognition technology, the start and end of the utterance are determined, the utterance content is transcribed, and then expressed by a word vector whose element is the frequency of the content word in the recognition result. . Similarly, i-vectors were also extracted in advance as speaker feature vectors for each file and registered in the database. 5000 queries were given and the search accuracy was compared. Cosine similarity was used as a similarity measure for both speakers and text.

実験では、データベース上の全音声ドキュメントに話者ラベルが与えられている。検索時に、話者ラベルをクエリとして与えた場合と、話者ラベルの代わりに話者特徴ベクトル（i-vector）を与えた場合の精度を比較する。ただし、クエリとなるテキストは、目的音声ドキュメントの正解書き起こし(人手で与えた書き起し)である。 In the experiment, speaker labels are given to all voice documents in the database. At the time of retrieval, the accuracy is compared between when a speaker label is given as a query and when a speaker feature vector (i-vector) is given instead of a speaker label. However, the text used as a query is a correct transcript of the target speech document (a transcript given manually).

クエリの話者特徴ベクトルには、データベース上の目的音声ドキュメント以外から、同一話者の音声ドキュメントをランダムに抽出し、その話者特徴ベクトルを用いた。すなわち、話者類似度の最も高い音声ドキュメントは、目的音声ドキュメントにならないという状況で実験を行った。 As the speaker feature vector of the query, voice documents of the same speaker are randomly extracted from those other than the target voice document on the database, and the speaker feature vector is used. That is, the experiment was performed in a situation where the voice document having the highest speaker similarity is not the target voice document.

表１に上記の実験結果として得られた検索精度を示す。 Table 1 shows the retrieval accuracy obtained as a result of the above experiment.

クエリの対象者は100人である。話者ラベルがクエリに与えられた場合は、理想的な状態であるから検索精度が極めて高く、3-best（上位３候補に目的音声ドキュメントが含まれる割合）で90%を超えている。本発明は、それには及ばないものの2-bestで７割、5-bestで８割正解している。また2-bestで、話者ラベルありの場合の1-bestの精度とほぼ同等であった。この実験結果から、実用にも耐え得る精度で動作することが見て取れる。 The target of the query is 100 people. When the speaker label is given to the query, the search accuracy is extremely high because it is an ideal state, and the 3-best (the ratio that the target speech document is included in the top three candidates) exceeds 90%. In the present invention, 70% is correct for 2-best and 80% is correct for 5-best. It was 2-best, almost the same as the 1-best accuracy with speaker label. From this experimental result, it can be seen that it operates with accuracy that can withstand practical use.

［効果］
この発明によれば、話者指定型の音声ドキュメント検索において、大量の音声ドキュメントのすべての音声データに対して話者ラベルを付与しなくても、話者特徴ベクトルの類似度の高い話者の音声ドキュメントを検索結果として得ることができる。すなわち、それぞれに適切な話者ラベルを付与するような音声ドキュメントの整備は必要でなくなり、音声ドキュメントの整備にかかわる稼働やコストを削減できる。 [effect]
According to the present invention, in speaker-specified type audio document search, a speaker with a high similarity of speaker feature vectors can be obtained without assigning speaker labels to all audio data of a large amount of audio documents. An audio document can be obtained as a search result. That is, it is not necessary to prepare a voice document that assigns an appropriate speaker label to each, and it is possible to reduce operation and cost related to the maintenance of the voice document.

話者特徴ベクトルの類似度計算においては、計算コストの小さな類似度尺度を用いることで、高速な話者指定型の音声ドキュメント検索が実現できる。 In the similarity calculation of the speaker feature vector, a high-speed speaker-specified type speech document search can be realized by using a similarity scale with a low calculation cost.

検索に際しての類似度計算においては、話者特徴ベクトルの類似度と単語ベクトル（テキストクエリ）の類似度を組み合わせた合算値が適用可能であり、検索精度等の観点で最適な類似度計算を選定して利用することができる。あるいは、単語ベクトルの類似度計算結果に基づいて音声ドキュメントの候補を限定したうえで、話者ベクトルの類似度計算に基づいた類似度の上位候補を検索結果として得る形で、検索精度を高め、かつ処理の高速化を図ることも可能である。 In the similarity calculation at the time of search, a combined value combining the similarity of the speaker feature vector and the similarity of the word vector (text query) can be applied, and the optimal similarity calculation is selected from the viewpoint of search accuracy etc. Can be used. Alternatively, after limiting the candidates for the speech document based on the word vector similarity calculation result, the search accuracy is improved in the form of obtaining the top candidate of the similarity based on the speaker vector similarity calculation, It is also possible to increase the processing speed.

この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１０音声クエリ入力部
１１テキストクエリ入力部
１２話者特徴ベクトル抽出部
１３単語ベクトル抽出部
１４話者特徴ベクトル空間類似度算出部
１５単語ベクトル空間類似度算出部
１６類似度合算部
１７検索結果出力部
１８音声ドキュメント記憶部
１９類似度記憶部
２０音声ドキュメント話者特徴単語ベクトル作成部
２１クエリ話者特徴単語ベクトル作成部
２２話者特徴単語ベクトル空間類似度算出部
３０話者ラベル入力部
３１目的話者特徴ベクトル記憶部
３２音声データ入力部
３３話者ラベル名付与部
４０高類似度候補記憶部
４１話者特徴ベクトル空間類似度算出部 DESCRIPTION OF SYMBOLS 10 Speech query input part 11 Text query input part 12 Speaker feature vector extraction part 13 Word vector extraction part 14 Speaker feature vector space similarity calculation part 15 Word vector space similarity calculation part 16 Similarity summation part 17 Search result output part 18 Speech document storage unit 19 Similarity storage unit 20 Speech document speaker feature word vector creation unit 21 Query speaker feature word vector creation unit 22 Speaker feature word vector space similarity calculation unit 30 Speaker label input unit 31 Target speaker Feature vector storage unit 32 Voice data input unit 33 Speaker label name assigning unit 40 High similarity candidate storage unit 41 Speaker feature vector space similarity calculation unit

Claims

A voice document storage unit for storing a plurality of voice documents by a plurality of speakers;
A word vector space similarity calculating unit that calculates a text similarity from a word vector of a text to be searched and a word vector of a recognition result text obtained by voice recognition of the voice document;
A high similarity candidate storage unit for storing candidate information for specifying the voice document having the high text similarity;
Among the candidate information stored in the high similarity candidate storage unit, the high candidate with the high text similarity is specified by the target speaker feature vector that is the speaker feature vector of the speaker to be searched and the candidate information. a speaker feature vector space similarity calculating section for calculating a speaker similarity and a speaker feature vectors of the speaker who uttered the sound document that,
A search result output unit for outputting the voice document having both the text similarity and the speaker similarity;
Only including,
The speaker feature vector is a feature vector obtained by performing matrix decomposition on a super vector of a mixed Gaussian distribution obtained by performing adaptive processing on speech data uttered by a speaker.
Voice document retrieval device.

A plurality of voice documents by a plurality of speakers are stored in the voice document storage unit,
A word vector space similarity calculating unit that calculates a text similarity from a word vector of a text to be searched and a word vector of a recognition result text obtained by voice recognition of the voice document;
A high similarity candidate storing step for storing candidate information for specifying the voice document having a high text similarity in the high similarity candidate storing unit;
The purpose of the speaker feature vector space similarity calculation unit is a speaker feature vector of a speaker to be searched with respect to a higher candidate having a high text similarity among the candidate information stored in the high similarity candidate storage unit A speaker feature vector space similarity calculating step for calculating a speaker similarity from a speaker feature vector and a speaker feature vector of a speaker who has spoken the voice document specified by the candidate information ;
A search result output unit for outputting the voice document having both the text similarity and the speaker similarity high;
Only including,
The speaker feature vector is a feature vector obtained by performing matrix decomposition on a super vector of a mixed Gaussian distribution obtained by performing adaptive processing on speech data uttered by a speaker.
Voice document search method.

A program for causing a computer to function as the voice document search device according to claim 1 .