JP5210440B2

JP5210440B2 - Method, program and apparatus for high speed speech retrieval

Info

Publication number: JP5210440B2
Application number: JP2012000070A
Authority: JP
Inventors: チェン、ユロン
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2012-01-04
Filing date: 2012-01-04
Publication date: 2013-06-12
Anticipated expiration: 2026-07-03
Also published as: JP2012133371A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a high-speed voice retrieval method and apparatus. <P>SOLUTION: A large voice database in a multiprocessor system is retrieved to specify a target voice clip. The large voice database is divided into a plurality of smaller groups, and the plurality of small groups are dynamically scheduled for a plurality of available processors within the system. The processor divides each group into a plurality of smaller segments, extracts a voice feature from the segment and uses a common component Gaussian mixed model (CCGMM) to model the segment, thereby processing the plurality of scheduled groups in parallel. One processor further extracts a voice feature from the target voice clip and uses the CCGMM to model the extracted voice feature. On the basis of a KL distance between the target voice clip and each of the segments, it is determined whether the segment matches the target voice clip. <P>COPYRIGHT: (C)2012,JPO&INPIT

Description

本開示内容は概して、信号処理およびマルチメディアアプリケーションに関する。より具体的には、これに限定されるわけではないが、高速音声検索および音声指紋の方法および装置に関する。 The present disclosure relates generally to signal processing and multimedia applications. More specifically, but not exclusively, it relates to a fast voice search and voice fingerprint method and apparatus.

音声検索（例えば、音声クリップ用に大きな音声ストリームを検索することであって、当該音声ストリームが破損／歪曲していたとしても実行される）には数多くの用途があり、例えば、放送用音楽／コマーシャルの分析、インターネットでの著作権管理、または未分類の音声クリップ用のメタデータの特定等がある。典型的な音声検索システムは、シリアルで単一プロセッサシステム用に設計されている。このような検索システムでは通常、大きな音声ストリームにおいてターゲット音声クリップを検索するのに長時間かかってしまう。しかし、音声検索システムは大抵、大きい音声データベースに対して効率的に動作するよう求められており、例えば、非常に短時間で（例えば、略リアルタイムで）大きいデータベースを検索しなければならない。また、音声データベースは、その一部分またはすべてにおいて、歪曲、破損、および／または圧縮が発生している場合がある。このため、音声検索システムは、ターゲット音声クリップと同一の音声セグメントが歪曲、破損および／または圧縮されている場合であっても、その音声セグメントを特定するのに十分なロバスト性を有している必要がある。したがって、ターゲット音声クリップを大きい音声データベースから迅速且つロバストに検索できる音声検索システムが望まれている。 Voice search (eg, searching for a large audio stream for an audio clip and performed even if the audio stream is corrupted / distorted) has many uses, for example, broadcast music / Examples include commercial analysis, copyright management on the Internet, or identifying metadata for uncategorized audio clips. A typical voice search system is designed for serial, single processor systems. Such a search system usually takes a long time to search for a target audio clip in a large audio stream. However, speech search systems are often required to operate efficiently on large speech databases, for example, must search large databases in a very short time (eg, in near real time). In addition, a part or all of the voice database may be distorted, damaged, and / or compressed. For this reason, the voice search system is sufficiently robust to identify the voice segment even if the same voice segment as the target voice clip is distorted, damaged and / or compressed. There is a need. Accordingly, there is a need for a voice search system that can quickly and robustly search for target voice clips from a large voice database.

開示する主題の特徴および利点は以下に記述する主題の詳細な説明から明らかとなる。 The features and advantages of the disclosed subject matter will become apparent from the following detailed description of the subject matter.

音声検索モジュールに基づいてロバスト且つ並列な音声検索が実行され得るコンピューティングシステムの一例を示す図である。1 is a diagram illustrating an example of a computing system in which robust and parallel voice searches can be performed based on a voice search module. FIG.

音声検索モジュールに基づいてロバスト且つ並列な音声検索が実行され得るコンピューティングシステムの別の例を示す図である。FIG. 6 illustrates another example of a computing system in which robust and parallel voice searches can be performed based on a voice search module.

音声検索モジュールに基づいてロバスト且つ並列な音声検索が実行され得るコンピューティングシステムのさらに別の例を示す図である。FIG. 10 illustrates yet another example of a computing system in which robust and parallel voice searches can be performed based on a voice search module.

ロバストな音声検索を実行する音声検索モジュールの一例を示すブロック図である。It is a block diagram which shows an example of the voice search module which performs robust voice search.

図４に示すロバストな音声検索モジュールの動作例を示す図である。It is a figure which shows the operation example of the robust voice search module shown in FIG.

マルチプロセッサシステムにおいてロバスト且つ並列な音声検索を実行する音声検索モジュールの一例を示すブロック図である。It is a block diagram which shows an example of the voice search module which performs robust and parallel voice search in a multiprocessor system.

マルチプロセッサシステムにおいてロバスト且つ並列な音声検索を実行するべく、大規模音声データベースを小グループに分割する方法を示す図である。FIG. 3 illustrates a method for dividing a large speech database into small groups to perform robust and parallel speech searches in a multiprocessor system. マルチプロセッサシステムにおいてロバスト且つ並列な音声検索を実行するべく、大規模音声データベースを小グループに分割する方法を示す図である。FIG. 3 illustrates a method for dividing a large speech database into small groups to perform robust and parallel speech searches in a multiprocessor system. マルチプロセッサシステムにおいてロバスト且つ並列な音声検索を実行するべく、大規模音声データベースを小グループに分割する方法を示す図である。FIG. 3 illustrates a method for dividing a large speech database into small groups to perform robust and parallel speech searches in a multiprocessor system.

マルチプロセッサシステムにおいてロバスト且つ並列な音声検索を実行するプロセスの一例を示す擬似コードを示す図である。FIG. 4 is a pseudo code illustrating an example of a process for performing robust and parallel speech retrieval in a multiprocessor system.

本願において開示される主題の実施形態によると、ロバスト且つ並列な検索方法を用いて、ターゲット音声クリップを求めて、マルチプロセッサシステム内の大きな音声ストリームまたは大きい音声データベースを検索し得る。大きい音声データベースを複数の小グループに分割するとしてもよい。これらの小グループを、マルチプロセッサシステム内で利用可能なプロセッサまたは処理コアによって修理されるべく、動的にスケジューリングするとしてもよい。プロセッサまたは処理コアは、スケジューリングされたグループを並列に処理するとしてもよい。このような並列処理は、各グループをより小さいセグメントに分割して、セグメントから音声特徴を抽出して、共通成分ガウス混合モデル（ＣｏｍｍｏｎＣｏｍｐｏｎｅｎｔＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ：ＣＣＧＭＭ）を用いてセグメントをモデル化することによってなされる。これらのセグメントの長さは、ターゲット音声クリップの長さと同一であるとしてもよい。どのグループを処理するよりも前に、１つのプロセッサまたは処理コアはターゲット音声クリップから音声特徴を抽出してＣＣＧＭＭを用いてモデル化するとしてもよい。ターゲット音声クリップのモデルとグループの各セグメントとの間のカルバック・ライブラー（ＫＬ）距離またはＫＬ最大距離をさらに算出するとしてもよい。当該距離が所定値以下であれば、対応するセグメントはターゲット音声クリップであると特定される。 According to embodiments of the presently disclosed subject matter, a robust and parallel search method may be used to search a large audio stream or large audio database in a multiprocessor system for a target audio clip. A large speech database may be divided into a plurality of small groups. These small groups may be dynamically scheduled to be repaired by the processors or processing cores available in the multiprocessor system. The processor or processing core may process the scheduled groups in parallel. Such parallel processing involves dividing each group into smaller segments, extracting speech features from the segments, and modeling the segments using a Common Component Gaussian Mixture Model (CCCGMM). Made by. The length of these segments may be the same as the length of the target audio clip. Prior to processing any group, one processor or processing core may extract audio features from the target audio clip and model them using CCGMM. A Cullback liber (KL) distance or a KL maximum distance between the target audio clip model and each segment of the group may be further calculated. If the distance is less than or equal to the predetermined value, the corresponding segment is identified as the target audio clip.

当該距離が所定値を超えている場合、プロセッサまたは処理コアは任意の数のセグメントを省略して、ターゲット音声クリップの検索を継続するとしてもよい。プロセッサまたは処理コアが１つのグループを検索し終わると、処理対象の新しいグループが与えられてターゲット音声クリップを検索し、すべてのグループについて検索を実行する。グループのサイズは、負荷インピーダンスおよび演算の重複を低減するように決定され得る。さらに、複数のプロセッサまたは処理コアが実行する音声グループの並列処理の効率を向上させるべく入出力（Ｉ／Ｏ）を最適化し得る。 If the distance exceeds a predetermined value, the processor or processing core may omit any number of segments and continue searching for the target audio clip. When the processor or processing core finishes searching for one group, a new group to be processed is given to search for the target audio clip and the search is performed for all groups. The size of the group can be determined so as to reduce load impedance and computation duplication. Furthermore, input / output (I / O) can be optimized to improve the efficiency of parallel processing of voice groups performed by multiple processors or processing cores.

本明細書において「一実施形態」または「開示されている主題の実施形態」という表現は、当該実施形態に関連して説明される特定の特徴、構造または特性が、開示されている主題の少なくとも１つの実施形態に含まれていることを意味する。このように、「一実施形態」というフレーズが本明細書において何度も使用されるが、必ずしもすべてが同一の実施形態に言及しているわけではない。 In this specification, the expression "one embodiment" or "an embodiment of the disclosed subject matter" means that a particular feature, structure, or characteristic described in connection with the embodiment is at least a disclosed subject matter. It is included in one embodiment. Thus, although the phrase “one embodiment” is used many times in this specification, all do not necessarily refer to the same embodiment.

図１は、音声検索モジュール１２０に基づいてロバスト且つ並列な音声検索が実行され得るコンピューティングシステム１００の一例を示す図である。コンピューティングシステム１００は、システムインターコネクト１１５に結合される1以上のプロセッサ１１０を備える。プロセッサ１１０は、複数または多くの処理コアを有するとしてもよい（説明の便宜上、「複数のコア」という表現はこれ以降では複数の処理コアおよび多くの処理コアの両方を意味するものとする）。プロセッサ１１０は、複数のコアを用いてロバスト且つ並列な音声検索を実行する、音声検索モジュール１２０を有するとしてもよい。音声検索モジュールは、分割機構、スケジュール、および複数の音声検索部等、複数の構成要素を含むとしてもよい（より詳細な説明は図４から図６を参照しつつ後述する）。音声検索モジュールに含まれる1以上の構成要素が１つのコアに配置されて、他の構成要素は別のコアに配置されるとしてもよい。 FIG. 1 is a diagram illustrating an example of a computing system 100 in which robust and parallel voice searches can be performed based on a voice search module 120. Computing system 100 includes one or more processors 110 coupled to a system interconnect 115. The processor 110 may have a plurality of processing cores or a number of processing cores (for convenience of explanation, the expression “a plurality of cores” will hereinafter mean both a plurality of processing cores and a number of processing cores). The processor 110 may include a speech search module 120 that performs robust and parallel speech searches using multiple cores. The voice search module may include a plurality of components such as a division mechanism, a schedule, and a plurality of voice search units (more detailed description will be described later with reference to FIGS. 4 to 6). One or more components included in the voice search module may be arranged in one core, and the other components may be arranged in another core.

音声検索モジュールはまず、大きい音声データベースを複数の小グループに分割するとしてもよいし、または大きな音声ストリームを一部重複しているより小さいサブストリームに分割するとしてもよい。続いて、１つのコアが検索対象の音声クリップ（「ターゲット音声クリップ」）を処理して、ターゲット音声クリップのモデルを構築する。一方、音声検索モジュールは、複数のコアに対して音声小グループ／サブストリームを動的にスケジューリングする。複数のコアは、各グループ／サブストリームを複数のセグメントに分割して、各音声セグメントのモデルを構築する。これは並列に行われる。各セグメントのサイズは、ターゲット音声クリップのサイズと等しいとしてもよい。ターゲット音声クリップと音声データベース／ストリームの両方を含むすべての音声セグメントに共通な、複数のガウス成分を含むガウス混合モデル（Ｇａｕｓｓｉａｎｍｉｘｔｕｒｅｍｏｄｅｌ：ＧＭＭ）を用いて、各音声セグメントとターゲット音声クリップとをモデル化するとしてもよい。音声セグメントのモデルが構築されると、当該セグメントのモデルとターゲット音声クリップのモデルとの間のカルバック−ライブラー（ＫＬ）距離またはＫＬ最大距離を算出するとしてもよい。当該距離が所定値以下であれば、当該音声セグメントはターゲット音声クリップであると特定され得る。検索プロセスは、すべての音声グループ／サブストリームが処理されるまで継続されるとしてもよい。 The speech search module may first divide a large speech database into multiple small groups, or may divide a large speech stream into smaller substreams that are partially overlapping. Subsequently, one core processes the audio clip to be searched (“target audio clip”) to build a model of the target audio clip. Meanwhile, the voice search module dynamically schedules voice small groups / substreams for a plurality of cores. The multiple cores divide each group / substream into multiple segments to build a model for each audio segment. This is done in parallel. The size of each segment may be equal to the size of the target audio clip. Model each audio segment and target audio clip using a Gaussian mixture model (GMM) containing multiple Gaussian components common to all audio segments including both the target audio clip and the audio database / stream. It may be converted. Once the audio segment model has been constructed, the Cullback-Liver (KL) distance or the KL maximum distance between the segment model and the target audio clip model may be calculated. If the distance is less than or equal to a predetermined value, the audio segment can be identified as the target audio clip. The search process may continue until all voice groups / substreams have been processed.

コンピューティングシステム１００はさらに、システムインターコネクト１１５に結合されているチップセット１３０を備えるとしてもよい。チップセット１３０は、1以上の集積回路パッケージまたはチップを有するとしてもよい。チップセット１３０は、コンピューティングシステム１００のその他の構成要素１６０との間のデータ転送をサポートするデバイスインターフェース１３５を１以上有するとしてもよい。その他の構成要素１６０は、例えば、ＢＩＯＳファームウェア、キーボード、マウス、ストレージデバイス、ネットワークインターフェース等である。チップセット１３０は、周辺機器インターコネクト（ＰＣＩ）バス１７０に結合されるとしてもよい。チップセット１３０はＰＣＩバス１７０に対するインターフェースを提供するＰＣＩブリッジ１４５を有するとしてもよい。ＰＣＩブリッジ１４５は、プロセッサ１１０およびその他の構成要素１６０と周辺機器との間にデータ経路を提供するとしてもよい。周辺機器は、例えば、音声デバイス１８０およびディスクドライブ１９０である。図１には図示されていないが、これら以外のデバイスもまたＰＣＩバス１７０に結合され得る。 The computing system 100 may further comprise a chipset 130 that is coupled to the system interconnect 115. The chipset 130 may have one or more integrated circuit packages or chips. The chipset 130 may have one or more device interfaces 135 that support data transfer with other components 160 of the computing system 100. Other components 160 are, for example, BIOS firmware, a keyboard, a mouse, a storage device, a network interface, and the like. Chipset 130 may be coupled to a peripheral device interconnect (PCI) bus 170. The chipset 130 may have a PCI bridge 145 that provides an interface to the PCI bus 170. The PCI bridge 145 may provide a data path between the processor 110 and other components 160 and peripheral devices. The peripheral devices are, for example, an audio device 180 and a disk drive 190. Although not shown in FIG. 1, other devices may also be coupled to the PCI bus 170.

また、チップセット１３０は、メインメモリ１５０に結合されているメモリコントローラ１２５を有するとしてもよい。メインメモリ１５０は、プロセッサ１１０の複数のコアまたは当該システム内のその他の任意のデバイスによって実行される命令列およびデータを格納するとしてもよい。メモリコントローラ１２５は、プロセッサ１１０の複数のコアおよびコンピューティングシステム１００内のほかのデバイスに対応付けられるメモリトランザクションに応じてメインメモリ１５０にアクセスするとしてもよい。一実施形態によると、メモリコントローラ１２５はプロセッサ１１０またはその他の回路に配置されるとしてもよい。メインメモリ１５０は、メモリコントローラ１２５がデータの読み書きを行う、アドレス指定可能な格納位置を提供するさまざまなメモリデバイスを有するとしてもよい。メインメモリ１５０は、ダイナミックランダムアクセスメモリ（ＤＲＡＭ）デバイス、シンクロナスＤＲＡＭ（ＳＤＲＡＭ）デバイス、ダブル・データ・レート（ＤＤＲ）ＳＤＲＡＭデバイスまたはその他のメモリデバイス等、1以上の異なる種類のメモリデバイスを有するとしてもよい。 The chipset 130 may also have a memory controller 125 coupled to the main memory 150. Main memory 150 may store sequences of instructions and data executed by multiple cores of processor 110 or any other device in the system. The memory controller 125 may access the main memory 150 in response to memory transactions associated with multiple cores of the processor 110 and other devices within the computing system 100. According to one embodiment, the memory controller 125 may be located in the processor 110 or other circuitry. Main memory 150 may include various memory devices that provide addressable storage locations from which memory controller 125 reads and writes data. Main memory 150 has one or more different types of memory devices, such as dynamic random access memory (DRAM) devices, synchronous DRAM (SDRAM) devices, double data rate (DDR) SDRAM devices, or other memory devices. Also good.

図２は、音声検索モジュール２４０を用いてロバスト且つ並列な音声検索が実行され得る別の例であるコンピューティングシステム２００を示す図である。システム２００は、プロセッサ０２２０Ａのような、複数のプロセッサを備えるとしてもよい。システム２００内の1以上のプロセッサは、多くのコアを有するとしてもよい。システム２００は、複数のコアによってロバスト且つ並列な音声検索を実行する音声検索モジュール２４０を備えるとしてもよい。音声検索モジュールは、分割機構、スケジュール、および複数の音声検索部等、複数の構成要素を含むとしてもよい（より詳細な説明は図４から図６を参照しつつ後述する）。音声検索モジュールに含まれる1以上の構成要素が１つのコアに配置されて、他の構成要素は別のコアに配置されるとしてもよい。システム２００内のプロセッサは、システムインターコネクト２１０によって互いに接続されているとしてもよい。システムインターコネクト２１０は、フロントサイドバス（ＦＳＢ）であってもよい。各プロセッサは、当該システムインターコネクトを介して、入出力（Ｉ／Ｏ）デバイスおよびメモリ２３０に接続されるとしてもよい。コアはすべて、メモリ２３０から音声データを受け取るとしてもよい。 FIG. 2 is a diagram illustrating another example computing system 200 in which robust and parallel voice searches can be performed using the voice search module 240. System 200 may include multiple processors, such as processor 0 220A. One or more processors in system 200 may have many cores. The system 200 may include a voice search module 240 that performs robust and parallel voice searches with multiple cores. The voice search module may include a plurality of components such as a division mechanism, a schedule, and a plurality of voice search units (more detailed description will be described later with reference to FIGS. 4 to 6). One or more components included in the voice search module may be arranged in one core, and the other components may be arranged in another core. The processors in system 200 may be connected to each other by system interconnect 210. The system interconnect 210 may be a front side bus (FSB). Each processor may be connected to an input / output (I / O) device and memory 230 via the system interconnect. All cores may receive audio data from memory 230.

図３は、音声検索モジュール３４０を用いてロバスト且つ並列な音声検索を実行し得るさらに別の例であるコンピューティングシステム３００を示す図である。システム３００において、複数のプロセッサ（例えば、３２０Ａ、３２０Ｂ、３２０Ｃおよび３２０Ｄ）を接続するシステムインターコネクト３１０は、リンクベースのポイント・ツー・ポイント接続である。各プロセッサは、リンクハブ（例えば、３３０Ａ、３３０Ｂ、３３０Ｃおよび３３０Ｄ）を介してシステムインターコネクトに接続されているとしてもよい。一部の実施形態によると、リンクハブはメモリコントローラと同じ場所に配置されて、当該メモリコントローラがシステムメモリに対するトラフィックを調整するとしてもよい。１以上のプロセッサが多くのコアを含むとしてもよい。システム３００は、複数のコアによってロバスト且つ並列な音声検索を実行する音声検索モジュール３４０を備えるとしてもよい。音声検索モジュールは、分割機構、スケジュール、および複数の音声検索部等、複数の構成要素を含むとしてもよい（より詳細な説明は図４から図６を参照しつつ後述する）。音声検索モジュールに含まれる1以上の構成要素が１つのコアに配置されて、他の構成要素は別のコアに配置されるとしてもよい。システム３００内の各プロセッサ／コアは、システムインターコネクトを介して共有メモリ（図３には不図示）に接続されるとしてもよい。コアはすべて、共有メモリから音声データを受け取るとしてもよい。 FIG. 3 is a diagram illustrating another example computing system 300 that may perform robust and parallel voice searches using the voice search module 340. In system 300, system interconnect 310 that connects multiple processors (eg, 320A, 320B, 320C, and 320D) is a link-based point-to-point connection. Each processor may be connected to the system interconnect via a link hub (eg, 330A, 330B, 330C, and 330D). According to some embodiments, the link hub may be co-located with the memory controller, which regulates traffic for system memory. One or more processors may include many cores. The system 300 may include a speech search module 340 that performs robust and parallel speech searches with multiple cores. The voice search module may include a plurality of components such as a division mechanism, a schedule, and a plurality of voice search units (more detailed description will be described later with reference to FIGS. 4 to 6). One or more components included in the voice search module may be arranged in one core, and the other components may be arranged in another core. Each processor / core in system 300 may be connected to a shared memory (not shown in FIG. 3) via a system interconnect. All cores may receive audio data from shared memory.

図２および図３において、音声検索モジュール（つまり、２４０および３４０）はまず、大きい音声データベースを複数の小グループに分割するとしてもよいし、または大きい音声ストリームを一部重複しているより小さいサブストリームに分割するとしてもよい。続いて、１つのコアが検索対象の音声クリップ（「ターゲット音声クリップ」）を処理して、ターゲット音声クリップのモデルを構築する。一方、音声検索モジュールは、複数のコアに対して音声小グループ／サブストリームを動的にスケジューリングする。複数のコアは、各グループ／サブストリームを複数のセグメントに分割して、各音声セグメントのモデルを構築する。これは並列に行われる。各セグメントのサイズは、ターゲット音声クリップのサイズと等しいとしてもよい。ターゲット音声クリップと音声データベース／ストリームの両方を含むすべての音声セグメントに共通な、複数のガウス成分のガウス混合モデル（Ｇａｕｓｓｉａｎｍｉｘｔｕｒｅｍｏｄｅｌ：ＧＭＭ）を用いて、各音声セグメントとターゲット音声クリップとをモデル化するとしてもよい。音声セグメントのモデルが構築されると、当該セグメントのモデルとターゲット音声クリップのモデルとの間のカルバック−ライブラー（ＫＬ）距離またはＫＬ最大距離を算出するとしてもよい。当該距離が所定値以下であれば、当該音声セグメントはターゲット音声クリップであると特定され得る。検索プロセスは、すべての音声グループ／サブストリームが処理されるまで継続されるとしてもよい。 2 and 3, the speech search module (ie, 240 and 340) may first divide a large speech database into multiple small groups, or smaller sub-parts that partially overlap a large speech stream. It may be divided into streams. Subsequently, one core processes the audio clip to be searched (“target audio clip”) to build a model of the target audio clip. Meanwhile, the voice search module dynamically schedules voice small groups / substreams for a plurality of cores. The multiple cores divide each group / substream into multiple segments to build a model for each audio segment. This is done in parallel. The size of each segment may be equal to the size of the target audio clip. Model each audio segment and target audio clip using a multiple Gaussian mixture model (GMM) common to all audio segments including both the target audio clip and the audio database / stream You may do that. Once the audio segment model has been constructed, the Cullback-Liver (KL) distance or the KL maximum distance between the segment model and the target audio clip model may be calculated. If the distance is less than or equal to a predetermined value, the audio segment can be identified as the target audio clip. The search process may continue until all voice groups / substreams have been processed.

図４は、ロバストな音声検索を実行する音声検索モジュール４００の一例を示すブロック図である。音声検索モジュール４００は、特徴抽出部４１０と、モデル化機構４２０と、決定部４３０とを備える。特徴抽出部４１０は、入力音声ストリーム（例えば、ターゲット音声クリップ、大きい音声ストリームのサブストリーム等）を受け取って、入力音声ストリームから音声特徴を抽出するとしてもよい。入力音声ストリームが、ターゲット音声クリップを特定するべく検索されるべき音声ストリームである場合、特徴抽出部は当該音声ストリームに対してスライディングウィンドウ（ｓｌｉｄｉｎｇｗｉｎｄｏｗ）を適用して当該音声ストリームを複数の互いに重複するセグメントに分割するとしてもよい。ウィンドウはターゲット音声クリップと長さが同じである。入力音声ストリームの各セグメント（ターゲット音声ストリームが有するセグメントは１セグメントのみである）はさらに、複数のフレームに分割される。各フレームは、長さが同じで、隣接フレームと重複するとしてもよい。例えば、一実施形態によると、フレームの長さは２０ミリ秒で、フレーム間の重複箇所は１０ミリ秒であるとしてもよい。各フレームについて特徴ベクトルを抽出するとしてもよい。各フレームは、フーリエ係数、メル周波数ケプストラム係数、スペクトルフラットネス（ｓｐｅｃｔｒａｌｆｌａｔｔｎｅｓｓ）、およびこういったパラメータの平均、分散、その他の微分係数といった特徴を含み得る。音声セグメントの全フレームの特徴ベクトルにより、特徴ベクトルシーケンスが形成される。 FIG. 4 is a block diagram illustrating an example of a voice search module 400 that performs a robust voice search. The voice search module 400 includes a feature extraction unit 410, a modeling mechanism 420, and a determination unit 430. The feature extraction unit 410 may receive an input audio stream (for example, a target audio clip, a substream of a large audio stream, etc.) and extract audio features from the input audio stream. When the input audio stream is an audio stream to be searched to identify the target audio clip, the feature extraction unit applies a sliding window to the audio stream to overlap the audio streams with each other. It may be divided into segments. The window is the same length as the target audio clip. Each segment of the input audio stream (the target audio stream has only one segment) is further divided into a plurality of frames. Each frame may have the same length and overlap with an adjacent frame. For example, according to one embodiment, the frame length may be 20 milliseconds and the overlap between frames may be 10 milliseconds. A feature vector may be extracted for each frame. Each frame may include features such as Fourier coefficients, mel frequency cepstrum coefficients, spectral flatness, and the mean, variance, and other derivative coefficients of these parameters. A feature vector sequence is formed by the feature vectors of all frames of the speech segment.

２つの隣接するセグメントが重複しているのは、２つの隣接するセグメント間でターゲット音声クリップを見逃す可能性を小さくするためである。重複箇所が長くなるほど、見逃す可能性が低くなる。一実施形態によると、どのような一致であろうと見逃さないように、重複箇所の長さは、フレームの長さをセグメントの長さから引いたものに等しくなるように設定してもよい。しかし、重複箇所が長くなると演算が増えてしまう。このため、演算負荷と見逃す可能性との間でバランスを取る必要がある（例えば、重複箇所の長さは、セグメントの長さの２分の１以下である）。いずれにしろ、２つのセグメント間で重複しているフレームの特徴ベクトルの場合、抽出は１回のみでよい。 The reason why two adjacent segments overlap is to reduce the possibility of missing a target audio clip between two adjacent segments. The longer the overlap, the less likely it will be missed. According to one embodiment, the length of the overlap may be set equal to the length of the frame minus the length of the segment, so as not to miss any matches. However, the calculation increases as the overlapping portion becomes longer. For this reason, it is necessary to balance between the calculation load and the possibility of oversight (for example, the length of the overlapping portion is equal to or less than half the length of the segment). In any case, in the case of a feature vector of a frame that overlaps between two segments, the extraction needs to be performed only once.

モデル化機構４２０は、特徴抽出部４１０が抽出した特徴ベクトルシーケンスに基づいて、音声セグメントのモデルを構築するとしてもよい。使用されるモデルに応じて、モデル化機構は該モデルのパラメータを推定する。一実施形態によると、共通成分ガウス混合モデル（「ＣＣＧＭＭ」）に基づいて音声セグメントをモデル化するとしてもよい。ＣＣＧＭＭはすべてのセグメントにわたって共通している複数のガウス成分を含む。各セグメントについて、モデル化機構は、共通のガウス成分に対して一連の混合重み付け値を推定する。別の実施形態によると、他のモデル（例えば、隠れマルコフモデル）に基づいて音声セグメントをモデル化するとしてもよい。一実施形態によると、ターゲット音声クリップのみがモデル化されて、音声セグメントの特徴ベクトルシーケンスはそのまま、音声セグメントがターゲット音声クリップと略同一か否か決定するべく利用されるとしてもよい。 The modeling mechanism 420 may construct a speech segment model based on the feature vector sequence extracted by the feature extraction unit 410. Depending on the model used, the modeling mechanism estimates the model parameters. According to one embodiment, speech segments may be modeled based on a common component Gaussian mixture model (“CCGMM”). The CCGMM includes multiple Gaussian components that are common across all segments. For each segment, the modeling mechanism estimates a series of blend weight values for the common Gaussian component. According to another embodiment, speech segments may be modeled based on other models (eg, hidden Markov models). According to one embodiment, only the target audio clip may be modeled and the feature vector sequence of the audio segment may be used as is to determine whether the audio segment is substantially identical to the target audio clip.

決定部４３０は、入力音声ストリームに含まれる音声セグメントが十分に類似しており音声セグメントがターゲット音声クリップの複写と特定できるか否か判断するとしてもよい。このため、決定部は、音声セグメントのモデルとターゲット音声クリップのモデルとを比較することによって類似性測度を導き出すとしてもよい。一実施形態によると、類似性測度はこれら２つのモデル間で算出される距離であってもよい。別の実施形態によると、類似性測度は、音声セグメントのモデルとターゲット音声クリップのモデルとが同一である確率であってもよい。さらに別の実施形態によると、類似性測度は、音声セグメントの特徴ベクトルシーケンスとターゲット音声クリップのモデルとを比較することによって得られるとしてもよい。例えば、隠れマルコフモデル（ＨＭＭ）に基づいてターゲット音声クリップをモデル化する場合、音声セグメントの特徴ベクトルシーケンスとターゲット音声クリップのＨＭＭとに基づき、音声セグメントとターゲット音声クリップとの間の可能性スコアを算出するべくビタビベースのアルゴリズムを用いるとしてもよい。 The determination unit 430 may determine whether the audio segments included in the input audio stream are sufficiently similar and the audio segment can be identified as a copy of the target audio clip. For this reason, the determination unit may derive the similarity measure by comparing the model of the audio segment and the model of the target audio clip. According to one embodiment, the similarity measure may be a distance calculated between these two models. According to another embodiment, the similarity measure may be the probability that the audio segment model and the target audio clip model are identical. According to yet another embodiment, the similarity measure may be obtained by comparing the feature vector sequence of the audio segment with the model of the target audio clip. For example, when modeling a target audio clip based on a Hidden Markov Model (HMM), the likelihood score between the audio segment and the target audio clip is calculated based on the feature vector sequence of the audio segment and the HMM of the target audio clip. A Viterbi-based algorithm may be used for calculation.

類似性測度の値に基づいて、決定部は、音声セグメントをターゲット音声クリップと特定できるか否か判断するとしてもよい。例えば、類似性測度の値が所定のしきい値以下であれば（例えば、類似性測度は音声セグメントモデルとターゲット音声クリップとの間の距離である）、音声セグメントはターゲット音声クリップと略同一であると特定され得る。同様に、類似性測度の値が所定しきい値以上であれば（例えば、類似性測度は音声セグメントがターゲット音声クリップと略同一である可能性スコアである）、音声セグメントはターゲット音声クリップと略同一であると特定され得る。一方、類似性測度によって音声セグメントがターゲット音声クリップとは大きく異なることが分かった場合には、当該音声セグメントの直後の任意の数のセグメントを省略するとしてもよい。実際に省略するセグメントの数は、類似性測度の値および／または実験に基づくデータに応じて決まる。類似性測度によって現在のセグメントがターゲット音声クリップと非常に異なることが分かる場合には、任意の数の後続セグメントを省略することによって、ターゲット音声クリップを見逃すことはあり得ない。これは、入力音声ストリームをセグメントに分割するべく利用されるウィンドウが徐々に前方向にスライドする結果、あるセグメントから次のセグメントへと移る際に類似性測度に連続性が認められるためである。 Based on the value of the similarity measure, the determination unit may determine whether the audio segment can be identified as the target audio clip. For example, if the value of the similarity measure is less than or equal to a predetermined threshold (eg, the similarity measure is the distance between the audio segment model and the target audio clip), the audio segment is substantially the same as the target audio clip. Can be identified. Similarly, if the value of the similarity measure is greater than or equal to a predetermined threshold (eg, the similarity measure is a likelihood score that the audio segment is approximately the same as the target audio clip), the audio segment is approximately the same as the target audio clip. It can be specified to be identical. On the other hand, if the similarity measure shows that the audio segment is significantly different from the target audio clip, an arbitrary number of segments immediately after the audio segment may be omitted. The number of segments that are actually omitted depends on the value of the similarity measure and / or the data based on the experiment. If the similarity measure shows that the current segment is very different from the target audio clip, it is not possible to miss the target audio clip by omitting any number of subsequent segments. This is because the similarity measure is recognized as continuity when moving from one segment to the next as a result of the windows used to divide the input audio stream into segments that gradually slide forward.

図５は、図４に図示されるロバストな音声検索モジュールの動作例を示す図である。ターゲット音声クリップ５１０は、特徴抽出部に与えられて、複数のフレームに分割される。特徴抽出部はそして、ブロック５３０Ａにおいて、フレーム毎の特徴ベクトルによって、特徴ベクトルシーケンス（５４０）を生成する。特徴ベクトルは、１以上のパラメータを含み得るので、ｘ次元のベクトルであってもよい（ここで、ｘ≧１）。ブロック５７０Ａにおいて、特徴ベクトルシーケンス５４０は、以下のようなＧＭＭを用いてモデル化されるとしてもよい。

ＧＭＭ

は、Ｍ個のガウス成分を含み、

は成分重みで、

は平均で、

は共分散であり、ｉ＝１、２・・・、Ｍであって、ｋはセグメントｋを表し、Ｎ（）はガウス分布を表す。ターゲット音声クリップについては、セグメントは１つのみであるので、セグメントを特定するためのｋは利用する必要はない。しかし、入力音声ストリーム５２０については、セグメントは通常複数あるので、異なるセグメントについてＧＭＭを特定するのが望ましい。 FIG. 5 is a diagram illustrating an operation example of the robust speech search module illustrated in FIG. The target audio clip 510 is given to the feature extraction unit and divided into a plurality of frames. The feature extraction unit then generates a feature vector sequence (540) with the feature vectors for each frame in block 530A. Since the feature vector may include one or more parameters, it may be an x-dimensional vector (where x ≧ 1). In block 570A, the feature vector sequence 540 may be modeled using a GMM as follows.

GMM

Contains M Gaussian components,

Is the component weight,

Is on average,

Is the covariance, i = 1, 2,..., M, k represents the segment k, and N () represents a Gaussian distribution. For the target audio clip, since there is only one segment, it is not necessary to use k for specifying the segment. However, since there are usually multiple segments for the input audio stream 520, it is desirable to identify GMMs for different segments.

図５に示す例では、カルバック−ライブラー（ＫＬ）距離またはＫＬ最大距離を類似性測度として使用する。ＫＬ最大距離の算出を簡略化するべく、音声セグメントすべてに用いられるＧＭＭではガウス成分共通群が共通している、つまり、ｉ番目のガウス成分について、平均

および分散

は音声セグメントが変わっても同じである、と仮定される。このため、式（１）は以下のように変形される。

各音声セグメントについて、１セットの重み付け

のみを、共通ガウス成分について推定する必要がある。Ｔ個の特徴ベクトル、

を持つセグメントｋの特徴ベクトルシーケンスの場合、重みは以下のように推定され得る。

ここで、

はｉ番目またはｊ番目のセグメントについての普遍的な重みであって、いくつかのサンプル音声ファイルに基づいて実験により得られるとしてもよいし、乱数値によって初期化されるとしてもよい。 In the example shown in FIG. 5, the Cullback-Liver (KL) distance or the KL maximum distance is used as the similarity measure. In order to simplify the calculation of the KL maximum distance, the GMM used for all speech segments has a common Gaussian component common group, that is, an average of the i-th Gaussian component.

And distributed

Is assumed to be the same even if the speech segment changes. For this reason, Formula (1) is deform | transformed as follows.

One set of weights for each audio segment

Only need to be estimated for the common Gaussian component. T feature vectors,

For a feature vector sequence of segment k with, the weights can be estimated as follows:

here,

Is a universal weight for the i th or j th segment, which may be obtained experimentally based on several sample audio files, or may be initialized with a random value.

ターゲット音声クリップ５１０を特定するべく検索される入力音声ストリーム５２０は、特徴抽出部に与えられるとしてもよい。ブロック５３０Ｂにおいて、特徴抽出部は入力音声ストリームを、互いに部分的に重複する複数のセグメントに分割する。特徴抽出部はさらに、各セグメントを、互いに部分的に重複する複数のフレームに分割して、フレーム毎に特徴ベクトルを抽出する。ブロック５６０は、入力音声ストリーム５２０の特徴ベクトルシーケンスを示すと共に、当該音声ストリームが互いに部分的に重複する複数のセグメントに分割されている様子を示す。例えば、ターゲット音声クリップの長さと同じサイズのウィンドウを入力音声ストリーム５２０に適用するとしてもよい。説明のために、セグメント５６０Ａを得るべくターゲット音声クリップの特徴ベクトルシーケンスに対してウィンドウが図示されているが、この場合セグメントは１つしかないので、ターゲット音声クリップにウィンドウを適用する必要は通常ない。シフトするウィンドウを入力音声ストリームに適用すると、５６０Bおよび５６０Cのような部分的に重複する複数のセグメントが得られる。ウィンドウのシフト量は、セグメント５６０Ｂからセグメント５６０Ｃまでの間で時間τであり、ここでτはウィンドウサイズよりも小さい。 The input audio stream 520 searched to identify the target audio clip 510 may be provided to the feature extraction unit. In block 530B, the feature extraction unit divides the input audio stream into a plurality of segments that partially overlap each other. The feature extraction unit further divides each segment into a plurality of partially overlapping frames, and extracts a feature vector for each frame. Block 560 shows the feature vector sequence of the input audio stream 520 and shows the audio stream being divided into segments that partially overlap each other. For example, a window having the same size as the length of the target audio clip may be applied to the input audio stream 520. For illustration purposes, a window is shown for the feature vector sequence of the target audio clip to obtain segment 560A, but in this case there is only one segment, so there is usually no need to apply a window to the target audio clip. . Applying the shifting window to the input audio stream results in a plurality of partially overlapping segments such as 560B and 560C. The amount of window shift is time τ from segment 560B to segment 560C, where τ is smaller than the window size.

各音声セグメントはＣＣＧＭＭを用いてモデル化される。例えば、セグメント５６０Ｂはブロック５７０Ｂでモデル化され、セグメント５６０Ｃはブロック５７０Ｃでモデル化される。入力音声ストリーム５２０の各セグメントのモデルとターゲット音声クリップ５１０のモデルは、重みの組み合わせは異なるが共通のガウス成分を有する。一実施形態によると、特徴ベクトルは入力音声ストリーム全体からフレーム毎に抽出されて、入力音声ストリーム全体に対応する長い特徴ベクトルシーケンスが生成されるとしてもよい。続いて、Ｎ×ＦＬ（ここで、Ｎは正の整数でありＦＬはフレームの長さ）の長さを持つウィンドウが、当該長い特徴ベクトルシーケンスに適用される。ウィンドウ内の複数の特徴ベクトルは、一の音声セグメントの一の特徴ベクトルを構成し、この特徴ベクトルはＣＣＧＭＭを構築するために利用される。ウィンドウは、時間τだけ前方向にシフトされる。 Each speech segment is modeled using CCGMM. For example, segment 560B is modeled at block 570B and segment 560C is modeled at block 570C. The model of each segment of the input audio stream 520 and the model of the target audio clip 510 have different combinations of weights but have a common Gaussian component. According to one embodiment, feature vectors may be extracted frame by frame from the entire input audio stream to generate a long feature vector sequence corresponding to the entire input audio stream. Subsequently, a window having a length of N × FL (where N is a positive integer and FL is the length of the frame) is applied to the long feature vector sequence. A plurality of feature vectors in the window constitute one feature vector of one speech segment, and this feature vector is used to construct a CCGMM. The window is shifted forward by time τ.

セグメントがターゲット音声クリップと略同一か否か決定するべく、当該セグメントのモデルとターゲット音声クリップのモデルとの間のＫＬ最大距離を以下のようにして算出するとしてもよい。

このようにして算出されたＫＬ最大距離が所定のしきい値未満の場合、音声クリップが検出されたとみなされるとしてもよい。入力音声ストリーム５２０に対して適用されるウィンドウが時間方向で前方向にシフトされていくと、距離は通常、１つのタイムステップから次のタイムステップまでの間で一定の継続性を示す。つまり、距離が大きすぎると、現在のセグメントの直後に続く１以上のセグメントがターゲット音声クリップに一致する可能性は低い。このため、距離の値によっては、同一音声ストリーム／サブストリーム内の直後の所定数のセグメントについては検索を省略するとしてもよい。 In order to determine whether the segment is substantially the same as the target audio clip, the KL maximum distance between the model of the segment and the model of the target audio clip may be calculated as follows.

If the KL maximum distance calculated in this way is less than a predetermined threshold value, it may be considered that an audio clip has been detected. As the window applied to the input audio stream 520 is shifted forward in the time direction, the distance typically exhibits a constant continuity from one time step to the next. That is, if the distance is too large, it is unlikely that one or more segments immediately following the current segment will match the target audio clip. Therefore, depending on the distance value, the search may be omitted for a predetermined number of segments immediately after the same audio stream / substream.

図６は、マルチプロセッサシステムにおいてロバスト且つ並列な音声検索を実行する音声検索モジュール６００の一例を示すブロック図である。音声検索モジュール６００は、分割機構６１０と、スケジューラ６２０と、Ｉ／Ｏ最適化部６３０と、複数の音声検索部（例えば６４０Ａ、・・・、６４０Ｎ）とを備える。分割機構６１０は、大きい音声ストリームを複数のより小さいサブストリームに分割、および／または、大きい音声データベースを複数の小グループに分割するとしてもよい。図７Ａ、図７Ｂおよび図７Ｃは、マルチプロセッサシステムにおいてロバスト且つ並列な音声検索を実行するべく大きい音声データベースを小グループに分割する方法を示す図である。図７Ａは、単一の大きな音声ストリーム７１０を含むデータベースの一例を示す図である。分割機構は、音声ストリーム７１０を複数のより小さいサブストリーム７１２、７１４および７１６に分割するとしてもよい。ここで、各サブストリームは１グループを構成している。各サブストリームの長さは互いに異なるとしてもよいが、処理を単純化するべく通常は均一な長さとする。ターゲット音声クリップの正確な検出を見落とすことのないように、各サブストリームは直後のサブストリームと重複しており、２つの隣接するサブストリーム（例えば、７１２および７１４、７１４および７１６）間での重複部分は、

以上でなければならない。ここで、

はターゲット音声クリップ内のフレーム総数である。 FIG. 6 is a block diagram illustrating an example of a speech search module 600 that performs robust and parallel speech searches in a multiprocessor system. The voice search module 600 includes a dividing mechanism 610, a scheduler 620, an I / O optimization unit 630, and a plurality of voice search units (for example, 640A,..., 640N). The splitting mechanism 610 may split the large audio stream into multiple smaller substreams and / or split the large audio database into multiple small groups. 7A, 7B, and 7C illustrate a method for dividing a large speech database into small groups to perform robust and parallel speech searches in a multiprocessor system. FIG. 7A is a diagram illustrating an example of a database that includes a single large audio stream 710. The splitting mechanism may split the audio stream 710 into multiple

smaller substreams

712, 714, and 716. Here, each substream constitutes one group. The lengths of the substreams may be different from each other, but are usually uniform in length to simplify processing. Each substream overlaps with the immediately following substream so that accurate detection of the target audio clip is not overlooked, and overlap between two adjacent substreams (eg, 712 and 714, 714 and 716) Part,

It must be more than that. here,

Is the total number of frames in the target audio clip.

図７Ｂは、複数の比較的小さい音声ストリーム（例えば、７２０、７２５、７３０、７３５、および７４０）を含む別の例のデータベースを示す図である。一実施形態によると、分割機構６１０は当該データベースを、各グループが１つの音声ストリームのみを含むように、複数の小グループに分割するとしてもよい。別の実施形態によると、図７Ｂに示すように、分割機構はデータベースを、一部のグループはそれぞれが音声ストリームを１つのみ含み、他のグループはそれぞれが複数の小さい音声ストリームを含むように、複数の小グループに分割してもよい。図７Ｃは、複数の比較的小さい音声ストリーム（例えば、７５０、７５５および７６０）と大きな音声ストリーム（例えば７７０）とを含むさらに別の例のデータベースを示す図である。分割機構は、比較的小さい音声ストリームを、各グループが音声ストリーム１つのみを含むように複数のグループに分割するとしてもよいし、または、一部のグループは音声ストリームを１つのみ含み（例えば、７５０）他のグループは複数の小さい音声ストリームを含む（例えば、７５５および７６０を同じグループに入れるとしてもよい）ように複数のグループに分割するとしてもよい。７７０のような大きい音声ストリームについては、分割機構は、互いに部分的に重複する複数のより小さいサブストリーム（例えば、７７２および７７４）に分割するとしてもよい。ここで、図７Ａに示した方法に従って、各サブストリームは１グループを構成するとしてもよい。 FIG. 7B illustrates another example database that includes multiple relatively small audio streams (eg, 720, 725, 730, 735, and 740). According to one embodiment, the splitting mechanism 610 may split the database into a plurality of small groups such that each group contains only one audio stream. According to another embodiment, as shown in FIG. 7B, the splitting mechanism includes a database, some groups each include only one audio stream, and other groups each include a plurality of small audio streams. It may be divided into a plurality of small groups. FIG. 7C illustrates yet another example database that includes multiple relatively small audio streams (eg, 750, 755, and 760) and a large audio stream (eg, 770). The splitting mechanism may split a relatively small audio stream into multiple groups such that each group contains only one audio stream, or some groups contain only one audio stream (eg, 750) Other groups may be divided into multiple groups so as to include multiple small audio streams (eg, 755 and 760 may be in the same group). For large audio streams such as 770, the splitting mechanism may split into multiple smaller substreams (eg, 772 and 774) that partially overlap each other. Here, according to the method shown in FIG. 7A, each substream may constitute one group.

また、分割機構は、演算の重複（大きな音声ストリームが互いに部分的に重複する複数のより小さいサブストリームに分割される場合）および複数のプロセッサによる並列処理における負荷の不均衡を低減するように、大きな音声データベースを複数の適切なサイズの複数のグループに分割する。グループのサイズが小さくなると、演算の重複部分が大きくなり得る一方、グループのサイズが大きくなると、負荷の不均衡が著しくなってしまうことがある。一実施形態によると、グループのサイズはターゲット音声クリップのサイズの約２５倍であるとしてもよい。 The splitting mechanism also reduces computational load (when large audio streams are split into multiple smaller substreams that partially overlap each other) and load imbalance in parallel processing by multiple processors, Divide a large speech database into multiple groups of appropriate size. As the group size decreases, the overlap of operations can increase, while as the group size increases, the load imbalance can become significant. According to one embodiment, the size of the group may be about 25 times the size of the target audio clip.

図６に戻って、スケジューラ６２０は、マルチプロセッサシステム内の複数のプロセッサに対して大きなデータベースの複数のグループを動的にスケジューリングして、各プロセッサが一度に１つの処理対象のグループを持つようにするとしてもよい。スケジューラは、当該システムの複数のプロセッサが利用可能か否かを定期的に確認して、利用可能なプロセッサそれぞれに対して音声グループを割り当てて、処理およびターゲット音声クリップの検索を実行させる。その後別のプロセッサが利用可能な状態になると、スケジューラはこのプロセッサに１つのグループを割り当てるとしてもよい。スケジューラはまた、プロセッサがその前に割り当てられたグループについて検索を終了した直後に、ほかのプロセッサが検索処理を終了したかどうかに関わらず、当該プロセッサに対してまだ検索がすんでいない音声グループを割り当てる。実際のところ、グループのサイズが同一であったとしても、検索処理を省略するセグメントの数はセグメント毎に異なる可能性があるので、同じターゲット音声クリップを検索するのに必要な時間はプロセッサごとに異なる場合がある。上述したような動的スケジューリングを利用することで、負荷の不均衡を効果的に低減し得る。 Returning to FIG. 6, the scheduler 620 dynamically schedules multiple groups of a large database for multiple processors in a multiprocessor system so that each processor has one group to be processed at a time. You may do that. The scheduler periodically checks whether or not a plurality of processors of the system can be used, assigns an audio group to each of the available processors, and executes processing and search for a target audio clip. When another processor becomes available thereafter, the scheduler may assign a group to this processor. The scheduler also assigns a voice group that has not yet been searched to the processor immediately after the processor finishes searching for the previously assigned group, regardless of whether other processors have finished the search process. . In fact, even if the group size is the same, the number of segments that skip the search process may vary from segment to segment, so the time required to search for the same target audio clip varies from processor to processor. May be different. By using dynamic scheduling as described above, load imbalance can be effectively reduced.

Ｉ／Ｏ最適化部６３０は、システムインターコネクト（例えば、システムのプロセッサと共有システムメモリとを接続するシステムバス）上でのＩ／Ｏトラフィックを最適化するとしてもよい。Ｉ／Ｏ最適化部は、各プロセッサのデータ範囲が定義されている間、最初は、検索対象の音声データベース全体をディスクからメモリへロードしないと判断するとしてもよい。また、Ｉ／Ｏ最適化部は、メモリから受け取る割り当てられたセグメントを各プロセッサが読む際には、一度に一部分のみを読ませるとしてもよい。Ｉ／Ｏトラフィックを最適化することによって、Ｉ／Ｏ最適化部は、Ｉ／Ｏコンテンションを低減し、Ｉ／Ｏ処理および演算を重複させ、演算効率の向上に貢献するとしてもよい。この結果、音声検索のスケーラビリティを大きく改善することができる。 The I / O optimization unit 630 may optimize I / O traffic on a system interconnect (for example, a system bus connecting a system processor and a shared system memory). While the data range of each processor is defined, the I / O optimization unit may initially determine that the entire search target speech database is not loaded from the disk to the memory. Further, the I / O optimization unit may read only a part at a time when each processor reads the allocated segment received from the memory. By optimizing the I / O traffic, the I / O optimization unit may reduce I / O contention, overlap I / O processing and computation, and contribute to improving computation efficiency. As a result, the scalability of voice search can be greatly improved.

音声検索モジュール６００はさらに、複数の音声検索部６４０Ａから６４０Ｎを備える。各音声検索部（例えば６４０Ａ）は、一のプロセッサに配置されて、当該プロセッサに割り当てられるグループを処理してターゲット音声クリップを検索する。図４に図示されている音声検索モジュール４００と同様に、音声検索部は、特徴抽出部（例えば４１０）と、モデル化機構（例えば４２０）と、決定部（例えば４３０）とを有する。各音声検索部は、自身に割り当てられた、ターゲット音声クリップを特定するための音声グループの連続能動型検索を実行する。これは、音声グループの音声ストリームを、互いに部分的に重複する複数のセグメントに分割し、各セグメントについて特徴ベクトルシーケンスを抽出して、式（１）から（４）で示したようにＣＣＧＭＭに基づいて各セグメントをモデル化することによって行われる。ここで、セグメントの長さはターゲット音声クリップの長さと同じである。また、すべての音声検索部が利用する、ターゲット音声クリップ用のＣＣＧＭＭは、音声検索部のうちの１つが一度推定すればそれでよい。各音声検索部は、各セグメントのモデルとターゲット音声クリップのモデルとの間のＫＬ最大距離を算出する。このＫＬ最大距離に基づいて、音声検索部はターゲット音声クリップが検出されるか否か判断するとしてもよい。さらに、各音声検索部は、現在のセグメントのＫＬ最大距離がしきい値よりも大きい場合には、現在のセグメントに続く複数のセグメントを省略するとしてもよい。 The voice search module 600 further includes a plurality of voice search units 640A to 640N. Each voice search unit (eg, 640A) is arranged in one processor and processes a group assigned to the processor to search for a target voice clip. Similar to the voice search module 400 illustrated in FIG. 4, the voice search unit includes a feature extraction unit (for example, 410), a modeling mechanism (for example, 420), and a determination unit (for example, 430). Each voice search unit performs a continuous active search of a voice group assigned to itself for specifying a target voice clip. This is based on CCGMM as shown in equations (1) to (4) by dividing the audio stream of the audio group into a plurality of segments that partially overlap each other and extracting a feature vector sequence for each segment. This is done by modeling each segment. Here, the length of the segment is the same as the length of the target audio clip. Further, the CCGMM for the target voice clip used by all voice search units may be estimated once by one of the voice search units. Each voice search unit calculates the KL maximum distance between the model of each segment and the model of the target voice clip. Based on this KL maximum distance, the voice search unit may determine whether or not a target voice clip is detected. Further, each voice search unit may omit a plurality of segments following the current segment when the KL maximum distance of the current segment is larger than the threshold value.

図８は、マルチプロセッサシステムにおいてロバスト且つ並列な音声検索を実行するためのプロセス８００の一例を示す擬似コードを示す図である。ライン８０２において、音声検索モジュールは初期化されるとしてもよい。例えば、ターゲット音声クリップファイルおよび音声データベースファイルを開けて、グローバルパラメータを初期化するとしてもよい。ライン８０４において、大きな音声データベースを、図７Ａ、図７Ｂおよび図７Ｃに図示しているように、ＮＧ個の小グループに分割するとしてもよい。ライン８０６において、モデル（例えば、ＣＣＧＭＭ）をターゲット音声クリップについて構築するとしてもよい。ライン８０８において、ＮＧ個の音声グループを利用可能なプロセッサに対して動的にスケジューリングして、スケジューリングされたグループの並列処理を開始するとしてもよい。ライン８０８は、並列実装をセットアップする１つの命令を利用し、その他の並列実装命令もまた用いられ得る。 FIG. 8 is a pseudo code illustrating an example of a process 800 for performing robust and parallel speech searches in a multiprocessor system. At line 802, the voice search module may be initialized. For example, the target audio clip file and the audio database file may be opened and the global parameters may be initialized. At line 804, the large audio database may be divided into NG small groups as illustrated in FIGS. 7A, 7B, and 7C. At line 806, a model (eg, CCGMM) may be built for the target audio clip. At line 808, NG voice groups may be dynamically scheduled to available processors to initiate parallel processing of the scheduled groups. Line 808 utilizes one instruction to set up a parallel implementation, and other parallel implementation instructions may also be used.

ライン８１０からライン８４６は、マルチプロセッサシステムのプロセッサが並列に、ＮＧ個のグループのそれぞれをどのように処理し、且つどのように検索してターゲットを特定するかを示している。説明の便宜上、ライン８１２からライン８４６の処理は、第１番目のグループから最後のグループまでの、繰り返しとして図示されていることに留意されたい。実際には、複数のプロセッサが利用可能な場合、これらの利用可能なプロセッサによって複数のグループが並列に処理される。ライン８１４において、各グループの複数の音声ストリームのうち一部またはすべてを、これらのストリームがターゲット音声クリップよりも時間的に長い場合には、互いに部分的に重複するＮＳ個のセグメントにさらに分割するとしてもよい。ライン８１６は、グループの各セグメントについて、ライン８１８から８３２に示すような、繰り返しプロセスを開始させる。ライン８２０において、特徴ベクトルシーケンス（フレーム毎に）をセグメントから抽出するとしてもよい。ライン８２２において、モデル（例えば、式（１）から式（３）に示すようなＣＣＧＭＭ）をセグメントについて構築するとしてもよい。ライン８２４において、セグメントのモデルとターゲット音声クリップのモデルとの間の距離（例えば、式（４）に示すようなＫＬ最大距離）を算出するとしてもよい。ライン８２６において、セグメントがターゲット音声クリップと一致するか否かを、ライン８２４において算出された距離と所定のしきい値＃１とに基づいて、判断するとしてもよい。距離がしきい値＃１未満であれば、セグメントはターゲット音声クリップに一致する。ライン８２８において、同じ音声ストリーム／サブストリーム内の所定数の後続セグメント（例えば、Ｍ個のセグメント）の検索を省略するか否かを、ライン８２４において算出された距離と所定のしきい値＃２とに基づいて、判断するとしてもよい。距離がしきい値＃２よりも大きい場合には、Ｍ個のセグメントの検索を省略するとしてもよい。一実施形態によると、省略するセグメントの数は距離の値に応じて変わるとしてもよい。ライン８３０において、検索結果（例えば、各グループにおける一致セグメントのインデックスまたは開始時間）を、当該グループを処理するプロセッサに対してローカルなアレイに格納するとしてもよい。ライン８４２において、すべてのプロセッサから得られる、ローカルアレイに格納した検索結果を要約してユーザに出力するとしてもよい。 Lines 810 to 846 show how the processors of the multiprocessor system process each of the NG groups in parallel and how to search to identify the target. Note that for convenience of explanation, the processing from line 812 to line 846 is illustrated as an iteration from the first group to the last group. In practice, when multiple processors are available, multiple groups are processed in parallel by these available processors. At line 814, some or all of the multiple audio streams in each group are further divided into NS segments that partially overlap each other if these streams are temporally longer than the target audio clip. It is good. Line 816 initiates an iterative process for each segment of the group, as indicated by lines 818-832. At line 820, a feature vector sequence (for each frame) may be extracted from the segment. At line 822, a model (eg, a CCGMM as shown in equations (1) through (3)) may be built for the segment. In line 824, a distance between the segment model and the target audio clip model (for example, a KL maximum distance as shown in Expression (4)) may be calculated. In line 826, it may be determined whether the segment matches the target audio clip based on the distance calculated in line 824 and a predetermined threshold # 1. If the distance is less than threshold # 1, the segment matches the target audio clip. Whether the search for a predetermined number of subsequent segments (eg, M segments) within the same audio stream / substream is to be omitted in line 828, whether the distance calculated in line 824 and a predetermined threshold # 2 It may be determined based on the above. If the distance is greater than threshold value # 2, the search for M segments may be omitted. According to one embodiment, the number of segments to omit may vary depending on the distance value. At line 830, the search results (eg, the index or start time of matching segments in each group) may be stored in an array local to the processor that processes the group. At line 842, search results obtained from all processors and stored in the local array may be summarized and output to the user.

図８に概略を示したロバスト且つ並列な検索ストラテジを、Ｉ／Ｏ最適化等のほかの技術と共に用いることによって、マルチプロセッサシステムにおいて大きな音声データベース内でターゲット音声クリップを検索するスピードを大きく改善し得る。１つの実験によると、２７時間の音声ストリームにおいて１５秒のターゲット音声クリップを検索するスピードは、１６ウェイ（１６−ｗａｙ）のユニシスシステムにおいて、同じターゲット音声クリップにおいて同じ音声ストリームを連続して検索する場合に比べると、１１倍に早くなることが分かっている。 The robust and parallel search strategy outlined in FIG. 8 along with other techniques such as I / O optimization greatly improves the speed of searching for target audio clips in a large audio database in a multiprocessor system. obtain. According to one experiment, the speed of searching for a 15-second target audio clip in a 27-hour audio stream is to continuously search for the same audio stream in the same target audio clip in a 16-way Unisys system. It is known that it is 11 times faster than the case.

一実施形態によると、変形された検索ストラテジが用いられ得る。このストラテジを用いると、ターゲット音声クリップの最初のＫ個（Ｋ≧１）のフレームに対して仮モデル（例えば、ＣＣＧＭＭ）を構築して、ターゲット音声クリップ全体に対して完全モデルを構築するとしてもよい。このためまず、音声セグメントの最初のＫ個（Ｋ≧１）のフレームに対して仮モデル（例えば、ＣＣＧＭＭ）が構築され得る。能動型検索において、各音声セグメントの最初のＫ個のフレームの仮モデルとターゲット音声クリップの最初のＫ個のフレームの仮モデルとがまず比較されて、仮類似性測度を生成する。仮類似性測度によってこれらの２つの仮モデルが非常に類似していることが分かれば、音声セグメント全体に対して完全モデルが構築されて、ターゲット音声クリップ全体の完全モデルに対して比較される。そうでない場合は、音声セグメントに対して完全モデルは構築されず、最初のＫ個のフレームに対して仮モデルをまず構築してこの仮モデルとターゲット音声クリップの仮モデルを比較することによって、次のセグメントを検索するとしてもよい。このような変形検索ストラテジは、さらに演算負荷を低減し得る。 According to one embodiment, a modified search strategy may be used. Using this strategy, a temporary model (eg, CCGMM) is built for the first K (K ≧ 1) frames of the target audio clip, and a complete model is built for the entire target audio clip. Good. Thus, first, a temporary model (eg, CCGMM) can be constructed for the first K (K ≧ 1) frames of the speech segment. In the active search, the temporary model of the first K frames of each audio segment is first compared with the temporary model of the first K frames of the target audio clip to generate a temporary similarity measure. If the preliminary similarity measure shows that these two temporary models are very similar, a complete model is constructed for the entire audio segment and compared against the complete model for the entire target audio clip. Otherwise, a complete model is not built for the audio segment, and a temporary model is first built for the first K frames and then compared with the temporary model and the temporary model of the target audio clip. You may search for segments. Such a modified search strategy can further reduce the computation load.

開示されている主題の実施形態例を図１から図８に示すブロック図およびフローチャートを参照しつつ説明したが、当業者であれば、開示されている主題はほかの多くの方法によっても実施され得ることが容易に理解できる。例えば、フローチャートにおけるブロックの実行順序は変更するとしてもよいし、および／または、説明したブロック図／フローチャートのブロックはその一部を変更、削除または合成するとしてもよい。 While example embodiments of the disclosed subject matter have been described with reference to block diagrams and flowcharts shown in FIGS. 1-8, those skilled in the art can implement the disclosed subject matter in many other ways. Easy to understand. For example, the execution order of the blocks in the flowchart may be changed, and / or a part of the blocks in the block diagram / flow chart described may be changed, deleted, or combined.

前述の説明では、開示されている主題をさまざまな側面から説明した。説明に当たっては、主題を十分に説明することを目的として、具体的な数値、システムおよび構成を記載した。しかし、本開示内容を参考にすることによって、このような具体的且つ詳細な内容がなくても主題を実施し得ることは、当業者には明らかである。また、公知の特徴、構成要素またはモジュールは、開示されている主題をあいまいにすることを避けるべく、省略、簡略化、合成または分割した。 In the foregoing description, the disclosed subject matter has been described in various aspects. In the description, specific numerical values, systems, and configurations are described for the purpose of fully explaining the subject matter. However, it will be apparent to those skilled in the art, upon reference to the disclosure, that the subject matter may be practiced without such specific details. In other instances, well-known features, components or modules have been omitted, simplified, combined or divided in order to avoid obscuring the disclosed subject matter.

開示されている主題のさまざまな実施形態は、ハードウェア、ファームウェア、ソフトウェアまたはそれらの組み合わせにおいて実装され得る。また、開示されている主題のさまざまな実施形態は、プログラムコードを参照することによって、またはプログラムコードと関連付けることによって記述され得る。プログラムコードは、例えば、設計のシミュレーション、エミュレーションおよび製造用の命令、機能、手順、データ構造、ロジック、アプリケーションプログラム、設計表現またはフォーマットであり、機械によってアクセスされると、機械はタスクを実行し、抽象データ型または低水準ハードウェアコンテキストを定義して、結果を生成する。 Various embodiments of the disclosed subject matter can be implemented in hardware, firmware, software, or combinations thereof. Also, various embodiments of the disclosed subject matter may be described by referring to or associating with program code. Program code is, for example, design simulation, emulation and manufacturing instructions, functions, procedures, data structures, logic, application programs, design representations or formats, when accessed by a machine, the machine performs a task, Define an abstract data type or low-level hardware context and generate a result.

シミュレーション用の場合、プログラムコードは、設計されたハードウェアがどのように動作するかを示すモデルを本質的に提供する、ハードウェア記述言語または別の機能記述言語を用いてハードウェアを表現するとしてもよい。プログラムコードは、アセンブリまたは機械言語、もしくはコンパイルおよび／または解釈され得るデータであってもよい。また、ソフトウェアとは、ある形態または別の形態によって、動作を実行するかまたは、結果を生じさせるものと認識することは当該技術分野では普通である。このような表現は、プロセッサに動作を実行させるか、または結果を生成させる処理システムによるプログラムコードの実行を簡単に説明するためのものに過ぎない。 For simulation purposes, the program code represents the hardware using a hardware description language or another functional description language that essentially provides a model of how the designed hardware works. Also good. The program code may be assembly or machine language, or data that can be compiled and / or interpreted. It is also common in the art to recognize software as performing an action or producing a result in one form or another. Such an expression is merely intended to briefly describe the execution of program code by a processing system that causes a processor to perform an operation or generate a result.

プログラムコードは、例えば、揮発性および／または不揮発性メモリに格納されるとしてもよい。揮発性および／または不揮発性メモリは、ストレージデバイスおよび／または関連付けられる機械可読または機械アクセス可能媒体であってよい。機械可読または機械アクセス可能媒体は、固体メモリ、ハードドライブ、フロッピーディスク、光学ストレージ、テープ、フラッシュメモリ、メモリスティック、デジタルビデオディスク、デジタル多用途ディスク（ＤＶＤ）等であってよいし、機械アクセス可能な生物学的状態保存ストレージ等のより珍しい媒体であってもよい。機械可読媒体は、機械が読み出し可能な形式で情報を格納、送信または受信するどのような機構を有するとしてもよく、当該媒体は、プログラムコードを符号化している伝播信号または搬送波の電気的形態、光学的形態、音響的形態またはその他の形態を通過させる有形の媒体、例えば、アンテナ、光ファイバ、通信インターフェース等を有するとしてもよい。プログラムコードは、パケット、シリアルデータ、パラレルデータ、伝播信号等の形態で送信されるとしてもよく、圧縮または暗号化された形式で利用されるとしてもよい。 The program code may be stored in, for example, volatile and / or nonvolatile memory. Volatile and / or nonvolatile memory can be a storage device and / or an associated machine-readable or machine-accessible medium. The machine-readable or machine-accessible medium may be solid state memory, hard drive, floppy disk, optical storage, tape, flash memory, memory stick, digital video disk, digital versatile disk (DVD), etc., and machine accessible It may be a more unusual medium such as a biological state storage. A machine-readable medium may have any mechanism for storing, transmitting, or receiving information in a form readable by a machine, such as an electrical form of a propagated signal or carrier wave encoding program code, It may have a tangible medium that passes through optical, acoustic, or other forms, such as an antenna, optical fiber, communication interface, and the like. The program code may be transmitted in the form of a packet, serial data, parallel data, a propagation signal, etc., or may be used in a compressed or encrypted form.

プログラムコードは、プロセッサと、当該プロセッサによって読み出し可能な揮発性および／または不揮発性メモリと、少なくとも１つの入力デバイスおよび／または１以上の出力デバイスとを備える、移動可能または固定コンピュータ、携帯情報端末（ＰＤＡ）、セットトップボックス、携帯電話およびポケットベル（登録商標）、ならびにその他の電子デバイスといった、プログラム可能な機械で実行されるプログラムにおいて実装されるとしてもよい。プログラムコードは、入力デバイスを用いて入力されたデータに対して適用されて、上述した実施形態を実行して出力情報を生成するとしてもよい。出力情報は、１以上の出力デバイスに適用されるとしてもよい。当業者であれば、開示されている主題の実施形態はさまざまなコンピュータシステム構成によって実施され得ることに想到し得る。そのようなコンピュータシステム構成は、マルチプロセッサまたはマルチコアプロセッサシステム、ミニコンピュータ、メインフレームコンピュータ、実質的にいかなるデバイスにも埋め込み得るパーベイシブ（ｐｅｒｖａｓｉｖｅ）またはミニチュア型のコンピュータまたはプロセッサを含む。開示されている主題の実施形態はまた、タスクを実行するのは通信ネットワークを介してリンクされているリモート処理デバイスである分散コンピューティング環境において実施され得る。 The program code is a mobile or stationary computer, personal digital assistant (PC) comprising a processor, volatile and / or non-volatile memory readable by the processor, at least one input device and / or one or more output devices. (PDA), set-top boxes, mobile phones and pagers, and other electronic devices may be implemented in programs that run on programmable machines. The program code may be applied to data input using an input device, and output information may be generated by executing the above-described embodiment. The output information may be applied to one or more output devices. One skilled in the art can appreciate that embodiments of the disclosed subject matter can be implemented by various computer system configurations. Such computer system configurations include multiprocessor or multicore processor systems, minicomputers, mainframe computers, pervasive or miniature computers or processors that can be embedded in virtually any device. Embodiments of the disclosed subject matter can also be practiced in distributed computing environments where it is a remote processing device that is linked through a communications network to perform a task.

処理は順次実行されるものとして説明されているが、一部の処理については、実際には、並列に、同時に、および／または分散環境下において実行されるとしてもよく、プログラムコードは、シングルプロセッサ型またはマルチプロセッサ型の機械によってアクセスされるべく、ローカルおよび／またはリモートに格納されている。また、一部の実施形態によると、処理の順序は、開示されている主題の精神から逸脱することなく並び替え得る。プログラムコードは、埋め込まれているコントローラによって用いられるとしてもよいし、埋め込まれたコントローラと関連して利用されるとしてもよい。 Although the processes are described as being executed sequentially, some processes may actually be executed in parallel, simultaneously, and / or in a distributed environment, and the program code may be single processor. Stored locally and / or remotely for access by type or multiprocessor machines. Also, according to some embodiments, the order of processing may be rearranged without departing from the spirit of the disclosed subject matter. The program code may be used by the embedded controller or may be used in connection with the embedded controller.

開示されている主題を実施形態例を参照しつつ説明してきたが、この説明は本発明を限定するものと解釈されるべきではない。実施形態例のさまざまな変形例および主題のその他の実施形態は、当業者には明らかであり、開示されている主題の範囲内に含まれるものとする。 Although the disclosed subject matter has been described with reference to example embodiments, this description should not be construed as limiting the invention. Various modifications of the example embodiments and other embodiments of the subject matter will be apparent to those skilled in the art and are intended to be within the scope of the disclosed subject matter.

Claims

A method of searching an audio database and identifying a target audio clip in a multiprocessor system,
Dividing the voice database into a plurality of groups;
Building a model for the target audio clip;
Dynamically scheduling the plurality of groups for a plurality of processors of the multiprocessor system;
Processing the plurality of scheduled groups in parallel using the plurality of processors to retrieve the target audio clip; and
Processing the plurality of scheduled groups in parallel includes:
Dividing each of the plurality of scheduled groups into a plurality of segments;
A temporary model is constructed for the first partial frame included in the target audio clip, and a temporary model is constructed for the first partial frame included in at least one of the segments. Obtaining a provisional similarity measure for the provisional model;
If the temporary similarity measure indicates that both the temporary models are similar , build a model for each of the segments and determine a similarity measure between the model of the target audio clip and the model of each of the segments ; Determining that the segment matches the target audio clip based on the similarity measure;
Omitting processing of the immediately following segment based on the similarity measure;
Having a method.

The step of omitting the processing of the segment immediately after the
Omitting processing of a number of the segments determined according to the value of the similarity measure,
The method of claim 1.

Building a model for the target audio clip comprises:
Extracting a feature vector sequence from the target audio clip; and modeling the feature vector sequence based on a Gaussian mixture model (GMM) including a plurality of Gaussian components;
Building a model for each said segment consists of:
3. For each of the segments, including extracting a feature vector sequence of the segment and modeling the feature vector sequence based on a Gaussian mixture model (GMM) including a plurality of Gaussian components. The method described.

The method according to any one of claims 1 to 3, wherein the plurality of divided segments have portions that partially overlap each other.

The step of dividing the speech database determines a size for each of the plurality of groups so as to reduce a load imbalance and an amount of operations overlapping between the plurality of groups in the parallel processing of the plurality of groups. The method according to any one of claims 1 to 4, further comprising:

Based on the similarity measure, determining that the segment matches the target audio clip comprises:
The method according to any one of claims 1 to 5, comprising determining that the segment matches the target audio clip if the similarity measure is greater than or equal to a predetermined threshold.

When executed by a computer , the computer
Dividing the speech database into groups,
Building a model for the target audio clip;
Dynamically scheduling the plurality of groups for a plurality of processors of a multiprocessor system;
Processing the plurality of scheduled groups in parallel using the plurality of processors to retrieve the target audio clip; and
Processing the plurality of scheduled groups in parallel includes:
Dividing each of the plurality of scheduled groups into a plurality of segments;
A temporary model is constructed for the first partial frame included in the target audio clip, and a temporary model is constructed for the first partial frame included in at least one of the segments. Obtaining a provisional similarity measure for the provisional model;
If the temporary similarity measure indicates that both the temporary models are similar , build a model for each of the segments and determine a similarity measure between the model of the target audio clip and the model of each of the segments ; Determining that the segment matches the target audio clip based on the similarity measure;
Omitting processing of the immediately following segment based on the similarity measure;
To execute,
Program .

The step of omitting the processing of the segment immediately after the
Omitting processing of a number of the segments determined according to the value of the similarity measure,
The program according to claim 7.

Building a model for the target audio clip comprises:
Extracting a feature vector sequence from the target audio clip; and modeling the feature vector sequence based on a Gaussian mixture model (GMM) including a plurality of Gaussian components;
Building a model for each said segment consists of:
The method according to claim 7 or 8, comprising, for each of the segments, extracting a feature vector sequence of the segment and modeling the feature vector sequence based on a Gaussian mixture model (GMM) including a plurality of Gauss components. The listed program .

The program according to claim 7, wherein the plurality of segments have portions that partially overlap each other.

The step of dividing the speech database determines a size for each of the plurality of groups so as to reduce a load imbalance and an amount of operations overlapping between the plurality of groups in the parallel processing of the plurality of groups. The program according to any one of claims 7 to 10, further comprising:

Based on the similarity measure, determining that the segment matches the target audio clip comprises:
The program according to any one of claims 7 to 11, further comprising: determining that the segment matches the target audio clip if the similarity measure is equal to or greater than a predetermined threshold.

A memory for receiving voice data from the voice database;
A device comprising: a plurality of processor cores connected to the memory; and a target audio clip is identified by searching the audio database,
Dividing the voice database into a plurality of groups;
Build a model for the target audio clip,
Dynamically scheduling the plurality of groups for the plurality of processor cores;
Each of the processor cores divides each of the scheduled groups into a plurality of segments,
A temporary model is constructed for the first partial frame included in the target audio clip, and a temporary model is constructed for the first partial frame included in at least one of the segments. Find a temporary similarity measure for the temporary model,
If the temporary similarity measure indicates that both the temporary models are similar, each processor core builds a model for each of the segments, and each processor core causes the model of the target audio clip and each of the Determining a similarity measure of the segment with the model, and determining, by each of the processor cores, that the segment matches the target audio clip based on the similarity measure;
Each processor core omits the processing of the immediately following segment based on the similarity measure,
The plurality of divided segments have portions that partially overlap each other.

Each processor omits processing of the segment immediately after a number determined according to the value of the similarity measure.
The apparatus of claim 13.

Building a model for the target audio clip by extracting a feature vector sequence from the target audio clip and modeling the feature vector sequence based on a Gaussian mixture model (GMM) including a plurality of Gaussian components;
For each said segment, construct a model for each said segment by extracting the feature vector sequence of said segment and modeling said feature vector sequence based on a Gaussian mixture model (GMM) comprising a plurality of Gaussian components To
15. Apparatus according to claim 13 or 14.

In the parallel processing of the plurality of groups, the size is determined for each of the plurality of groups and the speech database is determined so as to reduce load imbalance and the amount of computation overlapping between the plurality of groups. The apparatus according to any one of claims 13 to 15, which is divided into a plurality of groups of sizes.

The apparatus according to any one of claims 13 to 16, wherein the segment is determined to match the target audio clip if the similarity measure is greater than or equal to a predetermined threshold.

The apparatus according to claim 13, wherein the plurality of processor cores are included in a plurality of processors.