JP4877112B2

JP4877112B2 - Voice processing apparatus and program

Info

Publication number: JP4877112B2
Application number: JP2007183480A
Authority: JP
Inventors: 裕司久湊
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2007-07-12
Filing date: 2007-07-12
Publication date: 2012-02-15
Anticipated expiration: 2027-07-12
Also published as: JP2009020352A

Abstract

<P>PROBLEM TO BE SOLVED: To suppress a decrease in precision of speech recognition in case of change in speaker's position. <P>SOLUTION: A position specifying unit 14 specifies directions (j) of respective speakers from a voice signal S. A speaker identifying unit 16 discriminates speakers of respective voices that the voice signal S represents. An adaptive model generating unit 24 generates, based on the voice signal S1, a sound model M corresponding to a combination of a direction (j) specified by the position specifying unit 14 from a voice signal S1 for adaptation and a speaker discriminated by the speaker identifying unit 16 from the voice signal S1. A speech recognizing unit 26 specifies characters corresponding to a voice that a voice signal S2 represents based on the sound model M corresponding to the combination of the direction (j) specified by the position specifying unit 14 from the voice signal S1 for recognition and the speaker discriminated by the speaker discriminating unit 16 from the voice signal S2. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声を認識する技術に関する。 The present invention relates to a technology for recognizing speech.

隠れマルコフモデルなどの音響モデルを利用して音声を認識（さらには音声に対応した文字を出力）する音声認識の技術が従来から提案されている。例えば特許文献１には、音声信号から抽出された複数の特徴量の各々に対応する複数の音響モデル候補を事前に作成し、複数の音響モデル候補のうち音響的に音声信号に近似する音響モデルを選択して音声認識に利用する技術が開示されている。
特開２００３−２０２８９１号公報 Conventionally, a speech recognition technique for recognizing speech (and outputting characters corresponding to speech) using an acoustic model such as a hidden Markov model has been proposed. For example, in Patent Literature 1, a plurality of acoustic model candidates corresponding to each of a plurality of feature amounts extracted from an audio signal are created in advance, and an acoustic model that acoustically approximates an audio signal among the plurality of acoustic model candidates. A technique for selecting and utilizing for voice recognition is disclosed.
JP 2003-202891 A

ところで、発声者が特定の空間内で発声した音声は、当該空間の音響的な特性（例えば壁面における反射特性や吸音特性）が付加されたうえで収音機器に収音される。発声音に付加される音響的な特性は空間内の発声者の位置に応じて変化する。したがって、複数の音響モデル候補の各々に発声者の位置が反映されない特許文献１の技術においては、発声者の位置が変化した場合に音声認識の精度が低下するという問題がある。以上の事情を背景として、本発明は、発声者の位置が変化した場合における音声認識の精度の低下を抑制するという課題の解決をひとつの目的としている。 By the way, the voice uttered by a speaker in a specific space is collected by a sound collecting device after the acoustic characteristics of the space (for example, reflection characteristics and sound absorption characteristics on a wall surface) are added. The acoustic characteristic added to the uttered sound changes according to the position of the utterer in the space. Therefore, in the technique of Patent Document 1 in which the position of the speaker is not reflected in each of the plurality of acoustic model candidates, there is a problem that the accuracy of speech recognition is reduced when the position of the speaker is changed. Against the background of the above circumstances, an object of the present invention is to solve the problem of suppressing a decrease in accuracy of speech recognition when the position of a speaker changes.

以上の課題を解決するために、本発明に係る音声処理装置は、発声者と発声者の位置との各組合せに対応する複数の音響モデルを記憶する記憶装置と、適応用の音声信号（例えば図１や図２の音声信号Ｓ1）から各発声者の位置を特定する位置特定手段と、適応用の音声信号が表わす音声の発声者を区別する話者識別手段と、適応用の音声信号が表わす音声に対応した文字を利用者が入力する入力手段と、記憶装置が記憶する複数の音響モデルのうち話者識別手段が区別した発声者と位置特定手段が特定した当該発声者の位置との組合せに対応する各音響モデルを適応用の音声信号と入力手段が入力した文字とに基づいて更新する適応処理により、当該発声者と当該位置との組合せに対応する各音響モデルを生成して記憶装置に格納する一方、記憶装置が記憶する複数の音響モデルのうち適応処理にて更新されなかった各音響モデルを、当該音響モデルと同じ位置に対応するとともに相異なる発声者に対応する２以上の音響モデルであって適応処理による更新後の音響モデルを含む２以上の音響モデルを平均化した音響モデルに更新する適応モデル生成手段と、適応モデル生成手段による処理後の複数の音響モデルのうち認識用の音声信号（例えば図１や図２の音声信号Ｓ2）が表わす音声の発声者と当該発声者の位置との組合せに対応した音響モデルに基づいて、認識用の音声信号が表わす音声に対応した文字を特定（音声を認識）する音声認識手段とを具備する。以上の構成によれば、発声者と当該発声者との組合せに応じて適応化された音響モデルに基づいて音声認識が実行されるから、発声者や発声者の位置に拘わらず共通の音響モデルが固定的に使用される構成や、音声認識に使用される音響モデルが発声者の位置とは無関係に選定される構成と比較して、音声認識の精度を高めることが可能である。 In order to solve the above problems, a speech processing device according to the present invention includes a storage device that stores a plurality of acoustic models corresponding to each combination of a speaker and a position of a speaker, and an audio signal for adaptation (for example, The position specifying means for specifying the position of each speaker from the voice signal S1) in FIGS. 1 and 2, speaker identification means for distinguishing the speaker of the voice represented by the voice signal for adaptation, and the voice signal for adaptation. An input means for the user to input characters corresponding to the voice to be represented; a speaker identified by the speaker identifying means among a plurality of acoustic models stored in the storage device; and a position of the speaker identified by the position identifying means Each acoustic model corresponding to the combination is updated and generated based on the speech signal for adaptation and the character input by the input means, and each acoustic model corresponding to the combination of the speaker and the position is generated and stored. While storing in the device Each acoustic model that has not been updated in the adaptation process among a plurality of acoustic models stored in the storage device is an adaptive adaptation of two or more acoustic models that correspond to the same position as the acoustic model and correspond to different speakers. An adaptive model generating means for updating two or more acoustic models including an acoustic model updated by processing to an averaged acoustic model, and a speech signal for recognition (for example, among a plurality of acoustic models processed by the adaptive model generating means) Characters corresponding to the speech represented by the recognition speech signal are identified (speech) based on the acoustic model corresponding to the combination of the speech speaker represented by the speech signal S2) in FIGS. 1 and 2 and the position of the speaker. Voice recognition means. According to the above configuration, since the speech recognition is performed based on the acoustic model adapted according to the combination of the speaker and the speaker, a common acoustic model regardless of the position of the speaker or the speaker. The accuracy of speech recognition can be improved as compared to a configuration in which is used fixedly or a configuration in which an acoustic model used for speech recognition is selected regardless of the position of the speaker.

本発明の好適な態様において、適応モデル生成手段は、方向優先モードおよび話者優先モードの何れかの動作モードで動作し、方向優先モードでは、記憶装置が記憶する複数の音響モデルのうち適応処理にて更新されなかった各音響モデルを、当該音響モデルと同じ位置に対応するとともに相異なる発声者に対応する２以上の音響モデルであって適応処理による更新後の音響モデルを含む２以上の音響モデルを平均化した音響モデルに更新し、話者優先モードでは、記憶装置が記憶する複数の音響モデルのうち適応処理にて更新されなかった各音響モデルを、当該音響モデルと同じ発声者に対応するとともに相異なる位置に対応する２以上の音響モデルであって適応処理による更新後の音響モデルを含む２以上の音響モデルを平均化した音響モデルに更新する。 Oite to a preferred embodiment of the present invention, the adaptive model generation means operates in either mode of operation direction priority mode and speaker priority mode, in the direction priority mode, among the plurality of acoustic models storage device stores Each acoustic model that has not been updated by the adaptive processing is two or more acoustic models that correspond to the same position as the acoustic model and that correspond to different speakers, including two or more acoustic models that have been updated by the adaptive processing. In the speaker priority mode, each acoustic model that has not been updated in the adaptive process among the plurality of acoustic models stored in the storage device is updated to the same speaker as the acoustic model. And two or more acoustic models corresponding to different positions and including two or more acoustic models that have been updated by adaptive processing. To update to Dell.

本発明に係る音声処理装置は、音声の処理に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働によっても実現される。本発明に係るプログラムは、発声者と発声者の位置との各組合せに対応する複数の音響モデルを記憶する記憶装置を具備するコンピュータに、適応用の音声信号から各発声者の位置を特定する位置特定処理と、適応用の音声信号が表わす音声の発声者を区別する話者識別処理と、適応用の音声信号が表わす音声に対応した文字を利用者が入力する入力処理と、記憶装置が記憶する複数の音響モデルのうち話者識別処理で区別した発声者と位置特定処理で特定した当該発声者の位置との組合せに対応する各音響モデルを適応用の音声信号と入力処理で入力した文字とに基づいて更新する適応処理により、当該発声者と当該位置との組合せに対応する各音響モデルを生成して記憶装置に格納する一方、記憶装置が記憶する複数の音響モデルのうち適応処理にて更新されなかった各音響モデルを、当該音響モデルと同じ位置に対応するとともに相異なる発声者に対応する２以上の音響モデルであって適応処理による更新後の音響モデルを含む２以上の音響モデルを平均化した音響モデルに更新する適応モデル生成処理と、適応モデル生成処理後の複数の音響モデルのうち認識用の音声信号が表わす音声の発声者と当該発声者の位置との組合せに対応した音響モデルに基づいて、認識用の音声信号が表わす音声に対応した文字を特定する音声認識処理とを実行させる。以上のプログラムによっても、本発明に係る音声処理装置と同様の作用および効果が奏される。なお、本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The audio processing apparatus according to the present invention is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to audio processing, and a general-purpose arithmetic processing apparatus such as a CPU (Central Processing Unit). It is also realized through collaboration with the program. The program according to the present invention specifies the position of each speaker from a speech signal for adaptation in a computer having a storage device that stores a plurality of acoustic models corresponding to each combination of the speaker and the position of the speaker. A position identification process, a speaker identification process for distinguishing a speaker of a voice represented by an adaptation voice signal, an input process for a user to input a character corresponding to the voice represented by the adaptation voice signal, and a storage device Each acoustic model corresponding to the combination of the speaker identified by the speaker identification process among the plurality of stored acoustic models and the position of the speaker identified by the position specifying process was input by an adaptive audio signal and input process Among the plurality of acoustic models stored in the storage device, each acoustic model corresponding to the combination of the speaker and the position is generated and stored in the storage device by the adaptive processing updated based on the character. Each acoustic model that has not been updated in response processing is two or more acoustic models that correspond to the same position as the corresponding acoustic model and that correspond to different speakers, including two or more acoustic models that have been updated by adaptive processing. Of an adaptive model generation process for updating an acoustic model of an acoustic model to an averaged acoustic model, and a combination of a voice speaker represented by a speech signal for recognition among a plurality of acoustic models after the adaptive model generation process and the position of the speaker And a speech recognition process for specifying a character corresponding to the speech represented by the recognition speech signal. Even with the above program, the same operations and effects as those of the speech processing apparatus according to the present invention are exhibited. The program of the present invention is provided to the user in a form stored in a computer-readable recording medium and installed in the computer, or is provided from the server device in the form of distribution via a communication network. To be installed.

＜Ａ：第１実施形態＞
図１は、本発明の第１実施形態に係る音声処理装置１００の構成を示すブロック図である。同図に示すように、音声処理装置１００は、制御装置１０と記憶装置３０とを具備するコンピュータシステムである。制御装置１０には入力装置４２と出力装置４４と放音装置４６とが接続される。入力装置４２は、利用者が文字を入力するための機器（例えばキーボード）である。出力装置４４は、制御装置１０による制御のもとに各種の画像を表示する表示機器である。なお、制御装置１０が指示した画像を印刷する印刷機器も出力装置４４として採用される。放音装置４６は、制御装置１０から供給される信号に応じた音声を放音する機器（例えばスピーカやヘッドホン）である。 <A: First Embodiment>
FIG. 1 is a block diagram showing a configuration of a speech processing apparatus 100 according to the first embodiment of the present invention. As shown in FIG. 1, the voice processing device 100 is a computer system that includes a control device 10 and a storage device 30. An input device 42, an output device 44, and a sound emitting device 46 are connected to the control device 10. The input device 42 is a device (for example, a keyboard) for a user to input characters. The output device 44 is a display device that displays various images under the control of the control device 10. A printing device that prints an image instructed by the control device 10 is also used as the output device 44. The sound emitting device 46 is a device (for example, a speaker or a headphone) that emits sound corresponding to a signal supplied from the control device 10.

記憶装置３０は、制御装置１０が実行するプログラムや制御装置１０が使用する各種のデータを記憶する。半導体記憶装置や磁気記憶装置など公知の記録媒体が記憶装置３０として任意に採用される。図１に示すように、記憶装置３０には音声信号Ｓが格納される。音声信号Ｓは、音声入力装置６０（マイクロホンアレイ）内に相互に離間して配列された複数の収音機器６２の各々に到達した音声の波形を表わす複数の系統の信号である。本形態の音声信号Ｓが表わす音声は、会議室などの空間内にて複数の参加者が随時に発声する会議において音声入力装置６０が収録した音声である。 The storage device 30 stores a program executed by the control device 10 and various data used by the control device 10. A known recording medium such as a semiconductor storage device or a magnetic storage device is arbitrarily adopted as the storage device 30. As shown in FIG. 1, the audio signal S is stored in the storage device 30. The audio signal S is a signal of a plurality of systems representing the waveform of the sound that has reached each of the plurality of sound collection devices 62 arranged in the audio input device 60 (microphone array) so as to be separated from each other. The voice represented by the voice signal S of this embodiment is voice recorded by the voice input device 60 in a conference in which a plurality of participants speak at any time in a space such as a conference room.

また、記憶装置３０には、各々が別個の発声者に対応するＮ個の音響モデル群Ｇ（Ｇ1〜ＧN）が格納される（Ｎは２以上の整数）。第ｉ番目（ｉ＝１〜Ｎ）の音響モデル群Ｇiは、音声の特性を音素毎にモデル化するＫ個の音響モデルＭ（Ｍ[1,i]〜Ｍ[K,i]）で構成される。音響モデルＭには、隠れマルコフモデルが好適に採用される。Ｎ個の音響モデル群Ｇ1〜ＧNの各音響モデルＭは、初期的には共通の内容に設定される。さらに詳述すると、初期的な音響モデルＭは、収音機器６２に近接して発声したときの音声を充分に多数の発声者について採取した結果から生成された平均的なモデル（すなわち発声者や発声時の環境に殆ど依存しない標準的なモデル）である。 The storage device 30 stores N acoustic model groups G (G1 to GN) each corresponding to a separate speaker (N is an integer of 2 or more). The i-th (i = 1 to N) acoustic model group Gi is composed of K acoustic models M (M [1, i] to M [K, i]) that model speech characteristics for each phoneme. Is done. As the acoustic model M, a hidden Markov model is preferably employed. The acoustic models M of the N acoustic model groups G1 to GN are initially set to common contents. More specifically, the initial acoustic model M is an average model (i.e., the speaker and the voice model) generated from the result of collecting the voices when the voice is spoken close to the sound collecting device 62 for a sufficiently large number of voices. This is a standard model that is almost independent of the environment when speaking.

制御装置１０は、記憶装置３０に格納されたプログラムを実行することで適応処理と認識処理とを実行する。適応処理は、音声信号Ｓの特性を音響モデルＭに反映させる処理であり、認識処理は、音声信号Ｓが表わす音声に対応した文字を適応処理後の音響モデルＭに基づいて特定する処理である。図１に示すように、制御装置１０は、複数の要素（区間特定部１２，位置特定部１４，話者識別部１６，選択部２２，適応モデル生成部２４，音声認識部２６）として機能する。制御装置１０が実現する各要素の機能（すなわち制御装置１０の動作）を適応処理時と認識処理時とに区分して以下に詳述する。なお、制御装置１０の各要素は、音声の処理に専用されるＤＳＰなどの電子回路によっても実現される。また、制御装置１０は、複数の集積回路に分散して実装されてもよい。 The control device 10 executes adaptive processing and recognition processing by executing a program stored in the storage device 30. The adaptation process is a process for reflecting the characteristics of the speech signal S in the acoustic model M, and the recognition process is a process for specifying a character corresponding to the speech represented by the speech signal S based on the acoustic model M after the adaptation process. . As shown in FIG. 1, the control device 10 functions as a plurality of elements (section identification unit 12, position identification unit 14, speaker identification unit 16, selection unit 22, adaptive model generation unit 24, speech recognition unit 26). . The function of each element realized by the control device 10 (that is, the operation of the control device 10) will be described in detail below by dividing it into an adaptive process and a recognition process. Note that each element of the control device 10 is also realized by an electronic circuit such as a DSP dedicated to voice processing. Further, the control device 10 may be distributed and mounted on a plurality of integrated circuits.

＜適応処理時＞
区間特定部１２は、記憶装置３０に格納された音声信号Ｓのうちの所定の区間を適応用の音声信号Ｓ1として特定する。音声信号Ｓ1は、記憶装置３０に格納された各音響モデルＭの更新のために利用される。本形態の区間特定部１２は、音声信号Ｓのうち始点から所定の時間（例えば５分）が経過するまでの区間を音声信号Ｓ1として抽出する。 <At the time of adaptive processing>
The section specifying unit 12 specifies a predetermined section of the audio signal S stored in the storage device 30 as the adaptation audio signal S1. The audio signal S1 is used for updating each acoustic model M stored in the storage device 30. The section specifying unit 12 of this embodiment extracts a section from the start point until a predetermined time (for example, 5 minutes) elapses from the audio signal S as the audio signal S1.

位置特定部１４は、音声信号Ｓ1から各発声者の位置（方向）を順次に特定する。本形態の位置特定部１４は、音声信号Ｓ1の収録時における音声入力装置６０からみた各発声者の方向ｊを、所定のＫ個の方向のなかから順次に選択する。例えば、位置特定部１４は、音声信号Ｓ1を時間軸に沿って発声者毎（発話毎）に区分した各発声区間（すなわち、ひとりの発声者による発声が連続する区間）について、音声信号Ｓ1の各系統間の音量差や位相差と各収音機器６２の位置との関係に基づいて発声者の方向ｊを特定する。なお、音声信号Ｓから発声者（発音源）の位置を特定する処理には公知の技術（例えば特開２００７−８９０５８号公報に開示された技術）が任意に採用される。 The position specifying unit 14 sequentially specifies the position (direction) of each speaker from the audio signal S1. The position specifying unit 14 of this embodiment sequentially selects the direction j of each speaker as viewed from the voice input device 60 at the time of recording the voice signal S1, from among predetermined K directions. For example, the position specifying unit 14 determines the audio signal S1 for each utterance section (that is, a section in which utterance by one speaker is continuous) obtained by dividing the audio signal S1 for each speaker (for each utterance) along the time axis. The direction j of the speaker is specified based on the relationship between the volume difference or phase difference between the systems and the position of each sound collecting device 62. A known technique (for example, a technique disclosed in Japanese Patent Application Laid-Open No. 2007-89058) is arbitrarily employed for the process of specifying the position of the speaker (sound generation source) from the audio signal S.

話者識別部１６は、音声信号Ｓ1が表わす各音声の発声者を区別する。さらに詳述すると、話者識別部１６は、各発声区間の複数のフレームの各々について音響的な特徴量（例えばＭＦＣＣ（Mel Frequency Cepstral Coefficients））を抽出し、ひとつの発声区間から抽出された複数の特徴量を同じ集合（クラスタ）に分類する。そして、話者識別部１６は、ひとつの集合内の各特徴量を代表する中心ベクトルと当該集合（発声者）に固有に付与された識別子ｉとを発声者毎に対応させたテーブル（以下「話者情報」という）３２を生成して記憶装置３０に格納する。話者識別部１６は、各発声者の発声区間について以上の処理を実行するたびに当該発声者の識別子ｉを順次に出力する。 The speaker identifying unit 16 distinguishes between the speakers of each voice represented by the voice signal S1. More specifically, the speaker identification unit 16 extracts an acoustic feature amount (for example, MFCC (Mel Frequency Cepstral Coefficients)) for each of a plurality of frames in each utterance section, and extracts a plurality of pieces extracted from one utterance section. Are classified into the same set (cluster). The speaker identifying unit 16 then associates a central vector representing each feature quantity in one set with an identifier i uniquely assigned to the set (speaker) for each speaker (hereinafter, “ 32 ”(referred to as“ speaker information ”) is generated and stored in the storage device 30. The speaker identification unit 16 sequentially outputs the identifier i of the speaker every time the above processing is executed for the speaking section of each speaker.

選択部２２は、記憶装置３０に格納された複数（Ｎ×Ｋ個）の音響モデルＭの何れかを適応処理の対象として選択する。本形態の選択部２２は、話者識別部１６が特定した発声者の識別子ｉと位置特定部１４が当該発声者について特定した方向ｊとの組合せに対応する音響モデルＭ[j,i]を記憶装置３０から選択する。 The selection unit 22 selects any one of a plurality (N × K) of acoustic models M stored in the storage device 30 as a target for adaptive processing. The selection unit 22 of this embodiment selects an acoustic model M [j, i] corresponding to the combination of the speaker identifier i specified by the speaker identification unit 16 and the direction j specified by the position specification unit 14 for the speaker. Select from the storage device 30.

一方、利用者は、適応用の音声信号Ｓ1が表わす音声に対応した文字列ＴINを適応処理に先立って入力装置４２から入力する。本形態の制御装置１０は、適応処理の実行前（文字列ＴINの入力前）に音声信号Ｓ1を放音装置４６に供給する。利用者は、放音装置４６から出力される音声を聴取することで文字列ＴINを認知して入力装置４２に入力する。 On the other hand, the user inputs a character string TIN corresponding to the voice represented by the adaptation voice signal S1 from the input device 42 prior to the adaptation process. The control device 10 according to the present embodiment supplies the sound signal S1 to the sound emitting device 46 before executing the adaptive process (before inputting the character string TIN). The user recognizes the character string TIN by listening to the sound output from the sound emitting device 46 and inputs it to the input device 42.

適応モデル生成部２４は、記憶装置３０に格納された複数の音響モデルＭのうち選択部２２が選択した音響モデルＭ[j,i]を、区間特定部１２から供給される音声信号Ｓ1と入力装置４２から入力される文字列ＴINとに基づいて適応化（話者適応・環境適応）する。さらに詳述すると、音響モデルＭ[j,i]のうち文字列ＴINの各文字に対応した音素のモデルが、音声信号Ｓ1における当該文字に対応した区間の特性に応じた内容に変更される。記憶装置３０に格納された音響モデルＭ[j,i]は、適応モデル生成部２４が作成（変更）した音響モデルＭ[j,i]に更新される。以上の処理が各発声区間について反復される。すなわち、音声信号Ｓ1が表わす音声の発声者（識別子ｉ）と当該発声者の方向ｊとの各組合せにとって最適な音響モデルＭが順次に音声信号Ｓ1の各発声区間から生成されて記憶装置３０に格納（適応処理前の音響モデルＭが更新）される。もっとも、適応用の音声信号Ｓ1には発声者（識別子ｉ）と方向ｊとの全部の組合せに対応した音声が含まれるわけではないから、適応処理の完了後に記憶装置３０に格納されている音響モデルＭのなかには、適応処理で更新されずに初期的な内容のままである音響モデルＭもある。以上が適応処理時の各要素の動作である。 The adaptive model generation unit 24 inputs an acoustic model M [j, i] selected by the selection unit 22 among the plurality of acoustic models M stored in the storage device 30 and the audio signal S1 supplied from the section identification unit 12. Adaptation (speaker adaptation / environment adaptation) is performed based on the character string TIN input from the device 42. More specifically, the phoneme model corresponding to each character of the character string TIN in the acoustic model M [j, i] is changed to the content corresponding to the characteristics of the section corresponding to the character in the speech signal S1. The acoustic model M [j, i] stored in the storage device 30 is updated to the acoustic model M [j, i] created (changed) by the adaptive model generation unit 24. The above process is repeated for each utterance section. That is, an optimal acoustic model M for each combination of the voice speaker (identifier i) represented by the voice signal S1 and the direction j of the voicer is sequentially generated from each voice section of the voice signal S1 and stored in the storage device 30. Stored (the acoustic model M before the adaptation process is updated). However, since the audio signal S1 for adaptation does not include audio corresponding to all combinations of the speaker (identifier i) and the direction j, the sound stored in the storage device 30 after completion of the adaptation process. Among the models M, there is also an acoustic model M that is not updated by the adaptation process and remains in the initial contents. The above is the operation of each element during adaptive processing.

＜認識処理時＞
認識処理時には、音声信号Ｓの全区間が始点から終点にかけて順次に認識用の音声信号Ｓ2として記憶装置３０から出力される。音声信号Ｓ2は音声認識部２６による音声認識の対象となる。前述の適応処理の対象となるのは、実際の音声認識の対象となる音声信号Ｓ2の部分である。位置特定部１４は、適応処理時と同様の手順で、認識用の音声信号Ｓ2から各発声者の位置（方向ｊ）を順次に特定する。 <During recognition processing>
During the recognition process, all sections of the audio signal S are sequentially output from the storage device 30 as a recognition audio signal S2 from the start point to the end point. The voice signal S2 is subject to voice recognition by the voice recognition unit 26. The target of the adaptive processing described above is the portion of the voice signal S2 that is the target of actual voice recognition. The position specifying unit 14 sequentially specifies the position (direction j) of each speaker from the recognition speech signal S2 in the same procedure as in the adaptation process.

話者識別部１６は、音声信号Ｓ2が表わす各音声の発声者を区別して当該発声者の識別子ｉを特定する。さらに詳述すると、話者識別部１６は、適応処理時と同様に、音声信号Ｓ2を区分した各発声区間の複数のフレームの各々について音響的な特徴量（例えばＭＦＣＣ）を抽出し、ひとつの発声区間から抽出された複数の特徴量を代表する中心ベクトルを特定する。そして、発声区間について特定した中心ベクトルに最も近似する中心ベクトルを記憶装置３０の話者情報３２から検索し、当該中心ベクトルに対応した識別子ｉを特定する。 The speaker identification unit 16 identifies the speaker i of each voice represented by the voice signal S2, and identifies the speaker's identifier i. More specifically, the speaker identification unit 16 extracts an acoustic feature quantity (for example, MFCC) for each of a plurality of frames in each utterance section into which the speech signal S2 has been divided in the same manner as in the adaptation process. A center vector representing a plurality of feature amounts extracted from the utterance section is specified. Then, a center vector that most closely approximates the center vector specified for the utterance section is searched from the speaker information 32 of the storage device 30, and an identifier i corresponding to the center vector is specified.

選択部２２は、記憶装置３０に格納された複数の音響モデルＭの何れかを音声認識のために選択する。さらに詳述すると、選択部２２は、話者識別部１６が特定した発声者の識別子ｉと位置特定部１４が当該発声者について特定した方向ｊとの組合せに対応する音響モデルＭ[j,i]を記憶装置３０から選択する。 The selection unit 22 selects one of the plurality of acoustic models M stored in the storage device 30 for speech recognition. More specifically, the selection unit 22 determines whether or not the acoustic model M [j, i corresponding to the combination of the speaker identifier i specified by the speaker identification unit 16 and the direction j specified by the position specifying unit 14 for the speaker. ] Is selected from the storage device 30.

音声認識部２６は、記憶装置３０に格納された複数の音響モデルＭのうち選択部２２が選択した音響モデルＭ[j,i]に基づいて、音声信号Ｓ2が表わす音声に対応した文字列ＴOUTを特定する。音響モデルＭ[j,i]を利用した文字列ＴOUTの特定には公知の技術が任意に採用される。文字列ＴOUTは出力装置４４から出力（表示や印刷）される。以上が認識処理時の各要素の動作である。 The voice recognition unit 26, based on the acoustic model M [j, i] selected by the selection unit 22 among the plurality of acoustic models M stored in the storage device 30, the character string TOUT corresponding to the voice represented by the voice signal S2. Is identified. A known technique is arbitrarily employed to specify the character string TOUT using the acoustic model M [j, i]. The character string TOUT is output (displayed or printed) from the output device 44. The above is the operation of each element during the recognition process.

以上に説明したように、音声信号Ｓ2の音声に対応した文字列ＴOUTの特定には、当該音声の発声者（識別子ｉ）と当該発声者の方向ｊとに応じて適応処理で最適化された音響モデルＭ[j,i]が利用される。したがって、発声者や発声者の位置に拘わらず共通の音響モデルＭが固定的に使用される構成や、音声認識に利用される音響モデルが発声者の位置とは無関係に（例えば音声信号の特性のみに応じて）選択される特許文献１の構成と比較して、音声認識の精度を高めることが可能である。 As described above, the character string TOUT corresponding to the voice of the voice signal S2 is optimized by adaptive processing according to the voicer (identifier i) of the voice and the direction j of the voicer. An acoustic model M [j, i] is used. Therefore, the configuration in which the common acoustic model M is fixedly used regardless of the speaker or the position of the speaker, and the acoustic model used for speech recognition is independent of the position of the speaker (for example, the characteristics of the audio signal). Compared with the configuration of Patent Document 1 that is selected (depending only on), the accuracy of speech recognition can be increased.

なお、音声信号Ｓ1には発声者と方向ｊとの全部の組合せに対応した音声が含まれるわけではないから、適応処理にて更新されなかった音響モデル（以下「未更新の音響モデル」という）Ｍ[j,i]が認識処理にて文字列ＴOUTの特定に利用される場合がある。未更新の音響モデルＭは発声者や発声時の環境（方向ｊ）に依存しない標準的なモデルであるから、適応処理による更新後の音響モデルＭ[j,i]を利用する場合と比較して文字列ＴOUTの認識の精度は低い。しかし、他の発声者が別の環境で発声した音声の特性を反映した音響モデルＭ[j,i]が利用される場合と比較すれば、認識の精度を所定の水準に維持することは可能である。 Note that the audio signal S1 does not include audio corresponding to all combinations of the speaker and the direction j. Therefore, the acoustic model that has not been updated by the adaptive processing (hereinafter referred to as “unupdated acoustic model”). M [j, i] may be used for specifying the character string TOUT in the recognition process. Since the unupdated acoustic model M is a standard model that does not depend on the speaker or the environment (direction j) at the time of speaking, it is compared with the case where the acoustic model M [j, i] updated by the adaptive processing is used. Therefore, the accuracy of recognition of the character string TOUT is low. However, it is possible to maintain the recognition accuracy at a predetermined level as compared with the case where the acoustic model M [j, i] reflecting the characteristics of speech uttered by another speaker in another environment is used. It is.

また、入力装置４２から入力された文字列ＴINが適応処理における音響モデルＭの更新に利用されるから、音声信号Ｓ1のみに基づいて適応処理が実行される構成と比較して高精度に音響モデルＭを適応化することが可能である。なお、以上の構成においては利用者が音声信号Ｓ1の音声を聴取したうえで文字列ＴINを入力する必要があるが、長時間にわたる音声信号Ｓ2の全区間の音声を聴取して文字列を書き起こす作業と比較すれば、利用者の労力は遥かに低減される。 In addition, since the character string TIN input from the input device 42 is used for updating the acoustic model M in the adaptive processing, the acoustic model is highly accurate compared to the configuration in which the adaptive processing is executed based only on the speech signal S1. It is possible to adapt M. In the above configuration, it is necessary for the user to input the character string TIN after listening to the voice of the voice signal S1. However, the user listens to the voice of the entire section of the voice signal S2 over a long period of time and writes the character string. Compared to the work that occurs, the user's effort is greatly reduced.

＜Ｂ：第２実施形態＞
次に、本発明の第２実施形態について説明する。なお、本形態において作用や機能が第１実施形態と共通する要素については、以上と同じ符号を付して各々の詳細な説明を適宜に省略する。 <B: Second Embodiment>
Next, a second embodiment of the present invention will be described. In addition, about the element which an effect | action and function are common in 1st Embodiment in this form, the same code | symbol as the above is attached | subjected and each detailed description is abbreviate | omitted suitably.

図２は、音声処理装置１００の構成を示すブロック図である。同図に示すように、本形態の音声処理装置１００は、第１実施形態の適応モデル生成部２４に補助生成部２８を追加した構成である。補助生成部２８は、音声信号Ｓ1に応じた適応処理の実行後に、未更新の音響モデルＭを他の音響モデルＭに基づいて更新する。補助生成部２８は、方向優先モードおよび話者優先モードの何れかの動作モードで動作する。補助生成部２８の動作モードは入力装置４２に対する入力に応じて選択される。 FIG. 2 is a block diagram illustrating a configuration of the audio processing apparatus 100. As shown in the figure, the speech processing apparatus 100 of this embodiment has a configuration in which an auxiliary generation unit 28 is added to the adaptive model generation unit 24 of the first embodiment. The auxiliary generation unit 28 updates the unupdated acoustic model M based on the other acoustic model M after executing the adaptive process according to the audio signal S1. The auxiliary generation unit 28 operates in any one of the direction priority mode and the speaker priority mode. The operation mode of the auxiliary generation unit 28 is selected according to the input to the input device 42.

方向優先モードが選択された場合、補助生成部２８は、Ｎ個の音響モデル群Ｇ1〜ＧNのうち方向ｊに対応するＮ個の音響モデルＭ[j,1]〜Ｍ[j,N]を平均化することで音響モデルＭnew[j]を生成する。すなわち、補助生成部２８による処理の内容は以下の式(1)で表現される。

When the direction priority mode is selected, the auxiliary generation unit 28 selects N acoustic models M [j, 1] to M [j, N] corresponding to the direction j among the N acoustic model groups G1 to GN. The acoustic model Mnew [j] is generated by averaging. That is, the contents of the processing by the auxiliary generation unit 28 are expressed by the following equation (1).

発声者の総数Ｎが充分に大きい場合、音響モデルＭnew[j]は、標準的な音声の発声者が方向ｊにて発声した音声のモデル（すなわち方向ｊに依存するが発声者には非依存なモデル）に相当する。補助生成部２８は、Ｎ個の音響モデル群Ｇ1〜ＧNのうち方向ｊに対応する未更新の音響モデルＭを音響モデルＭnew[j]に更新する。以上の処理がＫ個の方向について順次に実行されることで音響モデル群Ｇ1〜ＧNの全部の音響モデルＭが更新される。 If the total number N of speakers is sufficiently large, the acoustic model Mnew [j] is a model of a speech uttered in a direction j by a standard speech speaker (ie, dependent on the direction j but independent of the speaker). Equivalent to the model). The auxiliary generation unit 28 updates the unupdated acoustic model M corresponding to the direction j among the N acoustic model groups G1 to GN to the acoustic model Mnew [j]. All the acoustic models M of the acoustic model groups G1 to GN are updated by sequentially executing the above processing in K directions.

一方、話者優先モードが選択された場合、補助生成部２８は、識別子ｉに対応する音響モデル群Ｇi内のＫ個の音響モデルＭ[1,i]〜Ｍ[K,i]を平均化することで音響モデルＭnew[i]を生成する。すなわち、補助生成部２８による処理の内容は以下の式(2)で表現される。

On the other hand, when the speaker priority mode is selected, the auxiliary generation unit 28 averages the K acoustic models M [1, i] to M [K, i] in the acoustic model group Gi corresponding to the identifier i. As a result, the acoustic model Mnew [i] is generated. That is, the content of the processing by the auxiliary generation unit 28 is expressed by the following equation (2).

音響モデルＭnew[i]は、識別子ｉの発声者が収音機器６２に近接して発声したときに収録された音声のモデル（すなわち発声者に依存するが方向ｊには非依存なモデル）に相当する。補助生成部２８は、音響モデル群Ｇiのうち未更新の音響モデルＭを音響モデルＭnew[i]に更新する。以上の処理がＮ個の音響モデル群Ｇ1〜ＧNについて順次に実行されることで音響モデル群Ｇ1〜ＧNの全部の音響モデルＭが更新される。 The acoustic model Mnew [i] is an audio model recorded when the speaker with the identifier i utters close to the sound collection device 62 (that is, a model that depends on the speaker but does not depend on the direction j). Equivalent to. The auxiliary generation unit 28 updates the unupdated acoustic model M in the acoustic model group Gi to the acoustic model Mnew [i]. All the acoustic models M of the acoustic model groups G1 to GN are updated by sequentially executing the above processing for the N acoustic model groups G1 to GN.

本形態においては、未更新の音響モデルＭが、方向優先モードでは方向ｊを反映した音響モデルＭnew[j]に更新され、話者優先モードでは発声者（識別子ｉ）を反映した音響モデルＭnew[i]に更新される。したがって、未更新の音響モデルＭが初期的な内容のまま認識処理に使用される第１実施形態と比較して音声認識の精度を高めることが可能である。換言すると、未更新の音響モデルＭに起因した音声認識の精度の低下が緩和されるから、音声信号Ｓ1の時間長が短い場合（未更新の音響モデルＭが多くなる可能性が高い場合）であっても音声認識の精度を確保することが可能となる。 In this embodiment, the unupdated acoustic model M is updated to the acoustic model Mnew [j] reflecting the direction j in the direction priority mode, and the acoustic model Mnew [reflecting the speaker (identifier i) in the speaker priority mode. i]. Therefore, it is possible to improve the accuracy of speech recognition as compared with the first embodiment in which the unupdated acoustic model M is used for the recognition process with the initial contents. In other words, since the decrease in the accuracy of speech recognition due to the unupdated acoustic model M is alleviated, the time length of the speech signal S1 is short (when there is a high possibility that the number of unupdated acoustic models M will increase). Even if it exists, it becomes possible to ensure the precision of voice recognition.

なお、以上に例示した方向優先モードでは方向ｊに対応するＮ個の音響モデルＭ[j,1]〜Ｍ[j,N]を平均化することで音響モデルＭnew[j]を生成したが、音響モデルＮnew[j]を生成する方法や音響モデルＮnew[j]の生成に利用される音響モデルＭは適宜に変更される。例えば、未更新の音響モデルＭ[j,i]を置換する音響モデルＭnew[j]を、方向ｊに対応するＮ個の音響モデルＭ[j,1]〜Ｍ[j,N]のうち音響モデルＭ[j,i]を除外した(Ｎ−１)個の音響モデルＭに基づいて生成する構成が採用される。また、方向ｊに対応するＮ個の音響モデルＭ[j,1]〜Ｍ[j,N]のうち適応処理にて更新済の音響モデルＭのみに基づいて音響モデルＭnew[j]を生成する構成も好適である。すなわち、更新後の音響モデルＭnew[j]の生成に、方向ｊに対応した他の発声者の音響モデルＭが利用される構成であればよい。 In the direction priority mode exemplified above, the acoustic model Mnew [j] is generated by averaging the N acoustic models M [j, 1] to M [j, N] corresponding to the direction j. The method for generating the acoustic model Nnew [j] and the acoustic model M used for generating the acoustic model Nnew [j] are appropriately changed. For example, an acoustic model Mnew [j] that replaces an unupdated acoustic model M [j, i] is used as the acoustic model among N acoustic models M [j, 1] to M [j, N] corresponding to the direction j. A configuration is used in which the generation is based on (N−1) acoustic models M excluding the model M [j, i]. Also, the acoustic model Mnew [j] is generated based on only the acoustic model M that has been updated in the adaptive process among the N acoustic models M [j, 1] to M [j, N] corresponding to the direction j. A configuration is also suitable. That is, any configuration may be used as long as the acoustic model M of another speaker corresponding to the direction j is used to generate the updated acoustic model Mnew [j].

また、話者優先モードにおいて音響モデルＮnew[i]を生成する方法や音響モデルＮnew[i]の生成に利用される音響モデルＭは適宜に変更される。例えば、未更新の音響モデルＭ[j,i]を置換する音響モデルＭnew[i]を、音響モデル群ＧiのＫ個の音響モデルＭ[1,i]〜Ｍ[K,i]のうち音響モデルＭ[j,i]を除外した(Ｋ−１)個の音響モデルＭに基づいて生成する構成や、音響モデルＭ[1,i]〜Ｍ[K,i]のうち適応処理にて更新済の音響モデルＭのみに基づいて音響モデルＭnew[i]を生成する構成も採用される。すなわち、更新後の音響モデルＭnew[i]の生成に、識別子ｉの発声者に対応した他の方向の音響モデルＭが利用される構成であればよい。 Further, the method for generating the acoustic model Nnew [i] in the speaker priority mode and the acoustic model M used for generating the acoustic model Nnew [i] are appropriately changed. For example, an acoustic model Mnew [i] that replaces an unupdated acoustic model M [j, i] is used as the acoustic model among the K acoustic models M [1, i] to M [K, i] of the acoustic model group Gi. A configuration generated based on (K-1) acoustic models M excluding the model M [j, i], or updated by adaptive processing among the acoustic models M [1, i] to M [K, i] A configuration in which the acoustic model Mnew [i] is generated based only on the completed acoustic model M is also employed. That is, any configuration may be used as long as the acoustic model M in another direction corresponding to the speaker with the identifier i is used to generate the updated acoustic model Mnew [i].

＜Ｃ：変形例＞
以上の各形態には様々な変形を加えることができる。具体的な変形の態様を例示すれば以下の通りである。なお、以下の例示から２以上の態様を任意に選択して組合わせてもよい。 <C: Modification>
Various modifications can be made to each of the above embodiments. An example of a specific modification is as follows. Two or more aspects may be arbitrarily selected from the following examples and combined.

（１）変形例１
第１実施形態の認識処理では未更新の音響モデルＭを初期的な内容のまま使用したが、他の音響モデルＭを代替的に選択する構成も採用される。例えば、発声者の方向を優先して音響モデルＭを選択する方向優先モードと、発声者を優先して音響モデルＭを選択する話者優先モードとが、入力装置４２に対する入力に応じて選択的に指定される構成が好適である。位置特定部１４が特定した方向ｊと話者識別部１６が特定した識別子ｉとの組合せに対応した音響モデルＭ[j,i]が未更新である場合、方向優先モードにおいては、選択部２２は、方向ｊに対応する複数の音響モデルＭのうち識別子ｉの発声者の音声に特徴量が最も近似する発声者の音響モデルＭを選択する。各発声者の音声の類否は、例えば話者情報３２に含まれる中心ベクトル間の距離に応じて決定される（距離が小さいほど類似する）。一方、話者優先モードが選択された場合、選択部２２は、音響モデル群Ｇiのうち方向ｊに最も近似する方向に対応した音響モデルＭを選択する。以上の構成によれば、第２実施形態と同様に、未更新の音響モデルＭを初期的な内容のまま認識処理に使用する第１実施形態と比較して音声認識の精度を高めることが可能である。 (1) Modification 1
In the recognition process of the first embodiment, the unupdated acoustic model M is used as it is, but a configuration in which another acoustic model M is selected alternatively is also employed. For example, a direction priority mode in which the acoustic model M is selected with priority on the direction of the speaker and a speaker priority mode in which the acoustic model M is selected with priority on the speaker are selectively selected according to the input to the input device 42. The configuration specified in (1) is preferred. When the acoustic model M [j, i] corresponding to the combination of the direction j specified by the position specifying unit 14 and the identifier i specified by the speaker identifying unit 16 has not been updated, in the direction priority mode, the selecting unit 22 Selects the acoustic model M of the speaker whose feature amount is closest to the voice of the speaker of the identifier i among the plurality of acoustic models M corresponding to the direction j. The similarity of the voice of each speaker is determined, for example, according to the distance between the center vectors included in the speaker information 32 (the smaller the distance, the more similar). On the other hand, when the speaker priority mode is selected, the selection unit 22 selects the acoustic model M corresponding to the direction closest to the direction j in the acoustic model group Gi. According to the above configuration, as in the second embodiment, it is possible to improve the accuracy of speech recognition as compared with the first embodiment in which the unupdated acoustic model M is used for the recognition process with the initial contents. It is.

（２）変形例２
以上の各形態においては、音声信号Ｓ1が表わす音声を適応処理の実行前に放音する構成を例示したが、利用者に文字列ＴINを認知させるための方法は適宜に変更される。例えば、適応処理前の初期的な音響モデルＭを利用して音声認識部２６が音声信号Ｓ1の音声認識を実行し、当該文字列を出力装置４４から出力する構成が採用される。未更新の音響モデルＭを利用した音声認識の精度は低いから、出力装置４４から出力される文字列は不正確である場合がある。したがって、利用者は、出力装置４４が出力する文字列を訂正したうえで入力装置４２から文字列ＴINとして入力する。以上の構成によれば、利用者が音声を聴取することで文字列ＴINの全部を認知する必要がある構成と比較して利用者の作業の負担が軽減されるという利点がある。もっとも、利用者による文字列ＴINの入力は本発明において必須ではない。例えば、音声信号Ｓ1のみに基づいて適応処理を実行する構成も採用される。 (2) Modification 2
In each of the above embodiments, the configuration in which the voice represented by the voice signal S1 is emitted before the execution of the adaptive process is exemplified, but the method for causing the user to recognize the character string TIN is appropriately changed. For example, a configuration is adopted in which the speech recognition unit 26 performs speech recognition of the speech signal S1 using the initial acoustic model M before the adaptive processing, and outputs the character string from the output device 44. Since the accuracy of speech recognition using the unupdated acoustic model M is low, the character string output from the output device 44 may be inaccurate. Therefore, the user corrects the character string output by the output device 44 and inputs it as the character string TIN from the input device 42. According to the above configuration, there is an advantage that the burden on the user's work is reduced as compared with the configuration in which the user needs to recognize the entire character string TIN by listening to the voice. However, the input of the character string TIN by the user is not essential in the present invention. For example, a configuration that performs adaptive processing based only on the audio signal S1 is also employed.

（３）変形例３
以上の各形態においては音声信号Ｓの先頭から所定の時間長の区間を適応用の音声信号Ｓ1として抽出したが、区間特定部１２が音声信号Ｓ1を特定する方法は任意である。例えば、音声信号Ｓの全区間のうち発声者数が多い区間を区間特定部１２が音声信号Ｓ1として特定してもよい。以上の態様によれば、音声信号Ｓ1の区間内の発声者数が少ない場合と比較して多数の音響モデルＭが適応処理で更新されるから、音声認識部２６による音声認識の精度を高めることが可能である。なお、音声信号Ｓ1が音声信号Ｓ（Ｓ2）の部分である必要は必ずしもない。すなわち、音声信号Ｓ1と音声信号Ｓ2とが別個のファイルとして記憶装置３０に格納された構成も採用される。 (3) Modification 3
In each of the above embodiments, a section having a predetermined time length from the head of the audio signal S is extracted as the adaptation audio signal S1, but the method by which the section specifying unit 12 specifies the audio signal S1 is arbitrary. For example, the section specifying unit 12 may specify a section having a large number of speakers among all sections of the audio signal S as the voice signal S1. According to the above aspect, since a large number of acoustic models M are updated by the adaptive processing compared to the case where the number of speakers in the section of the speech signal S1 is small, the accuracy of speech recognition by the speech recognition unit 26 is improved. Is possible. Note that the audio signal S1 is not necessarily a part of the audio signal S (S2). That is, a configuration in which the audio signal S1 and the audio signal S2 are stored in the storage device 30 as separate files is also employed.

（４）変形例４
以上の各形態においては音声入力装置６０に対する発声者の方向ｊを特定したが、位置特定部１４が発声者の位置を特定する構成も好適である。また、適応用の音声信号Ｓ1から方向ｊを特定する位置特定部１４と認識用の音声信号Ｓ2から方向ｊを特定する位置特定部１４とが別個に設置された構成や、音声信号Ｓ1から識別子ｉを特定する話者識別部１６と音声信号Ｓ2から識別子ｉを特定する話者識別部１６とが別個に設置された構成も採用される。ただし、位置特定部１４や話者識別部１６が適応処理時と認識処理時とで共用される以上の各形態によれば、制御装置１０の構成や機能（制御装置１０が実行するプログラムの内容）が簡素化されるという利点がある。 (4) Modification 4
In each of the above embodiments, the direction j of the speaker with respect to the voice input device 60 is specified, but a configuration in which the position specifying unit 14 specifies the position of the speaker is also suitable. Further, a configuration in which the position specifying unit 14 that specifies the direction j from the adaptation audio signal S1 and the position specifying unit 14 that specifies the direction j from the recognition audio signal S2 are separately installed, or an identifier from the audio signal S1. A configuration in which the speaker identification unit 16 that identifies i and the speaker identification unit 16 that identifies the identifier i from the audio signal S2 are separately installed is also employed. However, according to the above embodiments in which the position specifying unit 14 and the speaker identification unit 16 are shared between the adaptive process and the recognition process, the configuration and functions of the control device 10 (the contents of the program executed by the control device 10). ) Has the advantage of being simplified.

また、適応処理時に音声信号Ｓ1の各発声区間について利用者が方向ｊおよび識別子ｉを入力装置４２から入力する構成や、認識処理時に音声信号Ｓ2の各発声区間について利用者が方向ｊおよび識別子ｉを入力装置４２から入力する構成も採用される。したがって、位置特定部１４や話者識別部１６は本発明にとって必須の要件ではない。もっとも、制御装置１０（位置特定部１４や話者識別部１６）が音声信号Ｓから方向ｊや識別子ｉを特定する以上の各形態によれば、利用者による作業の負担が軽減されるという利点がある。 Further, a configuration in which the user inputs the direction j and the identifier i from the input device 42 for each utterance section of the speech signal S1 during the adaptive processing, and the user j direction and the identifier i for each utterance section of the speech signal S2 during the recognition processing. Is also input from the input device 42. Therefore, the position specifying unit 14 and the speaker identifying unit 16 are not essential requirements for the present invention. However, according to the above embodiments in which the control device 10 (the position specifying unit 14 and the speaker identifying unit 16) specifies the direction j and the identifier i from the audio signal S, the user's work burden is reduced. There is.

（５）変形例５
識別子ｉと方向ｊとに対応した複数（Ｎ×Ｋ個）の音響モデルＭが適応処理前に記憶装置３０に格納された構成は本発明において必須ではない。例えば、事前に記憶装置３０に格納された音響モデルＭが適応モデル生成部２４の生成した音響モデルＭ[j,i]に更新される以上の各形態のほか、適応モデル生成部２４の生成した音響モデルＭ[j,i]が記憶装置３０に新規に格納される構成も採用される。すなわち、識別子ｉと方向ｊとの組合せに対応した音響モデルＭを適応モデル生成部２４が生成する構成であれば足り、当該音響モデルＭが既存の音響モデルＭの更新に使用されるか記憶装置３０に新規に格納されるかは本発明において不問である。 (5) Modification 5
A configuration in which a plurality (N × K) of acoustic models M corresponding to the identifier i and the direction j are stored in the storage device 30 before the adaptive processing is not essential in the present invention. For example, the acoustic model M stored in the storage device 30 in advance is updated to the acoustic model M [j, i] generated by the adaptive model generation unit 24, and the adaptive model generation unit 24 generates the acoustic model M. A configuration in which the acoustic model M [j, i] is newly stored in the storage device 30 is also employed. That is, it is sufficient if the adaptive model generation unit 24 generates the acoustic model M corresponding to the combination of the identifier i and the direction j, and whether the acoustic model M is used for updating the existing acoustic model M is a storage device. It is unquestioned in the present invention whether it is newly stored in 30.

本発明の第１実施形態に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus which concerns on 1st Embodiment of this invention. 本発明の第２実施形態に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus which concerns on 2nd Embodiment of this invention.

Explanation of symbols

１００……音声処理装置、１０……制御装置、１２……区間特定部、１４……位置特定部、１６……話者識別部、２２……選択部、２４……適応モデル生成部、２６……音声認識部、３０……記憶装置、４２……入力装置、４４……出力装置、４６……放音装置、Ｓ……音声信号、Ｓ1……適応用の音声信号、Ｓ2……認識用の音声信号、Ｍ（Ｍ[j,i]）……音響モデル、Ｇ（Ｇ1〜ＧN）……音響モデル群。 DESCRIPTION OF SYMBOLS 100 ... Voice processing apparatus, 10 ... Control apparatus, 12 ... Section specification part, 14 ... Position specification part, 16 ... Speaker identification part, 22 ... Selection part, 24 ... Adaptive model generation part, 26 …… Voice recognition unit, 30 …… Storage device, 42 …… Input device, 44 …… Output device, 46 …… Sound emitting device, S …… Voice signal, S1 …… Voice signal for adaptation, S2 …… Recognition Audio signal, M (M [j, i]) …… acoustic model, G (G1˜GN) …… acoustic model group.

Claims

A storage device for storing a plurality of acoustic models corresponding to each combination of a speaker and a position of the speaker;
Position specifying means for specifying the position of each speaker from the audio signal for adaptation;
Speaker identification means for distinguishing a speaker of the voice represented by the adaptation voice signal;
An input means for a user to input a character corresponding to the voice represented by the adaptation voice signal;
Each acoustic model corresponding to a combination of a speaker identified by the speaker identifying unit and a position of the speaker identified by the position identifying unit among a plurality of acoustic models stored in the storage device is used as the adaptation speech. While each acoustic model corresponding to the combination of the speaker and the position is generated and stored in the storage device by adaptive processing updated based on the signal and the character input by the input means, the storage device Among the plurality of stored acoustic models, each acoustic model that has not been updated by the adaptive processing is two or more acoustic models that correspond to the same position as the acoustic model and correspond to different speakers, and the adaptive processing Adaptive model generation means for updating two or more acoustic models including the updated acoustic model according to the above to an averaged acoustic model ;
Based on an acoustic model corresponding to a combination of a voice speaker represented by a voice signal for recognition and a position of the voicer among a plurality of acoustic models processed by the adaptive model generation unit , the voice signal for recognition A voice processing device comprising: voice recognition means for specifying a character corresponding to the voice represented by.

The adaptive model generation means includes:
Operates in either the direction priority mode or speaker priority mode
In the direction priority mode, each acoustic model that has not been updated in the adaptive processing among the plurality of acoustic models stored in the storage device corresponds to the same position as the acoustic model and corresponds to a different speaker 2 The above acoustic model is updated to an averaged acoustic model of two or more acoustic models including the updated acoustic model by the adaptive processing,
In the speaker priority mode, each acoustic model that has not been updated by the adaptive processing among a plurality of acoustic models stored in the storage device corresponds to the same speaker as the acoustic model and to a different position. Update two or more acoustic models, including two or more acoustic models updated by the adaptive processing, to an averaged acoustic model
The speech processing apparatus according to claim 1.

A computer having a storage device that stores a plurality of acoustic models corresponding to each combination of a speaker and a position of the speaker ,
A position identifying process for identifying the position of each speaker from the audio signal for adaptation;
Speaker identification processing for distinguishing the speaker of the voice represented by the adaptation voice signal;
An input process in which a user inputs characters corresponding to the voice represented by the adaptive voice signal;
Each acoustic model corresponding to a combination of a speaker identified by the speaker identification process and a position of the speaker identified by the position identifying process among a plurality of acoustic models stored in the storage device is used for the adaptation speech. While each acoustic model corresponding to the combination of the speaker and the position is generated and stored in the storage device by adaptive processing that is updated based on the signal and the character input in the input processing, the storage device Among the plurality of stored acoustic models, each acoustic model that has not been updated by the adaptive processing is two or more acoustic models that correspond to the same position as the acoustic model and correspond to different speakers, and the adaptive processing An adaptive model generation process for updating two or more acoustic models including an updated acoustic model to an averaged acoustic model ;
Based on an acoustic model corresponding to a combination of a voice speaker represented by a recognition voice signal and a position of the speaker among the plurality of acoustic models after the adaptive model generation processing , the recognition voice signal represents A program that executes voice recognition processing that identifies characters corresponding to voice.