JP2008250059A

JP2008250059A - Voice recognition device, voice recognition system and voice recognition method

Info

Publication number: JP2008250059A
Application number: JP2007092414A
Authority: JP
Inventors: Takatoshi Sanehiro; 貴敏實廣; Futoshi Naya; 太納谷; Haruo Noma; 春生野間; Kiyoshi Kogure; 潔小暮; Tomoji Toriyama; 朋二鳥山; Tadashi Omura; 廉大村; Masaya Okada; 昌也岡田; Masakazu Miyamae; 雅一宮前
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2007-03-30
Filing date: 2007-03-30
Publication date: 2008-10-16

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recognition device, a voice recognition system and a voice recognition method for correctly recognizing input voice including noise. <P>SOLUTION: The voice recognition system includes a server, and a plurality of operation noise models generated according to each of the plurality of operations, and a plurality of environment noise models generated according to each of a plurality of places, are stored in a database in the server. A plurality of repeaters are connected to the server, and a mobile terminal is connected in a communicable manner. The mobile terminal acquires input voice and detects a behavior of a testee, and transmits voice data regarding input voice and acceleration data regarding the behavior, to the repeater. The repeater attaches a relay ID to the voice data and the acceleration data and transmits them to the server. The server specifies the behavior of the testee, and estimates a location as a current position of the testee. Then, the voice of the testee included in the input speech is recognized by using the environment noise model according to the behavior, and the environment noise model according to the current position of the testee. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は音声認識装置、音声認識システムおよび音声認識方法に関し、特にたとえば、雑音を含む入力音声から被験者の音声を認識する、音声認識装置、音声認識システムおよび音声認識方法に関する。 The present invention relates to a speech recognition device, a speech recognition system, and a speech recognition method, and more particularly to a speech recognition device, a speech recognition system, and a speech recognition method that recognize a subject's speech from input speech including noise.

従来のこの種の音声認識装置の一例が特許文献１に示されている。特許文献１の技術では、入力音声信号中の音声区間の信号と雑音区間の信号とを判別し、観測した雑音区間の信号から雑音モデルを学習する。そして、予め用意した雑音のない音声モデルと雑音モデルとを合成し、雑音重畳音声モデルを生成する。また、雑音区間の信号を予め用意した基準信号に重畳し、特徴パラメータの長時間平均を求めておく。これらの動作は、音声区間の信号が入力される前に実行される。音声区間の信号が入力されると、この信号の特徴パラメータの長時間平均を求め、雑音区間の信号を重畳した基準信号の特徴パラメータの長時間平均との差分を求める。この差分を雑音重畳音声モデルに加算して、ＣＭＮ済雑音重畳音声モデルとする。そして、ＣＭＮ済雑音重畳音声モデルと音声区間の信号の特徴パラメータとのモデル照合尤度を計算し、認識結果を出力する。
特開２００６−１４５６９４号 An example of a conventional speech recognition apparatus of this type is shown in Patent Document 1. In the technique of Patent Document 1, a signal in a speech section and a signal in a noise section in an input speech signal are discriminated, and a noise model is learned from the observed signal in the noise section. Then, a speech model without noise and a noise model prepared in advance are synthesized to generate a noise superimposed speech model. Further, the noise section signal is superimposed on a reference signal prepared in advance, and a long-time average of the characteristic parameters is obtained. These operations are executed before the signal of the voice section is input. When a signal in the speech interval is input, a long-time average of the feature parameter of this signal is obtained, and a difference from the long-term average of the feature parameter of the reference signal on which the signal in the noise interval is superimposed is obtained. This difference is added to the noise superimposed speech model to obtain a CMN-completed noise superimposed speech model. Then, a model matching likelihood between the CMN-completed noise superimposed speech model and the feature parameter of the speech section signal is calculated, and a recognition result is output.
JP 2006-145694 A

しかし、特許文献１の技術では、入力音声信号中の雑音区間から雑音モデルを学習しているが、音声区間と雑音区間との正確な判別は困難であるため、雑音モデルが適切に作成されない恐れがある。また、雑音区間が短い場合には、雑音モデルの推定に用いるデータが少なくなり、雑音モデルの信頼度は低くなる。したがって、特許文献１の技術では、音声認識を適切に実行できない恐れがある。 However, in the technique of Patent Document 1, a noise model is learned from a noise section in an input speech signal. However, since it is difficult to accurately distinguish between a speech section and a noise section, the noise model may not be appropriately created. There is. In addition, when the noise interval is short, data used for noise model estimation is reduced, and the reliability of the noise model is lowered. Therefore, there is a possibility that speech recognition cannot be appropriately performed with the technique of Patent Document 1.

それゆえに、この発明の主たる目的は、新規な、音声認識装置、音声認識システムおよび音声認識方法を提供することである。 Therefore, a main object of the present invention is to provide a novel speech recognition apparatus, speech recognition system, and speech recognition method.

この発明の他の目的は、雑音を含んだ入力音声であっても正確に音声認識できる、音声認識装置、音声認識システムおよび音声認識方法を提供することである。 Another object of the present invention is to provide a speech recognition device, a speech recognition system, and a speech recognition method capable of accurately recognizing even an input speech including noise.

本発明は、上記の課題を解決するために、以下の構成を採用した。なお、括弧内の参照符号および補足説明等は、本発明の理解を助けるために後述する実施の形態との対応関係を示したものであって、本発明を何ら限定するものではない。 The present invention employs the following configuration in order to solve the above problems. The reference numerals in parentheses, supplementary explanations, and the like indicate correspondence relationships with embodiments described later to help understanding of the present invention, and do not limit the present invention in any way.

請求項１の発明は、複数の動作の各々に対応して作成された複数の動作雑音モデルを当該複数の動作の各々に対応付けて記憶する動作雑音モデル記憶手段、被験者の音声を含む入力音声を検出する入力音声検出手段、被験者の動作を特定する動作特定手段、動作特定手段によって特定された動作に対応する動作雑音モデルを動作雑音モデル記憶手段から読み出す動作雑音モデル読出手段、および動作雑音モデル読出手段によって読み出された動作雑音モデルを用いて、音声検出手段によって検出される入力音声に含まれる被験者の音声を認識する認識手段を備える、音声認識装置である。 The invention according to claim 1 is an operation noise model storage means for storing a plurality of operation noise models created corresponding to each of a plurality of movements in association with each of the plurality of movements, and an input voice including a voice of a subject Input speech detection means for detecting motion, action specification means for specifying the action of the subject, action noise model reading means for reading the action noise model corresponding to the action specified by the action specification means from the action noise model storage means, and action noise model The speech recognition apparatus includes a recognition unit that recognizes the voice of the subject included in the input voice detected by the voice detection unit using the operation noise model read by the reading unit.

請求項１の発明では、音声認識装置（１２，１８）は、動作雑音モデル記憶手段（４８）、入力音声検出手段（３８，Ｓ１）、動作特定手段（Ｓ５）、動作雑音モデル読出手段（Ｓ９）、および認識手段（Ｓ１３）を備える。動作雑音モデル記憶手段は、複数の動作の各々に対応して作成された複数の動作雑音モデルを当該複数の動作の各々に対応付けて記憶する。たとえば、動作雑音モデルは、被験者が所定の動作を行うときに、衣類が擦れる音や所定の動作で使用する器具の音のような動作に起因する雑音についての雑音モデルである。入力音声検出手段は、被験者の音声を含む入力音声を検出し、動作特定手段は、被験者の動作を特定する。動作雑音モデル読出手段は、被験者の動作に応じた動作雑音モデルを動作雑音モデル記憶手段から読み出す。認識手段は、読み出された動作雑音モデルを用いて雑音抑圧処理を行い、入力音声に含まれる被験者の音声を認識する。雑音モデルを用いた雑音抑圧処理としては、たとえば、ＰＭＣ（Parallel Model Combination）法やＧＭＭ（Gaussian Mixture Model）による雑音抑圧処理を用いることができる。 According to the first aspect of the present invention, the speech recognition device (12, 18) includes the operation noise model storage means (48), the input speech detection means (38, S1), the action specifying means (S5), and the action noise model reading means (S9). ) And recognition means (S13). The operation noise model storage means stores a plurality of operation noise models created corresponding to each of the plurality of operations in association with each of the plurality of operations. For example, the motion noise model is a noise model for noise caused by motion such as a sound of clothing being rubbed or a sound of an instrument used in a predetermined motion when the subject performs a predetermined motion. The input voice detecting means detects input voice including the voice of the subject, and the action specifying means specifies the action of the subject. The operating noise model reading means reads an operating noise model corresponding to the motion of the subject from the operating noise model storage means. The recognition means performs noise suppression processing using the read operation noise model, and recognizes the subject's voice included in the input voice. As the noise suppression processing using the noise model, for example, noise suppression processing by a PMC (Parallel Model Combination) method or GMM (Gaussian Mixture Model) can be used.

請求項１の発明によれば、予め複数の動作に対応した動作雑音モデルを用意し、被験者の動作に応じた動作雑音モデルを選択して音声認識を実行するので、入力音声に含まれる動作に起因する雑音を適切に抑圧でき、正確に音声認識することができる。 According to the first aspect of the present invention, an operation noise model corresponding to a plurality of operations is prepared in advance, and the operation noise model corresponding to the operation of the subject is selected to perform speech recognition. The resulting noise can be suppressed appropriately and speech recognition can be performed accurately.

請求項２の発明は、請求項１の発明に従属し、複数の場所の各々に対応して作成された複数の場所雑音モデルを当該複数の場所の各々に対応付けて記憶する場所雑音モデル記憶手段、被験者の存在する場所と特定する場所特定手段、および場所特定手段によって特定された場所に対応する場所雑音モデルを場所雑音モデル記憶手段から読み出す場所雑音モデル読出手段をさらに備え、認識手段は、動作雑音モデル読出手段によって読み出された動作雑音モデルと、場所雑音モデル読出手段によって読み出された場所雑音モデルとを用いて、音声検出手段によって検出される入力音声に含まれる被験者の音声を認識する。 The invention of claim 2 is dependent on the invention of claim 1, and a place noise model storage that stores a plurality of place noise models created corresponding to each of the plurality of places in association with each of the plurality of places. A place noise model reading means for reading a place noise model corresponding to the place specified by the place specifying means, and a place noise model reading means for reading the place noise model corresponding to the place specified by the place specifying means, Using the motion noise model read by the motion noise model reading means and the location noise model read by the location noise model reading means, the speech of the subject included in the input speech detected by the speech detection means is recognized. To do.

請求項２の発明では、音声認識装置は、場所雑音モデル記憶手段（４８）、場所特定手段（Ｓ７）および場所雑音モデル読出手段（Ｓ９）をさらに備える。場所雑音モデル記憶手段は、複数の場所の各々で収集された雑音に対応する雑音データを用いて作成された複数の雑音モデルを、当該複数の場所の各々に対応付けて記憶する。たとえば、人がたくさん集まる場所（部屋）では話し声のような雑音が発生し、電子機器が置かれた部屋ではその動作音が発生する。このような周囲ないし環境の雑音についての雑音モデルが記憶されるのである。場所特定手段は、被験者の存在する場所すなわち現在位置を特定する。場所雑音モデル読出手段は、特定された現在位置に対応する場所雑音モデルを場所雑音モデル記憶手段から読み出す。したがって、認識手段は、動作雑音モデルと場所雑音モデルとを用いて、雑音を抑圧し、入力音声に含まれる被験者の音声を認識する。 More specifically, the speech recognition apparatus further includes a place noise model storage means (48), a place specifying means (S7), and a place noise model reading means (S9). The location noise model storage means stores a plurality of noise models created using noise data corresponding to the noise collected at each of the plurality of locations in association with each of the plurality of locations. For example, noise such as talking voice is generated in a place (room) where many people gather, and the operation sound is generated in a room where electronic devices are placed. A noise model for such ambient or environmental noise is stored. The place specifying means specifies the place where the subject exists, that is, the current position. The place noise model reading means reads the place noise model corresponding to the specified current position from the place noise model storage means. Therefore, the recognition means suppresses noise using the operation noise model and the location noise model, and recognizes the voice of the subject included in the input voice.

請求項２の発明によれば、動作雑音のみならず、環境雑音も抑圧するので、より正確に音声認識することができる。 According to the invention of claim 2, since not only the operation noise but also the environmental noise is suppressed, the voice can be recognized more accurately.

請求項３の発明は、請求項１または２の発明に従属し、入力音声検出手段によって検出された入力音声に対応する音声信号の信号対雑音比を推定する推定手段、および推定手段によって推定された信号対雑音比に応じて雑音モデルの合成比率を調整する調整手段をさらに備える。 The invention of claim 3 is dependent on the invention of claim 1 or 2, and is estimated by an estimation means for estimating a signal-to-noise ratio of a speech signal corresponding to the input speech detected by the input speech detection means, and the estimation means. And adjusting means for adjusting the synthesis ratio of the noise model according to the signal-to-noise ratio.

請求項３の発明では、推定手段（Ｓ３）および調整手段（Ｓ１１）をさらに備える。推定手段は、入力音声の信号対雑音比（ＳＮＲ）、すなわち音声および雑音の相対的な大きさの比を推定する。調整手段は、信号対雑音比に応じて雑音モデル（動作雑音モデルのみ，動作雑音モデルおよび環境雑音モデルの両方）の合成比率を調整する。たとえば、音声モデルと雑音モデルとを合成して雑音重畳モデルを作成するときには、推定したＳＮＲに応じて合成比率が調整される。 The invention of claim 3 further comprises estimation means (S3) and adjustment means (S11). The estimation means estimates the signal-to-noise ratio (SNR) of the input speech, that is, the ratio of the relative magnitudes of speech and noise. The adjusting means adjusts the synthesis ratio of the noise model (only the operating noise model, both the operating noise model and the environmental noise model) according to the signal-to-noise ratio. For example, when a noise superposition model is created by synthesizing a speech model and a noise model, the synthesis ratio is adjusted according to the estimated SNR.

請求項３の発明によれば、入力音声のＳＮＲを考慮してモデル合成を行うので、より正確に音声認識することができる。 According to the invention of claim 3, since model synthesis is performed in consideration of the SNR of the input speech, speech recognition can be performed more accurately.

請求項４の発明は、請求項２または３の発明に従属し、場所特定手段は環境に設置された中継器が発する識別情報を検出し、当該識別情報を発した中継器の設置場所を被験者の存在する場所として特定する。 The invention of claim 4 is dependent on the invention of claim 2 or 3, wherein the location specifying means detects identification information issued by a repeater installed in the environment, and the location of the repeater that issued the identification information is determined by the subject. Identified as the location of

請求項４の発明では、周囲ないし環境すなわち複数の場所に中継器（１６）が設置されており、場所特定手段（Ｓ５）は、通信可能範囲に存在する中継器が発する識別情報（中継器ＩＤ）を検出する。そして、中継器ＩＤに基づいてその中継器が設置されている場所を特定し、その特定した場所を被験者の存在する場所（現在位置）として推定（特定）する。 In the invention of claim 4, the repeater (16) is installed in the surrounding or environment, that is, a plurality of places, and the place specifying means (S5) is the identification information (repeater ID) issued by the repeater existing in the communicable range. ) Is detected. And the place where the repeater is installed is specified based on the repeater ID, and the specified place is estimated (specified) as the place (current position) where the subject exists.

請求項４の発明によれば、被験者の現在位置を簡単に特定することができ、特定した場所に応じた場所雑音モデルを用いて正確に音声認識することができる。 According to the fourth aspect of the present invention, the current position of the subject can be easily specified, and the voice can be accurately recognized using the place noise model corresponding to the specified place.

請求項５の発明は、携帯端末と音声認識装置とを備える音声認識システムであって、携帯端末は、被験者の音声を含む入力音声を検出する入力音声検出手段、および入力音声検出手段によって検出された入力音声についての音声信号を音声認識装置に送信する送信手段を備え、音声認識装置は、送信手段によって送信された音声信号を受信する受信手段、複数の動作の各々に対応して作成された複数の動作雑音モデルを当該複数の動作の各々に対応付けて記憶する動作雑音モデル記憶手段、被験者の動作を特定する動作特定手段、動作特定手段によって特定された動作に対応する動作雑音モデルを動作雑音モデル記憶手段から読み出す動作雑音モデル読出手段、および動作雑音モデル読出手段によって読み出された動作雑音モデルを用いて、音声検出手段によって検出される入力音声に含まれる被験者の音声を認識する認識手段を備える、音声認識システムである。 The invention of claim 5 is a voice recognition system comprising a portable terminal and a voice recognition device, wherein the portable terminal is detected by an input voice detecting means for detecting an input voice including the voice of the subject, and an input voice detecting means. A transmission means for transmitting a voice signal for the input voice to the voice recognition apparatus, the voice recognition apparatus being created corresponding to each of a plurality of operations, a reception means for receiving the voice signal transmitted by the transmission means A motion noise model storage unit that stores a plurality of motion noise models in association with each of the plurality of motions, a motion identification unit that identifies a subject's motion, and a motion noise model corresponding to the motion identified by the motion identification unit Using the operation noise model read out from the noise model storage means, and the operation noise model read out by the operation noise model read out means, It comprises recognition means for recognizing speech of the subject included in the input voice detected by the voice detecting means, a speech recognition system.

請求項５の発明では、音声認識システム（１０）は、携帯端末（１８）と音声認識装置（１２）と備える。たとえば、音声認識装置は、看護師などの被験者が作業中に発話する発話内容（実施例では業務内容）を音声認識して記録する。携帯端末は、被験者によって所持ないし装着され、入力音声検出手段（３８）によって検出された被験者の音声を含む入力音声についての音声信号を、送信手段（２８）によって音声認識装置に送信する。音声認識装置は、受信手段（Ｓ１）、動作雑音モデル記憶手段（４８）、動作特定手段（Ｓ５）、動作雑音モデル読出手段（Ｓ９）、および認識手段（Ｓ１３）を備える。受信手段は、携帯端末から送信される音声信号を受信する。動作雑音モデル記憶手段は、看護業務についての動作に起因する雑音（衣類の擦れる音など）を動作毎に記憶する。動作特定手段は、被験者の動作を特定する。動作雑音モデル読出手段は、被験者の動作に応じた動作雑音モデルを動作雑音モデル記憶手段から読み出す。認識手段は、被験者の動作に応じた動作雑音モデルを用いて雑音抑圧処理を行い、受信手段によって受信された音声信号（入力音声の音声信号）に含まれる被験者の音声を認識する。 In the invention of claim 5, the voice recognition system (10) includes a portable terminal (18) and a voice recognition device (12). For example, the speech recognition apparatus recognizes and records speech content (business content in the embodiment) that a subject such as a nurse speaks during work. The portable terminal is carried or worn by the subject, and transmits a voice signal of the input voice including the voice of the subject detected by the input voice detection means (38) to the voice recognition device by the transmission means (28). The speech recognition apparatus includes a receiving means (S1), an operating noise model storage means (48), an action specifying means (S5), an operating noise model reading means (S9), and a recognizing means (S13). The receiving means receives an audio signal transmitted from the mobile terminal. The operation noise model storage means stores noise (such as a rubbing sound of clothes) caused by an operation for nursing work for each operation. The action specifying unit specifies the action of the subject. The operating noise model reading means reads an operating noise model corresponding to the motion of the subject from the operating noise model storage means. The recognizing unit performs noise suppression processing using an operation noise model corresponding to the motion of the subject, and recognizes the subject's voice included in the voice signal (the voice signal of the input voice) received by the receiving unit.

請求項５の発明によれば、請求項１の発明と同様に、複数の場所に対応した雑音モデルを予め用意し、被験者の存在する場所に応じた雑音モデルを選択して音声認識を実行するので、雑音を適切に抑圧でき、正確に音声認識することができる。 According to the invention of claim 5, as in the invention of claim 1, a noise model corresponding to a plurality of places is prepared in advance, and a speech model is selected by selecting a noise model corresponding to the place where the subject exists. Therefore, noise can be appropriately suppressed and voice recognition can be performed accurately.

請求項６の発明は、請求項５の発明に従属し、音声認識装置は、複数の場所の各々に対応して作成された複数の場所雑音モデルを当該複数の場所の各々に対応付けて記憶する場所雑音モデル記憶手段、被験者の存在する場所と特定する場所特定手段、および場所特定手段によって特定された場所に対応する場所雑音モデルを場所雑音モデル記憶手段から読み出す場所雑音モデル読出手段をさらに備え、認識手段は、動作雑音モデル読出手段によって読み出された動作雑音モデルと、場所雑音モデル読出手段によって読み出された場所雑音モデルとを用いて、音声検出手段によって検出される入力音声に含まれる被験者の音声を認識する。 The invention of claim 6 is dependent on the invention of claim 5, and the speech recognition apparatus stores a plurality of location noise models created corresponding to each of the plurality of locations in association with each of the plurality of locations. And a place noise model reading means for reading out a place noise model corresponding to the place specified by the place specifying means from the place noise model storing means. The recognition means is included in the input speech detected by the speech detection means using the motion noise model read by the motion noise model reading means and the location noise model read by the location noise model reading means. Recognize subject's voice.

請求項６の発明では、音声認識装置は、場所雑音モデル記憶手段（４８）、場所特定手段（Ｓ７）および場所雑音モデル読出手段（Ｓ９）をさらに備える。場所雑音モデル記憶手段は、複数の場所の各々で収集された雑音に対応する雑音データを用いて作成された複数の雑音モデルを、当該複数の場所の各々に対応付けて記憶する。場所特定手段は、被験者の存在する場所すなわち現在位置を特定する。場所雑音モデル読出手段は、特定された現在位置に対応する場所雑音モデルを場所雑音モデル記憶手段から読み出す。したがって、認識手段は、動作雑音モデルと場所雑音モデルとを用いて、雑音を抑圧し、入力音声に含まれる被験者の音声を認識する。 More specifically, the speech recognition apparatus further includes a place noise model storage means (48), a place specifying means (S7), and a place noise model reading means (S9). The location noise model storage means stores a plurality of noise models created using noise data corresponding to the noise collected at each of the plurality of locations in association with each of the plurality of locations. The place specifying means specifies the place where the subject exists, that is, the current position. The place noise model reading means reads the place noise model corresponding to the specified current position from the place noise model storage means. Therefore, the recognition means suppresses noise using the operation noise model and the location noise model, and recognizes the voice of the subject included in the input voice.

請求項６の発明によれば、動作雑音のみならず、環境雑音も抑圧するので、より正確に音声認識することができる。 According to the invention of claim 6, since not only the operation noise but also the environmental noise is suppressed, the voice can be recognized more accurately.

請求項７の発明は、請求項６の発明に従属し、複数の場所の各々に対応して配置され、携帯端末とサーバとの通信を中継する複数の中継器をさらに備え、中継器は、携帯端末から送信された音声信号を受信して、受信した音声信号に自己の識別情報を付加してサーバに送信し、場所特定手段は、受信手段によって受信された音声信号に付加された識別情報に基づいて、当該音声信号を送信した中継器の設置場所を被験者の存在する場所として特定する。 The invention of claim 7 is dependent on the invention of claim 6, further comprising a plurality of repeaters arranged corresponding to each of the plurality of places and relaying communication between the portable terminal and the server, The audio signal transmitted from the portable terminal is received, the identification information added to the audio signal received by the receiving means is added to the received audio signal, and the identification information added to the audio signal received by the receiving means is transmitted to the server. Based on the above, the installation location of the repeater that transmitted the audio signal is specified as the location where the subject exists.

請求項７の発明では、複数の場所の各々に対応して配置される複数の中継器（１６）をさらに備える。中継器は、通信可能な範囲に存在する携帯端末（１８）から送信される被験者の音声を含む入力音声についての音声信号を受信し、受信した音声信号に自身の識別情報（中継器ＩＤ）を付加して音声認識装置（１２）に送信する。場所特定手段（Ｓ７）は、中継器ＩＤに基づいて当該中継器の設置場所を特定し、その設置場所を携帯端末の存在する場所、すなわち被験者の現在位置として特定する。 The invention of claim 7 further includes a plurality of repeaters (16) arranged corresponding to each of the plurality of places. The repeater receives an audio signal for the input voice including the voice of the subject transmitted from the portable terminal (18) existing in a communicable range, and adds its own identification information (relay ID) to the received audio signal. In addition, it is transmitted to the voice recognition device (12). The location specifying means (S7) specifies the installation location of the repeater based on the repeater ID, and specifies the installation location as the location where the mobile terminal exists, that is, the current location of the subject.

請求項７の発明によれば、被験者の現在位置を容易に特定でき、被験者の現在位置に応じた場所雑音モデルを選択して、正確に音声認識することができる。 According to the seventh aspect of the present invention, the current position of the subject can be easily specified, and the place noise model corresponding to the current position of the subject can be selected and voice recognition can be performed accurately.

請求項８の発明は、複数の動作の各々に対応して作成された複数の動作雑音モデルを当該複数の動作の各々に対応付けて記憶する動作雑音モデル記憶手段を備えるコンピュータの音声認識方法であって、(a)被験者の音声を含む入力音声を検出し、(b)被験者の動作を特定し、(c)ステップ(b)によって特定された動作に対応する動作雑音モデルを動作雑音モデル記憶手段から読み出し、そして(d)ステップ(c)によって読み出された動作雑音モデルを用いて、ステップ(a)によって検出される入力音声に含まれる被験者の音声を認識する、音声認識方法である。 The invention of claim 8 is a speech recognition method for a computer comprising an operation noise model storage means for storing a plurality of operation noise models created corresponding to each of a plurality of operations in association with each of the plurality of operations. (A) Detect input speech including subject's speech, (b) identify subject's motion, and (c) store motion noise model corresponding to motion identified in step (b) And (d) a speech recognition method for recognizing a subject's speech included in the input speech detected in step (a) using the operating noise model read out in step (c).

請求項８の発明の発明においても、請求項１の音声認識装置の発明と同様に、正確に音声認識することができる。 In the invention of claim 8 as well, as in the invention of the speech recognition apparatus of claim 1, speech recognition can be performed accurately.

この発明によれば、予め複数の動作に対応した動作雑音モデルを用意し、被験者の動作に応じた動作雑音モデルを用いて音声認識を実行するので、入力音声に含まれる動作に起因する雑音を適切に抑圧でき、雑音を含んだ入力音声であっても正確に音声認識することができる。 According to the present invention, an operation noise model corresponding to a plurality of movements is prepared in advance, and voice recognition is performed using the movement noise model corresponding to the movement of the subject. Appropriate suppression is possible, and even speech that contains noise can be accurately recognized.

この発明の上述の目的，その他の目的，特徴および利点は、図面を参照して行う以下の実施例の詳細な説明から一層明らかとなろう。 The above object, other objects, features and advantages of the present invention will become more apparent from the following detailed description of embodiments with reference to the drawings.

図１を参照して、この発明の一実施例である音声認識システム１０は、音声認識装置としても機能するサーバ１２を含み、たとえば、病院のような組織に適用され、看護師などの被験者が作業中に発話する内容（たとえば業務内容）を音声認識して記録する。 Referring to FIG. 1, a speech recognition system 10 according to an embodiment of the present invention includes a server 12 that also functions as a speech recognition device, and is applied to an organization such as a hospital. Content that is uttered during work (for example, business content) is recognized and recorded.

サーバ１２は、有線或いは無線による通信回線（ネットワーク）１４を介して複数の中継器１６に接続される。複数の中継器１６のそれぞれは、看護師が作業ないし業務を行う場所、たとえば、病室の入り口、病室内のベッド或いはその近傍、廊下およびナースステーションなどの所定位置に配置される。また、複数の中継器１６のそれぞれには、無線通信可能に携帯端末１８が接続される。携帯端末１８は、看護師によって所持され（装着され）、携帯端末１８から送信されるデータは、無線通信可能な範囲（たとえば、半径１〜３メートル）に存在する中継器１６を介して、サーバ１２に送信される。 The server 12 is connected to a plurality of repeaters 16 via a wired or wireless communication line (network) 14. Each of the plurality of repeaters 16 is disposed at a predetermined position such as a place where the nurse performs work or work, such as an entrance to a hospital room, a bed in the hospital room or the vicinity thereof, a corridor, and a nurse station. In addition, a mobile terminal 18 is connected to each of the plurality of repeaters 16 so that wireless communication is possible. The portable terminal 18 is possessed (attached) by the nurse, and data transmitted from the portable terminal 18 is transmitted to the server via the repeater 16 that exists in a wirelessly communicable range (for example, a radius of 1 to 3 meters). 12 is transmitted.

なお、図１には１つの携帯端末１８を示してあるが、音声認識システム１０は、複数の携帯端末１８を備えてよく、複数の携帯端末１８のそれぞれは、複数の看護師のそれぞれに割り当てられる。また、携帯端末１８は、無線ＬＡＮ等によってネットワーク１４に直接接続される場合もある。 Although one mobile terminal 18 is shown in FIG. 1, the voice recognition system 10 may include a plurality of mobile terminals 18, and each of the plurality of mobile terminals 18 is assigned to each of a plurality of nurses. It is done. The mobile terminal 18 may be directly connected to the network 14 by a wireless LAN or the like.

図２は携帯端末１８の具体的な構成を示すブロック図であり、携帯端末１８はＣＰＵ２０を含む。ＣＰＵ２０には、メモリ２２，エンコーダ２４，非接触センサ２６，インターフェイス２８，タイマ３０、ＤＩＰスイッチ３２，無線送信機３４、無線受信機３６および複数の加速度センサ４０ａ，４０ｂ，４０ｃ，４０ｄ，４０ｅ，４０ｆ（以下、これらをまとめて、「加速度センサ４０」という場合がある）などが接続される。 FIG. 2 is a block diagram showing a specific configuration of the mobile terminal 18, and the mobile terminal 18 includes a CPU 20. The CPU 20 includes a memory 22, an encoder 24, a non-contact sensor 26, an interface 28, a timer 30, a DIP switch 32, a wireless transmitter 34, a wireless receiver 36, and a plurality of acceleration sensors 40a, 40b, 40c, 40d, 40e, 40f. (Hereinafter, these may be collectively referred to as “acceleration sensor 40”).

メモリ２２は、ワークメモリないしバッファメモリとして働き、ＣＰＵ２０によって使用される。エンコーダ２４にはヘッドセットマイク３８が接続され、エンコーダ２４は、ヘッドセットマイク３８から入力される入力音声についての音声信号をＭＰ３のような圧縮音声データに変調する。圧縮音声データは、ＣＰＵ２０の指示に従ってメモリ２２に記憶される。メモリ２２に記憶された圧縮音声データは、ＣＰＵ２０の指示に従って、一定時間（たとえば、１０秒〜３０秒）毎に、中継器１６およびネットワーク１４を介してサーバ１２に送信される。 The memory 22 functions as a work memory or a buffer memory and is used by the CPU 20. A headset microphone 38 is connected to the encoder 24, and the encoder 24 modulates an audio signal for input audio input from the headset microphone 38 into compressed audio data such as MP3. The compressed audio data is stored in the memory 22 in accordance with an instruction from the CPU 20. The compressed audio data stored in the memory 22 is transmitted to the server 12 via the repeater 16 and the network 14 at regular time intervals (for example, 10 seconds to 30 seconds) in accordance with an instruction from the CPU 20.

なお、音声信号を圧縮変調するのは、メモリ２２の容量を比較的少なくするためであり、また、サーバ１２に送信するデータのデータ量を低減するためである。 The reason why the audio signal is compression-modulated is to reduce the capacity of the memory 22 and to reduce the amount of data to be transmitted to the server 12.

また、この実施例で用いるヘッドセットマイク３８は指向性を有するものである。これは、予め周囲の雑音（看護師の動作に起因する雑音や環境による雑音）をできるだけ除いた入力音声を検出し、音声認識の精度を高めるためである。また、ヘッドセットマイク３８を用いるのは、看護師が作業を行うときには看護師の両手が塞がっていることが多いためであり、さらには、作業に用いる道具以外のものを看護師が手に持つことをできるだけ少なくするためである。ただし、ヘッドセットマイク３８に代えて、指向性を有するピンマイクを、たとえば襟元に装着するようにしてもよい。 The headset microphone 38 used in this embodiment has directivity. This is to improve the accuracy of speech recognition by detecting the input speech from which ambient noise (noise caused by the nurse's action and environmental noise) is removed as much as possible. The headset microphone 38 is used because the nurse's hands are often closed when the nurse performs the work, and the nurse has a hand other than the tool used for the work. This is to make things as small as possible. However, instead of the headset microphone 38, a pin microphone having directivity may be attached to the neck, for example.

非接触センサ２６としては、焦電センサを用いることができ、ＣＰＵ２０は非接触センサ２６からの入力に応じてヘッドセットマイク３８をオン／オフする。この実施例では、非接触センサ２６すなわち焦電センサの前で、看護師が手を２回上下させると、その検出信号がＣＰＵ２０に入力され、これに応じて、ＣＰＵ２０はヘッドセットマイク３８をオンし、その後、看護師が焦電センサの前で、手を２回上下させると、ヘッドセットマイク３８をオフする。このように、ヘッドセットマイク３８をオン／オフ可能にしてあるのは、看護師のプライバシを守るためである。つまり、業務内容を音声認識して記録する必要が生じたときに、ヘッドセットマイク３８はオンされ、休憩時間など業務内容を記録する必要がないときには、ヘッドセットマイク３８はオフされる。 A pyroelectric sensor can be used as the non-contact sensor 26, and the CPU 20 turns on / off the headset microphone 38 in accordance with an input from the non-contact sensor 26. In this embodiment, when the nurse raises and lowers his / her hand twice in front of the non-contact sensor 26, that is, the pyroelectric sensor, the detection signal is input to the CPU 20, and the CPU 20 turns on the headset microphone 38 accordingly. Then, when the nurse raises and lowers his / her hand twice in front of the pyroelectric sensor, the headset microphone 38 is turned off. The reason why the headset microphone 38 can be turned on / off is to protect the privacy of the nurse. In other words, the headset microphone 38 is turned on when the business content needs to be recognized and recorded, and the headset microphone 38 is turned off when there is no need to record the business content such as a break time.

インターフェイス２８は、ＬＡＮ（無線ＬＡＮ）アダプタのようなインターフェイスであり、これにより、携帯端末１８はネットワーク１４に接続される。したがって、携帯端末１８は、ネットワーク１４を介して、サーバ１２との間で通信可能になる。 The interface 28 is an interface such as a LAN (wireless LAN) adapter, whereby the portable terminal 18 is connected to the network 14. Therefore, the mobile terminal 18 can communicate with the server 12 via the network 14.

タイマ３０は、日付および時刻を計時する回路であり、ＣＰＵ２０は、タイマ３０から時間データを取得する。ＤＩＰスイッチ３２は、たとえば８ビットで構成され、各ビットのオン／オフを切り替えることにより、０〜２５５の間で数値を設定することができる。この数値が看護師の識別情報（看護師ＩＤ）であり、各携帯端末１８で異なる値が設定される。ＣＰＵ２０は、送信する音声データに、時間データや看護師ＩＤをラベルとして付して、中継器１６に送信する。つまり、音声データ、時間データ、および看護師ＩＤについてのデータ（数値データ）が携帯端末１８から中継器１６に対して送信される。 The timer 30 is a circuit that measures the date and time, and the CPU 20 acquires time data from the timer 30. The DIP switch 32 is composed of, for example, 8 bits, and can set a numerical value between 0 and 255 by switching each bit on / off. This numerical value is nurse identification information (nurse ID), and a different value is set for each portable terminal 18. CPU20 attaches time data and nurse ID as a label to the audio | voice data to transmit, and transmits to the repeater 16. That is, voice data, time data, and data (numerical data) about the nurse ID are transmitted from the portable terminal 18 to the repeater 16.

なお、この実施例では、ＤＩＰスイッチ３２を用いて看護師ＩＤを設定するようにしてあるが、これに限定されるべきではない。たとえば、ＤＩＰスイッチ３２に代えて、看護師ＩＤを記憶したＲＯＭなどを設けておくようにすることもできる。 In this embodiment, the nurse ID is set using the DIP switch 32. However, the present invention should not be limited to this. For example, instead of the DIP switch 32, a ROM storing a nurse ID may be provided.

無線送信機３４は、ＣＰＵ２０の指示に従って、上述の音声データ、時間データおよび看護師ＩＤについてのデータ（以下、これらを送信データと呼ぶことがある。）を中継器１６に送信する。無線受信機３６は、無線通信可能な範囲に存在する中継器１６が発する微弱電波を受信し、中継器ＩＤを復調し、復調した中継器ＩＤについてのデータをＣＰＵ２０で処理する。 The wireless transmitter 34 transmits the above-described voice data, time data, and data on the nurse ID (hereinafter, these may be referred to as transmission data) to the repeater 16 in accordance with an instruction from the CPU 20. The wireless receiver 36 receives a weak radio wave emitted from the repeater 16 existing in a wireless communicable range, demodulates the repeater ID, and processes data on the demodulated repeater ID by the CPU 20.

加速度センサ４０の各々は、たとえば多軸（３軸）加速度センサであり、この携帯端末１８を所持ないし装着する看護師の動作を検出するために用いられる。この実施例では、頭部、両手、腰部（または腹部）および両足の各部位についての加速度が検出され、検出された加速度データと後述する動作ＤＢ４６（図５参照）に予め記憶されている動作毎の加速度データとが対比され、１の動作が特定（同定）される。たとえば、周知のＤＰマッチングを実行することにより、検出された加速度データと、動作ＤＢ４６に記憶される加速データとの近似度（類似度）を容易に求めることができる。 Each of the acceleration sensors 40 is, for example, a multi-axis (3-axis) acceleration sensor, and is used to detect the operation of a nurse who carries or wears the portable terminal 18. In this embodiment, acceleration for each part of the head, both hands, waist (or abdomen) and both feet is detected, and for each motion stored in advance in the detected acceleration data and a motion DB 46 (see FIG. 5) described later. The motion data is compared with each other to identify (identify) one motion. For example, by executing the well-known DP matching, the degree of approximation (similarity) between the detected acceleration data and the acceleration data stored in the motion DB 46 can be easily obtained.

上述したような構成の携帯端末１８は、看護師などの被験者によって所持ないし装着される。たとえば、図３に示すように、非接触センサ２６、ヘッドセットマイク３８および加速度センサ４０以外の回路コンポーネントはボックス（筐体）６０に収容され、ボックス６０は看護師の白衣の前ポケット等に収納される。また、非接触センサ２６は、ペン型のケースに収容され、看護師の白衣の胸ポケットに挿すように収納される。なお、図面では、分かり易く示すために、ボックス６０および非接触センサ２６を各ポケットの外部に記載してある。また、ヘッドセットマイク３８は看護師の頭部に装着される。 The portable terminal 18 configured as described above is possessed or worn by a subject such as a nurse. For example, as shown in FIG. 3, circuit components other than the non-contact sensor 26, the headset microphone 38, and the acceleration sensor 40 are accommodated in a box (housing) 60, and the box 60 is accommodated in a front pocket of a nurse's lab coat or the like. Is done. The non-contact sensor 26 is housed in a pen-type case so as to be inserted into the breast pocket of the nurse's lab coat. In the drawings, the box 60 and the non-contact sensor 26 are shown outside each pocket for easy understanding. The headset microphone 38 is attached to the nurse's head.

また、上述したように、加速度センサ４０が看護師の所定の部位に装着ないし固定される。たとえば、図３に示すように、加速度センサ４０ａは看護師の頭部に装着され、加速度センサ４０ｂは看護師の右手首に装着され、加速度センサ４０ｃは看護師の左手首に装着され、加速度センサ４０ｄは看護師の腰部（または腹部）に装着され、加速度４０ｅは看護師の右足首に装着され、そして、加速度４０ｆは看護師の左足首に装着される。 Further, as described above, the acceleration sensor 40 is attached or fixed to a predetermined part of the nurse. For example, as shown in FIG. 3, the acceleration sensor 40a is attached to the nurse's head, the acceleration sensor 40b is attached to the nurse's right wrist, and the acceleration sensor 40c is attached to the nurse's left wrist. 40d is attached to the waist (or abdomen) of the nurse, acceleration 40e is attached to the nurse's right ankle, and acceleration 40f is attached to the nurse's left ankle.

なお、図３においては省略するが、非接触センサ２６は接続線を用いてボックス６０内のＣＰＵ２０に接続され、ヘッドセットマイク３８は接続線を用いてボックス６０内のエンコーダ２４に電気的に接続され、加速度センサ４０は接続線を用いてボックス６０内のＣＰＵ２０に接続される。ただし、接続線を用いずに、ブルートゥース（登録商標）のような近距離無線によって接続するようにしてもよい。つまり、電気的に接続されればよいのである。 Although not shown in FIG. 3, the non-contact sensor 26 is connected to the CPU 20 in the box 60 using a connection line, and the headset microphone 38 is electrically connected to the encoder 24 in the box 60 using a connection line. The acceleration sensor 40 is connected to the CPU 20 in the box 60 using a connection line. However, you may make it connect by short distance radio | wireless like Bluetooth (trademark), without using a connection line. That is, it is only necessary to be electrically connected.

上述したように、この音声認識システム１０では、看護師などが作業中に発話する内容を音声認識して記録する。また、音声認識を行う際には、雑音抑圧処理を適宜行う。 As described above, the speech recognition system 10 recognizes and records the content of a utterance by a nurse or the like during work. Further, when performing speech recognition, noise suppression processing is appropriately performed.

雑音抑圧処理としては、たとえば、ＰＭＣ（Parallel Model Combination）法を用いることができる。ＰＭＣ法では、音声モデルと雑音モデルとを合成することにより、雑音重畳音声モデルが推定（作成）され、この雑音重畳音声モデルと入力音声とが照合される。これによって、雑音を含む入力音声であっても精度良く認識することができる。このＰＭＣ法では、実際の雑音重畳雑音モデルを必要とすることなく、雑音を含む入力音声に対応することができる。よく用いられるＬｏｇ−Ａｄｄ近似と呼ばれる推定法では、数１に示すように、雑音重畳音声モデルの平均ベクトルμ_ｘを推定することができる。 As the noise suppression processing, for example, a PMC (Parallel Model Combination) method can be used. In the PMC method, a noise superimposed speech model is estimated (created) by synthesizing a speech model and a noise model, and the noise superimposed speech model and input speech are collated. As a result, even input speech including noise can be recognized with high accuracy. This PMC method can deal with input speech including noise without requiring an actual noise superimposed noise model. In a commonly used estimation method called Log-Add approximation, the average vector μ _x of the noise superimposed speech model can be estimated as shown in Equation 1.

ここで、μ_sおよびμ_nのそれぞれは、音声モデルおよび雑音モデルの対数スペクトルエネルギーの平均ベクトルを示す。また、「＾」は推定値を意味する。以下、同じ。 Here, μ _s and μ _n respectively represent average vectors of logarithmic spectral energy of the speech model and the noise model. “^” Means an estimated value. same as below.

なお、ＰＭＣ法については、「M.J.F.Gales,“Model-Based Techniques for Noise Robust Speech Recognition,”Ph.D Thesis,Cambridge University,1995.」および「M.J.F.Gales,S J Young,“A fast and flexible implementation of parallel model combination,”Proc. of ICASSP,pp.133-136,1995.」において詳細に開示されているので参照されたい。 Regarding PMC method, “MJFGales,“ Model-Based Techniques for Noise Robust Speech Recognition, ”Ph.D Thesis, Cambridge University, 1995.” and “MJFGales, SJ Young,“ A fast and flexible implementation of parallel. The model combination, “Proc. of ICASSP, pp. 133-136, 1995.” is disclosed in detail.

また、ＧＭＭ（Gaussian Mixture Model）による雑音抑圧処理を用いることもできる。音声と雑音とに相関が無いと仮定し、フレームｉにおける雑音重畳音声（入力音声）、クリーン音声および雑音のメルフィルタバンク出力の対数値ベクトルのそれぞれを、Ｘ(i)、Ｓ(i)およびＮ(i)とすると、数２のように表すことができる。 Also, noise suppression processing using GMM (Gaussian Mixture Model) can be used. Assuming that there is no correlation between speech and noise, the logarithm vectors of noise superimposed speech (input speech), clean speech and noise mel filter bank output in frame i are X (i), S (i) and Assuming N (i), it can be expressed as in Equation 2.

ここで、ｇ(Ｓ(i),Ｎ(i))は、ミスマッチ関数である。また、フィルタバンクの第ｂバンクに対する補助関数ｆ_ｂは数３で定義される。 Here, g (S (i), N (i)) is a mismatch function. Further, the auxiliary function f _b for the b-th bank of the filter bank is defined by Equation 3.

ここで、ｓ_ｂおよびｎ_ｂは、クリーン音声および雑音のメルフィルタバンク出力である。数２に対し、１次テイラー展開を適用することで、平均および分散を推定することができる。クリーン音声を数４のようなＫ混合ガウス分布モデルで表し、雑音信号を単一ガウス分布Ｎ(μ_n,Σ_n)で表すと、平均および分散のそれぞれは、数５および数６のように近似的に推定することができる。 Here, s _b and n _b are the mel filter bank output for clean speech and noise. By applying the first-order Taylor expansion to Equation 2, the mean and variance can be estimated. When clean speech is represented by a K-mixed Gaussian distribution model such as Equation 4 and a noise signal is represented by a single Gaussian distribution N (μ _n , Σ _n ), the mean and variance are as shown in Equation 5 and Equation 6, respectively. It can be estimated approximately.

実際には、雑音の分散を推定するときにはデータが少ない場合が多い。また、分散推定による音声認識の精度の向上は、平均推定による音声認識の精度の向上と比較して小さいため、音声モデルの分散を、Σ_x,k(b,b)≒Σ_s,k(b,b)とする。すると、クリーン音声は、数７のように表すことができる。 In practice, there are often few data when estimating the variance of noise. In addition, since the improvement in speech recognition accuracy by variance estimation is small compared to the improvement in speech recognition accuracy by average estimation, the variance of the speech model is expressed as Σ _{x, k} (b, b) ≈Σ _{s, k} ( b, b). Then, the clean voice can be expressed as in Expression 7.

このように、ＧＭＭによる雑音抑圧処理では、雑音モデルと音声モデルとを用いて、分析フレームごとに入力音声から音声のみを推定する。そして、推定された音声と音声モデルとを照合する。これによって、雑音を含む入力音声であっても精度良く認識することができる。 As described above, in the noise suppression processing by the GMM, only the speech is estimated from the input speech for each analysis frame using the noise model and the speech model. Then, the estimated speech and the speech model are collated. As a result, even input speech including noise can be recognized with high accuracy.

なお、ＧＭＭによる雑音抑圧処理については、「J.C.Segura,A.de la Torre, M.C.Benitez, A.M.Peinado,“Model-based compensation of the additive noise for continuous speech recognition. Experiments using AURORA II database and tasks,”Proc. of Eurospeech’01,vol.I,pp.221-224,2001」において詳細に開示されているので参照されたい。 For noise suppression processing by GMM, see “JCSegura, A. de la Torre, MCBenitez, AMPeinado,“ Model-based compensation of the additive noise for continuous speech recognition. Experiments using AURORA II database and tasks, ”Proc. of Eurospeech '01, vol. I, pp. 221-224, 2001 "for reference.

上述のような雑音抑圧処理では、使用する雑音モデルによってその音声認識の精度に大きな違いが出るため、どのような雑音モデルを用いるかが問題となる。ここで、上述したように、サーバ１２は、看護師のような被験者の音声を認識するのであるが、看護師が業務（看護業務）を行っている場合には、衣服が擦れる音や器具（血圧計や点滴注射器）を使用したり、設定ないし設置したりする音が発生する。このような看護業務の動作に伴う（起因する）雑音（動作雑音）は、看護師の音声とともに、ヘッドセットマイク３８から入力される。 In the noise suppression processing as described above, since the accuracy of speech recognition varies greatly depending on the noise model used, what kind of noise model is used becomes a problem. Here, as described above, the server 12 recognizes the voice of a subject such as a nurse. The sound of using or setting or installing a sphygmomanometer or infusion syringe is generated. Noise (operation noise) associated with the operation of such nursing work is input from the headset microphone 38 together with the voice of the nurse.

また、このような動作雑音のみならず、周囲（環境）からの雑音もヘッドセットマイク３８から入力される。たとえば、院内の待合室では、患者等の話し声やテレビ（またはラジオ）の音が発生し、また、看護師の詰所（ナースステーション）では、他の看護師の話し声、ナースコールの音または他の看護師の動作に伴う雑音が発生する。このような環境に起因する雑音（環境雑音）もまた、看護師の音声とともに、ヘッドセットマイク３８から入力される。 Further, not only such operation noise but also noise from the surroundings (environment) is input from the headset microphone 38. For example, in the waiting room in the hospital, the patient's voice or TV (or radio) sound is generated, and in the nurse's station (nurse station), the voice of another nurse, the sound of a nurse call or other nursing care is generated. Noise is generated due to the movement of the teacher. Noise caused by such an environment (environmental noise) is also input from the headset microphone 38 together with the voice of the nurse.

ここで、病院内の複数の場所で収録された環境雑音の平均スペクトルの具体例を図４に示す。環境雑音は、「洗濯室横」、「病室横の廊下」、「エレベータホール」、「ナースステーション内」、および「階段」において、それぞれ１０分程度収録したものである。使用したマイクロホンはＤＰＡ製小型コンデンサマイクロホン４０６０であり、収録機器にはＭ−ＡＵＤＩＯ製のＭＩＣＲＯＴＲＡＣＫ２４／９６を用いた。各場所における環境雑音の平均パワースペクトルは、環境雑音を４８ｋＨｚのサンプリング周波数、１６ビットで収録後、１６ｋＨｚにダウンサンプリングし、分析窓長２０ｍｓで短時間フーリエ変換を行い、収録されたデータの全フレームで平均することによって求めた。 Here, a specific example of the average spectrum of the environmental noise recorded at a plurality of locations in the hospital is shown in FIG. The environmental noise is recorded for about 10 minutes in each of “Laundry room side”, “Wardway side corridor”, “Elevator hall”, “In the nurse station”, and “Staircase”. The microphone used was a small condenser microphone 4060 made by DPA, and MICROTRACK 24/96 made by M-AUDIO was used as a recording device. The average power spectrum of the environmental noise at each location is recorded at a sampling frequency of 48 kHz and 16 bits, then down-sampled to 16 kHz, short-time Fourier transformed with an analysis window length of 20 ms, and all frames of the recorded data Obtained by averaging with

図４に示すように、「洗濯機横」の環境雑音には、洗濯機の動作音が主に含まれていた。また、「病室横の廊下」、「エレベータホール」および「階段」では、会話音声が時折発生する以外は基本的に静かであり、環境雑音は小さかった。また、「ナースステーション内」の環境雑音には、時折発生する会話音声以外に、機器が発する動作音が含まれ、５００Ｈｚ付近にピークが見られた。 As shown in FIG. 4, the environmental noise “beside the washing machine” mainly includes the operation sound of the washing machine. In addition, the “corridor beside the hospital room”, “elevator hall”, and “stairs” were basically quiet except for occasional conversational voices, and the environmental noise was small. In addition, the environmental noise “in the nurse station” includes the operation sound generated by the equipment in addition to the occasional conversation voice, and a peak was observed in the vicinity of 500 Hz.

なお、図示は省略するが、動作雑音については、実際に看護師が看護業務を行うときに発生する雑音が、看護業務毎に収録される。 In addition, although illustration is abbreviate | omitted, about operation noise, the noise which generate | occur | produces when a nurse actually performs nursing work is recorded for every nursing work.

図４からも分かるように、場所によって周囲の環境雑音は異なるので、音声認識用に雑音抑圧処理などを行う場合には、その場所に特化した雑音モデルを用いれば、より精度良く雑音抑圧処理を行うことが可能となる。このことは、動作雑音についても同様であり、看護業務に特化した雑音モデルを用いることが考えられる。 As can be seen from FIG. 4, the ambient environmental noise differs depending on the location. Therefore, when noise suppression processing is performed for speech recognition, noise suppression processing can be performed with higher accuracy by using a noise model specialized for the location. Can be performed. The same applies to the operation noise, and it is conceivable to use a noise model specialized for nursing work.

そこで、この実施例では、予め複数の看護業務ごとに検出される動作雑音のモデル化を図るとともに、予め複数の場所ごとに観測される環境雑音のモデル化を図って、動作雑音のモデルと環境雑音のモデルとをメモリ（この実施例では、データベース）に記憶しておき、それらを用いて雑音抑圧処理を実行するようにしてある。 Therefore, in this embodiment, the modeling of the operating noise detected for each of a plurality of nursing tasks in advance and the modeling of the environmental noise observed for each of a plurality of locations in advance are performed. The noise model is stored in a memory (in this embodiment, a database), and noise suppression processing is executed using them.

具体的には、図５のブロック図に示すように、サーバ１２には、複数のデータベース（ＤＢ）、すなわち看護師ＤＢ４２、中継器ＤＢ４４、動作ＤＢ４６、雑音モデルＤＢ４８および音声モデルＤＢ５０が接続される。これらのＤＢ４２−５０は、看護師の発話（音声）が含まれる入力音声に対して、雑音抑圧処理を実行して音声認識するために使用される。 Specifically, as shown in the block diagram of FIG. 5, a plurality of databases (DB), that is, a nurse DB 42, a repeater DB 44, an operation DB 46, a noise model DB 48, and a voice model DB 50 are connected to the server 12. . These DBs 42-50 are used to perform speech recognition by performing noise suppression processing on the input speech including the nurse's speech (speech).

看護師ＤＢ４２には、看護師の識別情報（看護師ＩＤ）に対応付けて、看護師名などの看護師を特定するための情報が記憶される。中継器ＤＢ４４には、中継器１６の識別情報（中継器ＩＤ）に対応付けて、中継器１６の設置されている場所が記憶される。したがって、サーバ１２は、看護師ＩＤから看護師または看護師名を特定することができ、中継器ＩＤから中継器１６の設置されている場所を特定することができる。 The nurse DB 42 stores information for identifying a nurse such as a nurse name in association with nurse identification information (nurse ID). The repeater DB 44 stores the location where the repeater 16 is installed in association with the identification information (repeater ID) of the repeater 16. Therefore, the server 12 can specify the nurse or the nurse name from the nurse ID, and can specify the place where the repeater 16 is installed from the repeater ID.

ここで、図１を参照して、中継器１６は、上述したように、無線通信可能な範囲に存在する携帯端末１８から送信される送信データを受信する。そして、受信した送信データに自身の中継器ＩＤを付して、ネットワーク１４を介してサーバ１２に送信する。つまり、サーバ１２で受信される音声データには、看護師ＩＤおよび中継器ＩＤが付加されている。したがって、サーバ１２は、受信した音声データに対応する音声を入力した看護師および入力した場所（現在位置）を推定（特定）することができる。 Here, referring to FIG. 1, as described above, repeater 16 receives transmission data transmitted from portable terminal 18 that exists in a wireless communicable range. Then, the received transmission data is appended with its own repeater ID and transmitted to the server 12 via the network 14. That is, the nurse ID and the repeater ID are added to the voice data received by the server 12. Therefore, the server 12 can estimate (specify) the nurse who has input the voice corresponding to the received voice data and the input location (current position).

なお、中継器１６と携帯端末１８とは互いに通信可能であるため、携帯端末１８は、無線通信可能な範囲に存在する中継器１６の識別情報（中継器ＩＤ）を検出することもできる。したがって、中継器１６からサーバ１２に対して中継器ＩＤが付加された送信データを送信するのではなく、携帯端末１８からサーバ１２に対して中継器ＩＤが付加された送信データを送信することもできる。つまり、携帯端末１８が通信可能な範囲にある中継器１６から中継器ＩＤを取得し、送信データに中継器ＩＤについてのデータを付して、携帯端末１８から、インターフェイス２８およびネットワーク１４を介して、サーバ１２に送信することもできる。 In addition, since the repeater 16 and the portable terminal 18 can communicate with each other, the portable terminal 18 can also detect identification information (repeater ID) of the repeater 16 that exists in a range where wireless communication is possible. Therefore, instead of transmitting the transmission data with the relay ID added from the repeater 16 to the server 12, it is also possible to transmit the transmission data with the relay ID added from the portable terminal 18 to the server 12. it can. That is, the repeater ID is acquired from the repeater 16 within the communicable range of the portable terminal 18, the data about the repeater ID is attached to the transmission data, and the portable terminal 18 is connected via the interface 28 and the network 14. Can also be sent to the server 12.

図５に戻って、動作ＤＢ４６は、看護師の動作（看護業務）の名称ないし識別情報に対応して、該当看護業務を実行する場合の加速度センサ４０の加速度データが記憶される。たとえば、動作ＤＢ４６に記憶される加速度データは、複数（多数）の看護師が看護業務（動作）を行った場合に検出される複数の加速度データを平均することにより得られる。ただし、看護業務のように、人間が行う動作ないし行動では、個人差があるため、看護師毎に加速度データを取得して、動作ＤＢ４６に収録しておくようにしてもよい。このようにすれば、看護師の動作（看護業務）の特定の精度が上がると考えられる。 Returning to FIG. 5, the operation DB 46 stores acceleration data of the acceleration sensor 40 when executing the corresponding nursing service corresponding to the name or identification information of the nurse's operation (nursing service). For example, the acceleration data stored in the motion DB 46 can be obtained by averaging a plurality of acceleration data detected when a plurality of (many) nurses perform nursing work (motion). However, since there are individual differences in actions or actions performed by humans as in nursing work, acceleration data may be acquired for each nurse and recorded in the action DB 46. In this way, it is considered that the specific accuracy of the operation (nursing work) of the nurse is improved.

雑音モデルＤＢ４８には、動作雑音モデルと環境雑音モデルとが記憶される。具体的には、動作雑音モデルは、複数の看護業務を実行している場合に発生する動作雑音を収録した動作雑音のデータ（動作雑音データ）に基づいて作成（推定）され、各看護業務に対応付けて記憶される。また、環境雑音モデルは、複数の場所毎に収録した環境雑音のデータ（環境雑音データ）に基づいて作成（推定）され、各場所に対応づけて記憶される。 The noise model DB 48 stores an operation noise model and an environmental noise model. Specifically, an operating noise model is created (estimated) based on operating noise data (operating noise data) that records operating noise that occurs when multiple nursing tasks are performed. Correspondingly stored. The environmental noise model is created (estimated) based on environmental noise data (environmental noise data) recorded for each of a plurality of locations, and stored in association with each location.

なお、この実施例では、後述するように、中継器１６の設置位置に基づいて看護師の現在位置を推定するので、中継器１６が設置される場所毎の環境雑音モデルを雑音モデルＤＢ４８に記憶しているものとする。 In this embodiment, as will be described later, since the nurse's current position is estimated based on the installation position of the repeater 16, an environmental noise model for each place where the repeater 16 is installed is stored in the noise model DB 48. Suppose you are.

音声モデルＤＢ５０には、雑音を含まない音声のみに基づいて作成された音声モデル（音声データ）が記憶される。たとえば、この音声認識システム１０を利用する看護師が、雑音の無い場所で入力した音声データが記憶（収録）される。また、複数の看護師がこの音声認識システム１０を利用する場合には、看護師毎に音声モデルを記憶し、音声認識を行うときに、各看護師に対応する音声モデルを利用するようにすれば、より適切に音声認識を実行できる。 The speech model DB 50 stores a speech model (speech data) created based only on speech that does not include noise. For example, voice data input by a nurse using the voice recognition system 10 in a place without noise is stored (recorded). When a plurality of nurses use the voice recognition system 10, a voice model is stored for each nurse, and when performing voice recognition, a voice model corresponding to each nurse is used. Thus, speech recognition can be executed more appropriately.

このような音声認識システム１０を利用して、看護師が作業中に発話した業務内容を記録する一例を示す。たとえば、ナースステーションに居る看護師が、患者Ａの点滴に向かう場合には、看護師はヘッドセットマイク３８をオンにし、「患者Ａさんの点滴に行ってきます」と発話し、その後、ヘッドセットマイク３８をオフする。すると、看護師の発話（音声）を含む入力音声についての音声信号には、携帯端末１８において、デジタル変換および圧縮（変調）処理が施され、入力音声についての音声信号に対応する圧縮音声データが生成される。また、ヘッドセットマイク３８がオンされてからオフされるまでに、加速度センサ４０からの加速度データが検出される。たとえば、点滴に必要な器具や薬を準備する動作についての加速度データが検出される。 An example will be shown in which such a speech recognition system 10 is used to record the work content spoken by a nurse during work. For example, if a nurse at the nurse station heads to the patient A drip, the nurse turns on the headset microphone 38 and speaks, “I will go to patient A ’s drip”, and then the headset. The microphone 38 is turned off. Then, the digital signal and the compression (modulation) process are performed on the voice signal for the input voice including the utterance (voice) of the nurse in the portable terminal 18, and the compressed voice data corresponding to the voice signal for the input voice is obtained. Generated. Also, acceleration data from the acceleration sensor 40 is detected from when the headset microphone 38 is turned on to when it is turned off. For example, acceleration data regarding an operation for preparing an instrument or medicine necessary for infusion is detected.

圧縮音声データおよび加速度データに、時間データおよび看護師ＩＤが付加された送信データが、通信可能な範囲に在る中継器１６に送信される。この場合には、ナースステーションに配置された中継器１６に送信される。そして、中継器１６においてその中継器１６の中継器ＩＤが付加された送信データは、ネットワーク１４を介してサーバ１２に送信される。 The transmission data in which the time data and the nurse ID are added to the compressed voice data and the acceleration data is transmitted to the repeater 16 in the communicable range. In this case, it is transmitted to the repeater 16 arranged at the nurse station. Then, the transmission data to which the repeater ID of the repeater 16 is added in the repeater 16 is transmitted to the server 12 via the network 14.

サーバ１２では、中継器ＩＤが付加された送信データが受信されると、時間データに基づいて音声データに対応する音声が入力された時間が特定され、看護師ＤＢ４８が参照されて、看護師ＩＤから音声データに対応する音声を入力した看護師が特定される。また、サーバ１２では、動作ＤＢ４６が参照されて、送信データに含まれる加速度データに基づいて、看護師の動作（看護業務）が特定される。たとえば、「点滴に必要な器具や薬を準備する動作」が特定される。さらに、サーバ１２では、中継器ＤＢ５０が参照されて、中継器ＩＤから送信データを送信（中継）した中継器１６が特定され、その中継器１６の設置された場所（ここではナースステーション）が看護師の現在位置として特定される。さらにまた、サーバ１２では、音声データの信号対雑音比（ＳＮＲ）が推定される。 When the server 12 receives the transmission data to which the repeater ID is added, the time when the voice corresponding to the voice data is input is specified based on the time data, the nurse DB 48 is referred to, and the nurse ID The nurse who inputs the voice corresponding to the voice data is identified. In the server 12, the operation DB 46 is referred to, and the operation of the nurse (nursing service) is specified based on the acceleration data included in the transmission data. For example, “operation for preparing instruments and medicines necessary for infusion” is specified. Further, in the server 12, the repeater DB 50 is referred to, the repeater 16 that has transmitted (relayed) the transmission data from the repeater ID is specified, and the place where the repeater 16 is installed (here, the nurse station) is a nursing care. It is specified as the teacher's current position. Furthermore, the server 12 estimates the signal-to-noise ratio (SNR) of the voice data.

看護師の動作が特定されると、サーバ１２では、当該動作に対応する動作雑音モデルが雑音モデルＤＢ４８から読み出される。また、看護師の現在位置が特定されると、サーバ１２では、看護師の現在位置に対応する環境雑音モデルが雑音モデルＤＢ４８から読み出される。そして、サーバ１２では、読み出された動作雑音モデルおよび環境雑音モデルを用いて、音声データに対応する音声信号（入力音声の音声信号）に対して残響抑圧処理が施され、入力音声に含まれる看護師の音声が認識される。 When the nurse's action is specified, the server 12 reads out an operation noise model corresponding to the action from the noise model DB 48. When the current position of the nurse is specified, the server 12 reads an environmental noise model corresponding to the current position of the nurse from the noise model DB 48. Then, the server 12 performs reverberation suppression processing on the audio signal corresponding to the audio data (the audio signal of the input audio) using the read operation noise model and environmental noise model, and is included in the input audio. The nurse's voice is recognized.

音声認識を行う際には、上述のＰＭＣ法やＧＭＭなどの雑音モデルを用いた雑音抑圧処理が行われる。このとき、入力音声のＳＮＲを考慮してモデル合成を行えば、より精度良く音声認識を実行することが可能となる。 When performing speech recognition, noise suppression processing using a noise model such as the above-described PMC method or GMM is performed. At this time, if model synthesis is performed in consideration of the SNR of the input speech, speech recognition can be executed with higher accuracy.

このようにして、看護師の音声、すなわち「患者Ａさんの点滴に行ってきます」という音声は、その音声が発せられたときの動作および場所に起因する雑音の雑音モデルを用いて音声認識される。そして、その認識結果はテキスト文として出力され、たとえばサーバ１２内のメモリに記憶される。また、サーバ１２は、その音声が発せられた時刻およびその音声を発した看護師（看護師名）を特定できるので、音声の認識結果は、その音声が発せられた時刻およびその音声を発した看護師名と共に記憶される。 In this way, the voice of the nurse, that is, the voice “I will go to the drip of patient A”, is recognized by using the noise model of the noise caused by the movement and location when the voice is emitted. The Then, the recognition result is output as a text sentence and stored in a memory in the server 12, for example. Further, since the server 12 can specify the time when the voice was emitted and the nurse (nurse name) who issued the voice, the voice recognition result is the time when the voice was emitted and the voice. Memorized with nurse name.

なお、サーバ１２に記憶した看護師の発話内容（すなわち看護師の業務内容）などのテキストデータは、たとえば、看護師に割り当てられるパーソナルコンピュータのようなコンピュータからサーバ１２にアクセスすることによって、適宜確認および取得が可能である。 Note that text data such as nurse's utterance contents (ie, nurse's work contents) stored in the server 12 is appropriately checked by accessing the server 12 from a computer such as a personal computer assigned to the nurse. And acquisition is possible.

詳細な説明は省略するが、たとえば、看護師が病室内に移動し、病室内で発話する内容を音声認識して記録する場合には、上述の例と同様に、その移動した場所（この場合は病室内）での動作に対応して動作雑音モデルおよび当該場所に対応した環境雑音モデルが選択され（読み出され）、それらを用いて正確に音声認識が行われる。 Although detailed explanation is omitted, for example, when a nurse moves into a hospital room and utters and records the utterance in the hospital room as in the above example, the location (in this case) An operation noise model and an environmental noise model corresponding to the location are selected (read out) corresponding to the operation in the room), and voice recognition is performed accurately using them.

以下に、音声認識システム１０が雑音を含む入力音声から看護師の音声を認識する処理について、フロー図を用いて説明する。具体的には、サーバ１２が図６に示すフロー図に従って全体処理を実行する。なお、ここでは、ＰＭＣ法を用いて雑音抑圧処理を実行する場合について説明するが、ＧＭＭによって雑音抑圧処理を実行することもできる。 Below, the process which the voice recognition system 10 recognizes a nurse's voice from the input voice containing noise is demonstrated using a flowchart. Specifically, the server 12 executes the entire process according to the flowchart shown in FIG. Note that, here, the case where the noise suppression process is executed using the PMC method will be described, but the noise suppression process can also be executed by the GMM.

図６に示すように、サーバ１２は全体処理を開始すると、ステップＳ１で、入力音声を取得する。すなわち、看護師が装着する操作端末１８から中継器１６を介して送信される入力音声についての音声データを取得する。具体的には、サーバ１２は、中継器ＩＤが付された送信データを受信する。この送信データには、音声データ、加速度データ、看護師ＩＤ、および時間データが含まれている。続くステップＳ３では、取得した入力音声のＳＮＲを推定する。つまり、音声および雑音の相対的な大きさの比を推定する。 As shown in FIG. 6, when the server 12 starts the entire process, in step S <b> 1, an input voice is acquired. That is, the voice data about the input voice transmitted from the operation terminal 18 worn by the nurse through the repeater 16 is acquired. Specifically, the server 12 receives the transmission data to which the repeater ID is attached. This transmission data includes voice data, acceleration data, nurse ID, and time data. In the subsequent step S3, the SNR of the acquired input speech is estimated. That is, the ratio of the relative sizes of speech and noise is estimated.

続くステップＳ５では、看護師の動作を特定する。具体的には、サーバ１２は、送信データに含まれる加速度データと、動作ＤＢ４６に記憶される動作毎の加速データとを比較し、最も近似（類似）する加速度データに対応する動作を、看護師の動作として特定する。 In subsequent step S5, the operation of the nurse is specified. Specifically, the server 12 compares the acceleration data included in the transmission data with the acceleration data for each operation stored in the operation DB 46, and determines the operation corresponding to the most approximate (similar) acceleration data as a nurse. Specified as the action.

なお、図示は省略するが、送信データに含まる加速度データと、動作ＤＢ４６に記憶される加速度データとの近似度がすべて一定の閾値未満である場合には、該当する動作が無いと判断するようにしてある。これは、音声を入力する際に、必ずしも動作が伴っているとは限らないからである。 Although illustration is omitted, if all the approximations of the acceleration data included in the transmission data and the acceleration data stored in the motion DB 46 are less than a certain threshold, it is determined that there is no corresponding motion. It is. This is because an operation is not always accompanied when a voice is input.

続いて、ステップＳ７では、看護師の現在位置を特定する。具体的には、サーバ１２は、中継器ＤＢ４４を参照して、ステップＳ１で取得した中継器ＩＤに対応して記述される中継器１６が配置されている場所を看護師の現在位置として特定する。 Subsequently, in step S7, the current position of the nurse is specified. Specifically, the server 12 refers to the repeater DB 44 and identifies the place where the repeater 16 described corresponding to the repeater ID acquired in step S1 is arranged as the current position of the nurse. .

次に、ステップＳ９では、特定した看護師の動作と現在位置とから雑音モデルを決定する。すなわち、サーバ１２は、雑音モデルＤＢ４８から、看護師の動作に対応する動作雑音モデルおよび看護師の現在位置に対応する環境雑音モデルを読み出す。さらに、ステップＳ１１では、雑音重畳音声モデルを作成する。すなわち、サーバ１２は、ステップＳ９で決定した雑音モデルと音声モデルＤＢ５０に記憶された音声モデルとを合成し、雑音重畳音声モデルを作成する。雑音重畳音声モデルを作成する際には、ステップＳ３で推定した入力音声のＳＮＲに基づいて、その合成比率を調整する。 Next, in step S9, a noise model is determined from the identified nurse action and the current position. That is, the server 12 reads out from the noise model DB 48 an operation noise model corresponding to the nurse's operation and an environmental noise model corresponding to the nurse's current position. Further, in step S11, a noise superimposed speech model is created. That is, the server 12 synthesizes the noise model determined in step S9 and the speech model stored in the speech model DB 50 to create a noise superimposed speech model. When creating a noise-superimposed speech model, the synthesis ratio is adjusted based on the SNR of the input speech estimated in step S3.

続くステップＳ１３では、音声認識を実行する。すなわち、雑音重畳モデルと入力音声とを照合し、入力音声から看護師の音声を認識する。そして、ステップＳ１５では、認識結果を出力する。たとえば、サーバ１２は、認識結果を内部メモリに出力し、テキストデータとして記録して、当該全体処理を終了する。 In subsequent step S13, voice recognition is executed. That is, the noise superposition model and the input voice are collated, and the nurse's voice is recognized from the input voice. In step S15, the recognition result is output. For example, the server 12 outputs the recognition result to the internal memory, records it as text data, and ends the entire process.

この実施例によれば、複数の動作のそれぞれに対応する動作雑音モデルと、複数の場所のそれぞれに対応する環境雑音モデルとを予め記憶しておき、被験者の動作および存在する場所に応じた雑音モデルを用いて音声認識を実行するので、入力音声に含まれる雑音を適切に抑圧でき、雑音を含んだ入力音声であっても正確に音声認識することができる。 According to this embodiment, an operation noise model corresponding to each of a plurality of movements and an environmental noise model corresponding to each of a plurality of places are stored in advance, and noise corresponding to the movement of the subject and the place where it exists Since speech recognition is performed using the model, noise included in the input speech can be appropriately suppressed, and speech recognition can be performed accurately even for input speech including noise.

また、上述の実施例によれば、携帯端末１８からの送信データを中継した中継器１６の設置位置を被験者の存在する場所として特定するので、容易に被験者の現在位置を特定することができる。 Moreover, according to the above-mentioned Example, since the installation position of the repeater 16 which relayed the transmission data from the portable terminal 18 is specified as a place where the subject exists, the current position of the subject can be easily specified.

ただし、看護師（被験者）の現在位置の特定方法はこれに限定されず、適宜な方法を用いて被験者の現在位置を特定してよい。たとえば、音声認識システム１０を屋外に存在する被験者に対して適用する場合には、公知のＧＰＳを利用して被験者の現在位置を検出することができる。 However, the method for specifying the current position of the nurse (subject) is not limited to this, and the current position of the subject may be specified using an appropriate method. For example, when the speech recognition system 10 is applied to a subject existing outdoors, the current position of the subject can be detected using a known GPS.

また、上述の実施例で説明したように、屋内に被験者が存在する場合には、通過センサ等によって被験者の現在位置を特定することもできる。たとえば、被験者に自身の識別情報を送信するタグ（無線タグや赤外線ＬＥＤタグ等）を取り付け、病室の出入口や廊下の天井などの適宜な場所にタグからの識別情報を受信するタグ読取装置を設けるようにする。かかる場合には、サーバ１２は、被験者の病室の出入りなどを管理することによって各被験者の現在位置を検出し、被験者の識別情報に対応付けて現在位置の情報を管理しておく。そして、被験者によって音声が入力されたときに、その被験者の現在位置を取得するとよい。 Further, as described in the above embodiment, when a subject exists indoors, the current position of the subject can be specified by a passage sensor or the like. For example, a tag (such as a wireless tag or an infrared LED tag) that transmits identification information to a subject is attached, and a tag reader that receives identification information from the tag is provided at an appropriate location such as a hospital room entrance or a corridor ceiling. Like that. In such a case, the server 12 detects the current position of each subject by managing the entrance / exit of the subject's hospital room, and manages information on the current position in association with the identification information of the subject. Then, when a voice is input by the subject, the current position of the subject may be acquired.

さらに、被験者の現在位置に基づいて被験者の動作を推定することもできる。たとえば、看護師が行う看護業務（作業）毎に場所（部屋）を固定的に決定しておけば、看護師の現在位置からその看護業務（動作）を特定することができる。したがって、かかる場合には、上述の実施例で示した加速度センサ４０は省略することができる。このため、送信データの容量も軽減することができる。 Furthermore, the movement of the subject can be estimated based on the current position of the subject. For example, if a place (room) is fixedly determined for each nursing task (work) performed by the nurse, the nursing task (operation) can be specified from the current position of the nurse. Therefore, in such a case, the acceleration sensor 40 shown in the above embodiment can be omitted. For this reason, the capacity of the transmission data can be reduced.

かかる場合には、図７に示すように、サーバ１２には、場所ＤＢ５２がさらに接続される。この場所ＤＢ５２には、複数の場所の名称ないし識別情報に対応付けて、対応する看護師の動作（看護業務）の名称ないし識別情報が記憶される。 In such a case, the location DB 52 is further connected to the server 12 as shown in FIG. The location DB 52 stores names or identification information of corresponding nurse operations (nursing duties) in association with names or identification information of a plurality of locations.

具体的には、サーバ１２が図８に示すフロー図に従って全体処理を実行する。ただし、上述の実施例で図６を用いて説明した処理と略同じであるため、異なる内容について説明し、重複した説明は省略することにする。また、上述したように、看護師の現在位置から看護師の動作を特定するため、図６に示したステップＳ５の処理が省略される。 Specifically, the server 12 executes the entire process according to the flowchart shown in FIG. However, since it is substantially the same as the processing described with reference to FIG. 6 in the above-described embodiment, different contents will be described and redundant description will be omitted. Further, as described above, in order to specify the operation of the nurse from the current position of the nurse, the process of step S5 shown in FIG. 6 is omitted.

図８に示すように、サーバ１２は、ステップＳ７で看護師の現在位置を特定すると、ステップＳ２１で、場所ＤＢ５２を参照して、特定した現在位置に対応して記憶される看護師の動作ないし看護業務を特定する。 As shown in FIG. 8, when the server 12 specifies the nurse's current position in step S7, the server 12 refers to the location DB 52 in step S21 and stores the nurse's operation stored in correspondence with the specified current position. Identify nursing work.

図示は省略するが、場所ＤＢ５２を設けた場合には、上記とは逆に、看護師の動作から場所を特定することも可能である。かかる場合には、中継器１６は携帯端末１８からの送信データを単に中継するだけでよい。つまり、中継器１６はその中継器ＩＤを付加する必要がない。また、中継器ＩＤに基づいて看護師の現在位置を推定（特定）しないため、中継器ＤＢ４４を削除することができる。動作としては、サーバ１２は、図６に示したステップＳ５において、看護師の動作を特定すると、ステップＳ７の処理に変えて、場所ＤＢを参照して、動作から看護師の場所すなわち現在位置を特定する。つまり、ステップＳ７の処理が異なる以外は、サーバ１２の処理は図６に示した処理と同じである。 Although illustration is omitted, when the place DB 52 is provided, it is possible to specify the place from the operation of the nurse, contrary to the above. In such a case, the repeater 16 simply relays transmission data from the portable terminal 18. That is, the repeater 16 does not need to add the repeater ID. Further, since the current position of the nurse is not estimated (specified) based on the repeater ID, the repeater DB 44 can be deleted. As the operation, when the server 12 identifies the operation of the nurse in step S5 shown in FIG. 6, the server 12 refers to the location DB instead of the processing in step S7, and determines the location of the nurse, that is, the current position from the operation. Identify. That is, the process of the server 12 is the same as the process shown in FIG. 6 except that the process of step S7 is different.

また、上述の実施例では、サーバ１２が音声認識処理を実行するようにしたが、これに限定されず、携帯端末１８のＣＰＵ２０が音声認識処理を実行することもできる。つまり、携帯端末１８が音声認識装置として機能することもできる。この場合には、上述の実施例で示したデータベース（４２−５０）を携帯端末１８が持つようにしてもよい。また、携帯端末１８が音声認識処理を実行するときに、ネットワーク１４や外部コンピュータ（たとえばサーバ１２）に接続されるデータベースを適宜参照するようにしてもよい。 In the above-described embodiment, the server 12 executes the voice recognition process. However, the present invention is not limited to this, and the CPU 20 of the mobile terminal 18 can also execute the voice recognition process. That is, the portable terminal 18 can function as a voice recognition device. In this case, the mobile terminal 18 may have the database (42-50) shown in the above embodiment. Further, when the mobile terminal 18 executes the voice recognition process, a database connected to the network 14 or an external computer (for example, the server 12) may be referred to as appropriate.

さらに、上述の実施例では、動作雑音と環境雑音との両方を抑圧する場合についてのみ説明したが、いずれか一方を抑圧するようにしても、両方を抑圧しない場合に比べて、音声認識の精度を向上させることができる。たとえば、動作雑音のみを抑圧する場合には、図６のステップＳ７の処理を削除し、ステップＳ９では、動作雑音モデルのみを決定すればよい。かかる場合には、環境雑音モデルを雑音モデルＤＢ４８に構築する必要はない。また、環境雑音のみを抑圧する場合には、図６のステップＳ５の処理を削除し、ステップＳ９では、環境雑音モデルのみを決定すればよい。かかる場合には、動作雑音モデルを雑音モデルＤＢ４８に構築する必要はない。 Furthermore, in the above-described embodiment, only the case where both the operating noise and the environmental noise are suppressed has been described. However, even if either one is suppressed, the accuracy of speech recognition is higher than when both are not suppressed. Can be improved. For example, when only operating noise is suppressed, the process of step S7 in FIG. 6 is deleted, and only the operating noise model is determined in step S9. In such a case, it is not necessary to construct an environmental noise model in the noise model DB 48. Further, when only the environmental noise is suppressed, the processing in step S5 in FIG. 6 is deleted, and in step S9, only the environmental noise model is determined. In such a case, it is not necessary to construct an operating noise model in the noise model DB 48.

さらにまた、上述の実施例では、音声認識システム１０が病院のような組織に適用され、看護師のような被験者の音声を認識する場合についてのみ説明したが、用途はこれに限定される必要はない。たとえば、医師の音声を認識するようにしてもよいし、工場に適用して、当該工場における作業員の音声を認識するようにしてもよい。 Furthermore, in the above-described embodiment, only the case where the voice recognition system 10 is applied to an organization such as a hospital and recognizes the voice of a subject such as a nurse has been described. However, the application need not be limited to this. Absent. For example, the voice of a doctor may be recognized, or the voice of a worker in the factory may be recognized by applying to a factory.

図１はこの発明の音声認識システムの一例を示す図解図である。FIG. 1 is an illustrative view showing one example of a voice recognition system of the present invention. 図２は図１の携帯端末の電気的な構成を示す図解図である。FIG. 2 is an illustrative view showing an electrical configuration of the portable terminal of FIG. 図３は図１の携帯端末を被験者（看護師）が装着した様子を示す図解図である。FIG. 3 is an illustrative view showing a state in which a subject (nurse) wears the portable terminal of FIG. 図４は複数の場所の各々において観測された環境雑音の平均パワースペクトルを示すグラフである。FIG. 4 is a graph showing an average power spectrum of environmental noise observed at each of a plurality of locations. 図５は図１のサーバに接続されるデータベースを示すブロック図である。FIG. 5 is a block diagram showing a database connected to the server of FIG. 図６は図１のサーバの全体処理を示すフロー図である。FIG. 6 is a flowchart showing the overall processing of the server of FIG. 図７は他の実施例のサーバに接続されるデータベースを示すブロック図である。FIG. 7 is a block diagram illustrating a database connected to a server according to another embodiment. 図８は他の実施例のサーバの全体処理を示すフロー図である。FIG. 8 is a flowchart showing the overall processing of the server according to another embodiment.

Explanation of symbols

１０ …音声認識装置
１２ …サーバ
１４ …ネットワーク
１６ …中継器
１８ …携帯端末
３８ …ヘッドセットマイク
４０（４０ａ，４０ｂ，４０ｃ，４０ｄ，４０ｅ，４０ｆ） …加速度センサ DESCRIPTION OF SYMBOLS 10 ... Voice recognition apparatus 12 ... Server 14 ... Network 16 ... Repeater 18 ... Portable terminal 38 ... Headset microphone 40 (40a, 40b, 40c, 40d, 40e, 40f) ... Acceleration sensor

Claims

An operation noise model storage means for storing a plurality of operation noise models created corresponding to each of the plurality of operations in association with each of the plurality of operations;
Input voice detection means for detecting input voice including the voice of the subject;
Action specifying means for specifying the action of the subject;
The operation noise model reading means for reading out the operation noise model corresponding to the operation specified by the operation specifying means from the operation noise model storage means, and the operation noise model read out by the operation noise model reading means, A speech recognition apparatus comprising recognition means for recognizing the voice of the subject included in the input speech detected by the speech detection means.

A place noise model storage means for storing a plurality of place noise models created corresponding to each of the plurality of places in association with each of the plurality of places;
A place identifying means for identifying the place where the subject exists, and a place noise model reading means for reading a place noise model corresponding to the place specified by the place specifying means from the place noise model storage means,
The recognizing unit uses the operation noise model read by the operation noise model reading unit and the place noise model read by the place noise model reading unit to detect the input speech detected by the sound detection unit. The speech recognition apparatus according to claim 1, wherein the speech of the subject included in the subject is recognized.

Estimating means for estimating a signal-to-noise ratio of a speech signal with respect to input speech detected by the input speech detecting means; and adjusting means for adjusting a synthesis ratio of a noise model according to the signal-to-noise ratio estimated by the estimating means The speech recognition apparatus according to claim 1, further comprising:

The said place specific means detects the identification information which the repeater installed in the environment emits, and specifies the installation place of the repeater which issued the said identification information as a place where the said test subject exists. Voice recognition device.

A speech recognition system comprising a mobile terminal and a speech recognition device,
The portable terminal is
Input voice detection means for detecting input voice including the voice of the subject, and transmission means for sending a voice signal for the input voice detected by the input voice detection means to the voice recognition device,
The voice recognition device
Receiving means for receiving the audio signal transmitted by the transmitting means;
An operation noise model storage means for storing a plurality of operation noise models created corresponding to each of the plurality of operations in association with each of the plurality of operations;
Action specifying means for specifying the action of the subject;
The operation noise model reading means for reading out the operation noise model corresponding to the operation specified by the operation specifying means from the operation noise model storage means, and the operation noise model read out by the operation noise model reading means, A speech recognition system comprising recognition means for recognizing the voice of the subject included in the input speech detected by the speech detection means.

The voice recognition device
A place noise model storage means for storing a plurality of place noise models created corresponding to each of the plurality of places in association with each of the plurality of places;
A place identifying means for identifying the place where the subject exists, and a place noise model reading means for reading a place noise model corresponding to the place specified by the place specifying means from the place noise model storage means,
The recognizing unit uses the operation noise model read by the operation noise model reading unit and the place noise model read by the place noise model reading unit to detect the input speech detected by the sound detection unit. The speech recognition system according to claim 5, wherein the speech of the subject included in the subject is recognized.

A plurality of repeaters that are arranged corresponding to each of the plurality of places and relay communication between the mobile terminal and the voice recognition device;
The repeater receives the audio signal transmitted from the mobile terminal, adds its own identification information to the received audio signal, and transmits it to the audio recognition device,
The location specifying means specifies the installation location of the repeater that has transmitted the audio signal as the location where the subject exists based on the identification information added to the audio signal received by the receiving means. The speech recognition system described.

A computer speech recognition method comprising an operation noise model storage means for storing a plurality of operation noise models created corresponding to each of a plurality of operations in association with each of the plurality of operations,
(a) Detect input speech including subject's speech,
(b) identify the subject's movement;
(c) reading out an operating noise model corresponding to the operation specified in step (b) from the operating noise model storage means; and
(d) A voice recognition method for recognizing the voice of the subject included in the input voice detected in step (a) using the operating noise model read out in step (c).