JP2014119536A

JP2014119536A - Sound recognition device

Info

Publication number: JP2014119536A
Application number: JP2012273275A
Authority: JP
Inventors: Hiroshi Nagoshi; 啓名越
Original assignee: JVCKenwood Corp
Current assignee: JVCKenwood Corp
Priority date: 2012-12-14
Filing date: 2012-12-14
Publication date: 2014-06-30

Abstract

PROBLEM TO BE SOLVED: To improve accuracy to recognize sound data.SOLUTION: A sound recognition device comprises: a voice processing unit 12 outputting a candidate of a type of voice included in input sound data when there is sound data input; a non-voice processing unit 13 determining whether the input sound data is non-voice sound which does not have a feature of voice when there is sound data input, and making an identifier correspond to the non-voice sound when the input signal is non-voice sound; and an operation unit 143 weighing for the candidate based on a history database D3 and determining the type of the voice based on the candidate of the type of the weighted voice when the identifier is made to correspond to the non-voice sound in the non-voice processing unit 13 and a candidate of the type of the voice from the voice processing unit 12 within an appointed time after the identifier is made to correspond to the non-voice sound.

Description

本発明は音データに含まれる音声又は非音声音を認識する音認識装置に関する。 The present invention relates to a sound recognition device that recognizes a voice or a non-voice sound included in sound data.

キーボードやタッチパネルのような入力装置を利用することなく、マイク等を介して入力される音声を文字に変換する音声認識が盛んに行われている。また、音声以外の非音声音を識別し、視覚データに変換する技術も検討されている。 Voice recognition for converting voice input through a microphone or the like into characters without using an input device such as a keyboard or a touch panel is actively performed. In addition, a technique for identifying non-speech sounds other than speech and converting it into visual data is also being studied.

同一音声でも文字の異なる語彙は多くあるため、音声から文字データへの変換は困難であり、音声認識の分野では、認識率の向上が課題になっている。 Since there are many vocabulary with different characters even in the same speech, it is difficult to convert speech to character data. In the field of speech recognition, improvement of the recognition rate is an issue.

認識率を向上させるため、例えば、場面に応じて音声認識に利用する辞書（例えば、医療用語用、会議用語用等）を使い分けることで、各場面で使用される頻度が高い文字に変換する方法もある。 In order to improve the recognition rate, for example, by selectively using a dictionary (for example, for medical terms, for conference terms) used for speech recognition according to the scene, it is converted into characters that are frequently used in each scene There is also.

特開２０１２−１２８４１１号公報JP 2012-128411 A

上述したように、音声認識の精度を向上させることが課題であった。また、非音声音を認識精度も同様に向上させることが課題であった。 As described above, it has been a problem to improve the accuracy of speech recognition. Further, it has been a problem to improve the recognition accuracy of non-speech sounds as well.

上記課題に鑑み、音データを認識する精度を向上した音認識装置を提供することを目的としている。 In view of the above problems, an object of the present invention is to provide a sound recognition device with improved accuracy for recognizing sound data.

上記目的を達成するために、音声処理部は、音データの入力があった場合に、入力された音データに含まれる音声の種類の候補を出力する。非音声処理部は、音データの入力があった場合に、入力された音データが音声の特徴を持たない非音声音かどうかを判定するとともに、前記入力された音データが非音声音であった場合に、前記非音声音に対して識別子を対応付ける。演算部は、前記非音声処理部において、前記非音声音に対して識別子が対応付けられ、かつ、前記非音声音に対して識別子が対応付けられてから所定時間以内に前記音声処理部から前記音声の種類の候補が入力された場合、前記履歴データベースに基づいて、前記候補に対して重み付けをし、前記重み付けされた音声の種類の候補に基づいて前記音声の種類を決定する。 In order to achieve the above object, the sound processing unit outputs a sound type candidate included in the input sound data when the sound data is input. The non-speech processing unit determines whether or not the input sound data is a non-speech sound having no sound feature when sound data is input, and the input sound data is a non-speech sound. In this case, an identifier is associated with the non-voice sound. The calculation unit is configured so that, in the non-voice processing unit, an identifier is associated with the non-speech sound, and the identifier is associated with the non-speech sound, and the speech processing unit transmits the identifier within a predetermined time. When a voice type candidate is input, the candidate is weighted based on the history database, and the voice type is determined based on the weighted voice type candidate.

本発明によれば、音認識の精度を向上させることができる。 According to the present invention, the accuracy of sound recognition can be improved.

第１実施形態に係る音認識装置を説明するブロック図である。It is a block diagram explaining the sound recognition apparatus which concerns on 1st Embodiment. 図１の音認識装置で利用するデータの一例の説明図である。It is explanatory drawing of an example of the data utilized with the sound recognition apparatus of FIG. 図１の音認識装置の非音声音処理部における処理の一例を説明するフローチャートである。It is a flowchart explaining an example of the process in the non-voice sound process part of the sound recognition apparatus of FIG. 図１の音認識装置の重み付け処理部における処理の一例を説明するフローチャートである。It is a flowchart explaining an example of the process in the weighting process part of the sound recognition apparatus of FIG. 第２実施形態に係る音認識装置を説明するブロック図である。It is a block diagram explaining the sound recognition apparatus which concerns on 2nd Embodiment. 図５の音認識装置で利用するデータの一例の説明図である。It is explanatory drawing of an example of the data utilized with the sound recognition apparatus of FIG. 図５の音認識装置の非音声音処理部における処理の一例を説明するフローチャートである。It is a flowchart explaining an example of the process in the non-voice sound process part of the sound recognition apparatus of FIG. 図５の音認識装置の重み付け処理部における処理の一例を説明するフローチャートである。It is a flowchart explaining an example of the process in the weighting process part of the sound recognition apparatus of FIG. 第３実施形態に係る音認識装置を説明するブロック図である。It is a block diagram explaining the sound recognition apparatus which concerns on 3rd Embodiment. 図９の音認識装置で利用するデータの一例を説明図である。It is explanatory drawing of an example of the data utilized with the sound recognition apparatus of FIG. 図９の音認識装置の非音声音処理部における処理の一例を説明するフローチャートである。It is a flowchart explaining an example of the process in the non-speech sound processing part of the sound recognition apparatus of FIG. 図９の音認識装置の重み付け処理部における処理の一例を説明するフローチャートである。It is a flowchart explaining an example of the process in the weighting process part of the sound recognition apparatus of FIG.

図面を用いて本発明の各実施形態に係る音認識装置について説明する。なお、以下の各図面において、同一の構成については同一の符号を用いて説明を省略する。 The sound recognition apparatus according to each embodiment of the present invention will be described with reference to the drawings. In the following drawings, the same components are denoted by the same reference numerals and the description thereof is omitted.

〈第１実施形態〉
第１実施形態に係る音認識装置は、音声認識を利用して、入力される音声を文字データに変換して出力する装置である。第１実施形態に係る音認識装置は、複数の文字データの候補から音声に対応する文字データを決定する際、音声の入力タイミングから過去の所定時間内（例えば、３分前まで）に入力された非音声音を利用する。以下では、人間が声によって発した音を「音声」とする。また、音声以外の音（例えば、機械音等）を「非音声音」とする。音認識装置では、音声認識に非音声音を利用することで、音声認識の精度を向上させることができる。 <First Embodiment>
The sound recognition apparatus according to the first embodiment is an apparatus that converts input speech into character data and outputs the speech using speech recognition. When the sound recognition apparatus according to the first embodiment determines character data corresponding to speech from a plurality of character data candidates, it is input within a predetermined time in the past (for example, up to 3 minutes before) from the speech input timing. Use non-speech sound. Hereinafter, a sound produced by a human voice is referred to as “speech”. Further, a sound other than voice (for example, mechanical sound) is referred to as “non-voice sound”. In the sound recognition device, the accuracy of speech recognition can be improved by using non-speech sounds for speech recognition.

図１に示すように、第１実施形態に係る音認識装置１ａは、音データが入力されるマイクロフォン等の入力部１１と、音データに含まれる音声を解析する音声処理部１２と、音データに含まれる音声以外の非音声音を解析する非音声音処理部１３と、音声処理部１２及び非音声音処理部１３の処理の結果を利用して音声に対応する文字データを決定する重み付け処理部１４と、文字データを出力するディスプレイ等の出力部１５と、記憶装置２０とを備えている。 As shown in FIG. 1, the sound recognition apparatus 1a according to the first embodiment includes an input unit 11 such as a microphone to which sound data is input, a sound processing unit 12 that analyzes sound included in the sound data, and sound data. A non-speech sound processing unit 13 for analyzing non-speech sounds other than the speech included in the sound, and a weighting process for determining character data corresponding to the speech using the processing results of the speech processing unit 12 and the non-speech sound processing unit 13 Unit 14, an output unit 15 such as a display for outputting character data, and a storage device 20.

非音声音処理部１３は、非音声音を解析するため、解析部１３１、検索部１３２及び追加部１３３を有している。また、重み付け処理部１４は、文字データを決定するため、判定部１４１、選択部１４２、演算部１４３及び更新部１４４を有している。 The non-speech sound processing unit 13 includes an analysis unit 131, a search unit 132, and an addition unit 133 in order to analyze non-speech sounds. In addition, the weighting processing unit 14 includes a determination unit 141, a selection unit 142, a calculation unit 143, and an update unit 144 in order to determine character data.

記憶装置２０は、音認識プログラムＰ１、音声データベースＤ１、非音声音データベースＤ２及び履歴データベースＤ３を記憶している。 The storage device 20 stores a sound recognition program P1, a sound database D1, a non-sound sound database D2, and a history database D3.

具体的には、音認識装置１ａは、ＣＰＵ（中央処理装置）１０、入力部１１、出力部１５及び記憶装置２０を備える情報処理装置であって、記憶装置２０に記憶される音認識プログラムＰ１が実行されることで、ＣＰＵ１０が音声処理部１２、非音声音処理部１３及び重み付け処理部１４として処理を実行する。 Specifically, the sound recognition device 1a is an information processing device including a CPU (Central Processing Unit) 10, an input unit 11, an output unit 15, and a storage device 20, and a sound recognition program P1 stored in the storage device 20 Is executed, the CPU 10 executes processing as the voice processing unit 12, the non-voice sound processing unit 13, and the weighting processing unit 14.

音声データベースＤ１は、音声処理部１２において、音データに含まれる音声を対応する文字データに変換する際に利用するデータであって、一般的な音声認識で利用されるデータである。例えば、音声データベースＤ１では、音声の波形データや周波数データ等と発音記号とが対応づけられる「音響モデルデータ」と、発音記号と単語の文字データとが対応づけられる「辞書データ」と、単語文字データから文章の文字データへの変換に利用される「言語モデルデータ」とを含んでいる。「言語モデルデータ」は例えば単語と単語のつながりを確率によって紐付けたデータである。音声処理部１２では、入力された音声を単語の文字データへ変換してもよいし、文章の文字データへ変換してもよい。 The voice database D1 is data used when the voice processing unit 12 converts voice contained in the voice data into corresponding character data, and is data used in general voice recognition. For example, in the speech database D1, “acoustic model data” in which speech waveform data, frequency data, and the like are associated with phonetic symbols, “dictionary data” in which phonetic symbols are associated with word character data, and word characters "Language model data" used to convert text data into text data. “Language model data” is, for example, data in which words are linked by probability. In the voice processing unit 12, the input voice may be converted into word character data or sentence character data.

非音声音データベースＤ２は、非音声音の波形データや周波数データ等の音を特定できる特徴量データと、当該非音声音の識別子とが関連付けられたデータベースである。例えば、非音声音データベースＤ２は、非音声音であるチャイムの音とチャイムの音に付された識別子、非音声音である電話の音と電話の音に付された識別子等を関連付けている。 The non-speech sound database D2 is a database in which feature data that can specify a sound, such as waveform data and frequency data of a non-speech sound, and an identifier of the non-speech sound are associated with each other. For example, the non-speech sound database D2 associates chime sounds that are non-speech sounds with identifiers attached to chime sounds, telephone sounds that are non-speech sounds, identifiers attached to phone sounds, and the like.

履歴データベースＤ３は、非音声音の識別子及び文字データに対して文字データの決定に利用する重み値を関連付けたデータである。ここでは、ある非音声音の入力後所定時間内（例えば、３分以内）にある音声が入力された頻度を重み値として利用している。 The history database D3 is data in which a weight value used for determining character data is associated with an identifier of non-voice sound and character data. Here, the frequency of input of a sound within a predetermined time (for example, within 3 minutes) after input of a certain non-speech sound is used as a weight value.

図２（ａ）に示す履歴データベースＤ３では、識別子「Ａ音」が付された非音声音であるチャイムの音を入力後、音声「お疲れ様」が入力された頻度が４０％であり、音声「おはようございます」が入力された頻度が３５％であることを示している。すなわち、記憶装置２０は、履歴データベースＤ３として例えば（識別子が付された非音声音）と（音声）に対応付けられて頻度を記憶している。履歴データベースＤ３に記憶する頻度は例えば「（識別子が付された非音声音）が検出されてから所定時間以内に（音声）が検出された回数／（識別子が付された非音声音）が検出された回数」として算出する。具体例は、＜チャイムが検知された後３分以内に「お疲れ様」が認識された回数＞／＜チャイムが検知された回数＞として算出された頻度が４０％である。なお、一つの非音声音に対して、３分以内に発せられる音声は、一つとは限らないので、音声ひとつひとつが、独立して０〜１００％の値をとり得ることになる。例えば、初めてチャイムの音を認識した後、３分以内に「さようなら」と、「お疲れ様」が両方認識された場合、その時点での履歴データベースは両方１００パーセントとなる。また、識別子「Ｂ音」が付された非音声音である電話の音を入力後、音声「はい」が入力された頻度が７５％であり、音声「株式会社××です」が入力された頻度が７２％であり、音声「もしもし」が入力された頻度が４８％であることを示している。さらに、識別子「Ｃ音」が付された非音声音の入力後、音声「進捗」が入力された頻度が８５％であることを示している。 In the history database D3 shown in FIG. 2 (a), after inputting a chime sound, which is a non-speech sound with the identifier “A sound”, the frequency of input of the voice “Thank you” is 40%. "Good morning" indicates that the frequency of input is 35%. That is, the storage device 20 stores the frequency in association with, for example, (non-speech sound with an identifier) and (voice) as the history database D3. The frequency stored in the history database D3 is, for example, “the number of times that (speech) is detected within a predetermined time after (non-speech sound with an identifier attached) / (non-speech sound with an identifier attached) is detected. It is calculated as “the number of times performed”. A specific example is 40% of the frequency calculated as <number of times “Thank you” is recognized within 3 minutes after chime is detected> / <number of times chime is detected>. In addition, since the sound uttered within 3 minutes with respect to one non-speech sound is not necessarily one, each sound can independently take a value of 0 to 100%. For example, if both “goodbye” and “Thank you” are recognized within 3 minutes after recognizing the chime sound for the first time, the history database at that time is both 100 percent. In addition, after inputting the phone sound, which is a non-speech sound with the identifier “B sound”, the frequency of input of the sound “Yes” is 75%, and the sound “Co., Ltd. ××” is input. The frequency is 72%, and the frequency at which the voice “Hello” is input is 48%. Further, it is indicated that the frequency of input of the voice “progress” after the input of the non-voice sound with the identifier “C sound” is 85%.

音声処理部１２は、入力部１１から音データが入力されると、入力された音データを解析し、音データに含まれる音声に対応する１以上の文字データの候補を特定する。具体的には、音声処理部１２は、記憶装置２０に記憶される音声データベースＤ１を読み出し、音データに含まれる音声の波形データや周波数データを、音声データベースＤ１の「音響モデルデータ」に含まれる各音声の波形データや周波数データと比較して類似度の高い発音記号を導出する。次に、導出した発音記号に対し「辞書データ」または「言語モデルデータ」を適用して、単語の文字データまたは文章の文字データに変換し、単語の文字データまたは文章の文字データをその確からしさを示すスコアとともに出力する。音声処理部１２は、例えば、発音記号を類似度とともに導出する。そして音声処理部１２は、類似度に対し発音記号が「辞書データ」に含まれるか否か、及び含まれる場合は「音響モデル」に含まれる単語のつながりが確からしいか否かによって、類似度に重み付けをして発音記号が示す文字データのスコアとして導出する。また、音声処理部１２は、例えば、図２（ｂ）に一例を示すように、スコアが所定の条件に該当する音声の文字データを全て抽出し、抽出した文字データとスコアとを対応させて音声処理の処理結果（音声処理結果）として重み付け処理部１４に出力する。 When sound data is input from the input unit 11, the sound processing unit 12 analyzes the input sound data and identifies one or more character data candidates corresponding to the sound included in the sound data. Specifically, the voice processing unit 12 reads the voice database D1 stored in the storage device 20, and the voice waveform data and frequency data included in the voice data are included in the “acoustic model data” of the voice database D1. A phonetic symbol having a high degree of similarity is derived as compared with the waveform data and frequency data of each voice. Next, apply “dictionary data” or “language model data” to the derived phonetic symbols to convert them to word character data or sentence character data, and to verify the word character data or sentence character data. Is output with a score indicating. For example, the voice processing unit 12 derives a phonetic symbol together with the similarity. Then, the speech processing unit 12 determines whether or not the phonetic symbol is included in the “dictionary data” with respect to the similarity, and if it is included, whether the word included in the “acoustic model” is likely to be connected. Is derived as a score of character data indicated by a phonetic symbol. In addition, for example, as shown in FIG. 2B, the voice processing unit 12 extracts all voice character data whose score satisfies a predetermined condition, and associates the extracted character data with the score. It outputs to the weighting process part 14 as a process result (voice process result) of a voice process.

ここで、音声処理部１２が文字データを抽出する条件としては、（１）スコアが特定の値以上（例えば、５０以上）の文字データを抽出、（２）スコアが上位所定数（例えば、３位以上）の文字データを抽出、（３）スコアが１位のスコアから所定範囲内のスコア（例えば、１位のスコアからマイナス５以内のスコア）の文字データを抽出、等が考えられる。 Here, as conditions for the voice processing unit 12 to extract character data, (1) character data with a score equal to or higher than a specific value (for example, 50 or more) is extracted, and (2) the score is an upper predetermined number (for example, 3). (3) Extraction of character data having a score within a predetermined range from the first-ranked score (for example, a score within minus 5 from the first-ranked score) is conceivable.

解析部１３１は、入力部１１から音データが入力されると、入力された音データを解析し、非音声音をメモリ（図示せず）に記憶させる。具体的には、解析部１３１は、入力された音データの音量が所定の閾値以上であるか否かを判定する。解析部１３１は、閾値以上の音量の音データに対し、音声か否かを判定する。解析部１３１は、音声ではない、閾値以上の音データを非音声音として判定する。解析部１３１による音声か否かの判定方法の例は、入力された音データをある時間幅のフレーム単位で切り出し、フレームごとに周波数変換を行い、各周波数のエネルギーと所定の帯域幅において時間平均したエネルギーとの比が閾値を超えるか否かに基づき判定する。解析部１３１による音声か否かの判定方法の他の例は、入力された音データをある時間幅のフレーム単位で切り出し、周波数変換し、ピークとなっているスペクトルを、スペクトルエネルギーを所定値と比較すること等により検出し、ピークのスペクトル同士が基音と倍音の関係となっていることを検出した場合に音声であると判定する。解析部１３１による音声の判定方法は、例えば、特願２０１１−２５４５７８や特願２０１１−２６００３６に記載された技術を用いることができる。 When sound data is input from the input unit 11, the analysis unit 131 analyzes the input sound data and stores a non-speech sound in a memory (not shown). Specifically, the analysis unit 131 determines whether or not the volume of the input sound data is greater than or equal to a predetermined threshold value. The analysis unit 131 determines whether the sound data having a volume equal to or higher than the threshold is sound. The analysis unit 131 determines sound data that is not voice and is equal to or higher than a threshold value as non-speech sound. An example of a method for determining whether or not the sound is a sound by the analysis unit 131 is to cut input sound data in units of frames of a certain time width, perform frequency conversion for each frame, and perform time average in each frequency energy and a predetermined bandwidth. Judgment is made based on whether or not the ratio to the energy exceeds the threshold value. Another example of a method for determining whether or not the sound is sound by the analysis unit 131 is to cut the input sound data in units of frames of a certain time width, perform frequency conversion, and convert the peak spectrum to a spectrum energy with a predetermined value. When it is detected by comparison or the like, and it is detected that the peak spectra have a relationship between the fundamental tone and the harmonic, it is determined that the sound is a voice. For example, the technique described in Japanese Patent Application No. 2011-254578 and Japanese Patent Application No. 2011-260036 can be used as the voice determination method by the analysis unit 131.

また、解析部１３１は、閾値以上の音量の音データに含まれる非音声音を抽出した場合、非音声音のデータのメモリへの蓄積を開始する。その後、解析部１３１は、非音声音の音量が閾値未満となったとき、メモリへの非音声音の蓄積を終了する。ここで利用する閾値は、処理対象とする音データの音量の判定に利用するものとして予め音認識装置１ａで定められている。これにより、雑音のように音量が閾値未満の音について処理を不要とし、音認識装置１ａにおける処理負担を軽減することができる。 Further, when the non-speech sound included in the sound data having a volume equal to or higher than the threshold value is extracted, the analysis unit 131 starts accumulation of the non-speech sound data in the memory. Thereafter, when the volume of the non-speech sound becomes less than the threshold, the analysis unit 131 ends the accumulation of the non-speech sound in the memory. The threshold used here is determined in advance by the sound recognition apparatus 1a as being used for determining the volume of sound data to be processed. As a result, it is not necessary to process a sound whose volume is less than the threshold, such as noise, and the processing burden on the sound recognition device 1a can be reduced.

また、解析部１３１は、非音声音のメモリへの蓄積を開始すると、蓄積の開始のタイミングを基準として計時を開始し、非音声音のメモリへの蓄積開始から所定時間（例えば、５秒）を経過後、メモリへの非音声音の蓄積を終了するようにしてもよい。これにより、長い非音声音については、音の冒頭部分のみを検知、比較対象とすることができ、データベース用のメモリ容量の削減と検知レスポンスの向上が可能となる。 Further, when the analysis unit 131 starts accumulation of the non-speech sound in the memory, the analysis unit 131 starts timing based on the timing of the accumulation start, and a predetermined time (for example, 5 seconds) from the start of accumulation of the non-speech sound in the memory. After elapse of time, accumulation of non-speech sounds in the memory may be terminated. As a result, for long non-speech sounds, only the beginning part of the sound can be detected and compared, and the memory capacity for the database can be reduced and the detection response can be improved.

検索部１３２は、解析部１３１によってメモリへ非音声音が蓄積されると、非音声音データベースＤ２からメモリに記憶された非音声音の識別子を検索する。 When the non-sound sound is accumulated in the memory by the analysis unit 131, the search unit 132 searches the non-sound sound identifier stored in the memory from the non-sound sound database D2.

具体的には、検索部１３２は、記憶装置２０から非音声音データベースＤ２を読み出し、非音声音データベースＤ２に含まれる各非音声音について、メモリで記憶される非音声音と比較してそれぞれ類似度を求める。検索部１３２は、例えば非音声音データベースに含まれる各非音声音の波形データや周波数データについて、メモリで記憶された非音声音の波形データや周波数データと比較してそれぞれ類似度を求める。 Specifically, the search unit 132 reads the non-speech sound database D2 from the storage device 20, and each non-speech sound included in the non-speech sound database D2 is similar to the non-speech sound stored in the memory. Find the degree. For example, the non-speech sound waveform data and frequency data included in the non-speech sound database are compared with the non-speech sound waveform data and frequency data stored in the memory to obtain the similarity.

また、検索部１３２は、非音声音データベースＤ２に含まれる全ての非音声音についてメモリに記憶される非音声音との類似度を求めると、求めた類似度のうち最も高い類似度が所定の閾値以上であるか否かを判定する。ここで利用する閾値は、非音声音データベースＤ２に含まれる非音声音とメモリに記憶される非音声音とが同一の内容を表すものであることを判定するために、音認識装置１ａで予め定められている値である。 Further, when the search unit 132 obtains the similarity between all the non-speech sounds included in the non-speech sound database D2 and the non-speech sounds stored in the memory, the highest similarity among the obtained similarities is predetermined. It is determined whether or not the threshold value is exceeded. The threshold used here is determined in advance by the sound recognition apparatus 1a in order to determine that the non-speech sound included in the non-speech sound database D2 and the non-speech sound stored in the memory represent the same content. It is a defined value.

最も高い類似度が閾値以上であるとき、検索部１３２は、類似度が最も高い非音声音の識別子を処理結果（非音声音処理結果）として重み付け処理部１４に出力する。ここで、最も高い類似度が閾値未満の場合は、検索部１３２においてメモリに記憶される非音声音は非音声音データベースＤ２には含まれていないと判定された場合であり、検索部１３２は、この非音声音を追加部１３３に出力する。 When the highest similarity is greater than or equal to the threshold, the search unit 132 outputs the identifier of the non-speech sound with the highest similarity to the weighting processing unit 14 as a processing result (non-speech sound processing result). Here, when the highest similarity is less than the threshold value, the search unit 132 determines that the non-speech sound stored in the memory is not included in the non-speech sound database D2, and the search unit 132 The non-speech sound is output to the adding unit 133.

追加部１３３は、検索部１３２から非音声音が入力されると、入力された非音声音に識別子を付して、非音声音データベースＤ２に追加する。また、追加部１３３は、この非音声音の識別子を処理結果（非音声音処理結果）として重み付け処理部１４に出力する。 When a non-speech sound is input from the search unit 132, the adding unit 133 adds an identifier to the input non-speech sound and adds it to the non-speech sound database D2. Further, the adding unit 133 outputs the non-speech sound identifier to the weighting processing unit 14 as a processing result (non-sound sound processing result).

判定部１４１は、音声処理部１２と非音声音処理部１３から処理結果が入力されるタイミングを判定している。具体的には、判定部１４１は、非音声音処理部１３から非音声音処理結果が入力されると、非音声音処理結果である非音声音の識別子をメモリ（図示せず）に記憶させるとともに、入力タイミングを基準として計時を開始する。その後、非音声音処理部１３から新たな非音声音の識別子が入力されると、判定部１４１は、メモリで記憶される識別子を新たな非音声音の識別子に更新するとともに、前回の計時を終了し、新たに計時を開始する。 The determination unit 141 determines the timing at which processing results are input from the sound processing unit 12 and the non-sound sound processing unit 13. Specifically, when the non-sound sound processing result is input from the non-sound sound processing unit 13, the determination unit 141 stores a non-sound sound identifier that is the non-sound sound processing result in a memory (not shown). At the same time, timing is started based on the input timing. Thereafter, when a new non-speech sound identifier is input from the non-speech sound processing unit 13, the determination unit 141 updates the identifier stored in the memory to a new non-speech sound identifier and measures the previous time count. End and start timing again.

一方、判定部１４１は、音声処理部１２から音声処理結果が入力されると、非音声音処理部１３から非音声音処理結果が入力後、所定時間内（例えば、３分以内）であるか否かを判定する。所定時間内である場合、判定部１４１は、音声処理部１２から入力された音声処理結果及びメモリで記憶する非音声音の識別子を選択部１４２に出力する。また、所定時間経過後である場合、判定部１４１は、音声処理部１２から入力された音声処理結果のみを出力部１５に出力する。 On the other hand, when the sound processing result is input from the sound processing unit 12, the determination unit 141 is within a predetermined time (for example, within 3 minutes) after the non-sound sound processing result is input from the non-sound sound processing unit 13. Determine whether or not. When it is within the predetermined time, the determination unit 141 outputs the sound processing result input from the sound processing unit 12 and the identifier of the non-speech sound stored in the memory to the selection unit 142. If the predetermined time has elapsed, the determination unit 141 outputs only the sound processing result input from the sound processing unit 12 to the output unit 15.

選択部１４２は、判定部１４１から音声処理結果及び非音声音の識別子が入力されると、記憶装置２０から履歴データベースＤ３を読み出す。また、選択部１４２は、読み出した履歴データベースＤ３から、入力された非音声音の識別子と、音声処理結果に含まれる各文字データとに対応する頻度を選択し、この選択結果を非音声音の識別子及び音声処理結果とともに、演算部１４３に出力する。 The selection unit 142 reads the history database D <b> 3 from the storage device 20 when the voice processing result and the identifier of the non-voice sound are input from the determination unit 141. Further, the selection unit 142 selects a frequency corresponding to the input non-speech sound identifier and each character data included in the sound processing result from the read history database D3, and selects the selection result as the non-speech sound. Along with the identifier and the voice processing result, the result is output to the calculation unit 143.

一方、非音声音処理部１３の追加部１３３で非音声音データベースＤ２に追加した非音声音のように、新たな非音声音については、履歴データベースＤ３に含まれていない。このように、非音声音の識別子及び音声処理結果に含まれる文字データとの組み合わせに関連付けられる頻度が履歴データベースＤ３に含まれていない場合、選択部１４２は、この組み合わせについての頻度は選択できない。したがって、選択部１４２は、音声処理結果において最もスコアの高い文字データを音声に対応する文字データとして出力部１５に出力するとともに、非音声音の識別子、音声処理結果及び音声に対応すると決定された文字データを更新部１４４に出力する。 On the other hand, the new non-sound sound is not included in the history database D3 like the non-sound sound added to the non-sound sound database D2 by the adding unit 133 of the non-sound sound processing unit 13. As described above, when the history database D3 does not include the frequency associated with the combination of the non-speech sound identifier and the character data included in the speech processing result, the selection unit 142 cannot select the frequency for this combination. Therefore, the selection unit 142 is determined to output the character data having the highest score in the voice processing result to the output unit 15 as the character data corresponding to the voice, and to correspond to the identifier of the non-voice sound, the voice processing result, and the voice. The character data is output to the updating unit 144.

また、音声処理部１２からの処理結果の入力が非音声音処理部１３からの処理結果の入力から所定時間が経過後であることにより判定部１４１から音声処理結果のみが入力された場合にも、選択部１４２は、履歴データベースＤ３から頻度を選択することができない。したがって、この場合、選択部１４２は、音声処理結果において最もスコアの高い文字データを音声に対応する文字データとして出力部１５に出力し、更新部１４４にはデータを出力しない。 Further, when only a sound processing result is input from the determination unit 141 because the processing result input from the sound processing unit 12 is after a predetermined time has elapsed from the input of the processing result from the non-sound sound processing unit 13. The selection unit 142 cannot select the frequency from the history database D3. Therefore, in this case, the selection unit 142 outputs the character data having the highest score in the speech processing result to the output unit 15 as the character data corresponding to the speech, and does not output the data to the update unit 144.

演算部１４３は、選択部１４２から非音声音の識別子、音声処理結果及び選択結果が入力されると、音声処理結果に含まれるスコアを重み付けする演算をし、音声に対応する文字データを決定する。具体的には、演算部１４３は、各文字データの候補についてスコアを重み付けして新たなスコアを演算し、新たなスコアが最も高い文字データを音声に対応する文字データとして出力部１５に出力する。また、演算部１４３は、決定された音声に対応する文字データ、非音声音の識別子及び音声処理結果を更新部１４４に出力する。 When the identifier of the non-speech sound, the voice processing result, and the selection result are input from the selection unit 142, the calculation unit 143 performs a calculation for weighting the score included in the voice processing result and determines character data corresponding to the voice. . Specifically, the calculation unit 143 calculates a new score by weighting the score for each character data candidate, and outputs the character data having the highest new score to the output unit 15 as character data corresponding to speech. . In addition, the calculation unit 143 outputs the character data corresponding to the determined voice, the identifier of the non-speech sound, and the voice processing result to the update unit 144.

例えば、演算部１４３は、非音声音の識別子と各文字データの組み合わせについて、式（１）のような数式で重み付けした新たなスコアを求める。式（１）においてｒ１は、予め定められる係数である。 For example, the calculation unit 143 obtains a new score weighted by a mathematical expression such as Expression (1) for the combination of the non-speech sound identifier and each character data. In the formula (1), r1 is a predetermined coefficient.

新たなスコア＝スコア＋頻度×ｒ１ …（１）
更新部１４４は、入力されるデータに応じて、履歴データベースＤ３の頻度を更新する。具体的には、更新部１４４は、演算部１４３から決定された文字データ、非音声音の識別子及び音声処理結果が入力されると、非音声音の識別子及び選択された音声に対応する文字データと関連付けられる頻度をより高い頻度に更新し、非音声音の識別子及び選択された音声処理結果に含まれる音声に対応する文字データ以外の文字データと関連付けられる頻度をより低い頻度に更新するように履歴データベースＤ３を更新する。ここで、履歴データベースＤ３に非音声音の識別子及び音声処理結果に含まれる文字データと関連付けられる頻度がないとき、更新部１４４は、非音声音の識別子及び文字データに頻度を関連付けて履歴データベースＤ３に追加する。 New score = score + frequency × r1 (1)
The update unit 144 updates the frequency of the history database D3 according to the input data. Specifically, when the character data, the non-speech sound identifier and the sound processing result determined by the calculation unit 143 are input, the update unit 144 receives the non-sound sound identifier and the character data corresponding to the selected sound. The frequency associated with the non-speech sound identifier and the character data other than the character data corresponding to the speech included in the selected speech processing result is updated to a lower frequency. The history database D3 is updated. Here, when there is no frequency associated with the non-speech sound identifier and the character data included in the speech processing result in the history database D3, the update unit 144 associates the frequency with the non-speech sound identifier and the character data and records the history database D3. Add to

また、更新部１４４は、選択部１４２から決定された文字データ、非音声音の識別子及び音声処理結果が入力されると、非音声音の識別子及び決定された音声に対応する文字データと関連付けられる頻度をより高い頻度に更新し、入力した非音声音の識別子及び音声処理結果に含まれる音声に対応すると決定された文字データ以外の文字データと関連付けられる頻度をより低い頻度に更新するように履歴データベースＤ３を更新する。 In addition, when the character data determined from the selection unit 142, the non-speech sound identifier, and the sound processing result are input, the update unit 144 is associated with the non-sound sound identifier and the character data corresponding to the determined sound. Update the frequency to a higher frequency and update the frequency associated with the character data other than the character data determined to correspond to the voice included in the input non-speech sound identifier and the voice processing result to a lower frequency Update the database D3.

（非音声音処理部における処理）
図３に示すフローチャートを利用して、非音声音処理部１３における処理を説明する。 (Processing in the non-voice sound processing unit)
Processing in the non-speech sound processing unit 13 will be described using the flowchart shown in FIG.

図３の処理は、装置の電源がＯＮになったときに開始される。非音声音処理部１３は、解析部１３１において入力部１１を介して入力された音データを解析し、入力された音データの音量が閾値以上であるか否かを判定する（Ｓ１０）。音データの音量が閾値以上であるとき（Ｓ１０でＹＥＳ）、解析部１３１は音データが非音声音であるか否かを判定する（Ｓ１１）。ここで、閾値以上の音量の音データの入力がない場合（Ｓ１０でＮＯ）及び音声データに非音声音が含まれない場合（Ｓ１１でＮＯ）、ステップＳ１０に戻り、処理を繰り返す。 The process of FIG. 3 is started when the apparatus is turned on. The non-speech sound processing unit 13 analyzes the sound data input via the input unit 11 in the analysis unit 131, and determines whether or not the volume of the input sound data is greater than or equal to a threshold (S10). When the volume of the sound data is greater than or equal to the threshold (YES in S10), the analysis unit 131 determines whether the sound data is a non-speech sound (S11). Here, when there is no input of sound data having a volume equal to or higher than the threshold value (NO in S10) and when the non-voice sound is not included in the sound data (NO in S11), the process returns to step S10 and the process is repeated.

閾値以上の音量の音データに非音声音が含まれるとき（Ｓ１１でＹＥＳ）、解析部１３１は、非音声音のメモリへの蓄積を開始する。また、検索部１３２は、非音声音データベースＤ２に含まれる各非音声音とメモリに記憶される非音声音との類似度を求めて比較する（Ｓ１２）。 When the non-speech sound is included in the sound data having a volume equal to or higher than the threshold (YES in S11), the analysis unit 131 starts accumulation of the non-speech sound in the memory. Further, the search unit 132 obtains and compares the similarity between each non-speech sound included in the non-speech sound database D2 and the non-speech sound stored in the memory (S12).

非音声音データベースＤ２に入力された非音声音と類似の非音声音が含まれているとき（Ｓ１３でＹＥＳ）、解析部１３１は、入力される非音声音のメモリへの蓄積を終了し、検索部１３２は、非音声音データベースＤ２に含まれる各非音声音とメモリに記憶される非音声音との比較を終了する（Ｓ１４）。データの蓄積及び比較が終了して非音声音の識別子が特定されると、検索部１３２は、非音声音の識別子を処理結果として重み付け処理部１４に出力し、メモリに蓄積した非音声音のデータを破棄する（Ｓ１８）。 When the non-speech sound similar to the non-speech sound input to the non-speech sound database D2 is included (YES in S13), the analysis unit 131 ends the storage of the input non-speech sound in the memory, The search unit 132 ends the comparison between each non-speech sound included in the non-speech sound database D2 and the non-speech sound stored in the memory (S14). When the storage and comparison of the data is completed and the identifier of the non-speech sound is specified, the search unit 132 outputs the non-speech sound identifier to the weighting processing unit 14 as a processing result, and the non-sound sound identifier stored in the memory is output. The data is discarded (S18).

これに対し、非音声音データベースＤ２に入力された非音声音と類似の非音声音が含まれていないとき（Ｓ１３でＮＯ）、解析部１３１は、続いて入力される音データの音量が閾値未満か、または、非音声音のメモリへの蓄積時間が所定時間以上か否かを判定する（Ｓ１５）。新たな音データの音量が閾値以上の場合及び非音声音のメモリへの蓄積時間が所定時間内の場合、ステップＳ１２に戻り、データの蓄積及び比較の処理を繰り返す。 On the other hand, when the non-speech sound similar to the non-speech sound input to the non-speech sound database D2 is not included (NO in S13), the analysis unit 131 determines that the volume of the sound data subsequently input is a threshold value. It is determined whether or not the storage time of the non-voice sound in the memory is equal to or longer than a predetermined time (S15). When the volume of the new sound data is equal to or higher than the threshold value and when the storage time of the non-speech sound in the memory is within the predetermined time, the process returns to step S12 to repeat the data storage and comparison processing.

新たに入力する音データの音量が閾値未満となった場合又は蓄積時間が所定時間以上となったとき（Ｓ１５でＹＥＳ）、解析部１３１は、入力される非音声音のメモリへの蓄積を終了し、検索部１３２は、非音声音データベースＤ２に含まれる各非音声音とメモリに記憶される非音声音との比較を終了する（Ｓ１６）。その後、追加部１３３は、非音声音に新たな識別子を付して非音声音データベースＤ２に追加する（Ｓ１７）。また、非音声音データベースＤ２に非音声音及び非音声音の識別子が追加されると、追加部１３３は、新たに付した非音声音の識別子を処理結果として重み付け処理部１４に出力し、メモリに蓄積した非音声音のデータを破棄する（Ｓ１８）。 When the volume of newly input sound data is less than the threshold value or when the accumulation time is equal to or longer than the predetermined time (YES in S15), the analysis unit 131 ends the accumulation of the input non-speech sound in the memory. Then, the search unit 132 ends the comparison between each non-speech sound included in the non-speech sound database D2 and the non-speech sound stored in the memory (S16). Thereafter, the adding unit 133 adds a new identifier to the non-speech sound and adds it to the non-speech sound database D2 (S17). When the non-speech sound and the non-speech sound identifier are added to the non-speech sound database D2, the adding unit 133 outputs the newly added non-sound sound identifier to the weighting processing unit 14 as a processing result, and the memory The data of the non-speech sound stored in (1) is discarded (S18).

（重み付け処理部における処理）
図４に示すフローチャートを利用して、重み付け処理部１４における処理を説明する。図４に示すフローチャートは装置の電源がＯＮになったときに開始される。重み付け処理部１４の判定部１４１は、音声処理部１２及び非音声音処理部１３からの処理結果の入力を待機しており、非音声音処理部１３から処理結果である非音声音の識別子が入力されると（Ｓ２０でＹＥＳ）、計時を開始する（Ｓ２１）。 (Processing in the weighting processing unit)
The process in the weighting process part 14 is demonstrated using the flowchart shown in FIG. The flowchart shown in FIG. 4 is started when the apparatus is turned on. The determination unit 141 of the weighting processing unit 14 waits for input of processing results from the audio processing unit 12 and the non-speech sound processing unit 13, and the identifier of the non-speech sound that is the processing result is received from the non-sound sound processing unit 13. When input is made (YES in S20), time measurement is started (S21).

その後、音声処理部１２から処理結果が入力されると（Ｓ２２でＹＥＳ）、判定部１４１は、非音声音処理部１３から非音声音の識別子を入力後、所定時間内であるか否かを判定する（Ｓ２３）。 Thereafter, when the processing result is input from the sound processing unit 12 (YES in S22), the determination unit 141 determines whether or not it is within a predetermined time after inputting the non-sound sound identifier from the non-sound sound processing unit 13. Determine (S23).

所定時間内であるとき（Ｓ２３でＹＥＳ）、選択部１４２は、入力された非音声音の識別子と音声処理結果に含まれる各文字データに関連付けられる頻度が履歴データベースＤ３に含まれるか否かを判定する（Ｓ２４）。 When it is within the predetermined time (YES in S23), the selection unit 142 determines whether or not the history database D3 includes the frequency of the input non-speech sound identifier and the character data included in the sound processing result. Determine (S24).

入力された非音声音の識別子及び音声処理結果に含まれる各文字データと関連付けられる頻度が履歴データベースＤ３に含まれるとき（Ｓ２４でＹＥＳ）、演算部１４３は、重み付け処理を実行する。すなわち、演算部１４３は、音声処理結果に含まれる各文字データの候補のスコアについて、重み値である頻度を利用して新たなスコアを求める（Ｓ２５）。詳細は上述した式（１）を用いた演算である。その後、演算部１４３は、最もスコアが高くなった文字データを音声に対応すると文字データと決定し、結果として出力する（Ｓ２６）。 When the history database D3 includes the identifier of the input non-speech sound and the frequency associated with each character data included in the speech processing result (YES in S24), the calculation unit 143 performs a weighting process. In other words, the calculation unit 143 obtains a new score for the candidate score of each character data included in the speech processing result using the frequency that is the weight value (S25). The details are the calculations using the above-described equation (1). Thereafter, the arithmetic unit 143 determines that the character data having the highest score corresponds to the voice as character data, and outputs it as a result (S26).

一方、所定時間内でないとき（Ｓ２３でＮＯ）又は入力された非音声音の識別子及び音声処理結果に含まれる各文字データに関連付けられる頻度が履歴データベースＤ３に含まれていないとき（Ｓ２４でＮＯ）、選択部１４２は、音声処理結果に含まれるスコアが最も高い文字データを音声に対応する文字データと決定し、結果として出力する（Ｓ２６）。 On the other hand, when it is not within the predetermined time (NO in S23), or when the frequency database associated with each character data included in the input non-speech sound identifier and the voice processing result is not included in the history database D3 (NO in S24) The selection unit 142 determines the character data having the highest score included in the speech processing result as the character data corresponding to the speech, and outputs it as a result (S26).

また、更新部１４４は、音声に対応する文字データが決定されると、履歴データベースＤ３に含まれる頻度を更新する（Ｓ２７）。 Further, when the character data corresponding to the voice is determined, the update unit 144 updates the frequency included in the history database D3 (S27).

なお、非音声音処理部１３及び音声処理部１２から処理結果の入力がないとき（Ｓ２０でＮＯ及びＳ２２でＮＯ）、ステップＳ２０に戻り、非音声音処理部１３又は音声処理部１２からの処理結果を待機する。 When no processing result is input from the non-sound sound processing unit 13 and the sound processing unit 12 (NO in S20 and NO in S22), the process returns to Step S20, and the process from the non-sound sound processing unit 13 or the sound processing unit 12 is performed. Wait for the result.

上述した実施例では、音声処理部１２は音声に対応する文字データを出力したが、文字データではなく、音声の種類を示す形態であれば、例えば音声に対応する識別子やアイコンなどでもよく、文字データに限らない。 In the above-described embodiment, the voice processing unit 12 outputs the character data corresponding to the voice, but may be an identifier or an icon corresponding to the voice, for example, as long as it is a form indicating the voice type instead of the character data. Not limited to data.

上述したように、第１実施形態に係る音認識装置１ａは、音声認識の際に、過去の所定時間内に取得された非音声音を利用する。したがって、非音声音の取得から一定時間内に取得される音声に対応する種類を予測することが可能となり、音声認識の精度を向上させることができる。 As described above, the sound recognition device 1a according to the first embodiment uses a non-speech sound acquired within a predetermined time in the past at the time of voice recognition. Therefore, it is possible to predict the type corresponding to the voice acquired within a certain time from the acquisition of the non-speech sound, and the accuracy of voice recognition can be improved.

また、音認識装置１ａで利用される非音声音データベースＤ２や履歴データベースＤ３は利用に応じて自動で更新されるため、利用者の登録作業等の処理がなくても音声認識の精度が向上させることができる。 In addition, since the non-speech sound database D2 and the history database D3 used in the sound recognition device 1a are automatically updated according to use, the accuracy of speech recognition is improved without processing such as user registration work. be able to.

〈第２実施形態〉
第２実施形態に係る音認識装置は、音データに含まれる非音声音を視覚データに変換する際、非音声音の入力タイミングから過去の所定時間内（例えば、３分前まで）に入力された音声を利用する。ここでも、人間が声によって発した音を「音声」とし、音声以外の音を「非音声音」とする。非音声音の特定に音声を利用することで、非音声音への変換の精度を向上させることができる。 Second Embodiment
In the sound recognition device according to the second embodiment, when converting non-sound sound included in sound data into visual data, the sound recognition device is input within a predetermined time in the past (for example, up to 3 minutes before) from the input timing of the non-sound sound. Use your voice. Here again, a sound produced by a human voice is referred to as “speech”, and a sound other than the sound is referred to as “non-speech sound”. By using the voice for specifying the non-voice sound, the accuracy of conversion to the non-voice sound can be improved.

図５に示すように、第２実施形態に係る音認識装置１ｂは、図１を用いて上述した第１実施形態に係る音認識装置１ａと比較して、音声処理部１２に代えて音声処理部１２ｂを有し、非音声音処理部１３に代えて非音声音処理部１３ｂを有し、重み付け処理部１４に代えて重み付け処理部１４ｂを有している点で異なる。また、音認識装置１ｂは、音認識装置１ａと比較して、記憶装置２０において、音認識プログラムＰ１に代えて音認識プログラムＰ２を記憶し、履歴データベースＤ３に代えて履歴データベースＤ４を記憶している。すなわち、音認識装置１ｂでは、記憶装置２０に記憶される音認識プログラムＰ２が実行されることで、ＣＰＵ１０が音声処理部１２ｂ、非音声音処理部１３ｂ及び重み付け処理部１４ｂとして処理が実行される。 As shown in FIG. 5, the sound recognition device 1b according to the second embodiment is replaced with a sound processing unit 12 as compared with the sound recognition device 1a according to the first embodiment described above with reference to FIG. A non-speech sound processing unit 13b instead of the non-speech sound processing unit 13, and a weighting processing unit 14b instead of the weighting processing unit 14. Also, the sound recognition device 1b stores a sound recognition program P2 in place of the sound recognition program P1 and a history database D4 in place of the history database D3 in the storage device 20, as compared with the sound recognition device 1a. Yes. That is, in the sound recognition device 1b, the sound recognition program P2 stored in the storage device 20 is executed, whereby the CPU 10 performs processing as the sound processing unit 12b, the non-sound sound processing unit 13b, and the weighting processing unit 14b. .

履歴データベースＤ４は、文字データ及び非音声音の識別子に対して、非音声音の決定に利用する重み値を関連付けたデータベースである。ここでは、ある文字データの音声の入力後所定時間内（例えば、３分以内）に当該非音声音を入力された頻度を重み値として利用している。 The history database D4 is a database in which weight values used for determining non-speech sounds are associated with character data and non-speech sound identifiers. Here, the frequency at which the non-speech sound is input within a predetermined time (for example, within 3 minutes) after inputting the sound of certain character data is used as a weight value.

図６（ａ）に示す履歴データベースＤ４では、音声「お疲れ様」を入力後、識別子「Ａ音」が付された非音声音であるチャイムの音が入力された頻度が４０％、識別子「Ｂ音」が付された電話の音が入力された頻度が３５％であることを示している。また、音声「おはようございます」を入力後、識別子「Ｃ音」が付された非音声音が入力された頻度が３０％、識別子「Ｄ音」が付された非音声音が入力された頻度が２５％、識別子「Ｅ音」が付された非音声音が入力された頻度が１０％であることを示している。 In the history database D4 shown in FIG. 6A, after inputting the voice “Thank you”, the frequency of input of the chime sound, which is a non-speech sound with the identifier “A sound”, is 40%, and the identifier “B sound” This indicates that the frequency of inputting the sound of a telephone with “” is 35%. In addition, after inputting the voice “Good morning”, the frequency with which the non-speech sound with the identifier “C sound” is input is 30%, and the frequency with which the non-speech sound with the identifier “D sound” is input. Is 25%, and the frequency of input of the non-speech sound with the identifier “E sound” is 10%.

音声処理部１２ｂは、入力部１１から音データが入力されると、音データに含まれる音声に対応する文字データを出力する。このとき、音声処理部１２ｂは、音声データベースＤ１を利用して、音データに音声に対応する文字データを文字データの確からしさを表すスコアとともに導出し、スコアが１位となった文字データのみを音声処理結果として出力する。なお、スコアが１位となった場合であっても、このスコアが所定の値未満の場合には、音声処理部１２ｂは、文字データを出力しなくてもよい。すなわち、信頼性のない結果については、その後の処理に反映させる必要がないためである。 When sound data is input from the input unit 11, the sound processing unit 12b outputs character data corresponding to the sound included in the sound data. At this time, the voice processing unit 12b uses the voice database D1 to derive the character data corresponding to the voice from the voice data together with a score representing the probability of the character data, and only the character data having the first score is obtained. Output as audio processing result. Even when the score is ranked first, if the score is less than a predetermined value, the voice processing unit 12b may not output character data. That is, an unreliable result need not be reflected in subsequent processing.

検索部１３２ｂは、解析部１３１によって閾値以上の音量の音声データから抽出された非音声音がメモリに蓄積されると、メモリに記憶される非音声音の識別子を非音声音データベースＤ２から検索し、検索された非音声音の識別子を非音声音処理結果として重み付け処理部１４ｂに出力する。具体的には、検索部１３２ｂは、非音声音データベースＤ２に含まれる各非音声音について、メモリに蓄積される非音声音との類似度を求め、この類似度をスコアとする。また検索部１３２ｂは、所定の抽出条件に該当する非音声音の識別子を全て、非音声音の識別子の候補として抽出し、例えば、図６（ｂ）に一例を示すように、非音声音の識別子とスコアとを対応させて非音声音処理結果とする。 When the non-speech sound extracted from the sound data having a volume equal to or higher than the threshold by the analysis unit 131 is accumulated in the memory, the search unit 132b searches the non-speech sound database D2 for the identifier of the non-speech sound stored in the memory. The identifier of the searched non-speech sound is output to the weighting processing unit 14b as the non-speech sound processing result. Specifically, the search unit 132b calculates the similarity between each non-speech sound included in the non-speech sound database D2 and the non-speech sound stored in the memory, and uses this similarity as a score. Further, the search unit 132b extracts all the identifiers of the non-speech sounds that meet the predetermined extraction condition as candidates for the identifiers of the non-speech sounds. For example, as illustrated in FIG. The identifier and the score are associated with each other to obtain a non-voice sound processing result.

ここで、検索部１３２ｂが非音声音の識別子を抽出する条件としては、（１）スコアが特定の値以上（例えば、５０以上）の非音声音の識別子を抽出、（２）スコアが上位所定数（例えば、３位以上）の非音声音の識別子を抽出、（３）スコアが１位のスコアから所定範囲内のスコア（例えば、１位のスコアからマイナス５以内のスコア）の文字データを抽出、等が考えられる。 Here, as conditions for the search unit 132b to extract the identifier of the non-speech sound, (1) extract the identifier of the non-speech sound whose score is equal to or higher than a specific value (for example, 50 or higher), Extract non-speech sound identifiers (for example, 3rd or higher), (3) character data with a score within a predetermined range from the 1st score (for example, a score within minus 5 from the 1st score) Extraction etc. can be considered.

一方、検索部１３２ｂは、非音声音の識別子が非音声音データベースＤ２から検索できない場合、非音声音処理結果を出力することなく、新たな非音声音がメモリに記憶されるまで待機する。すなわち、非音声音処理部１３ｂは、非音声音データベースＤ２に含まれていない非音声音を新たに入力した場合でも、図１を用いて上述した非音声音処理部１３のように非音声音データベースＤ２に追加することがない。 On the other hand, if the identifier of the non-speech sound cannot be retrieved from the non-speech sound database D2, the search unit 132b waits until a new non-speech sound is stored in the memory without outputting the non-speech sound processing result. That is, even when a non-sound sound not newly included in the non-speech sound database D2 is newly input, the non-speech sound processing unit 13b is similar to the non-sound sound processing unit 13 described above with reference to FIG. It is not added to the database D2.

判定部１４１ｂは、音声処理部１２ｂ及び非音声音処理部１３ｂから入力される処理結果を待機しているが、判定部１４１と異なり、音声処理部１２ｂから処理結果が入力されたタイミングを基準として計時を開始する。また、判定部１４１ｂは、非音声音処理部１３ｂから処理結果が入力されると、音声処理部１２ｂから処理結果を入力後、所定時間内であるか否かを判定し、所定時間内であるとき、音声処理部１２ｂの処理結果である文字データ及び非音声音処理部１３ｂの処理結果を選択部１４２ｂに出力する。 The determination unit 141b waits for the processing results input from the sound processing unit 12b and the non-speech sound processing unit 13b. Unlike the determination unit 141, the determination unit 141b is based on the timing at which the processing results are input from the sound processing unit 12b. Start timing. Further, when the processing result is input from the non-speech sound processing unit 13b, the determination unit 141b determines whether it is within a predetermined time after inputting the processing result from the sound processing unit 12b, and is within the predetermined time. At this time, the character data which is the processing result of the voice processing unit 12b and the processing result of the non-voice sound processing unit 13b are output to the selection unit 142b.

選択部１４２ｂは、判定部１４１ｂから文字データ及び非音声音処理結果が入力されると、記憶装置２０から履歴データベースＤ４を読み出し、入力された文字データと、非音声音処理結果に含まれる各非音声音の識別子とに対応する頻度を選択し、この選択結果を文字データ及び非音声音処理結果とともに、演算部１４３ｂに出力する。 When the character data and the non-speech sound processing result are input from the determination unit 141b, the selection unit 142b reads the history database D4 from the storage device 20, and the input character data and each non-speech sound processing result included in the non-speech sound processing result. The frequency corresponding to the voice sound identifier is selected, and the selection result is output to the calculation unit 143b together with the character data and the non-voice sound processing result.

演算部１４３ｂは、選択部１４２ｂから文字データ、非音声音処理結果及び選択結果が入力されると、非音声音処理結果に含まれるスコアを重み付けする演算をし、入力した非音声音を決定する。具体的には、演算部１４３ｂは、各非音声音の識別子の候補についてスコアを重み付けして新たなスコアを演算し、新たなスコアが最も高い非音声音の識別子を、入力された非音声音の識別子と決定し、出力部１５に出力する。また、演算部１４３ｂは、決定された非音声音の識別子、文字データ及び非音声音処理結果を更新部１４４ｂに出力する。 When the character data, the non-speech sound processing result, and the selection result are input from the selection unit 142b, the calculation unit 143b performs a calculation to weight the score included in the non-speech sound processing result and determines the input non-speech sound. . Specifically, the calculation unit 143b calculates a new score by weighting the score for each non-speech sound identifier candidate, and calculates the non-speech sound identifier having the highest new score as the input non-speech sound. And output to the output unit 15. The computing unit 143b outputs the determined non-speech sound identifier, character data, and non-speech sound processing result to the update unit 144b.

例えば、演算部１４３ｂは、文字データと各非音声音の識別子の組み合わせについて、式（２）のような数式で重み付けした新たなスコアを求める。式（２）においてｒ２は、予め定められる係数である。 For example, the calculation unit 143b obtains a new score weighted by a mathematical expression such as Expression (2) for the combination of character data and each non-speech sound identifier. In equation (2), r2 is a predetermined coefficient.

新たなスコア＝スコア＋頻度×ｒ２ …（２）
ここで、非音声音の識別子に対応する文字データが関連付けられているとき、この文字データを出力するようにしてもよい。例えば、識別子「Ａ音」に文字データ「チャイム」が関連付けられているとき演算部１４３ｂは文字データである「チャイム」を出力部１５に出力し、識別子「Ｂ音」に文字データ「電話」が関連付けられているとき演算部１４３ｂは文字データである「電話」を出力部に出力する。 New score = score + frequency × r2 (2)
Here, when character data corresponding to the identifier of the non-speech sound is associated, this character data may be output. For example, when the character data “chime” is associated with the identifier “A sound”, the calculation unit 143b outputs the character data “chime” to the output unit 15, and the identifier “B sound” has the character data “phone”. When they are associated, the calculation unit 143b outputs “telephone”, which is character data, to the output unit.

なお、出力部１５に出力するデータは視覚データであれば文字データに限られず、出力部１５が複数のランプである場合には、各非音声音とランプとを対応付けているとき、非音声音の識別子が決定されると、入力された非音声音と対応するランプを点灯する等によって認識結果を出力してもよい。 The data to be output to the output unit 15 is not limited to character data as long as it is visual data. When the output unit 15 is a plurality of lamps, the non-sound sound is associated with each non-sound sound. When the identifier of the voice sound is determined, the recognition result may be output by turning on a lamp corresponding to the input non-voice sound.

更新部１４４ｂは、演算部１４３ｂから決定された非音声音の識別子、文字データ及び非音声音処理結果が入力されると、履歴データベースＤ４の頻度を更新する。具体的には、更新部１４４ｂは、文字データ及び演算部１４３ｂで決定された非音声音の識別子と関連付けられる頻度を高くし、文字データ及びと非音声音処理結果に含まれる決定された非音声音の識別子以外の非音声音の識別子と関連付けられる頻度を低くするように履歴データベースＤ４を更新する。 The update unit 144b updates the frequency of the history database D4 when the non-speech sound identifier, the character data, and the non-speech sound processing result determined from the calculation unit 143b are input. Specifically, the update unit 144b increases the frequency associated with the character data and the identifier of the non-speech sound determined by the calculation unit 143b, and determines the non-speech sound included in the character data and the non-speech sound processing result. The history database D4 is updated so as to reduce the frequency associated with non-speech sound identifiers other than voice sound identifiers.

（非音声音処理部における処理）
図７に示すフローチャートを利用して、非音声音処理部１３ｂにおける処理を説明する。非音声音処理部１３ｂは、解析部１３１が入力部１１を介して入力される音データを解析し、入力された音データの音量が閾値以上であるか否かを判定する（Ｓ３０）。音データの音量が閾値以上であるとき（Ｓ３０でＹＥＳ）、解析部１３１は音データが非音声音であるか否かを判定する（Ｓ３１）。ここで、閾値以上の音量の音データの入力がない場合（Ｓ３０でＮＯ）及び音声データに非音声音が含まれない場合（Ｓ３１でＮＯ）、ステップＳ３０に戻り、処理を繰り返す。 (Processing in the non-voice sound processing unit)
Processing in the non-speech sound processing unit 13b will be described using the flowchart shown in FIG. The non-speech sound processing unit 13b analyzes sound data input by the analysis unit 131 via the input unit 11, and determines whether or not the volume of the input sound data is greater than or equal to a threshold value (S30). When the volume of the sound data is equal to or higher than the threshold (YES in S30), the analysis unit 131 determines whether the sound data is a non-speech sound (S31). If there is no input of sound data having a volume equal to or higher than the threshold value (NO in S30) and no non-sound sound is included in the audio data (NO in S31), the process returns to step S30 and the process is repeated.

閾値以上の音量の音データに非音声音が含まれるとき（Ｓ３１でＹＥＳ）、解析部１３１は、非音声音のメモリへの蓄積を開始する。また、検索部１３２ｂは、非音声音データベースＤ２に含まれる各非音声音とメモリに記憶される非音声音との類似度を求め、類似度をスコアとして比較する（Ｓ３２）。 When non-speech sounds are included in the sound data having a volume equal to or higher than the threshold (YES in S31), the analysis unit 131 starts accumulation of non-speech sounds in the memory. Further, the search unit 132b obtains the similarity between each non-speech sound included in the non-speech sound database D2 and the non-speech sound stored in the memory, and compares the similarity as a score (S32).

非音声音データベースＤ２に入力された非音声音と類似の非音声音が含まれているとき（Ｓ３３でＹＥＳ）、解析部１３１は、入力される非音声音のメモリへの蓄積を終了し、検索部１３２ｂは、非音声音データベースＤ２に含まれる各非音声音とメモリに記憶される非音声音との比較を終了する（Ｓ３４）。データの蓄積及び比較が終了して条件を満たす非音声音の識別子が抽出されると、検索部１３２ｂは、抽出された全ての非音声音の識別子及びスコアを含む非音声音処理結果を重み付け処理部１４ｂに出力するとともに（Ｓ３５）、メモリで蓄積するデータを破棄し（Ｓ３８）、ステップＳ３０に戻り、処理を繰り返す。 When the non-speech sound similar to the non-speech sound input to the non-speech sound database D2 is included (YES in S33), the analysis unit 131 ends the storage of the input non-speech sound in the memory, The search unit 132b ends the comparison between each non-speech sound included in the non-speech sound database D2 and the non-speech sound stored in the memory (S34). When the non-speech sound identifiers satisfying the conditions after the storage and comparison of data are extracted, the search unit 132b weights the non-speech sound processing results including all the extracted non-speech sound identifiers and scores. The data is output to the unit 14b (S35), the data stored in the memory is discarded (S38), the process returns to step S30, and the process is repeated.

これに対し、非音声音データベースＤ２に入力された非音声音と類似の非音声音が含まれていないとき（Ｓ３３でＮＯ）、解析部１３１は、続いて入力される音データの音量が閾値未満か、または、非音声音のメモリへの蓄積時間が所定時間以上か否かを判定する（Ｓ３６）。新たな音データの音量が閾値以上の場合又は入力される非音声音のメモリへの蓄積時間が所定時間内の場合、ステップＳ３２に戻り、データの蓄積及び比較の処理を繰り返す。 On the other hand, when the non-speech sound similar to the non-speech sound input to the non-speech sound database D2 is not included (NO in S33), the analysis unit 131 determines that the volume of the subsequently input sound data is a threshold value. It is determined whether or not the storage time of the non-voice sound in the memory is equal to or longer than a predetermined time (S36). If the volume of the new sound data is equal to or higher than the threshold value, or if the storage time of the input non-speech sound in the memory is within the predetermined time, the process returns to step S32 to repeat the data storage and comparison processing.

新たに入力する音データの音量が閾値未満となった場合又は蓄積時間が所定時間以上となったとき（Ｓ３６でＹＥＳ）、解析部１３１は、入力される非音声音のメモリへの蓄積を終了し、検索部１３２ｂは、非音声音データベースＤ２に含まれる各非音声音とメモリに記憶される非音声音との比較を終了する（Ｓ３７）。その後、検索部１３２ｂは、メモリで蓄積されるデータを破棄する（Ｓ３８）。 When the volume of newly input sound data is less than the threshold value or when the accumulation time is equal to or longer than the predetermined time (YES in S36), the analysis unit 131 ends the accumulation of the input non-speech sound in the memory. Then, the search unit 132b ends the comparison between each non-speech sound included in the non-speech sound database D2 and the non-speech sound stored in the memory (S37). Thereafter, the search unit 132b discards the data stored in the memory (S38).

（重み付け処理部における処理）
図８に示すフローチャートを利用して、重み付け処理部１４ｂにおける処理を説明する。重み付け処理部１４ｂの判定部１４１は、音声処理部１２ｂ及び非音声音処理部１３ｂからの処理結果の入力を待機しており、音声処理部１２ｂから処理結果である文字データが入力されると（Ｓ４０でＹＥＳ）、計時を開始する（Ｓ４１）。 (Processing in the weighting processing unit)
The process in the weighting process part 14b is demonstrated using the flowchart shown in FIG. The determination unit 141 of the weighting processing unit 14b waits for input of processing results from the speech processing unit 12b and the non-speech sound processing unit 13b, and when character data as a processing result is input from the speech processing unit 12b ( Time measurement is started (S41).

その後、非音声音処理部１３ｂから処理結果が入力されると（Ｓ４１でＹＥＳ）、判定部１４１ｂは、音声処理部１２ｂから文字データを入力後、所定時間内であるか否かを判定する（Ｓ４３）。 Thereafter, when a processing result is input from the non-speech sound processing unit 13b (YES in S41), the determination unit 141b determines whether or not it is within a predetermined time after inputting character data from the sound processing unit 12b ( S43).

所定時間内であるとき（Ｓ４３でＹＥＳ）、選択部１４２ｂは、入力された文字データと非音声音処理結果に含まれる非音声音の識別子とに関連付けられる頻度が履歴データベースＤ４に含まれるか否かを判定する（Ｓ４４）。 When it is within the predetermined time (YES in S43), the selection unit 142b determines whether the history database D4 includes the frequency associated with the input character data and the non-speech sound identifier included in the non-speech sound processing result. Is determined (S44).

入力された文字データ及び非音声音処理結果に含まれる非音声音の識別子と関連付けられる頻度が履歴データベースＤ４に含まれるとき（Ｓ４４でＹＥＳ）、演算部１４３ｂは、重み付け処理を実行する。すなわち、演算部１４３ｂは、非音声音処理結果に含まれる各非音声音の識別子の候補のスコアについて、重み値である頻度を利用して新たなスコアを求める（Ｓ４５）。その後、演算部１４３ｃは、最もスコアが高くなった非音声音の識別子を入力された非音声音の識別子と決定し、この非音声音の識別子に対応する文字データを結果として出力する（Ｓ４６）。 When the frequency associated with the input character data and the identifier of the non-speech sound included in the non-speech sound processing result is included in the history database D4 (YES in S44), the calculation unit 143b executes a weighting process. That is, the computing unit 143b obtains a new score using the frequency that is the weight value for the score of each non-speech sound identifier candidate included in the non-speech sound processing result (S45). Thereafter, the calculation unit 143c determines the identifier of the non-speech sound with the highest score as the input non-speech sound identifier, and outputs the character data corresponding to the non-speech sound identifier as a result (S46). .

一方、所定時間内でないとき（Ｓ４３でＮＯ）又は入力された文字データと非音声音処理結果に含まれる各非音声音の識別子に関連付けられる頻度が履歴データベースＤ４に含まれていないとき（Ｓ４４でＮＯ）、選択部１４２ｂは、非音声音処理結果に含まれるスコアが最も高い非音声音の識別子に対応する文字データを結果データとして出力する（Ｓ４４）。 On the other hand, when it is not within the predetermined time (NO in S43) or when the frequency associated with the identifier of each non-speech sound included in the input character data and the non-speech sound processing result is not included in the history database D4 (in S44) NO), the selection unit 142b outputs character data corresponding to the identifier of the non-speech sound having the highest score included in the non-speech sound processing result as result data (S44).

また、更新部１４４ｂは、音声に対応する文字データが決定されると、履歴データベースＤ４に含まれる頻度を更新する（Ｓ４７）。 Further, when the character data corresponding to the voice is determined, the updating unit 144b updates the frequency included in the history database D4 (S47).

上述したように、第２実施形態に係る音認識装置１ｂは、非音声音を視覚データに変換する際に、一定時間内に取得された音声を利用する。したがって、音声の取得から一定時間内に取得される非音声音を予測することが可能となり、非音声音の変換の精度を向上させることができる。また、音認識装置１ｂで利用される履歴データベースＤ４は利用に応じて自動で更新されるため、利用者の登録作業等の処理がなくても音声認識の精度が向上させる。 As described above, the sound recognition device 1b according to the second embodiment uses sound acquired within a certain time when converting non-sound sound into visual data. Therefore, it is possible to predict a non-speech sound that is acquired within a predetermined time from the acquisition of the sound, and the accuracy of conversion of the non-speech sound can be improved. Further, since the history database D4 used in the sound recognition device 1b is automatically updated according to use, the accuracy of voice recognition is improved even without processing such as user registration work.

〈第３実施形態〉
第３実施形態に係る音認識装置は、音データに含まれる非音声音を視覚データに変換する際、非音声音の入力タイミングから過去の所定時間内（例えば、３分前まで）に入力された非音声音の認識結果を利用する。ここでも、人間が声によって発した音声以外の音を「非音声音」とする。過去に入力された非音声音の認識結果を非音声音の視覚データへの変換に利用することで、非音声音の視覚データへの変換の精度を向上させることができる。 <Third Embodiment>
In the sound recognition apparatus according to the third embodiment, when converting non-sound sound included in sound data into visual data, the sound recognition apparatus is input within a predetermined time in the past (for example, up to 3 minutes before) from the input timing of the non-sound sound. Use non-speech sound recognition results. Here again, a sound other than a voice uttered by a human voice is referred to as a “non-voice sound”. By using the recognition result of the non-speech sound input in the past for the conversion of the non-speech sound into the visual data, the accuracy of the conversion of the non-sound sound into the visual data can be improved.

図９に示すように、第３実施形態に係る音認識装置１ｃは、図１を用いて上述した第１実施形態に係る音認識装置１ａと比較して、音声処理部１２を有さず、非音声音処理部１３に代えて非音声音処理部１３ｃを有し、重み付け処理部１４に代えて重み付け処理部１４ｃを有している点で異なる。また、音認識装置１ｃは、音認識装置１ａと比較して、記憶装置２０において、音認識プログラムＰ１に代えて音認識プログラムＰ３を記憶し、履歴データベースＤ３に代えて履歴データベースＤ５を記憶している。すなわち、音認識装置１ｃでは、記憶装置２０に記憶されている音認識プログラムＰ３が実行されることで、ＣＰＵ１０が非音声音処理部１３ｂ及び重み付け処理部１４ｂとして処理が実行される。 As shown in FIG. 9, the sound recognition device 1c according to the third embodiment does not have the voice processing unit 12 as compared with the sound recognition device 1a according to the first embodiment described above with reference to FIG. The difference is that a non-speech sound processing unit 13c is provided instead of the non-sound sound processing unit 13, and a weighting processing unit 14c is provided instead of the weighting processing unit 14. Also, the sound recognition device 1c stores a sound recognition program P3 in place of the sound recognition program P1 and a history database D5 in place of the history database D3 in the storage device 20, as compared with the sound recognition device 1a. Yes. That is, in the sound recognition device 1c, the sound recognition program P3 stored in the storage device 20 is executed, whereby the CPU 10 performs processing as the non-speech sound processing unit 13b and the weighting processing unit 14b.

履歴データベースＤ５は、第１の非音声音の識別子と第２の非音声音の識別子とに、非音声音の決定に利用する重み値を関連付けるデータである。ここでは、第１の非音声音の入力後所定時間内（例えば、３分以内）に第２の非音声音が入力された頻度を重み値として利用している。 The history database D5 is data that associates the first non-speech sound identifier and the second non-speech sound identifier with a weight value used to determine the non-speech sound. Here, the frequency at which the second non-voice sound is input within a predetermined time (for example, within 3 minutes) after the input of the first non-voice sound is used as a weight value.

図１０（ａ）に示す履歴データベースＤ５では、識別子「Ｆ音」が付された非音声音を入力後、識別子「Ａ音」が付された非音声音であるチャイムの音が入力された頻度が４０％、識別子「Ｂ音」が付された電話の音が入力された頻度が３５％であることを示している。また、識別子「Ｄ音」が付された非音声音を入力後、識別子「Ｃ音」が付された非音声音が入力された頻度が３０％、識別子「Ｄ音」が付された非音声音が入力された頻度が２５％、識別子「Ｅ音」が付された非音声音が入力された頻度が１０％であることを示している。 In the history database D5 shown in FIG. 10A, the frequency at which a chime sound, which is a non-speech sound with the identifier “A sound”, is input after the non-speech sound with the identifier “F sound” is input. Indicates that the frequency of input of the sound of the telephone with the identifier “B sound” is 35%. In addition, after inputting a non-sound sound with the identifier “D sound”, the frequency of inputting the non-sound sound with the identifier “C sound” is 30%, and the non-sound with the identifier “D sound” is added. This indicates that the frequency at which a voice sound is input is 25%, and the frequency at which a non-speech sound with an identifier “E sound” is input is 10%.

検索部１３２ｃは、解析部１３１によって閾値以上の音量の音データから抽出された非音声音がメモリに蓄積されると、メモリに記憶される非音声音の識別子を非音声音データベースＤ２から検索し、検索された非音声音の識別子を非音声音処理結果として重み付け処理部１４ｃに出力する。具体的には、検索部１３２ｃは、非音声音データベースＤ２に含まれる各非音声音について、メモリに蓄積される非音声音との類似度を求め、この類似度をスコアとする。また検索部１３２ｃは、所定の抽出条件に該当する非音声音の識別子を全て、非音声音の識別子の候補として抽出し、例えば、図１０（ｂ）に一例を示すように、非音声音の識別子とスコアとを対応させて非音声音処理結果とする。 When the non-speech sound extracted from the sound data having a volume equal to or higher than the threshold by the analysis unit 131 is stored in the memory, the search unit 132c searches the non-speech sound database D2 for the identifier of the non-speech sound stored in the memory. The identifier of the searched non-speech sound is output to the weighting processing unit 14c as a non-speech sound processing result. Specifically, the search unit 132c calculates the similarity between each non-speech sound included in the non-speech sound database D2 and the non-speech sound stored in the memory, and uses this similarity as a score. Further, the search unit 132c extracts all non-speech sound identifiers that meet the predetermined extraction condition as candidates for non-speech sound identifiers. For example, as illustrated in FIG. The identifier and the score are associated with each other to obtain a non-voice sound processing result.

ここで、検索部１３２ｃが非音声音の識別子を抽出する条件としては、（１）スコアが特定の値以上（例えば、５０以上）の非音声音の識別子を抽出、（２）スコアが上位所定数（例えば、３位以上）の非音声音の識別子を抽出、（３）スコアが１位のスコアから所定範囲内のスコア（例えば、１位のスコアからマイナス５以内のスコア）の文字データを抽出、等が考えられる。 Here, as conditions for the search unit 132c to extract the identifier of the non-speech sound, (1) extract the identifier of the non-speech sound whose score is equal to or higher than a specific value (for example, 50 or higher), and (2) the score is a predetermined higher rank. Extract non-speech sound identifiers (for example, 3rd or higher), (3) character data with a score within a predetermined range from the 1st score (for example, a score within minus 5 from the 1st score) Extraction etc. can be considered.

一方、検索部１３２ｃは、非音声音の識別子が非音声音データベースＤ２から検索できない場合、非音声音処理結果を出力することなく、新たな非音声音がメモリに記憶されるまで待機する。すなわち、非音声音処理部１３ｃは、非音声音データベースＤ２に含まれていない非音声音を新たに入力した場合でも、図１を用いて上述した非音声音処理部１３のように非音声音データベースＤ２に追加することがない。 On the other hand, if the identifier of the non-speech sound cannot be retrieved from the non-speech sound database D2, the search unit 132c waits until a new non-sound sound is stored in the memory without outputting the non-speech sound processing result. That is, even when a non-sound sound not included in the non-sound sound database D2 is newly input, the non-sound sound processing unit 13c is similar to the non-sound sound processing unit 13 described above with reference to FIG. It is not added to the database D2.

判定部１４１ｃは、非音声音処理部１３ｃから入力される非音声音処理結果を待機するとともに、過去に入力された非音声音の識別子をメモリで記憶しており、新たな非音声音処理結果が入力されると、前回、非音声音処理部１３ｃから非音声音処理結果を入力後、所定時間内であるか否かを判定し、所定時間内であるとき、メモリで記憶される非音声音の識別子（第１の非音声音の識別子）と、新たに入力した非音声音処理結果を選択部１４２ｃに出力する。また、判定部１４１ｃは、非音声音処理部１３ｃから新たに非音声音処理結果が入力されたタイミングを基準として新たに計時を開始する。 The determination unit 141c waits for the non-speech sound processing result input from the non-speech sound processing unit 13c and stores the identifier of the non-speech sound input in the past in the memory. Is input, it is determined whether or not it is within a predetermined time after inputting the result of the non-audio sound processing from the non-audio sound processing unit 13c last time, and the non-audio stored in the memory is stored within the predetermined time. The voice sound identifier (first non-speech sound identifier) and the newly input non-speech sound processing result are output to the selection unit 142c. In addition, the determination unit 141c newly starts timing based on the timing at which a new non-sound sound processing result is input from the non-sound sound processing unit 13c.

選択部１４２ｃは、判定部１４１ｂから前回入力された非音声音の識別子及び非音声音処理結果を入力すると、記憶装置２０から履歴データベースＤ４を読み出し、前回入力された非音声音の識別子及び非音声音処理部に含まれる各非音声音の識別子（第２の非音声音の識別子）に対応する頻度を選択し、この選択結果を前回入力された非音声音の識別子及び非音声音処理結果とともに、演算部１４３ｃに出力する。 When the selector 142c receives the non-speech sound identifier and the non-speech sound processing result previously input from the determination unit 141b, the selection unit 142c reads the history database D4 from the storage device 20 and reads the non-speech sound identifier and non-sound input last time. The frequency corresponding to each non-speech sound identifier (second non-speech sound identifier) included in the voice sound processing unit is selected, and this selection result is combined with the previously input non-sound sound identifier and non-speech sound processing result. And output to the calculation unit 143c.

演算部１４３ｃは、選択部１４２ｃから、前回入力された非音声音の識別子、非音声音処理部及び選択結果が入力されると、非音声音処理結果に含まれるスコアを重み付けする演算をし、入力した非音声音を決定する。具体的には、演算部１４３ｃは、非音声音の識別子の候補についてスコアを重み付けして新たなスコアを演算し、新たなスコアが最も高い非音声音の識別子を、入力された非音声音の識別子と決定し、出力部１５に出力する。また、演算部１４３ｃは、決定された非音声音の識別子、前回入力した非音声音の識別子及び非音声音処理結果を更新部１４４ｃに出力する。また、演算部１４３ｃは、メモリで記憶される第１の非音声音の識別子を前回入力した非音声音の識別子を決定された非音声音の識別子に書き換える。 When the identifier of the non-speech sound input last time, the non-speech sound processing unit, and the selection result are input from the selection unit 142c, the calculation unit 143c performs a calculation to weight the score included in the non-speech sound processing result, Determine the input non-speech sound. Specifically, the calculation unit 143c calculates a new score by weighting the score for the candidate for the non-speech sound identifier, and calculates the non-speech sound identifier with the highest new score as the non-speech sound identifier. The identifier is determined and output to the output unit 15. In addition, the calculation unit 143c outputs the determined non-sound sound identifier, the previously input non-sound sound identifier, and the non-sound sound processing result to the update unit 144c. In addition, the calculation unit 143c rewrites the identifier of the first non-speech sound stored in the memory with the identifier of the non-speech sound that has been input last time.

例えば、演算部１４３ｃは、前回入力した非音声音の識別子と非音声音処理結果に含まれる各非音声音の識別子について、式（３）のような数式で重み付けした新たなスコアを求める。式（３）においてｒ３は、予め定められる係数である。 For example, the computing unit 143c obtains a new score weighted by a mathematical expression such as Expression (3) for the identifier of the non-speech sound input last time and the identifier of each non-speech sound included in the non-speech sound processing result. In Expression (3), r3 is a predetermined coefficient.

新たなスコア＝スコア＋頻度×ｒ３ …（３）
ここで、非音声音の識別子に対応する文字データが関連付けられているとき、この文字データを出力するようにしてもよい。例えば、識別子「Ａ音」に文字データ「チャイム」が関連付けられているとき演算部１４３ｃは文字データである「チャイム」を出力部１５に出力し、識別子「Ｂ音」に文字データ「電話」が関連付けられているとき演算部１４３ｃは文字データである「電話」を出力部に出力する。 New score = score + frequency × r3 (3)
Here, when character data corresponding to the identifier of the non-speech sound is associated, this character data may be output. For example, when the character data “chime” is associated with the identifier “A sound”, the calculation unit 143 c outputs the character data “chime” to the output unit 15, and the character data “phone” is included in the identifier “B sound”. When they are associated, the calculation unit 143c outputs “telephone”, which is character data, to the output unit.

更新部１４４ｃは、演算部１４３ｃから今回入力した非音声音に対して決定された非音声音の識別子、前回入力された非音声音の識別子及び非音声音処理結果が入力されると、履歴データベースＤ５の頻度を更新する。具体的には、更新部１４４ｃは、前回入力された非音声音の識別子及び今回決定された非音声音の識別子と関連付けられる頻度を高くし、前回入力された非音声音の識別子及び非音声音処理結果に含まれる今回決定された非音声音以外の非音声音の識別子と関連付けられる頻度を低くするように履歴データベースＤ５を更新する。 The update unit 144c receives the non-speech sound identifier determined for the non-speech sound input this time, the identifier of the non-speech sound input last time, and the non-speech sound processing result from the calculation unit 143c. Update the frequency of D5. Specifically, the update unit 144c increases the frequency associated with the identifier of the non-speech sound input last time and the identifier of the non-speech sound determined this time, and the identifier and non-speech sound of the non-speech sound input last time The history database D5 is updated so as to reduce the frequency associated with the identifier of the non-speech sound other than the non-speech sound determined this time included in the processing result.

（非音声音処理部における処理）
図１１に示すフローチャートを利用して、非音声音処理部１３ｃにおける処理を説明する。非音声音処理部１３ｃは、解析部１３１が入力部１１を介して入力される音データを解析し、入力された音データの音量が閾値以上であるか否かを判定する（Ｓ５０）。音データの音量が閾値以上であるとき（Ｓ５０でＹＥＳ）、解析部１３１は音データが非音声音であるか否かを判定する（Ｓ５１）。ここで、閾値以上の音量の音データの入力がない場合（Ｓ５０でＮＯ）及び音データに非音声音が含まれない場合（Ｓ５１でＮＯ）、ステップＳ５０に戻り、処理を繰り返す。 (Processing in the non-voice sound processing unit)
Processing in the non-speech sound processing unit 13c will be described using the flowchart shown in FIG. The non-speech sound processing unit 13c analyzes the sound data input by the analysis unit 131 via the input unit 11, and determines whether or not the volume of the input sound data is greater than or equal to a threshold value (S50). When the volume of the sound data is equal to or higher than the threshold (YES in S50), the analysis unit 131 determines whether the sound data is a non-speech sound (S51). Here, when there is no input of sound data having a volume equal to or higher than the threshold value (NO in S50) and when the sound data does not include a non-speech sound (NO in S51), the process returns to step S50 and the process is repeated.

閾値以上の音量の音データに非音声音が含まれるとき（Ｓ５１でＹＥＳ）、解析部１３１は、非音声音のメモリへの蓄積を開始する。また、検索部１３２ｃは、非音声音データベースＤ２に含まれる各非音声音とメモリに記憶される非音声音との類似度を求め、類似度をスコアとして比較する（Ｓ５２）。 When non-speech sounds are included in sound data having a volume equal to or higher than the threshold (YES in S51), the analysis unit 131 starts accumulation of non-speech sounds in the memory. Further, the search unit 132c obtains the similarity between each non-speech sound included in the non-speech sound database D2 and the non-speech sound stored in the memory, and compares the similarity as a score (S52).

非音声音データベースＤ２に入力された非音声音と類似の非音声音が含まれているとき（Ｓ５３でＹＥＳ）、解析部１３１は、入力される非音声音のメモリへの蓄積を終了し、検索部１３２ｃは、非音声音データベースＤ２に含まれる各非音声音とメモリに記憶される非音声音との比較を終了する（Ｓ５４）。データの蓄積及び比較が終了して条件を満たす非音声音の識別子が抽出されると、検索部１３２ｂは、抽出された全ての非音声音の識別子及びスコアを含む非音声音処理結果を重み付け処理部１４ｂに出力するとともに（Ｓ５５）、メモリで蓄積するデータを破棄し（Ｓ５８）、ステップＳ５０に戻り、処理を繰り返す。 When the non-speech sound similar to the non-speech sound input to the non-speech sound database D2 is included (YES in S53), the analysis unit 131 ends the storage of the input non-speech sound in the memory, The search unit 132c ends the comparison between each non-speech sound included in the non-speech sound database D2 and the non-speech sound stored in the memory (S54). When the non-speech sound identifiers satisfying the conditions after the storage and comparison of data are extracted, the search unit 132b weights the non-speech sound processing results including all the extracted non-speech sound identifiers and scores. The data is output to the unit 14b (S55), the data stored in the memory is discarded (S58), the process returns to step S50, and the process is repeated.

これに対し、非音声音データベースＤ２に入力された非音声音と類似の非音声音が含まれていないとき（Ｓ５３でＮＯ）、解析部１３１は、続いて入力される音データの音量が閾値未満か、または、非音声音のメモリへの蓄積時間が所定時間以上か否かを判定する（Ｓ５６）。新たな音データの音量が閾値以上の場合又は入力される非音声音のメモリへの蓄積時間が所定時間内の場合、ステップＳ５２に戻り、データの蓄積及び比較の処理を繰り返す。 On the other hand, when the non-speech sound similar to the non-speech sound input to the non-speech sound database D2 is not included (NO in S53), the analysis unit 131 determines that the volume of the sound data subsequently input is a threshold value. It is determined whether or not the storage time of the non-voice sound in the memory is equal to or longer than a predetermined time (S56). If the volume of the new sound data is equal to or greater than the threshold value, or if the storage time of the input non-speech sound in the memory is within the predetermined time, the process returns to step S52, and the data storage and comparison processing is repeated.

新たに入力する音データの音量が閾値未満となった場合又は蓄積時間が所定時間以上となったとき（Ｓ５６でＹＥＳ）、解析部１３１は、入力される非音声音のメモリへの蓄積を終了し、検索部１３２ｃは、非音声音データベースＤ２に含まれる各非音声音とメモリに記憶される非音声音との比較を終了する（Ｓ５７）。その後、検索部１３２ｃは、メモリで蓄積するデータを破棄する（Ｓ５８）。 When the volume of newly input sound data is less than the threshold value or when the accumulation time is equal to or longer than the predetermined time (YES in S56), the analysis unit 131 ends the accumulation of the input non-speech sound in the memory. Then, the search unit 132c ends the comparison between each non-speech sound included in the non-speech sound database D2 and the non-speech sound stored in the memory (S57). Thereafter, the search unit 132c discards the data stored in the memory (S58).

（重み付け処理部における処理）
図１２に示すフローチャートを利用して、重み付け処理部１４ｃにおける処理を説明する。重み付け処理部１４ｃの判定部１４１ｃは、非音声音処理部１３ｃからの非音声音処理結果の入力を待機しており、非音声音処理部１３ｃから非音声音処理結果が入力されると（Ｓ６０でＹＥＳ）、計時を開始する（Ｓ６１）。また、判定部１４１ｃは、前回の非音声音処理結果を入力後、所定時間内であるか否かを判定する（Ｓ６２）。 (Processing in the weighting processing unit)
The processing in the weighting processing unit 14c will be described using the flowchart shown in FIG. The determination unit 141c of the weighting processing unit 14c waits for input of the non-sound sound processing result from the non-sound sound processing unit 13c, and when the non-sound sound processing result is input from the non-sound sound processing unit 13c (S60). YES), the timing is started (S61). Moreover, the determination part 141c determines whether it is within predetermined time after inputting the last non-voice sound processing result (S62).

所定時間内であるとき（Ｓ６２でＹＥＳ）、選択部１４２ｃは、前回入力した非音声音の識別子と新たに入力された非音声音処理結果に含まれる各非音声音の識別子とに関連付けられる頻度が履歴データベースＤ５に含まれるか否かを判定する（Ｓ６３）。前回入力した非音声音の識別子と新たに入力された非音声音処理結果に含まれる非音声音の識別子とに関連付けられる頻度が履歴データベースＤ５に含まれるとき（Ｓ６３でＹＥＳ）、演算部１４３ｃは、重み付け処理を実行する。すなわち、演算部１４３ｃは、新たに入力した非音声音処理部に含まれる各非音声音の識別子の候補のスコアについて、重み値である頻度を利用して新たなスコアを求める（Ｓ６４）。その後、演算部１４３ｃは、最もスコアが高くなった非音声音の識別子を入力された非音声音の識別子と決定し、この非音声音に対応する文字データを結果として出力する（Ｓ６５）。 When it is within the predetermined time (YES in S62), the selection unit 142c is associated with the identifier of the non-speech sound input last time and the identifier of each non-speech sound included in the newly input non-speech sound processing result. Is included in the history database D5 (S63). When the history database D5 includes the frequency associated with the identifier of the non-speech sound input last time and the identifier of the non-speech sound included in the newly input non-speech sound processing result (YES in S63), the calculation unit 143c The weighting process is executed. That is, the calculation unit 143c obtains a new score using the frequency that is the weight value for the score of the candidate identifier of each non-speech sound included in the newly input non-speech sound processing unit (S64). Thereafter, the calculation unit 143c determines the identifier of the non-speech sound with the highest score as the input non-speech sound identifier, and outputs the character data corresponding to the non-speech sound as a result (S65).

一方、所定時間内でないとき（Ｓ６２でＮＯ）又は前回入力された非音声音の識別子及び新たに入力された非音声音処理結果に含まれる非音声音の識別子と関連付けられる頻度が履歴データベースＤ５に含まれないとき（Ｓ６３でＮＯ）、選択部１４２ｂは、非音声音処理結果に含まれるスコアが最も高い非音声音の識別子に対応する文字データを結果データとして出力する（Ｓ６５）。 On the other hand, when it is not within the predetermined time (NO in S62) or the frequency associated with the identifier of the non-speech sound input last time and the non-speech sound identifier included in the newly input non-speech sound processing result is stored in the history database D5. When not included (NO in S63), the selection unit 142b outputs character data corresponding to the identifier of the non-speech sound having the highest score included in the non-speech sound processing result as result data (S65).

また、更新部１４４ｃは、非音声音が決定されると、履歴データベースＤ５に含まれる頻度を更新する（Ｓ６６）。 In addition, when the non-speech sound is determined, the update unit 144c updates the frequency included in the history database D5 (S66).

上述したように、第３実施形態に係る音認識装置１ｃは、非音声音を視覚データに変換する際に、過去に取得された非音声音を利用する。したがって、非音声音の取得から一定時間内に取得される非音声音を予測することが可能となり、非音声音の変換の精度を向上させることができる。また、音認識装置１ｃで利用される履歴データベースＤ５は利用に応じて自動で更新されるため、利用者の登録作業等の処理がなくても音声認識の精度が向上させる。 As described above, the sound recognition device 1c according to the third embodiment uses the non-sound sound acquired in the past when the non-sound sound is converted into the visual data. Therefore, it is possible to predict a non-speech sound acquired within a certain time from the acquisition of the non-speech sound, and the accuracy of conversion of the non-speech sound can be improved. In addition, since the history database D5 used in the sound recognition device 1c is automatically updated according to use, the accuracy of voice recognition is improved without processing such as user registration work.

以上、実施形態を用いて本発明を詳細に説明したが、本発明は本明細書中に説明した実施形態に限定されるものではない。本発明の範囲は、特許請求の範囲の記載及び特許請求の範囲の記載と均等の範囲により決定されるものである。 As mentioned above, although this invention was demonstrated in detail using embodiment, this invention is not limited to embodiment described in this specification. The scope of the present invention is determined by the description of the claims and the scope equivalent to the description of the claims.

１ａ，１ｂ，１ｃ…音認識装置
１０…ＣＰＵ
１１…入力部
１２，１２ｂ…音声処理部
１３，１３ｂ，１３ｃ…非音声音処理部
１３１…解析部
１３２，１３２ｂ，１３２ｃ…検索部
１３３…追加部
１４，１４ｂ，１４ｃ…重み付け処理部
１４１，１４１ｂ，１４１ｃ…判定部
１４２，１４２ｂ，１４２ｃ…選択部
１４３，１４３ｂ，１４３ｃ…演算部
１４４，１４４ｂ，１４４ｃ…更新部
１５…出力部
２０…記憶装置
Ｄ１…音声データベース
Ｄ２…非音声音データベース
Ｄ３，Ｄ４，Ｄ５…履歴データベース 1a, 1b, 1c ... sound recognition device 10 ... CPU
DESCRIPTION OF SYMBOLS 11 ... Input part 12, 12b ... Voice processing part 13, 13b, 13c ... Non-voice sound processing part 131 ... Analysis part 132, 132b, 132c ... Search part 133 ... Addition part 14, 14b, 14c ... Weighting process part 141, 141b , 141c... Determination unit 142, 142b, 142c... Selection unit 143, 143b, 143c. , D5 ... History database

Claims

A sound recognition apparatus using a history database in which an identifier for identifying a non-sound sound is associated with a frequency at which a predetermined type of sound sound has appeared within a predetermined time from the appearance of the non-sound sound,
When sound data is input, it is determined whether or not the input sound data is a non-speech sound having no voice characteristics, and when the input signal is a non-speech sound, the non-sound sound A non-speech processing unit that associates the identifier with a voice sound;
A voice processing unit that outputs a candidate of a type of voice included in the input sound data when the sound data is input;
In the non-speech processing unit, an identifier is associated with the non-speech sound, and the type of sound is transmitted from the speech processing unit within a predetermined time after the identifier is associated with the non-speech sound. A calculation unit that weights the candidates based on the history database and determines the type of the voice based on the weighted voice type candidates;
A sound recognition apparatus comprising:

The sound recognition apparatus according to claim 1, further comprising an update unit that updates the history database based on the type of the voice determined by the calculation unit.

The history database is
The sound recognition apparatus according to claim 1, wherein the voice type is stored as character data.

The non-speech sound processing unit is a non-speech sound database that stores the identifier of the non-speech sound and feature data that can identify the sound;
A search unit that searches the feature amount data matching the determined non-speech sound from the feature amount data of the non-speech sound;
4. The sound recognition apparatus according to claim 1, wherein an identifier is associated with the non-speech sound based on a search result of the search unit.

When the input signal is a non-speech sound, the non-speech sound processing unit starts storing the non-speech sound in a memory, and the signal input by the search unit and the sound data If the search unit cannot retrieve the feature amount data that matches the determined non-speech sound within a predetermined period, the stored non-speech sound is associated with a new identifier of the non-speech sound. And having an additional part to add to the non-voice sound database;
The sound recognition apparatus according to claim 3.

Sound using a history database associating the type of voice sound with the frequency of occurrence of a predetermined non-voice sound within a predetermined time from the voice sound, and a non-voice sound database associating a non-voice identifier with non-sound sound data A recognition device,
When sound data is input, it is determined whether or not the input sound data is a sound sound having a sound characteristic, and when the input signal is a sound sound, A voice processing unit for associating types,
When sound data is input, it is determined whether a non-speech sound having no sound characteristics is input. If the determination result is positive, the identifier of the non-speech sound is determined based on the non-speech sound database. The non-speech sound processing unit for outputting the candidates of
In the speech processing unit, a candidate for the identifier of the speech from the non-speech processing unit is within a predetermined time after the type is associated with the speech sound and the type is associated with the speech sound. A calculation unit that weights the candidates based on the history database and identifies the weighted non-speech sounds based on the candidates when input,
A sound recognition apparatus comprising:

A history database associating a first non-speech sound identifier with a frequency at which a second non-speech sound identifier appears within a predetermined time from the first non-speech sound;
When there is an input, it is determined whether or not the input signal is a non-speech sound having no voice characteristics, and when the input signal is a non-speech sound, A non-speech sound processing unit for associating the identifier,
When an identifier is associated with the first non-speech sound, it is determined whether it is within a predetermined time since the identifier is associated with the second non-speech sound. A calculation unit for determining an identifier of the first non-speech sound based on the history database;
A sound recognition apparatus comprising: