JP6517670B2

JP6517670B2 - Speech recognition apparatus, speech recognition method and speech recognition program

Info

Publication number: JP6517670B2
Application number: JP2015223492A
Authority: JP
Inventors: 健太小合; 明小島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-11-13
Filing date: 2015-11-13
Publication date: 2019-05-22
Anticipated expiration: 2035-11-13
Also published as: JP2017090789A

Description

本発明は、音声認識装置、音声認識方法及び音声認識プログラムに関する。 The present invention relates to a speech recognition apparatus, a speech recognition method, and a speech recognition program.

従来から、インターネットブラウザや、スマートテレビ、タブレット、スマートフォン端末において、利用者から発話された音声を端末（以後、端末をクライアントともいう）で圧縮してクラウド側にそのまま送り、クラウド側のシステムで音声認識処理を行って、その結果をネットワーク経由で端末側が受け取り利用する、クラウド型音声認識システムを用いた音声認識が行われている。また、圧縮してクラウド側にそのまま送らず、端末で音響分析だけ行い、特徴量だけを送って、クラウド側で音声認識を行うＤＳＲ（Distributed Speech Recognition）という方式によるクラウド型音声認識システムを用いた音声認識も行われている。 Conventionally, in an internet browser, smart TV, tablet, smart phone terminal, the voice uttered by the user is compressed by the terminal (hereinafter, the terminal is also referred to as a client) and sent to the cloud side as it is. Speech recognition is performed using a cloud-based speech recognition system in which recognition processing is performed and the terminal side receives and uses the result via a network. In addition, it does not send to the cloud side as it is compressed, only the acoustic analysis is performed by the terminal, only the feature amount is sent, and a cloud type speech recognition system using a method called DSR (Distributed Speech Recognition) that performs speech recognition on the cloud side is used. Speech recognition is also performed.

また、ＨＴＭＬ５ブラウザを用いて、Ｗｅｂアプリで音声を取得して、クラウド側にＷｅｂソケット通信で圧縮音声を送付し、クラウド側で音声認識を行って、その結果を端末で利用できる技術も存在する（例えば、非特許文献１参照）。 In addition, there is also a technology that uses the HTML5 browser to acquire speech with a Web application, sends compressed speech to the cloud side with Web socket communication, performs speech recognition on the cloud side, and uses the result on the terminal (For example, refer to nonpatent literature 1).

これらの仕組みでは、クラウド側の豊富なＣＰＵ資源や統計データで、背景雑音の除去、音圧や音響モデルによる音声区間と非音声区間の分離、音素分析、言語モデルによる統計的分析を行い、高い精度で音声認識を行うことができている。また、近年は、ディープラーニング技術を使ってＣＰＵとＧＰＵ資源を組み合わせて各処理をより高精度にしたり、ディープラーニング技術を使って音響特徴から直接文に変換する一体化モデルにしたりする音声認識方式も提案されている。 In these mechanisms, with the abundant CPU resources and statistical data on the cloud side, removal of background noise, separation of speech and non-voice sections by sound pressure and acoustic model, phoneme analysis, statistical analysis by language model, etc. Speech recognition can be performed with accuracy. Also, in recent years, a speech recognition method using deep learning technology to combine CPU and GPU resources to make each process more accurate, or using an deep learning technology to convert an acoustic feature directly to a sentence into an integrated model Is also proposed.

さらに、クラウド側で得られたビックデータ、例えば、新しい言葉や、流行語、時事単語など現在ユーザ利用頻度の高い言葉に対し、重みづけを増すなど言語モデルの精度を日々上げていくことによって、正確で実用的な音声認識を行うことができている。 Furthermore, by increasing the accuracy of the language model daily, such as increasing weighting, for words that are frequently used by current users, such as new words, buzzwords, and current events words, for example, big data obtained on the cloud side, for example Accurate and practical speech recognition can be performed.

音声認識精度を上げる従来技術としては、例えば下記の（１）〜（３）が挙げられるが、それぞれ下記の通りの課題がある。 For example, the following (1) to (3) can be mentioned as conventional techniques for improving the speech recognition accuracy, but there are problems as described below.

（１）エコーキャンセラ
スピーカーで再生した信号がマイクで収音され、収音した音響信号が前記のスピーカーで再生する信号に含まれるようなループが構成される場合には、エコーやハウリングが発生する。このエコーやハウリングを除去・低減する従来技術としては、エコーキャンセラがある。エコーキャンセラは、エコーやハウリングを低減するために用いられるものであり、マイクで得られた音響信号から直前にスピーカーで出した音響信号を取り除くフィルタ処理を行うことにより、装置内でエコーやハウリング影響を除く技術である。しかし、家庭のリビングルームのように、独立した複数の装置が組み合わされ、スピーカーもそれぞれ異なるものが存在する場合は、他スピーカー装置から流れ出る声音をエコーキャンセラにより除去・低減することはできない。また、スマートフォンのアプリやＷｅｂＡＰＩから直接クラウドに音声を送ってしまう場合も、動画の音声などは別アプリである動画再生アプリから直接出力されており、キャンセルすべき音響信号を他アプリやＡＰＩ側からは把握することは困難であり、エコーキャンセラにより除去・低減することはできない。 (1) Echo canceler When a loop is formed in which a signal reproduced by a speaker is collected by a microphone and the collected audio signal is included in a signal reproduced by the speaker, echo or howling occurs. . An echo canceller is known as a conventional technique for removing and reducing the echo and howling. An echo canceller is used to reduce echo and howling, and by applying a filtering process to remove an acoustic signal generated by a speaker from an acoustic signal obtained by a microphone, echo or howling is affected in the apparatus. Except for However, in the case where a plurality of independent devices are combined and speakers are different as in a living room of a home, voices flowing out from other speaker devices can not be removed or reduced by the echo canceller. Also, even when voice is sent directly from the smartphone application or Web API to the cloud, the voice of the video is directly output from the video playback application that is another application, and the acoustic signal to be canceled is from the other application or API Is difficult to grasp and can not be removed or reduced by the echo canceller.

（２）音圧の違い
背景音声と発話者がマイクに向かって発話した音声を区別する従来技術としては、音圧の違い、すなわち音響信号のパワーを利用する方法がある。しかし、テレビやラジオの再生音量が大きい場合、テレビやラジオのスピーカーがマイクに近い場合、マイクと発話者の位置が離れている場合などでは、背景音声と発話者の音声の音圧に大きな差が無いため、背景音声と発話者の音声とをうまく判別できないことがある。 (2) Difference in sound pressure As a conventional technique for distinguishing between background sound and speech uttered by a speaker toward a microphone, there is a method of using the difference in sound pressure, that is, the power of an acoustic signal. However, when the playback volume of a television or radio is large, when the speaker of the television or radio is close to the microphone, or when the microphone and the speaker are away, the sound pressure between the background sound and the speaker's voice is large In some cases, it is difficult to distinguish between the background sound and the speaker's voice because there is no

（３）音響モデルによる分離
人の音声と空調音やエンジン音等の環境ノイズとを分離する従来技術としては、発生源の音の生成モデル違いに着目した音源分離技術がある。また、近年は、ディープラーニングを用いた高度な特徴量判定による音源分離技術も開発されつつあり、音源分離の精度は急速に改善されてきている（例えば、非特許文献２参照）。しかし、ライブ放送のテレビラジオや案内放送等は、それ自体が“人間の音声”であるため、背景にある放送の音声を再生する音響装置の再生品質が高かったり、再生の音量が大きかったりすると、背景音声と発話者の音声とを十分に分離できないことがある。 (3) Separation by Acoustic Model As conventional technology for separating human voice and environmental noise such as air conditioning noise and engine sound, there is a sound source separation technology focusing on the difference in generation model of the sound of the generation source. Further, in recent years, a sound source separation technology based on advanced feature amount determination using deep learning is also being developed, and the accuracy of the sound source separation has been rapidly improved (see, for example, Non-Patent Document 2). However, live radio television radios and information broadcasts themselves are "human voice", so if the playback quality of the audio device that plays the audio of the broadcast in the background is high or the volume of playback is high. There are cases where the background speech and the speech of the speaker can not be sufficiently separated.

［online］、［平成２７年１０月１３日検索］、インターネット＜http://www.ntt.co.jp/news2014/1409/140911a.html＞[Online], [October 13, 2015 search], Internet <http://www.ntt.co.jp/news2014/1409/140911a.html> ［online］、［平成２７年１０月１５日検索］、インターネット＜http://www.ntt.co.jp/journal/1509/files/jn201509017.pdf＞[Online], [October 15, 2015 search], Internet <http://www.ntt.co.jp/journal/1509/files/jn201509017.pdf>

しかしながら、発話者が発した音声認識したい音声の背景に比較的大きな音量の音声が存在している場合には、音声認識対象とする音響信号に発話者が発した音声の音響信号のほかに背景の音声の音響信号も含まれてしまうことから、音声認識結果に発話者が意図しない背景の音響信号の音声認識結果が含まれてしまい、発話者が望む音声認識結果とは異なる音声認識結果が得られてしまうという問題がある。 However, when there is a voice with a relatively large volume in the background of the voice that the speaker wants to recognize, there is a background other than the sound signal of the voice emitted by the speaker in the sound signal targeted for voice recognition. The voice recognition result of the background is not included in the voice recognition result, and the voice recognition result different from the voice recognition result desired by the speaker is included. There is a problem of being obtained.

例えば、音声認識対象の音声を発する場所に、テレビやラジオの放送が比較的大きな音量で背景に常時流れている場合、音声認識の識別期間に、これらテレビやラジオのスピーカーから出た声音（アナウンスやセリフ）が意図せず混ざってしまって、正しく認識されないことがあるという問題がある（課題１）。 For example, when a television or radio broadcast is constantly flowing at a relatively high volume and in the background at a location that emits voice recognition target voices, voices emitted from the speakers of these televisions or radio during an identification period of voice recognition And serif) are mixed unintentionally and there is a problem that they may not be recognized correctly (task 1).

また、屋外スタジアム、講演ホール、パブリックビューイング会場、電車内など、案内放送や、館内放送が頻繁に大音量で流れている環境で収音した音響信号に対して音声認識を行う場合、音声認識の識別期間に、案内放送や、館内放送の声音（アナウンスやセリフ）が途中で入って、音声検索結果にそれらの音声認識結果が意図せず混ざってしまうという問題もある（課題２）。 In addition, voice recognition is performed when sound recognition is performed on guidance broadcasts such as outdoor stadiums, lecture halls, public viewing venues, in a train, etc., and environments where in-house broadcasts frequently flow at high volumes. There is also a problem that voices (announcements or words) in a guide broadcast or in-house broadcast enter in the middle of the identification period of, and those voice recognition results are unintentionally mixed with the voice search results (Problem 2).

また、録画した映画を再生している場合やダウンロードしながらＶＯＤを再生している場合に、動画再生中のアプリ音声と、音声認識アプリが個別にマルチタスクで動作するケースで、音声認識の識別期間に、動画の声音（アナウンスやセリフ）が意図せず音声認識結果に混ざってしまうという問題もある（課題３）。 In addition, when playing a recorded movie or playing VOD while downloading, voice recognition identification is performed in the case where the application voice during video playback and the voice recognition application individually operate in multitasking. During a period, there is also a problem that the voice of the moving image (announcement or speech) is unintentionally mixed with the voice recognition result (task 3).

本発明は、このような事情に鑑みてなされたもので、入力された発話者の音声を含む音響信号を音声認識して認識結果を得て、入力された別の音響信号も音声認識して認識結果を得て、これらの認識結果中で共通するものを入力された発話者の音声を含む音響信号の音声認識結果から取り除くことにより、不要な音声認識結果が含まれる可能性を低減することで、発話者の音声に対する音声認識率を向上させることができる音声認識装置、音声認識方法及び音声認識プログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and performs voice recognition of an acoustic signal including the voice of the input speaker to obtain a recognition result, and performs voice recognition of another voice signal input. To reduce the possibility of including unnecessary speech recognition results by obtaining recognition results and removing common ones of these recognition results from speech recognition results of acoustic signals including the voice of the input speaker. It is an object of the present invention to provide a speech recognition device, a speech recognition method and a speech recognition program capable of improving the speech recognition rate for the speech of the speaker.

上記の課題を解決するため、本発明は、入力された発話者の音声を含む音響信号を音声認識して認識結果を得て、入力された別の音響信号も音声認識して認識結果を得て、これらの認識結果中で共通するものを入力された発話者の音声を含む音響信号の音声認識結果から取り除く。 In order to solve the above problems, the present invention performs speech recognition of an acoustic signal including the voice of the input speaker to obtain a recognition result, and performs speech recognition of another acoustic signal input to obtain a recognition result. These common recognition results are removed from the speech recognition result of the sound signal including the speech of the input speaker.

本発明の一態様は、第１の収音手段で第１の発話者の音声を含んで収音された音響信号である第１音響信号と、前記第１の収音手段とは異なる１以上の収音手段である第２〜第Ｎ（Ｎは２以上の整数）の収音手段でそれぞれ音響信号である第２音響信号〜第Ｎ音響信号と、のそれぞれの音響信号を音声認識して、それぞれの音響信号に対する音声認識結果である第１音声認識結果〜第Ｎ音声認識結果を得る音声認識手段と、前記第２音声認識結果〜第Ｎ音声認識結果の少なくとも１以上の音声認識結果に含まれる部分音声認識結果と、前記第１音声認識結果に含まれる部分音声認識結果とが、部分音声認識結果の内容が同一であり、かつ、略同時刻の音響信号に対応する部分音声認識結果である場合に、当該部分音声認識結果を前記第１音声認識結果から削除したものを前記第１の発話者の音声認識結果として得る音声認識結果加工手段と、を備えた音声認識装置である。 One aspect of the present invention is a first sound signal which is a sound signal collected by the first sound collecting means including the voice of the first speaker and one or more different from the first sound collecting means The second to N-th (N is an integer of 2 or more) sound-pickup means, which are sound-pickup means, respectively recognize the sound signals of the second sound signal to the N-th sound signal which are sound signals Voice recognition means for obtaining a first voice recognition result to an N-th voice recognition result as voice recognition results for respective sound signals, and at least one voice recognition result of the second voice recognition result to the N-th voice recognition result The partial speech recognition result included in the partial speech recognition result and the partial speech recognition result included in the first speech recognition result have the same content of the partial speech recognition result, and the partial speech recognition result corresponding to the acoustic signal at substantially the same time When the partial speech recognition result is the first speech A speech recognition result processing means for obtaining those removed from identification result as the speech recognition result of the first speaker, a voice recognition device equipped with.

本発明の一態様は、第１の収音手段で第１の発話者の音声を含んで収音された音響信号である第１音響信号と、１以上の放送の音響信号である第１放送音響信号〜第Ｍ放送音響信号（Ｍは１以上の整数）と、のそれぞれの音響信号を音声認識して、それぞれの音響信号に対する音声認識結果である第１音声認識結果と第１放送音声認識結果〜第Ｍ放送音声認識結果を得る音声認識手段と、前記第１放送音声認識結果〜第Ｍ放送音声認識結果の少なくとも１以上の音声認識結果に含まれる部分音声認識結果と、前記第１音声認識結果に含まれる部分音声認識結果とが、部分音声認識結果の内容が同一であり、かつ、略同時刻の音響信号に対応する部分音声認識結果である場合に、当該部分音声認識結果を前記第１音声認識結果から削除したものを前記第１の発話者の音声認識結果として得る音声認識結果加工手段とを備えた音声認識装置である。 According to one aspect of the present invention, there is provided a first acoustic signal which is an acoustic signal collected by the first sound collecting means including the voice of a first speaker and a first broadcast which is an acoustic signal of one or more broadcasts. The voice signal of each of the sound signal to the Mth broadcast sound signal (M is an integer of 1 or more) is voice-recognized, and the first voice recognition result and the first broadcast voice recognition as the voice recognition result for each sound signal Speech recognition means for obtaining a result of the M-th broadcast speech recognition, a partial speech recognition result included in at least one or more speech recognition results of the first broadcast speech recognition result to the M-th broadcast speech recognition result, and the first speech If the partial speech recognition result contained in the recognition result is the same as the partial speech recognition result corresponding to the acoustic signals at approximately the same time, the content of the partial speech recognition result is the same, the partial speech recognition result is Deleted from the first speech recognition result A speech recognition apparatus and a speech recognition result processing means for obtaining a speech recognition result of the first speaker.

本発明の一態様は、第１の収音手段で第１の発話者の音声を含んで収音された音響信号である第１音響信号と、前記第１の収音手段とは異なる１以上の収音手段である第２〜第Ｎ（Ｎは２以上の整数）の収音手段でそれぞれ収音された音響信号である第２音響信号〜第Ｎ音響信号と、のそれぞれの音響信号を音声認識して、それぞれの音響信号に対する音声認識結果である第１音声認識結果〜第Ｎ音声認識結果を得る音声認識手段と、前記第２音声認識結果〜第Ｎ音声認識結果のうち前記第１の収音手段の近傍で収音された音響信号に対応する少なくとも１以上の音声認識結果に含まれる部分音声認識結果と、前記第１音声認識結果に含まれる部分音声認識結果とが、部分音声認識結果の内容が同一であり、かつ、略同時刻の部分音声認識結果である場合に、当該部分音声認識結果を前記第１音声認識結果から削除したものを前記第１の発話者の音声認識結果として得る音声認識結果加工手段と、を備えた音声認識装置である。 One aspect of the present invention is a first sound signal which is a sound signal collected by the first sound collecting means including the voice of the first speaker and one or more different from the first sound collecting means Sound signals of the second to N-th sound signals which are sound signals collected respectively by the second to N-th (N is an integer of 2 or more) sound-pickup means. Voice recognition means for obtaining a first voice recognition result to a N-th voice recognition result as voice recognition results for respective sound signals by performing voice recognition, and the second voice recognition result to the N-th voice recognition result The partial speech recognition result included in at least one or more speech recognition results corresponding to the sound signal collected in the vicinity of the sound collection means and the partial speech recognition result included in the first speech recognition result are partial speech Partial speech recognition results at the same time, with the same content of the recognition result If it is a speech recognition apparatus, comprising: a speech recognition result processing means for obtaining the partial speech recognition result obtained by deleting from the first speech recognition result as a voice recognition result of the first speaker, a.

本発明の一態様は、第１の収音手段で第１の発話者の音声を含んで収音された音響信号である第１音響信号と、１以上の放送の音響信号である第１放送音響信号〜第Ｍ放送音響信号（Ｍは１以上の整数）と、のそれぞれの音響信号を音声認識して、それぞれの音響信号に対する音声認識結果である第１音声認識結果と第１放送音声認識結果〜第Ｍ放送音声認識結果を得る音声認識手段と、前記第１放送音声認識結果〜第Ｍ放送音声認識結果のうち前記第１の収音手段が放送の受信対象地域にある少なくとも１以上の音声認識結果に含まれる部分音声認識結果と、前記第１音声認識結果に含まれる部分音声認識結果とが、部分音声認識結果の内容が同一であり、かつ、略同時刻の音響信号に対応する部分音声認識結果である場合に、当該部分音声認識結果を前記第１音声認識結果から削除したものを前記第１の発話者の音声認識結果として得る音声認識結果加工手段とを備えた音声認識装置である。 According to one aspect of the present invention, there is provided a first acoustic signal which is an acoustic signal collected by the first sound collecting means including the voice of a first speaker and a first broadcast which is an acoustic signal of one or more broadcasts. The voice signal of each of the sound signal to the Mth broadcast sound signal (M is an integer of 1 or more) is voice-recognized, and the first voice recognition result and the first broadcast voice recognition as the voice recognition result for each sound signal Voice recognition means for obtaining a result of the Mth broadcast speech recognition, and at least one or more of the first broadcast speech recognition result to the Mth broadcast speech recognition result in which the first sound collecting means is in a broadcast reception target area The partial speech recognition result included in the speech recognition result and the partial speech recognition result included in the first speech recognition result have the same content of the partial speech recognition result, and correspond to acoustic signals at substantially the same time In the case of a partial speech recognition result, the partial sound Recognition result is a voice recognition device and a speech recognition result processing means for obtaining a speech recognition result of the first speaker and those removed from the first speech recognition result.

本発明の一態様は、第１の収音手段で第１の発話者の音声を含んで収音された音響信号である第１音響信号と、前記第１の収音手段とは異なる１以上の収音手段である第２〜第Ｎ（Ｎは２以上の整数）の収音手段でそれぞれ収音された音響信号である第２音響信号〜第Ｎ音響信号と、のそれぞれの音響信号を音声認識して、それぞれの音響信号に対する音声認識結果である第１音声認識結果〜第Ｎ音声認識結果を得る音声認識ステップと、前記第２音声認識結果〜第Ｎ音声認識結果の少なくとも１以上の音声認識結果に含まれる部分音声認識結果と、前記第１音声認識結果に含まれる部分音声認識結果とが、部分音声認識結果の内容が同一であり、かつ、略同時刻の音響信号に対応する部分音声認識結果である場合に、当該部分音声認識結果を前記第１音声認識結果から削除したものを前記第１の発話者の音声認識結果として得る音声認識結果加工ステップとを有する音声認識方法である。 One aspect of the present invention is a first sound signal which is a sound signal collected by the first sound collecting means including the voice of the first speaker and one or more different from the first sound collecting means Sound signals of the second to N-th sound signals which are sound signals collected respectively by the second to N-th (N is an integer of 2 or more) sound-pickup means. A voice recognition step of obtaining a first voice recognition result to an N-th voice recognition result which is voice recognition result for each sound signal by voice recognition, and at least one of the second voice recognition result to the N-th voice recognition result The partial speech recognition result included in the speech recognition result and the partial speech recognition result included in the first speech recognition result have the same content of the partial speech recognition result, and correspond to acoustic signals at substantially the same time If it is a partial speech recognition result, the partial speech recognition result A speech recognition method and a speech recognition result processing step of obtaining those removed from the first speech recognition result as a voice recognition result of the first speaker.

本発明の一態様は、第１の収音手段で第１の発話者の音声を含んで収音された音響信号である第１音響信号と、１以上の放送の音響信号である第１放送音響信号〜第Ｍ放送音響信号（Ｍは１以上の整数）と、のそれぞれの音響信号を音声認識して、それぞれの音響信号に対する音声認識結果である第１音声認識結果と第１放送音声認識結果〜第Ｍ放送音声認識結果を得る音声認識ステップと、前記第１放送音声認識結果〜第Ｍ放送音声認識結果の少なくとも１以上の音声認識結果に含まれる部分音声認識結果と、前記第１音声認識結果に含まれる部分音声認識結果とが、部分音声認識結果の内容が同一であり、かつ、略同時刻の音響信号に対応する部分音声認識結果である場合に、当該部分音声認識結果を前記第１音声認識結果から削除したものを前記第１の発話者の音声認識結果として得る音声認識結果加工ステップと、を含む音声認識方法である。 According to one aspect of the present invention, there is provided a first acoustic signal which is an acoustic signal collected by the first sound collecting means including the voice of a first speaker and a first broadcast which is an acoustic signal of one or more broadcasts. The voice signal of each of the sound signal to the Mth broadcast sound signal (M is an integer of 1 or more) is voice-recognized, and the first voice recognition result and the first broadcast voice recognition as the voice recognition result for each sound signal A voice recognition step of obtaining a result of the M-th broadcast speech recognition result, a partial speech recognition result included in at least one or more voice recognition results of the first broadcast speech recognition result to the M-th broadcast speech recognition result, and the first voice If the partial speech recognition result contained in the recognition result is the same as the partial speech recognition result corresponding to the acoustic signals at approximately the same time, the content of the partial speech recognition result is the same, the partial speech recognition result is Deleted from the first speech recognition result A speech recognition result processing step of obtaining a speech recognition result of the first speaker from a speech recognition method comprising.

本発明の一態様は、第１の収音手段で第１の発話者の音声を含んで収音された音響信号である第１音響信号と、前記第１の収音手段とは異なる１以上の収音手段である第２〜第Ｎ（Ｎは２以上の整数）の収音手段でそれぞれ収音された音響信号である第２音響信号〜第Ｎ音響信号と、のそれぞれの音響信号を音声認識して、それぞれの音響信号に対する音声認識結果である第１音声認識結果〜第Ｎ音声認識結果を得る音声認識ステップと、前記第２音声認識結果〜第Ｎ音声認識結果のうち前記第１の収音手段の近傍で収音された音響信号に対応する少なくとも１以上の音声認識結果に含まれる部分音声認識結果と、前記第１音声認識結果に含まれる部分音声認識結果とが、部分音声認識結果の内容が同一であり、かつ、略同時刻の部分音声認識結果である場合に、当該部分音声認識結果を前記第１音声認識結果から削除したものを前記第１の発話者の音声認識結果として得る音声認識結果加工ステップと、を含む音声認識方法。 One aspect of the present invention is a first sound signal which is a sound signal collected by the first sound collecting means including the voice of the first speaker and one or more different from the first sound collecting means Sound signals of the second to N-th sound signals which are sound signals collected respectively by the second to N-th (N is an integer of 2 or more) sound-pickup means. A voice recognition step of performing voice recognition to obtain first to Nth voice recognition results as voice recognition results for respective acoustic signals; and the second voice recognition result to the Nth voice recognition result The partial speech recognition result included in at least one or more speech recognition results corresponding to the sound signal collected in the vicinity of the sound collection means and the partial speech recognition result included in the first speech recognition result are partial speech Partial speech recognition with the same content of recognition result and almost the same time If the result, the speech recognition method comprising a speech recognition result processing step of obtaining the partial speech recognition result obtained by deleting from the first speech recognition result as a voice recognition result of the first speaker, a.

本発明の一態様は、第１の収音手段で第１の発話者の音声を含んで収音された音響信号である第１音響信号と、１以上の放送の音響信号である第１放送音響信号〜第Ｍ放送音響信号（Ｍは１以上の整数）と、のそれぞれの音響信号を音声認識して、それぞれの音響信号に対する音声認識結果である第１音声認識結果と第１放送音声認識結果〜第Ｍ放送音声認識結果を得る音声認識ステップと、前記第１放送音声認識結果〜第Ｍ放送音声認識結果のうち前記第１の収音手段が放送の受信対象地域にある少なくとも１以上の音声認識結果に含まれる部分音声認識結果と、前記第１音声認識結果に含まれる部分音声認識結果とが、部分音声認識結果の内容が同一であり、かつ、略同時刻の音響信号に対応する部分音声認識結果である場合に、当該部分音声認識結果を前記第１音声認識結果から削除したものを前記第１の発話者の音声認識結果として得る音声認識結果加工ステップと、を含む音声認識方法である。 According to one aspect of the present invention, there is provided a first acoustic signal which is an acoustic signal collected by the first sound collecting means including the voice of a first speaker and a first broadcast which is an acoustic signal of one or more broadcasts. The voice signal of each of the sound signal to the Mth broadcast sound signal (M is an integer of 1 or more) is voice-recognized, and the first voice recognition result and the first broadcast voice recognition as the voice recognition result for each sound signal A voice recognition step for obtaining a result of the Mth broadcast speech recognition, and at least one or more of the first broadcast speech recognition result to the Mth broadcast speech recognition result in which the first sound collecting means is in a broadcast reception target area The partial speech recognition result included in the speech recognition result and the partial speech recognition result included in the first speech recognition result have the same content of the partial speech recognition result, and correspond to acoustic signals at substantially the same time If it is a partial speech recognition result, that part A speech recognition result processing step of obtaining a speech recognition result obtained by deleting from the first speech recognition result as a voice recognition result of the first speaker, a speech recognition method comprising.

本発明の一態様は、コンピュータを、前記音声認識装置として動作させるための音声認識プログラムである。 One aspect of the present invention is a speech recognition program for operating a computer as the speech recognition device.

本発明によれば、入力された発話者の音声を含む音響信号の音声認識結果から、入力された別の音響信号の音声認識結果と共通する部分を取り除くことにより、不要な音声認識結果が含まれる可能性を低減することで、発話者の音声に対する音声認識率を向上させることができるという効果が得られる。 According to the present invention, unnecessary speech recognition results are included by removing a portion common to the speech recognition result of another input sound signal from the speech recognition result of the sound signal including the speech of the input speaker. By reducing the possibility of being voiced, the effect of being able to improve the speech recognition rate for the speech of the speaker can be obtained.

本発明の第１、３実施形態の音声認識システムを含むシステム全体の構成を示すブロック図である。BRIEF DESCRIPTION OF THE DRAWINGS It is a block diagram which shows the structure of the whole system containing the speech recognition system of 1st, 3rd embodiment of this invention. 本発明の第１実施形態の音声認識システムの１つのクライアント側装置とクラウド側装置による構成を示すブロック図である。It is a block diagram which shows the structure by one client side apparatus and the cloud side apparatus of the speech recognition system of 1st Embodiment of this invention. 本発明の第１実施形態の第１動作例の音声認識結果加工部の処理の流れを示す図である。It is a figure which shows the flow of a process of the speech recognition result process part of the 1st operation example of 1st Embodiment of this invention. 本発明の第１実施形態の第１動作例の音声認識結果加工部の具体例を説明するための図である。It is a figure for demonstrating the specific example of the speech recognition result process part of the 1st operation example of 1st Embodiment of this invention. 本発明の第１実施形態の第２動作例の音声認識結果加工部の具体例を説明するための図である。It is a figure for demonstrating the specific example of the speech recognition result process part of the 2nd operation example of 1st Embodiment of this invention. 本発明の第１実施形態の第３動作例の音声認識結果加工部の具体例を説明するための図である。It is a figure for demonstrating the specific example of the speech recognition result process part of the 3rd operation example of 1st Embodiment of this invention. 本発明の第１実施形態の第４動作例の音声認識結果加工部の処理の流れを示す図である。It is a figure which shows the flow of a process of the speech recognition result process part of the 4th operation example of 1st Embodiment of this invention. 本発明の第１実施形態の第４動作例の音声認識結果加工部の変形例の処理の流れを示す図である。It is a figure which shows the flow of a process of the modification of the speech recognition result process part of the 4th operation example of 1st Embodiment of this invention. 本発明の第１実施形態の第６動作例の音声認識結果加工部の処理の流れを示す図である。It is a figure which shows the flow of a process of the speech recognition result process part of the 6th operation example of 1st Embodiment of this invention. 本発明の第２実施形態の音声認識システムを含むシステム全体の構成を示すブロック図である。It is a block diagram which shows the structure of the whole system containing the speech recognition system of 2nd Embodiment of this invention. 本発明の第２実施形態の音声認識システムの１つのクライアント側装置とクラウド側装置による構成部分を示すブロック図である。It is a block diagram which shows the component part by one client side apparatus of the speech recognition system of 2nd Embodiment of this invention, and a cloud side apparatus. 本発明の第２実施形態の第１動作例の音声認識結果加工部の処理の流れを示す図である。It is a figure which shows the flow of a process of the speech recognition result process part of the 1st operation example of 2nd Embodiment of this invention. 本発明の第２実施形態の第１動作例の音声認識結果加工部の具体例を説明するための図である。It is a figure for demonstrating the specific example of the speech recognition result process part of the 1st operation example of 2nd Embodiment of this invention. 本発明の第２実施形態の第３動作例の音声認識結果加工部の処理の流れを示す図である。It is a figure which shows the flow of a process of the speech recognition result process part of the 3rd operation example of 2nd Embodiment of this invention. 本発明の第３実施形態の１つのクライアント側装置とクラウド側装置による構成部分を示すブロック図である。It is a block diagram which shows the component part by one client side apparatus of 3rd Embodiment of this invention, and a cloud side apparatus. 本発明の第３実施形態の音声認識結果加工部の処理の流れを示す図である。It is a figure which shows the flow of a process of the speech recognition result process part of 3rd Embodiment of this invention. 本発明の第４実施形態の１つのクライアント側装置とクラウド側装置による構成部分を示すブロック図である。It is a block diagram which shows the component part by one client side apparatus of 4th Embodiment of this invention, and a cloud side apparatus. 本発明の第４実施形態の変形例の１つのクライアント側装置とクラウド側装置による構成部分を示すブロック図である。It is a block diagram which shows the component part by one client side apparatus of the modification of 4th Embodiment of this invention, and a cloud side apparatus. 本発明の第４実施形態の変形例の音声認識結果加工部の具体例を説明するための図である。It is a figure for demonstrating the specific example of the speech recognition result process part of the modification of 4th Embodiment of this invention. 本発明の音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus of this invention.

以下、図面を参照して、本発明の一実施形態による音声認識システムを説明する。 Hereinafter, a speech recognition system according to an embodiment of the present invention will be described with reference to the drawings.

ここで、本発明が想定する利用形態について説明する。本発明は、音声認識用マイクと発話者が近いケースではなく、比較的遠い場合、具体的には、４５センチから数メートル程度の比較的距離があるケースで利用されることを想定している。 Here, the usage form assumed by the present invention will be described. The present invention assumes that the voice recognition microphone and the speaker are not close to each other, but are used when the distance is relatively far, specifically, in a case where there is a relatively long distance of about 45 cm to several meters. .

想定する周辺状況は、本システム外のテレビ放送、ラジオ放送の声音（アナウンスやセリフ）が数秒から十数秒間隔で流れていたり、案内放送が不定期に流れたりすることなどにより、認識期間中にテレビやラジオや案内放送などの声音がクライアント側装置に入力される音響信号に不定期に入ってしまうケースなどである。 Assumed peripheral conditions are during the recognition period, such as television broadcasts outside this system, and voices (announcements and speeches) of radio broadcasts flowing at intervals of a few seconds to a few tens of seconds, and guidance broadcasts irregularly, etc. There are cases such as cases where voices such as television, radio, and guide broadcasting occasionally enter an audio signal input to the client-side device.

＜第１実施形態＞
まず、本発明の第１実施形態として、クライアント側装置に入力された発話者の音声を含む音響信号の音声認識結果から、別のクライアント側装置に入力された音響信号の音声認識結果と共通する部分を取り除く形態について説明する。図１は第１実施形態における音声認識システムの構成を示すブロック図である。この図において、符号１００は音声認識システムであり、符号１_１〜１_Ｎは複数個（Ｎ個、Ｎは２以上の整数）のクライアント側装置であり、符号２はクラウド側装置である。クライアント側装置１_１〜１_Ｎは、利用者が利用する装置であり、例えば、スマートフォン、スマートテレビ、ＨＤＭＩ（登録商標）ドングルＳＴＢ（「ＳＴＢ」は「セットトップボックス」）、小型ＳＴＢ、先端家電デバイス、ゲーム機などである。クラウド側装置２は、ネットワーク３を介してクライアント側装置１_１〜１_Ｎと接続される。ネットワーク３は、音声認識システム１００のクライアント側装置１_１〜１_Ｎとクラウド側装置２とがインターネットの接続プロトコルに従って情報を送受信できるようにするためのものであり、例えばインターネットである。クライアント側装置１_１〜１_Ｎが最低限含む構成は全て同じであるため、以下では、第１実施形態の音声認識システム１００のうちのクライアント側装置１_１とクラウド側装置２により構成される部分について詳細化したブロック図である図２を用いて説明を行う。 First Embodiment
First, as a first embodiment of the present invention, from the speech recognition result of an acoustic signal including the voice of the speaker input to the client side device, it is common to the speech recognition result of an acoustic signal input to another client side device The form which removes a part is demonstrated. FIG. 1 is a block diagram showing the configuration of the speech recognition system in the first embodiment. In this figure, reference numeral 100 is a speech recognition system, reference numerals 1 ₁ to 1 _N are a plurality of (N, N is an integer of 2 or more) client side devices, and 2 is a cloud side device. The client side devices 1 ₁ to 1 _N are devices used by the user, and for example, a smartphone, a smart TV, an HDMI (registered trademark) dongle STB ("STB" is a "set top box"), a small STB, an advanced home appliance Devices, game consoles, etc. The cloud side device 2 is connected to the client side devices 1 ₁ to 1 _N via the network 3. The network 3 is for enabling the client side devices 1 ₁ to 1 _N of the speech recognition system 100 and the cloud side device 2 to transmit and receive information according to the connection protocol of the Internet, and is, for example, the Internet. Since all the configurations included at least in the client side devices 1 ₁ to 1 _N are the same, in the following, a portion configured by the client side device ₁₁ and the cloud side device 2 in the speech recognition system 100 of the first embodiment This will be described using FIG. 2 which is a detailed block diagram of FIG.

クライアント側装置１_１は、音声入力部１１_１、ユーザ情報取得部１２_１、音声送出部１３_１、検索結果受信部１４_１、画面表示部１５_１を少なくとも含んで構成される。 Client device _{1 1} includes an audio input unit 11 _1, the user information acquiring unit 12 _1, the audio output unit 13 _1, the search result receiving unit 14 _1, configured to include at least the screen display unit 15 _1.

クラウド側装置２は、音声受信部２１、音声認識部２２、音声認識結果保持部２３、音声認識結果加工部２４、検索処理部２５、検索結果送信部２６を少なくとも含んで構成される。 The cloud side device 2 includes at least a voice receiving unit 21, a voice recognition unit 22, a voice recognition result holding unit 23, a voice recognition result processing unit 24, a search processing unit 25, and a search result transmission unit 26.

次に、第１実施形態の音声認識システムの動作を説明する。 Next, the operation of the speech recognition system according to the first embodiment will be described.

［［第１実施形態の第１動作例］］
第１動作例として、第１〜Ｎの利用者のそれぞれがクライアント側装置１_１〜１_Ｎを利用していて、第１の利用者がクライアント側装置１_１に対して検索結果を得たい文章を発話し、当該発話に対応する検索結果をクライアント側装置１_１の画面表示部１５_１に表示する場合の動作の例を説明する。ここでは、より具体的なケースとして、Ｎ＝３であり、テレビやラジオの音が流れていたり駅やデパートなどの案内放送が不定期に流れたりする場所に第１の利用者がいて、第１の利用者と同じテレビやラジオや案内放送が流れている場所に第２の利用者がいて、第１の利用者と同じテレビやラジオや案内放送が流れておらず異なる環境音のある場所に第３の利用者がいる場合を例に説明する。 [[First operation example of the first embodiment]]
As a first operation example, each user of the 1~N is not using the client-side apparatus 1 ₁ to 1 _N, the sentence to be obtained search results first user to the client-side apparatus 1 ₁ the speaks, an example of operation of displaying a search result corresponding to the utterance to the client-side apparatus 1 ₁ of the screen display unit 15 _1. Here, as a more specific case, N = 3, and the first user is at a place where the sound of a television or radio is flowing or the information broadcasting such as a station or department store is irregularly A second user is in the same place where the same television, radio and information broadcast are flowing as the first user, and the same television, radio and information broadcast as the first user is not flowing and there is a different environmental sound The case where there is a third user is described as an example.

クライアント側装置１_１の音声入力部１１_１は、クライアント側装置１_１の周囲で発せられた音響信号を取得し、取得した音響信号を音声送信部１３_１に出力する。第１の利用者がクライアント側装置１_１に対して検索結果を得たい文章を発話した場合には、第１の利用者が発話した音声を含む音響信号を取得して出力する。クライアント側装置１_１の周囲でテレビやラジオや案内放送などの環境音が発生している場合には、その環境音を含む音響信号を取得して出力する。したがって、上記の具体ケースであれば、第１の利用者が発話した音声と、テレビやラジオなどの音や案内放送などの環境音と、により構成される音響信号を取得して出力する。 Client device 1 ₁ of the speech input unit 11 ₁ obtains an acoustic signal emitted around the client-side apparatus 1 _1, and outputs the acquired audio signal to the audio transmission unit 13 _1. If the first user has uttered a sentence to be obtained search results to the client-side apparatus 1 _1, obtains and outputs a sound signal including a voice first user uttered. If the client-side apparatus 1 ₁ of environmental sounds such as a television or radio or announcement around has occurred, and outputs the acquired audio signal including the ambient sound. Therefore, in the above specific case, an acoustic signal composed of the voice uttered by the first user and the sound of a television or radio, or an environmental sound such as a guidance broadcast is acquired and output.

クライアント側装置１_１のユーザ情報取得部１２_１は、クライアント側装置１_１の音声入力部１１_１が音響信号を取得した時刻情報を得て、当該時刻情報とクライアント側装置１_１を特定可能な識別情報（以下、「ID」と呼ぶ）とをユーザ情報として音声送信部１３_１に出力する。時刻情報とは、例えば絶対時刻であり、例えばクライアント側装置１_１がＧＰＳ受信部を内蔵するスマートフォンである場合は、音声入力部１１_１であるスマートフォンのマイクが音響信号を取得した際にＧＰＳ受信部が受信した絶対時刻を時刻情報とすればよい。また、たとえば、携帯キャリア網の基地局や通信サーバからもらった時刻情報でもよいし、スマートフォンのOSが保持するローカル時計の時刻情報でもよい。なお、第１実施形態においては、時刻情報は、複数のクライアント側装置それぞれで取得された音響信号が発せられた時刻が同一であるか否かを特定するために音声認識結果加工部２３が用いるためものであるため、複数のクライアント側装置間で共通の時刻であれば、絶対時刻そのものでなくてもよい。 User information acquisition section 12 ₁ of the client-side apparatus 1 ₁ obtains the time information voice input unit 11 ₁ of the client-side apparatus 1 ₁ acquires the audio signal, which can specify the time information and the client-side apparatus 1 ₁ identification information (hereinafter, referred to as "ID") to the audio transmission unit 13 ₁ and the user information. The time information, for example, absolute time, for example, if the client-side apparatus 1 ₁ is a smart phone with a built-in GPS receiver, the GPS receiver when the smartphone microphone is a voice input unit 11 ₁ obtains a sound signal The absolute time received by the unit may be used as time information. Further, for example, it may be time information obtained from a base station of a mobile carrier network or a communication server, or may be time information of a local clock held by the OS of the smartphone. In the first embodiment, the time information is used by the voice recognition result processing unit 23 to specify whether the times at which the acoustic signals acquired by each of the plurality of client-side devices are the same are the same. Since the time is a common time among a plurality of client side apparatuses, it may not be the absolute time itself.

クライアント側装置１_１の音声送出部１３_１は、音声入力部１１_１が出力した音響信号とユーザ情報取得部１２_１が出力したユーザ情報とを含む伝送信号をクラウド側装置２に対して送出する。より正確には、音声送出部１３_１は、音声入力部１１_１が出力した音響信号とユーザ情報取得部１２_１が出力したユーザ情報とを含む伝送信号を、クラウド側装置２に伝えるべく、ネットワーク３に対して送出する。伝送信号の送出は、例えば、10msなどの所定時間区間ごとに行われる。また、音声送出部１３_１は、音響信号を所定の符号化方法により符号化して符号列を得て、得られた符号列とユーザ情報とを含む伝送信号を送出してもよい。また、音声送出部１３_１は、音響信号に対して音声認識処理の一部の処理である特徴量抽出などを行い、その処理により得られた特徴量とユーザ情報とを含む伝送信号を送出してもよい。ネットワークを介して別装置に伝送信号を送出する技術や音声認識処理をクライアント側装置とクラウド側装置で分散して行う技術には、多くの公知技術や周知技術が存在しているため、詳細な説明を省略する。 Voice output unit 13 ₁ of the client-side apparatus 1 ₁ sends a transmission signal including the user information acoustic signal and the user information acquisition section 12 ₁ speech input unit 11 ₁ is output is output to the cloud side device 2 . More precisely, the audio output unit 13 _1, the transmission signal including the user information acoustic signal and the user information acquisition section 12 ₁ speech input unit 11 ₁ is output is output, to convey to the cloud side device 2, the network Send to 3 Transmission of the transmission signal is performed, for example, at predetermined time intervals such as 10 ms. The audio output unit 13 ₁ obtains a code string acoustic signal is encoded by a predetermined encoding method, it may be sent a transmission signal including a resultant code sequence and the user information. The audio output unit 13 ₁ performs such feature extraction, which is part of the processing of the speech recognition process on the acoustic signals, sends a transmission signal including a feature amount and the user information obtained by the process May be There are many well-known techniques and well-known techniques in the technique of sending transmission signals to another device via a network and the technique of performing speech recognition processing in a distributed manner between the client-side device and the cloud-side device. I omit explanation.

クライアント側装置１_２〜１_Ｎの音声入力部１１_２〜１１_Ｎ、ユーザ情報取得部１２_２〜１２_Ｎ及び音声送出部１３_２〜１３_Ｎも、それぞれ、クライアント側装置１_１の音声入力部１１_１、ユーザ情報取得部１２_１及び音声送出部１３_１と同じ動作をする。したがって、上記の具体ケースであれば、クライアント側装置１_２は、第２の利用者が発話した音声と、第１の利用者と同じ環境音と、により構成される音響信号を取得して、当該音響信号とユーザ情報とを含む伝送信号を送出する。また、クライアント側装置１_３は、第３の利用者が発話した音声と、第１の利用者とは異なる環境音と、により構成される音響信号を取得して、当該音響信号とユーザ情報とを含む伝送信号を送出する。 Client device ₁ 2 to 1 _N audio input unit ₁₁ 2 to 11 _N, the user information acquiring unit ₁₂ 2 to 12 _N and the voice sending section ₁₃ 2 to 13 _N be respectively client device _{1 1} of the speech input unit 11 _1, the user information acquiring unit 12 ₁ and the audio output unit 13 ₁ and the same operation. Therefore, if the above specific case, the client-side apparatus 1 _2, and audio second user has uttered, the same environment sound and the first user, to obtain a composed audio signal by, A transmission signal including the sound signal and user information is transmitted. Moreover, the client-side apparatus 1 _3, and audio third user has uttered, and different environmental sound from the first user, to obtain a composed acoustic signal by a corresponding acoustic signal and the user information Send out a transmission signal including

クラウド側装置２の音声受信部２１は、クライアント側装置１_１〜１_Ｎの音声送出部１３_１〜１３_Ｎがそれぞれ送出した伝送信号を受信して、受信したそれぞれの伝送信号から音響信号とユーザ情報との組を取り出して出力する。伝送信号の受信は、例えば、10msなどの所定時間区間ごとに行われる。音声送出部１３_１〜１３_Ｎが音響信号を所定の符号化方法により符号化して符号列を得て、得られた符号列を含む伝送信号を送出した場合には、クラウド側装置２の音声受信部２１は、受信した伝送信号に含まれる符号列を所定の符号化方法に対応する復号方法により復号することで音響信号を得て、得られた音響信号とユーザ情報との組を出力すればよい。また、音声送出部１３_１〜１３_Ｎが音響信号に対して音声認識処理の一部の処理である特徴量抽出などを行い、その処理により得られた特徴量とユーザ情報とを含む伝送信号を送出した場合には、伝送信号から音響信号ではなく特徴量を取り出し、取り出した特徴量とユーザ情報との組を出力すればよい。 The voice receiving unit 21 of the cloud side device 2 receives the transmission signals respectively sent by the voice sending portions 13 _{1 to} 13 _N of the client side devices 1 ₁ to 1 _N , and generates an acoustic signal and a user from the received transmission signals. Extract and output a pair with information. Reception of the transmission signal is performed, for example, at predetermined time intervals such as 10 ms. When the voice transmitting units 13 _{1 to} 13 _N encode the acoustic signal according to a predetermined coding method to obtain a code string and transmit a transmission signal including the obtained code string, the voice reception of the cloud-side device 2 is performed The unit 21 obtains an acoustic signal by decoding a code string included in the received transmission signal according to a decoding method corresponding to a predetermined encoding method, and outputs a set of the obtained acoustic signal and user information. Good. In addition, the voice transmitting units 13 _{1 to} 13 _N perform feature extraction, which is a part of processing of voice recognition processing, on the sound signal, and transmit signals including the feature and the user information obtained by the processing. In the case of sending, it is sufficient to take out not the acoustic signal but the feature amount from the transmission signal, and output the set of the extracted feature amount and the user information.

クラウド側装置２の音声認識部２２は、音声受信部２１が出力したそれぞれの音響信号に対して音声認識処理を行い、音響信号に含まれる音声に対応する文字列である音声認識結果を得て、音声認識結果と、当該音声認識結果に対応する時刻情報と、当該音声認識結果に対応するIDとによる組を出力する。なお、時刻情報がなかったり、不適切な値だった場合、受け取った時刻情報を用いず、サーバがデータを受け取ったおよその時刻情報で管理する処理をしてもよい。 The voice recognition unit 22 of the cloud-side device 2 performs voice recognition processing on each of the sound signals output from the voice reception unit 21, and obtains a voice recognition result that is a character string corresponding to the voice included in the sound signal. And outputting a set of a speech recognition result, time information corresponding to the speech recognition result, and an ID corresponding to the speech recognition result. If there is no time information or an inappropriate value, the server may manage the data based on the approximate time information received without using the received time information.

音声認識処理は、音響信号の所定の纏まりごとに行われる。例えば、音声認識部２２は、音声受信部２１が出力した音響信号を音声認識部２２内の図示しない記憶部に順次記憶し、記憶した音響信号に対して発話区間検出を行うことで発話区間ごとの音響信号の纏まりを得て、発話区間ごとの音響信号の纏まりに対して音声認識処理を行って、発話区間ごとの音響信号の纏まりに対する文字列である音声認識結果を得る。また、例えば、音声認識部２２は、複数の発話区間の音響信号の纏まりに対して音声認識処理を行って、複数の発話区間の音響信号の纏まりに対する文字列である音声認識結果を得てもよい。 Speech recognition processing is performed for each predetermined set of acoustic signals. For example, the voice recognition unit 22 sequentially stores the sound signal output from the voice reception unit 21 in a storage unit (not shown) in the voice recognition unit 22, and performs voice period detection on the stored voice signal. The voice recognition process is performed on the voice signal clusters in each utterance section to obtain a voice recognition result as a character string for the voice signal clusters in each utterance section. Also, for example, even if the speech recognition unit 22 performs speech recognition processing on a group of sound signals in a plurality of speech segments, and obtains a speech recognition result that is a character string to a group of sound signals in a plurality of speech segments. Good.

したがって、上記の具体ケースであれば、音声認識部２２は、クライアント側装置１_１の音響信号に対する音声認識結果としては、第１の利用者が発話した音声とテレビやラジオや案内放送などの環境音との音声認識結果とから成る文字列を得る。また、音声認識部２２は、クライアント側装置１_２の音響信号に対する音声認識結果としては、第２の利用者が発話した音声と第１の利用者と同じ環境音との音声認識結果から成る文字列を得る。また、音声認識部２２は、クライアント側装置１_３の音響信号に対する音声認識結果としては、第３の利用者が発話した音声と第１の利用者とは異なる環境音との音声認識結果とから成る文字列を得る。 Therefore, if the above specific case, the speech recognition unit 22, as a speech recognition result to the client-side apparatus 1 ₁ of the acoustic signal, the first user utterance was such as voice and TV or radio or announcement Environment A string consisting of a sound and a speech recognition result is obtained. The speech recognition unit 22, as a speech recognition result to the client-side apparatus 1 _second acoustic signal consists of a speech recognition result of the second user utterance voice and the same environment sound and the first user character Get a row. The speech recognition unit 22, as a speech recognition result to the client-side apparatus 1 _third acoustic signal from a sound third user has uttered the first user and the speech recognition result of the different environmental sound Get a string of

なお、音声認識処理には公知の音声認識技術を用いればよい。すなわち、音響モデル、言語モデル、特徴量の取り方等などの音声認識処理の詳細は、公知のものを用いればよい。また、音声認識処理として、ディープラーニングを用いた一体型の音声認識処理などを用いてもよい。これらの音声認識処理においては、図示しない解析部などで音響信号を規定の区間になるよう解析してから音声認識してもよい。また、音声受信部２１が音響信号に代えて特徴量を出力した場合には、音声認識部２２はその特徴量を用いて音声認識処理を行えばよい。何れにしろ、音声認識処理自体には、多くの公知技術や周知技術が存在しているため、詳細な説明を省略する。 A well-known speech recognition technology may be used for the speech recognition process. That is, as the details of the speech recognition processing such as an acoustic model, a language model, and how to obtain a feature amount, known ones may be used. Further, an integrated voice recognition process using deep learning or the like may be used as the voice recognition process. In these speech recognition processes, speech may be recognized after the sound signal is analyzed to be a prescribed section by an analysis unit (not shown) or the like. In addition, when the voice receiving unit 21 outputs a feature amount instead of the sound signal, the voice recognition unit 22 may perform the voice recognition process using the feature amount. In any case, the speech recognition process itself includes many well-known techniques and well-known techniques, so detailed description will be omitted.

音声認識処理は音響信号の所定の纏まりごとに行われるため、音声認識結果の文字列は所定の纏まりの音響信号に対応するものである。そこで、音声認識部２２は、例えば、音声認識処理の対象にした所定の纏まりの音響信号に対応する複数のユーザ情報に含まれる時刻情報から代表時刻を求め、当該代表時刻を表す時刻情報を当該音声認識結果と組にする。代表時刻は、音声認識結果の文字列に対応する音響信号が発せられた時刻を代表するものであればよい。例えば、音声認識結果の文字列が発せられた始端の時刻を代表時刻とすればよい。また、代表時刻は１つの音声認識結果に複数あってもよい。例えば、音声認識結果に含まれる単語などの部分文字列ごとに、その単語などが発せられた始端の時刻を代表時刻としてもよい。 Since the speech recognition process is performed for each predetermined set of acoustic signals, the character string of the speech recognition result corresponds to the predetermined set of acoustic signals. Therefore, the speech recognition unit 22 obtains representative time from time information included in a plurality of user information corresponding to a predetermined group of sound signals targeted for speech recognition processing, for example, and the time information representing the representative time is Pair with speech recognition results. The representative time may be any one that represents the time at which the sound signal corresponding to the character string of the speech recognition result is emitted. For example, the time at the beginning of the speech recognition result character string may be used as the representative time. Also, there may be a plurality of representative times in one speech recognition result. For example, for each partial character string such as a word included in the speech recognition result, the time at the beginning of the word or the like may be used as the representative time.

音声認識結果と組にするIDは、当該音声認識結果に対応するID、すなわち、当該音声認識結果を得る元となった音響信号と組となって音声受信部２１から入力されたユーザ情報に含まれるIDである。 The ID to be paired with the speech recognition result is included in the user information input from the speech receiving unit 21 in combination with the ID corresponding to the speech recognition result, that is, the acoustic signal from which the speech recognition result is obtained. ID.

クラウド側装置２の音声認識結果保持部２３は、音声認識部２２が出力した音声認識結果と時刻情報とIDとの組を記憶する。音声認識結果保持部２３の記憶内容は、音声認識結果加工部２４が時刻が共通する単語などの部分文字列があるか否かを判定する処理、及び、時刻が共通する単語などの部分文字列があった際に音声認識結果から取り除いて加工済み音声認識結果を得る処理、に用いられる。したがって、音声認識結果保持部２３には、音声認識部２２が出力した音声認識結果と時刻情報とIDとの組を音声認識結果加工部２４の処理が必要とする時間分だけ記憶しておく。また、音声認識結果保持部２３に保持した記憶内容は、当該記憶内容を用いる音声認識結果加工部２４の処理が終わった時点で削除してよい。 The voice recognition result holding unit 23 of the cloud-side device 2 stores a set of the voice recognition result, the time information, and the ID output from the voice recognition unit 22. The contents stored in the voice recognition result storage unit 23 are processing for determining whether or not the voice recognition result processing unit 24 has a partial character string such as a word having a common time, and a partial character string such as a word having a common time Is used for processing to obtain a processed speech recognition result by removing it from the speech recognition result when there is an error. Therefore, the speech recognition result holding unit 23 stores the combination of the speech recognition result, the time information, and the ID output from the speech recognition unit 22 only for the time required for the process of the speech recognition result processing unit 24. Further, the stored contents held in the speech recognition result holding unit 23 may be deleted when the processing of the speech recognition result processing unit 24 using the stored contents is completed.

クラウド側装置２の音声認識結果加工部２４は、音声認識結果保持部２３に記憶された少なくとも１つの音声認識結果と時刻情報とIDとの組について、当該音声認識結果の文字列に含まれる部分文字列それぞれについて、他の音声認識結果と時刻情報とIDとの組の中に、部分文字列と時刻との組が一致するものがあった場合に、一致した部分文字列を取り除いたものを加工済み音声認識結果とし、加工済み音声認識結果とIDとを組にして出力する。したがって、少なくともある１つのクライアント側装置についての加工済み音声認識結果が出力されることになる。なお、時刻が一致するか否かの判定については、各クライアント側装置における絶対時刻の誤差や音声認識処理における時刻の誤差などを考慮して同じ時刻であると判定してもよい。すなわち、少なくともある１つの処理対象のクライアント側装置については、略同一の時刻に他のクライアント側装置に当該処理対象クライアント側装置と同じ部分文字列（共通する部分文字列）がある場合には、当該処理対象クライアント側装置の音声認識結果の文字列から共通する部分文字列を取り除いたものを加工済み音声認識結果として得る。 The voice recognition result processing unit 24 of the cloud-side device 2 includes a portion of the combination of at least one voice recognition result, time information, and ID stored in the voice recognition result holding unit 23 in the character string of the voice recognition result. For each character string, if there is a match between the partial character string and the time among the other voice recognition result, the time information, and the ID pair, the one obtained by removing the matched partial character string is As a processed speech recognition result, the processed speech recognition result and the ID are output as a set. Therefore, processed speech recognition results for at least one client-side device will be output. The determination as to whether or not the times coincide may be determined as the same time in consideration of an error in absolute time in each client-side device, an error in time in speech recognition processing, and the like. That is, for at least one processing target client side device, when another client side device has the same partial character string (common partial character string) as the processing target client side device at substantially the same time, What is obtained by removing a common partial character string from the character string of the speech recognition result of the processing target client side device is obtained as a processed speech recognition result.

なお、この処理は、他のクライアント側装置の全てを対象として行ってもよいし、他のクライアント側装置の少なくとも１つを対象として行ってもよい。この場合、音声認識結果加工部２４の処理で必要な音声認識結果だけを前段で得るようにしてもよい。すなわち、音声認識結果加工部２４の処理に不要な音声認識結果を得るための音声受信部２１、音声認識部２２及び音声認識結果保持部２３の動作は省略してもよい。 Note that this process may be performed on all other client-side devices, or may be performed on at least one other client-side device. In this case, only the speech recognition result necessary for the processing of the speech recognition result processing unit 24 may be obtained in the previous stage. That is, the operations of the voice receiving unit 21, the voice recognition unit 22, and the voice recognition result holding unit 23 for obtaining a voice recognition result unnecessary for the processing of the voice recognition result processing unit 24 may be omitted.

ここで、上記のＮ＝３の例で、少なくともある１つのクライアント側装置がクライアント側装置１_１である例について図３と図４を用いて説明する。図３はこの動作例における音声認識結果加工部２４の処理フローを説明する図であり、図４はこの例における音声認識結果と加工済み音声認識結果の一例を説明する図である。図３の例は、クライアント側装置１_１以外の全てのクライアント側装置それぞれを対象として、クライアント側装置１_２の音声認識結果から順に、クライアント側装置１_１の音声認識結果と部分文字列と時刻との組が一致するものがあるか否かを探索し、部分文字列と時刻との組が一致するものがあった場合には、部分文字列と時刻との組が一致する部分文字列をクライアント側装置１_１の音声認識結果の文字列から当該共通部分文字列を取り除いていく例である。 Here, in the above example of N = 3, it will be described with reference to FIGS. 3 and 4 for example at least some one client device is a client-side apparatus 1 _1. FIG. 3 is a view for explaining the processing flow of the speech recognition result processing unit 24 in this operation example, and FIG. 4 is a view for explaining an example of the speech recognition result and the processed speech recognition result in this example. The example of FIG. 3, the subject of each of all the client device other than the client-side apparatus 1 _1, in order from the speech recognition result of the client-side apparatus 1 _2, the speech recognition result of the client-side apparatus 1 ₁ and the partial strings and time If there is a match between the partial character string and the time, there is a partial character string that matches the partial character string and the time pair. it is an example of the client-side apparatus 1 ₁ of the speech recognition result string will remove the common substring.

音声認識結果加工部２４は、まず、クライアント側装置１_１の音声認識結果と時刻情報とIDとの組を音声認識結果保持部２３から読み出す（ステップＳ２４１）。音声認識結果加工部２４は、次に、初期値ｘを２に設定する（ステップＳ２４２）。音声認識結果加工部２４は、次に、クライアント側装置１_ｘの音声認識結果と時刻情報とIDとの組を音声認識結果保持部２３から読み出す（ステップＳ２４３）。音声認識結果加工部２４は、次に、クライアント側装置１_１の音声認識結果と時刻情報とIDとの組とクライアント側装置１_ｘの音声認識結果と時刻情報とIDとの組とにおいて、部分文字列とその時刻が一致するものがあるか否かを探索する（ステップＳ２４４）。この場合、時刻には誤差が考えられるので、およその時間で一致判定する。これは、例えば、数秒以内である。以後、フロー説明における“時刻の一致”という表現に関しては、特に記載ない限り、同様に扱うものとする。音声認識結果加工部２４は、次に、ステップＳ２４４において部分文字列とその時刻が一致するものがあった場合には、部分文字列とその時刻が一致する全ての部分文字列をクライアント側装置１_１の音声認識結果の文字列から取り除く（ステップＳ２４５）。ステップＳ２４４において部分文字列とその時刻が一致するものがなかった場合には、ステップＳ２４６に進む。音声認識結果加工部２４は、次に、ステップＳ２４３〜ステップＳ２４５の処理の対象としていないクライアント側装置が残っているかを判定する（ステップＳ２４６）。音声認識結果加工部２４は、次に、ステップＳ２４６においてステップＳ２４３〜ステップＳ２４５の処理の対象としていないクライアント側装置が残っていると判定された場合には、ｘをｘ＋１に置き換える（ステップＳ２４７）。ステップＳ２４６においてステップＳ２４３〜ステップＳ２４５の処理の対象としていないクライアント側装置が残っていないと判定された場合には、最後に行ったステップＳ２４５で処理済みのクライアント側装置１_１の音声認識結果の文字列をクライアント側装置１_１の加工済み音声認識結果の文字列としてIDと組にして出力する（ステップＳ２４８）。 Speech recognition result processing unit 24 first reads a set of the speech recognition result of the client-side apparatus 1 ₁ and the time information and the ID from the speech recognition result holding unit 23 (step S241). Next, the speech recognition result processing unit 24 sets an initial value x to 2 (step S242). Next, the voice recognition result processing unit 24 reads out a combination of the voice recognition result of the client side device _1x , time information, and an ID from the voice recognition result holding unit 23 (step S243). Speech recognition result processing unit 24, then, in the set of the set and the speech recognition result of the client-side apparatus 1 _x and the time information and the ID of the speech recognition result of the client-side apparatus 1 ₁ and the time information and ID, partial It is searched whether there is a match between the character string and the time (step S244). In this case, since there is an error in the time, the coincidence is determined in approximately time. This is, for example, within a few seconds. Hereinafter, the expression “time matching” in the flow description will be treated in the same manner unless otherwise specified. Next, when there is a partial character string and one whose time coincides with the partial character string at step S244, the speech recognition result processing unit 24 uses the partial character string and all the partial character strings whose time coincides with one another on the client side device 1 _{It is} removed from the character string of the speech recognition result of ₁ (step S245). If there is no match between the partial character string and the time in step S244, the process proceeds to step S246. Next, the speech recognition result processing unit 24 determines whether there remains any client-side device not subject to the processes of steps S243 to S245 (step S246). The voice recognition result processing unit 24 next replaces x with x + 1 if it is determined in step S246 that there is a client-side device not subject to the processing in steps S243 to S245 (step S247). Step If it is determined that there are no remaining client device that is not subject to processing in step S243~ step S245 in S246, the last at step S245 the processed client device 1 ₁ of the speech recognition result of characters column in the ID paired as client-side apparatus 1 ₁ of processed speech recognition result string output (step S248).

次に、図４を参照して、この例における音声認識結果と加工済み音声認識結果の一例を説明する。図４の横軸は時刻であり、矢印の上にある３つは音声認識結果加工部２４の入力であるクライアント側装置１_１〜１_３それぞれの音声認識結果であり、矢印の下にある１つはクライアント側装置１_１の加工済み音声認識結果である。クライアント側装置１_１の音声認識結果には、クライアント側装置１_１の利用者である第１の利用者が発した発話である発話１及び発話２の音声認識結果の部分文字列と、クライアント側装置１_１の周囲でテレビが発した音声であるテレビ音声１及びテレビ音声２の音声認識結果の部分文字列が含まれている。また、クライアント側装置１_２の音声認識結果には、クライアント側装置１_２の利用者である第２の利用者が発した発話である発話３及び発話４の音声認識結果の部分文字列と、クライアント側装置１_２の周囲でテレビが発した音声であるテレビ音声１及びテレビ音声２の音声認識結果の部分文字列が含まれている。また、クライアント側装置１_３の音声認識結果には、クライアント側装置１_３の利用者である第３の利用者が発した発話である発話１及び発話２の音声認識結果の部分文字列と、クライアント側装置１_３の周囲でテレビが発した音声であるテレビ音声３及びテレビ音声４の音声認識結果の部分文字列が含まれている。ここで、第１の利用者が発した発話である発話１及び発話２の音声認識結果の部分文字列と、第３の利用者が発した発話である発話１及び発話２の音声認識結果の部分文字列と、はそれぞれ同一であるとする。
なお、図を理解しやすくするために、発話音声例とテレビ音声の文字列の例を図４の音声認識結果の上に併記する。発話例は通常体、テレビ音声例は斜体で表記する。図では、部分文字列は単語毎に書かれているが、実際は１音素等短い部分文字列でもよい。 Next, an example of the speech recognition result and the processed speech recognition result in this example will be described with reference to FIG. The horizontal axis of FIG. 4 is time, and three above the arrow are the speech recognition results of the client side devices 1 1 _to 1 ₃ which are the input of the speech recognition result processing unit 24 and 1 below the arrow The _{first is} the processed speech recognition result of the client side device 11. The client-side apparatus 1 ₁ of the speech recognition result, the first and the partial string of the speech recognition result of speech 1 and utterance 2 is a speech the user has issued a client device 1 ₁ of the user, the client-side TV around the apparatus 1 ₁ contains a substring of the speech recognition result of the television audio 1 and the television audio 2 is a sound produced by. Moreover, the client-side apparatus 1 ₂ speech recognition result, a partial character string of the speech recognition result of the speech 3 and speech 4 is a utterance second user is a client-side apparatus 1 ₂ user uttered, substring of a speech recognition result of the client-side apparatus 1 ₂ of the television audio 1 is a audio television emitted in and around the television audio 2 are included. Further, the speech recognition result is the client-side apparatus 1 _3, and the client-side apparatus 1 ₃ third user utterance is a spoken 1 and utterance 2 of the speech recognition result emitted partial string is a user of, substring of a speech recognition result of the television audio 3 and video audio 4 is a speech television emitted around the client-side apparatus 1 ₃ are included. Here, partial character strings of speech recognition results of the speech 1 and the speech 2 which are the speech of the first user, and speech recognition results of the speech 1 and the speech 2 which are the speech of the third user. The partial character strings are assumed to be identical to each other.
In order to make the figure easy to understand, an example of an utterance voice example and an example of a character string of television voice will be written together on the voice recognition result of FIG. Utterance examples are usually written in the body, and television speech examples are in italics. In the figure, the partial character strings are written for each word, but in fact, they may be short partial character strings such as one phoneme.

まず、ｘ＝２のときの図３のステップＳ２４４とステップＳ２４５の処理を説明する。クライアント側装置１_１の音声認識結果に含まれる部分文字列のうちテレビ音声１及びテレビ音声２の音声認識結果の部分文字列については、クライアント側装置１_２の音声認識結果にも同時刻で含まれるため、クライアント側装置１_１の音声認識結果から取り除かれる。クライアント側装置１_１の音声認識結果に含まれる部分文字列のうち発話１及び発話２の音声認識結果の部分文字列については、クライアント側装置１_２の音声認識結果には同時刻で含まれないため、クライアント側装置１_１の音声認識結果から取り除かれない。すなわち、クライアント側装置１_１の音声認識結果に含まれる部分文字列としては発話１及び発話２の音声認識結果の部分文字列が残された状態となり、ｘ＝３のときの処理に進む。 First, the process of step S244 and step S245 of FIG. 3 when x = 2 will be described. The client device 1 ₁ of the substring of the speech recognition result of the television audio 1 and the television audio 2 of the partial character strings included in the speech recognition result, also includes at the same time the speech recognition result of the client-side apparatus 1 ₂ It is therefore removed from the speech recognition result of the client-side apparatus 1 _1. The substring of the utterance 1 and utterance 2 of the speech recognition result of the partial character strings included in the client-side apparatus 1 ₁ for speech recognition result is not included at the same time the speech recognition result of the client-side apparatus 1 ₂ Therefore, it not removed from the speech recognition result of the client-side apparatus 1 _1. In other words, a state in which a substring of the speech recognition result of speech 1 and utterance 2 is left as a partial character string included in the speech recognition result of the client-side apparatus 1 ₁ proceeds to the process in the case of x = 3.

次に、ｘ＝３のときの図３のステップＳ２４４とステップＳ２４５の処理を説明する。クライアント側装置１_１の音声認識結果に含まれる部分文字列のうち発話１及び発話２の音声認識結果の部分文字列については、クライアント側装置１_３の音声認識結果に含まれるものの、クライアント側装置１_３の音声認識結果に同時刻では含まれないため、クライアント側装置１_１の音声認識結果から取り除かれない。すなわち、クライアント側装置１_１の音声認識結果に含まれる部分文字列としては発話１及び発話２の音声認識結果の部分文字列が残された状態となる。 Next, the process of step S244 and step S245 of FIG. 3 when x = 3 will be described. The client device 1 ₁ of the substring of the speech recognition result of speech 1 and utterance 2 of the partial character strings included in the speech recognition result, although included in the voice recognition result of the client-side apparatus 1 _3, client device because it is not included at the same time in the 1 ₃ speech recognition result, it is not removed from the speech recognition result of the client-side apparatus 1 _1. In other words, a state in which a substring of the speech recognition result of speech 1 and utterance 2 is left as a partial character string included in the speech recognition result of the client-side apparatus 1 _1.

ｘ＝３のときの図３のステップＳ２４４とステップＳ２４５の処理を終えると、ステップＳ２４６においてステップＳ２４３〜ステップＳ２４５の処理を完了していないクライアント側装置が残されていないと判定され、ステップＳ２４８において、発話１及び発話２の音声認識結果の部分文字列が残された状態である音声認識結果が加工済み音声認識結果として出力される。 When the processes in steps S244 and S245 in FIG. 3 when x = 3 are completed, it is determined in step S246 that there are no client-side devices that have not completed the processes in steps S243 to S245, and in step S248. A speech recognition result in which partial character strings of speech recognition results of speech 1 and speech 2 are left is output as a processed speech recognition result.

クラウド側装置２の検索処理部２５は、音声認識結果加工部２４が出力した少なくとも１つの加工済み音声認識結果とIDとの組に含まれる加工済み音声認識結果を検索クエリとして用いて、所定の検索データベースや所定の情報検索サイトでの検索を実行し、検索結果を得て、得た検索結果をIDとの組にして検索結果送出部２６に対して出力する。上記の例では、音声認識結果加工部２４が出力したクライアント装置１_１の加工済み音声認識結果を検索クエリとして用いて、所定の検索データベースや所定の情報検索サイトでの検索を実行し、加工済み音声認識結果に対応する検索結果を得て、得た検索結果をクライアント装置１_１のIDと組にして、検索結果送出部２６に対して出力する。検索処理は、周知技術であるため、詳細な説明を省略する。 The search processing unit 25 of the cloud-side device 2 uses the processed speech recognition result included in the combination of at least one processed speech recognition result and the ID output by the speech recognition result processing unit 24 as a search query, and performs a predetermined process. The search in the search database or a predetermined information search site is executed, the search result is obtained, and the obtained search result is output to the search result transmitting unit 26 in a pair with the ID. In the above example, using the processed speech recognition result of the client device 1 ₁ speech recognition result processing unit 24 is output as a search query, executes the search at a given search database and predetermined information search site, Produced to obtain a search result corresponding to the speech recognition result, we obtained results in the client device 1 ₁ of the ID and the set, and outputs the search result transmission unit 26. The search processing is a well-known technology, and thus the detailed description is omitted.

クラウド側装置２の検索結果送出部２６は、検索処理部２５が出力した検索結果とIDとの組に含まれるIDに対応するクライアント側装置に対し、検索結果を含む伝送信号である第二伝送信号を送出する。上記の例であれば、検索結果を含む伝送信号である第二伝送信号をクライアント側装置１_１に対して送出する。より正確には、検索結果送出部２６は、検索結果を含む伝送信号である第二伝送信号を、クライアント側装置１_１に伝えるべく、ネットワーク３に対して送出する。 The search result sending unit 26 of the cloud-side device 2 transmits a second transmission signal including the search result to the client-side device corresponding to the ID included in the combination of the search result and the ID output from the search processing unit 25. Send a signal. In the example above, it transmits the second transmission signal is a transmission signal including the search results to the client-side apparatus 1 _1. More precisely, the search result transmission unit 26, the second transmission signal is a transmission signal including the search results, to inform the client device 1 ₁ is sent to the network 3.

クライアント側装置１_１の検索結果受信部１４_１は、クラウド側装置２が送出した第二伝送信号を受信して、受信した第二伝送信号から検索結果を取り出して、画面表示部１５_１に対して出力する。すなわち、検索結果受信部１４_１が出力する検索結果は、加工済み音声認識結果に対応する検索結果である。 Search result receiving unit 14 ₁ of the client-side apparatus 1 ₁ receives the second transmission signal cloud side device 2 sends, retrieves the search results from the second transmission signal received, with respect to the screen display unit 15 ₁ Output. That is, the search result retrieval result receiving unit 14 ₁ outputs is a search result corresponding to the processed speech recognition result.

クライアント側装置１_１の画面表示部１５_１は、検索結果受信部１４_１が出力した検索結果をクライアント側装置１_１の画面に表示する。すなわち、画面表示部１５_１が表示する検索結果は、加工済み音声認識結果に対応する検索結果である。 Screen display unit 15 ₁ of the client-side apparatus 1 ₁ displays a search result retrieval result receiving unit 14 ₁ is output to the client-side apparatus 1 ₁ of the screen. That is, the search result screen display section 15 ₁ is displayed is a search result corresponding to the processed speech recognition result.

第１実施形態の第１動作例による音声認識システムを用いることによって、課題１の問題を解決することが可能となり、発話者が望む音声認識結果とは異なる音声認識結果が得られる可能性を従来よりも低減し、検索において発話者が望む検索結果とは異なる検索結果が得られる可能性を従来よりも低減することが可能となる。 By using the speech recognition system according to the first operation example of the first embodiment, it is possible to solve the problem of the problem 1 and it is possible that the speech recognition result different from the speech recognition result desired by the speaker can be obtained. It is possible to reduce the possibility of obtaining a search result different from the search result desired by the speaker in the search, compared to the conventional case.

第１実施形態の第１動作例による音声認識システムを用いることによる効果を、上記の具体ケースで、より詳しく説明する。 The effect of using the speech recognition system according to the first operation example of the first embodiment will be described in more detail in the above specific case.

クライアント側装置１_１の周囲でテレビやラジオや案内放送などの環境音が発生している場合には、クライアント側装置１_２の周囲でもクライアント側装置１_１の周囲と同じテレビやラジオや案内放送などの環境音が発生している。 If the environment sound around the client-side device 1 ₁ such as a television or radio or announcement has occurred, the same television or radio or announcement with the surrounding of the client-side device 1 ₁ is also around the client-side device 1 ₂ Environmental noise such as is generated.

この場合、従来技術では、クライアント側装置１_１の音声認識結果に、第１の利用者の音声の音声認識結果に加えて、テレビやラジオや案内放送などの環境音の音声認識結果が含まれてしまう。クライアント側装置１_１が得た音響信号に対して雑音抑圧処理を施した上で音声認識処理をする従来技術も存在するが、雑音抑圧処理で抑圧し切れなかった環境音があった場合には、クライアント側装置１_１の音声認識結果に、抑圧し切れなかった環境音の音声認識結果が含まれてしまう。 In this case, in the prior art, the speech recognition result of the client-side apparatus 1 _1, in addition to the speech recognition result of the first user's voice, including voice recognition result of the environmental sound, such as a television or radio or announcement It will If the client-side apparatus 1 ₁ but also exist prior art speech recognition processing after applying the noise suppression processing on the sound signal obtained, there is environmental sound that could not be completely suppressed by the noise suppressing process , the speech recognition result of the client-side apparatus 1 _1, will contain the result of speech recognition has not been suppressed ambient sound.

クライアント側装置１_１が取得した音響信号からクライアント側装置１_２が取得した音響信号を取り除く従来技術も存在する。しかしながら、同一の環境音であっても、クライアント側装置１_１への伝達特性とクライアント側装置２_１への伝達特性とは異なるため、クライアント側装置１_１が取得した音響信号とクライアント側装置１_２が取得した音響信号とにおいては異なる信号成分として含まれている。このため、クライアント側装置１_１が取得した音響信号からクライアント側装置１_２が取得した音響信号を取り除いたところで、クライアント側装置１_１が取得した音響信号から環境音の全てを取り除くことはできない。したがって、取り除き切れなかった環境音があった場合には、クライアント側装置１_１の音声認識結果に、取り除き切れなかった環境音の音声認識結果が含まれてしまう。また、クライアント側装置１_１が取得した音響信号からクライアント側装置１_２が取得した音響信号を取り除いてしまうと、クライアント側装置１_１に対して第１の利用者が発した音声に対応する音響信号の成分のうち、クライアント側装置１_２の音響信号に対応する成分が取り除かれてしまうため、クライアント側装置１_１に対して第１の利用者が発した音声に対応する音響信号の成分が歪んだ状態となってしまい、クライアント側装置１_１に対して第１の利用者が発した音声に対する音声認識が正しく行われなくなるという問題が生じる可能性もある。 Also present client device 1 ₁ is acquired prior art to remove an acoustic signal that the client-side apparatus 1 ₂ from the acoustic signal acquired was. However, even with the same environmental sound, because different from the transfer characteristic of the transfer characteristic and the client-side device 2 ₁ to the client-side apparatus 1 _1, the acoustic signal and the client-side apparatus 1 by the client-side apparatus 1 ₁ is acquired ₂ is included as a different signal component in the acquired acoustic signal. Therefore, at the removal of the acoustic signals by the client-side apparatus 1 ₂ from sound signals client device 1 ₁ has acquired is acquired, it is impossible to remove all environmental sound from the acoustic signal by the client-side apparatus 1 ₁ is acquired. Therefore, if there is not fully removed environment sound, the speech recognition result of the client-side apparatus 1 _1, it will contain the result of speech recognition could not remove the environmental sound. Further, when the thus removed sound signal which the client-side apparatus 1 ₂ has obtained from the audio signal by the client-side apparatus 1 ₁ is acquired, sound corresponding to the first voice user utters the client-side apparatus 1 ₁ among the components of the signal, for components corresponding to the acoustic signal of the client-side apparatus 1 ₂ will be removed, the component of the acoustic signal corresponding to the first voice user utters the client-side apparatus 1 ₁ becomes a state distorted, there is a possibility that a problem that voice recognition is not performed correctly for the first voice the user has issued the client-side apparatus 1 ₁ occurs.

クライアント側装置１_１への伝達特性とクライアント側装置２_１への伝達特性とが異なった場合でも、同一の環境音が比較的大きな音量で存在している場合には、クライアント側装置１_１が取得した音響信号に対する音声認識結果とクライアント側装置１_２が取得した音響信号に対する音声認識結果の双方に、テレビやラジオや案内放送などの声音の音声認識結果である部分文字列が同時刻の部分文字列として含まれている。したがって、第１実施形態の第１動作例による音声認識システムによれば、クライアント側装置１_１が取得した音響信号に対する音声認識結果から、他のクライアント側装置が取得した音響信号に対する音声認識結果に略同一の時刻に含まれる部分文字列を取り除くことで、テレビやラジオや案内放送などの環境音の音声認識結果を取り除くことができる。 Even when the transmission characteristics of the client device transfer characteristic and the client-side device 2 ₁ to 1 ₁ are different, if the same environmental sound is present in relatively large volume, the client-side apparatus 1 ₁ in both the speech recognition result for the acquired speech recognition result for the acoustic signals and the client-side apparatus 1 ₂ acquired acoustic signals, portions of the television or radio and announcement is a voice recognition result of the vocal such substring is the same time It is included as a string. Therefore, according to the speech recognition system according to the first operation example of the first embodiment, the speech recognition result for the sound signal by the client-side apparatus 1 ₁ has acquired, to the speech recognition result for the acoustic signal other client-side device is obtained By removing partial strings included at substantially the same time, it is possible to remove speech recognition results of environmental sounds such as television, radio, and information broadcasting.

一方、クライアント側装置１_１に対して第１の利用者が発した音声は、クライアント側装置１_１が取得した音響信号には含まれるものの、クライアント側装置１_２が取得した音響信号には含まれない。したがって、クライアント側装置１_１が取得した音響信号に対する音声認識結果から、他のクライアント側装置が取得した音響信号に対する音声認識結果に略同一の時刻に含まれる部分文字列を取り除くことでも、第１の利用者が発した音声の音声認識結果は取り除かれない。 Meanwhile, the first voice the user has issued the client-side apparatus 1 _1, although the client-side apparatus 1 ₁ are included in the acquired acoustic signal, included in the audio signal by the client-side apparatus 1 ₂ acquires I can not. Therefore, from the speech recognition result for the sound signal by the client-side apparatus 1 ₁ is acquired, also by removing the partial character strings included in substantially the same time to the speech recognition result for the acoustic signal other client device has acquired, first The speech recognition results of the voices uttered by the users of are not removed.

以上のように、第１実施形態の第１動作例による音声認識システムによれば、発話者が望む音声認識結果である発話者が発した音声の音声認識結果が欠落する可能性を低く抑えながら、発話者が望む音声認識結果とは異なる音声認識結果であるテレビやラジオや案内放送などの環境音の音声認識結果が含まれる可能性を従来よりも低減することができる。 As described above, according to the speech recognition system according to the first operation example of the first embodiment, the possibility that the speech recognition result of the speech uttered by the utterer, which is the speech recognition result desired by the utterer, is suppressed low It is possible to reduce the possibility of including the speech recognition result of environmental sound such as a television, a radio, and a guidance broadcast, which is a speech recognition result different from the speech recognition result desired by the speaker, than in the past.

［［第１実施形態の第２動作例］］
第２動作例として、ある１つの処理対象クライアント側装置について、略同一の時刻に予め定めた複数個の他のクライアント側装置に処理対象クライアント側装置と同じ部分文字列（共通する部分文字列）が同時刻にある場合に、処理対象クライアント側装置の音声認識結果の文字列から共通する部分文字列を取り除いたものを加工済み音声認識結果として得る例を説明する。第２動作例が第１動作例と異なるのは、クラウド側装置２の音声認識結果加工部２４の動作である。以下、第１動作例と異なる部分についてのみ説明する。 [[Second operation example of the first embodiment]]
As a second operation example, for one processing target client side device, a plurality of other client side devices predetermined at substantially the same time same partial character string as the processing target client side device (common partial character string) An example will be described in which, as a processed speech recognition result, one obtained by removing a common partial character string from the character string of the speech recognition result of the processing target client side device when the same time is present is obtained. The second operation example is different from the first operation example in the operation of the speech recognition result processing unit 24 of the cloud-side device 2. Hereinafter, only differences from the first operation example will be described.

クラウド側装置２の音声認識結果加工部２４は、音声認識結果保持部２３に記憶された少なくとも１つの音声認識結果と時刻情報とIDとの組について、当該音声認識結果の文字列に含まれる部分文字列それぞれについて、他の音声認識結果と時刻情報とIDとの組の中に、部分文字列と時刻との組が一致するものが予め定めた複数個（Ｋ個、Ｋは２以上の整数）あった場合に、一致した部分文字列を取り除いたものを加工済み音声認識結果とし、加工済み音声認識結果とIDとを組にして出力する。 The voice recognition result processing unit 24 of the cloud-side device 2 includes a portion of the combination of at least one voice recognition result, time information, and ID stored in the voice recognition result holding unit 23 in the character string of the voice recognition result. For each character string, among the combinations of other speech recognition results, time information, and ID, there are a plurality of predetermined combinations (K, K is an integer of 2 or more) in which combinations of partial character strings and time match 2.) If there is a matching partial character string removed, it is regarded as a processed speech recognition result, and the processed speech recognition result and the ID are output as a set.

次に、図５を参照して、この例における音声認識結果と加工済み音声認識結果の一例を説明する。図５の横軸は時刻であり、矢印の上にある３つは音声認識結果加工部２４の入力であるクライアント側装置１_１〜１_３それぞれの音声認識結果であり、矢印の下にある１つはクライアント側装置１_１の加工済み音声認識結果である。ここでは、より具体的なケースとして、Ｎ＝３(クライアント側装置数)及びＫ＝２(同時刻で部分文字列が一致した装置数の許容数)であり、テレビやラジオの音が流れていたり駅やデパートなどの案内放送が不定期に流れたりする場所に第１の利用者がいて、第１の利用者と同じテレビやラジオや案内放送が流れている場所に第２の利用者と第３の利用者がいる場合を例に説明する。 Next, an example of the speech recognition result and the processed speech recognition result in this example will be described with reference to FIG. The horizontal axis in FIG. 5 is time, and three above the arrow are the speech recognition results of the client side devices 1 1 _to 1 ₃ which are the input of the speech recognition result processing unit 24, and 1 below the arrow The _{first is} the processed speech recognition result of the client side device 11. Here, as a more specific case, N = 3 (the number of devices on the client side) and K = 2 (the allowable number of the number of devices whose partial character strings coincide at the same time), and the sound of television or radio is flowing The first user is in a place where information broadcasts such as train stations and department stores flow irregularly, and the same place as the first user such as a television, radio, and a guide broadcast flows with the second user The case where the third user is present will be described as an example.

クライアント側装置１_１の音声認識結果には、クライアント側装置１_１の利用者である第１の利用者が発した発話である発話１及び発話２の音声認識結果の部分文字列と、クライアント側装置１_１の周囲でテレビが発した音声であるテレビ音声１及びテレビ音声２の音声認識結果の部分文字列が含まれている。また、クライアント側装置１_２の音声認識結果には、クライアント側装置１_２の利用者である第２の利用者が発した発話である発話３及び発話４の音声認識結果の部分文字列と、クライアント側装置１_２の周囲でテレビが発した音声であるテレビ音声１及びテレビ音声２の音声認識結果の部分文字列が含まれている。また、クライアント側装置１_３の音声認識結果には、クライアント側装置１_３の利用者である第３の利用者が発した発話である発話５と発話２の音声認識結果の部分文字列と、クライアント側装置１_３の周囲でテレビが発した音声であるテレビ音声１及びテレビ音声２の音声認識結果の部分文字列が含まれている。ここで、第１の利用者が発した発話である発話２の音声認識結果の部分文字列と、第３の利用者が発した発話である発話２の音声認識結果の部分文字列と、は同一であるとする。 The client-side apparatus 1 ₁ of the speech recognition result, the first and the partial string of the speech recognition result of speech 1 and utterance 2 is a speech the user has issued a client device 1 ₁ of the user, the client-side TV around the apparatus 1 ₁ contains a substring of the speech recognition result of the television audio 1 and the television audio 2 is a sound produced by. Moreover, the client-side apparatus 1 ₂ speech recognition result, a partial character string of the speech recognition result of the speech 3 and speech 4 is a utterance second user is a client-side apparatus 1 ₂ user uttered, substring of a speech recognition result of the client-side apparatus 1 ₂ of the television audio 1 is a audio television emitted in and around the television audio 2 are included. Further, the speech recognition result is the client-side apparatus 1 _3, the third user utterance and the partial string of the speech recognition result of speech 5 and the speech 2 is emitted in a client device 1 ₃ of the user, substring of client device 1 ₃ TV audio 1 is a audio television emitted around the and television audio 2 speech recognition result are included. Here, the partial character string of the speech recognition result of the speech 2 which is the speech emitted by the first user, and the partial character string of the speech recognition result of the speech 2 which is the speech emitted by the third user It is assumed that they are identical.

クライアント側装置１_１の音声認識結果に含まれる部分文字列のうち発話１の音声認識結果の部分文字列については、クライアント側装置１_２の音声認識結果にもクライアント側装置１_３の音声認識結果にも同時刻で含まれないため、クライアント側装置１_１の音声認識結果から取り除かれない。クライアント側装置１_１の音声認識結果に含まれる部分文字列のうちテレビ音声１の音声認識結果の部分文字列については、クライアント側装置１_２の音声認識結果にもクライアント側装置１_３の音声認識結果にも同時刻で含まれるため、すなわち、他の２個のクライアント側装置の音声認識結果にも同時刻で含まれるため、一致数はＫより大きい３になり、クライアント側装置１_１の音声認識結果から取り除かれる。クライアント側装置１_１の音声認識結果に含まれる部分文字列のうち発話２の音声認識結果の部分文字列については、クライアント側装置１_２の音声認識結果は同時刻で含まれず、クライアント側装置１_３の音声認識結果には同時刻で含まれるため、すなわち、他の１個のクライアント側装置の音声認識結果にも同時刻で含まれるため、一致数は２となりＫを超えないため、クライアント側装置１_１の音声認識結果から取り除かれない。クライアント側装置１_１の音声認識結果に含まれる部分文字列のうちテレビ音声２の音声認識結果の部分文字列については、クライアント側装置１_２の音声認識結果にもクライアント側装置１_３の音声認識結果にも同時刻で含まれるため、すなわち、他の２個のクライアント側装置の音声認識結果にも同時刻で含まれるため、クライアント側装置１_１の音声認識結果から取り除かれる。したがって、発話１及び発話２の音声認識結果の部分文字列が残された状態であるクライアント側装置１_１の音声認識結果が加工済み音声認識結果として出力される。 Client for the substring of the speech recognition result of the speech one of device 1 ₁ of the partial character strings included in the voice recognition result, the speech recognition result of the client-side apparatus 1 _second client-side device to the speech recognition result 1 ₃ because it is not included at the same time also it is not removed from the speech recognition result of the client-side apparatus 1 _1. The substring of the speech recognition result of the television audio 1 of the partial character strings included in the client-side apparatus 1 ₁ of the speech recognition result, the client-side apparatus 1 ₃ of the speech recognition in the client-side apparatus 1 ₂ speech recognition result to be included in the result to be the same time, i.e., because it contains at the same time also on the result of the speech recognition of the other two client devices, the number of matching becomes K greater than 3, the client-side apparatus 1 ₁ speech It is removed from the recognition result. The client device 1 ₁ of the substring of the speech recognition result of the speech 2 of the partial character strings included in the speech recognition result, the speech recognition result of the client-side apparatus 1 ₂ will not be included in the same time, the client-side apparatus 1 _Since the voice recognition result of ₃ is included at the same time, that is, the voice recognition result of another client side apparatus is also included at the same time, the number of matches is 2 and does not exceed K. not removed from the speech recognition result of the device 1 _1. The substring of the speech recognition result of the television audio 2 of the partial character strings included in the client-side apparatus 1 ₁ of the speech recognition result, the client-side apparatus 1 ₃ of the speech recognition in the client-side apparatus 1 ₂ speech recognition result since also included at the same time to the result, that is, since the speech recognition result of the other two client device included in the same time, be removed from the speech recognition result of the client-side apparatus 1 _1. Therefore, speech 1 and the client-side apparatus 1 ₁ for speech recognition result is a state in which the partial character string is left in the speech recognition result of the speech 2 is outputted as a processed speech recognition result.

なお、この処理は、他のクライアント側装置の全てを対象として行ってもよいし、他の一部（ただし、複数個）のクライアント側装置を対象として行ってもよい。この場合、音声認識結果加工部２４の処理で必要な音声認識結果だけを前段で得るようにしてもよい。すなわち、音声認識結果加工部２４の処理に不要な音声認識結果を得るための音声受信部２１、音声認識部２２及び音声認識結果保持部２３の動作は省略してもよい。 Note that this process may be performed on all other client-side devices, or may be performed on other partial (a plurality of) client-side devices. In this case, only the speech recognition result necessary for the processing of the speech recognition result processing unit 24 may be obtained in the previous stage. That is, the operations of the voice receiving unit 21, the voice recognition unit 22, and the voice recognition result holding unit 23 for obtaining a voice recognition result unnecessary for the processing of the voice recognition result processing unit 24 may be omitted.

第１動作例では、偶然、二人の利用者が同時刻に同一の内容を発話した場合には、利用者が発した音声の音声認識結果は取り除かれてしまう。これに対し、第２動作例では、三人以上（Ｋ＋１人以上）が同時刻に同一の内容を発話しない限りは、利用者が発した音声の音声認識結果を取り除いてしまうことはない。テレビやラジオや案内放送などの環境音が必ず同時刻に同一の内容であることと比べれば、複数の利用者の発話が同時刻に同一の内容である可能性は極めて低く、それが三人以上となる可能性はさらに低い。したがって、第１実施形態の第２動作例による音声認識システムによれば、発話者が望む音声認識結果である発話者が発した音声の音声認識結果が欠落する可能性を第１動作例よりも低く抑えながら、発話者が望む音声認識結果とは異なる音声認識結果であるテレビやラジオや案内放送などの環境音の音声認識結果が含まれる可能性を従来よりも低減することができる。 In the first operation example, when two users utter the same content at the same time by chance, the voice recognition result of the voice emitted by the user is removed. On the other hand, in the second operation example, the voice recognition result of the voice emitted by the user is not removed unless three or more people (K + 1 or more) utter the same content at the same time. It is extremely unlikely that the utterances of multiple users are the same content at the same time, as compared to the environmental sound such as television, radio, and guide broadcasting, which always have the same content at the same time, three people It is even less likely to be above. Therefore, according to the speech recognition system according to the second operation example of the first embodiment, the possibility that the speech recognition result of the speech emitted by the utterer, which is the speech recognition result desired by the utterer is lost may be higher than the first operation example. It is possible to reduce the possibility of including the speech recognition result of environmental sounds such as television, radio, and guidance broadcasting, which is a speech recognition result different from the speech recognition result desired by the speaker, than the conventional technology, while keeping it low.

［［第１実施形態の第３動作例］］
第３動作例として、ある１つの処理対象クライアント側装置について、他のクライアント側装置のうち、処理対象クライアント側装置と同じ部分文字列が同時刻に出現することが複数回あるクライアント側装置についてのみを対象として、他のクライアント側装置に処理対象クライアント側装置と同じ部分文字列（共通する部分文字列）が同時刻にある場合に、処理対象クライアント側装置の音声認識結果の文字列から共通する部分文字列を取り除いたものを加工済み音声認識結果として得る例を説明する。第３動作例が第１動作例と異なるのは、クラウド側装置２の音声認識結果加工部２４の動作である。以下、第１動作例と異なる部分についてのみ説明する。 [[Third operation example of the first embodiment]]
As a third operation example, only with respect to one processing target client-side device, among other client-side devices, only a client-side device in which the same partial character string as the processing target client-side device appears multiple times at the same time If the same partial character string (common partial character string) as the process target client device is present at the same time in another client device for the same target character string from the character recognition result of the process target client device An example will be described in which a partial character string removed is obtained as a processed speech recognition result. The third operation example is different from the first operation example in the operation of the speech recognition result processing unit 24 of the cloud device 2. Hereinafter, only differences from the first operation example will be described.

クラウド側装置２の音声認識結果加工部２４は、音声認識結果保持部２３に記憶された少なくとも１つの音声認識結果と時刻情報とIDとの組について、他の音声認識結果と時刻情報とIDとの組の中に、部分文字列と時刻との組が一致するものが予め定めた複数個（Ｌ個、Ｌは２以上の整数）あった場合に、一致した部分文字列を取り除いたものを加工済み音声認識結果とし、加工済み音声認識結果とIDとを組にして出力する。 The voice recognition result processing unit 24 of the cloud-side device 2 mixes other voice recognition results, time information, and ID with respect to a combination of at least one voice recognition result, time information, and ID stored in the voice recognition result holding unit 23. If there are a predetermined number (L pieces, L is an integer of 2 or more) in which the combination of the partial character string and the time match is a predetermined number in the group of, the one from which the matched partial character string is removed is As a processed speech recognition result, the processed speech recognition result and the ID are output as a set.

次に、図６を参照して、この例における音声認識結果と加工済み音声認識結果の一例を説明する。図６の横軸は時刻であり、矢印の上にある３つは音声認識結果加工部２４の入力であるクライアント側装置１_１〜１_３それぞれの音声認識結果であり、矢印の下にある１つはクライアント側装置１_１の加工済み音声認識結果である。ここでは、より具体的なケースとして、Ｎ＝３及びＬ＝２であり、テレビやラジオの音が流れていたり駅やデパートなどの案内放送が不定期に流れたりする場所に第１の利用者がいて、第１の利用者と同じテレビやラジオや案内放送が流れている場所に第２の利用者がいて、第１の利用者とは異なるテレビやラジオや案内放送が流れている場所に第３の利用者がいる場合を例に説明する。 Next, with reference to FIG. 6, an example of the speech recognition result and the processed speech recognition result in this example will be described. The horizontal axis in FIG. 6 is a time, three at the top of the arrow is an input and is the client-side apparatus 1 ₁ to 1 ₃ each speech recognition result of the speech recognition result processing section 24, the bottom of the arrow 1 The _{first is} the processed speech recognition result of the client side device 11. Here, as a more specific case, it is N = 3 and L = 2 and the first user is at a place where the sound of a television or radio is flowing or the guidance broadcast of a station or department store is irregularly flowing. In the same place as the first user, the second user is in the place where the same television, radio and information broadcast are flowing, and the second user is in the place where the television, radio and information broadcast are different from the first user. The case where the third user is present will be described as an example.

クライアント側装置１_１の音声認識結果には、クライアント側装置１_１の利用者である第１の利用者が発した発話である発話１及び発話２の音声認識結果の部分文字列と、クライアント側装置１_１の周囲でテレビが発した音声であるテレビ音声１及びテレビ音声２の音声認識結果の部分文字列が含まれている。また、クライアント側装置１_２の音声認識結果には、クライアント側装置１_２の利用者である第２の利用者が発した発話である発話３及び発話４の音声認識結果の部分文字列と、クライアント側装置１_２の周囲でテレビが発した音声であるテレビ音声１及びテレビ音声２の音声認識結果の部分文字列が含まれている。また、クライアント側装置１_３の音声認識結果には、クライアント側装置１_３の利用者である第３の利用者が発した発話である発話５と発話２の音声認識結果の部分文字列と、クライアント側装置１_３の周囲でテレビが発した音声であるテレビ音声３及びテレビ音声４の音声認識結果の部分文字列が含まれている。ここで、第１の利用者が発した発話である発話２の音声認識結果の部分文字列と、第３の利用者が発した発話である発話２の音声認識結果の部分文字列と、は同一であるとする。 The client-side apparatus 1 ₁ of the speech recognition result, the first and the partial string of the speech recognition result of speech 1 and utterance 2 is a speech the user has issued a client device 1 ₁ of the user, the client-side TV around the apparatus 1 ₁ contains a substring of the speech recognition result of the television audio 1 and the television audio 2 is a sound produced by. Moreover, the client-side apparatus 1 ₂ speech recognition result, a partial character string of the speech recognition result of the speech 3 and speech 4 is a utterance second user is a client-side apparatus 1 ₂ user uttered, substring of a speech recognition result of the client-side apparatus 1 ₂ of the television audio 1 is a audio television emitted in and around the television audio 2 are included. Further, the speech recognition result is the client-side apparatus 1 _3, the third user utterance and the partial string of the speech recognition result of speech 5 and the speech 2 is emitted in a client device 1 ₃ of the user, substring of a speech recognition result of the television audio 3 and video audio 4 is a speech television emitted around the client-side apparatus 1 ₃ are included. Here, the partial character string of the speech recognition result of the speech 2 which is the speech emitted by the first user, and the partial character string of the speech recognition result of the speech 2 which is the speech emitted by the third user It is assumed that they are identical.

クライアント側装置１_２の音声認識結果には、クライアント側装置１_１の音声認識結果と同じ部分文字列が同時刻で含まれている部分文字列として、テレビ音声１の音声認識結果の部分文字列と、テレビ音声２の音声認識結果の部分文字列と、の２つの部分文字列がある。クライアント側装置１_２は、クライアント側装置１_１と同じ部分文字列が同時刻に出現することが複数回あるクライアント装置であるため、部分文字列の取り除き処理の対象とする。クライアント側装置１_３の音声認識結果には、クライアント側装置１_１の音声認識結果と同じ部分文字列が同時刻で含まれている部分文字列として、発話２の音声認識結果の部分文字列がある。クライアント側装置１_３は、クライアント側装置１_１と同じ部分文字列が同時刻に出現することが複数回ないクライアント装置であるため、部分文字列の取り除き処理の対象としない。そして、部分文字列の取り除き処理の対象となったクライアント側装置１_２についてのみ、そのクライアント側装置１_２の音声認識結果とクライアント側装置１_１の音声認識結果とで、同じ文字列が同時刻で含まれているものを全て探索して得る。すなわち、テレビ音声１の音声認識結果の部分文字列とテレビ音声２の音声認識結果の部分文字列とを得る。そして、探索された全ての部分文字列、すなわち、テレビ音声１の音声認識結果の部分文字列とテレビ音声２の音声認識結果の部分文字列、をクライアント側装置１_１の音声認識結果から取り除いたもの、すなわち、発話１及び発話２の音声認識結果の部分文字列が残された状態であるクライアント側装置１_１の音声認識結果、を加工済み音声認識結果として得る。 The client-side apparatus 1 ₂ speech recognition results, as a partial character string is the same substring as a speech recognition result of the client-side apparatus 1 ₁ are included at the same time, the speech recognition result of the television audio 1 substring There are two substrings of and the substring of the speech recognition result of TV voice 2. Client device 1 _2, since the same sub-string and the client-side apparatus 1 ₁ is a client device in a plurality of times may appear at the same time, the object of removing processing substrings. The client-side apparatus 1 ₃ of the speech recognition results, as a partial character string is the same substring as a speech recognition result of the client-side apparatus 1 ₁ are included at the same time, the partial character string of the speech recognition result of the speech 2 is there. Client device _1-3, since the same sub-string and the client-side apparatus 1 ₁ is a client device without a plurality of times may appear at the same time, is not used for removing process of the substring. A portion for character client device 1 ₂ to be processed in the subject removing the column only in its client-side apparatus 1 ₂ speech recognition result and the client-side apparatus 1 ₁ speech recognition result, the same character string is the same time Search and get everything that is included in. That is, the partial character string of the speech recognition result of television sound 1 and the partial character string of the speech recognition result of television sound 2 are obtained. Then, all the partial character string is searched, i.e., removing sub-string of the speech recognition result of the partial strings and TV voice second speech recognition result of the television audio 1, from the speech recognition result of the client-side apparatus 1 ₁ things, namely, obtaining speech 1 and the client-side apparatus 1 ₁ for speech recognition result is a state in which the partial character string is left in the speech recognition result of the speech 2, the resulting processed speech recognition.

なお、この処理は、他のクライアント側装置の全てを対象として行ってもよいし、他の一部のクライアント側装置を対象として行ってもよい。この場合、音声認識結果加工部２４の処理で必要な音声認識結果だけを前段で得るようにしてもよい。すなわち、音声認識結果加工部２４の処理に不要な音声認識結果を得るための音声受信部２１、音声認識部２２及び音声認識結果保持部２３の動作は省略してもよい。 Note that this process may be performed on all other client-side devices, or may be performed on some other client-side devices. In this case, only the speech recognition result necessary for the processing of the speech recognition result processing unit 24 may be obtained in the previous stage. That is, the operations of the voice receiving unit 21, the voice recognition unit 22, and the voice recognition result holding unit 23 for obtaining a voice recognition result unnecessary for the processing of the voice recognition result processing unit 24 may be omitted.

第１動作例では、偶然、二人の利用者が同時刻に同一の内容を発話した場合には、利用者が発した音声の音声認識結果は取り除かれてしまう。これに対し、第３動作例では、二人の利用者が同時刻に同一の内容を発話することを複数回行わない限りは、利用者が発した音声の音声認識結果を取り除いてしまうことはない。テレビやラジオや案内放送などの環境音が必ず同時刻に同一の内容であることと比べれば、二人の利用者の発話が同時刻に同一の内容である可能性は極めて低く、それが複数回となる可能性はさらに低い。したがって、第１実施形態の第３動作例による音声認識システムによれば、発話者が望む音声認識結果である発話者が発した音声の音声認識結果が欠落する可能性を第１動作例よりも低く抑えながら、発話者が望む音声認識結果とは異なる音声認識結果であるテレビやラジオや案内放送などの環境音の音声認識結果が含まれる可能性を従来よりも低減することができる。 In the first operation example, when two users utter the same content at the same time by chance, the voice recognition result of the voice emitted by the user is removed. On the other hand, in the third operation example, unless the two users utter the same content a plurality of times at the same time, the voice recognition result of the voice emitted by the user is removed. Absent. It is extremely unlikely that two users' utterances will be the same content at the same time, as compared to that the environmental sound such as a television, radio, and guide broadcast always have the same content at the same time, Even less likely to be. Therefore, according to the speech recognition system according to the third operation example of the first embodiment, the possibility that the speech recognition result of the speech uttered by the utterer, which is the speech recognition result desired by the utterer, may be lost is more than the first operation example. It is possible to reduce the possibility of including the speech recognition result of environmental sounds such as television, radio, and guidance broadcasting, which is a speech recognition result different from the speech recognition result desired by the speaker, than the conventional technology, while keeping it low.

［［第１実施形態の第４動作例］］
第４動作例として、第１動作例の時刻情報に加えて、位置情報も用いる例を説明する。第４動作例が第１動作例と異なるのは、クライアント側装置１_１〜１_Ｎのユーザ情報取得部１２_１〜１２_Ｎ、クラウド側装置２の音声認識部２２、音声認識結果保持部２３、音声認識結果加工部２４の動作である。以下、第１動作例と異なる部分についてのみ説明する。 [[The fourth operation example of the first embodiment]]
As a fourth operation example, an example using position information in addition to the time information of the first operation example will be described. Fourth operation example is different from the first operation example, the client-side apparatus ₁ 1 to 1 _N user information acquisition unit ₁₂ 1 to 12 _N, the speech recognition unit 22 of the cloud side device 2, the speech recognition result holding unit 23, This is the operation of the speech recognition result processing unit 24. Hereinafter, only differences from the first operation example will be described.

クライアント側装置１_１のユーザ情報取得部１２_１は、クライアント側装置１_１は音声入力部１１_１が音響信号を取得した時刻情報と位置情報を得て、当該時刻情報と位置情報をユーザ情報として音声送信部１３_１に出力する。位置情報とは、例えば緯度経度などの絶対位置を表す情報であり、クライアント側装置がＧＰＳ受信部を内蔵するスマートフォンである場合は、音声入力部１１_１であるマイクが音響信号を取得した際にＧＰＳ受信部が測位した緯度経度を位置情報とすればよい。また、Ｗｉｆｉ基地局やビーコンによる補助測位機能をもつスマートフォンである場合は、補助測位部が測位した緯度経度を位置情報とすればよい。なお、位置情報は、複数のクライアント側装置それぞれで取得された音響信号が発せられた位置が近傍であるか否かを特定するために音声認識結果加工部２４が用いるためものであるため、複数のクライアント側装置間の相対位置関係を表す情報でもよい。例えば、スマートテレビやＳＴＢの場合の、地域コード、郵便番号コード、近傍ビーコンから受信したビーコンコード、あるいは、ジオハッシュIDのような、ある緯度経度のメッシュ状の領域で同一の値を示す地域固有IDを位置情報の相対位置関係を表す情報として用いてもよい。クライアント側装置１_２〜１_Ｎのユーザ情報取得部１２_２〜１２_Ｎも、クライアント側装置１_１のユーザ情報取得部１２_１と同様に動作する。 User information acquisition section 12 ₁ of the client-side apparatus 1 _1, the client-side apparatus 1 ₁ obtains location information and time information voice input unit 11 ₁ obtains an acoustic signal, the position information and the time information as the user information and outputs to the audio transmission unit 13 _1. Position information is, for example, information indicating an absolute position such as latitude and longitude, if the client-side device is a smart phone with a built-in GPS receiver, when the microphone is a voice input unit 11 ₁ obtains a sound signal The latitude and longitude measured by the GPS receiver may be used as the position information. Further, in the case of a smartphone having an auxiliary positioning function using a Wifi base station or a beacon, the latitude and longitude measured by the auxiliary positioning unit may be used as the position information. The position information is used by the voice recognition result processing unit 24 to specify whether or not the position at which the sound signal acquired by each of the plurality of client-side devices is emitted is in the vicinity. It may be information representing the relative positional relationship between client-side devices of For example, in the case of smart TV or STB, a region-specific code that indicates the same value in a mesh area of a certain latitude and longitude, such as a region code, a zip code, a beacon code received from a proximity beacon, or a geohash ID The ID may be used as information representing the relative positional relationship of the position information. Client device ₁ 2 to 1 _N user information acquisition unit ₁₂ 2 to 12 _N also operates in the same manner as the user information acquisition section 12 ₁ of the client-side apparatus _{1 1.}

クラウド側装置２の音声認識部２２は、音声受信部２１が出力したそれぞれの音響信号に対して音声認識処理を行い、音響信号に含まれる音声に対応する文字列である音声認識結果を得て、音声認識結果と、当該音声認識結果に対応する時刻情報と、当該音声認識結果に対応する位置情報と、当該音声認識結果に対応するIDとによる組を出力する。音声認識処理やその音声認識結果、音声認識結果に対応する時刻情報、音声認識結果に対応するID、については第１動作例と同様である。音声認識結果と組にする位置情報は、当該音声認識結果に対応する位置情報、すなわち、当該音声認識結果を得る元となった音響信号と組となって音声受信部２１から入力されたユーザ情報に含まれる位置情報である。１つの音声認識結果に対して、当該音声認識結果を得る元となった音響信号と組となって音声受信部２１から入力されたユーザ情報に含まれる位置情報が複数ある場合には、複数の位置情報を代表する１つの位置情報を音声認識結果と組にする。複数の位置情報を代表する１つの位置情報は、音声認識結果に対応する音響信号が発せられた位置を略特定可能とするものであれば何でもよく、例えば、複数の位置情報の何れか１つであってもよいし、複数の位置情報に含まれる緯度の平均値と複数の位置情報に含まれる経度の平均値とを表す位置情報であってもよい。 The voice recognition unit 22 of the cloud-side device 2 performs voice recognition processing on each of the sound signals output from the voice reception unit 21, and obtains a voice recognition result that is a character string corresponding to the voice included in the sound signal. A set of a speech recognition result, time information corresponding to the speech recognition result, position information corresponding to the speech recognition result, and an ID corresponding to the speech recognition result is output. The voice recognition process, the voice recognition result, the time information corresponding to the voice recognition result, and the ID corresponding to the voice recognition result are the same as those in the first operation example. The position information to be paired with the speech recognition result is the position information corresponding to the speech recognition result, that is, the user information input from the speech receiving unit 21 in combination with the acoustic signal from which the speech recognition result is obtained. Location information included in When there is a plurality of position information included in the user information input from the voice receiving unit 21 in combination with an acoustic signal from which the voice recognition result is obtained, for one voice recognition result, a plurality of position information One piece of position information representing position information is paired with the speech recognition result. The one position information representing a plurality of position information may be anything as long as the position where the sound signal corresponding to the speech recognition result can be substantially identified, for example, any one of a plurality of position information The position information may be an average value of latitudes included in a plurality of position information and an average value of longitudes included in the plurality of position information.

クラウド側装置２の音声認識結果保持部２３は、音声認識部２２が出力した音声認識結果と時刻情報と位置情報とIDとの組を記憶する。音声認識結果保持部２３の記憶内容は、音声認識結果加工部２４が時刻と位置が共通する単語などの部分文字列があるか否かを判定する処理、及び、時刻と位置が共通する単語などの部分文字列があった際に音声認識結果から取り除いて加工済み音声認識結果を得る処理、に用いられる。したがって、音声認識結果保持部２３に保持した記憶内容は、当該記憶内容を用いる音声認識結果加工部２４の処理が終わり一定時間経過した時点で削除してよい。これは、クライアント側装置の内部処理の所要時間や、クラウド側装置へのデータ送信にかかる時間や誤差、各部分文字列の持つ時間的長さ等を考慮して、例えば、十数秒である。 The voice recognition result holding unit 23 of the cloud-side device 2 stores a set of the voice recognition result, the time information, the position information, and the ID output from the voice recognition unit 22. The contents stored in the voice recognition result storage unit 23 are processing for the voice recognition result processing unit 24 to determine whether or not there is a partial character string such as a word having the same position as the time, and a word having the same position as the time Is used for processing to obtain a processed speech recognition result by removing it from the speech recognition result when there is a partial character string of. Therefore, the stored contents held in the speech recognition result holding unit 23 may be deleted when the processing of the speech recognition result processing unit 24 using the stored contents is over and a predetermined time has elapsed. This is, for example, dozens of seconds in consideration of the time required for internal processing of the client side device, the time and error required for data transmission to the cloud side device, and the time length of each partial character string.

クラウド側装置２の音声認識結果加工部２４は、音声認識結果保持部２３に記憶された少なくとも１つのID付き音声認識結果と時刻情報と位置情報とIDとの組について、当該音声認識結果の文字列中の部分文字列と時刻と位置との組それぞれについて、他の音声認識結果と時刻情報と位置情報とIDとの組の中に、部分文字列と時刻と位置との組が一致するものがあった場合に、一致した部分文字列を取り除いたものを加工済み音声認識結果とし、加工済み音声認識結果とIDとを組にして出力する。したがって、少なくとも１つのあるクライアント側装置についての加工済み音声認識結果が出力されることになる。なお、位置が一致するか否かの判定については、クライアント側装置が厳密に同一位置にあるかどうかを判定するのではなく、クライアント側装置が同一のテレビやラジオなどの音や案内放送などの環境音を音響信号として取得する可能性がある位置にあるかどうかを判定するので、予め定めた距離の範囲内にあるかなどにより、近傍にあるか否かを位置が一致するか否かの判定として用いる。すなわち、少なくともある１つのクライアント側装置については、略同一の時刻に近傍位置にある他のクライアント側装置に当該クライアント側装置と同じ部分文字列（共通する部分文字列）がある場合には、当該クライアント側装置の音声認識結果の文字列から共通する部分文字列を取り除いたものを加工済み音声認識結果として得る。 The voice recognition result processing unit 24 of the cloud-side device 2 uses the characters of the voice recognition result for the combination of at least one ID-added voice recognition result, time information, position information, and ID stored in the voice recognition result holding unit 23. A combination of partial character string, time, and position among sets of other speech recognition results, time information, positional information, and ID for each of the partial character string, time, and position in the string If there is a match, the processed speech recognition result is obtained by removing the matched partial character string, and the processed speech recognition result and the ID are output as a set. Therefore, processed speech recognition results for at least one client-side device will be output. Note that the determination as to whether or not the positions coincide is not to determine whether the client-side devices are in exactly the same position, but the client-side devices may use the same television or radio sound, such as a guide broadcast, etc. Since it is determined whether or not the environmental sound may be acquired as an acoustic signal, it is determined whether or not the positions coincide with each other depending on whether they are within a predetermined distance range. Used as a judgment. That is, for at least one client-side device, if another client-side device at a near position at substantially the same time has the same partial character string (common partial character string) as the client-side device, What is obtained by removing the common partial character string from the character string of the speech recognition result of the client side device is obtained as a processed speech recognition result.

なお、上記のクラウド側装置２の音声認識結果決定部２６の処理フローは図７の通りである。図７の処理フローが図３の処理フローと異なる点は、図３のステップＳ２４１に代えてステップＳ２４１Ａを行い、図３のステップＳ２４３に代えてステップＳ２４３Ａを行い、図３のステップＳ２４４に代えてステップＳ２４４Ａを行い、図３のステップＳ２４５に代えてステップＳ２４５Ａを行う点である。 The processing flow of the speech recognition result determination unit 26 of the cloud device 2 described above is as shown in FIG. 7 differs from the processing flow of FIG. 3 in that step S241A is performed instead of step S241 in FIG. 3 and step S243A is performed instead of step S243 in FIG. 3 and step S244 in FIG. Step S244A is performed, and step S245A is performed instead of step S245 of FIG. 3.

音声認識結果加工部２４は、まず、クライアント側装置１_１の音声認識結果と時刻情報と位置情報とIDとの組を音声認識結果保持部２３から読み出す（ステップＳ２４１Ａ）。音声認識結果加工部２４は、次に、初期値ｘを２に設定する（ステップＳ２４２）。音声認識結果加工部２４は、次に、クライアント側装置１_ｘの音声認識結果と時刻情報と位置情報とIDとの組を音声認識結果保持部２３から読み出す（ステップＳ２４３Ａ）。音声認識結果加工部２４は、次に、クライアント側装置１_１の音声認識結果と時刻情報と位置情報とIDとの組とクライアント側装置１_ｘの音声認識結果と時刻情報と位置情報とIDとの組とにおいて、部分文字列とそのおよその時刻（例えば数秒）と位置が一致するものがあるか否かを探索する（ステップＳ２４４Ａ）。音声認識結果加工部２４は、次に、ステップＳ２４４Ａにおいて部分文字列とその時刻と位置が一致するものがあった場合には、部分文字列とその時刻と位置が一致する全ての部分文字列をクライアント側装置１_１の音声認識結果の文字列から取り除く（ステップＳ２４５Ａ）。ステップＳ２４４Ａにおいて部分文字列とその時刻と位置が一致するものがなかった場合には、ステップＳ２４６に進む。音声認識結果加工部２４は、次に、ステップＳ２４３、ステップＳ２４４Ａ、ステップＳ２４５Ａの処理の対象としていないクライアント側装置が残っているかを判定する（ステップＳ２４６）。音声認識結果加工部２４は、次に、ステップＳ２４６においてステップＳ２４３、ステップＳ２４４Ａ、ステップＳ２４５Ａの処理の対象としていないクライアント側装置が残っていると判定された場合には、ｘをｘ＋１に置き換える（ステップＳ２４７）。ステップＳ２４６においてステップＳ２４３、ステップＳ２４４Ａ、ステップＳ２４５Ａの処理の対象としていないクライアント側装置が残っていないと判定された場合には、最後に行ったステップＳ２４５Ａで処理済みのクライアント側装置１_１の音声認識結果の文字列をクライアント側装置１_１の加工済み音声認識結果の文字列としてIDと組にして出力する（ステップＳ２４８）。 Speech recognition result processing unit 24 first reads a set of the position information and the ID and the client device 1 ₁ of the speech recognition result and the time information from the speech recognition result holding unit 23 (step S241A). Next, the speech recognition result processing unit 24 sets an initial value x to 2 (step S242). Next, the speech recognition result processing unit 24 reads out a combination of the speech recognition result, time information, position information, and ID of the client side device _1x from the speech recognition result holding unit 23 (step S243A). Speech recognition result processing unit 24, then the position information and the ID and the speech recognition result and the time information of the set and the client-side apparatus 1 _x between the position information and the ID and the client device 1 ₁ of the speech recognition result and the time information It is searched whether or not there is a partial character string and its approximate time (for example, several seconds) and position match among the group of (step S 244 A). Next, when there is a partial character string that matches the time and position at step S244A, the speech recognition result processing unit 24 selects the partial character string and all partial strings that match the time and position. removed from the client-side apparatus _{1 1} of the speech recognition result string (step S245A). If it is determined in step S244A that there is no partial character string whose time and position match, the process proceeds to step S246. Next, the speech recognition result processing unit 24 determines whether there remains any client-side device not subject to the processes of step S243, step S244A, and step S245A (step S246). The voice recognition result processing unit 24 next replaces x with x + 1 if it is determined in step S246 that there remains a client-side device not subject to the processing in steps S243, S244A, and S245A (step S246). S247). Step S243 In step S246, step S244A, when it is determined that there are no remaining client device that is not subject to the process of step S245A is end processed speech recognition client device _{1 1} in step S245A Been results of the string in the ID and the set output as client-side apparatus 1 ₁ of processed speech recognition result string (step S248).

なお、図７のステップＳ２４４Ａに代えて図８の（１）記載のステップＳ２４４Ａ１１とステップＳ２４４Ａ１２を行ってもよい。図７のステップＳ２４４Ａに代えて図８の（１）記載のステップＳ２４４Ａ１１とステップＳ２４４Ａ１２を行えば、ステップＳ２４４Ａ２の部分文字列と時刻の組が一致する場合にのみ、クライアント側装置１_２〜１_Ｎのそれぞれがクライアント側装置１_１の近傍にあるかの探索を行えばよくなるので、一致する部分文字列が少ない場合に、演算処理量を少なくすることができる。 Note that steps S244A11 and S244A12 described in (1) of FIG. 8 may be performed instead of step S244A of FIG. 7. By performing the step S244A11 steps S244A12 in (1) described in FIG. 8 in place of step S244A of FIG. 7, only if the set of substrings and time step S244A2 match, the client-side apparatus ₁ 2 to 1 _N since each is well be carried out of the search in the vicinity of the client-side apparatus 1 _1, when matched substring is small, it is possible to reduce the amount of computation.

また、図７のステップＳ２４４Ａに代えて図８の（２）記載のステップＳ２４４Ａ２１とステップＳ２４４Ａ２２を行ってもよい。図７のステップＳ２４４Ａに代えて図８の（２）記載のステップＳ２４４Ａ２１とステップＳ２４４Ａ２２を行えば、クライアント側装置１_２〜１_Ｎのそれぞれがクライアント側装置１_１の近傍にある場合にのみステップＳ２４４Ａ２の部分文字列と時刻の組が一致するかの探索を行えばよくなるので、近傍にあるクライアント装置が少ない場合に、演算処理量を少なくすることができる。 In addition, step S244A21 and step S244A22 described in (2) of FIG. 8 may be performed instead of step S244A of FIG. 7. By performing the step S244A21 steps S244A22 (2) described in FIG. 8 in place of step S244A of FIG. 7, step only if the respective client-side apparatus ₁ 2 to 1 _N is close to the client device _{1 1} S244A2 Since it is sufficient to search whether the partial character string and the time pair match, it is possible to reduce the amount of operation processing when there are few client devices in the vicinity.

第１実施形態の第４動作例による音声認識システムを用いることによって、課題２の問題を解決することが可能となり、発話者が望む音声認識結果とは異なる音声認識結果が得られる可能性を従来よりも低減し、検索において発話者が望む検索結果とは異なる検索結果が得られる可能性を従来よりも低減することが可能となる。 By using the speech recognition system according to the fourth operation example of the first embodiment, it is possible to solve the problem of the problem 2 and it is possible that the speech recognition result different from the speech recognition result desired by the speaker can be obtained. It is possible to reduce the possibility of obtaining a search result different from the search result desired by the speaker in the search, compared to the conventional case.

［［第１実施形態の第５動作例］］
第４動作例についても、第１動作例から第２動作例への動作の変更と同様の変更をすることができる。これを第５動作例として説明する。すなわち、第５動作例は、ある１つの処理対象クライアント側装置について、略同一の時刻に近傍位置にある予め定めた複数個の他のクライアント側装置に処理対象クライアント側装置と同じ部分文字列（共通する部分文字列）がある場合に、処理対象クライアント側装置の音声認識結果の文字列から共通する部分文字列を取り除いたものを加工済み音声認識結果として得る例である。第５動作例が第４動作例と異なるのは、クラウド側装置２の音声認識結果加工部２４の動作である。以下、第４動作例と異なる部分についてのみ説明する。 [[Fifth operation example of the first embodiment]]
Also in the fourth operation example, the same change as the change of the operation from the first operation example to the second operation example can be made. This will be described as a fifth operation example. That is, in the fifth operation example, the same partial character string as the process target client side apparatus (a plurality of predetermined other client side apparatuses located near at substantially the same time) for one process target client side apparatus This is an example in which, when there is a common partial character string, a character string of the speech recognition result of the processing target client side device from which the common partial character string is removed is obtained as a processed speech recognition result. The fifth operation example is different from the fourth operation example in the operation of the speech recognition result processing unit 24 of the cloud device 2. Hereinafter, only differences from the fourth operation example will be described.

クラウド側装置２の音声認識結果加工部２４は、音声認識結果保持部２３に記憶された少なくとも１つの音声認識結果と時刻情報と位置情報とIDとの組について、当該音声認識結果の文字列に含まれる部分文字列それぞれについて、他の音声認識結果と時刻情報と位置情報とIDとの組の中に、部分文字列と時刻との組が一致するものが予め定めた複数個（Ｋ個、Ｋは２以上の整数）あった場合に、一致した部分文字列を取り除いたものを加工済み音声認識結果とし、加工済み音声認識結果とIDとを組にして出力する。 The voice recognition result processing unit 24 of the cloud-side device 2 sets the combination of at least one voice recognition result, time information, position information, and ID stored in the voice recognition result holding unit 23 to a character string of the voice recognition result. For each of the included partial character strings, there are a plurality of (K, predetermined ones in which combinations of partial character strings and time among the combinations of other speech recognition results, time information, position information, and ID match When K is an integer greater than or equal to 2), the one from which the matched partial character string has been removed is taken as a processed voice recognition result, and the processed voice recognition result and the ID are output in pairs.

第４動作例では、偶然、二人の利用者が同時刻に近傍位置で同一の内容を発話した場合には、利用者が発した音声の音声認識結果は取り除かれてしまう。これに対し、第５動作例では、三人以上（Ｋ＋１人以上）の同時刻に近傍位置で同一の内容を発話しない限りは、利用者が発した音声の音声認識結果を取り除いてしまうことはない。テレビやラジオや案内放送などの環境音が必ず同時刻に同一の内容であることと比べれば、複数の利用者の発話が同時刻に近傍位置で同一の内容である可能性は極めて低く、それが三人以上となる可能性はさらに低い。したがって、第１実施形態の第５動作例による音声認識システムによれば、発話者が望む音声認識結果である発話者が発した音声の音声認識結果が欠落する可能性を第２動作例よりも低く抑えながら、発話者が望む音声認識結果とは異なる音声認識結果であるテレビやラジオや案内放送などの環境音の音声認識結果が含まれる可能性を従来よりも低減することができる。 In the fourth operation example, when two users utter the same content at the same time near each other at the same time, the speech recognition result of the speech emitted by the user is removed. On the other hand, in the fifth operation example, as long as three or more people (K + 1 or more) do not utter the same content at the same position at the same time, the result of the speech recognition of the speech uttered by the user is removed Absent. It is extremely unlikely that the utterances of multiple users will be the same content at the same time in the vicinity, as compared to the environmental sound such as television, radio, and guide broadcasting always having the same content at the same time. Is less likely to be three or more. Therefore, according to the speech recognition system according to the fifth operation example of the first embodiment, the possibility that the speech recognition result of the speech emitted by the utterer, which is the speech recognition result desired by the utterer, may be lost is better than the second operation example. It is possible to reduce the possibility of including the speech recognition result of environmental sounds such as television, radio, and guidance broadcasting, which is a speech recognition result different from the speech recognition result desired by the speaker, than the conventional technology, while keeping it low.

［［第１実施形態の第６動作例］］
位置情報を用いない第１〜第３の動作例と、位置情報を用いる第４〜第５の動作例と、を組み合わせて動作させてもよく、その一例を第６動作例として説明する。第６動作例は、ある１つの処理対象クライアント側装置について、位置情報から処理対象クライアント側装置と近傍位置にあると判断されたクライアント側装置と、処理対象クライアント側装置と近傍位置にあるとは判断されないものの、音声認識結果の文字列中の複数個の部分文字列について、処理対象クライアント側装置の音声認識結果の文字列と同じ部分文字列が同時刻で出現するクライアント側装置と、について、処理対象クライアント側装置の音声認識結果の文字列から共通する部分文字列を取り除いたものを加工済み音声認識結果として得る例である。第６動作例が第４動作例と異なるのは、クラウド側装置２の音声認識結果加工部２４の動作である。以下、第４動作例と異なる音声認識結果加工部２４の動作について、その処理フローである図９を用いて説明する。 [[Sixth operation example of the first embodiment]]
The first to third operation examples not using position information and the fourth to fifth operation examples using position information may be combined to operate, and an example thereof will be described as a sixth operation example. In the sixth operation example, with regard to one processing target client side device, the client side device determined to be near the processing target client side device from the position information, and being near the processing target client side device Regarding a plurality of partial character strings in the character string of the speech recognition result although not determined, with respect to the client side device in which the same partial character string as the character string of the speech recognition result of the processing target client side device appears at the same time In this example, a character string of the speech recognition result of the processing target client side device from which a common partial character string is removed is obtained as a processed speech recognition result. The sixth operation example is different from the fourth operation example in the operation of the speech recognition result processing unit 24 of the cloud device 2. Hereinafter, the operation of the speech recognition result processing unit 24 different from the fourth operation example will be described with reference to FIG.

音声認識結果加工部２４は、まず、クライアント側装置１_１の音声認識結果と時刻情報と位置情報とIDとの組を音声認識結果保持部２３から読み出す（ステップＳ２４１）。音声認識結果加工部２４は、次に、初期値ｘを２に設定する（ステップＳ２４２）。音声認識結果加工部２４は、次に、クライアント側装置１_ｘの音声認識結果と時刻情報と位置情報とIDとの組を音声認識結果保持部２３から読み出す（ステップＳ２４３）。音声認識結果加工部２４は、次に、クライアント側装置１_１の音声認識結果と時刻情報と位置情報とIDとの組とクライアント側装置１_ｘの音声認識結果と時刻情報と位置情報とIDとの組とにおいて、部分文字列とその時刻が一致するものが複数個あるか否かを探索する（ステップＳ２４４Ｂ１）。音声認識結果加工部２４は、次に、ステップＳ２４４Ｂ１において部分文字列とその時刻が一致するものが複数個あった場合には、部分文字列とその時刻が一致する全ての部分文字列をクライアント側装置１_１の音声認識結果の文字列から取り除く（ステップＳ２４５Ｂ１）。ステップＳ２４４Ｂ１において部分文字列とその時刻が一致するものが複数個なかった場合、すなわち、部分文字列とその時刻が一致するものが１個であった場合と部分文字列とその時刻が一致するものがなかった場合には、ステップＳ２４４Ｂ２に進む。音声認識結果加工部２４は、次に、クライアント側装置１_１の音声認識結果と時刻情報と位置情報とIDとの組とクライアント側装置１_ｘの音声認識結果と時刻情報と位置情報とIDとの組とにおいて、部分文字列とその時刻と位置が一致するものがあるか否かを探索する（ステップＳ２４４Ｂ２）。音声認識結果加工部２４は、次に、ステップＳ２４４Ｂ２において部分文字列とその時刻と位置が一致するものがあった場合には、部分文字列とその時刻と位置が一致する全ての部分文字列をクライアント側装置１_１の音声認識結果の文字列から取り除く（ステップＳ２４５Ｂ２）。ステップＳ２４４Ｂ２において部分文字列とその時刻と位置が一致するものがなかった場合には、ステップＳ２４６に進む。音声認識結果加工部２４は、次に、ステップＳ２４３、ステップ２４４Ｂ１、ステップ２４４Ｂ２、ステップ２４５Ｂ１、ステップ２４５Ｂ２の何れでも処理の対象としていないクライアント側装置が残っているかを判定する（ステップＳ２４６）。音声認識結果加工部２４は、次に、ステップＳ２４６においてステップＳ２４３、ステップ２４４Ｂ１、ステップ２４４Ｂ２、ステップ２４５Ｂ１、ステップ２４５Ｂ２の何れでも処理の対象としていないクライアント側装置が残っていると判定された場合には、ｘをｘ＋１に置き換える（ステップＳ２４７）。ステップＳ２４６においてステップＳ２４３、ステップ２４４Ｂ１、ステップ２４４Ｂ２、ステップ２４５Ｂ１、ステップ２４５Ｂ２の何れでも処理の対象としていないクライアント側装置が残っていないと判定された場合には、最後に行ったステップＳ２４５Ｂ１またはＳ２４５Ｂ２で処理済みのクライアント側装置１_１の音声認識結果の文字列をクライアント側装置１_１の加工済み音声認識結果の文字列としてIDと組にして出力する（ステップＳ２４８）。 Speech recognition result processing unit 24 first reads a set of the position information and the ID and the client device 1 ₁ of the speech recognition result and the time information from the speech recognition result holding unit 23 (step S241). Next, the speech recognition result processing unit 24 sets an initial value x to 2 (step S242). Next, the speech recognition result processing unit 24 reads a combination of the speech recognition result of the client side device _1x , time information, position information, and an ID from the speech recognition result holding unit 23 (step S243). Speech recognition result processing unit 24, then the position information and the ID and the speech recognition result and the time information of the set and the client-side apparatus 1 _x between the position information and the ID and the client device 1 ₁ of the speech recognition result and the time information It is searched whether or not there are a plurality of partial character strings that match the time in the group of (step S244B1). Next, when there are a plurality of partial character strings that match the time at step S 244 B 1, the speech recognition result processing unit 24 sets all the partial strings that match the partial character and the time to the client side. device _{1 1} removed from the string of the speech recognition result (step S245B1). In step S 244 B 1, if there are not a plurality of partial character strings that match the time, that is, if there is only one partial character string that matches the time and a partial character string that matches the time If not, the process proceeds to step S244B2. Speech recognition result processing unit 24, then the position information and the ID and the speech recognition result and the time information of the set and the client-side apparatus 1 _x between the position information and the ID and the client device 1 ₁ of the speech recognition result and the time information It is searched whether or not there is a partial character string and one whose time and position coincide with each other (step S244B2). Next, when there is a partial character string that matches the time and position at step S244B2, the speech recognition result processing unit 24 selects the partial character string and all partial strings that match the time and position. removed from the client-side apparatus _{1 1} of the speech recognition result string (step S245B2). If it is determined in step S244B2 that there is no partial character string and its time and position match, the process proceeds to step S246. Next, the voice recognition result processing unit 24 determines whether any client-side device not to be processed remains in any of step S243, step 244B1, step 244B2, step 245B1 and step 245B2 (step S246). Next, if it is determined in step S246 that any client-side device not to be processed remains in any of step S243, step 244B1, step 244B2, step 245B1 and step 245B2 in step S246. , X to x + 1 (step S247). If it is determined in step S246 that there is no client-side device not to be processed in any of step S243, step 244B1, step 244B2, step 245B1 and step 245B2, the process in step S245B1 or S245B2 last performed already client device 1 ₁ of the ID paired outputs a string of the speech recognition result as a client-side device 1 ₁ of the processed speech recognition result string (step S248).

第１実施形態の第６動作例による音声認識システムを用いることによって、複数のクライアント側装置が近傍位置にはないものの同じテレビやラジオが流れている場合と、複数のクライアント側装置が近傍位置あって同じテレビやラジオが流れている場合と、の双方の場合の環境音の音声認識結果の文字列を取り除くことが可能となり、発話者が望む音声認識結果とは異なる音声認識結果が得られる可能性を従来よりも低減し、検索において発話者が望む検索結果とは異なる検索結果が得られる可能性を従来よりも低減することが可能となる。 By using the voice recognition system according to the sixth operation example of the first embodiment, a plurality of client-side devices are not in the vicinity, but the same television or radio is flowing, and a plurality of client-side devices are in the vicinity It is possible to remove character strings of speech recognition results of environmental sounds when the same television and radio are flowing, and speech recognition results different from the speech recognition results desired by the speaker can be obtained. It is possible to reduce the gender than in the past and to reduce the possibility of obtaining a search result different from the search result desired by the speaker in the search than in the past.

＜第２実施形態＞
次に、本発明の第２実施形態として、クライアント側装置に入力された発話者の音声を含む音響信号の音声認識結果から、公共放送の音響信号の音声認識結果と共通する部分を取り除く形態について説明する。図１０は、第２実施形態における音声認識システムの構成を示すブロック図である。図１０の構成要素のうち図１と同じ構成については同じ符号を付してある。符号１００は音声認識システムであり、符号１_１〜１_Ｎは１個以上（Ｎ個、Ｎは１以上の整数）のクライアント側装置であり、符号２はクラウド側装置である。符号５_１〜５_Ｍは１局以上（Ｍ局、Ｍは１以上の整数）の公共放送局である。クライアント側装置１_１〜１_Ｎは、利用者が利用する装置であり、例えば、第１実施形態の説明において例示したものである。公共放送局５_１〜５_Ｍは音声認識システム１００外に存在するものである。クラウド側装置２は、ネットワーク３を介してクライアント側装置１_１〜１_Ｎと接続される。ネットワーク３は、音声認識システム１００のクライアント側装置１_１〜１_Ｎとクラウド側装置２をインターネットの接続プロトコルに従って情報の送受信をできるようにするためのものであり、例えばインターネットである。クライアント側装置１_１〜１_Ｎとクラウド側装置２はインターネットの接続プロトコルに従って情報の送受信をできるようにされる。
クラウド側装置２は、ネットワーク４を介して公共放送局５_１〜５_Ｍと接続される。ネットワーク４は、音声認識システム１００のクラウド側装置２と公共放送局５_１〜５_Ｍ、をインターネットの接続プロトコルや、映像中継用の専用プロトコルに従って情報の送受信をできるようにするためのものであり、例えば閉域型の専用線インターネットである。 Second Embodiment
Next, as a second embodiment of the present invention, a form in which a portion common to the speech recognition result of the sound signal of the public broadcast is removed from the speech recognition result of the sound signal including the speech of the speaker input to the client side device explain. FIG. 10 is a block diagram showing the configuration of the speech recognition system in the second embodiment. Among the components of FIG. 10, the same components as those of FIG. 1 are denoted by the same reference numerals. The code 100 is a speech recognition system, the codes 1 ₁ to 1 _N are one or more (N, N is an integer of 1 or more) client side apparatuses, and the code 2 is a cloud side apparatus. Reference numerals 5 _{1 to} 5 _M denote public broadcast stations of one or more stations (M stations, M is an integer of 1 or more). The client side devices 1 ₁ to 1 _N are devices used by the user, and are exemplified in the description of the first embodiment, for example. The public broadcast stations 5 _{1 to} 5 _M exist outside the speech recognition system 100. The cloud side device 2 is connected to the client side devices 1 ₁ to 1 _N via the network 3. The network 3 is for enabling the client side devices 1 ₁ to 1 _N of the voice recognition system 100 and the cloud side device 2 to transmit and receive information according to the connection protocol of the Internet, and is, for example, the Internet. The client side devices 1 ₁ to 1 _N and the cloud side device 2 can transmit and receive information according to the connection protocol of the Internet.
Cloud-side device 2 is connected to the public broadcasting station ₅ 1 to 5 _M via the network 4. Network 4, cloud-side device 2 and the public broadcasting station 5 ₁ to 5 M of the speech recognition system _100, a and Internet connection protocol is intended to allow transmission and reception of information in accordance with a dedicated protocol for video relay For example, closed-type dedicated line Internet.

図１０の構成では、公共放送局５_１〜５_Ｍとクラウド側装置２とはネットワーク４を介してインターネットの接続プロトコルに従って接続され、クラウド側装置２が公共放送局５_１〜５_Ｍの公共放送信号を受信できるようにされる。ただし、クラウド側装置２が図示しない受信機を備えていて、ネットワーク４を介さずに公共放送信号を受信できるようにしてもよい。また、例えば、東京、大阪、名古屋、福山等、各地域によって放送される公共放送の番組構成や放送時刻が変わるため、音響信号も地域により異なることになる。クラウド側装置２を全国各地に多数設置するのはコストがかかるため、公共放送信号の受信機を全国各地に設置し、処理した信号や認識結果をネットワーク３、及びネットワーク４経由でクラウド側装置２に送る構成としてもよい。 In the configuration of FIG. 10, the public broadcast stations 5 _{1 to} 5 _M and the cloud side device 2 are connected via the network 4 according to the connection protocol of the Internet, and the cloud side device 2 is a public broadcast of the public broadcast stations 5 _{1 to} 5 _M It is made possible to receive a signal. However, the cloud-side device 2 may be provided with a receiver (not shown) so that the public broadcast signal can be received without going through the network 4. Further, for example, since the program configuration and the broadcast time of public broadcasting broadcast by each area, such as Tokyo, Osaka, Nagoya, Fukuyama, etc., the acoustic signal will also differ by area. Since installing many cloud side devices 2 all over the country is expensive, receivers of public broadcast signals are installed all over the country, processed signals and recognition results are network 3 and network 4 via cloud side devices 2 It may be sent to

公共放送局５_１〜５_Ｍは、例えば、クライアント側装置が位置する可能性のある地域の全てのまたは主要な公共放送局であり、例えば、クライアント側装置を利用する利用者の居住地域や移動範囲を含む地域において放送されている衛星、地上、主要ＩＰ型同報放送／ストリーミング／ＣＡＴＶ／有線放送などである。クラウド側装置２は、所望の公共放送局を全て受信できるように、ネットワーク３やネットワーク４や受信設備や受信装置などの必要な設備と接続しておく。 The public broadcast stations 5 _{1 to} 5 _M are, for example, all or main public broadcast stations in the area where the client side device may be located, for example, the residence area or movement of the user using the client side device These include satellites, terrestrial and major IP broadcasts / streaming / CATV / wired broadcasts in areas including the area. The cloud side device 2 is connected to necessary facilities such as the network 3 and the network 4 and a receiving facility and a receiving device so that all desired public broadcast stations can be received.

クライアント側装置１_１〜１_Ｎが最低限含む構成は全て同じであるため、以下では、第２実施形態の音声認識システム１００のうちのクライアント側装置１_１とクラウド側装置２により構成される部分について詳細化したブロック図である図１１を用いて説明を行う。図１１の構成要素のうち図２と同じ符号を付してある構成要素は、図２と同じ動作を行うものである。 Since all the configurations included at least in the client side devices 1 ₁ to 1 _N are the same, in the following, a part configured by the client side device ₁₁ and the cloud side device 2 in the speech recognition system 100 of the second embodiment This will be described using FIG. 11 which is a detailed block diagram of FIG. Among the components shown in FIG. 11, the components assigned the same reference numerals as those in FIG. 2 perform the same operations as those in FIG.

クラウド側装置２は、音声受信部２１、音声認識部２２、放送受信部４１、放送音声認識部４２、音声認識結果保持部４３、音声認識結果加工部４４、検索処理部２５、検索結果送出部２６を少なくとも含んで構成される。クラウド側装置２の音声受信部２１、音声認識部２２、検索処理部２５及び検索結果送出部２６は、第１実施形態のクラウド側装置２の音声受信部２１、音声認識部２２、検索処理部２５及び検索結果送出部２６と、それぞれ同一の動作をする。 The cloud side device 2 includes a voice receiving unit 21, a voice recognition unit 22, a broadcast receiving unit 41, a broadcast voice recognition unit 42, a voice recognition result holding unit 43, a voice recognition result processing unit 44, a search processing unit 25, a search result sending unit It comprises at least 26. The voice receiving unit 21, voice recognition unit 22, search processing unit 25 and search result sending unit 26 of the cloud side device 2 are the voice receiving unit 21, voice recognition unit 22, search processing portion of the cloud side device 2 according to the first embodiment. The same operation is performed with the search result sending unit 26 and the search result sending unit 26.

次に、第２実施形態の音声認識システムの動作を説明する。 Next, the operation of the speech recognition system according to the second embodiment will be described.

［［第２実施形態の第１動作例］］
第１動作例として、第１〜Ｎの利用者のそれぞれがクライアント側装置１_１〜１_Ｎを利用していて、第１の利用者がクライアント側装置１_１に対して検索結果を得たい文章を発話し、当該発話に対応する検索結果をクライアント側装置１_１の画面表示部１５_１に表示する場合の動作の例を説明する。ここでは、より具体的なケースとして、２つの公共放送局５_１〜５_２の放送を受信できる地域内にある公共放送局５_１の放送のみが流れている場所に第１の利用者がいる場合を例に説明する。 [[First operation example of the second embodiment]]
As a first operation example, each user of the 1~N is not using the client-side apparatus 1 ₁ to 1 _N, the sentence to be obtained search results first user to the client-side apparatus 1 ₁ the speaks, an example of operation of displaying a search result corresponding to the utterance to the client-side apparatus 1 ₁ of the screen display unit 15 _1. Here, as a more specific case, there are first user to the location where only broadcast public broadcasting station 5 ₁ in the area that can receive broadcast of the two public broadcasting stations 5 ₁ to 5 ₂ flows The case will be described as an example.

クライアント側装置１_１の音声入力部１１_１は、クライアント側装置１_１の周囲で発せられた音響信号を取得し、取得した音響信号を音声送信部１３_１に出力する。第１の利用者がクライアント側装置１_１に対して検索結果を得たい文章を発話した場合には、第１の利用者が発話した音声を含む音響信号を取得して出力する。クライアント側装置１_１の周囲でテレビやラジオや案内放送などの環境音が発生している場合には、その環境音を含む音響信号を取得して出力する。したがって、上記の具体ケースであれば、第１の利用者が発話した音声と、公共放送局５_１の放送の音と、により構成される音響信号を取得して出力する。 Client device 1 ₁ of the speech input unit 11 ₁ obtains an acoustic signal emitted around the client-side apparatus 1 _1, and outputs the acquired audio signal to the audio transmission unit 13 _1. If the first user has uttered a sentence to be obtained search results to the client-side apparatus 1 _1, obtains and outputs a sound signal including a voice first user uttered. If the client-side apparatus 1 ₁ of environmental sounds such as a television or radio or announcement around has occurred, and outputs the acquired audio signal including the ambient sound. Therefore, if the above specific case, and audio first user has uttered, and sound public broadcasting station 5 ₁ broadcast, obtains and outputs the composed audio signal by.

クライアント側装置１_１のユーザ情報取得部１２_１は、クライアント側装置１_１の音声入力部１１_１が音響信号を取得した時刻情報を得て、当該時刻情報とクライアント側装置１_１を特定可能な識別情報（以下、「ID」と呼ぶ）とをユーザ情報として音声送信部１３_１に出力する。時刻情報とは、例えば絶対時刻であり、例えばクライアント側装置１_１がＧＰＳ受信部を内蔵するスマートフォンである場合は、音声入力部１１_１であるスマートフォンのマイクが音響信号を取得した際にＧＰＳ受信部が受信した絶対時刻を時刻情報とすればよい。また、たとえば、携帯網の基地局や通信サーバからもらった時刻情報でもよいし、OSが保持するローカル時計の時刻情報でもよい。 User information acquisition section 12 ₁ of the client-side apparatus 1 ₁ obtains the time information voice input unit 11 ₁ of the client-side apparatus 1 ₁ acquires the audio signal, which can specify the time information and the client-side apparatus 1 ₁ identification information (hereinafter, referred to as "ID") to the audio transmission unit 13 ₁ and the user information. The time information, for example, absolute time, for example, if the client-side apparatus 1 ₁ is a smart phone with a built-in GPS receiver, the GPS receiver when the smartphone microphone is a voice input unit 11 ₁ obtains a sound signal The absolute time received by the unit may be used as time information. Further, for example, time information may be received from a base station of a mobile network or a communication server, or may be time information of a local clock held by the OS.

クライアント側装置１_２〜１_Ｎの音声入力部１１_２〜１１_Ｎ、ユーザ情報取得部１２_２〜１２_Ｎ及び音声送出部１３_２〜１３_Ｎも、それぞれ、クライアント側装置１_１の音声入力部１１_１、ユーザ情報取得部１２_１及び音声送出部１３_１と同じ動作をする。なお、第２〜Ｎの何れかの利用者の発話に対応する検索結果を得る必要が無い場合には、検索結果を得る必要が無い利用者のクライアント側装置は備えないでよいし、検索結果を得る必要が無い利用者のクライアント側装置を備えていたとしても当該クライアント側装置の音声入力部、ユーザ情報取得部及び音声送出部は動作させないでよい。 Client device ₁ 2 to 1 _N audio input unit ₁₁ 2 to 11 _N, the user information acquiring unit ₁₂ 2 to 12 _N and the voice sending section ₁₃ 2 to 13 _N be respectively client device _{1 1} of the speech input unit 11 _1, the user information acquiring unit 12 ₁ and the audio output unit 13 ₁ and the same operation. In addition, when it is not necessary to obtain a search result corresponding to the utterance of any of the second to N users, the client side apparatus of the user who does not need to obtain the search result may not be provided, and the search result is Even if the client-side device of the user who does not need to obtain is provided, the voice input unit, the user information acquisition unit and the voice sending unit of the client-side device need not be operated.

クラウド側装置２の音声受信部２１は、クライアント側装置１_１〜１_Ｎの音声送出部１３_１〜１３_Ｎがそれぞれ送出した伝送信号を受信して、受信したそれぞれの伝送信号から音響信号とユーザ情報との組を取り出して出力する。伝送信号の受信は、例えば、10msなどの所定時間区間ごとに行われる。音声送出部１３_１〜１３_Ｎが音響信号を所定の符号化方法により符号化して符号列を得て、得られた符号列を送出した場合には、クラウド側装置２の音声受信部２１は、受信した伝送信号に含まれる符号列を所定の符号化方法に対応する復号方法により復号することで音響信号を得て、得られた音響信号とユーザ情報との組を出力すればよい。また、音声送出部１３_１〜１３_Ｎが音響信号に対して音声認識処理の一部の処理である特徴量抽出などを行い、その処理により得られた特徴量とユーザ情報とを含む伝送信号を送出した場合には、伝送信号から音響信号ではなく特徴量を取り出し、取り出した特徴量とユーザ情報との組を出力すればよい。なお、第２〜Ｎの何れかの利用者の発話に対応する検索結果を得る必要が無い場合には、検索結果を得る必要が無い利用者の音響信号は受信しないでよいし、検索結果を得る必要が無い利用者の音響信号は受信したとしても当該音響信号とユーザ情報との組の出力は行わないでよい。また、第２〜Ｎの全ての利用者の発話に対応する検索結果を得る必要が無く、第１の利用者以外の音響信号とユーザ情報との組を出力しない場合には、クライアント側装置１_１のユーザ情報にIDを含めずに出力してもよい。 The voice receiving unit 21 of the cloud side device 2 receives the transmission signals respectively sent by the voice sending portions 13 _{1 to} 13 _N of the client side devices 1 ₁ to 1 _N , and generates an acoustic signal and a user from the received transmission signals. Extract and output a pair with information. Reception of the transmission signal is performed, for example, at predetermined time intervals such as 10 ms. When the voice transmitting units 13 _{1 to} 13 _N encode the acoustic signal according to a predetermined coding method to obtain a code string and transmit the obtained code string, the voice receiving unit 21 of the cloud-side device 2 A sound signal may be obtained by decoding a code string included in the received transmission signal by a decoding method corresponding to a predetermined coding method, and a set of the obtained sound signal and user information may be output. In addition, the voice transmitting units 13 _{1 to} 13 _N perform feature extraction, which is a part of processing of voice recognition processing, on the sound signal, and transmit signals including the feature and the user information obtained by the processing. In the case of sending, it is sufficient to take out not the acoustic signal but the feature amount from the transmission signal, and output the set of the extracted feature amount and the user information. In addition, when it is not necessary to obtain a search result corresponding to the utterance of any of the second to N users, it is not necessary to receive an acoustic signal of a user who does not need to obtain a search result. Even if a user's acoustic signal that does not need to be obtained is received, it is not necessary to output the set of the acoustic signal and the user information. In addition, it is not necessary to obtain search results corresponding to the utterances of all the second to N users, and when not outputting a set of acoustic signals other than the first user and user information, the client side device 1 _It may be output without including the ID in ₁ user information.

クラウド側装置２の音声認識部２２は、音声受信部２１が出力したそれぞれの音響信号に対して音声認識処理を行い、音響信号に含まれる音声に対応する文字列である音声認識結果を得て、音声認識結果と、当該音声認識結果に対応する時刻情報と、当該音声認識結果に対応するIDとによる組を出力する。なお、時刻情報がなかったり、不適切な値だった場合、受け取った時刻情報を用いず、サーバがデータを受け取ったおよその時刻情報で管理する処理をしてもよい。なお、第１の利用者以外についての出力をしない場合には、クライアント側装置１_１のIDを含めずに、音声認識結果と、当該音声認識結果に対応する時刻情報とによる組を出力してもよい。 The voice recognition unit 22 of the cloud-side device 2 performs voice recognition processing on each of the sound signals output from the voice reception unit 21, and obtains a voice recognition result that is a character string corresponding to the voice included in the sound signal. And outputting a set of a speech recognition result, time information corresponding to the speech recognition result, and an ID corresponding to the speech recognition result. If there is no time information or an inappropriate value, the server may manage the data based on the approximate time information received without using the received time information. When not the output of the other first user, without including the ID of the client-side apparatus 1 _1, and the speech recognition result, and outputs the set by the time information corresponding to the speech recognition result It is also good.

したがって、上記の具体ケースであれば、音声認識部２２は、クライアント側装置１_１の音響信号に対する音声認識結果としては、第１の利用者が発話した音声と公共放送局５_１の放送の音に含まれる音声との音声認識結果とから成る文字列を得て出力する。 Therefore, if the above specific case, the speech recognition unit 22, as a speech recognition result to the client-side apparatus 1 ₁ of the acoustic signal, the first user utterance voice and sound of the public broadcasting station 5 ₁ for broadcasting To obtain and output a character string composed of the speech recognition result with the speech contained in and.

なお、第１実施形態と同様に、音声認識処理には公知の音声認識技術を用いればよい。音声受信部２１が音響信号に代えて特徴量を出力した場合には、音声認識部２２はその特徴量を用いて音声認識処理を行えばよい。 As in the first embodiment, a known speech recognition technology may be used for the speech recognition process. When the voice receiving unit 21 outputs a feature amount instead of the sound signal, the voice recognition unit 22 may perform voice recognition processing using the feature amount.

クラウド側装置２の放送受信部４１は、公共放送局５_１〜５_Ｍがそれぞれ送出した公共放送信号を受信して、受信したそれぞれの公共放送信号から音響信号と当該音響信号に対応する時刻情報との組を取り出して出力する。その際、公共放送局を特定可能な識別情報（以下、「放送局ID」と呼ぶ）も音響信号と時刻情報と組にして出力してもよい。公共放送信号の受信は、例えば、10msなどの所定時間区間ごとに行われる。公共放送局５_１〜５_Ｎが音響信号を所定の符号化方法により符号化して符号列を得て、得られた符号列を送出した場合には、クラウド側装置２の放送受信部４１は、受信した公共放送信号に含まれる符号列を所定の符号化方法に対応する復号方法により復号することで音響信号を得て、得られた音響信号と当該音響信号に対応する時刻情報との組を出力すればよい。なお、放送受信部４１に図示しない時計を備えて絶対時刻を出力可能なようにしておき、公共放送信号がアナログ放送であって公共放送信号から時刻情報を取り出せない場合などには、放送受信部４１に備えた時計から得た絶対時刻を公共放送信号から音響信号と組にして出力してもよい。なお、放送受信部４１に関しては、例えば、東京、大阪、名古屋、福山等、各地域によって放送される公共放送の音響信号群が変わるため、放送受信部４１をクラウド側装置２とは異なる地方において、ネットワーク経由で放送音声認識部４２と接続する構成でもよい。また、公共放送局から、ネットワーク経由で直接信号を得られる場合は、それで得られる公共放送の音響信号を直接入力に用いても良い。 Broadcast receiving unit 41 of the cloud side device 2, the time information public broadcasting station 5 ₁ to 5 _M has received a public broadcast signal transmitted respectively correspond to the acoustic signal and the acoustic signal from each of the public broadcasting signals received Take out the pair with and output. At that time, identification information (hereinafter referred to as "broadcasting station ID") that can specify a public broadcasting station may be output as a set of acoustic signals and time information. The reception of the public broadcast signal is performed, for example, at predetermined time intervals such as 10 ms. Public broadcasting station 5 ₁ to 5 _N can obtain a code string by encoding with a predetermined encoding method the acoustic signal, when sending the obtained code string, the broadcast receiving unit 41 of the cloud side device 2, A sound signal is obtained by decoding a code string included in the received public broadcast signal by a decoding method corresponding to a predetermined coding method, and a set of the obtained sound signal and time information corresponding to the sound signal is obtained. You can output it. The broadcast reception unit 41 is provided with a clock (not shown) so as to be able to output an absolute time, and when the public broadcast signal is an analog broadcast and time information can not be extracted from the public broadcast signal, etc. The absolute time obtained from the clock provided at 41 may be output from the public broadcast signal in combination with the acoustic signal. As for the broadcast reception unit 41, for example, since the sound signal group of public broadcasts varies depending on the area, such as Tokyo, Osaka, Nagoya, Fukuyama, etc., the broadcast reception unit 41 is not located in a region different from the cloud side device 2. , And may be connected to the broadcast speech recognition unit 42 via a network. Also, when a direct signal can be obtained from a public broadcasting station via a network, the acoustic signal of the public broadcast obtained thereby may be used directly for input.

クラウド側装置２の放送音声認識部４２は、放送受信部４１が出力したそれぞれの音響信号に対して音声認識処理を行い、音響信号に含まれる音声に対応する文字列である音声認識結果を得て、音声認識結果と、当該音声認識結果に対応する時刻情報とによる組を出力する。その際、放送局IDも音響信号と時刻情報と組にして出力してもよい。なお、放送受信部４１と放送音声認識部４２を、クラウド側装置２とは異なる地方において、その出力をネットワーク経由で装置４３と接続する構成でもよい。
なお、音響信号の所定の纏まりごとに音声認識処理を行い、得られた音声認識結果の文字列に時刻情報やIDを付与する方法や、音声認識処理に用いる音声認識技術等については、音声認識部２２と同様であるので、詳細な説明を省略する。 The broadcast speech recognition unit 42 of the cloud-side device 2 performs speech recognition processing on each of the acoustic signals output from the broadcast reception unit 41, and obtains a speech recognition result that is a character string corresponding to the speech included in the acoustic signal. A set of the speech recognition result and time information corresponding to the speech recognition result is output. At this time, the broadcast station ID may be output as a set of acoustic signal and time information. The broadcast reception unit 41 and the broadcast speech recognition unit 42 may be configured to connect the output thereof to the device 43 via a network in a region different from the cloud-side device 2.
Voice recognition processing is performed for each predetermined set of acoustic signals, and time information and an ID are added to the character string of the obtained voice recognition result, and voice recognition technology and the like used for voice recognition processing are voice recognition. As it is the same as part 22, detailed description is omitted.

クラウド側装置２の音声認識結果保持部４３は、音声認識部２２が出力した音声認識結果と時刻情報とIDとの組と、放送音声認識部４２が出力した音声認識結果と時刻情報との組と、を記憶する。放送音声認識部４２が音声認識結果と時刻情報と放送局IDとの組を出力した場合には、放送音声認識部４２が出力した音声認識結果と時刻情報との組に代えて、音声認識結果と時刻情報と放送局IDとの組を記憶する。音声認識結果保持部４３の記憶内容は、音声認識結果加工部４４が時刻が共通する単語などの部分文字列があるか否かを判定する処理、及び、時刻が共通する単語などの部分文字列があった際に音声認識結果から取り除いて加工済み音声認識結果を得る処理、に用いられる。したがって、音声認識結果保持部４３には、音声認識部２２が出力した音声認識結果と時刻情報とIDとの組と放送音声認識部４２が出力した音声認識結果と時刻情報と放送局IDとの組とを音声認識結果加工部４４の処理が必要とする時間分だけ記憶しておく。また、音声認識結果保持部４３に保持した記憶内容は、当該記憶内容を用いる音声認識結果加工部４４の処理が終わった時点で削除してよい。 The voice recognition result holding unit 43 of the cloud-side device 2 sets the combination of the voice recognition result, the time information, and the ID output from the voice recognition unit 22 and the voice recognition result and time information output from the broadcast voice recognition unit 42. And remember. When the broadcast speech recognition unit 42 outputs a combination of speech recognition result, time information and broadcasting station ID, the speech recognition result is substituted for the combination of speech recognition result and time information output from the broadcast speech recognition unit 42. And stores time information and a station ID. The contents stored in the voice recognition result storage unit 43 are processes of the voice recognition result processing unit 44 determining whether or not there is a partial character string such as a word having a common time, and a partial character string such as a word having a common time Is used for processing to obtain a processed speech recognition result by removing it from the speech recognition result when there is an error. Therefore, in the voice recognition result holding unit 43, a combination of the voice recognition result outputted by the voice recognition unit 22, time information and ID, the voice recognition result outputted by the broadcast voice recognition unit 42, time information and the broadcasting station ID The set is stored only for the time required for the processing of the speech recognition result processing unit 44. Further, the stored contents held in the speech recognition result holding unit 43 may be deleted when the processing of the speech recognition result processing unit 44 using the stored contents is completed.

クラウド側装置２の音声認識結果加工部４４は、音声認識結果保持部４３に記憶された少なくとも１つのクライアント側装置の音声認識結果と時刻情報とIDとの組について、音声認識結果保持部４３に記憶された各公共放送の音声認識結果と時刻情報との組の中に、部分文字列と時刻との組が一致するものがあった場合に、一致した部分文字列を取り除いたものを加工済み音声認識結果とし、加工済み音声認識結果とIDとを組にして出力する。したがって、少なくともある１つのクライアント側装置についての加工済み音声認識結果が出力されることになる。なお、時刻が一致するか否かの判定については、クライアント側装置と公共放送局またはクラウド側装置とにおける絶対時刻の誤差や音声認識処理における時刻の誤差などを考慮して同じ時刻であると判定してもよい。すなわち、少なくともある１つのクライアント側装置については、略同一の時刻に何れかの公共放送に当該クライアント側装置と同じ部分文字列（共通する部分文字列）がある場合には、当該クライアント側装置の音声認識結果の文字列から共通する部分文字列を取り除いたものを加工済み音声認識結果として得る。 The voice recognition result processing unit 44 of the cloud side device 2 stores the combination of the voice recognition result of at least one client side device stored in the voice recognition result holding unit 43, the time information, and the ID in the voice recognition result holding unit 43. If there is a match between the partial character string and the time set among the stored speech recognition results of each public broadcast and the time information, the matched partial character string is removed and processed As a speech recognition result, the processed speech recognition result and the ID are output as a set. Therefore, processed speech recognition results for at least one client-side device will be output. It should be noted that the determination as to whether or not the times coincide is determined to be the same time in consideration of an error in absolute time in the client side device and the public broadcasting station or the cloud side device or an error in time in speech recognition processing. You may That is, for at least one client-side device, if there is a partial character string (a common partial character string) in the public broadcast at approximately the same time in any public broadcast, the client-side device What removed the common substring from the character string of the speech recognition result is obtained as a processed speech recognition result.

なお、この処理は、クライアント側装置が位置する可能性のある地域の公共放送局の全てを対象として行ってもよいし、少なくとも１つの公共放送を対象として行ってもよい。この場合、音声認識結果加工部４４の処理で必要な音声認識結果だけを前段で得るようにしてもよい。すなわち、音声認識結果加工部４４の処理に不要な音声認識結果を得るための放送受信部４１、放送音声認識部４２及び音声認識結果保持部４３の動作は省略してもよい。 Note that this process may be performed on all public broadcast stations in an area where the client-side device may be located, or may be performed on at least one public broadcast. In this case, only the speech recognition result required by the processing of the speech recognition result processing unit 44 may be obtained in the previous stage. That is, the operations of the broadcast reception unit 41, the broadcast speech recognition unit 42, and the speech recognition result holding unit 43 for obtaining speech recognition results unnecessary for the processing of the speech recognition result processing unit 44 may be omitted.

ここで、上記のＭ＝２の例で、少なくともある１つのクライアント側装置がクライアント側装置１_１である例について図１２と図１３を用いて説明する。図１２はこの動作例における音声認識結果加工部４４の処理フローを説明する図であり、図１３はこの例における音声認識結果と加工済み音声認識結果の一例を説明する図である。図１２の例は、全ての公共放送局の音声認識結果を対象として、公共放送局５_１の音声認識結果から順に、クライアント側装置１_１の音声認識結果と部分文字列と時刻との組が一致するものがあるか否かを探索し、部分文字列と時刻との組が一致するものがあった場合には、部分文字列と時刻との組が一致する部分文字列をクライアント側装置１_１の音声認識結果の文字列から当該共通部分文字列を取り除いていく例である。 Here, in the above example of M = 2, it will be described with reference to FIGS. 12 and 13 for example at least some one client device is a client-side apparatus 1 _1. FIG. 12 is a diagram for explaining the processing flow of the speech recognition result processing unit 44 in this operation example, and FIG. 13 is a diagram for explaining an example of the speech recognition result and the processed speech recognition result in this example. Example of FIG. 12, as the target speech recognition results of all public broadcasting station, in order from the speech recognition result of the public broadcasting station 5 _1, the set of the speech recognition result of the client-side apparatus 1 ₁ and the partial strings and time It is searched whether there is a match, and if there is a match between the partial character string and the time set, the partial character string matching the partial character string and the time set is used as the client-side device 1 It is an example which removes the said common partial character string from the character string of the speech recognition result of _1. FIG.

音声認識結果加工部４４は、まず、クライアント側装置１_１の音声認識結果と時刻情報とIDとの組を音声認識結果保持部４３から読み出す（ステップＳ４４１）。音声認識結果加工部４４は、次に、初期値ｙを１に設定する（ステップＳ４４２）。音声認識結果加工部４４は、次に、公共放送局５_ｙの音声認識結果と時刻情報の組を音声認識結果保持部４３から読み出す（ステップＳ４４３）。音声認識結果加工部４４は、次に、クライアント側装置１_１の音声認識結果と時刻情報とIDとの組と公共放送局５_ｙの音声認識結果と時刻情報との組とにおいて、部分文字列とその時刻が一致するものがあるか否かを探索する（ステップＳ４４４）。音声認識結果加工部４４は、次に、ステップＳ４４４において部分文字列とその時刻が一致するものがあった場合には、部分文字列とその時刻が一致する全ての部分文字列をクライアント側装置１_１の音声認識結果の文字列から取り除く（ステップＳ４４５）。ステップＳ４４４において部分文字列とその時刻が一致するものがなかった場合には、ステップＳ４４６に進む。音声認識結果加工部４４は、次に、ステップＳ４４３〜ステップＳ４４５の処理の対象としていない公共放送局が残っているかを判定する（ステップＳ４４６）。音声認識結果加工部４４は、次に、ステップＳ４４６においてステップＳ４４３〜ステップＳ４４５の処理の対象としていない公共放送局が残っていると判定された場合には、ｙをｙ＋１に置き換える（ステップＳ４４７）。ステップＳ４４６においてステップＳ４４３〜ステップＳ４４５の処理の対象としていない公共放送局が残っていないと判定された場合には、最後に行ったステップＳ４４５で処理済みのクライアント側装置１_１の音声認識結果の文字列をクライアント側装置１_１の加工済み音声認識結果の文字列としてIDと組にして出力する（ステップＳ４４８）。 Speech recognition result processing unit 44 first reads a set of the speech recognition result of the client-side apparatus 1 ₁ and the time information and the ID from the speech recognition result holding unit 43 (step S441). Next, the speech recognition result processing unit 44 sets an initial value y to 1 (step S442). Next, the speech recognition result processing unit 44 reads out the combination of the speech recognition result of the public broadcasting station 5 _y and the time information from the speech recognition result holding unit 43 (step S443). Speech recognition result processing unit 44, then, in a set of client-side apparatus 1 ₁ for speech recognition result and the time information and the ID as a set and public broadcasting stations 5 _y speech recognition result and the time information of the partial strings It is searched whether there is a match between the time and the time (step S444). Next, when there is a partial character string that matches the time at step S444, the speech recognition result processing unit 44 uses the client-side device 1 for all the partial strings that match the partial character string and the time. _{It is} removed from the character string of the speech recognition result of ₁ (step S445). If there is no match between the partial character string and its time in step S444, the process proceeds to step S446. Next, the speech recognition result processing unit 44 determines whether there are any public broadcast stations not subject to the processes of steps S443 to S445 (step S446). Next, the voice recognition result processing unit 44 replaces y with y + 1 if it is determined in step S446 that there remains a public broadcasting station not subjected to the processing in steps S443 to S445 (step S447). Step If it is determined that there are no remaining steps S443~ public broadcasting station that is not subject to processing in step S445 in S446, the last in step S445 the processed client device 1 ₁ of the speech recognition result of characters column in the ID paired as client-side apparatus 1 ₁ of processed speech recognition result string output (step S448).

次に、図１３を参照して、この例における音声認識結果と加工済み音声認識結果の一例を説明する。図１３の横軸は時刻であり、矢印の上にある３つは音声認識結果加工部４４の入力であるクライアント側装置１_１と公共放送局５_１と公共放送局５_２のそれぞれの音声認識結果であり、矢印の下にある１つはクライアント側装置１_１の加工済み音声認識結果である。クライアント側装置１_１の音声認識結果には、クライアント側装置１_１の利用者である第１の利用者が発した発話である発話１及び発話２の音声認識結果の部分文字列と、クライアント側装置１_１の周囲でテレビが発した音声であるテレビ音声１及びテレビ音声２の音声認識結果の部分文字列が含まれている。また、公共放送局５_１の音声認識結果には、公共放送局５_１が放送した音響信号に含まれる音声であるテレビ音声１及びテレビ音声２の音声認識結果の部分文字列が含まれている。また、公共放送局５_２の音声認識結果には、公共放送局５_２が放送した音響信号に含まれる音声であるテレビ音声３及びテレビ音声４の音声認識結果の部分文字列が含まれている。 Next, an example of the speech recognition result and the processed speech recognition result in this example will be described with reference to FIG. The horizontal axis of FIG. 13 is a time, three each of the speech recognition of the speech recognition result the client device is an input of the processing unit 44 1 ₁ and public broadcasting stations 5 ₁ and public broadcasting stations 5 ₂ above the arrow the result, one below the arrow is processed speech recognition result of the client-side apparatus 1 _1. The client-side apparatus 1 ₁ of the speech recognition result, the first and the partial string of the speech recognition result of speech 1 and utterance 2 is a speech the user has issued a client device 1 ₁ of the user, the client-side TV around the apparatus 1 ₁ contains a substring of the speech recognition result of the television audio 1 and the television audio 2 is a sound produced by. Further, the speech recognition result is a public broadcasting station 5 _1, public broadcasting station 5 ₁ contains a substring of the speech recognition result of the television audio 1 and the television audio 2 is a sound included in the sound signal broadcast . Further, the speech recognition result is a public broadcasting station 5 _2, contains a substring of the speech recognition result of the television audio 3 and video audio 4 is a sound public broadcasting station 5 ₂ is included in the acoustic signal broadcast .

まず、ｙ＝１のときの図１２のステップＳ４４４とステップＳ４４５の処理を説明する。クライアント側装置１_１の音声認識結果に含まれる部分文字列のうちテレビ音声１及びテレビ音声２の音声認識結果の部分文字列については、公共放送局５_１の音声認識結果にも同時刻で含まれるため、クライアント側装置１_１の音声認識結果から取り除かれる。クライアント側装置１_１の音声認識結果に含まれる部分文字列のうち発話１及び発話２の音声認識結果の部分文字列については、公共放送局５_１の音声認識結果には同時刻で含まれないため、クライアント側装置１_１の音声認識結果から取り除かれない。すなわち、クライアント側装置１_１の音声認識結果に含まれる部分文字列としては発話１及び発話２の音声認識結果の部分文字列が残された状態となり、ｙ＝２のときの処理に進む。 First, the process of step S444 and step S445 of FIG. 12 when y = 1 will be described. The client device 1 ₁ of the substring of the speech recognition result of the television audio 1 and the television audio 2 of the partial character strings included in the speech recognition result, also includes at the same time the speech recognition result of the public broadcasting station 5 ₁ It is therefore removed from the speech recognition result of the client-side apparatus 1 _1. The client device 1 ₁ of the substring of the speech recognition result of speech 1 and utterance 2 of the partial character strings included in the speech recognition result, not included at the same time the speech recognition result of the public broadcasting station 5 ₁ Therefore, it not removed from the speech recognition result of the client-side apparatus 1 _1. In other words, a state in which a substring of the speech recognition result of speech 1 and utterance 2 is left as a partial character string included in the speech recognition result of the client-side apparatus 1 ₁ proceeds to the process in the case of y = 2.

次に、ｙ＝２のときの図１２のステップＳ４４４とステップＳ４４５の処理を説明する。クライアント側装置１_１の音声認識結果に含まれる部分文字列のうち発話１及び発話２の音声認識結果の部分文字列については、公共放送局５_２の音声認識結果には同時刻で含まれないため、クライアント側装置１_１の音声認識結果から取り除かれない。すなわち、クライアント側装置１_１の音声認識結果に含まれる部分文字列としては発話１及び発話２の音声認識結果の部分文字列が残された状態となる。 Next, the process of step S444 and step S445 of FIG. 12 when y = 2 will be described. The substring of the utterance 1 and utterance 2 of the speech recognition result of the partial character strings included in the client-side apparatus 1 ₁ for speech recognition result is not included at the same time the speech recognition result of the public broadcasting station 5 ₂ Therefore, it not removed from the speech recognition result of the client-side apparatus 1 _1. In other words, a state in which a substring of the speech recognition result of speech 1 and utterance 2 is left as a partial character string included in the speech recognition result of the client-side apparatus 1 _1.

ｙ＝２のときの図１２のステップＳ４４４とステップＳ４４５の処理を終えると、ステップＳ４４６においてステップＳ４４３〜ステップＳ４４５の処理を完了していない公共放送局が残されていないと判定され、ステップＳ４４８において、発話１及び発話２の音声認識結果の部分文字列が残された状態である音声認識結果が加工済み音声認識結果として出力される。 When the processes in steps S444 and S445 in FIG. 12 when y = 2 are finished, it is determined in step S446 that there are no public broadcast stations that have not completed the processes in steps S443 to S445, and in step S448. A speech recognition result in which partial character strings of speech recognition results of speech 1 and speech 2 are left is output as a processed speech recognition result.

第２実施形態の第１動作例による音声認識システムを用いることによって、課題１の問題を解決することが可能となり、発話者が望む音声認識結果とは異なる音声認識結果が得られる可能性を従来よりも低減し、検索において発話者が望む検索結果とは異なる検索結果が得られる可能性を従来よりも低減することが可能となる。 By using the speech recognition system according to the first operation example of the second embodiment, it is possible to solve the problem of the problem 1 and it is possible that the speech recognition result different from the speech recognition result desired by the speaker can be obtained. It is possible to reduce the possibility of obtaining a search result different from the search result desired by the speaker in the search, compared to the conventional case.

［［第２実施形態の第２動作例］］
第２動作例として、ある１つの処理対象クライアント側装置について、公共放送局のうち、処理対象クライアント側装置と同じ部分文字列が同時刻に出現することが複数回ある公共放送局のみを対象として、公共放送局に処理対象クライアント側装置と同じ部分文字列（共通する部分文字列）が同時刻にある場合に、処理対象クライアント側装置の音声認識結果の文字列から共通する部分文字列を取り除いたものを加工済み音声認識結果として得る例を説明する。第２動作例が第１動作例と異なるのは、クラウド側装置２の音声認識結果加工部４４の動作である。以下、第１動作例と異なる部分についてのみ説明する。 [[Second operation example of the second embodiment]]
As a second operation example, with regard to one processing target client side device, among public broadcasting stations, targeting only the public broadcasting stations in which the same partial character string as the processing target client side device appears multiple times at the same time When the same partial character string (common partial character string) as the processing target client device is present at the same time in the public broadcasting station, the common partial character string is removed from the character string of the speech recognition result of the processing target client device An example will be described in which one is obtained as a processed speech recognition result. The second operation example is different from the first operation example in the operation of the speech recognition result processing unit 44 of the cloud device 2. Hereinafter, only differences from the first operation example will be described.

クラウド側装置２の音声認識結果加工部４４は、音声認識結果保持部４３に記憶された少なくとも１つのクライアント側装置の音声認識結果と時刻情報とIDとの組について、音声認識結果保持部４３に記憶された公共放送の音声認識結果と時刻情報との組の中に、部分文字列と時刻との組が一致するものが複数個ある公共放送についてのみを対象として、クライアント側装置の音声認識結果の文字列から、部分文字列と時刻との組が当該公共放送の音声認識結果と一致した部分文字列を取り除いたものを加工済み音声認識結果とし、加工済み音声認識結果とIDとを組にして出力する。なお、第１動作例と同様に、時刻が一致するか否かの判定については、クライアント側装置と公共放送局またはクラウド側装置とにおける絶対時刻の誤差や音声認識処理における時刻の誤差などを考慮して同じ時刻であると判定してもよい。すなわち、第２動作例では、少なくともある１つのクライアント側装置については、略同一の時刻に何れかの公共放送に当該クライアント側装置と同じ部分文字列（共通する部分文字列）が複数個ある場合には、当該クライアント側装置の音声認識結果の文字列から共通する部分文字列が複数個ある公共放送についての共通する部分文字列を取り除いたものを加工済み音声認識結果として得る。 The voice recognition result processing unit 44 of the cloud side device 2 stores the combination of the voice recognition result of at least one client side device stored in the voice recognition result holding unit 43, the time information, and the ID in the voice recognition result holding unit 43. The voice recognition result of the client-side device for only the public broadcast in which there are a plurality of sets of partial character string and time match among the stored speech recognition result of public broadcast and time information A character string obtained by removing a partial character string in which a combination of partial character string and time matches the speech recognition result of the public broadcast from the character string is taken as a processed speech recognition result, and the processed speech recognition result and ID are paired. Output. As in the first operation example, when determining whether the times match, the error of the absolute time in the client device and the public broadcasting station or the cloud device, the error of the time in the speech recognition process, etc. are taken into consideration. It may be determined that it is the same time. That is, in the second operation example, when there is a plurality of partial character strings (common partial character strings) the same as the client-side device in any public broadcast at substantially the same time for at least one client-side device In the character string of the speech recognition result of the client-side device, the common speech of the public broadcast having a plurality of common partial character strings is removed as a processed speech recognition result.

第１動作例では、利用者の周囲に環境音として存在していないテレビやラジオの同時刻に同一の内容を利用者が偶然発話した場合には、利用者が発した音声の音声認識結果は取り除かれてしまう。これに対し、第２動作例では、利用者の周囲に環境音として存在していないテレビやラジオの同時刻に同一の内容を利用者が複数回発話しない限りは、利用者が発した音声の音声認識結果を取り除いてしまうことはない。利用者の周囲に環境音として存在していないテレビやラジオの同時刻に同一の内容を利用者が偶然発話する可能性は極めて低く、それが複数回となる可能性はさらに低い。したがって、第２実施形態の第２動作例による音声認識システムによれば、発話者が望む音声認識結果である発話者が発した音声の音声認識結果が欠落する可能性を第１動作例よりも低く抑えながら、発話者が望む音声認識結果とは異なる音声認識結果であるテレビやラジオや案内放送などの環境音の音声認識結果が含まれる可能性を従来よりも低減することができる。 In the first operation example, when the user accidentally utters the same content at the same time of the television or radio that is not present as environmental sound around the user, the speech recognition result of the voice emitted by the user is It will be removed. On the other hand, in the second operation example, as long as the user does not utter the same content multiple times at the same time on the television or radio that is not present as environmental sound around the user, There is no need to remove the speech recognition result. The possibility of the user accidentally uttering the same content at the same time on a television or radio that is not present as environmental sound around the user is extremely low, and the possibility of multiple occurrences is even lower. Therefore, according to the speech recognition system according to the second operation example of the second embodiment, the possibility that the speech recognition result of the speech emitted by the utterer, which is the speech recognition result desired by the utterer, may be lost is more than the first operation example. It is possible to reduce the possibility of including the speech recognition result of environmental sounds such as television, radio, and guidance broadcasting, which is a speech recognition result different from the speech recognition result desired by the speaker, than the conventional technology, while keeping it low.

［［第２実施形態の第３動作例］］
第３動作例として、第１動作例の時刻情報に加えて、位置情報も用いる例を説明する。
第３動作例が第１動作例と異なるのは、クライアント側装置１_１〜１_Ｎのユーザ情報取得部１２_１〜１２_Ｎ、クラウド側装置２の音声認識部２２、放送受信部４１、放送音声認識部４２、音声認識結果保持部４３、音声認識結果加工部４４の動作である。以下、第１動作例と異なる部分についてのみ説明する。 [[Third operation example of the second embodiment]]
As a third operation example, an example using position information in addition to the time information of the first operation example will be described.
The third operation example differs from the first operation example, the client-side apparatus ₁ 1 to 1 _N user information acquisition unit ₁₂ 1 to 12 _N, the speech recognition unit 22 of the cloud side device 2, the broadcast receiving unit 41, broadcast audio The operations of the recognition unit 42, the speech recognition result holding unit 43, and the speech recognition result processing unit 44. Hereinafter, only differences from the first operation example will be described.

第３動作例のクライアント側装置１_１のユーザ情報取得部１２_１は、クライアント側装置１_１は音声入力部１１_１が音響信号を取得した時刻情報と位置情報を得て、当該時刻情報と位置情報をユーザ情報として音声送信部１３_１に出力する。位置情報は、例えば緯度経度などの絶対位置を表す情報であり、クライアント側装置がＧＰＳ受信部を内蔵するスマートフォンである場合は、音声入力部１１_１であるマイクが音響信号を取得した際にＧＰＳ受信部が測位した緯度経度を位置情報とすればよい。Ｗｉｆｉ基地局やビーコンによる補助測位機能をもつスマートフォンである場合は、補助測位部が測位した緯度経度を位置情報とすればよい。なお、位置情報は、複数のクライアント側装置間の相対位置関係を表す情報でもよい。例えば、スマートテレビやＳＴＢの場合の、地域コード、郵便番号コード、近傍ビーコンから受信したビーコンコード、あるいは、ジオハッシュIDのような、ある緯度経度のメッシュ状の領域で同一の値を示す地域固有IDを位置情報の相対位置関係を表す情報として用いてもよい。クライアント側装置１_２〜１_Ｎのユーザ情報取得部１２_２〜１２_Ｎも、クライアント側装置１_１のユーザ情報取得部１２_１と同様に動作する。 User information acquisition section 12 ₁ of the client-side apparatus 1 ₁ of the third operation example, the client-side apparatus 1 ₁ obtains location information and time information voice input unit 11 ₁ obtains an acoustic signal, the position and the time information and outputs to the audio transmission unit 13 ₁ of the information as the user information. Position information is, for example, information indicating an absolute position such as latitude and longitude, if the client-side device is a smart phone with a built-in GPS receiver, GPS when the microphone is a voice input unit 11 ₁ obtains a sound signal The latitude and longitude determined by the receiving unit may be used as the position information. In the case of a smartphone having an auxiliary positioning function using a Wifi base station or a beacon, the latitude and longitude measured by the auxiliary positioning unit may be used as the position information. The position information may be information representing the relative positional relationship between a plurality of client-side devices. For example, in the case of smart TV or STB, a region-specific code that indicates the same value in a mesh area of a certain latitude and longitude, such as a region code, a zip code, a beacon code received from a proximity beacon, or a geohash ID The ID may be used as information representing the relative positional relationship of the position information. Client device ₁ 2 to 1 _N user information acquisition unit ₁₂ 2 to 12 _N also operates in the same manner as the user information acquisition section 12 ₁ of the client-side apparatus _{1 1.}

第３動作例のクラウド側装置２の音声認識部２２は、音声受信部２１が出力したそれぞれの音響信号に対して音声認識処理を行い、音響信号に含まれる音声に対応する文字列である音声認識結果を得て、音声認識結果と、当該音声認識結果に対応する時刻情報と、当該音声認識結果に対応する位置情報と、当該音声認識結果に対応するIDとによる組を出力する。音声認識処理やその音声認識結果、音声認識結果に対応する時刻情報、音声認識結果に対応するID、については第１動作例と同様である。音声認識結果と組にする位置情報は、当該音声認識結果に対応する位置情報、すなわち、当該音声認識結果を得る元となった音響信号と組となって音声受信部２１から入力されたユーザ情報に含まれる位置情報である。１つの音声認識結果に対して、当該音声認識結果を得る元となった音響信号と組となって音声受信部２１から入力されたユーザ情報に含まれる位置情報が複数ある場合には、複数の位置情報を代表する１つの位置情報を音声認識結果と組にする。複数の位置情報を代表する１つの位置情報は、音声認識結果に対応する音響信号が発せられた位置を略特定可能とするものであれば何でもよく、例えば、複数の位置情報の何れか１つであってもよいし、複数の位置情報に含まれる緯度の平均値と複数の位置情報に含まれる経度の平均値とを表す位置情報であってもよい。 The voice recognition unit 22 of the cloud-side device 2 of the third operation example performs voice recognition processing on each of the acoustic signals output from the voice reception unit 21 and generates a voice that is a character string corresponding to the voice included in the acoustic signal. The recognition result is obtained, and a combination of the speech recognition result, time information corresponding to the speech recognition result, position information corresponding to the speech recognition result, and an ID corresponding to the speech recognition result is output. The voice recognition process, the voice recognition result, the time information corresponding to the voice recognition result, and the ID corresponding to the voice recognition result are the same as those in the first operation example. The position information to be paired with the speech recognition result is the position information corresponding to the speech recognition result, that is, the user information input from the speech receiving unit 21 in combination with the acoustic signal from which the speech recognition result is obtained. Location information included in When there is a plurality of position information included in the user information input from the voice receiving unit 21 in combination with an acoustic signal from which the voice recognition result is obtained, for one voice recognition result, a plurality of position information One piece of position information representing position information is paired with the speech recognition result. The one position information representing a plurality of position information may be anything as long as the position where the sound signal corresponding to the speech recognition result can be substantially identified, for example, any one of a plurality of position information The position information may be an average value of latitudes included in a plurality of position information and an average value of longitudes included in the plurality of position information.

第３動作例のクラウド側装置２の放送受信部４１が行う動作のうち、第１動作例の放送受信部４１が行う動作と異なるのは、放送局IDを必ず出力する点、すなわち、音響信号と時刻情報と放送局IDによる組を出力する点である。これ以外の動作は第１動作例と同じである。 Among the operations performed by the broadcast reception unit 41 of the cloud-side device 2 of the third operation example, the operation different from the operations performed by the broadcast reception unit 41 of the first operation example is that the broadcasting station ID is always output, that is, an acoustic signal , Time point information and a station ID. The other operations are the same as in the first operation example.

第３動作例のクラウド側装置２の放送音声認識部４２が行う動作のうち、第１動作例の放送音声認識部４２が行う動作と異なるのは、放送局IDを必ず入出力する点、すなわち、音響信号と時刻情報と放送局IDによる組が入力され、音声認識結果の文字列と時刻情報と放送局IDによる組を出力する点である。これ以外の動作は第１動作例と同じである。 The operation performed by the broadcast speech recognition unit 42 of the cloud-side device 2 of the third operation example differs from the operation performed by the broadcast speech recognition unit 42 of the first operation example in that the broadcasting station ID is always input / output, A set of an acoustic signal, time information and a broadcasting station ID is input, and a set of a character string of speech recognition result, time information and a broadcasting station ID is output. The other operations are the same as in the first operation example.

第３動作例のクラウド側装置２の音声認識結果保持部４３は、音声認識部２２が出力した音声認識結果と時刻情報と位置情報とIDとの組と、放送音声認識部４２が出力した音声認識結果と時刻情報と放送局IDとの組と、を記憶する。音声認識結果保持部４３の記憶内容は、音声認識結果加工部４４が公共放送の受信対象地域にクライアント側装置があるか否かを判定する処理、時刻が共通する単語などの部分文字列があるか否かを判定する処理、及び、時刻が共通する単語などの部分文字列があった際に音声認識結果から取り除いて加工済み音声認識結果を得る処理、に用いられる。したがって、音声認識結果保持部４３には、音声認識部２２が出力した音声認識結果と時刻情報と位置情報とIDとの組と放送音声認識部４２が出力した音声認識結果と時刻情報と放送局IDとの組とを音声認識結果加工部４４の処理が必要とする時間分だけ記憶しておく。また、音声認識結果保持部４３に保持した記憶内容は、当該記憶内容を用いる音声認識結果加工部４４の処理が終わった時点で削除してよい。 The speech recognition result holding unit 43 of the cloud-side device 2 in the third operation example is a combination of the speech recognition result output from the speech recognition unit 22, time information, position information, and ID, and the speech output from the broadcast speech recognition unit 42. A recognition result, a set of time information and a broadcasting station ID are stored. The contents stored in the voice recognition result storage unit 43 are processes for determining whether the voice recognition result processing unit 44 has a client-side device in the public broadcast reception target area, and there is a partial character string such as a word having a common time. It is used for the process of determining whether or not it is a process of obtaining a processed speech recognition result by removing it from the speech recognition result when there is a partial character string such as a word having a common time. Therefore, in the voice recognition result holding unit 43, a combination of the voice recognition result outputted by the voice recognition unit 22, time information, position information and ID, the voice recognition result outputted by the broadcast voice recognition unit 42, time information and the broadcasting station The combination with the ID is stored for the time required for the processing of the speech recognition result processing unit 44. Further, the stored contents held in the speech recognition result holding unit 43 may be deleted when the processing of the speech recognition result processing unit 44 using the stored contents is completed.

第３動作例のクラウド側装置２の音声認識結果加工部４４は、音声認識結果保持部４３に記憶された少なくとも１つのクライアント側装置の音声認識結果と時刻情報と位置情報とIDとの組について、音声認識結果保持部４３に記憶された各公共放送の音声認識結果と時刻情報と放送局IDの組の中に、当該公共放送の受信対象地域にクライアント側装置があり、部分文字列と時刻との組が一致するものがあった場合に、一致した部分文字列を取り除いたものを加工済み音声認識結果とし、加工済み音声認識結果とIDとを組にして出力する。したがって、少なくともある１つのクライアント側装置についての加工済み音声認識結果が出力されることになる。 The voice recognition result processing unit 44 of the cloud device 2 of the third operation example is a combination of the voice recognition result, time information, position information, and ID of at least one client device stored in the voice recognition result storage unit 43. Among the combination of speech recognition result of each public broadcast and time information and broadcasting station ID stored in the speech recognition result holding unit 43, there is a client-side device in the reception target area of the public broadcast, partial character string and time If there is a match in the pair, the result of removing the matched partial character string is taken as a processed speech recognition result, and the processed speech recognition result and the ID are output as a set. Therefore, processed speech recognition results for at least one client-side device will be output.

クライアント側装置が公共放送の受信対象地域にあるかは、公知の絶対位置を特定可能な情報同士のマッチングにより判定すればよい。例えば、音声認識結果加工部４４内の図示しない記憶部に、公共放送の放送局IDの受信対象の国、県、市町村などの情報と、緯度経度は付された地図と、を予め記憶しておき、クライアント側装置の位置情報により特定される緯度経度からクライアント側装置が位置する国、県、市町村などを求め、求めた国、県、市町村などが公共放送の受信対象の国、県、市町村などに対応するかにより判定すればよい。または、例えば、音声認識結果加工部４４内の図示しない記憶部に、公共放送の放送局IDの受信対象地域の緯度経度の範囲の情報を予め記憶しておき、クライアント側装置の位置情報により特定される音声認識結果に対応する音響信号が発せられた位置の緯度経度が受信対象地域の緯度経度の範囲内かなどによって判定すればよい。 Whether the client-side device is in a public broadcast reception target area may be determined by matching known absolute positions with identifiable information. For example, the storage unit (not shown) in the voice recognition result processing unit 44 stores in advance information on the country, prefecture, city, town, etc. for which the broadcast station ID of public broadcasting is received, and a map with latitude and longitude attached. Every country, prefecture, municipality, etc. where the client-side device is located is determined from the latitude and longitude specified by the location information of the client-side device, and the requested country, prefecture, municipality, etc. It may be determined depending on whether it corresponds to etc. Alternatively, for example, information on the range of latitude and longitude of the reception target area of the broadcast station ID of the public broadcast is stored in advance in a storage unit (not shown) in the voice recognition result processing unit 44, and specified by the position information of the client side device It may be determined based on whether the latitude and longitude of the position where the acoustic signal corresponding to the voice recognition result is generated is within the range of the latitude and longitude of the reception target area.

なお、時刻が一致するか否かの判定については、クライアント側装置と公共放送局またはクラウド側装置とにおける絶対時刻の誤差や音声認識処理における時刻の誤差などを考慮して同じ時刻であると判定してもよい。 It should be noted that the determination as to whether or not the times coincide is determined to be the same time in consideration of an error in absolute time in the client side device and the public broadcasting station or the cloud side device or an error in time in speech recognition processing. You may

すなわち、第３動作例のクラウド側装置２の音声認識結果加工部４４は、少なくともある１つのクライアント側装置について、略同一の時刻に、当該クライアント側装置が受信対象地域にある公共放送に、当該クライアント側装置と同じ部分文字列（共通する部分文字列）がある場合には、当該クライアント側装置の音声認識結果の文字列から共通する部分文字列を取り除いたものを加工済み音声認識結果として得る。 That is, the voice recognition result processing unit 44 of the cloud device 2 of the third operation example relates to public broadcasting in which the client device is in a reception target area at substantially the same time for at least one client device. When there is the same partial character string (common partial character string) as the client side device, a character string of the speech recognition result of the client side device from which the common partial character string is removed is obtained as a processed speech recognition result .

なお、共通する部分文字列を取り除く処理は、必ずしもクライアント側装置が受信対象地域である公共放送局の全てを対象として行わなくてもよく、少なくとも１つの公共放送を対象として行ってもよい。この場合、音声認識結果加工部４４の処理で必要な音声認識結果だけを前段で得るようにしてもよい。すなわち、音声認識結果加工部４４の処理に不要な音声認識結果を得るための放送受信部４１、放送音声認識部４２及び音声認識結果保持部４３の動作は省略してもよい。 Note that the process of removing the common partial character string may not necessarily be performed on all the public broadcast stations in which the client-side device is the reception target area, and may be performed on at least one public broadcast. In this case, only the speech recognition result required by the processing of the speech recognition result processing unit 44 may be obtained in the previous stage. That is, the operations of the broadcast reception unit 41, the broadcast speech recognition unit 42, and the speech recognition result holding unit 43 for obtaining speech recognition results unnecessary for the processing of the speech recognition result processing unit 44 may be omitted.

ここで、少なくともある１つのクライアント側装置がクライアント側装置１_１である例について図１４の処理フローを用いて説明する。図１４の例は、全ての公共放送局の音声認識結果を対象として、公共放送局５_１から順に、当該公共放送局の受信対象地域にクライアント側装置１_１があるかを判定し、当該公共放送局の受信対象地域にクライアント側装置１_１がある場合に、当該公共放送局の音声認識結果に、クライアント側装置１_１の音声認識結果と部分文字列と時刻との組が一致するものがあるか否かを探索し、部分文字列と時刻との組が一致するものがあった場合には、部分文字列と時刻との組が一致する部分文字列をクライアント側装置１_１の音声認識結果の文字列から当該共通部分文字列を取り除いていく例である。 Here it will be described with reference to the processing flow in FIG. 14 for example at least some one client device is a client-side apparatus 1 _1. Example of FIG. 14, as the target speech recognition results of all public broadcasting station, in order from the public broadcasting station 5 _1, it is determined whether the received target area of the public broadcasting stations have client-side apparatus 1 _1, the public if the received target area of the broadcast station is a client-side device 1 _1, the speech recognition result of the public broadcasting station, those set of the speech recognition result of the client-side apparatus 1 ₁ and the partial strings and time matches searching whether, if the combination of the partial character string and time there is a match, the set of client-side apparatus 1 ₁ for speech recognition a substring that matches the partial character string and time It is an example which removes the said common substring from the character string of a result.

音声認識結果加工部４４は、まず、クライアント側装置１_１の音声認識結果と時刻情報と位置情報とIDとの組を音声認識結果保持部４３から読み出す（ステップＳ４４１Ａ）。音声認識結果加工部４４は、次に、初期値ｙを１に設定する（ステップＳ４４２）。音声認識結果加工部４４は、次に、公共放送局５_ｙの音声認識結果と時刻情報と放送局IDの組を音声認識結果保持部４３から読み出す（ステップＳ４４３Ａ）。音声認識結果加工部４４は、次に、クライアント側装置１_１の音声認識結果と時刻情報と位置情報とIDとの組と公共放送局５_ｙの音声認識結果と時刻情報と放送局IDの組とにおいて、クライアント側装置１１の音声認識結果と時刻情報と位置情報の組に含まれる位置情報が公共放送局５_ｙの受信対象地域に含まれ、かつ、部分文字列とその時刻が一致するものがあるか、を探索する（ステップＳ４４４Ａ）。音声認識結果加工部４４は、次に、ステップＳ４４４Ａの条件を満たす場合には、部分文字列とその時刻が一致する全ての部分文字列をクライアント側装置１_１の音声認識結果の文字列から取り除く（ステップＳ４４５Ａ）。ステップＳ４４４Ａの条件を満たさなかった場合には、ステップＳ４４６に進む。音声認識結果加工部４４は、次に、ステップＳ４４３、ステップＳ４４４Ａ、ステップＳ４４５Ａの処理の対象としていない公共放送局が残っているかを判定する（ステップＳ４４６）。音声認識結果加工部４４は、次に、ステップＳ４４６においてステップＳ４４３、ステップＳ４４４Ａ、ステップＳ４４５Ａの処理の対象としていない公共放送局が残っていると判定された場合には、ｙをｙ＋１に置き換える（ステップＳ４４７）。ステップＳ４４６においてステップＳ４４３、ステップＳ４４４Ａ、ステップＳ４４５Ａの処理の対象としていない公共放送局が残っていないと判定された場合には、最後に行ったステップＳ４４５Ａで処理済みのクライアント側装置１_１の音声認識結果の文字列をクライアント側装置１_１の加工済み音声認識結果の文字列としてIDと組にして出力する（ステップＳ４４８）。 Speech recognition result processing unit 44 first reads a set of the position information and the ID and the client device 1 ₁ of the speech recognition result and the time information from the speech recognition result holding unit 43 (step S441A). Next, the speech recognition result processing unit 44 sets an initial value y to 1 (step S442). Next, the speech recognition result processing unit 44 reads out the combination of the speech recognition result of the public broadcast station 5 _y , the time information, and the broadcasting station ID from the speech recognition result holding unit 43 (step S443A). Speech recognition result processing unit 44, then, a set of client-side apparatus 1 ₁ speech recognition result and the time information and the position information and the speech recognition result sets and public broadcasting stations 5 _y of the ID and the time information and the broadcast station ID And the position information included in the combination of the voice recognition result of the client-side device 11 and the time information and the position information is included in the reception target area of the public broadcast station 5 _y , and the partial character string and the time match Search for (step S444A). Speech recognition result processing unit 44, then, when conditions are satisfied in step S444A removes all partial strings substring and its time matches the client-side apparatus 1 ₁ of the speech recognition result string (Step S445A). If the condition of step S444A is not satisfied, the process proceeds to step S446. Next, the speech recognition result processing unit 44 determines whether there are any public broadcast stations not subject to the processes of step S443, step S444A, and step S445A (step S446). Next, when it is determined in step S446 that there are still public broadcasting stations that are not targets of the processes of step S443, step S444A, and step S445A, the voice recognition result processing unit 44 replaces y with y + 1 (step S447). Step S443 In step S446, step S444A, when it is determined that there are no remaining public broadcasting station that is not subject to the process of step S445A is end processed speech recognition client device _{1 1} in step S445A Been results of the string in the ID and the set output as client-side apparatus 1 ₁ of processed speech recognition result string (step S448).

第２実施形態の第３動作例による音声認識システムを用いることによって、発話者が望む音声認識結果である発話者が発した音声の音声認識結果が欠落する可能性を第１動作例よりも低く抑えながら、課題１の問題を解決することが可能となり、発話者が望む音声認識結果とは異なる音声認識結果が得られる可能性を従来よりも低減し、検索において発話者が望む検索結果とは異なる検索結果が得られる可能性を従来よりも低減することが可能となる。 By using the speech recognition system according to the third operation example of the second embodiment, the possibility that the speech recognition result of the speech emitted by the utterer, which is the speech recognition result desired by the utterer, is lower than in the first operation example It is possible to solve the problem 1 of the problem 1 while suppressing the possibility that the speech recognition result different from the speech recognition result desired by the utterer can be obtained than before, and the search result desired by the utterer in the search It is possible to reduce the possibility of obtaining different search results as compared to the prior art.

＜第３実施形態＞
次に、本発明の第３実施形態として、クライアント側装置に入力された発話者の音声を含む音響信号の音声認識結果から、クライアント側装置で再生されている音響信号の音声認識結果と共通する部分を取り除く形態について説明する。第３実施形態における音声認識システムの構成は、第１実施形態の音声認識システムの構成と同様であり、音声認識システムの構成を示すブロック図は図１である。符号１００は音声認識システムであり、符号１_１〜１_Ｎは１個以上（Ｎ個、Ｎは１以上の整数）のクライアント側装置であり、符号２はクラウド側装置である。第３実施形態においては、クライアント側装置１_１〜１_Ｎは、クライアント側装置に記憶したコンテンツを再生する機能または／及びネットワーク３経由でクライアント側装置がダウンロードしながらコンテンツを再生する機能を有するものである。なお、「記憶したコンテンツ」に関しては、メディアを装着する形でもよい。コンテンツは、少なくともセリフなど日本語、外国語の音声を含む音響信号を含むものであり、例えば、クライアント側装置に録画した映画や、ダウンロード購入したパッケージ番組、クライアント側装置がダウンロードしながら再生するＶＯＤなどの映像音響信号である。 Third Embodiment
Next, as a third embodiment of the present invention, the speech recognition result of the acoustic signal including the voice of the speaker input to the client side device is common to the speech recognition result of the acoustic signal reproduced by the client side device. The form which removes a part is demonstrated. The configuration of the speech recognition system in the third embodiment is the same as that of the speech recognition system in the first embodiment, and a block diagram showing the configuration of the speech recognition system is FIG. The code 100 is a speech recognition system, the codes 1 ₁ to 1 _N are one or more (N, N is an integer of 1 or more) client side apparatuses, and the code 2 is a cloud side apparatus. In the third embodiment, the client side devices 1 ₁ to 1 _N have a function of reproducing the content stored in the client side device and / or a function of reproducing the content while the client side device downloads via the network 3 It is. Media may be attached to the "stored content". The content includes an audio signal including voice of at least Japanese and foreign languages such as serifs. And so on.

クライアント側装置１_１〜１_Ｎが最低限含む構成は全て同じであるため、以下では、第３実施形態の音声認識システム１００のうちのクライアント側装置１_１とクラウド側装置２により構成される部分について詳細化したブロック図である図１５を用いて説明を行う。図１５の構成要素のうち図１１と同じ符号を付してある構成要素は、図１１と同じ動作を行うものである。 Since all the configurations included at least in the client side devices 1 ₁ to 1 _N are the same, in the following, a part configured by the client side device ₁₁ and the cloud side device 2 in the speech recognition system 100 of the third embodiment Will be described with reference to FIG. 15 which is a detailed block diagram of FIG. Among the components shown in FIG. 15, the components assigned the same reference numerals as those in FIG. 11 perform the same operations as those in FIG.

クライアント側装置１_１は、音声入力部１１_１、ユーザ情報取得部１２_１、音声送出部１３_１、検索結果受信部１４_１、画面表示部１５_１、コンテンツ情報取得部１６_１、コンテンツ情報送出部１７_１を少なくとも含んで構成される。クライアント側装置１_１の音声入力部１１_１、ユーザ情報取得部１２_１、音声送出部１３_１、検索結果受信部１４_１、画面表示部１５_１は、第２実施形態の第１動作例のクライアント側装置１_１の音声入力部１１_１、ユーザ情報取得部１２_１、音声送出部１３_１、検索結果受信部１４_１、画面表示部１５_１と、それぞれ同一の動作をする。 Client device _{1 1} includes an audio input unit 11 _1, the user information acquiring unit 12 _1, the audio output unit 13 _1, the search result receiving unit 14 _1, the screen display unit 15 _1, the content information obtaining unit 16 _1, the contents information sending part It comprises 17 ₁ at least. Client device _{1 1} of the speech input unit 11 _1, the user information acquiring unit 12 _1, the audio output unit 13 _1, the search result receiving unit 14 _1, the screen display unit 15 _1, the client of the first operation example of the second embodiment side device _{1 1} of the speech input unit 11 _1, the user information acquiring unit 12 _1, the audio output unit 13 _1, the search result receiving unit 14 _1, and the screen display unit 15 _1, respectively the same operation.

クラウド側装置２は、音声受信部２１、音声認識部２２、コンテンツ音声認識結果蓄積部６０、コンテンツ情報受信部６１、コンテンツ音声認識結果取得部６２、音声認識結果保持部６３、音声認識結果加工部６４、検索処理部２５、検索結果送出部２６を少なくとも含んで構成される。クラウド側装置２の音声受信部２１、音声認識部２２、検索処理部２５及び検索結果送出部２６は、第２実施形態の第１動作例のクラウド側装置２の音声受信部２１、音声認識部２２、検索処理部２５及び検索結果送出部２６と、それぞれ同一の動作をする。 The cloud side device 2 includes a voice receiving unit 21, a voice recognition unit 22, a content voice recognition result storage unit 60, a content information receiving unit 61, a content voice recognition result obtaining unit 62, a voice recognition result holding unit 63, and a voice recognition result processing unit 64 includes at least a search processing unit 25 and a search result sending unit 26. The voice receiving unit 21, the voice recognition unit 22, the search processing unit 25 and the search result sending unit 26 of the cloud side device 2 are the voice receiving unit 21 and the voice recognition portion of the cloud side device 2 according to the first operation example of the second embodiment. 22. The search processing unit 25 and the search result sending unit 26 operate in the same manner.

以下では、第２実施形態の第１動作例と異なる部分について説明する。 Hereinafter, parts different from the first operation example of the second embodiment will be described.

クライアント側装置１_１のコンテンツ情報取得部１６_１は、クライアント側装置１_１が現在再生しているコンテンツについて、当該コンテンツを特定可能な識別情報（以下、「コンテンツID」という。なお、同一映画であっても、日本語、外国語等の言語の選択によっては、セリフが異なるが、以下説明では、日本語、外国語等の複数言語の音声に対応したコンテンツの場合、それぞれ異なるコンテンツIDを持たせることとして扱い、省略する）と、当該コンテンツ中における現在再生している箇所を表す相対時刻（いわゆる再生位置である）と、を取得して、コンテンツ情報送出部１７_１に出力する。コンテンツ中における現在再生している箇所を表す相対時刻とは、例えば、コンテンツの先頭開始点から標準速度で再生を行った場合に、その箇所を再生するまでに必要となる秒数や、コンテンツに予め付与されているタイムスタンプなどである。 Client device 1 ₁ of the content information acquisition unit 16 _1, the content that the client-side apparatus 1 ₁ is currently reproduced, identification information capable of identifying the content (hereinafter, referred to as "content ID". In the same movies Even though there are different words depending on the choice of language such as Japanese and foreign language, in the following explanation, in the case of content corresponding to voice of multiple languages such as Japanese and foreign language, they have different content ID respectively treated as possible to, and omitted), the relative time that represents the position currently being played during the content (a so-called playback position), to obtain the outputs to the content information sender 17 _1. The relative time indicating the portion currently being reproduced in the content is, for example, the number of seconds required to reproduce the portion when reproducing at a standard speed from the start point of the content, or It is a time stamp or the like given in advance.

クライアント側装置１_１のコンテンツ情報送出部１７_１は、クライアント側装置１_１を特定可能な識別情報（以下、「ID」と呼ぶ）と、コンテンツ情報取得部１６_１が出力したコンテンツIDと相対時刻と、ユーザ情報取得部１２_１が出力した時刻情報と、を組にして、IDとコンテンツIDと相対時刻と時刻情報との組を含む伝送信号である第三伝送信号をクラウド側装置２に対して送出する。なお、識別情報に関しては、例えば、光メディアの情報データベースや、断片的な音声データからコンテンツIDと相対時刻(再生位置)を取得できる、既存の外部クラウドサービスを用いて特定しても良い。 The content information sender 17 _first client-side apparatus 1 _1, identifiable identification information client device 1 ₁ (hereinafter, referred to as "ID") and the content ID and the relative time that the content information acquisition unit 16 ₁ is output If, by the time information which the user information acquisition section 12 ₁ has output, and to set, to the ID and the content ID and the third transmission signal cloud side device 2 is a transmission signal including a set of the relative time and the time information Send out. The identification information may be specified, for example, using an existing external cloud service that can acquire a content ID and a relative time (reproduction position) from an optical media information database or fragmentary audio data.

クラウド側装置２のコンテンツ音声認識結果蓄積部６０には、予め、映画などのコンテンツについての、コンテンツIDと、当該コンテンツの音響信号を音声認識して得られた音声認識結果の文字列と、相対時刻と、が対応付けて記憶されている。音声認識結果の文字列と相対時刻とは、音声認識結果の文字列に含まれる各部分文字列ごとに、当該部分文字列と相対時刻とを組にしておくことで記憶されている。 The content voice recognition result storage unit 60 of the cloud-side device 2 compares, in advance, the content ID for a content such as a movie, and the character string of the voice recognition result obtained by speech recognition of the sound signal of the content. The time is associated with each other and stored. The character string of the speech recognition result and the relative time are stored by combining the partial character string and the relative time for each partial character string included in the character string of the speech recognition result.

クラウド側装置２のコンテンツ情報受信部６１は、クライアント側装置１_１のコンテンツ情報送出部１７_１が送出した第三伝送信号を受信して、当該第三伝送信号に含まれるIDとコンテンツIDと相対時刻と時刻情報との組を得て、コンテンツ音声認識結果取得部６２に出力する。 The content information reception unit 61 of the cloud side device 2 receives the third transmission signal content information sender 17 _first client device 1 ₁ is transmitted, ID and the content ID and the relative included in the third transmission signal A set of time and time information is obtained and output to the content voice recognition result acquisition unit 62.

クラウド側装置２のコンテンツ音声認識結果取得部６２は、コンテンツ情報受信部６１が出力したコンテンツIDと相対時刻を用いてコンテンツ音声認識結果蓄積部６０を探索し、当該コンテンツIDに対応するコンテンツの音声認識結果の文字列に含まれる各部分文字列ごとの当該部分文字列と相対時刻とを組を得て、当該相対時刻を対応するコンテンツ情報受信部６１が出力した時刻情報に置き換えて、コンテンツの音声認識結果の文字列に含まれる各部分文字列ごとの当該部分文字列と時刻情報の組を生成し、生成した音声認識結果の文字列に含まれる各部分文字列ごとの当該部分文字列と時刻情報の組と、IDと、を組にして音声認識結果保持部６３に出力する。 The content voice recognition result acquisition unit 62 of the cloud-side device 2 searches the content voice recognition result storage unit 60 using the content ID and relative time output by the content information receiving unit 61, and the voice of the content corresponding to the content ID Obtain a pair of the partial character string and relative time for each partial character string included in the character string of the recognition result, replace the relative time with the time information output by the corresponding content information receiving unit 61, and A pair of the partial character string and time information for each partial character string included in the character string of the speech recognition result is generated, and the partial character string for each partial character string included in the character string of the generated speech recognition result A set of time information and an ID are output to the voice recognition result holding unit 63 as a set.

クラウド側装置２の音声認識結果保持部６３は、音声認識部２２が出力した音声認識結果と時刻情報とIDとの組と、コンテンツ音声認識結果取得部６２が出力した音声認識結果の文字列に含まれる各部分文字列ごとの当該部分文字列と時刻情報の組とIDとを組にしたものと、を記憶する。音声認識結果保持部６３の記憶内容は、音声認識結果加工部６４が時刻が共通する単語などの部分文字列があるか否かを判定する処理、及び、時刻が共通する単語などの部分文字列があった際に音声認識結果から取り除いて加工済み音声認識結果を得る処理、に用いられる。したがって、音声認識結果保持部６３に保持した記憶内容は、当該記憶内容を用いる音声認識結果加工部６４の処理が終わった時点で削除してよい。 The voice recognition result holding unit 63 of the cloud-side device 2 uses the combination of the voice recognition result output by the voice recognition unit 22, the time information and the ID, and the character string of the voice recognition result output by the content voice recognition result acquisition unit 62. A combination of the partial character string, time information pair, and ID for each partial character string included is stored. The stored contents of the voice recognition result holding unit 63 are processes of the voice recognition result processing unit 64 determining whether or not there is a partial character string such as a word having a common time, and a partial character string such as a word having a common time Is used for processing to obtain a processed speech recognition result by removing it from the speech recognition result when there is an error. Therefore, the stored contents held in the speech recognition result holding unit 63 may be deleted when the processing of the speech recognition result processing unit 64 using the stored contents is completed.

クラウド側装置２の音声認識結果加工部６４は、音声認識結果保持部６３に記憶された少なくとも１つのクライアント側装置の音声認識結果と時刻情報とIDとの組について、音声認識結果保持部４３に記憶された音声認識結果の文字列に含まれる各部分文字列ごとの当該部分文字列と時刻情報の組とIDとを組にしたものの中に、部分文字列と時刻との組が一致するものがあった場合に、一致した部分文字列を取り除いたものを加工済み音声認識結果とし、加工済み音声認識結果とIDとを組にして出力する。なお、時刻が一致するか否かの判定については、音声認識処理における時刻の誤差などを考慮して同じ時刻であると判定してもよい。少なくともある１つのクライアント側装置がクライアント側装置１_１である場合の処理フローは図１６の通りである。 The voice recognition result processing unit 64 of the cloud side device 2 stores the combination of the voice recognition result of at least one client side device stored in the voice recognition result holding unit 63, the time information, and the ID in the voice recognition result holding unit 43. A combination of a partial character string and a time in a combination of a partial character string for each partial character string included in the character string of the stored voice recognition result and a set of time information and an ID If there is a match, the processed speech recognition result is obtained by removing the matched partial character string, and the processed speech recognition result and the ID are output as a set. The determination as to whether or not the times coincide may be determined as the same time in consideration of an error of the time in the speech recognition process. Processing Flow in at least some one client device is a client-side apparatus 1 ₁ is as shown in FIG 16.

このように、第三実施形態によれば、例えば、映画やＶＯＤを再生している場合、そのコンテンツＩＤと再生位置の秒数もクライアント側装置が取得した音響信号と共にクラウド側装置２に送る。これによりクラウド側装置６は、コンテンツＩＤと再生位置の秒数によりコンテンツの音声認識結果が蓄積されたＤＢを探索してそのコンテンツの声音の音声認識結果（アナウンスやセリフの文字列）を得て、得られたコンテンツの声音の音声認識結果（アナウンスやセリフの文字列）をノイズとして、クライアント側装置が取得した音響信号の音声認識結果の文字列から除外することによって、クライアント側装置の利用者が発した発話に対する音声認識の誤認識の確率を下げることができる。 Thus, according to the third embodiment, when playing a movie or VOD, for example, the content ID and the number of seconds of the playback position are also sent to the cloud-side device 2 together with the acoustic signal acquired by the client-side device. As a result, the cloud device 6 searches the DB in which the voice recognition result of the content is accumulated according to the content ID and the number of seconds of the playback position, and obtains the voice recognition result (announcement or speech character string) of the voice of the content. The user of the client-side device by excluding the speech recognition result (an announcement or speech character string) of the obtained voice of the content as the noise from the character string of the speech recognition result of the acoustic signal acquired by the client-side device It is possible to lower the probability of misrecognition of speech recognition with respect to the speech emitted by.

すなわち、第３の実施形態による音声認識システムを用いることによって、課題３の問題を解決することが可能となる。 That is, by using the speech recognition system according to the third embodiment, it is possible to solve the problem 3 of the problem.

＜第４実施形態＞
次に、本発明の第４実施形態として、検索指示の入力を明示した形態について説明する。ここでは、図１１の第２実施形態において検索指示の入力を明示した形態について、図１７を用いて説明する。図１７は、第２実施形態に対応する第４実施形態の音声認識システム１００のうちのクライアント側装置１_１とクラウド側装置２により構成される部分について詳細化したブロック図である。図１７に示す構成が図１１に示す構成と異なる点は、クライアント側装置１_１が検索指示入力部１８_１も少なくとも含んで構成される点である。図１７の検索指示入力部１８_１以外の構成要素は図１１と同じである。以下では、第２実施形態の記載からの差分を説明する。 Fourth Embodiment
Next, as a fourth embodiment of the present invention, an embodiment in which an input of a search instruction is specified will be described. Here, a form in which the input of the search instruction is specified in the second embodiment of FIG. 11 will be described with reference to FIG. FIG. 17 is a detailed block diagram of a portion constituted by the client side device ₁₁ and the cloud side device 2 in the speech recognition system 100 of the fourth embodiment corresponding to the second embodiment. To indicate configuration differs from the structure shown in FIG. 11 FIG. 17 is that the client-side apparatus 1 ₁ is configured to include at least also search instruction input unit 18 _1. Search instruction input unit 18 ₁ except for components of FIG. 17 is the same as FIG. 11. The differences from the description of the second embodiment will be described below.

［［第４実施形態の動作例］］
第４実施形態の動作例として、第１の利用者がクライアント側装置１_１に対して検索結果を得たい文章を発話し、当該発話に対応する検索結果をクライアント側装置１_１の画面表示部１５_１に表示する場合の動作の例を説明する。 [[Operation example of the fourth embodiment]]
4 As an example of the operation of the embodiment, speaks a sentence to be obtained search results first user to the client-side apparatus 1 _1, the screen display unit of the search result corresponding to the utterance client device 1 ₁ An example of the operation in the case of displaying in 15 ₁ will be described.

クライアント側装置１_１の検索指示入力部１８_１は、第１の利用者が検索結果を得たい文章を発話する際に、検索開始の指示の入力を受け付け、受け付けた検索開始の指示を音声入力部１１_１とユーザ情報取得部１２_１と音声送出部１３_１に出力する。検索開始の指示は、音声認識と検索の双方の開始の指示ともいえる。例えば、クライアント側装置１_１がスマートフォンである場合は、画面上に表示された音声検索開始ボタンと、その音声検索開始ボタンがタッチされたことを検出する検出手段とが、クライアント側装置１_１の検索指示入力部１８_１である。 Search instruction input unit 18 ₁ of the client-side apparatus 1 _1, when the first user utters a sentence to be obtained search results, accepts the input of a search start instruction, the voice input an instruction received search start part 11 and outputs to ₁ and the user information acquisition section 12 ₁ and the audio output unit 13 _1. The instruction to start the search can be said to be an instruction to start both the speech recognition and the search. For example, if the client-side apparatus 1 ₁ is a smart phone, a voice search start button displayed on a screen, detecting means for detecting that the voice search start button is touched, the client device 1 ₁ Search is an instruction input unit 18 _1.

クライアント側装置１_１の音声入力部１１_１は、検索指示入力部１８_１が出力した検索開始の指示に従って、音響信号を取得して、取得した音響信号を音声送出部１３_１に出力する。例えば、音声入力部１１_１は、検索開始の指示が入力された時点で音響信号の取得を開始し、検索開始の指示が入力された時刻から予め定めた時間が経過した時点で音響信号の取得を終了する。また、例えば、音声入力部１１_１は、図示しない発話有無検出手段を備え、検索開始の指示が入力された時点で音響信号の取得を開始し、発話有無検出手段が発話が無くなったと判断した時点で音響信号の取得を終了する。 Client device 1 ₁ of the speech input unit 11 ₁ in accordance with the instruction of the search instruction input unit 18 ₁ is output search start obtains the acoustic signal, and outputs the acquired audio signal to the audio output unit 13 _1. For example, the speech input unit 11 _1, the acquisition of the acoustic signal at the time when the search start instruction starts acquisition of acoustic signals at the time of the input, the time search start instruction is predetermined from the time input has elapsed Finish. Point addition, for example, the speech input unit 11 _1, which comprises a speech detecting means (not shown), an instruction to start searching starts acquisition of acoustic signals at the time of the input is determined that the speech detecting means has disappeared speech End acquisition of the acoustic signal.

クライアント側装置１_１のユーザ情報取得部１２_１は、検索指示入力部１８_１が出力した検索開始の指示に従って、クライアント側装置１_１の音声入力部１１_１が音響信号を取得した時刻情報を得て、当該時刻情報とクライアント側装置１_１を特定可能な識別情報（以下、「ID」と呼ぶ）とをユーザ情報として音声送出部１３_１に出力する。例えば、ユーザ情報取得部１２_１は、音声入力部１１_１が音響信号を取得して出力している間、時刻情報を得て、得た時刻情報とIDとをユーザ情報として音声送出部１３_１に出力する。 User information acquisition section 12 ₁ of the client-side apparatus 1 ₁ search instruction in accordance with an instruction input portion 18 ₁ is the search start outputting, to obtain time information voice input unit 11 ₁ of the client-side apparatus 1 ₁ acquires the audio signal Te, the time information and the client-side apparatus 1 ₁ can specify identification information (hereinafter, referred to as "ID") to the audio output unit 13 ₁ and the user information. For example, the user information acquiring unit 12 _1, the audio while the input unit 11 ₁ is output by obtaining the acoustic signal, obtains time information, the audio output unit 13 and the time information obtained and the ID as the user information ₁ Output to

クライアント側装置１_１の音声送出部１３_１は、検索指示入力部１８_１が出力した検索開始の指示に従って、音声入力部１１_１が出力した音響信号とユーザ情報取得部１２_１が出力したユーザ情報とを含む伝送信号をクラウド側装置２に対して送出する。 User information sound output unit 13 ₁ of the client-side apparatus 1 ₁ according to an instruction of the search instruction input unit 18 searches _{started 1} is output, the acoustic signal and the user information acquisition section 12 ₁ speech input unit 11 ₁ is output is output And transmits to the cloud device 2.

第４実施形態の動作例の音声認識システム１００のこれ以降の動作は、第２実施形態の第１動作例と同様である。 The subsequent operation of the speech recognition system 100 of the operation example of the fourth embodiment is the same as that of the first operation example of the second embodiment.

このような構成により、クライアント側装置１_１の検索指示入力部１８_１が検索開始の指示の入力を受け付けたのを契機に、第１の利用者が発話した検索結果を得たい文章に対応する検索結果をクライアント側装置１_１の画面表示部１５_１に表示することが可能となる。 With this configuration, in response to the retrieval instruction input unit 18 ₁ of the client-side apparatus 1 ₁ accepts the input of a search start instruction, corresponding to the sentence to be obtained search results first user has uttered it is possible to display the search results to the client-side apparatus 1 ₁ of the screen display unit 15 _1.

なお、第２実施形態の第１動作例以外の動作例、第１実施形態、第３実施形態についても、検索指示の入力を明示した音声認識システム１００の動作は上記と同様であるので詳細な説明を省略するが、クライアント側装置の検索指示入力部が検索開始の指示の入力を受け付けたのを契機に、利用者が発話した検索結果を得たい文章に対応する検索結果をクライアント側装置の画面表示部に表示することが可能となる。 In the operation examples other than the first operation example of the second embodiment, the operation of the speech recognition system 100 which clearly indicates the input of the search instruction is the same as that described above in the first embodiment and the third embodiment as well. Although the description is omitted, when the search instruction input unit of the client side device receives the input of the search start instruction, the search result corresponding to the sentence for which the user wants to obtain the search result spoken by the user is It becomes possible to display on the screen display unit.

＜第４実施形態の変形例＞
次に、本発明の第４実施形態の変形例として、検索指示の入力時点よりも前の音響信号を用いる形態について、図１８を用いて説明する。図１８は、図１７に示す第４実施形態の変形例の音声認識システム１００のうちのクライアント側装置１_１とクラウド側装置２により構成される部分について詳細化したブロック図である。図１８に示す構成が図１７に示す構成と異なる点は、クライアント側装置１_１が音声保持部１９_１も少なくとも含んで構成される点である。図１８の音声保持部_１以外の構成要素は図１１と同じである。以下では、第４実施形態との差分を説明する。 Modification of Fourth Embodiment
Next, as a modified example of the fourth embodiment of the present invention, an embodiment using an acoustic signal before the input time of a search instruction will be described using FIG. FIG. 18 is a detailed block diagram of a portion constituted by the client side device ₁₁ and the cloud side device 2 in the speech recognition system 100 according to the modification of the fourth embodiment shown in FIG. Configuration shown in Figure 18 differs from the structure shown in FIG. 17 is that the client device 1 ₁ is configured to include at least also voice holding unit 19 _1. The components other than the voice holding unit _{1 of} FIG. 18 are the same as those of FIG. Below, the difference with 4th Embodiment is demonstrated.

［［第４実施形態の変形例の動作例］］
第４実施形態の変形例の動作例として、第４実施形態の動作例と同じ場合の例、すなわち、第１の利用者がクライアント側装置１_１に対して検索結果を得たい文章を発話し、当該発話に対応する検索結果をクライアント側装置１_１の画面表示部１５_１に表示する場合の動作の例を説明する。 [[Operation Example of Modification of Fourth Embodiment]]
As an example of the operation of the modification of the fourth embodiment, an example of the same case as the operation of the fourth embodiment, i.e., speaks a sentence to be obtained search results first user to the client-side apparatus 1 ₁ , an example of operation of displaying a search result corresponding to the utterance to the screen display unit 15 ₁ of the client-side apparatus 1 _1.

クライアント側装置１_１の音声入力部１１_１は、常に音響信号を取得する。音声入力部１１_１は、検索指示入力部１８_１から検索開始の指示が入力された場合には、検索指示入力部１８_１から入力された検索開始の指示に従って、取得した音響信号を音声送出部１３_１に出力する。例えば、音声入力部１１_１は、検索指示入力部１８_１から検索開始の指示が入力された場合には、検索開始の指示が入力された時点から、検索開始の指示が入力された時刻から予め定めた時間が経過した時点までの、音響信号を音声送出部１３_１に出力する。また、音声入力部１１_１は、取得した全ての音響信号を音声保持部１９_１に出力する。 Speech input unit 11 ₁ of the client-side apparatus 1 ₁ always acquires an acoustic signal. Speech input unit 11 ₁ is searched if the instruction input unit 18 instruction to start searching from ₁ is input, searches in accordance with the instruction input unit 18 instruction search start input _1, speech sending unit acoustic signals acquired 13 Output to ₁ For example, ₁ speech input unit 11, when the search instruction input unit 18 search start instruction from ₁ is input in advance from the time the search start instruction has been input, from the time the instruction is input in the search start up to the point where an agreement time elapses, and outputs a sound signal to the sound output unit 13 _1. The voice input unit 11 ₁ outputs all the acoustic signals acquired on the speech holding unit 19 _1.

クライアント側装置１_１のユーザ情報取得部１２_１は、音声入力部１１_１が音響信号を取得した時刻の時刻情報を常に取得する。ユーザ情報取得部１２_１は、検索指示入力部１８_１から検索開始の指示が入力された場合には、検索指示入力部１８_１から入力された検索開始の指示に従って、クライアント側装置１_１の音声入力部１１_１が音声送出部１３_１に出力する音響信号の時刻情報と、当該時刻情報とクライアント側装置１_１を特定可能な識別情報（以下、「ID」と呼ぶ）とをユーザ情報として音声送出部１３_１に出力する。また、ユーザ情報取得部１２_１は、取得した全ての時刻情報を音声保持部１９_１に出力する。 User information acquisition section 12 ₁ of the client-side apparatus 1 ₁ always obtains the time information of time at which the speech input unit 11 ₁ obtains an acoustic signal. User information acquisition unit 12 ₁ is searched if the instruction input unit 18 search start instruction from ₁ is input, according to an instruction of the search instruction input unit 18 ₁ is input from the search start, the client-side apparatus 1 ₁ speech and time information of an acoustic signal input unit 11 ₁ outputs the voice output unit 13 _1, the speech the time information and the client-side apparatus 1 ₁ can specify identification information (hereinafter, referred to as "ID") and a user information and outputs to the transmission unit 13 _1. The user information acquisition section 12 ₁ outputs all the time information acquired in the speech holding unit 19 _1.

クライアント側装置１_１の音声保持部１９_１は、音声入力部１１_１から入力された音響信号とユーザ情報取得部１２_１から入力された時刻情報とを組にして図示しない記憶部に記憶し、最新のものから所定時間経過した音響信号と時刻情報との組を記憶部から削除する。すなわち、音声保持部１９_１は、音声入力部１１_１から入力された音響信号とその音響信号に対応する時刻情報を最新のものから所定時間分だけ保持する。所定時間とは、予め設定した時間であり、例えば、十数秒から数分程度である。また、音声保持部１９_１は、検索指示入力部１８_１から検索開始の指示が入力された場合には、記憶部に記憶されている音響信号と時刻情報との組を音声送出部１３_１に出力する。すなわち、音声保持部１９_１は、検索指示入力部１８_１から検索開始の指示が入力された場合には、最新のものから所定時間分の音響信号とその時刻情報を音声送出部１３_１に出力する。 Client device 1 ₁ of the speech holding unit 19 ₁ stores in the storage unit (not shown) by the time information input from the sound signal and the user information acquisition unit 12 ₁ that is input from the speech input unit 11 ₁ in the set, A set of an acoustic signal and time information for which a predetermined time has elapsed from the latest one is deleted from the storage unit. That is, the speech holding unit 19 ₁ stores the time information corresponding to the acoustic signal and its acoustic signal input from the speech input unit 11 ₁ to the most recent one predetermined time period. The predetermined time is a time set in advance, and is, for example, about ten seconds to several minutes. Further, ₁ voice holding unit 19, when the search instruction input unit 18 search start instruction from ₁ is inputted, a set of the acoustic signal and the time information stored in the storage unit to the audio output unit 13 ₁ Output. That is, the speech holding unit 19 ₁ is searched when an instruction search start instruction from the input unit 18 ₁ is input, the output audio signals of the predetermined time from the latest one and its time information to the audio output unit 13 ₁ Do.

クライアント側装置１_１の音声送出部１３_１は、検索指示入力部１８_１から入力された検索開始の指示に従って、音声入力部１１_１から入力された音響信号とユーザ情報取得部１２_１から入力されたユーザ情報と音声保持部１９_１から入力された音響信号とその時刻情報とを含む伝送信号をクラウド側装置２に対して送出する。すなわち、音声送出部１３_１は、検索開始の指示が入力された時点から検索開始の指示が入力された時刻から予め定めた時間が経過した時点までの音響信号とその時刻情報と、検索開始の指示が入力された時点よりも過去の所定時間分の音響信号とその時刻情報と、クライアント側装置１_１のIDと、を含む伝送信号をクラウド側装置２に対して送出する。 Voice output unit 13 ₁ of the client-side apparatus 1 ₁ search in accordance with an instruction of search start input from the instruction input unit 18 ₁ is input from the audio signal and the user information acquisition unit 12 ₁ that is input from the speech input unit 11 ₁ sending user information and the acoustic signal input from the voice holding portion 19 ₁ and a transmission signal including its time information to the cloud side device 2. That is, the voice output unit 13 _1, the acoustic signal up to the point of time when the search instruction starts when an instruction to start searching has been input a predetermined from the time input has elapsed and its time information, initiating searches instructions than when it is inputted sends a past audio signal in a predetermined time duration and its time information, and the ID of the client-side apparatus 1 _1, a transmission signal including the cloud side device 2.

クラウド側装置２の音声受信部２１、音声認識部２２、放送受信部４１、放送音声認識部４２の動作は、それぞれ、第２実施形態の音声受信部２１、音声認識部２２、放送受信部４１、放送音声認識部４２の動作と同じである。 The operations of the voice reception unit 21, voice recognition unit 22, broadcast reception unit 41, and broadcast voice recognition unit 42 of the cloud side device 2 are the same as the voice reception unit 21, voice recognition unit 22, and broadcast reception unit 41 of the second embodiment. , The operation of the broadcast speech recognition unit 42 is the same.

クラウド側装置２の音声認識結果加工部４４は、まず、音声認識結果保持部４３に記憶された少なくとも１つのクライアント側装置の音声認識結果と時刻情報とIDとの組について、音声認識結果保持部４３に記憶された公共放送の音声認識結果と時刻情報との組の中に、部分文字列と時刻との組が一致するものが複数個ある公共放送についてのみを対象として、クライアント側装置の音声認識結果の文字列から、部分文字列と時刻との組が当該公共放送の音声認識結果と一致した部分文字列を取り除き、取り除き後のクライアント側装置の音声認識結果の文字列を得る。音声認識結果加工部４４は、さらに、取り除き後のクライアント側装置の音声認識結果の文字列から、検索開始の指示が入力された時点よりも過去の部分文字列を取り除いたものを加工済み音声認識結果とし、加工済み音声認識結果とIDとを組にして出力する。 The voice recognition result processing unit 44 of the cloud side device 2 first makes a voice recognition result holding unit for the combination of the voice recognition result of at least one client side device stored in the voice recognition result holding unit 43, time information, and an ID. The voice of the client-side device is targeted only for public broadcasts in which there are a plurality of combinations of partial character string and time in the set of speech recognition result of public broadcast and time information stored in 43. From the character string of the recognition result, the partial character string in which the combination of partial character string and time matches the speech recognition result of the public broadcast is removed, and the character string of the speech recognition result of the client-side device after removal is obtained. The voice recognition result processing unit 44 further performs voice recognition on the character string of the voice recognition result of the client-side device after removal from which the partial character string in the past from the time when the search start instruction is input is removed. As a result, the processed speech recognition result and the ID are output in pairs.

次に、図１９を参照して、この例における音声認識結果と加工済み音声認識結果の一例を説明する。図１９の横軸は時刻であり、検索指示が入力された時点の時刻をＴ_０、検索指示が入力された時刻Ｔ_０から予め定めた時間が経過した時点の時刻をＴ_Ａ、検索指示が入力された時刻Ｔ_０から所定時間過去の時点の時刻をＴ_Ｂ、とする。上側にある太い矢印の上にある３つは音声認識結果加工部４４の入力であるクライアント側装置１_１と公共放送局５_１と公共放送局５_２のそれぞれの音声認識結果であり、下側にある太い矢印の下にある１つはクライアント側装置１_１の加工済み音声認識結果である。 Next, an example of the speech recognition result and the processed speech recognition result in this example will be described with reference to FIG. The horizontal axis in FIG. 19 represents time, and the time when a search instruction is input is T ₀ , the time when a predetermined time has elapsed from time T ₀ when a search instruction is input is T _A , and Let T _{B be} the time of a predetermined time past the input time T ₀ . Three above the thick arrow in the upper is an input client device 1 ₁ and each of the speech recognition result of the public broadcasting station 5 ₁ and public broadcasting station 5 ₂ of the speech recognition result processing section 44, the lower one below the thick arrow in the are processed speech recognition result of the client-side apparatus 1 _1.

クライアント側装置１_１の音声認識結果には、検索指示が入力された時刻Ｔ_０から予め定めた時間が経過した時刻Ｔ_Ａまでの時間の音声認識結果として、クライアント側装置１_１の利用者である第１の利用者が発した発話である発話２の音声認識結果の部分文字列と、クライアント側装置１_１の周囲でテレビが発した音声であるテレビ音声２の音声認識結果の部分文字列と、が含まれている。また、クライアント側装置１_１の音声認識結果には、検索指示が入力された時刻Ｔ_０の所定時間過去の時刻Ｔ_Ｂから検索指示が入力された時刻Ｔ_０までの時間の音声認識結果として、クライアント側装置１_１の利用者である第１の利用者が発した発話である発話１の音声認識結果の部分文字列と、クライアント側装置１_１の周囲でテレビが発した音声であるテレビ音声１の音声認識結果の部分文字列と、が含まれている。 The client-side apparatus 1 ₁ of the speech recognition results, search instruction as the result time of the speech recognition until the time T _A the time determined in advance from the time T ₀ that is input has elapsed, the client-side apparatus 1 ₁ of the user there first user and a sub-string of the speech recognition result of the speech 2 is a speech uttered substring of a speech recognition result of the television audio 2 is a audio television emitted around the client-side apparatus 1 ₁ And is included. Moreover, the client-side apparatus 1 ₁ of the speech recognition results, as the predetermined time speech recognition result of time from past time T _B until time T ₀ the search instruction is input at time T ₀ the search instruction is input, first user and a sub-string of the speech recognition result of the speech 1 is a speech uttered, TV voice is the voice of television emitted around the client-side apparatus 1 ₁ is a client-side apparatus 1 ₁ of the user 1 and a partial string of the speech recognition result are included.

公共放送局５_１の音声認識結果には、検索指示が入力された時刻Ｔ_０から予め定めた時間が経過した時刻Ｔ_Ａまでの時間の音声認識結果として、公共放送局５_１が放送した音響信号に含まれる音声であるテレビ音声２の音声認識結果の部分文字列が含まれている。また、公共放送局５_１の音声認識結果には、検索指示が入力された時刻Ｔ_０の所定時間過去の時刻Ｔ_Ｂから検索指示が入力された時刻Ｔ_０までの時間の音声認識結果として、公共放送局５_１が放送した音響信号に含まれる音声であるテレビ音声１の音声認識結果の部分文字列が含まれている。 A public broadcasting station 5 ₁ for speech recognition result, the search instruction as the result time of the speech recognition until the time T _A the time determined in advance from the time T ₀ that is input has elapsed, public broadcasting station 5 ₁ is broadcast sound A partial character string of the speech recognition result of the television speech 2 which is the speech included in the signal is included. Further, the speech recognition result of the public broadcasting station 5 _1, as a predetermined time speech recognition result of time from past time T _B until time T ₀ the search instruction is input at time T ₀ the search instruction is input, public broadcasting station 5 ₁ contains a substring of the speech recognition result of the television audio 1 is a sound included in the sound signal broadcast.

公共放送局５_２の音声認識結果には、検索指示が入力された時刻Ｔ_０から予め定めた時間が経過した時刻Ｔ_Ａまでの時間の音声認識結果として、公共放送局５_２が放送した音響信号に含まれる音声であるテレビ音声４の音声認識結果の部分文字列が含まれている。また、公共放送局５_２の音声認識結果には、検索指示が入力された時刻Ｔ_０の所定時間過去の時刻Ｔ_Ｂから検索指示が入力された時刻Ｔ_０までの時間の音声認識結果として、公共放送局５_２が放送した音響信号に含まれる音声であるテレビ音声３の音声認識結果の部分文字列が含まれている。 A public broadcasting station 5 ₂ of the speech recognition results, search instruction as the result time of the speech recognition until the time T _A the time determined in advance from the time T ₀ that is input has elapsed, public broadcasting station 5 ₂ is broadcast sound A partial character string of the speech recognition result of the television speech 4 which is the speech included in the signal is included. Also, the public broadcasting station 5 ₂ of the speech recognition results, as the predetermined time speech recognition result of time from past time T _B until time T ₀ the search instruction is input at time T ₀ the search instruction is input, public broadcasting station 5 ₂ contains a substring of the speech recognition result of the television audio 3 is a sound included in the sound signal broadcast.

クライアント側装置１_１の音声認識結果の文字列と公共放送局５_１の音声認識結果の文字列には、時刻Ｔ_Ｂから時刻Ｔ_Ａの間に、部分文字列とその時刻とが一致するものとして、テレビ音声１の音声認識結果とテレビ音声２の音声認識結果の２個の部分文字列がある。したがって、公共放送局５_１は、複数の部分文字列について、部分文字列とその時刻とが一致しているため、取り除き対象となる。そして、音声認識結果加工部４４は、クライアント側装置１_１の音声認識結果の文字列から、公共放送局５_１の音声認識結果の文字列にも同じ部分文字列が同時刻で存在している全ての部分文字列であるテレビ音声１の音声認識結果の部分文字列とテレビ音声２の音声認識結果の部分文字列を取り除く。クライアント側装置１_１の音声認識結果の文字列と公共放送局５_２の音声認識結果の文字列には、時刻Ｔ_Ｂから時刻Ｔ_Ａの間に、部分文字列とその時刻とが一致する部分文字列はない。したがって、公共放送局５_２は、複数の部分文字列について、部分文字列とその時刻とが一致していないため、取り除き対象とならない。クライアント側装置１_１の音声認識結果の文字列に対してここまでの取り除き処理を行った結果が、図１９の上側の太い矢印と下側の太い矢印との間に例示したものである。 The client-side apparatus 1 of the _first speech recognition result string and public broadcasting stations 5 ₁ speech recognition result string, while from the time T _B at time T _A, which substring and its time matches There are two partial character strings of the speech recognition result of the television sound 1 and the speech recognition result of the television sound 2. Thus, public broadcasting station 5 ₁ for a plurality of substrings, for partial strings and their time are the same, the removed target. Then, the speech recognition result processing unit 44, the client-side apparatus 1 ₁ of the speech recognition result string, to the public broadcasting station 5 ₁ for speech recognition result string same substring are present at the same time The partial strings of the speech recognition result of television voice 1 and the partial strings of the speech recognition result of television voice 2 which are all partial strings are removed. The client-side apparatus 1 of the _first speech recognition result string and public broadcasting station 5 ₂ of the speech recognition result string, while from the time T _B at time T _A, the partial character string and its time matches parts There is no string. Thus, public broadcasting stations 5 _2, for a plurality of substrings, for partial strings and their time do not match, not a removed object. Client device 1 ₁ of a result of removing the processing up to this point for the character string of the speech recognition result is an illustration between the upper thick arrow and the lower thick arrow in FIG. 19.

次に、音声認識結果加工部４４は、クライアント側装置１_１の音声認識結果の文字列から、時刻Ｔ_Ｂから時刻Ｔ_０の間の部分文字列である発話１の音声認識結果の部分文字列を取り除く。この結果、発話２の音声認識結果の部分文字列だけが残されたものが、クライアント側装置１_１の加工済み音声認識結果として出力される。 Then, the speech recognition result processing unit 44, the client-side apparatus 1 ₁ of the speech recognition result string, the partial character is a sequence of speech first speech recognition result sub-string between times T ₀ from the time T _B Get rid of As a result, those only partial string of the speech recognition result of the speech 2 is left is outputted as processed speech recognition result of the client-side apparatus 1 _1.

第４実施形態の変形例の動作例の音声認識システム１００のこれ以降の動作は、第４実施形態の動作例と同様である。 The subsequent operation of the speech recognition system 100 of the operation example of the modification of the fourth embodiment is the same as the operation example of the fourth embodiment.

なお、第４実施形態の変形例と同様に、第１〜第３実施形態の全ての実施形態その動作例についても、音声保持部１９_１を備える等により、検索開始の指示よりも過去の音響信号を用いて音声認識システム１００を動作させてもよい。 Similarly to the modification of the fourth embodiment, also all embodiments operation example of the first to third embodiments, the like comprises a speech holding unit 19 _1, than search start instruction of the past acoustic The signal may be used to operate the speech recognition system 100.

第４実施形態の変形例のように検索開始の指示よりも過去の音響信号を用いて動作させる構成とすることにより、特に、第１実施形態の第３動作例や第２実施形態の第２動作例のように複数の部分文字列が共通する他クライアント側装置や公共放送局を対象として音声認識結果の取り除き処理を行う構成において、検索開始の指示よりも過去の音響信号を用いない構成とする場合よりも、応答速度を速めることができる。 As in the modification of the fourth embodiment, the configuration is made to operate using an acoustic signal in the past rather than a search start instruction, in particular, the third operation example of the first embodiment and the second operation example of the second embodiment. In a configuration in which processing for removing a speech recognition result is performed on another client-side device or a public broadcast station having a plurality of partial character strings in common as in the operation example, a configuration not using acoustic signals in the past rather than a search start instruction The response speed can be faster than in the case of

＜音声認識装置の実施形態＞
なお、前述した音声認識システムはクライアント側装置１_１〜１_Ｎとクラウド側装置２とがネットワーク３で接続された構成であるが、クラウド側装置２は複数のサーバ装置等で構成されていてもよい。また、音声認識システムはクラウド型のシステムでなくともよく、スタンドアローン型の音声認識装置であってもよい。すなわち、クラウド側装置２の構成をクライアント側装置１_１〜１_Ｎ内に備えた音声認識装置であってもよい。 <Embodiment of Speech Recognition Device>
Although the voice recognition system described above has a configuration in which the client side devices 1 ₁ to 1 _N and the cloud side device 2 are connected by the network 3, even if the cloud side device 2 is configured by a plurality of server devices etc. Good. Further, the speech recognition system may not be a cloud type system, but may be a stand-alone type speech recognition device. That is, it may be a voice recognition device in which the configuration of the cloud side device 2 is provided in the client side devices 1 ₁ to 1 _N.

また、前述した説明においては、音声認識結果を情報検索に応用した例を説明したが、音声認識結果はどのように利用されてもよい。すなわち、図２及び図１１に示したクライアント側装置１_１とクラウド側装置２により構成される部分のうちの要部のみにより構成される音声認識装置としてもよい。これらの音声認識装置について、図２０を用いて説明する。 In the above description, an example in which the speech recognition result is applied to information search has been described, but the speech recognition result may be used in any way. That may be a speech recognition device including an only essential part of the portion constituted by the client-side apparatus 1 ₁ and the cloud side device 2 shown in FIGS. 2 and 11. These speech recognition devices will be described with reference to FIG.

［［音声認識装置の第１例］]
図２０の（Ａ）は、音声認識装置の第１例を示すブロック図である。第１例の音声認識装置７００は、音声認識部７１０と音声認識結果７２０を少なくとも含んで構成される。 [[First Example of Speech Recognition Device]]
FIG. 20A is a block diagram showing a first example of the speech recognition apparatus. The speech recognition apparatus 700 of the first example includes at least a speech recognition unit 710 and a speech recognition result 720.

音声認識装置７００の音声認識部７１０は、図２の音声認識部２２に対応するものである。例えば、音声認識部７１０は、音声認識対象の第１の発話者のスマートフォンのマイク等の第１の収音手段で第１の発話者音声を含んで収音された音響信号である第１音響信号と、第１の発話者とは異なる第２〜Ｎ（Ｎは２以上の整数）の利用者それぞれのスマートフォンのマイク等の第１の収音手段とは異なる第２〜Ｎの収音手段それぞれ収音された音響信号である第２音響信号〜第Ｎ音響信号と、のそれぞれの音響信号を音声認識して、それぞれの音響信号に対する音声認識結果である第１音声認識結果〜第Ｎ音声認識結果を得る。ここで、第１音響信号〜第Ｎ音響信号は、例えば、同一の時刻を含む音響信号である。例えば、第１音響信号は、第１の発話者が音声認識対象として発話した音声を含む音響信号であり、第２音響信号〜第Ｎ音響信号は、始端と終端がそれぞれ第１音響信号と同一または近傍の絶対時刻である音響信号である。 The speech recognition unit 710 of the speech recognition apparatus 700 corresponds to the speech recognition unit 22 of FIG. For example, the voice recognition unit 710 is a first sound signal which is a sound signal collected including a first speaker voice by a first sound collecting means such as a microphone of a smartphone of a first speaker of a voice recognition target. Second to N sound collecting means different from the signal and the first sound collecting means such as microphones of second to N (N is an integer of 2 or more) users different from the first speaker The voice signal of each of the second to N-th sound signals which are the collected sound signals is voice-recognized, and the first voice recognition result to the N-th voice which is the voice recognition result for each of the sound signals Get recognition results. Here, the first to Nth acoustic signals are, for example, acoustic signals including the same time. For example, the first sound signal is a sound signal including a voice uttered by the first speaker as a voice recognition target, and the second sound signal to the Nth sound signal have the same beginning and end as the first sound signal. Or it is an acoustic signal which is the absolute time of the neighborhood.

音声認識装置７００の音声認識結果加工部７２０は、図２の音声認識結果保持部２３と音声認識結果加工部２４に対応するものである。例えば、音声認識結果加工部７２０は、第２音声認識結果〜第Ｎ音声認識結果の少なくとも１以上の音声認識結果に含まれる部分音声認識結果と、第１音声認識結果に含まれる部分音声認識結果とが、部分音声認識結果の内容が同一であり、かつ、略同時刻の音響信号に対応する部分音声認識結果である場合に、当該部分音声認識結果を第１音声認識結果から削除したものを第１の発話者の音声認識結果として得る。なお、音声認識結果加工部７２０は、部分音声認識結果の内容が同一で時刻が略同一であることに加えて、第２〜Ｎの収音手段の位置が第１の収音手段の近傍にある場合に、部分音声認識結果を第１音声認識結果から削除する構成としてもよい。 The speech recognition result processing unit 720 of the speech recognition apparatus 700 corresponds to the speech recognition result holding unit 23 and the speech recognition result processing unit 24 in FIG. For example, the speech recognition result processing unit 720 may generate a partial speech recognition result included in at least one of the second speech recognition result to the Nth speech recognition result and a partial speech recognition result included in the first speech recognition result. When the content of the partial speech recognition result is the same and the partial speech recognition result corresponds to the sound signal at substantially the same time, the partial speech recognition result is deleted from the first speech recognition result Obtained as a speech recognition result of the first speaker. In addition to the fact that the contents of the partial speech recognition result are the same and the time is substantially the same, the position of the second to N sound pickup means is in the vicinity of the first sound pickup means. In some cases, the partial speech recognition result may be deleted from the first speech recognition result.

［［音声認識装置の第２例］]
図２０の（Ｂ）は、音声認識装置の第２例を示すブロック図である。第２例の音声認識装置７０１は、音声認識部７１１と音声認識結果７２１を少なくとも含んで構成される。 [[Second example of speech recognition device]]
FIG. 20B is a block diagram showing a second example of the speech recognition device. The speech recognition apparatus 701 of the second example includes at least a speech recognition unit 711 and a speech recognition result 721.

音声認識装置７０１の音声認識部７１１は、図１１の音声認識部２２と放送音声認識部４２に対応するものである。例えば、音声認識部７１１は、音声認識対象の第１の発話者のスマートフォンのマイク等の第１の収音手段で第１の発話者音声を含んで収音された音響信号である第１音響信号と、１以上の放送の音響信号である第１放送音響信号〜第Ｍ放送音響信号（Ｍは１以上の整数）と、のそれぞれの音響信号を音声認識して、それぞれの音響信号に対する音声認識結果である第１音声認識結果と第１放送音声認識結果〜第Ｍ放送音声認識結果を得る。ここで、第１音響信号と第１放送音響信号〜第Ｍ放送音響信号は、例えば、同一の時刻を含む音響信号である。例えば、第１音響信号は、第１の発話者が音声認識対象として発話した音声を含む音響信号であり、第１放送音響信号〜第Ｍ放送音響信号は、始端と終端がそれぞれ第１音響信号と同一または近傍の絶対時刻である音響信号である。 The speech recognition unit 711 of the speech recognition device 701 corresponds to the speech recognition unit 22 and the broadcast speech recognition unit 42 in FIG. For example, the voice recognition unit 711 is a first sound that is an acoustic signal collected including a first speaker voice by a first sound collecting unit such as a microphone of a smartphone of a first speaker of a voice recognition target. The audio signal of each of the first broadcast audio signal to the M-th broadcast audio signal (M is an integer of 1 or greater) which is one or more broadcast audio signals is voice-recognized, and the voice for each audio signal is voice-recognized. The first speech recognition result and the first broadcast speech recognition result to the Mth broadcast speech recognition result, which are recognition results, are obtained. Here, the first acoustic signal and the first to M-th broadcast acoustic signals are, for example, acoustic signals including the same time. For example, the first acoustic signal is an acoustic signal including a voice uttered by the first speaker as a voice recognition target, and the first to Mth broadcast acoustic signals have a first acoustic signal at the beginning and the end, respectively. Is an acoustic signal that is the same as or near an absolute time.

音声認識装置７０１の音声認識結果加工部７２１は、図１１の音声認識結果保持部４３と音声認識結果加工部４４に対応するものである。例えば、音声認識結果加工部７２１は、第１放送音声認識結果〜第Ｍ放送音声認識結果の少なくとも１以上の音声認識結果に含まれる部分音声認識結果と、第１音声認識結果に含まれる部分音声認識結果とが、部分音声認識結果の内容が同一であり、かつ、略同時刻の音響信号に対応する部分音声認識結果である場合に、当該部分音声認識結果を第１音声認識結果から削除したものを第１の発話者の音声認識結果として得る。なお、音声認識結果加工部７２１は、第１の収音手段が受信対象地域にある第２〜Ｍの放送の音声認識結果のみを対象として、部分音声認識結果の内容が同一で時刻が略同一である場合に部分音声認識結果を第１音声認識結果から削除する構成としてもよい。 The speech recognition result processing unit 721 of the speech recognition device 701 corresponds to the speech recognition result holding unit 43 and the speech recognition result processing unit 44 in FIG. For example, the voice recognition result processing unit 721 may select a partial voice recognition result included in at least one or more of the first broadcast voice recognition result to the Mth broadcast voice recognition result and a partial voice included in the first voice recognition result. When the recognition result is the same as the partial speech recognition result and the partial speech recognition result corresponding to the sound signal at substantially the same time, the partial speech recognition result is deleted from the first speech recognition result One is obtained as the speech recognition result of the first speaker. The voice recognition result processing unit 721 targets the voice recognition results of the second to M broadcasts in which the first sound collecting means is in the reception target area, and the contents of the partial voice recognition results are the same and the time is substantially the same. In this case, the partial speech recognition result may be deleted from the first speech recognition result.

これらの音声認識によれば、テレビやラジオや案内放送などの環境音が比較的大きな音量で存在している環境下で利用者が発話した場合であっても、高精度に環境音の音声認識結果を取り除くことができ、不要な音声認識結果が含まれる可能性を低減することで、発話者の音声に対する音声認識率を向上させることができる。 According to these voice recognitions, even if the user utters in an environment where environmental sounds such as television, radio, and guide broadcasting exist at relatively large volume, voice recognition of environmental sounds with high accuracy is realized. The result can be removed, and by reducing the possibility that unnecessary speech recognition results are included, the speech recognition rate for the speech of the speaker can be improved.

なお、上述の説明では、音声認識結果が文字列であるとして説明したが、音声認識結果が音素を表す記号の列などで表されている場合は、文字列に代えて音素記号の列を用いてもよい。すなわち、上述の説明における音声認識結果の文字列や部分文字列は、音声認識結果の音素記号列や部分音素記号列などの、音声認識結果やその一部の内容の一例である。 In the above description, the speech recognition result is described as a character string, but when the speech recognition result is represented by a string of symbols representing phonemes, a string of phoneme symbols is used instead of the character string. May be That is, the character string or partial character string of the speech recognition result in the above description is an example of the speech recognition result such as a phoneme symbol string or partial phoneme character string of the speech recognition result or a part of the content thereof.

前述した実施形態における音声認識システムの全部または一部をコンピュータで実現するようにしてもよい。その場合、この機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよく、ＰＬＤ（Programmable Logic Device）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアを用いて実現されるものであってもよい。 All or part of the speech recognition system in the above-described embodiment may be realized by a computer. In that case, a program for realizing this function may be recorded in a computer readable recording medium, and the program recorded in the recording medium may be read and executed by a computer system. Here, the “computer system” includes an OS and hardware such as peripheral devices. The term "computer-readable recording medium" refers to a storage medium such as a flexible disk, a magneto-optical disk, a ROM, a portable medium such as a ROM or a CD-ROM, or a hard disk built in a computer system. Furthermore, “computer-readable recording medium” dynamically holds a program for a short time, like a communication line in the case of transmitting a program via a network such as the Internet or a communication line such as a telephone line. It may also include one that holds a program for a certain period of time, such as volatile memory in a computer system that becomes a server or a client in that case. Further, the program may be for realizing a part of the functions described above, or may be realized in combination with the program already recorded in the computer system. It may be realized using hardware such as PLD (Programmable Logic Device) or FPGA (Field Programmable Gate Array).

以上、図面を参照して本発明の実施の形態を説明してきたが、上記実施の形態は本発明の例示に過ぎず、本発明が上記実施の形態に限定されるものではないことは明らかである。したがって、本発明の技術思想及び範囲を逸脱しない範囲で構成要素の追加、省略、置換、その他の変更を行ってもよい。 Although the embodiments of the present invention have been described above with reference to the drawings, it is apparent that the above embodiments are merely examples of the present invention, and the present invention is not limited to the above embodiments. is there. Therefore, additions, omissions, substitutions, and other modifications of the components may be made without departing from the technical spirit and scope of the present invention.

本発明は、発話者が発した音声認識したい音声の背景に比較的大きな音量の音声が存在している場合であっても、入力された発話者の音声を含む音響信号を音声認識して認識結果を得て、入力された別の音響信号も音声認識して認識結果を得て、これらの認識結果中で共通するものを入力された音響信号の音声認識結果から取り除くことにより、不要な音声認識結果が含まれる可能性を低減することで、発話者の音声に対する音声認識率を向上させるものである。したがって、発話者が発した音声認識したい音声の背景に比較的大きな音量の音声が存在している場合であっても、発話者の音声のみの音声認識結果を得ることが不可欠な様々な用途にも適用できる。 The present invention recognizes and recognizes an acoustic signal including the voice of the input speaker even if a voice of relatively large volume is present in the background of the voice intended to be recognized by the speaker. Unnecessary speech is obtained by obtaining a result and performing speech recognition on another acoustic signal input to obtain a recognition result, and removing a common one among the recognition results from the speech recognition result of the input acoustic signal. By reducing the possibility of including a recognition result, the speech recognition rate for the speech of the speaker is improved. Therefore, in various applications in which it is essential to obtain the result of speech recognition of only the speech of the speaker even if there is a relatively large volume of speech in the background of the speech that the speaker wants to recognize. Is also applicable.

例えば、一般家庭のリビングルームにおいては、本発明により、音声入力部とＴＶやＡＶアンプ等のスピーカー音源の位置を遠ざける必要がなくなり、音声認識マイク付きリモコン装置が不要になり、装置筐体内の高感度マイクだけで認識する音声認識ができるようになる。したがって、装置コストにシビアな端末システムの費用削減やリモコンの軽量化、リモコンの消費電療の低減により、利便性が向上する。 For example, in the living room of a general home, according to the present invention, it is not necessary to move away the sound input unit and the position of a speaker sound source such as a TV or AV amplifier, and a remote control device with a voice recognition microphone becomes unnecessary. It will be possible to recognize voice with only the sensitivity microphone. Therefore, convenience is improved by cost reduction of the terminal system severe to the device cost, weight reduction of the remote control, and reduction of power consumption of the remote control.

また、本発明は、放送中に無音や無声音状態が少ないテレビやラジオ、多言語で場内案内放送が繰り返されたりするオリンピック会場、駅、空港、講演ホール、電車、パブリックビューイング会場等での音声認識の活用に有用である。 In addition, the present invention is a voice in an Olympic hall, a station, an airport, a lecture hall, a train, a public viewing hall, etc. where television and radio with little silent and silent state are broadcasted and on-site guidance broadcasting is repeated in multiple languages. It is useful for the utilization of recognition.

また、本発明によれば、自動車内においても、独立型のカーＴＶや交通情報やカーラジオの声音（アナウンス、セリフ）を気にせずに、いつでも音声認識コマンドの発話、音声認識による関連情報の検索が行うことができる。これにより、例えば、クラウド連携型の自動車向け音声エージェントサービスの利便性が増す効果が期待される。 Further, according to the present invention, even in a car, speech recognition commands are uttered at any time without concern for a stand-alone car TV, traffic information or car radio voices (announcements, speeches) and related information by speech recognition. Search can be done. As a result, for example, an effect of increasing the convenience of the cloud-linked type voice agent service for cars is expected.

また、企業のコールセンターにおける、電話自動応対システムの音声コマンド認識においても、ユーザ宅でよく背景に流れているテレビやラジオ音声等の生活的音声ノイズの影響を抑制できるため、正確性の向上、ひいては、オペレータ介入稼働の削減によるコスト削減も副次的に期待できる。 In voice command recognition of a telephone automatic response system in a call center of a company, the influence of lifelike voice noise such as television and radio voice which often flows in the background of the user's home can be suppressed, thereby improving the accuracy Cost reduction by reducing operator intervention can also be expected as a secondary factor.

１_１、１_２、１_３、１_Ｎ・・・クライアント側装置、２・・・クラウド側装置、３・・・ネットワーク、１００・・・音声認識システム、１１_１・・・音声入力部、１２_１・・・ユーザ情報取得部、１３_１・・・音声送出部、１４_１・・・検索結果受信部、１５_１・・・画面表示部、２１・・・音声受信部、２２・・・音声認識部、２３・・・音声認識結果保持部、２４・・・音声認識結果加工部、２５・・・検索処理部、２６・・・検索結果送出部、５_１、５_Ｍ・・・公共放送局、４１・・・放送受信部、４２・・・放送音声認識部、４３・・・音声認識結果保持部、４４・・・音声認識結果加工部、１６_１・・・コンテンツ情報取得部、１７_１・・・コンテンツ情報送出部、６０・・・コンテンツ音声認識結果蓄積部、６１・・・コンテンツ情報受信部、６２・・・コンテンツ音声認識結果取得部、６３・・・音声認識結果保持部、６４・・・音声認識結果加工部、１８_１・・・検索指示入力部、１９_１・・・音声保持部、７００・・・音声認識装置、７１０・・・音声認識部、７２０・・・音声認識結果加工部、７０１・・・音声認識装置、７１１・・・音声認識部、７２１・・・音声認識結果加工部 1 ₁ 1 ₂ 1 ₃ 1 1 _N ... Client side device 2 ... Cloud side device 3 ... Network 100 ... Speech recognition system 11 ₁ ... Speech input unit 12 DESCRIPTION OF SYMBOLS ₁ ... User information acquisition part, 13 ₁ ... Voice transmission part, 14 ₁ ... Search result receiving part, 15 ₁ ... Screen display part, 21 ... Voice reception part, 22 ... Voice Recognition part, 23 ... voice recognition result holding part, 24 ... voice recognition result processing part, 25 ... search processing part, 26 ... search result sending part, 5 ₁ , 5 _M ... public broadcast Bureau, 41 ... broadcast receiving unit, 42 ... broadcast audio recognition unit, 43 ... speech recognition result holding portion, 44 ... speech recognition result processing unit, ₁₆ 1 ... content information acquisition unit, 17 ₁ ... content information sending unit, 60 ... content speech recognition result accumulating section, 61 · The content information receiving section, 62 ... content speech recognition result obtaining unit, 63 ... speech recognition result holding portion, 64 ... speech recognition result processing unit, ₁₈ 1 ... retrieval instruction input section, 19 ₁ ... voice holding unit, 700 ... voice recognition device, 710 ... voice recognition unit, 720 ... voice recognition result processing unit, 701 ... voice recognition device, 711 ... voice recognition unit, 721 ... Speech recognition result processing unit

Claims

A first sound signal which is a sound signal collected by the first sound collecting means including the voice of the first speaker and at least one sound collecting means different from the first sound collecting means; Each of the second to N-th sound signals, which are sound signals collected by the second to N-th (N is an integer of 2 or more) sound-pickup means, is voice-recognized, Voice recognition means for obtaining a first voice recognition result to a N-th voice recognition result as a voice recognition result for an audio signal;
The partial speech recognition result included in at least one of the second speech recognition result to the Nth speech recognition result and the partial speech recognition result included in the first speech recognition result are partial speech recognition results When the content is the same and the partial speech recognition result corresponds to the sound signal at substantially the same time, the partial speech recognition result deleted from the first speech recognition result is the first speaker's Voice recognition result processing means obtained as a voice recognition result;
Equipped with
The voice recognition result processing means
The partial speech recognition result included in at least one of the second speech recognition result to the Nth speech recognition result and the partial speech recognition result included in the first speech recognition result are partial speech recognition results The second speech recognition result to the Nth speech recognition result in which the contents are the same and there are a plurality of partial speech recognition results corresponding to the acoustic signals at substantially the same time (1) The partial speech recognition result contained in the speech recognition result, the content of the partial speech recognition result is the same, and all partial speech recognition results corresponding to the acoustic signals at substantially the same time are obtained to obtain partial speech A voice recognition apparatus for obtaining a result of removing a recognition result from the first voice recognition result as a voice recognition result of the first speaker .

A first acoustic signal which is an acoustic signal collected including a voice of a first speaker by a first sound collecting means, and a first broadcast acoustic signal to an M-th broadcast acoustic which is an acoustic signal of one or more broadcasts Speech recognition is performed on the respective sound signals of the signal (M is an integer of 1 or more), and the first speech recognition result and the first broadcast speech recognition result which are speech recognition results for the respective speech signals ~ Mth broadcast speech recognition Voice recognition means for obtaining a result;
The partial speech recognition result included in at least one or more of the first broadcast speech recognition result to the Mth broadcast speech recognition result and the partial speech recognition result included in the first speech recognition result are partial speech recognition When the content of the result is the same and the partial speech recognition result corresponds to the acoustic signal at substantially the same time, the first speech in which the partial speech recognition result is deleted from the first speech recognition result Voice recognition result processing means to be obtained as the voice recognition result of the person,
Equipped with
The voice recognition result processing means
The partial speech recognition result included in at least one or more of the first broadcast speech recognition result to the Mth speech recognition result and the partial speech recognition result included in the first speech recognition result are partial speech recognition results Are the same, and there are a plurality of partial speech recognition results corresponding to acoustic signals at substantially the same time, and only the first to the Mth broadcast speech recognition results,
The partial speech recognition result is identical to the partial speech recognition result included in the first speech recognition result, and all partial speech recognition results corresponding to acoustic signals at substantially the same time are obtained. A voice recognition apparatus for obtaining, as the voice recognition result of the first speaker, a result obtained by deleting the obtained partial voice recognition result from the first voice recognition result .

Voice recognition instruction input means for receiving an instruction to start voice recognition processing;
And sound holding means for holding the sound signal for a predetermined time.
The voice recognition means
When the voice recognition instruction input unit receives an instruction to start voice recognition processing, the voice signal including the sound signal held in the voice holding unit is voice-recognized and the voices including the past are voice-recognized. Get recognition results,
The voice recognition result processing means
A process of deleting the partial speech recognition result including the past from the first speech recognition result is performed using each speech recognition result including the past speech, and the speech recognition result of the first speaker is The speech recognition apparatus according to claim 1 or 2, which can be obtained.

A first sound signal which is a sound signal collected by the first sound collecting means including the voice of the first speaker and at least one sound collecting means different from the first sound collecting means; Each of the second to N-th sound signals, which are sound signals collected by the second to N-th (N is an integer of 2 or more) sound-pickup means, is voice-recognized, Obtaining a first voice recognition result to a N-th voice recognition result, which are voice recognition results for an audio signal;
The partial speech recognition result included in at least one of the second speech recognition result to the Nth speech recognition result and the partial speech recognition result included in the first speech recognition result are partial speech recognition results When the content is the same and the partial speech recognition result corresponds to the sound signal at substantially the same time, the partial speech recognition result deleted from the first speech recognition result is the first speaker's A speech recognition result processing step obtained as a speech recognition result;
I have a,
In the voice recognition result processing step,
The partial speech recognition result included in at least one of the second speech recognition result to the Nth speech recognition result and the partial speech recognition result included in the first speech recognition result are partial speech recognition results The second speech recognition result to the Nth speech recognition result in which the contents are the same and there are a plurality of partial speech recognition results corresponding to the acoustic signals at substantially the same time (1) The partial speech recognition result contained in the speech recognition result, the content of the partial speech recognition result is the same, and all partial speech recognition results corresponding to the acoustic signals at substantially the same time are obtained to obtain partial speech A voice recognition method for obtaining a result of removing a recognition result from the first voice recognition result as a voice recognition result of the first speaker .

A first acoustic signal which is an acoustic signal collected including a voice of a first speaker by a first sound collecting means, and a first broadcast acoustic signal to an M-th broadcast acoustic which is an acoustic signal of one or more broadcasts Speech recognition is performed on the respective sound signals of the signal (M is an integer of 1 or more), and the first speech recognition result and the first broadcast speech recognition result which are speech recognition results for the respective speech signals ~ Mth broadcast speech recognition A speech recognition step to obtain a result;
The partial speech recognition result included in at least one or more of the first broadcast speech recognition result to the Mth broadcast speech recognition result and the partial speech recognition result included in the first speech recognition result are partial speech recognition When the content of the result is the same and the partial speech recognition result corresponds to the acoustic signal at substantially the same time, the first speech in which the partial speech recognition result is deleted from the first speech recognition result A speech recognition result processing step obtained as a speech recognition result of the person,
Only including,
In the voice recognition result processing step,
The partial speech recognition result included in at least one or more of the first broadcast speech recognition result to the Mth speech recognition result and the partial speech recognition result included in the first speech recognition result are partial speech recognition results Are the same, and there are a plurality of partial speech recognition results corresponding to acoustic signals at substantially the same time, and only the first to the Mth broadcast speech recognition results,
The partial speech recognition result is identical to the partial speech recognition result included in the first speech recognition result, and all partial speech recognition results corresponding to acoustic signals at substantially the same time are obtained. A voice recognition method for obtaining a result obtained by deleting the obtained partial voice recognition result from the first voice recognition result as a voice recognition result of the first speaker .

A voice recognition program for operating a computer as the voice recognition device according to any one of claims 1 to 3 .