JP6322125B2

JP6322125B2 - Speech recognition apparatus, speech recognition method, and speech recognition program

Info

Publication number: JP6322125B2
Application number: JP2014241123A
Authority: JP
Inventors: 滋藤村; 大喜渡邊; 山田　智広; 智広山田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-11-28
Filing date: 2014-11-28
Publication date: 2018-05-09
Anticipated expiration: 2034-11-28
Also published as: JP2016102899A

Description

本発明は、背景音声を考慮した音声認識を実施するための技術に関する。 The present invention relates to a technique for performing speech recognition in consideration of background speech.

ウェブ上のサービスにおいて、音声による高精度な入力手段を備えることは、利用者における利便性の面から重要なことは言うまでもない。近年、利用者がウェブ閲覧の際に用いる端末が多様化するにつれ、その処理能力も区々であり、高度な計算処理を行うに適さない端末もある。このことから、音声認識の処理は、サーバ・クライアントモデルでいうところのサーバで実施されることが多い。サーバ・クライアントモデルで音声認識を実施する際に、雑音に対する耐性を強化し、精度の向上を意識した方法が研究されている。 Needless to say, providing high-accuracy input means by voice in a service on the web is important from the viewpoint of convenience for the user. In recent years, as the terminals used for browsing the web by users have diversified, the processing capabilities thereof vary, and some terminals are not suitable for performing advanced calculation processing. For this reason, the speech recognition process is often performed by a server in the server / client model. Research has been conducted on methods for enhancing the tolerance to noise and improving accuracy when performing speech recognition in the server / client model.

また、近年、主流となっている音声認識の具体的な処理方法は統計的機械学習に基づいたものであり、確率的に尤もらしいものを認識の結果として出力する。つまり、認識の結果はあくまで推定となる（非特許文献１参照）。 Further, in recent years, a specific processing method of speech recognition that has become mainstream is based on statistical machine learning, and outputs a probabilistic likelihood as a result of recognition. That is, the recognition result is only an estimate (see Non-Patent Document 1).

河原達也，「音声認識の方法論に関する考察-世代交代に向けて-」，情報処理学会研究報告(SLP)，音声言語情報処理， 2014-SLP-100(3)， pp.1-5， 2014．Tatsuya Kawahara, “Study on Methodology of Speech Recognition-Toward a Change of Generations”, Information Processing Society of Japan (SLP), Spoken Language Information Processing, 2014-SLP-100 (3), pp.1-5, 2014.

日本語には発音上は同音であるが字が異なるものが多数ある、一例として、「せんだい」については、地名で「仙台（宮城県）」と「川内（鹿児島県）」があり、音だけでは区別がつかず、音声認識における入力情報が音声データのみである場合には推定が困難である。また、英語においても、発音上は同音であるが、スペルが異なるものがある。 There are many Japanese pronunciations that have the same sound but different characters. For example, “Sendai” has the place names “Sendai (Miyagi Prefecture)” and “Kawauchi (Kagoshima Prefecture)”. Cannot be distinguished, and estimation is difficult when the input information in speech recognition is only speech data. Also in English, there are some that are spelled in the same sound but different in spelling.

本発明は、上記の課題について鑑みてなされたものであり、本発明の目的は、音声認識の精度を向上し、利用者が意図した音声認識結果を提示するための技術を提供することにある。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a technique for improving the accuracy of speech recognition and presenting a speech recognition result intended by a user. .

上記の課題を解決するために、本発明は、音声認識装置であって、音声データを、位相差の情報を用いて音源までの距離を特定することで、ユーザが発話した発話音声データと背景音声データとに分離し、分離した前記背景音声データから、音声の特徴情報を抽出し、複数のコンテンツの音声の特徴情報が記憶された音声特徴記憶部を用いてユーザが利用しているコンテンツを特定する特定部と、前記発話音声データを音声認識して、少なくとも１つのテキスト候補データに変換し、前記特定したコンテンツに関する関連情報を用いて、前記テキスト候補データの中からユーザに提示するテキストデータを決定する音声認識部と、を備える。 In order to solve the above-described problems, the present invention is a speech recognition device, in which speech data is identified by using a phase difference information to determine a distance to a sound source, and speech audio data uttered by a user and background The audio feature information is extracted from the background audio data separated into the audio data, and the content used by the user using the audio feature storage unit in which the audio feature information of a plurality of contents is stored. a specifying unit for specifying, by recognizing speech the speech data, text data is converted into at least one text candidate data, using said relevant information about the identified content, is presented to the user from among the text candidate data A voice recognition unit for determining

本発明は、音声認識装置が行う音声認識方法であって、音声データを、位相差の情報を用いて音源までの距離を特定することで、ユーザが発話した発話音声データと背景音声データとに分離する分離ステップと、前記分離した背景音声データから、音声の特徴情報を抽出し、複数のコンテンツの音声の特徴情報が記憶された音声特徴記憶部を用いてユーザが利用しているコンテンツを特定する特定ステップと、前記発話音声データを音声認識して、少なくとも１つのテキスト候補データに変換する変換ステップと、前記特定したコンテンツに関する関連情報を用いて、前記テキスト候補データの中からユーザに提示するテキストデータを決定する決定ステップと、を行う。
The present invention is a speech recognition method performed by a speech recognition apparatus, wherein speech data is classified into speech speech data uttered by a user and background speech data by specifying a distance to a sound source using phase difference information. Separating step for separating and extracting audio feature information from the separated background audio data, and specifying the content used by the user using the audio feature storage unit storing the audio feature information of multiple contents a specifying step of, by recognizing speech the speech data, a conversion step of converting at least one text candidate data, using the related information about the content that the identified and presented to the user from among the text candidate data And a determination step for determining text data.

本発明は、前記音声認識装置として、コンピュータを機能させることを特徴とする音声認識プログラムである。 The present invention is a speech recognition program that causes a computer to function as the speech recognition apparatus.

本発明によれば、音声認識の精度を向上し、利用者が意図した音声認識結果を提示するための技術を提供することができる。 According to the present invention, it is possible to improve the accuracy of speech recognition and provide a technique for presenting a speech recognition result intended by a user.

本発明の実施形態の音声認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition system of embodiment of this invention. 音声認識システムの処理を示すシーケンス図である。It is a sequence diagram which shows the process of a speech recognition system. コンテンツ特定処理を示すフローチャートである。It is a flowchart which shows a content specific process. 関連情報DBの一例を示す図である。It is a figure which shows an example of related information DB. 音声認識処理の具体例を示す図である。It is a figure which shows the specific example of a speech recognition process. 本実施形態の変形例である情報配信システムの構成を示すブロック図である。It is a block diagram which shows the structure of the information delivery system which is a modification of this embodiment.

以下、本発明の実施の形態を、図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

コンテンツ（例えば、テレビ等の放送番組）を視聴しながら、そのコンテンツと関連する情報をウェブで検索するといった行為は、現在では一般的に行われている。ここで、検索の際に音声による入力を用いると、入力される音声データは、利用者（ユーザ）の検索要求となる音声（発話音声）に加え、背後に流れるコンテンツの音声も背景音声として音声データの中に含まれる。本実施形態では、この背景音声を用いて利用者が現在、利用しているコンテンツを特定し、当該コンテンツに関連する情報を、音声で入力された検索要求の内容を高精度に認識するための情報源とする。 While viewing content (for example, a broadcast program such as a television) while searching for information related to the content on the web, it is generally performed now. Here, if voice input is used for the search, the input voice data is the voice (utterance voice) that is a search request of the user (user), and the voice of the content flowing behind is also used as the background voice. Included in the data. In the present embodiment, the background voice is used to identify the content that the user is currently using, and the information related to the content can be used to recognize the details of the search request input by voice with high accuracy. Information source.

図１は、本実施形態の音声認識システムの構成を示す構成図である。図示する音声認識システムでは、クライアント１と、サーバ２（音声認識装置）とを備える。本実施形態では、クライアント１は、音声認識機能を備えることなく、サーバ２側（ウェブページ上）での音声認識機能の利用を想定している。 FIG. 1 is a configuration diagram showing the configuration of the speech recognition system of the present embodiment. The illustrated voice recognition system includes a client 1 and a server 2 (voice recognition device). In the present embodiment, the client 1 assumes the use of the voice recognition function on the server 2 side (on the web page) without providing the voice recognition function.

現状では、一般的にウェブページの閲覧はブラウザを通して行われるため、クライアント１はブラウザとなる。ウェブページは、通常、HTML、CSS、JavaScriptによって構成され、クライアント１においては、JavaScriptなどのプログラムによって以下に説明する各機能部が実現されるものとする。 Currently, browsing of web pages is generally performed through a browser, so the client 1 is a browser. The web page is usually configured by HTML, CSS, and JavaScript, and each function unit described below is realized in the client 1 by a program such as JavaScript.

クライアント１は、利用者が使用する利用者端末であって、スマートフォン、タブレット端末、PCなどを用いることができる。図示するクライアント１は、音声取得部１１と、位置取得部１２と、通信部１３と、結果表示部１４とを備える。 The client 1 is a user terminal used by a user, and a smartphone, a tablet terminal, a PC, or the like can be used. The illustrated client 1 includes an audio acquisition unit 11, a position acquisition unit 12, a communication unit 13, and a result display unit 14.

音声取得部１１は、当該クライアント１が備えるマイク（不図示）を用いて、音声データを取得する。なお、従来、ブラウザにおいて、マイクから音声を取得するには、プラグインなどと呼ばれる特殊なソフトウェアをあらかじめブラウザにインストールしておくことが一般的であったが、近年は、急速に機能の整備が進みつつある広義のHTML5に含まれる、ブラウザ上のJavaScriptから利用可能なAPI（Application Programming Interface）を活用することで、特殊なソフトウェアをインストールすることなしにマイクから音声を取得することが可能となった。 The voice acquisition unit 11 acquires voice data using a microphone (not shown) included in the client 1. Conventionally, in browsers, it has been common to install special software called plug-ins or the like in advance in order to obtain sound from a microphone. By using API (Application Programming Interface) that can be used from JavaScript on the browser included in HTML5 in a broad sense, it is possible to obtain audio from a microphone without installing special software. It was.

ここで、上記JavaScriptから利用可能なAPIを活用することで実現が可能であるということは、ウェブサーバから配信されるウェブページ内の記述のみでマイクから音声を取得可能になるということである。具体的には、getUserMediaおよびWeb Audio APIを活用することで、音声取得部１１は、取得した音声データをストリーミング形式により通信部１３に渡すことが可能となる。 Here, the fact that it can be realized by utilizing the API that can be used from the above JavaScript means that it is possible to acquire sound from the microphone only by the description in the web page distributed from the web server. Specifically, by using getUserMedia and Web Audio API, the audio acquisition unit 11 can pass the acquired audio data to the communication unit 13 in a streaming format.

位置取得部１２は、当該クライアント１の位置情報（所在地情報）を取得する。なお、位置取得部１２についても、音声取得部１１と同様に、JavaScriptから利用可能なAPIを活用することで位置情報の取得が可能となる。具体的には、Geolocation APIを活用し、navigator.geolocation.getCurrentPosition関数を利用することで、現在位置の情報が取得可能である。 The position acquisition unit 12 acquires position information (location information) of the client 1. As with the sound acquisition unit 11, the position acquisition unit 12 can also acquire position information by using an API that can be used from JavaScript. Specifically, the current location information can be acquired by using the Geolocation API and using the navigator.geolocation.getCurrentPosition function.

通信部１３は、音声取得部１１が取得した音声データ、および、位置取得部１２が取得した位置情報を、ネットワークを介してサーバ２に送信する。なお、音声データについては、通信部１３は、ストリーミング形式でサーバ２に送信する。ここで、ストリーミングでの送信については、HTML5の機能の一つであるWebSocketを用いる。位置情報については、通信部１３は、位置取得部１２が取得でき次第、すなわち、位置取得部１２が位置情報を取得したタイミングでサーバ２に送信する。また、通信部１３は、サーバから送信された情報を受信し、結果表示部１４に送出する。 The communication unit 13 transmits the audio data acquired by the audio acquisition unit 11 and the position information acquired by the position acquisition unit 12 to the server 2 via the network. Note that the communication unit 13 transmits the audio data to the server 2 in a streaming format. Here, WebSocket, which is one of HTML5 functions, is used for streaming transmission. As for the position information, the communication unit 13 transmits the position information to the server 2 as soon as the position acquisition unit 12 can acquire it, that is, at the timing when the position acquisition unit 12 acquires the position information. In addition, the communication unit 13 receives information transmitted from the server and sends it to the result display unit 14.

結果表示部１４は、通信部１３を介してサーバ３から受信した認識結果などの各種情報を、ディスプレイ（不図示）に表示する。 The result display unit 14 displays various information such as recognition results received from the server 3 via the communication unit 13 on a display (not shown).

なお、本実施形態では、音声取得部１１は、利用者の発話による音声データ（発話音声データ）の入力が行われる以前から、背後に流れているコンテンツ（視聴内容）の音声データ（背景音声データ）を取得し、通信部１３は、当該音声データをサーバ２に送信し続ける。これにより、利用者が視聴しているコンテンツを特定するための音声データの量が増加し、コンテンツ特定の精度を向上することができる。 In the present embodiment, the voice acquisition unit 11 is the voice data (background voice data) of the content (viewing content) flowing behind before the voice data (speech voice data) input by the user's utterance is input. ) And the communication unit 13 continues to transmit the audio data to the server 2. As a result, the amount of audio data for specifying the content that the user is viewing increases, and the accuracy of content specification can be improved.

サーバ２は、クライアント１から送信された音声データを音声認識し、認識結果をクライアント１に提供する。なお、サーバ２においては、実装言語の制約などはない。図示するサーバ２は、通信部２１と、コンテンツ特定部２２と、音声認識部２３と、音声特徴DB（データベース）２４と、関連情報DB２５とを備える。 The server 2 performs voice recognition on the voice data transmitted from the client 1 and provides the recognition result to the client 1. Note that the server 2 has no restrictions on the implementation language. The illustrated server 2 includes a communication unit 21, a content specifying unit 22, a voice recognition unit 23, a voice feature DB (database) 24, and a related information DB 25.

通信部２１は、クライアント１から送信される音声データおよび位置情報を受信し、受信した音声データおよび位置情報をコンテンツ特定部２２に送出するとともに、クライアント１から送信される音声データを音声認識部２３に送出する。また、通信部２１は、音声認識部２３の認識結果をクライアント１に送信する。 The communication unit 21 receives the audio data and position information transmitted from the client 1, sends the received audio data and position information to the content specifying unit 22, and transmits the audio data transmitted from the client 1 to the audio recognition unit 23. To send. In addition, the communication unit 21 transmits the recognition result of the voice recognition unit 23 to the client 1.

コンテンツ特定部２２は、ユーザが入力した音声データに付随する背景音声データから、音声の特徴情報を抽出し、音声特徴DB２４を用いて利用者が利用しているコンテンツを特定する。また、本実施形態では、コンテンツ特定部２２は、ユーザの位置情報を取得し、音声特徴DB２４に記憶されたコンテンツの中から当該位置情報に応じた放送番組を絞込み、絞込んだ各コンテンツの特徴情報と、背景音声データの特徴情報とをそれぞれ照合することで、ユーザが視聴しているコンテンツを特定する。すなわち、位置情報を用いて確度を高めたうえで、音声データを基に音声特徴DB２４を用いてユーザが利用しているコンテンツを特定する。コンテンツ特定部２２は、特定したコンテンツを識別するための情報（例えば、コンテンツIDなど）を、音声認識部２３に送出する。 The content specifying unit 22 extracts audio feature information from background audio data attached to the audio data input by the user, and specifies the content used by the user using the audio feature DB 24. In the present embodiment, the content specifying unit 22 acquires the location information of the user, narrows down broadcast programs according to the location information from the content stored in the audio feature DB 24, and features the narrowed content. The content that the user is viewing is specified by collating the information with the feature information of the background audio data. That is, after the accuracy is improved using the position information, the content used by the user is specified using the audio feature DB 24 based on the audio data. The content specifying unit 22 sends information for identifying the specified content (for example, a content ID) to the voice recognition unit 23.

音声認識部２３は、音声データを音声認識して、少なくとも１つの認識結果候補であるテキスト候補データに変換し、特定したコンテンツに関する関連情報を用いて、テキスト候補データの中からユーザに提示するテキストデータを決定する。具体的には、音声認識部２３は、通信部２１より受け取ったユーザが発話した音声データの音声認識を行い、コンテンツ特定部２２が特定したコンテンツの関連情報を関連情報DB２５から取得し、コンテンツに関連するテキスト候補データが選択される確率を高くする。そして、音声認識部２３は、選択したテキスト候補データを、通信部２１を介してクライアント１に送信する。 The speech recognition unit 23 recognizes speech data, converts the speech data into text candidate data that is at least one recognition result candidate, and uses text related to the identified content to present text to the user from the text candidate data. Determine the data. Specifically, the voice recognition unit 23 performs voice recognition of voice data uttered by the user received from the communication unit 21, acquires related information of the content specified by the content specifying unit 22 from the related information DB 25, and stores it in the content. Increase the probability of selecting relevant text candidate data. Then, the speech recognition unit 23 transmits the selected text candidate data to the client 1 via the communication unit 21.

音声特徴DB２４には、複数のコンテンツの音声の特徴情報が記憶される。関連情報DB２５には、複数のコンテンツに関する関連情報が記憶される。 The audio feature DB 24 stores audio feature information of a plurality of contents. The related information DB 25 stores related information regarding a plurality of contents.

なお、上記説明した、クライアント１およびサーバ２は、例えば、ＣＰＵと、メモリと、ハードディスク等の外部記憶装置と、入力装置と、出力装置とを備えた汎用的なコンピュータシステムを用いることができる。このコンピュータシステムにおいて、ＣＰＵがメモリ上にロードされた所定のプログラムを実行することにより、各部の各機能が実現される。例えば、クライアント１およびサーバ２の各機能は、クライアント１用のプログラムの場合はユーザ端末１のＣＰＵが、そして、サーバ２用のプログラムの場合はサーバ２のＣＰＵがそれぞれ実行することにより実現される。 Note that the client 1 and the server 2 described above can use a general-purpose computer system including a CPU, a memory, an external storage device such as a hard disk, an input device, and an output device, for example. In this computer system, each function of each unit is realized by the CPU executing a predetermined program loaded on the memory. For example, each function of the client 1 and the server 2 is realized by the CPU of the user terminal 1 being executed in the case of the program for the client 1 and the CPU of the server 2 being executed in the case of the program for the server 2. .

また、クライアント１のプログラムおよびサーバ２用のプログラムは、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ−ＲＯＭなどのコンピュータ読取り可能な記録媒体に記憶することも、ネットワークを介して配信することもできる。 The client 1 program and the server 2 program can be stored in a computer-readable recording medium such as a hard disk, a flexible disk, a CD-ROM, an MO, a DVD-ROM, or distributed via a network. it can.

以下に、本実施形態の処理について説明する。 Below, the process of this embodiment is demonstrated.

図２は、音声認識システムの処理を示すシーケンス図である。クライアント１の音声取得部１１は、マイクより音声データを取得し（Ｓ１）、通信部１３を用いてサーバ２に送信する（Ｓ２）。なお、音声データには、ユーザが発話することで入力した発話音声データと、ユーザが利用（視聴）しているコンテンツの背景音声データとが含まれる。なお、音声データは、ストリーミング形式でサーバ２に送信される。 FIG. 2 is a sequence diagram showing processing of the voice recognition system. The voice acquisition unit 11 of the client 1 acquires voice data from the microphone (S1) and transmits it to the server 2 using the communication unit 13 (S2). Note that the audio data includes utterance audio data input by the user speaking and background audio data of the content used (viewed) by the user. Note that the audio data is transmitted to the server 2 in a streaming format.

また、クライアント１の位置取得部１２は、位置情報を取得し（Ｓ３）、通信部１３を用いてサーバ２に送信する（Ｓ４）。なお、Ｓ１およびＳ２の処理と、Ｓ３およびＳ４の処理とは、非同期に行われるものであって、Ｓ３およびＳ４の処理は、Ｓ１およびＳ２の処理の前に行われる、または、Ｓ１およびＳ２の処理の間に行われるなど、処理の順序は図２に示す例に限定されるものではない。 Further, the position acquisition unit 12 of the client 1 acquires position information (S3) and transmits it to the server 2 using the communication unit 13 (S4). The processing of S1 and S2 and the processing of S3 and S4 are performed asynchronously, and the processing of S3 and S4 is performed before the processing of S1 and S2, or the processing of S1 and S2 The order of processing is not limited to the example shown in FIG.

サーバ２のコンテンツ特定部２２は、クライアント１から送信された音声データの背景音声データから音声の特徴情報を抽出し、音声特徴DB２４を用いて利用者が利用しているコンテンツを特定する（Ｓ５）。そして、サーバ２の音声認識部２３は、クライアント１から送信された音声データの発話データを音声認識して、少なくとも１つのテキスト化された認識結果候補に変換し、関連情報DB２５を参照して、特定したコンテンツに関する関連情報を用いて、テキスト候補データの中からユーザに提示するテキストデータを決定する（Ｓ６）。なお、Ｓ５およびＳ６の処理については後述する。 The content specifying unit 22 of the server 2 extracts the audio feature information from the background audio data of the audio data transmitted from the client 1, and specifies the content used by the user using the audio feature DB 24 (S5). . Then, the speech recognition unit 23 of the server 2 recognizes speech data of the speech data transmitted from the client 1 and converts it into at least one recognition result candidate converted into text, and refers to the related information DB 25, The text data to be presented to the user is determined from the text candidate data using the related information regarding the identified content (S6). The processing of S5 and S6 will be described later.

そして、音声認識部２３は、決定したテキストデータを通信部２１を用いてクライアント１に送信する（Ｓ７）。クライアント１の結果表示部１４は、通信部１３を用いてサーバ２から受信した認識結果を、ディスプレイに表示する（Ｓ８）。 The voice recognition unit 23 transmits the determined text data to the client 1 using the communication unit 21 (S7). The result display unit 14 of the client 1 displays the recognition result received from the server 2 using the communication unit 13 on the display (S8).

以下に、Ｓ５およびＳ６の処理について、詳細に説明する。 Below, the process of S5 and S6 is demonstrated in detail.

Ｓ５のコンテンツ特定処理では、通信部２１を介して得られた音声データと位置情報とを用いて、クライアント１の利用者が利用しているコンテンツを特定する。ここで、音声データを利用してコンテンツを特定する方法については、自動コンテンツ認識（ACR）技術の中でも音声フィンガープリントと呼ばれる方法が知られている。 In the content specifying process of S5, the content used by the user of the client 1 is specified using the audio data and the position information obtained via the communication unit 21. Here, as a method of specifying content using audio data, a method called audio fingerprint is known among automatic content recognition (ACR) techniques.

音声フィンガープリントについては、特開２００４−３２６０５０号公報（以下、「文献１」に記載されている。具体的には、音声データを一定の時間長、例えば、20ミリ秒から40ミリ秒程度に区切った上で、当該音声区間に特徴的な指標である特徴情報を多次元ベクトルとして抽出し、抽出した特徴情報と音声特徴データベースに格納された各コンテンツの特徴情報との照合を行うことで、コンテンツの特定を行う。ここで、前述の多次元ベクトルとしては、例えば、メル周波数ケプストラム係数などを用いることができる。 The voice fingerprint is described in Japanese Patent Application Laid-Open No. 2004-326050 (hereinafter referred to as “Document 1”. Specifically, the voice data is set to a certain time length, for example, about 20 milliseconds to 40 milliseconds. After dividing, the feature information that is a characteristic index for the speech section is extracted as a multidimensional vector, and by comparing the extracted feature information with the feature information of each content stored in the speech feature database, In this case, for example, a mel frequency cepstrum coefficient can be used as the multidimensional vector.

ここでは、コンテンツが、テレビなどの放送番組である場合を例に、以下に説明する。この場合、音声特徴DB２４には、今現在、放送が行われている番組の音声データの特徴情報が格納されるものとする。サーバ２（または、図示しない外部システム）は、各放送波による放送番組をそれぞれ受信し、各放送番組の音声データを前述のように一定の時間長毎に区切った上で、当該の音声区間に特徴的な指標である特徴情報を抽出し、音声特徴DB２４に逐次的に格納する。クライアント１から送信される音声データから得られる特徴情報と、音声特徴DB２４に格納される各放送番組の音声データの特徴情報との間には、所定のタイムラグは生じるものの、クライアント１で利用者が視聴している放送番組の音声データが、ほぼリアルタイムでサーバ２に送信され、当該音声データの特徴情報が抽出されて、音声特徴DB２４の各番組の特徴情報と、照合することによって、本実施形態では、現在放送中の放送番組であっても、利用者が視聴している番組の特定が可能となる。 Here, the case where the content is a broadcast program such as a television will be described below as an example. In this case, the audio feature DB 24 stores the feature information of the audio data of the program currently being broadcast. The server 2 (or an external system (not shown)) receives each broadcast program by each broadcast wave, divides the audio data of each broadcast program at a certain time length as described above, and then enters the corresponding audio section. Feature information that is a characteristic index is extracted and stored in the voice feature DB 24 sequentially. Although a predetermined time lag occurs between the feature information obtained from the sound data transmitted from the client 1 and the feature information of the sound data of each broadcast program stored in the sound feature DB 24, the user at the client 1 The audio data of the broadcast program being viewed is transmitted to the server 2 in almost real time, the feature information of the audio data is extracted, and collated with the feature information of each program in the audio feature DB 24. Then, even if it is a broadcast program currently being broadcast, it is possible to specify the program that the user is viewing.

なお、音声特徴DB２４には、放送番組毎に、番組IDと、音声データの特徴情報とが対応付けて記憶されるものとする。 The audio feature DB 24 stores a program ID and audio data feature information in association with each broadcast program.

また、音声特徴DB２４には、各放送番組の音声データの特徴情報ではなく、各放送番組の音声データそのものを記憶することとしてもよい。この場合、コンテンツ特定部２２は、Ｓ５のコンテンツの特定処理において、音声特徴DB２４に記憶された各放送番組の特徴情報を抽出し、クライアント１から送信された音声データの特徴情報と照合するものとする。 The audio feature DB 24 may store the audio data of each broadcast program, not the feature information of the audio data of each broadcast program. In this case, the content specifying unit 22 extracts the feature information of each broadcast program stored in the audio feature DB 24 in the content specifying process of S5, and collates it with the feature information of the audio data transmitted from the client 1. To do.

図３は、ユーザが視聴しているコンテンツの特定処理（Ｓ５）の流れを示すフローチャートである。ここでは、コンテンツが放送番組である場合を例に説明する。 FIG. 3 is a flowchart showing the flow of content identification processing (S5) that the user is viewing. Here, a case where the content is a broadcast program will be described as an example.

まず、コンテンツ特定部２２は、クライアント１から送信される位置情報が取得済みか否かを判別し（Ｓ１１）、取得済みの場合（Ｓ１１：ＹＥＳ）、当該位置情報に基づいて、当該位置が属する地域で現在放送されている放送番組を絞込む（すなわち特定する）。そして、コンテンツ特定部２２は、クライアント１から送信された音声データ（背景音声データ）から音声の特徴情報を抽出し、当該特徴情報を、音声特徴DB２４内の絞込んだ各放送番組の特徴情報とそれぞれ照合し、利用者が現在視聴している放送番組を推定する（Ｓ１２）。 First, the content specifying unit 22 determines whether or not the position information transmitted from the client 1 has been acquired (S11). If the position information has been acquired (S11: YES), the position belongs based on the position information. Narrow down (ie identify) broadcast programs currently being broadcast in the region. Then, the content specifying unit 22 extracts audio feature information from the audio data (background audio data) transmitted from the client 1, and extracts the feature information from the narrowed down broadcast program feature information in the audio feature DB 24. Each is collated to estimate the broadcast program that the user is currently viewing (S12).

すなわち、コンテンツ特定部２２は、音声特徴DB２４内の絞込んだ各放送番組について、音声データの特徴情報と照合し、照合結果である確率値（もしくは、確信度ともいう）を算出する。なお、コンテンツ特定部２２においては、クライアント１より通信部２１を通して受け取った音声データの一定時間内での信号強度の情報を用いて、信号強度が小さい部分を背景音声データだけが含まれるものとして、そのデータのみを用いることによってコンテンツ特定の精度を向上させることもできる。さらに、クライアント１より通信部２１を通して受け取った音声データが複数チャネルから構成される場合（ステレオ音声データ）には、位相差の情報を利用して、音源までの距離を特定することで音源分離を行い、距離が遠いと判定される音声データの成分を背景音声とすることでコンテンツ特定の精度を高め、かつ、距離の近い成分を認識すべき発話部分として音声認識を行うことで、音声認識の精度を高めることも可能である。 That is, the content specifying unit 22 compares each broadcast program narrowed down in the audio feature DB 24 with the feature information of the audio data, and calculates a probability value (also referred to as a certainty factor) as a comparison result. In the content specifying unit 22, it is assumed that only the background audio data is included in a portion having a low signal strength by using the signal strength information within a predetermined time of the audio data received from the client 1 through the communication unit 21. By using only that data, the accuracy of content identification can be improved. Furthermore, when the audio data received from the client 1 through the communication unit 21 is composed of a plurality of channels (stereo audio data), the sound source separation is performed by specifying the distance to the sound source using the information of the phase difference. The voice data component that is determined to be far away is used as the background voice to improve the accuracy of content identification, and the voice recognition is performed as the utterance part that should recognize the component that is close to the distance. It is also possible to increase the accuracy.

あらかじめ定めた所定の確率値より大きな放送番組が存在する場合（Ｓ１３：ＹＥＳ）、コンテンツ特定部２２は、当該放送番組を利用者が現在視聴している視聴番組であると特定し、当該視聴番組を識別する番組IDを音声認識部２３に出力する（Ｓ１５）。 When there is a broadcast program larger than a predetermined probability value determined in advance (S13: YES), the content identification unit 22 identifies the broadcast program as a viewing program that the user is currently viewing, and the viewing program Is output to the voice recognition unit 23 (S15).

一方、当該地域で放送されている放送番組の中で所定の確率値より大きな番組が存在しない場合（Ｓ１３：ＮＯ）、コンテンツ特定部２２は、音声特徴DB２４に記憶された全ての放送番組の特徴情報と、クライアント１から送信された音声データの特徴情報とを照合し、それぞれの確率値を算出する（Ｓ１４）。そして、コンテンツ特定部２２は、確率値が最も大きい放送番組を利用者が現在視聴している視聴番組であると特定し、当該視聴番組を識別する番組IDを音声認識部２３に出力する（Ｓ１５）。 On the other hand, when there is no program larger than the predetermined probability value among the broadcast programs broadcast in the area (S13: NO), the content specifying unit 22 is characterized by all the broadcast programs stored in the audio feature DB 24. The information is collated with the feature information of the voice data transmitted from the client 1, and each probability value is calculated (S14). Then, the content identification unit 22 identifies the broadcast program with the highest probability value as the viewing program that the user is currently viewing, and outputs a program ID that identifies the viewing program to the voice recognition unit 23 (S15). ).

なお、利用者は常に現在放送中の番組を見ているわけではなく、録画したものを再生する、いわゆるタイムシフト視聴を行っている可能性がある。音声特徴DB２４に過去の放送番組の特徴情報が記憶されている場合は、音声特徴DB２４に記憶された過去を含む全ての放送番組の特徴情報との照合を行うことで、過去の番組であっても、利用者の視聴番組を特定することができる。 Note that the user does not always watch the program that is currently being broadcast, but may be performing a so-called time-shifted viewing that plays back the recorded program. When the feature information of the past broadcast program is stored in the audio feature DB 24, it is possible to identify the past program by collating with the feature information of all the broadcast programs including the past stored in the audio feature DB 24. Also, it is possible to specify the viewing program of the user.

次に、Ｓ６の音声認識処理では、音声認識部２３は、クライアント１から送信された音声データの中から利用者が音声により入力を行った発話音声データを音声認識し、認識結果を通信部２１を介してクライアント１に返却（送信）する。この音声認識処理においては、一般に、次に示す２つの側面で認識結果が確率値を伴って複数得られる形（N-Best解）となる。 Next, in the speech recognition process of S6, the speech recognition unit 23 recognizes speech speech data input by speech by the user from the speech data transmitted from the client 1, and communicates the recognition result to the communication unit 21. Return (send) to the client 1 via In this speech recognition process, generally, a plurality of recognition results (N-Best solutions) are obtained with probability values in the following two aspects.

第一に、認識対象の音声データに対して、日本語であれば平仮名の連続として発話内容を認識する際に、音声データに雑音等が混じり不明瞭である場合には、確率値を伴って複数の候補が存在する。第二に、認識された平仮名に対して、適切な漢字を割り当て最終的な認識結果とする際にも、同音異字の問題により複数の候補が生じることとなる。 First, for speech data to be recognized, if the speech data is unclear due to noise or other noise when recognizing the utterance content as a sequence of hiragana characters in Japanese, it is accompanied by a probability value. There are multiple candidates. Second, even when an appropriate kanji is assigned to the recognized hiragana and used as the final recognition result, a plurality of candidates are generated due to the problem of homophones.

本実施形態では、確率値を伴った認識結果の複数の候補の中から、利用者の意図を汲むための情報源として、コンテンツ特定部２２により特定されたコンテンツおよび関連情報DB２５内に蓄積されたコンテンツと関連する関連情報（単語もしくは固有表現など）を利用する。 In the present embodiment, the content specified by the content specifying unit 22 and stored in the related information DB 25 as an information source for drawing the user's intention out of a plurality of recognition result candidates with probability values. Use related information (words or specific expressions) related to the content.

図４は、関連情報DB２５の一例を示す図である。図示する関連情報DB２５には、放送番組毎に、番組IDと、番組名と、放送日時と、放送局名と、少なくとも１つの関連情報とが、対応付けて表形式で記憶されている。関連情報には、例えば、出演者名、地名、その他の番組に関連するキーワードが設定される。なお、番組名、放送日時および放送局名も、関連情報の一部とする。 FIG. 4 is a diagram illustrating an example of the related information DB 25. In the related information DB 25 shown in the figure, for each broadcast program, a program ID, a program name, a broadcast date and time, a broadcast station name, and at least one related information are stored in association with each other in a table format. In the related information, for example, performer names, place names, and other keywords related to other programs are set. The program name, broadcast date and time, and broadcast station name are also part of the related information.

音声認識部２３は、認識結果として複数の候補が存在する場合、コンテンツ特定部２２が特定した放送番組の関連情報を関連情報DB２５から読み出し、読み出した関連情報を用いて複数の候補の中から利用者が意図する認識結果を選択する。 When there are a plurality of candidates as a recognition result, the speech recognition unit 23 reads the related information of the broadcast program specified by the content specifying unit 22 from the related information DB 25 and uses the read related information from the plurality of candidates. The recognition result intended by the person is selected.

図５は、音声認識部２３が複数の候補の中から利用者が意図する認識結果を選択する際の処理の具体例を示すものである。 FIG. 5 shows a specific example of processing when the speech recognition unit 23 selects a recognition result intended by the user from a plurality of candidates.

図示する例では、利用者はクライアント１に向かって「仙台で人気の居酒屋」と発話することで、クライアント１に操作指示を入力する。クライアント１に入力された発話音声データは、背景音声データとともにサーバ２に送信される。ここでは、利用者が「仙台」を発音する際に言い淀むなどして、明瞭でない発音でクライアント１に入力されるものとする。 In the example shown in the figure, the user inputs an operation instruction to the client 1 by speaking to the client 1 as “a popular pub in Sendai”. Speech voice data input to the client 1 is transmitted to the server 2 together with background voice data. Here, it is assumed that a user speaks when “Sendai” is pronounced and is input to the client 1 with an unclear pronunciation.

サーバ２の音声認識部２３は、クライアント１から送信された音声データから発話音声データを抽出し（Ｓ２１）、発話音声データを音声認識してテキストに変換し、図示する複数の平仮名の認識結果候補（テキスト候補データ）を取得する（Ｓ２２）。すなわち、認識結果候補１の「せんだいでにんきのいざかや」と、認識結果候補２の「せんないでにんきのいざかや」とを認識結果として生成する。 The voice recognition unit 23 of the server 2 extracts utterance voice data from the voice data transmitted from the client 1 (S21), recognizes the utterance voice data and converts it into text, and recognizes a plurality of hiragana recognition result candidates shown in the figure. (Text candidate data) is acquired (S22). That is, the recognition result candidate 1 “Sendai Nichinki Izakaya” and the recognition result candidate 2 “Sendai Garlic Izakaya” are generated as recognition results.

一方、図示する例では、コンテンツ特定部２２により視聴番組として、図３に示す番組ID「200002」（ぶらり宮城旅）が特定されるものとする。これにより、音声認識部２３は、図３の関連情報DB２５から当該番組IDに対応付けて記憶されている関連情報（例えば、番組名、出演者名、地名など）を読み出す（Ｓ２３）。 On the other hand, in the illustrated example, it is assumed that the program ID “200002” (Branched Miyagi Journey) shown in FIG. Thereby, the voice recognition unit 23 reads related information (for example, program name, performer name, place name, etc.) stored in association with the program ID from the related information DB 25 of FIG. 3 (S23).

そして、音声認識部２３は、Ｓ２２で認識した平仮名の認識結果候補１、２に対して、適切な漢字を割り当てる（Ｓ２４）。図示する例では、３つの認識結果候補（漢字）が生成される。すなわち、認識結果候補１の「仙台で人気の居酒屋」と、認識結果候補２の「川内で人気の居酒屋」と、認識結果候補３の「船内で人気の居酒屋」とを生成する。 Then, the speech recognition unit 23 assigns appropriate kanji to the hiragana recognition result candidates 1 and 2 recognized in S22 (S24). In the illustrated example, three recognition result candidates (kanji characters) are generated. That is, the recognition result candidate 1 “Izakaya popular in Sendai”, the recognition result candidate 2 “Popular pub in Kawauchi”, and the recognition result candidate 3 “Popular pub in the ship” are generated.

そして、音声認識部２３は、候補となる複数の単語「仙台」、「川内」および「船内」の中から、関連情報DB２５から読み出した関連情報のいずれかと一致する（または、関連する）「仙台」に変換した認識結果候補１の「仙台で人気の居酒屋」を、利用者が意図する認識結果であると決定する。そして、音声認識部２３は、決定した「仙台で人気の居酒屋」（テキストデータ）を、通信部２１を用いてクライアント１に送信する。 Then, the speech recognition unit 23 selects “Sendai” that matches (or is related to) any of the related information read from the related information DB 25 from among the plurality of candidate words “Sendai”, “Kawauchi”, and “Shipboard”. It is determined that the recognition result candidate 1 “Izakaya popular in Sendai” converted into “” is the recognition result intended by the user. Then, the voice recognition unit 23 transmits the determined “popular pub in Sendai” (text data) to the client 1 using the communication unit 21.

なお、上記の具体例においては、音声認識部２３は、関連情報DB２５から読みだした関連情報のいずれかと一致する認識結果候補を選択すると記載したが、確率値を利用してユーザに提示する認識結果を決定することもできる。例えば、認識結果候補１の確率値がαで、認識結果候補２の確率値がβで、認識結果候補３の確率値がγの場合、関連情報DB２５から読み出した関連情報のいずれかと一致する（または、関連する）認識結果候補１の確率値αが大きくなるように調整する。例えば、所定の係数ｎを用いて、認識結果候補１の確率値を、α×ｎ、または、α＋ｎなどとすることが考えられる。そして、音声認識部２３は、複数の認識結果候補の中から、確率値が大きい順に並び替えを行った上で認識結果候補をユーザに提示する方法も考えられる。また、音声認識部２３は、複数の認識結果候補の中から、確率値が最も大きい認識結果候補をユーザに提示する認識結果として決定することとしてもよい。 In the above specific example, it is described that the speech recognition unit 23 selects a recognition result candidate that matches any of the related information read from the related information DB 25, but the recognition that is presented to the user using the probability value. The result can also be determined. For example, when the probability value of the recognition result candidate 1 is α, the probability value of the recognition result candidate 2 is β, and the probability value of the recognition result candidate 3 is γ, it matches any of the related information read from the related information DB 25 ( Alternatively, the probability value α of the recognition result candidate 1 (related) is adjusted to be large. For example, it is conceivable that the probability value of the recognition result candidate 1 is α × n or α + n using a predetermined coefficient n. The speech recognition unit 23 can also consider a method of presenting the recognition result candidates to the user after rearranging the plurality of recognition result candidates in descending order of probability values. The speech recognition unit 23 may determine a recognition result candidate having the largest probability value as a recognition result to be presented to the user from among a plurality of recognition result candidates.

＜変形例＞
上記実施形態では、利用者が利用しているコンテンツを考慮し、音声認識の精度を向上させる方法について、コンテンツが放送番組である場合を中心に説明した。この音声認識システムを適用して、利用者に関連する情報配信を行う変形例も考えられる。 <Modification>
In the embodiment described above, the method for improving the accuracy of voice recognition in consideration of the content used by the user has been described focusing on the case where the content is a broadcast program. A modification in which information related to a user is distributed by applying this voice recognition system is also conceivable.

図６は、変形例の情報配信システムの一例を示す図である。 FIG. 6 is a diagram illustrating an example of a modified information distribution system.

図６に示す情報配信システムは、図１で示したサーバ２（音声認識装置）を応用したシステムである。具体的には、図１では、１つのサーバ２としたが、実際のシステム構築および運用時においては、１つのサーバに機能を集約するよりも、まとまった機能単位でサーバを分割した方が効率的である場合がある。このため、図６に示す変形例では、３つのサーバ２Ａ、２Ｂ、２Ｃから情報配信装置が構成されるものとする。 The information distribution system shown in FIG. 6 is a system to which the server 2 (voice recognition device) shown in FIG. 1 is applied. Specifically, although one server 2 is shown in FIG. 1, it is more efficient to divide the server into a unit of function rather than consolidating the functions into one server in actual system construction and operation. Sometimes. For this reason, in the modification shown in FIG. 6, it is assumed that the information distribution apparatus is configured by three servers 2A, 2B, and 2C.

ここで、図１のサーバ２との差分は、サーバ２Ａの情報提示部２６および情報DB２７である。情報提示部２６は、コンテンツ特定部２２により特定されたコンテンツと、音声認識部２３による音声認識結果と、クライアント１から送信される利用者の位置情報とを用いて、利用者にとって有用な情報を提示・配信する。情報DB２７には、利用者に提供する、各種の情報が格納されている。 Here, the difference from the server 2 in FIG. 1 is the information presentation unit 26 and the information DB 27 of the server 2A. The information presenting unit 26 uses the content specified by the content specifying unit 22, the voice recognition result by the voice recognition unit 23, and the user's position information transmitted from the client 1 to provide useful information for the user. Present and distribute. The information DB 27 stores various types of information provided to the user.

すなわち、一般的な音声入力により、様々な情報が記憶された情報DBを検索して、利用者に情報を提示する場合と比較して、変形例の情報配信システムでは、利用者が視聴しているコンテンツを特定することにより、同音異字の問題を解消した音声認識結果が得られ、より利用者の意図を汲んだ情報を提示することが可能となるだけでなく、利用者の位置情報を利用することで提示する情報の編集、または、優先順位を調整することができ、提示する情報の精度および質をより高めることが可能となる。 In other words, in the information distribution system according to the modified example, the user views and listens to the information DB in which various information is stored by general voice input and presents the information to the user. By identifying the content, it is possible to obtain a voice recognition result that eliminates the problem of homonyms and to present information that better captures the user's intention, as well as to use the user's location information By doing so, it is possible to edit the information to be presented or to adjust the priority order, and it is possible to further improve the accuracy and quality of the information to be presented.

一例として、図６に示す情報配信システムが、主にホテル・旅館等の宿泊施設情報を対象としたものである場合、情報DB２７に含まれる情報は、当然、宿泊施設情報となる。前述の図５の例で示したように、視聴している放送番組が宮城県に関する旅行番組であり、音声認識結果が「宮城県の旅館」の場合、情報提示部２６は、利用者の位置情報を用いて、情報DB２７から検索した結果を編集する。例えば、利用者の位置情報が宮城県である場合には、情報提示部２６は、情報DB２７の中から検索した宮城県内の宿泊施設の宿泊プランを提示するよりも、日帰りプランが存在する宿泊施設について優先的に提示する方が効果的であると判別し、日帰りプランが存在する宿泊施設の情報の優先度を高く設定する。一方、情報提示部２６は、利用者の位置情報が大阪である場合には、宿泊プランのみを提示する方が適切であると判別し、日帰りプランについては検索結果から削除することなどが考えられる。 As an example, when the information distribution system shown in FIG. 6 is mainly for accommodation facility information such as hotels and inns, the information included in the information DB 27 is naturally accommodation facility information. As shown in the example of FIG. 5 described above, when the broadcast program being viewed is a travel program related to Miyagi Prefecture, and the voice recognition result is “Ryokan in Miyagi Prefecture”, the information presentation unit 26 determines the location of the user. Using the information, the result retrieved from the information DB 27 is edited. For example, when the location information of the user is Miyagi Prefecture, the information presentation unit 26 has an accommodation facility for which there is a day trip plan rather than presenting an accommodation plan for the accommodation facility in Miyagi Prefecture retrieved from the information DB 27. It is determined that it is more effective to preferentially present the information, and the priority of the information on the accommodation facility where the day plan exists is set high. On the other hand, when the location information of the user is Osaka, the information presenting unit 26 determines that it is more appropriate to present only the accommodation plan, and the day plan may be deleted from the search result. .

以上説明した本実施形態では、ユーザが入力した音声データに付随する背景音声データを用いてユーザが利用しているコンテンツを特定し、特定したコンテンツに関連する関連情報を考慮してユーザが入力した音声データを音声認識してテキストデータに変換する。これにより、本実施形態では、音声認識の精度を向上し、利用者が意図した音声データの認識結果を提供することができる。 In the present embodiment described above, the content used by the user is specified using background audio data attached to the audio data input by the user, and the user inputs it in consideration of related information related to the specified content. Speech data is recognized and converted to text data. Thereby, in this embodiment, the accuracy of speech recognition can be improved and the recognition result of speech data intended by the user can be provided.

また、本実施形態では、音声データ中の背景音声から、利用しているコンテンツを特定する際に、クライアント１の位置情報を用いる。これにより、コンテンツの特定精度を高めるとともに、効率化することができ、さらに、音声認識の精度を向上することができる。 In the present embodiment, the position information of the client 1 is used when specifying the content being used from the background audio in the audio data. As a result, it is possible to increase the accuracy of content identification and increase the efficiency, and further improve the accuracy of speech recognition.

なお、本発明は上記実施形態に限定されるものではなく、特許請求の範囲内において、種々変更・応用が可能である。例えば、上記実施形態では、放送番組を背景音声として利用する形態を例として説明したが、背景音声として利用可能なものは放送番組の音声データのみに限らない。例えば、音楽の楽曲を聞きながら音楽情報に関するウェブページを閲覧している際に、聞いている楽曲を特定し、利用者の位置情報を考慮し、当該楽曲の演奏者の近隣でのイベント開催情報がある場合にはそれを提示するといった利用も可能となる。 In addition, this invention is not limited to the said embodiment, A various change and application are possible within a claim. For example, in the above-described embodiment, a mode in which a broadcast program is used as background audio has been described as an example. However, what can be used as background audio is not limited to audio data of a broadcast program. For example, when browsing a music information web page while listening to music, identify the music you are listening to, consider the location information of the user, and hold event information in the vicinity of the music player If there is, you can use it by presenting it.

１：クライアント
１１：音声取得部
１２：位置取得部
１３：通信部
１４：結果表示部
２：サーバ
２１：通信部
２２：コンテンツ特定部
２３：音声認識部
２４：音声特徴DB
２５：関連情報DB
２６：情報構築部 DESCRIPTION OF SYMBOLS 1: Client 11: Voice acquisition part 12: Position acquisition part 13: Communication part 14: Result display part 2: Server 21: Communication part 22: Content identification part 23: Voice recognition part 24: Voice feature DB
25: Related information DB
26: Information construction department

Claims

A speech recognition device,
By identifying the distance to the sound source using the phase difference information, the voice data is separated into the speech voice data uttered by the user and the background voice data, and the feature information of the voice is obtained from the separated background voice data. A specifying unit for extracting and specifying content used by a user using an audio feature storage unit in which audio feature information of a plurality of contents is stored;
The speech data by voice recognition, was converted into at least one text candidate data, using said relevant information about the identified content, the text candidate speech recognition unit for determining the text data to be presented to the user from the data And a voice recognition device comprising:

The speech recognition apparatus according to claim 1,
The speech recognition apparatus characterized in that the content is a broadcast program.

The speech recognition device according to claim 2,
The specifying unit acquires user position information, narrows down broadcast programs according to the position information from the contents stored in the audio feature storage unit, feature information of each narrowed broadcast program, and the background A speech recognition apparatus characterized by identifying broadcast programs viewed by a user by collating with feature information of speech data.

A speech recognition method performed by a speech recognition device,
A separation step of separating the voice data into speech voice data uttered by the user and background voice data by specifying a distance to the sound source using phase difference information;
A step of extracting audio feature information from the separated background audio data, and specifying content used by the user using an audio feature storage unit in which audio feature information of a plurality of contents is stored;
The speech data by speech recognition and a conversion step of converting at least one text candidate data,
And a determination step of determining text data to be presented to a user from among the text candidate data using related information relating to the specified content.

A speech recognition program for causing a computer to function as the speech recognition apparatus according to any one of claims 1 to 3.