JP2010175708A

JP2010175708A - System and method for retrieval of speech recognition

Info

Publication number: JP2010175708A
Application number: JP2009016641A
Authority: JP
Inventors: Kazunari Ouchi; 一成大内; Miwako Doi; 美和子土井
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2009-01-28
Filing date: 2009-01-28
Publication date: 2010-08-12

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition retrieval system for suppressing the number of vocabularies in a range that guarantees recognition accuracy for practically use and improving the covering ratio for utterance. <P>SOLUTION: The speech recognition retrieval system includes an EPG DB 29 for accumulating data of program information on a broadcast program, a speech input section 11, a speech recognition section 21 for recognizing input speech and converting it into a text, a speech recognition dictionary 22 referred to by the speech recognition section 21, a retrieval section 28 for retrieving a range to be retrieved in the EPG DB 29 using the text recognized by the speech recognition section 21 as a retrieval keyword, a display section 27 for displaying the speech recognition result, the retrieval result or the like, an operation section 12 for switching selection and display of a speech recognition candidate to be displayed, an utterance habit extracting section 23 for extracting user's utterance habit from the text recognized by the speech recognition section 21, an utterance habit DB 24 for accumulating the utterance habit as data, and a speech recognition dictionary creating section 25 for creating the speech recognition dictionary 22 based on the data of the range to be retrieved and the contents of the utterance habit DB 24. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声認識入力により所望の情報を対象検索範囲より検索する音声認識検索システム及び音声認識検索方法に関する。 The present invention relates to a voice recognition search system and a voice recognition search method for searching for desired information from a target search range by voice recognition input.

カーナビゲーションシステムなど、手による操作ができない状況下で音声認識入力によって所望の情報の検索、操作を行う取り組みがなされている。或る一定時間中において音声の始端と終端とをそれぞれ一回のみ検出する孤立単語音声認識の場合、語彙数と認識率は二律背反の関係にある。そこで、入力される音声の属性（例えば、ジャンル等）によって適切に音声認識辞書を切り替えることにより、音声認識精度を確保する方法が考えられ、例えば、入力属性を先に指示し、適切な音声認識辞書を選択してから音声入力する方法が提案されている（例えば、特許文献１参照。）。 Efforts are being made to search for and operate desired information by voice recognition input in situations where manual operation is not possible, such as car navigation systems. In the case of isolated word speech recognition in which the beginning and end of speech are detected only once during a certain period of time, the number of vocabularies and the recognition rate are in a trade-off relationship. Therefore, a method of ensuring the voice recognition accuracy by appropriately switching the voice recognition dictionary according to the input voice attribute (for example, genre, etc.) can be considered. For example, the input attribute is designated first and the appropriate voice recognition is performed. A method of inputting voice after selecting a dictionary has been proposed (for example, see Patent Document 1).

また、全語彙を対象とした音声認識を実施し、音声検索キー候補が多い場合に、検索キー確定関連質問を提示し関連情報を発話させ、検索キー認識尤度と関連情報認識尤度から音声検索キー候補を同定する方法が提案されている（例えば、特許文献２参照。）。 In addition, when speech recognition is performed for all vocabularies and there are many speech search key candidates, a search key confirmation related question is presented and related information is uttered, and speech is obtained from the search key recognition likelihood and the related information recognition likelihood. A method for identifying search key candidates has been proposed (see, for example, Patent Document 2).

しかしながら、手による操作が可能な用途（例えば、テレビの番組（コンテンツ）予約など）において、リモコン装置等の操作の負担を軽減させるべく音声認識入力を用いる場合、入力の全てを音声認識入力で行うよりも、キー操作と適切に組み合わせることによって、全体としての使い勝手が向上すると考えられる。このような観点から、ＥＰＧ（Electronic Program Guide）を利用して音声認識により番組予約を行う取り組みが提案されている（例えば、特許文献３参照。）。 However, in applications where manual operation is possible (for example, TV program (contents) reservation), when using voice recognition input to reduce the burden of operating the remote control device, etc., all input is performed by voice recognition input. Rather, it is considered that the overall usability is improved by appropriately combining with key operations. From such a viewpoint, there has been proposed an approach for making a program reservation by voice recognition using an EPG (Electronic Program Guide) (see, for example, Patent Document 3).

特開２００７−２６４１９８号公報JP 2007-264198 A 特許第３４２０９６５号公報Japanese Patent No. 3420965 特開２０００−３１６１２８号公報JP 2000-316128 A

手による操作が可能な用途で音声認識入力を用いる場合、従来技術では、予め用意した音声認識用の辞書を固定的に使用している。ところが、番組情報、インターネット上の情報など、日々変化する情報の検索においては、検索対象範囲の内容に応じて音声認識用の辞書を更新することが、音声認識精度の維持には必要である。また、音声認識用の辞書の語彙数と認識精度の関係は、発話に対するカバー率を上げるために語彙数を増やすと認識精度が低下し、語彙数を少なくすると認識精度は向上するものの発話に対するカバー率が低下する。 When voice recognition input is used for an application that can be operated by hand, a dictionary for voice recognition prepared in advance is fixedly used in the related art. However, in searching for information that changes from day to day, such as program information and information on the Internet, it is necessary to maintain the voice recognition accuracy to update the dictionary for voice recognition according to the contents of the search target range. In addition, the relationship between the number of words in the dictionary for speech recognition and the recognition accuracy shows that the recognition accuracy decreases when the number of vocabularies is increased in order to increase the coverage rate for utterances, while the recognition accuracy improves when the number of vocabularies is reduced. The rate drops.

そこで、本発明は、上記した問題に鑑みてなされたもので、語彙数を実用に耐える認識精度を担保する範囲におさえ、その中で発話に対するカバー率を向上させる音声認識検索システム及び音声認識検索方法を提供することを目的とする。 Therefore, the present invention has been made in view of the above-described problems, and the speech recognition search system and the speech recognition search that improve the coverage rate for utterances within the range that guarantees the recognition accuracy that can withstand the number of vocabulary in practical use. It aims to provide a method.

本発明の一態様によれば、コンテンツ情報のデータを蓄積するコンテンツ情報データベースと、音声を入力するための音声入力部と、前記音声入力部から入力された音声を認識してテキスト化する音声認識部と、認識処理のため前記音声認識部で参照される音声認識辞書と、前記音声認識部が認識したテキストを検索キーワードとして前記コンテンツ情報データベース内の検索対象範囲を検索する検索部と、検索結果を表示する表示部と、前記表示部に表示される音声認識候補の選択を行う操作部と、前記音声認識部が認識したテキストから、ユーザの発話傾向を抽出する発話傾向抽出部と、前記発話傾向抽出部が抽出した発話傾向を蓄積する発話傾向データベースと、前記検索対象範囲のデータと前記発話傾向データベースの内容に基づいて、前記音声認識辞書を生成する音声認識辞書生成部を具備することを特徴とする音声認識検索システムが提供される。 According to one aspect of the present invention, a content information database that accumulates data of content information, a voice input unit for inputting voice, and voice recognition that recognizes voice input from the voice input unit and converts it into text. A speech recognition dictionary that is referred to by the speech recognition unit for recognition processing, a search unit that searches a search target range in the content information database using text recognized by the speech recognition unit as a search keyword, and a search result A display unit that displays a voice recognition candidate displayed on the display unit, an utterance tendency extraction unit that extracts a user's utterance tendency from text recognized by the voice recognition unit, and the utterance Based on the utterance tendency database that accumulates the utterance tendency extracted by the trend extraction unit, the data of the search target range, and the contents of the utterance tendency database Speech recognition retrieval system characterized by comprising a speech recognition dictionary generating unit which generates the speech recognition dictionary is provided.

また、本発明の別の一態様によれば、ユーザから音声を用いて入力される検索対象となる音声検索キーを音声認識する音声認識検索システムにおける音声認識検索方法であって、検索キーワードを音声入力し、入力された音声信号に対して音声認識を行ってテキスト化し、音声認識候補を尤度の高い順に複数個提示し、提示された候補中から選択した該文字列を検索キーワードとしてコンテンツ情報データベースを検索し、コンテンツ候補リスト中からコンテンツを選択し、コンテンツを選択した際に、発話傾向を抽出し、選択したコンテンツの正式名と発話した内容とを比較し、該発話の特徴を抽出し、発話傾向データベースに格納する、ことを特徴とする音声認識検索方法が提供される。 According to another aspect of the present invention, there is provided a speech recognition search method in a speech recognition search system for recognizing a speech search key to be searched that is input from a user using speech, wherein the search keyword is spoken. Input, perform speech recognition on the input speech signal, convert it to text, present a plurality of speech recognition candidates in descending order of likelihood, and use the character string selected from the presented candidates as content information Search the database, select the content from the content candidate list, and when the content is selected, extract the utterance tendency, compare the official name of the selected content with the uttered content, and extract the features of the utterance The speech recognition retrieval method is characterized in that it is stored in an utterance tendency database.

本発明によれば、語彙数を実用に耐える認識精度を担保する範囲におさえ、その中で発話に対するカバー率を向上させることができる。 According to the present invention, the coverage rate for utterances can be improved even within the range that guarantees the recognition accuracy that can withstand practical use of the vocabulary number.

本発明の実施の形態に係る音声認識検索システムの一例を示すブロック図である。It is a block diagram which shows an example of the speech recognition search system which concerns on embodiment of this invention. 本発明の実施の形態に係る音声認識検索システムのリモコン装置の実装例を示す概略図である。It is the schematic which shows the example of mounting of the remote control apparatus of the speech recognition search system which concerns on embodiment of this invention. 本発明の実施の形態に係る音声認識検索システムにおいて無線通信機能を搭載した一例を示すブロック図である。It is a block diagram which shows an example which mounts the radio | wireless communication function in the speech recognition search system which concerns on embodiment of this invention. 本発明の実施の形態に係る音声認識検索システムの処理動作を示すフローチャートである。It is a flowchart which shows the processing operation of the speech recognition search system which concerns on embodiment of this invention. 音声認識候補の提示例を説明する図である。It is a figure explaining the example of presentation of a speech recognition candidate. 検索結果の表示例を説明する図である。It is a figure explaining the example of a display of a search result. 番組名の読みの付与例を説明する図である。It is a figure explaining the example of giving of reading of a program name. ＥＰＧデータの例を説明する図である。It is a figure explaining the example of EPG data. 音声認識辞書生成部が音声認識辞書を生成する際の処理動作を示すフローチャートである。It is a flowchart which shows the processing operation at the time of a speech recognition dictionary production | generation part producing | generating a speech recognition dictionary.

以下、本発明の実施の形態について図面を参照しながら説明する。各図において同一箇所については同一の符号を付すとともに、重複した説明は省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings, the same portions are denoted by the same reference numerals, and redundant description is omitted.

図１は、本発明の実施の形態に係る音声認識検索システム１００の構成例を示すブロック図であり、図２は、リモコン装置１０の実装例を示す概略図である。 FIG. 1 is a block diagram illustrating a configuration example of a speech recognition / retrieval system 100 according to an embodiment of the present invention, and FIG. 2 is a schematic diagram illustrating an implementation example of a remote control device 10.

図１に示すように、音声認識検索システム１００は、リモコン装置１０および録画機器２０を備えている。リモコン装置１０は、図２に示すように、音声入力部１１および操作部１２を備えている。音声入力部１１は、リモコン装置１０の任意の位置に内蔵あるいは外付けで備えることができ、例えばマイクロフォンである。図２は、音声入力部１１を内蔵させた場合の例である。操作部１２は、操作者の操作を受け付ける部位である。図２に示す例では、リモコン装置１０の任意の位置に、認識した音声の候補の移動等を行う十字キーと押しボタンを２個備えている。操作部１２の構成はこれに限られず、ポインティングデバイスでポインタを操作できるように構成することもできる。 As shown in FIG. 1, the speech recognition / retrieval system 100 includes a remote control device 10 and a recording device 20. As shown in FIG. 2, the remote control device 10 includes a voice input unit 11 and an operation unit 12. The voice input unit 11 can be provided internally or externally at an arbitrary position of the remote control device 10, and is, for example, a microphone. FIG. 2 shows an example in which the voice input unit 11 is incorporated. The operation unit 12 is a part that receives an operation of the operator. In the example illustrated in FIG. 2, two cross keys and two push buttons for moving a recognized voice candidate are provided at an arbitrary position of the remote control device 10. The configuration of the operation unit 12 is not limited to this, and the operation unit 12 may be configured so that the pointer can be operated with a pointing device.

録画機器２０は、音声認識部２１、音声認識辞書２２、発話傾向抽出部２３、発話傾向データベース（以下、発話傾向ＤＢという）２４、音声認識辞書生成部２５、音声認識候補表示部２６、表示部２７、検索部２８、EPGデータベース（以下、ＥＰＧＤＢという）２９を備えるとともに、図示しない録画・制御部をも備えている。録画・制御部は、ビデオハードディスクレコーダ、録画機能付のテレビ、パソコンなど録画機能を備えた機器である。尚、録画・制御部を録画機能付パソコンで構成する場合には、音声入力部１１をパソコンに接続し、操作部１２はマウス等パソコンの操作機器を使用しても構わない。 The recording device 20 includes a voice recognition unit 21, a voice recognition dictionary 22, an utterance tendency extraction unit 23, an utterance tendency database (hereinafter referred to as an utterance tendency DB) 24, a voice recognition dictionary generation unit 25, a voice recognition candidate display unit 26, and a display unit. 27, a search unit 28, an EPG database (hereinafter referred to as EPG DB) 29, and a recording / control unit (not shown). The recording / control unit is a device having a recording function such as a video hard disk recorder, a TV with a recording function, or a personal computer. When the recording / control unit is composed of a personal computer with a recording function, the voice input unit 11 may be connected to the personal computer, and the operation unit 12 may use a personal computer operating device such as a mouse.

音声認識部２１は、リモコン装置１０の音声入力部１１を介して取得した音声を認識するものである。音声認識部２１には、例えば、形態素解析等を含む音声認識エンジンが搭載される。 The voice recognition unit 21 recognizes the voice acquired through the voice input unit 11 of the remote control device 10. The speech recognition unit 21 is equipped with a speech recognition engine including morphological analysis, for example.

音声認識辞書２２には、音声認識のためのデータが多数蓄積されており、音声認識部２１からのアクセスを受け付ける。音声認識辞書２２は、例えば、ＨＤＤやＥＥＰＲＯＭで構築することができる。 The voice recognition dictionary 22 stores a large amount of data for voice recognition, and accepts access from the voice recognition unit 21. The voice recognition dictionary 22 can be constructed by, for example, an HDD or an EEPROM.

発話傾向抽出部２３は、音声入力部１１を介して取り込まれた発話者の発話傾向（後述する）を抽出するものである。この発話傾向抽出部２３は、例えばソフトウェアで構築することができる。 The utterance tendency extraction unit 23 extracts the utterance tendency (described later) of the utterer captured via the voice input unit 11. The utterance tendency extraction unit 23 can be constructed by software, for example.

発話傾向ＤＢ２４は、発話傾向抽出部２３で抽出された発話傾向のデータを蓄積するものである。発話傾向ＤＢ２４は、例えば、揮発性メモリで構築することができる。 The utterance tendency DB 24 accumulates utterance tendency data extracted by the utterance tendency extraction unit 23. The utterance tendency DB 24 can be constructed with, for example, a volatile memory.

音声認識辞書生成部２５は、発話傾向ＤＢ２４およびＥＰＧＤＢ２９からデータを取り込んで音声認識辞書を生成するものである。音声認識辞書生成部２５で生成された音声認識辞書は、音声認識辞書２２に送られる。 The voice recognition dictionary generation unit 25 takes in data from the utterance tendency DB 24 and the EPG DB 29 and generates a voice recognition dictionary. The speech recognition dictionary generated by the speech recognition dictionary generation unit 25 is sent to the speech recognition dictionary 22.

音声認識候補表示部２６は、音声認識部２１での音声認識の結果、尤度にしたがって音声認識候補をリストアップするものである。音声認識候補表示部２６は、例えば、揮発性メモリで構築することができる。 The speech recognition candidate display unit 26 lists speech recognition candidates according to likelihood as a result of speech recognition by the speech recognition unit 21. The speech recognition candidate display unit 26 can be constructed with, for example, a volatile memory.

表示部２７は、音声認識および検索作業における各種表示を司るものである。表示部２７は、ＬＣＤ等のディスプレイが好適である。 The display unit 27 manages various displays in voice recognition and search operations. The display unit 27 is preferably a display such as an LCD.

検索部２８は、音声認識候補から選択された文字列を検索キーワードとしてＥＰＧＤＢ２９を検索するものである。この検索部２８は、例えばソフトウェアで構築することができる。 The search unit 28 searches the EPG DB 29 using a character string selected from the speech recognition candidates as a search keyword. The search unit 28 can be constructed by software, for example.

ＥＰＧＤＢ２９は、ＥＰＧ（Electric Program Guide）データを蓄積するもので、例えば、ＨＤＤやＥＥＰＲＯＭで構築することができる。ＥＰＧデータは、テレビ番組表をデジタルデータとして提供し、画面に表示したり、録画予約を簡単に行なったりするためのシステムである。ＥＰＧデータの提供方法には、地上波に乗せる「ADAMS-EPG」やBSデジタルに乗せるARIB（Association of Radio Industries and Businesses of Japan：電波産業会）規格の「EPG」、インターネットを通じて提供される「ADAMS-EPG+」や「iEPG」がある。図８は、ＥＰＧデータの一例を示している。図８に示すように、＜ＴＩＴＬＥ＞で番組名が、＜ＴＥＸＴ＞で出演者の情報がＸＭＬ（Extensible Markup Language）ベースで盛り込まれている。 The EPG DB 29 accumulates EPG (Electric Program Guide) data, and can be constructed with, for example, an HDD or an EEPROM. The EPG data is a system for providing a TV program guide as digital data and displaying it on a screen or making a recording reservation easily. EPG data can be provided by "ADAMS-EPG" on the terrestrial wave, ARIB (Association of Radio Industries and Businesses of Japan) standard "EPG" on BS digital, "ADAMS" provided via the Internet. -EPG + "and" iEPG ". FIG. 8 shows an example of EPG data. As shown in FIG. 8, <TITLE> includes a program name, and <TEXT> includes information about performers on an XML (Extensible Markup Language) basis.

図１に示すように、リモコン装置１０と録画機器２０との間は、有線で接続することができるが、これに限られない。図３に示すように、リモコン装置１０に通信部１３、録画機器２０に通信部３０を備えることにより、無線通信可能な構成としても良い。尚、無線通信に係る構成以外は同じ構成で実施可能であるため、以下、図１に示す構成にしたがって説明する。 As shown in FIG. 1, the remote control device 10 and the recording device 20 can be connected by wire, but the present invention is not limited to this. As shown in FIG. 3, the remote control device 10 may include a communication unit 13 and the recording device 20 may include a communication unit 30 so that wireless communication is possible. In addition, since it can implement with the same structure except the structure which concerns on radio | wireless communication, hereafter, it demonstrates according to the structure shown in FIG.

次に、以上のように構成した音声認識検索システムにおける音声認識および検索の処理動作について、図４のフローチャートを参照しながら説明する。 Next, speech recognition and search processing operations in the speech recognition and retrieval system configured as described above will be described with reference to the flowchart of FIG.

まず、音声認識検索システムは、ユーザからの音声認識開始指示を待つ（ステップＳ１０）。音声認識開始指示は、リモコン装置１０の操作部１２のうち、音声認識開始指示機能に割り当てたボタンを押下する。あるいは、録画機器２０の表示部２７上に配置された表示上のボタンを、操作部１２を使って押下することでもよい。音声認識開始指示が有るまで指示を待ち（ステップＳ１１）、音声認識開始指示が有った後、ユーザは所望の検索キーワードを音声入力部１１に音声入力する（ステップＳ１２）。また、音声認識開始の指示を行い、音声認識部２１が音声入力後の無音区間を検出して自動的に音声認識終了とする（ステップＳ１３）。音声認識開始後、いつ音声認識終了するかについては、ボタンを押下している間に音声認識を実施するようにしてもよい。 First, the voice recognition search system waits for a voice recognition start instruction from the user (step S10). For the voice recognition start instruction, a button assigned to the voice recognition start instruction function in the operation unit 12 of the remote control device 10 is pressed. Alternatively, a button on the display arranged on the display unit 27 of the recording device 20 may be pressed using the operation unit 12. The instruction waits until there is a voice recognition start instruction (step S11). After the voice recognition start instruction is received, the user inputs a desired search keyword into the voice input unit 11 (step S12). Also, a voice recognition start instruction is given, and the voice recognition unit 21 detects a silent section after voice input and automatically ends voice recognition (step S13). As for when to end speech recognition after starting speech recognition, speech recognition may be performed while the button is pressed.

次に、音声認識部２１は、音声入力部１１に入力された音声信号に対して、音声認識辞書２２を用いて音声認識を行い、音声認識候補を尤度の高い順に複数個提示する（ステップＳ１４）。 Next, the speech recognition unit 21 performs speech recognition on the speech signal input to the speech input unit 11 using the speech recognition dictionary 22, and presents a plurality of speech recognition candidates in descending order of likelihood (step). S14).

図５は、音声認識候補の提示例を説明する図である。図５に示す例においては、「えーびーしーこうこうこうざ」という入力された音声に対して、音声認識の結果、音声認識候補として３つの候補がリストアップされている。この候補は、音声認識候補表示部２６に提示される。尚、音声認識候補は、出来るだけ絞ってリストアップするのが、ユーザにとっても使いやすいものである。そこで、例えば音声認識のスコアが８０点以上となる候補をリストアップするように設定することができる。カッコで括られた部分が各候補の「読み」であって、カッコの前の部分が、各文字列である。このように、候補とその読みを両方表示するようにしておくと、ユーザはその候補がどのような理由でリストアップされたか確認でき、わかりやすい。提示される音声認識候補の数と順番は、もっぱら、音声認識エンジンの認識性能等に依存する。 FIG. 5 is a diagram illustrating an example of presenting speech recognition candidates. In the example shown in FIG. 5, three candidates are listed as speech recognition candidates as a result of speech recognition for the input speech “Ebisu-Kokoukouza”. This candidate is presented on the speech recognition candidate display unit 26. Note that it is easy for the user to list the speech recognition candidates as narrowed down as possible. Therefore, for example, it is possible to set so that candidates with a voice recognition score of 80 or more are listed. The part enclosed in parentheses is “reading” of each candidate, and the part before the parenthesis is each character string. In this way, if both candidates and their readings are displayed, the user can confirm why the candidates are listed and it is easy to understand. The number and order of the voice recognition candidates presented depends solely on the recognition performance of the voice recognition engine.

ユーザは、提示された候補の中に所望の文字列、例えば番組名（部分語）あるいは出演者名がある場合は、操作部１２を操作して所望の候補を選択し、所望の候補がない場合は、再度音声認識入力を行う（ステップＳ１５）。 When there is a desired character string, for example, a program name (partial word) or a performer name among the presented candidates, the user operates the operation unit 12 to select a desired candidate and there is no desired candidate. In this case, voice recognition input is performed again (step S15).

所望の文字列が選択されると、検索部２８は、その文字列を第一の検索キーワードとして、ＥＰＧＤＢ２９を検索する（ステップＳ１６）。ここで、その文字列が出演者名である場合は、図８にその一例を示すＥＰＧデータのうち、＜ＩＴＥＭ＞出演者＜／ＩＴＥＭ＞の後の＜ＴＥＸＴ＞タグを検索する。その文字列が番組名（またはその一部）である場合は、＜ＴＩＴＬＥ＞タグを検索する。検索の結果、ヒットする番組のＥＰＧデータから、番組放送日時、チャンネル、番組名などを抽出し、表示部２７にリストアップする（ステップＳ１７）。 When a desired character string is selected, the search unit 28 searches the EPG DB 29 using the character string as a first search keyword (step S16). If the character string is the name of a performer, the <TEXT> tag after <ITEM> performer </ ITEM> is searched from the EPG data whose example is shown in FIG. If the character string is a program name (or a part thereof), a <TITLE> tag is searched. As a result of the search, program broadcast date / time, channel, program name, etc. are extracted from the EPG data of the hit program and listed on the display unit 27 (step S17).

検索結果の表示例を図６に示す。この例では、図５に示す音声認識候補リストで提示された文字列のうち、ユーザが「ＡＢＣ高校講座」を選択したことを受けて、ＥＰＧＤＢ２９を検索し、「ＡＢＣ高校講座」に該当する番組候補として１０個の番組のリストを一覧にして表示している。 A display example of the search result is shown in FIG. In this example, in response to the user selecting “ABC High School Course” from the character strings presented in the speech recognition candidate list shown in FIG. 5, the EPG DB 29 is searched and corresponds to “ABC High School Course”. A list of 10 programs is displayed as a list of program candidates.

上述したように、図５に示した音声認識候補について、例えば音声認識のスコアが８０点以上である候補をリストアップするように設定することができるが、第一候補の尤度が第二候補以下の候補よりも明らかに高い場合には、直接検索を実施し、図５に示した音声認識候補のリストアップを提示することなく、図６に示す番組候補をリストアップするようにしても良い。 As described above, the voice recognition candidates shown in FIG. 5 can be set to list, for example, candidates having a voice recognition score of 80 or more, but the likelihood of the first candidate is the second candidate. If it is clearly higher than the following candidates, the program candidates shown in FIG. 6 may be listed without performing a direct search and presenting the list of voice recognition candidates shown in FIG. .

検索の結果、図６に示す番組候補リスト中にある所望の番組を選択する（ステップＳ１８）。番組を選択すると、図８のようなＥＰＧデータの詳細を参照できる（ステップＳ１９）。 As a result of the search, a desired program in the program candidate list shown in FIG. 6 is selected (step S18). When a program is selected, details of the EPG data as shown in FIG. 8 can be referred to (step S19).

さらに、所望の番組を選択した後に、発話傾向の抽出も実施する。後述するが、番組名の読みは、音声認識辞書２２の生成時に、正式な番組名だけでなく、所定のルールに基づいて分割された部分的な番組名が生成されている。所定のルールとしては、例えば図７において、番組名のカッコの前後で分割、スペースの有無で分割、カッコ内を分割、“の”の前後で分割することができる。但し、闇雲に分割することは、検索のキーワードが必要以上に増えるので、適当ではない。 Furthermore, after selecting a desired program, the utterance tendency is also extracted. As will be described later, in reading the program name, not only the official program name but also a partial program name divided based on a predetermined rule is generated when the voice recognition dictionary 22 is generated. As the predetermined rule, for example, in FIG. 7, the program name can be divided before and after the parenthesis of the program name, divided depending on the presence or absence of a space, the inside of the parenthesis can be divided, and the division before and after “no”. However, it is not appropriate to divide into dark clouds because the search keywords increase more than necessary.

発話傾向抽出部２３は、選択した番組の正式名と実際にユーザが発話した内容とを比較し、その発話の特徴を例えば以下に示す二つの例のいずれか、あるいは両方の組み合わせで抽出し（ステップＳ２０）、発話傾向ＤＢ２４に格納する（ステップＳ２１）。 The utterance tendency extraction unit 23 compares the official name of the selected program with the content actually uttered by the user, and extracts the characteristics of the utterance by, for example, one of the following two examples or a combination of both ( Step S20), and stored in the utterance tendency DB 24 (Step S21).

＜発話の長さによる発話の特徴抽出＞
図７は、番組名の読みの付与例を説明する図である。図７の例では、正式な番組名の読みは「えーびーしーこうこうこうざりかそうごうらっかぶったいのぶつり」で、３０文字である。これに対して、実際に発話された読みの長さを評価する。「えーびーしーこうこうこうざりかそうごう」と発話した場合は１９文字、「らっかぶったいのぶつり」と発話した場合は１１文字、「ぶつり」と発話した場合は３文字となる。これら実際に発話された読みの長さを、正式な番組名の読みの長さで正規化して、それぞれ１９／３０、１１／３０、３／３０という値を発話傾向ＤＢ２４に格納する。あるいは、予め正式な番組名の読みの長さに対して、１／３未満、１／３以上２／３未満、２／３以上と閾値をいくつか設定しておき、それぞれの範囲の度数をカウントするようにしても良い。 <Speech feature extraction based on utterance length>
FIG. 7 is a diagram for explaining an example of giving program name reading. In the example of FIG. 7, the official program name is read as “Ebisu-Kikokuri-zukuri-zukuri-buri-no-bushi-buratsu-buri” (30 characters). On the other hand, the length of the reading actually spoken is evaluated. 19 characters are used for speaking "Ebisu-Kikuzakusoka", 11 characters are used for speaking "Rakubata no Utsuri", and 3 characters are used for "Butsuri". The lengths of these actually uttered readings are normalized by the reading length of the official program name, and the values 19/30, 11/30, and 3/30 are stored in the utterance tendency DB 24, respectively. Or, for the length of reading of the official program name, several threshold values are set in advance such as less than 1/3, 1/3 or more, less than 2/3, 2/3 or more, and the frequency of each range is set. You may make it count.

＜発話の位置による発話の特徴抽出＞
図７の例では、正式な番組名の読みは「えーびーしーこうこうこうざりかそうごうらっかぶったいのぶつり」で、３０文字である。これに対して、実際に発話された読みの位置、すなわち正式な番組名のどの部分が読まれたのかを評価する。「えーびーしーこうこうこうざ」のように正式番組名の最初から発話した場合は「前半」、「りかそうごう」のように正式番組名の最初からでもなく最後まででもなく発話した場合は「中間」、「ぶつり」のように正式番組名の最後まで発話した場合は後半とし、それらを発話傾向ＤＢ２４に格納する。 <Speech feature extraction based on utterance location>
In the example of FIG. 7, the official program name is read as “Ebisu-Kikokuri-zukuri-zukuri-buri-no-bushi-buratsu-buri” (30 characters). On the other hand, the position of the actually spoken reading, that is, which part of the official program name is read is evaluated. When uttering from the beginning of the official program name such as `` Ebisu Kokoukouza '', when uttering not from the beginning to the end of the official program name such as `` first half '', `` Rikasogo '' When the utterance is made to the end of the official program name, such as “intermediate” and “hit”, the latter half is stored and stored in the utterance tendency DB 24.

次に、音声認識辞書２２の生成方法について説明する。地上波デジタル放送などで配信されるＥＰＧデータは、放送信号と共に送信されており、周知の方法でＥＰＧＤＢ２９に格納されている。音声認識辞書生成部２５は、ＥＰＧＤＢ２９のＥＰＧデータを、例えば一日一回の頻度で分析し、発話傾向ＤＢ２４内の発話傾向データの内容に応じて、音声認識辞書２２を生成していく。以下、生成方法の例を図９のフローチャートを参照しながら説明する。図９は、音声認識辞書生成部２５が音声認識辞書２２を生成する際の処理動作を示すフローチャートである。 Next, a method for generating the speech recognition dictionary 22 will be described. EPG data distributed by terrestrial digital broadcasting or the like is transmitted together with a broadcast signal and stored in the EPG DB 29 by a well-known method. The voice recognition dictionary generation unit 25 analyzes the EPG data in the EPG DB 29 at a frequency of once a day, for example, and generates the voice recognition dictionary 22 according to the content of the utterance tendency data in the utterance tendency DB 24. Hereinafter, an example of the generation method will be described with reference to the flowchart of FIG. FIG. 9 is a flowchart showing a processing operation when the voice recognition dictionary generation unit 25 generates the voice recognition dictionary 22.

図８は１番組分のＥＰＧデータの例である。この例では、ＸＭＬ形式のデータとなっていることがわかるが、ｉＥＰＧなどのようにＸＭＬ形式でないデータでも良い。これらのＥＰＧデータをＥＰＧＤＢ２９に蓄積する。ＥＰＧＤＢ２９は、ＸＭＬ形式のデータの場合はＸＭＬデータベースで構築することが望ましいが、他のデータベース（ＲＤＢなど）で構築しても構わない。ＥＰＧＤＢ２９に格納されたＥＰＧデータのうち、＜ＴＩＴＬＥ＞タグで括られた番組名、＜ＩＴＥＭ＞出演者＜／ＩＴＥＭ＞の次の＜ＴＥＸＴ＞タグで括られた出演者名を抽出する（ステップＳ３０）。 FIG. 8 shows an example of EPG data for one program. In this example, it is understood that the data is in the XML format, but data other than the XML format such as iEPG may be used. These EPG data are stored in the EPG DB 29. The EPG DB 29 is preferably constructed by an XML database in the case of XML format data, but may be constructed by another database (RDB or the like). From the EPG data stored in the EPG DB 29, the program name enclosed in the <TITLE> tag and the performer name enclosed in the <TEXT> tag next to the <ITEM> performer </ ITEM> are extracted (step) S30).

次に、番組名については、そのままではかなり長いものやサブタイトルが含まれるものがあるため、例えば、番組名に含まれるスペース、括弧、形態素解析で抽出した助詞（例えば「の」など）を手がかりに文字列を分割する。この際に、発話傾向ＤＢ２４に格納された過去の発話傾向を読み出して反映させ、番組名を分割し、音声認識辞書２２を生成して登録する読みを変更する（ステップＳ３１、Ｓ３２）。例えば、発話の長さが短い傾向があれば、正式な番組名から長さの短い読みを優先的に辞書に登録することとし、逆に発話の長さが長い傾向があれば、正式な番組名から長さの長い読みを優先的に辞書に登録する。または、正式な番組名の前半を多く発話する傾向があれば、正式な番組名の前半の読みを優先的に辞書に登録することとし、逆に正式な番組名の中間あるいは後半を多く発話する傾向があれば、正式な番組名の中間あるいは後半の読みを優先的に辞書に登録する。 Next, there are some program titles that are fairly long and include subtitles as they are, so for example, using the space, parentheses, and particles (for example, “no”) extracted by morphological analysis included in the program name. Split a string. At this time, the past utterance tendency stored in the utterance tendency DB 24 is read out and reflected, the program name is divided, and the reading for generating and registering the voice recognition dictionary 22 is changed (steps S31 and S32). For example, if the utterance length tends to be short, a short reading from the official program name is preferentially registered in the dictionary. Conversely, if the utterance length tends to be long, the official program Preferentially register long readings from names. Or, if there is a tendency to utter the first half of the official program name, the reading of the first half of the official program name should be preferentially registered in the dictionary, and conversely, the middle or the second half of the official program name will be uttered a lot. If there is a tendency, the middle or latter half of the official program name is preferentially registered in the dictionary.

こうすることによって、同一番組名に対して、発話に対するカバー率を向上させるために複数の読みを付与する。闇雲に部分語を増やしたのでは音声認識辞書２２に格納する語彙数が多くなってしまうが、このように過去のユーザの発話傾向に応じて発話される可能性が高いものを優先的に音声認識辞書２２に格納させることで、音声認識精度を確保しつつカバー率を向上させることが出来る。 By doing so, a plurality of readings are given to the same program name in order to improve the coverage rate for the utterance. If the number of partial words is increased in the dark clouds, the number of vocabularies stored in the speech recognition dictionary 22 will increase, but the speech that is likely to be spoken according to the past user's utterance tendency is preferentially spoken. By storing it in the recognition dictionary 22, it is possible to improve the coverage rate while ensuring the voice recognition accuracy.

読みの付与には、汎用の仮名漢字変換辞書を使用しても良いし、愛称など、別の読み方をインターネット上の情報をもとに生成しても良く、その手段は問わない（ステップＳ３３）。また、それぞれの読みが番組名に関するものであるか、出演者名に関するものであるか、識別子を用意しておくと検索の際に都合がよい。 A general-purpose kana-kanji conversion dictionary may be used for giving a reading, or another reading such as a nickname may be generated based on information on the Internet, and any means is available (step S33). . In addition, it is convenient for searching if an identifier is prepared for whether each reading relates to a program name or a performer name.

上記、音声認識辞書２２の更新処理は、例えば、一日一回、深夜などに定期的に実施すると好適である。 It is preferable that the update processing of the voice recognition dictionary 22 is performed periodically, for example, once a day or late at night.

本実施形態によれば、番組情報、インターネット上の情報など日々変化する情報の検索を音声認識入力により行う際に、検索対象範囲にとって適切な音声認識辞書を動的に生成し、かつユーザの発話傾向に応じて適切に生成語彙を変更できるため、語彙数を実用に耐える認識精度を担保する範囲におさえ、その中で発話に対するカバー率を向上させることができる。 According to this embodiment, when searching for information that changes daily such as program information and information on the Internet by voice recognition input, a voice recognition dictionary suitable for the search target range is dynamically generated and the user's utterance Since the generated vocabulary can be appropriately changed according to the tendency, the coverage rate for the utterance can be improved even within the range in which the number of vocabularies is assured for the recognition accuracy to be practically used.

なお、本発明は上記の実施形態のそのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記の実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 Note that the present invention is not limited to the above-described embodiments as they are, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Moreover, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

１００・・・音声認識検索システム、１０・・・リモコン装置、１１・・・音声入力部、１２・・・操作部、２０・・・録画機器、２１・・・音声認識部、２２・・・音声認識辞書、２３・・・発話傾向抽出部、２４・・・発話傾向ＤＢ、２５・・・音声認識辞書生成部、２６・・・音声認識候補表示部、２７・・・表示部、２８・・・検索部、２９・・・ＥＰＧＤＢ。 DESCRIPTION OF SYMBOLS 100 ... Voice recognition search system, 10 ... Remote control device, 11 ... Voice input part, 12 ... Operation part, 20 ... Recording apparatus, 21 ... Voice recognition part, 22 ... Speech recognition dictionary, 23 ... utterance tendency extraction unit, 24 ... utterance tendency DB, 25 ... speech recognition dictionary generation unit, 26 ... speech recognition candidate display unit, 27 ... display unit, ..Search part, 29 ... EPG DB.

Claims

A content information database for storing content information data;
A voice input unit for inputting voice;
A voice recognition unit that recognizes the voice input from the voice input unit and converts it into text;
A speech recognition dictionary referred to by the speech recognition unit for recognition processing;
A search unit for searching a search target range in the content information database using the text recognized by the voice recognition unit as a search keyword;
A display for displaying the search results;
An operation unit for selecting a voice recognition candidate displayed on the display unit;
An utterance tendency extraction unit that extracts a user's utterance tendency from the text recognized by the voice recognition unit;
An utterance tendency database for accumulating the utterance tendency extracted by the utterance tendency extraction unit;
A speech recognition search system, comprising: a speech recognition dictionary generation unit that generates the speech recognition dictionary based on the data of the search target range and the contents of the utterance tendency database.

The utterance tendency extraction unit comprises utterance length extraction means for extracting the utterance length represented by the number of characters of the recognized text,
While storing the utterance length in the utterance tendency database,
The speech recognition search system according to claim 1, wherein the speech recognition dictionary generation unit changes a speech recognition dictionary that is generated based on a tendency of the utterance length.

The utterance tendency extracting unit includes a partial word position extracting unit that extracts a partial word position feature with respect to a formal content name among a plurality of partial words with respect to the same keyword,
In the utterance tendency database, the characteristics of the partial word positions are accumulated,
The speech recognition search system according to claim 1, wherein the speech recognition dictionary generation unit changes a speech recognition dictionary that is generated based on a tendency of the partial word position.

The speech recognition search system according to claim 2, wherein the tendency of the utterance length is determined by normalizing the utterance length with a reading length of a formal content name.

The feature of the partial word position is characterized in that it is determined by uttering the formal content name from the beginning, uttering the formal content name halfway, or uttering the formal content name to the end. The speech recognition search system according to claim 3.

The speech recognition search system according to claim 1, wherein the speech recognition result or the search result is displayed on the display unit in a descending order of likelihood.

A speech recognition search method in a speech recognition search system that recognizes a speech search key to be searched that is input from a user using speech,
Spoken search terms,
Speech recognition is performed on the input speech signal to convert it into text, and multiple speech recognition candidates are presented in descending order of likelihood,
Search the content information database using the character string selected from the presented candidates as a search keyword,
Select content from the content candidate list,
When selecting content, extract the utterance tendency,
Compare the official name of the selected content with the content of the utterance, extract the features of the utterance, and store in the utterance tendency database.
A speech recognition search method characterized by the above.