JP5847290B2

JP5847290B2 - Document search apparatus and document search method

Info

Publication number: JP5847290B2
Application number: JP2014504643A
Authority: JP
Inventors: 洋一藤井; 石井　純; 純石井
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2012-03-13
Filing date: 2012-12-27
Publication date: 2016-01-20
Anticipated expiration: 2032-12-27
Also published as: WO2013136634A1; US20150112683A1; CN104221012A; JPWO2013136634A1; DE112012006633T5

Description

この発明は、電子化されたドキュメントの章、節、項などの細かい単位を検索するドキュメント検索装置およびドキュメント検索方法に関するものである。 The present invention relates to a document search apparatus and a document search method for searching for fine units such as chapters, sections, and sections of an electronic document.

家電製品および車載機器など、多くの機器に対し、操作方法およびトラブル時の対応方法などについて記載した紙の取扱説明書が添付されている。その中でも、画面を持つような情報機器においては、取扱説明書が電子化され、直接検索および閲覧が可能となっている。これにより、わざわざ紙のドキュメントを持ち歩いたりすることなく、閲覧することが可能となっている。一方で、電子化されたドキュメントは一覧性が低く、ユーザが確認したいと思う内容を探すことが難しく、検索機能を提供することが必須となる。 For many devices such as home appliances and in-vehicle devices, a paper instruction manual that describes operation methods and troubleshooting methods is attached. Among them, in an information device having a screen, an instruction manual is digitized and can be directly searched and browsed. This makes it possible to browse without having to carry a paper document. On the other hand, an electronic document has a low listability, and it is difficult to search for a content that a user wants to confirm, and it is essential to provide a search function.

従来の検索機能の典型的なものの中で一番単純な方式としては、キーワードにより検索を行い、ヒットした部分をドキュメントの先頭から出現順に表示するＧＲＥＰ検索方式がある。さらに、あらかじめドキュメントと抽出したキーワードから検索インデックスを作成し、その検索インデックスを利用して論理式による検索を行い候補を表示する論理型検索方式がある。また、論理型検索方式では入力キーワードと検索インデックスとの関連度を表すスコアが定義できないため、単純にキーワードを入力してその出現頻度をカウントすることでスコアを決定するベストマッチング検索方式がある。さらにキーワードからｔｆ・ｉｄｆ（ｔｅｒｍｆｒｅｑｕｅｎｃｙａｎｄｉｎｖｅｒｓｅｄｏｃｕｍｅｎｔｆｒｅｑｕｅｎｃｙ）などの統計的な重みを付けた検索インデックスを作成し、入力キーワードとのベクトル距離（内積）によって検索して候補を表示する統計型検索方式がある。これらの検索方式の提供により、電子化されたドキュメントを検索することが可能になり、ある程度ユーザが求める部分を閲覧することが可能である。 Among the typical conventional search functions, the simplest method is a GREP search method in which a search is performed using a keyword and hit portions are displayed in the order of appearance from the top of the document. Further, there is a logical search method in which a search index is created from a document and a keyword extracted in advance, and a search is performed using a logical expression using the search index to display candidates. Further, since the logical search method cannot define a score representing the degree of association between an input keyword and a search index, there is a best matching search method in which a score is determined by simply inputting a keyword and counting its appearance frequency. Further, a search index with a statistical weight such as tf.idf (term frequency and inverse document frequency) is created from the keyword, and a search is performed by a vector distance (inner product) with the input keyword to display candidates. There is. By providing these search methods, it is possible to search for an electronic document, and it is possible to browse a portion requested by the user to some extent.

論理型検索方式では、検索条件に厳密に一致するものだけを検索するため、複雑な検索条件を駆使すればユーザの検索意図に一致するものが見つかりやすいというメリットがある一方、検索条件が少しでも適切でないと検索漏れにつながりやすいというデメリットがある。また、複雑な検索式を構築することは一般ユーザにとってはハードルが高いというデメリットもある。従って、最も一般的な論理型検索はキーワードを複数入力させ、ＯＲ論理演算によって検索結果を求めて提示する方式である。
一方、ベストマッチング検索方式および統計型検索方式の場合には、キーワードに論理的な構造を入れる必要なく検索できるというメリットがある一方、ドキュメント中のキーワードの出現回数が単純にスコア化されたり、出現傾向に応じて重み付けされた値によってスコアが計算されたりするために、ユーザが制御することが難しいというデメリットがある。The logical search method searches only those that exactly match the search conditions, so there is an advantage that using complex search conditions makes it easier to find a search that matches the user's search intention. There is a demerit that it is easy to lead to search omission if it is not appropriate. In addition, there is a demerit that building a complicated search expression is a high hurdle for general users. Therefore, the most common logical type search is a method in which a plurality of keywords are input and a search result is obtained and presented by an OR logical operation.
On the other hand, the best-matching search method and the statistical search method have the advantage of being able to search without the need for a logical structure in the keyword, while the number of occurrences of the keyword in the document is simply scored or There is a demerit that it is difficult for the user to control because the score is calculated by a value weighted according to the tendency.

これらの方式のメリットとデメリットを踏まえて、両方の良いところを活かす方法として、複数の検索エンジンを統合して処理するような方法が提案されている。たとえば特許文献１では、論理型検索方式と統計型検索方式、またはベストマッチング検索方式と統計型検索方式を別々に実行し、その結果を論理的に統合することで検索する方法が開示されている。 Based on the merits and demerits of these methods, a method that integrates and processes a plurality of search engines has been proposed as a method of taking advantage of both advantages. For example, Patent Document 1 discloses a method of performing a search by separately executing a logical search method and a statistical search method, or a best matching search method and a statistical search method, and logically integrating the results. .

具体的には、論理型検索方式の検索エンジンからは検索結果候補の情報だけが求まり、ベストマッチング検索方式と統計型検索方式の検索エンジンからは検索結果候補とそのスコアが情報として求まる。
論理型検索方式と統計型検索方式を合わせた場合には、たとえば、論理式型検索結果と統計型検索結果のうち文書ＩＤの重なったものだけを最終結果候補としたり、論理式型検索結果と統計型検索結果の文書ＩＤのすべてを最終結果候補とした上で、統計型検索結果のスコアを最終結果の順位付けに使ったりしている。Specifically, only search result candidate information is obtained from a logical search method search engine, and search result candidates and their scores are obtained as information from best matching search method and statistical search method search engines.
When the logical type search method and the statistical type search method are combined, for example, only the result of the logical type search result and the statistical type search result with the document ID overlapping is set as the final result candidate, or the logical type search result and All the document IDs of the statistical search results are used as final result candidates, and the scores of the statistical search results are used for ranking the final results.

さらに、ベストマッチング検索方式と統計型検索方式を合わせた場合には、スコアの平均を使って最終結果の順位付けを行っている。 Furthermore, when the best matching search method and the statistical search method are combined, the final results are ranked using the average of scores.

また、従来の検索方式では、キーワードの表層的な違いによって検索できない場合を少なくするために、同義語および類義語のテーブルを作成しておき、検索条件中のキーワードを同義語および類義語に展開して検索する方法が提案されている。 In addition, in the conventional search method, a table of synonyms and synonyms is created and the keywords in the search conditions are expanded into synonyms and synonyms in order to reduce the case where the search cannot be performed due to differences in the surface layer of keywords. A search method has been proposed.

特開平１０−１４３５３０号公報Japanese Patent Laid-Open No. 10-143530

従来のドキュメント検索装置およびドキュメント検索方法は以上のように構成されているので、単独の検索方式で検索するよりもユーザが所望する検索結果を得やすくなっている。しかしながら、これらの検索方式は、検索インデックスを作成するためのキーワードの抽出対象が検索対象のドキュメントそのものであるため、単独の検索方式を使う場合でも、複数の検索方式を組み合わせて使う場合でも、ドキュメント内に出現したキーワードを検索することが基本となる。 Since the conventional document search apparatus and document search method are configured as described above, it is easier to obtain a search result desired by the user than when searching by a single search method. However, in these search methods, the keyword extraction target for creating the search index is the search target document itself, so even if a single search method is used or a combination of multiple search methods is used, the document It is fundamental to search for keywords that appear inside.

また、実際の検索場面においては、検索する側はドキュメントで使われるキーワードが何か分からない状態で検索条件を入力しなくてはならないため、所望のドキュメントが引けないということが発生する。それらを解消するため、同義語および類義語展開による検索が行われ、それによって多少の改善が期待できる。しかしながら、取扱説明書などのドキュメントは、正確を期すため専門的な用語、および独自機能に対する特別な用語を使った説明が記載されることが多く、一般的なユーザおよび使い方を知りたい初心者ユーザにとっては、何をキーワードとして検索すれば所望の説明が得られるのか分からないという状況になってしまう場合が多い。具体的には、カーナビゲーションの地図向きを表す用語として「北基準」とか「自車基準」といった用語がカーナビ初心者には想像もできないキーワードであり、「走っていく方向がいつも上側になる地図にしたい」といった条件で検索しようとし、適切なキーワードが存在しないため所望の検索結果が得られないといったことが発生する。 Further, in the actual search scene, the search side must input the search condition in a state where the keyword used in the document is unknown, so that a desired document cannot be drawn. In order to eliminate them, a search by synonym and synonym expansion is performed, so that some improvement can be expected. However, manuals and other manuals often contain technical terms and explanations using special terms for unique functions to ensure accuracy, so it is recommended for general users and beginner users who want to know how to use them. In many cases, it is difficult to know what is searched for as a keyword to obtain a desired explanation. Specifically, terms such as “north reference” or “own vehicle reference” are terms that cannot be imagined by car navigation beginners as terms that indicate the direction of a map for car navigation. When a search is performed under a condition such as “I want to do”, an appropriate keyword does not exist and a desired search result cannot be obtained.

この発明は、上記のような課題を解決するためになされたもので、ユーザの自然言語による入力に対して、単純な検索方式による検索結果よりも、より適切な検索結果を提示することを目的とする。 The present invention has been made to solve the above-described problems, and it is an object of the present invention to present a more appropriate search result for a user's natural language input than a search result obtained by a simple search method. And

この発明に係るドキュメント検索装置は、あらかじめ用意されたドキュメントから作成した検索インデックスと、ユーザからの入力を受け、検索インデックスを用いてドキュメント内から当該ユーザ入力に関連のある項目を検索するドキュメント検索部と、ドキュメントの内容を問う想定質問と当該想定質問の回答となるドキュメント内の項目との対応関係を学習した発話推定モデルと、発話推定モデルを基にドキュメント内からユーザ入力の回答に相当する項目を推定する発話内容推定部と、ドキュメント検索部から得られたドキュメント検索結果および発話内容推定部から得られたドキュメント推定結果を統合して、最終検索結果を生成する結果統合部とを備えるものである。 A document search device according to the present invention receives a search index created from a document prepared in advance and an input from a user, and uses the search index to search an item related to the user input from the document. And an utterance estimation model that learns the correspondence between an assumed question that asks the contents of the document and an item in the document that is the answer to the assumed question, and an item that corresponds to a user input answer from the document based on the utterance estimation model Utterance content estimation unit, and a document search result obtained from the document search unit and a document estimation result obtained from the utterance content estimation unit are integrated to generate a final search result. is there.

この発明に係るドキュメント検索方法は、入力解析部が、ユーザからの入力を受け付けるユーザ入力ステップと、ドキュメント検索部が、あらかじめ用意されたドキュメントから作成した検索インデックスを用いて、当該ドキュメント内からユーザ入力に関連のある項目を検索するドキュメント検索ステップと、発話内容推定部が、ドキュメントの内容を問う想定質問と当該想定質問の回答となるドキュメント内の項目との対応関係を学習した発話推定モデルを基に、ドキュメント内からユーザ入力の回答に相当する項目を推定する発話内容推定ステップと、結果統合部が、ドキュメント検索ステップから得られたドキュメント検索結果および発話内容推定ステップから得られたドキュメント推定結果を統合して、最終検索結果を生成する結果統合ステップとを備えるものである。 In the document search method according to the present invention, a user input step in which an input analysis unit receives an input from a user, and a user input from within the document using a search index created from a document prepared in advance by the document search unit. The document search step for searching for items related to the utterance, and the utterance content estimation unit based on the utterance estimation model in which the correspondence between the assumed question that asks the document content and the item in the document that is the answer to the assumed question is learned. The utterance content estimation step for estimating an item corresponding to the user input answer from the document, and the result integration unit obtains the document search result obtained from the document search step and the document estimation result obtained from the utterance content estimation step. Result in a final search result It is intended and a consolidation step.

この発明によれば、ユーザがどのような聞き方をするかを想定した質問とその回答になるドキュメント項目との対応関係を学習した発話推定モデルを用いて、ドキュメント内からユーザ入力の回答に相当する項目を推定し、推定結果をインデックス検索の結果と統合するようにしたので、ユーザの自然言語による入力に対して、単純な検索方式による結果よりも、より適切な検索結果を提示することができる。 According to the present invention, using an utterance estimation model in which a correspondence between a question assuming a user's way of listening and a document item as an answer is learned, it corresponds to a user input answer from within a document. Since the estimated items are integrated with the index search results, it is possible to present more appropriate search results than the results of the simple search method to the user's natural language input. it can.

この発明の実施の形態１に係るドキュメント検索装置の構成を示すブロック図である。It is a block diagram which shows the structure of the document search device concerning Embodiment 1 of this invention. 実施の形態１に係るドキュメント検索装置のドキュメントの例を示す図である。3 is a diagram illustrating an example of a document of the document search device according to Embodiment 1. FIG. 実施の形態１に係るドキュメント検索装置のドキュメント解析結果および検索インデックス用のキーワードリストの例を示す図である。It is a figure which shows the example of the keyword list for the document analysis result and search index of the document search device concerning Embodiment 1. 実施の形態１に係るドキュメント検索装置の収集発話データの例を示す図である。It is a figure which shows the example of the collection speech data of the document search apparatus concerning Embodiment 1. FIG. 実施の形態１に係るドキュメント検索装置の収集発話解析結果および発話推定モデル用のキーワードリストの例を示す図である。It is a figure which shows the example of the keyword list for the collection speech analysis results and speech estimation model of the document search device concerning Embodiment 1. FIG. 実施の形態１に係るドキュメント検索装置のドキュメントから検索インデックスを作成する動作を示すフローチャートである。4 is a flowchart showing an operation of creating a search index from a document of the document search device according to the first embodiment. 実施の形態１に係るドキュメント検索装置の収集発話データから発話推定モデルを作成する動作を示すフローチャートである。4 is a flowchart illustrating an operation of creating an utterance estimation model from collected utterance data of the document search device according to the first embodiment. 実施の形態１に係るドキュメント検索装置のユーザ入力から最終検索結果を作成する動作を示すフローチャートである。6 is a flowchart illustrating an operation of creating a final search result from a user input of the document search device according to the first embodiment. 実施の形態１に係るドキュメント検索装置における、ユーザ入力の遷移例を示す図である。It is a figure which shows the example of transition of a user input in the document search device concerning Embodiment 1. FIG. 図９のユーザ入力の遷移例の続きを示す図である。FIG. 10 is a diagram showing a continuation of the user input transition example of FIG. この発明の実施の形態２に係るドキュメント検索装置の構成を示すブロック図である。It is a block diagram which shows the structure of the document search device based on Embodiment 2 of this invention. 実施の形態２に係るドキュメント検索装置のドキュメントの階層を表す図である。FIG. 10 is a diagram illustrating a document hierarchy of the document search device according to the second embodiment. 実施の形態２に係るドキュメント検索装置のユーザ入力から最終検索結果を作成する動作を示すフローチャートである。10 is a flowchart illustrating an operation of creating a final search result from a user input of the document search device according to the second embodiment. 実施の形態２に係るドキュメント検索装置における、ユーザ入力の遷移例を示す図である。It is a figure which shows the example of transition of a user input in the document search device concerning Embodiment 2. FIG. この発明の実施の形態３に係るドキュメント検索装置のドキュメントの例を示す図である。It is a figure which shows the example of the document of the document search device concerning Embodiment 3 of this invention. 実施の形態３に係るドキュメント検索装置のドキュメント解析結果および検索インデックス用のキーワードリストの例を示す図である。It is a figure which shows the example of the document analysis result of the document search device concerning Embodiment 3, and the keyword list for search indexes. 実施の形態３に係るドキュメント検索装置の収集発話データの例を示す図である。It is a figure which shows the example of the collection speech data of the document search device concerning Embodiment 3. FIG. 実施の形態３に係るドキュメント検索装置の収集発話解析結果および発話推定モデル用のキーワードリストの例を示す図である。It is a figure which shows the example of the keyword list for the collection speech analysis results and speech estimation model of the document search device concerning Embodiment 3. FIG. 実施の形態３に係るドキュメント検索装置における、ユーザ入力の遷移例を示す図である。FIG. 10 is a diagram illustrating a transition example of user input in the document search device according to the third embodiment. 図１９のユーザ入力の遷移例の続きを示す図である。It is a figure which shows the continuation of the example of a transition of the user input of FIG. この発明の実施の形態４に係るドキュメント検索装置のドキュメントの例を示す図である。It is a figure which shows the example of the document of the document search device concerning Embodiment 4 of this invention. 実施の形態４に係るドキュメント検索装置のドキュメント解析結果および検索インデックス用のキーワードリストの例を示す図である。It is a figure which shows the example of the document analysis result of the document search device concerning Embodiment 4, and the keyword list for search indexes. 実施の形態４に係るドキュメント検索装置の収集発話データの例を示す図である。It is a figure which shows the example of the collection speech data of the document search device concerning Embodiment 4. FIG. 実施の形態４に係るドキュメント検索装置の収集発話解析結果および発話推定モデル用のキーワードリストの例を示す図である。It is a figure which shows the example of the keyword list for the collection speech analysis results and speech estimation model of the document search device concerning Embodiment 4. FIG. 実施の形態４に係るドキュメント検索装置における、ユーザ入力の遷移例を示す図である。FIG. 10 is a diagram illustrating a transition example of user input in the document search device according to the fourth embodiment. 図２５のユーザ入力の遷移例の続きを示す図である。It is a figure which shows the continuation of the example of a transition of the user input of FIG.

以下、この発明をより詳細に説明するために、この発明を実施するための形態について、添付の図面に従って説明する。
実施の形態１．
以下、図面を参照して本発明の実施の形態を説明する。
図１は、本実施の形態１に係るドキュメント検索装置の構成を示すブロック図である。
ドキュメント１は、製品の取扱説明書などを電子化したテキストデータである。このドキュメント１は、製品機能などに合わせて章、節、項などの項目にある程度階層化されているものとする。入力解析部２は、ドキュメント１のテキストなどを公知の技術である形態素解析などの方法により、形態素単位に分割する。ドキュメント解析結果３は、入力解析部２によってドキュメント１を形態素に分割したデータである。Hereinafter, in order to explain the present invention in more detail, modes for carrying out the present invention will be described with reference to the accompanying drawings.
Embodiment 1 FIG.
Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a block diagram showing the configuration of the document search apparatus according to the first embodiment.
Document 1 is text data obtained by digitizing an instruction manual of a product. This document 1 is assumed to be hierarchized to some extent into items such as chapters, sections, and sections according to product functions. The input analysis unit 2 divides the text of the document 1 into morpheme units by a known technique such as morpheme analysis. The document analysis result 3 is data obtained by dividing the document 1 into morphemes by the input analysis unit 2.

検索インデックス作成部４は、ドキュメント解析結果３から検索インデックス５を作成する。この検索インデックス５は、ドキュメント検索部１２からのキーワードの入力に対してドキュメント１中の特定の章、節、項などの項目を検索結果として返す。
収集発話データ６は、ドキュメント１を利用する場合にあらかじめユーザアンケートなどの方法によって聞きたいことを収集した発話データである。収集発話データ６の作成方法は、あらかじめドキュメント１に書かれた製品機能から質問を生成し、それをアンケートなどの形で事前に集めたものを想定する。
収集発話解析結果７は、入力解析部２によって収集発話データ６を形態素に分割したデータである。The search index creation unit 4 creates a search index 5 from the document analysis result 3. The search index 5 returns items such as specific chapters, sections, and items in the document 1 as search results in response to the keyword input from the document search unit 12.
The collected utterance data 6 is utterance data collected in advance by using a method such as a user questionnaire when the document 1 is used. The method of creating the collected utterance data 6 assumes that a question is generated from a product function written in the document 1 in advance and collected in advance in the form of a questionnaire or the like.
The collected utterance analysis result 7 is data obtained by dividing the collected utterance data 6 into morphemes by the input analysis unit 2.

発話推定モデル作成部８は、収集発話解析結果７の形態素単位を学習単位（素性）として統計的な学習を行い、発話推定モデル９を作成する。この発話推定モデル９は、収集発話解析結果７の形態素列を入力とし、発話内容推定結果として前記質問に対する回答に相当する項目をスコアつきで返すための学習結果データである。 The utterance estimation model creation unit 8 performs statistical learning using the morpheme unit of the collected utterance analysis result 7 as a learning unit (feature), and creates an utterance estimation model 9. This utterance estimation model 9 is learning result data for receiving a morpheme string of the collected utterance analysis result 7 as an input and returning an item corresponding to an answer to the question with a score as an utterance content estimation result.

ユーザ入力１０は、ドキュメント検索装置へのユーザからの入力を表すデータである。ここでは、ユーザ入力１０がテキスト入力であるものとして説明を行う。ユーザ入力解析結果１１は、入力解析部２によってユーザ入力１０を形態素に分割したデータである。 The user input 10 is data representing an input from the user to the document search apparatus. Here, description will be made assuming that the user input 10 is a text input. The user input analysis result 11 is data obtained by dividing the user input 10 into morphemes by the input analysis unit 2.

ドキュメント検索部１２は、ユーザ入力解析結果１１を入力として、検索インデックス５を利用して検索を行い、ドキュメント検索結果１３を作成する。
発話内容推定部１４は、ユーザ入力解析結果１１を入力として、発話推定モデル９を使ってこの入力に対応する項目を推定し、その項目のドキュメントＩＤを取得する。ドキュメント推定結果１５は、発話内容推定部１４で推定したドキュメントＩＤとそのスコア（後述する）を含むデータである。The document search unit 12 performs a search using the search index 5 with the user input analysis result 11 as an input, and creates a document search result 13.
The utterance content estimation unit 14 receives the user input analysis result 11 as an input, estimates an item corresponding to this input using the utterance estimation model 9, and acquires the document ID of the item. The document estimation result 15 is data including the document ID estimated by the utterance content estimation unit 14 and its score (described later).

結果統合部１６は、ドキュメント検索結果１３とドキュメント推定結果１５を統合して１つの検索結果としてまとめ、最終検索結果１７として出力する。 The result integration unit 16 integrates the document search result 13 and the document estimation result 15 into one search result and outputs it as a final search result 17.

図２は、ドキュメント１の例である。ドキュメント１は章、節、項のような階層構造を持っており、階層ごとに検索結果位置を示すドキュメントＩＤを持っている。図２の例では、ドキュメントＩＤ「Ｉｄ＿１０＿１」のドキュメント１−１が、下位のデータ構造内に含まれるテキストも含んでいる。たとえば「Ｉｄ＿１０＿１＿１」のドキュメント１−２は、「Ｉｄ＿１０＿１」のドキュメント１−１にも含まれることを表している。 FIG. 2 is an example of the document 1. The document 1 has a hierarchical structure such as a chapter, a section, and an item, and has a document ID indicating a search result position for each hierarchy. In the example of FIG. 2, the document 1-1 with the document ID “Id — 10_1” also includes text included in the lower data structure. For example, the document 1-2 of “Id — 10_1 — 1” is included in the document 1-1 of “Id — 10_1”.

図３は、ドキュメント解析結果３、および検索インデックス５用のキーワードリストの例である。「Ｉｄ＿１０＿１＿１」はドキュメント解析結果３−１の一例であり、図２の「Ｉｄ＿１０＿１＿１」のドキュメント１−２に対して形態素解析による入力解析を行った結果を示している。このドキュメント解析結果３−１では、形態素解析結果の区切を「／」で区切っている。
検索インデックス用データ３−２は、「Ｉｄ＿１０＿１＿１」のドキュメント解析結果３−１をもとにした、検索インデックス作成部４が使用するデータ例を示している。ここではドキュメントＩＤと自立語形態素の一般形（キーワード）のリストとが抽出されている。FIG. 3 shows an example of the keyword analysis results for the document analysis result 3 and the search index 5. “Id — 10 — 1 — 1” is an example of the document analysis result 3-1, and shows the result of performing input analysis by morphological analysis on the document 1-2 of “Id — 10 — 1 — 1” in FIG. In the document analysis result 3-1, the morphological analysis result is delimited by “/”.
The search index data 3-2 indicates an example of data used by the search index creation unit 4 based on the document analysis result 3-1 of “Id — 10 — 1 — 1”. Here, a document ID and a list of general forms (keywords) of independent word morphemes are extracted.

図４は、収集発話データ６の例である。収集発話データ６−１は「Ｉｄ＿１０」のドキュメントに対応する質問の例、収集発話データ６−２は「Ｉｄ＿１０＿１」のドキュメントに対応する質問の例、収集発話データ６−３は「Ｉｄ＿１０＿１＿１」のドキュメントに対応する質問の例である。収集発話データ６−４は、地図種類の具体的変更方法を知りたいことを意図した質問であるが、ここで想定している製品では実現不可能な地図種類であるため、「Ｉｄ＿１０＿１＿１」と同階層のドキュメントＩＤが選択できない収集発話データ例である。
なお、これら収集発話データ６−１〜６−４は、ユーザが製品の機能を確認するためにどのような聞き方をするかを想定した質問文例である。FIG. 4 is an example of the collected utterance data 6. The collected utterance data 6-1 is an example of a question corresponding to a document “Id_10”, the collected utterance data 6-2 is an example of a question corresponding to a document of “Id_10_1”, and the collected utterance data 6-3 is a document of “Id_10_1_1”. It is an example of a question corresponding to. The collected utterance data 6-4 is a question intended to know a specific method of changing the map type. However, since it is a map type that cannot be realized with the product assumed here, it is the same as "Id_10_1_1". It is an example of the collection utterance data which cannot select document ID of a hierarchy.
Note that these collected utterance data 6-1 to 6-4 are examples of question sentences that are assumed to be heard by the user in order to confirm the function of the product.

図５は、収集発話解析結果７、および発話推定モデル９用のキーワードリストの例である。「Ｉｄ＿１０＿１＿１」は収集発話解析結果７−１の一例であり、図４の「Ｉｄ＿１０＿１＿１」の収集発話データ６−１のテキストを形態素解析による入力解析を行った結果を示している。
発話推定モデル用データ７−２は、「Ｉｄ＿１０＿１＿１」の収集発話解析結果７−１をもとにした、発話推定モデル作成部８が使用するデータ例を示している。ここではドキュメントＩＤと自立語形態素の一般形（キーワード）のリストとが抽出されている。FIG. 5 is an example of a keyword list for the collected utterance analysis result 7 and the utterance estimation model 9. “Id — 10 — 1 — 1” is an example of the collected utterance analysis result 7-1 and shows the result of performing input analysis by morphological analysis on the text of the collected utterance data 6-1 of “Id — 10 — 1 — 1” in FIG.
The utterance estimation model data 7-2 shows an example of data used by the utterance estimation model creation unit 8 based on the collected utterance analysis result 7-1 of “Id — 10_1 — 1”. Here, a document ID and a list of general forms (keywords) of independent word morphemes are extracted.

次に、ドキュメント検索装置の動作を説明する。
動作は大きく２つの処理に分かれる。１つは、ドキュメント１および収集発話データ６からそれぞれ検索インデックス５、発話推定モデル９を作成する作成処理であり、もう１つは、ユーザ入力１０を受けて最終検索結果１７を作成する検索処理である。まず、作成処理について説明する。Next, the operation of the document search apparatus will be described.
The operation is roughly divided into two processes. One is a creation process for creating the search index 5 and the utterance estimation model 9 from the document 1 and the collected utterance data 6, respectively, and the other is a search process for creating a final search result 17 in response to the user input 10. is there. First, the creation process will be described.

まず、作成処理のうち、検索インデックス５の作成方法について説明する。ここでは、従来技術で開示されているｔｆ・ｉｄｆによる重み付けをするものとする。
図６は、ドキュメント１から検索インデックス５を作成するまでの動作を示すフローチャートである。図２に示したように、ドキュメント１はドキュメントＩＤとテキストとが対応付けられたペアになっているものとする。たとえば、ドキュメント１−２ではドキュメントＩＤ「Ｉｄ＿１０＿１＿１」という名前に、「自車基準。自車の進行方向を上とした地図が表示されます。」というテキストが対応付けられている。ステップＳＴ１では、入力解析部２がこの構造のドキュメント１を順次読み込み、既知の技術である形態素解析によって形態素列に分割する。ドキュメント１−２を形態素解析した結果が、図３のドキュメント解析結果３−１である。このドキュメント解析結果３−１は、形態素の区切り「／」しか示していないが、実際には、品詞情報、活用語の原型、読みなどが含まれているものとする。First, a method for creating the search index 5 in the creation process will be described. Here, weighting by tf · idf disclosed in the prior art is assumed.
FIG. 6 is a flowchart showing an operation from creation of the search index 5 from the document 1. As shown in FIG. 2, it is assumed that the document 1 is a pair in which the document ID and the text are associated with each other. For example, in document 1-2, a document ID “Id — 10 — 1 — 1” is associated with the text “vehicle reference. A map with the direction of travel of the vehicle up is displayed.” In step ST1, the input analysis unit 2 sequentially reads the document 1 having this structure, and divides it into morpheme strings by morphological analysis which is a known technique. The result of the morphological analysis of the document 1-2 is the document analysis result 3-1 in FIG. This document analysis result 3-1 shows only the morpheme delimiter “/”, but it actually includes part-of-speech information, a prototype of a utilization word, reading, and the like.

ドキュメント解析結果３がすべてのドキュメントＩＤに対して生成されると、続くステップＳＴ２で、検索インデックス作成部４が、すべてのドキュメント解析結果３から検索インデックス５の作成に必要な形態素（キーワード）を抽出し、（ドキュメントＩＤ、キーワードリスト）のペアを作成し、すべてのペアを元にｔｆ・ｉｄｆによって重み付けした検索インデックス５を作成する。図３のドキュメント解析結果３−１から抽出した（ドキュメントＩＤ、キーワードリスト）のペアは、同じ図３の検索インデックス用データ３−２で表される。 When the document analysis result 3 is generated for all document IDs, the search index creation unit 4 extracts morphemes (keywords) necessary for creating the search index 5 from all the document analysis results 3 in the subsequent step ST2. Then, a pair of (document ID, keyword list) is created, and a search index 5 weighted by tf · idf based on all pairs is created. A pair of (document ID, keyword list) extracted from the document analysis result 3-1 in FIG. 3 is represented by the same search index data 3-2 in FIG.

具体的な検索インデックス作成手順の説明は行わないが、簡単に説明する。まず、ｔｆ・ｉｄｆは、すべてのドキュメントＩＤに含まれるキーワード数をベクトルの次元とし、各キーワードをベクトルの要素に割り当て、ベクトルの値を頻度で表す（ｔｆの部分）。このベクトル値を「多くのドキュメントに出現するキーワード（一般的な語）は重要度が低く、特定のドキュメントにしか出現しないキーワードの重要度は高い」というヒューリスティックに適うように重み付けを行う（ｉｄｆの部分）。この重み付きテーブルが検索インデックス５となる。 A specific search index creation procedure will not be described, but will be briefly described. First, tf · idf uses the number of keywords included in all document IDs as a vector dimension, assigns each keyword to a vector element, and represents the vector value as a frequency (tf portion). This vector value is weighted so as to meet the heuristic that “keywords that appear in many documents (general words) have low importance, and keywords that appear only in specific documents have high importance” (idf portion). This weighted table is the search index 5.

次に、発話推定モデル９の作成処理について説明する。
図７は、収集発話データ６から発話推定モデル９を作成するまでの動作を示すフローチャートである。収集発話データ６は、図４の収集発話データ６−１〜６−４に表したように、あらかじめユーザから集めた発話をその回答となるドキュメントＩＤに割り当てたデータである。収集発話データ６の作成方法は、アンケートなどでドキュメントＩＤごとの機能を説明した内容を提示して、その機能を探したい場合に何と言うかを表す文章を集めたデータである。たとえば、図４の「Ｉｄ＿１０＿１＿１」の「自車基準。自車の進行方向を上とした地図が表示されます。」という具体的内容を提示した場合は、収集発話データ６−３のような発話が収集できることが期待でき、一方「Ｉｄ＿１０」のような上位の概念を提示した場合には、収集発話データ６−１のようなデータを始め、収集発話データ６−２〜６−４のような発話も収集できることが期待できる。なお、収集発話データ６−４はドキュメント１の製品の機能外の内容の発話データであり、この場合は中間的な「Ｉｄ＿１０＿１」のドキュメントＩＤに割り当てることとなる。上記の作業は人手によってあらかじめ行い、図４の構造のデータを用意しておくこととなる。Next, a process for creating the utterance estimation model 9 will be described.
FIG. 7 is a flowchart showing an operation until the utterance estimation model 9 is created from the collected utterance data 6. The collected utterance data 6 is data in which utterances collected in advance from the user are assigned to the document ID as a response, as shown in the collected utterance data 6-1 to 6-4 in FIG. The method of creating the collected utterance data 6 is data in which sentences describing what is said when a function for each document ID is presented in a questionnaire or the like and the function is to be searched is collected. For example, in the case of presenting the specific content of “Id_10_1_1” of “Id_10_1_1” in FIG. 4 that “the map is displayed with the traveling direction of the own vehicle up”, the utterance as the collected utterance data 6-3 On the other hand, when a high-level concept such as “Id_10” is presented, data such as collected utterance data 6-1 is started, and collected utterance data 6-2 to 6-4 is started. It can be expected that utterances can also be collected. The collected utterance data 6-4 is utterance data having contents outside the functions of the product of the document 1, and in this case, the collected utterance data 6-4 is assigned to an intermediate document ID “Id — 10_1”. The above operation is performed manually in advance, and data having the structure shown in FIG. 4 is prepared.

入力解析部２はステップＳＴ３において、ステップＳＴ１でドキュメント１を入力とした場合と同様に、収集発話データ６の形態素解析を行う。たとえば、図４の収集発話データ６−３を形態素解析した結果が、図５の収集発話解析結果７−１である。続くステップＳＴ４で発話推定モデル作成部８が、ステップＳＴ２と同様にドキュメントＩＤとキーワードのリストを発話推定モデル用データ７−２として抽出し、発話推定モデル９を作成するための処理を行う。発話推定モデル９は、ここでは最大エントロピ法（以下、ＭＥ法）によって学習するものとする。 In step ST3, the input analysis unit 2 performs morphological analysis of the collected utterance data 6 in the same manner as when the document 1 is input in step ST1. For example, the result of morphological analysis of the collected utterance data 6-3 in FIG. 4 is the collected utterance analysis result 7-1 in FIG. In the subsequent step ST4, the utterance estimation model creating unit 8 extracts a document ID and keyword list as the utterance estimation model data 7-2 and performs a process for creating the utterance estimation model 9 as in step ST2. It is assumed here that the speech estimation model 9 is learned by the maximum entropy method (hereinafter, ME method).

ＭＥ法の詳細な説明は行わないが、簡単に説明する。ＭＥ法は、（ドキュメントＩＤ、キーワードリスト）のペアを学習データとし、入力としてキーワードのリストを入力した場合にそのドキュメントＩＤを推定する方法である。キーワードのリストからドキュメントのＩＤを推定するときに学習したデータで最も起こりやすくなる（正解が多くなる）ように（ドキュメントＩＤ、キーワードリスト）のペアの重みを計算し、それを保存したものが発話推定モデル９である。
すべての収集発話解析結果７からキーワードが抽出され、ＭＥ法によって学習して、発話推定モデル９が作成される。具体的には、図５の収集発話解析結果７−１に対して、同じ図５の発話推定モデル用データ７−２が抽出され、この発話推定モデル用データ７−２を元に上記学習が行われる。The ME method will not be described in detail, but will be described briefly. The ME method is a method of estimating a document ID when a pair of (document ID, keyword list) is used as learning data and a keyword list is input as an input. The weight of the pair of (document ID, keyword list) is calculated so that it is most likely to occur in the data learned when estimating the document ID from the keyword list (the number of correct answers increases), and the saved data is the utterance This is an estimation model 9.
Keywords are extracted from all the collected utterance analysis results 7 and learned by the ME method to create an utterance estimation model 9. Specifically, the same utterance estimation model data 7-2 in FIG. 5 is extracted from the collected utterance analysis result 7-1 in FIG. 5, and the above learning is performed based on the utterance estimation model data 7-2. Done.

次に、検索処理について説明する。
図８は、ユーザ入力１０から最終検索結果１７を作成するまでの動作を示すフローチャートである。図９および図１０は、ユーザ入力１０の一例であるユーザ入力１０−１の検索処理における遷移例を示す。ここではユーザ入力１０はテキストでの入力を想定し、図９のユーザ入力１０−１が入力されたとして説明する。入力解析部２は、ステップＳＴ１１でまずユーザ入力１０−１を受け取り、形態素解析してユーザ入力解析結果１１−１を生成し、ユーザ入力解析結果１１−１から自立語を抽出してキーワードリスト１１−２を作成する。続くステップＳＴ１２では、発話内容推定部１４がこのキーワードリスト１１−２を入力に用いて、発話推定モデル９から図１０のドキュメント推定結果１５−１を得る。図１０に示すように、ドキュメント推定結果１５−１は、スコア順に並んでいる。このスコアは、発話推定モデル９に保存された（ドキュメントＩＤ、キーワードリスト）のぺアの重みから算出される値であり、ユーザ入力１０との関連度合いが高いドキュメントＩＤ、即ち、ユーザ入力１０の質問に対する回答として相応しいドキュメントＩＤに高いスコアが付与される。Next, the search process will be described.
FIG. 8 is a flowchart showing the operation from the user input 10 until the final search result 17 is created. 9 and 10 show transition examples in the search process of the user input 10-1 which is an example of the user input 10. FIG. Here, it is assumed that the user input 10 is a text input and the user input 10-1 in FIG. 9 is input. In step ST11, the input analysis unit 2 first receives the user input 10-1, generates a user input analysis result 11-1 through morphological analysis, extracts an independent word from the user input analysis result 11-1, and extracts the keyword list 11 -2 is created. In subsequent step ST12, the utterance content estimation unit 14 uses the keyword list 11-2 as an input to obtain the document estimation result 15-1 of FIG. 10 from the utterance estimation model 9. As shown in FIG. 10, the document estimation results 15-1 are arranged in the order of scores. This score is a value calculated from the pair weight of (document ID, keyword list) stored in the utterance estimation model 9 and has a high degree of association with the user input 10, that is, the user input 10 A high score is given to the document ID suitable as an answer to the question.

ドキュメント推定結果１５−１が得られると、続くステップＳＴ１３にて今度はドキュメント検索部１２がキーワードリスト１１−２を入力に用いて、検索インデックス５から図１０のドキュメント検索結果１３−１を得る。図１０に示すように、ドキュメント検索結果１３−１もスコア順に並んでいる。このスコアは、検索インデックス５に保存されたｔｆ・ｉｄｆの重みから算出される値であり、ユーザ入力１０との関連度合いが高いドキュメントＩＤに高いスコアが付与される。
なお、ドキュメント推定結果１５のスコアおよびドキュメント検索結果１３のスコアの算出方法には公知の技術を用いればよいため、ここでの説明は省略する。When the document estimation result 15-1 is obtained, the document search unit 12 obtains the document search result 13-1 shown in FIG. 10 from the search index 5 by using the keyword list 11-2 as an input at the next step ST 13. As shown in FIG. 10, document search results 13-1 are also arranged in the order of score. This score is a value calculated from the weight of tf · idf stored in the search index 5, and a high score is given to a document ID having a high degree of association with the user input 10.
Note that a known technique may be used for the calculation method of the score of the document estimation result 15 and the score of the document search result 13, and thus description thereof is omitted here.

ステップＳＴ１３の処理が終わると、続いてステップＳＴ１４の処理に移り、結果統合部１６がドキュメント推定結果１５−１の最大スコアがここで定めた閾値Ｘ（たとえば、Ｘ＝０．９）以上かどうかを判断する。ドキュメント推定結果１５−１では最大スコアが閾値Ｘより小さいので（ステップＳＴ１４“ＮＯ”）、結果統合部１６はステップＳＴ１６の処理に進む。ステップＳＴ１６では、ドキュメントＩＤごとに、ドキュメント検索結果１３−１のスコアとドキュメント推定結果１５−１のスコアの重み付き加算を行い、最終検索結果１７−１を作成する。図１０では、（ドキュメント推定結果１５−１のスコア）：（ドキュメント検索結果１３−１のスコア）＝１：１で加算した結果が最終検索結果７４となっている。 When the process of step ST13 ends, the process proceeds to the process of step ST14, and the result integration unit 16 determines whether the maximum score of the document estimation result 15-1 is equal to or greater than a threshold value X (for example, X = 0.9) determined here. Judging. Since the maximum score is smaller than the threshold value X in the document estimation result 15-1 (step ST14 “NO”), the result integrating unit 16 proceeds to the process of step ST16. In step ST16, for each document ID, the weighted addition of the score of the document search result 13-1 and the score of the document estimation result 15-1 is performed to create a final search result 17-1. In FIG. 10, a final search result 74 is obtained by adding (score of document estimation result 15-1) :( score of document search result 13-1) = 1: 1.

一方、ステップＳＴ１４でドキュメント推定結果１５−１の最大スコアが閾値Ｘを超えた場合には（ステップＳＴ１４“ＹＥＳ”）、続くステップＳＴ１５にて結果統合部１６はドキュメント検索結果１３−１を破棄して、ドキュメント推定結果１５−１を最終検索結果（不図示）とする。
検索が終了すると、ドキュメント検索装置は画面にドキュメントＩＤのタイトルなどを表示して、ユーザに選択させることで、所望のドキュメント位置を提示する。On the other hand, when the maximum score of the document estimation result 15-1 exceeds the threshold value X in step ST14 (step ST14 “YES”), the result integration unit 16 discards the document search result 13-1 in the subsequent step ST15. The document estimation result 15-1 is set as a final search result (not shown).
When the search is completed, the document search device displays the title of the document ID and the like on the screen and makes the user select, thereby presenting a desired document position.

以上より、実施の形態１によれば、ドキュメント検索装置は、あらかじめ用意されたドキュメント１から作成した検索インデックス５と、ユーザ入力１０を解析したユーザ入力解析結果１１を受け、検索インデックス５を用いてドキュメント１内からユーザ入力解析結果１１に関連のあるドキュメントＩＤを検索するドキュメント検索部１２と、ドキュメント１の内容を問う想定質問（ユーザ発話）とその回答となるドキュメントＩＤとの対応関係を定義した収集発話データ６を学習した発話推定モデル９と、発話推定モデル９を基にドキュメント１内からユーザ入力解析結果１１の回答に相当するドキュメントＩＤを推定する発話内容推定部１４と、ドキュメント検索部１２から得られたドキュメント検索結果１３と発話内容推定部１４から得られたドキュメント推定結果１５を統合して最終検索結果１７を生成する結果統合部１６とを備えるように構成した。このため、単純なドキュメント検索機能とは異なる、収集発話データ６に基づく発話内容推定を行って、従来のドキュメント検索機能では実現できなかった一般ユーザおよび初心者ユーザが入力するドキュメント１に出現しないような言い回しおよび一般用語での検索が可能となる。よって、ユーザの自然言語による入力に対して、単純な検索方式による結果よりも、より適切な検索結果を提示することができる。 As described above, according to the first embodiment, the document search apparatus receives the search index 5 created from the document 1 prepared in advance and the user input analysis result 11 obtained by analyzing the user input 10, and uses the search index 5. The correspondence between the document search unit 12 that searches the document ID related to the user input analysis result 11 from the document 1 and the assumed question (user utterance) that asks the contents of the document 1 and the document ID that is the answer is defined. The utterance estimation model 9 that has learned the collected utterance data 6, the utterance content estimation unit 14 that estimates the document ID corresponding to the answer to the user input analysis result 11 from the document 1 based on the utterance estimation model 9, and the document search unit 12 Document search result 13 obtained from, and utterance content estimation unit 14 Obtained by integrating the document estimation results 15 was configured with the result integration unit 16 that generates a final search result 17. For this reason, utterance content estimation based on the collected utterance data 6 is performed, which is different from the simple document search function, so that it does not appear in the document 1 input by general users and novice users that could not be realized by the conventional document search function. Search by wording and general terms is possible. Therefore, it is possible to present a more appropriate search result for the user's natural language input than the result of the simple search method.

また、実施の形態１によれば、発話内容推定部１４は、推定したドキュメントＩＤにユーザ入力１０との関連度合いに応じたスコアを付与し、結果統合部１６は、発話内容推定部１４から得られたドキュメント推定結果１５のスコアがあらかじめ定めた閾値Ｘより大きい場合に、ドキュメント検索部１２から得られたドキュメント検索結果１３を無視して、最終検索結果１７を生成する構成にした。このため、一般ユーザおよび初心者ユーザの入力がドキュメント１に出現しないような言い回しおよび一般用語の場合に、単純な検索方式では不適切な検索結果候補をたくさん含んでしまうのを避け、ユーザの入力に対してより適切な検索結果を提示することができる。 Further, according to the first embodiment, the utterance content estimation unit 14 gives a score according to the degree of association with the user input 10 to the estimated document ID, and the result integration unit 16 obtains from the utterance content estimation unit 14. When the score of the obtained document estimation result 15 is larger than a predetermined threshold value X, the document search result 13 obtained from the document search unit 12 is ignored and the final search result 17 is generated. For this reason, in the case of phrases and general terms that do not appear in the document 1 as input by general users and novice users, the simple search method avoids including a large number of inappropriate search result candidates. On the other hand, a more appropriate search result can be presented.

なお、実施の形態１では、ドキュメント推定結果１５の最大スコアがあらかじめ定めた閾値Ｘより大きい場合には、ドキュメント推定結果１５をそのまま最終検索結果１７にする構成としたが、最初からドキュメント推定結果１５のスコアとドキュメント検索結果１３のスコアを所定の割合で重み付け加算するようにしてもよい。ドキュメント推定結果１５のスコアは、ユーザの発話から直接推定されるドキュメントから計算されるのに対して、ドキュメント検索結果１３のスコアとは、ドキュメント中のキーワードの有無から計算される。したがってそれぞれ一長一短があり、それらを重み付け加算することで、２つの方式でともによいスコアのものを提示することができる。 In the first embodiment, when the maximum score of the document estimation result 15 is larger than the predetermined threshold value X, the document estimation result 15 is used as the final search result 17 as it is. And the score of the document search result 13 may be weighted and added at a predetermined ratio. The score of the document estimation result 15 is calculated from the document directly estimated from the user's utterance, whereas the score of the document search result 13 is calculated from the presence / absence of the keyword in the document. Therefore, there are advantages and disadvantages, respectively, and by adding them by weighting, it is possible to present a score with a good score in both methods.

また、実施の形態１によれば、ドキュメント検索装置は、あらかじめ用意されたドキュメント１、および当該ドキュメント１の内容を問うユーザ発話とその回答になるドキュメントＩＤとの対応関係を定義した収集発話データ６を解析する入力解析部２と、入力解析部２から出力されたドキュメント解析結果３から検索インデックス５を作成する検索インデックス作成部４と、入力解析部２から出力された収集発話解析結果７を用いてユーザ発話とドキュメントＩＤとの対応関係を学習して発話推定モデル９を作成する発話推定モデル作成部８とを備えるように構成した。このため、従来のドキュメント検索機能では実現できなかった、一般ユーザおよび初心者ユーザが入力する、ドキュメント１に出現しないような言い回しおよび一般用語での検索が可能となる。 Further, according to the first embodiment, the document search apparatus collects the collected utterance data 6 that defines the correspondence between the document 1 prepared in advance and the user utterance that asks the contents of the document 1 and the document ID that is the answer. Using the input analysis unit 2 for analyzing the document, the search index creation unit 4 for creating the search index 5 from the document analysis result 3 output from the input analysis unit 2, and the collected utterance analysis result 7 output from the input analysis unit 2 The utterance estimation model creating unit 8 that learns the correspondence between the user utterance and the document ID and creates the utterance estimation model 9 is provided. For this reason, it is possible to search with words and general terms that do not appear in the document 1 and are input by general users and novice users, which could not be realized by the conventional document search function.

実施の形態２．
図１１は、本実施の形態２に係るドキュメント検索装置の構成を示すブロック図である。なお、図１１において図１と同一または相当の部分については同一の符号を付し説明を省略する。
上記実施の形態１との大きな違いは以下の２点である。
（１）収集発話データ６を割り当てるドキュメントＩＤの単位を細かい単位ではなく、より大きな単位にした発話推定モデル９を作成する。
（２）ドキュメント推定結果１５は検索インデックス５による検索対象範囲を限定する目的で使用する。Embodiment 2. FIG.
FIG. 11 is a block diagram showing the configuration of the document search apparatus according to the second embodiment. 11 that are the same as or equivalent to those in FIG. 1 are denoted by the same reference numerals and description thereof is omitted.
Major differences from the first embodiment are the following two points.
(1) Create an utterance estimation model 9 in which the unit of the document ID to which the collected utterance data 6 is assigned is not a fine unit but a larger unit.
(2) The document estimation result 15 is used for the purpose of limiting the search target range by the search index 5.

図１１において、検索対象限定部１８は、ドキュメント検索部１２の検索対象を、ドキュメント推定結果１５の下位ドキュメントＩＤに限定する。ドキュメント限定リスト１９は、限定されたドキュメントＩＤを保持する。 In FIG. 11, the search target limiting unit 18 limits the search target of the document search unit 12 to the lower document ID of the document estimation result 15. The document limitation list 19 holds limited document IDs.

図１２は、ドキュメント１のドキュメントＩＤの階層を表す図である。図１２の例では、第２階層（四角で囲まれたドキュメントＩＤ）より下位層のドキュメントＩＤに収集発話データ６を割り付けることなく、第１階層と第２階層のドキュメントＩＤに割り当てることを表している。 FIG. 12 is a diagram illustrating a hierarchy of document IDs of document 1. In the example of FIG. 12, the collection utterance data 6 is not allocated to the document ID of the lower layer than the second layer (document ID surrounded by the square), but is allocated to the document IDs of the first layer and the second layer. Yes.

次に、ドキュメント検索装置の動作を説明する。
作成処理における動作は基本的に上記実施の形態１と同じである。但し、収集発話データ６のドキュメントＩＤへの割り当てを、図１２に示すように第２階層以上とする。従って、図４において収集発話データ６−１はドキュメントＩＤ「Ｉｄ＿１０」に割り当て、それ以外の収集発話データ６−２〜６−４はすべて「Ｉｄ＿１０＿１」に割り当てる。Next, the operation of the document search apparatus will be described.
The operation in the creation process is basically the same as in the first embodiment. However, the allocation of the collected utterance data 6 to the document ID is set to the second hierarchy or higher as shown in FIG. Therefore, in FIG. 4, the collected utterance data 6-1 is assigned to the document ID “Id_10”, and all other collected utterance data 6-2 to 6-4 are assigned to “Id_10_1”.

続いて、検索処理について説明する。
図１３は、ユーザ入力１０から最終検索結果１７を作成するまでの動作を示すフローチャートである。図１４は、検索対象限定部１８の動作を説明する図である。上記実施の形態１と同様に、ここでもユーザ入力１０はテキストでの入力を想定し、図９のユーザ入力１０−１が入力されたとして説明する。ステップＳＴ１１で入力解析部２は、図８と同様にユーザ入力１０−１を解析する。次にステップＳＴ１２で、発話内容推定部１４が発話内容推定を行う。推定結果は、図１４のドキュメント推定結果１５−２（ドキュメントＩＤ、スコア）になる。上述したように、収集発話データ６のドキュメントＩＤへの割り当てが第２階層以上に制限されているため、第３階層以下のドキュメントＩＤは無い。Next, the search process will be described.
FIG. 13 is a flowchart showing the operation from the user input 10 until the final search result 17 is created. FIG. 14 is a diagram for explaining the operation of the search target limiting unit 18. As in the first embodiment, here, the user input 10 is assumed to be a text input, and the user input 10-1 in FIG. 9 is input. In step ST11, the input analysis unit 2 analyzes the user input 10-1 as in FIG. Next, in step ST12, the utterance content estimation unit 14 performs utterance content estimation. The estimation result is the document estimation result 15-2 (document ID, score) in FIG. As described above, since the allocation of the collected utterance data 6 to the document ID is limited to the second hierarchy or higher, there is no document ID in the third hierarchy or lower.

続くステップＳＴ２１で検索対象限定部１８が、ドキュメント推定結果１５−２のスコアが閾値Ｙ（たとえば、Ｙ＝０．６）以上になるドキュメントＩＤが１個以上か確認する。ドキュメント推定結果１５−２では、「ＩＤ＿１０＿１」のスコアが０．６以上なので（ステップＳＴ２１“ＹＥＳ”）、処理をステップＳＴ２２に移し、スコアが閾値Ｙ以上のドキュメントＩＤの下位層を展開し、展開した各ドキュメントＩＤに同じスコアを付与する。また、ドキュメント推定結果１５−２では「Ｉｄ＿１０＿１」だけが閾値Ｙ以上なので、検索対象限定部１８は「Ｉｄ＿１０＿１」の下位層の「Ｉｄ＿１０＿１＿１」〜「Ｉｄ＿１０＿１＿７」を検索対象として選択し、ドキュメント限定リスト１９−１として設定する。 In subsequent step ST21, the search target limiting unit 18 confirms whether or not there is one or more document IDs for which the score of the document estimation result 15-2 is greater than or equal to a threshold Y (for example, Y = 0.6). In the document estimation result 15-2, since the score of “ID — 10_1” is 0.6 or more (step ST21 “YES”), the process moves to step ST22, and the lower layer of the document ID whose score is greater than or equal to the threshold Y is expanded and expanded. The same score is assigned to each document ID. Further, in the document estimation result 15-2, only “Id — 10_1” is equal to or greater than the threshold Y, so the search target limiting unit 18 selects “Id — 10_1 — 1” to “Id — 10 — 1_7” in the lower layer of “Id — 10_1” as search targets, and the document limitation list 19 Set as -1.

続くステップＳＴ２３では、ドキュメント検索部１２が図１４のキーワードリスト１１−２を使って検索インデックス５を検索し、ドキュメント検索結果１３−１を得る。そして、ステップＳＴ２４でこのドキュメント検索結果１３−１のスコアにドキュメント限定リスト１９−１のスコアを足し合わせた結果を最終検索結果１７−２として出力する。 In subsequent step ST23, the document search unit 12 searches the search index 5 using the keyword list 11-2 of FIG. 14, and obtains a document search result 13-1. In step ST24, a result obtained by adding the score of the document restriction list 19-1 to the score of the document search result 13-1 is output as the final search result 17-2.

一方、ステップＳＴ２１でドキュメント推定結果１５−２に閾値Ｙを超えるスコアが存在しなかった場合（ステップＳＴ２１“ＮＯ”）、検索対象限定部１８はこのドキュメント推定結果１５−２を破棄し（ステップＳＴ２５）、続くステップＳＴ２６にてドキュメント検索部１２がすべてのドキュメントＩＤを検索対象にしたドキュメント検索結果（不図示）を得て、そのまま最終検索結果（不図示）として出力する。 On the other hand, if there is no score exceeding the threshold Y in the document estimation result 15-2 in step ST21 (step ST21 “NO”), the search target limiting unit 18 discards the document estimation result 15-2 (step ST25). In step ST26, the document search unit 12 obtains a document search result (not shown) with all document IDs as search targets, and outputs it as a final search result (not shown).

以上より、実施の形態２によれば、ドキュメント検索装置は、発話内容推定部１４から得られたドキュメント推定結果１５のうち、あらかじめ定めた閾値Ｙ以上のスコアのドキュメントＩＤとその下位層のドキュメントＩＤを抽出する検索対象限定部１８を備え、発話内容推定部１４は、検索インデックス５の検索の最小単位となる階層より上位の階層のドキュメントＩＤと収集発話データ６との対応関係を学習した発話推定モデル９を基に推定し、結果統合部１６は、発話内容推定部１４から得られたドキュメント推定結果１５のうちの検索対象限定部１８で抽出したドキュメントＩＤを、ドキュメント検索部１２から得られたドキュメント検索結果１３と統合するように構成した。このため、収集発話データ６をより上位の階層のドキュメントＩＤに割り振れば、収集発話データ６を製品の機種による機能の細かな違いを考慮しなくてよいドキュメントＩＤへの対応付けが可能となる。よって、ドキュメントＩＤと収集発話データ６との対応付けが容易になると共に、データスパースネスによる検索の精度低下を抑制することができる。また、製品の機能を汎用的なレベルで定義できるため、多くの機種を抱える製品開発においても、共通の収集発話データ６として利用でき、新たな製品への対応が容易となる。 As described above, according to the second embodiment, the document search apparatus includes a document ID having a score equal to or higher than a predetermined threshold Y among the document estimation results 15 obtained from the utterance content estimation unit 14 and a document ID of a lower layer thereof. The utterance content estimation unit 14 includes a search target limiting unit 18 that extracts the utterance, and the utterance content estimation unit 14 learns the correspondence between the document ID of the higher hierarchy and the collected utterance data 6 as a minimum unit of search of the search index 5. Based on the model 9, the result integration unit 16 obtained from the document search unit 12 the document ID extracted by the search target limiting unit 18 in the document estimation result 15 obtained from the utterance content estimation unit 14. The document search result 13 is integrated. For this reason, if the collected utterance data 6 is assigned to a document ID of a higher hierarchy, it is possible to associate the collected utterance data 6 with a document ID that does not need to take into account fine differences in functions depending on the product model. . Therefore, it becomes easy to associate the document ID with the collected utterance data 6, and it is possible to suppress a decrease in search accuracy due to data sparseness. In addition, since product functions can be defined at a general-purpose level, it can be used as common collected utterance data 6 in product development with many models, and it becomes easy to deal with new products.

なお、上記実施の形態１，２では、検索インデックス５として、統計型検索方式の検索インデックスを用いて説明したが、論理型検索方式の検索インデックスを用いて、検索キーワードの出現回数の総和をもとに確率を設定してもよい。その場合、検索キーワード出現回数の総和が最大の場合をＮとし、各ドキュメントでの検索キーワード出現回数の総和をＮで割ったものをスコアとしたり、検索結果のすべてのドキュメントにおける検索キーワード出現回数の総和をＭとして、各ドキュメントでの検索キーワード出現回数の総和をＭで割ったものをスコアとしたりする方法が考えられる。 In Embodiments 1 and 2 described above, the search index 5 is described using a search index of the statistical search method. However, the search index of the logical search method is used to calculate the total number of occurrences of the search keyword. A probability may be set for each. In this case, N is the case where the total number of search keyword appearances is the maximum, and a score obtained by dividing the total number of search keyword appearances in each document by N is used as a score, or the number of search keyword appearances in all documents in the search results. A method may be considered in which the sum is M , and the score obtained by dividing the total number of search keyword appearances in each document by M is used as a score.

さらに、上記実施の形態１，２では、検索インデックス５の作成単位および発話推定モデル９の作成単位として自立語単位で行った例を示したが、音素ｎ−ｇｒａｍおよび音節ｎ−ｇｒａｍなどを単位として検索インデックス５および発話推定モデル９を作成してもよい。また、高頻出単語と音素ｎ−ｇｒａｍ、または高頻出単語と音節ｎ−ｇｒａｍを組み合わせて検索インデックス５および発話推定モデル９を作成してもよい。この場合、検索インデックス５および発話推定モデル９のサイズの削減が可能となる。 Furthermore, in the first and second embodiments, the example in which the search index 5 is created and the speech estimation model 9 is created in units of independent words has been described. However, the phoneme n-gram, syllable n-gram, etc. The search index 5 and the utterance estimation model 9 may be created. Alternatively, the search index 5 and the utterance estimation model 9 may be created by combining a frequently-occurring word and a phoneme n-gram, or a highly frequently-occurring word and a syllable n-gram. In this case, the size of the search index 5 and the utterance estimation model 9 can be reduced.

また、上記実施の形態１，２では、図４の収集発話データ６−４のような、該当する製品機能が無く適切な説明部分が無いためにドキュメント１のどこにも当てはめることのできない発話については、特別なドキュメントＩＤを付与して発話推定モデル９を作成しておき、ユーザ入力１０に対するドキュメント推定結果１５の最大スコアのものがその特別なドキュメントＩＤであった場合には、結果統合部１６においてドキュメント検索結果１３を利用せずに最終検索結果１７を作成するようにしてもよい。また、この場合にドキュメント検索装置としては、特別なドキュメントＩＤに対応するメッセージを提示するように構成してもよい。 In Embodiments 1 and 2 described above, utterances that cannot be applied anywhere in document 1 because there is no relevant product function and no appropriate explanation part, such as collected utterance data 6-4 in FIG. When the utterance estimation model 9 is created by giving a special document ID, and the document with the maximum score of the document estimation result 15 with respect to the user input 10 is the special document ID, the result integrating unit 16 The final search result 17 may be created without using the document search result 13. In this case, the document search apparatus may be configured to present a message corresponding to a special document ID.

さらに、上記実施の形態１，２では、ユーザ入力１０がテキスト入力の場合を例に説明したが、入力手段として音声認識を用いてもよい。その場合には、音声認識結果の第１候補のテキストをユーザ入力１０として処理する方法、およびＮ番目の候補までをユーザ入力１０として処理する方法などが考えられる。また、音声認識結果が形態素単位で生成される場合は、入力解析部２での処理を省略してそのままユーザ入力解析結果１１として扱うようにしてもよい。 Furthermore, in the first and second embodiments, the case where the user input 10 is text input has been described as an example. However, voice recognition may be used as an input unit. In that case, a method of processing the first candidate text of the speech recognition result as the user input 10 and a method of processing up to the Nth candidate as the user input 10 are conceivable. If the speech recognition result is generated in units of morphemes, the processing in the input analysis unit 2 may be omitted and handled as the user input analysis result 11 as it is.

また、上記実施の形態１，２では、日本語の入力例について説明を行ったが、言語を限定するものではなく、英語、ドイツ語、中国語などでも、入力解析部２を言語ごとに差し替えることで同様の効果を得ることが可能である。 In the first and second embodiments, examples of Japanese input have been described. However, the language is not limited, and the input analysis unit 2 is replaced for each language in English, German, Chinese, and the like. It is possible to obtain the same effect.

実施の形態３．
以下では、英語の入力例について説明する。
本実施の形態３のドキュメント検索装置は、図１に示すドキュメント検索装置と図面上では同様の構成であるため、以下では図１を援用して説明する。Embodiment 3 FIG.
In the following, an example of English input will be described.
Since the document search apparatus according to the third embodiment has the same configuration as that of the document search apparatus shown in FIG. 1, the following description will be given with reference to FIG.

図１５は、本実施の形態３に係るドキュメント検索装置に入力されるドキュメント１の英語例である。ドキュメント１は、章、節、項のような階層構造を持っており、階層ごとに検索結果位置を示すドキュメントＩＤを持っている。図１５の例では、ドキュメントＩＤ「Ｉｄ＿１０＿１」のドキュメント１−１１が、下位のデータ構造内に含まれるテキストも含んでいる。たとえば「Ｉｄ＿１０＿１＿１」のドキュメント１−１２は、「Ｉｄ＿１０＿１」のドキュメント１−１１にも含まれることを表している。 FIG. 15 is an English example of the document 1 input to the document search apparatus according to the third embodiment. The document 1 has a hierarchical structure such as a chapter, a section, and an item, and has a document ID indicating a search result position for each hierarchy. In the example of FIG. 15, the document 1-11 having the document ID “Id — 10_1” also includes text included in the lower data structure. For example, the document 1-12 of “Id_10_1_1” is included in the document 1-11 of “Id_10_1”.

図１６は、ドキュメント解析結果３、および検索インデックス５用のキーワードリストの例である。「Ｉｄ＿１０＿１＿１」はドキュメント解析結果の一例であり、図１５の「Ｉｄ＿１０＿１＿１」のドキュメント１−１２に対して形態素解析による入力解析を行った結果を示している。このドキュメント解析結果３−１１では、形態素解析結果の区切を「／」で区切った情報しか提示していないが、実際には品詞情報などの情報も生成される。
検索インデックス用データ３−１２は、「Ｉｄ＿１０＿１＿１」のドキュメント解析結果３−１１をもとにした、検索インデックス作成部４が使用するデータ例を示している。ここではドキュメントＩＤと、前置詞、冠詞、ｂｅ動詞、代名詞を除く自立語形態素とが抽出されている。FIG. 16 is an example of a keyword list for the document analysis result 3 and the search index 5. “Id — 10 — 1 — 1” is an example of the document analysis result, and shows the result of performing input analysis by morphological analysis on the document 1-12 of “Id — 10 — 1 — 1” in FIG. In this document analysis result 3-11, only information obtained by dividing the morphological analysis result by “/” is presented, but actually information such as part-of-speech information is also generated.
The search index data 3-12 shows an example of data used by the search index creation unit 4 based on the document analysis result 3-11 of “Id — 10_1 — 1”. Here, document IDs and independent word morphemes excluding prepositions, articles, be verbs, and pronouns are extracted.

図１７は、収集発話データ６の例である。収集発話データ６−１１は「Ｉｄ＿１０」のドキュメントに対応する質問の例、収集発話データ６−１２は「Ｉｄ＿１０＿１」のドキュメントに対応する質問の例、収集発話データ６−１３は「Ｉｄ＿１０＿１＿１」のドキュメントに対応する質問の例である。収集発話データ６−１４は、地図種類の具体的変更方法を知りたいことを意図した質問であるが、ここで想定している製品では実現不可能な地図種類であるため、「Ｉｄ＿１０＿１＿１」と同階層のドキュメントＩＤが選択できない収集発話データ例である。 FIG. 17 is an example of the collected utterance data 6. The collected utterance data 6-11 is an example of a question corresponding to the document “Id_10”, the collected utterance data 6-12 is an example of a question corresponding to the document of “Id_10_1”, and the collected utterance data 6-13 is a document of “Id_10_1_1”. It is an example of a question corresponding to. The collected utterance data 6-14 is a question intended to know a specific method of changing the map type. However, since it is a map type that cannot be realized with the product assumed here, it is the same as “Id_10_1_1”. It is an example of the collection utterance data which cannot select document ID of a hierarchy.

図１８は、収集発話解析結果７、および発話推定モデル９用のキーワードリストの例である。「Ｉｄ＿１０＿１＿１」の収集発話解析結果７−１１は、図１７の「Ｉｄ＿１０＿１＿１」の収集発話データ６−１３の収集発話解析結果例、発話推定モデル用データ７−１２は、「Ｉｄ＿１０＿１＿１」の収集発話解析結果７−１１をもとにした、発話推定モデル作成部８が使用するデータ例を示している。ここではドキュメントＩＤと、前置詞、冠詞、ｂｅ動詞を除く自立語形態素とが抽出されている。 FIG. 18 is an example of the keyword list for the collected utterance analysis result 7 and the utterance estimation model 9. The collected utterance analysis result 7-11 of "Id_10_1_1" is the collected utterance analysis result example of the collected utterance data 6-13 of "Id_10_1_1" in FIG. The example of data which the speech estimation model preparation part 8 based on the result 7-11 uses is shown. Here, document IDs and independent word morphemes excluding prepositions, articles, and be verbs are extracted.

次に、ドキュメント検索装置の動作を説明する。
本実施の形態３に係るドキュメント検索装置の動作（作成処理、検索処理）は基本的に上記実施の形態１の図６〜図８と同じである。従って、ここでは異なる部分のみを説明する。まず、作成処理について説明する。Next, the operation of the document search apparatus will be described.
The operation (creation process, search process) of the document search apparatus according to the third embodiment is basically the same as that in FIGS. 6 to 8 of the first embodiment. Therefore, only different parts will be described here. First, the creation process will be described.

まず、作成処理のうち、検索インデックス５の作成方法について説明する。ここでは、従来技術で開示されているｔｆ・ｉｄｆによる重み付けをするものとする。
図１５に示したように、ドキュメント１は、ドキュメントＩＤとテキストとが対応付けられたペアになっているものとする。たとえば、ドキュメント１−１２ではドキュメントＩＤ「Ｉｄ＿１０＿１＿１」という名前に、「Ｈｅａｄｉｎｇｕｐ．Ｄｉｓｐｌａｙｔｈｅｍａｐｗｈｉｃｈｒｏｔａｔｅｄｔｏａｌｗａｙｓｆａｃｅｔｈｅｄｉｒｅｃｔｉｏｎｙｏｕａｒｅｔｒａｖｅｌｌｉｎｇ」というテキストが対応付けられている。図６のステップＳＴ１では、入力解析部２がこの構造のドキュメント１を順次読み込み、既知の技術である形態素解析によって形態素列に分割する。ドキュメント１−１２を形態素解析した結果が、図１６のドキュメント解析結果３−１１である。このドキュメント解析結果３−１１は、形態素の区切りしか示していないが、実際には、品詞情報、活用語の原形などが含まれているものとする。First, a method for creating the search index 5 in the creation process will be described. Here, weighting by tf · idf disclosed in the prior art is assumed.
As shown in FIG. 15, it is assumed that the document 1 is a pair in which the document ID and the text are associated with each other. For example, in the document 1-12, the text “Heading up. Display the map when to face faces the direction you are traveling” is associated with the document ID “Id — 10 — 1 — 1”. In step ST1 of FIG. 6, the input analysis unit 2 sequentially reads the document 1 having this structure and divides it into morpheme strings by morphological analysis which is a known technique. The result of the morphological analysis of the document 1-12 is a document analysis result 3-11 in FIG. This document analysis result 3-11 shows only morpheme breaks, but in reality, it is assumed that part-of-speech information, the original form of the usage word, and the like are included.

ドキュメント解析結果３がすべてのドキュメントＩＤに対して生成されると、続くステップＳＴ２で、検索インデックス作成部４が、すべてのドキュメント解析結果３から検索インデックス５の作成に必要な形態素（キーワード）を抽出し、（ドキュメントＩＤ、キーワードリスト）のペアを作成し、すべてのペアを元にｔｆ・ｉｄｆによって重み付けした検索インデックス５を作成する。図１６のドキュメント解析結果３−１１から抽出した（ドキュメントＩＤ、キーワードリスト）のペアは、同じ図１６の検索インデックス用データ３−１２で表される。 When the document analysis result 3 is generated for all document IDs, the search index creation unit 4 extracts morphemes (keywords) necessary for creating the search index 5 from all the document analysis results 3 in the subsequent step ST2. Then, a pair of (document ID, keyword list) is created, and a search index 5 weighted by tf · idf based on all pairs is created. A pair of (document ID, keyword list) extracted from the document analysis result 3-11 in FIG. 16 is represented by the same search index data 3-12 in FIG.

具体的な検索インデックス作成手順は、上記実施の形態１と同様であるため、説明を省略する。 The specific search index creation procedure is the same as that in the first embodiment, and a description thereof will be omitted.

次に、発話推定モデル９の作成処理について説明する。
収集発話データ６は、図１７の収集発話データ６−１１〜６−１４に表したように、あらかじめユーザから集めた発話をその回答となるドキュメントＩＤに割り当てたデータである。収集発話データ６の作成方法は上記実施の形態１と同様であるため、説明を省略する。Next, a process for creating the utterance estimation model 9 will be described.
The collected utterance data 6 is data in which utterances collected from the user in advance are assigned to the document ID as the answer, as shown in the collected utterance data 6-11 to 6-14 in FIG. Since the method of creating the collected utterance data 6 is the same as that in the first embodiment, description thereof is omitted.

入力解析部２は、図７に示したステップＳＴ３において、先に説明したステップＳＴ１でドキュメント１を入力とした場合と同様に、収集発話データ６の形態素解析を行う。たとえば、図１７の収集発話データ６−１３を形態素解析した結果が、図１８の収集発話解析結果７−１１である。続くステップＳＴ４で発話推定モデル作成部８が、先に説明したステップＳＴ２と同様にドキュメントＩＤとキーワードのリストを発話推定モデル用データ７−１２として抽出し、上記実施の形態１と同様にＭＥ法によって発話推定モデル９を学習する。すべての収集発話解析結果７からキーワードが抽出され、ＭＥ法によって学習して、発話推定モデル９が作成される。具体的には、図１８の収集発話解析結果７−１１に対して、同じ図１８の発話推定モデル用データ７−１２が抽出され、この発話推定モデル用データ７−１２を元に上記学習が行われる。 In step ST3 shown in FIG. 7, the input analysis unit 2 performs morphological analysis on the collected utterance data 6 in the same manner as when the document 1 is input in step ST1 described above. For example, the result of morphological analysis of the collected utterance data 6-13 in FIG. 17 is the collected utterance analysis result 7-11 in FIG. In subsequent step ST4, the utterance estimation model creating unit 8 extracts a list of document IDs and keywords as utterance estimation model data 7-12 in the same manner as in step ST2 described above, and the ME method as in the first embodiment. To learn the utterance estimation model 9. Keywords are extracted from all the collected utterance analysis results 7 and learned by the ME method to create an utterance estimation model 9. Specifically, the same utterance estimation model data 7-12 of FIG. 18 is extracted from the collected utterance analysis result 7-11 of FIG. 18, and the above learning is performed based on the utterance estimation model data 7-12. Done.

次に、検索処理について説明する。
図１９および図２０は、ユーザ入力１０の一例であるユーザ入力１０−１１の検索処理における遷移例を示す。ここではユーザ入力１０はテキストでの入力を想定し、図１９のユーザ入力１０−１１が入力されたとして説明する。入力解析部２は、図８に示したステップＳＴ１１でまずユーザ入力１０−１１を受け取り、形態素解析してユーザ入力解析結果１１−１１を生成し、ユーザ入力解析結果１１−１１から前置詞、冠詞、ｂｅ動詞、代名詞を除外して自立語を抽出し、キーワードリスト１１−１２を作成する。続くステップＳＴ１２では、発話内容推定部１４がこのキーワードリスト１１−１２を入力に用いて、発話推定モデル９から図２０のドキュメント推定結果１５−１１を得る。図２０に示すように、ドキュメント推定結果１５−１１は、スコア順に並んでいる。Next, the search process will be described.
19 and 20 show a transition example in the search process of the user input 10-11 which is an example of the user input 10. FIG. Here, it is assumed that the user input 10 is a text input and the user input 10-11 in FIG. 19 is input. The input analysis unit 2 first receives the user input 10-11 in step ST11 shown in FIG. 8 and generates a user input analysis result 11-11 by performing morphological analysis. From the user input analysis result 11-11, a preposition, an article, Excludes be verbs and pronouns, extracts independent words, and creates a keyword list 11-12. In the subsequent step ST12, the utterance content estimation unit 14 uses the keyword list 11-12 as an input to obtain the document estimation result 15-11 of FIG. As shown in FIG. 20, the document estimation results 15-11 are arranged in the order of scores.

ドキュメント推定結果１５−１１が得られると、続くステップＳＴ１３にて今度はドキュメント検索部１２がキーワードリスト１１−１２を入力に用いて、検索インデックス５から図２０のドキュメント検索結果１３−１１を得る。図２０に示すように、ドキュメント検索結果１３−１１もスコア順に並んでいる。 When the document estimation result 15-11 is obtained, the document search unit 12 obtains the document search result 13-11 of FIG. 20 from the search index 5 by using the keyword list 11-12 as an input in the next step ST13. As shown in FIG. 20, the document search results 13-11 are also arranged in the order of score.

続くステップＳＴ１４では、結果統合部１６がドキュメント推定結果１５−１１の最大スコアがここで定めた閾値Ｘ（たとえば、Ｘ＝０．９）以上かどうかを判断する。ドキュメント推定結果１５−１１では最大スコアが閾値Ｘより小さいので（ステップＳＴ１４“ＮＯ”）、結果統合部１６はステップＳＴ１６の処理に進む。ステップＳＴ１６では、ドキュメントＩＤごとに、ドキュメント検索結果１３−１１のスコアとドキュメント推定結果１５−１１のスコアの重み付き加算を行い、最終検索結果１７−１１を作成する。図２０では、（ドキュメント推定結果１５−１１のスコア）：（ドキュメント検索結果１３−１１のスコア）＝１：１で加算した結果が最終検索結果１７−１１となっている。 In subsequent step ST14, the result integration unit 16 determines whether or not the maximum score of the document estimation result 15-11 is equal to or greater than a threshold value X (for example, X = 0.9) determined here. Since the maximum score is smaller than the threshold value X in the document estimation result 15-11 (step ST14 “NO”), the result integration unit 16 proceeds to the process of step ST16. In step ST16, for each document ID, the weighted addition of the score of the document search result 13-11 and the score of the document estimation result 15-11 is performed to create the final search result 17-11. In FIG. 20, the result obtained by adding (score of document estimation result 15-11) :( score of document search result 13-11) = 1: 1 is the final search result 17-11.

一方、ステップＳＴ１４でドキュメント推定結果１５−１１の最大スコアが閾値Ｘを超えた場合には（ステップＳＴ１４“ＹＥＳ”）、続くステップＳＴ１５にて結果統合部１６はドキュメント検索結果１３−１１を破棄して、ドキュメント推定結果１５−１１を最終検索結果（不図示）とする。
検索が終了すると、ドキュメント検索装置は画面にドキュメントＩＤのタイトルなどを表示して、ユーザに選択させることで、所望のドキュメント位置を提示する。On the other hand, when the maximum score of the document estimation result 15-11 exceeds the threshold value X in step ST14 (step ST14 “YES”), the result integration unit 16 discards the document search result 13-11 in the next step ST15. Thus, the document estimation result 15-11 is set as a final search result (not shown).
When the search is completed, the document search device displays the title of the document ID and the like on the screen and makes the user select, thereby presenting a desired document position.

以上より、実施の形態３によれば、ドキュメント検索装置は、日本語だけでなく英語のドキュメント１についても上記実施の形態１と同様の処理を実施可能であり、英語の入力の場合にも、上記実施の形態１と同様の効果を得ることができる。
なお、説明は省略するが、実施の形態３の構成を上記実施の形態２に適用してもよい。As described above, according to the third embodiment, the document search apparatus can perform the same processing as that of the first embodiment not only on the Japanese language but also on the English document 1. The same effect as in the first embodiment can be obtained.
Although not described, the configuration of the third embodiment may be applied to the second embodiment.

実施の形態４．
以下では、中国語の入力例について説明する。
本実施の形態４のドキュメント検索装置は、図１に示すドキュメント検索装置と図面上では同様の構成であるため、以下では図１を援用して説明する。Embodiment 4 FIG.
In the following, an example of Chinese input will be described.
The document search apparatus according to the fourth embodiment has the same configuration as that of the document search apparatus shown in FIG. 1, and therefore will be described below with reference to FIG.

図２１は、本実施の形態４に係るドキュメント検索装置に入力されるドキュメント１の中国語例である。ドキュメント１は、章、節、項のような階層構造を持っており、階層ごとに検索結果位置を示すドキュメントＩＤを持っている。図２１の例では、ドキュメントＩＤ「Ｉｄ＿１０＿１」のドキュメント１−２１が、下位のデータ構造内に含まれるテキストも含んでいる。たとえば「Ｉｄ＿１０＿１＿１」のドキュメント１−２２は、「Ｉｄ＿１０＿１」のドキュメント１−２１にも含まれることを表している。 FIG. 21 is a Chinese example of the document 1 input to the document search apparatus according to the fourth embodiment. The document 1 has a hierarchical structure such as a chapter, a section, and an item, and has a document ID indicating a search result position for each hierarchy. In the example of FIG. 21, the document 1-21 having the document ID “Id — 10_1” also includes text included in the lower data structure. For example, the document 1-22 of “Id_10_1_1” is included in the document 1-21 of “Id_10_1”.

図２２は、ドキュメント解析結果３、および検索インデックス５用のキーワードリストの例である。「Ｉｄ＿１０＿１＿１」はドキュメント解析結果の一例であり、図２１の「Ｉｄ＿１０＿１＿１」のドキュメント１−２２に対して形態素解析による入力解析を行った結果を示している。このドキュメント解析結果３−２１では、形態素解析結果の区切を「／」で区切った情報しか提示していないが、実際には品詞情報などの情報も生成される。
検索インデックス用データ３−２２は、「Ｉｄ＿１０＿１＿１」のドキュメント解析結果３−２２をもとにした、検索インデックス作成部４が使用するデータ例を示している。ここではドキュメントＩＤと、代詞、助詞、介詞を除く自立語形態素とが抽出されている。FIG. 22 is an example of the keyword list for the document analysis result 3 and the search index 5. “Id — 10 — 1 — 1” is an example of the document analysis result, and shows the result of performing input analysis by morphological analysis on the document 1-22 of “Id — 10 — 1 — 1” in FIG. In this document analysis result 3-21, only information obtained by dividing the morphological analysis result by “/” is presented, but actually information such as part of speech information is also generated.
The search index data 3-22 is an example of data used by the search index creation unit 4 based on the document analysis result 3-22 of “Id — 10_1 — 1”. Here, the document ID and independent word morphemes excluding pronouns, particles, and interpositions are extracted.

図２３は、収集発話データ６の例である。収集発話データ６−２１は「Ｉｄ＿１０」のドキュメントに対応する質問の例、収集発話データ６−２２は「Ｉｄ＿１０＿１」のドキュメントに対応する質問の例、収集発話データ６−２３は「Ｉｄ＿１０＿１＿１」のドキュメントに対応する質問の例である。収集発話データ６−２４は、地図種類の具体的変更方法を知りたいことを意図した質問であるが、ここで想定している製品では実現不可能な地図種類であるため、「Ｉｄ＿１０＿１＿１」と同階層のドキュメントＩＤが選択できない収集発話データ例である。 FIG. 23 is an example of the collected utterance data 6. The collected utterance data 6-21 is an example of a question corresponding to the document “Id_10”, the collected utterance data 6-22 is an example of a question corresponding to the document of “Id_10_1”, and the collected utterance data 6-23 is a document of “Id_10_1_1”. It is an example of a question corresponding to. The collected utterance data 6-24 is a question intended to know a specific method of changing the map type. However, since it is a map type that cannot be realized with the product assumed here, it is the same as “Id — 10_1 — 1”. It is an example of the collection utterance data which cannot select document ID of a hierarchy.

図２４は、収集発話解析結果７、および発話推定モデル９用のキーワードリストの例である。「Ｉｄ＿１０＿１＿１」の収集発話解析結果７−２１は、図２３の「Ｉｄ＿１０＿１＿１」の収集発話データ６−２３の収集発話解析結果例、発話推定モデル用データ７−２２は、「Ｉｄ＿１０＿１＿１」の収集発話解析結果７−２１をもとにした、発話推定モデル作成部８が使用するデータ例を示している。ここではドキュメントＩＤと、代詞、助詞、介詞を除く自立語形態素とが抽出されている。 FIG. 24 is an example of a keyword list for the collected utterance analysis result 7 and the utterance estimation model 9. The collected utterance analysis result 7-21 of "Id_10_1_1" is the collected utterance analysis result example of the collected utterance data 6-23 of "Id_10_1_1" in FIG. 23, and the utterance estimation model data 7-22 is the collected utterance analysis of "Id_10_1_1". The example of data which the speech estimation model preparation part 8 based on the result 7-21 uses is shown. Here, the document ID and independent word morphemes excluding pronouns, particles, and interpositions are extracted.

次に、ドキュメント検索装置の動作を説明する。
本実施の形態４に係るドキュメント検索装置の動作（作成処理、検索処理）は基本的に上記実施の形態１の図６〜図８と同じである。従って、ここでは異なる部分のみを説明する。まず、作成処理について説明する。Next, the operation of the document search apparatus will be described.
The operations (creation processing and search processing) of the document search apparatus according to the fourth embodiment are basically the same as those in FIGS. 6 to 8 in the first embodiment. Therefore, only different parts will be described here. First, the creation process will be described.

まず、作成処理のうち、検索インデックス５の作成方法について説明する。ここでは、従来技術で開示されているｔｆ・ｉｄｆによる重み付けをするものとする。
図２１に示したように、ドキュメント１は、ドキュメントＩＤとテキストとが対応付けられたペアになっているものとする。First, a method for creating the search index 5 in the creation process will be described. Here, weighting by tf · idf disclosed in the prior art is assumed.
As shown in FIG. 21, it is assumed that the document 1 is a pair in which the document ID and the text are associated with each other.

図６のステップＳＴ１では、入力解析部２がこの構造のドキュメント１を順次読み込み、既知の技術である形態素解析によって形態素列に分割する。ドキュメント１−２２を形態素解析した結果が、図２２のドキュメント解析結果３−２１である。このドキュメント解析結果３−２１は、形態素の区切りしか示していないが、実際には、品詞情報などが含まれているものとする。 In step ST1 of FIG. 6, the input analysis unit 2 sequentially reads the document 1 having this structure and divides it into morpheme strings by morphological analysis which is a known technique. The result of the morphological analysis of the document 1-22 is a document analysis result 3-21 in FIG. This document analysis result 3-21 shows only morpheme breaks, but it is assumed that part of speech information is actually included.

ドキュメント解析結果３がすべてのドキュメントＩＤに対して生成されると、続くステップＳＴ２で、検索インデックス作成部４が、すべてのドキュメント解析結果３から検索インデックス５の作成に必要な形態素（キーワード）を抽出し、（ドキュメントＩＤ、キーワードリスト）のペアを作成し、すべてのペアを元にｔｆ・ｉｄｆによって重み付けした検索インデックス５を作成する。図２２のドキュメント解析結果３−２１から抽出した（ドキュメントＩＤ、キーワードリスト）のペアは、同じ図２２の検索インデックス用データ３−２２で表される。 When the document analysis result 3 is generated for all document IDs, the search index creation unit 4 extracts morphemes (keywords) necessary for creating the search index 5 from all the document analysis results 3 in the subsequent step ST2. Then, a pair of (document ID, keyword list) is created, and a search index 5 weighted by tf · idf based on all pairs is created. A pair of (document ID, keyword list) extracted from the document analysis result 3-21 in FIG. 22 is represented by the same search index data 3-22 in FIG.

次に、発話推定モデル９の作成処理について説明する。
収集発話データ６は、図２３の収集発話データ６−２１〜６−２４に表したように、あらかじめユーザから集めた発話をその回答となるドキュメントＩＤに割り当てたデータである。収集発話データ６の作成方法は上記実施の形態１と同様であるため、説明を省略する。Next, a process for creating the utterance estimation model 9 will be described.
The collected utterance data 6 is data in which the utterances collected from the user in advance are assigned to the document ID as the answer, as shown in the collected utterance data 6-21 to 6-24 in FIG. Since the method of creating the collected utterance data 6 is the same as that in the first embodiment, description thereof is omitted.

入力解析部２は、図７に示したステップＳＴ３において、先に説明したステップＳＴ１でドキュメント１を入力とした場合と同様に、収集発話データ６の形態素解析を行う。たとえば、図２３の収集発話データ６−２３を形態素解析した結果が、図２４の収集発話解析結果７−２１である。続くステップＳＴ４で発話推定モデル作成部８が、先に説明したステップＳＴ２と同様にドキュメントＩＤとキーワードのリストを発話推定モデル用データ７−２２として抽出し、上記実施の形態１と同様にＭＥ法によって発話推定モデル９を学習する。すべての収集発話解析結果７からキーワードが抽出され、ＭＥ法によって学習して、発話推定モデル９が作成される。具体的には、図２４の収集発話解析結果７−２１に対して、同じ図２４の発話推定モデル用データ７−２２が抽出され、この発話推定モデル用データ７−２２を元に上記学習が行われる。 In step ST3 shown in FIG. 7, the input analysis unit 2 performs morphological analysis on the collected utterance data 6 in the same manner as when the document 1 is input in step ST1 described above. For example, the result of morphological analysis of the collected utterance data 6-23 in FIG. 23 is the collected utterance analysis result 7-21 in FIG. In subsequent step ST4, the utterance estimation model creating unit 8 extracts a list of document IDs and keywords as utterance estimation model data 7-22 in the same manner as in step ST2 described above, and the ME method as in the first embodiment. To learn the utterance estimation model 9. Keywords are extracted from all the collected utterance analysis results 7 and learned by the ME method to create an utterance estimation model 9. Specifically, the same utterance estimation model data 7-22 of FIG. 24 is extracted from the collected utterance analysis result 7-21 of FIG. 24, and the above learning is performed based on the utterance estimation model data 7-22. Done.

次に、検索処理について説明する。
図２５および図２６は、ユーザ入力１０の一例であるユーザ入力１０−２１の検索処理における遷移例を示す。ここではユーザ入力１０はテキストでの入力を想定し、図２５のユーザ入力１０−２１が入力されたとして説明する。入力解析部２は、図８に示したステップＳＴ１１でまずユーザ入力１０−２１を受け取り、形態素解析してユーザ入力解析結果１１−２１を生成し、ユーザ入力解析結果１１−２１から代詞、助詞、介動詞を除外して自立語を抽出し、キーワードリスト１１−２２を作成する。続くステップＳＴ１２では、発話内容推定部１４がこのキーワードリスト１１−２２を入力に用いて、発話推定モデル９から図２６のドキュメント推定結果１５−２１を得る。図２６に示すように、ドキュメント推定結果１５−２１は、スコア順に並んでいる。Next, the search process will be described.
FIG. 25 and FIG. 26 show a transition example in the search process of the user input 10-21 which is an example of the user input 10. Here, it is assumed that the user input 10 is a text input and the user input 10-21 in FIG. 25 is input. The input analysis unit 2 first receives the user input 10-21 in step ST11 shown in FIG. 8, generates a user input analysis result 11-21 by performing morphological analysis, and generates a pronoun, particle, Independent words are extracted by excluding interverbs, and a keyword list 11-22 is created. In the subsequent step ST12, the utterance content estimation unit 14 uses the keyword list 11-22 as an input to obtain the document estimation result 15-21 of FIG. As shown in FIG. 26, the document estimation results 15-21 are arranged in the order of scores.

ドキュメント推定結果１５−２１が得られると、続くステップＳＴ１３にて今度はドキュメント検索部１２がキーワードリスト１１−２２を入力に用いて、検索インデックス５から図２６のドキュメント検索結果１３−２１を得る。図２６に示すように、ドキュメント検索結果１３−２１もスコア順に並んでいる。 When the document estimation result 15-21 is obtained, in the subsequent step ST13, the document search unit 12 obtains the document search result 13-21 of FIG. 26 from the search index 5 by using the keyword list 11-22 as an input. As shown in FIG. 26, the document search results 13-21 are also arranged in the order of score.

続くステップＳＴ１４では、結果統合部１６がドキュメント推定結果１５−２１の最大スコアがここで定めた閾値Ｘ（たとえば、Ｘ＝０．９）以上かどうかを判断する。ドキュメント推定結果１５−２１では最大スコアが閾値Ｘより小さいので（ステップＳＴ１４“ＮＯ”）、結果統合部１６はステップＳＴ１６の処理に進む。ステップＳＴ１６では、ドキュメントＩＤごとに、ドキュメント検索結果１３−２１のスコアとドキュメント推定結果１５−２１のスコアの重み付き加算を行い、最終検索結果１７−２１を作成する。図２６では、（ドキュメント推定結果１５−２１のスコア）：（ドキュメント検索結果１３−２１のスコア）＝１：１で加算した結果が最終検索結果１７−２１となっている。 In subsequent step ST14, the result integration unit 16 determines whether or not the maximum score of the document estimation result 15-21 is equal to or greater than a threshold value X (for example, X = 0.9) determined here. Since the maximum score is smaller than the threshold value X in the document estimation result 15-21 (step ST14 “NO”), the result integration unit 16 proceeds to the process of step ST16. In step ST16, for each document ID, the weighted addition of the score of the document search result 13-21 and the score of the document estimation result 15-21 is performed to create the final search result 17-21. In FIG. 26, the final search result 17-21 is obtained by adding (score of document estimation result 15-21) :( score of document search result 13-21) = 1: 1.

一方、ステップＳＴ１４でドキュメント推定結果１５−２１の最大スコアが閾値Ｘを超えた場合には（ステップＳＴ１４“ＹＥＳ”）、続くステップＳＴ１５にて結果統合部１６はドキュメント検索結果１３−２１を破棄して、ドキュメント推定結果１５−２１を最終検索結果（不図示）とする。
検索が終了すると、ドキュメント検索装置は画面にドキュメントＩＤのタイトルなどを表示して、ユーザに選択させることで、所望のドキュメント位置を提示する。On the other hand, when the maximum score of the document estimation result 15-21 exceeds the threshold value X in step ST14 (step ST14 “YES”), the result integration unit 16 discards the document search result 13-21 in step ST15. Thus, the document estimation result 15-21 is set as a final search result (not shown).
When the search is completed, the document search device displays the title of the document ID and the like on the screen and makes the user select, thereby presenting a desired document position.

以上より、実施の形態４によれば、ドキュメント検索装置は、日本語だけでなく中国語のドキュメント１についても上記実施の形態１と同様の処理を実施可能であり、中国語の入力の場合にも、上記実施の形態１と同様の効果を得ることができる。
なお、説明は省略するが、実施の形態４の構成を上記実施の形態２に適用してもよい。As described above, according to the fourth embodiment, the document search apparatus can perform the same processing as that of the first embodiment on not only Japanese but also the Chinese document 1. Also, the same effect as in the first embodiment can be obtained.
Although not described, the configuration of the fourth embodiment may be applied to the second embodiment.

上記以外にも、本願発明はその発明の範囲内において、各実施の形態の自由な組み合わせ、あるいは各実施の形態の任意の構成要素の変形、もしくは各実施の形態において任意の構成要素の省略が可能である。 In addition to the above, within the scope of the invention, the invention of the present application can be freely combined with each embodiment, modified any component of each embodiment, or omitted any component in each embodiment. Is possible.

以上のように、この発明に係るドキュメント検索装置は、ユーザがどのような聞き方をするかを想定した質問とその回答になるドキュメント項目との対応関係を学習した発話推定モデルを用いて、ユーザの自然言語による入力に対するドキュメント内検索結果を提示するようにしたので、たとえば、家電製品および車載機器などの電子化された取扱説明書を検索および表示する情報機器などに用いるのに適している。 As described above, the document search apparatus according to the present invention uses the utterance estimation model in which the correspondence between the question assuming the user's way of listening and the document item as the answer is learned, Since the search result in the document with respect to the input in natural language is presented, it is suitable for use in, for example, an information device that searches and displays electronic instruction manuals such as home appliances and in-vehicle devices.

１ドキュメント、２入力解析部、３ドキュメント解析結果、４検索インデックス作成部、５検索インデックス、６収集発話データ、７収集発話解析結果、８発話推定モデル作成部、９発話推定モデル、１０ユーザ入力、１１ユーザ入力解析結果、１２ドキュメント検索部、１３ドキュメント検索結果、１４発話内容推定部、１５ドキュメント推定結果、１６結果統合部、１７最終検索結果、１８検索対象限定部、１９ドキュメント限定リスト。 1 document, 2 input analysis unit, 3 document analysis result, 4 search index creation unit, 5 search index, 6 collected utterance data, 7 collected utterance analysis result, 8 utterance estimation model creation unit, 9 utterance estimation model, 10 user input, 11 user input analysis result, 12 document search unit, 13 document search result, 14 utterance content estimation unit, 15 document estimation result, 16 result integration unit, 17 final search result, 18 search target limiting unit, 19 document limited list.

Claims

A search index created from documents prepared in advance,
A document search device comprising a document search unit that receives an input from a user and searches for an item related to the user input from within the document using the search index,
An utterance estimation model that learns the correspondence between an assumed question asking the content of the document and an item in the document that is an answer to the assumed question;
An utterance content estimation unit that estimates an item corresponding to the answer of the user input from within the document based on the utterance estimation model;
A document search apparatus comprising: a result integration unit that integrates a document search result obtained from the document search unit and a document estimation result obtained from the utterance content estimation unit to generate a final search result.

The utterance content estimation unit gives a score according to the degree of association with the user input to the estimated item in the document,
The result integration unit generates a final search result by ignoring the document search result obtained from the document search unit when the score of the document estimation result obtained from the utterance content estimation unit is larger than a predetermined value. The document search apparatus according to claim 1, wherein:

The document search unit gives a score corresponding to the degree of association with the user input to the items in the searched document,
The utterance content estimation unit gives a score according to the degree of association with the user input to the estimated item in the document,
The result integration unit adds and integrates a score of a document search result obtained from the document search unit and a score of a document estimation result obtained from the utterance content estimation unit at a certain ratio. Item 2. The document search device according to Item 1.

Among the document estimation results obtained from the utterance content estimation unit, a search target limiting unit for extracting items satisfying a predetermined condition,
The utterance content estimation unit estimates based on an utterance estimation model in which a correspondence relationship between an item of a unit larger than the minimum unit of search of the search index and the assumed question is learned,
The result integration unit integrates the items extracted by the search target limiting unit from the document estimation results obtained from the utterance content estimation unit with the document search results obtained from the document search unit. The document search apparatus according to claim 1.

An input analysis unit that analyzes collected utterance data that defines a correspondence relationship between a prepared document and an assumed question asking the content of the document and an item in the document that is an answer to the question;
A search index creation unit for creating the search index from the analysis result of the document output from the input analysis unit;
Using the analysis result of the collected utterance data output from the input analysis unit, learning a correspondence relationship between the assumed question and the item in the document, and an utterance estimation model creation unit that creates the utterance estimation model; The document search apparatus according to claim 1, further comprising:

A document search method using a document search device,
A user input step in which the input analysis unit receives input from the user;
Document retrieval unit, using the search index created from previously prepared document, and document retrieval step of retrieving the items that are relevant to the user input from the document,
Based on the utterance estimation model in which the utterance content estimation unit has learned the correspondence between the assumed question asking the content of the document and the item in the document that is the answer to the assumed question, the answer of the user input from within the document An utterance content estimation step for estimating an item corresponding to
A document search method comprising: a result integration step in which a result integration unit generates a final search result by integrating a document search result obtained from the document search step and a document estimation result obtained from the utterance content estimation step.