JP6387328B2

JP6387328B2 - Procedure expression extraction method, procedure expression extraction device, and procedure expression extraction program

Info

Publication number: JP6387328B2
Application number: JP2015123787A
Authority: JP
Inventors: 仁西川; 牧野　俊朗; 俊朗牧野; 松尾　義博; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-06-19
Filing date: 2015-06-19
Publication date: 2018-09-05
Anticipated expiration: 2035-06-19
Also published as: JP2017010201A

Description

本発明は、手続き表現抽出方法、手続き表現抽出装置、及び手続き表現抽出プログラムに係り、特に、テキストから手続き表現を抽出する手続き表現抽出方法、手続き表現抽出装置、及び手続き表現抽出プログラムに関する。 The present invention relates to a procedure expression extraction method, a procedure expression extraction apparatus, and a procedure expression extraction program, and more particularly to a procedure expression extraction method, a procedure expression extraction apparatus, and a procedure expression extraction program for extracting a procedure expression from text.

人間が機械に対して質問し、質問を受け付けた機械が適切な応答を返すシステムを質問応答システムという。質問応答システムを利用することにより、ユーザは所望の情報を素早く得ることができるため、身の回りに様々な情報が氾濫する現代において質問応答システムの重要性は日増しに高まっている。 A system in which a human makes a question to a machine and the machine that receives the question returns an appropriate response is called a question answering system. Since the user can quickly obtain desired information by using the question answering system, the importance of the question answering system is increasing day by day in the present age when various information is inundated.

従来の多くの質問応答システムにおいては、システムが回答できる質問の範囲はファクトイドに限られていた。ファクトイドは、人名、場所等、単語、名詞句等で表現される固有表現である。例えば、「日本の最初の総理大臣は誰ですか」という質問を受け付けた場合に「伊藤博文」と回答したり、「人口の最も多い日本の都道府県はどこですか」という質問を受け付けた場合に「東京都」と回答したりするものである。 In many conventional question answering systems, the range of questions that the system can answer is limited to factoids. A factoid is a specific expression expressed by a person name, a place, a word, a noun phrase, or the like. For example, if you receive the question "Who is the first prime minister in Japan?" Or if you answer "Hirofumi Ito" or "Which prefecture is Japan with the highest population?" Or “Tokyo”.

しかし、現在では、理由を問う質問等、文でないと回答できない質問を受け付けて回答するシステムも開発されてきた。例えば、「空が青いのはなぜですか」という質問を受け付けた場合に「光の波長が原因です」と回答するというものである。 At present, however, systems have been developed that accept and answer questions that can only be answered in sentences, such as questions that ask reason. For example, when the question “why the sky is blue” is accepted, the answer is “caused by the wavelength of light”.

質問応答システムがより様々な種類の質問に回答できるようになって、より実用化されるためには、手続きを問う質問を受け付けた場合にも回答できるようにする必要がある。例えば、「インスタントコーヒーのいれ方を教えてください」という質問を受け付けた場合には、一連の手続きを説明するために複数の文で回答することが必要となる。例えば、前述の質問に対しては、「まず、インスタントコーヒーの瓶のふたをあけ、インスタントコーヒーの粉末をカップなどの容器に入れます。次に、カップなどの容器にお湯を注ぎ、粉末をお湯に溶かします」等の回答が必要となる。 In order for the question answering system to answer more various types of questions and be put to practical use, it is necessary to be able to answer even when a question asking for a procedure is accepted. For example, when a question “Please tell me how to make instant coffee” is accepted, it is necessary to answer in multiple sentences in order to explain a series of procedures. For example, to the above question, “First, open the lid of the instant coffee jar and put the instant coffee powder into a cup or other container. Next, pour hot water into the cup or other container. It is necessary to answer such as

このような質問に対する回答を生成する先行技術文献として、非特許文献１が挙げられる。非特許文献１に開示された技術では、まず与えられた質問で文書検索を行い、その後に予め人手で設定された、一連の手続きが記述されている箇所に含まれている確率が高いキーワードの集合を利用して、手続きを問う質問に対する回答を生成する。 Non-patent document 1 is cited as a prior art document that generates an answer to such a question. In the technique disclosed in Non-Patent Document 1, first, a document search is performed using a given question, and then a keyword that has a high probability of being included in a place where a series of procedures described in advance is manually described. Using the set, generate an answer to the question that asks the procedure.

Satoshi Nakakura and Junichi Fukumoto, "Question Answering System beyond Factoid Type Questions", In Proceedings of the 23rd International Conference on Circuits/Systems, Computers and Communications (ITC-CSCC 2008), pp. 617-620, 2008.Satoshi Nakakura and Junichi Fukumoto, "Question Answering System beyond Factoid Type Questions", In Proceedings of the 23rd International Conference on Circuits / Systems, Computers and Communications (ITC-CSCC 2008), pp. 617-620, 2008.

ここで、非特許文献１に開示されている方法には、２つの問題点がある。１つ目の問題点は、人手によって個々のキーワードを設定しなければならないことである。そのため、適切なキーワードの集合を網羅できているか否かを検証する方法がない。２つ目の問題点は、回答を生成する際に、単に特定のキーワードの集合を含むと思われる複数の文を抽出するに過ぎないことである。そのため、複数の文で順序を追って手続きが説明されている場合であっても、その順序の構造を捉えることができない。 Here, the method disclosed in Non-Patent Document 1 has two problems. The first problem is that individual keywords must be set manually. Therefore, there is no method for verifying whether or not an appropriate set of keywords can be covered. The second problem is that, when generating an answer, only a plurality of sentences that are supposed to include a specific set of keywords are extracted. Therefore, even when a procedure is explained in order by a plurality of sentences, the structure of the order cannot be grasped.

本発明は、以上のような事情に鑑みてなされたものであり、テキストから手続き表現を精度良く抽出する表現抽出方法、手続き表現抽出装置、及び手続き表現抽出プログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide an expression extraction method, a procedure expression extraction apparatus, and a procedure expression extraction program for accurately extracting procedure expressions from text.

上記目的を達成するために、本発明の手続き表現抽出方法は、パラメータ推定部、テキスト入力部、及び手続き表現抽出部を備えた手続き表現抽出装置における手続き表現抽出方法であって、前記パラメータ推定部が、手続き表現であることを示すタグが予め与えられたテキストである複数の訓練事例データに基づいて、テキストから抽出される、前記タグに関する特徴を表す予め定めた特徴量の各々に対する重みをパラメータとして推定するステップと、前記テキスト入力部が、前記手続き表現の抽出の対象とするテキストの入力を受け付けるステップと、前記手続き表現抽出部が、前記テキスト入力部により受け付けた前記テキストから抽出される前記特徴量と、前記特徴量の各々に対する前記重みとに基づいて、前記テキストから前記手続き表現を抽出するステップと、を含む。 In order to achieve the above object, a procedural expression extraction method of the present invention is a procedural expression extraction method in a procedural expression extraction device comprising a parameter estimation unit, a text input unit, and a procedural expression extraction unit, wherein the parameter estimation unit Is based on a plurality of training case data in which a tag indicating that it is a procedural expression is a text given in advance. The step of estimating as, the step in which the text input unit accepts input of text to be extracted from the procedure expression, and the procedure expression extraction unit is extracted from the text received by the text input unit Based on the feature amount and the weight for each of the feature amount, Including a step of extracting more representation, a.

なお、前記訓練事例データのテキストには、手続きであることを示すタグ、及び手続きのタイトルであることを示すタグが予め付与され、前記パラメータ推定部が前記特徴量に対する重みをパラメータとして推定するステップは、前記パラメータ推定部が、前記複数の訓練事例データに基づいて、文から抽出される前記タグの各々に関する特徴を表す特徴量の各々に対する重みを前記パラメータとして推定する処理を含み、前記手続き表現抽出部が前記手続き表現を抽出するステップは、前記手続き表現抽出部が、前記テキスト入力部により受け付けた前記テキストから抽出される前記特徴量と、前記特徴量の各々に対する前記重みとに基づいて、前記テキスト入力部により受け付けた前記テキストから、前記タイトル及び前記手続きを抽出する処理を含むようにしても良い。 The training example data text is preliminarily provided with a tag indicating that it is a procedure and a tag indicating that it is a procedure title, and the parameter estimation unit estimates a weight for the feature amount as a parameter. Includes a process in which the parameter estimation unit estimates, as the parameter, a weight for each feature amount representing a feature related to each of the tags extracted from a sentence based on the plurality of training case data, and the procedural expression The step of extracting the procedure expression by the extraction unit is based on the feature amount extracted from the text received by the text input unit by the procedure expression extraction unit and the weight for each of the feature amounts. Extract the title and the procedure from the text received by the text input unit That process may include a.

また、前記特徴量は、前記タグが付与された文が予め定めた単語を含んでいるか否か、前記タグが付与された文が予め定めた品詞を含んでいるか否か、及び前記タグが付与された文の文頭が数字であるか否か、及び前記タグが付与された文同士の類似度が予め定めた閾値以上であるか否か、のうちの少なくとも１つを含むようにしても良い。 The feature amount includes whether the sentence with the tag includes a predetermined word, whether the sentence with the tag includes a predetermined part of speech, and the tag It is also possible to include at least one of whether or not the beginning of the sentence is a number, and whether or not the similarity between the sentences to which the tag is attached is equal to or greater than a predetermined threshold.

本発明の手続き表現抽出装置は、手続き表現であることを示すタグが予め与えられたテキストである複数の訓練事例データに基づいて、テキストから抽出される、前記タグに関する特徴を表す予め定めた特徴量の各々に対する重みをパラメータとして推定するパラメータ推定部と、前記手続き表現の抽出の対象とするテキストの入力を受け付けるテキスト入力部と、前記テキスト入力部により受け付けた前記テキストから抽出される前記特徴量と、前記特徴量の各々に対する前記重みとに基づいて、前記テキストから前記手続き表現を抽出する手続き表現抽出部と、を備える。 The procedural expression extraction device according to the present invention is a predetermined characteristic representing a characteristic relating to the tag, which is extracted from the text based on a plurality of training case data in which the tag indicating the procedural expression is a text given in advance. A parameter estimation unit that estimates a weight for each of the quantities as a parameter, a text input unit that receives input of text to be extracted from the procedural expression, and the feature amount extracted from the text received by the text input unit And a procedural expression extracting unit that extracts the procedural expression from the text based on the weights for each of the feature quantities.

本発明の手続き表現抽出プログラムは、コンピュータに、上記手続き表現抽出方法の各ステップを実行させるためのプログラムである。 The procedure expression extraction program of the present invention is a program for causing a computer to execute each step of the procedure expression extraction method.

本発明によれば、人手によって個々の手続き表現を記述することなく手続きの順序を加味した応答を可能とすることができる。 According to the present invention, it is possible to make a response considering the order of procedures without manually describing individual procedure expressions.

実施形態に係る手続き表現抽出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the procedure expression extraction apparatus which concerns on embodiment. 実施形態に係る手続き表現抽出装置により実行される手続き表現抽出処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the procedure expression extraction process performed by the procedure expression extraction apparatus which concerns on embodiment. 実施形態に係る訓練事例データの一例を示す模式図である。It is a schematic diagram which shows an example of the training example data which concern on embodiment. 実施形態に係るパラメータデータの一例を示す模式図である。It is a schematic diagram which shows an example of the parameter data which concern on embodiment. 実施形態に係る手続き表現の抽出の対象とするテキストの一例を示す模式図である。It is a schematic diagram which shows an example of the text made into the object of extraction of the procedure expression which concerns on embodiment. 実施形態に係る手続き表現の抽出の対象とするテキストであって注釈が付されたテキストの一例を示す模式図である。It is a schematic diagram which shows an example of the text which was the object of extraction of the procedure expression which concerns on embodiment, and was annotated. 実施形態に係る手続き表現データの一例を示す模式図である。It is a schematic diagram which shows an example of the procedure expression data which concern on embodiment.

以下、図面を参照して、本実施形態に係る手続き表現抽出装置について詳細に説明する。 Hereinafter, a procedure expression extraction device according to the present embodiment will be described in detail with reference to the drawings.

本実施形態に係る手続き表現抽出装置１０は、図１に示すように、手続き表現であることを示すタグ、及びタイトルであることを示すタグが予め与えられたテキストである複数の訓練事例データから、各タグに関する特徴を表す予め定めた特徴量の各々に対する重みをパラメータとして推定するパラメータ推定部２０を有する。また、手続き表現抽出装置１０は、手続き表現の抽出の対象とするテキストの入力を受け付けるテキスト入力部２２を有する。また、手続き表現抽出装置１０は、テキスト入力部２２により受け付けたテキストから抽出される特徴量と、特徴量の各々に対する重みとに基づいて、テキスト入力部２２により受け付けたテキストから手続き表現を抽出する手続き表現抽出部２４を有する。 As shown in FIG. 1, the procedure expression extraction apparatus 10 according to the present embodiment is based on a plurality of training case data that is a text in which a tag indicating a procedure expression and a tag indicating a title are given in advance. The parameter estimation unit 20 estimates the weight for each of the predetermined feature amounts representing the features related to each tag as a parameter. The procedural expression extraction device 10 also includes a text input unit 22 that receives input of text to be extracted from procedural expressions. The procedural expression extraction device 10 extracts a procedural expression from the text received by the text input unit 22 based on the feature amount extracted from the text received by the text input unit 22 and the weight for each feature amount. A procedural expression extraction unit 24 is included.

さらに、手続き表現抽出装置１０は、手続き表現の抽出の対象とするテキストから手続き表現を抽出する手続き表現抽出処理に必要となる各種データを記憶する記憶部１８を有する。記憶部１８は、後述する訓練事例データが記憶される訓練事例データ記憶部１８ａ、後述するパラメータデータが記憶されるパラメータデータ記憶部１８ｂ、及び後述する手続き表現が記憶される手続き表現記憶部１８ｃを有する。 Furthermore, the procedural expression extraction device 10 includes a storage unit 18 that stores various data necessary for procedural expression extraction processing for extracting a procedural expression from text that is a target of procedural expression extraction. The storage unit 18 includes a training case data storage unit 18a that stores training case data described later, a parameter data storage unit 18b that stores parameter data described later, and a procedure expression storage unit 18c that stores procedure expressions described later. Have.

なお、本実施形態に係る手続き表現抽出装置１０は、例えばＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）、後述する手続き表現抽出処理のプログラムを含む各種プログラムを記憶するＲＯＭ（Read Only Memory）を備えたコンピュータ装置で構成される。また、手続き表現抽出装置１０を構成するコンピュータは、ハードディスクドライブ、不揮発性メモリ等の記憶部を備えていても良い。なお、本実施形態では、ＲＯＭに代えて不揮発性メモリを備えている場合について説明する。また、ハードディスクドライブ、不揮発性メモリ等の記憶部にＣＰＵが実行するプログラムが記憶されていても良い。ＣＰＵがＲＯＭやハードディスク等の記憶部に記憶されているプログラムを読み出して実行することにより、上記のハードウェア資源とプログラムとが協働し、上記で説明した機能が実現される。 The procedure expression extraction apparatus 10 according to the present embodiment includes, for example, a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read Only Memory) that stores various programs including a program for procedure expression extraction processing described later. It is comprised with the computer apparatus provided with. The computer constituting the procedural expression extraction device 10 may include a storage unit such as a hard disk drive or a nonvolatile memory. In the present embodiment, a case where a nonvolatile memory is provided instead of the ROM will be described. A program executed by the CPU may be stored in a storage unit such as a hard disk drive or a nonvolatile memory. When the CPU reads and executes a program stored in a storage unit such as a ROM or a hard disk, the hardware resource and the program cooperate to realize the function described above.

次に、図２を参照して、一例としてユーザによる実行指示が入力された場合に、本実施形態に係る手続き表現抽出装置１０により実行される手続き表現抽出処理の動作について説明する。 Next, referring to FIG. 2, an operation of a procedure expression extraction process executed by the procedure expression extraction apparatus 10 according to the present embodiment when an execution instruction is input by a user will be described as an example.

ステップＳ１０１では、パラメータ推定部２０が、訓練事例データ記憶部１８ａから上述した複数の訓練事例データを読み込む。訓練事例データは、手続き表現を学習する際に利用される訓練事例を示すデータであり、一例として図３に示すように、訓練事例を識別するための訓練事例番号と、タグにより注釈付けされた手続き表現を含む訓練事例とが対応付けられたデータである。訓練事例は、手続き表現である複数の文を含んでいて、一連の手続きを表す複数の文の一部あるいは全部、もしくは単一のウェブページ等に掲載されている文書全体である。 In step S101, the parameter estimation unit 20 reads the plurality of training case data described above from the training case data storage unit 18a. The training case data is data indicating a training case used when learning the procedural expression. As shown in FIG. 3 as an example, the training case number is annotated with a training case number for identifying the training case and a tag. This data is associated with training examples including procedural expressions. The training example includes a plurality of sentences that are procedural expressions, and is a part or all of a plurality of sentences representing a series of procedures, or the entire document posted on a single web page or the like.

本実施形態では、訓練事例は、一連の手続きに関係しない文、一連の手続きのタイトルを表す文、及び、一連の手続きを表す文を含んでいる。一連の手続きを表す文は、一連の手続きのうちの一部を表す文であっても、一連の手続きのうちの全部を表す文であっても良い。 In the present embodiment, the training example includes a sentence that does not relate to a series of procedures, a sentence that represents a title of the series of procedures, and a sentence that represents a series of procedures. The sentence representing a series of procedures may be a sentence representing a part of the series of procedures or a sentence representing all of the series of procedures.

図３に示す訓練事例データにおける訓練事例番号１の例では、タグが付されていない「カップラーメンには様々な作り方がある」との文、及び「このようにしてカップラーメンは簡便に作ることができる」との文が、一連の手続きに関係しない文に対応する。また、タイトルタグとしての<title>タグが付されている「カップラーメンを作る手続きの一例を以下に示す」との文が、タイトルを表す文に対応する。さらに、手続きタグとしての<procedure>タグが付されている「１．カップラーメンのふたを開ける．」との文等の６つの文が、一連の手続きを表す文に対応する。 In the example of training case number 1 in the training case data shown in FIG. 3, a sentence “There are various ways to make cup ramen” with no tag, and “How to make cup ramen in this way simply” The sentence "can do" corresponds to a sentence that is not related to a series of procedures. In addition, a sentence “an example of a procedure for creating a cup ramen” shown with a <title> tag as a title tag corresponds to a sentence representing a title. Furthermore, six sentences such as a sentence “1. Open the lid of the cup ramen” with a <procedure> tag as a procedure tag correspond to sentences representing a series of procedures.

このように、本実施形態では、訓練事例が一連の手続きに関係しない文、一連の手続きのタイトルを表す文、及び、一連の手続きを表す文を含んでいるが、これに限らない。訓練事例が一連の手続きを表す文のみを含んでいても良く、一連の手続きのタイトルを表す文、及び、一連の手続きを表す文のみを含んでいても良い。 As described above, in this embodiment, the training example includes a sentence that does not relate to a series of procedures, a sentence that represents the title of a series of procedures, and a sentence that represents a series of procedures, but the present invention is not limited thereto. The training example may include only a sentence representing a series of procedures, or may include only a sentence representing a series of procedures and a sentence representing a series of procedures.

また、本実施形態では、一連の手続きを表す文を、<procedure>タグ及び</procedure>タグで囲み、一連の手続きのタイトルを表す文を、<title>タグ及び</title>タグで囲んでいるが、これに限らない。一連の手続きを表す文には、一連の手続きを表す文であることが判定可能なタグを付し、一連の手続きのタイトルを表す文には、一連の手続きのタイトルを表す文であることが判定可能なタグを付せば良い。 In this embodiment, a sentence representing a series of procedures is enclosed in <procedure> tags and </ procedure> tags, and a sentence representing a series of procedures is enclosed in <title> tags and </ title> tags. However, it is not limited to this. A sentence representing a series of procedures is attached with a tag which can be determined to be a sentence representing a series of procedures, and a sentence representing the title of a series of procedures may be a sentence representing the title of a series of procedures. A tag that can be determined may be attached.

また、一連の手続きの一部にさらに細かい手続きを表す文が組み込まれている場合には、一連の手続きを表す文に階層構造のタグを付してもよい。この場合には、例えば、一連の手続きを表す文を<main-procedure>タグ及び</main-procedure>タグで囲み、さらに細かい手続きを表す文を<sub-procedure>タグ及び</sub-procedure>タグで囲む。これにより、階層構造の手続き表現を抽出することができる。 In addition, when a sentence representing a more detailed procedure is incorporated in a part of a series of procedures, a hierarchical structure tag may be attached to the sentence representing the series of procedures. In this case, for example, a statement representing a series of procedures is enclosed in <main-procedure> tags and </ main-procedure> tags, and a statement representing further detailed procedures is enclosed in <sub-procedure> tags and </ sub-procedure tags. > Surround with tags. Thereby, a procedural expression having a hierarchical structure can be extracted.

ステップＳ１０３では、パラメータ推定部２０が、読み込んだ複数の訓練事例データから、各タグに関する特徴量の各々に対する重みをパラメータとして推定する。本実施形態では、パラメータを、公知の手法である条件付き確率場を用いて、下記の（１）式及び（２）式により推定する。 In step S103, the parameter estimation unit 20 estimates, as a parameter, a weight for each feature amount related to each tag from the plurality of read training case data. In the present embodiment, the parameters are estimated by the following formulas (1) and (2) using a conditional random field that is a known method.

なお、上記（１）式及び（２）式において、ｘ＝｛ｘ_１，ｘ_２，…，ｘ_ｍ｝は、訓練事例に含まれる各々の文の列である。図３に示す例では、ｘ_１＝“カップラーメンには様々な作り方がある．”、ｘ_２＝“カップラーメンを作る手続きの一例を以下に示す．”、ｘ_３＝“１．カップラーメンのふたを開ける．”である。 In the above formulas (1) and (2), x = {x ₁ , x ₂ ,..., X _m } is a sequence of each sentence included in the training example. In the example shown in FIG. 3, x ₁ = “There are various ways to make cup ramen”, x ₂ = “An example of a procedure for making a cup ramen is shown below”, x ₃ = “1. Open the lid. ”

また、上記（１）式及び（２）式において、ｙ＝｛ｙ_１，ｙ_２，…，ｙ_ｎ｝は、訓練事例に含まれる各々の文のタグの列である。図３に示す例では、各文のタグは、“タグなし” “title”、及び“procedure”の何れかである。 In the above formulas (1) and (2), y = {y ₁ , y ₂ ,..., Y _n } is a string of tags of each sentence included in the training example. In the example illustrated in FIG. 3, the tag of each sentence is “no tag”, “title”, or “procedure”.

また、上記（１）式及び（２）式において、関数ｆ（ｘ，ｙ）は、引数ｘ及び引数ｙに応じた特徴ベクトルを返す関数である。本実施形態では、特徴ベクトルｆ（ｘ，ｙ）は、各タグに関する文の特徴を表す予め定めた複数の特徴量に対応する要素の列である。本実施形態では、これらの特徴量は、引数ｘと引数ｙとの組み合わせで定義されており、図３に示す例では、特徴量の列をｚ＝｛ｚ_１，ｚ_２，…，ｚ_ｋ｝とすると、ｚ_１＝“procedure-文の文頭が数字である”、ｚ_２＝“procedure-「そのあと」を含む”、ｚ_３＝“title-「手続き」を含む”である。特徴ベクトルｆ（ｘ，ｙ）は、引数ｘ及び引数ｙの値の組み合わせに応じて、当該組み合わせに対応する要素を１とし、対応しない要素を０としたベクトルで表されている。 In the above equations (1) and (2), the function f (x, y) is a function that returns a feature vector corresponding to the argument x and the argument y. In the present embodiment, the feature vector f (x, y) is a sequence of elements corresponding to a plurality of predetermined feature amounts representing the feature of the sentence related to each tag. In the present embodiment, these feature quantities are defined by a combination of an argument x and an argument y. In the example shown in FIG. 3, a sequence of feature quantities is represented by z = {z ₁ , z ₂ ,..., Z _k. }, Z ₁ = “procedure-sentence begins with a number”, z ₂ = “procedure—includes“ after ””, and z ₃ = “title—includes“ procedure ””. f (x, y) is represented by a vector in which an element corresponding to the combination x is 1 and an element that does not correspond is 0, depending on the combination of the values of the argument x and the argument y.

特徴量は、例えば、<procedure>タグが付与された文が予め定めた単語（例えば、「そのあと」、「手続き」等）を含んでいるか否か、<procedure>タグが付与された文が特定の品詞（例えば、形容詞、副詞等）を含んでいるか否か、<procedure>タグが付与された文の文頭が数字であるか否か、及び、<procedure>タグが付与された文同士の文の類似度が予め定めた閾値以上であるか否か、のうちの少なくとも１つを含む。 The feature amount is, for example, whether a sentence with a <procedure> tag includes a predetermined word (for example, “after”, “procedure”, etc.), or a sentence with a <procedure> tag. Whether the sentence contains a specific part of speech (for example, adjectives, adverbs, etc.), whether the sentence beginning with a <procedure> tag is a number, and between sentences with a <procedure> tag It includes at least one of whether or not the sentence similarity is greater than or equal to a predetermined threshold.

図３に示す訓練事例番号１の例では、「<procedure>１．カップラーメンのふたを開ける．</procedure>」との記載に基づき、特徴量が「prodedure-文の文頭が数字である」の場合に、当該特徴量に対応する要素を１とする。また、「<title>カップラーメンを作る手続きの一例を以下に示す．</title>」との記載に基づき、特徴量が「title-「手続き」を含む」の場合に、当該特徴量に対応する要素を１とする。 In the example of training case number 1 shown in FIG. 3, the feature quantity is “prodedure-sentence begins with a number” based on the description “<procedure> 1. Open the cup ramen lid. </ Procedure>” In this case, the element corresponding to the feature amount is set to 1. Also, based on the description of "<title> Cupramen making procedure shown below. </ Title>", if the feature is "title-" includes procedure "", it corresponds to the feature The element to be set is 1.

また、上記（１）式及び（２）式において、ｗは、特徴量の各々に対する重みを表すパラメータベクトルである。また、上記（１）式及び（２）式において、ｗ・ｆ（ｘ，ｙ）は、ｗとｆ（ｘ，ｙ）との内積である。また、上記（１）式及び（２）式において、確率Ｐ（ｙ｜ｘ）は、引数ｘである条件の下で引数ｙとなる確率を表す。 In the above equations (1) and (2), w is a parameter vector representing the weight for each feature amount. In the above formulas (1) and (2), w · f (x, y) is an inner product of w and f (x, y). In the above equations (1) and (2), the probability P (y | x) represents the probability of being an argument y under the condition of the argument x.

本実施形態では、パラメータ推定部２０が、読み込んだ訓練事例データからパラメータを推定する際、訓練事例毎に特徴ベクトルｆ（ｘ，ｙ）を上記（１）式に与えた場合に、確率Ｐ（ｙ｜ｘ）が最大となるパラメータベクトルｗを求める。 In the present embodiment, when the parameter estimation unit 20 estimates a parameter from the read training case data, when the feature vector f (x, y) is given to the above equation (1) for each training case, the probability P ( A parameter vector w that maximizes y | x) is obtained.

なお、確率Ｐ（ｙ｜ｘ）が最大となるパラメータベクトルｗを求める手法としては、公知の手法を適用することができ、例えば、下記の参考文献１に開示された手法を適用することができる。 As a method for obtaining the parameter vector w that maximizes the probability P (y | x), a known method can be applied. For example, the method disclosed in Reference Document 1 below can be applied. .

［参考文献１］John Lafferty, Andrew McCallum and Fernand C. N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data", In Proceedings of the 18th International Conference on Machine Learning (ICML 2001), pages 282-289, 2001. [Reference 1] John Lafferty, Andrew McCallum and Fernand CN Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data", In Proceedings of the 18th International Conference on Machine Learning (ICML 2001), pages 282-289, 2001.

パラメータ推定部２０は、求めたパラメータベクトルｗをパラメータデータとしてパラメータデータ記憶部１８ｂに記憶させる。本実施形態では、一例として図４に示すように、各々の特徴量と、その特徴量に対する重みとをそれぞれ対応付けたデータをパラメータデータとしてパラメータデータ記憶部１８ｂに記憶させる。 The parameter estimation unit 20 stores the obtained parameter vector w in the parameter data storage unit 18b as parameter data. In the present embodiment, as shown in FIG. 4 as an example, data in which each feature amount is associated with a weight for the feature amount is stored in the parameter data storage unit 18b as parameter data.

図４に示す例では、「procedure-文の文頭が数字である」の特徴量に対応する重みは、文の文頭が数字である場合に、<procedure>タグがどの程度付されやすいかを表している。 In the example shown in FIG. 4, the weight corresponding to the feature quantity “procedure-sentence begins with a number” indicates how easily the <procedure> tag is attached when the beginning of the sentence is a number. ing.

ステップＳ１０５では、テキスト入力部２２が、手続き表現の抽出の対象とするテキストの入力を受け付けて記憶部１８に記憶する。テキストは、一例として図５に示すように、「ＡＴＭでお金をおろすには以下のようにする．」等の複数の文を含んでいる。 In step S <b> 105, the text input unit 22 receives an input of text to be extracted from the procedure expression and stores it in the storage unit 18. As shown in FIG. 5 as an example, the text includes a plurality of sentences such as “To make money with ATM, do as follows”.

ステップＳ１０７では、手続き表現抽出部２４が、パラメータ推定部２０により推定されたパラメータデータに基づいて、テキスト入力部２２により受け付けられたテキストから手続き表現を抽出する。本実施形態では、手続き表現を、下記の（３）式により、ヴィタビ・アルゴリズムを用いて、内積ｗ・ｆ（ｘ，ｙ）が最大となる場合の、各文のタグの列であるｙを抽出する。 In step S <b> 107, the procedure expression extraction unit 24 extracts a procedure expression from the text received by the text input unit 22 based on the parameter data estimated by the parameter estimation unit 20. In this embodiment, the procedure expression is expressed by the following equation (3), and the y that is the sequence of tags of each sentence when the inner product w · f (x, y) is maximum is obtained using the Viterbi algorithm. Extract.

例えば、上記テキストに含まれる文の数が４つであり、抽出した引数ｙが｛title，procedure，procedure，procedure｝であった場合、一例として図６に示すように、上記テキストに含まれる複数の文の各々に、順番に、<title>タグ、<procedure>タグ、<procedure>タグ、<procedure>タグを付すことができる。 For example, when the number of sentences included in the text is four and the extracted argument y is {title, procedure, procedure, procedure}, as illustrated in FIG. 6 as an example, a plurality of sentences included in the text are included. <Title> tag, <procedure> tag, <procedure> tag, and <procedure> tag can be added to each of the sentences in order.

そして、抽出した引数ｙを、上記テキストに当てはめることにより、<title>タグ又は<procedure>タグが付された文のみを手続き表現として抽出する。従って、“タグなし”である文は抽出されない。 Then, by applying the extracted argument y to the above text, only a sentence with a <title> tag or a <procedure> tag is extracted as a procedure expression. Therefore, the sentence “No tag” is not extracted.

ステップＳ１０９では、手続き表現抽出部２４が、抽出した手続き表現を手続き表現記憶部１８ｃに記憶し、本手続き表現抽出処理のプログラムの実行を終了する。本実施形態では、一例として図７に示すように、手続き表現を、<title>タグが付された文（タイトル）と<procedure>タグが付された文（手続き）とに分離して、それぞれ対応させて記憶する。なお、上記テキストに<title>タグが付された文（タイトル）が存在しない場合には、<procedure>タグが付された文（手続き）のみが記憶される。 In step S109, the procedure expression extraction unit 24 stores the extracted procedure expression in the procedure expression storage unit 18c, and ends the execution of the program for the procedure expression extraction process. In the present embodiment, as shown in FIG. 7 as an example, the procedure expression is separated into a sentence (title) with a <title> tag and a sentence (procedure) with a <procedure> tag, respectively. Memorize it in correspondence. If there is no sentence (title) with the <title> tag in the text, only the sentence (procedure) with the <procedure> tag is stored.

図７に示す例では、図５に示すテキストにから抽出された「ＡＴＭでお金をおろすには以下のようにする．」のタイトルに対し、「まず，自分のキャッシュ・カードをＡＴＭのスロットに挿入する．」等の複数の手続きが対応付けられた手続き表現が手続き表現記憶部１８ｃに記憶される。 In the example shown in FIG. 7, in response to the title extracted as follows from the text shown in FIG. 5 "To make money with ATM, do as follows." A procedure expression associated with a plurality of procedures such as “Insert.” Is stored in the procedure expression storage unit 18c.

このようにして、本実施形態に係る手続き表現抽出装置１０では、複数の手続き表現を含む訓練事例データから手続き表現の特徴を表す予め定めた特徴量に対する重みをパラメータとして推定し、特徴量に対する重みに基づいて、手続き表現の抽出の対象とするテキストから手続き表現を抽出する。これにより、テキストから手続き表現を精度良く抽出することができる。 As described above, the procedure expression extraction device 10 according to the present embodiment estimates, as a parameter, a weight for a predetermined feature amount representing a feature of the procedure expression from training example data including a plurality of procedure expressions, and the weight for the feature amount. Based on the procedural expression, the procedural expression is extracted from the text to be extracted. Thereby, it is possible to accurately extract the procedural expression from the text.

なお、本実施形態に係る手続き表現抽出装置１０が備えているパラメータ推定部２０、テキスト入力部２２、及び手続き表現抽出部２４の各構成は、専用のハードウェアにより実現されるものであってもよく、また、メモリおよびマイクロプロセッサにより実現させるものであっても良い。また、これらの各構成は、メモリおよびＣＰＵ（中央演算装置）により構成され、各構成の機能を実現するためのプログラムをメモリにロードして実行することによりその機能を実現させるものであってもよい。 In addition, each structure of the parameter estimation part 20, the text input part 22, and the procedure expression extraction part 24 with which the procedure expression extraction apparatus 10 which concerns on this embodiment is provided may be implement | achieved by dedicated hardware. Alternatively, it may be realized by a memory and a microprocessor. Each of these components is configured by a memory and a CPU (central processing unit), and a program for realizing the function of each component is loaded into the memory and executed to realize the function. Good.

また、本実施形態に係る手続き表現抽出装置１０の各処理部の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより手続き表現抽出処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 Further, a program for realizing the function of each processing unit of the procedure expression extraction device 10 according to the present embodiment is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into a computer system. The procedural expression extraction process may be performed by executing. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer system” includes a WWW system having a homepage providing environment (or display environment). The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) in a computer system that becomes a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those holding programs for a certain period of time are also included.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, what is called a difference file (difference program) may be sufficient.

以上、この発明の実施の形態を図面を参照して詳述してきたが、具体的な構成はこの実施の形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes design and the like within the scope not departing from the gist of the present invention. .

１０手続き表現抽出装置
１８記憶部
１８ａ訓練事例データ記憶部
１８ｂパラメータデータ記憶部
１８ｃ手続き表現記憶部
２０パラメータ推定部
２２テキスト入力部
２４手続き表現抽出部 DESCRIPTION OF SYMBOLS 10 Procedure expression extraction apparatus 18 Storage part 18a Training example data storage part 18b Parameter data storage part 18c Procedure expression storage part 20 Parameter estimation part 22 Text input part 24 Procedure expression extraction part

Claims

A procedural expression extraction method in a procedural expression extraction device comprising a parameter estimation unit, a text input unit, and a procedural expression extraction unit,
Each of the predetermined feature amounts representing the feature related to the tag, which is extracted from the text based on a plurality of training case data in which the parameter estimation unit is a text given in advance as a tag indicating that it is a procedural expression Estimating a weight for as a parameter;
The text input unit accepting input of text to be extracted from the procedural expression;
The procedural expression extraction unit extracting the procedural expression from the text based on the feature amount extracted from the text received by the text input unit and the weight for each of the feature amount;
Only including,
In the text of the training example data, a tag indicating that it is a procedure and a tag indicating that it is a title of the procedure are given in advance,
The step in which the parameter estimation unit estimates the weight for the feature amount as a parameter includes a feature amount representing a feature related to each of the tags extracted from a sentence based on the plurality of training case data. Including estimating a weight for each as the parameter,
The step of extracting the procedural expression by the procedural expression extracting unit includes the step of extracting the procedural expression from the feature amount extracted from the text received by the text input unit and the weight for each of the feature amounts. A procedural expression extraction method including processing for extracting the title and the procedure from the text received by the text input unit .

The feature amount includes whether the sentence with the tag includes a predetermined word, whether the sentence with the tag includes a predetermined part of speech, and the tag. whether beginning of a sentence of sentence is a number, and the tag whether the similarity of the sentence among granted is a predetermined threshold value or more, the procedure representation of claim 1, further comprising at least one of Extraction method.

Based on a plurality of training case data in which a tag indicating procedural expression is a text given in advance, a weight for each of the predetermined feature amounts representing the feature related to the tag extracted from the text is estimated as a parameter. A parameter estimation unit to perform,
A text input unit that accepts input of text to be extracted from the procedural expression;
A procedural expression extracting unit that extracts the procedural expression from the text based on the feature amount extracted from the text received by the text input unit and the weight for each of the feature amount;
Equipped with a,
In the text of the training example data, a tag indicating that it is a procedure and a tag indicating that it is a title of the procedure are given in advance,
The parameter estimating unit estimates, as the parameter, a weight for each feature amount representing a feature related to each of the tags extracted from a sentence based on the plurality of training case data,
The procedural expression extracting unit, based on the feature amount extracted from the text received by the text input unit and the weight for each of the feature amounts, from the text received by the text input unit, Extract title and procedure
Procedural expression extraction device.

The computer, procedural representation extraction program for executing the steps of the procedure expression extraction method according to claim 1 or 2, wherein.