JP5436484B2

JP5436484B2 - Factor trivial word acquisition device, method and program

Info

Publication number: JP5436484B2
Application number: JP2011070941A
Authority: JP
Inventors: 翔川中; 章裕宮田; 高秀星出; 考藤村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-03-28
Filing date: 2011-03-28
Publication date: 2014-03-05
Anticipated expiration: 2031-03-28
Also published as: JP2012203870A

Description

本発明は、要因自明語獲得装置及び方法及びプログラムに係り、特に、出現位置傾向と、各語の品詞と、各語の直前出現助詞の頻度分布から獲得する経験語を含む言語パターンを用いる要因自明語獲得装置および方法およびプログラムに関する。 The present invention relates to a factor trivial word acquisition apparatus, method, and program, and in particular, a factor using a language pattern including an experience word acquired from an appearance position tendency, a part of speech of each word, and a frequency distribution of an immediately preceding particle of each word. The present invention relates to a trivial word acquisition apparatus, method, and program.

何らかの結果を表す表現と、結果を引き起こした原因を表す表現の２つ組の獲得（因果関係の獲得）は、経済予測、マーケティング、行動ナビゲーションなどさまざまな分野において重要なものである。因果関係の種類としては世の中の出来事の間の因果関係や、人々一人一人の行動とその要因との間の因果関係などがある。大量の記事を情報源として，自動的に獲得する従来技術が存在する（例えば、非特許文献１参照）。 Acquiring two sets of expressions that express some result and expressions that cause the result (acquisition of causality) is important in various fields such as economic forecasting, marketing, and behavioral navigation. Types of causal relationships include causal relationships between events in the world and causal relationships between individual behaviors and their factors. There is a conventional technique for automatically acquiring a large number of articles as an information source (see, for example, Non-Patent Document 1).

非特許文献1などの方法では、新聞記事などを情報源として、各記事のテキストに対して、「ため」「ので」など要因を表す接続標識を手がかりとした特定の言語パターンに適合する表現を、要因や結果が書かれている箇所として特定する。さらに要因と結果の表現それぞれに対して、語形処理などにより正規化を行い、要因と結果のペア関係を抽出、さらにそのペア関係を大量に集めることで要因と結果の間の関係の強さを知識として獲得する方法である。 In methods such as Non-Patent Document 1, newspaper articles are used as information sources, and for each article's text, expressions that match a specific language pattern with clues to connection indicators that represent factors such as “for” and “so” are used. , Identify the place where the factor or result is written. Furthermore, each factor and result expression is normalized by word processing, etc., and pair relationships between the factors and results are extracted, and by collecting a large amount of the pair relationships, the strength of the relationship between the factors and results is increased. It is a method to acquire as knowledge.

乾孝司, 乾健太郎, 松本裕治. "接続標識「ため」に基づく文書集合からの因果関係知識の自動獲得," 情報処理学会論文誌, Vol.45, No.3, pp. 919-933, 2004)Inui Takashi, Inui Kentaro, Matsumoto Yuji. "Automatic acquisition of causal knowledge from document collection based on connection sign" for "," IPSJ Transactions, Vol. 45, No. 3, pp. 919-933, 2004 )

本発明で解決する課題は次の２つである。 The problems to be solved by the present invention are the following two.

課題１）要因でありながら、要因を表す言語パターンを伴わない語の適切な要因推定；
課題２）経験記事コーパスにおける要因自明語リストの網羅的な獲得；
それぞれについて下記に詳細な説明を記す。 Task 1) Estimate appropriate factors for words that are factors but do not have a language pattern that represents the factors;
Issue 2) Comprehensive acquisition of factor trivial lists in experience article corpus;
A detailed description is given below for each.

１）要因でありながら、要因を表す言語パターンを伴わない語の適切な要因推定；
ある経験記事中の要因箇所を推定するというタスクがあるときに、上記のような各表現の近傍に出現する言語パターンを利用した方法では、要因でありながら、要因を表す言語パターンを伴わない語については、要因であることを推定することができない。例えば、レストラン訪問についての経験が記述される経験記事において、次に示す例文１と例文２では"二次会"'という単語が要因であるが、要因を表す言語パターンを伴っていないために、要因であると推定することができない。 1) Appropriate factor estimation for words that are factors but do not have a language pattern representing the factor;
When there is a task to estimate the location of a factor in an experience article, the above method using a language pattern that appears in the vicinity of each expression is a factor but a word that does not have a language pattern that represents the factor. Cannot be estimated to be a factor. For example, in an experience article describing the experience of visiting a restaurant, in the following example sentence 1 and example sentence 2, the word “secondary meeting” is a factor, but since there is no language pattern representing the factor, It cannot be estimated that there is.

２）経験記事コーパスにおける要因自明語リストの網羅的な獲得；
上記の問題を解決するために、経験記事コーパスにおける要因自明語リストを網羅的に獲得する必要がある。要因自明語を自動獲得する方法として、経験記事コーパスに対して、要因を表す接続標識を用いて、結果表現と要因表現の２つ組を大量に獲得し、要因表現を要因自明語とする方法が考えられる。しかしながら現状では、経験記事における経験に対する要因表現だけを選別して取得することができない。なぜなら、様々な結果表現があるとき、それが経験記事における経験か、その他の結果かを区別することができないからである。例として、レストラン訪問を対象とした経験記事コーパスがあるとき、特定の接続標識の前後に着目し、結果表現と要因表現のペアを大量に獲得することができるが、そのうちのレストラン訪問に対する要因表現だけを収集することはできない。従来、レストラン訪問を表す表現のリストが存在しないためである。（選別取得処理ができないと次の例文３、例文４のような表現についても要因自明語として獲得されてしまう。

2) Comprehensive acquisition of factor self-explanatory word list in experience article corpus;
In order to solve the above problems, it is necessary to obtain a comprehensive list of factor trivial words in the experience article corpus. As a method of automatically acquiring factor self-explanatory words, using a connection indicator representing factors for experience article corpus, acquiring a large number of result expressions and factor expressions and making factor expressions self-explanatory words Can be considered. However, at present, it is not possible to select and obtain only factor expressions for experiences in experience articles. This is because when there are various result expressions, it is not possible to distinguish whether it is an experience in an experience article or other results. As an example, when there is an experience article corpus for restaurant visits, it is possible to obtain a large number of pairs of result expressions and factor expressions by focusing on the front and back of a specific connection sign. Can't just collect. This is because, conventionally, there is no list of expressions representing restaurant visits. (If the selection acquisition process cannot be performed, expressions such as the following example sentence 3 and example sentence 4 are also acquired as factor trivial words.

本発明は、上記の点に鑑みなされたもので、要因でありながら、要因を表す言語パターンを伴わない語の適切な要因推定を可能とし、経験記事コーパスにおける要因自明語リストの網羅的な獲得が可能な要因自明語獲得装置及び方法及びプログラムを提供することを目的とする。

The present invention has been made in view of the above points, and enables the appropriate factor estimation of a word that is a factor but does not involve a language pattern representing the factor, and comprehensively obtains a factor trivial word list in the experience article corpus It is an object of the present invention to provide an apparatus, method and program for acquiring a self-evident factor.

上記の課題を解決するため、本発明は、コーパス中に一度以上出現する各語のうち、要因を表す語である要因自明語を抽出する要因自明語獲得装置であって、
人々が自らの経験について記した経験記事、経験語を格納した記憶手段と、
経験語（以下、「経験語シード」と記す）の入力を受け付ける入力手段と、
入力された前記経験語シードに基づいて前記記憶手段を参照し、経験記事が要素となるコーパスがあるとき、１度以上登場する各語のコーパス中における、各経験記事内での出現位置傾向と、各語の品詞と、各語の直前出現助詞の頻度分布と、を用いて経験語を抽出する経験語獲得手段と、
前記経験語獲得手段で獲得した前記経験語と、要因助詞を含む言語パターンを用いて、前記記憶手段を参照し、コーパスから所定の出現回数以上の単語を要因自明語として抽出する要因語獲得手段と、を有する。 To solve the above problems, the present invention is, among the words appearing more than once in the corpus, a factor obvious word acquisition apparatus for extracting factors obvious word is a word representing the factors,
Experience articles that people wrote about their experiences, storage means that stored experience words,
An input means for receiving input of experience words (hereinafter referred to as “experience word seeds”);
The storage means is referred to based on the input experience word seed, and when there is a corpus that is an element of the experience article, the appearance position tendency in each experience article in the corpus of each word appearing at least once , Experience word acquisition means for extracting experience words using the part of speech of each word and the frequency distribution of the immediately preceding particle of each word;
Wherein the experience word acquired in the experience word acquisition means, by using a language pattern including factors particle, by referring to the storage means, cause words to extract a word on a given occurrence times speed than factors evident Language corpus Acquisition means.

また、本発明は、請求項１の前記経験語獲得手段において、
前記１度以上登場する各語が経験語であるかを判定する際に、
予めユーザから入力された適切な１個以上の経験語の直前出現助詞の頻度分布と、前記コーパス中に一度以上出現する各語のうち品詞の種類が“名詞−サ変接続”もしくは“動詞−自立”に該当する単語の直前出現助詞の頻度分布との類似度を判定の特徴として利用する手段を含む。 Further, the present invention provides the experience word acquisition means according to claim 1,
When determining whether each word that appears more than once is an experience word,
The frequency distribution of the immediately preceding appearing particles of one or more appropriate experience words input in advance by the user, and the type of part of speech of each word appearing more than once in the corpus is “noun-sa-variant connection” or “verb-independence” Includes means for using the similarity with the frequency distribution of the immediately preceding particle of the word corresponding to “” as a determination feature .

また、本発明は、請求項１の前記経験語獲得手段において、
前記１度以上登場する各語が経験語であるかを判定する際に、
予めユーザから入力された１個以上の助詞の、各語の直前に出現する頻度の分布を判定の特徴として利用する手段を含む。 Further, the present invention provides the experience word acquisition means according to claim 1,
When determining whether each word that appears more than once is an experience word,
Means for using, as a determination feature, the distribution of the frequency of one or more particles input from the user in advance appearing immediately before each word.

また、本発明は、請求項１の前記要因語獲得手段において、
コーパス中における各語について、予め指定する要因を表す接続標識および経験語を含む言語パターンに合致する回数が一定回数以上となる語を要因自明語として抽出する手段を含む。 Further, the present invention provides the factor word acquisition means according to claim 1,
For each word in the corpus, there is included means for extracting, as a factor self-explanatory word, a word whose number of matches with a language pattern including a connection indicator representing a factor designated in advance and an experience word is a predetermined number of times or more.

本発明では、予め自動獲得する経験語を用いて、経験記事における経験に対する要因自明語を、獲得する上記のような問題に対処するために必要となる、「経験記事コーパスにおける要因自明語リストの網羅的な獲得」という課題を解決することができる。従来は、上記要因推定問題において、各経験記事中における要因でありながら要因を表す言語パターンを伴わない語について要因であると適切に推定できなかったが、その語が要因自明語リスト（各語の品詞と各語の直前助詞、頻度分布から抽出された経験語の集合）に含まれる場合は、要因であると適切に推定することができ、要因推定の精度を高めることができる。 In the present invention, using the experience words automatically acquired in advance, it is necessary to deal with the above-described problem of acquiring the factor trivial words for the experience in the experience article. We can solve the problem of “exhaustive acquisition”. Previously, in the above factor estimation problem, it was not possible to properly estimate a factor that is a factor in each experience article but not accompanied by a language pattern that represents the factor. Can be appropriately estimated to be a factor, and the accuracy of factor estimation can be improved.

具体的には、上記例文１と例文２においても、"二次会"という語が要因自明語リストに含まれていれば、例文１と例文２における"二次会"という語が要因であると推定することができる。 Specifically, in the above-described example sentence 1 and example sentence 2, if the word “secondary meeting” is included in the factor trivial word list, it is estimated that the word “secondary meeting” in example sentence 1 and example sentence 2 is the factor. Can do.

本発明の第１の実施の形態における要因自明語獲得装置の構成図である。It is a block diagram of the factor trivial acquisition device in the 1st Embodiment of this invention. 本発明の第１の実施の形態における経験記事各語テーブルの例である。It is an example of the experience article each word table in the 1st Embodiment of this invention. 本発明の第１の実施の形態における経験記事の例である。It is an example of the experience article in the 1st Embodiment of this invention. 本発明の第１の実施の形態における経験語テーブルの例である。It is an example of the experience word table in the 1st Embodiment of this invention. 本発明の第１の実施の形態における要因自明語テーブルの例である。It is an example of the factor trivial word table in the 1st Embodiment of this invention. 本発明の第１の実施の形態における要因自明語装置の処理のフローチャートである。It is a flowchart of a process of the factor trivial device in the 1st Embodiment of this invention. 本発明の第１の実施の形態における行動表現獲得処理のフローチャートである。It is a flowchart of the action expression acquisition process in the 1st Embodiment of this invention. 本発明の第１の実施の形態における要因自明語抽出処理のプログラム例である。It is an example of a program of the factor trivial word extraction process in the 1st Embodiment of this invention.

以下図面と共に、本発明の実施の形態を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

最初に、本明細書中で用いられる用語について説明する。 First, terms used in this specification will be described.

・経験記事：人々が自らの何らかの経験について記述した記事：
・経験語：経験記事コーパス中に１度以上登場する各語のうち、著者が経験を実施したことを表すために用いている語：
（例：レストラン訪問についての経験を記述する経験記事コーパスにおいて、"行く"、"訪問"、"来店"などの単語は通常レストラン訪問という経験を表現する際に使われるため、経験語である。）
・要因：人々の経験実施の決断にポジティブな影響を与えた事象や状態、条件：
・要因語：各記事中において、要因を表す各表現：
・要因自明語：コーパス中に１度以上登場する各語のうち、出現文脈に依存せず、高確率で要因を表す語：
（例："ニ次会"という単語は、レストラン訪問についての経験記事において、著者が該当店舗を訪れる際の要因を表す確率が高いので、該当コーパスにおいて要因自明語である。）
図１は、本発明の第1の実施の形態における要因自明語獲得装置の構成図を示す。同図における装置は、大きく分けて経験語シード入力部１０、経験語獲得部２０、要因自明語獲得部３０、記録部４０、外部装置５０から構成される。・ Experience articles: Articles that describe people's own experiences:
・ Experience words: Of the words that appear more than once in the experience article corpus, the words used to describe the author's experience:
(Example: In an experience article corpus that describes an experience of visiting a restaurant, words such as “go”, “visit”, and “visit” are usually used to express the experience of visiting a restaurant, and thus are experience words. )
・ Factors: Events, conditions and conditions that had a positive impact on people's decision to implement experiences:
・ Factor word: In each article, each expression expressing the factor:
-Factor self-explanatory words: Of words that appear more than once in the corpus, words that represent factors with high probability without depending on the appearance context:
(Example: The word “ni-kai” is a factor trivial word in the corpus because it has a high probability of representing a factor when the author visits the store in an experience article about a restaurant visit.)
FIG. 1 shows a configuration diagram of a factor trivial word acquisition apparatus according to the first embodiment of the present invention. The apparatus in the figure is roughly composed of an experience word seed input unit 10, an experience word acquisition unit 20, a factor trivial word acquisition unit 30, a recording unit 40, and an external device 50.

以下に記録部５０の各テーブルについて説明する。記録部５０は、ハードディスク等の記憶媒体であり、経験記事各語テーブル４１、経験語テーブル４２、要因自明語テーブル４３から構成される。 Hereinafter, each table of the recording unit 50 will be described. The recording unit 50 is a storage medium such as a hard disk, and includes an experience article word table 41, an experience word table 42, and a factor trivial word table 43.

＜経験記事各語テーブル４１＞
経験記事各語テーブル４１は、図２に示すように、経験記事ＩＤフィールド、先頭からの順番フィールド、文字列フィールド、品詞フィールドが含まれる。経験記事各語テーブル４１は、各経験記事を格納するためのテーブルであり、各経験記事の各語が格納され、さらに先頭から何語目かが分かるものとする。さらに品詞フィールドによって、その語の品詞が分かるものとする。経験記事は図３に示すような人々が、特定の分野についての自らの経験を記した記事であるとする（図３はレストラン訪問という分野についての経験）。 <Experience article word table 41>
As shown in FIG. 2, the experience article word table 41 includes an experience article ID field, an order field from the top, a character string field, and a part of speech field. The experience article word table 41 is a table for storing each experience article, and stores each word of each experience article, and further knows the number of words from the top. Further, the part of speech of the word can be understood from the part of speech field. It is assumed that the experience article is an article in which people as shown in FIG. 3 describe their experiences in a specific field (FIG. 3 is an experience in the field of restaurant visits).

＜経験語テーブル４２＞
経験語テーブル４２は、図４に示すように、経験語フィールドから構成され、獲得した経験語を格納するテーブルである。 <Experience Word Table 42>
As shown in FIG. 4, the experience word table 42 is configured from an experience word field and stores acquired experience words.

＜要因自明語テーブル４３＞
要因自明語テーブル４３は、図５に示すように、要因自明語フィールドから構成され、獲得した要因自明語を格納するテーブルである。 <Factor trivial table 43>
As shown in FIG. 5, the factor trivial word table 43 is configured from a factor trivial word field and stores the acquired factor trivial words.

以下に、上記の構成における要因自明語獲得装置の処理について説明する。 Below, the process of the factor trivial word acquisition apparatus in said structure is demonstrated.

図６は、本発明の第１の実施の形態における要因自明語獲得装置の処理のフローチャートである。 FIG. 6 is a flowchart of the process performed by the factor trivial word acquisition apparatus according to the first embodiment of the present invention.

本装置では大きく分けて、経験語シード入力部１０による経験シード入力処理Ｓ１００、経験語獲得部２０による経験獲得処理Ｓ２００、要因自明語獲得部３０による要因自明語獲得処理Ｓ３００から構成される。以下では各処理について詳しく説明する。 This apparatus is roughly divided into an experience seed input process S100 by the experience word seed input unit 10, an experience acquisition process S200 by the experience word acquisition unit 20, and a factor trivial word acquisition process S300 by the factor trivial word acquisition unit 30. Each process will be described in detail below.

＜経験語シード入力処理Ｓ１００＞
経験語シード入力処理では、経験語シード入力部１０が、利用者からの経験語の入力を受け付ける。（受け付ける経験語は利用者が思いつく程度の、２，３個の経験語が想定される。）利用者が経験語を入力し終えると、経験語獲得処理Ｓ２００で用いる閾値θ_1,θ_2,θ_3,θ_4,の入力を要求し、ユーザからそれらの閾値が入力されると、経験語シード入力部１０では、利用者が入力した経験語リストV = {v_i} とθ_1,θ_2,θ_3,θ_4,を経験語獲得部２０に出力する。 <Experience Word Seed Input Processing S100>
In the experience word seed input process, the experience word seed input unit 10 receives an input of an experience word from the user. (The number of experience words accepted is assumed to be two or three experience words that the user can conceive.) When the user finishes inputting experience words, the thresholds θ _1, θ _2, used in the experience word acquisition process S200 _. When input of θ _3, θ ₄ is requested and the threshold values are input from the user, the experience word seed input unit 10 receives the experience word list V = {v _i } and θ _1, θ input by the user. _2, θ _3, θ ₄ are output to the experience word acquisition unit 20.

＜経験語獲得処理Ｓ２００）
経験語獲得部２０は、経験獲得処理Ｓ２００から渡された経験語リストV（経験語シードの集合）とθ_1,θ_2,θ_3,θ_4,を入力として、記録部４０を参照し、経験語を獲得(出力)する。以下に詳細な処理を示す。 <Experience acquisition process S200)
The experience word acquisition unit 20 inputs the experience word list V (a set of experience word seeds) and θ _1, θ _2, θ _3, θ _4, passed from the experience acquisition process S200, and refers to the recording unit 40, Acquire (output) experience words. Detailed processing is shown below.

図７は、本発明の第１の実施の形態における経験語獲得処理のフローチャートである。 FIG. 7 is a flowchart of the experience word acquisition process in the first embodiment of the present invention.

ステップ２０１）経験語獲得部２０は、経験語シードに基づいて、経験記事各語テーブル４１を参照し、一度以上登場する全て語z_iのリストを作成する（この際、各語は、原型のラベルと品詞のペアが、ユニークな時にユニークな一語としてカウントする。 Step 201) Based on the experience word seed, the experience word acquisition unit 20 refers to the experience article word table 41 and creates a list of all words z _{i that} appear once or more (in this case, each word is a prototype). When a label / part of speech pair is unique, it counts as a unique word.

すなわち、＜行く：動詞-自立＞と＜行く：動詞-非自立＞は別の語としてカウントする。
語z_iのうち、品詞が"名詞-サ変接続"もしくは"動詞-自立"である語a_iのリストを作成する。 That is, <go: verb-independence> and <go: verb-independence> are counted as separate words.
Among words z _i , a list of words a _i whose part of speech is “noun-sa-variant connection” or “verb-independence” is created.

ステップ２０２）全ての語a_iについて次のスコアp₁(a_i)を計算する。 Step 202) Calculate the next score p ₁ (a _i ) for all words a _i .

上記の式（１）のPart(a_i)は、語a_iが各経験記事に出現する際の、直前１語に出現する助詞の頻度分布ベクトルである。part(a_i)の例として、『行く』という単語が経験記事コーパスにおいて、『へ』という助詞を直前１語に取る回数が30回、『に』という助詞を直前１語に取る回数が50回、『と』という助詞を直前１語に取る回数が20回であるとき、
Part『行く』=[30,50,20]
(但し、上記ベクトルは『へ』、『に』、『と』という助詞を直前に取る回数が格納されるとする)などとすることができる。

Part (a _i ) in the above equation (1) is a frequency distribution vector of particles that appear in the immediately preceding word when the word a _i appears in each experience article. As an example of part (a _i ), in the experience article corpus, the number of times that the word “Go” is taken as the last word is 30 times, and the number of words “ni” is taken as the last word is 50 Times, when the number of times the particle “to” is taken in the last word is 20 times,
Part “Go” = [30,50,20]
(However, it is assumed that the vector stores the number of times that the particles “h”, “ni”, “to” are taken immediately before).

cossim(x₁, x₂)は次のような入力ベクトルx₁とx₂のコサイン類似度を計算する関数である。 cossim (x ₁ , x ₂ ) is a function for calculating the cosine similarity of the input vectors x ₁ and x ₂ as follows.

上記式（２）のp₁(a_i)の値が閾値θ₁より高い語b_iのリストを作成する。

A list of words b _{i in which} the value of p ₁ (a _i ) in the above equation (2) is higher than the threshold θ ₁ is created.

ステップ２０３）ステップ２０３で作成された全ての語b_iについて以下のスコアp₂(b_i)を計算する。なお、以下のD(b_i)は語b_iを含む経験記事集合である。 Step 203) The following score p ₂ (b _i ) is calculated for all words b _i created in step 203. The following D (b _i ) is an experience article set including the word b _i .

なお、式（４）のd_pos(b_i, d)b_iがd内で出現位置が冒頭に近いほど高いスコアが付与される。これは経験記事について、まず経験を実施するまでの過程を記し、その後に経験語のことを記す順番で記述されるモデルを仮定し、さらに、前者の経験を実施するまでの過程(冒頭に近い部分)に経験語がしばしば出現する性質を仮定しているためである。またの式（４）において、1/2を減算しているのは、単純に出現回数が多い単語のスコアが高くなるのを防ぐためである。

It should be _noted that a higher score is given as d _pos (b _i , d) b _i in Expression (4) is closer to the beginning in d. This is an experience article that first describes the process up to the implementation of the experience, then assumes a model that is described in the order in which the experience words are written, and then the process up to the implementation of the former experience (close to the beginning) This is because it is assumed that the experience words often appear in (part). In Equation (4), the reason why 1/2 is subtracted is to prevent the score of a word having a large number of appearances from increasing.

全ての語b_iについて次のスコアp₂(b_i)を計算し終わった後に、p₂(b_i)の値が閾値θ₂より高い語c_iのリストを作成する。 After calculating the next score p ₂ (b _i ) for all the words b _i , a list of words c _i whose p ₂ (b _i ) value is higher than the threshold θ ₂ is created.

ステップ２０４）全ての語c_iについて次のスコアp₃(c_i)を計算する。 Step 204) Calculate the next score p ₃ (c _i ) for all words c _i .

但し、式（５）のdf(c_i)は、c_iの経験記事コーパスにおける出現ドキュメント数である。

However, df (c _i ) in Expression (5) is the number of documents appearing in the experience article corpus of c _i .

p₃(c_i)の値が閾値θ₃より高い語のリストを作成し、経験語テーブル４２に格納し、要因自明語獲得処理（ステップ３００）に処理を遷移する。 A list of words in which the value of p ₃ (c _i ) is higher than the threshold value θ ₃ is created, stored in the experience word table 42, and the process transitions to the factor trivial word acquisition process (step 300).

経験語獲得処理Ｓ２００の出力例として、次のような経験語シードが入力として与えられたとき、次のようなラベルを持つ経験語シードを出力することができる。 As an output example of the experience word acquisition process S200, when the following experience word seed is given as an input, an experience word seed having the following label can be output.

・入力経験語シード例：
行く、来店
・出力経験語シード例：
行く、立ち寄る、寄る、伺う、並ぶ、連れる、向かう、着く、訪れる、利用、訪問、来店
＜要因自明語獲得処理Ｓ３００＞
要因自明語獲得部３０では、経験テーブル４２を参照し、経験語e_jを全て取得する。
さらに、経験記事各語テーブルを参照し、全ての情報を取得する。 -Input experience word seed example:
Go, visit ・ Output experience word seed example:
Go, stop, stop, ask, line, line, go, arrive, visit, use, visit, visit <factor trivial word acquisition process S300>
In factors obvious word acquisition unit 30 refers to the experience table 42, and acquires all the experience word e _j.
Further, all the information is acquired by referring to the experience article word table.

次に各経験記事dにおいて、次の(1)(2)(3)の処理を実施する.
(1)経験記事 d中に出現する各語について原形が各e_jのいずれかと一致する語について、その出現位置w₁（先頭からの語数）を全て取得する。 Next, in each experience article d, the following processes (1), (2), and (3) are performed.
(1) Experience article For each word that appears in d, all occurrence positions w ₁ (number of words from the beginning) are acquired for words whose original form matches any of each e _j .

(2)次に各出現位置w₁について、w₁の一語前に位置する語が{と、で、として、ということで、ので、ため}のいずれかである場合、その出現位置w₂を全て取得する。 (2) Next, for each occurrence position w ₁ , if the word located one word before w ₁ is either {and, as, so, so, therefore}, the appearance position w ₂ Get all.

(3)次に各出現位置w₂について、w₂の一語前に位置する語をメモリ（図示せず）内の要因自明語リストに記録する。 (3) Next, for each appearance position w ₂ , the word located before the word of w ₂ is recorded in the factor trivial word list in the memory (not shown).

上記(1)(2)(3)の処理を全ての経験記事について実施した後に、要因自明語リストにθ₄回以上出現した語f_iの全てを要因自明語テーブル４３に格納する。 After the above processes (1), (2), and (3) have been performed for all experience articles, all of the words f _i that appear θ ₄ times or more in the factor trivial word list are stored in the factor trivial word table 43.

上記(1)(2)(3)の処理は図８に示すようにプログラムの形で記述することができる。同図に示すプログラムは、経験記事ｄの集合をＤ、各経験記事ｄの単語数をsize(d)、経験記事ｄにおけるｇ番目の単語をｗ（d,g）、経験語の集合をＥ，要因助詞語をｊ＝｛と，で，として，ということで，ので，ため｝とする。また、「count(w_i)」は語w_iの出現回数を数える処理である。 The processes (1), (2) and (3) can be described in the form of a program as shown in FIG. The program shown in the figure is D for the set of experience articles d, size (d) for the number of words in each experience article d, w (d, g) for the g-th word in experience article d, and E for the set of experience words. , And let the factor particle be j = {and, as, so, and so for}. “Count (w _i )” is processing for counting the number of appearances of the word w _i .

要因自明語獲得処理では、経験語と要因助詞語を含む言語パターンに着目し、次のような表現がある時に下線部の単語の出現回数を数え上げ、一定回数以上数え上げられた単語を要因自明語として記録する。 In the factor trivial word acquisition process, paying attention to language patterns including experience words and factor particle words, the number of occurrences of underlined words is counted when there is the following expression, and the words counted more than a certain number of times are factor trivial words Record as.

要因自明語獲得処理では、誤りを少なく、数多くの要因自明語を獲得するために、経験語および特定の要因助詞語との接続を利用している。

In the factor trivial word acquisition processing, the connection with the experience word and the specific factor particle word is used in order to reduce errors and acquire many factor trivial words.

要因自明語獲得処理Ｓ３００の出力例として、次のような要因自明語を出力することができる。 As an output example of the factor trivial word acquisition process S300, the following factor trivial words can be output.

・要因自明語例：
安い、ランチ、一人、二人、友人、家族、車、二次会、宴会、上司、デート、友達、同僚、ディナー、昼食、良い評判、接待、大勢、思いつき、ブランチ、結婚式、接待、家族連れ、忘年会、誕生日祝い、夕食、居酒屋感覚、紹介、記念日、グループ、仲間、カップル、２名、送別会、仕事、テイクアウト、タクシー、二次会、電車、モーニング、宴会、プライベート、合コン、平日ランチ、子連れ、お祝い、女性同士、誕生日、女性、会食、新年会、観光
［第２の実施の形態］
本実施の形態は、第１の実施の形態の経験語獲得処理における、ステップ２０２の処理を次のように変更したものである。・ Examples of self-evident factors:
Cheap, lunch, one person, two people, friends, family, car, secondary party, banquet, boss, date, friends, colleagues, dinner, lunch, good reputation, entertainment, many, thoughts, brunch, wedding, entertainment, families, Year-end party, birthday celebration, dinner, tavern sense, introduction, anniversary, group, friends, couple, 2 people, farewell party, work, takeout, taxi, secondary party, train, morning, banquet, private, joint party, weekday lunch, with children , Celebration, women, birthday, women, dinner, New Year's party, tourism [second embodiment]
In the present embodiment, the process of step 202 in the experience word acquisition process of the first embodiment is changed as follows.

ステップ２０２）まずユーザからの1つ以上の助詞集合Q={q_i}と、閾値θ₅の入力を受け付ける。ユーザからの入力が終わると、下記処理に遷移する。 Step 202) First, an input of one or more particle sets Q = {q _i } and a threshold value θ ₅ is received from the user. When the input from the user ends, the process proceeds to the following process.

ステップ２０１で取得された全ての語a_iについて式（６）でスコアp₄(a_i)を計算する。なお、tf(a_i)はコーパス中における語a_iの出現頻度である。 Score p ₄ (a _i ) is calculated for all words a _i acquired in step 201 using equation (6). Note that tf (a _i ) is the appearance frequency of the word a _i in the corpus.

最後にp₄(a_i)の値が閾値θ₅より高い語b_iのリストを作成する。

Finally, a list of words b _i having a value of p ₄ (a _i ) higher than the threshold θ ₅ is created.

上記に示した第１、第２の実施の形態により、以下のような効果を得ることができる。 According to the first and second embodiments described above, the following effects can be obtained.

効果１は課題１に対応し、効果２は課題２に対応する。
・効果１）要因でありながら、要因を表す言語パターンを伴わない語の適切な要因推定；
・効果２）経験記事コーパスにおける要因自明語リストの網羅的な獲得
下記に詳細を記す。 Effect 1 corresponds to issue 1 and effect 2 corresponds to issue 2.
Effect 1) Appropriate factor estimation for words that are factors but do not have a language pattern representing the factors;
・ Effect 2) Comprehensive acquisition of factor self-explanatory word list in experience article corpus Details are described below.

本発明の第１の効果として、各経験記事から要因を表す表現を取得するというタスクがあるとき、必ずしも要因を表す言語パターンを伴わない語についても、要因自明語リストを用いることで要因であるか否かを評価することができる。 As a first effect of the present invention, when there is a task of acquiring an expression representing a factor from each experience article, even a word not necessarily accompanied by a language pattern representing the factor is a factor by using the factor obvious word list. It can be evaluated whether or not.

例を次に示す。ある経験記事において次のような文があるとき、次の下線部の語を要因を表す語として評価することは従来技術において可能であった。要因を表す接続標識を伴っているためである。 For example: When there is the following sentence in an experience article, it is possible in the prior art to evaluate the next underlined word as a word representing a factor. This is because there is a connection indicator indicating the factor.

しかしながら次のような語について要因であると評価することは従来技術においてはできない場合があった。要因を表す言語パターンを伴っていないからである。

However, there are cases where it is not possible in the prior art to evaluate the following words as a factor. This is because there is no language pattern representing the factor.

本発明によって、上記例文３のような場合において、要因であるか否かを判定する際に語の要因らしさ（要因自明語に含まれるか否か）という情報を新たに用いることができ、高精度な要因性判定の実現に寄与することができる。

According to the present invention, in the case of the above-described example sentence 3, when it is determined whether or not it is a factor, it is possible to newly use information on the likelihood of the word (whether or not included in the factor trivial word). This can contribute to the realization of accurate causality determination.

本発明の第２の効果として、経験記事コーパスがあるときに、経験に対して確実に要因を表す表現だけを選別して取得することができ、経験に対する確度の高い要因をマーケッターに見せることで、マーケッターは該当経験分野全体における、経験の要因の傾向を把握することができ、マーケティングにおける様々な仮説を構築する際や戦略を実行する際の知識として役立てることができる。 As a second effect of the present invention, when there is an experience article corpus, it is possible to select and acquire only expressions that surely represent the factors for the experience, and to show the marketer the factors with high accuracy for the experience. Marketers can understand the tendency of experience factors in the entire field of experience, and can use it as knowledge when building various hypotheses and executing strategies in marketing.

例として、飲食業界におけるマーケッターは、飲食店に関するレビュー記事集合と本装置とを用いることで、飲食業界における、頻出する要因を把握することができ、特定店舗の広告戦略を検討する際に、本装置で要因語を含む謳い文句を用いた広告文章を作成することができる。 As an example, marketers in the food and beverage industry can grasp the factors that frequently appear in the food and beverage industry by using a set of review articles about the restaurant and this device. An advertising sentence using an ugly phrase including a factor word can be created by the device.

なお、第１の実施の形態における図８において、要因自明語獲得部３０におけるプログラムを示したが、この例に限定されることなく、要因自明語獲得装置の経験語シード入力部１０、経験語獲得部２０の処理についてもプログラムとして構築し、要因自明語獲得装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。

In FIG. 8 in the first embodiment, the program in the factor trivial word acquisition unit 30 is shown. However, the present invention is not limited to this example. The processing of the acquisition unit 20 can also be constructed as a program, installed in a computer used as a factor trivial word acquisition device and executed, or distributed via a network.

本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications can be made within the scope of the claims.

１０経験語シード入力部
２０経験語獲得部
３０要因自明語獲得部
４０記憶部
４１経験記事各語テーブル
４２経験語テーブル
４３要因自明語テーブル DESCRIPTION OF SYMBOLS 10 Experience word seed input part 20 Experience word acquisition part 30 Factor trivial word acquisition part 40 Memory | storage part 41 Experience article word table 42 Experience word table 43 Factor trivial word table

Claims

Among the words appearing more than once in the corpus, a factor obvious word acquisition apparatus for extracting factors obvious word is a word representing the factors,
Experience articles that people wrote about their experiences, storage means that stored experience words,
An input means for receiving input of experience words (hereinafter referred to as “experience word seeds”);
The storage means is referred to based on the input experience word seed, and when there is a corpus that is an element of the experience article, the appearance position tendency in each experience article in the corpus of each word appearing at least once , Experience word acquisition means for extracting experience words using the part of speech of each word and the frequency distribution of the immediately preceding particle of each word;
Wherein the experience word acquired in the experience word acquisition means, by using a language pattern including factors particle, by referring to the storage means, cause words to extract a word on a given occurrence times speed than factors evident Language corpus Acquisition means,
A factor trivial word acquisition device characterized by comprising:

The experience word acquisition means includes
When determining whether each word that appears more than once is an experience word,
The frequency distribution of the immediately preceding appearing particles of one or more appropriate experience words input in advance by the user, and the type of part of speech of each word appearing more than once in the corpus is “noun-sa-variant connection” or “verb-independence” The factor trivial word acquisition apparatus according to claim 1, further comprising means for using a similarity with the frequency distribution of the immediately preceding particle of the word corresponding to “ as a feature of determination.

The experience word acquisition means includes
When determining whether each word that appears more than once is an experience word,
2. The factor trivial word acquisition apparatus according to claim 1, further comprising means for using, as a determination feature, a distribution of the frequency of one or more particles that are input in advance from a user and appearing immediately before each word.

The factor word acquisition means is
The factor self-explanatory claim according to claim 1, further comprising means for extracting, as a factor self-explanatory word, a word whose number of matches with a language pattern including a connection indicator and an experience word representing a pre-designated factor for each word in the corpus Word acquisition device.

Among the words appearing more than once in the corpus, a factor obvious word acquisition method of executing extracting the factors obvious word is a word that represents a factor in the computer,
An input step in which the input means receives an input of an experience word (hereinafter referred to as “experience word seed”);
Experience Katarie resulting unit, based on the experience word seed entered, people refers to the storage means for storing the experience articles and experience language noted for their experiences, when there is a corpus experience articles is an element Experience of extracting experience words using the tendency of appearance position in each experience article, the part of speech of each word, and the frequency distribution of the last appearing particle of each word in the corpus of each word appearing more than once A word acquisition step;
Factors word acquisition means, wherein the experience word acquired in the experience language acquisition step, using a language pattern including factors particle, by referring to the storage means, cause obvious words on a given occurrence times speed than from the corpus Factor word acquisition step to extract as words,
A factor trivial acquisition method characterized by

In the experience word acquisition step,
When determining whether each word that appears more than once is an experience word,
The frequency distribution of the immediately preceding appearing particles of one or more appropriate experience words input in advance by the user, and the type of part of speech of each word appearing more than once in the corpus is “noun-sa-variant connection” or “verb-independence” The method according to claim 5 , wherein the similarity with the frequency distribution of the immediately preceding particle of the word corresponding to "is used as a determination feature .

In the experience word acquisition step,
When determining whether each word that appears more than once is an experience word,
6. The factor trivial word acquisition method according to claim 5, wherein a distribution of the frequency of one or more particles previously input from a user appearing immediately before each word is used as a determination feature.

In the factor word acquisition step,
6. The factor trivial word acquisition method according to claim 5, wherein, for each word in the corpus, a word having a predetermined number of times that matches a language pattern including a connection indicator representing a factor designated in advance and an experience word is extracted as a factor trivial word. .

Computer
A factor trivial word acquisition program for causing each factor of the factor trivial word acquisition device according to any one of claims 1 to 4 to function.