JP5705710B2

JP5705710B2 - Feature word extraction device, feature word extraction method, and program

Info

Publication number: JP5705710B2
Application number: JP2011262395A
Authority: JP
Inventors: のぞみ小林; 牧野　俊朗; 俊朗牧野; 松尾　義博; 義博松尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-11-30
Filing date: 2011-11-30
Publication date: 2015-04-22
Anticipated expiration: 2031-11-30
Also published as: JP2013114586A

Description

本発明は、特徴語抽出装置、特徴語抽出方法、及びプログラムに係り、特に、説明文から説明対象に関する特徴語を抽出する特徴語抽出装置、特徴語抽出方法、及びプログラムに関する。 The present invention relates to a feature word extraction device, a feature word extraction method, and a program, and more particularly, to a feature word extraction device, a feature word extraction method, and a program that extract a feature word related to an explanation target from an explanatory sentence.

従来において、さまざまな分野の専門用語を抽出する研究が存在する。たとえば、非特許文献１では、これ以上分割不可能な名詞（以下、「単名詞」）に着目し、単名詞の連接に基づくスコアリングを使用した専門用語抽出手法が挙げられている。 Conventionally, there are studies for extracting technical terms in various fields. For example, Non-Patent Literature 1 focuses on nouns that cannot be further divided (hereinafter, “single nouns”), and includes a terminology extraction method that uses scoring based on the concatenation of single nouns.

中川裕志, 森辰則, 湯本紘彰．出現頻度と連接頻度に基づく専門用語抽出．自然言語処理Vol.10 No.1, pp.27-45,2003年.Nakagawa, Y., Mori, Y., Yumoto, Y. Terminology extraction based on appearance frequency and connection frequency. Natural Language Processing Vol.10 No.1, pp.27-45, 2003.

上記特許文献１に記載の手法では、文章中の名詞連続を対象としているが、文章の中にはたとえ名詞連続であっても、抽出したい商品と関連しない語も存在する。 Although the technique described in Patent Document 1 is intended for continuation of nouns in a sentence, there are words in the sentence that are not related to the product to be extracted even if they are noun continuations.

また、上記特許文献１に記載の手法では、形態素解析によって切り出された名詞の連続を候補としているが、ファッションなどの分野ではカタカナ語の未知語が多く存在する。カタカナ未知語は、長い語であっても一語として獲得されてしまい、「ノースリーブフラワープリントワンピース」のような適当ではない語が誤って抽出されてしまう。 In the method described in Patent Document 1, a series of nouns extracted by morphological analysis are candidates, but there are many unknown words in Katakana in the field of fashion and the like. Katakana unknown words are acquired as one word even if they are long words, and inappropriate words such as “sleeveless flower print dress” are erroneously extracted.

本発明は、上記の事実を鑑みてなされたもので、適切な長さで、精度よく特徴語を抽出することができる特徴語抽出装置、特徴語抽出方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above-described facts, and an object thereof is to provide a feature word extraction device, a feature word extraction method, and a program that can accurately extract feature words with an appropriate length. To do.

上記の目的を達成するために本発明に係る特徴語抽出方法は、説明対象の説明文から前記説明対象に関する特徴語を抽出する特徴語抽出装置における特徴語抽出方法であって、候補獲得手段によって、入力された、未知語であるカタカナ語が分割され、かつ、形態素解析済みの説明文の集合から、予め求められた前記説明対象のカテゴリ語のリストと、予め求められた前記カテゴリ語及び該カテゴリ語を修飾する特徴語の出現パタンとに基づいて、特徴語の候補を獲得し、前記特徴語の候補が、前記未知語であるカタカナ語が分割された複数の形態素から構成される場合、前記特徴語の候補を分解し、分解により得られた語の各々を、特徴語の候補として獲得するステップと、スコア計算手段によって、前記獲得された特徴語の候補の各々について、前記説明文の集合における前記特徴語の候補の出現頻度、及び前記特徴語の候補が出現する説明文の数に基づいて、特徴語である度合いを示すスコアを計算するステップと、候補出力手段によって、前記計算されたスコアが閾値以上となる特徴語の候補を、前記特徴語として出力するステップと、を含む。 In order to achieve the above object, a feature word extraction method according to the present invention is a feature word extraction method in a feature word extraction device that extracts a feature word related to an explanation object from an explanation sentence of the explanation object, The inputted Katakana word, which is an unknown word, is divided, and from the set of explanatory texts that have been subjected to morphological analysis, a list of category words to be explained in advance, the category words obtained in advance, and the category words When a feature word candidate is acquired based on the appearance pattern of a feature word that modifies a category word, and the feature word candidate is composed of a plurality of morphemes obtained by dividing the unknown word katakana, decomposing candidates of the feature words, each word obtained by the decomposition, the method comprising: obtaining a candidate of the feature words, the score calculation unit, to each of the acquired characteristic word candidate Calculating a score indicating the degree of being a feature word based on the appearance frequency of the feature word candidate in the set of explanation sentences and the number of explanation sentences in which the feature word candidates appear, and candidate output And outputting a feature word candidate whose calculated score is equal to or greater than a threshold by the means as the feature word.

本発明に係る特徴語抽出装置は、説明対象の説明文から前記説明対象に関する特徴語を抽出する特徴語抽出装置であって、入力された、未知語であるカタカナ語が分割され、かつ、形態素解析済みの説明文の集合から、予め求められた前記説明対象のカテゴリ語のリストと、予め求められた前記カテゴリ語及び該カテゴリ語を修飾する特徴語の出現パタンとに基づいて、特徴語の候補を獲得し、前記特徴語の候補が、前記未知語であるカタカナ語が分割された複数の形態素から構成される場合、前記特徴語の候補を分解し、分解により得られた語の各々を、特徴語の候補として獲得する候補獲得手段と、前記獲得された特徴語の候補の各々について、前記説明文の集合における前記特徴語の候補の出現頻度、及び前記特徴語の候補が出現する説明文の数に基づいて、特徴語である度合いを示すスコアを計算するスコア計算手段と、前記計算されたスコアが閾値以上となる特徴語の候補を、前記特徴語として出力する候補出力手段と、を含んで構成されている。 A feature word extraction apparatus according to the present invention is a feature word extraction apparatus that extracts a feature word related to an explanation object from an explanation sentence of the explanation object, and is divided into input Katakana words that are unknown words, and has a morpheme. Based on the list of category words to be explained in advance obtained from the set of analyzed explanation sentences and the appearance pattern of the feature words that modify the category words and the category words obtained in advance, When a candidate is acquired and the feature word candidate is composed of a plurality of morphemes obtained by dividing the unknown word katakana word, the feature word candidate is decomposed, and each of the words obtained by the decomposition is a candidate acquisition means for acquiring as a candidate of feature words for each of the acquired characteristic word candidate, the appearance frequency of the feature word candidates in said set of description, and are candidates for the characteristic word appearing Score calculating means for calculating a score indicating the degree of being a feature word based on the number of clear sentences; and candidate output means for outputting a feature word candidate whose calculated score is equal to or greater than a threshold as the feature word; It is comprised including.

本発明に係る特徴語抽出方法及び特徴語抽出装置によれば、候補獲得手段によって、入力された、未知語であるカタカナ語が分割され、かつ、形態素解析済みの説明文の集合から、予め求められた前記説明対象のカテゴリ語のリストと、予め求められた前記カテゴリ語及び該カテゴリ語を修飾する特徴語の出現パタンとに基づいて、特徴語の候補を獲得する。 According to the feature word extraction method and the feature word extraction apparatus according to the present invention, the candidate acquisition unit divides the input Katakana word, which is an unknown word, and obtains it in advance from a set of explanatory sentences that have been subjected to morphological analysis. Feature word candidates are acquired based on the list of category words to be explained, and the appearance patterns of the category words obtained in advance and the feature words that modify the category words.

そして、スコア計算手段によって、前記獲得された特徴語の候補の各々について、前記説明文の集合における前記特徴語の候補の出現頻度、及び前記特徴語の候補が出現する説明文の数に基づいて、特徴語である度合いを示すスコアを計算する。候補出力手段によって、前記計算されたスコアが閾値以上となる特徴語の候補を、前記特徴語として出力する。 Then, for each of the obtained feature word candidates by the score calculation means, based on the appearance frequency of the feature word candidates in the set of explanation sentences and the number of explanation sentences in which the feature word candidates appear The score indicating the degree of being a feature word is calculated. The candidate output means outputs a feature word candidate whose calculated score is equal to or greater than a threshold value as the feature word.

このように、未知語であるカタカナ語が分割された説明文の集合から、特徴語の候補を獲得し、特徴語の候補の出現頻度、及び特徴語の候補が出現する説明文の数に基づいて、スコアを計算することにより、適切な長さで、精度よく特徴語を抽出することができる。 In this manner, feature word candidates are acquired from a set of explanatory sentences in which Katakana words that are unknown words are divided, and based on the appearance frequency of feature word candidates and the number of explanatory sentences in which feature word candidates appear. Thus, by calculating the score, it is possible to accurately extract feature words with an appropriate length.

本発明に係るプログラムは、コンピュータに、上記の特徴語抽出方法の各ステップを実行させるためのプログラムである。 A program according to the present invention is a program for causing a computer to execute each step of the feature word extraction method.

以上説明したように、本発明の特徴語抽出方法、特徴語抽出装置、及びプログラムによれば、未知語であるカタカナ語が分割された説明文の集合から、特徴語の候補を獲得し、特徴語の候補の出現頻度、及び特徴語の候補が出現する説明文の数に基づいて、スコアを計算することにより、適切な長さで、精度よく特徴語を抽出することができる、という効果が得られる。 As described above, according to the feature word extraction method, the feature word extraction device, and the program of the present invention, feature word candidates are obtained from a set of explanatory sentences obtained by dividing katakana words that are unknown words, By calculating the score based on the appearance frequency of word candidates and the number of explanatory sentences in which feature word candidates appear, it is possible to accurately extract feature words with an appropriate length. can get.

本発明の実施の形態に係る特徴語抽出装置の構成を示す概略図である。It is the schematic which shows the structure of the feature word extraction apparatus which concerns on embodiment of this invention. Ｗｅｂテキストの例を示す図である。It is a figure which shows the example of Web text. 各候補語の出現頻度及サイト頻度を示す表である。It is a table | surface which shows the appearance frequency and site frequency of each candidate word. 各候補語のスコアを示す表である。It is a table | surface which shows the score of each candidate word. 本発明の実施の形態に係る特徴語抽出装置における特徴語抽出処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the feature word extraction process routine in the feature word extraction apparatus which concerns on embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。なお、商品情報を表わす大量のＷｅｂテキストから、商品の特徴語を抽出する特徴語抽出装置に本発明を適用した場合を例に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. An example will be described in which the present invention is applied to a feature word extraction device that extracts feature words of a product from a large amount of Web text representing product information.

＜特徴語抽出装置の構成＞
図１に示すように、本発明の実施の形態に係る特徴語抽出装置１００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）と、後述する特徴語抽出処理ルーチンを実行するためのプログラムを記憶したＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）とを備えたコンピュータで構成され、機能的には次に示すように構成されている。特徴語抽出装置１００は、入力部１０と、演算部２０とを備えている。 <Configuration of feature word extraction device>
As shown in FIG. 1, a feature word extraction device 100 according to an embodiment of the present invention is configured to execute a CPU (Central Processing Unit), a RAM (Random Access Memory), and a feature word extraction processing routine described later. It is comprised by the computer provided with ROM (Read Only Memory) which memorize | stored the program, and it is comprised as shown below functionally. The feature word extraction device 100 includes an input unit 10 and a calculation unit 20.

入力部１０は、既知のキーボード、マウス、記憶装置などの入力器により入力された、抽出対象となるＷｅｂテキストの集合を受け付ける。入力されたＷｅｂテキストは、図２に示すように、商品に関する説明文が書かれたＷｅｂページのテキスト部分と、そのＷｅｂページのURLもしくはドメインとを含んでいる。なお、商品が説明対象の一例であり、Ｗｅｂテキストの集合が、説明文の集合の一例である。 The input unit 10 receives a set of Web texts to be extracted, which are input by an input device such as a known keyboard, mouse, or storage device. As shown in FIG. 2, the input Web text includes a text portion of a Web page in which a description about the product is written, and a URL or domain of the Web page. Note that a product is an example of an explanation target, and a set of Web texts is an example of a set of explanatory texts.

演算部２０は、Ｗｅｂテキスト記憶部２１、形態素解析部２２、カタカナ未知語分割部２３、候補獲得部２４、カテゴリ語リストデータベース２５、抽出パタンデータベース２６、スコア計算部２７、候補出力部２８、及び特徴語辞書データベース２９を備えている。 The calculation unit 20 includes a Web text storage unit 21, a morphological analysis unit 22, a katakana unknown word division unit 23, a candidate acquisition unit 24, a category word list database 25, an extraction pattern database 26, a score calculation unit 27, a candidate output unit 28, and A feature word dictionary database 29 is provided.

Ｗｅｂテキスト記憶部２１は、入力部１０により受け付けたＷｅｂテキストの集合を記憶している。 The web text storage unit 21 stores a set of web texts received by the input unit 10.

形態素解析部２２は、Ｗｅｂテキスト記憶部２１に記憶されているＷｅｂテキストの各々のテキスト部分に対して、周知の技術である形態素解析処理を行って単語に分解し、各単語に、品詞、読み情報などを付与する。例えば「ノースリーブフラワープリントワンピース」という入力に対し、「ノースリーブフラワープリントワンピース（未知語）」が付与される。 The morpheme analysis unit 22 performs a morphological analysis process, which is a well-known technique, on each text part of the Web text stored in the Web text storage unit 21 to break it down into words. Give information etc. For example, “sleeveless flower print dress (unknown word)” is given to the input “sleeveless flower print dress”.

カタカナ未知語分割部２３は、形態素解析部２２の結果を入力とし、品詞「未知語」が付与されているカタカナ語の分割を行なう。カタカナ語の分割は、例えば、参考文献（中澤敏明, 河原大輔, 黒橋禎夫,「日本語辞書整備のための日本語カタカナ複合名詞の自動分割」,言語処理学会第11回年次大会, pp.588-591, 2005年3月）などの既知の技術により実現でき、和英辞書を用いる方法、英語コーパスと和英辞書を用いる方法、基礎データ内の関係を用いる方法、又はこれらの方法の組み合わせにより実現できる。 The katakana unknown word dividing unit 23 receives the result of the morpheme analyzing unit 22 and performs division of katakana words to which the part of speech “unknown word” is assigned. Katakana language segmentation is described in, for example, the references (Toshiaki Nakazawa, Daisuke Kawahara, Ikuo Kurohashi, “Automatic segmentation of Japanese katakana compound nouns for Japanese dictionary development”, 11th Annual Conference of the Association for Natural Language Processing, pp .588-591, March 2005) and can be realized by a method using a Japanese-English dictionary, a method using an English corpus and a Japanese-English dictionary, a method using a relation in basic data, or a combination of these methods. realizable.

「ノースリーブフラワープリントワンピース（未知語）」に対する出力は、「ノー（未知語） /スリーブ（未知語） /フラワー（未知語） /プリント（未知語） /ワンピース（未知語）」となる。 The output for “Sleeveless Flower Print One Piece (Unknown Word)” is “No (Unknown Word) / Sleeve (Unknown Word) / Flower (Unknown Word) / Print (Unknown Word) / One Piece (Unknown Word)”.

ここで、カテゴリ語に基づいて特徴語の候補を抽出する原理について説明する。 Here, the principle of extracting feature word candidates based on category words will be described.

商品の特徴語は、典型的には、商品を表す語（カテゴリ語）を修飾する表現で出現する。 The characteristic word of a product typically appears in an expression that modifies a word (category word) representing the product.

例えば、「ワンピース」、「シャツ」などがカテゴリ語であり、「フェミニンなワンピース」の「フェミニン」、「ノースリーブワンピース」の「ノースリーブ」が特徴語である。 For example, “one piece”, “shirt”, and the like are category words, and “feminine” in “feminine one piece” and “sleeveless” in “sleeveless one piece” are characteristic words.

あらかじめ用意した、商品を表すカテゴリ語と、あらかじめ用意した「Ｘな＜カテゴリ語＞」、「Ｘ＜カテゴリ語＞」のような出現パタンとを用いて、商品の説明文から、特徴語Xと考えられる候補語を獲得する。 Using the category words representing the product prepared in advance and the appearance patterns such as “X <category word>” and “X <category word>” prepared in advance, the feature word X and the Get possible candidate words.

カテゴリ語と特徴語との出現パタンとしては、以下の（１）〜（３）に示すようなものが考えられる。 As the appearance patterns of the category words and the feature words, the following patterns (1) to (3) are conceivable.

（１）カテゴリ語を修飾する形容詞、形容動詞、副詞として特徴語が出現する出現パタン (1) Appearance patterns in which feature words appear as adjectives, adjective verbs, and adverbs that modify category words

例えば、「Ｘとした＜カテゴリ語＞」（e.g. ふわっとしたワンピース）や、「Ｘな＜カテゴリ語＞」（e.g. フェミニンなワンピース）である。 For example, "X <category word>" (e.g. fluffy dress) or "X na <category word>" (e.g. feminine dress).

（２）カテゴリ語を含む複合語として特徴語が出現する出現パタン (2) Appearance pattern in which feature words appear as compound words including category words

例えば、「Ｘ＜カテゴリ語＞」（e.g. ノースリーブワンピース, 2way バッグ）である。 For example, “X <category word>” (e.g. sleeveless dress, 2 way bag).

（３）カテゴリ語にかかる名詞句（未知語含む）として特徴語が出現する出現パタン (3) Appearance patterns in which feature words appear as noun phrases (including unknown words) related to category words

例えば、「Ｘの＜カテゴリ語＞」（e.g. チェックのワンピース）である。 For example, “X's <category word>” (e.g. one piece of check).

そこで、本実施の形態では、カテゴリ語リストデータベース２５に、商品のカテゴリ語のリストを記憶しておく。なお、カテゴリ語は人手で用意してもよいし、各商品ページのトピックパス(e.g. ”top > レディース > トップス > シャツ”のように、Webサイト内での各ページの位置を、階層構造の上位ページへのリンクとして表したものをいう）から自動で獲得してもよい。 Therefore, in the present embodiment, a category word list of products is stored in the category word list database 25. Category words may be prepared manually, or the position of each page within the website, such as the topic path of each product page (eg “top> ladies> tops> shirt”) It may be automatically acquired from a page).

また、抽出パタンデータベース２６は、上記の（１）〜（３）の出現パタンを、特徴語を抽出するために用いる抽出パタンとして記憶している。 The extraction pattern database 26 stores the appearance patterns (1) to (3) described above as extraction patterns used for extracting feature words.

候補獲得部２４は、カタカナ未知語分割部２３により出力された、形態素解析済みであって、かつ、カタカナ未知語が分割されたＷｅｂテキストのテキスト部分から、カテゴリ語リストデータベース２５に記憶されたカテゴリ語リストと、抽出パタンデータベース２６に記憶された特徴語の抽出パタンとを用いて、特徴語の候補となる候補語を抽出する。 The candidate acquisition unit 24 outputs the category stored in the category word list database 25 from the text part of the Web text that has been morphologically analyzed and is divided by the Katakana unknown word dividing unit 23. Using the word list and the feature word extraction patterns stored in the extraction pattern database 26, candidate words that are candidate feature words are extracted.

例えば、カテゴリ語リストに「ワンピース」が含まれ、抽出パタンに、「X ＜カテゴリ語＞」という出現パタンが含まれている場合に、説明文中の「ノー（未知語） /スリーブ（未知語） /フラワー（未知語） /プリント（未知語） /〈ワンピース〉」という部分から、「ノー/スリーブ/フラワー/プリント」が候補語ｘとして獲得される。 For example, when “One Piece” is included in the category word list and the appearance pattern “X <Category word>” is included in the extracted pattern, “No (unknown word) / Sleeve (unknown word)” in the explanatory text “No / Sleeve / Flower / Print” is obtained as a candidate word x from the part “/ flower (unknown word) / print (unknown word) / <one piece>”.

また、候補獲得部２４は、獲得した候補語が複数の形態素から構成される場合、候補語を分解し、分解により得られた語の各々を、候補語とする。なお、候補語が一語から構成される場合は分解しない。 In addition, when the acquired candidate word is composed of a plurality of morphemes, the candidate acquisition unit 24 decomposes the candidate word and sets each of the words obtained by the decomposition as a candidate word. If the candidate word is composed of one word, it is not decomposed.

例えば、獲得された候補語「ノー/スリーブ/フラワー/プリント」は複数形態素からなるため、「ノー」、「ノー/スリーブ」、「ノー/スリーブ/フラワー」、「ノー/スリーブ/フラワー/プリント」、「スリーブ」、「スリーブ/フラワー」、「スリーブ/フラワー/プリント」、「フラワー」、「フラワー/プリント」、及び「プリント」に分解し、それぞれの語を候補語とする。 For example, the acquired candidate word “no / sleeve / flower / print” is composed of multiple morphemes, so “no”, “no / sleeve”, “no / sleeve / flower”, “no / sleeve / flower / print” , “Sleeve”, “Sleeve / Flower”, “Sleeve / Flower / Print”, “Flower”, “Flower / Print”, and “Print”, with each word as a candidate word.

また、候補獲得部２４は、得られたそれぞれの候補語について、入力されたＷｅｂテキストの集合における候補語の出現頻度と、その候補語が何個のＷｅｂサイトで出現したかを示すサイト頻度とを計算する。サイト頻度は、入力WebテキストのURLのうち、例えば最初の’/’までをサイト名として切り出して頻度を計算することで得られる。入力が説明文の集合であれば、サイト頻度の代わりに説明文の数を用いる。 In addition, the candidate acquisition unit 24, for each obtained candidate word, the appearance frequency of the candidate word in the set of input Web text, and the site frequency indicating how many websites the candidate word has appeared. Calculate The site frequency is obtained by calculating the frequency by cutting out, for example, the first “/” from the URL of the input Web text as the site name. If the input is a set of explanatory texts, the number of explanatory texts is used instead of the site frequency.

図３に示すように、候補語（形態素境界が含まれたまま）と、計算した候補語の出現頻度ｎ（ｘ）と、候補語のサイト頻度sf(x)とが、候補獲得部２４の出力となる。 As shown in FIG. 3, the candidate word (with the morpheme boundary included), the calculated appearance frequency n (x) of the candidate word, and the site frequency sf (x) of the candidate word are Output.

スコア計算部２７は、獲得された候補語の各々について、以下に説明するように、特徴語である度合いを示すスコアを計算する。 The score calculation unit 27 calculates, for each of the acquired candidate words, a score indicating the degree of being a feature word, as will be described below.

本実施の形態では、特徴語である度合いとして、獲得された語の適切な長さを測るため、語のユニット性に基づくスコアリングを行なう。語のユニット性とは、ある言語単位がコーパス中で安定して使用される度合いを示すものである。 In the present embodiment, scoring based on word unity is performed in order to measure the appropriate length of the acquired word as the degree of being a characteristic word. The word unity indicates the degree to which a language unit is stably used in the corpus.

具体的には、C-valueとよばれる尺度を、語が1語から成る場合にも対応させたModified C-value（上記の非特許文献１を参照）に基づく式を用いる。また、「エクストラファインコットン」のようなある特定サイトの商品にしか存在しない語に対しては低い重みがつくように改良した、以下の（１）式を用いて、スコアを計算する。これは、辞書に登録すべき特徴語は、広く様々なサイトで出現する語であることが望ましいためである。 Specifically, an expression based on Modified C-value (see Non-Patent Document 1 above) in which a scale called C-value is made to correspond even when a word consists of one word is used. Further, a score is calculated using the following formula (1), which is improved so that a low weight is applied to a word such as “extra fine cotton” that exists only in a product of a specific site. This is because the characteristic words to be registered in the dictionary are preferably words that appear in a wide variety of sites.

ここで、sf(x)は候補語xが出現したＷｅｂテキストのサイト数を示すサイト頻度、length(x)は候補語xを構成する形態素数、n(x)は候補語xの出現頻度である。t(x)は候補語xを含む別の候補語の出現頻度の総数、c(x)は候補語xを含む別の候補語の種類数である。 Here, sf (x) is the site frequency indicating the number of Web text sites where the candidate word x appears, length (x) is the number of morphemes constituting the candidate word x, and n (x) is the frequency of appearance of the candidate word x is there. t (x) is the total number of appearance frequencies of another candidate word including the candidate word x, and c (x) is the number of types of another candidate word including the candidate word x.

上記（１）式は、もともとのC-valueを、上記の非特許文献１と同様に頻度1にも対応させ、さらにサイト毎の出現頻度の対数をかけることで、あるサイトにしか出現しない特定語の重みを低くするようにしたものである。 In the above formula (1), the original C-value corresponds to the frequency 1 in the same manner as the non-patent document 1, and the logarithm of the appearance frequency for each site is used to specify that it appears only at a certain site. The weight of words is lowered.

スコア計算部２７は、獲得された候補語の各々について、候補獲得部２４で得られた候補語、出現頻度、及びサイト頻度を用いて、上記（１）式に従って、スコアを計算する。 The score calculation unit 27 calculates a score for each of the acquired candidate words using the candidate word, the appearance frequency, and the site frequency obtained by the candidate acquisition unit 24 according to the above equation (1).

具体的には、入力された候補語すべてに対し、下記の１．〜６．の処理を行い、スコアを計算する。なお、混乱を避けるため、一つ一つ処理される候補語をxと表記する。 Specifically, for all input candidate words, the following 1. ~ 6. To calculate the score. To avoid confusion, each candidate word to be processed is denoted as x.

１．候補語xに紐付く出現頻度(n(x))及びサイト頻度(sf(x))を得る。
２．候補語xを形態素境界で分割し、分割数(length(x)に相当)を得る。
３．候補語xをキーとし、スコア計算部２７の入力である候補語を検索し、キーである候補語xを含む候補語とそれに紐付く出現頻度を取得し、候補語xを含む候補語と出現頻度とのペアを表わす候補語リストペアを作成する。
４．候補語リストペア中の出現頻度を足しこみ、t(x)を求める。
５．候補語リストペア中の候補語の種類数(c(x)に相当)を求める。
６．求められたn(x)、sf(x)、length(x)、t(x)、c(x)から、上記（１）式に従い score(x)を算出する。 1. The appearance frequency (n (x)) and site frequency (sf (x)) associated with the candidate word x are obtained.
2. The candidate word x is divided at the morpheme boundary to obtain the number of divisions (corresponding to length (x)).
3. Using the candidate word x as a key, the candidate word that is the input of the score calculation unit 27 is searched, the candidate word including the candidate word x that is the key and the appearance frequency associated therewith are obtained, and the candidate word including the candidate word x and the appearance A candidate word list pair representing a pair with frequency is created.
4). Add the appearance frequency in the candidate word list pair to find t (x).
5. The number of types of candidate words in the candidate word list pair (corresponding to c (x)) is obtained.
6). From the obtained n (x), sf (x), length (x), t (x), c (x), score (x) is calculated according to the above equation (1).

なお、上記３．の候補語ｘを含む候補語の検索では、膨大な候補が想定され、処理時間が非常にかかると予想されるため、候補語と出現頻度を接尾辞配列などの構造体にすることで検索時間を短縮することが好ましい。 The above 3. In the search for candidate words including candidate word x, a large number of candidates are assumed and processing time is expected to be very long. Therefore, the search time can be increased by making candidate words and appearance frequency into a structure such as a suffix array. Is preferably shortened.

図４に示すような、各候補語と求められたスコアとのペアが、スコア計算部２７の出力となる。なお、上記図４の例における候補語は、形態素境界を削除したものとする。 A pair of each candidate word and the obtained score as shown in FIG. Note that it is assumed that the candidate words in the example of FIG. 4 have morpheme boundaries deleted.

候補出力部２８は、計算されたスコアが閾値以上となる候補語を特徴語として出力し、特徴語辞書データベース２９に登録する。例えば、候補語「ノー」、「ノー/スリーブ」、「ノー/スリーブ/フラワー」、「ノー/スリーブ/フラワー/プリント」、「スリーブ」、「スリーブ/フラワー」、「スリーブ/フラワー/プリント」、「フラワー」、「フラワー/プリント」、及び「プリント」のうち、「ノースリーブ」及び「フラワープリント」のスコアが閾値以上であれば、この２つの候補語のみを出力し、他の７の候補語は棄却する。 The candidate output unit 28 outputs a candidate word whose calculated score is equal to or greater than a threshold value as a feature word and registers it in the feature word dictionary database 29. For example, the candidate words "No", "No / Sleeve", "No / Sleeve / Flower", "No / Sleeve / Flower / Print", "Sleeve", "Sleeve / Flower", "Sleeve / Flower / Print", If the score of “Sleeveless” and “Flower Print” is greater than or equal to the threshold value among “Flower”, “Flower / Print”, and “Print”, only these two candidate words are output, and the other seven candidate words Reject.

なお、閾値としては適当な値を選択することができる。たとえば、あらかじめ開発データでテストし、最適であった値を設定するなどの方法がある。また、入力データのサイズなどによって閾値は検討する必要がある。一例として、数万のＷｅｂテキストを入力とした場合の閾値は60〜100程度とした。 An appropriate value can be selected as the threshold value. For example, there is a method of testing with development data in advance and setting an optimum value. Further, the threshold value needs to be examined depending on the size of the input data. As an example, the threshold when inputting tens of thousands of Web texts is set to about 60 to 100.

＜特徴語抽出装置の作用＞
次に、本実施の形態に係る特徴語抽出装置１００の作用について説明する。まず、特徴語抽出装置１００に、商品の説明文を含むＷｅｂテキストの集合が入力されると、Ｗｅｂテキスト記憶部２１に記憶される。そして、特徴語抽出装置１００において、図５に示す特徴語抽出処理ルーチンが実行される。 <Operation of feature word extraction device>
Next, the operation of the feature word extraction device 100 according to the present embodiment will be described. First, when a set of Web texts including product explanations is input to the feature word extraction device 100, it is stored in the Web text storage unit 21. Then, the feature word extraction apparatus 100 executes a feature word extraction processing routine shown in FIG.

まず、ステップＳ１０１において、Ｗｅｂテキストの集合を、Ｗｅｂテキスト記憶部２１から読み込む。そして、ステップＳ１０２において、形態素解析部２２によって、上記ステップＳ１０１で読み込んだＷｅｂテキストの各々に対して形態素解析処理を行う。 First, in step S <b> 101, a set of Web text is read from the Web text storage unit 21. In step S102, the morpheme analysis unit 22 performs a morpheme analysis process on each Web text read in step S101.

次のステップＳ１０３では、カタカナ未知語分割部２３によって、各Ｗｅｂテキストの形態素解析結果に基づいて、各Ｗｅｂテキストのカタカナ未知語を分割する。ステップＳ１０４では、候補獲得部２４によって、カテゴリ語リストデータベース２５及び抽出パタンデータベース２６を参照して、上記ステップＳ１０３でカタカナ未知語が分割され、かつ、形態素解析済みの各Ｗｅｂテキストから、特徴語の候補となる候補語を抽出する。 In the next step S103, the Katakana unknown word dividing unit 23 divides the Katakana unknown word of each Web text based on the morphological analysis result of each Web text. In step S104, the candidate acquisition unit 24 refers to the category word list database 25 and the extraction pattern database 26, and from each Web text in which the katakana unknown words are divided in step S103 and the morphological analysis has been performed, Candidate word candidates are extracted.

そして、ステップＳ１０５において、候補獲得部２４によって、上記ステップＳ１０４で抽出された各候補語について、出現頻度及びサイト頻度を計算する。ステップＳ１０６では、スコア計算部２７によって、上記ステップＳ１０４で抽出された各候補語について、形態素の分割数を求めると共に、当該候補語を含む他の候補語及び出現頻度を求める。 In step S105, the candidate acquisition unit 24 calculates the appearance frequency and site frequency for each candidate word extracted in step S104. In step S106, the score calculation unit 27 obtains the number of morpheme divisions for each candidate word extracted in step S104, and obtains other candidate words including the candidate word and the appearance frequency.

そして、ステップＳ１０７において、スコア計算部２７によって、上記ステップＳ１０４で抽出された各候補語について、上記ステップＳ１０５で計算された出現頻度及びサイト頻度と、上記ステップＳ１０６で求められた分割数、他の候補語、及び出現頻度とを用いて、上記（１）式に従って、スコアを計算する。 In step S107, for each candidate word extracted in step S104 by the score calculation unit 27, the appearance frequency and site frequency calculated in step S105, the number of divisions obtained in step S106, and the like. Using the candidate word and the appearance frequency, the score is calculated according to the above equation (1).

ステップＳ１０８では、上記ステップＳ１０７で計算されたスコアが予め定められた閾値以上となる候補語を、特徴語として特徴語辞書データベース２９に登録して、特徴語抽出処理ルーチンを終了する。 In step S108, a candidate word whose score calculated in step S107 is equal to or greater than a predetermined threshold is registered as a feature word in the feature word dictionary database 29, and the feature word extraction processing routine is terminated.

以上説明したように、本実施の形態に係る特徴語抽出装置によれば、未知語であるカタカナ語が分割され、かつ、形態素解析済みのＷｅｂテキストの集合から、特徴語の候補を獲得し、特徴語の候補の出現頻度、及び特徴語の候補が出現するＷｅｂテキストの数に基づいて、スコアを計算することにより、適切な長さで、精度よく特徴語を抽出することができる。また、特徴語を抽出することにより、どういう商品が買われているか、注目されているかなどを分析する際の手がかり語が獲得でき、マーケティングなどにおいてより詳細な分析が可能になる。 As described above, according to the feature word extraction apparatus according to the present embodiment, a candidate for feature words is acquired from a set of Web texts in which Katakana words that are unknown words are divided and subjected to morphological analysis, By calculating the score based on the appearance frequency of the feature word candidates and the number of Web texts in which the feature word candidates appear, the feature words can be accurately extracted with an appropriate length. In addition, by extracting feature words, clue words for analyzing what kind of products are bought or attracting attention can be acquired, and more detailed analysis in marketing or the like becomes possible.

また、特徴語の候補を、カテゴリ語リストと出現パタンとで絞り込むことにより、精度良く特徴語を抽出することができる。 Also, feature words can be extracted with high accuracy by narrowing down feature word candidates by the category word list and the appearance pattern.

また、カタカナ未知語を分割すると共に、語のユニット性に基づき、かつ、特定サイトの商品にしか存在しない語に対しては低い重みがつくようにした式を用いてスコアを計算することにより、カタカナ未知語の特徴語であっても、適切な長さで特徴語を獲得することができる。 In addition, by dividing the Katakana unknown word and calculating the score using a formula based on the unity of the word and giving low weight to words that exist only in the products of a specific site, Even a characteristic word of an unknown katakana word can be acquired with an appropriate length.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、商品以外のものを説明対象として、当該説明対象の特徴語を、説明文から抽出するようにしてもよい。 For example, the feature word of the description target may be extracted from the description sentence with the object other than the product as the description target.

また、形態素解析部を、外部の装置に設けてもよい。この場合には、特徴語抽出装置に、形態素解析済みのＷｅｂテキスト（説明文)の集合が入力されるようにし、形態素解析済みのＷｅｂテキスト（説明文)の集合に対して、カタカナ未知語分割部によって、カタカナ未知語が分割されるようにすればよい。 Further, the morphological analysis unit may be provided in an external device. In this case, a set of Web texts (descriptions) that have been subjected to morphological analysis is input to the feature word extraction device, and Katakana unknown word segmentation is performed on the set of Web texts (descriptions) that have been subjected to morphological analysis. The katakana unknown word may be divided by the part.

また、形態素解析部及びカタカナ未知語分割部を外部の装置に設けてもよい。この場合には、特徴語抽出装置に、カタカナ未知語が分割され、かつ、形態素解析済みのＷｅｂテキスト（説明文)の集合が入力されるようにし、当該Ｗｅｂテキスト（説明文)の集合から、候補獲得部によって、特徴語の候補が獲得されるようにすればよい。 Moreover, you may provide an external apparatus with a morphological analysis part and a katakana unknown word division | segmentation part. In this case, a set of Web texts (descriptions) that have been divided into katakana unknown words and have been subjected to morphological analysis are input to the feature word extraction device, and from the set of Web texts (descriptions), The candidate acquisition unit may acquire feature word candidates.

また、Ｗｅｂテキストを入力とする場合を例に説明したが、これに限定されるものではなく、Ｗｅｂテキスト以外の、説明対象の説明文を入力とするようにすればよい。 Moreover, although the case where Web text is used as an input has been described as an example, the present invention is not limited to this, and an explanatory text other than Web text may be input.

また、本発明は、周知のコンピュータに媒体もしくは通信回線を介して、プログラムをインストールすることによっても実現可能である。 The present invention can also be realized by installing a program on a known computer via a medium or a communication line.

また、上述の特徴語抽出装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 The above feature word extraction device has a computer system inside, but the “computer system” includes a homepage provision environment (or display environment) if a WWW system is used. .

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
２０演算部
２１Ｗｅｂテキスト記憶部
２２形態素解析部
２３カタカナ未知語分割部
２４候補獲得部
２５カテゴリ語リストデータベース
２６抽出パタンデータベース
２７スコア計算部
２８候補出力部
２９特徴語辞書データベース
１００特徴語抽出装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Calculation part 21 Web text memory | storage part 22 Morphological analysis part 23 Katakana unknown word division | segmentation part 24 Candidate acquisition part 25 Category word list database 26 Extraction pattern database 27 Score calculation part 28 Candidate output part 29 Feature word dictionary database 100 Feature word Extraction device

Claims

A feature word extraction method in a feature word extraction device that extracts a feature word related to the explanation object from an explanation sentence of the explanation object,
By the candidate acquisition means, the inputted Katakana word that is an unknown word is divided, and from the set of explanation sentences that have been subjected to morphological analysis, the list of category words to be explained in advance, and the previously obtained category words Based on a category word and an appearance pattern of a feature word that modifies the category word, a candidate for the feature word is acquired, and the candidate for the feature word is composed of a plurality of morphemes obtained by dividing the Katakana word that is the unknown word And decomposing the feature word candidates and obtaining each of the words obtained by the decomposition as feature word candidates ;
For each of the feature word candidates acquired by the score calculation means, based on the appearance frequency of the feature word candidates in the set of explanation words and the number of explanation sentences in which the feature word candidates appear Calculating a score indicating the degree of being a word;
Outputting candidate feature words whose calculated score is equal to or greater than a threshold by candidate output means as the feature words;
A feature word extraction method.

Appearance pattern of feature words to qualify before Symbol category words and the category word,
A pattern in which the category word and a characteristic word as an adjective, adjective verb, or adverb that modifies the category word appear;
The feature word extraction method according to claim 1, further comprising: a pattern that appears as a compound word including the category word and the feature word, or a pattern in which a feature word as a noun phrase related to the category word and the category word appears.

The step of calculating the score by the score calculating means includes, for each of the acquired feature word candidates, the appearance frequency of the feature word candidates, the number of explanatory sentences in which the feature word candidates appear, the feature words The score is calculated based on the number of morphemes constituting the candidate, the total number of appearance frequencies of other candidates including the feature word candidate, and the number of types of other candidates including the feature word candidate. 3. The feature word extraction method according to 1 or 2.

A feature word extraction device that extracts a feature word related to the explanation object from an explanation sentence of the explanation object,
A list of category words to be explained in advance obtained from a set of explanation sentences that have been divided and whose morpheme analysis has been performed, and the category words that have been obtained in advance and the categories When a candidate for a feature word is obtained based on the appearance pattern of a feature word that modifies a word, and the candidate for the feature word is composed of a plurality of morphemes obtained by dividing the unknown word katakana, Candidate acquisition means for decomposing feature word candidates and acquiring each of the words obtained by the decomposition as feature word candidates ;
For each of the acquired feature word candidates, the degree of being a feature word is determined based on the frequency of appearance of the feature word candidates in the set of explanation sentences and the number of explanation sentences in which the feature word candidates appear. A score calculating means for calculating a score to be shown;
Candidate output means for outputting a feature word candidate for which the calculated score is equal to or greater than a threshold value as the feature word;
A feature word extraction device.

The program for making a computer perform each step of the feature word extraction method of any one of Claims 1-3 .