JP2009026195A

JP2009026195A - Article classification apparatus, article classification method and program

Info

Publication number: JP2009026195A
Application number: JP2007190834A
Authority: JP
Inventors: Tatsunori Mori; 辰則森; Jun Nishimura; 純西村; Rintaro Miyazaki; 林太郎宮崎; Naoto Maeda; 直人前田; Shorei O; 松齢翁; Yusuke Ishikawa; 雄介石川; Hiroyuki Kobayashi; 寛之小林
Original assignee: Yahoo Japan Corp; Yokohama National University NUC
Current assignee: Yahoo Japan Corp; Yokohama National University NUC
Priority date: 2007-07-23
Filing date: 2007-07-23
Publication date: 2009-02-05

Abstract

<P>PROBLEM TO BE SOLVED: To easily search an article by accurately extracting attribute information on articles from article information. <P>SOLUTION: A feature expansion part 321 expands into features a learning corpus with attribute tags and attribute value tags attached thereto in advance before and after attributes of articles and attribute values representing the contents of the attributes, and an article classification part 322 stores each expanded feature in an attribute learning DB 323 in association with an attribute tag or attribute value tag. When a new article information document is added, a feature expansion part 341 expands the article information document into features, and a tagging part 342 extracts data on attributes and/or attribute values from the article information document feature-expanded by the feature expansion part 341 according to the classification results by the article classification part 322. that is, the attribute learning DB 323. The features are classification information classifying the word containing each character by semantic similarity. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、商品分類装置、商品分類方法及びプログラムに関する。 The present invention relates to a product classification device, a product classification method, and a program.

近年、インターネット等の通信ネットワークを利用したネットオークションやオンラインショッピング等により、商品を購入する機会が増えてきている。ネットオークションやオンラインショッピングでは、階層構造を有する商品カテゴリーに従って商品を検索したり、商品の属性情報をキーワードとして商品を検索したりしている。例えば、ユーザが洋服を購入したい場合、希望する商品の属性（「色」、「サイズ」等）又は属性の内容を示す属性値（「色」であれば「赤」、「黄」等、「サイズ」であれば「Ｌ」、「Ｍ」等）に基づいて商品情報を検索する。 In recent years, opportunities to purchase products have increased due to online auctions and online shopping using communication networks such as the Internet. In an online auction or online shopping, a product is searched according to a product category having a hierarchical structure, or a product is searched using attribute information of the product as a keyword. For example, when the user wants to purchase clothes, the attribute of the desired product (“color”, “size”, etc.) or the attribute value indicating the content of the attribute (“color” is “red”, “yellow”, etc.) If “size”, the product information is searched based on “L”, “M”, etc.).

商品の検索に関する技術としては、次のような技術が知られている。例えば、商品の色によって検索を行う例として、商品の色をその商品の属性データとして記憶管理しておき、ユーザが購買用画面において入力した色に基づいて該当する商品を検索するインターネットショッピングシステムが提案されている（特許文献１参照）。 The following techniques are known as techniques for searching for products. For example, as an example of performing a search based on the color of a product, an Internet shopping system that stores and manages the color of a product as attribute data of the product and searches for the corresponding product based on the color input by the user on the purchase screen. It has been proposed (see Patent Document 1).

また、オンラインショップの運営会社が、商品のメーカーに商品の属性情報を報告してもらい、予め属性情報を登録したデータベースを作成しておき、商品の属性に基づいて検索を行う商品検索システムが提案されている（特許文献２参照）。
特開２００２−９２０２０号公報特開２００２−２５９４０１号公報 In addition, a product search system is proposed in which an online shop management company reports the product attribute information to the product manufacturer, creates a database in which the attribute information is registered in advance, and searches based on the product attributes (See Patent Document 2).
Japanese Patent Laid-Open No. 2002-92020 JP 2002-259401 A

ところで、一般的に、ネットオークションやオンラインショッピングにおいては、商品が属するカテゴリーが階層的に構築されており、そのカテゴリー毎に分類されて商品情報が管理されている。従って、ユーザは、所望の商品が属するカテゴリーを絞った後に、そのカテゴリー内で検索を行うのが一般的である。 By the way, in general, in online auctions and online shopping, categories to which products belong are hierarchically constructed, and product information is classified and managed for each category. Therefore, after the user narrows down the category to which the desired product belongs, the user generally performs a search within that category.

例えば、ユーザが赤い色の商品を希望して「赤」というキーワードで検索した場合、「赤」を含む商品情報が検索されることとなる。このとき、「赤」を含む人名や店名、品名（「赤井」、「赤西」等）が商品情報に含まれている場合には、そのような商品情報まで検索されるおそれがあり、カテゴリーが複数になればなるほど、そのような結果が顕著になりえる。即ち、ユーザが所望する色やサイズ等の属性で商品検索を行った場合、単純なキーワード検索では、そのキーワードを含む商品情報が検索されるために、所望の商品情報を得ることが難しかった。 For example, when a user searches for a red-colored product and searches with a keyword “red”, product information including “red” is searched. At this time, if the product information includes names, store names, or product names (including “Akai”, “Akanishi”, etc.) that contain “red”, there is a risk that even such product information may be searched. The more you have, the more prominent that result. That is, when a product search is performed with attributes such as colors and sizes desired by the user, it is difficult to obtain desired product information because a simple keyword search searches for product information including the keyword.

また、上記特許文献１，２に記載されている技術においては、予め商品情報毎に属性情報を登録しておくため、精度のよい検索が可能にはなる。しかし、その属性情報を予め人手で入力・登録しておく必要があり作業が煩雑であると共に、その技術を商品検索のシステムに採用するのは運用上現実的ではなかった。 Further, in the techniques described in Patent Documents 1 and 2, attribute information is registered in advance for each product information, so that accurate search is possible. However, it is necessary to manually input and register the attribute information in advance, and the work is complicated. In addition, it is not practically practical to adopt the technology for a product search system.

本発明は、上記の従来技術における問題に鑑みてなされたものであって、商品情報から商品の属性情報を精度良く抽出して商品検索の使い勝手を向上させることを課題とする。 The present invention has been made in view of the above problems in the prior art, and an object of the present invention is to improve the usability of product search by accurately extracting product attribute information from product information.

上記課題を解決するために、請求項１に記載の発明は、商品説明文に基づいてその商品を属性及び／又は当該属性の内容を示す属性値毎に分類する商品分類装置であって、予め商品説明文に記載されている商品の属性及び属性値に対して属性タグ及び属性値タグが付与されている複数の学習用文書を素性展開する第１素性展開手段と、前記第１素性展開手段により素性展開された各素性を前記属性タグ又は属性値タグと関連付けることにより、前記学習用文書を分類する分類手段と、を備える。 In order to solve the above-mentioned problem, the invention described in claim 1 is a product classification device that classifies a product for each attribute value indicating the attribute and / or the content of the attribute based on the product description, First feature development means for developing a plurality of learning documents to which attribute tags and attribute value tags are assigned to the attributes and attribute values of the product described in the product description; and the first feature development means Classification means for classifying the learning document by associating each feature developed by the feature tag with the attribute tag or the attribute value tag.

請求項２に記載の発明は、請求項１に記載の商品分類装置において、入力された抽出対象となる商品情報文書を素性展開する第２素性展開手段と、前記分類手段による分類結果に基づいて前記第２素性展開手段が素性展開した商品情報文書から属性及び／又は属性値のデータを抽出する抽出手段と、を更に備える。 According to a second aspect of the present invention, in the commodity classification apparatus according to the first aspect, a second feature expansion unit that expands the input product information document to be extracted and a classification result by the classification unit. The apparatus further comprises extraction means for extracting attribute and / or attribute value data from the product information document developed by the second feature development means.

請求項３に記載の発明は、請求項２に記載の商品分類装置において、前記抽出手段により抽出された属性及び／又は属性値のデータを、当該データを抽出した商品情報文書と関連付けて記憶する記憶手段と、クライアント端末から受信した属性及び／又は属性値のデータが関連付けられた商品情報文書を前記記憶手段から検索して前記クライアント端末に送信する検索手段と、を更に備える。 According to a third aspect of the present invention, in the commodity classification device according to the second aspect, the attribute and / or attribute value data extracted by the extracting means is stored in association with the commodity information document from which the data is extracted. Storage means, and search means for searching the product information document associated with the attribute and / or attribute value data received from the client terminal from the storage means and transmitting to the client terminal.

請求項４に記載の発明は、請求項１〜３の何れか一項に記載の商品分類装置において、前記素性展開において用いる素性のうちの一つは、当該素性展開された各展開単位が含まれる単語を、その意味の類似性によって分類した分類情報である。 According to a fourth aspect of the present invention, in the commodity classification apparatus according to any one of the first to third aspects, one of the features used in the feature development includes each development unit in which the feature development is performed. This is classification information in which words are classified according to similarity in meaning.

請求項５に記載の発明は、請求項１〜４の何れか一項に記載の商品分類装置において、前記分類手段の分類手法として、ＳＶＭ（Support Vector Machine）を用いる。 According to a fifth aspect of the present invention, in the commodity classification apparatus according to any one of the first to fourth aspects, an SVM (Support Vector Machine) is used as a classification method of the classification means.

請求項６に記載の発明は、商品説明文に基づいてその商品を属性及び／又は当該属性の内容を示す属性値毎に分類する商品分類方法であって、予め商品説明文に記載されている商品の属性及び属性値に対して属性タグ及び属性値タグが付与されている複数の学習用文書を素性展開する第１素性展開工程と、前記第１素性展開工程において素性展開された各素性を前記属性タグ又は属性値タグと関連付けることにより、前記学習用文書を分類する分類工程と、を含む。 The invention according to claim 6 is a product classification method for classifying the product for each attribute and / or attribute value indicating the content of the attribute based on the product description, which is described in the product description in advance. A first feature development step of developing a plurality of learning documents to which attribute tags and attribute value tags are assigned to product attributes and attribute values, and each feature developed in the first feature development step. A classification step of classifying the learning document by associating with the attribute tag or the attribute value tag.

請求項７に記載の発明は、商品説明文に基づいてその商品を属性及び／又は当該属性の内容を示す属性値毎に分類するコンピュータを、予め商品説明文に記載されている商品の属性及び属性値に対して属性タグ及び属性値タグが付与されている複数の学習用文書を素性展開する第１素性展開手段、前記第１素性展開手段により素性展開された各素性を前記属性タグ又は属性値タグと関連付けることにより、前記学習用文書を分類する分類手段、として機能させるためのプログラムである。 The invention according to claim 7 is a computer that classifies the product for each attribute value indicating the attribute and / or the content of the attribute based on the product description, and the attribute of the product described in the product description and A first feature development unit that develops a plurality of learning documents to which an attribute tag and an attribute value tag are attached to an attribute value, and each feature expanded by the first feature development unit is the attribute tag or attribute. This is a program for functioning as classification means for classifying the learning document by associating with a value tag.

請求項１、６、７に記載の発明によれば、学習用文書を素性展開して得られた各素性を属性タグ又は属性値タグと関連付けておくので、商品情報文書から商品の属性情報を抽出する際に精度良く抽出を行うことができ、抽出された属性情報を利用して商品検索の使い勝手を向上させることができる。 According to the first, sixth, and seventh aspects of the present invention, each feature obtained by developing the feature of the learning document is associated with the attribute tag or the attribute value tag. Extraction can be performed with high accuracy, and usability of product search can be improved using the extracted attribute information.

請求項２に記載の発明によれば、分類手段による分類結果に基づいて、入力した商品情報文書から商品の属性情報を精度良く抽出することができる。 According to the second aspect of the present invention, product attribute information can be accurately extracted from the input product information document based on the classification result by the classifying means.

請求項３に記載の発明によれば、商品情報文書から抽出された属性及び／又は属性値のデータを、当該データを抽出した商品情報文書と関連付けておくので、商品情報文書から商品の属性情報を精度良く抽出して商品検索の使い勝手を向上させることができる。 According to the third aspect of the invention, the attribute and / or attribute value data extracted from the product information document is associated with the product information document from which the data is extracted. Can be extracted with high accuracy and the usability of product search can be improved.

請求項４に記載の発明によれば、意味の類似性によって分類した分類情報を素性として用いることにより、意味が近い単語を、同じ属性又は属性値を示す単語として抽出することができる。 According to the fourth aspect of the invention, by using the classification information classified based on the similarity of meanings as features, words having similar meanings can be extracted as words indicating the same attribute or attribute value.

請求項５に記載の発明によれば、分類手法として、ＳＶＭを用いることができる。 According to the invention described in claim 5, SVM can be used as a classification method.

以下、本発明に係る商品分類装置の一実施の形態について図面を参照して説明する。 Hereinafter, an embodiment of a product classification device according to the present invention will be described with reference to the drawings.

［システム構成］
図１に、オークションシステム１００のシステム構成を示す。図１に示すように、オークションシステム１００は、商品分類装置としてのサーバ装置１０と、ＰＣ（Personal Computer）２０ａ，２０ｂ，２０ｃ，・・・，２０ｎ（以下、ＰＣ２０という。）とが通信ネットワークＮを介して接続されて構成されている。通信ネットワークＮは、インターネットや電気通信事業者等の通信網であり、当該通信ネットワークＮに接続する装置間をデータ通信可能に接続する。 [System configuration]
FIG. 1 shows a system configuration of the auction system 100. As shown in FIG. 1, in an auction system 100, a server device 10 as a product classification device and PCs (Personal Computers) 20a, 20b, 20c,..., 20n (hereinafter referred to as PC 20) are communication networks N. It is connected and configured. The communication network N is a communication network such as the Internet or a telecommunications carrier, and connects devices connected to the communication network N so that data communication is possible.

サーバ装置１０は、ＰＣやＷＳ（Work Station）等の情報処理端末装置であり、商品説明文に基づいてその商品を属性及び／又は属性値毎に分類するものである。また、サーバ装置１０は、商品情報を管理する商品情報ＤＢ（Data Base）１４を備える。サーバ装置１０は、ＨＴＴＰ（HyperText Transfer Protocol）等を用いたＰＣ２０との通信セッションにおいて、Ｗｅｂサーバとしての機能を有する。例えば、サーバ装置１０は、ＰＣ２０からネットオークションに出品する商品の商品情報を受け付け、商品情報文書として商品情報ＤＢ１４に格納する。また、サーバ装置１０は、商品の購入を希望するユーザによるＰＣ２０からの要求に基づいて、商品情報ＤＢ１４に格納された商品情報文書の検索を行って検索結果を返信する。 The server device 10 is an information processing terminal device such as a PC or WS (Work Station), and classifies the product for each attribute and / or attribute value based on the product description. The server device 10 includes a product information DB (Data Base) 14 that manages product information. The server device 10 has a function as a Web server in a communication session with the PC 20 using HTTP (HyperText Transfer Protocol) or the like. For example, the server device 10 receives product information of products to be exhibited in the online auction from the PC 20 and stores them in the product information DB 14 as product information documents. Further, the server device 10 searches for the product information document stored in the product information DB 14 based on a request from the PC 20 by a user who wishes to purchase the product, and returns a search result.

ＰＣ２０は、各ユーザが使用するクライアント端末である。ＰＣ２０は、Ｗｅｂブラウザとしての機能を有し、ユーザがオークションに商品を出品する際には、サーバ装置１０が提供する入力画面において商品情報の入力を受け付け、入力された商品情報をサーバ装置１０に送信する。また、ＰＣ２０は、ユーザが商品を購入する際には、サーバ装置１０が提供する検索画面においてキーワードの入力を受け付け、入力されたキーワードをサーバ装置１０に送信する。 The PC 20 is a client terminal used by each user. The PC 20 has a function as a Web browser. When a user places a product for auction, the PC 20 accepts input of product information on an input screen provided by the server device 10, and inputs the input product information to the server device 10. Send. Further, when the user purchases a product, the PC 20 receives an input of a keyword on a search screen provided by the server device 10 and transmits the input keyword to the server device 10.

［サーバ装置の構成］
図２に、サーバ装置１０の構成を示す。図２に示すように、サーバ装置１０は、ＣＰＵ（Central Processing Unit）１１、ＲＡＭ（Random Access Memory）１２、記憶部１３、商品情報ＤＢ１４、操作部１５、表示部１６、通信部１７等を備え、各部はバス１８により接続されている。 [Configuration of server device]
FIG. 2 shows the configuration of the server device 10. As shown in FIG. 2, the server apparatus 10 includes a CPU (Central Processing Unit) 11, a RAM (Random Access Memory) 12, a storage unit 13, a product information DB 14, an operation unit 15, a display unit 16, a communication unit 17, and the like. Each unit is connected by a bus 18.

ＣＰＵ１１は、サーバ装置１０の各部の処理動作を統括的に制御する。具体的には、ＣＰＵ１１は、操作部１５から入力される操作信号又は通信部１７により受信される指示信号に応じて、記憶部１３に記憶されている各種処理プログラムを読み出し、ＲＡＭ１２内の作業領域に展開し、当該プログラムとの協働により各種処理を行う。 The CPU 11 comprehensively controls the processing operation of each unit of the server device 10. Specifically, the CPU 11 reads various processing programs stored in the storage unit 13 in response to an operation signal input from the operation unit 15 or an instruction signal received by the communication unit 17, and a work area in the RAM 12. And perform various processes in cooperation with the program.

ＲＡＭ１２は、ＣＰＵ１１により実行される各種処理において、記憶部１３から読み出された各種プログラムやデータ、及びパラメータ等を一時的に記憶する。 The RAM 12 temporarily stores various programs, data, parameters, and the like read from the storage unit 13 in various processes executed by the CPU 11.

記憶部１３は、ハードディスクや不揮発性の半導体メモリ等により構成され、ＣＰＵ１１で実行される各種処理プログラム、各種データ等を記憶する。例えば、記憶部１３には、学習用文書が記憶されている。 The storage unit 13 includes a hard disk, a nonvolatile semiconductor memory, and the like, and stores various processing programs executed by the CPU 11, various data, and the like. For example, a learning document is stored in the storage unit 13.

商品情報ＤＢ１４は、ユーザがオークションに出品した商品に関する商品情報を管理するためのデータベースである。商品情報ＤＢ１４には、出品商品毎に、商品情報文書、属性、属性値等が関連付けられて格納される。商品情報文書は、商品に関する説明文を含むテキストデータである。属性とは、商品が有する特徴をいい、例えば、「色」、「サイズ」等をいう。また、属性値とは、属性の内容を示す値をいい、例えば、属性「色」については「黄」、属性「サイズ」については「Ｍ」等をいう。商品情報ＤＢ１４の属性、属性値には、後述する属性情報抽出処理（図７参照）において商品情報文書から抽出された属性情報（属性・属性値）が格納される。商品検索時には、属性又は属性値をキーワードとして商品情報文書を検索することができる。 The merchandise information DB 14 is a database for managing merchandise information related to merchandise that the user has submitted for auction. The merchandise information DB 14 stores merchandise information documents, attributes, attribute values, and the like associated with each exhibition merchandise. The merchandise information document is text data including an explanatory text related to the merchandise. An attribute refers to a characteristic of a product, such as “color”, “size”, and the like. The attribute value is a value indicating the content of the attribute. For example, the attribute “color” is “yellow”, the attribute “size” is “M”, and the like. The attribute and attribute value of the merchandise information DB 14 store attribute information (attribute / attribute value) extracted from the merchandise information document in attribute information extraction processing (see FIG. 7) described later. When searching for merchandise, merchandise information documents can be searched using attributes or attribute values as keywords.

操作部１５は、カーソルキー、数字入力キー、及び各種機能キー等を備えたキーボードと、マウス等のポインティングデバイスを備えて構成され、キーボードに対するキー操作やマウス操作により入力された指示信号をＣＰＵ１１に出力する。 The operation unit 15 includes a keyboard having a cursor key, numeric input keys, various function keys, and the like, and a pointing device such as a mouse, and sends an instruction signal input to the CPU 11 by key operation or mouse operation on the keyboard. Output.

表示部１６は、ＬＣＤ（Liquid Crystal Display）により構成され、ＣＰＵ１１から入力される表示信号の指示に従って、画面表示を行う。 The display unit 16 is configured by an LCD (Liquid Crystal Display), and performs screen display in accordance with an instruction of a display signal input from the CPU 11.

通信部１７は、ＣＰＵ１１の制御の下、通信ネットワークＮと接続するための通信インターフェースであり、外部機器との間でデータの送受信を行う。 The communication unit 17 is a communication interface for connecting to the communication network N under the control of the CPU 11 and transmits / receives data to / from an external device.

図３は、サーバ装置１０の機能ブロック図である。図３に示すように、サーバ装置１０は、学習用コーパス入力部３１、機械学習器３２、商品情報入力部３３、自動抽出器３４、商品情報ＤＢ１４、クエリ受信部３５、商品情報検索部３６、検索結果送信部３７を有する。なお、学習用コーパス入力部３１、機械学習器３２、商品情報入力部３３、自動抽出器３４、商品情報検索部３６の各部は、ＣＰＵ１１と、記憶部１３に記憶されている各種処理プログラムとの協働により実現される。また、クエリ受信部３５及び検索結果送信部３７は、通信部１７により実現される。 FIG. 3 is a functional block diagram of the server device 10. As illustrated in FIG. 3, the server device 10 includes a learning corpus input unit 31, a machine learner 32, a product information input unit 33, an automatic extractor 34, a product information DB 14, a query reception unit 35, a product information search unit 36, A search result transmission unit 37 is included. The learning corpus input unit 31, machine learner 32, product information input unit 33, automatic extractor 34, and product information search unit 36 are each configured by the CPU 11 and various processing programs stored in the storage unit 13. Realized by collaboration. The query receiving unit 35 and the search result transmitting unit 37 are realized by the communication unit 17.

学習用コーパス入力部３１は、記憶部１３に記憶されている学習用文書（以下、学習用コーパスという。）を入力し、取得する機能部である。学習用コーパスは、機械学習に用いるためのテキストデータであり、予め商品説明文に記載されている商品の属性及び属性値の前後にそれぞれ属性タグ及び属性値タグが付与されている。 The learning corpus input unit 31 is a functional unit that inputs and acquires a learning document (hereinafter referred to as a learning corpus) stored in the storage unit 13. The learning corpus is text data to be used for machine learning, and attribute tags and attribute value tags are assigned before and after the product attributes and attribute values described in advance in the product description.

ここで、学習用コーパスを作成する際の商品説明文への注釈付けについて説明する。注釈付けとは、属性情報学習処理（図６参照）の準備段階として、予め学習用コーパス中の属性情報（属性及び属性値）に対してタグを付与しておく処理である。 Here, the annotation to the product description when creating the learning corpus will be described. Annotation is a process of adding a tag to attribute information (attributes and attribute values) in the learning corpus in advance as a preparation stage for the attribute information learning process (see FIG. 6).

表１に、注釈付けを行う属性及び属性値の例を示す。

Table 1 shows examples of attributes and attribute values to be annotated.

表１に示すように、属性としては、色、素材、サイズ、形状、状態、定価、製造場所、シーズン／モデル、デザイン、その他を採用する。これらの情報を選んだ理由としては、学習用コーパスを作成する際の注釈付けを行う者の間の差を少なくすること、ユーザが検索の対象として必要だと感じていることが挙げられる。本実施の形態では、ファッションのカテゴリーを例に説明する。 As shown in Table 1, as attributes, color, material, size, shape, state, list price, manufacturing place, season / model, design, etc. are adopted. The reasons for selecting these pieces of information include reducing the difference between those who perform annotation when creating a learning corpus, and the fact that users feel that they are necessary for search. In the present embodiment, a fashion category will be described as an example.

図４に、属性タグ（<attr>，</attr>）及び属性値タグ（<val>，</val>）が付与された学習用コーパスのデータ例を示す。図４に示すように、「色」、「サイズ」等の属性に対しては、<attr>色</attr>、<attr>サイズ</attr>等のように、属性の前後に属性タグが付与され、「色」の属性値である「黄」、「サイズ」の属性値である「Ｍ」等に対しては、<val>黄</val>、<val>Ｍ</val>等のように、属性値の前後に属性値タグが付与される。 FIG. 4 shows a data example of a learning corpus to which attribute tags (<attr>, </ attr>) and attribute value tags (<val>, </ val>) are assigned. As shown in FIG. 4, for attributes such as “color” and “size”, attribute tags before and after the attribute, such as <attr> color </ attr>, <attr> size </ attr>, etc. Are attached to the "color" attribute value "yellow", the "size" attribute value "M", etc. <val> yellow </ val>, <val> M </ val> As described above, an attribute value tag is added before and after the attribute value.

なお、注釈付けを行う際に、属性と属性値が組として現れない場合は、単独でも注釈付けを行うこととする。また、属性と属性値が一つの複合語になっている場合は、分解し、個別に注釈付けを行う（例：<val>黄</val><attr>色</attr>）。複合語になっている場合には、属性値−属性の順で現れることが多い。また、属性が階層構造を持つ場合には、階層を考慮せずに、個別に注釈付けを行う（例えば、<attr>サイズ</attr>、<attr>肩幅</attr>、<attr>着丈</attr>、<attr>身幅</attr>等）。 When annotating, if an attribute and an attribute value do not appear as a pair, the annotation is performed alone. If the attribute and attribute value are a single compound word, it is decomposed and annotated separately (eg <val> yellow </ val> <attr> color </ attr>). In the case of compound words, they often appear in the order of attribute value-attribute. Also, if the attribute has a hierarchical structure, annotate it individually without considering the hierarchy (for example, <attr> size </ attr>, <attr> shoulder width </ attr>, <attr> length) </ attr>, <attr> Width </ attr>, etc.).

機械学習器３２は、学習用コーパス入力部３１により取得された複数の学習用コーパスに基づいて機械学習を行う機能部であり、素性展開部３２１、商品分類部３２２及び属性学習ＤＢ３２３を有する。 The machine learning device 32 is a functional unit that performs machine learning based on a plurality of learning corpuses acquired by the learning corpus input unit 31, and includes a feature expansion unit 321, a product classification unit 322, and an attribute learning DB 323.

素性展開部３２１は、複数の学習用コーパスを素性展開する。具体的には、素性展開部３２１は、学習用コーパスを文字単位で分割し、分割された各文字について素性を抽出する。本実施の形態では、素性として、表層文字、文字種、品詞、シソーラス上の分類番号を用いる。 The feature development unit 321 performs feature development on a plurality of learning corpora. Specifically, the feature development unit 321 divides the learning corpus in units of characters, and extracts features for each of the divided characters. In the present embodiment, surface characters, character types, parts of speech, and thesaurus classification numbers are used as features.

図５に、素性展開の例を示す。図５に示す例では、「素材はレーヨン。」という文字列に対して、対象文字の前後２文字ずつ計５文字を用いて、左向き（文末から文頭へ）に解析を行っている。 FIG. 5 shows an example of feature development. In the example shown in FIG. 5, the character string “material is rayon” is analyzed leftward (from the end of the sentence to the beginning of the sentence) using a total of five characters before and after the target character.

表層文字は、文字単位で分割された文字そのものである。
文字種は、文字が漢字であるか（KANJI）、平仮名であるか（HIRAG）、片仮名であるか（KATAK）、その他であるか（OTHER）を示すものである。 A surface character is a character itself divided in character units.
The character type indicates whether the character is kanji (KANJI), hiragana (HIRAG), katakana (KATAK), or other (OTHER).

品詞は、その文字が属する単語の品詞を示しており、「名詞」の場合には、更に詳細に分類した「普通名詞」、「固有名詞」等を含む。素性展開部３２１は、予め記憶部１３に記憶されている単語−品詞変換辞書を参照して、その文字が属する単語の品詞を抽出する。また、素性展開部３２１は、対象文字の単語内での位置を示す記号を素性の先頭に付与する。単語内の先頭文字には「Ｂ」、最終文字には「Ｅ」、それらの間の文字には「Ｉ」を付与する。また、１文字からなる単語については「Ｓ」を付与する。 The part of speech indicates the part of speech of the word to which the character belongs, and in the case of “noun”, it includes “common noun”, “proprietary noun” and the like classified in more detail. The feature expansion unit 321 refers to the word-part-of-speech conversion dictionary stored in advance in the storage unit 13 and extracts the part of speech of the word to which the character belongs. The feature development unit 321 also adds a symbol indicating the position of the target character in the word to the head of the feature. “B” is given to the first character in the word, “E” is given to the last character, and “I” is given to the character between them. In addition, “S” is given to a word consisting of one letter.

分類番号は、各文字が含まれる単語を、その意味の類似性によって分類した分類情報の一種である。素性展開部３２１は、予め記憶部１３に記憶されている単語−分類番号変換辞書を参照して、その文字が属する単語の分類番号を抽出する。本実施の形態では、分類番号として、角川類語新辞典（角川書店（登録商標））において各単語に付与されている番号を使用する。角川類語新辞典の語彙分類構造は、十進分類になっており、大項目・中項目・小項目の３階層における各項目番号を連結した３桁の数字が分類番号となっている。例えば、「紫」、「赤」、「グリーン」、「カラー」等、「色」に関する単語には「１４３」という分類番号が付与される。分類番号を素性として用いる場合には、意味が近い単語には同じ分類番号が付与されるので、表層文字が異なっていても同じ素性を持つ事例として考慮される。 The classification number is a type of classification information in which words including each character are classified based on similarity in meaning. The feature expansion unit 321 refers to the word-classification number conversion dictionary stored in advance in the storage unit 13 and extracts the classification number of the word to which the character belongs. In the present embodiment, the number assigned to each word in the Kadokawa Thesaurus new dictionary (Kadokawa Shoten (registered trademark)) is used as the classification number. The vocabulary classification structure of the Kadokawa new dictionary has a decimal classification, and a three-digit number concatenating item numbers in the three levels of large items, medium items, and small items is a classification number. For example, a classification number “143” is assigned to a word related to “color” such as “purple”, “red”, “green”, “color”, and the like. When the classification number is used as a feature, the same classification number is assigned to words having similar meanings, so that even if the surface character is different, it is considered as an example having the same feature.

商品分類部３２２は、素性展開部３２１により素性展開された各素性を、属性タグ又は属性値タグと関連付けることにより、学習用コーパスを分類する。具体的には、商品分類部３２２は、学習用コーパスに付与されている属性タグ又は属性値タグに基づいて、学習用コーパスを文字単位で分割した各文字が属性又は属性値に含まれる文字であることを示す分類タグを付与する。分類タグは、対応する文字のチャンク内での位置を表す記号と、チャンクの種類（属性であれば「attr」、属性値であれば「val」）をハイフンで結んだもので表される。本実施の形態で用いる、チャンクの符号化手法の一つであるＩＯＥ２法では、チャンクの最終文字には「Ｅ」、それ以前の文字には「Ｉ」が付与される。要素以外の文字には「Ｏ」が付与される。図５に示す例では、「レーヨン」の「レ」、「ー」、「ヨ」、に対しては「I-val」が付与され、「ン」に対しては「E-val」が付与される。 The product classification unit 322 classifies the learning corpus by associating each feature developed by the feature development unit 321 with an attribute tag or an attribute value tag. Specifically, the product classification unit 322 is a character in which each character obtained by dividing the learning corpus in character units is included in the attribute or attribute value based on the attribute tag or attribute value tag given to the learning corpus. A classification tag indicating that it exists is assigned. The classification tag is represented by a symbol representing the position of the corresponding character in the chunk and the type of chunk (“attr” for the attribute, “val” for the attribute value) connected with a hyphen. In the IOE2 method, which is one of the chunk encoding methods used in the present embodiment, “E” is assigned to the last character of the chunk, and “I” is assigned to the previous character. “O” is assigned to characters other than elements. In the example shown in FIG. 5, “I-val” is assigned to “ray”, “-”, “yo”, and “E-val” is assigned to “n”. Is done.

商品分類部３２２は、学習用コーパスを素性展開して各文字について得られた各素性を、付与された分類タグと関連付けて属性学習ＤＢ３２３に格納する。この関連付けが、機械学習器３２における学習内容であり、学習用コーパスの分類に該当する。具体的には、商品分類部３２２は、分類手法としてＳＶＭ（Support Vector Machine）等を用いて、図５に示す破線で囲まれた情報から、対象文字「レ」に対応する分類タグ「I-val」を得るような分類器を生成する。なお、商品分類部３２２の分類手法としては、ＳＶＭに限らず、ニューラルネットワーク等を用いてもよい。 The merchandise classification unit 322 stores each feature obtained for each character by expanding the learning corpus in the attribute learning DB 323 in association with the assigned classification tag. This association is the learning content in the machine learning device 32 and corresponds to the classification of the learning corpus. Specifically, the product classification unit 322 uses SVM (Support Vector Machine) or the like as a classification method, and uses the classification tag “I−” corresponding to the target character “Re” from the information surrounded by the broken line shown in FIG. Generate a classifier that obtains "val". Note that the classification method of the product classification unit 322 is not limited to SVM, and a neural network or the like may be used.

属性学習ＤＢ３２３は、複数の学習用コーパスに基づいて得られた各素性と、分類タグとが関連付けられたデータベースである。属性学習ＤＢ３２３は、記憶部１３に記憶されている。 The attribute learning DB 323 is a database in which each feature obtained based on a plurality of learning corpora is associated with a classification tag. The attribute learning DB 323 is stored in the storage unit 13.

商品情報入力部３３は、各ＰＣ２０から送信され、通信部１７により受信された、ネットオークションに新たに出品された商品情報文書を入力し、取得する機能部である。 The merchandise information input unit 33 is a functional unit that inputs and acquires a merchandise information document newly sent in the online auction transmitted from each PC 20 and received by the communication unit 17.

自動抽出器３４は、機械学習器３２により得られた属性学習ＤＢ３２３に基づいて、新たに入力された商品情報文書から属性・属性値を自動的に抽出するための機能部であり、素性展開部３４１及びタギング部３４２を有する。 The automatic extractor 34 is a functional unit for automatically extracting attributes / attribute values from a newly input product information document based on the attribute learning DB 323 obtained by the machine learner 32. 341 and a tagging unit 342.

素性展開部３４１は、商品情報入力部３３により取得された商品情報文書を素性展開する。具体的には、素性展開部３４１は、商品情報文書を文字単位で分割し、分割された各文字について素性（表層文字、文字種、品詞、分類番号）を抽出する。素性展開処理の詳細については、素性展開部３２１と同様であるため、省略する。 The feature development unit 341 develops the feature information document acquired by the product information input unit 33. Specifically, the feature development unit 341 divides the product information document in character units, and extracts features (surface character, character type, part of speech, classification number) for each divided character. The details of the feature development process are the same as those of the feature development unit 321, and are therefore omitted.

タギング部３４２は、商品分類部３２２による分類結果、即ち、属性学習ＤＢ３２３に基づいて、素性展開部３４１が素性展開した商品情報文書から属性及び／又は属性値のデータを抽出する。具体的には、タギング部３４２は、商品情報文書を素性展開した各文字について得られた各素性及び属性学習ＤＢ３２３に基づいて、当該各素性に対して、各素性と関連付けられた分類タグを付与する。例えば、タギング部３４２は、対象文字が属性に含まれる文字であると推定した場合には、そのチャンク内での位置に応じて「I-attr」又は「E-attr」を付与し、対象文字が属性値に含まれる文字であると推定した場合には、そのチャンク内での位置に応じて「I-val」、「E-val」を付与し、対象文字が属性にも属性値にも該当しない文字であると推定した場合には、「Ｏ」を付与する。タギング部３４２として、ＳＶＭ等を用いる。 The tagging unit 342 extracts attribute and / or attribute value data from the product information document developed by the feature development unit 341 based on the classification result by the product classification unit 322, that is, the attribute learning DB 323. Specifically, the tagging unit 342 assigns a classification tag associated with each feature to each feature based on each feature and attribute learning DB 323 obtained for each character obtained by developing the feature information document. To do. For example, if the tagging unit 342 estimates that the target character is a character included in the attribute, the tagging unit 342 adds “I-attr” or “E-attr” depending on the position in the chunk, and the target character Is assumed to be a character included in the attribute value, "I-val" and "E-val" are assigned according to the position in the chunk, and the target character is assigned to both the attribute and attribute value. If it is estimated that the character is not applicable, “O” is given. As the tagging unit 342, SVM or the like is used.

また、タギング部３４２は、付与された分類タグに基づいて、商品情報文書から属性及び／又は属性値を抽出する。具体的には、タギング部３４２は、「I-attr」又は「E-attr」が付与された単語を属性として、「I-val」又は「E-val」が付与された単語を属性値として抽出し、その抽出された属性及び／又は属性値を、当該属性及び／又は属性値を抽出した商品情報文書と関連付けて商品情報ＤＢ１４に記憶させる。 Further, the tagging unit 342 extracts an attribute and / or attribute value from the product information document based on the assigned classification tag. Specifically, the tagging unit 342 uses a word assigned “I-attr” or “E-attr” as an attribute, and a word given “I-val” or “E-val” as an attribute value. The extracted attribute and / or attribute value is associated with the product information document from which the attribute and / or attribute value is extracted and stored in the product information DB 14.

なお、タギング部３４２により分類タグを付与して属性及び属性値を商品説明文中で識別可能にすることまでを、属性・属性値の抽出として扱ってもよい。 Note that the process up to adding a classification tag by the tagging unit 342 and making the attribute and attribute value identifiable in the product description may be handled as attribute / attribute value extraction.

クエリ受信部３５は、ＰＣ２０から送信された検索クエリを受信する。検索クエリには、商品情報文書を検索するためのキーワードとして、属性及び／又は属性値が含まれる。 The query receiving unit 35 receives a search query transmitted from the PC 20. The search query includes attributes and / or attribute values as keywords for searching for product information documents.

商品情報検索部３６は、クエリ受信部３５により受信された検索クエリに含まれる属性又は属性値に基づいて、商品情報ＤＢ１４に記憶されている商品情報文書の検索を行い、検索キーワード（属性又は属性値）に関連付けられている商品情報文書を商品情報ＤＢ１４から抽出し、検索結果を検索結果送信部３７に出力する。 The product information search unit 36 searches the product information document stored in the product information DB 14 based on the attribute or attribute value included in the search query received by the query receiving unit 35, and searches the search keyword (attribute or attribute). The product information document associated with (value) is extracted from the product information DB 14, and the search result is output to the search result transmission unit 37.

検索結果送信部３７は、商品情報検索部３６により検索された検索結果をＰＣ２０へ送信する。 The search result transmission unit 37 transmits the search result searched by the product information search unit 36 to the PC 20.

［サーバ装置の動作］
次に、サーバ装置１０の動作を説明する。
図６は、属性情報学習処理を示すフローチャートである。属性情報学習処理は、サーバ装置１０がオークションシステム１００を提供する前に、予め行われる処理であり、ＣＰＵ１１と、記憶部１３に記憶されているプログラムとの協働によるソフトウェア処理によって実現される。 [Operation of server device]
Next, the operation of the server device 10 will be described.
FIG. 6 is a flowchart showing the attribute information learning process. The attribute information learning process is a process that is performed in advance before the server device 10 provides the auction system 100, and is realized by a software process in cooperation with the CPU 11 and a program stored in the storage unit 13.

まず、学習用コーパス入力部３１により、記憶部１３に記憶されている学習用コーパスが入力され、取得される（ステップＳ１）。次に、素性展開部３２１により、学習用コーパスが文字単位で素性展開される（ステップＳ２）。具体的には、素性展開部３２１により、学習用コーパスが文字単位で分割され（ステップＳ２１）、分割された各文字について素性として表層文字、文字種、品詞、分類番号が抽出される（ステップＳ２２）。 First, the learning corpus input unit 31 inputs and acquires the learning corpus stored in the storage unit 13 (step S1). Next, the feature development unit 321 develops the feature corpus for each character (step S2). Specifically, the feature development unit 321 divides the learning corpus in units of characters (step S21), and for each divided character, a surface character, character type, part of speech, and classification number are extracted as features (step S22). .

次に、商品分類部３２２により、素性展開された各素性を、属性タグ又は属性値タグと関連付けることにより、学習用コーパスが分類される（ステップＳ３）。具体的には、商品分類部３２２により、学習用コーパスに付与されている属性タグ又は属性値タグに基づいて、学習用コーパスを文字単位で分割した各文字に対して分類タグが付与される。 Next, the learning corpus is classified by associating each feature-expanded feature with an attribute tag or attribute value tag by the product classification unit 322 (step S3). Specifically, based on the attribute tag or attribute value tag assigned to the learning corpus, the product classification unit 322 assigns a classification tag to each character obtained by dividing the learning corpus in character units.

次に、商品分類部３２２により、学習用コーパスを素性展開して各文字について得られた各素性が分類タグと関連付けられて属性学習ＤＢ３２３に格納される（ステップＳ４）。
以上で属性情報学習処理が終了する。 Next, the feature classification unit 322 stores each feature obtained for each character by developing the feature corpus in association with the classification tag in the attribute learning DB 323 (step S4).
This completes the attribute information learning process.

次に、図７を参照して、属性情報抽出処理を説明する。属性情報抽出処理は、オークションに出品を希望するユーザの操作によって、ＰＣ２０から商品の商品情報が送信され、通信部１７により受信された際に行われる処理であり、ＣＰＵ１１と、記憶部１３に記憶されているプログラムとの協働によるソフトウェア処理によって実現される。 Next, attribute information extraction processing will be described with reference to FIG. The attribute information extraction process is a process performed when product information of a product is transmitted from the PC 20 and received by the communication unit 17 by an operation of a user who wishes to exhibit in the auction, and is stored in the CPU 11 and the storage unit 13. This is realized by software processing in cooperation with a program that has been implemented.

まず、商品情報入力部３３により、抽出対象となる商品情報文書が入力され、取得される（ステップＳ５）。次に、素性展開部３４１により、商品情報文書が文字単位で素性展開される（ステップＳ６）。具体的には、素性展開部３４１により、商品情報文書が文字単位で分割され（ステップＳ６１）、分割された各文字について素性として表層文字、文字種、品詞、分類番号が抽出される（ステップＳ６２）。 First, a product information document to be extracted is input and acquired by the product information input unit 33 (step S5). Next, the feature development unit 341 develops the feature information document in units of characters (step S6). Specifically, the feature development unit 341 divides the product information document in units of characters (step S61), and the surface character, character type, part of speech, and classification number are extracted as features for each divided character (step S62). .

次に、タギング部３４２により、商品分類部３２２による分類結果、即ち、属性学習ＤＢ３２３に基づいて、素性展開された商品情報文書から属性及び／又は属性値のデータが抽出される（ステップＳ７）。具体的には、タギング部３４２により、商品情報文書を素性展開した各文字について得られた各素性及び属性学習ＤＢ３２３に基づいて、当該各素性に対して、各素性と関連付けられた分類タグが付与され、付与された分類タグに基づいて、商品情報文書から属性及び／又は属性値が抽出される。属性又は属性値として抽出された単語は、タギング部３４２により、商品情報文書と関連付けられて、商品情報ＤＢ１４に格納される（ステップＳ８）。
以上で属性情報抽出処理が終了する。 Next, the tagging unit 342 extracts attribute and / or attribute value data from the product information document that has been developed based on the classification result by the product classification unit 322, that is, the attribute learning DB 323 (step S7). Specifically, the tagging unit 342 assigns a classification tag associated with each feature to each feature based on each feature and attribute learning DB 323 obtained for each character obtained by developing the product information document. Then, attributes and / or attribute values are extracted from the product information document based on the assigned classification tags. The words extracted as attributes or attribute values are associated with the product information document by the tagging unit 342 and stored in the product information DB 14 (step S8).
This completes the attribute information extraction process.

以上説明したように、サーバ装置１０は、学習用コーパスを素性展開して得られた各素性を属性タグ又は属性値タグ（属性タグに相当する分類タグ「I-attr」「E-attr」又は属性値タグに相当する分類タグ「I-val」「E-val」）と関連付けておくので、商品情報文書から商品の属性情報を抽出する際に精度良く抽出を行うことができ、抽出された属性情報を利用して商品検索の使い勝手を向上させることができる。 As described above, the server device 10 assigns each feature obtained by developing the feature corpus to an attribute tag or an attribute value tag (classification tags “I-attr”, “E-attr” corresponding to the attribute tag, or Classification tag corresponding to the attribute value tag "I-val" "E-val"), it can be extracted accurately when extracting product attribute information from the product information document Usability of product search can be improved by using attribute information.

また、学習用コーパスを用いて機械学習器３２により分類された分類結果に基づいて、自動抽出器３４により商品情報文書から商品の属性情報を精度良く抽出することができる。また、ネットオークションやオンラインショッピングにおいてユーザが記入した商品説明文に分類手法を適用して、商品を自動分類することができる。 Further, based on the classification result classified by the machine learner 32 using the learning corpus, the attribute information of the product can be accurately extracted from the product information document by the automatic extractor 34. In addition, it is possible to automatically classify products by applying a classification method to product descriptions written by users in online auctions and online shopping.

また、素性として分類番号を用いることにより、意味が近い単語を、同じ属性又は属性値を示す単語として抽出することができる。機械学習による文字列に対する系列ラベリングに基づく属性名、属性値の抽出、類語辞典の情報の利用により、抽出対象となる商品情報文書が学習した際とは異なるカテゴリーとなっている場合の抽出精度の低下を軽減することができる。 Further, by using the classification number as a feature, words having similar meanings can be extracted as words indicating the same attribute or attribute value. Extraction accuracy when the product information document to be extracted is in a different category by using attribute name, attribute value extraction, and synonym dictionary information based on sequence labeling for character strings by machine learning Reduction can be reduced.

また、商品情報文書から抽出された属性及び／又は属性値のデータを、当該データを抽出した商品情報文書と関連付けておくので、商品検索の使い勝手を向上させることができる。 Further, since the attribute and / or attribute value data extracted from the product information document is associated with the product information document from which the data is extracted, the usability of product search can be improved.

例えば、商品情報文書から属性、属性値を抽出することにより、複数の商品カテゴリーを横断して属性情報検索を行うことができ、正確に抽出した属性、属性値をインデクスとして検索システムに登録して検索することができる。 For example, by extracting attributes and attribute values from a product information document, you can perform attribute information searches across multiple product categories, and register the extracted attributes and attribute values as indexes in the search system. You can search.

また、カテゴリーに応じて決まる出品物の属性情報（色，素材，サイズ，形状等）を属性名（「色」等）と属性値（「赤」等）の組としてとらえ、個々の商品情報から機械学習手法を用いてそれらを抽出しておく。このため、例えば、「赤」という検索クエリで検索する際には、「赤色」の属性及び属性値を持つ商品情報を検索することができ、「赤色」と「シャツ」という検索クエリを用いて複数のカテゴリーから赤色のシャツを検索するように、複数のカテゴリーに亘る商品情報の検索が可能となる。 In addition, the attribute information (color, material, size, shape, etc.) of the exhibit determined according to the category is taken as a set of attribute name (“color”, etc.) and attribute value (“red”, etc.) Extract them using machine learning techniques. For this reason, for example, when searching with the search query “red”, product information having the attribute “red” and the attribute value can be searched, and the search queries “red” and “shirt” are used. It is possible to search for product information across a plurality of categories so as to search for red shirts from a plurality of categories.

なお、上記実施の形態における記述は、本発明に係る商品分類装置の例であり、これに限定されるものではない。装置を構成する各部の細部構成及び細部動作に関しても本発明の趣旨を逸脱することのない範囲で適宜変更可能である。 The description in the above embodiment is an example of the product classification device according to the present invention, and the present invention is not limited to this. The detailed configuration and detailed operation of each part constituting the apparatus can be changed as appropriate without departing from the spirit of the present invention.

例えば、上記実施の形態では、学習用文書及び抽出対象となる商品情報文書を文字単位で素性展開することとしたが、文法上の最小単位である形態素単位で素性展開することとしてもよい。 For example, in the above embodiment, the learning document and the product information document to be extracted are expanded in character units. However, the feature expansion may be performed in morpheme units, which is the minimum grammatical unit.

また、商品の特徴を示す属性として表１に示すものを例として挙げたが、例えば、商品が所属するカテゴリーを属性として捉えて属性名及び／又は属性値を抽出することとしてもよい。これにより、様々なカテゴリーの学習コーパスを入力して、カテゴリーに関する属性及び属性値を学習して、例えば、属性名「カテゴリー」、属性値「ファッション」といったカテゴリーを商品情報に関連付けることができる。従って、属性としてカテゴリーを指定した商品検索も可能になり、複数のカテゴリーに所属する商品情報をカテゴリーの指定により検索することもできるようになる。 Moreover, although what was shown in Table 1 as an attribute which shows the characteristic of goods was mentioned as an example, it is good also as, for example, catching the category to which goods belong as an attribute, and extracting an attribute name and / or an attribute value. Thereby, learning corpus of various categories can be input to learn attributes and attribute values related to the categories, and for example, categories such as attribute name “category” and attribute value “fashion” can be associated with the product information. Accordingly, it is possible to search for a product specifying a category as an attribute, and to search for product information belonging to a plurality of categories by specifying the category.

また、上記実施の形態では、商品分類装置をオークションシステム１００に適用した場合について説明したが、通信ネットワークを利用して商品を提供するオンラインショッピングシステムに適用することとしてもよい。 Moreover, although the said embodiment demonstrated the case where the goods classification apparatus was applied to the auction system 100, it is good also as applying to the online shopping system which provides goods using a communication network.

［実験例］
商品情報文書からの属性情報の抽出精度を調べるために、実験１〜４を行った。
なお、チャンキングについては、ＳＶＭに基づく汎用チャンカーであるYamChaを使用した。素性展開の際には文字を単位とし、チャンキングの解析方向は左向き解析で行った。また、属性、属性値のチャンクの符号化手法にはＩＯＥ２法を利用し、文脈長は対象文字の前後２文字ずつ計５文字とした。 [Experimental example]
Experiments 1 to 4 were conducted in order to examine the accuracy of extracting attribute information from the product information document.
For chunking, Yamacha, a general-purpose chunker based on SVM, was used. When developing the feature, the character was used as a unit, and the analysis direction of chunking was performed by left-facing analysis. In addition, the IOE2 method is used for the encoding method of attribute and attribute value chunks, and the context length is set to 5 characters in total, 2 characters before and after the target character.

実験データとして、Yahoo!（登録商標）オークションに出品された商品の商品情報のうち、ファッションカテゴリーのものを用いた。この際、出品者に固有の記述様式による影響を排除するために、出品者が重複した商品情報は用いないように考慮した。これに用いたデータの詳細を以下に示す。 As the experiment data, the fashion category of product information of products exhibited at the Yahoo! (registered trademark) auction was used. At this time, in order to eliminate the influence of the description style unique to the exhibitor, consideration was given to not using the product information that the exhibitor duplicated. Details of the data used for this are shown below.

（Ａ）アパレル（男性用）−トップス−シャツ−半袖１５０ページ
属性：総数１４２２個／異なり数１４９個
属性値：総数１７９４個／異なり数５１２個
（Ｂ）アパレル（女性用）−トップス−タンクトップ，キャミソール１５０ページ
属性：総数７２３個／異なり数９１個
属性値：総数１２４５個／異なり数３８１個 (A) Apparel (for men)-Tops-Shirt-Short sleeves 150 pages Attribute: Total 1422 / Different 149 Attribute value: Total 1794 / Different 512 (B) Apparel (Women)-Tops-Tank top , Camisole 150 pages Attribute: Total number 723 / Different number 91 Attribute value: Total number 1245 / Different number 381

学習用文書（以下、学習データという。）として用いる場合には、上記の商品情報（実験データ（Ａ）又は（Ｂ））に対し、前述した方法に従って注釈付けをした文書（テキストデータ）を用いた。 When used as a learning document (hereinafter referred to as learning data), a document (text data) that has been annotated in accordance with the above-described method is used for the product information (experiment data (A) or (B)). It was.

＜実験１＞
実験１では、実験データとして（Ａ）「アパレル（男性用）−トップス−シャツ−半袖」を対象とし、分類番号以外の素性、即ち、表層文字・文字種・品詞を用いて実験を行った。評価に際しては、商品情報を単位とした、５分割交差検定を行い、それらの平均の適合率、再現率を求めた。ここで、
適合率＝正しく抽出できた属性情報の数／サーバ装置が抽出した属性情報の数
再現率＝正しく抽出できた属性情報の数／データ中の属性情報の数
で定義される。 <Experiment 1>
In Experiment 1, (A) "Apparel (for men)-Tops-Shirt-Short sleeves" was used as experimental data, and experiments were performed using features other than classification numbers, that is, surface characters, character types, and parts of speech. In the evaluation, a 5-fold cross-validation was performed with the product information as a unit, and the average precision and recall were obtained. here,
Relevance rate = number of attribute information successfully extracted / number of attribute information extracted by server apparatus Reproducibility = number of attribute information successfully extracted / number of attribute information in data.

表２に、実験１における抽出精度を示す。

Table 2 shows the extraction accuracy in Experiment 1.

表２に示すように、属性に関しては適合率、再現率ともに８０％以上であり、ある程度の精度で属性情報を抽出することができたと考えられる。また、属性値については、７０％台に留まったが、属性と比較して抽出する対象の種類が多いため、抽出精度が抑制されたことによるものと考えられる。これは、学習データの量を増やすことで解決することが可能であると考えられる。 As shown in Table 2, regarding the attributes, both the relevance ratio and the recall ratio are 80% or more, and it is considered that the attribute information could be extracted with a certain degree of accuracy. In addition, although the attribute value remained in the 70% range, it is considered that the extraction accuracy is suppressed because there are many types of objects to be extracted compared to the attribute. It is considered that this can be solved by increasing the amount of learning data.

＜実験２＞
実験２では、機械学習による属性情報の抽出に対して、表層文字を素性として用いた場合の影響、素性として角川類語新辞典の分類番号を用いた場合の効果を検討するために、実験データとして（Ａ）「アパレル（男性用）−トップス−シャツ−半袖」を対象とし、以下の条件に従った素性を用いて実験を行った。 <Experiment 2>
In Experiment 2, in order to examine the effect of using surface character as a feature and the effect of using Kadokawa's new dictionary classification number as a feature for extracting attribute information by machine learning, (A) Experiments were conducted using “apparel (for men) —tops—shirts—short sleeves” using features according to the following conditions.

条件（ア）：表層文字・文字種・品詞
条件（イ）：表層文字・文字種・品詞・分類番号
条件（ウ）：文字種・品詞
条件（エ）：文字種・品詞・分類番号 Condition (A): Surface character / character type / part of speech Condition (A): Surface character / character type / part of speech / classification number Condition (U): Character type / part of speech Condition (D): Character type / part of speech / classification number

なお、評価に際しては、商品情報を単位とした、５分割交差検定を行い、それらの平均の適合率、再現率を求めた。 In the evaluation, a 5-fold cross-validation was performed with the product information as a unit, and the average precision and recall were obtained.

表３に、実験２における抽出精度を示す。

Table 3 shows the extraction accuracy in Experiment 2.

条件（ア）と条件（イ）の結果を比較すると、分類番号を素性として用いたことの効果を僅かだが確認することができた。しかし、表層文字への依存が高いため、分類番号の効果が少ないことも同時に確認できる。 Comparing the results of condition (a) and condition (b), we were able to confirm the effects of using classification numbers as features. However, since the dependence on surface characters is high, it can be confirmed at the same time that the effect of the classification number is small.

条件（ウ）と条件（エ）の結果を比較すると、表層文字を素性として用いない場合には、分類番号は精度の向上に非常に有効に働いているといえる。つまり、表層表現に依存しない素性だけでもある程度の抽出精度を保つことができるので、既存のシソーラスに現れる表現であれば、学習データに現れない新しい属性、属性値であっても、抽出可能であることが期待される。特に、新しい分野の商品に関する商品情報における属性、属性値の抽出において有効であると考えられる。 Comparing the results of the condition (c) and the condition (d), it can be said that the classification number works very effectively in improving accuracy when the surface character is not used as a feature. In other words, it is possible to maintain a certain level of extraction accuracy with only features that do not depend on the surface representation, so it is possible to extract even new attributes and attribute values that do not appear in the learning data as long as the expression appears in the existing thesaurus. It is expected. In particular, it is considered effective in extracting attributes and attribute values in product information related to products in a new field.

＜実験３＞
実験３では、学習データに用いるデータ量と抽出精度との関係を調べるための実験を行った。実験データとしては、（Ａ）「アパレル（男性用）−トップス−シャツ−半袖」を用い、素性には、表層文字・文字種・品詞・分類番号を用いた。評価に際しては、商品情報を単位とした５分割交差検定法を用い、学習データとして用いる４つのグループに含まれるデータ全てから徐々にデータを取り出して、使用する学習データを増やしていき、それらの平均の適合率、再現率を求めた。 <Experiment 3>
In Experiment 3, an experiment was conducted to examine the relationship between the amount of data used for learning data and the extraction accuracy. As experimental data, (A) "apparel (for men)-tops-shirt-short sleeve" was used, and surface characters, character types, parts of speech, and classification numbers were used as features. When evaluating, use the 5-fold cross-validation method with product information as a unit, gradually extract data from all the data included in the four groups used as learning data, increase the learning data to be used, and average their The precision and recall were calculated.

図８に、属性の抽出におけるデータ量と精度の関係を示し、図９に、属性値の抽出におけるデータ量と精度の関係を示す。図８、図９において、横軸は、学習データとして用いる商品情報文書のページ数、縦軸は、適合率又は再現率である。 FIG. 8 shows the relationship between data amount and accuracy in attribute extraction, and FIG. 9 shows the relationship between data amount and accuracy in attribute value extraction. 8 and 9, the horizontal axis represents the number of product information document pages used as learning data, and the vertical axis represents the relevance ratio or the recall ratio.

図８、図９に示すように、適合率については、少ないデータ量である程度の精度を得ることができるが、再現率については、データ量を増加させることにより精度を向上させることができることが確認できた。再現率については、精度の上昇が飽和していないため、更にデータを増やすと、精度の向上が見込める可能性がある。 As shown in FIG. 8 and FIG. 9, it is confirmed that the precision can be obtained with a certain degree of accuracy with a small amount of data, but the accuracy can be improved with respect to the recall by increasing the amount of data. did it. Regarding the recall rate, the increase in accuracy is not saturated, and if the data is further increased, the accuracy may be improved.

＜実験４＞
実験４では、学習データと、新たな抽出対象として用いる商品情報文書（以下、テストデータという。）とに異なるグループに属する商品情報を用いて自動抽出を行った場合の精度を検討するために、以下の条件で平均の適合率、再現率を求めた。 <Experiment 4>
In Experiment 4, in order to examine the accuracy when automatic extraction was performed using product information belonging to different groups for learning data and a product information document (hereinafter referred to as test data) used as a new extraction target, The average precision and recall were obtained under the following conditions.

条件（ア）：学習データ：実験データ（Ａ）
テストデータ：実験データ（Ｂ）
使用素性：表層文字・文字種・品詞
条件（イ）：学習データ：実験データ（Ａ）
テストデータ：実験データ（Ｂ）
使用素性：文字種・品詞・分類番号
条件（ウ）：学習データ：実験データ（Ｂ）
テストデータ：実験データ（Ａ）
使用素性：表層文字・文字種・品詞
条件（エ）：学習データ：実験データ（Ｂ）
テストデータ：実験データ（Ａ）
使用素性：文字種・品詞・分類番号 Condition (A): Learning data: Experimental data (A)
Test data: Experimental data (B)
Usage features: Surface characters, character types, parts of speech Condition (b): Learning data: Experimental data (A)
Test data: Experimental data (B)
Use feature: Character type, part of speech, classification number Condition (c): Learning data: Experimental data (B)
Test data: Experimental data (A)
Features used: Surface characters, character types, parts of speech Condition (d): Learning data: Experimental data (B)
Test data: Experimental data (A)
Use features: Character type, part of speech, classification number

なお、実験データ（Ａ）「アパレル（男性用）−トップス−シャツ−半袖」と実験データ（Ｂ）「アパレル（女性用）−トップス−タンクトップ，キャミソール」は、「アパレル−トップス」の中では、出現する属性情報の類似性が低い関係にある。 Experimental data (A) "Apparel (for men)-Tops-shirt-short sleeve" and experimental data (B) "Apparel (for women)-Tops-Tank top, camisole" are among "Apparel-Tops" , The similarity of the appearing attribute information is low.

表４に、実験４における抽出精度を示す。

Table 4 shows the extraction accuracy in Experiment 4.

実験１や実験２等のように、同じグループ内の商品情報（アパレル（男性用）−トップス−シャツ−半袖）を用いて自動抽出を行った場合と比べると、全体的に精度が低いことがわかる。これは、文書内に出現する属性情報の類似性が低くなったからである。 As in Experiment 1 and Experiment 2, etc., the overall accuracy is lower than when automatic extraction is performed using product information (apparel (for men), tops, shirts, short sleeves) in the same group. Recognize. This is because the similarity of the attribute information appearing in the document is lowered.

また、条件（ア）と条件（イ）、条件（ウ）と条件（エ）をそれぞれ比較してわかるように、表層文字を用いた方が分類番号を用いた場合よりも抽出精度が高いといえる。ただし、同類の商品情報を用いた場合よりも、表層文字を用いた場合と分類番号を用いた場合の精度の差が小さくなっていることから、出現する属性情報の類似性がより低い場合においては、分類番号の効果がより大きくなると考えられる。 Also, as can be seen by comparing the conditions (a) and (b) and the conditions (c) and (d), the extraction accuracy is higher when using surface characters than when using classification numbers. I can say that. However, in the case where the similarity of appearing attribute information is lower because the difference in accuracy between using surface characters and using classification numbers is smaller than when using similar product information It is considered that the effect of the classification number becomes larger.

本発明の実施の形態に係るオークションシステムのシステム構成図である。It is a system configuration figure of an auction system concerning an embodiment of the invention. サーバ装置の構成を示すブロック図である。It is a block diagram which shows the structure of a server apparatus. サーバ装置の機能ブロック図である。It is a functional block diagram of a server apparatus. 属性タグ及び属性値タグが付与された学習用コーパスのデータ例である。It is a data example of the learning corpus to which the attribute tag and the attribute value tag are given. 素性展開を説明するための図である。It is a figure for demonstrating feature expansion. 属性情報学習処理を示すフローチャートである。It is a flowchart which shows an attribute information learning process. 属性情報抽出処理を示すフローチャートである。It is a flowchart which shows an attribute information extraction process. 属性の抽出におけるデータ量と精度の関係を示す図である。It is a figure which shows the relationship between the data amount and the precision in attribute extraction. 属性値の抽出におけるデータ量と精度の関係を示す図である。It is a figure which shows the relationship between the data amount and precision in extraction of an attribute value.

Explanation of symbols

１０サーバ装置
１１ＣＰＵ
１２ＲＡＭ
１３記憶部
１４商品情報ＤＢ
１５操作部
１６表示部
１７通信部
１８バス
２０ＰＣ
３１学習用コーパス入力部
３２機械学習器
３２１素性展開部
３２２商品分類部
３２３属性学習ＤＢ
３３商品情報入力部
３４自動抽出器
３４１素性展開部
３４２タギング部
３５クエリ受信部
３６商品情報検索部
３７検索結果送信部
１００オークションシステム
Ｎ通信ネットワーク 10 Server device 11 CPU
12 RAM
13 Storage Unit 14 Product Information DB
15 Operation unit 16 Display unit 17 Communication unit 18 Bus 20 PC
31 Learning Corpus Input Unit 32 Machine Learner 321 Feature Expansion Unit 322 Product Classification Unit 323 Attribute Learning DB
33 Product information input unit 34 Automatic extractor 341 Feature development unit 342 Tagging unit 35 Query reception unit 36 Product information search unit 37 Search result transmission unit 100 Auction system N Communication network

Claims

A product classification device that classifies the product for each attribute value indicating the attribute and / or the content of the attribute based on the product description,
First feature development means for developing a plurality of learning documents to which attribute tags and attribute value tags are assigned to the attributes and attribute values of the products described in advance in the product description;
Classification means for classifying the learning document by associating each feature developed by the first feature development means with the attribute tag or attribute value tag;
A product classification apparatus comprising:

Second feature expansion means for expanding the input product information document to be extracted;
Extraction means for extracting attribute and / or attribute value data from the product information document that the second feature development means has developed based on the classification result by the classification means;
The product classification device according to claim 1, further comprising:

Storage means for storing the attribute and / or attribute value data extracted by the extraction means in association with the product information document from which the data is extracted;
Search means for searching for product information documents associated with attribute and / or attribute value data received from the client terminal from the storage means and transmitting to the client terminal;
The product classification device according to claim 2, further comprising:

One of the features used in the feature expansion is classification information obtained by classifying words including each expansion unit subjected to the feature expansion based on similarity in meaning. Product classification device of description.

The product classification device according to any one of claims 1 to 4, wherein an SVM (Support Vector Machine) is used as a classification method of the classification means.

A product classification method for classifying the product into attributes and / or attribute values indicating the contents of the attribute based on the product description,
A first feature development step for feature development of a plurality of learning documents to which an attribute tag and an attribute value tag are attached to an attribute and an attribute value of a product described in advance in a product description;
A classification step of classifying the learning document by associating each feature developed in the first feature development step with the attribute tag or the attribute value tag;
Product classification method including

A computer that classifies the product into attributes and / or attribute values indicating the contents of the attribute based on the product description.
First feature expansion means for expanding a plurality of learning documents to which an attribute tag and an attribute value tag are assigned to an attribute and an attribute value of a product described in advance in a product description;
A classifying unit that classifies the learning document by associating each feature developed by the first feature development unit with the attribute tag or the attribute value tag;
Program to function as.