JP2013522719A

JP2013522719A - Product category classification

Info

Publication number: JP2013522719A
Application number: JP2012557037A
Authority: JP
Inventors: ジョーン・リーン; リウ・ホワレイ
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2010-03-09
Filing date: 2011-03-02
Publication date: 2013-06-13

Abstract

【解決手段】製品のカテゴリ分類は、取得されたデータから複数の製品のタイトルを抽出し、タイトルをフレーズに分割し、フレーズについての各スコアを決定し、フレーズについての決定された各のスコアに少なくとも部分的に基づくフレーズの少なくとも１つを使用して、複数の製品のうちの第１の製品についての第１のワードシーケンスを構成し、第１のワードシーケンスを複数の製品のうちの第２の製品についての第２のワードシーケンスと比較し、比較に少なくとも部分的に基づいて、複数の製品のうちの第１の製品と第２の製品とを統合して１つの製品カテゴリに入れることを含む。
【選択図】図１Product category classification extracts a plurality of product titles from acquired data, divides the title into phrases, determines each score for the phrase, and determines each score determined for the phrase. At least one of the at least partly based phrases is used to construct a first word sequence for a first product of the plurality of products, and the first word sequence is a second of the plurality of products. A first word sequence of the plurality of products to be combined into one product category based on at least in part the comparison with a second word sequence for Including.
[Selection] Figure 1

Description

［関連出願の相互参照］
本出願は、あらゆる目的のために、参照によって本明細書に組み込まれる、２０１０年３月９日に出願された発明の名称を「METHOD AND DEVICE FOR CATEGORIZING DATA（製品をカテゴリ分類するための方法及び装置）」とする中国特許出願第２０１０１０１２２１４１．２号に基づく優先権を主張する。該出願は、 [Cross-reference of related applications]
This application is named “METHOD AND DEVICE FOR CATEGORIZING DATA”, a method and method for categorizing products, filed on March 9, 2010, which is incorporated herein by reference for all purposes. The priority based on Chinese Patent Application No. 201010122141.2. The application is

本発明は、データ処理の技術に関し、特に、製品データをカテゴリ分類するための方法及びシステムに関する。 The present invention relates to data processing techniques, and more particularly to a method and system for categorizing product data.

電子商取引ウェブサイトでは、一般に、ウェブサイト上の製品を記述する各種のデータがテキストやデータ表などの形で記憶されている。電子ウェブサイトで取り上げられるデータは大量なので、全ての製品に関する記述データは、大規模な情報コンテンツを形成する。したがって、とりわけ類似の製品について、データをどの程度効果的に管理するべきかに関する問題がある。 In an electronic commerce website, generally, various data describing products on the website are stored in the form of text or data tables. Because the amount of data taken up on electronic websites is large, descriptive data about all products forms massive information content. Therefore, there is a question as to how effectively data should be managed, especially for similar products.

各種の電子商取引ウェブサイトでは、クラスタリング技術を使用して製品の各種のデータをカテゴリ分類することが一般的である。代表的なクラスタリング技術は、既定の一連のルール及び条件に基づいて、製品に関するデータをカテゴリ別に分類する（例えば、類似の製品は、同じカテゴリに分類される）。 In various electronic commerce websites, it is common to categorize various data of products using clustering technology. A typical clustering technique classifies data about products into categories based on a predetermined set of rules and conditions (eg, similar products are classified into the same category).

通常使用されるクラスタリング方法の一例は、階層的クラスタリングである。このクラスタリング階層的クラスタリング方法は、ボトムアップポリシーのことを言う。代表的なボトムアップポリシーでは、カテゴリ分類されるべき各オブジェクトが、最初は別々のアトムクラスタと見なされ、これらのアトムクラスタは、次いで、同じカテゴリに属する全てのオブジェクトが同じグループにまとめられるまで又は終了条件が満たされるまで、より高いレベルに新しいクラスタを形成するために統合される。 One example of a commonly used clustering method is hierarchical clustering. This clustering hierarchical clustering method refers to a bottom-up policy. In a typical bottom-up policy, each object to be categorized is initially considered a separate atom cluster, and these atom clusters are then either collected until all objects that belong to the same category are grouped together or Until the termination condition is met, it is integrated to form a new cluster at a higher level.

しかしながら、上記のクラスタリング方法を使用して電子商取引ウェブサイトのデータを分類するためには、広範なデータ処理が必要とされ、これは、システムリソースの非効率を招くと考えられる。 However, extensive data processing is required to categorize e-commerce website data using the clustering method described above, which may lead to inefficiencies in system resources.

発明の様々な実施形態が、以下の詳細な説明及び添付の図面で開示される。 Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.

製品をカテゴリ分類するためのシステムの一実施形態を示した図である。FIG. 1 illustrates an embodiment of a system for categorizing products.

製品をカテゴリ分類するプロセスの一実施形態を示したフローチャートである。6 is a flowchart illustrating one embodiment of a process for categorizing products.

製品をカテゴリ分類するプロセスの別の一実施形態を示したフローチャートである。5 is a flowchart illustrating another embodiment of a process for categorizing products.

製品データをカテゴリ分類及び使用するためのシステムの一実施形態を示した図である。1 illustrates one embodiment of a system for categorizing and using product data. FIG.

本発明は、プロセス、装置、システム、合成物、コンピュータによって読み取り可能なストレージ媒体に実装されたコンピュータプログラム製品、並びに／又は結合先のメモリに記憶された命令及び／若しくは結合先のメモリによって提供される命令を実行するように構成されたプロセッサなどのプロセッサを含む、数々の形態で実装することができる。本明細書では、これらの実装形態、又は本発明がとりえるその他のあらゆる形態を、技術と称することができる。総じて、開示されたプロセスのステップの順序は、本発明の範囲内で可変である。別途明記されない限り、タスクを実施するように構成されるとして説明されるプロセッサ又はメモリなどのコンポーネントは、所定時にタスクを実施するように一時的に構成される汎用コンポーネントとして、又はタスクを実施するように製造された特殊コンポーネントとして実装することができる。本明細書で使用される「プロセッサ」という用語は、コンピュータプログラム命令などのデータを処理するように構成された１つ又は２つ以上のデバイス、回路、及び／又は処理コアを言う。 The present invention is provided by a process, apparatus, system, composite, computer program product implemented on a computer readable storage medium, and / or instructions stored in a combined memory and / or combined memory. Can be implemented in a number of forms, including a processor such as a processor configured to execute instructions. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of the disclosed processes is variable within the scope of the present invention. Unless stated otherwise, a component such as a processor or memory that is described as being configured to perform a task is a generic component that is temporarily configured to perform a task at a given time, or to perform a task. Can be implemented as a special component manufactured. The term “processor” as used herein refers to one or more devices, circuits, and / or processing cores configured to process data, such as computer program instructions.

本発明の原理を例示した添付の図面とともに、以下で、発明の１つ又は２つ以上の実施形態の詳細な説明が提供される。本発明は、このような実施形態との関連のもとで説明されているが、いかなる実施形態にも限定されない。本発明の範囲は、特許請求の範囲によってのみ限定され、本発明は、数々の代替形態、変更形態、及び均等物を内包している。以下の説明では、本発明の完全な理解を可能にするために、数々の具体的詳細が明記されている。これらの詳細は、例示を目的として提供されるものであり、本発明は、これらの詳細の一部又は全部を伴わずとも、特許請求の範囲にしたがって実施することができる。明瞭さを期するため、本発明に関連する技術分野において知られる技術要素は、本発明が不必要に不明瞭にされないように詳細な説明を省略される。 A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. Although the invention has been described in the context of such embodiments, it is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.

製品のカテゴリ分類が開示される。一部の実施形態では、製品データが取得され、製品データの中で言及されている製品のタイトルが抽出される。一部の実施形態では、製品データから製品の属性情報も抽出される。抽出された情報は、フレーズに分割される。フレーズの過去発生頻度に少なくとも部分的に基づいて、各フレーズについてのスコアが決定される。製品について、１つ又は２つ以上の一連のフレーズが選択され、ワードシーケンスに構成される。各製品について構成されたワードシーケンスは、その他の製品のワードシーケンスと比較される。類似のワードシーケンスを持つ製品は、統合されて１つのカテゴリ下の製品集合にされる。 Product categorization is disclosed. In some embodiments, product data is obtained and the titles of the products mentioned in the product data are extracted. In some embodiments, product attribute information is also extracted from the product data. The extracted information is divided into phrases. A score for each phrase is determined based at least in part on the past occurrence frequency of the phrase. For a product, one or more series of phrases are selected and organized into a word sequence. The word sequence configured for each product is compared with the word sequences of the other products. Products with similar word sequences are merged into a set of products under one category.

一部の実施形態では、類似のワードシーケンスを持つ製品を統合して１つのカテゴリ下の製品集合にすることは、また、そのカテゴリの製品の関連データ（例えば、製品のカテゴリを記述する付随の製品データ）を統合することも含む。 In some embodiments, consolidating products with similar word sequences into a set of products under a category also includes associated data for the products in that category (eg, an accompanying description describing the category of the product). Including product data).

図１は、製品をカテゴリ分類するためのシステムの一実施形態を示した図である。図に示された例では、システム１００は、抽出ユニット１０と、分割ユニット１１と、選択ユニット１２と、統合ユニット１３と、処理ユニット１４とを含む。 FIG. 1 is a diagram illustrating one embodiment of a system for categorizing products. In the example shown in the figure, the system 100 includes an extraction unit 10, a division unit 11, a selection unit 12, an integration unit 13, and a processing unit 14.

システム１００は、パソコン、サーバコンピュータ、スマートフォンなどの手持ち式のすなわち携帯型のデバイス、フラットパネルデバイス、マルチプロセッサシステム、マイクロプロセッサベースのシステム、セットトップボックス、プログラマブル家庭用電子機器、ネットワークＰＣ、ミニコンピュータ、大規模コンピュータ、特殊用途向けデバイス、上記のシステム若しくはデバイスの任意を含む分散コンピューティング環境、又は１つ若しくは２つ以上のプロセッサと、該プロセッサに結合され該プロセッサに命令を提供するように構成されたメモリと、を含むその他のハードウェア／ソフトウェア／ファームウェアの組み合わせなどの、任意の適切なコンピューティングデバイスを使用して実装することができる。 The system 100 is a hand-held or portable device such as a personal computer, a server computer, or a smartphone, a flat panel device, a multiprocessor system, a microprocessor-based system, a set-top box, a programmable home electronic device, a network PC, or a minicomputer. A large computer, a special purpose device, a distributed computing environment including any of the above systems or devices, or one or more processors and coupled to the processor configured to provide instructions to the processor Can be implemented using any suitable computing device, such as other hardware / software / firmware combinations including

ユニットは、１つ又は２つ以上の汎用プロセッサ上で実行されるソフトウェアコンポーネントとして、プログラマブルロジックデバイス及び／若しくは所定の機能を実施するように設計された特殊用途向け集積回路などのハードウェアとして、又はそれらの組み合わせとして実装することができる。一部の実施形態では、ユニットは、本発明の実施形態で説明された方法をコンピュータデバイス（パソコン、サーバ、ネットワーク機器など）に実行させるための幾つかの命令を含み且つ不揮発性のストレージ媒体（光ディスク、フラッシュストレージデバイス、モバイルハードディスクなど）に記憶させることができるソフトウェア製品の形で具現化することができる。ユニットは、１つのデバイスに実装する又は複数のデバイスに分散させることができる。ユニットの機能は、互いに合体させる又は複数のサブユニットに更に分けることができる。 A unit may be a software component that runs on one or more general purpose processors, as hardware such as a programmable logic device and / or special purpose integrated circuit designed to perform a predetermined function, or It can be implemented as a combination of them. In some embodiments, the unit includes a number of instructions for causing a computing device (such as a personal computer, server, network equipment, etc.) to perform the method described in the embodiments of the present invention and a non-volatile storage medium ( And can be embodied in the form of a software product that can be stored on an optical disk, flash storage device, mobile hard disk, etc. Units can be implemented in one device or distributed across multiple devices. Unit functions can be combined with each other or further divided into multiple subunits.

抽出ユニット１０は、カテゴリ分類されるべき製品に関連するデータを取得するように構成される。抽出ユニット１０は、また、取得されたデータから製品のタイトルを抽出するように構成される。一部の実施形態では、抽出ユニット１０は、取得されたデータから製品の属性情報も抽出するように構成される。 The extraction unit 10 is configured to obtain data relating to the product to be categorized. The extraction unit 10 is also configured to extract product titles from the acquired data. In some embodiments, the extraction unit 10 is configured to also extract product attribute information from the acquired data.

分割ユニット１１は、製品のタイトルの各々を１つ又は２つ以上のフレーズに分割するように構成され、各フレーズは、１つ又は２つ以上のワードを含む。分割ユニットは、更に、フレーズの過去発生頻度を表す各フレーズのスコアを決定するように構成される。 The split unit 11 is configured to split each of the product titles into one or more phrases, each phrase including one or more words. The split unit is further configured to determine a score for each phrase representing the past occurrence frequency of the phrase.

選択ユニット１２は、各製品についての既定の条件を満たすスコアを持つフレーズを選択し、それらのフレーズを統合してその製品のワードシーケンスにするように構成される。 The selection unit 12 is configured to select phrases having a score that satisfies a predefined condition for each product and consolidate those phrases into a word sequence for that product.

統合ユニット１３は、その製品について構成されたワードシーケンスを互いに比較するように構成される。一部の実施形態では、統合ユニット１３は、どの製品が類似のワードシーケンスを有するかを決定し、類似の対応するワードシーケンスを有する製品を統合して１つの製品カテゴリに入れるように構成される。一部の実施形態では、類似のワードシーケンスを持つ製品について、統合ユニット１３は、また、同じカテゴリの製品の関連データ（例えば、属性情報やその他の記述データ）を統合する（そして例えばその製品カテゴリを記述するデータとする）。 The integration unit 13 is configured to compare the word sequences configured for that product with each other. In some embodiments, the integration unit 13 is configured to determine which products have similar word sequences and integrate products with similar corresponding word sequences into one product category. . In some embodiments, for products with similar word sequences, the integration unit 13 also integrates related data (eg, attribute information and other descriptive data) for products of the same category (and, for example, that product category). ).

処理ユニット１４は、統合ユニット１３によって決定された製品カテゴリの各々に対応する識別子を設定及び記憶するように構成される。 The processing unit 14 is configured to set and store an identifier corresponding to each of the product categories determined by the integration unit 13.

図２は、製品をカテゴリ分類するプロセスの一実施形態を示したフローチャートである。一部の実施形態では、プロセス２００は、図１の１００などの誘導検索システムにおいて実行される。 FIG. 2 is a flowchart illustrating one embodiment of a process for categorizing products. In some embodiments, process 200 is performed in a guided search system, such as 100 in FIG.

ステップ２０２では、カテゴリ分類されるべき製品に関連するデータが取得され、それらの製品のタイトル及びその他の属性情報が抽出される。 In step 202, data related to the products to be categorized is obtained and the titles and other attribute information of those products are extracted.

一部の実施形態では、電子商取引ウェブサイトにおいて、製品に関連するデータが手動で（例えば、ウェブサイトのオペレータ又は登録ユーザによって）ウェブサイトに入力される。例えば、ユーザは、ユーザが製品に関連するデータを入力することができるフィールドを特徴付けるウェブサイトのウェブページにアクセスすることができる。次いで、そのウェブページのコンテンツは、ユーザに伝送することができる。サーバは、次いで、そのコンテンツからタイトル及びその他の属性情報を抽出する。サーバは、また、抽出されたタイトルをフレーズに分割する。 In some embodiments, at an e-commerce website, data related to the product is manually entered into the website (eg, by the website operator or registered user). For example, a user can access a web page of a website that characterizes a field where the user can enter data related to the product. The web page content can then be transmitted to the user. The server then extracts the title and other attribute information from the content. The server also divides the extracted title into phrases.

一部の実施形態では、製品データは、製品のカテゴリ分類を実施するために（例えば、電子商取引ウェブサイト用に記憶されているカテゴリ分類をアップデートするために）、定期的に及び／又は自動的に取得される。一部の実施形態では、製品データは、電子商取引ウェブサイトに関連付けられたサーバ（例えば、ウェブサイトのプラットフォームをサポートするとともにウェブサイトのためにコンテンツの少なくとも一部を記憶しているサーバ）によって取得される。例えば、サーバは、このようなデータがウェブサイトにアップロードされた後に、製品データを取得することができる。 In some embodiments, the product data is periodically and / or automatically to perform product categorization (eg, to update a categorization stored for an e-commerce website). To be acquired. In some embodiments, product data is obtained by a server associated with an e-commerce website (eg, a server that supports the website platform and stores at least a portion of content for the website). Is done. For example, the server can obtain product data after such data is uploaded to a website.

各種の実施形態において、製品のタイトルは、その製品を正確に記述するキーワードを含むので、製品のタイトルを抽出することが、望ましいとされる。製品に関連するデータの例としては、タイトル、価格、及びモデル、年、メーカなどに関連するその他の情報が挙げられる。例えば、ヘアドライヤー製品のタイトルは、「ＨａｉｒＳｈｉｎｅブランドによるＭｏｄｅｌＤ３５０６のヘアドライヤー」であるかもしれない。 In various embodiments, since the product title includes keywords that accurately describe the product, it is desirable to extract the product title. Examples of data related to a product include title, price, and other information related to model, year, manufacturer, and the like. For example, the title of the hair dryer product may be “Model D3506 hair dryer by the HairSine brand”.

各種の実施形態において、製品の属性情報は、製品の詳細な記述を含む。例えば、ヘアドライヤーの属性情報は、製品が市場に出された時期、ヘアドライヤーのモデル及びカラー、並びに評価点（評価スコア）を含むことができる。一部の実施形態では、属性及び対応する属性値は、属性及び対応する属性値を表す識別子によって示される。一部の実施形態では、属性及び対応する属性値は、属性識別子：属性値識別子の記号ペアの形で表される。例えば、製品のカラー属性が緑であるならば、それは、属性Ａ：２０００として記すことができる。ここで、Ａは、カラー属性の識別子であり、２０００は、属性値緑の識別子である。一部の実施形態では、製品を統合して（例えば各々が１つのカテゴリに関連付けられた）１つ又は２つ以上のグループにする際に、異なる製品のタイトル及び属性情報の両方の類似度が考慮される。したがって、一部の実施形態では、ステップ２００において、製品のタイトル及び属性情報の両方が抽出される。 In various embodiments, the product attribute information includes a detailed description of the product. For example, the attribute information of a hair dryer may include the time when the product was put on the market, the model and color of the hair dryer, and the evaluation score (evaluation score). In some embodiments, the attribute and the corresponding attribute value are indicated by an identifier that represents the attribute and the corresponding attribute value. In some embodiments, attributes and corresponding attribute values are represented in the form of attribute identifier: attribute value identifier symbol pairs. For example, if the product color attribute is green, it can be written as attribute A: 2000. Here, A is an identifier of a color attribute, and 2000 is an identifier of an attribute value green. In some embodiments, when combining products into one or more groups (eg, each associated with a category), the similarity of both the title and attribute information of different products is Be considered. Thus, in some embodiments, in step 200 both product title and attribute information are extracted.

ステップ２０４では、製品のタイトルがフレーズに分割される。 In step 204, the product title is divided into phrases.

一部の実施形態では、抽出された製品のタイトル及び／又は属性情報は、１つ又は２つ以上のフレーズに分割され、各フレーズは、少なくとも１つのワードを含む。一部の実施形態では、タイトルは、１つ又は２つ以上のフレーズの識別可能意味に少なくとも基づいて、１つ又は２つ以上のフレーズに分割される。一部の実施形態では、タイトルの分割は、どの個々のワードをフレーズと見なすことができるか及びどのワードグループをフレーズと見なすことができるかを決定する既定のルール集合に基づいて実施される。例えば、製品のタイトル「ＨａｉｒＳｈｉｎｅブランドによるＭｏｄｅｌＤ３５０６のヘアドライヤー」は、「ＨａｉｒＳｈｉｎｅブランド」、「ＭｏｄｅｌＤ３５０６」、及び「ヘアドライヤー」に分割される。 In some embodiments, the extracted product title and / or attribute information is divided into one or more phrases, each phrase including at least one word. In some embodiments, the title is divided into one or more phrases based at least on the identifiable meaning of the one or more phrases. In some embodiments, title splitting is performed based on a predefined set of rules that determine which individual words can be considered phrases and which word groups can be considered phrases. For example, the product title “Model D3506 hair dryer by HairSine brand” is divided into “HairSine brand”, “Model D3506”, and “Hair dryer”.

一部の実施形態では、タイトル及び／又は属性情報のフレーズへの分割は、特定のフレーズを廃棄することも含む。例えば、分割プロセスの終了時に、ブランド及び製品タイプを示すフレーズ（例えば、「ＨａｉｒＳｈｉｎｅブランド」及び「ＭｏｄｅｌＤ３５０６」）が維持される。反対に、製品のカテゴリ分類に密接に結び付かない傾向が強いフレーズ（例えば、「認定商品」、「セール」、及び「特価」）は、分割プロセスの終了時に排除される。一部の実施形態では、どのフレーズが廃棄されるかは、データベースに記憶されている過去の基準情報を使用して決定される。 In some embodiments, dividing the title and / or attribute information into phrases also includes discarding certain phrases. For example, at the end of the splitting process, phrases indicating brand and product type (eg, “HairShine brand” and “Model D3506”) are maintained. Conversely, phrases that tend not to be closely tied to product categorization (eg, “certified products”, “sale”, and “special price”) are eliminated at the end of the segmentation process. In some embodiments, which phrases are discarded are determined using past reference information stored in a database.

一部の実施形態では、製品のタイトル及び製品情報は、Ｈａｄｏｏｐ分散コンピューティングシステムなどのプラットフォームに実装されたツールを使用してフレーズに分割される。一部の実施形態では、Ｈａｄｏｏｐ分散アーキテクチャにおいて（例えば、５０から３００のマシンで構成されたコンピューティングクラスタにおいて）Ｈａｄｏｏｐプログラムが実行される。 In some embodiments, product titles and product information are divided into phrases using tools implemented in a platform such as a Hadoop distributed computing system. In some embodiments, the Hadoop program is executed in a Hadoop distributed architecture (eg, in a computing cluster comprised of 50 to 300 machines).

ステップ２０６では、フレーズについてのそれぞれのスコアが決定される。一部の実施形態では、分割によって生成されて廃棄されなかった各フレーズについてのスコアが決定される。一部の実施形態では、フレーズのスコアは、フレーズの過去発生頻度を表す。フレーズの過去発生頻度は、関連付けられた電子商取引ウェブサイトのユーザがそのフレーズを検索した回数、ユーザによって入力されたタイトル情報にそのフレーズが含まれていた回数、及び分布確率のうちの、１つ又は２つ以上を含む。 In step 206, the respective score for the phrase is determined. In some embodiments, a score is determined for each phrase that was generated by splitting and was not discarded. In some embodiments, the phrase score represents the past occurrence frequency of the phrase. The past occurrence frequency of the phrase is one of the number of times that the user of the associated electronic commerce website searches for the phrase, the number of times that the phrase is included in the title information input by the user, and the distribution probability. Or two or more.

ステップ２０８では、製品についてワードシーケンスが決定される。一部の実施形態では、製品について分割されたフレーズによって、ワードシーケンスが形成される。一部の実施形態では、ワードシーケンスに含まれるべきフレーズは、それらの決定されたスコアに基づいて、既定の条件にしたがって選択される。例えば、既定の条件は、（１つ又は２つ以上の）最も高いスコアを持つ２つのフレーズを製品のタイトルから、そして（１つ又は２つ以上の）最も高いスコアを持つ５つのワードを属性情報の中から選択することを要求するかもしれない。 In step 208, a word sequence is determined for the product. In some embodiments, the word sequence is formed by phrases that are split for the product. In some embodiments, phrases to be included in the word sequence are selected according to predefined conditions based on their determined score. For example, the default condition is to attribute two phrases with the highest score (one or more) from the product title and five words with the highest score (one or more). You may be required to choose from information.

ステップ２１０では、製品に対応するワードシーケンスが比較される。ステップ２０６において製品について構成されたワードシーケンスは、互いに比較される。一部の実施形態では、製品のワードシーケンスは、取得された製品データの中のその他の全ての製品のワードシーケンスと比較される。一部の実施形態では、各比較によって、一致率が決定される。一致率は、２つのワードシーケンス（及びそのそれぞれの製品）がどの程度類似しているかを決定する。一部の実施形態では、比較の一致率が特定の閾値を上回るならば、それら２つの製品は、類似であると見なされる。 In step 210, word sequences corresponding to products are compared. The word sequences constructed for the products in step 206 are compared with each other. In some embodiments, the product word sequence is compared to all other product word sequences in the acquired product data. In some embodiments, each comparison determines a match rate. The match rate determines how similar two word sequences (and their respective products) are. In some embodiments, the two products are considered similar if the comparison match rate is above a certain threshold.

例えば、２つのワードシーケンスが同一である（例えば、各ワードシーケンスが厳密に同じフレーズを有する）ならば、一致率は、１００％になるだろう。一致率の閾値が９５％であるとすると、ワードシーケンス及びそのそれぞれの２つの製品は、類似であると見なされる。 For example, if two word sequences are identical (eg, each word sequence has exactly the same phrase), the match rate will be 100%. Assuming that the match rate threshold is 95%, the word sequence and each of its two products are considered similar.

ステップ２１２では、比較に少なくとも部分的に基づいて、少なくとも２つの製品が統合されて１つの製品カテゴリに入れられる。ステップ２１０の比較に基づいて、類似の製品が統合されて同じカテゴリに分類される。一部の実施形態では、製品カテゴリは、互いに類似するワードシーケンスを有する製品集合である。これらの製品は、製品のワードシーケンスが互いに類似しているゆえに、互いに類似していると見なされる。言い換えると、ワードシーケンスは、対応する製品を的確に表していると見なされる。一部の実施形態では、統合されて１つのカテゴリに入れられた製品集合が、まとめて１つのデータベースに記憶される。 At step 212, based at least in part on the comparison, at least two products are combined into a product category. Based on the comparison of step 210, similar products are combined and classified into the same category. In some embodiments, a product category is a collection of products having word sequences that are similar to each other. These products are considered to be similar to each other because the word sequences of the products are similar to each other. In other words, the word sequence is considered to accurately represent the corresponding product. In some embodiments, product sets that are merged into one category are stored together in one database.

例えば、ステップ２１０の比較に基づいて、１５の製品のワードシーケンスが、類似していると見なされる（例えば、各製品のワードシーケンスが、その他の全ての製品のワードシーケンスと類似していると見なされる）とする。この例では、これら１５の製品は、１つのカテゴリに分類される。 For example, based on the comparison of step 210, the word sequence of 15 products is considered similar (eg, the word sequence of each product is considered similar to the word sequence of all other products). ). In this example, these 15 products are classified into one category.

一部の実施形態では、統合されて同じカテゴリに分類された製品について、そのそれぞれの製品データもまた、統合されて（例えば、一体の記述データとされて）その製品カテゴリ用に記憶される。例えば、同じカテゴリの製品についての統合製品データは、そのカテゴリの全ての製品を記述するために使用することができる。統合されて同じカテゴリに入れられた製品、及びそれらの統合製品データは、例えば、同じテキストファイル又はデータ表に記憶することができる。 In some embodiments, for products that are merged and classified into the same category, their respective product data is also consolidated (eg, as integral descriptive data) and stored for that product category. For example, integrated product data for products in the same category can be used to describe all products in that category. Products integrated into the same category and their integrated product data can be stored, for example, in the same text file or data table.

一部の実施形態では、製品カテゴリの管理において、そのカテゴリについての統合製品データは、その製品カテゴリを特徴付けるために使用される。例えば、統合製品データは、関連付けられたカテゴリの製品を視覚的に提示するために使用することができる。或いは、統合製品データは、関連付けられているカテゴリの製品の記述を変更するために修正することができる。また、統合製品データは、関連付けられている製品カテゴリにおける製品の検索に応えて返信することができる。 In some embodiments, in managing a product category, the integrated product data for that category is used to characterize that product category. For example, the integrated product data can be used to visually present an associated category of products. Alternatively, the integrated product data can be modified to change the description of the product in the associated category. Also, the integrated product data can be returned in response to a search for products in the associated product category.

一部の実施形態では、特定された製品カテゴリの各々について、固有のカテゴリ識別子が設定される。製品カテゴリは、そのそれぞれの固有のカテゴリ識別子によって探索することができるように、その識別子とともに記憶される。例えば、各固有のカテゴリ識別子は、（例えば、製品のタイトル又はその他の製品識別情報を使用した）対応する製品集合及びそれらの統合製品データとともに記憶することができる。 In some embodiments, a unique category identifier is set for each identified product category. Product categories are stored with their identifiers so that they can be searched by their respective unique category identifiers. For example, each unique category identifier can be stored along with a corresponding set of products (eg, using product titles or other product identification information) and their integrated product data.

図３は、製品をカテゴリ分類するプロセスの別の一実施形態を示したフローチャートである。一部の実施形態では、図２のプロセス２００の繰り返しに続いて、ステップ３０２〜３０６が生じる。 FIG. 3 is a flowchart illustrating another embodiment of a process for categorizing products. In some embodiments, steps 302-306 occur following the repetition of process 200 of FIG.

プロセス３００は、プロセス２００によるカテゴリ分類の結果の正確さを向上させるために実施することができる。プロセス３００は、依存されたデータが同じ製品に対し（例えばユーザによって入力されたとおり）異なるタイトルを含んでいたゆえに類似の製品ではあるがプロセス２００において異なるカテゴリに分類された製品のカテゴリを合体させるのに役立つことができる。プロセス３００は、カテゴリ分類プロセスの全体的な正確さを向上させるために、任意の回数にわたって実施することができる。 Process 300 can be implemented to improve the accuracy of the categorization results by process 200. Process 300 merges categories of products that are similar products but classified into different categories in process 200 because the dependent data included different titles for the same product (eg, as entered by the user). Can help. Process 300 can be performed any number of times to improve the overall accuracy of the categorization process.

ステップ３０２〜３０６の以下の実施形態では、プロセス２００の繰り返し後に少なくとも２つの製品カテゴリが作成されたことを前提とする。 The following embodiments of steps 302-306 assume that at least two product categories have been created after the process 200 has been repeated.

ステップ３０２では、製品カテゴリについてのワード組み合わせが決定される。 In step 302, word combinations for product categories are determined.

製品カテゴリについてのワード組み合わせは、その製品カテゴリを表す一連のフレーズと、また、その一連のフレーズについて決定されたそれぞれのスコアとを言う。ワード組み合わせは、製品カテゴリについて様々なやり方で選択することができる。一例において、あるカテゴリの全ての製品が、同じワードシーケンスに対応しているならば、そのワードシーケンスが、そのカテゴリについてのワード組み合わせとして使用される。例えば、いずれもがフレーズ「ＨａｉｒＳｈｉｎｅブランド」、「赤」、及び「ＤＦ０７５３」を含むワードシーケンスに対応している製品は、同じカテゴリにカテゴリ分類され、したがって、「ＨａｉｒＳｈｉｎｅブランド、赤、及びＤＦ０７５３」を、その製品カテゴリについてのワード組み合わせとして捉えることができる。 A word combination for a product category refers to a series of phrases that represent the product category and a respective score determined for the series of phrases. Word combinations can be selected in various ways for product categories. In one example, if all products in a category correspond to the same word sequence, that word sequence is used as the word combination for that category. For example, products that all correspond to word sequences that contain the phrases “HairShine brand”, “Red”, and “DF0753” are categorized in the same category, and thus “HairShine brand, red, and DF0753” It can be understood as a word combination for the product category.

別の例では、あるカテゴリの全ての製品が、同じワードシーケンスに対応しているのではないが、いずれもが、幾つかの同じフレーズを含むワードシーケンスに対応している。このような状況では、カテゴリの全ての製品に共通する一連のフレーズを、その製品カテゴリについてのワード組み合わせとして捉えることができる。 In another example, not all products in a category correspond to the same word sequence, but all correspond to a word sequence that includes several identical phrases. In such a situation, a series of phrases common to all products in a category can be considered as a word combination for that product category.

ステップ３０４では、２つの製品カテゴリ間における類似度が決定される。 In step 304, the similarity between the two product categories is determined.

一部の実施形態では、２つのカテゴリ間における類似度が、それら２つの製品カテゴリのワード組み合わせを使用して決定される。例えば、類似度は、以下の式によって決定することができる。

In some embodiments, the similarity between two categories is determined using a word combination of the two product categories. For example, the similarity can be determined by the following equation.

上記の式において、ＴＤ１及びＴＤ２は、２つの製品カテゴリのそれぞれのワード組み合わせを表している。例えば、
ＴＤ１＝（フレーズ１１、スコア１１）、（フレーズ１２、スコア１２）、（フレーズ１３、スコア１３）
ＴＤ２＝（フレーズ２１、スコア２１）、（フレーズ２２、スコア２２）、（フレーズ２３、スコア２３）
ここで、「フレーズＸＸ」は、フレーズを表しており、「スコアＹＹ」は、対応するスコアを表している。 In the above equation, TD1 and TD2 represent word combinations of the two product categories, respectively. For example,
TD1 = (Phrase 11, Score 11), (Phrase 12, Score 12), (Phrase 13, Score 13)
TD2 = (Phrase 21, Score 21), (Phrase 22, Score 22), (Phrase 23, Score 23)
Here, “phrase XX” represents a phrase, and “score YY” represents a corresponding score.

更に、特性２及び特性２は、２つの製品カテゴリに対応する主要属性のそれぞれの値を表している。本明細書で言う主要属性とは、特定の製品の重要な属性を言う。例えば、携帯電話の主要属性が、そのブランド及びモデルを含む一方で、そのカラー及び重量は、一般的（例えば非主要）属性である。一部の実施形態では、特定の製品についての主要属性が記憶され、特性１及び特性２としてどの値を使用するかを決定するために、プロセス３００においてアクセスされる。一部の実施形態では、類似度は、コサイン計算の法則をもとにして算出される。算出される類似度が大きいほど、２つの製品は類似している。 Furthermore, characteristic 2 and characteristic 2 represent values of main attributes corresponding to two product categories. As used herein, primary attributes refer to important attributes of a particular product. For example, the primary attributes of a mobile phone include its brand and model, while its color and weight are general (eg, non-primary) attributes. In some embodiments, key attributes for a particular product are stored and accessed in process 300 to determine which values to use as characteristic 1 and characteristic 2. In some embodiments, the similarity is calculated based on the law of cosine calculation. The greater the calculated similarity, the more similar the two products.

更に、λ１及びλ２は、タイトル及び属性に重みを割り当てるために選択された係数である。（例えば、ＴＤ１及びＴＤ２が、タイトル情報から分割されたフレーズを使用して形成されたもので、性質１及び性質２が、属性の値であるゆえに、）λ１及びλ２は、類似度の計算にとってタイトル又は属性のいずれがより重要であるかをそれぞれ示す２つの係数を表している。例えば、λ１＝２で且つλ２＝１であるときは、これは、タイトルの重要性が属性の重要性の２倍であることを示している。 Furthermore, λ1 and λ2 are coefficients selected to assign weights to titles and attributes. (For example, TD1 and TD2 are formed using phrases divided from the title information, and property 1 and property 2 are attribute values). It represents two coefficients that each indicate which of the title or attribute is more important. For example, when λ1 = 2 and λ2 = 1, this indicates that the importance of the title is twice the importance of the attribute.

更に、ａ及びｂは、既定のパラメータを表しており、ｎ１及びｎ２は、比較されている２つの製品カテゴリにそれぞれ含まれる製品の数を表している。パラメータａ及びｂは、類似度の値を制御し、したがって、２つの製品カテゴリが統合されるか否かに影響を及ぼす。例えば、２つの製品カテゴリがともに、それぞれ多数の製品を含むときは、類似度の値は、ａ及びｂの値を変更し、

から計算される類似度の値をより小さくして調整を行うことができる。これは、２つの製品カテゴリが統合される確率を低くする。 Furthermore, a and b represent predetermined parameters, and n1 and n2 represent the number of products included in the two product categories being compared, respectively. The parameters a and b control the similarity value and thus influence whether the two product categories are integrated. For example, if two product categories both contain a large number of products, the similarity value changes the values of a and b,

The adjustment can be performed by reducing the similarity value calculated from This reduces the probability that the two product categories are integrated.

例えば、ａ＝５０で、ｂ＝２０で、ｎ１＝１００で、且つｎ２＝１０であるならば、

である。 For example, if a = 50, b = 20, n1 = 100, and n2 = 10,

It is.

ステップ３０６では、決定された２つの製品カテゴリ間における類似度を、既定の閾値と比較することによって、２つの製品カテゴリが合体されるべきか否かが決定される。決定された類似度が、既定の閾値を超える場合は、ステップ３０８において、２つの製品カテゴリは、合体されて１つのカテゴリにされる。決定された類似度が、既定の閾値を超えない場合は、２つの製品カテゴリは、合体されない。 In step 306, it is determined whether the two product categories should be merged by comparing the similarity between the determined two product categories to a predetermined threshold. If the determined similarity exceeds a predetermined threshold, at step 308, the two product categories are merged into one category. If the determined similarity does not exceed a predetermined threshold, the two product categories are not merged.

一部の実施形態では、既定の閾値は、２つのカテゴリが合体されて１つのカテゴリにされるのに十分に類似しているか否かを決定するために使用される。既定の閾値は、ステップ３０４における決定のために、記憶及びアクセスすることができる。 In some embodiments, the predetermined threshold is used to determine whether two categories are sufficiently similar to be merged into one category. The predetermined threshold can be stored and accessed for determination in step 304.

上記の例に戻り、決定された２つの製品カテゴリ間における類似度が、おおよそ７％であるとする。この例において、２つのカテゴリを合体させるための既定の閾値が、９７％であるとすると、決定された類似度は、閾値を遥かに下回るので、２つのカテゴリは、合体されない。 Returning to the above example, assume that the degree of similarity between the two determined product categories is approximately 7%. In this example, if the default threshold for merging the two categories is 97%, the two categories are not merged because the determined similarity is well below the threshold.

一部の実施形態では、２つのカテゴリを合体させることは、新しいカテゴリ識別子を作成し、その識別子を両カテゴリの全ての製品（例えば、それらの製品についての識別情報）及び両カテゴリの関連製品データとともに記憶させることを含む。一部の実施形態では、２つのカテゴリを合体させることは、両カテゴリの全ての製品及び両カテゴリの関連製品データを２つのカテゴリのカテゴリ識別子の１つとともに記憶させることを含む。 In some embodiments, combining two categories creates a new category identifier, which is used to identify all products in both categories (eg, identifying information about those products) and related product data in both categories. Including remembering with. In some embodiments, combining two categories includes storing all products in both categories and associated product data in both categories along with one of the category identifiers in the two categories.

図４は、製品データをカテゴリ分類及び使用するためのシステムの一実施形態を示した図である。システム４００は、ユーザ４０２と、ネットワーク４０４と、サーバ４０６とを含む。ネットワーク４０４は、各種の高速データネットワーク及び／又は電気通信ネットワークを含む。サーバ４０６は、ネットワーク４０４を通じてユーザ４０２と通信するように構成される。 FIG. 4 is a diagram illustrating one embodiment of a system for categorizing and using product data. System 400 includes user 402, network 404, and server 406. The network 404 includes various high-speed data networks and / or telecommunications networks. Server 406 is configured to communicate with user 402 over network 404.

一部の実施形態では、プロセス２００は、システム４００を使用して実施される。一部の実施形態では、プロセス３００もまた、システム４００を使用して実施される。一部の実施形態では、システム１００のユニット（抽出ユニット１０、分割ユニット１１、選択ユニット１２、統合ユニット１３、及び処理ユニット１４）は、サーバ４０６の構成要素である。 In some embodiments, process 200 is performed using system 400. In some embodiments, process 300 is also performed using system 400. In some embodiments, the units of system 100 (extraction unit 10, split unit 11, selection unit 12, integration unit 13, and processing unit 14) are components of server 406.

一部の実施形態では、サーバ４０６は、電子商取引ウェブサイトのためのプラットフォームをサポートするように構成される。例えば、サーバ４０６は、ウェブサイトのための情報を記憶し、また、ウェブサイトのウェブページを提供する。一部の実施形態では、サーバ４０６は、ウェブサイトに情報（例えば製品データ）をアップロードするユーザ（例えばユーザ４０２）からデータを取得するように構成される。 In some embodiments, the server 406 is configured to support a platform for an e-commerce website. For example, the server 406 stores information for the website and provides web pages for the website. In some embodiments, the server 406 is configured to obtain data from a user (eg, user 402) who uploads information (eg, product data) to a website.

サーバ４０６は、製品データを取得された製品のタイトルを抽出するように構成される。一部の実施形態では、サーバ４０６は、取得されたデータから製品の属性情報を抽出するようにも構成される。サーバ４０６は、例えば、ウェブサイトにアップロードされたデータのタイトル及び属性フィールドのそれぞれから、タイトル及び／又は属性情報を抽出することができる。サーバ４０６は、抽出された情報（例えば、タイトル及び／又は属性情報）をフレーズに分割するように構成される。例えば、製品のタイトルは、一連の英数字ワードを１つ又は２つ以上のフレーズに分けるルール集合に基づいて分割することができる。サーバ４０６は、フレーズについてのスコアを決定するように構成される。一部の実施形態では、フレーズについてのスコアは、（例えば、ウェブサイトに記憶された製品データ内における）そのフレーズの発生の過去頻度に基づく。サーバ４０６は、取得されたデータの製品についてのワードシーケンスを構成するように構成される。例えば、ワードシーケンスは、各製品について構成される。一部の実施形態では、製品についてのワードシーケンスは、その製品のフレーズのうちの１つの選択されたフレーズに基づいて決定される。フレーズは、既定の条件に基づいて選択することができる（例えば、スコアの最も高い３つのフレーズが選択される）。サーバ４０６は、製品についてのワードシーケンスをその他の製品のワードシーケンスと比較するように構成される。一部の実施形態では、製品についてのワードシーケンスは、取得されたデータの中のその他の全ての製品のワードシーケンスと比較される。一部の実施形態では、２つのワードシーケンスの比較は、それらのワードシーケンス（及びそれらに対応する製品）が類似しているか否かを結果としてもたらす。サーバ４０６は、比較の結果に少なくとも部分的に基づいて、少なくとも２つの製品を統合して同じカテゴリに入れるように構成される。一部の実施形態では、比較において類似であると見なされたワードシーケンスを有する製品は、統合されて同じカテゴリに入れられる。例えば、統合されて同じカテゴリに入れられた製品は、同じカテゴリ識別子のもとに記憶される。一部の実施形態では、同じカテゴリの製品の製品データ（例えばタイトル及び属性情報）もやはり、同じカテゴリ識別子のもとに記憶される。 Server 406 is configured to extract the title of the product for which product data was obtained. In some embodiments, the server 406 is also configured to extract product attribute information from the acquired data. For example, the server 406 can extract title and / or attribute information from each of the title and attribute fields of the data uploaded to the website. Server 406 is configured to divide the extracted information (eg, title and / or attribute information) into phrases. For example, product titles can be split based on a set of rules that split a series of alphanumeric words into one or more phrases. Server 406 is configured to determine a score for the phrase. In some embodiments, the score for a phrase is based on the past frequency of occurrence of the phrase (eg, in product data stored on a website). Server 406 is configured to construct a word sequence for the product of acquired data. For example, a word sequence is configured for each product. In some embodiments, the word sequence for a product is determined based on a selected phrase of one of the product's phrases. Phrases can be selected based on predefined conditions (eg, the three phrases with the highest scores are selected). Server 406 is configured to compare a word sequence for a product with a word sequence for other products. In some embodiments, the word sequence for the product is compared to the word sequence for all other products in the acquired data. In some embodiments, the comparison of two word sequences results in whether the word sequences (and their corresponding products) are similar. Server 406 is configured to integrate at least two products into the same category based at least in part on the results of the comparison. In some embodiments, products with word sequences that are considered similar in the comparison are merged into the same category. For example, products that are integrated into the same category are stored under the same category identifier. In some embodiments, product data (eg, title and attribute information) for products in the same category is also stored under the same category identifier.

一部の実施形態では、サーバ４０６は、製品カテゴリを合体させるように構成される。一部の実施形態では、サーバ４０６は、製品カテゴリについてのワード組み合わせを決定するように構成される。例えば、既存の各製品カテゴリについて、ワード組み合わせが決定される。ワード組み合わせは、そのカテゴリの製品に関連付けられたワードシーケンスから選択することができる。サーバ４０６は、２つの製品カテゴリ間における類似度を決定するように構成される。一部の実施形態では、類似度は、２つのカテゴリのワード組み合わせを使用して決定される。サーバ４０６は、カテゴリを合体させるか否かを決定するために、決定された２つのカテゴリ間における類似度を既定の閾値と比較するように構成される。決定された類似度が、既定の閾値を上回るならば、２つのカテゴリは、合体されて１つのカテゴリにされる（そして、例えば、両カテゴリからの製品が、同じカテゴリ識別子とともに記憶される）。そうでなく、決定された類似度が、既定の閾値を下回るならば、２つの製品カテゴリは、合体されない。 In some embodiments, the server 406 is configured to merge product categories. In some embodiments, the server 406 is configured to determine word combinations for product categories. For example, word combinations are determined for each existing product category. The word combination can be selected from a word sequence associated with that category of product. Server 406 is configured to determine the similarity between the two product categories. In some embodiments, the similarity is determined using two categories of word combinations. Server 406 is configured to compare the similarity between the two determined categories to a predetermined threshold to determine whether to merge the categories. If the determined similarity is above a predetermined threshold, the two categories are merged into one category (and, for example, products from both categories are stored with the same category identifier). Otherwise, if the determined similarity is below a predetermined threshold, the two product categories are not merged.

サーバ４０６は、製品カテゴリ情報を記憶及び維持するように構成される。このような情報は、類似の製品からなる各カテゴリを電子商取引ウェブサイトにおいて表すために使用することができる。例えば、製品カテゴリのタイトル及び属性情報を含む視覚的な表現、すなわち表を、ユーザによるそのカテゴリの製品の検索に応えて表示することができる。例えば、電子商取引ウェブサイトにおいて、ユーザが検索ボックスに「携帯電話」を入れたとする。ウェブサイトをサポートしているサーバは、そのウェブサイトにおいて販売されている製品のなかで「携帯電話」に関連するものを含む検索結果の集合を返信するだろう。返信された検索結果は、「携帯電話」に関連する製品カテゴリに関して記憶されている情報を、（例えば、製品の価格、モデル、費用、メーカなどの形で）検索結果に表示することができる。 Server 406 is configured to store and maintain product category information. Such information can be used to represent each category of similar products on an electronic commerce website. For example, a visual representation, i.e., a table that includes product category title and attribute information, can be displayed in response to a user searching for a product in that category. For example, assume that a user puts “mobile phone” in a search box on an electronic commerce website. A server that supports a website will return a set of search results including those related to “mobile phone” among the products sold on that website. The returned search result may display information stored about the product category associated with “mobile phone” in the search result (eg, in the form of product price, model, cost, manufacturer, etc.).

ユーザ４０２は、ユーザがそれを通じて電子商取引ウェブサイトにアクセスするデバイスである。ユーザ４０２は、図４では、ノート型パソコンとして示されているが、ユーザ４０２としては、なかでも特に、任意のコンピュータ、携帯機器、又はタブレットが挙げられる。一部の実施形態では、ユーザ４０２は、ユーザが電子商取引ウェブサイトに製品データをアップロードすることを可能にするように構成される。一部の実施形態では、ユーザ４０２は、検索結果を受信するように構成される。一部の実施形態では、ユーザ４０２は、検索結果を表示するように構成される。 User 402 is a device through which a user accesses an electronic commerce website. The user 402 is shown as a notebook personal computer in FIG. 4, but the user 402 may be any computer, portable device, or tablet, among others. In some embodiments, the user 402 is configured to allow the user to upload product data to an e-commerce website. In some embodiments, the user 402 is configured to receive search results. In some embodiments, the user 402 is configured to display search results.

当業者ならば、本発明の趣旨及び範囲から逸脱することなく本発明の実施形態に対して各種の変更及び代替をなしえることがわかる。したがって、本発明の実施形態に対するこれらの変更及び代替が、本発明の特許請求の範囲及びそれらの均等物の範囲内に入るならば、本発明は、これらの変更及び代替の全てを含むようにも意図されている。 Those skilled in the art will recognize that various modifications and alternatives can be made to the embodiments of the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and alternatives to the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention includes all these modifications and alternatives. Is also intended.

以上の実施形態は、理解を明瞭にする目的で幾らか詳細に説明されてきたが、本発明は、提供された詳細に限定されない。本発明の実現には、多くの代替的手法がある。開示された実施形態は、例示のためであって、限定的なものではない。 Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the present invention. The disclosed embodiments are illustrative and not restrictive.

Claims

A method for categorizing products,
Extracting multiple product titles from the acquired data;
Dividing the title into phrases;
Determining each score for the phrase;
A first word sequence corresponding to a first product of the plurality of products using at least one of the phrases selected based at least in part on each determined score for the phrase Configuring
Comparing the first word sequence to a second word sequence corresponding to a second product of the plurality of products;
Consolidating the first product and the second product of the plurality of products into a product category based at least in part on the comparison;
A method comprising:

The method of claim 1, further comprising:
Determining a similarity between the first product category and the second product category;
Coalescing the first product category with the second product category if the determined similarity is at least commensurate with a coalescence threshold;
A method comprising:

The method of claim 1, comprising:
The method wherein determining each score for the phrase is based at least in part on the past occurrence frequency of the phrase.

The method of claim 1, further comprising:
Extracting attribute information about the plurality of products from the acquired data, and dividing the attribute information into phrases.

The method of claim 1, comprising:
Comparing the first word sequence with a second word sequence for a second product of the plurality of products is whether the first word sequence is similar to the second word sequence Determining.

6. A method according to claim 5, wherein
The method of determining whether the first word sequence is similar to the second word sequence is based at least in part on a match rate.

The method of claim 1, comprising:
Integrating the first product and the second product of the plurality of products into one product category means that the first product and the second product of the plurality of products Integrating the data associated with the method.

The method of claim 1, comprising:
Integrating the first product and the second product of the plurality of products into one product category means that the first product and the second product of the plurality of products Storing both with a category identifier.

The method of claim 2, comprising:
Determining similarity includes calculating a value based on a determined score corresponding to the first product category and a determined score corresponding to the second product category.

The method of claim 2, comprising:
Combining the first product category with the second product category includes storing the first product category and the second product category with the same category identifier.

A system for categorizing products,
One or more processors;
A memory connected to the one or more processors and configured to provide instructions to the one or more processors;
With
The one or more processors are:
Extract multiple product titles from the acquired data,
Divide the title into phrases,
Determine each score for the phrase,
A first word sequence corresponding to a first product of the plurality of products using at least one of the phrases selected based at least in part on each determined score for the phrase Configure
Comparing the first word sequence to a second word sequence corresponding to a second product of the plurality of products;
Based at least in part on the comparison, such that the first product and the second product of the plurality of products are integrated into a product category;
Configured system.

The system of claim 11, comprising:
The one or more processors further include:
Determining the similarity between the first product category and the second product category;
Based on whether the determined similarity exceeds a merge threshold, the first product category is merged with the second product category,
Configured system.

The system of claim 11, comprising:
The system wherein the one or more processors are configured to determine each score for the phrase based at least in part on a past occurrence frequency of the phrase.

The system of claim 11, comprising:
The system wherein the one or more processors are further configured to extract attribute information about the plurality of products from the acquired data and divide the attribute information into phrases.

The system of claim 11, comprising:
The one or more processors include determining whether the first word sequence is similar to the second word sequence, the first word sequence of the plurality of products A system configured to compare with a second word sequence for a second product of which.

The system of claim 15, comprising:
The one or more processors are configured to determine whether the first word sequence is similar to the second word sequence based at least in part on a match rate. ,system.

The system of claim 11, comprising:
The one or more processors include integrating data associated with the first product and the second product of the plurality of products, the first of the plurality of products. A system configured to consolidate one product and the second product into one product category.

The system of claim 11, comprising:
The one or more processors include storing both the first product and the second product of the plurality of products with the same category identifier, and the one of the plurality of products A system configured to integrate a first product and the second product into one product category.

The system of claim 11, comprising:
The one or more processors include calculating a value based on a determined score corresponding to the first product category and a determined score corresponding to the second product category; A system that is configured to determine degrees.

The system of claim 12, comprising:
The one or more processors include storing the first product category and the second product category together with a category identifier, wherein the first product category is the second product category. A system that is configured to merge.

A computer program product for categorizing products, implemented on a computer-readable storage medium,
Computer instructions for extracting multiple product titles from the acquired data;
Computer instructions for dividing the title into phrases;
Computer instructions for determining each score for the phrase;
A first word sequence corresponding to a first product of the plurality of products using at least one of the phrases selected based at least in part on each determined score for the phrase Computer instructions for configuring,
Computer instructions for comparing the first word sequence to a second word sequence corresponding to a second product of the plurality of products;
Computer instructions for consolidating the first product and the second product of the plurality of products into a product category based at least in part on the comparison;
A computer program product comprising: