JP2010231761A

JP2010231761A - Multi-language object hierarchy extraction method and system from multi-language web site

Info

Publication number: JP2010231761A
Application number: JP2009281197A
Authority: JP
Inventors: Yu Zhao; ユウジャオ; Jianqiang Li; ジェンチャンリイ
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd
Priority date: 2009-03-18
Filing date: 2009-12-11
Publication date: 2010-10-14
Anticipated expiration: 2029-12-11
Also published as: CN101840402B; JP4986085B2; CN101840402A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and a system for extracting a multi-language object hierarchy from a multi-language web site. <P>SOLUTION: This method includes steps for: inputting a web page into the multi-language web site; decomposing the web site into a lower-rank web site assembly classified by language so that web pages of each lower-rank web site has the same language; extracting a single language object hierarchy from each lower-rank web site, and recording a correspondence relation between each object on the hierarchy and a web page corresponding thereto; determining a parallel relation between different language web pages in different lower-rank web sites; and generating the multi-language object hierarchy of the multi-language web site following the extracted single language object hierarchy of each lower-rank web site, the correspondence relation between the recorded object and the web page corresponding thereto, and the determined parallel relation between the different language web pages. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、概して情報抽出に関し、特に、ウェブマイニングおよび多言語ウェブ・サイトからの多言語オブジェクト階層抽出方法およびシステムに関する。 The present invention relates generally to information extraction, and more particularly to web mining and multilingual object hierarchy extraction methods and systems from multilingual web sites.

インターネット時代を迎えた現在、ウェブ上にはますます多様で膨大な情報が蓄積されている。コンピュータは今や、現代人の生活において、知りたい情報を見つけるためになくてはならないツールとなった。コンピュータは計算、記憶、検索といった情報処理はきわめて高速に行うことができるが、情報を理解する能力に欠けており、これがインテリジェントな情報処理を行う上での障害となっている。近年、この問題に対処するための方策として、インテリジェントな情報処理のための意味的処理に関する研究が盛んになってきた。こうした技術は、例えば、Ｔ．非特許文献１（Ｂｅｒｎｅｒｓ−Ｌｅｅ、Ｊ．Ｈｅｎｄｌｅｒ、Ｏ．Ｌａｓｓｉｌａ「ＴｈｅＳｅｍａｎｔｉｃＷｅｂ」（セマンティックウェブ）、ＳｃｉｅｎｔｉｆｉｃＡｍｅｒｉｃａｎ、２００１年５月、ｐｐ．２８〜３７）や、非特許文献２（ＮｉｇｅｌＳｈａｄｂｏｌｔ、ＴｉｍＢｅｒｎｅｒｓ−Ｌｅｅ、およびＷｅｎｄｙＨａｌｌ「ＴｈｅＳｅｍａｎｔｉｃＷｅｂＲｅｖｉｓｉｔｅｄ」（セマンティックウェブ再訪）、ＩＥＥＥＩｎｔｅｌｌｉｇｅｎｔＳｙｓｔｅｍｓ２１（３）ｐｐ．９６〜１０１、２００６年５月／６月）や、非特許文献３（Ｅ．Ｈｙｖｏｎｅｎ（編集）の「ＳｅｍａｎｔｉｃＷｅｂＫｉｃｋ−ＯｆｆｉｎＦｉｎｌａｎｄ−Ｖｉｓｉｏｎ、Ｔｅｃｈｎｏｌｏｇｉｅｓ、Ｒｅｓｅａｒｃｈ、ａｎｄＡｐｐｌｉｃａｔｉｏｎｓ」（フィンランドにおけるセマンティックウェブの開始−ビジョン、技術、研究、応用）、ＨＩＩＴＰｕｂｌｉｃａｔｉｏｎｓ、２００２−００１、ＨｅｌｓｉｎｋｉＩｎｓｔｉｔｕｔｅｆｏｒＩｎｆｏｒｍａｔｉｏｎＴｅｃｈｎｏｌｏｇｙ（ＨＩＩＴ）、フィンランド、ヘルシンキ、ｐｐ．３０４）等に記述されている。これらの開示の全体を、あらゆる趣旨においてここに援用する。上記の文献は、コンピュータによる情報の理解を支援するためのフォーマットと技術を中心に扱っている。ワールドワイド・ウェブ・コンソーシアム（Ｗ３Ｃ）等の標準化機構は、セマンティック技術の採用を促進するための基盤とすべく、人工知能（ＡＩ）や広く普及したウェブ情報処理技術で行われてきた従来領域の知識表現のための数学的論理（例：記述論理、フレーム論理）に基づいて、ＸＭＬ、ＲＤＦ（ＲｅｓｏｕｒｃｅＤｅｓｃｒｉｐｔｉｏｎＦｒａｍｅｗｏｒｋ）、ＯＷＬ（ＷｅｂＯｎｔｏｌｏｇｙＬａｎｇｕａｇｅ）、ルール言語（例：ＷｅｂＲｕｌｅＬａｎｇｕａｇｅ、ＲｕｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）に代表される様々な標準を積極的に規定している。また、多数の開発者、起業家、実務家らは、意味ベースのインテリジェント情報利用の構想の実現に向けたツール・セット、製品、事例研究を作成・展開する段階にすでに入っており、中には実用化を実現したものさえある。しかし、各種インテリジェント情報利用サービスのウェブ・ユーザへの提供においてコンピュータの強力な計算能力と意味関連標準を採用する上では、ドメイン知識（現在、オントロジがウェブ上の知識表現における主流となっている）が内部で主要な役割を果たす。そのため、ドメイン知識の構築が、解決が待たれる重要な問題となっている。また同時に、ウェブの普及に伴い、世界全体がより緊密に相互接続されるようになった。異なる言語を使用する人々の間でシームレスかつスムーズな通信チャネルを確立するためには、構築されたドメイン知識に多様な言語バージョンを含めると共に、この多言語バージョン間に正確な対応関係を設定することが必要となる。そうなると、ドメイン知識構築問題において、複数言語間の対応関係を伴う多言語ドメイン知識をいかにして確立するかという新たな困難が生じてくる。 Now that we have entered the Internet era, a great deal of information is accumulated on the Web. Computers are now an indispensable tool for finding the information you want to know in the lives of modern people. Computers can perform information processing such as calculation, storage, and retrieval at extremely high speeds, but lack the ability to understand information, which is an obstacle to intelligent information processing. In recent years, research on semantic processing for intelligent information processing has become active as a measure for dealing with this problem. Such techniques are described, for example, in T.W. Non-Patent Document 1 (Berners-Lee, J. Hendler, O. Lassila "The Semantic Web", Scientific American, May 2001, pp. 28-37), Non-Patent Document 2 (Nigel Shadbolt, Tim Berners-Lee, and Wendy Hall “The Semantic Web Revisited” (revisited the Semantic Web), IEEE Intelligent Systems 21 (3) pp. 96-101, May / June 2006, Non-Patent Document 3 (E. Hyvonen (editor) “Semantic Web Kick-Off in Finland-Vision, Technologies, Research. and Applications "(Start of the Semantic Web in Finland-Vision, Technology, Research, Applications), HIIT Publications, 2002-001, Helsinki Institute for Information Technology (HIIT), Finland, Helsinki, pp. 304), etc. . The entire disclosures of these are hereby incorporated by reference for all purposes. The above documents focus on formats and techniques for supporting computer understanding of information. Standardization mechanisms such as the World Wide Web Consortium (W3C) are based on artificial intelligence (AI) and widely used web information processing technologies that have been widely used to promote the adoption of semantic technologies. Based on mathematical logic for knowledge representation (eg, description logic, frame logic), XML, Resource Description Framework (RDF), Web Information Language (OWL), rule language (eg, Web Rule Language, Rule Markup Language) It actively defines various standards represented by. Many developers, entrepreneurs, and practitioners are already in the process of creating and deploying tool sets, products, and case studies for the realization of a semantic-based intelligent information utilization concept. There are even things that have been put to practical use. However, domain knowledge (currently ontology is the mainstream of knowledge representation on the web) in adopting the computer's powerful computing power and semantic standards in providing various intelligent information usage services to web users. Plays a major role internally. For this reason, the construction of domain knowledge has become an important issue that must be resolved. At the same time, with the spread of the Web, the entire world has become more interconnected. In order to establish a seamless and smooth communication channel between people who use different languages, include various language versions in the constructed domain knowledge and set the exact correspondence between these multilingual versions Is required. Then, in the domain knowledge construction problem, a new difficulty arises as to how to establish multilingual domain knowledge with correspondence between multiple languages.

ドメイン知識を表現するドキュメントを形式化する手段としては、オントロジが利用される。オントロジとは、ドメイン内の概念／オブジェクトと、これらの概念／オブジェクト間の関係を定義するものである。オントロジ内で定義された概念／オブジェクト間の関係には、「に属する」（ｂｅｌｏｎｇｉｎｇｔｏ）、「に位置する」（ｌｏｃａｔｅｄｉｎ）等の多種多様な関係が含まれる。実用用途において最も一般的な概念関係は、「ＡはＢに属する」（ＡｂｅｌｏｎｇｉｎｇｔｏＢ）または「ＡはＢの下位概念である」（Ａｉｓａｓｕｂ−ｃｏｎｃｅｐｔｏｆＢ）のような階層的関係である。例えば、「パーソナルコンピュータ」という概念は「コンピュータ」という概念の下位概念である。階層的関係のみを定義する軽量なオントロジは「階層」と呼ばれ、通常、実用用途においては分類システムまたはディレクトリ構造として具現化される。 An ontology is used as a means for formalizing a document expressing domain knowledge. Ontology defines the concepts / objects in the domain and the relationships between these concepts / objects. The relationships between concepts / objects defined in the ontology include a wide variety of relationships such as “belonging to” and “located in”. The most common conceptual relationship in practical applications is hierarchical, such as “A belongs to B” (A being to B) or “A is a subordinate concept of B” (A is a sub-concept of B). It is a relationship. For example, the concept of “personal computer” is a subordinate concept of the concept of “computer”. A lightweight ontology that defines only hierarchical relationships is called a “hierarchy” and is typically embodied as a classification system or directory structure in practical applications.

階層の抽出方法に関連する技術については、既存の論文や特許ですでにいくつか提案されている。しかしそのほとんどは単一言語のデータソースから単一言語の階層を抽出するのみであり、多言語階層抽出問題に対処する技術はきわめて少数である。「多言語階層」とは、概念／オブジェクトが複数の言語で定義または記述された階層を意味する。以下ではまず、多言語階層に関連する既存の論文または特許を紹介する。 Several techniques related to the hierarchy extraction method have already been proposed in existing papers and patents. However, most of them only extract monolingual hierarchies from monolingual data sources, and very few techniques address the multilingual hierarchy extraction problem. “Multilingual hierarchy” means a hierarchy in which concepts / objects are defined or described in a plurality of languages. In the following, we first introduce existing papers or patents related to the multilingual hierarchy.

非特許文献４（２００８年ＩＣＩＣＩＣ会議議事録、Ｈ．−Ｃ．Ｙａｎｇ、Ｄ．−Ｗ．Ｃｈｅｎ、Ｃ．−ＨＬｅｅの「ＡｍｕｌｔｉｌｉｎｇｕａｌｈｉｅｒａｒｃｈｙｍａｐｐｉｎｇｍｅｔｈｏｄｂａｓｅｄｏｎＧＨＳＯＭ」（ＧＨＳＯＭに基づく多言語階層マッピング方法））（以下、「参考文献１」という）で紹介されている多言語階層構築方法とは、以下のようなものである。すなわち、まず並列多言語ドキュメント群（「並列多言語」とは、１つのドキュメントに多数の異言語バージョンが存在することを意味する）を収集し、次にこれらのドキュメントの並列関係を手動でマーキングする（すなわち、別個のドキュメントであっても実際には同一ドキュメントの異言語バージョンであるドキュメントをマーキングする）。そして、そのドキュメント集合の単一言語部分集合毎に階層が抽出される。最後に、事前にマーキングされたドキュメントの並列関係に従って、これらの単一言語階層間の対応関係が確立される。 Non-Patent Document 4 (2008 ICICIC Meeting Minutes, H.-C. Yang, D.-W. Chen, C.-H Lee's “A multi-level hierarchical mapping method on GHSOM” (Multilingual Hierarchical Mapping Based on GSOM) Method)) (hereinafter referred to as “reference document 1”) is a multilingual hierarchy construction method as described below. That is, first collect parallel multilingual documents ("Parallel multilingual" means that there are many different language versions in one document), and then manually mark the parallel relationship of these documents (I.e., marking documents that are separate languages but are actually different language versions of the same document). Then, a hierarchy is extracted for each single language subset of the document set. Finally, a correspondence between these single language hierarchies is established according to the parallel relationship of the pre-marked documents.

非特許文献５（１９９９年ＥＭＮＬＰ／ＶＬＣ会議議事録、Ｊ．Ｄａｕｄｅ、Ｌ．Ｐａｄｒｏ、Ｇ．Ｒｉｇａｕの「ＭａｐｐｉｎｇＭｕｌｔｉｌｉｎｇｕａｌＨｉｅｒａｒｃｈｉｅｓＵｓｉｎｇＲｅｌａｘａｔｉｏｎＬａｂｅｌｉｎｇ」（弛緩ラベリング法を使用した多言語階層のマッピング））（以下、「参考文献２」という）では、抽出した単一言語階層間の対応関係を確立する方法が提案されている。この方法では、外部多言語辞書を使用して、言語解析技術により異言語の概念名／オブジェクト名間の対応関係が判定され、それに基づいて階層間の対応関係が確立される。 Non-Patent Document 5 (1999 EMNLP / VLC Meeting Minutes, J. Daud, L. Padro, G. Rigau “Mapping Multiple Hierarchies Using Relaxation Labeling” (mapping of multilingual hierarchy using relaxation labeling method)) (Referred to as “reference document 2”) proposes a method of establishing correspondence between extracted single language hierarchies. In this method, a correspondence relationship between concept names / object names in different languages is determined by a language analysis technique using an external multilingual dictionary, and a correspondence relationship between hierarchies is established based thereon.

さらに、特許文献１（欧州特許第ＥＰ８８７７４８Ｂ１号）「Ｍｕｌｔｉｌｉｎｇｕａｌｔｅｒｍｉｎｏｌｏｇｙｅｘｔｒａｃｔｉｏｎｓｙｓｔｅｍ」（多言語用語抽出システム」（以下、「参考文献３」という）では、多言語ドキュメントから多言語概念用語を抽出する方法が開示されている。この方法では、ある用語のある言語バージョンが入力として使用され、ドキュメントは単語網とみなされる。多言語ドキュメントの単語網間の類似度を解析することにより、その用語の他の言語バージョンを導出することが可能となる。 Further, in Patent Document 1 (European Patent No. EP88877B1) “Multilingual Termination Extraction System” (Multilingual Term Extraction System) (hereinafter referred to as “Reference 3”), there is a method for extracting multilingual conceptual terms from a multilingual document. In this method, a certain language version of a term is used as input and the document is considered a word network, and other terms of the term are analyzed by analyzing the similarity between the word networks of a multilingual document. It is possible to derive a language version.

特許文献６（２００３年３月にＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃで発表されたＰ．Ｒｅｓｎｉｋ、Ｎ．Ａ．の「ＴｈｅＷｅｂａｓａｐａｒａｌｌｅｌｃｏｒｐｕｓ」（並列コーパスとしてのウェブ））（以下、「参考文献４」という）は、階層または知識の抽出には関連するものではないが、多言語ドキュメント間の並列関係を自動的に確立する方法を紹介している。この方法では、ウェブ上のウェブ・ページはドキュメント集合として扱われ、ウェブ・ページのＨＴＭＬ構造の類似度を利用して、異言語ウェブ・ページ間の並列関係が識別される。 Patent Document 6 (P. Resnik, NA's “The Web as a parallel corpus” (Web as a parallel corpus) announced at Computational Linguistic in March 2003) (hereinafter referred to as “Reference Document 4”) Presents a method for automatically establishing parallel relationships between multilingual documents, although not related to hierarchy or knowledge extraction. In this method, web pages on the web are treated as a set of documents, and the parallel relationship between different language web pages is identified using the similarity of the HTML structure of the web pages.

欧州特許第ＥＰ８８７７４８Ｂ１号European Patent No. EP88877B1

Ｂｅｒｎｅｒｓ−Ｌｅｅ、Ｊ．Ｈｅｎｄｌｅｒ、Ｏ．Ｌａｓｓｉｌａ「ＴｈｅＳｅｍａｎｔｉｃＷｅｂ」（セマンティックウェブ）、ＳｃｉｅｎｔｉｆｉｃＡｍｅｒｉｃａｎ、２００１年５月、ｐｐ．２８〜３７Berners-Lee, J.M. Hender, O.D. Lassila "The Semantic Web", Scientific American, May 2001, pp. 199-001. 28-37 ＮｉｇｅｌＳｈａｄｂｏｌｔ、ＴｉｍＢｅｒｎｅｒｓ−Ｌｅｅ、およびＷｅｎｄｙＨａｌｌ「ＴｈｅＳｅｍａｎｔｉｃＷｅｂＲｅｖｉｓｉｔｅｄ」（セマンティックウェブ再訪）、ＩＥＥＥＩｎｔｅｌｌｉｇｅｎｔＳｙｓｔｅｍｓ２１（３）ｐｐ．９６〜１０１、２００６年５月／６月Nigel Shadbolt, Tim Berners-Lee, and Wendy Hall “The Semantic Web Revisited” (IEEE Semantic Web Revisited), IEEE Intelligent Systems 21 (3) pp. 199 96-101, May / June 2006 Ｅ．Ｈｙｖｏｎｅｎ（編集）「ＳｅｍａｎｔｉｃＷｅｂＫｉｃｋ−ＯｆｆｉｎＦｉｎｌａｎｄ−Ｖｉｓｉｏｎ、Ｔｅｃｈｎｏｌｏｇｉｅｓ、Ｒｅｓｅａｒｃｈ、ａｎｄＡｐｐｌｉｃａｔｉｏｎｓ」（フィンランドにおけるセマンティックウェブの開始−ビジョン、技術、研究、応用）、ＨＩＩＴＰｕｂｌｉｃａｔｉｏｎｓ、２００２−００１、ＨｅｌｓｉｎｋｉＩｎｓｔｉｔｕｔｅｆｏｒＩｎｆｏｒｍａｔｉｏｎＴｅｃｈｎｏｌｏｇｙ（ＨＩＩＴ）、フィンランド、ヘルシンキ、ｐｐ．３０４E. Hyvonen (edit) "Semantic Web Kick-Off in Finland-Vision, Technologies, Research, and Applications" (Start of the Semantic Web in Finland-Vision, Technology, Research, Applications, HIIT Publications, int. Technology (HIIT), Helsinki, Finland, pp. 304 ２００８年ＩＣＩＣＩＣ会議議事録、Ｈ．−Ｃ．Ｙａｎｇ、Ｄ．−Ｗ．Ｃｈｅｎ、Ｃ．−ＨＬｅｅの「ＡｍｕｌｔｉｌｉｎｇｕａｌｈｉｅｒａｒｃｈｙｍａｐｐｉｎｇｍｅｔｈｏｄｂａｓｅｄｏｎＧＨＳＯＭ」（ＧＨＳＯＭに基づく多言語階層マッピング方法）Minutes of the 2008 ICICIC meeting, H.C. -C. Yang, D.C. -W. Chen, C.I. -HLee's “A multilingual hierarchy mapping method based on GHSOM” (multilingual hierarchical mapping method based on GHSOM) １９９９年ＥＭＮＬＰ／ＶＬＣ会議議事録、Ｊ．Ｄａｕｄｅ、Ｌ．Ｐａｄｒｏ、Ｇ．Ｒｉｇａｕの「ＭａｐｐｉｎｇＭｕｌｔｉｌｉｎｇｕａｌＨｉｅｒａｒｃｈｉｅｓＵｓｉｎｇＲｅｌａｘａｔｉｏｎＬａｂｅｌｉｎｇ」（弛緩ラベリング法を使用した多言語階層のマッピング）Minutes of 1999 EMNLP / VLC meeting, J. Daud, L.M. Padro, G .; Rigau's “Mapping Multilayer Hierarchies Using Relaxation Labeling” (Mapping of multilingual hierarchies using relaxation labeling method) ２００３年３月、ＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃで発表されたＰ．Ｒｅｓｎｉｋ、Ｎ．Ａ．の「ＴｈｅＷｅｂａｓａｐａｒａｌｌｅｌｃｏｒｐｕｓ」（並列コーパスとしてのウェブ）In March 2003, P.P. announced at Computational Linguistic. Resnik, N.M. A. "The Web as a parallel corpus" (Web as a parallel corpus)

上記で説明した関連の解決法のうち、参考文献１はドキュメントの並列関係を手動で構築する必要がある。この方法の欠点は、効率性が低く、多くの時間と労力を要し、拡張性に劣ることである。そのため、大規模な多言語階層を確立する場合には採用することができない。参考文献２の方法では、階層の抽出と複数の言語バージョン間のマッピングが２つの独立した処理として分離されている。これにより、マッピング時に階層ソースのコンテキストが失われるため、外部の多言語辞書のみを使用する場合には高精度を実現することが難しくなる。参考文献３の方法の場合、これを採用するためには、同一ドキュメントの多言語バージョンとして、概念用語を抽出するための多言語ドキュメントを事前に決定することが必須なので、ある言語で表現された少なくとも１つの用語が入力またはドライブとして与えられることが必要となる。そのため、この方法は、多言語ドキュメントにおける並列関係の有無が特定されていない状況では適用できず、新たな概念用語を抽出する目的で使用することも不可能である。さらに、この方法には概念間の関係の抽出が示されていないため、階層の構築にも使用することができない。参考文献４の方法は、階層を抽出する際に、多言語ドキュメント間に並列関係が存在するかどうかを判定する目的で利用することは可能である。しかし、この方法で判定できるのはドキュメント間の対応関係のみであり、ドキュメントの内部要素間の対応関係を判定することは不可能である。階層内の概念オブジェクトは、ドキュメント全体ではなく、ドキュメントの一部分のみに対応する可能性がきわめて高いため、この方法は多言語階層の抽出とマッピングに直接適用することはできない。 Of the related solutions described above, Reference 1 requires manual construction of document parallel relationships. The disadvantages of this method are that it is less efficient, takes a lot of time and effort, and is less scalable. Therefore, it cannot be adopted when establishing a large-scale multilingual hierarchy. In the method of Reference 2, the extraction of the hierarchy and the mapping between a plurality of language versions are separated as two independent processes. As a result, since the context of the hierarchical source is lost during mapping, it is difficult to achieve high accuracy when using only an external multilingual dictionary. In the case of the method of Reference 3, in order to adopt this, it is essential to determine in advance a multilingual document for extracting concept terms as a multilingual version of the same document. At least one term needs to be given as input or drive. Therefore, this method cannot be applied in a situation where the presence or absence of the parallel relationship in the multilingual document is not specified, and cannot be used for the purpose of extracting new conceptual terms. Furthermore, since this method does not show the extraction of relationships between concepts, it cannot be used to construct a hierarchy. The method of Reference 4 can be used for the purpose of determining whether a parallel relationship exists between multilingual documents when extracting a hierarchy. However, only the correspondence between documents can be determined by this method, and it is impossible to determine the correspondence between internal elements of a document. This method cannot be applied directly to the extraction and mapping of multilingual hierarchies because conceptual objects in the hierarchy are very likely to correspond to only a portion of the document, not the entire document.

上記を要約すると、多言語階層抽出のための既存の方法には多数の欠点があり、これらの欠点は、多言語階層抽出および対応の完全な自動化が不可能であること、および効率性と拡張性が不十分であること、の２点に集約される。特に、これらの方法は新たな知識ドメインまたは新たな言語に対処する必要が生じた際に迅速に対応できず、ドキュメントのラベリングや辞書の構築といった多大な準備作業が必要となる。 To summarize the above, existing methods for multilingual hierarchy extraction have a number of drawbacks, which are impossible to fully automate multilingual hierarchy extraction and correspondence, and efficiency and extension It is summarized in two points that the performance is insufficient. In particular, these methods cannot respond quickly when a new knowledge domain or a new language needs to be dealt with, and require a large amount of preparation work such as document labeling and dictionary construction.

本発明は、関連技術による多言語階層抽出方法が抱える上記の問題を解決することを目的とする。本発明においては、多言語ウェブ・サイトから多言語オブジェクト階層を自動的に抽出する方法およびシステムが提供される。本発明の方法は、多言語ウェブ・サイトにおける個々の単一言語下位ウェブ・サイトから単一言語階層を抽出し、これらの単一言語下位ウェブ・サイト間の内部並列対応関係を自動的に識別し、これらの単一言語階層間の対応関係を導出することにより、多言語ウェブ・サイトの多言語階層を生成する。 The object of the present invention is to solve the above-mentioned problems of the multilingual hierarchy extraction method according to the related art. In the present invention, a method and system for automatically extracting a multilingual object hierarchy from a multilingual web site is provided. The method of the present invention extracts a monolingual hierarchy from individual monolingual subweb sites in a multilingual web site and automatically identifies internal parallel correspondences between these monolingual subweb sites The multilingual hierarchy of the multilingual web site is generated by deriving the correspondence between these single language hierarchies.

本発明の第１の態様によれば、多言語ウェブ・サイトからの多言語オブジェクト階層抽出方法であって、多言語ウェブ・サイトにウェブ・ページを入力するステップと、各下位ウェブ・サイトのウェブ・ページが同一言語となるように、当該ウェブ・サイトを言語別に下位ウェブ・サイト集合に分解するステップと、各下位ウェブ・サイトから単一言語オブジェクト階層を抽出し、当該階層上の各オブジェクトとそれに対応するウェブ・ページとの間のマッピング関係を記録するステップと、異なる下位ウェブ・サイト内の異言語ウェブ・ページ間の並列関係を判定するステップと、抽出された各下位ウェブ・サイトの単一言語オブジェクト階層と、記録されたオブジェクトと対応するウェブ・ページ間のマッピング関係と、判定された異言語ウェブ・ページ間の並列関係とに従って、多言語ウェブ・サイトの多言語オブジェクト階層を生成するステップとを備えることを特徴とする多言語オブジェクト階層抽出方法が提供される。 According to a first aspect of the present invention, there is provided a method for extracting a multilingual object hierarchy from a multilingual web site, the step of inputting a web page to the multilingual web site, and the web of each subordinate web site・ Decomposing the web site into a set of subordinate web sites by language so that the pages are in the same language; extracting a single language object hierarchy from each subordinate web site; Recording a mapping relationship between the corresponding web pages, determining a parallel relationship between different language web pages in different sub-web sites, and simply extracting each sub-web site extracted. One language object hierarchy, the mapping relationship between the recorded object and the corresponding web page, and the determined tongue Accordance with the parallel relationship between web pages, multilingual object hierarchy extraction method characterized by comprising the steps of: generating a multi-language object hierarchy multilingual web site is provided.

本発明の第２の態様によれば、多言語ウェブ・サイトからの多言語オブジェクト階層抽出システムであって、多言語ウェブ・サイトにウェブ・ページを入力するための入力手段と、各下位ウェブ・サイトのウェブ・ページが同一言語となるように、当該ウェブ・サイトを言語別に下位ウェブ・サイト集合に分解するための単一言語下位ウェブ・サイト分解手段と、各下位ウェブ・サイトから単一言語オブジェクト階層を抽出し、当該階層上の各オブジェクトとそれに対応するウェブ・ページとの間のマッピング関係を記録するための単一言語オブジェクト階層抽出手段と、異なる下位ウェブ・サイト内の異言語ウェブ・ページ間の並列関係を判定するための並列関係判定手段と、抽出された各下位ウェブ・サイトの単一言語オブジェクト階層と、記録されたオブジェクトと対応するウェブ・ページ間のマッピング関係と、判定された異言語ウェブ・ページ間の並列関係とに従って、多言語ウェブ・サイトの多言語オブジェクト階層を生成するための多言語オブジェクト階層生成手段とを備えることを特徴とする多言語オブジェクト階層抽出システムが提供される。 According to the second aspect of the present invention, there is provided a multilingual object hierarchy extraction system from a multilingual web site, an input means for inputting a web page to the multilingual web site, A single language sub-web site decomposition means for decomposing the web site into a set of sub-web sites by language so that the web page of the site is in the same language, and a single language from each sub-web site A monolingual object hierarchy extraction means for extracting an object hierarchy and recording a mapping relationship between each object on the hierarchy and the corresponding web page; Parallel relationship determination means for determining the parallel relationship between pages and monolingual object hierarchies of each extracted lower web site A multilingual object for generating a multilingual object hierarchy of a multilingual web site according to the mapping relationship between the recorded object and the corresponding web page and the determined parallel relationship between different language web pages There is provided a multilingual object hierarchy extraction system characterized by comprising hierarchy generation means.

本発明により提供される多言語階層抽出方法は、ドキュメントの手動ラベリングが不要な完全に自動的な方法であり、動作パラメータはドメインおよび言語のいずれにも非依存である。本発明により、既存の方法に比較して、抽出効率と拡張性が大幅に向上する。さらに、本発明の方法およびシステムは多言語ウェブ・サイト内の固有な多言語並列関係を利用するため、高精度な結果が保証される。 The multilingual hierarchy extraction method provided by the present invention is a fully automatic method that does not require manual labeling of documents, and the operating parameters are independent of both domain and language. According to the present invention, extraction efficiency and expandability are greatly improved as compared with existing methods. In addition, the method and system of the present invention utilizes the unique multilingual parallel relationships within the multilingual web site, thus ensuring high accuracy results.

本発明の上記の利点と特長は、以下の詳細な説明と図面を併せて参照することにより、明らかとなるであろう。ただし、本発明は図面に示す例や特定の実施例に限定されないことに留意されたい。 The above advantages and features of the present invention will become apparent upon reference to the following detailed description and drawings. However, it should be noted that the present invention is not limited to the examples shown in the drawings or specific embodiments.

本発明は、以下に示す本発明の実施例の詳細な説明と添付図面からさらに明確に理解されるであろう。なお、添付図面では、類似の部品は同一の参照番号を使用して示している。 The invention will be more clearly understood from the following detailed description of embodiments of the invention and the accompanying drawings. In the accompanying drawings, similar parts are denoted by the same reference numerals.

本発明による多言語オブジェクト階層抽出システム１００を示すブロック図である。1 is a block diagram illustrating a multilingual object hierarchy extraction system 100 according to the present invention. 図１に示すシステム１００の動作処理を説明するためのフローチャートである。It is a flowchart for demonstrating the operation | movement process of the system 100 shown in FIG. 図１に示すシステム１００の並列関係判定手段と並列関係補完手段の詳細を示すブロック図である。It is a block diagram which shows the detail of the parallel relationship determination means and parallel relationship complementing means of the system 100 shown in FIG. 本発明による多言語オブジェクト階層生成処理を説明するための概略図である。It is the schematic for demonstrating the multilingual object hierarchy generation process by this invention.

図１は、本発明による多言語オブジェクト階層抽出システム１００を示すブロック図である。図１において、システム１００は、多言語オブジェクト階層抽出部と記憶部とで構成されるシステムとして示されている。本発明による多言語オブジェクト階層抽出処理は、多言語オブジェクト階層抽出部を処理部として使用して実装される。多言語オブジェクト階層抽出部は、多言語ウェブ・サイトからオブジェクト階層を抽出する。この多言語ウェブ・サイトにおいて、オブジェクト階層に含まれるオブジェクト名とオブジェクトの関連ドキュメント（ウェブ・ページ）は、異なる言語の複数バージョンを有する可能性がある。図１に示すように、多言語オブジェクト階層抽出部は、入力手段１０１、ページブロック集合生成手段１０２（任意）、単一言語下位ウェブ・サイト分解手段１０３、単一言語オブジェクト階層抽出手段１０４、並列関係判定手段１０５、並列関係補完手段１０６（任意）、および多言語オブジェクト階層生成手段１０７を備える。記憶部は、処理部と組み合わせて、各種処理結果を記憶するために使用される。図示するように、記憶部には、多言語ウェブ・サイトページ記憶装置１０８、単一言語下位ウェブ・サイト記憶装置１０９、単一言語オブジェクト階層記憶装置１１０、多言語下位ウェブ・サイト並列関係記憶装置１１１、および多言語オブジェクト階層記憶装置１１２を含めることができる。 FIG. 1 is a block diagram illustrating a multilingual object hierarchy extraction system 100 according to the present invention. In FIG. 1, a system 100 is shown as a system including a multilingual object hierarchy extraction unit and a storage unit. The multilingual object hierarchy extraction process according to the present invention is implemented using a multilingual object hierarchy extraction unit as a processing unit. The multilingual object hierarchy extraction unit extracts an object hierarchy from a multilingual web site. In this multilingual web site, the object names included in the object hierarchy and the related documents (web pages) of the objects may have multiple versions of different languages. As shown in FIG. 1, the multilingual object hierarchy extraction unit includes an input unit 101, a page block set generation unit 102 (arbitrary), a single language subordinate web site decomposition unit 103, a single language object hierarchy extraction unit 104, a parallel A relationship determination unit 105, a parallel relationship complementing unit 106 (optional), and a multilingual object hierarchy generation unit 107 are provided. The storage unit is used for storing various processing results in combination with the processing unit. As shown in the figure, the storage unit includes a multilingual web site page storage device 108, a single language subordinate web site storage device 109, a single language object hierarchy storage device 110, and a multilingual subordinate web site parallel relationship storage device. 111, and a multilingual object hierarchy storage 112 may be included.

図２は、図１に示すシステム１００の動作処理を説明するためのフローチャートである。以下では、図１と図２を参照して、本発明の原理と動作処理について詳細に説明する。 FIG. 2 is a flowchart for explaining an operation process of the system 100 shown in FIG. Hereinafter, the principle and operation processing of the present invention will be described in detail with reference to FIG. 1 and FIG.

図２に示すように、処理２００は、入力手段１０１が多言語ウェブ・サイトに含まれるすべてのウェブ・ページを入力するステップ２０１から始まる。多言語ウェブ・サイトページ記憶装置１０８は、インターネットからクローリングにより自動収集された１つ以上の多言語ウェブ・サイトのすべてのウェブ・ページを記憶し、ウェブ・ページＩＤ、ウェブ・ページコンテンツ、ウェブ・ページリンク等の各種コンテンツを記録する。ステップ２０２において、ページブロック集合生成手段１０２は、入力された各ウェブ・ページに対して前処理を実行し、各ウェブ・ページのページブロック集合を生成する。ページブロックとは、ウェブ・ページの一部分であり、そのビジュアルサイズとページ内における位置の情報が記憶される。また、ネスティング、隣接等のページブロック間のレイアウト関係もやはり抽出される。結果の精度をさらに高めるためには、多言語オブジェクト階層の抽出処理において、異なるウェブ・ページのページブロック集合間の類似度を参考情報として使用するのが望ましい。 As shown in FIG. 2, process 200 begins at step 201 where input means 101 inputs all web pages contained in a multilingual web site. The multilingual web site page storage device 108 stores all web pages of one or more multilingual web sites automatically collected by crawling from the Internet, and includes web page ID, web page content, web page Record various contents such as page links. In step 202, the page block set generation means 102 performs pre-processing for each input web page, and generates a page block set for each web page. A page block is a part of a web page and stores information about its visual size and position within the page. In addition, layout relationships between page blocks such as nesting and adjacency are also extracted. In order to further improve the accuracy of the results, it is desirable to use the similarity between the page block sets of different web pages as reference information in the multilingual object hierarchy extraction process.

次に、ステップ２０３において、単一言語下位ウェブ・サイト分解手段１０３は、入力された多言語ウェブ・サイト内のウェブ・ページを、言語別に単一言語下位ウェブ・サイト集合に分解することができる。これは、ウェブ・サイト内の各ウェブ・ページに、言語によって互いを区別するための言語ラベルを付加することで行われる。各単一言語下位ウェブ・サイトは、後に単一言語下位ウェブ・サイト記憶装置１０９に記憶することができる。図１に示すように、単一言語下位ウェブ・サイト記憶装置１０９においては、多言語ウェブ・サイトページ記憶装置１０８に記憶されたウェブ・ページＩＤ、ウェブ・ページコンテンツ、ウェブ・ページリンク等の各種ウェブ・ページコンテンツに加えて、各ウェブ・ページに対しそのページの言語に基づいて言語ＩＤのラベルが付加される。続くステップ２０４において、単一言語オブジェクト階層抽出手段１０４が、各下位ウェブ・サイトの単一言語オブジェクト階層を抽出し、当該階層内の各オブジェクトとそれに対応するウェブ・ページ（またはページブロック）間の対応関係を記録する。単一言語オブジェクト階層抽出手段１０４の処理結果は、単一オブジェクト階層記憶装置１１０に記憶することができる。ステップ２０５において、並列関係判定手段１０５が、異なる下位ウェブ・サイト間の並列関係を判定する。この関係は、異なる下位ウェブ・サイト内における異言語ウェブ・ページ／ページブロック間の並列関係である可能性がある。異言語下位ウェブ・サイト間の並列関係を判定するための方法には、Ｗｅｂディレクトリ構造を利用する方法、異言語ウェブ・ページのＤＯＭ構造を利用する方法、ページブロック集合の構造トポロジを利用する方法など、多数のものがある。並列関係を判定するための方法については、後に詳述する。 Next, in step 203, the monolingual sub-website decomposition means 103 can decompose the web pages in the input multilingual web site into a single-language subordinate web site set by language. . This is done by adding a language label for distinguishing each web page in the web site according to the language. Each monolingual subweb site can later be stored in monolingual subweb site storage 109. As shown in FIG. 1, the monolingual subordinate web site storage device 109 has various web page IDs, web page contents, web page links, etc. stored in the multilingual web site page storage device 108. In addition to web page content, each web page is labeled with a language ID based on the language of the page. In the following step 204, the monolingual object hierarchy extracting means 104 extracts the monolingual object hierarchy of each subordinate web site, and between each object in the hierarchy and the corresponding web page (or page block). Record the correspondence. The processing result of the single language object hierarchy extraction means 104 can be stored in the single object hierarchy storage device 110. In step 205, the parallel relationship determination means 105 determines the parallel relationship between different lower web sites. This relationship can be a parallel relationship between different language web pages / page blocks in different sub-web sites. The method for determining the parallel relationship between different language subordinate web sites includes a method using a Web directory structure, a method using a DOM structure of a different language web page, and a method using a structure topology of a page block set. There are many things. A method for determining the parallel relationship will be described in detail later.

並列関係判定手段１０５が異なる下位ウェブ・サイト間の並列関係を判定した後、並列関係補完手段１０６がステップ２０６（任意）において、抽出されたウェブ・ページまたは単一言語階層間のハイパーリンク関係を解析することにより、判定された並列関係をさらに補完することも可能である。並列関係を補完する方法については、後に詳述する。並列関係判定手段１０５および並列関係補完手段１０６によって判定された異言語ウェブ・ページ間またはページブロック間の並列関係は、多言語下位ウェブ・サイト並列関係記憶装置１１１に記憶することができる。 After the parallel relationship determination unit 105 determines the parallel relationship between different lower-level web sites, the parallel relationship complementing unit 106 determines the hyperlink relationship between the extracted web page or monolingual hierarchy in step 206 (optional). By analyzing, it is also possible to further complement the determined parallel relationship. A method for complementing the parallel relationship will be described in detail later. The parallel relationship between different language web pages or page blocks determined by the parallel relationship determining unit 105 and the parallel relationship complementing unit 106 can be stored in the multilingual sub-web site parallel relationship storage device 111.

ステップ２０７において、多言語オブジェクト階層生成手段１０７が、抽出された各下位ウェブ・サイトの単一言語オブジェクト階層（単一言語オブジェクト階層記憶装置１１０に記憶）と、オブジェクトと対応するウェブ・ページ（またはページブロック）との間の記録された対応関係と、判定された異言語ウェブ・ページまたはページブロック間の並列関係（多言語下位ウェブ・サイト並列関係記憶装置１１１に記憶）とに基づいて、多言語ウェブ・サイトの多言語オブジェクト階層を生成する。多言語オブジェクト階層上の各オブジェクトには、異言語バージョンが含まれる可能性がある。生成された多言語オブジェクト階層は、多言語オブジェクト階層記憶装置１１２に記憶される。処理２００はこれで終了する。 In step 207, the multilingual object hierarchy generation means 107 outputs the extracted single language object hierarchy (stored in the single language object hierarchy storage device 110) of each subordinate web site and the web page corresponding to the object (or Based on the recorded correspondence relationship between the page block) and the determined parallel relationship between different language web pages or page blocks (stored in the multilingual sub-web site parallel relationship storage device 111). Create a multilingual object hierarchy for language web sites. Each object on the multilingual object hierarchy may include a different language version. The generated multilingual object hierarchy is stored in the multilingual object hierarchy storage device 112. The process 200 ends here.

以下では、図３を参照して、並列関係の判定処理と補完処理の例について説明する。図３に示す構造と方法は例示のみを目的とするものであり、本発明の範囲を限定するとみなしてはならない。 Below, with reference to FIG. 3, the example of the determination process of a parallel relationship and a complement process is demonstrated. The structure and method shown in FIG. 3 are for illustrative purposes only and should not be considered as limiting the scope of the present invention.

まず、並列関係判定手段１０５の内部構造について検討する。図３には、並列関係判定手段１０５の一例として、ディレクトリ構造解析ユニット３０１と、ＤＯＭ構造解析ユニット３０２と、ページブロック集合解析ユニット３０３と、第１整合ユニット３０４とを備える構成が示されている。ディレクトリ構造解析ユニット３０１、ＤＯＭ構造解析ユニット３０２、およびページブロック集合解析ユニット３０３は各々、ウェブ・サイトのＷｅｂディレクトリ構造解析、ウェブ・ページＤＯＭ構造解析、およびページブロック集合の構造解析により、異言語ウェブ・ページ（ページブロック）間の並列関係を決定する処理を実装するために使用される。なお、Ｗｅｂディレクトリ構造解析、ウェブ・ページＤＯＭ構造解析、およびページブロック集合の構造解析は、それぞれ別個に使用して並列関係を判定することができる。この３つは、必ずしも図３に示すように同時に使用する必要はない。図３で３つ解析方法を組み合わせているのは、結果の精度を向上させるための配慮に過ぎず、本発明の範囲を限定するものとみなしてはならない。第１整合ユニット３０４は、ディレクトリ構造解析ユニット３０１、ＤＯＭ構造解析ユニット３０２、ページブロック集合解析ユニット３０３によって各々判定された３つの並列関係結果を整合して、結果間の対立を解決するために使用される。例えば、第１整合ユニット３０４は、各解析ユニットに重みを割り当て、割り当てた重みに応じて最終的な結果を決定する。解析ユニットに割り当てる重みを決定する方法としては、当該技術では既知のトレーニングサンプルをベースとした任意の機械学習方法を利用することができる。 First, the internal structure of the parallel relationship determination unit 105 will be examined. FIG. 3 shows a configuration including a directory structure analysis unit 301, a DOM structure analysis unit 302, a page block set analysis unit 303, and a first matching unit 304 as an example of the parallel relationship determination unit 105. . The directory structure analysis unit 301, the DOM structure analysis unit 302, and the page block set analysis unit 303 respectively perform different language webs by performing web directory structure analysis, web page DOM structure analysis, and page block set structure analysis of a web site. Used to implement a process that determines the parallel relationship between pages (page blocks). The web directory structure analysis, the web page DOM structure analysis, and the page block set structure analysis can be used separately to determine the parallel relationship. These three are not necessarily used at the same time as shown in FIG. The combination of the three analysis methods in FIG. 3 is merely a consideration for improving the accuracy of the results, and should not be regarded as limiting the scope of the present invention. The first matching unit 304 is used to match the three parallel relationship results determined by the directory structure analysis unit 301, the DOM structure analysis unit 302, and the page block set analysis unit 303, respectively, and to resolve conflicts between the results. Is done. For example, the first matching unit 304 assigns a weight to each analysis unit, and determines a final result according to the assigned weight. As a method for determining the weight to be assigned to the analysis unit, any machine learning method based on a known training sample can be used in the art.

ディレクトリ構造解析ユニット３０１は、ウェブ・サイトのＷｅｂディレクトリ構造を解析して並列関係を決定するための構成要素である。例えば、ディレクトリ構造解析においては、ウェブ・ページのＵＲＬを参照して、多言語ウェブ・ページ間の並列関係に対するウェブ・サイト作者の配慮を推論することができる。ＵＲＬのパターンを見ると、多言語並列関係の判定に役立つ多くの情報が得られる。例えば、Ｓｙｍａｎｔｅｃのウェブ・サイトでは、ｈｔｔｐ：／／ｗｗｗ．ｓｙｍａｎｔｅｃ．ｃｏｍ／ｎｏｒｔｏｎとｈｔｔｐ：／／ｗｗｗ．ｓｙｍａｎｔｅｃ．ｃｏｍ／ｚｈ／ｃｎ／ｎｏｒｔｏｎとｈｔｔｐ：／／ｗｗｗ．ｓｙｍａｎｔｅｃ．ｃｏｍ／ｊａ／ｊｐ／ｎｏｒｔｏｎの３つが１つの並列ウェブ・ページ群を構成する。これらのページは、同じコンテンツの英語、中国語、日本語のバージョンである。Ｓｙｍａｎｔｅｃのウェブ・サイトにおいては、並列関係を識別するためのＵＲＬパターンは「ｈｔｔｐ：／／ｗｗｗ．ｓｙｍａｎｔｅｃ．ｃｏｍ／（言語）／（地域）／（コンテンツ）」となっている。こうしたＵＲＬパターンは、単一言語下位ウェブ・サイト内の全ウェブ・ページのＵＲＬを対象に類似度解析を実行して各単一言語下位ウェブ・サイトのＵＲＬテンプレートを取得し、各単一言語下位ウェブ・サイトのＵＲＬテンプレートを比較することにより検出することができる。 The directory structure analysis unit 301 is a component for analyzing a web directory structure of a web site and determining a parallel relationship. For example, in directory structure analysis, the URL of a web page can be referred to infer the web site author's consideration for the parallel relationship between multilingual web pages. Looking at the URL pattern, a lot of information useful for determining the multilingual parallel relationship can be obtained. For example, on the Symantec website: http: // www. symantec. com / norton and http: // www. symantec. com / zh / cn / norton and http: // www. symantec. com / ja / jp / norton constitute one parallel web page group. These pages are English, Chinese and Japanese versions of the same content. In the Symantec website, the URL pattern for identifying the parallel relationship is “http://www.symantec.com/(language)/(region)/(content)”. These URL patterns are obtained by performing a similarity analysis on the URLs of all web pages in a single language subordinate web site to obtain a URL template for each single language subordinate web site. It can be detected by comparing URL templates of web sites.

ウェブ・ページまたはページブロック間の並列関係を判定する際には、Ｗｅｂディレクトリ構造に加えて、ウェブ・ページの内部構造解析も使用することができる。例えば、ＤＯＭ構造解析ユニット３０２は、ウェブ・ページのＤＯＭ構造の類似度を解析することによりウェブ・ページ間の並列関係を判定し、ページブロック集合解析ユニット３０３は、各ウェブ・ページ内のページブロック集合の類似度を解析することによりページブロック間の並列関係を判定する。最初に、ＤＯＭ構造解析ユニット３０２が、ウェブ・ページのＤＯＭ構造の類似度を解析することにより、並列関係を有するウェブ・ページを検出する。ＤＯＭ構造の類似度を示す指標としては、ＨＴＭＬノードラベルシーケンスの類似度やノードパターンの類似度などがある。上述したように、ページブロック集合生成手段１０２がさらに、各ウェブ・ページのページブロック集合を生成する。ページブロック集合解析ユニット３０３はその後、２つのウェブ・ページのページブロック集合の類似度を解析することにより、これらのウェブ・ページ間に並列関係が存在するかどうかを判定する。ページブロック集合の類似度を示す指標としては、ページブロック集合のトポロジ構造の類似度（抽象空間の関係のみが考慮される）と、ページブロックの空間的サイズおよび位置情報の類似度が使用される。ページブロック集合の類似度を使用すると、ウェブ・ページ間の並列関係だけでなく、ページブロック間の並列関係も同時に決定することができる。 In addition to the web directory structure, internal structure analysis of web pages can also be used in determining parallel relationships between web pages or page blocks. For example, the DOM structure analysis unit 302 determines the parallel relationship between the web pages by analyzing the similarity of the DOM structures of the web pages, and the page block set analysis unit 303 selects the page blocks in each web page. The parallel relationship between page blocks is determined by analyzing the similarity of sets. First, the DOM structure analysis unit 302 detects web pages having a parallel relationship by analyzing the similarity of the DOM structures of the web pages. As an index indicating the similarity of the DOM structure, there is a similarity of an HTML node label sequence, a similarity of a node pattern, and the like. As described above, the page block set generation unit 102 further generates a page block set for each web page. The page block set analysis unit 303 then determines whether there is a parallel relationship between these web pages by analyzing the similarity of the page block sets of the two web pages. As an index indicating the similarity of the page block set, the similarity of the topological structure of the page block set (only the relation of the abstract space is considered) and the similarity of the spatial size and position information of the page block are used. . By using the similarity of page block sets, not only the parallel relationship between web pages, but also the parallel relationship between page blocks can be determined simultaneously.

ここで再び図３を参照すると、ディレクトリ構造解析ユニット３０１、ＤＯＭ構造解析ユニット３０２、ページブロック集合解析ユニット３０３によって各々判定された第１、第２、第３の並列関係結果は、これらの異なる並列関係結果を整合するために第１整合ユニット３０４に送られる。上述したように、第１整合ユニット３０４はこれら３つの解析結果に異なる重みを割り当てることができる。整合された並列関係結果は、最終結果として多言語オブジェクト階層生成手段１０７に直接送って多言語オブジェクト階層を生成することも、あるいは中間結果として並列関係補完手段１０６に送って判定された並列関係を補完することもできる。いわゆる補完手段は、可能な並列関係の取りこぼしを回避するため、ハイパーリンク構造解析や下位ウェブ・サイトの単一言語階層解析等により、判定された並列関係を補完する。 Referring again to FIG. 3, the first, second, and third parallel relationship results respectively determined by the directory structure analysis unit 301, the DOM structure analysis unit 302, and the page block set analysis unit 303 are the different parallel results. Sent to the first matching unit 304 to match the relationship results. As described above, the first matching unit 304 can assign different weights to these three analysis results. The matched parallel relation result is directly sent to the multilingual object hierarchy generation means 107 as a final result to generate a multilingual object hierarchy, or the intermediate relation is sent to the parallel relation complementing means 106 to determine the determined parallel relation. It can also be supplemented. So-called complementing means complements the determined parallel relationship by hyperlink structure analysis, single-language hierarchical analysis of lower-level web sites, or the like, in order to avoid possible missed parallel relationships.

図３に一例として示した並列関係補完手段１０６は、ハイパーリンク構造補完ユニット３０５と、単一言語オブジェクト階層補完ユニット３０６と、第２整合ユニット３０７とを含む。図３に示した並列関係補完手段の構成もまた一例として示したに過ぎず、よって本発明の範囲を限定するものとみなしてはならない。単一言語オブジェクト階層補完ユニット３０６と第２整合ユニット３０７は、別個に使用することも、組み合わせて使用することもできる。第２整合ユニット３０７は、補完方法毎に事前に判定された重みに従って異なる補完結果間を整合し、最終的な補完済み並列関係を決定する。なお、並列関係補完処理時に適用可能な重みと、並列関係判定処理時に適用可能な重みは、互いに独立していることは明らかである。 The parallel relationship complementing means 106 shown as an example in FIG. 3 includes a hyperlink structure complementing unit 305, a single language object hierarchy complementing unit 306, and a second matching unit 307. The configuration of the parallel relationship complementing means shown in FIG. 3 is also shown as an example only, and thus should not be regarded as limiting the scope of the present invention. The monolingual object hierarchy completion unit 306 and the second matching unit 307 can be used separately or in combination. The second matching unit 307 matches different complementary results according to the weight determined in advance for each complementing method, and determines the final complemented parallel relationship. It is obvious that the weight that can be applied during the parallel relationship complementing process and the weight that can be applied during the parallel relationship determining process are independent of each other.

ハイパーリンク構造補完ユニット３０５は、ウェブ・ページ間のハイパーリンク関係を解析することにより、同じ単一言語下位ウェブ・サイト内のウェブ・ページ間の意味的トポロジ関係を取得する。その後、異なる単一言語下位ウェブ・サイトの意味的トポロジ関係を比較することにより、２つのウェブ・ページ間の並列性の有無が判定される。ウェブ・ページの意味的トポロジを表現する方法として、例えばナビゲーションパスを選択することができる。各単一言語下位ウェブ・サイトのナビゲーションパスが生成されると、異なる単一言語下位ウェブ・サイトのナビゲーションパスの類似度を比較して、並列関係を有するウェブ・ページの存在の有無を判定することができる。この場合のルールは、例えば、「ウェブ・サイト１内のウェブ・ページｐにおいて、ページｐに関連するすべてのナビゲーションパスのうち、これらのパスに沿ってｐをダイレクトするウェブ・ページをｐ_１、…ｐ_ｍとし、これらのパスに沿ってｐがダイレクトするウェブ・ページをｃ_１,…ｃ_ｎとし、かつ、ウェブ・サイト２内のウェブ・ページｐ´において、ページｐ´に関連するすべてのナビゲーションパスのうち、これらのパスに沿ってｐ´をダイレクトするウェブ・ページをｐ´_１,…ｐ´_ｋとし、これらのパスに沿ってｐ´がダイレクトするウェブ・ページをｃ´_１,…ｃ´_ｎとした場合で、さらに、ｍ＝ｋであり、（ｐ_ｉ，ｐ_ｉ'）（ｉ＝１，…，ｍ）がすべて並列ウェブ・ページペアの場合には、２つの集合｛ｃ_ｉ｝（ｉ＝１，…，ｎ）および｛ｃ_ｊ'｝（ｊ＝１，…，ｒ）において、これら２つの集合間の並列ウェブ・ページペア数の合計ｑが事前に設定したしきい値ｔ（ｔはｎおよびｒの最小値に関連する）を上回れば、（ｐ，ｐ'）もまた並列ウェブ・ページペアであると判定できる」というように設定することができる。 The hyperlink structure complementing unit 305 obtains a semantic topology relationship between web pages in the same monolingual sub-web site by analyzing a hyperlink relationship between web pages. Thereafter, by comparing the semantic topology relationships of different monolingual sub-web sites, the presence or absence of parallelism between the two web pages is determined. As a method for expressing the semantic topology of the web page, for example, a navigation path can be selected. Once the navigation paths for each single language sub-web site are generated, the similarity of the navigation paths of different single language sub-web sites is compared to determine the existence of web pages with parallel relationships be able to. The rule in this case is, for example, “in the web page p in the web site 1, out of all the navigation paths related to the page p, the web page that directs p along these paths is p ₁ , ... and p _m, c ₁ a web page p is directly along these paths, a ... c _n, and, in a web page p'web site 2, all related to the page p' Of the navigation paths, web pages that direct p ′ along these paths are designated as p ′ ₁ ,... P ′ _k, and web pages that p ′ direct along these paths are designated as c ′ ₁ ,. If c ′ _n and m = k and (p _i , p _i ′) (i = 1,..., m) are all parallel web page pairs, then two sets {c _{i} (i} 1, ..., n) and _{{c j '} (j =} 1, ..., in r), the threshold t (t total q of the number of parallel web pages pair is set in advance between the two sets (p, p ′) can also be determined to be a parallel web page pair if it exceeds (related to the minimum value of n and r) ”.

さらに、単一言語オブジェクト階層補完ユニット３０６が、単一言語オブジェクト階層抽出手段１０４によって抽出された各単一言語下位ウェブ・サイトの単一言語階層を使用して、ウェブ・ページまたはページブロック間の並列関係の有無を判定する。例えば、言語１で書かれたウェブ・ページまたはページブロックｐについて、それが階層上のオブジェクトｏに対応し、ｏの上位オブジェクトに対応するウェブ・ページまたはページブロックはｐ_１，…ｐ_ｍであり、ｏの下位オブジェクトに対応するウェブ・ページまたはページブロックはｃ_１，…ｃ_ｎであると想定し、かつ、言語２で書かれたウェブ・ページまたはページブロックｐ´について、それが階層上のオブジェクトｏ´に対応し、ｏ´の上位オブジェクトに対応するウェブ・ページまたはページブロックはｐ_１´，…ｐ_ｍ´であり、ｏ´の下位オブジェクトに対応するウェブ・ページまたはページブロックはｃ_１´，…ｃ_ｎ´であると想定すると、ｐおよびｐ´はウェブ・ページまたはページブロックであり、ｍ＝ｋであり、（ｐ_ｉ，ｐ_ｉ'）（ｉ＝１，…，ｍ）はすべて並列関係ペアであり、かつ２つの集合｛ｃ_ｉ｝（ｉ＝１，…，ｎ）および｛ｃ_ｊ'｝（ｊ＝１，…，ｒ）において、当該２つの集合間の並列関係ペア数の合計ｑが事前に設定されたしきい値ｔ（ｔはｎおよびｒの最小値に関連する）を上回る場合には、（ｐ，ｐ'）もまた並列関係ペアであると判定することができる。 Further, the monolingual object hierarchy complementing unit 306 uses the monolingual hierarchy of each monolingual subordinate web site extracted by the monolingual object hierarchy extracting means 104 to use a web page or page block between Determine whether there is a parallel relationship. For example, for a web page or pages block p written in language 1, it corresponds to the object o on the hierarchy, the web page or pages block corresponding to the o of the upper objects p _1, located in the ... p _m , O is assumed to be c ₁ ,... C _n and the web page or page block p ′ written in language 2 is hierarchical. The web page or page block corresponding to the object o ′ and corresponding to the upper object of o ′ is p ₁ ′,... P _m ′, and the web page or page block corresponding to the lower object of o ′ is c _1. _', ... c n' assuming a, p and p'is a web page or pages block, located in the m = k _{_{(P i, p i ')}} (i = 1, ..., m) are all parallel relation pairs, and two sets _{{c i} (i = 1} , ..., n) and _{{c j'} (j} = 1,..., R) if the total number q of parallel pairs between the two sets exceeds a preset threshold t (t is related to the minimum of n and r) , (P, p ′) can also be determined to be a parallel relationship pair.

単一言語下位ウェブ・サイト間の並列関係が判定された後、多言語オブジェクト階層生成手段１０７は、単一言語階層上のオブジェクトと、単一言語オブジェクト階層記憶装置１１０内に記憶されたウェブ・ページまたはページブロックとの対応関係を直接参照することにより、多言語階層間の並列関係を取得することができる。これにより、最終的な多言語オブジェクト階層が取得される。図４は、上記の処理の一例である。図４に示すように、中国語と英語の下位ウェブ・サイト間の並列関係と、中国語と英語の下位ウェブ・サイトに各々対応する単一言語階層と、オブジェクトと対応するウェブ・ページ（ページブロック）ＡおよびＢとの対応関係とを参照することにより、多言語バージョン（この例では、中国語＋英語）を有する最終的なオブジェクト階層が生成される。 After the parallel relationship between the monolingual subordinate web sites is determined, the multilingual object hierarchy generation unit 107 includes the objects on the monolingual hierarchy and the web / stored in the monolingual object hierarchy storage device 110. A parallel relationship between multilingual hierarchies can be acquired by directly referring to a correspondence relationship with a page or a page block. Thereby, the final multilingual object hierarchy is acquired. FIG. 4 is an example of the above processing. As shown in FIG. 4, the parallel relationship between the Chinese and English subordinate web sites, the single language hierarchy corresponding to each of the Chinese and English subordinate web sites, and the web page corresponding to the object (page By referring to the correspondence between the blocks A and B, a final object hierarchy having a multilingual version (in this example, Chinese + English) is generated.

これまで、添付図面を参照して、本発明による多言語オブジェクト階層の抽出方法およびシステムについて説明してきた。本発明の方法を使用すれば、ドキュメントを手動でラベリングする必要はなく、動作パラメータは特定のドメインおよび言語に非依存となる。そのため、本発明により、既存の方法に比較して抽出効率と拡張性が大幅に向上する。さらに、本発明の方法およびシステムは多言語ウェブ・サイト内の多言語並列対応関係を利用するため、高精度な結果が保証される。 So far, the multilingual object hierarchy extraction method and system according to the present invention has been described with reference to the accompanying drawings. With the method of the present invention, there is no need to manually label the document and the operating parameters are independent of the specific domain and language. Therefore, according to the present invention, extraction efficiency and expandability are greatly improved as compared with existing methods. In addition, the method and system of the present invention utilizes a multi-language parallel correspondence within a multi-language web site, thus ensuring high accuracy results.

上記では、添付図面を参照して本発明の特定の実施例について説明してきたが、本発明は図面に示した特定の構成やプロセスに限定されるものではない。上記では、説明を簡潔にするため、既知の方法および技術の詳細は省略している。また、上記の実施例では、いくつかの具体的なステップを例示したが、本発明の方法および処理は説明および図示に使用した特定のステップに限定されないため、当該技術に精通する当業者であれば、本発明の精神を一旦理解した後に、様々な変形、変更、追加を行い、またステップの順序を入れ替えることが可能である。 Although specific embodiments of the invention have been described above with reference to the accompanying drawings, the invention is not limited to the specific configurations and processes shown in the drawings. In the above description, details of known methods and techniques are omitted for the sake of brevity. Also, in the above examples, some specific steps have been illustrated, but the method and process of the present invention are not limited to the specific steps used in the description and illustration, so those skilled in the art are familiar. For example, once the spirit of the present invention is understood, various modifications, changes and additions can be made, and the order of the steps can be changed.

本発明の各要素は、ハードウェア、ソフトウェア、ファームウェア、またはその組み合わせとして実装し、そのシステム、サブシステム、コンポーネント、もしくはサブコンポーネント内で利用することができる。ソフトウェアとして実装された場合、本発明の各要素は、必要なタスクを実行するためのプログラムもしくはコードセクションとなる。これらのプログラムまたはコードセクションは、機械読取り可能な媒体に格納することも、あるいは、搬送波で搬送されるデータ信号を介して伝送媒体もしくは通信リンク上で伝送することもできる。「機械読取り可能な媒体」には、情報を格納または伝送できるあらゆる媒体が含まれる。機械読取り可能な媒体の例としては、電子回路、半導体記憶装置、ＲＯＭ、フラッシュメモリ、ＥＲＯＭ、フロッピーディスク、ＣＤ−ＲＯＭ、光ディスク、ハードディスク、光ファイバー媒体、ＲＦリンク等が挙げられる。コードセクションは、インターネットやイントラネット等のコンピュータネットワークを介してダウンロードすることができる。 Each element of the invention may be implemented as hardware, software, firmware, or a combination thereof and utilized within the system, subsystem, component, or subcomponent. When implemented as software, each element of the present invention is a program or code section for performing necessary tasks. These programs or code sections can be stored on a machine-readable medium or transmitted over a transmission medium or communication link via a data signal carried on a carrier wave. "Machine readable medium" includes any medium that can store or transmit information. Examples of the machine-readable medium include an electronic circuit, a semiconductor storage device, a ROM, a flash memory, an EROM, a floppy disk, a CD-ROM, an optical disk, a hard disk, an optical fiber medium, and an RF link. The code section can be downloaded via a computer network such as the Internet or an intranet.

本発明は、その精神および本質的な特徴から逸脱することなく、他の様々な形式で実装することができる。例えば、実施例で説明したアルゴリズムは、システムアーキテクチャが本発明の基本精神から逸脱しない限り、変更が可能である。したがって、上記の実施例は、あらゆる観点において限定的なものではなく、例示的なものとみなされる。本発明の範囲は、上記の説明よりもむしろ付記した請求項により定義されるため、請求項の範囲に入るあらゆる変形またはその等価物は本発明の範囲に含まれる。 The present invention can be implemented in various other forms without departing from the spirit and essential characteristics thereof. For example, the algorithm described in the embodiments can be modified as long as the system architecture does not depart from the basic spirit of the present invention. Accordingly, the above embodiments are considered in all respects to be illustrative and not restrictive. Since the scope of the present invention is defined by the appended claims rather than the foregoing description, any variation or equivalent that falls within the scope of the claims is included in the scope of the invention.

１０１：入力手段
１０２：ページブロック集合生成手段
１０３：単一言語下位ウェブ・サイト分解手段
１０４：単一言語オブジェクト階層抽出手段
１０５：並列関係判定手段
１０６：並列関係補完手段
１０７：多言語オブジェクト階層生成手段
１０８：多言語ウェブ・サイトページ記憶装置
１０９：単一言語下位ウェブ・サイト記憶装置
１１０：単一言語オブジェクト階層記憶装置
１１１：多言語下位ウェブ・サイト並列関係記憶装置
１１２：多言語オブジェクト階層記憶装置
３０１：ディレクトリ構造解析ユニット
３０２：ＤＯＭ構造解析ユニット
３０３：ページブロック集合解析ユニット
３０４：第１整合ユニット
３０５：ハイパーリンク構造補完ユニット
３０６：単一言語オブジェクト階層補完ユニット
３０７：第２整合ユニット 101: Input means 102: Page block set generation means 103: Single language lower-level web site decomposition means 104: Single language object hierarchy extraction means 105: Parallel relation determination means 106: Parallel relation complement means 107: Multilingual object hierarchy generation Means 108: Multilingual web site page storage device 109: Single language subordinate web site storage device 110: Single language object hierarchical storage device 111: Multilingual subordinate web site parallel relationship storage device 112: Multilingual object hierarchical storage device Apparatus 301: Directory structure analysis unit 302: DOM structure analysis unit 303: Page block set analysis unit 304: First matching unit 305: Hyperlink structure complementing unit 306: Monolingual object hierarchy complementing unit 307: Second matching Knit

Claims

A multilingual object hierarchy extraction method from a multilingual website,
Entering a web page on a multilingual website,
Decomposing the web site into a set of subordinate web sites related by language so that the web pages of each subordinate web site are in the same language;
Extracting a monolingual object hierarchy from each subordinate web site and recording a mapping relationship between each object on the hierarchy and its corresponding web page;
Determining parallel relationships between different language web pages in different subordinate web sites;
A multilingual web according to the monolingual object hierarchy of each extracted sub-web site, the mapping relationship between the recorded object and the corresponding web page, and the parallel relationship between the determined foreign language web pages A method of extracting a multilingual object hierarchy of a site, comprising: generating a multilingual object hierarchy of a site.

The method of claim 1, further comprising generating a page block set for each web page.

Recording the correspondence between each object and its corresponding page block;
Determining the parallel relationship between page blocks of different language web pages,
The parallel relationship between the page blocks of different language web pages determined to be the correspondence relationship between each recorded object and its corresponding page block is used as a reference in the process of generating the multilingual object hierarchy. The multilingual object hierarchy extraction method according to claim 2.

The multilingual object hierarchy extraction method according to claim 1, wherein the parallel relationship between the different language web pages is determined by a web directory structure of a web site.

The multilingual object hierarchy extraction method according to claim 1, wherein the parallel relationship between the different language web pages is determined by comparing similarities of DOM structures of the different language web pages.

The multilingual object hierarchy extraction method according to claim 1, wherein the parallel relationship between the different language web pages is determined by comparing similarities of page block sets of different language web pages.

Determining a parallel relationship between the different language web pages;
Determining a first parallel relationship result based on the web directory structure of the web site;
Determining a second parallel relationship result based on the similarity of the DOM structure of different language web pages;
Determining a third parallel relationship result based on the similarity of the page block set of different language web pages;
The first, second, and third parallel relationship results are matched based on predetermined weights for three analysis methods to determine an appropriate parallel relationship between different language web pages. Item 3. The multilingual object hierarchy extraction method according to Item 2.

The method according to any one of claims 4 to 7, further comprising a step of complementing the determined parallel relationship between different language web pages by analyzing the hyperlink structure of the web page. Multilingual object hierarchy extraction method.

5. The method of claim 4, further comprising the step of supplementing the determined parallel relationship between different language web pages by analyzing a monolingual block hierarchy of each extracted sub-web site. The multilingual object hierarchy extraction method according to any one of claims 7 to 9.

By analyzing the hyperlink structure between web pages, complement the determined parallel relationship between different language web pages and obtain the first completed parallel relationship result,
Analyzing the monolingual block hierarchy of each extracted sub-web site to complement the determined parallel relationship between different language web pages and obtain a second completed parallel relationship result;
The final parallel relationship between different language web pages is determined by matching the first and second complemented parallel relationship results based on predetermined weights related to two completion methods. The multilingual object hierarchy extraction method according to claim 4.

A multilingual object hierarchy extraction system from a multilingual website,
An input means for entering a web page on a multilingual web site;
A monolingual sub-web site decomposition means for decomposing the web site into a set of related sub-web sites by language so that the web pages of each sub-web site are in the same language;
A monolingual object hierarchy extraction means for extracting a monolingual object hierarchy from each subordinate web site and recording a mapping relationship between each object on the hierarchy and the corresponding web page;
A parallel relationship determination means for determining a parallel relationship between different language web pages in different subordinate web sites;
A multilingual web according to the monolingual object hierarchy of each extracted sub-web site, the mapping relationship between the recorded object and the corresponding web page, and the parallel relationship between the determined foreign language web pages A multilingual object hierarchy extraction system comprising: a multilingual object hierarchy generation means for generating a multilingual object hierarchy of a site.

12. The multilingual object hierarchy extraction system according to claim 11, further comprising page block set generation means for generating a page block set of each web page.

The monolingual object hierarchy extraction means records the correspondence between each object and its corresponding page block,
The web page parallel relationship determining means determines a parallel relationship between page blocks of different language web pages,
Multilingual to use the parallel relationship between the page blocks of different language web pages determined to be the correspondence between each recorded object and its corresponding page block in the process of generating the multilingual object hierarchy The multilingual object hierarchy extraction system according to claim 12, wherein the system is provided to an object hierarchy generation means.

12. The web page parallel relation determining unit includes a directory structure analyzing unit that determines a parallel relation between different language web pages by analyzing a web directory structure of a web site. Multilingual object hierarchy extraction system.

The web-page parallel relationship determination unit includes a DOM structure analysis unit that determines a parallel relationship between different language web pages by comparing similarities of DOM structures of different language web pages. Item 12. The multilingual object hierarchy extraction system according to Item 11.

The web-page parallel relationship determination means includes a page block set analysis unit that determines the parallel relationship between different language web pages by comparing the similarity of the page block sets of different language web pages. The multilingual object hierarchy extraction system according to claim 11.

The web / page parallel relationship determining means includes:
A directory structure analysis unit that determines a first parallel relationship result by analyzing a web directory structure of a web site;
A DOM structure analysis unit that determines a second parallel relationship result by comparing the DOM structure similarity of different language web pages;
A page block set analysis unit that determines a third parallel relationship result by comparing the similarity of page block sets of different language web pages;
Based on predetermined weights related to the directory structure analysis unit, the DOM structure analysis unit, and the page block set analysis unit, the first, second, and third parallel relationship results are matched to each other between different language web pages. The multilingual object hierarchy extraction system according to claim 11, further comprising: a first matching unit that determines an appropriate parallel relationship of the first and second matching units.

The multilingual object hierarchy extraction system according to any one of claims 14 to 17, further comprising parallel relation complementing means for complementing the judged parallel relation between different language web pages.

The parallel relationship complementing unit includes a hyperlink structure complementing unit that complements the determined parallel relationship between different language web pages by analyzing a hyperlink structure between web pages. The multilingual object hierarchy extraction system according to 18.

The parallel relationship complementing means includes a single language object hierarchy complementing unit that complements the determined parallel relationship between different language web pages by analyzing the single language block hierarchy of each extracted lower web site. The multilingual object hierarchy extraction system according to claim 18, comprising:

The parallel relationship complementing means includes:
A hyperlink structure complementing unit that complements the determined parallel relationship between different language web pages by analyzing a hyperlink structure between web pages and generates a first complemented parallel relationship result;
A single language that complements the determined parallel relationship between different language web pages by analyzing the extracted single language block hierarchy of each subordinate web site and generates a second completed parallel relationship result An object hierarchy completion unit;
Based on a predetermined weight for the hyperlink structure complementing unit and the monolingual object hierarchy complementing unit, the first and second complemented parallel relationship results are matched to obtain a final result between different language web pages. The multilingual object hierarchy extraction system according to claim 18, further comprising a second matching unit that determines a parallel relationship.