JP2020098592A

JP2020098592A - Method, device and storage medium of extracting web page content

Info

Publication number: JP2020098592A
Application number: JP2019221285A
Authority: JP
Inventors: 迎炬夏; Yingju Xia; ジョン・ジョォングアン; Zhongguang Zheng; 遥孟; Yao Meng; チェヌ・イェヌ; Yan Chen
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-12-18
Filing date: 2019-12-06
Publication date: 2020-06-25
Anticipated expiration: 2039-12-06
Also published as: CN111339396A; CN111339396B; JP7347179B2

Abstract

To provide method and device of extracting web page content.SOLUTION: The method of extracting web page content includes the steps of: calculating the similarity between a web page feature and a representative set of at least one web page feature cluster, the representative set including a sample of web page features that have a relatively high degree of similarity between each other in the corresponding web page feature cluster; determining a representative set that has the highest similarity to the web page features; updating the web page feature cluster associated with the determined representative set using web page features; recalculating the representative set of the updated web page feature clusters; and extracting the content from the web pages on the basis of the extraction template associated with the updated web page feature cluster.SELECTED DRAWING: Figure 2

Description

本発明は、ウェブページ内容を抽出する方法、装置及びコンピュータ読み取り可能な記憶媒体に関する。 The present invention relates to a method, a device and a computer-readable storage medium for extracting web page content.

インターネット、固定アクセス装置及び移動アクセス端末の迅速な発展に伴い、ウェブページが、人々が情報を取得し、情報を作り出す主な媒体となっている。しかし、ウェブページの数が激増するにつれて、必要な情報を迅速且つ正確に得ることが困難である。 With the rapid development of the Internet, fixed access devices and mobile access terminals, web pages have become the main medium through which people obtain and generate information. However, as the number of web pages has increased exponentially, it is difficult to obtain necessary information quickly and accurately.

デジタル資源及びインターネット上の情報の飛躍的な増加に伴い、ユーザが便利に閲覧し得る情報が大量存在する。よって、ニーズに応じて、抽取する必要のある情報を自動で抽取することも要される。ウェブページ内容を抽出する方法及びシステムについて言えば、通常、半構造化のWebドキュメントからのデータの抽取に関し、その核心は、ネットワーク上に分散している半構造化のHTMLページに暗に含まれる情報ポイントを抽取し、より構造的且つ語義がより明確な形式で表し、ユーザがWeb上でデータを検索すること、及び、応用プログラムがWeb上のデータを直接利用することに利便性を提供することにある。ウェブページ内容の情報抽出がインターネット情報処理の第一歩であるため、情報抽出の正確性は、後続の処理に直接影響を与えることがある。情報抽取の目的は、ノイズを抽出して除去し、ウェブページ中の価値ありの情報、例えば、ウェブページのタイトル、時間、テキスト、リンクなどの情報を得ることにある。 With the dramatic increase in digital resources and information on the Internet, there is a large amount of information that can be conveniently browsed by users. Therefore, it is also necessary to automatically extract the information that needs to be extracted according to the needs. Speaking of methods and systems for extracting web page content, it is usually about extracting data from semi-structured web documents, the core of which is implicitly contained in semi-structured HTML pages distributed over the network. Information points are extracted and expressed in a more structured and word-clarified form, which provides convenience for users to search data on the Web and for application programs to directly use the data on the Web. Especially. Since information extraction of web page contents is the first step in Internet information processing, the accuracy of information extraction can directly affect the subsequent processing. The purpose of information extraction is to extract noise and remove it, and obtain valuable information in a web page, for example, title, time, text, link, etc. of the web page.

従来のウェブページ情報抽出方法として、規則に基づく抽出方法、機械学習に基づく抽出方法などがある。規則に基づく方法は、比較的高い抽出正確性を達成することができるが、規則を作るプロセスでは、専門家の関与を要する。人的な関与が必要であるから、少量のデータについて言えば有効であるかもしれないが、このような人的注釈付け方法は、莫大なデータを処理することができない。規則に基づく抽出方法に比べ、機械学習に基づく方法は、人的な関与を必要としない。しかし、このような方法は、往々にして、大量の注釈付きコーパスを要する。また、注釈付きコーパスの作成が人的に完成される必要があるので、従来の機械学習に基づく情報抽出方法も局限性が存在する。 Conventional web page information extraction methods include a rule-based extraction method and a machine learning-based extraction method. Although rule-based methods can achieve relatively high extraction accuracy, the process of making rules requires expert involvement. While it may be useful for small amounts of data because of the human involvement required, such human annotation methods cannot handle large amounts of data. Compared to rule-based extraction methods, machine learning-based methods do not require human involvement. However, such methods often require a large amount of annotated corpus. Moreover, since the creation of the annotated corpus needs to be completed manually, the conventional information extraction method based on machine learning also has locality.

本発明の目的は、ウェブページ情報を抽出する方法、装置及びコンピュータ記憶媒体を提供することにある。従来技術に比べ、本発明は、莫大なデータの処理に用いることができ、また、大量の人的注釈付けを必要とせず、より高い正確性を有するため、ニーズに応じて、必要な情報を適切に抽出することができる。 It is an object of the present invention to provide a method, device and computer storage medium for extracting web page information. Compared with the prior art, the present invention can be used for processing huge amount of data, does not require a large amount of human annotation, and has higher accuracy, so that it can provide necessary information according to needs. It can be properly extracted.

上述の目的を達成するために、本発明の一側面によれば、ウェブページ内容を抽出する方法が提供され、それは、ウェブページ特徴と、少なくとも１つのウェブページ特徴クラスタの代表的（typical）集合との類似度を計算し、代表的集合は、対応するウェブページ特徴クラスタ中で互いの間の類似度が比較的高いウェブページ特徴のサンプルを含み；ウェブページ特徴との類似度が最も高い代表的集合を確定し；ウェブページ特徴を用いて、確定された代表的集合に関連付けられているウェブページ特徴クラスタを更新し；更新されたウェブページ特徴クラスタの代表的集合を再び計算し；及び、更新されたウェブページ特徴クラスタに関連付けられている抽出テンプレートに基づいて、ウェブページから内容を抽出することを含む。 To achieve the above object, according to one aspect of the present invention, there is provided a method of extracting web page content, which comprises a web page feature and a typical set of at least one web page feature cluster. And a representative set includes samples of web page features that are relatively similar to each other in the corresponding web page feature clusters; the representatives that are most similar to the web page features Determining a target set; updating the web page feature clusters associated with the determined representative set using the web page features; recomputing the representative set of updated web page feature clusters; and Includes extracting content from the web page based on the extraction template associated with the updated web page feature cluster.

本発明の他の側面によれば、ウェブページ内容を抽出する装置がさらに提供され、それは、少なくとも１つの処理器を含み、該処理器は、ウェブページ内容を抽出する方法を実行するように構成される。 According to another aspect of the invention, there is further provided an apparatus for extracting web page content, which comprises at least one processor, the processor configured to perform a method for extracting web page content. To be done.

本発明の他の側面によれば、コンピュータ可読プログラム指令を記憶したコンピュータ可読記憶媒体がさらに提供され、前記プログラム指令は、コンピュータにより実行されるときに、ウェブページ内容を抽出する方法を実現することができる。 According to another aspect of the present invention, there is further provided a computer-readable storage medium storing computer-readable program instructions, the program instructions implementing a method for extracting web page content when executed by a computer. You can

ウェブページにおける抽出待ち内容の一例を示す図である。It is a figure which shows an example of the extraction waiting content in a web page. 本発明の実施例によるウェブページ内容抽出方法のフローチャートである。6 is a flowchart of a web page content extraction method according to an exemplary embodiment of the present invention. 一例としてのウェブページの一部を示す図である。It is a figure which shows a part of web page as an example. 図3Aに示すウェブページの一部をDOMツリーに変換する一例を示す図である。FIG. 3B is a diagram showing an example of converting a part of the web page shown in FIG. 3A into a DOM tree. 本発明の実施例によるウェブページ情報抽出システムのブロック図である。1 is a block diagram of a web page information extraction system according to an exemplary embodiment of the present invention. 本発明の実施例によるウェブページ内容抽出装置を実現し得る汎用機器の構成である。1 is a configuration of a general-purpose device that can realize a web page content extraction device according to an exemplary embodiment of the present invention.

以下、添付した図面を参照しながら、本開示を実施するための好適な形態を詳細に説明する。なお、このような実施形態は、例示に過ぎず、本開示を限定するものでない。 Hereinafter, preferred embodiments for carrying out the present disclosure will be described in detail with reference to the accompanying drawings. It should be noted that such an embodiment is merely an example and does not limit the present disclosure.

図1は、ウェブページにおける抽出待ち内容の一例を示す図である。 FIG. 1 is a diagram showing an example of contents waiting to be extracted on a web page.

具体的には、例示として、図1は、１つの募集用ウェブページのサンプルを示しており、そのうち、左側の比較的小さい矩形領域中の内容及び右側の比較的大きい矩形中の内容が抽出待ち内容のサンプルである。図1に示すように、左側の比較的小さい矩形領域中の抽出待ち内容は、学歴に関する情報であり、右側の比較的大きい矩形中の抽出待ち内容は、職務内容に関する情報である。このウェブページサンプルでは、抽出待ち内容が学歴及び職務内容の情報に関するが、もちろん、抽出すべき内容がユーザのニーズによるものであり、以下に説明する本発明の各実施例を用いて、ニーズに応じて、必要な種類の情報の抽出を行うことができる。よって、抽出待ちのウェブページ情報内容は、複数の段落のテキストであっても良く、表（table）の形式で存在するデータであっても良い。なお、本発明の各実施例は、抽出待ちのウェブページ情報内容の類型について限定しない。本発明の各実施例によるウェブページ情報抽出方法及び装置は、各種の形式のウェブページ及び各種の形式の内容に適用することができる。 Specifically, as an example, FIG. 1 shows a sample of one recruitment web page, of which the contents in a relatively small rectangular area on the left side and the contents in a relatively large rectangular area on the right side are awaiting extraction. It is a sample of the content. As shown in FIG. 1, the extraction waiting content in the relatively small rectangular area on the left side is information related to educational background, and the extraction waiting content in the relatively large rectangular area on the right side is information related to job content. In this web page sample, the contents waiting to be extracted relate to the information of educational background and job contents, but of course, the contents to be extracted depend on the needs of the user. Therefore, by using each embodiment of the present invention described below, Accordingly, necessary types of information can be extracted. Therefore, the content of the web page information waiting to be extracted may be text of a plurality of paragraphs or data existing in a table format. In addition, each embodiment of the present invention does not limit the type of the web page information content waiting to be extracted. The method and apparatus for extracting web page information according to each embodiment of the present invention can be applied to web pages of various formats and contents of various formats.

以下、図2に基づいて、本発明の実施例におけるウェブページ内容抽出方法について説明する。図2は、本発明の実施例におけるウェブページ内容抽出方法の各ステップのフローチャートである。 Hereinafter, the web page content extraction method according to the embodiment of the present invention will be described with reference to FIG. FIG. 2 is a flowchart of each step of the web page content extraction method according to the embodiment of the present invention.

図2を参照する。ステップ201では、ウェブページ特徴と、少なくとも１つのウェブページ特徴クラスタの代表的集合との類似度を計算する。 Please refer to FIG. In step 201, the similarity between a web page feature and a representative set of at least one web page feature cluster is calculated.

ここでのウェブページ特徴は、例えば、ドキュメントオブジェクトモデル（Document Object Model、DOM）のラベルのウェブページ特徴であっても良い。本発明の各実施例によれば、ウェブページは、複数のウェブサイトからスクレイピング又はダウンロードすることにより得ることができ、スクレイピング又はダウンロードされたウェブページは、ドキュメントオブジェクトモデルツリーに変換することができる。図3Bは、変換後のドキュメントオブジェクツリーの一例を示している。スクレイピング又はダウンロードされたウェブページをすべてドキュメントオブジェクツリーに変換した後に、順次、各ウェブページ中のウェブページ特徴と、ウェブページ特徴クラスタの代表的集合との類似度を計算する。代替として、１つのウェブページをスクレイピングした後に、直ぐにテップ201の処理を行っても良い。以下、ウェブページ形式の一種としてのDOMツリーモデルについて簡単に紹介する。 The web page feature here may be, for example, a web page feature of a label of a document object model (DOM). According to each embodiment of the present invention, web pages can be obtained by scraping or downloading from multiple websites, and the scraped or downloaded web pages can be transformed into a document object model tree. FIG. 3B shows an example of the converted document object tree. After converting all scraped or downloaded web pages into a document object tree, the similarity between the web page features in each web page and a representative set of web page feature clusters is calculated in sequence. Alternatively, the processing of step 201 may be performed immediately after scraping one web page. Below is a brief introduction to the DOM tree model, a type of web page format.

図3Aは、一例としてのウェブページの一部を示す図である。 FIG. 3A is a diagram showing a part of a web page as an example.

図3Bは、図3Aに示すウェブページの一部をドキュメントオブジェクトモデルツリーに変換する一例を示す図である。 FIG. 3B is a diagram showing an example of converting a part of the web page shown in FIG. 3A into a document object model tree.

図3Bを参照する。葉ノードのパターンの定義に基づいて、例示として、図3B中の葉ノード“川崎市”のパターンが“text_strong_p_div_川崎市”であり、そのうち、該葉ノードのパターンの経路（path）が“text_strong_p_div_”であり、内容が“川崎市”である。同様に、他の各葉ノードのパターンが、それぞれ、“text_h2_div_川崎市”、“text_p_div_麻生区弁公室”などである。“text_strong_p_div_川崎市”は、DOMのラベルである。 Please refer to FIG. 3B. Based on the definition of the leaf node pattern, as an example, the pattern of the leaf node “Kawasaki City” in FIG. 3B is “text_strong_p_div_Kawasaki City”, and the path of the leaf node pattern is “text_strong_p_div_”. ", and the content is "Kawasaki City". Similarly, the patterns of the other leaf nodes are “text_h2_div_Kawasaki City”, “text_p_div_Aso Ward Bento Office”, etc., respectively. “Text_strong_p_div_Kawasaki City” is a DOM label.

以下、DOMのラベルをウェブページ特徴とする例について説明する。なお、該例は、示すためのものであり、ウェブページ特徴の形式を限定するものでない。本発明の各実施例は、DOMのラベルをウェブページ特徴とするウェブページに限定されず、任意のインターネットの内容にも適用することができる。ここで、DOMについて簡単に説明する。ドキュメントオブジェクトモデルDOMは、プラットフォーム及び語言と独立した方式で１つのドキュメントの内容及び構造に対してアクセス及び変更を行うことができる。DOMは、ユーザのページを動的に変化させることができ、例えば、１つの要素を動的に表し又は隠し、その属性を変更し、１つの要素を増加させるなどして、ページの双方向性を大幅に向上させることができる。DOMは、実際に、オブジェクト指向方式で記述されるドキュメントモデルであり、それは、ドキュメントの表示及び変更に必要なオブジェクト、これらのオブジェクトの行為及び属性、並びにこれらのオブジェクト間の関係を定義することができる。DOMは、ページ上のデータ及び構造の１つのツリー状の表現と見なすことができるが、ページは、図3Bに示すようなDOMツリーでない方式で実現することもできる。 An example in which a DOM label is a web page feature will be described below. It should be noted that the example is for the purpose of illustration and does not limit the format of web page features. Embodiments of the present invention are not limited to web pages featuring DOM labels as web page features, but can be applied to any internet content. Here, the DOM is briefly explained. The Document Object Model DOM can access and modify the content and structure of a document in a platform and language independent manner. The DOM can dynamically change a user's page, for example by dynamically presenting or hiding one element, changing its attributes, incrementing one element, etc. to make the page interactive. Can be significantly improved. The DOM is actually a document model that is described in an object-oriented way, which defines the objects necessary to display and modify a document, the actions and attributes of these objects, and the relationships between these objects. it can. Although the DOM can be viewed as a tree-like representation of the data and structure on the page, the page can also be implemented in a non-DOM tree fashion, as shown in Figure 3B.

一例として、抽出待ちウェブページ内容情報がDOMのラベルである場合、DOMのラベルとしてのウェブページ特徴のサンプルは、以下のように表すことができる。 As an example, when the extraction-waiting web page content information is a DOM label, a sample of web page features as a DOM label can be expressed as follows.

“<tag1><tag2>……<tagn>C1”
DOMのラベルがウェブページ特徴とされる１つの具体例は、以下の通りである。 “<tag1><tag2>……<tagn>C1”
One specific example in which a DOM label is a web page feature is as follows.

“<html><body><div[1]><div[2]><div[1]><div[4]><div[1]><div><dl><dt[6]text content”
抽出待ちウェブページ内容情報が関係型データであり、その中に属性及び属性値が含まれる場合、１つのサンプルは、以下のように表すことができる。 “<html><body><div[1]><div[2]><div[1]><div[4]><div[1]><div><dl><dt[6]text content ”
When the extraction-waiting web page content information is relational data and includes attributes and attribute values, one sample can be expressed as follows.

“<tag11><tag12>……<tag1n>R1<tag21><tag22>……<tag2n>C1”
そのうち、R1及びC1は、それぞれ、抽出する必要のある属性及び属性値である。 “<tag11><tag12>……<tag1n>R1<tag21><tag22>……<tag2n>C1”
Among them, R1 and C1 are an attribute and an attribute value that need to be extracted, respectively.

関係型データとしてのウェブページ特徴の１つの具体例は、以下の通りである。 One specific example of web page features as relational data is as follows.

“<html><body><table><tbody><tr><td><table><tbody><tr[2]><td><table><tbody><tr[8]><td[1]>Required Education<html><body><table><tbody><tr><td><table><tbody><tr[2]><td><table><tbody><tr[8]><td[2]>4 Year Degree”
上述の表現から分かるように、“Required Education”は、“必要な学歴”を表す属性であり、“4 Year Degree”は、大学卒業を表し、即ち、“必要な学歴”の属性の属性値である。 “<html><body><table><tbody><tr><td><table><tbody><tr[2]><td><table><tbody><tr[8]><td[1 ]>Required Education<html><body><table><tbody><tr><td><table><tbody><tr[2]><td><table><tbody><tr[8]><td[2]>4 Year Degree”
As can be seen from the above expression, “Required Education” is an attribute that represents “necessary educational background”, and “4 Year Degree” represents university graduation, that is, the attribute value of the attribute of “necessary educational background”. is there.

よって、ウェブページ特徴のサンプルは、例えば、上述のようなDOMのラベル型サンプル又は関係型データサンプルである。もちろん、ウェブページ特徴は、この2種類の形式に限られず、任意の適切な形式のデータであっても良い。 Thus, the web page feature sample is, for example, a DOM label-based sample or a relational data sample as described above. Of course, the web page characteristics are not limited to these two types of formats, and may be data in any appropriate format.

k個の抽出待ちのウェブページ特徴クラスタ：{R1，R2，…，Rk}があり、且つ各クラスタが各自、若干個のサンプルを有するとすれば、k個の代表的集合：{C1，C2，…，Ck}があり、各代表的集合が若干個の代表的サンプル点を有し、代表的集合には、対応するウェブページ特徴クラスタ中で互いの間の類似度が比較的高いウェブページ特徴のサンプルが含まれれる。 Given that there are k web page feature clusters waiting for extraction: {R1, R2,..., Rk}, and each cluster has its own few samples, k representative sets: {C1, C2 ,..., Ck}, where each representative set has some representative sample points, and the representative sets include web pages that have a relatively high degree of similarity between each other in the corresponding web page feature clusters. A sample of features is included.

このように、抽出待ちウェブページ内容情報のウェブページに対して解析を行ってウェブページ特徴のサンプルを得た後に、ステップ201では、ウェブページ特徴のサンプルと、既存の少なくとも１つのウェブページ特徴クラスタ{R1，R2，…，Rk}の代表的集合{C1，C2，…，Ck}との類似度を計算する。なお、既存の少なくとも１つのウェブページ特徴クラスタ{R1，R2，…，Rk}及びその代表的集合{C1，C2，…，Ck}は、最近更新されたウェブページ特徴クラスタ及びその代表的集合であり、最初のシードウェブページ特徴クラスタ及びその代表的集合が小規模なものであるため、人的に確定されても良く、該人的に確定されたウェブページ特徴クラスタ及び代表的集合から、自動反復（iteration）により更新を行うことで、最近更新されたウェブページ特徴クラスタ{R1，R2，…，Rk}及びその代表的集合{C1，C2，…，Ck}を得ることができる。なお、ここでのウェブページ特徴のサンプルは、１つ以上のサンプルであっても良く、即ち、抽出待ち情報のウェブページ情報のサンプルの集合である。 As described above, after the web page of the extraction waiting web page content information is analyzed to obtain the sample of the web page feature, in step 201, the sample of the web page feature and the existing at least one web page feature cluster are included. Compute the degree of similarity of {R1, R2,..., Rk} to the representative set {C1, C2,..., Ck}. Note that at least one existing web page feature cluster {R1, R2,..., Rk} and its representative set {C1, C2,..., Ck} is a recently updated web page feature cluster and its representative set. Yes, since the initial seed web page feature cluster and its representative set are small, they may be determined by a person, and the By updating by iteration, the recently updated web page feature cluster {R1, R2,..., Rk} and its representative set {C1, C2,..., Ck} can be obtained. It should be noted that the web page feature sample here may be one or more samples, that is, a set of samples of web page information of extraction waiting information.

ステップ201では、１つのウェブページ特徴のサンプルと、１つの代表的集合との類似度を計算し、具体的には、それぞれ、該ウェブページ特徴のサンプルと、代表的集合中の各ウェブページ特徴との類似度を計算し、そして、計算された各類似度の平均値を求めて該ウェブページ特徴のサンプルと該代表的集合との類似度とすることができる。なお、本発明の実施例による類似度の計算は、このような方法に限定されず、サンプルと、他のサンプルを含む集合との類似度を計算し得る任意の方法を採用しても良い。以下、サンプル間類似度の計算方法について説明する。 In step 201, the similarity between one sample of web page features and one representative set is calculated, and specifically, the sample of the web page features and each web page feature in the representative set are calculated. The similarity between the web page feature sample and the representative set can be calculated by calculating an average value of the calculated similarities. Note that the calculation of the degree of similarity according to the embodiment of the present invention is not limited to such a method, and any method that can calculate the degree of similarity between a sample and a set including other samples may be adopted. Hereinafter, a method of calculating the similarity between samples will be described.

＜サンプル間類似度の計算＞
ここで、サンプル間類似度の計算方法について説明する。類似度計算は、サンプルの間の類似度を計算するために用いられる。ここでのサンプルは、抽出待ち情報及び該情報に関する特徴を含む。類似度計算の正確性を向上させるために、特に、異なる表し方のウェブページフォーマット間の類似度計算の正確性を向上させるために、類似度の計算方法に対して学習を行う必要がある。 <Calculation of similarity between samples>
Here, a method of calculating the similarity between samples will be described. Similarity calculation is used to calculate the similarity between samples. The sample here includes extraction waiting information and characteristics related to the information. In order to improve the accuracy of the similarity calculation, in particular, to improve the accuracy of the similarity calculation between web page formats of different representations, it is necessary to learn a method of calculating the similarity.

類似度学習のタスクは、サンプル間の類似度を学習することである。ウェブページの表現形式が多種多様であるため、サンプル間の類似度の計算がとても困難である。通常のやり方は、サンプルをもう１つの空間にマッピングし、該空間内で、同類サンプル間の距離が近く、異類サンプル間の距離が遠い。よって、訓練済みニューラルネットワークを用いてウェブページ特徴間の類似度を確定することを考慮して、本発明の各実施例における類似度計算の学習は、並列の２つの共有重みのネットワークにより実現され、該ネットワークは、類別が多く、又は、訓練サンプル全体が以前の方法の訓練に用いられない分類問題に応用することができる。本発明の各実施例における類似度計算では、具体的には、入力される、対（ペア）になるサンプルを１つの空間にマッピングし、パラメータを調整することで、入力されるサンプル対の該空間内の距離がその類別の区分を表し得るようにさせる。 The task of similarity learning is to learn the similarity between samples. It is very difficult to calculate the similarity between samples due to the variety of web page expressions. The usual practice is to map the samples into another space in which the distance between like samples is close and the distance between like samples is long. Therefore, considering that the similarity between web page features is determined using a trained neural network, the learning of the similarity calculation in each embodiment of the present invention is realized by a network of two shared weights in parallel. , The network can be applied to classification problems that are highly categorized or where the entire training sample is not used in training the previous method. In the similarity calculation in each embodiment of the present invention, specifically, input samples to be paired (pair) are mapped to one space and parameters are adjusted so that the input sample pair Allow the distance in space to represent the categorization.

続いて、再び図2を参照する。ステップ202では、ウェブページ特徴のサンプルとの類似度が最も高い代表的集合を確定し、具体的には、ステップ201で計算されたウェブページ特徴のサンプルと、各代表的集合との各類似度の比較を行うことで、ウェブページ特徴のサンプルとの類似度が最も大きい代表的集合を確定することができる。 Then, FIG. 2 will be referred to again. In step 202, a representative set having the highest similarity to the sample of web page features is determined, and specifically, the similarity of the sample of web page features calculated in step 201 and each representative set is determined. By comparing the above, the representative set having the highest similarity with the sample of the web page feature can be determined.

ウェブページ特徴のサンプルとの類似度が最も大きい代表的集合を確定した後に、図2のステップ203では、ウェブページ特徴を用いて、確定された代表的集合に関連付けられたウェブページ特徴クラスタを更新し、具体的には、ウェブページ特徴を、確定された、ウェブページ特徴に対応する代表的集合に合併（merge）することで、確定された代表的集合に関連付けられたウェブページ特徴クラスタに対して更新を行う。 After determining the representative set with the highest similarity to the sample of web page features, the web page features are used to update the web page feature clusters associated with the determined representative set in step 203 of FIG. Then, specifically, by merging the web page features into a determined representative set corresponding to the web page features, the web page feature clusters associated with the determined representative set are merged. To update.

続いて、再び図2を参照する。ステップ204では、更新されたウェブページ特徴クラスタの代表的集合を再び計算する。 Then, FIG. 2 will be referred to again. In step 204, the representative set of updated web page feature clusters is recomputed.

確定された代表的集合内の代表的サンプル点の数がM（Mは、1よりも大きい整数）であるとする。なお、代表的集合内の代表的サンプル点の数は、予め設定される固定値であっても良く、又は、所定の閾値を用いて制御されるものであっても良く、又は、他の制約条件に従って変化することができるものであっても良い。例示のために、本発明の各実施例では、以下のステップを用いて、代表的サンプル点を更新する。 Let the number of representative sample points in the established representative set be M (M is an integer greater than 1). The number of representative sample points in the representative set may be a fixed value set in advance, or may be controlled using a predetermined threshold value, or another constraint. It may be one that can change according to conditions. For purposes of illustration, each embodiment of the present invention uses the following steps to update a representative sample point.

ステップ201におけるウェブページ特徴のサンプルからなる集合中の１つのサンプルを用いて、各クラスタC_mについて、以下の公式により代表的サンプル点を更新する。

Using one sample in the set of web page feature samples in step 201, update the representative sample points for each cluster C _{m according} to the following formula:

公式（2）を公式（1）に代入して計算を行うことで、公式（1）のargmaxの右側の式の値が最大であるようにさせる前のM個のサンプルを選択し（ここで、argmax関数についての説明を省略する）、そのうち、simは、類似度を示し、X_miは、類別C_m中の第i個目のサンプルを示し、X_mjは、類別m中の第j個目のサンプルを示し、P_kは、X_miが属するクラスタC_mとは異なる他のクラスタである。或いは、以下の公式（3）により、類似度の比が所定閾値θよりも大きいサンプルを選択する。

By substituting formula (2) into formula (1) and performing the calculation, we select M samples before making the value of the formula on the right side of argmax of formula (1) the maximum (where , Argmax function is omitted), in which sim indicates the similarity, X _mi indicates the i-th sample in the classification C _m , and X _mj indicates the j-th sample in the classification m. Shown is the eye sample, where P _k is another cluster different from the cluster C _m to which X _mi belongs. Alternatively, according to the following formula (3), a sample whose similarity ratio is larger than a predetermined threshold value θ is selected.

上述のようにウェブページ特徴クラスタを更新した後に、ウェブページ特徴のサンプルからなる集合のうちから他のサンプルを選択してこのステップを、収斂（収束）するまで又は所定の最大の反復ステップ数に達するまで繰り返す。 After updating the web page feature cluster as described above, select another sample from the set of web page feature samples and set this step until convergence or a predetermined maximum number of iteration steps. Repeat until it reaches.

最後に、再び図2を参照する。ステップ205では、更新されたウェブページ特徴クラスタに関連付けられた抽出テンプレートに基づいて、ウェブページから内容を抽出する。具体的には、更新された各クラスタを得た後に、各更新されたクラスタに関連付けられた抽出テンプレートを用いて、未知のデータに対して抽出を行うことができる。抽出テンプレートに一致したデータ中の内容が抽出される。本発明の各実施例による一例では、代表的サンプル点を用いて抽出テンプレートを表す。このような場合、１つの抽出待ち情報の新しいサンプルを得たときに、該サンプルと、各更新されたクラスタの代表的サンプル点との間の類似度を計算する。類似度が最も大きいクラスタに関連付けられた抽出テンプレートを選択して、該サンプルに対して抽出を行う抽出テンプレートとすることで、該サンプルに対して情報抽出を行う。なお、あるクラスタに関連付けられた抽出テンプレートを用いて情報を抽出することが当業者にとって周知であるため、ここでは、本発明の実施例における応用について説明するが、具体的な実現方法の説明を省略する。 Finally, refer again to FIG. At step 205, content is extracted from the web page based on the extraction template associated with the updated web page feature cluster. Specifically, after obtaining each updated cluster, the extraction template associated with each updated cluster can be used to perform extraction on unknown data. The content in the data that matches the extraction template is extracted. In one example according to embodiments of the present invention, representative sample points are used to represent the extraction template. In such a case, when a new sample of awaiting extraction information is obtained, the similarity between the sample and a representative sample point of each updated cluster is calculated. Information extraction is performed on the sample by selecting the extraction template associated with the cluster having the highest degree of similarity and using the extraction template as the extraction template for extracting the sample. Note that it is well known to those skilled in the art to extract information using an extraction template associated with a certain cluster, so here, an application in an embodiment of the present invention will be described, but a specific implementation method will be described. Omit it.

図4は、本発明の実施例におけるウェブページ情報抽出システムのブロック図である。 FIG. 4 is a block diagram of the web page information extraction system in the embodiment of the present invention.

図4に示すシステム400は、シードサンプル記憶ユニット401、類似度学習ユニット402、類似度計算ユニット403、代表点計算ユニット404、代表点記憶ユニット405、分類ユニット406、入力ユニット407及び情報抽出ユニット408を含む。そのうち、シードサンプル記憶ユニット401は、シードサンプルを記憶し、類似度学習ユニット402は、サンプル間の類似度に対して学習を行い、類似度計算ユニット403は、サンプル間の類似度を計算し、代表点計算ユニット404は、計算により代表的な点を確定し、代表点記憶ユニット405は、代表的な点を記憶し、分類ユニット406は、類似度に基づいて、サンプルを対応するクラスタに分類し、入力ユニット407は、ウェブページを入力し、情報抽出ユニット408は、ウェブページ内容を抽出する。 The system 400 shown in FIG. 4 includes a seed sample storage unit 401, a similarity learning unit 402, a similarity calculation unit 403, a representative point calculation unit 404, a representative point storage unit 405, a classification unit 406, an input unit 407 and an information extraction unit 408. including. Among them, the seed sample storage unit 401 stores the seed sample, the similarity learning unit 402 performs learning on the similarity between samples, the similarity calculation unit 403 calculates the similarity between samples, The representative point calculation unit 404 determines a representative point by calculation, the representative point storage unit 405 stores the representative point, and the classification unit 406 classifies the samples into corresponding clusters based on the similarity. Then, the input unit 407 inputs the web page, and the information extraction unit 408 extracts the web page content.

本発明の実施例により、代表的集合中で互いの間の類似度が比較的高いウェブページ特徴のサンプルが、同一類別（種類）のウェブページ内容から得られるものである。例えば、上述の“必要な学歴”及び“職務内容”が、異なる類別に属する。また、類別も、ウェブページ特徴の類型を表す類別であっても良い。本発明の実施例により、応用のニーズに応じて、類別に対しての定義を適応的に調整することができ、類別の定義に対しての調整は、図4に示すシードサンプル記憶ユニット401に記憶のシードサンプルに対して手動で調整することで実現される。さらに、本発明の実施例により、代表的集合の数が、抽出待ちウェブページ内容の種類の数（種数）に等しい。 According to an embodiment of the present invention, a sample of web page features that are relatively similar to each other in a representative set is obtained from web page content of the same category. For example, the above-mentioned "required educational background" and "job content" belong to different categories. Further, the classification may be a classification indicating the type of web page feature. According to the embodiment of the present invention, the definition for the classification can be adaptively adjusted according to the needs of the application, and the adjustment for the definition of the classification is performed in the seed sample storage unit 401 shown in FIG. This is accomplished by manually adjusting to the memory seed sample. Furthermore, according to the embodiment of the present invention, the number of representative sets is equal to the number of types (types) of web page contents waiting to be extracted.

図5は、発明の実施例によるウェブページ情報抽出装置及びウェブページ情報抽出方法を実現し得る汎用機器700の構成図である。汎用機器700は、例えば、コンピュータシステムであっても良い。なお、汎用機器700は、例示に過ぎず、本発明による方法及び装置の使用範囲又は機能について限定しない。また、汎用機器700は、上述の方法及び装置における任意のモジュールやアセンブリなど又はその組み合わせに依存しない。 FIG. 5 is a configuration diagram of a general-purpose device 700 that can implement a web page information extraction device and a web page information extraction method according to an embodiment of the present invention. The general-purpose device 700 may be, for example, a computer system. It should be noted that the general-purpose device 700 is merely an example, and does not limit the use range or function of the method and apparatus according to the present invention. Also, the general purpose device 700 does not rely on any of the modules and assemblies in the methods and apparatus described above, or any combination thereof.

図5では、中央処理装置（CPU）701は、ROM 702に記憶されているプログラム又は記憶部708からRAM 703にロッドされているプログラムに基づいて各種の処理を行う。RAM 703では、ニーズに応じて、CPU 701が各種の処理を行うときに必要なデータなどを記憶することもできる。CPU 701、ROM 702及びRAM 703は、バズ704を経由して互いに接続される。入力／出力インターフェース705もバス704に接続される。 In FIG. 5, a central processing unit (CPU) 701 performs various processes based on a program stored in the ROM 702 or a program loaded from the storage unit 708 to the RAM 703. The RAM 703 can also store data necessary for the CPU 701 to perform various processes according to needs. The CPU 701, ROM 702 and RAM 703 are connected to each other via a buzz 704. Input/output interface 705 is also connected to bus 704.

また、入力／出力インターフェース705には、さらに、次のような部品が接続され、即ち、キーボードなどを含む入力部706、液晶表示器（LCD）などのような表示器及びスピーカーなどを含む出力部707、ハードディスクなどを含む記憶部708、ネットワークインターフェースカード、例えば、LANカード、モデムなどを含む通信部709である。通信部709は、例えば、インターネット、LANなどのネットワークを経由して通信処理を行う。 Further, the input/output interface 705 is further connected with the following components, that is, an input unit 706 including a keyboard and the like, an output unit including a display such as a liquid crystal display (LCD) and a speaker, and the like. 707, a storage unit 708 including a hard disk, and a communication unit 709 including a network interface card such as a LAN card and a modem. The communication unit 709 performs communication processing via a network such as the Internet or LAN.

ドライブ710は、ニーズに応じて、入力／出力インターフェース705に接続されても良い。取り外し可能な媒体711、例えば、半導体メモリなどは、必要に応じて、ドライブ710にセットされることにより、その中から読み取られたコンピュータプログラムを記憶部708にインストールすることができる。 The drive 710 may be connected to the input/output interface 705 according to needs. The removable medium 711, such as a semiconductor memory, is set in the drive 710 as necessary, so that the computer program read from the medium can be installed in the storage unit 708.

また、本発明は、さらに、マシン可読指令コードを含むプログラムプロダクトを提供する。このような指令コードは、マシンにより読み取られて実行されるときに、上述の本開示の実施形態における方法を実行することができる。それ相応に、このようなプログラムプロダクトをキャリー（carry）する、例えば、磁気ディスク（フロッピーディスク（登録商標）を含む）、光ディスク（CD-ROM及びDVDを含む）、光磁気ディスク（MD（登録商標）を含む）、及び半導体記憶器などの各種記憶媒体も、本開示に含まれる。 The present invention also provides a program product including a machine-readable command code. Such instruction codes, when read and executed by a machine, can perform the methods in the embodiments of the present disclosure described above. Correspondingly, such program products are carried, for example, magnetic disks (including floppy disks (registered trademark)), optical disks (including CD-ROM and DVD), magneto-optical disks (MD (registered trademark)). Various storage media such as a semiconductor memory device are also included in the present disclosure.

上述の記憶媒体は、例えば、磁気ディスク、光ディスク、光磁気ディスク、半導体記憶器などを含んでも良いが、これらに限定されない。 The above-mentioned storage medium may include, for example, a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory device, but is not limited to these.

また、上述の方法における各操作（処理）は、各種のマシン可読記憶媒体に記憶されているコンピュータ実行可能なプログラムの方式で実現することもできる。 Further, each operation (process) in the above-described method can be realized by a method of a computer-executable program stored in various machine-readable storage media.

また、以上の実施例などに関し、さらに以下のように付記として開示する。 Further, the above embodiments and the like will be disclosed as additional notes as follows.

（付記1）
ウェブページ内容抽出方法であって、
ウェブページ特徴と、少なくとも１つのウェブページ特徴クラスタの代表的集合との類似度を計算し、前記代表的集合は、対応するウェブページ特徴クラスタ中で互いの間の類似度が比較的高いウェブページ特徴のサンプルを含み；
前記ウェブページ特徴との類似度が最も高い代表的集合を確定し；
前記ウェブページ特徴を用いて、確定された代表的集合に関連付けられたウェブページ特徴クラスタを更新し；
更新されたウェブページ特徴クラスタの代表的集合を再び計算し；及び
更新されたウェブページ特徴クラスタに関連付けられた抽出テンプレートに基づいて、ウェブページから内容を抽出する、方法。 (Appendix 1)
A method for extracting web page contents,
Compute the similarity between a web page feature and a representative set of at least one web page feature cluster, said representative set being a web page with a relatively high degree of similarity between each other in the corresponding web page feature cluster. Includes a sample of features;
Determining a representative set with the highest similarity to the web page features;
Updating the web page feature cluster associated with the defined representative set using the web page features;
Recalculating a representative set of updated web page feature clusters; and extracting content from web pages based on an extraction template associated with the updated web page feature clusters.

（付記2）
付記1に記載の方法であって、
前記代表的集合中で互いの間の類似度が比較的高いウェブページ特徴のサンプルが、同一類別（種類）のウェブページ内容から取得される、方法。 (Appendix 2)
The method according to Appendix 1,
The method, wherein samples of web page features that are relatively similar to each other in the representative set are obtained from web page content of the same category.

（付記3）
付記2に記載の方法、そのうち、
前記代表的集合の数が、抽出待ちウェブページ内容の類別の数（種数）に等しい、方法。 (Appendix 3)
The method described in Appendix 2, of which
The method, wherein the number of the representative set is equal to the number (category) of classified web page contents.

（付記4）
付記2又は3に記載の方法であって、
前記類別は、ウェブページ特徴の類型を表す類別を含む、方法。 (Appendix 4)
The method according to Appendix 2 or 3,
The method, wherein the categorization comprises a categorization representing a typology of web page features.

（付記5）
付記2又は3に記載の方法であって、
前記類別の定義に対して調整が、シードサンプルに対しての調整により実現される、方法。 (Appendix 5)
The method according to Appendix 2 or 3,
The method wherein the adjustments to the classification definitions are achieved by adjusting the seed samples.

（付記6）
付記5に記載の方法であって、
前記シードサンプルが人的に確定される、方法。 (Appendix 6)
The method according to Appendix 5,
The method wherein the seed sample is humanly determined.

（付記7）
付記1に記載の方法であって、
更新されたウェブページ特徴クラスタ中のウェブページ特徴と、他のサンプルのウェブページ特徴との類似度の和を、各代表的集合との類似度の和で割った値（比率）に基づいて、更新されたウェブページ特徴クラスタの代表的集合を構成するサンプルを選択する、方法。 (Appendix 7)
The method according to Appendix 1,
Based on the value (ratio) of the sum of the similarity between the web page features in the updated web page feature cluster and the other sample web page features divided by the sum of the similarity with each representative set, A method of selecting samples that make up a representative set of updated web page feature clusters.

（付記8）
付記7に記載の方法であって、
比較的大きい比率に対応する所定数のサンプルを用いて、更新されたウェブページ特徴クラスタの代表的集合を構成する、方法。 (Appendix 8)
The method according to appendix 7,
A method of constructing a representative set of updated web page feature clusters using a predetermined number of samples corresponding to a relatively large proportion.

（付記9）
付記1に記載の方法であって、
ニューラルネットワークを用いてウェブページ特徴間の類似度を計算する、方法。 (Appendix 9)
The method according to Appendix 1,
A method of calculating similarity between web page features using a neural network.

（付記10）
付記9に記載の方法であって、
前記ニューラルネットワークは、Siameseネットワークを含む、方法。 (Appendix 10)
The method according to Appendix 9,
The method wherein the neural network comprises a Siamese network.

（付記11）
付記9に記載の方法であって、
前記ニューラルネットワークは、訓練済みニューラルネットワークである、方法。 (Appendix 11)
The method according to Appendix 9,
The method, wherein the neural network is a trained neural network.

（付記12）
付記1に記載の方法であって、
抽出待ちウェブページ内容情報がDOMのラベル又は関係型（リレーショナル）データである、方法。 (Appendix 12)
The method according to Appendix 1,
A method in which the content information on the web page waiting to be extracted is a DOM label or relational data.

（付記13）
ウェブページ内容抽出装置であって、
少なくとも１つの処理器を含み、それは、付記1〜12のうちに任意の１項に記載の方法を実行するように構成される、装置。 (Appendix 13)
A web page content extraction device,
Apparatus comprising at least one processor, which is configured to perform the method according to any one of notes 1-12.

（付記14）
コンピュータ可読プログラム指令を記憶したコンピュータ読み取り可能な記憶媒体であって、
前記プログラム指令がコンピュータいより実行されるときに、付記1〜12のうちの任意の１項に記載の方法を実現することができる、記憶媒体。 (Appendix 14)
A computer-readable storage medium storing computer-readable program instructions, comprising:
A storage medium capable of implementing the method according to any one of appendices 1 to 12 when the program command is executed by a computer.

（付記15）
ウェブページ内容抽出システムであって、
シードサンプルを記憶するためのシードサンプル記憶ユニット；
サンプル間の類似度に対して学習を行うための類似度学習ユニット；
サンプル間の類似度を計算するための類似度計算ユニット；
計算することにより代表的な点を確定するための代表点計算ユニット；
代表的な点を記憶するための代表点記憶ユニット；
類似度に基づいて、サンプルを対応するクラスタに分類するための分類ユニット；
ウェブページを入力するための入力ユニット；及び
前記ウェブページの内容を抽出するための情報抽出ユニットを含む、システム。 (Appendix 15)
A web page content extraction system,
A seed sample storage unit for storing seed samples;
A similarity learning unit for learning the similarity between samples;
A similarity calculation unit for calculating the similarity between samples;
A representative point calculation unit for determining a representative point by calculating;
Representative point storage unit for storing representative points;
A classification unit for classifying the samples into corresponding clusters based on the similarity;
A system comprising: an input unit for inputting a web page; and an information extracting unit for extracting the content of the web page.

以上、本開示の好ましい実施形態を説明したが、本開示はこの実施形態に限定されず、本開示の趣旨を離脱しない限り、本開示に対するあらゆる変更は、本開示の技術的範囲に属する。 Although the preferred embodiment of the present disclosure has been described above, the present disclosure is not limited to this embodiment, and all modifications to the present disclosure are within the technical scope of the present disclosure unless departing from the spirit of the present disclosure.

Claims

A method of extracting web page contents,
Compute the similarity between a web page feature and a representative set of at least one web page feature cluster, said representative set being a web page with a relatively high degree of similarity between each other in the corresponding web page feature cluster. Includes a sample of features;
Determining a representative set with the highest similarity to the web page features;
Updating the web page feature clusters associated with the defined representative set using the web page features;
Recomputing a representative set of updated web page feature clusters; and extracting content from the web page based on an extraction template associated with the updated web page feature clusters.

The method of claim 1, wherein
The method, wherein samples of web page features that are relatively similar to each other in the representative set are obtained from the same type of web page content.

The method of claim 2, wherein
The method, wherein the number of said representative sets is equal to the number of awaiting web page contents.

The method of claim 1, wherein
Based on the value (ratio) of the sum of the similarity between the web page features in the updated web page feature cluster and the other sample web page features divided by the sum of the similarity with each representative set, A method of selecting samples that make up a representative set of updated web page feature clusters.

The method of claim 4, wherein
A method of constructing a representative set of updated web page feature clusters using a predetermined number of samples corresponding to a relatively large proportion.

The method of claim 1, wherein
A method of determining similarity between web page features using a neural network.

The method of claim 6, wherein
The method wherein the neural network comprises a Siamese network.

The method of claim 6, wherein
The method, wherein the neural network is a trained neural network.

A device for extracting web page contents,
Including at least one processor,
An apparatus, wherein the processor is configured to perform the method of any one of claims 1-8.

A computer-readable storage medium that stores a program for causing a computer to execute the method according to any one of claims 1 to 8.