JP2012018667A

JP2012018667A - Method, system and computer readable record medium for refining web document using text pattern extraction

Info

Publication number: JP2012018667A
Application number: JP2011115092A
Authority: JP
Inventors: Wu-Ju Yi; 祐周李; Yu-Sik Chang; 有植張
Original assignee: NHN Corp
Current assignee: NHN Corp
Priority date: 2010-07-07
Filing date: 2011-05-23
Publication date: 2012-01-26
Anticipated expiration: 2031-05-23
Also published as: KR20120004610A; KR101140263B1; JP5746912B2

Abstract

PROBLEM TO BE SOLVED: To provide a method, a system and a computer readable record medium for refining a web document using text pattern extraction.SOLUTION: A web document refinement method includes: a text pattern extraction stage in which a plural pieces of extract object data are analyzed by a predetermined criterion to extract the text pattern, and the plural pieces of extract object data are arranged and aligned based on the extracted text pattern; a normal equation extraction stage in which a normal equation is extracted from the plural pieces of extract object data arranged and aligned; and a web document refine stage in which a web document is refined and secondary data is generated using the extracted normal equation.

Description

本発明は、テキストパターン抽出を用いてＷｅｂ文書をリファインするための方法、システム及びコンピュータ読み出し可能記録媒体に関し、より詳細には、Ｗｅｂ文書に基づきテキストパターンを抽出し、抽出されたテキストパターンを用いて全体のパターンを説明する正規式を抽出した後、抽出された正規式をＷｅｂ文書に適用しリファインするための方法、システム及びコンピュータ読み出し可能媒体に関する。 The present invention relates to a method, a system, and a computer-readable recording medium for refining a Web document using text pattern extraction, and more particularly, extracts a text pattern based on a Web document and uses the extracted text pattern. The present invention relates to a method, a system, and a computer-readable medium for extracting a regular expression that describes an entire pattern and then applying the extracted regular expression to a Web document for refinement.

インターネットの発達及び普及の増加によりインターネットを用いた様々なサービスが提供されているが、そのうち代表的な例が検索サービスであると言える。このような検索サービスは、ユーザが検索しようとする単語又は単語の組み合わせをクエリとして入力すると、検索エンジンが入力されたクエリに対応する検索結果のテキスト（例えば、ユーザから入力された検索クエリを含むウェブサイト、記事、又は該当する検索クエリを含むファイル名を有するイメージなど）をユーザに提供するサービスを意味する。 Various services using the Internet are provided due to the increase in the development and spread of the Internet, and a representative example is a search service. In such a search service, when a user inputs a word or a combination of words to be searched as a query, the search engine includes a search result text corresponding to the input query (for example, a search query input from the user). Web site, article, or an image having a file name including the corresponding search query).

このように、ユーザが検索したい内容を適切に提供するために、インターネット検索サービスの提供者は、ウェブクローラ又は別途で提供する入力手段などを用いて、このようなウェブサイトや記事などを予め収集した後、収集したウェブサイトや記事などから形態素解析などを通じてキーワードを抽出し、これに基づき索引作業を行った後、別途に保存し、ユーザがクエリを入力する際に、より速い速度で検索結果をユーザに提供するように具現したのが一般的である。 In this way, in order to appropriately provide the content that the user wants to search, the provider of the Internet search service collects such websites and articles in advance using a web crawler or input means provided separately. After that, keywords are extracted from collected websites and articles through morphological analysis, etc., indexed based on this, stored separately, and when the user enters a query, the search results are faster In general, it is implemented so as to be provided to the user.

しかし、従来技術により収集されたウェブサイトや記事などに別途の加工を施さず、既存の形態素解析によるキーワード抽出及び索引作業を用いた場合、ユーザが入力したクエリと一致するキーワードが存在する検索結果のテキストに対して検索が一致すると判断し、無条件に提供することとなるが、このような検索結果のテキストの中にはユーザの検索意図とは一致しない内容が含まれる可能性がある。 However, search results that have keywords that match the query entered by the user when keyword extraction and indexing operations using existing morpheme analysis are performed without processing the websites and articles collected by the prior art. It is determined that the search matches the other text, and the search is provided unconditionally. However, such search result text may include content that does not match the user's search intention.

例えば、ユーザが俳優の「木村拓哉」に関する記事を検索しようとし、クエリとして「木村拓哉」を入力した時、収集された記事の中に「木村拓哉」という名前の記者が存在し、記者名が記事の本文内に含まれている場合がある。この場合、ユーザの検索したい内容とは無関係の記事内容であっても、ユーザが入力したクエリと一致するキーワードが存在するため、検索結果のテキストに含まれ、ユーザに提供されるという問題点が発生するのである。図５ａを参照すると、ユーザがクエリ「木村拓哉」を入力した場合、赤色の長方形であるＡ部分に表示されたように、俳優の「木村拓哉」とは全く関係のない、記者名が「木村拓哉」である記事の内容が検索され、表示されることが分かる。 For example, when a user tries to search for an article about the actor “Takuya Kimura” and enters “Takuya Kimura” as a query, there is a reporter named “Takuya Kimura” in the collected articles. May be included in the body of the article. In this case, even if the content of the article is irrelevant to the content that the user wants to search, since there is a keyword that matches the query entered by the user, it is included in the text of the search result and provided to the user. It occurs. Referring to FIG. 5 a, when the user inputs the query “Takuya Kimura”, as shown in the red rectangle A part, the reporter name is completely unrelated to the actor “Takuya Kimura”. It can be seen that the content of the article “Takuya” is searched and displayed.

さらに、インターネット検索サービスの提供者が直接作成又は編集したウェブサイト或いは記事などと異なり、第三者によって作成された後、ウェブクローラなどによって収集されたウェブサイト等の場合、特定の形式に合わせて作成されているわけではなく、当該検索結果を別途の分類基準、例えば作成者又は作成地域などにより再分類するためには、手作業で直接確認し分類するしかなく、インターネット検索サービスの提供者が、ユーザに作成者又は作成地域などの別途の分類基準で整列して提供するのに困難が存在した。 Furthermore, unlike websites or articles created or edited directly by Internet search service providers, websites created by a third party and then collected by web crawlers, etc. must be tailored to a specific format. In order to re-categorize the search results according to another classification standard, such as the creator or the creation area, it is necessary to directly check and classify the search results manually. There have been difficulties in providing users with a separate classification standard such as creator or creation area.

したがって、このような検索上の誤謬防止及びより効率的な検索結果の提供のために、収集されたウェブサイトや記事などについて適切なリファインを行い、内容と関係のない部分を削除するか別途の基準により再分類し、インデックスにして整列するための技術が求められるが、従来は人による手作業以外にはこのような問題を解決する方法が存在していなかった。 Therefore, in order to prevent such search errors and to provide more efficient search results, we will refine the collected websites and articles appropriately, and delete parts that are not related to the contents or separate them. There is a need for a technique for re-classification according to a standard and alignment with an index. Conventionally, there is no method for solving such a problem other than manual work by humans.

韓国特許０４３５４４２号公報Korean Patent No. 0435442 特開２００５−１９００７４号公報Japanese Patent Laid-Open No. 2005-190074 韓国公開特許２００２−００８９６７７号公報Korean Open Patent 2002-0089677

本発明の目的は、上述した従来技術の問題点を解決することにある。 An object of the present invention is to solve the problems of the prior art described above.

本発明の他の目的は、収集されたウェブサイトや記事などについて適切なリファインを行ってユーザの検索したい内容と関係のない部分を削除することにより、検索結果に含まれる恐れのある誤謬を防止し、より正確な検索結果を提供することである。 Another object of the present invention is to prevent errors that may be included in the search results by appropriately refining the collected websites and articles and deleting portions that are not related to the content that the user wants to search. And providing more accurate search results.

また、本発明の他の目的は、収集されたウェブサイトや記事などを適切なリファインを介して日付、報道地域、又は記者名などの別途の基準により再分類するための内容を抽出し、これをインデックスにして検索または整列することにより、より多様かつ正確な検索結果を提供することである。 Another object of the present invention is to extract contents for reclassifying collected websites and articles according to other criteria such as date, coverage, or reporter name through appropriate refinement. By searching or sorting by using as an index, more diverse and accurate search results can be provided.

上述のような本発明の目的を達成し、以下の本発明の効果を奏するための、本発明の特徴的な構成は以下の通りである。 In order to achieve the object of the present invention as described above and achieve the following effects of the present invention, the characteristic configuration of the present invention is as follows.

本発明の一実施形態に係るＷｅｂ文書リファイン方法は、複数の抽出対象資料を所定の基準で分析し、前記複数の抽出対象資料からテキストパターンを抽出し、前記抽出されたテキストパターンに基づき、前記複数の抽出対象資料を順序付けするテキストパターン抽出段階と、前記順序付けられた前記複数の抽出対象資料から正規式を抽出する正規式抽出段階と、前記抽出された正規式を用いてＷｅｂ文書をリファインし、２次資料を生成するＷｅｂ文書リファイン段階を行うことによって得られる。 The Web document refinement method according to an embodiment of the present invention analyzes a plurality of extraction target materials according to a predetermined standard, extracts a text pattern from the plurality of extraction target materials, and based on the extracted text pattern, A text pattern extraction stage for ordering a plurality of extraction target materials, a regular expression extraction stage for extracting a regular expression from the ordered plurality of extraction target materials, and refining a Web document using the extracted regular expressions It is obtained by performing a Web document refinement step of generating secondary material.

本発明の他の実施例によると、Ｗｅｂ文書リファインシステムは、複数の抽出対象資料を所定の基準で分析し、前記複数の抽出対象資料からテキストパターンを抽出し、前記抽出されたテキストパターンに基づき、前記複数の抽出対象資料を順序付けするテキストパターン抽出手段と、前記順序付けられた前記複数の抽出対象資料から正規式を抽出する正規式抽出手段、及び前記抽出された正規式を用いてＷｅｂ文書をリファインし、２次資料を生成する正規式適用手段を含む。 According to another embodiment of the present invention, the Web document refinement system analyzes a plurality of extraction target materials according to a predetermined standard, extracts a text pattern from the plurality of extraction target materials, and based on the extracted text pattern A text pattern extracting means for ordering the plurality of extraction target materials, a normal expression extracting means for extracting a regular expression from the ordered plurality of extraction target materials, and a Web document using the extracted regular expressions Includes regular expression application means for refinement and generation of secondary material.

本発明の実施例によると、収集されたウェブサイトや記事などの適切なリファインを行い、ユーザの検索したい内容と関係のない部分を削除することができるため、検索結果に含まれる恐れのある誤謬を防止し、より正確な検索結果を提供することができる。 According to the embodiment of the present invention, it is possible to appropriately refine collected websites and articles, and to delete portions that are not related to the content that the user wants to search. And more accurate search results can be provided.

また、本発明によると、収集されたウェブサイトや記事などの適切なリファインを行い、日付、報道地域、又は記者名などの別途の基準により再分類するための内容を抽出し、これをインデックスにして検索または整列することで、より多様で正確な検索結果を提供することができる。 In addition, according to the present invention, appropriate refinement of collected websites and articles, etc. is performed, and contents for reclassification according to other criteria such as date, coverage area or reporter name are extracted and used as an index. By searching or sorting, it is possible to provide more diverse and accurate search results.

本発明の一実施形態に係る収集されたＷｅｂ文書にテキストパターン抽出を介した正規式を適用してリファインし、これを用いて構築された検索データベースを用いた検索結果提供システムの全体的な構成を概略的に示す図面である。The overall configuration of a search result providing system using a search database that is refined by applying a regular expression through text pattern extraction to a collected Web document according to an embodiment of the present invention FIG. 本発明の一実施形態に係る検索結果提供システムの細部構成図である。It is a detailed block diagram of the search result provision system which concerns on one Embodiment of this invention. 本発明の一実施形態に係る検索結果提供システム内の２次加工部の細部構成図である。It is a detailed block diagram of the secondary process part in the search result provision system which concerns on one Embodiment of this invention. 本発明の一実施形態に係るＷｅｂ文書リファインシステムで用いる頻度分析手段におけるＰＭＩ頻度を示すグラフである。It is a graph which shows the PMI frequency in the frequency analysis means used with the Web document refinement system concerning one embodiment of the present invention. 従来技術における特定のクエリで検索される検索結果を図示する例示的な画面である。3 is an exemplary screen illustrating a search result searched by a specific query in the prior art. 本発明の一実施形態に係るＷｅｂ文書リファインシステムを用いて、特定のクエリで検索される検索結果を図示する例示的な画面である。6 is an exemplary screen illustrating search results searched by a specific query using a Web document refinement system according to an embodiment of the present invention. 本発明の一実施形態に係るＷｅｂ文書リファインシステムを用いて、報道地域に再分類され提供される結果を図示する例示的な画面である。4 is an exemplary screen illustrating a result of being reclassified to a coverage area using a Web document refinement system according to an embodiment of the present invention. 従来技術における検索を行い、記者名でクエリに該当した検索結果を図示する例示的な画面である。It is an exemplary screen illustrating a search result corresponding to a query by a reporter name after performing a search in the prior art. 本発明の一実施形態に係るＷｅｂ文書リファインシステムを用いて、記者名により再分類され、提供される検索結果を図示する例示的な画面である。6 is an exemplary screen illustrating search results reclassified by reporter name and provided using a Web document refinement system according to an embodiment of the present invention. 本発明の一実施形態に係るＷｅｂ文書リファインシステムの２次加工部における動作のフローチャートである。It is a flowchart of the operation | movement in the secondary process part of the Web document refinement system which concerns on one Embodiment of this invention. 多重配列アラインメント（ＭＳＡ、ＭｕｌｔｉｐｌｅＳｅｑｕｅｎｃｅＡｌｉｇｎｍｅｎｔ）技術を用いた遺伝子分析の結果の一実施例を示した画面である。It is the screen which showed one Example of the result of the gene analysis using a multiple sequence alignment (MSA, Multiple Sequence Alignment) technique.

以下の本発明に対する詳細な説明は、本発明を実施可能な特定の実施形態を例示し、実施形態が図示される添付図面を参照する。これらの実施形態は、当業者が本発明を実施するのに十分であるよう詳細に説明される。本発明の多様な実施形態は、互いに異なるが、相互排他的である必要はない。例えば、ここに記載されている特定形状、構造及び特性は、一実施形態に関して本発明の精神及び範囲から外れない程度に他の実施形態で実現されることができる。また、それぞれの開示された実施形態内の個別の構成要素の位置又は配置は、本発明の精神及び範囲から外れない程度に変更されることができる。従って、以下の詳細な説明は、限定的なものではなく、本発明の範囲は、適切に説明されると、その請求項が主張すること及びその均等な全ての範囲として添付された特許請求の範囲によってのみ限定される。図面において類似した参照符号は、多くの側面にわたり同一かつ類似した機能を示す。 The following detailed description of the invention illustrates certain embodiments in which the invention can be practiced and refers to the accompanying drawings in which the embodiments are illustrated. These embodiments are described in detail to enable those skilled in the art to practice the invention. Various embodiments of the present invention are different from each other but need not be mutually exclusive. For example, the specific shapes, structures, and characteristics described herein can be implemented in other embodiments to the extent that they do not depart from the spirit and scope of the present invention with respect to one embodiment. In addition, the position or arrangement of individual components within each disclosed embodiment can be altered without departing from the spirit and scope of the invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is, when properly described, what the claims claim and the equivalent scope of the appended claims Limited only by scope. In the drawings, like reference numbers indicate identical and similar functions throughout many aspects.

以下、本発明が属する技術分野で通常の知識を有した者が本発明を容易に実施できるようにするために、本発明の好ましい実施例に関して添付の図面を参照して詳細に説明する。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily practice the present invention.

本発明の好ましい実施例において、用語「Ｗｅｂ文書」は、インターネットエクスプローラなどのウェブブラウザプログラムを直接又は間接的に用い、ワールドワイドウェブ（ＷｏｒｌｄＷｉｄｅＷｅｂ）を介して閲覧することのできる手動又は能動的なテキスト形式を全て含む広義の意味で用いられる。Ｗｅｂ文書のファイル形式（ｆｉｌｅｆｏｒｍａｔ）として主に、ＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）が用いられるが、必ずしもこれに限定されることなく、ＸＭＬ（ｅＸｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）、ＳＧＭＬ（ＳｔａｎｄａｒｄＧｅｎｅｒａｌｉｚｅｄＭａｒｋｕｐＬａｎｇｕａｇｅ）を含み、ウェブブラウザプログラムを用いて直接又は間接的に（プラグインやその他別途のプログラムと接続される場合も含む）閲覧することのできるテキスト形式であれば、全てＷｅｂ文書に該当する。ウェブブラウザプログラムを用いてＷｅｂ文書を閲覧するためには、一般的にＷｅｂ文書が位置するアドレスをＵＲＬで入力し、そのアドレス形式としてＨＴＴＰ（ＨｙｐｅｒＴｅｘｔＴｒａｎｓｆｅｒＰｒｏｔｏｃｏｌ）が多く用いられるが、必ずしもこれに限定されない。Ｗｅｂ文書のコンテンツは特定形式に制限されず、一般的なテキストに限定されるものではなく、イメージ、音楽、動画、或いはこれらの結合などの様々な形態を含むことができる。また、Ｗｅｂ文書の出所による具体的な分類によると、Ｗｅｂ文書は一般的なＷｅｂ文書、広告、辞典、ブログ、ウェブサイト、ニュース、クラブ、イメージ、専門情報、本、地図、動画などを含んでもよく、これらに限定されない。前述の様々な出所及び形式を有するＷｅｂ文書を加工した「１次加工資料」、「２次加工資料」もまた様々な出所及び形式を有する。 In the preferred embodiment of the present invention, the term “Web document” refers to manual or active, which can be browsed via the World Wide Web using a web browser program such as Internet Explorer, directly or indirectly. It is used in a broad sense that includes all text forms. HTML (HyperText Markup Language) is mainly used as a file format of a Web document, but is not necessarily limited to this, but XML (eXtensible Markup Language), SGML (Standard General Language), and SGML (Standard General Language) are included. Any text format that can be browsed directly or indirectly (including when connected to a plug-in or other separate program) using a web browser program corresponds to a web document. In order to browse a web document using a web browser program, generally, an address where the web document is located is input as a URL, and HTTP (HyperText Transfer Protocol) is often used as the address format, but this is not necessarily limited thereto. Not. The content of the Web document is not limited to a specific format, is not limited to general text, and can include various forms such as an image, music, a moving image, or a combination thereof. In addition, according to the specific classification according to the source of the Web document, the Web document may include general Web documents, advertisements, dictionaries, blogs, websites, news, clubs, images, specialized information, books, maps, videos, etc. Well, it is not limited to these. The “primary processed material” and “secondary processed material” obtained by processing the Web document having various sources and formats described above also have various sources and formats.

また、本発明の実施形態における「リファイン」（ｒｅｆｉｎｅ）、「リファインメント」（ｒｅｆｉｎｅｍｅｎｔ）という用語は、Ｗｅｂ文書に所定の処理を施し変更されたＷｅｂ文書を導き出す作業を全て含む広義の意味で用いられる。ここで、リファインの一実施形態として既存のＷｅｂ文書から特定の単語や用語或いは部分を削除する作業を意味してもよく、または既存のＷｅｂ文書内部の特定の用語またはキーワードを抽出し、これに基づき索引作業を行って整列することができるようにデータベースを再構成する作業や、Ｗｅｂ文書の選別作業を意味してもよい、必ずしもこれに限定されるのではない。 In addition, the terms “refine” and “refinement” in the embodiments of the present invention are used in a broad sense including all operations for deriving a modified web document by performing predetermined processing on the web document. It is done. Here, as an embodiment of refinement, it may mean an operation of deleting a specific word, term, or part from an existing Web document, or a specific term or keyword inside an existing Web document is extracted and used. This may mean an operation of reconfiguring the database so that it can be arranged based on an indexing operation or a Web document selection operation, but is not necessarily limited thereto.

また、本発明の明細書における好ましい実施形態の説明のために、Ｗｅｂ文書のコンテンツを新聞記事に仮定して以下説明するが、これはＷｅｂ文書が新聞記事に限定されることを意味するものではなく、その他の様々なコンテンツを有するＷｅｂ文書に対しても本発明が適用可能であることは明らかである。 In order to describe the preferred embodiment in the specification of the present invention, the content of the Web document is assumed to be a newspaper article. However, this does not mean that the Web document is limited to a newspaper article. It is obvious that the present invention can be applied to a Web document having various other contents.

全システムの構成
図１は、発明の一実施形態により、収集されたＷｅｂ文書に、テキストパターン抽出を介した正規式を適用してリファインし、これを用いて構築された検索データベースを用いた検索結果提供システムの全体的な構成を概略的に示す図面である。 Diagram 1 of the entire system, according to one embodiment of the invention, the collected Web documents, refined by applying the normal through the text pattern extraction equation using the search database constructed using this search It is drawing which shows schematically the whole structure of a result provision system.

図１に図示された通り、本発明の一実施形態による全システムは、検索データベースを含んでいる検索結果提供システム１００が、ネットワーク２００を通じて複数のユーザ端末装置３００及び複数のＷｅｂ文書サーバ４００と接続されている。 As shown in FIG. 1, in the entire system according to an embodiment of the present invention, a search result providing system 100 including a search database is connected to a plurality of user terminal devices 300 and a plurality of Web document servers 400 through a network 200. Has been.

先ず、本発明の一実施形態によると、検索結果提供システム１００は、ユーザ端末装置３００から検索ワード、即ちクエリを受信し、これに基づき検索データベース（図示せず）を参照して検索を行った後、その結果から導き出される検索結果をユーザ端末装置３００に伝送する役目をする。また、検索結果提供システム１００は、複数のＷｅｂ文書サーバ４００から収集したＷｅｂ文書に対して所定の基準による分析を通じてテキストパターンを抽出し、１次加工資料から削除または別途の分類基準として索引作業が必要な部分を探索し、該当部分に対してテキストパターン抽出を用いた正規式の生成及び生成された正規式を１次加工資料に適用するリファイン過程を通じて２次加工資料を生成する役目もする。 First, according to an embodiment of the present invention, the search result providing system 100 receives a search word, that is, a query from the user terminal device 300, and performs a search with reference to a search database (not shown) based on the search word. Thereafter, the search result derived from the result is transmitted to the user terminal device 300. In addition, the search result providing system 100 extracts a text pattern from a Web document collected from a plurality of Web document servers 400 through analysis based on a predetermined standard, and deletes it from the primary processing material or performs indexing as a separate classification standard. It also serves to generate a secondary processing material through a refinement process by searching for a necessary portion, generating a normal expression using text pattern extraction for the corresponding portion, and applying the generated normal formula to the primary processing material.

また、本発明の一実施形態によると、ネットワーク２００は、有線及び無線といった通信態様と無関係に構成されてもよく、パーソナルエリアネットワーク（ＰＡＮ；ＰｅｓｏｎａｌＡｒｅａＮｅｔｗｏｒｋ）、ローカルエリアネットワーク（ＬＡＮ；ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、メトロポリタンエリアネットワーク（ＭＡＮ；ＭｅｔｒｏｐｏｌｉｔａｎＡｒｅａＮｅｔｗｏｒｋ）、ワイドエリアネットワーク（ＷＡＮ；ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）などの様々なネットワークで構成されてもよい。 In addition, according to an embodiment of the present invention, the network 200 may be configured regardless of communication modes such as wired and wireless, and may be a personal area network (PAN), a local area network (LAN), or a local area network (LAN). ), A metropolitan area network (MAN), a wide area network (WAN), and the like.

一方、本発明の一実施形態にとるユーザ端末装置３００は、ユーザが所定のクエリに対する検索結果を提供されるためにネットワーク２００を介して検索結果提供システム１００と接続するための機能を含む入出力装置を意味し、デスクトップコンピュータだけでなく、ノートブックコンピュータ、ワークステーション、パームトップ（ｐａｌｍｔｏｐ）コンピュータ、携帯情報端末（ｐｅｒｓｏｎａｌｄｉｇｉｔａｌａｓｓｉｓｔａｎｔ：ＰＤＡ）、ウェブパッド、スマートフォンを含む移動通信端末機などのようにメモリ手段を備え、マイクロプロセッサを搭載し、演算能力を有したデジタル機器であれば本発明によるユーザ端末装置３００に含まれてよい。好ましくは、検索結果提供システム１００と接続し、クエリを入力して検索結果を提供されるために、ユーザ端末装置３００内のウェブブラウザを実行させ、使用することができることが望ましいが、必ずしもこれに限定されない。 On the other hand, the user terminal device 300 according to an embodiment of the present invention includes an input / output including a function for connecting a search result providing system 100 via the network 200 in order for a user to be provided with a search result for a predetermined query. Means a device, not only a desktop computer, but also a notebook computer, workstation, palmtop computer, personal digital assistant (PDA), web pad, mobile communication terminal including a smartphone, etc. Any digital device having a memory means, a microprocessor, and a computing capability may be included in the user terminal device 300 according to the present invention. Preferably, it is desirable that the web browser in the user terminal device 300 can be executed and used in order to connect to the search result providing system 100 and provide a search result by inputting a query. It is not limited.

また、本発明の一実施形態によるＷｅｂ文書サーバ４００は、検索結果提供システム１００が所定の方式を通じて収集するＷｅｂ文書を含んでいるウェブサーバであればよく、物理的に特定のサーバや、特定の内容／形式のＷｅｂ文書を扱うサーバに限定されるものではない。従って、検索結果提供システム１００がネットワーク２００を通じてアクセスし、Ｗｅｂ文書を収集することのできるウェブサーバは、全てこのＷｅｂ文書サーバ４００に含まれてよい。好ましくは、Ｗｅｂ文書サーバ４００で扱うＷｅｂ文書はニュース記事が記載されたＷｅｂ文書を含んでもよい。 In addition, the Web document server 400 according to an embodiment of the present invention may be a Web server that includes Web documents that the search result providing system 100 collects through a predetermined method. The present invention is not limited to a server that handles content / format Web documents. Therefore, all web servers that can be accessed by the search result providing system 100 through the network 200 and collect web documents may be included in the web document server 400. Preferably, the web document handled by the web document server 400 may include a web document in which a news article is described.

検索結果提供システム
図２は、本発明の一実施形態による検索結果提供システム１００の細部構成図である。 Search Result Providing System FIG. 2 is a detailed block diagram of the search result providing system 100 according to an embodiment of the present invention.

図２を参照すると、本発明の一実施形態による検索結果提供システム１００は、送受信部１１０、検索部１２０、１次加工部１３０、２次加工部１４０及び検索データベース１５０を含んでもよい。 Referring to FIG. 2, the search result providing system 100 according to an exemplary embodiment of the present invention may include a transmission / reception unit 110, a search unit 120, a primary processing unit 130, a secondary processing unit 140, and a search database 150.

送受信部１１０は、ユーザ端末装置３００からクエリを受信し、検索部１２０に伝送し、検索部１２０から抽出された検索結果をユーザ端末装置３００に伝送する The transmission / reception unit 110 receives a query from the user terminal device 300, transmits the query to the search unit 120, and transmits the search result extracted from the search unit 120 to the user terminal device 300.

検索部１２０は、送受信部１１０から受信したクエリにマッチする情報を２次加工資料が保存された検索データベース１５０から検索する。検索を通じて抽出される検索結果は、ユーザ端末装置３００へと送信するために送受信部１１０に伝送される。また、検索部１２０は、検索データベース１５０に保存された２次加工資料からクエリにマッチする情報を検索した後、導き出された検索結果に関する情報をデータベース１５０に保存された１次加工資料から抽出し、送受信部１１０に伝送してもよい。 The search unit 120 searches the search database 150 in which the secondary processed material is stored for information that matches the query received from the transmission / reception unit 110. The search result extracted through the search is transmitted to the transmission / reception unit 110 for transmission to the user terminal device 300. In addition, the search unit 120 searches the secondary processing material stored in the search database 150 for information that matches the query, and then extracts information about the derived search result from the primary processing material stored in the database 150. The data may be transmitted to the transmission / reception unit 110.

１次加工部１３０は、Ｗｅｂ文書サーバ４００から既に収集されたＷｅｂ文書について、当該Ｗｅｂ文書から形態素解析などによりキーワードを抽出し、これに基づき索引作業を行って作成した１次加工資料を検索データベース１５０に保存する。ここで、Ｗｅｂ文書は、公知のウェブクローラにより収集されてもよく、形態素解析によるキーワード抽出及び索引作業も公知の方法により行われてもよい。 The primary processing unit 130 extracts a keyword from a Web document already collected from the Web document server 400 by morphological analysis or the like, performs an indexing operation based on the keyword, and searches for a primary processing material created by a search database. Save to 150. Here, the Web document may be collected by a known web crawler, and keyword extraction and indexing by morphological analysis may be performed by a known method.

２次加工部１４０は、１次加工部１３０によりキーワード抽出及び索引作業が完了された１次加工資料を対象に、特定のトークンが登場する頻度が特定の数値以上である部分を抽出した後、これを対象に所定の基準による分析を通じてテキストパターンを抽出し、抽出されたパターンにより並べて整列し、整列された内容に基づき全体のパターンに適用する正規式を抽出する。また、２次加工部１４０は、抽出された正規式に基づき１次加工資料をリファインした結果を２次加工資料にして検索データベース１５０に保存することができる。２次加工部１４０の各構成要素に対する詳細な機能については、後述する。 The secondary processing unit 140 extracts a portion where the frequency of appearance of a specific token is a specific numerical value or more for the primary processing material that has been subjected to keyword extraction and indexing work by the primary processing unit 130, A text pattern is extracted through analysis based on a predetermined standard for this, and aligned according to the extracted pattern, and a regular expression to be applied to the entire pattern is extracted based on the aligned contents. Further, the secondary processing unit 140 can store the result of refining the primary processing material based on the extracted regular formula as the secondary processing material in the search database 150. Detailed functions for each component of the secondary processing unit 140 will be described later.

検索データベース１５０は、Ｗｅｂ文書サーバ４００から既に収集されたＷｅｂ文書、１次加工部を経た１次加工資料及び２次加工部を経た２次加工資料、検索部１２０で検索して抽出した検索結果などを保存する空間を通称する。本発明の簡単な例示のために、図２には検索データベース１５０一つだけが図示されているが、本発明の他の実施形態において、前記の様々なデータが一つ以上の物理的に区分されるデータベースに保存可能であることは、本発明の属する分野における通常の知識を有する者において明らかである。また、１次加工資料及び２次加工資料はそれぞれ別個に存在していても両資料の同一の内容については互いに関係していてもよく、検索部１２０は、検索データベース１５０に保存された２次加工資料を通じて検索を行った後、検索結果は、これに関する１次加工資料から抽出して提供してもよい。 The search database 150 is a Web document that has already been collected from the Web document server 400, a primary processed material that has passed through the primary processing unit, a secondary processed material that has passed through the secondary processing unit, and a search result that has been searched and extracted by the search unit 120. Common name for the space to store. For simplicity of illustration of the present invention, only one search database 150 is shown in FIG. 2, but in other embodiments of the present invention, the various data may be partitioned into one or more physical partitions. It is obvious to those having ordinary knowledge in the field to which the present invention belongs that the data can be stored in the database. In addition, the primary processing material and the secondary processing material may exist separately, or the same contents of both materials may be related to each other, and the search unit 120 stores the secondary processing material stored in the search database 150. After performing a search through the processed material, the search result may be extracted and provided from the primary processed material related thereto.

図２での送受信部１１０、検索部１２０、１次加工部１３０及び２次加工部１４０は、物理的に一つのハードウェア内で具現されることもでき、一部又はそのそれぞれが物理的に他のハードウェアに具現されることもでき、同一の機能を行う物理的に複数個存在するハードウェアが並列的に存在することもできる。このように本発明の各構成部が設けられたハードウェア又はデータベースの物理的な個数及び位置に限定されず、様々な方式で設計変更できるということは、本発明の属する技術分野における通常の知識を有する者において明らかである。 The transmission / reception unit 110, the search unit 120, the primary processing unit 130, and the secondary processing unit 140 in FIG. 2 may be physically implemented in one piece of hardware. It can be embodied in other hardware, and there can be a plurality of physically existing hardware performing the same function in parallel. As described above, it is not limited to the physical number and location of the hardware or database in which each component of the present invention is provided, and the fact that the design can be changed in various ways means that it is normal knowledge in the technical field to which the present invention belongs. It is obvious in those who have

２次加工部
本発明の一実施形態による検索結果提供システム１００内の２次加工部１４０を図３を参照してより詳細に説明すると、２次加工部１４０は頻度分析手段１４１、テキストパターン抽出手段１４２、正規式抽出手段１４３及び正規式適用手段１４４を含むことができる。 Secondary Processing Unit The secondary processing unit 140 in the search result providing system 100 according to an embodiment of the present invention will be described in more detail with reference to FIG. 3. The secondary processing unit 140 includes a frequency analysis unit 141, a text pattern extraction. Means 142, regular expression extraction means 143, and regular expression application means 144 can be included.

ここで、本発明の一実施形態による頻度分析手段１４１は、１次加工部１３０によりキーワードの抽出及び索引作業が完了し検索データベース１５０又は別途のデータベースに保存されている１次加工資料を対象に特定のトークン（ｔоｋｅｎ、言彙分析の単位）が登場する頻度が１次加工資料のうちどの部分で特定の数値以上であるかを分析する。このような検討基準の好ましい一実施形態として、以下の数１を用いて特定クラスで所定トークンに該当するＰＭＩ（ＰｏｉｎｔｗｉｓｅＭｕｔｕａｌＩｎｆｏｒｍａｔｉｏｎ）値の推移を求めた後、ＰＭＩ値が特定の数値以上である部分を分析することができる。 Here, the frequency analysis unit 141 according to the embodiment of the present invention targets primary processing data that has been extracted and indexed by the primary processing unit 130 and stored in the search database 150 or a separate database. Analyze which part of the primary processed data the frequency of appearance of a specific token (token, lexical analysis unit) is greater than a specific value. As a preferred embodiment of such a study criterion, after obtaining a transition of a PMI (Pointwise Manual Information) value corresponding to a predetermined token in a specific class using the following Equation 1, the PMI value is equal to or greater than a specific numerical value. The part can be analyzed.

ここで、Ｐ（Ｗ）は、特定トークンの全体頻度を意味し、Ｐ（Ｗ│Ｃ）はクラスにおける特定トークンの頻度を意味する。一例として、クラスをニュース記事に特定すると、Ｗｅｂ文書全体における登場頻度よりもニュース記事においてより多く登場するトークンの場合（例えば、「新聞」又は「記者」など）、他のトークンの場合よりＰＭＩ値が相対的に高い可能性がある。
Here, P (W) means the overall frequency of the specific token, and P (W | C) means the frequency of the specific token in the class. As an example, when a class is specified as a news article, a token that appears more frequently in a news article than a frequency of appearance in the entire Web document (for example, “newspaper” or “reporter”), a PMI value than in the case of other tokens. May be relatively high.

より具体的な例として、クラスがニュース記事である場合において記者名に該当するパターンを抽出するために、記者名付近に存在することのできるトークン、即ち「記者」という用語及び記者名以降に挿入できる記号である「）」に対するＰＭＩ値を求めると、図４のように示されることができる。図４のグラフにおいて横軸は、ニュース記事内おける位置を意味し、縦軸は、「記者」という用語及び記号「）」に対するＰＭＩ値を示し、ＰＭＩ値が特定の数値以上である部分は、斜線で表示されている。即ち１次加工資料であるニュース記事のうち、「記者」という用語及び記号「）」に対するＰＭＩ値が特定の数値以上である部分は、斜線で表示されたニュース記事のうち最初からＡで表示された所定部分まで、またＢで表示された所定部分からニュース記事の最後までに該当することが分かる。従って、図４で斜線で表示される該当部分（以下「抽出対象部分」とする）を対象に、テキストパターン抽出手段１４２でテキストパターンを抽出することができる。頻度分析手段１４１の役目は分析を通じて１次加工資料のうちテキストパターンを抽出するのに必要な抽出対象部分を選択する点にあるため、テキストパターン抽出手段１４２で１次加工資料全体ではなく特定部分のみに対して作業を進めることができるようにすることで、テキストパターン抽出手段１４２にかかる負荷（ｌоａｄ）を大幅に減少させることができる。 As a more specific example, in order to extract a pattern corresponding to a reporter name when the class is a news article, a token that can exist near the reporter name, that is, the term “reporter” and inserted after the reporter name When the PMI value for the possible symbol “)” is obtained, it can be shown as in FIG. In the graph of FIG. 4, the horizontal axis represents a position in the news article, the vertical axis represents the PMI value for the term “reporter” and the symbol “)”, and the portion where the PMI value is equal to or greater than a specific numerical value is It is displayed with diagonal lines. That is, in the news article that is the primary processed material, the part where the PMI value for the term “reporter” and the symbol “)” is equal to or greater than a specific numerical value is displayed as A from the beginning of the news articles displayed with diagonal lines. It can be seen that this corresponds to the predetermined part, and from the predetermined part displayed in B to the end of the news article. Therefore, the text pattern can be extracted by the text pattern extraction unit 142 for the corresponding portion (hereinafter referred to as “extraction target portion”) displayed with diagonal lines in FIG. Since the role of the frequency analysis means 141 is to select an extraction target part necessary for extracting the text pattern from the primary processed material through the analysis, the text pattern extracting means 142 selects the specific part instead of the entire primary processed material. By making it possible to proceed with the work only for the text pattern, the load on the text pattern extraction means 142 can be greatly reduced.

次に、本発明の一実施形態によるテキストパターン抽出手段１４２は、１次加工資料のうち前記頻度分析手段１４１から選ばれた特定トークンの頻度が特定の数値以上である抽出対象部分を所定の基準で分析してテキストパターンを抽出し、これに基づき抽出対象部分を１次元的に並べ整列できるよう機能する。テキストパターン抽出手段１４２の機能は、ＤＮＡを用いた遺伝子地図探しに必要な共通シーケンス（ｓｅｑｕｅｎｃｅ）抽出の核心技術であるＭｕｌｔｉｐｌｅＳｅｑｕｅｎｃｅＡｌｉｇｎｍｅｎｔ（ＭＳＡ）技術をそのモチーフとして導入し応用することができる。遺伝子地図は、Ａ、Ｔ、Ｇ、Ｃ四つのアルファベットが１次元のシーケンスに並べられた染色体の中にどのような遺伝子がどの位置にあるのかを示すものであり、ＭＳＡ技術を用いて共通する特徴（例えば、青い瞳の人）を有する複数のＤＮＡを並べ、並べられた複数のＤＮＡから共通のシーケンスを抽出し、それから共通する特徴（青い瞳）に対する遺伝子を抽出する。ＭＳＡ技術を用いた遺伝子分析結果の一実施形態である図８を参照すると、図８で図示された通り、複数のＤＮＡ並びを介した共通の特徴を探すことができる。前記で説明した遺伝子地図探しに用いられるＭＳＡ技術を応用したその具体的な例として、頻度分析手段１４１において「記者」という用語及び記号「）」に対するＰＭＩ値が特定の数値以上である部分として選択された抽出対象部分が以下の表１と同じであると仮定する。
Next, the text pattern extraction unit 142 according to an embodiment of the present invention uses a predetermined reference as an extraction target portion in which the frequency of the specific token selected from the frequency analysis unit 141 in the primary processing material is a specific numerical value or more. In this case, the text pattern is extracted by analysis, and based on this, the extraction target portions function in a one-dimensional manner. The function of the text pattern extraction unit 142 can be applied by introducing a Multiple Sequence Alignment (MSA) technique, which is a core technique for extracting a common sequence necessary for searching a genetic map using DNA, as its motif. The genetic map shows what gene is in which position in the chromosome in which the four alphabets of A, T, G, and C are arranged in a one-dimensional sequence, and is common using MSA technology. A plurality of DNAs having characteristics (for example, a person with blue eyes) are arranged, a common sequence is extracted from the arranged DNAs, and then genes for the common characteristics (blue eyes) are extracted. Referring to FIG. 8, which is an embodiment of a gene analysis result using the MSA technique, common features through a plurality of DNA sequences can be searched as illustrated in FIG. As a specific example of applying the MSA technique used for the genetic map search described above, the frequency analysis means 141 selects the term “reporter” and the part where the PMI value for the symbol “)” is equal to or greater than a specific value. Assume that the extracted part to be extracted is the same as in Table 1 below.

テキストパターン抽出手段１４２は、前記表１に該当する複数の抽出対象部分を所定の基準で分析してテキストパターンを抽出し、これに基づき複数の抽出対象部分を１次元的に並べ整列する。より具体的に、テキストパターン抽出手段１４２で使用できる所定の基準としては形態素解析でもよく、その他にもＣｈａｒｔ型、Ｗоｒｄ型、Ｂｙｔｅ型、韓国語や英語や日付などのＷоｒｄＴｙｐｅ、都市や時間や動物などの抽象化された概念やワード単位に、またはチャンク（Ｃｈｕｎｋ）やページ（ｐａｇｅ）などが用いられてもよいが、これに限定されず、前記並べた複数の基準を複合的に結合して使用することができる。また、分析時に特定の単語や用語などを保存しておいた別途のデータベースを参照してもよい。例えば、前記表１で「東京」、「大阪」、「横浜」及び「バンクーバー」が全て「地域」に該当する単語であり、予め別途のデータベースに保存されている場合、単語「バンクーバー」の場合には他の単語と文字数が相違しても、データベースを参照して同一に「地域」に該当する単語であることを分析し、表１に該当する複数の抽出対象部分は全て記号「（」以降に「地域」に該当する単語が位置するテキストパターンを有するということを把握することができる。また、他の例として、単語が記者名の場合で別途のデータベースに該当する単語が保存されていない場合においても、それが誰かの名前を示す固有名詞に該当するという分析をすることができる。 The text pattern extraction unit 142 analyzes a plurality of extraction target portions corresponding to Table 1 with a predetermined standard, extracts a text pattern, and arranges and arranges the plurality of extraction target portions one-dimensionally based on this. More specifically, a morphological analysis may be used as a predetermined standard that can be used by the text pattern extraction unit 142. In addition, a Chart type, a Word type, a Byte type, a Word Type such as Korean, English, or date, a city, a time, An abstract concept such as an animal or a word unit, or a chunk or a page may be used, but is not limited thereto, and a plurality of the above-described criteria are combined and combined. Can be used. Further, a separate database in which specific words or terms are stored at the time of analysis may be referred to. For example, in Table 1 above, “Tokyo”, “Osaka”, “Yokohama”, and “Vancouver” are all words that correspond to “region”, and are stored in a separate database in advance, or the word “Vancouver”. Even if the number of characters is different from other words, it is analyzed by referring to the database that they are the same words corresponding to “region”, and all the extraction target parts corresponding to Table 1 are all represented by the symbol “(”. In the following, it is possible to grasp that the word corresponding to “Region” has a text pattern where the word is located, and as another example, the word corresponding to the reporter name is stored in a separate database. Even if it is not, it can be analyzed that it corresponds to a proper noun indicating someone's name.

また、実質的に別途のデータベースに全ての単語を保存することはできないため、学習法（ヒューリスティック、ｈｅｕｒｉｓｔｉｃ）を用いてデータベースの内容を追加変更、削除してもよい。その例として、前記表１で「東京」、「大阪」、「横浜」のみが「地域」に該当する単語として予め別途のデータベースに保存されているとすると、表１の抽出対象部分のうち前記三つの文章全てが記号「（」以降に地域に該当する文字が位置し、その後に記号「＝」が位置することから、同一の形式、即ち記号「（」及び「＝」で囲まれた部分に位置する文字は地域名が出ると分析し、四つ目の文章で該当位置に位置した「バンクーバー」がデータベースに保存されていなくても地域名として認識し、用語「バンクーバー」をデータベースに追加することができ、この場合、今後新たな抽出対象部分に対する分析に「バンクーバー」を用いることができる。 In addition, since all words cannot be stored in a separate database, the contents of the database may be added, changed, or deleted using a learning method (heuristic). As an example, if only “Tokyo”, “Osaka”, and “Yokohama” in Table 1 are stored in a separate database in advance as words corresponding to “Region”, among the extraction target portions of Table 1, In all three sentences, the character corresponding to the region is located after the symbol “(”, followed by the symbol “=”, so the same form, that is, the portion surrounded by the symbols “(” and “=” The character located in is analyzed when the region name comes out, and the word “Vancouver” in the fourth sentence is recognized as the region name even if it is not stored in the database, and the term “Vancouver” is added to the database In this case, “Vancouver” can be used for analysis of a new extraction target portion in the future.

また、新聞記事の場合、その形式（例えば、記者名を記者内容の前に記載したり、題名の後に記載したり、或いは記事の最後に記載したりするなどの記事作成形式）がその期間ごとに相違する可能性があるため、所定の基準で分析した後、そのテキストパターンを抽出して１次元的に並べて整列を行うために抽出対象部分の期間などを日付別や何週単位で限定するなどさまざまの基準を適用してもよく、用いられる基準は期間のみに限定されない。 In the case of newspaper articles, the format (for example, an article creation format in which the reporter name is written before the reporter content, the title is written after the title, or the article is written at the end of the article) Therefore, after analyzing according to a predetermined standard, the text pattern is extracted, and the period of the extraction target part is limited by date or in units of weeks in order to align and align one-dimensionally. Various criteria may be applied, and the criteria used are not limited to the time period alone.

このように、テキストパターン抽出手段１４２が表１に該当する抽出対象部分を所定の基準で分析し、そのパターンを抽出し、抽出されたパターンに基づき抽出対象部分を１次元的に並べて整列をした結果は、以下の表２のように示すことができる。 As described above, the text pattern extraction unit 142 analyzes the extraction target portion corresponding to Table 1 according to a predetermined standard, extracts the pattern, and arranges the extraction target portion one-dimensionally based on the extracted pattern. The results can be shown as in Table 2 below.

表２で示される１次元的に並べて整列された結果により記者名付近に位置した括弧や等号などの記号の位置、報道地域、記者名に該当する部分がテキストパターンにより全て整列されたことがわかるし、単語「ＴＢＣ」は一部記事にのみ含まれた内容であるため、他の記事と対応する部分がなく、別途の位置に整列されるようになる。前記表２で示される結果は、図８で図示された遺伝子地図で用いられるＭＳＡ結果と類似していることを確認することができる。
Based on the result of the one-dimensional alignment shown in Table 2, the positions of symbols such as parentheses and equal signs located near the reporter name, the report area, and the part corresponding to the reporter name were all aligned by the text pattern. As can be seen, the word “TBC” is only included in some articles, so there is no part corresponding to other articles, and the word “TBC” is arranged in a separate position. It can be confirmed that the results shown in Table 2 are similar to the MSA results used in the genetic map shown in FIG.

次に、正規式抽出手段１４３では、前記テキストパターン抽出手段１４２を介してテキストパターンに基づき１次元的に並べられ、整列された内容に基づき一般化された式で表現することのできる正規式を抽出する。前記例に続いて前記表２で提示された内容に基づき以下の表３のように正規式を抽出することができる。 Next, in the regular expression extraction unit 143, a regular expression that is arranged one-dimensionally based on the text pattern via the text pattern extraction unit 142 and can be expressed by a generalized expression based on the aligned contents. Extract. Following the above example, regular expressions can be extracted as shown in Table 3 below based on the contents presented in Table 2.

表３で開示された正規式は、先ず、記号「（」が位置し、その後に２〜４文字の漢字又は日本語からなる単語が位置し（正規式＜漢字、日本語＞｛２、４｝で表示）、その次に、記号「＝」が位置し、その後に４文字の漢字からなる単語が位置（正規式＜漢字＞｛４｝で表示）することがわかる。続いて、「ＴＢＣ」のような単語が登場することがあるが、これは任意に、すなわち、一部の記事のみにおいて示されることがあるため、該当単語が該当位置に存在してもよいという意味で正規式「（ＴＢＣ）？」のように表現されてもよい。その次には、単語「記者」及び記号「）」が位置することになり、その後は別途の正規化することのできる内容が示されず、抽出しようとする記者名パターンとも関係ないため、正規式「.＊」のように表現されることができる。また、テキストパターン抽出手段１４２での別途のデータベースを参照して記号「（」後に位置する２〜４文字の漢字又は日本語からなる単語は報道地域を示し、記号「＝」後に位置する４文字の漢字からなる単語は記者名を示すことを判断し、該当内容を正規式に含めてもよい。もちろん、表３による正規式は表２における１次元的に並べられ整列された例に限定して抽出されたものであるため、そのテキストパターンによって並べて整列される内容により該当する正規式はいくらでも変化されることができ、正規式の表現方式は、前記表３に限定されないことは了承されなければならない。
In the regular expression disclosed in Table 3, first, the symbol “(” is located, and then a word consisting of 2 to 4 kanji characters or Japanese is located (regular expression <kanji, Japanese> {2, 4 }, Followed by a symbol “=”, followed by a word consisting of 4 kanji characters (displayed by a regular expression <kanji> {4}). ”May appear, but this may be indicated arbitrarily, that is, only in some articles, so that the regular expression“ (TBC)? ", Then the word" reporter "and the symbol") "will be located, after which no other content that can be normalized is shown, It is not related to the reporter name pattern to be extracted. Can be revealed. Further, referring to a separate database in the text pattern extracting means 142, 2 to 4 character kanji characters or Japanese words located after the symbol “(” indicate the coverage area, and 4 characters located after the symbol “=”. It may be determined that a word consisting of kanji indicates a reporter name, and the corresponding content may be included in the regular expression.Of course, the regular expression according to Table 3 is limited to the one-dimensionally arranged and aligned example in Table 2. Therefore, it should be understood that the regular expression can be changed in any number according to the contents arranged side by side according to the text pattern, and the expression method of the regular expression is not limited to Table 3. I must.

最後に、本発明の一実施形態による正規式適用手段１４４は前記正規式抽出手段１４３により抽出された正規式を１次加工資料に対して適用してリファインした結果から、２次加工資料を生成する。正規式適用によるリファインの一例として、正規式適用手段１４４は１次加工資料から記者名を削除してもよく、表３で示した正規式を用いる場合、正規式適用手段１４４は１次加工資料のうち最初に出る記号「＝」のすぐ後に位置した４文字の漢字からなる単語が記者名に該当することを判断しているため、該当位置に存在する漢字からなる４文字を記者名として認識し、削除した後、２次加工資料にして検索データベース１５０又は別途のデータベースに保存してもよい。ここで、正規式適用手段１４４が正規式を適用してリファインするための対象は、１次加工部１３０によりキーワード抽出及び索引作業が完了された１次加工資料それ自体であるため、頻度分析手段１４１から選ばれた抽出対象部分とは異なる。このように、２次加工資料から記者名を削除した実施形態の場合、ユーザがクエリ「木村拓哉」を入力すると、検索部１２０は記者名が削除された２次加工資料から検索して検索結果を導出するようになるので、図５ｂのように、題名又は記事本文にクエリを含んでいる正確な記事のみを提供することができ、従来技術による検索結果である図５ａと異なる検索結果が表示されることが分かる。 Finally, the regular expression applying unit 144 according to an embodiment of the present invention generates a secondary processing material from the result of refinement by applying the normal formula extracted by the normal formula extracting unit 143 to the primary processing material. To do. As an example of refinement by applying a regular expression, the regular expression applying unit 144 may delete the reporter name from the primary processing material. When the normal formula shown in Table 3 is used, the normal formula applying unit 144 uses the primary processing material. Since it is determined that the word consisting of 4 kanji characters positioned immediately after the first symbol “=” corresponds to the reporter name, the 4 characters consisting of kanji existing at the corresponding position are recognized as the reporter name. Then, after deletion, it may be stored as a secondary processing material in the search database 150 or a separate database. Here, since the target for the regular expression applying unit 144 to refine by applying the regular expression is the primary processed material itself for which the keyword extraction and indexing work has been completed by the primary processing unit 130, the frequency analysis unit This is different from the extraction target portion selected from 141. As described above, in the embodiment in which the reporter name is deleted from the secondary processing material, when the user inputs the query “Takuya Kimura”, the search unit 120 searches from the secondary processing material from which the reporter name is deleted and the search result. 5b, as shown in FIG. 5b, only an accurate article including a query in the title or the article body can be provided, and a search result different from the search result of FIG. 5a according to the prior art is displayed. You can see that

正規式の適用の他の例として、正規式適用手段１４４は、記者名又は報道地域に基づき１次加工資料を再分類し、これをインデックスにして１次加工資料を加工した２次加工資料を生成してもよい。表３の正規式を用いる場合、正規式適用手段１４４は、１次加工資料の最初の記号「（」を確認し、その後に漢字又は日本語で２〜４文字が出る報道地域及びその後に記号「＝」が出ることを確認し、報道地域に該当する漢字又は日本語を基準に再分類し、これをインデックスとして設定した２次加工資料を生成したり、又は記号「＝」後に出る漢字からなる４文字の記者名に認識し、これを基準に再分類し、インデックスに設定した２次加工資料を生成してもよい。このように、報道地域を基準に再分類し、インデックスに設定した２次加工資料が生成された場合、ユーザの選択又はクエリの入力により特定報道地域、例えば「鹿児島」から作成された記事を検索部１２０で２次加工資料から検索して検索結果を導出することができるため、図６ａのように記事題名や本文の内容にかかわらず、報道地域が「鹿児島」である記事のみを正確に提供することができる。また、記者名を基準に再分類し、インデックスとして設定した２次加工資料が生成された場合、ユーザの選択又はクエリの入力により特定記者、例えば、「田中俊之」が作成した記事を検索部１２０で２次加工資料から検索して検索結果を導出するようになるため、図６ｃのように記事題名や本文の内容にかかわらず、記者名が「田中俊之」である記事のみを正確に提供することができ、これは従来技術による検索結果を示す図６ｂにおいては本文又は題名などに同名異人の名前がキーワードとして存在する場合、記者名が相違するとしても検索結果に含まれてしまうのと異なることがわかる。 As another example of applying the regular expression, the regular expression applying means 144 reclassifies the primary processing material based on the reporter name or the report area, and uses the secondary processing material as an index to process the secondary processing material. It may be generated. When using the regular expressions in Table 3, the regular expression applying means 144 confirms the first symbol “(” in the primary processed material, followed by the reporting area where 2 or 4 characters are written in Kanji or Japanese, and the symbol after that. Confirm that “=” appears, reclassify based on the kanji or Japanese corresponding to the report area, and generate secondary processed data set as an index, or from the kanji that appears after the symbol “=” It is possible to re-classify based on this four-character reporter name, reclassify it as a standard, and generate a secondary processed material set as an index. When the secondary processing material is generated, the search unit 120 searches the secondary processing material for an article created from a specific coverage area, for example, “Kagoshima” by user selection or query input, and derives a search result. Because you can As shown in Fig. 6a, it is possible to accurately provide only articles whose coverage area is "Kagoshima" regardless of the title of the article and the content of the text. When the next processed material is generated, an article created by a specific reporter, for example, “Toshiyuki Tanaka”, is selected from the secondary processed material by the search unit 120 by the user's selection or query input, and the search result is derived. Therefore, as shown in FIG. 6c, it is possible to accurately provide only an article whose reporter name is “Toshiyuki Tanaka” regardless of the title of the article and the content of the text. This is shown in FIG. When the names of different names with the same name exist as keywords in the text or title, it can be seen that even if the reporter names are different, they are not included in the search results.

前記の正規式適用の二つの例は、正規式適用手段１４４で正規式が適用される具体例を例示したに過ぎず、本発明の属する技術分野の標準的な技術及び当業者の技術常識によって、様々な方法により前記正規式適用の例を単独に、又は結合して使用したり、同一または類似した機能を行ったりできるように変形して使用することができる The above two examples of regular expression application are only specific examples in which the regular expression is applied by the regular expression application unit 144, and are based on standard techniques in the technical field to which the present invention belongs and technical common knowledge of those skilled in the art. The examples of applying the regular expression can be used alone or in combination by various methods, or modified so that the same or similar functions can be performed.

図７は、本発明の一実施形態による２次加工部における動作フローチャートである。 FIG. 7 is an operation flowchart in the secondary machining unit according to the embodiment of the present invention.

図７に例示された一実施形態によると、２次加工部の頻度分析手段１４１はリファインしようとする複数のＷｅｂ文書、例えば、ニュース記事のうち特定トークン、例えば、記者名付近に位置することのできる用語又は記号である「記者」、「）」などを含む特定トークンが登場する頻度を求める（Ｓ１００）。ここで、前記頻度は、前記説明されたＰＭＩ値を用いて求められてもよい。 According to the embodiment illustrated in FIG. 7, the frequency analysis unit 141 of the secondary processing unit may be located near a specific token, for example, a reporter name, among a plurality of Web documents to be refined, for example, news articles. The frequency of appearance of a specific token including “Reporter”, “”, etc., which are possible terms or symbols, is obtained (S100). Here, the frequency may be obtained using the PMI value described above.

次に、頻度分析手段１４１は、求められた頻度が所定の数値以上である部分を抽出対象資料として選択する（Ｓ１１０）。 Next, the frequency analysis unit 141 selects a portion whose calculated frequency is equal to or higher than a predetermined numerical value as an extraction target material (S110).

この後に、２次加工部のテキストパターン抽出手段１４２は、前記選択された抽出対象資料を形態素解析などを含む所定の基準により分析してそのテキストパターンを抽出し、これに基づき、複数の抽出対象部分を並べ、整列する（Ｓ１２０）。所定の基準により抽出対象資料を分析する際、特定単語又は用語などが保存された別途のデータベースを参照してもよく、ここで、別途のデータベースは、抽出対象資料の分析結果を反映する学習法（ヒューリスティック、ｈｅｕｒｉｓｔｉｃ）により変更し、アップデートしてもよい。 Thereafter, the text pattern extracting unit 142 of the secondary processing unit analyzes the selected extraction target material according to a predetermined standard including morphological analysis and extracts the text pattern, and based on this, extracts a plurality of extraction targets. The parts are arranged and aligned (S120). When analyzing the material to be extracted according to a predetermined standard, a separate database in which specific words or terms are stored may be referred to. Here, the separate database is a learning method that reflects the analysis result of the material to be extracted. It may be changed and updated by (heuristic).

この後に、２次加工部の正規式抽出手段１４３は、並べられ整列された複数の抽出対象資料から正規式を抽出する（Ｓ１３０）。 Thereafter, the regular expression extraction unit 143 of the secondary processing unit extracts a regular expression from a plurality of extraction target materials arranged and aligned (S130).

この後に、２次加工部の正規式提供手段１４４は、前記抽出された正規式を用いてリファインしようとする複数のＷｅｂ文書から正規式に表現された特定用語又はキーワードの削除し、又はこれを基準にＷｅｂ文書を再分類してインデックスに設定するなどのリファインメントを経て２次資料を生成する（Ｓ１４０）。このように生成された２次資料は、ユーザによるクエリを受信した際、１次資料の代わりに検索対象となってもよい。 Thereafter, the regular expression providing unit 144 of the secondary processing unit deletes a specific term or keyword expressed in the regular expression from a plurality of Web documents to be refined using the extracted regular expression, or The secondary document is generated through refinement such as reclassifying the Web document as a reference and setting it as an index (S140). The secondary material generated in this way may be a search target instead of the primary material when a query by the user is received.

本発明による実施形態は様々なコンピュータ手段を通じて行われることのできるプログラム命令形態に具現され、コンピュータで読取可能な媒体に記録されることができる。前記コンピュータ読取可能媒体は、プログラム命令、データファイル、データ構造などを単独にまたは組み合わせて含むことができる。前記媒体に記録されるプログラム命令は本発明のために特別に設計され、構成されたものであるか、コンピュータソフトウェアの分野の当業者に公知され使用可能なものであることもできる。コンピュータで読取可能記録媒体の例としては、ハードディスク、フロッピー（登録商標）ディスク、及び磁気テープのような磁気媒体（ｍａｇｎｅｔｉｃｍｅｄｉａ）、ＣＤ−ＲＯＭ、ＤＶＤのような光記録媒体（ｏｐｔｉｃａｌｍｅｄｉａ）、フロプティカルディスク（Ｆｌｏｐｔｉｃａｌｄｉｓｋ）のような磁気−光媒体（ｍｅｇｎｅｔｏ−ｏｐｔｉｃａｌｍｅｄｉａ）及びロム（ＲＯＭ）、ラム（ＲＡＭ）、フラッシュメモリなどのようなプログラム命令を保存し、実行するように特別に構成されたハードウェア装置が含まれる。プログラム命令の例としては、コンパイラにより作成されるような機械語コードだけでなく、インタプリタなどを用いてコンピュータにより実行可能な高級言語コードを含む。ハードウェア装置は、本発明の動作を行うために一つ以上のソフトウェアモジュールとして動作するように構成されてもよく、その逆も同様である。 Embodiments according to the present invention may be embodied in a program instruction form that can be executed through various computer means, and may be recorded on a computer-readable medium. The computer readable medium may include program instructions, data files, data structures, etc., alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and usable by those skilled in the computer software field. Examples of the computer-readable recording medium include a hard disk, a floppy (registered trademark) disk, a magnetic medium such as a magnetic tape, an optical recording medium such as a CD-ROM and a DVD, and a floppy disk. Specially configured to store and execute program instructions such as magneto-optical media such as optical disks and ROM, RAM, flash memory, etc. Hardware devices are included. Examples of program instructions include not only machine language codes created by a compiler but also high-level language codes that can be executed by a computer using an interpreter or the like. A hardware device may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

以上の通り、本発明では具体的な構成要素などのような発明特定事項と限られた実施形態及び図面により説明されたが、これは本発明のより全般的な理解のために提供されたものであり、本発明は上述の実施形態に限定されず、本発明の属する技術分野において通常の知識を有した者であれば、以上の記載から本発明の様々な修正及び変形が可能である。 As described above, the present invention has been described with reference to specific embodiments of the invention such as specific components and limited embodiments and drawings, which are provided for a more general understanding of the present invention. Thus, the present invention is not limited to the above-described embodiments, and various modifications and variations of the present invention can be made from the above description by those who have ordinary knowledge in the technical field to which the present invention belongs.

したがって、本発明の思想は、上述した実施形態に限定されてはならず、特許請求の範囲だけでなく、特許請求の範囲と均等かつ等価的な変形である全てのものは、本発明の思想の範疇に属するといえる。 Therefore, the idea of the present invention should not be limited to the above-described embodiments, and not only the scope of the claims but also all modifications equivalent to and equivalent to the scope of the claims It can be said that it belongs to the category.

Claims

A text pattern extraction step of analyzing a plurality of extraction target materials according to a predetermined standard, extracting the text pattern, and arranging the plurality of extraction target materials side by side based on the extracted text pattern;
A regular expression extraction step of extracting a regular expression from the plurality of extraction target materials arranged and aligned;
A Web document refinement method including a Web document refinement step of refining a Web document using the extracted regular expression to generate a secondary material.

Before the text pattern extraction step,
2. The method according to claim 1, further comprising: a frequency analysis step of obtaining a frequency at which a specific token appears in the Web document and selecting a part having the frequency equal to or higher than a predetermined numerical value as the extraction target material. Web document refinement method.

The Web document refinement method according to claim 2, wherein a PMI value is used as the frequency.

4. The Web document refinement method according to claim 2, wherein the specific token includes a term or a symbol located in the vicinity of a reporter name in the Web document.

5. The Web document refinement method according to claim 1, wherein the predetermined criterion includes morphological analysis.

In the text pattern extraction step,
The Web document refinement method according to any one of claims 1 to 5, wherein the extraction target material is analyzed with reference to a database in which specific words or terms are stored.

7. The Web document refinement method according to claim 6, wherein contents of the database are changed to reflect an analysis result of the extraction target material.

The refinement of the Web document includes deleting a specific term or keyword expressed in the regular expression from the Web document, or reclassifying the Web document based on the specific term or keyword and setting the index as an index. The Web document refinement method according to claim 1.

After the Web document refinement step,
The Web document refinement method according to any one of claims 1 to 8, wherein a query is received from a user terminal device, and a search based on the query is performed on the secondary material.

A text pattern extracting means for analyzing a plurality of extraction target materials according to a predetermined standard, extracting a text pattern thereof, and arranging the plurality of extraction target materials side by side based on the extracted text pattern;
A regular expression extraction means for extracting a regular expression from the plurality of the extraction target materials arranged and aligned;
A Web document refinement system, comprising: a normal expression application unit that refines a Web document using the extracted normal expression and generates secondary material.

The frequency analysis unit according to claim 10, further comprising: a frequency analysis unit that obtains a frequency at which a specific token appears in the Web document and selects a part having the frequency equal to or higher than a specific numerical value as the extraction target material. Web document refinement system.

The text pattern extraction means includes
The Web document refinement system according to claim 10 or 11, wherein the extraction target material is analyzed with reference to a database in which specific words or terms are stored.

13. The Web document refinement system according to claim 12, wherein contents of the database are changed to reflect an analysis result of the extraction target material.

The refinement of the Web document includes deleting a specific term or keyword expressed in the regular expression from the Web document, or reclassifying the Web document based on the specific term or keyword and setting the index as an index. The Web document refinement system according to any one of claims 10 to 13.

The Web document refinement system according to any one of claims 10 to 14, further comprising a search unit that searches the secondary material by a query received from a user terminal device.

A computer-readable recording medium having recorded thereon a program for performing each step of the Web document refinement method according to any one of claims 1 to 9 on a computer.