JP2018205978A

JP2018205978A - Information extracting device and information extracting method

Info

Publication number: JP2018205978A
Application number: JP2017109404A
Authority: JP
Inventors: 惇允萩原; Atsuhiro Hagiwara; 幸輝島田; Koki Shimada
Original assignee: Object Of Null Inc
Current assignee: Object Of Null Inc
Priority date: 2017-06-01
Filing date: 2017-06-01
Publication date: 2018-12-27
Anticipated expiration: 2037-06-01
Also published as: JP7040745B2

Abstract

To improve a precision when information is extracted from a web page.SOLUTION: An information extracting device 1 includes: a contents obtaining unit 131 that obtains contents of multiple web sites; an image creating unit 132 that creates a screen shot image in a state in which the contents obtained by the contents obtaining unit 131 are displayed on a screen; and a specifying unit 133 that specifies information to be extracted contained in the screen shot image by utilizing the screen shot image as input data to a deep learning model created by deep learning on the basis of multiple learning image contents that contain learning information.SELECTED DRAWING: Figure 3

Description

本発明は、ウェブページから情報を抽出するための情報抽出装置及び情報抽出方法に関する。 The present invention relates to an information extraction apparatus and an information extraction method for extracting information from a web page.

従来、ウェブサイトをクローリングして、情報を抽出する方法が知られている。特許文献１には、ウェブページに含まれている有用なデータを抽出する検索エンジンが開示されている。従来の検索エンジンにおいては、ウェブページ内のテキストに基づいて、情報を抽出する。例えば、従来の検索エンジンは、日付を示す「年」、「月」、「日」等のテキストがあることを条件として日付情報を抽出する。 Conventionally, a method of extracting information by crawling a website is known. Patent Document 1 discloses a search engine that extracts useful data included in a web page. In a conventional search engine, information is extracted based on text in a web page. For example, a conventional search engine extracts date information on the condition that there is a text such as “year”, “month”, “day” indicating the date.

特表２０１４−５２２０３０号公報Special table 2014-522030 gazette

しかしながら、テキストに基づいて情報を抽出する場合、抽出された情報が所望の情報でないことが多かった。例えば、ウェブページにおいてスポーツや行楽に関するイベントに関する情報を抽出する場合に、検索エンジンは、「２０１７年４月１日」という日付を示す情報をイベントの開催日として抽出する。しかし、抽出した日付はイベントの開催日ではなく、イベントへの参加申し込みの締め切り日であるという場合がある。したがって、検索エンジンがテキストに基づいて情報を抽出する場合、誤った情報を抽出してしまうことがあるという問題が生じていた。 However, when extracting information based on text, the extracted information is often not desired information. For example, when extracting information about an event related to sports or amusement on a web page, the search engine extracts information indicating a date “April 1, 2017” as the date of the event. However, in some cases, the extracted date is not the date of the event, but the deadline for applying to participate in the event. Therefore, when the search engine extracts information based on the text, there has been a problem that incorrect information may be extracted.

そこで、本発明はこれらの点に鑑みてなされたものであり、ウェブページから情報を抽出する際の精度を向上させることができる情報抽出装置及び情報抽出方法を提供することを目的とする。 Therefore, the present invention has been made in view of these points, and an object thereof is to provide an information extraction apparatus and an information extraction method capable of improving accuracy when extracting information from a web page.

本発明の第１の態様の情報抽出装置は、複数のウェブサイトのコンテンツを取得するコンテンツ取得部と、前記コンテンツ取得部が取得した前記コンテンツが画面に表示された状態のスクリーンショット画像を作成する画像作成部と、学習用情報を含む複数の学習用画像コンテンツに基づいて深層学習で作成された深層学習モデルへの入力データとして前記スクリーンショット画像を用いることにより、前記スクリーンショット画像に含まれている抽出対象情報を特定する特定部と、を有する。 An information extraction device according to a first aspect of the present invention creates a content acquisition unit that acquires content of a plurality of websites, and a screenshot image in a state where the content acquired by the content acquisition unit is displayed on a screen. Included in the screenshot image by using the screenshot image as input data to an image creation unit and a deep learning model created by deep learning based on a plurality of learning image contents including learning information And a specifying unit for specifying the extraction target information.

前記深層学習モデルは、前記学習用画像コンテンツにおける前記学習用情報が含まれている位置を示す第１位置情報を用いた深層学習をすることにより作成されており、前記特定部は、前記第１位置情報が関連付けられた前記深層学習モデルへの入力データとして、前記抽出対象情報において使用される文字列を含む画像領域の位置を示す第２位置情報を用いることにより前記抽出対象情報を特定してもよい。 The deep learning model is created by performing deep learning using first position information indicating a position where the learning information is included in the learning image content, and the specifying unit includes the first learning information. As the input data to the deep learning model associated with the position information, the extraction target information is specified by using the second position information indicating the position of the image region including the character string used in the extraction target information. Also good.

前記特定部は、前記コンテンツ取得部が取得した前記コンテンツに含まれている所定の文字列に基づく文字画像を作成し、前記スクリーンショット画像において前記文字画像との相関度が閾値以上である領域を特定することにより、前記画像領域の位置を特定してもよい。前記特定部は、前記画像領域の位置に基づいて、イベントに関連するイベントの日時、場所及び内容の少なくともいずれかを含むイベント情報を前記抽出対象情報として特定してもよい。 The specifying unit creates a character image based on a predetermined character string included in the content acquired by the content acquisition unit, and an area having a correlation degree with the character image in the screen shot image is a threshold value or more. The position of the image area may be specified by specifying. The specifying unit may specify, as the extraction target information, event information including at least one of an event date and time, a place, and a content related to an event based on the position of the image area.

また、前記特定部は、前記コンテンツ取得部が取得した前記コンテンツに含まれるテキスト、前記スクリーンショット画像及び前記第２位置情報のうち２つ以上を前記深層学習モデルへの入力データとして用いることにより、前記抽出対象情報を特定してもよい。この場合、前記特定部は、前記テキスト及び前記スクリーンショット画像を前記深層学習モデルへの入力データとして用いて前記抽出対象情報を特定する精度が閾値未満である場合に、前記第２位置情報をさらに前記深層学習モデルへの入力データとして用いてもよい。 Further, the specifying unit uses two or more of the text, the screenshot image, and the second position information included in the content acquired by the content acquisition unit as input data to the deep learning model. The extraction target information may be specified. In this case, when the accuracy of specifying the extraction target information using the text and the screenshot image as input data to the deep learning model is less than a threshold, the specifying unit further adds the second position information. It may be used as input data to the deep learning model.

前記特定部は、前記深層学習モデルへの複数の入力データのうち、第１の個数の入力データとして前記テキストを用いて前記抽出対象情報を特定した際の精度が前記閾値未満である場合に、第２の個数の入力データとして前記スクリーンショット画像を用いて前記抽出対象情報を特定してもよい。 When the accuracy when the extraction target information is specified using the text as the first number of input data among the plurality of input data to the deep learning model is less than the threshold, The extraction target information may be specified using the screenshot image as the second number of input data.

また、前記特定部は、特定する対象となる前記抽出対象情報の種別の指定を受け、指定を受けた前記種別に対応する前記深層学習モデルを用いて前記抽出対象情報を特定してもよい。 The specifying unit may receive specification of the type of the extraction target information to be specified, and specify the extraction target information using the deep learning model corresponding to the specified type.

また、情報抽出装置は、特定部が特定した前記抽出対象情報に関連付けられた広告を提供する広告提供部をさらに有してもよい。 The information extraction device may further include an advertisement providing unit that provides an advertisement associated with the extraction target information specified by the specifying unit.

本発明の第２の態様の情報抽出方法は、コンピュータが実行する、複数のウェブサイトのコンテンツを取得するステップと、取得した前記コンテンツが画面に表示された状態のスクリーンショット画像を作成するステップと、学習用情報を含む複数の学習用画像コンテンツに基づいて深層学習で作成された深層学習モデルの入力データとして前記スクリーンショット画像を用いることにより、前記スクリーンショット画像に含まれている抽出対象情報を特定するステップと、を有する。 The information extraction method according to the second aspect of the present invention includes a step of acquiring contents of a plurality of websites executed by a computer, a step of creating a screenshot image in a state where the acquired contents are displayed on a screen, By using the screenshot image as input data of a deep learning model created by deep learning based on a plurality of learning image contents including learning information, the extraction target information included in the screenshot image is obtained. Identifying.

本発明によれば、ウェブページから情報を抽出する際の精度を向上させることができるという効果を奏する。 According to the present invention, it is possible to improve the accuracy when extracting information from a web page.

第１実施形態の情報抽出装置の概要を説明するための図である。It is a figure for demonstrating the outline | summary of the information extraction apparatus of 1st Embodiment. 情報抽出装置がイベント情報を抽出する方法の概要について説明するための図である。It is a figure for demonstrating the outline | summary of the method by which an information extraction device extracts event information. 情報抽出装置の構成を示す図である。It is a figure which shows the structure of an information extraction apparatus. コンテンツ取得部が取得するコンテンツの一例を示す図である。It is a figure which shows an example of the content which a content acquisition part acquires. 特定部が位置情報に基づいてイベント情報を特定する動作の手順を示すフローチャートである。It is a flowchart which shows the procedure of the operation | movement which an identification part specifies event information based on position information. イベント情報が登録されたデータベースの一例を示す図である。It is a figure which shows an example of the database with which event information was registered. 情報抽出装置の動作フローチャートである。It is an operation | movement flowchart of an information extraction apparatus. 第２実施形態の情報抽出装置の構成を示す図である。It is a figure which shows the structure of the information extraction apparatus of 2nd Embodiment. コンテンツとともに表示される広告情報を示す図である。It is a figure which shows the advertisement information displayed with a content. 第２実施形態の情報抽出装置の動作フローチャートである。It is an operation | movement flowchart of the information extraction apparatus of 2nd Embodiment.

＜第１実施形態＞
［情報抽出装置１の概要］
図１は、第１実施形態の情報抽出装置１の概要を説明するための図である。情報抽出装置１は、インターネットＮを介してアクセス可能な複数のサーバ２から提供されるウェブページに含まれているコンテンツから、所望の情報を抽出するためのコンピュータである。情報抽出装置１は、ウェブページのコンテンツから、予め設定された各種の抽出対象情報を抽出し、抽出した抽出対象情報をデータベース３に登録する。本実施の形態においては、情報抽出装置１が、各種のイベントに関するイベント情報を抽出対象情報として抽出し、抽出したイベント情報をデータベース３に登録する場合を例示するが、情報抽出装置１が抽出する抽出対象情報はイベント情報に限定されない。 <First Embodiment>
[Outline of Information Extraction Apparatus 1]
FIG. 1 is a diagram for explaining an overview of an information extraction apparatus 1 according to the first embodiment. The information extraction device 1 is a computer for extracting desired information from content included in web pages provided from a plurality of servers 2 accessible via the Internet N. The information extraction apparatus 1 extracts various types of extraction target information set in advance from the content of the web page, and registers the extracted extraction target information in the database 3. In the present embodiment, the information extraction apparatus 1 extracts event information related to various events as extraction target information and registers the extracted event information in the database 3, but the information extraction apparatus 1 extracts the information. The extraction target information is not limited to event information.

なお、イベントは、特定の日又は期間に開催される行事であり、例えば、スポーツの試合、祭り、展示会及び特売セールである。イベント情報は、イベントの開催日又は開催期間、イベントの開催場所、及びイベントの内容の少なくともいずれかを含む情報である。 An event is an event held on a specific day or period, for example, a sporting game, a festival, an exhibition, or a special sale. The event information is information including at least one of an event date or period, an event location, and an event content.

データベース３に登録されたイベント情報は、各種のアプリケーションにより使用され得る。例えば、車両に搭載されたカーナビゲーションシステムは、データベース３に登録されたイベント情報を取得して、取得したイベント情報に基づいて、車両の現在位置又は目的地までの経路から所定の範囲内で開催されているイベントを抽出する。カーナビゲーションシステムが、抽出したイベントに関する情報を表示することで、車両内の人が、近くでイベントが開催されていることを認識することが可能になる。 The event information registered in the database 3 can be used by various applications. For example, a car navigation system mounted on a vehicle acquires event information registered in the database 3 and holds it within a predetermined range from a route to the current position of the vehicle or a destination based on the acquired event information. Extracted events. The car navigation system displays information on the extracted event, so that a person in the vehicle can recognize that the event is being held nearby.

［イベント情報の抽出方法の概要］
図２は、情報抽出装置１がイベント情報を抽出する方法の概要について説明するための図である。情報抽出装置１は、ウェブページに含まれているウェブコンテンツから得られる各種のデータを、予め作成された深層学習モデルの入力データとして用いて、高い精度でイベント情報を抽出することができる。深層学習モデルは、学習用情報として用いられる多数の教師データを使用して入力変数と出力変数との間の関係を学習することにより係数が決定されたニューラルネットワークにより構成されるモデルである。 [Outline of event information extraction method]
FIG. 2 is a diagram for explaining an outline of a method by which the information extraction apparatus 1 extracts event information. The information extraction apparatus 1 can extract event information with high accuracy by using various data obtained from web content included in a web page as input data of a deep learning model created in advance. The deep learning model is a model configured by a neural network in which coefficients are determined by learning a relationship between input variables and output variables using a large number of teacher data used as learning information.

情報抽出装置１は、深層学習モデルへの入力データとして、テキスト、画像、及びウェブページ内における所定の画像の座標を示す位置情報を組み合わせて用いることができる。情報抽出装置１は、ウェブページ内のテキストを深層学習モデルの入力データとして用いる場合、ウェブページのソースコードから予め登録された複数のテキストを抽出し、抽出した複数のテキストを深層学習モデルの入力データとする。このようにすることで、情報抽出装置１は、深層学習モデルに入力した複数のテキストの組み合わせに基づいて、高い確率で正しくイベント情報を抽出することができる。 The information extraction device 1 can use a combination of text, an image, and position information indicating the coordinates of a predetermined image in a web page as input data to the deep learning model. When the text in a web page is used as input data for a deep learning model, the information extraction apparatus 1 extracts a plurality of pre-registered texts from the web page source code, and inputs the extracted plurality of texts into the deep learning model. Data. By doing in this way, the information extraction apparatus 1 can extract event information correctly with high probability based on the combination of a plurality of texts input to the deep learning model.

情報抽出装置１は、テキストに代えて、又はテキストと共に、ウェブページ内のソースコードをレンダリングすることにより得られるスクリーンショット画像を深層学習モデルの入力データとして用いることもできる。スクリーンショット画像は、ウェブページがコンピュータの画面に表示された状態の画像である。スクリーンショット画像を入力データとして用いることができる画像用の深層学習モデルは、学習用の多数のスクリーンショット画像を用いて作成されている。情報抽出装置１が、スクリーンショット画像を画像用の深層学習モデルの入力データとして用いることで、ユーザがウェブページを視認した際にイベント情報であると認識する情報を抽出できるので、イベント情報を正しく抽出できる確率がさらに高まる。 The information extraction device 1 can also use a screenshot image obtained by rendering the source code in the web page instead of the text or together with the text as input data of the deep learning model. The screen shot image is an image in a state where a web page is displayed on a computer screen. An image deep learning model that can use a screen shot image as input data is created using a large number of screen shot images for learning. The information extraction apparatus 1 can extract information that is recognized as event information when the user visually recognizes the web page by using the screen shot image as input data of the deep learning model for images. The probability of extraction is further increased.

情報抽出装置１は、例えば、テキストだけを用いてイベント情報を抽出した場合の精度が低いと考えられる場合に、テキストと共にスクリーンショット画像を深層学習モデルの入力データとして用いてもよい。情報抽出装置１は、テキスト及びスクリーンショット画像を組み合わせて、深層学習モデルの一部の入力データとしてテキストを用いて、深層学習モデルの他の一部の入力データとして用いることで、イベント情報を正しく抽出できる確率をさらに高めることができる。 For example, when it is considered that the accuracy when event information is extracted using only text is low, the information extraction apparatus 1 may use a screenshot image together with the text as input data of the deep learning model. The information extraction device 1 combines the text and the screenshot image, uses the text as part of the input data of the deep learning model, and uses it as the input data of the other part of the deep learning model. The probability that it can be extracted can be further increased.

情報抽出装置１は、スクリーンショット画像における所定のテキストの位置を特定し、特定した位置を示す座標を深層学習モデルの入力データとして用いることで、イベント情報を正しく抽出できる確率をさらに高めることができる。情報抽出装置１がスクリーンショット画像における所定のテキストの位置を特定する方法の詳細については後述する。 The information extraction apparatus 1 can further increase the probability that event information can be correctly extracted by specifying a position of a predetermined text in a screenshot image and using coordinates indicating the specified position as input data of a deep learning model. . Details of how the information extracting apparatus 1 identifies the position of a predetermined text in the screenshot image will be described later.

［深層学習モデルの作成方法］
深層学習モデルは、既知の各種の方法を用いて作成することができる。テキストを入力データとして使用できる深層学習モデルを作成する場合、多数（例えば１００万）のウェブページに含まれているテキストを教師データとして使用する。深層学習モデルの作成者は、学習のために使用されるウェブページを視認することにより、ウェブページに基づいて把握できるイベント情報を特定する。そして、特定された学習用のイベント情報を、教師データとして使用されるウェブページのソースコードから抽出されるテキストに関連付けることで、テキストを入力データとするテキスト用の深層学習モデルを作成することができる。なお、深層学習モデルの作成者が特定するイベント情報は、ウェブページに含まれているテキストと同一であってもよく、ウェブページに含まれるテキストと異なる内容であってもよい。 [How to create a deep learning model]
The deep learning model can be created using various known methods. When creating a deep learning model that can use text as input data, text included in a large number (for example, 1 million) of web pages is used as teacher data. The creator of the deep learning model identifies event information that can be grasped based on the web page by visually recognizing the web page used for learning. Then, it is possible to create a deep learning model for text using text as input data by associating the identified learning event information with text extracted from the source code of a web page used as teacher data it can. The event information specified by the creator of the deep learning model may be the same as the text included in the web page, or may be different content from the text included in the web page.

同様に、スクリーンショット画像を入力データとして使用できる深層学習モデルを作成する場合、多数のウェブページのスクリーンショット画像を教師データとして使用する。そして、学習のために使用されるウェブページを深層学習モデルの作成者が視認することにより特定された学習用のイベント情報を、教師データとして使用されるウェブページのソースコードをレンダリングして得られるスクリーンショット画像に関連付けることで、スクリーンショット画像を入力データとする画像用の深層学習モデルを作成することができる。 Similarly, when creating a deep learning model that can use screen shot images as input data, screen shot images of many web pages are used as teacher data. The event information for learning specified by the creator of the deep learning model visually confirming the web page used for learning is obtained by rendering the source code of the web page used as teacher data. By associating with a screenshot image, it is possible to create a deep learning model for an image using the screenshot image as input data.

また、位置情報を入力データとして使用できる深層学習モデルを作成する場合、多数のウェブページに含まれるテキストの位置を教師データとして使用する。そして、学習のために使用されるウェブページを深層学習モデルの作成者が視認することにより特定された学習用のイベント情報を、教師データとして使用されるウェブページに含まれるテキストの位置情報に関連付けることで、位置情報を入力データとする位置用の深層学習モデルを作成することができる。 Further, when creating a deep learning model that can use position information as input data, the positions of texts included in many web pages are used as teacher data. The event information for learning specified by the creator of the deep learning model visually confirming the web page used for learning is associated with the position information of the text included in the web page used as teacher data. Thus, it is possible to create a deep learning model for position using position information as input data.

なお、深層学習モデルの作成者は、定期的に新たな学習用のウェブページを使用して学習作業を行うことにより、深層学習モデルを更新することで、直近のウェブページの構成の傾向に合致した深層学習モデルを作成することができる。 In addition, the creator of the deep learning model regularly updates the deep learning model by using a new learning web page to match the trend of the most recent web page configuration. A deep learning model can be created.

また、深層学習モデルの作成者は、ウェブページから抽出したい情報の種別ごとに深層学習モデルを作成することができる。例えば、深層学習モデルの作成者は、サッカーの試合に関するイベント情報が含まれている教師データを用いて深層学習モデルを作成することにより、サッカーの試合に関するイベント情報を正しく抽出できる確率が高まる深層学習モデルを作成することができる。情報抽出装置１は、抽出する対象の情報の種別に基づいて選択した深層学習モデルを使用することで、所望の情報を正しく抽出できる確率を高めることができる。
以下、情報抽出装置１の構成及び動作の詳細について説明する。 Further, the creator of the deep learning model can create a deep learning model for each type of information that is desired to be extracted from the web page. For example, the creator of a deep learning model creates a deep learning model using teacher data that includes event information related to a soccer game, thereby increasing the probability that event information related to a soccer game can be correctly extracted. A model can be created. The information extraction apparatus 1 can increase the probability that desired information can be correctly extracted by using the deep learning model selected based on the type of information to be extracted.
Hereinafter, the configuration and operation of the information extraction device 1 will be described in detail.

［情報抽出装置１の構成］
図３は、情報抽出装置１の構成を示す図である。情報抽出装置１は、通信部１１と、記憶部１２と、制御部１３とを有する。 [Configuration of Information Extraction Apparatus 1]
FIG. 3 is a diagram illustrating a configuration of the information extraction device 1. The information extraction device 1 includes a communication unit 11, a storage unit 12, and a control unit 13.

通信部１１は、情報抽出装置１がインターネットＮを介してサーバ２及びデータベース３との間でデータを送受信するための通信コントローラを含む通信インターフェースである。通信部１１は、インターネットＮを介して受信したウェブページのコンテンツを制御部１３に入力する。また、制御部１３が出力したイベント情報をデータベース３に対して送信する。 The communication unit 11 is a communication interface including a communication controller for the information extraction apparatus 1 to transmit and receive data between the server 2 and the database 3 via the Internet N. The communication unit 11 inputs the content of the web page received via the Internet N to the control unit 13. In addition, the event information output by the control unit 13 is transmitted to the database 3.

記憶部１２は、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）及びハードディスク等の記憶媒体を含む。記憶部１２は、制御部１３が実行するプログラムを記憶している。 The storage unit 12 includes a storage medium such as a ROM (Read Only Memory), a RAM (Random Access Memory), and a hard disk. The storage unit 12 stores a program executed by the control unit 13.

制御部１３は、例えばＣＰＵ（Central Processing Unit）であり、記憶部１２に記憶されたプログラムを実行することにより、コンテンツ取得部１３１、画像作成部１３２、特定部１３３及び登録部１３４として機能する。 The control unit 13 is, for example, a CPU (Central Processing Unit), and functions as a content acquisition unit 131, an image creation unit 132, a specifying unit 133, and a registration unit 134 by executing a program stored in the storage unit 12.

コンテンツ取得部１３１は、通信部１１を介して、複数のウェブサイトのコンテンツを取得する。コンテンツ取得部１３１は、取得したコンテンツを記憶部１２に記憶させる。 The content acquisition unit 131 acquires content of a plurality of websites via the communication unit 11. The content acquisition unit 131 stores the acquired content in the storage unit 12.

図４は、コンテンツ取得部１３１が取得するコンテンツの一例を示す図である。図４に示すコンテンツは、Ｕ公園におけるお花見イベントに関する情報を含んでいる。以下の説明においては、図４に示すコンテンツに基づいてイベント情報を特定する処理について説明する。 FIG. 4 is a diagram illustrating an example of content acquired by the content acquisition unit 131. The content shown in FIG. 4 includes information related to cherry-blossom viewing events in U Park. In the following description, a process for specifying event information based on the content shown in FIG. 4 will be described.

コンテンツ取得部１３１は、テキストに基づいてイベント情報を抽出する場合、取得したコンテンツを特定部１３３に入力する。また、コンテンツ取得部１３１は、スクリーンショット画像に基づいてイベント情報を抽出する場合、取得したコンテンツを画像作成部１３２にも入力する。 When extracting event information based on text, the content acquisition unit 131 inputs the acquired content to the specifying unit 133. When the content acquisition unit 131 extracts event information based on a screenshot image, the content acquisition unit 131 also inputs the acquired content to the image creation unit 132.

画像作成部１３２は、コンテンツ取得部１３１が取得したコンテンツが画面に表示された状態のスクリーンショット画像を作成する。画像作成部１３２は、作成したスクリーンショット画像を特定部１３３に入力する。 The image creation unit 132 creates a screenshot image in a state where the content acquired by the content acquisition unit 131 is displayed on the screen. The image creating unit 132 inputs the created screen shot image to the specifying unit 133.

特定部１３３は、コンテンツ取得部１３１が取得したコンテンツに含まれるテキスト、スクリーンショット画像及び位置情報を用いることにより、コンテンツ取得部１３１が取得したコンテンツに含まれている抽出対象情報としてのイベント情報を特定する。 The specifying unit 133 uses the text, the screenshot image, and the position information included in the content acquired by the content acquisition unit 131, thereby generating event information as extraction target information included in the content acquired by the content acquisition unit 131. Identify.

特定部１３３は、図４に示すコンテンツに含まれるテキストを深層学習モデルの入力データとして用いる場合、例えば、「桜祭り」、「開園時間」、「会場」、「アクセス」、「３月下旬〜４月上旬」、「３月２５日〜４月５日」、「８：００〜２１：００」、「Ｕ公園」、「２０１７年３月２０日」を、深層学習モデルの入力データとするテキストとして抽出する。その結果、特定部１３３は、深層学習モデルから、イベント日時が「３月２５日〜４月５日」の「８：００〜２１：００」であり、イベントの開催場所が「Ｕ公園」であり、イベントの内容が「桜祭り」であることを示す出力を得ることができる。 When the text included in the content shown in FIG. 4 is used as input data for the deep learning model, the specifying unit 133, for example, “sakura festival”, “opening time”, “venue”, “access”, “late March” "Early April", "March 25-April 5", "8: 00-21: 00", "U Park", "March 20, 2017" are used as input data for the deep learning model Extract as text. As a result, from the deep learning model, the specifying unit 133 has an event date and time of “March 25 to April 5” “8:00 to 21:00”, and the event is held at “U Park”. Yes, it is possible to obtain an output indicating that the content of the event is “Sakura Festival”.

しかしながら、図４に示すコンテンツには、日付を示す「２０１７年３月２０日」というテキストも含まれているため、深層学習モデルから、イベント日時が「２０１７年３月２０日」であるという誤った出力が得られる可能性がある。そこで、特定部１３３は、テキスト、スクリーンショット画像及び位置情報のうち２つ以上を学習モデルへの入力として用いることにより、正しい抽出対象情報を特定する確率を高めることができる。例えば、特定部１３３は、深層学習モデルへの複数の入力データのうち、第１の個数の入力データとしてテキストを用いて抽出対象情報を特定した際の精度が閾値未満である場合に、第２の個数の入力データとしてスクリーンショット画像を用いて抽出対象情報を特定してもよい。 However, since the content shown in FIG. 4 also includes the text “March 20, 2017” indicating the date, the deep learning model incorrectly indicates that the event date is “March 20, 2017”. Output may be obtained. Therefore, the specifying unit 133 can increase the probability of specifying the correct extraction target information by using two or more of the text, the screenshot image, and the position information as inputs to the learning model. For example, the specifying unit 133 selects the second when the accuracy when the extraction target information is specified using text as the first number of input data among the plurality of input data to the deep learning model is less than the threshold value. The extraction target information may be specified using screen shot images as the number of input data.

図４に示す例の場合、桜祭りの開催日を示す「３月２５日〜４月５日」というテキストの周囲には網模様が付されている。特定部１３３は、深層学習モデルへの入力データとしてスクリーンショット画像を用いると、日付を示す複数のテキスト「３月２５日〜４月５日」及び「２０１７年３月２０日」のうち、網模様に囲まれたテキストである「３月２５日〜４月５日」が深層学習モデルから出力される。このように、特定部１３３は、スクリーンショット画像を深層学習モデルへの入力データとして用いることで、イベント情報を正しく特定できる。 In the case of the example shown in FIG. 4, a net pattern is added around the text “March 25 to April 5” indicating the date of the cherry blossom festival. When the screen shot image is used as the input data to the deep learning model, the specifying unit 133 uses the network among the plurality of texts “March 25 to April 5” and “March 20, 2017” indicating the date. The text surrounded by the pattern “March 25th to April 5th” is output from the deep learning model. As described above, the specifying unit 133 can correctly specify event information by using the screen shot image as input data to the deep learning model.

また、特定部１３３は、テキスト及びスクリーンショット画像を深層学習モデルへの入力として用いてイベント情報を特定する精度が閾値未満である場合に、イベント情報において使用される文字列を含む画像領域の位置を示す位置情報をさらに学習モデルへの入力として用いる。具体的には、特定部１３３は、学習用画像コンテンツにおける学習用情報としての所定のテキストが含まれている位置を示す位置情報を用いた深層学習をすることにより作成された位置用の深層学習モデルに、コンテンツ取得部１３１が取得したコンテンツに含まれている所定の文字列と文字列の位置を示す座標を入力する。このようにすることで、深層学習モデルからは、入力された文字列の位置に対応するイベント情報が出力される。 The specifying unit 133 also determines the position of the image area including the character string used in the event information when the accuracy of specifying the event information using the text and the screenshot image as the input to the deep learning model is less than the threshold. Is used as an input to the learning model. Specifically, the specifying unit 133 performs deep learning for position created by performing deep learning using position information indicating a position where a predetermined text is included as learning information in the learning image content. A predetermined character string included in the content acquired by the content acquisition unit 131 and coordinates indicating the position of the character string are input to the model. By doing so, event information corresponding to the position of the input character string is output from the deep learning model.

特定部１３３は、文字列の位置を特定するために、コンテンツ取得部１３１が取得したコンテンツに含まれている所定の文字列に基づく文字画像を作成し、スクリーンショット画像において文字画像との相関度が閾値以上である領域を特定する。特定部１３３は、イベント情報に使用されるテキストが含まれている画像領域の位置を特定することで、深層学習モデルの入力データとして用いる文字列の位置を特定することができる。このようにすることで、特定部１３３は、イベント情報に使用されるテキストが含まれている画像領域の位置に基づいて、イベントに関連するイベントの日時、場所及び内容の少なくともいずれかを含むイベント情報を抽出対象情報として特定することができる。 The specifying unit 133 creates a character image based on a predetermined character string included in the content acquired by the content acquisition unit 131 in order to specify the position of the character string, and the degree of correlation with the character image in the screenshot image The region where is equal to or greater than the threshold is specified. The specifying unit 133 can specify the position of the character string used as the input data of the deep learning model by specifying the position of the image region including the text used for the event information. In this way, the specifying unit 133 can include an event including at least one of the date / time, location, and content of the event related to the event based on the position of the image area that includes the text used for the event information. Information can be specified as extraction target information.

図５は、特定部１３３が位置情報に基づいてイベント情報を特定する動作の手順を示すフローチャートである。以下、図４及び図５を参照しながら、特定部１３３が位置情報に基づいてイベント情報を特定する動作について説明する。 FIG. 5 is a flowchart showing a procedure of an operation in which the specifying unit 133 specifies event information based on position information. Hereinafter, an operation in which the specifying unit 133 specifies event information based on position information will be described with reference to FIGS. 4 and 5.

まず、特定部１３３は、コンテンツ取得部１３１が取得したコンテンツをレンダリングしてスクリーンショット画像を作成する（Ｓ１）。続いて、コンテンツ取得部１３１が取得したコンテンツに含まれる所定のテキストを画像に変換する（Ｓ２）。図４に示す例の場合、特定部１３３は、「桜祭り」、「開園時間」、「会場」、「アクセス」、「３月下旬〜４月上旬」、「３月２５日〜４月５日」、「８：００〜２１：００」、「Ｕ公園」、「２０１７年３月２０日」というテキストを画像に変換する。 First, the specifying unit 133 renders the content acquired by the content acquisition unit 131 to create a screenshot image (S1). Subsequently, the predetermined text included in the content acquired by the content acquisition unit 131 is converted into an image (S2). In the case of the example illustrated in FIG. 4, the specifying unit 133 performs “sakura festival”, “opening time”, “venue”, “access”, “late March to early April”, “March 25 to April 5”. The text “day”, “8:00 to 21:00”, “U park”, “March 20, 2017” is converted into an image.

続いて、特定部１３３は、テキストを変換した画像のそれぞれが、ステップＳ１において作成したスクリーンショット画像におけるどの位置にあるかを検索する（Ｓ３）。特定部１３３は、スクリーンショット画像における、テキストを変換した画像との相関度が最も高い画像領域を特定することにより、ウェブページにおける各テキストの位置を特定する（Ｓ４）。特定部１３３は、それぞれのテキストと、テキストに対応する画像領域の座標とを関連付けて記憶部１２に記憶させる（Ｓ５）。 Subsequently, the specifying unit 133 searches for a position in the screen shot image created in step S1 of each of the converted images (S3). The specifying unit 133 specifies the position of each text on the web page by specifying an image region having the highest correlation with the image obtained by converting the text in the screen shot image (S4). The specifying unit 133 associates each text with the coordinates of the image area corresponding to the text, and stores them in the storage unit 12 (S5).

続いて、特定部１３３は、ステップＳ５において記憶部１２に記憶させたテキスト及び座標を深層学習モデルに入力する（Ｓ６）。特定部１３３は、テキストの位置関係に基づいて深層学習モデルから出力されるイベント情報を特定し（Ｓ７）、登録部１３４に通知する。 Subsequently, the specifying unit 133 inputs the text and coordinates stored in the storage unit 12 in step S5 to the deep learning model (S6). The specifying unit 133 specifies event information output from the deep learning model based on the positional relationship between the texts (S7), and notifies the registration unit 134 of the event information.

図４に示す例においては、「桜祭り」というイベントの内容を示すテキストの位置の右隣にイベントの開催日を示すテキストが配置されている。また、「会場」というテキストの位置の右隣にイベントの開催場所を示すテキストが配置されている。一方、ウェブページの右下に配置されている日時は、イベントの開催日時ではない可能性が高い。このように、イベント情報を示すテキストが配置される位置と、所定のテキストが配置される位置との間には、一定の関係があると考えられる。したがって、特定部１３３が、多数のウェブページにおけるテキストの位置情報を含む教師データに基づいて作成された位置用の深層学習モデルにテキストとテキストに対応する画像領域の座標とを入力することで、イベント情報の特定精度を向上させることができる。 In the example shown in FIG. 4, text indicating the date of the event is arranged to the right of the text position indicating the content of the event “Sakura Festival”. In addition, a text indicating the place where the event is held is arranged to the right of the position of the text “venue”. On the other hand, the date and time arranged at the lower right of the web page is highly likely not the event date and time. As described above, it is considered that there is a certain relationship between the position where the text indicating the event information is arranged and the position where the predetermined text is arranged. Therefore, the specifying unit 133 inputs the text and the coordinates of the image region corresponding to the text into the deep learning model for the position created based on the teacher data including the position information of the text in many web pages. The accuracy of identifying event information can be improved.

なお、特定部１３３は、多数のウェブページのコンテンツに基づいて、テキスト、スクリーンショット画像及び位置情報の少なくともいずれかを用いて多数のイベント情報を特定し、登録部１３４は多数のイベント情報を順次データベース３に登録する。 The specifying unit 133 specifies a large number of event information using at least one of text, screenshot images, and position information based on the contents of a large number of web pages, and the registration unit 134 sequentially stores the large number of event information. Register in database 3.

図６は、イベント情報が登録されたデータベース３の一例を示す図である。図６に示すイベント情報データベースにおいては、イベント番号と、イベントの開催日と、イベントの開催時刻と、イベントの開催場所と、イベントの内容とが関連付けられている。図４に示したウェブページに基づいて特定されたイベント情報は、イベント番号が０００２のイベント情報である。 FIG. 6 is a diagram illustrating an example of the database 3 in which event information is registered. In the event information database shown in FIG. 6, an event number, an event date, an event holding time, an event holding location, and an event content are associated with each other. The event information identified based on the web page shown in FIG. 4 is event information having an event number of 0002.

登録部１３４は、異なるウェブページから特定されるイベント情報が異なっている場合、所定の割合以上のウェブページにおいて一致するイベント情報のみをデータベース３に登録してもよい。例えば、登録部１３４は、複数のウェブページに基づいて特定されたイベント情報のうち、一つだけイベントの開催日が異なる場合、開催日が異なっているイベント情報を登録しないようにしてもよい。 When event information specified from different web pages is different, the registration unit 134 may register only event information that matches in a predetermined percentage or more of web pages in the database 3. For example, the registration unit 134 may not register the event information having different dates when the event dates are different among only one of the event information specified based on the plurality of web pages.

登録部１３４は、特定部１３３が特定したイベント情報が、既にデータベース３に登録されているイベント情報と異なる文字列から構成されていることを条件として、特定されたイベント情報をデータベース３に登録してもよい。このようにすることで、同一のイベントに関する情報が多数データベース３に登録されることを防止できる。 The registration unit 134 registers the specified event information in the database 3 on condition that the event information specified by the specifying unit 133 is composed of a character string different from the event information already registered in the database 3. May be. By doing in this way, it can prevent that many information regarding the same event is registered into the database 3.

また、登録部１３４は、イベント情報に関連付けて、イベント情報を特定する根拠となったウェブページの数に対応する数値をデータベース３に登録してもよい。データベース３に登録されたイベント情報を参照するアプリケーションは、登録された数値を用いることで、信頼度が高いイベント情報を選択することができる。 Further, the registration unit 134 may register a numerical value corresponding to the number of web pages that is the basis for specifying the event information in the database 3 in association with the event information. An application that refers to event information registered in the database 3 can select event information with high reliability by using the registered numerical value.

なお、登録部１３４は、特定部１３３が特定した全てのイベント情報をデータベース３に登録してもよい。この場合、データベース３を参照するアプリケーション側で、アプリケーションで求められる精度に基づいて、使用するイベント情報を取捨選択することにより、アプリケーションを使用するユーザは、適切なイベント情報を取得することができる。 The registration unit 134 may register all event information specified by the specifying unit 133 in the database 3. In this case, the user who uses the application can acquire appropriate event information by selecting the event information to be used on the application side referring to the database 3 based on the accuracy required by the application.

［情報抽出装置１の動作フローチャート］
図７は、情報抽出装置１の動作フローチャートである。情報抽出装置１がイベント情報を特定する処理を開始すると、まず、コンテンツ取得部１３１が多数のウェブページのコンテンツを取得する（Ｓ１１）。コンテンツ取得部１３１がコンテンツを取得すると、特定部１３３は、コンテンツに含まれるテキストをテキスト用の深層学習モデルに入力することによりイベント情報を特定する（Ｓ１２）。 [Operation Flowchart of Information Extraction Apparatus 1]
FIG. 7 is an operation flowchart of the information extraction apparatus 1. When the information extraction apparatus 1 starts processing for specifying event information, first, the content acquisition unit 131 acquires content of a large number of web pages (S11). When the content acquisition unit 131 acquires content, the specifying unit 133 specifies event information by inputting text included in the content to the deep learning model for text (S12).

特定部１３３が、テキストに基づいて特定したイベント情報の精度が閾値以上であると判定した場合（Ｓ１３においてＹｅｓ）、ステップＳ１７に進んで、登録部１３４がイベント情報をデータベース３に登録する（Ｓ１７）。一方、特定部１３３は、テキストに基づいて特定したイベント情報の精度が閾値未満であると判定した場合（Ｓ１３においてＮｏ）、ステップＳ１４に進んで、スクリーンショット画像を画像用の深層学習モデルに入力することによりイベント情報を特定する（Ｓ１４）。ステップＳ１４において、特定部１３３は、テキスト及びスクリーンショット画像の両方を用いてイベント情報を特定してもよい。 When the specifying unit 133 determines that the accuracy of the event information specified based on the text is equal to or higher than the threshold (Yes in S13), the process proceeds to step S17, and the registration unit 134 registers the event information in the database 3 (S17). ). On the other hand, when determining that the accuracy of the event information specified based on the text is less than the threshold (No in S13), the specifying unit 133 proceeds to step S14 and inputs the screenshot image to the deep learning model for images. Thus, event information is specified (S14). In step S14, the specifying unit 133 may specify event information using both the text and the screen shot image.

特定部１３３が、スクリーンショット画像に基づいて特定したイベント情報の精度が閾値以上であると判定した場合（Ｓ１５においてＹｅｓ）、ステップＳ１７に進んで、登録部１３４がイベント情報をデータベース３に登録する（Ｓ１７）。一方、特定部１３３は、スクリーンショット画像に基づいて特定したイベント情報の精度が閾値未満であると判定した場合（Ｓ１５においてＮｏ）、ステップＳ１６に進んで、位置情報を位置用の深層学習モデルに入力することによりイベント情報を特定する（Ｓ１６）。ステップＳ１６における処理は、図５に示したステップＳ１からＳ７までの処理である。ステップＳ１６において、特定部１３３は、テキスト、スクリーンショット画像及び位置情報の全てを組み合わせてイベント情報を特定してもよい。 When the specifying unit 133 determines that the accuracy of the event information specified based on the screenshot image is equal to or higher than the threshold (Yes in S15), the process proceeds to step S17, and the registration unit 134 registers the event information in the database 3. (S17). On the other hand, when determining that the accuracy of the event information specified based on the screenshot image is less than the threshold value (No in S15), the specifying unit 133 proceeds to step S16 to convert the position information into the deep learning model for position. Event information is specified by inputting (S16). The process in step S16 is the process from step S1 to S7 shown in FIG. In step S16, the specifying unit 133 may specify event information by combining all of the text, the screenshot image, and the position information.

以上のように、情報抽出装置１は、ウェブページに含まれるテキスト、ソースコードに基づいて作成したスクリーンショット画像、及びイベントに関連するテキストの位置を示す位置情報を組み合わせて深層学習モデルの入力データとして用いることで、高い精度でイベント情報を特定することができる。 As described above, the information extraction device 1 combines the text included in the web page, the screen shot image created based on the source code, and the position information indicating the position of the text related to the event into the input data of the deep learning model. As a result, event information can be specified with high accuracy.

［変形例１］
以上の説明において、特定部１３３は、予め定められた深層学習モデルを使用したが、イベントの種類、ウェブページを作成した人の国籍、及びウェブページで使用されている言語等によって、ウェブページにおける抽出対象情報の掲載方法の傾向が異なると考えられる。そこで、特定部１３３は、抽出対象情報を正しく特定できる確率を高めるために、抽出対象情報の種別に基づいて異なる深層学習モデルを使用してもよい。具体的には、特定部１３３は、通信部１１を介して外部のコンピュータから、特定する対象となる抽出対象情報の種別の指定を受け、指定を受けた種別に対応する深層学習モデルを用いて抽出対象情報を特定することができる。 [Modification 1]
In the above description, the specifying unit 133 uses a predetermined deep learning model. However, depending on the type of event, the nationality of the person who created the web page, the language used in the web page, and the like, It is thought that the tendency of the method of posting extraction target information is different. Therefore, the specifying unit 133 may use different deep learning models based on the type of the extraction target information in order to increase the probability that the extraction target information can be correctly specified. Specifically, the specifying unit 133 receives a specification of the type of extraction target information to be specified from an external computer via the communication unit 11 and uses a deep learning model corresponding to the specified type. Extraction target information can be specified.

抽出対象情報の種別は、例えば、対象となるウェブページの言語、ウェブページが作成された国、及びイベントの種別である。具体的には、特定部１３３は、「日本語サイトに掲載されたサッカー関連イベント」、「英語サイトに掲載されたサッカー関連イベント」、「英語サイトに掲載されたアート関連イベント」、又は「中国語サイトに掲載された音楽関連イベント」等を抽出対象情報の種別として用いることができる。特定部１３３は、指定された種別の抽出対象情報を特定するために適した深層学習モデルを用いることで、抽出対象情報を正しく特定できる確率を高めることができる。 The type of extraction target information is, for example, the language of the target web page, the country in which the web page was created, and the type of event. Specifically, the specifying unit 133 may select “a soccer related event posted on a Japanese site”, “a soccer related event posted on an English site”, “an art related event posted on an English site”, or “China Music-related events posted on a language site ”or the like can be used as the type of extraction target information. The specifying unit 133 can increase the probability that the extraction target information can be correctly specified by using a deep learning model suitable for specifying the extraction target information of the specified type.

［第１実施形態の情報抽出装置１による効果］
以上説明したように、本実施形態の情報抽出装置１は、コンテンツ取得部１３１が取得したコンテンツが画面に表示された状態のスクリーンショット画像を作成する画像作成部１３２と、深層学習モデルへの入力データとしてスクリーンショット画像を用いることにより、スクリーンショット画像に含まれている抽出対象情報としてのイベント情報を特定する特定部１３３とを有する。このように、特定部１３３がスクリーンショット画像を用いてイベント情報を特定することにより、ウェブページを閲覧する人が視認する画面の傾向に基づいて、ウェブページに含まれるイベント情報を特定できるので、イベント情報を正しく特定できる確率を高めることができる。 [Effects of the information extraction apparatus 1 according to the first embodiment]
As described above, the information extraction apparatus 1 according to this embodiment includes the image creation unit 132 that creates a screenshot image in which the content acquired by the content acquisition unit 131 is displayed on the screen, and the input to the deep learning model. By using a screen shot image as data, it has a specifying unit 133 that specifies event information as extraction target information included in the screen shot image. In this manner, the event information included in the web page can be identified based on the tendency of the screen visually recognized by the person viewing the web page by identifying the event information using the screenshot image. The probability that event information can be correctly identified can be increased.

特に、特定部１３３は、イベントに関連する所定のテキストの第１位置情報が関連付けられた深層学習モデルへの入力データとして、イベント情報において使用される文字列を含む画像領域の位置を示す第２位置情報を用いてイベント情報を特定することで、ウェブページに含まれるテキストの位置関係の傾向に基づいて、ウェブページに含まれるイベント情報を特定できる。したがって、イベント情報に類似するテキストがウェブページ内に複数含まれている場合であっても、特定部１３３は、高い確率でイベント情報を正しく特定することができる。 In particular, the specifying unit 133 indicates the second position indicating the position of the image region including the character string used in the event information as input data to the deep learning model associated with the first position information of the predetermined text related to the event. By specifying the event information using the position information, the event information included in the web page can be specified based on the tendency of the positional relationship between the texts included in the web page. Therefore, even when a plurality of texts similar to the event information are included in the web page, the specifying unit 133 can correctly specify the event information with a high probability.

本発明は、検索エンジンが、検索キーワードとの関連性が高い順にウェブページをランキングする際にも効果的である。従来の検索エンジンのように、テキストだけに基づいて検索キーワードとウェブページとの関連性を特定する方法を用いる場合、ユーザが視認しないタグに検索キーワードを埋め込むＳＥＯ（Search Engine Optimization）対策をすることで、ウェブページを上位にランキング表示させることが可能になってしまう。これに対して、本発明では、スナップショット画像のピクセルデータを用いることで、ウェブページを閲覧するユーザが視認することができる内容に基づいてウェブページをランキングすることができる。したがって、ウェブページにＳＥＯ対策のためのコードが埋め込まれている場合にも、検索の精度を向上させることが可能になる。 The present invention is also effective when a search engine ranks web pages in descending order of relevance with a search keyword. When using a method that identifies the relationship between a search keyword and a web page based only on text, as in conventional search engines, take SEO (Search Engine Optimization) measures to embed the search keyword in a tag that the user does not see Thus, it becomes possible to display the ranking of the web page at the top. On the other hand, in this invention, a web page can be ranked based on the content which the user who browses a web page can visually recognize by using the pixel data of a snapshot image. Therefore, even when a code for SEO countermeasures is embedded in the web page, it is possible to improve the search accuracy.

＜第２実施形態＞
図８は、第２実施形態の情報抽出装置４の構成を示す図である。情報抽出装置４は、第１実施形態の情報抽出装置１における登録部１３４の代わりに、広告提供部１３５を有する点で情報抽出装置１と異なる。 Second Embodiment
FIG. 8 is a diagram illustrating a configuration of the information extraction device 4 according to the second embodiment. The information extraction device 4 is different from the information extraction device 1 in that an advertisement providing unit 135 is provided instead of the registration unit 134 in the information extraction device 1 of the first embodiment.

また、情報抽出装置４における特定部１３３が情報を特定する方法は第１実施形態と同様であるが、情報抽出装置４における特定部１３３は、スクリーンショット画像から抽出する対象の情報である抽出対象情報として、イベント情報以外の情報も特定する。抽出対象情報は、例えば、予め作成された辞書に含まれているテキスト及び画像である。 The method for specifying information by the specifying unit 133 in the information extracting device 4 is the same as that in the first embodiment. Information other than event information is also specified as information. The extraction target information is, for example, text and images included in a dictionary created in advance.

記憶部１２は、特定部１３３が特定可能な各種の情報に関連付けて広告情報を記憶する。広告提供部１３５は、特定部１３３が特定した情報を取得すると、取得した情報に関連付けて記憶部１２に記憶された広告情報をサーバ２に提供する。サーバ２は、ウェブページにアクセスする端末に対して、ウェブページのコンテンツとともに、広告提供部１３５から提供された広告情報を送信する。 The storage unit 12 stores advertisement information in association with various types of information that can be specified by the specifying unit 133. When the information provided by the specifying unit 133 is acquired, the advertisement providing unit 135 provides the server 2 with the advertisement information stored in the storage unit 12 in association with the acquired information. The server 2 transmits the advertisement information provided from the advertisement providing unit 135 together with the content of the web page to the terminal that accesses the web page.

図９は、コンテンツとともに表示される広告情報Ａ１及び広告情報Ａ２を示す図である。特定部１３３は、図９に示すウェブページ内の各種の情報を特定する。特定部１３３は、例えば、「Ｕ公園」、「お花見」、「桜」、「祭り」等の情報を特定する。また、特定部１３３は、特定した情報が表示されている位置も特定し、特定した位置に基づいて情報の重要度を決定する。 FIG. 9 is a diagram showing advertisement information A1 and advertisement information A2 displayed together with content. The specifying unit 133 specifies various types of information in the web page shown in FIG. The specifying unit 133 specifies information such as “U Park”, “Ohanami”, “Cherry Blossom”, “Festival”, for example. The specifying unit 133 also specifies the position where the specified information is displayed, and determines the importance of the information based on the specified position.

特定部１３３は、重要度が閾値以上の情報を広告提供部１３５に通知する。特定部１３３は、例えば「Ｕ公園」という情報を広告提供部１３５に通知する。この場合、広告提供部１３５が、「Ｕ公園」に関連付けて記憶部１２に記憶された広告情報Ａ１及び広告Ａ２をサーバ２に送信することで、広告情報Ａ１及び広告Ａ２がウェブページに表示される。 The specifying unit 133 notifies the advertisement providing unit 135 of information whose importance is greater than or equal to a threshold value. The specifying unit 133 notifies the advertisement providing unit 135 of information such as “U park”, for example. In this case, the advertisement providing unit 135 transmits the advertisement information A1 and the advertisement A2 stored in the storage unit 12 in association with “U Park” to the server 2, so that the advertisement information A1 and the advertisement A2 are displayed on the web page. The

図１０は、第２実施形態の情報抽出装置４の動作フローチャートである。Ｓ２１からＳ２６は、図７に示した動作フローチャートにおけるＳ１１からＳ１６に対応する。ただし、Ｓ２２、Ｓ２４及びＳ２６において特定する情報は、イベント情報に限らない抽出対象情報である。ステップＳ２２からＳ２６までにおいて特定部１３３が抽出対象情報を特定すると、広告提供部１３５は、特定された抽出対象情報に基づいて広告情報を選択する。続いて、Ｓ２８において、広告提供部１３５は、選択した広告情報をサーバ２に提供する。 FIG. 10 is an operation flowchart of the information extraction device 4 of the second embodiment. S21 to S26 correspond to S11 to S16 in the operation flowchart shown in FIG. However, the information specified in S22, S24, and S26 is extraction target information that is not limited to event information. When the specifying unit 133 specifies the extraction target information in steps S22 to S26, the advertisement providing unit 135 selects the advertisement information based on the specified extraction target information. Subsequently, in S <b> 28, the advertisement providing unit 135 provides the selected advertisement information to the server 2.

［第２実施形態の情報抽出装置４による効果］
以上説明したように、本実施形態の情報抽出装置４においては、特定部１３３がウェブページ内の抽出対象情報を特定し、広告提供部１３５が、特定された抽出対象情報に関連付けられた広告情報をサーバ２に提供する。このように、情報抽出装置４を利用することで、ウェブページのコンテンツに関連する商品やサービスに関する広告をウェブページ内に表示することができるので、ウェブページを閲覧中のユーザが関心を抱く確率を高めることができる。 [Effects of Information Extraction Device 4 of Second Embodiment]
As described above, in the information extraction device 4 of the present embodiment, the specifying unit 133 specifies the extraction target information in the web page, and the advertisement providing unit 135 has the advertisement information associated with the specified extraction target information. Is provided to the server 2. In this way, by using the information extraction device 4, advertisements regarding products and services related to the content of the web page can be displayed in the web page, so the probability that the user browsing the web page is interested in it. Can be increased.

以上、本発明を実施の形態を用いて説明したが、本発明の技術的範囲は上記実施の形態に記載の範囲には限定されず、その要旨の範囲内で種々の変形及び変更が可能である。例えば、装置の分散・統合の具体的な実施の形態は、以上の実施の形態に限られず、その全部又は一部について、任意の単位で機能的又は物理的に分散・統合して構成することができる。また、複数の実施の形態の任意の組み合わせによって生じる新たな実施の形態も、本発明の実施の形態に含まれる。組み合わせによって生じる新たな実施の形態の効果は、もとの実施の形態の効果を合わせ持つ。 As mentioned above, although this invention was demonstrated using embodiment, the technical scope of this invention is not limited to the range as described in the said embodiment, A various deformation | transformation and change are possible within the range of the summary. is there. For example, the specific embodiments of device distribution / integration are not limited to the above-described embodiments, and all or a part of them may be configured to be functionally or physically distributed / integrated in arbitrary units. Can do. In addition, new embodiments generated by any combination of a plurality of embodiments are also included in the embodiments of the present invention. The effect of the new embodiment produced by the combination has the effect of the original embodiment.

１情報抽出装置
２サーバ
３データベース
４情報抽出装置
１１通信部
１２記憶部
１３制御部
１３１コンテンツ取得部
１３２画像作成部
１３３特定部
１３４登録部
１３５広告提供部 DESCRIPTION OF SYMBOLS 1 Information extraction apparatus 2 Server 3 Database 4 Information extraction apparatus 11 Communication part 12 Storage part 13 Control part 131 Content acquisition part 132 Image creation part 133 Identification part 134 Registration part 135 Advertisement provision part

Claims

A content acquisition unit for acquiring content of a plurality of websites;
An image creation unit that creates a screenshot image in a state where the content acquired by the content acquisition unit is displayed on a screen;
By using the screenshot image as input data to a deep learning model created by deep learning based on a plurality of learning image contents including learning information, the extraction target information included in the screenshot image is obtained. A specific part to identify;
An information extraction apparatus having

The deep learning model is created by performing deep learning using first position information indicating a position where the learning information is included in the learning image content,
The specifying unit uses, as input data to the deep learning model associated with the first position information, second position information indicating a position of an image area including a character string used in the extraction target information. Specifying the extraction target information;
The information extraction device according to claim 1.

The specifying unit creates a character image based on a predetermined character string included in the content acquired by the content acquisition unit, and an area having a correlation degree with the character image in the screen shot image is a threshold value or more By specifying the position of the image area,
The information extraction device according to claim 2.

The specifying unit specifies, as the extraction target information, event information including at least one of an event date and time, a place, and contents related to an event based on the position of the image area.
The information extraction device according to claim 2 or 3.

The specifying unit uses the two or more of the text included in the content acquired by the content acquisition unit, the screenshot image, and the second position information as input data to the deep learning model. Identify target information,
The information extraction device according to any one of claims 2 to 4.

The specifying unit further uses the text and the screenshot image as input data to the deep learning model to specify the extraction target information when the accuracy of specifying the extraction target information is less than a threshold value. Used as input data to the model,
The information extraction device according to claim 5.

When the accuracy when the extraction target information is specified using the text as the first number of input data among the plurality of input data to the deep learning model is less than the threshold, Identifying the extraction target information using the screenshot image as a second number of input data;
The information extraction device according to claim 5 or 6.

The specifying unit receives the specification of the type of the extraction target information to be specified, and specifies the extraction target information using the deep learning model corresponding to the specified type.
The information extraction device according to any one of claims 1 to 7.

An advertisement providing unit that provides an advertisement associated with the extraction target information identified by the identifying unit;
The information extraction device according to any one of claims 1 to 8.

The computer runs,
Acquiring content from multiple websites;
Creating a screenshot image of the acquired content displayed on the screen;
By using the screenshot image as input data of a deep learning model created by deep learning based on a plurality of learning image contents including learning information, the extraction target information included in the screenshot image is identified. And steps to
An information extraction method comprising: