JP6763732B2

JP6763732B2 - Extractor

Info

Publication number: JP6763732B2
Application number: JP2016189749A
Authority: JP
Inventors: 悠菊地; 桂一落合; 健榎園; 慎石黒; 佑介深澤
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2016-09-28
Filing date: 2016-09-28
Publication date: 2020-09-30
Anticipated expiration: 2036-09-28
Also published as: JP2018055346A

Description

本発明は、文書から語を抽出する抽出装置に関する。 The present invention relates to an extraction device that extracts words from a document.

従来から、ある場所（例えば、イベント会場）の混雑を予測（推定）する技術が提案されている。例えば、特許文献１では、移動体の位置情報を用いて混雑予測をする技術が提案されている。 Conventionally, a technique for predicting (estimating) congestion in a certain place (for example, an event venue) has been proposed. For example, Patent Document 1 proposes a technique for predicting congestion using the position information of a moving body.

特開２００４−２１３０９８号公報Japanese Unexamined Patent Publication No. 2004-213098

しかしながら、上記のような移動体の位置情報を用いた場合、当該移動体が位置していた場所の混雑を推定することができるが、他の場所の混雑の推定を行うことは困難である。そこで、場所を示す情報（例えば、場所の名称、イベント名）を含む文書の内容に基づいて混雑を推定することが考えられる。 However, when the position information of the moving body as described above is used, it is possible to estimate the congestion at the place where the moving body was located, but it is difficult to estimate the congestion at another place. Therefore, it is conceivable to estimate the congestion based on the contents of the document including the information indicating the location (for example, the name of the location and the name of the event).

例えば、推定対象のイベントのイベント名又は告知文を含む文書を形態素解析することにより語に分割し、語の集合（Bag of Words）を当該イベントの特徴量として、当該特徴量と、過去に混雑したイベントの特徴量との類似度を計算することで、推定対象の場所が混雑しそうか否かを判断する方法が考えられる。しかしながら、混雑の推定に関係のない語（ノイズ）が多く含まれるため、推定対象のイベントが混雑するか否かを正確に推定できない場合がある。イベント名を含む文書の集合に対し、逆文書頻度（ＩＤＦ（Inverse Document Frequency））を単に適用したとしても、混雑の推定には関係無いイベントに特化した語まで抽出してしまうので、適切な語を抽出できない。 For example, a document containing the event name or announcement of the event to be estimated is divided into words by morphological analysis, and a set of words (Bag of Words) is used as the feature of the event, and the feature and congestion in the past. A method of determining whether or not the estimated target location is likely to be congested can be considered by calculating the degree of similarity with the feature quantity of the event. However, since many words (noise) that are not related to the estimation of congestion are included, it may not be possible to accurately estimate whether or not the event to be estimated is congested. Even if the inverse document frequency (IDF) is simply applied to a set of documents including the event name, words specific to the event that are not related to the estimation of congestion will be extracted, so it is appropriate. I can't extract words.

本発明は、上記の問題点に鑑みてなされたものであり、混雑状況推定対象の場所に対応する文書の内容に基づいて、混雑状況推定対象の場所の混雑状況を推定するための語を適切に抽出する抽出装置を提供することを目的とする。 The present invention has been made in view of the above problems, and appropriate terms for estimating the congestion status of the location of the congestion status estimation target based on the contents of the document corresponding to the location of the congestion status estimation target. It is an object of the present invention to provide an extraction apparatus for extracting into.

上述の課題を解決するために、本発明の抽出装置は、混雑状況推定対象の場所に対応する文書の内容に基づいて混雑状況推定対象の場所の混雑状況を推定する際に用いる語である混雑関連特徴語を抽出する抽出装置であって、抽出元となる文書と、当該文書の内容に対応する場所の混雑状況を示す混雑情報とを取得する情報取得部と、情報取得部により取得された混雑情報が予め設定された条件を満たす混雑状況の場所に対応する文書における語の出現度合いに基づいて、抽出元となる文書から混雑関連特徴語を抽出する抽出部と、を備える。 In order to solve the above-mentioned problems, the extraction device of the present invention is a term used when estimating the congestion status of the location of the congestion status estimation target based on the contents of the document corresponding to the congestion status estimation target location. It is an extraction device that extracts related feature words, and is acquired by an information acquisition unit that acquires a document that is an extraction source and congestion information that indicates the congestion status of a place corresponding to the content of the document, and an information acquisition unit. It is provided with an extraction unit that extracts congestion-related characteristic words from a document that is an extraction source, based on the degree of appearance of words in a document corresponding to a location of a congestion situation in which the congestion information satisfies a preset condition.

この発明によれば、所定条件を満たす混雑状況の場所に対応する文書に特有の混雑関連特徴語を適切に抽出することができる。この結果、混雑状況推定対象を示す情報を含む文書（例えば、将来実施予定のイベントの場所に関する文書）に、抽出した語を含むか否かを判断した結果に基づいて、当該文書に対応する場所の混雑状況を推定することができる。このように、混雑状況推定対象を示す情報を含む文書の内容に基づいて、混雑状況推定対象の場所の混雑状況を推定するための混雑関連特徴語を適切に抽出することができる。 According to the present invention, it is possible to appropriately extract congestion-related characteristic words peculiar to a document corresponding to a place in a congestion situation that satisfies a predetermined condition. As a result, the location corresponding to the document based on the result of determining whether or not the extracted word is included in the document containing the information indicating the congestion status estimation target (for example, the document regarding the location of the event scheduled to be held in the future). It is possible to estimate the congestion situation of. In this way, it is possible to appropriately extract congestion-related characteristic words for estimating the congestion status of the location of the congestion status estimation target based on the content of the document including the information indicating the congestion status estimation target.

本発明によれば、混雑状況推定対象の場所に対応する文書の内容に基づいて、混雑状況推定対象の場所の混雑状況を推定するための語を適切に抽出することができる。 According to the present invention, it is possible to appropriately extract words for estimating the congestion status of the location of the congestion status estimation target based on the content of the document corresponding to the congestion status estimation target location.

一実施形態に係る推定装置の機能構成を示すブロック図である。It is a block diagram which shows the functional structure of the estimation apparatus which concerns on one Embodiment. ＰＯＩデータベースのデータ例を示す図である。It is a figure which shows the data example of a POI database. 動態人数データベースのデータ例を示す図である。It is a figure which shows the data example of the dynamic person database. イベントデータベースのデータ例を示す図である。It is a figure which shows the data example of the event database. 推定装置のハードウェア構成を示す図である。It is a figure which shows the hardware configuration of the estimation apparatus. 推定装置で実行される処理を示すフローチャートである。It is a flowchart which shows the process executed by the estimation apparatus. 複数語を抽出する処理を説明する図である。It is a figure explaining the process of extracting a plurality of words.

以下、図面と共に本発明に係る抽出装置の実施形態について詳細に説明する。なお、図面の説明においては同一要素には同一符号を付し、重複する説明を省略する。 Hereinafter, embodiments of the extraction device according to the present invention will be described in detail together with the drawings. In the description of the drawings, the same elements are designated by the same reference numerals, and duplicate description will be omitted.

図１に本実施形態に係る推定装置１０を示す。推定装置１０（抽出装置）は、混雑状況推定対象の場所に対応する文書の内容に基づいて、混雑状況推定対象の場所の混雑状況を推定する装置である。また、推定装置１０は、混雑状況推定対象の場所の混雑状況を推定する際に、混雑関連特徴語を用いる。推定装置１０は、上記推定の前に、混雑状況推定対象の場所に対応する文書以外の文書から当該混雑関連特徴語を抽出する。混雑状況推定対象の場所とは、例えば、イベントの開催場所である。「イベント」とは、例えばイベント名称、開催場所、及び開催期間によって一意に特定される任意の催しである。イベントの具体例として、例えばスポーツイベント、バンドのライブイベント等の規模の大きいイベント（参加人数が多いと想定されるイベント）もあれば、例えばＣＤのリリースイベント、ミニライブ等の規模の小さいイベント（参加人数が少ないと想定されるイベント）もある。 FIG. 1 shows the estimation device 10 according to the present embodiment. The estimation device 10 (extraction device) is a device that estimates the congestion status of the location of the congestion status estimation target based on the contents of the document corresponding to the location of the congestion status estimation target. In addition, the estimation device 10 uses congestion-related characteristic words when estimating the congestion status of the location where the congestion status is estimated. Prior to the estimation, the estimation device 10 extracts the congestion-related characteristic words from documents other than the document corresponding to the location of the congestion situation estimation target. The place where the congestion situation is estimated is, for example, the place where the event is held. An "event" is any event that is uniquely identified by, for example, the event name, venue, and duration. Specific examples of events include large-scale events such as sports events and band live events (events that are expected to have a large number of participants), and small-scale events such as CD release events and mini-lives (events that are expected to have a large number of participants). There are also events) where the number of participants is expected to be small.

上記文書は、例えば、インターネット上で様々なユーザによってマイクロブログに投稿されたテキストデータである。この場合、１つの投稿のテキストデータを、１つの文書とする。この文書には、イベントに関する情報が含まれ得る。推定装置１０は、文書からイベント情報を抽出する。ここでイベント情報とは、イベントを示す情報であり、イベント名称、開催場所、及び開催期間を示す情報を含む。 The above document is, for example, text data posted on a microblog by various users on the Internet. In this case, the text data of one post is regarded as one document. This document may contain information about the event. The estimation device 10 extracts event information from the document. Here, the event information is information indicating an event, and includes information indicating an event name, a venue, and a period of the event.

推定装置１０は、上記の文書を取得（受信）できるように、文書を出力するテキスト出力装置（例えば、マイクロブログのサービスを提供するサーバ）とインターネット等のネットワークを介して接続されており、情報の送受信を行うことができるようになっている。 The estimation device 10 is connected to a text output device (for example, a server that provides a microblogging service) that outputs a document so that the above document can be acquired (received) via a network such as the Internet, and information. Can be sent and received.

推定装置１０は、上記文書からイベント情報を抽出し、イベント情報のイベントの混雑状況を判断して、当該判断結果に基づいて文書から混雑関連特徴語を抽出する。また、推定装置１０は、イベント（例えば、将来開催されるイベント）に関する文書（混雑状況推定対象の場所に対応する文書）を参照して、抽出した混雑関連特徴語を用いて、当該イベントの混雑状況を推定する。 The estimation device 10 extracts event information from the above document, determines the event congestion status of the event information, and extracts congestion-related characteristic words from the document based on the determination result. Further, the estimation device 10 refers to a document (a document corresponding to a location for which the congestion status is estimated) related to an event (for example, an event to be held in the future), and uses the extracted congestion-related characteristic words to congest the event. Estimate the situation.

推定装置１０は、情報取得部１１と、抽出文書データベース１２と、ＰＯＩデータベース１３と、動態人数データベース１４と、イベントデータベース１５と、抽出部１６と、推定対象文書データベース１７と、推定部１８と、を備える。 The estimation device 10 includes an information acquisition unit 11, an extracted document database 12, a POI database 13, a dynamic number database 14, an event database 15, an extraction unit 16, an estimation target document database 17, and an estimation unit 18. To be equipped.

情報取得部１１は、抽出元となる文書と、当該文書の内容に対応する場所の混雑状況を示す混雑情報とを取得する。また、情報取得部１１は、取得した文書の内容に基づいて当該文書の内容に対応する場所を特定し、当該場所の混雑度合いを示す指標値を取得し、当該指標値に基づいて混雑情報を生成することにより、混雑情報を取得する。 The information acquisition unit 11 acquires a document to be extracted and congestion information indicating a congestion status of a place corresponding to the content of the document. Further, the information acquisition unit 11 specifies a place corresponding to the content of the document based on the content of the acquired document, acquires an index value indicating the degree of congestion of the place, and obtains congestion information based on the index value. By generating it, congestion information is acquired.

情報取得部１１は、所定のタイミング（例えば、月に１度）でテキスト出力装置に対して、抽出元となる文書の送信要求をすることで、テキスト出力装置から文書を取得する。 The information acquisition unit 11 acquires a document from the text output device by requesting the text output device to transmit a document to be an extraction source at a predetermined timing (for example, once a month).

なお、情報取得部１１は、各ユーザの文書全てをテキスト出力装置から取得することとしてもよいし、例えば、過去一か月分等の一定期間にユーザによって投稿された文書のみを取得することとしてもよい。情報取得部１１は、取得した文書を抽出文書データベース１２へ記憶する。抽出文書データベース１２は、文書を記憶する手段である。 The information acquisition unit 11 may acquire all the documents of each user from the text output device, or for example, acquire only the documents posted by the user in a certain period such as the past one month. May be good. The information acquisition unit 11 stores the acquired document in the extracted document database 12. The extracted document database 12 is a means for storing documents.

情報取得部１１は、所定のタイミング（例えば、月に１度）で、抽出文書データベース１２から文書を取得して、当該文書からイベント情報を抽出する。情報取得部１１は、例えば、参考文献１（マイクロブログを用いたイベント情報抽出技術、山田渉，菊池悠，落合桂一，鳥居大祐，稲村浩，太田賢、情報処理学会論文誌 Vol.57 No.1 123-132（Jan.2016））に示されるようなＳＶＭ（Support Vector Machine）及びＣＲＦ（Conditional Random Fields）等を用いた手法によりイベント情報を抽出する。この手法によれば、上述のような文書からイベント名称、開催期間、及び開催場所を示す文字列をそれぞれ抽出することができる。このように、情報取得部１１は、取得した文書の内容に基づいて当該文書の内容に対応する開催場所を特定する。 The information acquisition unit 11 acquires a document from the extracted document database 12 at a predetermined timing (for example, once a month) and extracts event information from the document. The information acquisition unit 11 is, for example, Reference 1 (Event Information Extraction Technology Using Microblog, Wataru Yamada, Yu Kikuchi, Keiichi Ochiai, Daisuke Torii, Hiroshi Inamura, Ken Ota, Information Processing Society of Japan Journal Vol.57 No. Event information is extracted by a method using SVM (Support Vector Machine) and CRF (Conditional Random Fields) as shown in .1 123-132 (Jan.2016)). According to this method, character strings indicating the event name, the holding period, and the holding place can be extracted from the above-mentioned documents. In this way, the information acquisition unit 11 specifies the venue corresponding to the content of the document based on the content of the acquired document.

例えば、「5/2(土) AチームvsBチームレフト外野指定席ペアチケット 2015プロ野球Aチーム主催試合AチームvsBチームレフト外野指定席2連番のチケットです。日付2015年5月2日（土)14時試合開始C野球場座席レフト外野8http://t.co/xxyyzz」という文書がある場合、イベント名称として「AチームvsBチーム」、開催期間として「2015年5月2日」、開催場所を示す施設名称（ＰＯＩ（Point of interest）名称）として「C野球場」を抽出し、抽出したイベント情報と、抽出元の文書とを対応付ける。 For example, "5/2 (Sat) A team vs B team left outfield reserved seat pair ticket 2015 professional baseball A team sponsored game A team vs B team left outfield reserved seat 2 consecutive number ticket. Date May 2, 2015 (Saturday) ) 14:00 Match start C Baseball field seat Left outfield 8 http://t.co/xxyyzz If there is a document, the event name will be "Team A vs Team B" and the event will be held on "May 2, 2015". "C baseball field" is extracted as a facility name (POI (Point of interest) name) indicating a place, and the extracted event information is associated with the document of the extraction source.

情報取得部１１は、抽出文書データベース１２に記憶されている文書を参照して、当該文書から上記手法によりイベント情報を抽出した場合、イベント情報と、当該文書とを対応付ける。続いて、情報取得部１１は、抽出したイベント情報のＰＯＩ名称が示すＰＯＩの混雑状況を判断し、判断した結果に基づいて混雑情報を生成する。抽出したイベント情報のＰＯＩ名称が示すＰＯＩの混雑状況を判断する方法として、ＰＯＩの人気度に基づいて判断する方法（方式１）と、イベント開催による動態人数の増分の大小の比により混雑状況を判断する方法（方式２）とがある。 When the information acquisition unit 11 refers to the document stored in the extracted document database 12 and extracts the event information from the document by the above method, the information acquisition unit 11 associates the event information with the document. Subsequently, the information acquisition unit 11 determines the congestion status of the POI indicated by the POI name of the extracted event information, and generates congestion information based on the determined result. As a method of judging the congestion status of POI indicated by the POI name of the extracted event information, the congestion status is determined by the method of judging based on the popularity of POI (method 1) and the ratio of the increase in the number of dynamic people due to the event holding. There is a method for determining (method 2).

方式１による混雑状況を判断する方法から説明する。まず、情報取得部１１は、ＰＯＩデータベース１３を参照し、抽出したイベント情報のＰＯＩ名称に対応するＰＯＩの人気度（指標値）を取得する。ＰＯＩの人気度とは、当該ＰＯＩに人が集まる度合いを示す。ここで、ＰＯＩデータベース１３のデータ例を図２に示す。人気度が高い場合、人が多く集まる傾向にあるので、混雑する可能性が高いことを示す。上記ＰＯＩの人気度は、例えば、ユーザにより入力されたＰＯＩの評価値、ＰＯＩを示すキーワードで検索された回数を集計した値等であり、公知のＰＯＩの人気度算出方法により算出された値である。 The method for determining the congestion status according to the method 1 will be described first. First, the information acquisition unit 11 refers to the POI database 13 and acquires the popularity (index value) of the POI corresponding to the POI name of the extracted event information. The popularity of a POI indicates the degree to which people gather at the POI. Here, an example of data of the POI database 13 is shown in FIG. If the popularity is high, it indicates that there is a high possibility of congestion because many people tend to gather. The popularity of the POI is, for example, an evaluation value of the POI input by the user, a value obtained by totaling the number of searches with a keyword indicating the POI, or the like, and is a value calculated by a known method for calculating the popularity of the POI. is there.

ＰＯＩデータベース１３は、ＰＯＩ毎に、ＰＯＩの人気度を示す情報を対応付けたデータベースである。図２に示すデータベースは、「ＰＯＩ名称」と、「緯度経度」と、「メッシュコード」と、「人気度」と、を対応付けて記憶している。 The POI database 13 is a database in which information indicating the popularity of POI is associated with each POI. The database shown in FIG. 2 stores "POI name", "latitude / longitude", "mesh code", and "popularity" in association with each other.

「ＰＯＩ名称」は、ＰＯＩを示す名称である。このＰＯＩ名称は、ＰＯＩデータベース１３のレコード間で一意な情報（ＩＤ）である。「緯度経度」は、ＰＯＩの位置を示す情報であり、緯度経度の値である。「メッシュコード」は、ＰＯＩの緯度経度に対応するメッシュの識別子を示す。メッシュとは、緯度経度に基づいて地域をメッシュ状（矩形状）に区切った範囲をいう。「人気度」は、当該ＰＯＩの人気度である。 The "POI name" is a name indicating POI. This POI name is information (ID) unique among the records of the POI database 13. The "latitude / longitude" is information indicating the position of the POI, and is a value of latitude / longitude. The "mesh code" indicates the identifier of the mesh corresponding to the latitude and longitude of the POI. The mesh refers to a range in which an area is divided into a mesh shape (rectangular shape) based on latitude and longitude. "Popularity" is the popularity of the POI.

情報取得部１１は、ＰＯＩデータベース１３を参照して、抽出した全てのイベント情報それぞれのＰＯＩ名称を含むレコードの人気度を取得する。すなわち、情報取得部１１は、抽出した全てのイベント情報それぞれのＰＯＩ名称を検索キーとして、ＰＯＩデータベース１３を参照して、当該ＰＯＩ名称に対応する人気度を取得する。情報取得部１１は、全てのイベント情報のＰＯＩ名称に対応する人気度を取得すると、当該全てのイベント情報のＰＯＩ名称が示すＰＯＩ間の人気度を比較する。情報取得部１１は、当該比較をした結果、ＰＯＩ名称に対応する人気度が予め定められている上位一定の割合（例えば、上位１０％）に含まれる人気度であるＰＯＩ名称が示すＰＯＩのイベントの混雑状況が、混雑であると判断する。情報取得部１１は、上記比較をした結果、ＰＯＩ名称に対応する人気度が予め定められている下位一定の割合（例えば、下位１０％）に含まれる人気度であるＰＯＩ名称が示すＰＯＩのイベントの混雑状況が、非混雑であると判断する。ここで、非混雑とは、空いている（参加する人数が少ない）ことをいう。このように、方式１では、ＰＯＩ名称に対応する人気度によりＰＯＩ名称が示すＰＯＩのイベントの混雑状況を判断する。 The information acquisition unit 11 refers to the POI database 13 and acquires the popularity of the record including the POI name of each of the extracted event information. That is, the information acquisition unit 11 uses the POI name of each of the extracted event information as a search key, refers to the POI database 13, and acquires the popularity corresponding to the POI name. When the information acquisition unit 11 acquires the popularity corresponding to the POI names of all the event information, the information acquisition unit 11 compares the popularity among the POIs indicated by the POI names of all the event information. As a result of the comparison, the information acquisition unit 11 has a POI event indicated by the POI name, which is the popularity included in a predetermined high-ranking fixed ratio (for example, the top 10%) corresponding to the POI name. Judge that the congestion situation is congestion. As a result of the above comparison, the information acquisition unit 11 has a POI event indicated by the POI name, which is the popularity level included in a predetermined lower fixed ratio (for example, the lower 10%) of the popularity level corresponding to the POI name. Judge that the congestion situation is non-congested. Here, non-congested means that it is vacant (the number of participants is small). As described above, in the method 1, the congestion status of the POI event indicated by the POI name is determined based on the popularity corresponding to the POI name.

続いて、方式２による混雑状況を判断する方法を説明する。まず、情報取得部１１は、抽出したイベント情報の開催期間を参照し、現状の時刻より前であるか否かを判断する。すなわち、情報取得部１１は、イベント情報のイベントが過去のものであるか否か（イベントが終了しているか否か）を判断する。情報取得部１１は、イベント情報のイベントが過去のイベントではない場合、抽出したイベント情報を破棄する。続いて、情報取得部１１は、ＰＯＩデータベース１３を参照し、抽出したイベント情報のうち、破棄したイベント情報を除く全てのイベント情報それぞれのＰＯＩ名称を検索キーとして、ＰＯＩデータベース１３を参照して当該ＰＯＩ名称に対応するメッシュコードを取得する。続いて、情報取得部１１は、動態人数データベース１４を参照し、当該メッシュコードに対応する各時刻の推計人数を取得する。ここで、動態人数データベース１４のデータ例を図３に示す。 Subsequently, a method of determining the congestion status by the method 2 will be described. First, the information acquisition unit 11 refers to the holding period of the extracted event information, and determines whether or not it is before the current time. That is, the information acquisition unit 11 determines whether or not the event of the event information is a thing of the past (whether or not the event has ended). If the event of the event information is not a past event, the information acquisition unit 11 discards the extracted event information. Subsequently, the information acquisition unit 11 refers to the POI database 13 and refers to the POI database 13 using the POI names of all the event information excluding the discarded event information as the search key among the extracted event information. Obtain the mesh code corresponding to the POI name. Subsequently, the information acquisition unit 11 refers to the dynamic number database 14 and acquires the estimated number of people at each time corresponding to the mesh code. Here, an example of data of the dynamic number database 14 is shown in FIG.

動態人数データベース１４は、メッシュコード毎に、各時刻の推計人数を記憶するデータベースである。図３に示すデータベースは、「メッシュコード」と、「時刻」と、「推計人数」と、を対応付けて記憶している。 The dynamic number database 14 is a database that stores the estimated number of people at each time for each mesh code. The database shown in FIG. 3 stores the "mesh code", the "time", and the "estimated number of people" in association with each other.

「メッシュコード」は、推計したメッシュを示すメッシュコードである。「時刻」は、推計人数を計測した時刻である。なお、この「時刻」には、推計人数を計測した期間を示す情報（例えば、計測開始時刻と計測終了時刻）が入力されてもよい。「推計人数」は、上記「時刻」にメッシュコードが示すメッシュ内に位置していた推計人数である。この推計人数は、例えば、端末装置の位置情報に基づいて各メッシュに位置する端末装置の数を特定し、当該端末装置の数に基づいて算出された値である。 The "mesh code" is a mesh code indicating an estimated mesh. The "time" is the time when the estimated number of people is measured. In addition, information (for example, measurement start time and measurement end time) indicating the period in which the estimated number of people was measured may be input to this "time". The "estimated number of people" is the estimated number of people located in the mesh indicated by the mesh code at the above "time". This estimated number of people is, for example, a value calculated based on the number of terminal devices located in each mesh by specifying the number of terminal devices based on the position information of the terminal devices.

情報取得部１１は、動態人数データベース１４を参照し、当該メッシュコードに対応する各時刻の推計人数を取得する。具体的に、情報取得部１１は、メッシュコードを検索キーとして、イベント情報それぞれの開催期間及び当該開催期間の前の一定期間（例えば、２週間）の推計人数を取得する。なお、イベント情報のＰＯＩ名称のメッシュコードにおける各時刻の推計人数が取得できなかった場合、イベント情報のイベントが混雑しているか否かを判断できないため、当該イベント情報を破棄する。 The information acquisition unit 11 refers to the dynamic number database 14 and acquires the estimated number of people at each time corresponding to the mesh code. Specifically, the information acquisition unit 11 uses the mesh code as a search key to acquire the estimated number of people for each event information holding period and a certain period (for example, 2 weeks) before the holding period. If the estimated number of people at each time in the mesh code of the POI name of the event information cannot be obtained, it cannot be determined whether or not the event in the event information is congested, so the event information is discarded.

情報取得部１１は、メッシュコードに対応する各時刻の推計人数を取得した後、開催期間に対応する時刻の推計人数の平均人数（イベント開催中平均人数）を算出する。続いて、情報取得部１１は、開催期間の前の過去一定期間の平均人数を算出し、この人数を平常時人数とする。情報取得部１１は、イベント開催中平均人数から平常時人数を減算した減算値、又はイベント開催中平均人数を平常時人数で除算した除算値を算出する。当該減算値又は除算値（指標値）が、予め定めている混雑閾値以上の場合、情報取得部１１は、対象のイベント情報のイベントが混雑していると判断する。また、情報取得部１１は、上記減算値又は除算値が、予め定めている非混雑閾値未満である場合、対象のイベント情報のイベントが非混雑であると判断する。なお、混雑閾値は、非混雑閾値以上である。上記減算値又は除算値が、混雑閾値未満であり、且つ非混雑閾値以上である場合、情報取得部１１は、対象のイベント情報のイベントが混雑しておらず、且つ非混雑でもないと判断する。このように、方式２では、イベントの開催期間中の推計人数と、通常時の推計人数とを比較した結果に基づいて、イベント情報のイベントの混雑状況を判断する。 After acquiring the estimated number of people at each time corresponding to the mesh code, the information acquisition unit 11 calculates the average number of people estimated at the time corresponding to the holding period (average number of people during the event). Subsequently, the information acquisition unit 11 calculates the average number of people in the past fixed period before the holding period, and sets this number as the normal number of people. The information acquisition unit 11 calculates a subtraction value obtained by subtracting the normal number of people from the average number of people during the event, or a division value obtained by dividing the average number of people during the event by the normal number of people. When the subtraction value or the division value (index value) is equal to or greater than a predetermined congestion threshold value, the information acquisition unit 11 determines that the event of the target event information is congested. Further, when the subtraction value or the division value is less than a predetermined non-congestion threshold value, the information acquisition unit 11 determines that the event of the target event information is non-congestion. The congestion threshold is equal to or higher than the non-congestion threshold. When the subtraction value or the division value is less than the congestion threshold value and equal to or more than the non-congestion threshold value, the information acquisition unit 11 determines that the event of the target event information is not congested and is not non-congested. .. As described above, in the method 2, the congestion status of the event in the event information is determined based on the result of comparing the estimated number of people during the holding period of the event with the estimated number of people in the normal time.

情報取得部１１は、方式１又は方式２により各イベント情報のイベントの混雑状況を判断すると、当該判断に基づいて、混雑情報として、混雑であるか否かを示すフラグ情報（混雑フラグ）と、非混雑であるか否かを示すフラグ情報（非混雑フラグ）とを生成する。情報取得部１１は、混雑状況を判断した結果、混雑したと判断した場合、混雑フラグを「Ｔｒｕｅ」として、非混雑フラグを「Ｆａｌｓｅ」とする。情報取得部１１は、イベント情報のイベントが非混雑であると判断した場合、混雑フラグを「Ｆａｌｓｅ」として、非混雑フラグを「Ｔｒｕｅ」とする。また、情報取得部１１は、混雑状況を判断した結果、イベント情報のイベントが、混雑ではなく、非混雑でもないと判断した場合、混雑フラグ及び非混雑フラグを「Ｆａｌｓｅ」とする。すなわち、混雑フラグ及び非混雑フラグが「Ｆａｌｓｅ」となるイベントもあり得る。このように、情報取得部１１は、イベント情報のＰＯＩ名称の混雑状況を示す混雑フラグ及び非混雑フラグを生成することにより、混雑情報を取得する。なお、情報取得部１１は、混雑フラグ及び非混雑フラグの何れか一方を生成するようにしてもよい。情報取得部１１は、各イベント情報のイベントの混雑フラグ及び非混雑フラグを生成した後、イベント情報と、文書と、混雑フラグと、非混雑フラグとを対応付けた情報をイベントデータとして、当該イベントデータをイベントデータベース１５に記憶する。ここで、イベントデータベース１５のデータ例を図４に示す。 When the information acquisition unit 11 determines the event congestion status of each event information by the method 1 or the method 2, based on the determination, the information acquisition unit 11 includes flag information (congestion flag) indicating whether or not the event is congested as the congestion information. Generates flag information (non-congested flag) indicating whether or not it is non-congested. When the information acquisition unit 11 determines that the congestion status is congested as a result of determining the congestion status, the congestion flag is set to "True" and the non-congestion flag is set to "False". When the information acquisition unit 11 determines that the event of the event information is non-congested, the congestion flag is set to "False" and the non-congested flag is set to "True". Further, when the information acquisition unit 11 determines that the event of the event information is neither congested nor non-congested as a result of determining the congestion status, the congestion flag and the non-congested flag are set to "False". That is, there may be an event in which the congestion flag and the non-congestion flag are set to "False". In this way, the information acquisition unit 11 acquires the congestion information by generating the congestion flag and the non-congestion flag indicating the congestion status of the POI name of the event information. The information acquisition unit 11 may generate either a congestion flag or a non-congestion flag. After generating the congestion flag and the non-congestion flag of the event of each event information, the information acquisition unit 11 uses the event information, the document, the congestion flag, and the non-congestion flag as event data as the event data. The data is stored in the event database 15. Here, a data example of the event database 15 is shown in FIG.

イベントデータベース１５は、「イベント名称」と、「ＰＯＩ名称」と、「開催期間」と、「投稿文」と、「混雑フラグ」と、「非混雑フラグ」とを対応付けて記憶している。 The event database 15 stores the "event name", the "POI name", the "holding period", the "posted text", the "congestion flag", and the "non-congestion flag" in association with each other.

「イベント名称」は、イベント情報のイベント名称である。「ＰＯＩ名称」は、イベント情報のＰＯＩ名称である。「開催期間」は、イベント情報の開催期間である。「投稿文」は、イベント情報に対応する文書である。「混雑フラグ」は、情報取得部１１が生成した混雑フラグである。「非混雑フラグ」は、情報取得部１１が生成した非混雑フラグである。 The "event name" is the event name of the event information. The "POI name" is the POI name of the event information. The "holding period" is the holding period of event information. The "posted text" is a document corresponding to the event information. The “congestion flag” is a congestion flag generated by the information acquisition unit 11. The “non-congestion flag” is a non-congestion flag generated by the information acquisition unit 11.

情報取得部１１は、文書からイベント情報を抽出し、抽出した全てのイベント情報のイベントデータをイベントデータベース１５に記憶した後、イベントデータを記憶した旨を抽出部１６へ通知する。抽出部１６は、これに応じて混雑関連特徴語を抽出する。 The information acquisition unit 11 extracts event information from the document, stores the event data of all the extracted event information in the event database 15, and then notifies the extraction unit 16 that the event data has been stored. The extraction unit 16 extracts congestion-related characteristic words accordingly.

抽出部１６は、混雑フラグ又は非混雑フラグが予め設定された条件を満たす混雑状況の場所に対応する文書における語の出現度合いに基づいて、抽出元となる文書から混雑関連特徴語を抽出する部分である。また、抽出部１６は、混雑フラグ又は非混雑フラグが予め設定された第１の条件を満たす混雑状況の場所に対応する文書における語の出現度合いと、当該第１の条件とは異なる第２の条件を満たす混雑状況の場所に対応する文書における語の出現度合いとに基づいて混雑関連特徴語を抽出する。 The extraction unit 16 extracts congestion-related feature words from a document to be extracted based on the degree of appearance of words in a document corresponding to a place in a congestion situation in which a congestion flag or a non-congestion flag satisfies a preset condition. Is. In addition, the extraction unit 16 has a second condition in which the degree of appearance of words in a document corresponding to a place in a congestion situation in which a congestion flag or a non-congestion flag is set in advance and satisfies the first condition is different from the first condition. Congestion-related characteristic words are extracted based on the degree of appearance of words in the document corresponding to the location of the congestion situation that satisfies the conditions.

抽出部１６は、イベントデータがイベントデータベース１５に記憶された旨の通知を情報取得部１１から受信すると、混雑関連特徴語の抽出処理を開始する。具体的に、抽出部１６は、混雑時の混雑関連特徴語を抽出する場合、混雑フラグが「Ｔｒｕｅ」である（第１の条件）イベントの文書を公知技術により形態素解析して、混雑関連特徴語の候補として、形態素を抽出する。抽出部１６は、抽出した形態素毎に、当該形態素が混雑関連特徴語となるか否かを判断するためのスコアを算出する。抽出部１６は、混雑フラグが「Ｔｒｕｅ」であるイベントデータの文書の集合（混雑時イベント集合）における、抽出した形態素の出現頻度を算出する。また、抽出部１６は、全てのイベントデータの文書の集合（混雑フラグが「Ｔｒｕｅ」又は混雑フラグが「Ｆａｌｓｅ」である（第２の条件）イベントデータの文書の集合）における、抽出した形態素の出現頻度を算出する。抽出部１６は、上記混雑時イベント集合の文書数を算出し、算出した文書数と、混雑時イベント集合における、抽出した形態素の出現頻度と、全てのイベントデータの文書の集合における抽出した形態素の出現頻度とに基づいた、スコアを算出する。例えば、以下の式（１）により、混雑時における混雑関連特徴語の候補となる形態素のスコアを算出する（ダイス係数を用いた場合）。Ｓｃｏｒｅ_ｗｏｒｄは、混雑時の混雑関連特徴語の候補となる形態素のスコアである。ｄｆ（ｗｏｒｄｉｎｓｅｔ_{ｃｏｎｇｅｓｔｉｏｎ}）は、混雑時イベント集合における形態素の出現頻度である。ｎｕｍ（ｓｅｔ_{ｃｏｎｇｅｓｔｉｏｎ}）は、上記の混雑時イベント集合の文書数である。ｄｆ（ｗｏｒｄｉｎｓｅｔ_{ｅｖｅｎｔ}）は、全てのイベントデータの文書の集合における形態素の出現頻度である。
Ｓｃｏｒｅ_ｗｏｒｄ＝２×ｄｆ（ｗｏｒｄｉｎｓｅｔ_{ｃｏｎｇｅｓｔｉｏｎ}）／｛ｎｕｍ（ｓｅｔ_{ｃｏｎｇｅｓｔｉｏｎ}）＋ｄｆ（ｗｏｒｄｉｎｓｅｔ_{ｅｖｅｎｔ}）｝・・・（１） When the extraction unit 16 receives the notification from the information acquisition unit 11 that the event data is stored in the event database 15, the extraction unit 16 starts the extraction process of the congestion-related feature words. Specifically, when extracting the congestion-related feature words at the time of congestion, the extraction unit 16 morphologically analyzes the document of the event in which the congestion flag is "True" (first condition) by a known technique, and the congestion-related features. Extract morphemes as word candidates. The extraction unit 16 calculates a score for determining whether or not the morpheme is a congestion-related characteristic word for each extracted morpheme. The extraction unit 16 calculates the frequency of appearance of the extracted morphemes in the set of event data documents (congestion event set) whose congestion flag is "True". In addition, the extraction unit 16 is a set of documents of all event data (a set of documents of event data in which the congestion flag is "True" or the congestion flag is "False" (second condition)). Calculate the frequency of appearance. The extraction unit 16 calculates the number of documents in the congestion event set, the calculated number of documents, the appearance frequency of the extracted morphology in the congestion event set, and the extracted morphology in the document set of all event data. Calculate the score based on the frequency of appearance. For example, the score of a morpheme that is a candidate for a congestion-related feature word at the time of congestion is calculated by the following equation (1) (when the dice coefficient is used). Score _word is a score of a morpheme that is a candidate for a congestion-related feature word at the time of congestion. df (word in set _congestion ) is the frequency of appearance of morphemes in the congestion event set. num (set _congestion ) is the number of documents in the above-mentioned congestion event set. df (word in set _event ) is the frequency of appearance of morphemes in a set of documents of all event data.
Score _word = 2 × df (word in set _setting ) / {num (set _setting ) + df (word in set _event )} ・・・ (1)

上記の式は、混雑フラグが「Ｔｒｕｅ」である文書の集合だけでなく、全てのイベントデータの文書の集合における形態素の出現頻度（混雑フラグが「Ｔｒｕｅ」でないイベントデータの文書も含む集合における形態素の出現頻度）にも基づいている。すなわち、イベントデータの混雑フラグが「Ｔｒｕｅ」である文書における形態素の出現度合いと、全てのイベントデータの文書における形態素の出現度合いとに基づいてスコアを算出し、当該スコアに基づいて混雑関連特徴語を抽出する。全てのイベントデータの文書の集合における形態素の出現頻度が高いと、混雑時イベント集合における形態素の出現頻度が高くてもスコアが高くならない。よって、形態素が、混雑時イベント集合に特有の語であれば、スコアが高くなる。なお、上記全てのイベントデータの文書の集合の代わりに、混雑フラグが「Ｔｒｕｅ」であるイベントデータの文書の集合と、非混雑フラグが「Ｔｒｕｅ」であるイベントデータの文書の集合とを合わせた集合としてもよい。 The above formula is not only the set of documents whose congestion flag is "True", but also the appearance frequency of morphological elements in the set of documents of all event data (the morphological elements in the set including the documents of event data whose congestion flag is not "True"). It is also based on the frequency of appearance of. That is, a score is calculated based on the appearance degree of the morpheme in the document in which the congestion flag of the event data is "True" and the appearance degree of the morpheme in the document of all event data, and the congestion-related characteristic word is calculated based on the score. Is extracted. If the frequency of appearance of morphemes in the set of documents of all event data is high, the score will not be high even if the frequency of appearance of morphemes in the set of events during congestion is high. Therefore, if the morpheme is a word peculiar to the event set at the time of congestion, the score will be high. Instead of the set of all the above event data documents, the set of event data documents whose congestion flag is "True" and the set of event data documents whose non-congestion flag is "True" are combined. It may be a set.

なお、非混雑時の混雑関連特徴語を抽出する場合、抽出部１６は、非混雑フラグが「Ｔｒｕｅ」であるイベントの文書を形態素解析して、非混雑時における混雑関連特徴語の候補として形態素を抽出し、形態素毎に以下の式（２）に示すスコアを算出する。Ｓｃｏｒｅ_ｗｏｒｄは、非混雑時の混雑関連特徴語の候補となる形態素のスコアである。ｄｆ（ｗｏｒｄｉｎｓｅｔ_{ｖａｃａｎｔ}）は、非混雑時イベント集合（非混雑フラグが「Ｔｒｕｅ」であるイベントデータの文書の集合）における形態素の出現頻度である。ｎｕｍ（ｓｅｔ_{ｖａｃａｎｔ}）は、非混雑時イベント集合の文書数である。ｄｆ（ｗｏｒｄｉｎｓｅｔ_{ｅｖｅｎｔ}）は、全てのイベントデータの文書の集合における形態素の出現頻度である。
Ｓｃｏｒｅ_ｗｏｒｄ＝２×ｄｆ（ｗｏｒｄｉｎｓｅｔ_{ｖａｃａｎｔ}）／｛ｎｕｍ（ｓｅｔ_{ｖａｃａｎｔ}）＋ｄｆ（ｗｏｒｄｉｎｓｅｔ_{ｅｖｅｎｔ}）｝・・・（２） When extracting the congestion-related feature words during non-congestion, the extraction unit 16 morphologically analyzes the document of the event in which the non-congestion flag is "True", and morphologically analyzes the document as a candidate for the congestion-related feature words during non-congestion. Is extracted, and the score shown in the following formula (2) is calculated for each morpheme. Score _word is a score of a morpheme that is a candidate for congestion-related feature words during non-congestion. df (word in set _vacant ) is the frequency of appearance of morphemes in a non-congested event set (a set of event data documents whose non-congested flag is "True"). num (set _vacant ) is the number of documents in the non-congested event set. df (word in set _event ) is the frequency of appearance of morphemes in a set of documents of all event data.
Score _word = 2 × df (word in set _vacant ) / {num (set _vacant ) + df (word in set _event )} ・・・ (2)

抽出部１６は、スコアを算出すると、算出したスコアが予め定められているスコア閾値（例えば、０．３）以上である場合、当該スコアの混雑関連特徴語の候補の形態素を混雑関連特徴語として記憶する。このように、抽出部１６は、混雑関連特徴語を抽出する。 When the extraction unit 16 calculates the score, if the calculated score is equal to or higher than a predetermined score threshold value (for example, 0.3), the extraction unit 16 uses the candidate morpheme of the congestion-related feature word of the score as the congestion-related feature word. Remember. In this way, the extraction unit 16 extracts the congestion-related characteristic words.

抽出部１６は、例えば、図４に示したイベントデータベース１５のレコードＲ１の文書に含まれる「外野指定席」と「チケット」を混雑時の混雑関連特徴語（イベントの規模が大きいことを示唆する語）として抽出する。また、抽出部１６は、イベントデータベース１５のレコードＲ２の文書に含まれる「ミニライブ」、「サイン会」、「ＣＤ」を非混雑時の混雑関連特徴語（イベントの規模が小さいことを示唆する語）として抽出する。また、抽出部１６は、イベントデータベース１５のレコードＲ３の文書に含まれる「チケ」を混雑時の混雑関連特徴語として抽出する。また、抽出部１６は、イベントデータベース１５のレコードＲ４の文書に含まれる「生中継」を混雑時の混雑関連特徴語として抽出する。また、抽出部１６は、イベントデータベース１５のレコードＲ５の文書に含まれる「リリース」、「タワーレコード」を非混雑時の混雑関連特徴語として抽出する。抽出部１６は、混雑関連特徴語の抽出を完了すると、推定部１８へ混雑関連特徴語の抽出が完了した旨を通知する。 For example, the extraction unit 16 sets the “outfield reserved seat” and the “ticket” included in the document of the record R1 of the event database 15 shown in FIG. 4 as congestion-related characteristic words at the time of congestion (suggesting that the scale of the event is large). Extract as word). Further, the extraction unit 16 refers to the “mini-live”, “signing session”, and “CD” included in the document of the record R2 of the event database 15 as congestion-related characteristic words (suggesting that the scale of the event is small) during non-congestion. Extract as word). In addition, the extraction unit 16 extracts "chike" included in the document of record R3 of the event database 15 as a congestion-related characteristic word at the time of congestion. In addition, the extraction unit 16 extracts "live broadcast" included in the document of record R4 of the event database 15 as a congestion-related characteristic word at the time of congestion. Further, the extraction unit 16 extracts "release" and "tower record" included in the document of record R5 of the event database 15 as congestion-related characteristic words at the time of non-congestion. When the extraction unit 16 completes the extraction of the congestion-related characteristic words, the extraction unit 16 notifies the estimation unit 18 that the extraction of the congestion-related characteristic words is completed.

推定部１８は、混雑状況推定対象の場所に対応する文書を取得し、抽出部１６により抽出された混雑関連特徴語が、当該文書に含まれているか否かに基づいて上記場所の混雑状況を推定する。 The estimation unit 18 acquires a document corresponding to the location of the congestion status estimation target, and determines the congestion status of the location based on whether or not the congestion-related characteristic words extracted by the extraction unit 16 are included in the document. presume.

まず、推定部１８は、過去のイベントで、混雑有無のフラグが付与されたデータ（イベントデータ）を用いて機械学習により判別器（モデル）を作成する。具体的に、推定部１８は、抽出部１６から混雑関連特徴語の抽出が完了した旨の通知を受け付けると、イベントデータベース１５を参照し、各イベントデータを取得する。また、推定部１８は、抽出部１６から混雑時の混雑関連特徴語及び非混雑時の混雑関連特徴語を取得する。続いて、推定部１８は、抽出部１６によって抽出された混雑時の混雑関連特徴語に対応する要素のそれぞれが、当該混雑関連特徴語がイベントデータの文書に含まれるか否かを示す（例えば、混雑時の混雑関連特徴語のそれぞれの有無をフラグ付けした）混雑判断用の特徴ベクトルを生成する。この特徴ベクトルの次元数は、混雑関連特徴語の数である。また、推定部１８は、抽出部１６によって抽出された非混雑時の混雑関連特徴語に対応する要素のそれぞれが、当該混雑関連特徴語がイベントデータの文書に含まれるか否かを示す非混雑判断用の特徴ベクトルを生成する。続いて、推定部１８は、上記混雑判断用の特徴ベクトルを説明変数とし、イベントデータの混雑フラグを目的変数とした混雑判断用の学習データを生成する。また、推定部１８は、上記非混雑判断用の特徴ベクトルを説明変数とし、イベントデータの非混雑フラグを目的変数とした非混雑判断用の学習データを生成する。そして、推定部１８は、混雑判断用の学習データを用いて機械学習を実行し、混雑時の混雑関連特徴語のそれぞれを文書に含むか否かを示す特徴ベクトルを入力として、混雑フラグを出力するモデルを生成する。また、推定部１８は、非混雑判断用の学習データを用いて機械学習を実行し、非混雑時の混雑関連特徴語のそれぞれを文書を含むか否かを示す特徴ベクトルを入力として、非混雑フラグを出力するモデルを生成する。 First, the estimation unit 18 creates a discriminator (model) by machine learning using data (event data) to which a flag for the presence or absence of congestion is added in a past event. Specifically, when the estimation unit 18 receives the notification from the extraction unit 16 that the extraction of the congestion-related feature words is completed, the estimation unit 18 refers to the event database 15 and acquires each event data. Further, the estimation unit 18 acquires the congestion-related characteristic words at the time of congestion and the congestion-related characteristic words at the time of non-congestion from the extraction unit 16. Subsequently, the estimation unit 18 indicates whether or not each of the elements corresponding to the congestion-related feature words at the time of congestion extracted by the extraction unit 16 includes the congestion-related feature words in the event data document (for example,). , Flags the presence or absence of each of the congestion-related feature words at the time of congestion) Generates a feature vector for congestion judgment. The number of dimensions of this feature vector is the number of congestion-related feature words. In addition, the estimation unit 18 indicates whether or not each of the elements corresponding to the congestion-related feature words at the time of non-congestion extracted by the extraction unit 16 includes the congestion-related feature words in the event data document. Generate a feature vector for judgment. Subsequently, the estimation unit 18 generates learning data for congestion determination using the feature vector for congestion determination as an explanatory variable and the congestion flag of the event data as an objective variable. Further, the estimation unit 18 generates learning data for non-congestion determination using the feature vector for non-congestion determination as an explanatory variable and the non-congestion flag of the event data as an objective variable. Then, the estimation unit 18 executes machine learning using the learning data for congestion determination, inputs a feature vector indicating whether or not each of the congestion-related feature words at the time of congestion is included in the document, and outputs a congestion flag. Generate a model to do. Further, the estimation unit 18 executes machine learning using the learning data for non-congestion determination, and inputs a feature vector indicating whether or not each of the congestion-related feature words at the time of non-congestion includes a document, and is non-congested. Generate a model that outputs flags.

このような機械学習の手法としては、例えば、サポートベクターマシン（ＳＶＭ）、ニューラルネットワーク、及びナイーブベイズに代表されるブースティング等の教師あり機械学習を用いることができる。 As such a machine learning method, for example, supervised machine learning such as a support vector machine (SVM), a neural network, and boosting represented by Naive Bayes can be used.

推定部１８は、当該モデルを用いて混雑推定対象の文書が示すイベントの混雑状態を示す情報を生成する。推定部１８は、混雑関連特徴語が抽出された後（抽出部１６から混雑関連特徴語が抽出された旨の情報を取得した後）に、推定対象文書として文書をテキスト出力装置から取得した場合、当該文書から上述と同様の方法によりイベント情報を抽出して、当該イベント情報と文書とを対応付けて推定対象文書データベース１７へ記憶する。 The estimation unit 18 uses the model to generate information indicating the congestion state of the event indicated by the document to be estimated for congestion. When the estimation unit 18 acquires a document as an estimation target document from the text output device after the congestion-related characteristic words have been extracted (after acquiring the information that the congestion-related characteristic words have been extracted from the extraction unit 16). , Event information is extracted from the document by the same method as described above, and the event information and the document are associated with each other and stored in the estimation target document database 17.

推定部１８は、当該文書が上記混雑時の混雑関連特徴語又は非混雑時の混雑関連特徴語のそれぞれを含むか否かを示す特徴ベクトルを生成し、当該特徴ベクトルを入力値として、上記モデルを用いて、混雑フラグ又は非混雑フラグを出力する。推定部１８は、出力した混雑フラグ又は非混雑フラグを推定対象文書データベース１７に記憶されている上記文書に対応付けて記憶する。推定部１８は、他の装置からイベントの混雑状況の確認要求があった場合、推定対象文書データベース１７を参照して、当該イベントに対応する混雑フラグ又は非混雑フラグの情報に基づいて混雑状況を出力する。 The estimation unit 18 generates a feature vector indicating whether or not the document includes each of the congestion-related feature words during congestion and the congestion-related feature words during non-congestion, and uses the feature vector as an input value to generate the model. Is used to output a congestion flag or a non-congestion flag. The estimation unit 18 stores the output congestion flag or non-congestion flag in association with the above-mentioned document stored in the estimation target document database 17. When the estimation unit 18 receives a request for confirmation of the congestion status of an event from another device, the estimation unit 18 refers to the estimation target document database 17 and determines the congestion status based on the information of the congestion flag or the non-congestion flag corresponding to the event. Output.

なお、推定部１８は、上記のように、テキスト出力装置から取得した文書に示されるイベントの混雑状況を示す混雑フラグ又は非混雑フラグを記憶するだけでなく、当該混雑フラグ又は非混雑フラグに基づいて、イベントの混雑状況を示す文章の情報を生成し、生成した文章の情報をユーザへ提供するようにしてもよい。例えば、推定部１８は、当該文章の情報をテキスト出力装置へ送信し、テキスト出力装置が、ユーザへ当該文章の情報を送信するようにしてもよい。 As described above, the estimation unit 18 not only stores the congestion flag or the non-congestion flag indicating the congestion status of the event shown in the document acquired from the text output device, but also is based on the congestion flag or the non-congestion flag. Therefore, information on sentences indicating the congestion status of the event may be generated, and the information on the generated sentences may be provided to the user. For example, the estimation unit 18 may transmit the information of the sentence to the text output device, and the text output device may transmit the information of the sentence to the user.

このように、推定部１８は、テキスト出力装置から取得した文書に示されるイベントについて、混雑フラグ又は非混雑フラグを出力することにより、当該イベントの混雑状況を推定した結果を出力する。 In this way, the estimation unit 18 outputs the result of estimating the congestion status of the event by outputting the congestion flag or the non-congestion flag for the event shown in the document acquired from the text output device.

続いて、図５に本実施形態に係る推定装置１０のハードウェア構成を示す。推定装置１０の機能ブロック（構成部）は、ハードウェア及び／又はソフトウェアの任意の組み合わせによって実現される。また、各機能ブロックの実現手段は特に限定されない。すなわち、各機能ブロックは、物理的及び／又は論理的に結合した１つの装置により実現されてもよいし、物理的及び／又は論理的に分離した２つ以上の装置を直接的及び／又は間接的に(例えば、有線及び／又は無線)で接続し、これら複数の装置により実現されてもよい。 Subsequently, FIG. 5 shows the hardware configuration of the estimation device 10 according to the present embodiment. The functional block (constituent part) of the estimation device 10 is realized by any combination of hardware and / or software. Further, the means for realizing each functional block is not particularly limited. That is, each functional block may be realized by one physically and / or logically coupled device, or directly and / or indirectly by two or more physically and / or logically separated devices. (For example, wired and / or wireless) may be connected and realized by these plurality of devices.

例えば、本発明の一実施の形態における推定装置１０などは、混雑関連特徴語を抽出するコンピュータとして機能してもよい。上述の推定装置１０は、物理的には、プロセッサ１０１、メモリ１０２、ストレージ１０３、及び通信モジュール１０４などを含むコンピュータ装置として構成されてもよい。 For example, the estimation device 10 and the like according to the embodiment of the present invention may function as a computer for extracting congestion-related feature words. The estimation device 10 described above may be physically configured as a computer device including a processor 101, a memory 102, a storage 103, a communication module 104, and the like.

なお、以下の説明では、「装置」という文言は、回路、デバイス、ユニットなどに読み替えることができる。推定装置１０のハードウェア構成は、図に示した各装置を１つ又は複数含むように構成されてもよいし、一部の装置を含まずに構成されてもよい。 In the following description, the word "device" can be read as a circuit, a device, a unit, or the like. The hardware configuration of the estimation device 10 may be configured to include one or more of the devices shown in the figure, or may be configured not to include some of the devices.

推定装置１０における各機能は、プロセッサ１０１、メモリ１０２などのハードウェア上に所定のソフトウェア（プログラム）を読み込ませることで、プロセッサ１０１が演算を行い、通信モジュール１０４による通信や、メモリ１０２及びストレージ１０３におけるデータの読み出し及び／又は書き込みを制御することで実現される。 Each function in the estimation device 10 is performed by loading predetermined software (program) on hardware such as the processor 101 and the memory 102, so that the processor 101 performs an operation, and communication by the communication module 104, the memory 102, and the storage 103. It is realized by controlling the reading and / or writing of the data in.

プロセッサ１０１は、例えば、オペレーティングシステムを動作させてコンピュータ全体を制御する。プロセッサ１０１は、周辺装置とのインターフェース、制御装置、演算装置、レジスタなどを含む中央処理装置（ＣＰＵ：Central Processing Unit）で構成されてもよい。例えば、上述の抽出部１６及び推定部１８などは、プロセッサ１０１で実現されてもよい。 The processor 101 operates, for example, an operating system to control the entire computer. The processor 101 may be composed of a central processing unit (CPU) including an interface with peripheral devices, a control device, an arithmetic unit, registers, and the like. For example, the extraction unit 16 and the estimation unit 18 described above may be realized by the processor 101.

また、プロセッサ１０１は、プログラム（プログラムコード）、ソフトウェアモジュールやデータを、ストレージ１０３及び／又は通信モジュール１０４からメモリ１０２に読み出し、これらに従って各種の処理を実行する。プログラムとしては、上述の実施の形態で説明した動作の少なくとも一部をコンピュータに実行させるプログラムが用いられる。例えば、推定装置１０は、メモリ１０２に格納され、プロセッサ１０１で動作する制御プログラムによって実現されてもよく、他の機能ブロックについても同様に実現されてもよい。上述の各種処理は、１つのプロセッサ１０１で実行される旨を説明してきたが、２以上のプロセッサ１０１により同時又は逐次に実行されてもよい。プロセッサ１０１は、１以上のチップで実装されてもよい。なお、プログラムは、電気通信回線を介してネットワークから送信されてもよい。 Further, the processor 101 reads a program (program code), a software module, and data from the storage 103 and / or the communication module 104 into the memory 102, and executes various processes according to these. As the program, a program that causes a computer to execute at least a part of the operations described in the above-described embodiment is used. For example, the estimation device 10 may be realized by a control program stored in the memory 102 and operated by the processor 101, and may be realized for other functional blocks as well. Although it has been explained that the various processes described above are executed by one processor 101, they may be executed simultaneously or sequentially by two or more processors 101. The processor 101 may be mounted on one or more chips. The program may be transmitted from the network via a telecommunication line.

メモリ１０２は、コンピュータ読み取り可能な記録媒体であり、例えば、ＲＯＭ（Read Only Memory）、ＥＰＲＯＭ（Erasable Programmable ＲＯＭ）、ＥＥＰＲＯＭ（Electrically Erasable Programmable ＲＯＭ）、ＲＡＭ（Random Access Memory）などの少なくとも１つで構成されてもよい。メモリ２０２は、レジスタ、キャッシュ、メインメモリ（主記憶装置）などと呼ばれてもよい。メモリ２０２は、本発明の一実施の形態に係る方法を実施するために実行可能なプログラム（プログラムコード）、ソフトウェアモジュールなどを保存することができる。 The memory 102 is a computer-readable recording medium, and is composed of at least one such as a ROM (Read Only Memory), an EPROM (Erasable Programmable ROM), an EPROM (Electrically Erasable Programmable ROM), and a RAM (Random Access Memory). May be done. The memory 202 may be called a register, a cache, a main memory (main storage device), or the like. The memory 202 can store a program (program code), a software module, or the like that can be executed to carry out the method according to the embodiment of the present invention.

ストレージ１０３は、コンピュータ読み取り可能な記録媒体であり、例えば、ＣＤ−ＲＯＭ（Compact Disc ＲＯＭ）などの光ディスク、ハードディスクドライブ、フレキシブルディスク、光磁気ディスク(例えば、コンパクトディスク、デジタル多用途ディスク、Ｂｌｕ−ｒａｙ（登録商標）ディスク)、スマートカード、フラッシュメモリ(例えば、カード、スティック、キードライブ)、フロッピー（登録商標）ディスク、磁気ストリップなどの少なくとも１つで構成されてもよい。ストレージ１０３は、補助記憶装置と呼ばれてもよい。上述の記憶媒体は、例えば、メモリ１０２及び／又はストレージ１０３を含むデータベース、サーバその他の適切な媒体であってもよい。 The storage 103 is a computer-readable recording medium, and is, for example, an optical disk such as a CD-ROM (Compact Disc ROM), a hard disk drive, a flexible disk, or a magneto-optical disk (for example, a compact disk, a digital versatile disk, or a Blu-ray). It may consist of at least one (registered trademark) disk), smart card, flash memory (eg, card, stick, key drive), floppy (registered trademark) disk, magnetic strip, and the like. The storage 103 may be referred to as an auxiliary storage device. The storage medium described above may be, for example, a database, server or other suitable medium that includes memory 102 and / or storage 103.

通信モジュール１０４は、有線及び／又は無線ネットワークを介してコンピュータ間の通信を行うためのハードウェア（送受信デバイス）であり、例えばネットワークデバイス、ネットワークコントローラ、ネットワークカードなどともいう。 The communication module 104 is hardware (transmission / reception device) for communicating between computers via a wired and / or wireless network, and is also referred to as, for example, a network device, a network controller, or a network card.

また、プロセッサ１０１やメモリ１０２などの各装置は、情報を通信するためのバス１０５で接続される。バス１０５は、単一のバスで構成されてもよいし、装置間で異なるバスで構成されてもよい。 Further, each device such as the processor 101 and the memory 102 is connected by a bus 105 for communicating information. The bus 105 may be composed of a single bus, or may be composed of different buses between the devices.

また、推定装置１０は、マイクロプロセッサ、デジタル信号プロセッサ（ＤＳＰ：Digital Signal Processor）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＰＬＤ（Programmable Logic Device）、ＦＰＧＡ（Field Programmable Gate Array）などのハードウェアを含んで構成されてもよく、当該ハードウェアにより、各機能ブロックの一部又は全てが実現されてもよい。例えば、プロセッサ１０１は、これらのハードウェアの少なくとも１つで実装されてもよい。以上が、本実施形態に係る推定装置１０の構成である。 Further, the estimation device 10 includes hardware such as a microprocessor, a digital signal processor (DSP: Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), and an FPGA (Field Programmable Gate Array). It may be configured, and the hardware may realize a part or all of each functional block. For example, the processor 101 may be implemented on at least one of these hardware. The above is the configuration of the estimation device 10 according to the present embodiment.

続いて、図６のフローチャートを用いて、本実施形態に係る推定装置１０の動作方法である抽出方法を説明する。本処理では、まず、情報取得部１１が、抽出元の文書である文書を取得すると、当該文書を抽出文書データベース１２へ登録する（ステップＳ１）。情報取得部１１は、所定のタイミングで、抽出文書データベース１２に記憶されている文書からイベントの名称、ＰＯＩ名称、及び開催期間を抽出する。また、情報取得部１１は、抽出したＰＯＩ名称に対応するＰＯＩデータベース１３のＰＯＩ名称の人気度に基づいて、イベント名称、ＰＯＩ名称、及び開催期間に混雑フラグを紐付けることによって、イベントデータを生成し、生成したイベントデータをイベントデータベース１５へ登録する（ステップＳ２）。 Subsequently, the extraction method, which is the operation method of the estimation device 10 according to the present embodiment, will be described with reference to the flowchart of FIG. In this process, first, when the information acquisition unit 11 acquires a document that is an extraction source document, the information acquisition unit 11 registers the document in the extraction document database 12 (step S1). The information acquisition unit 11 extracts the event name, POI name, and holding period from the documents stored in the extracted document database 12 at a predetermined timing. Further, the information acquisition unit 11 generates event data by associating the event name, the POI name, and the congestion flag with the holding period based on the popularity of the POI name of the POI database 13 corresponding to the extracted POI name. Then, the generated event data is registered in the event database 15 (step S2).

抽出部１６は、イベントデータベース１５に記憶されているイベントデータの内、混雑フラグが「Ｔｒｕｅ」であるイベントデータの文書に対して形態素解析をして、形態素毎にスコアを算出して、当該スコアが所定閾値以上である場合、当該形態素を混雑関連特徴語として抽出する。また、抽出部１６は、イベントデータベース１５に記憶されているイベントデータの内、非混雑フラグが「Ｔｒｕｅ」である文書に対して形態素解析をして、形態素毎にスコアを算出して、当該スコアが所定閾値以上である場合、当該形態素を混雑関連特徴語として抽出する。このように、抽出部１６は、混雑関連特徴語を抽出する（ステップＳ３）。混雑関連特徴語を抽出した後に、推定部１８が、推定対象文書となる文書を取得し、推定対象文書データベース１７へ記憶する（ステップＳ４）。推定部１８は、イベントデータベース１５に記憶されている情報と、抽出部１６により抽出された混雑関連特徴語とに基づいてモデルを生成する。推定部１８は、推定対象文書データベース１７に記憶されている文書からイベント情報を抽出し、当該モデルを用いて、文書から抽出されるイベントの混雑状況を推定する（ステップＳ５）。 The extraction unit 16 performs morphological analysis on the event data document whose congestion flag is "True" among the event data stored in the event database 15, calculates a score for each morpheme, and obtains the score. Is equal to or greater than a predetermined threshold, the morpheme is extracted as a congestion-related feature word. Further, the extraction unit 16 performs morphological analysis on a document whose non-congestion flag is "True" among the event data stored in the event database 15, calculates a score for each morpheme, and obtains the score. Is equal to or greater than a predetermined threshold, the morpheme is extracted as a congestion-related feature word. In this way, the extraction unit 16 extracts the congestion-related characteristic words (step S3). After extracting the congestion-related characteristic words, the estimation unit 18 acquires a document to be an estimation target document and stores it in the estimation target document database 17 (step S4). The estimation unit 18 generates a model based on the information stored in the event database 15 and the congestion-related feature words extracted by the extraction unit 16. The estimation unit 18 extracts event information from the document stored in the estimation target document database 17, and estimates the congestion status of the event extracted from the document using the model (step S5).

上述の実施形態では、１語の形態素を混雑関連特徴語とする場合について述べたが、複数の語からなるフレーズを混雑関連特徴語とするようにしてもよい。例えば、図７に示すように、「××に来たら人が多すぎて大変。」という文を含む文書がある場合、抽出部１６は、形態素解析して、「××」、「に」、「来」、「たら」、「人」、「が」、「多」、「すぎ」、「て」、「大変。」と形態素に分ける。続いて、抽出部１６は、公知の方法により、助詞及び句読点を除去して、上記の形態素の内、対象とする形態素を「××」、「来」、「たら」、「人」、「多」、「すぎ」、「大変」とする。続いて、抽出部１６は、公知の方法により、接尾辞（たら、すぎ）を語幹（来、多）と結合し、「××」、「来たら」、「人」、「多すぎ」、「大変」に分ける。続いて、抽出部１６は、上記の形態素の内、混雑関連特徴語の候補として、２語の形態素の組を生成する（すなわち、上記各形態素を１文字と仮定して２−ｇｒａｍを抽出）。例えば、抽出部１６は、（××、来たら）、（来たら、人）、（人、多すぎ）、（多すぎ、大変）という混雑関連特徴語の候補となる形態素の組を生成する。生成した形態素の組が混雑関連特徴語の抽出対象となるか否かを判断する方法は、上述のスコアを算出して、当該スコアに基づいて判断する方法と同じである。この結果、抽出部１６は、（人、多すぎ）という組のスコアが閾値を上回る場合、当該（人、多すぎ）という組を混雑関連特徴語として抽出する。このように、抽出部１６は、混雑関連特徴語として、複数の語の組み合わせを抽出してもよい。この場合、「人が多い」、「すごい人」、「黒山の人」、「身動きできない」、「長い行列」、「混雑がやばい」を示す複数の語の組み合わせを抽出することができる。すなわち、語ではなく、「特定の語と近傍に出現する語の組み合わせ」により混雑関連特徴語群を定義、または抽出してもよい。例えば、「人」と「すごい、多い、黒山」の組み合わせにより、「すごい人」「人が多い」「黒山の人」などを抽出する。なお、Ｎ−ｇｒａｍ（上記の例では２−ｇｒａｍ）を形成する形態素の組の一部を予め記憶しておき、記憶している語と、文書を形態素解析した結果得られる形態素との組を生成するようにしてもよい。また、Ｎ−ｇｒａｍ（上記の例では２−ｇｒａｍ）を形成する形態素の組の全部又は一部を予め記憶しておき、当該形態素の組についてのスコアを上述と同様の方法により算出し、算出したスコアに基づいて混雑関連特徴語を抽出するようにしてもよい。 In the above-described embodiment, the case where one word morpheme is used as a congestion-related feature word has been described, but a phrase composed of a plurality of words may be used as a congestion-related feature word. For example, as shown in FIG. 7, when there is a document containing the sentence "If you come to XX, there are too many people and it is difficult." , "Come", "Tara", "People", "Ga", "Many", "Sugi", "Te", "Difficult." Subsequently, the extraction unit 16 removes particles and punctuation marks by a known method, and among the above morphemes, the target morpheme is "XX", "come", "tara", "person", "person", " "Many", "too much", "difficult". Subsequently, the extraction unit 16 combines the suffix (tara, too much) with the stem (coming, many) by a known method, and "XX", "coming", "person", "too many", Divide into "difficult". Subsequently, the extraction unit 16 generates a set of two morphemes as candidates for congestion-related characteristic words among the above morphemes (that is, extracts 2-gram assuming that each of the above morphemes is one character). .. For example, the extraction unit 16 generates a set of morphemes that are candidates for congestion-related feature words such as (XX, when it comes), (when it comes, people), (people, too many), and (too many, hard). .. The method of determining whether or not the generated morpheme set is the target for extracting the congestion-related feature words is the same as the method of calculating the above-mentioned score and making a determination based on the score. As a result, when the score of the group (people, too many) exceeds the threshold value, the extraction unit 16 extracts the group (people, too many) as a congestion-related characteristic word. In this way, the extraction unit 16 may extract a combination of a plurality of words as the congestion-related characteristic words. In this case, it is possible to extract a combination of a plurality of words indicating "many people", "great people", "people in Kuroyama", "immobility", "long queue", and "heavy congestion". That is, the congestion-related superfamily may be defined or extracted by "a combination of a specific word and a word appearing in the vicinity" instead of the word. For example, "great people", "many people", "people in Kuroyama", etc. are extracted by combining "people" and "great, many, Kuroyama". A part of the set of morphemes forming N-gram (2-gram in the above example) is memorized in advance, and the memorized word and the set of the morpheme obtained as a result of morphological analysis of the document are stored. It may be generated. Further, all or a part of the set of morphemes forming N-gram (2-gram in the above example) is stored in advance, and the score for the set of morphemes is calculated and calculated by the same method as described above. Congestion-related feature words may be extracted based on the score obtained.

上述の実施形態では、イベントの場所の混雑を判断する場合について述べたが、文書に含まれる場所（例えば、観光スポット）を示す情報（場所を示す名称、位置）を抽出して、抽出した場所の混雑の有無を判断するようにしてもよい。この場合に、方式２により混雑判断する場合、文書が投稿された時刻が、上述のイベント開催期間を示す時刻に対応する。 In the above-described embodiment, the case of determining the congestion of the event location has been described, but the information (name, location indicating the location) indicating the location (for example, a tourist spot) included in the document is extracted and the extracted location. You may decide whether or not there is congestion. In this case, when the congestion is determined by the method 2, the time when the document is posted corresponds to the time indicating the above-mentioned event holding period.

上述の実施形態では、場所を示す情報（イベントの開催場所の情報）を含む文書を取得する場合について述べたが、場所に関連することが予め定められている文書を取得するようにしてもよい。 In the above-described embodiment, the case of acquiring the document including the information indicating the place (information of the venue of the event) has been described, but the document which is predetermined to be related to the place may be acquired. ..

上述の実施形態では、方式２として、イベント開催による動態人数の増分の大小により混雑していたか否かを判断する場合について述べたが、各イベント会場のチェックイン数の大小により混雑していたか否かを判断するようにしてもよい。 In the above-described embodiment, as method 2, a case where it is determined whether or not the event is crowded depending on the magnitude of the increase in the number of dynamic people due to the holding of the event has been described, but whether or not the event venue is crowded depending on the number of check-ins. You may try to judge.

上述したように、推定装置１０の情報取得部１１は、抽出元となる文書と、文書の内容に対応するイベント場所の混雑フラグ、非混雑フラグとを取得する。抽出部１６は、混雑フラグが予め設定された条件を満たす文書における語の出現度合いに基づいて、文書から混雑関連特徴語として抽出する。 As described above, the information acquisition unit 11 of the estimation device 10 acquires the document to be extracted and the congestion flag and non-congestion flag of the event location corresponding to the contents of the document. The extraction unit 16 extracts the congestion-related characteristic words from the document based on the degree of appearance of the words in the document in which the congestion flag satisfies the condition set in advance.

この場合、推定装置１０は、所定条件を満たす混雑状況の場所に対応する文書に特有の混雑関連特徴語を適切に抽出することができる。すなわち、推定装置１０は、イベントの特徴語のうち、イベントの混雑を示唆する語、または混雑しないことを示唆する語を抽出することができる。この結果、推定装置１０は、混雑状況推定対象を示す情報を含む文書（例えば、将来実施予定のイベントの場所に関する文書等の、イベントの説明文）に、抽出した語を含むか否かを判断した結果に基づいて、当該文書に対応する場所の混雑状況を推定することができる。すなわち、将来実施予定のイベントの場所に関する文書に、上記抽出した語を含むか否かを判断することにより、未来の混雑状態を推定することができる。また、現在実施中のイベント又は過去に実施されたイベントの場所に関する文書に、上記抽出した語を含むか否かを判断することにより、現状又は過去の混雑状態を推定することができる。このように、推定装置１０は、混雑状況推定対象を示す情報を含む文書の内容に基づいて、混雑状況推定対象の場所の混雑状況を推定するための混雑関連特徴語を適切に抽出することができる。 In this case, the estimation device 10 can appropriately extract the congestion-related characteristic words peculiar to the document corresponding to the location of the congestion situation satisfying the predetermined condition. That is, the estimation device 10 can extract words that suggest congestion of the event or words that suggest that the event is not congested from the characteristic words of the event. As a result, the estimation device 10 determines whether or not the extracted word is included in the document including the information indicating the congestion status estimation target (for example, the description of the event such as the document regarding the location of the event scheduled to be implemented in the future). Based on the results, it is possible to estimate the congestion status of the place corresponding to the document. That is, the future congestion state can be estimated by determining whether or not the above-extracted words are included in the document regarding the location of the event scheduled to be held in the future. In addition, the current or past congestion state can be estimated by determining whether or not the above-extracted words are included in the document regarding the location of the event currently being held or the event held in the past. In this way, the estimation device 10 can appropriately extract congestion-related feature words for estimating the congestion status of the location of the congestion status estimation target based on the contents of the document including the information indicating the congestion status estimation target. it can.

また、抽出部１６は、混雑フラグが予め設定された条件（混雑フラグがＴｒｕｅ）を満たす文書における語の出現度合いと、混雑フラグが予め設定された別の条件（混雑フラグがＦａｌｓｅ）を満たす文書における語の出現度合いとに基づいて、混雑関連特徴語を抽出する。この場合、推定装置１０は、混雑状況が互いに異なる文書の集合における語の出現度合いの場所に対応する文書に基づいて混雑関連特徴語を抽出するので、一方の条件を満たす混雑状況の場所に対応する文書のみに頻出する語を特定することができる。この結果、当該条件を満たす混雑状況に特有の混雑関連特徴語を適切に抽出することができる。なお、抽出部１６は、混雑フラグが予め設定された条件（混雑フラグがＴｒｕｅ）を満たす文書における語の出現度合いのみに基づいて、混雑関連特徴語を抽出するようにしてもよい。抽出部１６は、当該出現度合いが予め定められた閾値以上である場合に、当該混雑関連特徴語の候補を混雑関連特徴語として抽出する。 Further, the extraction unit 16 satisfies the degree of appearance of words in a document satisfying a condition in which the congestion flag is set in advance (congestion flag is True) and another condition in which the congestion flag is set in advance (congestion flag is False). Congestion-related characteristic words are extracted based on the degree of appearance of words in. In this case, the estimation device 10 extracts the congestion-related characteristic words based on the documents corresponding to the locations of the appearance degrees of the words in the set of documents having different congestion statuses, and therefore corresponds to the locations of the congestion status satisfying one of the conditions. It is possible to identify words that frequently appear only in documents that are used. As a result, it is possible to appropriately extract congestion-related characteristic words peculiar to the congestion situation satisfying the condition. The extraction unit 16 may extract congestion-related characteristic words only based on the degree of appearance of words in a document in which the congestion flag satisfies a preset condition (congestion flag is True). When the degree of appearance is equal to or higher than a predetermined threshold value, the extraction unit 16 extracts candidates for the congestion-related feature words as congestion-related feature words.

情報取得部１１は、取得した文書（抽出文書データベース１２に記憶されている文書）の内容に基づいて、文書の内容に対応する場所（イベント情報のＰＯＩ名称）を特定し、当該場所の混雑度合いを示す指標値（ＰＯＩの人気度、動態人数の比率）を取得し、当該指標値に基づいて、混雑フラグを生成することにより混雑フラグを取得する。 The information acquisition unit 11 identifies a location (POI name of event information) corresponding to the content of the document based on the content of the acquired document (document stored in the extracted document database 12), and the degree of congestion at the location. The index value (popularity of POI, ratio of dynamic number of people) indicating the above is acquired, and the congestion flag is acquired by generating the congestion flag based on the index value.

この場合、推定装置１０は、他の装置から混雑フラグを取得する必要がなく、簡易な構成で文書の内容に対応する場所の混雑状況を推定するための語を抽出できる。なお、情報取得部１１は、文書の内容に対応する場所の混雑状況を判断せずに、他の装置から混雑状況の判断に基づいた混雑フラグ又は非混雑フラグを含むイベントデータを取得するようにしてもよい。 In this case, the estimation device 10 does not need to acquire the congestion flag from another device, and can extract words for estimating the congestion status of the place corresponding to the contents of the document with a simple configuration. Note that the information acquisition unit 11 acquires event data including a congestion flag or a non-congestion flag based on the determination of the congestion status from another device without determining the congestion status of the place corresponding to the content of the document. You may.

抽出部１６は、混雑関連特徴語として、複数の語の組み合わせを抽出する。この場合、一語では混雑状況を推定できないが、複数語を組み合わせることにより混雑状況を推定できる語を抽出することができる。 The extraction unit 16 extracts a combination of a plurality of words as congestion-related characteristic words. In this case, it is not possible to estimate the congestion status with one word, but it is possible to extract words that can estimate the congestion status by combining a plurality of words.

推定部１８は、混雑状況推定対象を示す文書を取得し、抽出部１６により抽出された混雑関連特徴語が、当該文書が含まれているか否かに基づいて、当該文書に対応する場所の混雑状況を推定する。 The estimation unit 18 acquires a document indicating a congestion status estimation target, and based on whether or not the congestion-related characteristic words extracted by the extraction unit 16 include the document, the congestion in the place corresponding to the document is congested. Estimate the situation.

この場合、混雑関連特徴語を含むか否かを判断した結果に基づいて、当該文書に対応する場所の混雑状況を適切に推定することができる。また、推定装置１０は、混雑関連特徴語を用いて推定するので、動態人数の統計処理がリアルタイムでない場合でも、混雑状況の推定結果を提示することができる。また、推定装置１０は、動態人数に基づいた数値ではなく、上記の混雑フラグに基づいて、イベントの混雑状況を示す文章で表示すれば、分かりやすい情報を提示することができる。なお、推定装置１０では、抽出部１６により抽出された混雑関連特徴語を用いて混雑状況推定対象を示す文書に対応する場所の混雑状況を推定したが、当該推定処理を他の装置で行うようにしてもよい。 In this case, the congestion status of the place corresponding to the document can be appropriately estimated based on the result of determining whether or not the congestion-related characteristic words are included. Further, since the estimation device 10 estimates using the congestion-related feature words, it is possible to present the estimation result of the congestion situation even when the statistical processing of the dynamic number of people is not real-time. Further, the estimation device 10 can present easy-to-understand information by displaying a sentence indicating the congestion status of the event based on the above-mentioned congestion flag instead of the numerical value based on the number of dynamic people. In the estimation device 10, the congestion status of the location corresponding to the document indicating the congestion status estimation target is estimated using the congestion-related feature words extracted by the extraction unit 16, but the estimation process is performed by another device. It may be.

イベントが開催されるＰＯＩの人気度や規模に基づいて、混雑の有無を判定する方法では、同一ＰＯＩで参加人数の規模が異なるイベントが同時に行われることもあるので、適切に混雑状況を推定することができない。しかし、上記のように混雑関連特徴語が、文書に含まれているか否かに基づいて混雑状況を推定すれば、混雑状況推定対象の場所の人気度を用いずに混雑状況に特有の特徴語の有無に基づいて推定するので、適切に混雑状況を推定することができる。 In the method of determining the presence or absence of congestion based on the popularity and scale of the POI where the event is held, events with the same POI but different scales of participants may be held at the same time, so the congestion situation is estimated appropriately. Can't. However, if the congestion status is estimated based on whether or not the congestion-related characteristic words are included in the document as described above, the characteristic words peculiar to the congestion status are not used without using the popularity of the place where the congestion status is estimated. Since the estimation is based on the presence or absence of, the congestion situation can be estimated appropriately.

また、イベントが開催されるＰＯＩを含むエリアの動態人数の情報を用いて、混雑を推定する場合、当該イベントの場所の混雑状況を推定することができるが、他のイベントの混雑状況を推定することができない。しかし、上記のように混雑関連特徴語が、文書に含まれているか否かに基づいて混雑状況を推定すれば、混雑状況推定対象の場所の動態人数を用いずに混雑状況に特有の特徴語の有無に基づいて推定するので、適切に混雑状況を推定することができる。 In addition, when estimating congestion using information on the number of dynamic people in the area including the POI where the event is held, the congestion status at the location of the event can be estimated, but the congestion status of other events is estimated. Can't. However, if the congestion status is estimated based on whether or not the congestion-related feature words are included in the document as described above, the feature words peculiar to the congestion status are not used for the dynamic number of people at the location where the congestion status is estimated. Since the estimation is based on the presence or absence of, the congestion situation can be estimated appropriately.

ソフトウェアは、ソフトウェア、ファームウェア、ミドルウェア、マイクロコード、ハードウェア記述言語と呼ばれるか、他の名称で呼ばれるかを問わず、命令、命令セット、コード、コードセグメント、プログラムコード、プログラム、サブプログラム、ソフトウェアモジュール、アプリケーション、ソフトウェアアプリケーション、ソフトウェアパッケージ、ルーチン、サブルーチン、オブジェクト、実行可能ファイル、実行スレッド、手順、機能などを意味するよう広く解釈されるべきである。 Software is an instruction, instruction set, code, code segment, program code, program, subprogram, software module, whether called software, firmware, middleware, microcode, hardware description language, or another name. , Applications, software applications, software packages, routines, subroutines, objects, executable files, execution threads, procedures, features, etc. should be broadly interpreted to mean.

また、ソフトウェア、命令などは、伝送媒体を介して送受信されてもよい。例えば、ソフトウェアが、同軸ケーブル、光ファイバケーブル、ツイストペア及びデジタル加入者回線（ＤＳＬ）などの有線技術及び／又は赤外線、無線及びマイクロ波などの無線技術を使用してウェブサイト、サーバ、又は他のリモートソースから送信される場合、これらの有線技術及び／又は無線技術は、伝送媒体の定義内に含まれる。 Further, software, instructions, and the like may be transmitted and received via a transmission medium. For example, the software uses wired technology such as coaxial cable, fiber optic cable, twisted pair and digital subscriber line (DSL) and / or wireless technology such as infrared, wireless and microwave to websites, servers, or other When transmitted from a remote source, these wired and / or wireless technologies are included within the definition of transmission medium.

本明細書で説明した情報、信号などは、様々な異なる技術のいずれかを使用して表されてもよい。例えば、上記の説明全体に渡って言及され得るデータ、命令、コマンド、情報、信号、ビット、シンボル、チップなどは、電圧、電流、電磁波、磁界若しくは磁性粒子、光場若しくは光子、又はこれらの任意の組み合わせによって表されてもよい。 The information, signals, etc. described herein may be represented using any of a variety of different techniques. For example, data, instructions, commands, information, signals, bits, symbols, chips, etc. that may be referred to throughout the above description may be voltage, current, electromagnetic waves, magnetic fields or magnetic particles, light fields or photons, or any of these. It may be represented by a combination of.

なお、本明細書で説明した用語及び／又は本明細書の理解に必要な用語については、同一の又は類似する意味を有する用語と置き換えてもよい。 The terms described herein and / or the terms necessary for understanding the present specification may be replaced with terms having the same or similar meanings.

本明細書で使用する「システム」および「ネットワーク」という用語は、互換的に使用される。 The terms "system" and "network" as used herein are used interchangeably.

また、本明細書で説明した情報、パラメータなどは、絶対値で表されてもよいし、所定の値からの相対値で表されてもよいし、対応する別の情報で表されてもよい。 Further, the information, parameters, etc. described in the present specification may be represented by an absolute value, a relative value from a predetermined value, or another corresponding information. ..

また、本明細書で使用する「判断(determining)」という用語は、多種多様な動作を包含する。「判断」は、例えば、判定(judging)、計算(calculating)、算出(computing)、処理(processing)、導出(deriving)、調査(investigating)、探索(looking up)（例えば、テーブル、データベースまたは別のデータ構造での探索）、確認(ascertaining)した事を「判断」したとみなす事などを含み得る。また、「判断」は、受信(receiving)（例えば、情報を受信すること）、送信(transmitting)（例えば、情報を送信すること）、入力(input)、出力(output)、アクセス(accessing)（例えば、メモリ中のデータにアクセスすること）した事を「判断」したとみなす事などを含み得る。また、「判断」は、解決(resolving)、選択(selecting)、選定(choosing)、確立(establishing)、比較(comparing)などした事を「判断」したとみなす事を含み得る。つまり、「判断」は、何らかの動作を「判断」したとみなす事を含み得る。 Also, the term "determining" as used herein includes a wide variety of actions. A "judgment" is, for example, judging, calculating, computing, processing, deriving, investigating, looking up (eg, table, database or another). (Search in the data structure of), confirming (ascertaining) can be regarded as "judgment", etc. Also, "judgment" includes receiving (eg, receiving information), transmitting (eg, transmitting information), input, output, accessing (accessing) ( For example, it may include the fact that (accessing the data in the memory) is regarded as "judgment". In addition, "judgment" may include that "judgment" is regarded as "resolving", "selecting", "choosing", "establishing", "comparing", and the like. That is, "judgment" may include considering some action as "judgment".

本明細書で使用する「に基づいて」という記載は、別段に明記されていない限り、「のみに基づいて」を意味しない。言い換えれば、「に基づいて」という記載は、「のみに基づいて」と「に少なくとも基づいて」の両方を意味する。 The phrase "based on" as used herein does not mean "based on" unless otherwise stated. In other words, the statement "based on" means both "based only" and "at least based on".

また、上記の各装置の構成における「手段」を、「部」、「回路」、「デバイス」等に置き換えてもよい。 Further, the "means" in the configuration of each of the above devices may be replaced with a "part", a "circuit", a "device" and the like.

本明細書で「第１の」、「第２の」などの呼称を使用した場合においては、その要素へのいかなる参照も、それらの要素の量または順序を全般的に限定するものではない。これらの呼称は、２つ以上の要素間を区別する便利な方法として本明細書で使用され得る。したがって、第１および第２の要素への参照は、２つの要素のみがそこで採用され得ること、または何らかの形で第１の要素が第２の要素に先行しなければならないことを意味しない。 When the terms "first", "second", etc. are used herein, any reference to the elements does not generally limit the quantity or order of those elements. These designations can be used herein as a convenient way to distinguish between two or more elements. Thus, references to the first and second elements do not mean that only two elements can be adopted there, or that the first element must somehow precede the second element.

「含む(ｉｎｃｌｕｄｉｎｇ)」、「含んでいる（ｃｏｍｐｒｉｓｉｎｇ）」、およびそれらの変形が、本明細書あるいは特許請求の範囲で使用されている限り、これら用語は、用語「備える」と同様に、包括的であることが意図される。さらに、本明細書あるいは特許請求の範囲において使用されている用語「または（or）」は、排他的論理和ではないことが意図される。 As long as "inclusion," "comprising," and variations thereof are used herein or in the claims, these terms are as comprehensive as the term "comprising." Intended to be targeted. Furthermore, the term "or" as used herein or in the claims is intended not to be an exclusive OR.

本明細書において、文脈または技術的に明らかに１つのみしか存在しない装置である場合以外は、複数の装置をも含むものとする。 In the present specification, a plurality of devices shall be included unless the device is apparently only one in the context or technically.

また、本明細書で説明した各態様／実施形態は、ＬＴＥ（Long Term Evolution）、ＬＴＥ−Ａ（LTE-Advanced）、ＳＵＰＥＲ３Ｇ、ＩＭＴ−Ａｄｖａｎｃｅｄ、４Ｇ、５Ｇ、ＦＲＡ（Future Radio Access）、Ｗ−ＣＤＭＡ（登録商標）、ＧＳＭ（登録商標）、ＣＤＭＡ２０００、ＵＭＢ（Ultra Mobile Broadband）、ＩＥＥＥ８０２．１１（Ｗｉ−Ｆｉ）、ＩＥＥＥ８０２．１６（ＷｉＭＡＸ）、ＩＥＥＥ８０２．２０、ＵＷＢ（Ultra-WideBand）、Ｂｌｕｅｔｏｏｔｈ（登録商標）、その他の適切なシステムを利用するシステム及び／又はこれらに基づいて拡張された次世代システムに適用されてもよい。 In addition, each aspect / embodiment described in the present specification includes LTE (Long Term Evolution), LTE-A (LTE-Advanced), SUPER 3G, IMT-Advanced, 4G, 5G, FRA (Future Radio Access), W. -CDMA®, GSM®, CDMA2000, UMB (Ultra Mobile Broadband), IEEE 802.11 (Wi-Fi), IEEE 802.16 (WiMAX), IEEE 802.20, UWB (Ultra-WideBand) ), Bluetooth®, and other systems that utilize suitable systems and / or extended next-generation systems based on them.

また、本明細書で説明した各態様／実施形態の処理手順、シーケンス、フローチャートなどは、矛盾の無い限り、順序を入れ替えてもよい。例えば、本明細書で説明した方法については、例示的な順序で様々なステップの要素を提示しており、提示した特定の順序に限定されない。 Further, the order of the processing procedures, sequences, flowcharts, etc. of each aspect / embodiment described in the present specification may be changed as long as there is no contradiction. For example, the methods described herein present elements of various steps in an exemplary order, and are not limited to the particular order presented.

入出力された情報等は特定の場所（例えば、メモリ）に保存されてもよいし、管理テーブルで管理してもよい。入出力される情報等は、上書き、更新、または追記され得る。出力された情報等は削除されてもよい。入力された情報等は削除されてもよい。入力された情報等は他の装置へ送信されてもよい。 The input / output information and the like may be stored in a specific location (for example, memory), or may be managed by a management table. Input / output information and the like can be overwritten, updated, or added. The output information and the like may be deleted. The entered information and the like may be deleted. The input information or the like may be transmitted to another device.

本明細書で説明した各態様／実施形態は単独で用いてもよいし、組み合わせて用いてもよいし、実行に伴って切り替えて用いてもよい。また、所定の情報の通知（例えば、「Ｘであること」の通知）は、明示的に行うものに限られず、暗黙的（例えば、当該所定の情報の通知を行わない）ことによって行われてもよい。 Each aspect / embodiment described in the present specification may be used alone, in combination, or switched with execution. Further, the notification of predetermined information (for example, the notification of "being X") is not limited to the explicit notification, but is performed implicitly (for example, the notification of the predetermined information is not performed). May be good.

本開示の全体において、明らかに単数であることを示しているものではない限り、単数および複数の両方のものを含むものとする。 The entire disclosure shall include both singular and plural, unless it clearly indicates that it is singular.

以上、本発明について詳細に説明したが、当業者にとっては、本発明が本明細書中に説明した実施形態に限定されるものではないということは明らかである。本発明は、特許請求の範囲の記載により定まる本発明の趣旨及び範囲を逸脱することなく修正及び変更態様として実施することができる。したがって、本明細書の記載は、例示説明を目的とするものであり、本発明に対して何ら制限的な意味を有するものではない。 Although the present invention has been described in detail above, it is clear to those skilled in the art that the present invention is not limited to the embodiments described in the present specification. The present invention can be implemented as modifications and modifications without departing from the gist and scope of the present invention determined by the description of the claims. Therefore, the description of the present specification is for the purpose of exemplification and does not have any limiting meaning to the present invention.

１０…推定装置、１１…情報取得部、１２…抽出文書データベース、１３…ＰＯＩデータベース、１４…動態人数データベース、１５…イベントデータベース、１６…抽出部、１７…推定対象文書データベース、１８…推定部、１０１…プロセッサ、１０２…メモリ、１０３…ストレージ、１０４…通信モジュール、１０５…バス。 10 ... estimation device, 11 ... information acquisition unit, 12 ... extracted document database, 13 ... POI database, 14 ... dynamic number database, 15 ... event database, 16 ... extraction unit, 17 ... estimation target document database, 18 ... estimation unit, 101 ... processor, 102 ... memory, 103 ... storage, 104 ... communication module, 105 ... bus.

Claims

It is an extraction device that extracts congestion-related characteristic words, which are words used when estimating the congestion status of the location of the congestion status estimation target based on the contents of the document corresponding to the congestion status estimation target location.
An information acquisition unit that acquires a document to be extracted and congestion information indicating the congestion status of a place corresponding to the content of the document.
An extraction unit that extracts congestion-related characteristic words from a document that is an extraction source based on the degree of appearance of words in a document corresponding to a location in a congestion situation in which the congestion information acquired by the information acquisition unit satisfies a preset condition. When,
Equipped with a,
The information acquisition unit identifies a place corresponding to the content of the document based on the content of the acquired document, and acquires a popularity degree indicating the degree of gathering of people in the place or an index value indicating the number of people in the place. An extraction device that acquires congestion information by generating congestion information indicating the congestion status of the location of the document based on the index value .

In the extraction unit, the degree of appearance of words in the document corresponding to the location of the congestion situation in which the congestion information acquired by the information acquisition unit satisfies the first preset condition is different from the first condition. The extraction device according to claim 1, wherein a congestion-related characteristic word is extracted based on the degree of appearance of the word in a document corresponding to a place of a congestion situation satisfying the condition of 2.

The extraction device according to claim 1 or 2 , wherein the extraction unit extracts a combination of a plurality of words as congestion-related characteristic words.

A document corresponding to the location of the congestion status estimation target is acquired, and the document is based on whether or not the congestion-related characteristic words extracted by the extraction unit are included in the document corresponding to the location of the congestion status estimation target. The extraction device according to any one of claims 1 to 3 , further comprising an estimation unit for estimating the congestion status of the place corresponding to the above.