JP7078244B2

JP7078244B2 - Data processing equipment, data processing methods, data processing systems and programs

Info

Publication number: JP7078244B2
Application number: JP2017044433A
Authority: JP
Inventors: 清彦岩井
Original assignee: Spectee Inc
Current assignee: Spectee Inc
Priority date: 2017-03-08
Filing date: 2017-03-08
Publication date: 2022-05-31
Anticipated expiration: 2037-03-08
Also published as: JP2018147411A

Description

本発明は、インターネット上から収集した投稿文データを処理するデータ処理装置、データ処理方法、データ処理システム及びプログラムに関する。より詳しくは、内容毎に分類された投稿文章群にタイトルや見出しなどを付与する技術に関する。 The present invention relates to a data processing device, a data processing method, a data processing system and a program for processing posted text data collected from the Internet. More specifically, the present invention relates to a technique for giving a title, a headline, etc. to a group of posted sentences classified by content.

インターネット上には、日々、数多くの文章が投稿されている。特に、ソーシャル・ネットワーキング・サービス（social networking service：ＳＮＳ）は、手軽にコメントや写真を投稿できるため、事件、事故及び災害などに関する情報がリアルタイムで投稿されることがあり、投稿写真や投稿動画などがニュースに利用され始めている。 Many sentences are posted on the Internet every day. In particular, the social networking service (SNS) allows you to easily post comments and photos, so information about incidents, accidents, disasters, etc. may be posted in real time, such as posted photos and posted videos. Is beginning to be used in the news.

一方、インターネット上に投稿された多種多様な文章の中から、ニュースソースとなり得る情報を抽出するには、投稿文を１つずつ読んで内容を確認する必要があり、手間と時間がかかる。また、事件、事故及び災害などのニュースには速報性が求められるが、インターネット上の投稿からこれらの事象の発生を知るには、投稿文を常にモニターしている必要があり、効率が悪い。 On the other hand, in order to extract information that can be a news source from a wide variety of sentences posted on the Internet, it is necessary to read the posted sentences one by one and check the contents, which takes time and effort. In addition, news such as incidents, accidents, and disasters are required to be breaking news, but in order to know the occurrence of these events from posts on the Internet, it is necessary to constantly monitor the posted text, which is inefficient.

そこで、近時、インターネット上から特定の投稿文や記事を収集し、利用者に提供するサービスが注目されている。しかしながら、このようなサービスを利用する場合でも、個々の文章について「目的とする事象に関するものか」や「信頼できるものか」を確認する作業は必要となる。このような他者が作成した文章の内容を短時間で効率的に判断する方法としては、それらに付されているタイトルや見出しを利用する方法があり、従来、投稿記事や新聞記事に自動でタイトルを付与する技術が提案されている（特許文献１，２参照）。 Therefore, recently, a service that collects specific posts and articles from the Internet and provides them to users is attracting attention. However, even when using such a service, it is necessary to confirm whether each sentence is "related to the target event" or "reliable". As a method of efficiently judging the contents of sentences created by others in a short time, there is a method of using the titles and headlines attached to them, and conventionally, it is automatically added to posted articles and newspaper articles. A technique for giving a title has been proposed (see Patent Documents 1 and 2).

例えば、特許文献１に記載の方法では、入力記事と同一カテゴリーの過去の記事を用いて、タイトル用の特徴語としての固有表現を抽出し、入力記事のタイトルを作成している。また、特許文献２に記載の方法では、新聞記事中からタイトル文作成のための候補文を抽出し、その候補文からデータベース用のタイトル文に必要な主語や述語を抽出し、タイトル文を作成している。 For example, in the method described in Patent Document 1, a unique expression as a feature word for a title is extracted by using a past article in the same category as the input article, and a title of the input article is created. Further, in the method described in Patent Document 2, a candidate sentence for creating a title sentence is extracted from a newspaper article, and a subject or a predicate necessary for a title sentence for a database is extracted from the candidate sentence to create a title sentence. is doing.

特開２０１０－１９１８５１号公報Japanese Unexamined Patent Publication No. 2010-191851 特開平９－４４４９７号公報Japanese Unexamined Patent Publication No. 9-44497

しかしながら、前述した従来のタイトル作成方法は、いずれも１つの記事に対してその内容を表すタイトルを付与する技術であり、投稿者が異なる複数の投稿文からなる投稿文章群にタイトルや見出しを付与したり、その内容を示す要約や記事などの短文を作成したりするものではない。このため、特許文献１，２に記載の技術を適用しても、インターネットを介して収集した任意の事象に関する投稿文章群について、その内容を示す短文を作成することはできない。 However, all of the above-mentioned conventional title creation methods are techniques for giving a title indicating the content to one article, and give a title or a headline to a posted sentence group consisting of a plurality of posted sentences by different contributors. It does not create short sentences such as summaries or articles that show the contents. Therefore, even if the techniques described in Patent Documents 1 and 2 are applied, it is not possible to create a short sentence showing the contents of a group of posted sentences relating to an arbitrary event collected via the Internet.

そこで、本発明は、インターネットを介して収集した事件・事故・災害などに関する投稿文章群について、その内容を示す短文を作成可能なデータ処理装置、データ処理方法、データ処理システム及びプログラムを提供することを目的とする。 Therefore, the present invention provides a data processing device, a data processing method, a data processing system, and a program capable of creating a short sentence showing the contents of a group of posted sentences related to an incident, an accident, a disaster, etc. collected via the Internet. With the goal.

本発明に係るデータ処理装置は、インターネットを介して収集された同一の事象に関する２以上の投稿文で構成される投稿文章群の各投稿文を解析し、前記投稿文章群に含まれる単語を出現頻度で順位付けする文章解析部と、前記投稿文章群に含まれる単語の中から各投稿文に共通する前記事象を特定できる単語を選定し、選定された単語を用いて前記事象の内容を表す短文を自動作成する短文作成部と、事象を特定可能な名詞が記憶され、前記短文作成部において事象を特定できる単語を選定する際に利用される事象単語データベースと、を有し、前記事象は、事件、事故又は災害であり、前記短文作成部における前記事象を特定できる単語の選定では、出現頻度が高順位の単語を優先して選定する。
前記文章解析部には、前記投稿文毎に品詞分解する品詞分解部と、品詞分解された単語から名詞を抽出する名詞抽出部と、抽出された名詞を出現頻度で順位付けする名詞カウント部とが設けられていてもよく、その場合、前記短文作成部は例えば出現頻度が高順位の名詞を用いて前記短文を作成することができる。
前記短文作成部は、例えば、前記投稿文章群に含まれる名詞の中から前記投稿文章群の各投稿文に共通する事象を特定できる名詞を選定する単語選定部と、前記単語選定部で選定された名詞を用いて短文を作成する短文生成部とを備え、前記単語選定部は、前記事象単語データベースに蓄積された名詞のデータと前記文章解析部で作成された名詞の順位データとを比較し、出現頻度が高順位の名詞を優先して選定する構成とすることもできる。
その場合、前記事象を特定できる名詞は、地域を表す名詞及び／又は事象を表す名詞とすることができる。
一方、本発明のデータ処理装置は、更に、前記単語の出現頻度の順位データを記憶する順位データ記憶部を有していてもよい。
前記短文は、例えばタイトルや見出しである。 The data processing device according to the present invention analyzes each posted sentence of a posted sentence group composed of two or more posted sentences related to the same event collected via the Internet, and appears a word included in the posted sentence group. The sentence analysis unit that ranks by frequency selects words that can identify the event that is common to each posted sentence from the words included in the posted sentence group, and the content of the event is used using the selected words. It has a short sentence creation unit that automatically creates a short sentence representing the above, and an event word database that stores a nomenclature that can identify an event and is used when selecting a word that can identify an event in the short sentence creation unit. The described event is an incident, an accident, or a disaster, and in the selection of words that can identify the event in the short sentence preparation unit, the word having the highest frequency of appearance is preferentially selected.
The sentence analysis unit includes a part-speech decomposition unit that decomposes part of speech for each posted sentence, a noun extraction unit that extracts nouns from words that have been decomposed into part speech, and a noun counting unit that ranks the extracted nouns according to their appearance frequency. In that case, the short sentence creation unit can create the short sentence by using, for example, a noun having a high frequency of appearance.
The short sentence creation unit is selected by, for example, a word selection unit that selects a noun that can identify an event common to each posted sentence of the posted sentence group from among the nouns included in the posted sentence group, and the word selection unit. It is equipped with a short sentence generation unit that creates short sentences using the nouns, and the word selection unit compares the noun data stored in the event word database with the noun ranking data created by the sentence analysis unit. However, it is also possible to preferentially select nouns with a high frequency of appearance.
In that case, the noun that can identify the event can be a noun representing a region and / or a noun representing an event.
On the other hand, the data processing apparatus of the present invention may further have a ranking data storage unit for storing ranking data of the frequency of appearance of the word.
The short sentence is, for example, a title or a headline.

本発明に係るデータ処理方法は、１又は複数のデータ処理装置を用いて短文を自動作成する方法であって、前記データ処理装置により、インターネットを介して収集された同一の事象に関する２以上の投稿文で構成される投稿文章群の各投稿文を解析し、前記投稿文章群に含まれる単語を出現頻度で順位付けする文章解析工程と、前記投稿文章群に含まれる単語の中から各投稿文に共通する前記事象を特定できる単語を選定し、選定された単語を用いて前記事象の内容を表す短文を作成する短文作成工程と、を行い、前記事象は、事件、事故又は災害であり、前記短文作成工程における前記事象を特定できる単語の選定では、事象を特定可能な名詞が記憶された事象単語データベースを利用し、出現頻度が高順位の単語を優先して選定する。
本発明のデータ処理方法では、前記文章解析工程で、前記投稿文毎に品詞分解する工程と、品詞分解された単語から名詞を抽出する工程と、抽出された名詞を出現頻度で順位付けする工程とを実施し、前記短文作成工程で出現頻度が高順位の名詞を用いて前記短文を作成してもよい。
また、前記短文作成工程では、前記投稿文章群に含まれる名詞の中から前記投稿文章群の各投稿文に共通する事象を特定できる名詞を選定する工程と、選定された名詞を用いて短文を作成する工程を行うことができ、その場合、例えば前記名詞を選定する工程では、前記事象単語データベースに蓄積された名詞のデータと前記文章解析工程で作成された名詞の順位データとを比較し、出現頻度が高順位の名詞を優先して選定することができる。 The data processing method according to the present invention is a method of automatically creating a short sentence using one or a plurality of data processing devices, and two or more posts relating to the same event collected via the Internet by the data processing device. A sentence analysis process that analyzes each posted sentence of the posted sentence group composed of sentences and ranks the words included in the posted sentence group according to the frequency of appearance, and each posted sentence from the words included in the posted sentence group. A short sentence creation step of selecting a word that can identify the event common to the above and creating a short sentence expressing the content of the event using the selected word is performed, and the event is an incident, an accident, or a disaster. In the selection of words that can identify the event in the short sentence creation process , a word database in which nomenclature that can specify the event is stored is used, and words with a high frequency of appearance are preferentially selected.
In the data processing method of the present invention, in the sentence analysis step, a step of part-speech decomposition for each posted sentence, a step of extracting nouns from the part-speech-decomposed words, and a step of ranking the extracted nouns by appearance frequency. And may be carried out, and the short sentence may be created by using a noun having a high frequency of appearance in the short sentence creation step.
Further, in the short sentence creation step, a step of selecting a noun that can identify an event common to each posted sentence of the posted sentence group from the nouns included in the posted sentence group, and a short sentence using the selected noun. In that case, for example, in the step of selecting the noun, the noun data stored in the event word database is compared with the noun ranking data created in the sentence analysis step. , Nouns with high frequency of appearance can be selected with priority.

本発明に係るデータ処理システムは、インターネットを介して収集された同一の事象に関する２以上の投稿文で構成される投稿文章群の各投稿文を解析し、前記投稿文章群に含まれる単語を出現頻度で順位付けする文章解析装置と、前記投稿文章群に含まれる単語の中から各投稿文に共通する前記事象を特定できる単語を選定し、選定された単語を用いて前記事象の内容を表す短文を自動作成する短文作成装置と、事象を特定可能な名詞が記憶され、前記短文作成装置において事象を特定できる単語を選定する際に利用される事象単語データベースと、を有し、前記事象は、事件、事故又は災害であり、前記短文作成装置における前記事象を特定できる単語の選定では、出現頻度が高順位の単語を優先して選定する。
本発明のデータ処理システムは、更に、インターネットを介して収集した複数の投稿文を事象毎に分類し、同一の事象に関する２以上の投稿文で構成されるテキストデータの集合体である投稿文章群を作成する投稿文分類装置を備えていてもよく、その場合、前記文章解析装置は、前記投稿文分類装置で作成された投稿文章群を解析する。
また、本発明のデータ処理システムは、前記短文作成装置で作成された短文のデータを、前記投稿文章群のデータに付加して外部配信する配信装置を有していてもよい。 The data processing system according to the present invention analyzes each posted sentence of a posted sentence group composed of two or more posted sentences related to the same event collected via the Internet, and appears a word included in the posted sentence group. From the sentence analysis device that ranks by frequency and the words included in the posted sentence group, the words that can identify the event common to each posted sentence are selected, and the content of the event is used using the selected words. It has a short sentence creation device that automatically creates a short sentence representing the above, and an event word database that stores a nomenclature that can identify an event and is used when selecting a word that can identify an event in the short sentence creation device. The described event is an incident, an accident, or a disaster, and in the selection of a word capable of identifying the event in the short sentence creating device, a word having a high frequency of appearance is preferentially selected.
The data processing system of the present invention further classifies a plurality of posted sentences collected via the Internet for each event, and is a group of posted sentences which is a collection of text data composed of two or more posted sentences related to the same event. The posted sentence classification device may be provided, and in that case, the sentence analysis device analyzes the posted sentence group created by the posted sentence classification device.
Further, the data processing system of the present invention may have a distribution device that adds the short sentence data created by the short sentence creation device to the data of the posted sentence group and distributes it externally.

本発明に係るプログラムは、コンピュータに、インターネットを介して収集された同一の事象に関する２以上の投稿文で構成される投稿文章群の各投稿文を解析し、前記投稿文章群に含まれる単語を出現頻度で順位付けする文章解析機能と、前記投稿文章群に含まれる単語の中から各投稿文に共通する前記事象を特定できる単語を選定し、選定された単語を用いて前記事象の内容を表す短文を自動作成する短文作成機能とを実行させ、前記事象は、事件、事故又は災害であり、前記短文作成機能における前記事象を特定できる単語の選定では、事象を特定可能な名詞が記憶された事象単語データベースを利用し、出現頻度が高順位の単語を優先して選定するものである。 The program according to the present invention analyzes each posted sentence of a posted sentence group composed of two or more posted sentences related to the same event collected via the Internet on a computer, and obtains words included in the posted sentence group. A sentence analysis function that ranks by frequency of appearance and a word that can identify the event common to each posted sentence are selected from the words included in the posted sentence group, and the selected word is used to identify the event. The short sentence creation function that automatically creates a short sentence representing the content is executed, and the event is an incident, an accident, or a disaster, and the event can be specified by selecting a word that can identify the event in the short sentence creation function. Using the event word database in which nomenclature is stored, words with the highest frequency of appearance are prioritized and selected.

本発明において、「投稿文章群」とは、インターネットを介して収集された任意の事象に関する２以上の投稿文で構成されるテキストデータの集合体を指す。
また、本発明における「事象」には、事件、事故、災害など報道などで取り上げられる様々な出来事が含まれ、その発生場所は国内に限られず、海外も含む。 In the present invention, the "posted text group" refers to a collection of text data composed of two or more posted texts relating to an arbitrary event collected via the Internet.
In addition, the "event" in the present invention includes various events such as incidents, accidents, and disasters that are taken up in the media, and the place of occurrence is not limited to Japan but also includes overseas.

本発明によれば、インターネットを介して収集した投稿文章群について、特定の単語の出現頻度を解析し、上位の単語を抽出して短文を生成しているため、その投稿文章群に関係する事象の内容を示す短文を自動作成することが可能となる。 According to the present invention, in a posted sentence group collected via the Internet, the frequency of appearance of a specific word is analyzed, and a high-ranking word is extracted to generate a short sentence. Therefore, an event related to the posted sentence group. It is possible to automatically create a short sentence showing the contents of.

本発明の第１の実施形態のデータ処理装置の構成例を示す概念図である。It is a conceptual diagram which shows the structural example of the data processing apparatus of 1st Embodiment of this invention. 図１に示すデータ処理装置１を用いてデータ処理を行う方法を示すフローチャートである。It is a flowchart which shows the method of performing the data processing using the data processing apparatus 1 shown in FIG. 順位データの例を示す図である。It is a figure which shows the example of the rank data. 本発明の第２の実施形態のデータ処理システムの構成例を示す概念図である。It is a conceptual diagram which shows the structural example of the data processing system of the 2nd Embodiment of this invention.

以下、本発明を実施するための形態について、添付の図面を参照して、詳細に説明する。なお、本発明は、以下に説明する実施形態に限定されるものではない。 Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the accompanying drawings. The present invention is not limited to the embodiments described below.

（第１の実施形態）
先ず、本発明の第１の実施形態に係るデータ処理装置について説明する。図１は本実施形態のデータ処理装置の構成例を示す概念図である。図１に示すように、本実施形態のデータ処理装置１は、文章解析部２と短文作成部３とを有し、タイトルや見出しなどの投稿文章群に関する短文を自動作成するものである。このデータ処理装置１には、必要に応じて順位データ記憶部４や事象単語データベース５などを設けることもできる。 (First Embodiment)
First, the data processing apparatus according to the first embodiment of the present invention will be described. FIG. 1 is a conceptual diagram showing a configuration example of the data processing device of the present embodiment. As shown in FIG. 1, the data processing device 1 of the present embodiment has a sentence analysis unit 2 and a short sentence creation unit 3, and automatically creates short sentences related to a group of posted sentences such as a title and a headline. The data processing device 1 may be provided with a ranking data storage unit 4, an event word database 5, or the like, if necessary.

［文章解析部２］
文章解析部２は、入力された投稿文章群の各投稿文を解析し、その中に含まれる単語を出現頻度で順位付けするものである。ここで、文章解析部２で解析される「投稿文章群」は、インターネットを介して収集された任意の事象に関する２以上の投稿文で構成されるテキストデータの集合体である。また、「投稿文」は、インターネットを介して収集可能なものであればよく、例えばＳＮＳ、ブログ、電子掲示板（Bulletin Board System：ＢＢＳ）などに投稿された情報などが挙げられる。なお、解析する投稿文の長さは、特に限定されるものではなく、長くても、短くてもよい。 [Sentence analysis unit 2]
The sentence analysis unit 2 analyzes each posted sentence of the input posted sentence group, and ranks the words contained in the posted sentence according to the frequency of appearance. Here, the "posted sentence group" analyzed by the sentence analysis unit 2 is a collection of text data composed of two or more posted sentences related to an arbitrary event collected via the Internet. Further, the "posted text" may be anything that can be collected via the Internet, and examples thereof include information posted on SNS, blogs, electronic bulletin boards (Bulletin Board System: BBS), and the like. The length of the posted text to be analyzed is not particularly limited, and may be long or short.

文章解析部２の構成は、特に限定されるものではないが、例えば、投稿文毎に品詞分解を行う品詞分解部２１と、品詞分解された単語から名詞を抽出する名詞抽出部２２と、抽出された名詞を出現頻度で順位付けする名詞カウント部２３とを備える構成とすることができる。なお、文章解析部２は、名詞以外の品詞も抽出して順位付けする構成、或いは、品詞分解された全ての単語を順位付けし、順位付けられた単語の中から名詞のみを抽出する構成とすることもできる。 The structure of the sentence analysis unit 2 is not particularly limited, but for example, a part-speech decomposition unit 21 that decomposes part of speech for each posted sentence, a noun extraction unit 22 that extracts nouns from words decomposed by part of speech, and extraction. It can be configured to include a noun counting unit 23 that ranks the nouns that have been given by the frequency of appearance. The sentence analysis unit 2 has a structure in which parts of speech other than nouns are also extracted and ranked, or all words decomposed by part of speech are ranked and only nouns are extracted from the ranked words. You can also do it.

文章解析部２における各投稿文の解析方法、即ち、品詞分解部２１、名詞抽出部２２及び名詞カウント部２３などでの処理方法は、特に限定されるものではないが、例えばテキストマイニング、ワードカウント及び全文検索技術などの手法を適用することができる。 The analysis method of each posted sentence in the sentence analysis unit 2, that is, the processing method in the part of speech decomposition unit 21, the noun extraction unit 22, the noun counting unit 23, and the like is not particularly limited, but for example, text mining and word counting. And methods such as full-text search technology can be applied.

［短文作成部３］
短文作成部３は、文章解析部２で作成された投稿文章群に含まれる単語（又は名詞）の順位データに基づいて、投稿文章群に関する短文を作成する。短文作成部３で作成される短文としては、タイトルや見出しなどのように事象を特定するものの他に、事象の具体的内容を簡単にまとめた要約や記事などが挙げられる。 [Short sentence creation section 3]
The short sentence creation unit 3 creates a short sentence related to the posted sentence group based on the ranking data of the words (or nouns) included in the posted sentence group created by the sentence analysis unit 2. Examples of the short sentences created by the short sentence creation unit 3 include those that specify an event such as a title and a headline, as well as summaries and articles that briefly summarize the specific contents of the event.

短文作成部３において、短文を自動作成する方法は、特に限定されるものではないが、例えば、高順位の単語（又は名詞）から地域や事象などの特定の内容を表す名詞を選定し、選定された名詞を用いて文章を作成する方法を用いることができる。その場合、短文作成部３には、単語（又は名詞）の順位データに基づき投稿文章群の各投稿文に含まれる名詞の中から特定の内容を表す名詞を選定する単語選定部３１と、単語選定部３１で選定された名詞を用いて短文を作成する短文生成部３２とを設ければよい。 The method of automatically creating a short sentence in the short sentence creation unit 3 is not particularly limited, but for example, a noun representing a specific content such as a region or an event is selected from high-ranked words (or nouns) and selected. It is possible to use a method of creating a sentence using the nouns that have been given. In that case, the short sentence creation unit 3 includes a word selection unit 31 that selects a noun representing a specific content from the nouns included in each posted sentence of the posted sentence group based on the ranking data of the word (or noun), and a word. A short sentence generation unit 32 that creates a short sentence using the nouns selected by the selection unit 31 may be provided.

［順位データ記憶部４］
順位データ記憶４は、文章解析部２で生成された単語（又は名詞）の順位データを記憶するものであり、必要に応じて設けられる。本実施形態のデータ処理装置１では、文章解析部２での解析処理の後、引き続き短文作成部３で短文作成を行ってもよいが、文章解析部２で解析した結果を順位データ記憶部４に記憶しておき、短文作成部３は順位データ記憶部４に記憶されたデータを利用して、短文を作成することもできる。 [Ranking data storage unit 4]
The rank data storage 4 stores the rank data of the word (or noun) generated by the sentence analysis unit 2, and is provided as needed. In the data processing device 1 of the present embodiment, after the analysis processing by the sentence analysis unit 2, the short sentence creation unit 3 may continuously create a short sentence, but the result of analysis by the sentence analysis unit 2 is the ranking data storage unit 4. The short sentence creation unit 3 can also create a short sentence by using the data stored in the rank data storage unit 4.

［事象単語データベース５］
事象単語データベース５には、事象を特定可能な名詞が記録されているものであり、必要に応じて設けられる。ここで、事象を特定可能な名詞としては、例えば、事件、事故、災害などの事象を表す際に多用される名詞や、その事象特有の名詞、地域を表す名詞などが挙げられる。この事象単語データベース５に記録されている名詞は、短文作成部３の単語選定部３１などにおいて、地域や事象などの特定の内容を表す名詞を選定する際に利用される。 [Event word database 5]
Nouns that can identify an event are recorded in the event word database 5, and are provided as necessary. Here, examples of nouns that can identify an event include nouns that are frequently used to represent events such as incidents, accidents, and disasters, nouns specific to the event, and nouns that represent regions. The nouns recorded in the event word database 5 are used when a noun representing a specific content such as a region or an event is selected in the word selection unit 31 or the like of the short sentence creation unit 3.

［動作］
次に、本実施形態のデータ処理装置１の動作、即ち、データ処理装置１を用いて投稿文章群に関する短文を自動作成する方法について説明する。本実施形態のデータ処理方法は、１又は複数のデータ処理装置１により、文章解析工程と、短文作成工程とを行い、投稿文章群に関する短文を自動作成する。図２は本実施形態のデータ処理方法を示すフローチャートであり、図３は順位データの例である。 [motion]
Next, the operation of the data processing device 1 of the present embodiment, that is, a method of automatically creating a short sentence regarding a group of posted sentences by using the data processing device 1 will be described. In the data processing method of the present embodiment, a sentence analysis step and a short sentence creation step are performed by one or a plurality of data processing devices 1, and a short sentence related to a posted sentence group is automatically created. FIG. 2 is a flowchart showing the data processing method of the present embodiment, and FIG. 3 is an example of ranking data.

［文章解析工程］
文章解析工程は、主にデータ処理装置１の記事解析部２で行われ、インターネットを介して収集した任意の事象に関する投稿文章群の各投稿文を解析し、投稿文に含まれる単語を出現頻度で順位付けする。単語の抽出及び順位付けの方法は、特に限定されるものではないが、例えば、図２に示すように、投稿文毎に品詞分解した後（ステップＳ１）、品詞分解された単語から名詞を抽出し（ステップＳ２）、抽出された名詞を出現頻度で順位付けする（ステップＳ３）ことにより、実施することができる。 [Sentence analysis process]
The sentence analysis process is mainly performed by the article analysis unit 2 of the data processing device 1, analyzes each posted sentence of the posted sentence group related to an arbitrary event collected via the Internet, and the appearance frequency of words included in the posted sentence. Ranking by. The method of extracting and ranking words is not particularly limited, but for example, as shown in FIG. 2, after part-speech decomposition for each posted sentence (step S1), nouns are extracted from the part-speech-decomposed words. It can be carried out by (step S2) and ranking the extracted nouns according to the frequency of appearance (step S3).

ステップＳ１の品詞分解について、投稿文章群の各投稿文のテキストデータは、既存の日本語形態素解析システムなどを用いることで、単語毎に分割することができる。また、ステップＳ２の名詞抽出については、機械学習や日本語形態素解析システムを用いることで、品詞分解された単語の中から名詞のみを抽出することができる。そして、ステップＳ３の順位付けは、図３に示すように、抽出された名詞などについてその語が使われた数（出現頻度）を数え、その数が多いものから順位を付ければよい。 Regarding the part-speech decomposition in step S1, the text data of each posted sentence in the posted sentence group can be divided into words by using an existing Japanese morphological analysis system or the like. As for the noun extraction in step S2, only the noun can be extracted from the part-speech-decomposed words by using machine learning or a Japanese morphological analysis system. Then, in the ranking of step S3, as shown in FIG. 3, the number of used words (appearance frequency) of the extracted nouns and the like may be counted, and the ranking may be made from the one with the largest number.

なお、文章解析工程では、名詞の抽出と併せて、又は、名詞の抽出の後で副詞などの名詞以外の品詞を抽出し、順位付けしてもよい。また、文章解析工程では、品詞分解された全ての単語を順位付けし、順位付けられた単語の中から名詞のみを抽出することもできる。 In the sentence analysis step, part of speech other than nouns such as adverbs may be extracted and ranked in combination with noun extraction or after noun extraction. Further, in the sentence analysis step, all the words decomposed by part of speech can be ranked, and only nouns can be extracted from the ranked words.

［短文作成工程］
短文作成工程は、主に情報処理装置１の短文作成部３で行われ、名詞の順位データに基づいて投稿文章群に関する短文を作成する。その方法は特に限定されるものではないが、例えば、図２に示すように、投稿文章群の各投稿文に含まれる名詞の中から、特定の内容を表す名詞を選定し（ステップＳ４）、必要に応じて場所や事象を特定する副詞などの他の品詞も選定し、選定された名詞などを用いて文章を作成する（ステップＳ５）ことにより実施することができる。 [Short sentence creation process]
The short sentence creation process is mainly performed by the short sentence creation unit 3 of the information processing apparatus 1, and creates a short sentence related to the posted sentence group based on the ranking data of nouns. The method is not particularly limited, but for example, as shown in FIG. 2, a noun representing a specific content is selected from the nouns included in each posted sentence of the posted sentence group (step S4). It can be carried out by selecting other part of speech such as an adverb that specifies a place or an event as necessary and creating a sentence using the selected noun or the like (step S5).

具体的には、ステップＳ４の名詞選定工程では、例えば「地域」や「事象」などのように投稿文章群の各投稿文に共通する事象を特定できる名詞を、高順位のものを優先して選定する。名詞の選定は、例えば、予め事象単語データベース５に蓄積しておいた「事象を特定できる名詞」のデータと、名詞の順位データとを比較することで行うことができる。 Specifically, in the noun selection step of step S4, nouns that can identify an event common to each posted sentence in the posted sentence group, such as "region" and "event", are given priority to those having a high rank. Select. The noun can be selected, for example, by comparing the data of the "noun that can specify the event" stored in the event word database 5 in advance with the ranking data of the noun.

その結果、例えば、図３に示す順位データの場合、「地域」として”福岡”が選定され、「事象」として”事故”が選定される。このとき、「事象」を表す名詞の第一候補としては、最も順位の高い”事故”が選定されるが、第二候補として次に順位の高い”地震”を選定することもできる。 As a result, for example, in the case of the ranking data shown in FIG. 3, "Fukuoka" is selected as the "region" and "accident" is selected as the "event". At this time, the highest-ranked "accident" is selected as the first candidate for the noun representing "event", but the next highest-ranked "earthquake" can also be selected as the second candidate.

また、前述した名詞の選定と併せて、又は、名詞の選定の後に、事象を特定できる他の品詞を選定してもよい。具体的には、場所や事象を特定する際に重要な副詞（例えば「で」など）を、名詞と同様の方法で選定する。これにより、より自然な短文を作成することが可能となる。 Further, another part of speech that can identify the event may be selected in combination with the above-mentioned selection of the noun or after the selection of the noun. Specifically, an adverb (for example, "de") that is important when specifying a place or an event is selected by the same method as a noun. This makes it possible to create more natural short sentences.

そして、ステップＳ５の短文生成工程において、名詞選定工程で選定された名詞やその他の品詞を用いて、投稿文章群に関する短文を作成する。例えば、図３に示す順位データの場合は、『福岡県で事故発生の可能性あり』、『福岡県で地震発生の可能性あり』若しくは『福岡県で事故又は地震発生の可能性あり』という短文が作成される。このように、本実施形態のデータ処理装置１では、最も順位の高い名詞だけでなく、高順位にある複数の名詞を選定し、選定された複数の名詞を組み合わせて１つの短文を作成したり、名詞毎に複数の短文を作成したりすることもできる。 Then, in the short sentence generation step of step S5, a short sentence related to the posted sentence group is created using the noun and other part of speech selected in the noun selection step. For example, in the case of the ranking data shown in FIG. 3, "there is a possibility of an accident in Fukuoka Prefecture", "there is a possibility of an earthquake in Fukuoka Prefecture", or "there is a possibility of an accident or an earthquake in Fukuoka Prefecture". A short sentence is created. As described above, in the data processing device 1 of the present embodiment, not only the noun having the highest rank but also a plurality of nouns having a high rank are selected, and the selected nouns are combined to create one short sentence. , You can also create multiple short sentences for each noun.

なお、前述した文章解析工程と短文作成工程は、連続して行ってもよいが、それぞれ独立して実施することもできる。例えば、文章解析工程は、全ての投稿文又は投稿文章群に対して実施し、短文作成工程は必要なものだけ選択的に実施することも可能である。また、前述した短文作成工程では、品詞分解された単語の中から特定の名詞を選択的に抽出しているが、本発明はこれに限定されるものではなく、投稿文章群表す単語であれば名詞以外の単語でもよく、名詞以外の単語を選定して短文を作成してもよい。 The sentence analysis step and the short sentence creation step described above may be performed continuously, but they may also be performed independently. For example, it is possible to carry out the sentence analysis process for all the posted sentences or the posted sentence group, and selectively carry out only the necessary short sentence creation steps. Further, in the above-mentioned short sentence creation step, a specific noun is selectively extracted from the words decomposed into part speech, but the present invention is not limited to this, and any word representing a posted sentence group is used. Words other than nouns may be used, and words other than nouns may be selected to create a short sentence.

また、前述した各工程により作成された投稿文章群に関する短文（テキストデータ）は、例えば、該当する投稿文章群データに付与され（ステップＳ６）、ユーザーに向けて配信される。又は、投稿文章群データとは別に、作成された短文のみを、ウェブサイトやＳＮＳに掲載して、特定のユーザー向けの情報として提供することもできる。 Further, a short sentence (text data) related to the posted sentence group created by each of the above-mentioned steps is attached to the corresponding posted sentence group data (step S6) and distributed to the user, for example. Alternatively, apart from the posted text group data, only the created short text can be posted on the website or SNS and provided as information for a specific user.

［プログラム］
前述した各工程は、データ処理装置の各機能を実現するためのコンピュータプログラムを作成し、１又は２以上のコンピュータに実装することにより実施することができる。即ち、本実施形態のデータ処理方法は、インターネットを介して収集した任意の事象に関する投稿文章群の各投稿文を解析し、投稿文に含まれる単語を出現頻度で順位付けする文章解析機能と、単語の順位データに基づいて投稿文章群に関する短文を自動作成する短文作成部とをコンピュータに実行させるプログラムにより実行することができる。 [program]
Each of the above-mentioned steps can be carried out by creating a computer program for realizing each function of the data processing device and implementing it on one or more computers. That is, the data processing method of the present embodiment has a sentence analysis function that analyzes each posted sentence of a posted sentence group related to an arbitrary event collected via the Internet and ranks the words included in the posted sentence according to the frequency of appearance. It can be executed by a program that causes a computer to execute a short sentence creation unit that automatically creates short sentences related to a group of posted sentences based on word ranking data.

このコンピュータプログラムでは、文章解析機能を、投稿文毎に品詞分解する機能と、品詞分解された単語の中から名詞を抽出する機能と、抽出された名詞を出現頻度で順位付けする機能により実現してもよい。また、短文作成機能を、単語（又は名詞）の順位データに基づき、投稿文章群の各投稿文に含まれる名詞の中から特定の内容を表す名詞を選定する機能と、単語選定部で選定された名詞を用いて短文を作成する文章生成機能とで実現し、名詞を選定する機能では、順位が上位の名詞を優先して選定するようにしてもよい。 In this computer program, the sentence analysis function is realized by the function of part-speech decomposition for each posted sentence, the function of extracting nouns from the part-speech-decomposed words, and the function of ranking the extracted nouns by the frequency of appearance. You may. In addition, the short sentence creation function is selected by the word selection unit and the function to select a noun that represents a specific content from the nouns included in each posted sentence of the posted sentence group based on the ranking data of the word (or noun). It is realized by a sentence generation function that creates a short sentence using a noun, and in the function of selecting a noun, the noun with the higher rank may be preferentially selected.

なお、前述した各機能は、一のプログラムに搭載されている必要はなく、機能毎にプログラムを作成し、それらを連動させることにより実行してもよい。その場合、各プログラムを２台以上のコンピュータ又は装置に分割して実装し、動作させることもできる。 It should be noted that each of the above-mentioned functions does not have to be installed in one program, and may be executed by creating a program for each function and linking them. In that case, each program may be divided into two or more computers or devices, mounted, and operated.

投稿文章群に含まれる投稿文は、特定の事象に関する情報以外にも様々な情報を含んでいるため、投稿文毎にテキストデータを解析し、タイトルや見出しなどの事象を特定する短文を作成すると、事象とは関係ない情報により特定精度が低下する。これに対して、本実施形態のデータ処理装置では、投稿文章群を構成する各投稿文に含まれる単語を出現頻度で順位付けし、高順位の単語を用いて文章を作成しているため、各投稿文が話題としている事象の内容や発生地域などを、精度よく特定することができる。 Since the posted text included in the posted text group contains various information other than the information related to a specific event, if you analyze the text data for each posted text and create a short sentence that identifies the event such as the title or heading. , Specific accuracy is reduced by information unrelated to the event. On the other hand, in the data processing device of the present embodiment, the words included in each posted sentence constituting the posted sentence group are ranked according to the frequency of appearance, and the sentence is created using the high-ranked words. It is possible to accurately identify the content and area of occurrence of the event that each posted text is talking about.

その結果、本実施形態のデータ処理装置やデータやデータ処理方法を用いることで、インターネットを介して収集した事件・事故・災害などに関する投稿文章群について、その内容を適確に表した見出し、タイトル、要約及び紹介文などの短文を、自動作成することが可能となる。また、本実施形態のデータ処理装置により、ＳＮＳ上にある投稿してから時間が経過していない最新の投稿文章群を解析すると、各地で発生している事象をいち早く発見し、その内容を必要な人に、迅速にかつわかりやすい状態で提供することができる。 As a result, by using the data processing device, data, and data processing method of this embodiment, the headings and titles that accurately represent the contents of the posted texts related to incidents, accidents, disasters, etc. collected via the Internet. , Summary and introductory texts can be automatically created. In addition, when the data processing device of the present embodiment analyzes the latest posted texts on the SNS that have not been posted for a long time, it is necessary to quickly discover the events occurring in each place and their contents. It can be provided to people in a quick and easy-to-understand manner.

なお、本実施形態のデータ処理装置は、インターネットを介して収集された任意の事象に関する２以上の投稿文で構成される「投稿文章群」の処理に特に有効であるが、当然ながら１つの投稿文への適用も可能である。その場合、１つの文章のみ解析を行い、単語を順位付けし、その順位付けデータに基づいて、その投稿文に関する短文を作成する。 The data processing device of the present embodiment is particularly effective for processing a "posted sentence group" composed of two or more posted sentences related to an arbitrary event collected via the Internet, but of course, one post. It can also be applied to sentences. In that case, only one sentence is analyzed, words are ranked, and a short sentence related to the posted sentence is created based on the ranking data.

（第２の実施形態）
次に、本発明の第２の実施形態に係るデータ処理システムについて説明する。図４は本実施形態のデータ処理システムの構成を示す概念図である。前述した第１の実施形態では、投稿文章群の各投稿文の解析と短文作成を同じ装置で行っているが、本発明はこれに限定するものではなく、これらの工程をそれぞれ別の装置で実施してもよい。即ち、本実施形態のデータ処理システム１０は、図４に示すように、文章解析装置２０と、短文作成装置３０とを有する。 (Second embodiment)
Next, the data processing system according to the second embodiment of the present invention will be described. FIG. 4 is a conceptual diagram showing the configuration of the data processing system of the present embodiment. In the first embodiment described above, the analysis of each posted sentence of the posted sentence group and the creation of a short sentence are performed by the same device, but the present invention is not limited to this, and these steps are performed by different devices. It may be carried out. That is, as shown in FIG. 4, the data processing system 10 of the present embodiment includes a sentence analysis device 20 and a short sentence creation device 30.

［文章解析装置２０］
文章解析装置２０は、インターネット５を介して収集した任意の事象に関する投稿文章群の各投稿文６ａ，６ｂを解析し、投稿文６ａ，６ｂに含まれる単語を出現頻度で順位付けするものである。なお、文章解析装置２０における上記以外の構成及び動作は、前述した第１の実施形態の文章解析部２と同様である。 [Sentence analysis device 20]
The sentence analysis device 20 analyzes each of the posted sentences 6a and 6b of the posted sentence group related to an arbitrary event collected via the Internet 5, and ranks the words included in the posted sentences 6a and 6b by the frequency of appearance. .. The configuration and operation of the text analysis device 20 other than the above are the same as those of the text analysis unit 2 of the first embodiment described above.

［短文作成装置３０］
短文作成装置３０は、文章解析装置２０で作成した単語の順位データに基づいて投稿短文作成群に関する短文を作成するものである。なお、文章解析装置３０における上記以外の構成及び動作は、前述した第１の実施形態の短文作成部３と同様である。 [Short sentence creation device 30]
The short sentence creation device 30 creates a short sentence related to the posted short sentence creation group based on the ranking data of the words created by the sentence analysis device 20. The configuration and operation of the sentence analysis device 30 other than the above are the same as those of the short sentence creation unit 3 of the first embodiment described above.

［投稿文分類装置７］
本実施形態のデータ処理システム１０には、インターネットを介して収集した複数の投稿文を分類し、同一事象毎に投稿文章群を作成する投稿文分類装置７が設けられていてもよい。その場合、文章解析装置２０には、この投稿文分類装置７で作成された投稿文章群が入力され、解析処理が行われる。 [Posted sentence classification device 7]
The data processing system 10 of the present embodiment may be provided with a posted text classification device 7 that classifies a plurality of posted texts collected via the Internet and creates a posted text group for each of the same events. In that case, the posted sentence group created by the posted sentence classification device 7 is input to the sentence analysis device 20, and the analysis process is performed.

本実施形態のデータ処理システム１０では、投稿文分類装置７で分類した投稿文章群のうち、必要なものだけ文章解析装置２０に出力し、解析処理を行ってもよい。例えば、投稿文分類装置７で事象の発生をモニターし、変化が現れたタイミングでその前後のデータを収集し、分類して文章解析装置２０に送ることも可能である。これにより、効率良くデータ解析を行うことができる。 In the data processing system 10 of the present embodiment, only the necessary ones among the posted sentence groups classified by the posted sentence classification device 7 may be output to the sentence analysis device 20 and analyzed. For example, it is also possible to monitor the occurrence of an event with the posted sentence classification device 7, collect data before and after the change at the timing when the change appears, classify the data, and send it to the sentence analysis device 20. As a result, data analysis can be performed efficiently.

また、本実施形態のデータ処理システム１０には、短文作成装置３０で作成された文章のデータを、投稿文章群のデータに付加し、外部配信する配信装置が設けられていてもよい。なお、図４に示すデータ処理システム１０では、投稿文分類装置７が配信装置も兼ねており、短文作成装置３０で作成された文章は、投稿文分類装置７で投稿文章群のデータに付加され、インターネット５を介して外部に配信される。 Further, the data processing system 10 of the present embodiment may be provided with a distribution device that adds the text data created by the short text creation device 30 to the data of the posted text group and distributes it externally. In the data processing system 10 shown in FIG. 4, the posted sentence classification device 7 also serves as a distribution device, and the sentence created by the short sentence creating device 30 is added to the data of the posted sentence group by the posted sentence classification device 7. , Is delivered to the outside via the Internet 5.

本実施形態のデータ処理システムは、同一事象に関する投稿文を集め、そのテキストデータに含まれる単語を出現頻度で順位付けし、高順位の単語を用いて文章を作成しているため、その事象を適確に表す短文を自動作成することができる。なお、本実施形態における上記以外の構成及び効果は、前述した第１の実施形態と同様である。 The data processing system of the present embodiment collects posts related to the same event, ranks the words contained in the text data according to the frequency of appearance, and creates a sentence using the high-ranked words. It is possible to automatically create a short sentence that accurately represents it. The configurations and effects other than the above in this embodiment are the same as those in the first embodiment described above.

１データ処理装置
２文章解析部
３短文作成部
４順位データ記憶部
５事象単語データベース
６ａ、６ｂ投稿文
７投稿文分類装置
８インターネット
１０データ処理システム
２０文章解析装置
２１品詞分解部
２２名詞抽出部
２３名詞カウント部
３０短文作成装置
３１単語選定部
３２短文生成部 1 Data processing device 2 Sentence analysis unit 3 Short sentence creation unit 4 Rank data storage unit 5 Event word database 6a, 6b Contribution sentence 7 Contribution sentence classification device 8 Internet 10 Data processing system 20 Sentence analysis device 21 Part-of-word decomposition unit 22 Noun extraction unit 23 Noun count unit 30 Short sentence creation device 31 Word selection unit 32 Short sentence generation unit

Claims

With a sentence analysis unit that analyzes each posted sentence of a posted sentence group consisting of two or more posted sentences related to the same event collected via the Internet, and ranks the words included in the posted sentence group by the frequency of appearance. ,
A short sentence creation unit that selects a word that can identify an event common to each posted sentence from the words included in the posted sentence group and automatically creates a short sentence that expresses the content of the event using the selected word.
An event word database in which nouns that can identify an event are stored and used when selecting a word that can identify an event in the short sentence creation unit.
Have,
The event is an incident, accident or disaster and
In the selection of words that can identify the event in the short sentence creation unit, a data processing device that preferentially selects words having a high frequency of appearance.

The sentence analysis unit
The part-speech decomposition part that decomposes the part of speech for each post,
A noun extractor that extracts nouns from words that have been decomposed into parts of speech,
A noun count section that ranks the extracted nouns according to their frequency of appearance,
Equipped with
The data processing device according to claim 1, wherein the short sentence creating unit creates the short sentence using a noun having a high frequency of appearance.

The short sentence making part
A word selection unit that selects nouns that can identify events common to each posted sentence in the posted sentence group from among the nouns included in the posted sentence group.
A short sentence generation unit that creates a short sentence using the nouns selected by the word selection unit, and
Equipped with
The word selection unit compares the noun data stored in the event word database with the noun ranking data created by the sentence analysis unit, and requests that the noun with the highest frequency of appearance be preferentially selected. Item 2. The data processing apparatus according to Item 1.

The data processing device according to claim 3, wherein the noun that can identify the event is a noun representing a region and a noun representing an event.

The data processing apparatus according to any one of claims 1 to 4, further comprising a ranking data storage unit for storing ranking data of the frequency of appearance of the word.

The data processing apparatus according to any one of claims 1 to 5, wherein the short sentence is a title or a heading.

A method of automatically creating a short sentence using one or more data processing devices.
With the data processing device
A sentence analysis process that analyzes each posted sentence of a posted sentence group consisting of two or more posted sentences related to the same event collected via the Internet, and ranks the words included in the posted sentence group by the frequency of appearance. ,
A short sentence creation process in which a word that can identify an event common to each posted sentence is selected from the words included in the posted sentence group and a short sentence expressing the content of the event is created using the selected word. Do,
The event is an incident, accident or disaster and
In the selection of a word that can specify the event in the short sentence creation step, a data processing method that preferentially selects a word having a high frequency of appearance by using an event word database in which a noun that can specify the event is stored .

The sentence analysis step is
The process of part-speech decomposition for each posted sentence and
The process of extracting nouns from words that have been decomposed into parts of speech,
The process of ranking the extracted nouns by frequency of appearance,
Have,
The data processing method according to claim 7, wherein the short sentence creation step uses a noun having a high frequency of appearance to create the short sentence.

The short sentence creation process is
The process of selecting a noun that can identify an event common to each posted sentence of the posted sentence group from the nouns included in the posted sentence group, and
The process of creating a short sentence using the selected nouns,
Have,
In the step of selecting the noun, the noun data accumulated in the event word database is compared with the noun rank data created in the sentence analysis step, and the noun with the highest frequency of appearance is preferentially selected. The data processing method according to claim 7 or 8.

A sentence analysis device that analyzes each posted sentence of a posted sentence group consisting of two or more posted sentences related to the same event collected via the Internet, and ranks the words included in the posted sentence group by the frequency of appearance. ,
A short sentence creation device that selects a word that can identify the event common to each posted sentence from the words included in the posted sentence group, and automatically creates a short sentence that expresses the content of the event using the selected word. When,
An event word database in which nouns that can identify an event are stored and used when selecting a word that can identify an event in the short sentence creation device.
Have,
The event is an incident, accident or disaster and
A data processing system that preferentially selects words with a high frequency of appearance in the selection of words that can identify the event in the short sentence creation device.

Equipped with a post text classification device that classifies multiple post texts collected via the Internet for each event and creates a post text group that is a collection of text data composed of two or more post texts related to the same event.
The data processing system according to claim 10, wherein the sentence analysis device analyzes a group of posted sentences created by the posted sentence classification device.

The data processing system according to claim 10 or 11, further comprising a distribution device for externally distributing short sentence data created by the short sentence creation device by adding it to the data of the posted sentence group.

On the computer
With a sentence analysis function that analyzes each posted sentence of a posted sentence group consisting of two or more posted sentences related to the same event collected via the Internet, and ranks the words included in the posted sentence group according to the frequency of appearance. ,
A short sentence creation function that selects a word that can identify the event common to each posted sentence from the words included in the posted sentence group, and automatically creates a short sentence that expresses the content of the event using the selected word. And execute,
The event is an incident, accident or disaster and
In the selection of words that can identify the event in the short sentence creation function, a program that preferentially selects words with a high frequency of appearance by using an event word database in which nouns that can identify the event are stored .