JP7553314B2

JP7553314B2 - Estimation device, estimation method, and program

Info

Publication number: JP7553314B2
Application number: JP2020172682A
Authority: JP
Inventors: 繁塩澤
Original assignee: Recruit Co Ltd
Current assignee: Recruit Co Ltd
Priority date: 2020-10-13
Filing date: 2020-10-13
Publication date: 2024-09-18
Anticipated expiration: 2040-10-13
Also published as: JP2022064137A

Description

本発明は、推定装置、推定方法及びプログラムに関する。 The present invention relates to an estimation device, an estimation method, and a program.

現在、研究者が自由に論文を投稿することが可能なサービスが提供されている。研究者は、投稿された論文を自由に閲覧することができ、自身の研究に利用することができる。特許文献１には、ユーザが収集した論文に基づいて、研究に関するユーザの興味を特定することが可能な技術が開示されている。 Currently, there are services that allow researchers to freely post papers. Researchers can freely view the posted papers and use them in their own research. Patent Document 1 discloses technology that can identify a user's research interests based on the papers collected by the user.

特開２００５－３４６２２５号公報JP 2005-346225 A

投稿される論文の数は膨大であることから、注目されている最新技術をキャッチアップするために、ユーザが全ての論文を確認することは現実的ではない。そこで、最新技術に用いられる技術ワードを用いて論文を検索することで、確認する論文数を絞ることが考えられる。しかしながら、最新技術に用いられる技術ワードは辞書に掲載されておらず、かつユーザ自身も知らないことが多いため、技術ワードで論文を絞ること自体が困難である。なお、このような課題は、論文に限られず、書籍やオンライン文書等のあらゆる文章にも生じ得る。 Because the number of submitted papers is enormous, it is not realistic for users to check all of the papers in order to catch up on the latest technologies that are attracting attention. Therefore, it is possible to narrow down the number of papers to be checked by searching for papers using technical words used in the latest technologies. However, since the technical words used in the latest technologies are not listed in dictionaries and users themselves are often not familiar with them, narrowing down the papers using technical words is difficult in itself. Note that this issue is not limited to papers, but can occur with any type of text, such as books or online documents.

そこで、本発明は、複数の文章を分析することで、複数の文章で用いられる特定の用語を、辞書を利用することなく抽出することを可能とする技術を提供することを目的とする。 The present invention aims to provide a technology that can extract specific terms used in multiple sentences by analyzing multiple sentences without using a dictionary.

本発明の一態様に係る推定装置は、複数の文章の入力を受け付ける受付部と、複数の文章に含まれる、連続する複数の単語を含む単語群からＮグラムを生成する生成部と、生成されたＮグラムのうち隣接する２つのＮグラム間の類似度を評価することで、複数の文章で用いられる特定の用語を推定する推定部と、を有する。 The estimation device according to one aspect of the present invention includes a receiving unit that receives input of a plurality of sentences, a generating unit that generates N-grams from a word group including a plurality of consecutive words contained in the plurality of sentences, and an estimating unit that estimates a specific term used in the plurality of sentences by evaluating the similarity between two adjacent N-grams among the generated N-grams.

本発明によれば、複数の文章を分析することで、複数の文章で用いられる特定の用語を、辞書を利用することなく抽出することを可能とする技術を提供することができる。 The present invention provides a technology that can analyze multiple sentences and extract specific terms used in multiple sentences without using a dictionary.

文書分析システムの一例を示す図である。FIG. 1 illustrates an example of a document analysis system. 分析装置及び端末のハードウェア構成例を示す図である。FIG. 2 illustrates an example of a hardware configuration of an analysis device and a terminal. 分析装置の機能ブロック構成例を示す図である。FIG. 2 is a diagram illustrating an example of a functional block configuration of an analysis device. 分析装置が行う処理手順の一例を示すフローチャートである。13 is a flowchart illustrating an example of a processing procedure performed by the analysis device. 技術ワードを推定する処理の一例を説明するための図である。FIG. 13 is a diagram for explaining an example of a process for estimating technical words. 端末に表示される、論文数及び変化点を示すグラフの一例を示す図である。FIG. 13 is a diagram showing an example of a graph showing the number of papers and points of change, which is displayed on a terminal.

添付図面を参照して、本発明の実施形態について説明する。なお、各図において、同一の符号を付したものは、同一又は同様の構成を有する。 The following describes an embodiment of the present invention with reference to the attached drawings. In each drawing, the same reference numerals denote the same or similar configurations.

＜システム構成＞
図１は、文書分析システム１の一例を示す図である。文書分析システム１は、分析装置１０と端末２０とを含む。文書分析システム１に含まれる端末２０の数には制限はない。分析装置１０と端末２０は、無線又は有線の通信ネットワークＮを介して接続され、相互に通信を行うことができる。 <System Configuration>
1 is a diagram showing an example of a document analysis system 1. The document analysis system 1 includes an analysis device 10 and a terminal 20. There is no limit to the number of terminals 20 included in the document analysis system 1. The analysis device 10 and the terminal 20 are connected via a wireless or wired communication network N and can communicate with each other.

分析装置１０は、インターネット等に公開されている多数の論文を分析することで、辞書を用いることなく、論文の中で用いられている技術ワード（特定の用語）を推定する。また、分析装置１０は、推定した技術ワードが論文の中で使用される頻度の推移に基づいて、最新技術の流行の兆しを検出する。 The analysis device 10 estimates technical words (specific terms) used in the papers without using a dictionary by analyzing a large number of papers published on the Internet, etc. Furthermore, the analysis device 10 detects signs of the latest technology trends based on changes in the frequency with which the estimated technical words are used in the papers.

分析装置１０は、１又は複数の物理的なサーバ等から構成されていてもよいし、ハイパーバイザー（hypervisor）上で動作する仮想的なサーバを用いて構成されていてもよいし、クラウドサーバを用いて構成されていてもよい。 The analysis device 10 may be configured with one or more physical servers, or may be configured with a virtual server that runs on a hypervisor, or may be configured with a cloud server.

端末２０は、分析装置１０による推定結果を表示する装置である。端末２０は、例えば、分析装置１０により推定された最新の技術ワードが論文の中で使用されている頻度の推移を時系列で示したグラフ等を表示する。端末２０は、パーソナルコンピュータ（ＰＣ）、ノートＰＣ、スマートフォン、タブレット端末、携帯電話機、携帯情報端末（ＰＤＡ）等である。 The terminal 20 is a device that displays the estimation results by the analysis device 10. The terminal 20 displays, for example, a graph showing the time series of changes in the frequency with which the latest technical words estimated by the analysis device 10 are used in papers. The terminal 20 is a personal computer (PC), a notebook PC, a smartphone, a tablet terminal, a mobile phone, a personal digital assistant (PDA), etc.

＜ハードウェア構成＞
図２は、分析装置１０及び端末２０のハードウェア構成例を示す図である。分析装置１０及び端末２０は、ＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphical processing unit）等のプロセッサ１１、メモリ、ＨＤＤ（Hard Disk Drive）及び／又はＳＳＤ（Solid State Drive）等の記憶装置１２、有線又は無線通信を行う通信ＩＦ（Interface）１３、入力操作を受け付ける入力デバイス１４、及び情報の出力を行う出力デバイス１５を有する。入力デバイス１４は、例えば、キーボード、タッチパネル、マウス及び／又はマイク等である。出力デバイス１５は、例えば、ディスプレイ、タッチパネル及び／又はスピーカ等である。 <Hardware Configuration>
2 is a diagram showing an example of the hardware configuration of the analysis device 10 and the terminal 20. The analysis device 10 and the terminal 20 each include a processor 11 such as a central processing unit (CPU) or a graphical processing unit (GPU), a storage device 12 such as a memory, a hard disk drive (HDD) and/or a solid state drive (SSD), a communication interface (IF) 13 for wired or wireless communication, an input device 14 for accepting input operations, and an output device 15 for outputting information. The input device 14 is, for example, a keyboard, a touch panel, a mouse, and/or a microphone. The output device 15 is, for example, a display, a touch panel, and/or a speaker.

＜機能ブロック構成＞
図３は、分析装置１０の機能ブロック構成例を示す図である。分析装置１０は、記憶部１００と、受付部１０１と、生成部１０２と、推定部１０３と、集計部１０４と、検出部１０５と、学習部１０６とを含む。記憶部１００は、分析装置１０が備える記憶装置１２を用いて実現することができる。また、受付部１０１と、生成部１０２と、推定部１０３と、集計部１０４と、検出部１０５と、学習部１０６とは、分析装置１０のプロセッサ１１が、記憶装置１２に記憶されたプログラムを実行することにより実現することができる。また、当該プログラムは、記憶媒体に格納することができる。当該プログラムを格納した記憶媒体は、コンピュータ読み取り可能な非一時的な記憶媒体（Non-transitory computer readable medium）であってもよい。非一時的な記憶媒体は特に限定されないが、例えば、ＵＳＢメモリ又はＣＤ－ＲＯＭ等の記憶媒体であってもよい。 <Function block configuration>
FIG. 3 is a diagram showing an example of a functional block configuration of the analysis device 10. The analysis device 10 includes a storage unit 100, a reception unit 101, a generation unit 102, an estimation unit 103, a counting unit 104, a detection unit 105, and a learning unit 106. The storage unit 100 can be realized by using a storage device 12 provided in the analysis device 10. The reception unit 101, the generation unit 102, the estimation unit 103, the counting unit 104, the detection unit 105, and the learning unit 106 can be realized by the processor 11 of the analysis device 10 executing a program stored in the storage device 12. The program can be stored in a storage medium. The storage medium storing the program may be a non-transitory computer readable storage medium. The non-transitory storage medium is not particularly limited, and may be, for example, a storage medium such as a USB memory or a CD-ROM.

記憶部１００は、論文ＤＢ１１０と、技術ワードの推定に用いられる技術ワード推定用モデルＭ１１０と、類似する技術ワードをグルーピングする際に用いられるグループ推定用モデルＭ１２０とを記憶する。論文ＤＢ１１０は、例えば、インターネット等から取得した論文データを格納するデータベースである。なお、論文ＤＢ１１０は、分析装置１０が備える記憶装置１２に格納されていてもよいし、分析装置１０と通信可能な外部装置に格納されていてもよい。 The storage unit 100 stores a paper DB 110, a technical word estimation model M110 used to estimate technical words, and a group estimation model M120 used when grouping similar technical words. The paper DB 110 is a database that stores paper data acquired from the Internet, for example. The paper DB 110 may be stored in a storage device 12 provided in the analysis device 10, or in an external device that can communicate with the analysis device 10.

受付部１０１は、分析対象となる複数の論文（複数の文章）の入力を受け付ける。また、受付部１０１は、受け付けた複数の論文を、論文ＤＢ１１０に格納する。受付部１０１が受け付ける複数の論文の各々には、日付を示す情報が含まれている。当該日付を示す情報は、例えば、論文が投稿された年月日や、論文が作成された年月日であってもよい。 The reception unit 101 receives input of multiple papers (multiple sentences) to be analyzed. The reception unit 101 also stores the multiple papers it receives in the paper DB 110. Each of the multiple papers received by the reception unit 101 includes information indicating a date. The information indicating the date may be, for example, the date the paper was submitted or the date the paper was created.

生成部１０２は、分析対象となる複数の論文から得られる、連続する複数の単語を含む単語群からＮグラム（N-gram）を生成する。 The generation unit 102 generates an N-gram from a group of words containing multiple consecutive words obtained from multiple papers to be analyzed.

本実施形態におけるＮグラムとは、複数の単語を含む文章を、連続したＮ個の単語単位で分割することで生成される文字列である。また、Ｎが１の場合はユニグラム（Uni-gram）、Ｎが２の場合はバイグラム（Bi-gram）、Ｎが３の場合はトリグラム（Tri-gram）、Ｎが４の場合はフォーグラム（Four-gram）、Ｎが５の場合はファイブグラム（Five-gram）、Ｎが６の場合はシックスグラム（Six-gram）などと称する。生成部１０２は、単語群に含まれる単語数をｎとした場合、１～ｎまでの複数のＮグラムを生成する。 In this embodiment, an N-gram is a character string generated by dividing a sentence containing multiple words into units of N consecutive words. In addition, when N is 1, it is called a Uni-gram, when N is 2, it is called a Bi-gram, when N is 3, it is called a Tri-gram, when N is 4, it is called a Four-gram, when N is 5, it is called a Five-gram, when N is 6, etc. The generation unit 102 generates multiple N-grams from 1 to n, where n is the number of words contained in the word group.

推定部１０３は、生成されたＮグラムのうち隣接する２つのＮグラム間の類似度を評価することで、分析対象となる複数の論文で用いられる技術ワード（特定の用語）を推定する。また、推定部１０３は、推定した技術ワードが複数存在する場合、複数の技術ワードの各々の分散表現（ベクトル）に基づいて類似度を評価することで、類似する技術ワードをまとめたグループ（技術ワードのグループ）を推定する。 The estimation unit 103 estimates technical words (specific terms) used in the multiple papers to be analyzed by evaluating the similarity between two adjacent N-grams among the generated N-grams. Furthermore, when there are multiple estimated technical words, the estimation unit 103 estimates a group of similar technical words (a group of technical words) by evaluating the similarity based on the distributed representation (vector) of each of the multiple technical words.

集計部１０４は、分析対象となる複数の論文の中から、技術ワードを含む論文の数を所定期間ごとに集計する。また、集計部１０４は、分析対象となる複数の論文の中から、技術ワードのグループのうち、少なくともいずれか１つの技術ワードを含む論文の数を所定期間ごとに集計するようにしてもよい。 The counting unit 104 counts the number of papers that contain a technical word from among the multiple papers to be analyzed for each specified period. The counting unit 104 may also be configured to count the number of papers that contain at least one technical word from a group of technical words from among the multiple papers to be analyzed for each specified period.

検出部１０５は、集計部１０４により集計された所定期間ごとの論文の数に基づいて、技術ワードを含む論文の数が増加する、時系列上の変化点を検出する。また、検出部１０５は、集計部１０４により集計された所定期間ごとの論文の数に基づいて、技術ワードのグループのうち少なくともいずれか１つの技術ワードを含む論文の数が増加する、時系列上の変化点を検出するようにしてもよい。 The detection unit 105 detects a change point in the time series where the number of papers including a technical word increases, based on the number of papers for each specified period tallied by the aggregation unit 104. The detection unit 105 may also detect a change point in the time series where the number of papers including at least one technical word from a group of technical words increases, based on the number of papers for each specified period tallied by the aggregation unit 104.

学習部１０６は、論文ＤＢ１１０に格納された論文データを用いて、技術ワード推定用モデルＭ１１０及びグループ推定用モデルＭ１２０を学習させる。 The learning unit 106 uses the paper data stored in the paper DB 110 to train the technical word estimation model M110 and the group estimation model M120.

＜処理手順＞
図４は、分析装置１０が行う処理手順の一例を示すフローチャートである。以下、図４を用いて、分析装置１０が大量の論文データを読み込んで分析を行い、注目されている技術ワードを端末２０の画面に表示するまでの一連の処理手順を説明する。 <Processing Procedure>
Fig. 4 is a flowchart showing an example of a processing procedure performed by the analysis device 10. Below, a series of processing procedures from when the analysis device 10 reads a large amount of paper data, performs analysis, and displays technical terms that are attracting attention on the screen of the terminal 20 will be described with reference to Fig. 4.

ステップＳ１０で、受付部１０１は、分析対象となる複数の論文の入力を受け付け、論文ＤＢ１１０に格納する。例えば、受付部１０１は、インターネット上で提供されている、研究者が論文を自由に投稿可能なサービスにアクセスし、過去（過去全てでもよいし、過去５年間など一部の期間であってもよい）に投稿された論文をダウンロードして論文ＤＢ１１０に格納するようにしてもよい。また、論文の全てをダウンロードして論文ＤＢ１１０に格納するのではなく、論文の要約（Abstract）部分のみ又は本文のみをダウンロードして論文ＤＢ１１０に格納するようにしてもよい。 In step S10, the reception unit 101 receives input of multiple papers to be analyzed and stores them in the paper DB 110. For example, the reception unit 101 may access a service provided on the Internet that allows researchers to freely post papers, download papers posted in the past (which may be the entire past, or a partial period such as the past five years), and store them in the paper DB 110. Also, rather than downloading all of the papers and storing them in the paper DB 110, only the abstract portion or the main text of the papers may be downloaded and stored in the paper DB 110.

ステップＳ１１で、生成部１０２は、論文ＤＢ１１０に格納されている論文データのうち、分析対象となる論文に含まれる各文章を、単語に分解する。例えば、生成部１０２は、日本語については形態素解析を行うことで単語に分解し、英語についてはスペースを単語の区切りとして認識する。 In step S11, the generation unit 102 breaks down each sentence included in the paper to be analyzed from among the paper data stored in the paper DB 110 into words. For example, the generation unit 102 breaks down Japanese into words by performing morphological analysis, and for English, recognizes spaces as word separators.

ステップＳ１２で、生成部１０２は、クレンジング処理を行うことで、不要な文字や単語（例えば冠詞、主語、接続詞、be動詞など）を削除し、技術ワードになり得る単語を残す。このとき、生成部１０２は、文章の中で、削除した文字や単語が存在していた部分で文章を区切り、区切られた部分に含まれる１又は複数の単語を含む単語群を認識できるようにしておく。また、各単語について語幹処理を行うことで、語尾が変化する動詞や形容詞等については語幹のみを残し、語幹以外の部分を消去する。 In step S12, the generation unit 102 performs a cleansing process to delete unnecessary characters and words (e.g., articles, subjects, conjunctions, be verbs, etc.) and leave words that can become technical words. At this time, the generation unit 102 divides the sentence at the part where the deleted characters or words were present, so that it can recognize word groups that include one or more words contained in the divided part. In addition, by performing stem processing on each word, for verbs and adjectives with invariable endings, only the stem remains and parts other than the stem are deleted.

ここで、「In this paper, we propose the simple Generative Adversarial Network model which allows long-range dependency modeling for image generation tasks.」という文章を例に、ステップＳ１１及びステップＳ１２の処理手順について具体例を説明する。 Here, we will explain a specific example of the processing procedures of steps S11 and S12 using the sentence "In this paper, we propose the simple Generative Adversarial Network model which allows long-range dependency modeling for image generation tasks." as an example.

まず、生成部１０２は、スペースと句読点（カンマ、ピリオド等）を単語の区切りとして認識することで、文章を単語に分解する。続いて、生成部１０２はクレンジング処理を行い、削除した文字や単語が存在していた部分を認識できるように区切り文字（説明の都合上「:」とする）を挿入する。これにより、生成部１０２は、文字列「paper : propose : simple Generative Adversarial Network model : allow long-range dependency modeling : image generation tasks」を出力する。 First, the generation unit 102 breaks down the sentence into words by recognizing spaces and punctuation marks (commas, periods, etc.) as word separators. Next, the generation unit 102 performs a cleansing process and inserts separators (for convenience of explanation, assume ":") so that the parts where deleted characters or words exist can be recognized. As a result, the generation unit 102 outputs the string "paper : propose : simple Generative Adversarial Network model : allow long-range dependency modeling : image generation tasks".

続いて、生成部１０２は、語幹処理を行うことで語幹のみを残す。これにより、生成部１０２は、文字列「paper : propose : simple Gener Adversar Network model : allow long-range depend model : image gener task」を出力する。 Next, the generation unit 102 performs stem processing to leave only the stem. As a result, the generation unit 102 outputs the character string "paper : propose : simple Gener Adversar Network model : allow long-range depend model : image generator task".

ステップＳ１３で、推定部１０３は、技術ワードの推定を行う。ここで、推定部１０３は、ステップＳ１２の処理手順で生成された文章のうち区切り文字で区切られた部分に１単語のみが含まれる部分については、当該１単語を技術ワードとして推定する。例えば、文字列「paper : propose : simple Generative Adversarial Network model : allow long-range dependency modeling : image generation tasks」について、推定部１０３は、「paper」及び「propose」は、技術ワードであると推定する。 In step S13, the estimation unit 103 estimates technical words. Here, for a portion of the sentence generated in the processing procedure of step S12 that contains only one word separated by delimiters, the estimation unit 103 estimates that one word as a technical word. For example, for the character string "paper : propose : simple generative adversarial network model : allow long-range dependency modeling : image generation tasks", the estimation unit 103 estimates that "paper" and "propose" are technical words.

続いて、推定部１０３は、区切り文字で区切られた部分に複数の単語が含まれる単語群について、各単語が技術ワードなのか、若しくは複数の連続する単語からなる熟語が技術ワードなのかを推定する。まず、生成部１０２は単語群ごとにＮグラムを生成する。 Next, for a word group in which multiple words are contained in a portion separated by a delimiter, the estimation unit 103 estimates whether each word is a technical word, or whether a phrase consisting of multiple consecutive words is a technical word. First, the generation unit 102 generates an N-gram for each word group.

以下、単語群「simple Generative Adversarial Network model」のＮグラムを生成する例を説明する。当該単語群には５つの単語が含まれるため、生成部１０２は、１～５までの複数のＮグラムを生成する。具体的には、ユニグラムについては、“simple”、“Generative”、“Adversarial”、“Network”、“model”という５つの文字列を生成する。バイグラムについては、“simple Generative”、“Generative Adversarial”、“Adversarial Network”、“Network model”という４つの文字列を生成する。トリグラムについては、“simple Generative Adversarial”、“Generative Adversarial Network”、“Adversarial Network model”という３つの文字列を生成する。フォーグラムについては、“simple Generative Adversarial Network”、“Generative Adversarial Network model”という２つの文字列を生成する。ファイブグラムについては、“simple Generative Adversarial Network model”という１つの文字列を生成する。 Below, an example of generating an N-gram for the word group "simple Generative Adversarial Network model" will be described. Since the word group includes five words, the generation unit 102 generates multiple N-grams, 1 to 5. Specifically, for unigrams, five character strings are generated: "simple", "Generative", "Adversarial", "Network", and "model". For bigrams, four character strings are generated: "simple Generative", "Generative Adversarial", "Adversarial Network", and "Network model". For trigrams, three character strings are generated: "simple Generative Adversarial", "Generative Adversarial Network", and "Adversarial Network model". For fourgrams, two character strings are generated: "simple Generative Adversarial Network" and "Generative Adversarial Network model". For fivegrams, one character string is generated: "simple Generative Adversarial Network model".

続いて、推定部１０３は、生成されたＮグラムを用いて技術ワードを推定する。推定部１０３は、Ｎの値をＮの最大値から１つ減算したＮグラムについて、隣接する２つのＮグラム間（隣接するＮグラムのペアと称してもよい）の類似度が所定閾値（第１閾値）以上であるか否かを判定する。類似度が所定閾値（第１閾値）以上である隣接する２つのＮグラムが存在しない場合は、Ｎの値を更に１減算したＮグラムについて、隣接する２つのＮグラム間の類似度が所定閾値（第１閾値）以上であるか否かを判定する処理を、Ｎの値が１になるまで繰り返し行う。隣接する２つのＮグラム間の類似度が所定閾値（第１閾値）以上であるＮグラムが存在する場合、当該隣接する２つのＮグラムに対応する、Ｎの値が１つ大きいＮグラムに含まれる単語を順に組み合わせた熟語を、技術ワードとして推定する。 Then, the estimation unit 103 estimates technical words using the generated N-grams. The estimation unit 103 determines whether or not the similarity between two adjacent N-grams (which may be referred to as a pair of adjacent N-grams) for an N-gram with the value of N subtracted by 1 from the maximum value of N is equal to or greater than a predetermined threshold (first threshold). If there are no adjacent N-grams with a similarity equal to or greater than the predetermined threshold (first threshold), the estimation unit 103 repeats the process of determining whether or not the similarity between two adjacent N-grams is equal to or greater than a predetermined threshold (first threshold) for an N-gram with the value of N further subtracted by 1, until the value of N becomes 1. If there is an N-gram with a similarity between two adjacent N-grams equal to or greater than the predetermined threshold (first threshold), the estimation unit 103 estimates, as technical words, phrases that combine, in order, words included in an N-gram with an N value that is one larger than the two adjacent N-grams.

なお、ユニグラムの場合、「隣接する２つのＮグラム」とは、単語群の中で連続する２つの単語のペアであり、「隣接する２つのＮグラムに対応する、Ｎの値が１つ大きいＮグラムに含まれる単語を順に組み合わせた熟語」とは、当該連続する２つの単語からなる熟語である。バイグラム以上の場合、「隣接する２つのＮグラム」とは、１≦ｐ＜ｎ（ｐは、ｐ＋Ｎ≦ｎを満たす自然数、ｎは単語群に含まれる単語数）としたとき、単語群に含まれる単語のうちｐ番目から（ｐ＋Ｎ－１）番目までの連続する単語からなるＮグラムと、（ｐ＋１）番目から（ｐ＋Ｎ）番目までの連続する単語からなるＮグラムとのペアである。また、「隣接する２つのＮグラムに対応する、Ｎの値が１つ大きいＮグラムに含まれる単語を順に組み合わせた熟語」とは、ｐ番目から（ｐ＋Ｎ）番目までの連続する単語からなるＮグラムの熟語である。 In the case of unigrams, "two adjacent N-grams" refers to a pair of two consecutive words in a word group, and "a phrase formed by sequentially combining words included in an N-gram with an N value one larger corresponding to two adjacent N-grams" refers to a phrase formed from the two consecutive words. In the case of bigrams or higher, "two adjacent N-grams" refers to a pair of an N-gram formed from consecutive words p to (p+N-1) among the words included in the word group, and an N-gram formed from consecutive words (p+1) to (p+N). In addition, "a phrase formed by sequentially combining words included in an N-gram with an N value one larger corresponding to two adjacent N-grams" refers to an N-gram phrase formed from consecutive words p to (p+N).

例えば、単語群“ＡＢＣＤ”（Ａ～Ｄは単語を意味する）について、バイグラムの場合の「隣接する２つのＮグラム」を求めるとする。この場合、ｎ＝４かつＮ＝２であることから、ｐ＋２≦４を満たすｐの値は１又は２である。なお、ｐ＋Ｎ≦ｎを満たすｐの値が複数存在する場合、推定部１０３は、複数の値の各々について、隣接する２つのＮグラム間の類似度が所定閾値（第１閾値）以上であるか否かを判定する。 For example, suppose that the "two adjacent N-grams" in the case of bigrams are to be found for the word group "A B C D" (A to D represent words). In this case, since n = 4 and N = 2, the value of p that satisfies p + 2 ≦ 4 is 1 or 2. Note that if there are multiple values of p that satisfy p + N ≦ n, the estimation unit 103 determines for each of the multiple values whether the similarity between the two adjacent N-grams is equal to or greater than a predetermined threshold (first threshold).

ｐ＝１とする場合、「隣接する２つのＮグラム」は、１番目から（１＋２－１）番目までの連続する単語と、（１＋１）番目から（１＋２）番目までの連続する単語、つまり、“ＡＢ”と“ＢＣ”になる。また、「隣接する２つのＮグラムに対応する、Ｎの値が１つ大きいＮグラムに含まれる単語を順に組み合わせた熟語」とは、１番目から（１＋２）番目までの連続する単語、つまり、“ＡＢＣ”になる。 When p=1, the "two adjacent N-grams" are the consecutive words from the 1st to the (1+2-1)th and the consecutive words from the (1+1)th to the (1+2)th, i.e., "A B" and "B C." Also, the "phrase formed by combining, in order, the words contained in the N-gram with an N value one larger that corresponds to the two adjacent N-grams" is the consecutive words from the 1st to the (1+2)th, i.e., "A B C."

ｐ＝２とする場合、「隣接する２つのＮグラム」は、２番目から（２＋２－１）番目までの連続する単語と、（２＋１）番目から（２＋２）番目までの連続する単語、つまり、“ＢＣ”と“ＣＤ”になる。また、「隣接する２つのＮグラムに対応する、Ｎの値が１つ大きいＮグラムに含まれる単語を順に組み合わせた熟語」とは、２番目から（２＋２）番目までの連続する単語、つまり、“ＢＣＤ”になる。 When p=2, the "two adjacent N-grams" are the consecutive words from the second to the (2+2-1)th word and the consecutive words from the (2+1)th to the (2+2)th word, i.e., "B C" and "C D." Also, the "phrase formed by combining, in order, the words contained in the N-gram with an N value one larger that corresponds to the two adjacent N-grams" is the consecutive words from the second to the (2+2)th word, i.e., "B C D."

ここで、推定部１０３は、隣接する２つのＮグラム間の類似度を、入力された２つの単語間の類似度を出力する技術ワード推定用モデルＭ１１０を用いて判定する。技術ワード推定用モデルＭ１１０は、論文ＤＢ１１０に格納されている論文データを用いて予め生成された、単語間の類似度を出力する学習済みモデルである。技術ワード推定用モデルＭ１１０は、例えば、word2vecと呼ばれる技術を用いて生成することが可能である。 Here, the estimation unit 103 determines the similarity between two adjacent N-grams using a technical word estimation model M110 that outputs the similarity between two input words. The technical word estimation model M110 is a trained model that outputs the similarity between words, and is generated in advance using paper data stored in the paper DB 110. The technical word estimation model M110 can be generated, for example, using a technology called word2vec.

図３に示すように、本実施形態では、技術ワード推定用モデルＭ１１０には、ユニグラムの単語間の類似度を出力するユニグラムモデルＭ１１１、バイグラムの単語間の類似度を出力するバイグラムモデルＭ１１２、トリグラムの単語間の類似度を出力するトリグラムモデルＭ１１３、フォーグラムの単語間の類似度を出力するフォーグラムモデルＭ１１４が含まれる。なお、技術ワード推定用モデルＭ１１０に含まれる５つのモデルはあくまで一例であり、ファイブグラムモデルやシックスグラムモデルといったように、Ｎの値が更に大きいモデルも含まれていてもよい。 As shown in FIG. 3, in this embodiment, the technical word estimation model M110 includes a unigram model M111 that outputs similarities between unigram words, a bigram model M112 that outputs similarities between bigram words, a trigram model M113 that outputs similarities between trigram words, and a four-gram model M114 that outputs similarities between four-gram words. Note that the five models included in the technical word estimation model M110 are merely examples, and models with even larger values of N, such as a five-gram model or a six-gram model, may also be included.

ここで、バイグラムモデルＭ１１２の生成方法を説明する。当該モデルは、論文ＤＢ１１０に格納されている論文データの各文章について、ステップＳ１１及びステップＳ１２の処理手順で説明した処理を行い、各文章を２単語ごとに繋いだ文章を作成し、作成した文章をword2vecに学習させることで生成することができる。例えば、学習部１０６は、ステップＳ１１及びステップＳ１２の処理手順により出力された文章「paper propose simple Gener Adversar Network model allow long-range depend model image gener task」が存在する場合、文章「paper_propose propose_simple simple_Gener Gener_Adversari Adversari_Network Network_model model_allow allow_long-range long-range_dependency dependency_modeling modeling_image image_generation generation_task」を生成し、word2vecに学習させる。学習部１０６は、このような処理を、論文ＤＢ１１０に格納されている論文データに含まれる全文章について繰り返し行う。これにより、論文ＤＢ１１０に格納されている論文データに含まれる２単語を繋げたバイグラムについて分散表現（ベクトル）が定められることから、推定部１０３は、学習させたword2vecを用いることで、当該分散表現に基づいて２つのバイグラム間の類似度を評価することが可能になる。なお、word2vecは一例に過ぎず、分散表現に基づいて単語間の類似度を評価する技術であれば、どのような技術を利用することも可能である。 Here, a method for generating the bigram model M112 will be described. The model can be generated by performing the process described in the processing procedure of steps S11 and S12 for each sentence of the paper data stored in the paper DB 110, creating sentences by connecting each sentence in two-word units, and having word2vec learn the created sentences. For example, when the sentence "paper propose simple Gener Adversar Network model allow long-range depend model image generator task" output by the processing procedure of steps S11 and S12 exists, the learning unit 106 generates the sentence "paper_propose propose_simple simple_Gener Gener_Adversari Adversari_Network Network_model model_allow allow_long-range long-range_dependency dependency_modeling modeling_image image_generation generation_task" and has word2vec learn it. The learning unit 106 repeatedly performs such a process for all sentences included in the paper data stored in the paper DB 110. This allows a distributed representation (vector) to be determined for a bigram linking two words included in the paper data stored in the paper DB 110, and the estimation unit 103 can use the learned word2vec to evaluate the similarity between two bigrams based on the distributed representation. Note that word2vec is merely an example, and any technology that evaluates the similarity between words based on a distributed representation can be used.

図５は、技術ワードを推定する処理の一例を説明するための図である。図５を用いて、推定部１０３が、単語群「simple gener adversarial network model」に存在する技術ワードを推定する場合の例を説明する。まず、推定部１０３は、単語群に含まれる単語数をｎとした場合に、Ｎの値をｎの最大値から１つ減算した値とするＮグラムについて、隣接する２つのＮグラム間の類似度が所定の閾値以上であるか否かを判定する。なお、図５の例では、単語群に５つの単語が含まれるので、ｎの値は５である。 Figure 5 is a diagram for explaining an example of a process for estimating technical words. Using Figure 5, an example will be described in which the estimation unit 103 estimates technical words present in the word group "simple gener adversarial network model". First, when the number of words included in the word group is n, the estimation unit 103 determines whether or not the similarity between two adjacent N-grams, for an N-gram in which the value of N is the maximum value of n minus 1, is equal to or greater than a predetermined threshold. In the example of Figure 5, the word group includes five words, so the value of n is 5.

まず、推定部１０３は、５から１を引いたフォーグラムについて、フォーグラムモデルＭ１１４を用いて、隣接するフォーグラムのペア、つまり“simple gener adversari network”及び“gener adversari network model”の間の類似度を推定する。類似度が所定閾値（第１閾値）以上である場合、推定部１０３は、これらのフォーグラムのペアの一つ上のファイブグラムの熟語（つまり、simple gener adversari network model）を、技術ワードとして推定する。 First, the estimation unit 103 estimates the similarity between the adjacent four-gram pair, i.e., "simple gener adversari network" and "gener adversari network model", for the four-gram obtained by subtracting 1 from 5, using the four-gram model M114. If the similarity is equal to or greater than a predetermined threshold (first threshold), the estimation unit 103 estimates the five-gram phrase one level above these four-gram pairs (i.e., simple gener adversari network model) as a technical word.

一方、類似度が所定閾値（第１閾値）未満である場合、推定部１０３は、４から１を引いたトリグラムについて、トリグラムモデルＭ１１３を用いて、隣接するトリグラムのペア、つまり、“simple gener adversari”及び“gener adversari network”の間の類似度、並びに、“gener adversari network”及び“adversari network model”の間の類似度を推定する。 On the other hand, if the similarity is less than the predetermined threshold (first threshold), the estimation unit 103 uses the trigram model M113 to estimate the similarity between adjacent trigram pairs, i.e., between "simple gener adversari" and "gener adversari network", and between "gener adversari network" and "adversari network model", for the trigram obtained by subtracting 1 from 4.

もし、“simple gener adversari”及び“gener adversari network”の間の類似度が所定閾値（第１閾値）以上である場合、推定部１０３は、これらのトリグラムのペアの一つ上のフォーグラムの熟語（つまり、simple gener adversari network）を、技術ワードとして推定する。また、“gener adversari network”及び“adversari network model”の間の類似度が所定閾値（第１閾値）以上である場合、推定部１０３は、これらのトリグラムのペアの一つ上のフォーグラムの熟語（つまり、gener adversari network model）を、技術ワードとして推定する。なお、“simple gener adversari”及び“gener adversari network”の間の類似度、並びに、“gener adversari network”及び“adversari network model”の間の類似度の両方が所定閾値（第１閾値）以上である場合、推定部１０３は、類似度が高い方について、これらのトリグラムのペアの一つ上のフォーグラムの熟語を、技術ワードとして推定する。 If the similarity between "simple gener adversari" and "gener adversari network" is equal to or greater than a predetermined threshold (first threshold), the estimation unit 103 estimates the four-gram phrase one level above these trigram pairs (i.e., simple gener adversari network) as a technical word. If the similarity between "gener adversari network" and "adversari network model" is equal to or greater than a predetermined threshold (first threshold), the estimation unit 103 estimates the four-gram phrase one level above these trigram pairs (i.e., gener adversari network model) as a technical word. If both the similarity between "simple gener adversari" and "gener adversari network" and the similarity between "gener adversari network" and "adversari network model" are equal to or greater than a predetermined threshold (first threshold), the estimation unit 103 estimates the four-gram phrase one level above these trigram pairs with the higher similarity as a technical word.

一方、いずれの類似度も所定閾値（第１閾値）未満である場合、推定部１０３は、３から１を引いたバイグラムについて、バイグラムモデルＭ１１２を用いて、隣接するバイグラムのペア、つまり、“simple gener”及び“gener adversari”の間の類似度、“gener adversari”及び“adversari network”の間の類似度、並びに、“adversari network”及び“network model”の間の類似度を推定する。これらの中に、類似度が所定閾値（第１閾値）以上である、隣接するバイグラムのペアが存在する場合、推定部１０３は、当該２つのバイグラムの一つ上のトリグラムの熟語を技術ワードとして推定する。もし、類似度が所定閾値（第１閾値）以上である、隣接するバイグラムのペアが複数存在する場合、推定部１０３は、最も類似度が大きいペアの一つ上のトリグラムの熟語を技術ワードとして推定する。 On the other hand, if all similarities are less than the predetermined threshold (first threshold), the estimation unit 103 estimates the similarity between adjacent bigram pairs, i.e., the similarity between "simple gener" and "gener adversari", the similarity between "gener adversari" and "adversari network", and the similarity between "adversari network" and "network model", for the bigrams obtained by subtracting 1 from 3, using the bigram model M112. If there is a pair of adjacent bigrams whose similarity is equal to or greater than the predetermined threshold (first threshold), the estimation unit 103 estimates the trigram phrase one level above the two bigrams as the technical word. If there are multiple pairs of adjacent bigrams whose similarity is equal to or greater than the predetermined threshold (first threshold), the estimation unit 103 estimates the trigram phrase one level above the pair with the greatest similarity as the technical word.

一方、いずれの類似度も所定閾値（第１閾値）未満である場合、推定部１０３は、２から１を引いたユニグラムについて、ユニグラムモデルＭ１１１を用いて、隣接するユニグラムのペア、つまり、“simple”及び“gener”の間の類似度、“gener”及び“adversari”の間の類似度、“adversari”及び“network”の間の類似度、並びに、“network”及び“model”の間の類似度を推定する。これらの中に、類似度が所定閾値（第１閾値）以上である隣接するユニグラムのペアが存在する場合、推定部１０３は、当該ペアの一つ上のバイグラムの熟語を技術ワードとして推定する。もし、類似度が所定閾値（第１閾値）以上であるユニグラムのペアが複数存在する場合、推定部１０３は、最も類似度が大きいペアの一つ上のバイグラムの熟語を技術ワードとして推定する。 On the other hand, if all similarities are less than the predetermined threshold (first threshold), the estimation unit 103 uses the unigram model M111 to estimate the similarity between adjacent unigram pairs, i.e., the similarity between "simple" and "gener", the similarity between "gener" and "adversari", the similarity between "adversari" and "network", and the similarity between "network" and "model", for the unigram obtained by subtracting 1 from 2. If there is an adjacent unigram pair among these whose similarity is equal to or greater than the predetermined threshold (first threshold), the estimation unit 103 estimates the bigram phrase one level above the pair as the technical word. If there are multiple unigram pairs whose similarity is equal to or greater than the predetermined threshold (first threshold), the estimation unit 103 estimates the bigram phrase one level above the pair with the greatest similarity as the technical word.

一方、いずれの類似度も所定閾値（第１閾値）未満である場合、推定部１０３は、全てのユニグラム（つまり、simple, gener, adversari, network, model）を、技術ワードとして推定する。 On the other hand, if all similarities are less than the predetermined threshold (first threshold), the estimation unit 103 estimates all unigrams (i.e., simple, gener, adversarial, network, model) as technical words.

ステップＳ１４で、推定部１０３は、ステップＳ１３の処理手順で推定した技術ワードが複数存在する場合、複数の技術ワードの各々の分散表現（ベクトル）に基づいて類似度を評価することで、類似する技術ワードをまとめたグループ（技術ワードのグループ）を推定する。推定部１０３は、グループ推定用モデルＭ１２０を用いることで、グループの推定を行う。 In step S14, if there are multiple technical words estimated in the processing procedure of step S13, the estimation unit 103 estimates a group (a group of technical words) that includes similar technical words by evaluating the similarity based on the distributed representation (vector) of each of the multiple technical words. The estimation unit 103 estimates the group by using the group estimation model M120.

グループ推定用モデルＭ１２０は、例えば、論文ＤＢ１１０に格納されている論文データの各文章について、ステップＳ１２及びステップＳ１３の処理手順を行うとともに、ステップＳ１４の処理手順で推定された熟語の技術ワードについてはアンダーバー等で結合することで一つの文字列になるように変換した文章を用意し、用意した文章をword2vecに学習させることで生成することができる。 The group estimation model M120 can be generated, for example, by carrying out the processing procedures of steps S12 and S13 for each sentence of the paper data stored in the paper DB110, preparing sentences in which the technical words of phrases estimated in the processing procedure of step S14 are converted into a single character string by combining them with underscores or the like, and having word2vec learn the prepared sentences.

ここで、グループ推定用モデルＭ１２０の生成例を説明する。なお、ステップＳ１４の処理手順で、“simple gener adversari”が熟語の技術ワードとして推定されたと仮定する。まず、生成部１０２は、ステップＳ１１及びステップＳ１２の処理手順を行うことで、文章「In this paper, we propose the simple Generative Adversarial Network model which allows long-range dependency modeling for image generation tasks.」を、文章「paper propose simple Gener Adversar Network model allow long-range depend model image gener task」に変換する。続いて、生成部１０２は、熟語の技術ワードに含まれる複数の単語をアンダーバーで結合することで、当該熟語が、word2vecにおいて一つの単語として認識されるようにする。具体的には、生成部１０２は、文章「paper propose simple_Gener_Adversari Network model allow long-range depend model image gener task」に変換する。続いて、学習部１０６は、変換された文章を、word2vecに学習させる。これにより、各技術ワードの分散表現（ベクトル）を求めることが可能となるため、２つの技術ワードを入力することで、２つの技術ワード間の類似度を出力することが可能な学習モデルを生成することができる。 Here, an example of generating the group estimation model M120 will be described. It is assumed that "simple gener adversari" is estimated as a technical word of the phrase in the processing procedure of step S14. First, the generation unit 102 performs the processing procedures of steps S11 and S12 to convert the sentence "In this paper, we propose the simple Generative Adversarial Network model which allows long-range dependency modeling for image generation tasks." into the sentence "paper propose simple Gener Adversar Network model allow long-range depend model image generator task". Next, the generation unit 102 combines multiple words included in the technical words of the phrase with underscores so that the phrase is recognized as one word in word2vec. Specifically, the generation unit 102 converts the sentence into "paper propose simple_Gener_Adversari Network model allow long-range depend model image generator task". Next, the learning unit 106 trains the converted sentence in word2vec. This makes it possible to find the distributed representation (vector) of each technical word, so that by inputting two technical words, a learning model can be generated that can output the similarity between the two technical words.

推定部１０３は、ステップＳ１３の処理手順で推定された全ての技術ワードについて、２つの技術ワード間の類似度を総当たりで推定し、類似度が近い技術ワードの組み合わせを、技術ワードのグループとする。例えば、推定部１０３は、全ての組み合わせにおいて類似度が所定閾値（第２閾値）以上となる技術ワードの組み合わせを、技術ワードのグループとみなすようにしてもよい。例えば、“blockchain”、“smart contract”、“bitcoin”及び“ethereum”の４つの単語について、全ての組み合わせにおいて類似度が所定閾値（第２閾値）以上であった場合、推定部１０３は、“blockchain”、“smart contract”、“bitcoin”及び“ethereum”の４つの単語を、技術ワードのグループとみなすようにしてもよい。なお、技術ワードのグループを推定する方法はこれに限定されず、他のクラスタリング手法が用いられてもよい。 The estimation unit 103 estimates the similarity between two technical words for all technical words estimated in the processing procedure of step S13 in a brute force manner, and sets combinations of technical words with similarities close to each other as a group of technical words. For example, the estimation unit 103 may regard combinations of technical words whose similarity is equal to or greater than a predetermined threshold (second threshold) in all combinations as a group of technical words. For example, if the similarity is equal to or greater than the predetermined threshold (second threshold) in all combinations of the four words "blockchain", "smart contract", "bitcoin", and "ethereum", the estimation unit 103 may regard the four words "blockchain", "smart contract", "bitcoin", and "ethereum" as a group of technical words. Note that the method of estimating groups of technical words is not limited to this, and other clustering methods may be used.

ステップＳ１５で、集計部１０４は、論文ＤＢ１１０に格納されている全論文について、技術ワードのグループのうち少なくともいずれか１つの技術ワードを含む論文の数を、各論文に含まれる日付を示す情報（ここでは投稿日とする）に基づいて、技術ワードのグループごとかつ所定期間ごとに集計する。例えば、“blockchain”、“smart contract”、“bitcoin”及び“ethereum”の４つの単語からなる技術ワードのグループが存在する場合、これらの単語の少なくともいずれか１つを含む論文を検索し、検索された論文に含まれる投稿日を用いて、所定期間（例えば１ヵ月間隔）に論文数を集計する。これにより、例えば、上記技術ワードのグループを含む論文の投稿数は、２０１５年１月は５件、２０１５年２月は７件、２０１５年３月は１０件、２０１５年４月は２０件といったデータを得ることができる。 In step S15, the counting unit 104 counts the number of papers that contain at least one of the technical words in the technical word group for all papers stored in the paper DB 110 for each group of technical words and for each predetermined period based on the information indicating the date included in each paper (here, the posting date). For example, if there is a group of technical words consisting of four words, "blockchain", "smart contract", "bitcoin", and "ethereum", papers that contain at least one of these words are searched for, and the number of papers is counted for a predetermined period (e.g., one-month intervals) using the posting dates included in the searched papers. This makes it possible to obtain data such as the number of posted papers containing the above technical word group being 5 in January 2015, 7 in February 2015, 10 in March 2015, and 20 in April 2015.

続いて、検出部１０５は、集計部１０４により集計された所定期間ごとの論文の数に基づいて、技術ワードのグループのうち少なくともいずれか１つの技術ワードを含む論文の数が増加する、時系列上の変化点を検出する。検出部１０５は、例えば、Change finder等の既知の変化点検出アルゴリズムを用いることで、時系列上の変化点を検出するようにしてもよいし、月別の論文数の比（例えば前月比２０％上昇）に基づいて時系列上の変化点を検出するようにしてもよい。検出部１０５は、これらに限定されず、どのような方法で変化点を検出するようにしてもよい。 Then, the detection unit 105 detects a change point in the time series where the number of papers including at least one technical word from the group of technical words increases, based on the number of papers for each specified period tallied by the aggregation unit 104. The detection unit 105 may detect a change point in the time series by using a known change point detection algorithm such as Change finder, or may detect a change point in the time series based on the ratio of the number of papers by month (e.g., a 20% increase from the previous month). The detection unit 105 is not limited to these, and may detect a change point in any manner.

図６は、端末２０に表示される、論文数及び変化点を示すグラフの一例を示す図である。図６（ａ）は、ある特定の技術ワードのグループのうち少なくともいずれか１つの技術ワードを含む論文の数の変化を示すグラフである。縦軸は論文投稿数であり、横軸は年月である。図６（ｂ）は、図６（ａ）に示す論文投稿数に基づいて算出された変化点のスコアを示す。図６（ｂ）において、急激にスコアが大きくなる箇所が変化点である。図６（ｂ）によれば、２０１８年６月頃と、２０１８年９月頃に、論文投稿数が大きく変化していることが示されている。 Figure 6 shows an example of a graph showing the number of papers and change points displayed on the terminal 20. Figure 6(a) is a graph showing the change in the number of papers that contain at least one technical word from a group of specific technical words. The vertical axis represents the number of papers submitted, and the horizontal axis represents the year and month. Figure 6(b) shows the score of the change point calculated based on the number of papers submitted shown in Figure 6(a). In Figure 6(b), the points where the score increases suddenly are the change points. Figure 6(b) shows that the number of papers submitted changed significantly around June 2018 and around September 2018.

以上説明した処理手順において、ステップＳ１４の処理手順は省略されてもよい。技術ワードによっては、必ずしもグループ化する必要が無い場合も想定されるためである。ステップＳ１４の処理手順が省略される場合、ステップＳ１５の処理手順で、集計部１０４は、論文ＤＢ１１０に格納されている全論文について、技術ワードを含む論文の数を、各論文に含まれる日付を示す情報に基づいて、技術ワードごとかつ所定期間ごとに集計するようにしてもよい。また、検出部１０５は、集計部１０４により集計された所定期間ごとの文章の数に基づいて、技術ワードを含む文章の数が増加する、時系列上の変化点を検出するようにしてもよい。 In the processing procedure described above, the processing procedure of step S14 may be omitted. This is because it is expected that grouping may not be necessary depending on the technical word. If the processing procedure of step S14 is omitted, in the processing procedure of step S15, the aggregation unit 104 may aggregate the number of papers containing the technical word for each technical word and for each predetermined period for all papers stored in the paper DB 110, based on information indicating the date included in each paper. Furthermore, the detection unit 105 may detect a change point in the time series where the number of sentences containing the technical word increases, based on the number of sentences for each predetermined period aggregated by the aggregation unit 104.

＜まとめ＞
以上説明した実施形態によれば、分析装置１０は、分析対象の論文からＮグラムを生成し、隣接するＮグラム間の類似度を評価するようにした。これにより、複数の論文で用いられる最新の技術ワードを、辞書を利用することなく抽出することが可能になる。また、分析装置１０は、類似する技術ワードのグループを推定し、推定した技術ワードのグループのうち少なくともいずれか１つの技術ワードが論文の中で使用される頻度の推移に基づいて、最新技術の流行の兆しを検出するようにした。最新の技術ワードは、名前が一意に定まっていないケースが多々存在するが、類似する技術ワードを考慮して論文数をカウントすることで、最新技術の流行の兆しをより適切に検出することが可能になる。 <Summary>
According to the embodiment described above, the analysis device 10 generates N-grams from the paper to be analyzed and evaluates the similarity between adjacent N-grams. This makes it possible to extract the latest technical words used in multiple papers without using a dictionary. The analysis device 10 also estimates a group of similar technical words and detects signs of the latest technology trend based on the change in the frequency of at least one technical word in the estimated group of technical words used in the paper. There are many cases where the names of the latest technical words are not uniquely determined, but by counting the number of papers taking similar technical words into consideration, it becomes possible to more appropriately detect signs of the latest technology trend.

以上説明した実施形態では、分析装置１０が投稿された論文を分析する前提で説明したが、本実施形態はこれに限定されない。本実施形態は、論文に限定されず、様々な文章の分析に適用することが可能である。 In the embodiment described above, the analysis device 10 is assumed to analyze a submitted paper, but the present embodiment is not limited to this. The present embodiment is not limited to papers, and can be applied to the analysis of various texts.

以上説明した実施形態は、本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。実施形態で説明したフローチャート、シーケンス、実施形態が備える各要素並びにその配置、材料、条件、形状及びサイズ等は、例示したものに限定されるわけではなく適宜変更することができる。また、異なる実施形態で示した構成同士を部分的に置換し又は組み合わせることが可能である。 The above-described embodiments are intended to facilitate understanding of the present invention, and are not intended to limit the present invention. The flow charts, sequences, elements included in the embodiments, and their arrangements, materials, conditions, shapes, sizes, etc., described in the embodiments are not limited to those exemplified, and may be modified as appropriate. In addition, configurations shown in different embodiments may be partially substituted or combined.

１…文書分析システム、１０…分析装置、１１…プロセッサ、１２…記憶装置、１３…通信ＩＦ、１４…入力デバイス、１５…出力デバイス、２０…端末、１００…記憶部、１０１…受付部、１０２…生成部、１０３…推定部、１０４…集計部、１０５…検出部、１０６…学習部 1... document analysis system, 10... analysis device, 11... processor, 12... storage device, 13... communication IF, 14... input device, 15... output device, 20... terminal, 100... storage unit, 101... reception unit, 102... generation unit, 103... estimation unit, 104... aggregation unit, 105... detection unit, 106... learning unit

Claims

A reception unit that receives input of a plurality of sentences;
a generation unit that generates an N-gram from a word group including a plurality of consecutive words included in the plurality of sentences;
an estimation unit that estimates a specific term used in the plurality of sentences by evaluating a similarity between two adjacent N-grams among the generated N-grams;
an output unit that outputs information related to the specific term;
having
the maximum value of N in the N-gram is the number of words included in the word group,
The estimation unit is
For an N-gram obtained by subtracting 1 from the maximum value of N, determine whether or not the similarity between two adjacent N-grams is equal to or greater than a predetermined threshold. If there are no adjacent N-grams whose similarity is equal to or greater than the first threshold, determine whether or not the similarity between two adjacent N-grams is equal to or greater than a predetermined threshold for an N-gram obtained by further subtracting 1 from the value of N, and repeat this process until the value of N becomes 1;
When there is an N-gram in which the similarity between two adjacent N-grams is equal to or greater than the first threshold, a phrase formed by combining, in order, words included in the N-gram corresponding to the two adjacent N-grams and having an N value that is one larger than the first threshold, is estimated as the specific term .
Estimation device.

Each of the plurality of sentences includes information indicating a date; and
a counting unit that counts the number of sentences including the specific term from among the plurality of sentences for each predetermined period based on the information indicating the date;
a detection unit that detects a change point in the time series where the number of sentences including the specific term increases based on the number of sentences for the predetermined period tallied by the counting unit,
The output unit outputs information regarding change points on the time series.
The estimation device according to claim 1 .

the estimation unit, when there are a plurality of estimated specific terms, estimates a similarity based on a distributed representation of each of the plurality of specific terms to estimate a group of the specific terms;
the counting unit counts, for each predetermined period, the number of sentences including at least one specific term in the group of specific terms from among the plurality of sentences based on the information indicating the date;
the detection unit detects a change point in the time series at which the number of sentences including at least one specific term among the group of specific terms increases, based on the number of sentences for the predetermined period tallied by the counting unit;
The estimation device according to claim 2 .

An estimation method performed by an estimation device,
A step of receiving an input of a plurality of sentences by the estimation device ;
A step in which an estimation device generates an N-gram from a word group including a plurality of consecutive words included in the plurality of sentences;
A step in which an estimation device estimates a similarity between two adjacent N-grams among the generated N-grams, thereby estimating specific terms used in the plurality of sentences;
The estimation device outputs information about the specific term ,
the maximum value of N in the N-gram is the number of words included in the word group,
The estimating step includes:
For an N-gram obtained by subtracting 1 from the maximum value of N, determine whether or not the similarity between two adjacent N-grams is equal to or greater than a predetermined threshold. If there are no adjacent N-grams whose similarity is equal to or greater than the first threshold, determine whether or not the similarity between two adjacent N-grams is equal to or greater than a predetermined threshold for an N-gram obtained by further subtracting 1 from the value of N, repeating this process until the value of N becomes 1;
When there is an N-gram in which the similarity between two adjacent N-grams is equal to or greater than the first threshold, a phrase formed by combining, in order, words included in the N-gram corresponding to the two adjacent N-grams and having an N value that is one larger than the first threshold, is estimated as the specific term.
Estimation method.

On the computer,
accepting input of a plurality of sentences;
generating N-grams from a word group including a plurality of consecutive words included in the plurality of sentences;
A step of estimating specific terms used in the plurality of sentences by evaluating a similarity between two adjacent N-grams among the generated N-grams;
an estimation device outputting information about the specific term;
Run the command ,
the maximum value of N in the N-gram is the number of words included in the word group,
The estimating step includes:
For an N-gram obtained by subtracting 1 from the maximum value of N, determine whether or not the similarity between two adjacent N-grams is equal to or greater than a predetermined threshold. If there are no adjacent N-grams whose similarity is equal to or greater than the first threshold, determine whether or not the similarity between two adjacent N-grams is equal to or greater than a predetermined threshold for an N-gram obtained by further subtracting 1 from the value of N, and repeat this process until the value of N becomes 1;
When there is an N-gram in which the similarity between two adjacent N-grams is equal to or greater than the first threshold, a phrase formed by combining, in order, words included in the N-gram corresponding to the two adjacent N-grams and having an N value that is one larger than the first threshold, is estimated as the specific term.
program.