JP2017174357A

JP2017174357A - Exploratory article prediction system

Info

Publication number: JP2017174357A
Application number: JP2016062744A
Authority: JP
Inventors: 一郎坂田; Ichiro Sakata; 純一郎森; Junichiro Mori; 佐々木　一; Hajime Sasaki; 一佐々木; 忠義原; Tadayoshi Hara
Original assignee: University of Tokyo NUC
Current assignee: University of Tokyo NUC
Priority date: 2016-03-25
Filing date: 2016-03-25
Publication date: 2017-09-28
Anticipated expiration: 2036-03-25
Also published as: JP6734088B2

Abstract

PROBLEM TO BE SOLVED: To provide a method for obtaining a prediction result of an exploratory field using only short-term data after a prediction object article is published, and provide a system using the method.SOLUTION: A model using network information as feature quantity is constructed and various feature quantity such as the number of nodes of a cluster to which an article belongs is used, for the feature quantity, without using field-dependent feature quantity such as an author name, an author feature quantity, a journal name and a journal feature quantity.SELECTED DRAWING: Figure 1

Description

本発明は、機械学習を用いた萌芽領域の予測方法及び当該方法を用いたシステムに関するものである。 The present invention relates to a sprouting region prediction method using machine learning and a system using the method.

従来、経営戦略の立案や政策の形成、推進プロジェクトの評価等のために、将来有望となるであろう分野を議論し意志決定を支援するための手法はいくつか提案されてきた。しかし、入手可能な情報量が増大し、また、技術の変化速度が加速し、専門家知識の細分化が進むことに伴い、専門家の知識に依存した従来の手法（例えば、Ｔ-Ｐｌａｎ法）では、必要な情報の抽出が困難となってきた。 In the past, several methods have been proposed for discussing promising areas and supporting decision-making for planning business strategies, formulating policies, and evaluating promotion projects. However, as the amount of available information increases, the rate of change in technology accelerates, and expert knowledge becomes more fragmented, conventional methods that depend on expert knowledge (for example, the T-Plan method) ), It has become difficult to extract necessary information.

そこで、専門家の知見に依存しない予測手法の必要性が増し、様々な手法が提案されてきた。特に、近年は、ビッグデータの潮流から、大規模な論文データを対象に、コンピュータ科学、例えば、データマイニングや情報検索などの分野において研究がなされてきた。 Therefore, the need for a prediction method that does not depend on expert knowledge has increased, and various methods have been proposed. Particularly in recent years, research has been conducted in the fields of computer science, for example, data mining and information retrieval, for large-scale paper data due to the trend of big data.

例えば、論文引用数の予測を最適化問題として定式化した上で、コンピュータ科学系の５０万論文を対象にして論文出版後３年間の情報を元に、１０年後の被引用数を予測した研究（非特許文献１）や、論文のインパクトを著者、内容、出版先、引用、共著、及び時系列の６つの要因で定義し、コンピュータ科学系の２００万論文データを使用して論文出版から５年後の著書のｈ−ｉｎｄｅｘを予測した研究（非特許文献２）がある。
しかしながら、これらの研究による方法では、論文出版から数年間の観察を経て被引用数の増加傾向を推測する必要があり、萌芽領域を早期に予測する方法としては不十分であった。特に、出版から数年経過してしまった時点ではすでに多くの研究者が注目している領域である可能性が高く、その時点での参入ではすでに遅れを取ることとなる。そのため、論文出版直後の時点での萌芽領域の予測が必要であった。 For example, after formulating the prediction of the number of paper citations as an optimization problem, the number of citations after 10 years was predicted based on information for 3 years after the publication of a paper for 500,000 papers in the computer science field. Define research (non-patent literature 1) and impact of papers by 6 factors: author, content, publication destination, citation, co-authorship, and time series. There is a study (non-patent document 2) that predicted the h-index of a book after five years.
However, the methods based on these studies need to infer an increasing trend in the number of citations after several years of observation from the publication of the paper, and are insufficient as a method for predicting early germination areas. In particular, when a few years have passed since publication, there is a high possibility that this is an area that many researchers are already paying attention to, and entry at that point will already be delayed. Therefore, it was necessary to predict the sprouting area immediately after the publication of the paper.

そこで、本発明者らは、鋭意検討の上、学術論文群から出版後一定期間における被引用数の増加数が目覚ましい論文群を早期に特定する情報学的手法を見出した。例えば、学術論文情報における引用関係をネットワークとして捉え、情報を構造として捉えることで複数の分野に同様に適用可能な手法を見出した。そこでは、統計的機械学習手法を適用し、萌芽論文と非萌芽論文を２値分類の問題として定義し、学習データに基づき予測モデルを予め構築することで、予測対象論文が出版された直後に予測処理を行うことを可能とした（非特許文献３、非特許文献４など）。 Accordingly, the present inventors have intensively studied and found an informatics method for early identification of a group of papers from which a remarkable increase in the number of citations in a certain period after publication has been made. For example, we have found a method that can be applied to multiple fields in the same way by considering citation relationships in academic paper information as a network and information as a structure. There, we applied statistical machine learning techniques, defined sprouting papers and non-sprouting papers as binary classification problems, and built a prediction model in advance based on the learning data. Prediction processing can be performed (Non-Patent Document 3, Non-Patent Document 4, etc.).

L., &Tong, H. (2015). The Child is Father of the Man: Foresee the Success at the Early Stage. arXiv preprint arXiv: 1504.00948L., & Tong, H. (2015) .The Child is Father of the Man: Foresee the Success at the Early Stage.arXiv preprint arXiv: 1504.00948 Dong, Y., Johnson, R. A., &Chawla, N. V. (2015, February). Will This Paper Increase Your h-index?: Scientific Impact Prediction. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (pp. 149-158). ACM.Dong, Y., Johnson, RA, & Chawla, NV (2015, February). Will This Paper Increase Your h-index ?: Scientific Impact Prediction. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (pp. 149 -158). ACM. 森純一郎，榊剛士，梶川裕矢，坂田一郎，萌芽研究領域特定のための大規模論文情報を用いた引用予測に関する研究，人工知能学会全国大会，２０１４Junichiro Mori, Takeshi Tsuji, Hiroya Sasakawa, Ichiro Sakata, Research on citation prediction using large-scale paper information to identify sprout research areas, National Conference of the Japanese Society for Artificial Intelligence, 2014 森純一郎“膨大な論文や特許の引用関係に着目し分析将来、成長領域となりうる技術を早期に発見”，日経ビッグデータＮｏ.１７，２０１５Junichiro Mori “Analyzing and Analyzing Citation Relationships between Enormous Papers and Patents” Discover Nikkei Big Data No. 17, 2015

本発明は、予測対象論文が出版されてから短期間のデータのみを用いて、萌芽領域の予測結果を得る方法及び当該方法を用いたシステムを提供するものである。 The present invention provides a method for obtaining a prediction result of a sprout area using only data for a short period after the publication of a prediction target paper and a system using the method.

本発明における萌芽領域の予測方法は、
予測モデルを構築するステップ、構築された予測モデルを評価するステップ、及び構築された予測モデルを用いて予測を行うステップを含み、
予測モデルを構築するステップは、データベースから論文データを取得するステップ、取得した論文データから論文の書誌情報及び引用ネットワークを抽出するステップ、抽出した論文の書誌情報及び引用ネットワークから論文の各特徴量を算出するステップ、並びに算出された各特徴量を説明変数とし萌芽論文を被説明変数として教師ありクラス分類を行なうステップを含み、
前記予測モデルを構築するステップにおいて算出される特徴量には、著者名、著者特徴量、論文誌名、及び論文誌特徴量を含まず、
構築された予測モデルを評価するステップは、被予測萌芽論文の論文データ及び被予測萌芽論文が公表された日が属する一定期間に公表された論文の論文データをデータベースから取得するステップ、並びに構築された予測モデルを評価指標で評価するステップを含み、
構築されたモデルを用いて予測を行うステップは、予測の対象とする論文の論文データをデータベースから取得するステップ、及び予測の対象とする論文が萌芽論文であるか否かを予測するステップを含む。 In the present invention, the method for predicting the sprouting area is:
Building a prediction model, evaluating the built prediction model, and performing a prediction using the built prediction model,
The step of constructing the prediction model includes the step of acquiring the article data from the database, the step of extracting the bibliographic information and the citation network of the article from the acquired article data, and the feature amount of the article from the bibliographic information and the citation network of the extracted article. And a step of performing supervised class classification with each calculated feature amount as an explanatory variable and a sprouting paper as an explained variable,
The feature amount calculated in the step of constructing the prediction model does not include the author name, author feature amount, journal title, and journal feature amount.
The step of evaluating the constructed prediction model includes the step of obtaining the article data of the predicted sprouting article and the article data of the article published in a certain period to which the date on which the predicted sprouting article was published, from the database. Evaluating the predicted model with an evaluation index,
The step of performing prediction using the constructed model includes the step of obtaining the paper data of the paper to be predicted from the database, and the step of predicting whether or not the paper to be predicted is an emerging paper. .

本発明における萌芽領域の予測システムは、
予測モデルを構築する予測モデル構築部、構築された予測モデルを評価する予測モデル評価部、及び構築された予測モデルを用いて予測を行う予測部を含み、
予測モデル構築部は、データベースから論文データを取得する学習データ取得部、取得した論文データから論文の書誌情報及び引用ネットワークを抽出する抽出部、抽出した論文の書誌情報及び引用ネットワークから論文データの各特徴量を算出する特徴量算出部、並びに算出された各特徴量を説明変数とし萌芽論文を被説明変数として教師ありクラス分類を行なう分析部を含み、
前記予測モデル構築部において算出される特徴量には、著者名、著者特徴量、論文誌名、及び論文誌特徴量を含まず、
構築された予測モデルを評価する予測モデル評価部は、被予測萌芽論文の論文データ及び被予測萌芽論文が公表された日が属する一定期間に公表された論文の論文データをデータベースから取得する評価データ取得部、並びに構築された予測モデルを評価指標で評価する評価部を含み、
構築された予測モデルを用いて予測を行う予測部は、予測の対象とする論文の論文データをデータベースから取得する対象データ取得部と、予測の対象とする論文が萌芽論文であるか否かを予測する予測部と、予測部での予測結果を出力する出力部を含む。 The germination area prediction system in the present invention is:
A prediction model building unit that builds a prediction model, a prediction model evaluation unit that evaluates the built prediction model, and a prediction unit that performs prediction using the built prediction model,
The prediction model construction unit is a learning data acquisition unit that acquires thesis data from the database, an extraction unit that extracts the bibliographic information and citation network of the paper from the acquired paper data, and each of the paper data from the bibliographic information of the extracted paper and the citation network A feature amount calculation unit that calculates a feature amount, and an analysis unit that performs supervised class classification with each calculated feature amount as an explanatory variable and an embryonic paper as an explained variable
The feature amount calculated in the prediction model construction unit does not include the author name, author feature amount, journal title, and journal feature amount,
The prediction model evaluation unit that evaluates the constructed prediction model is the evaluation data that obtains the paper data of the predicted sprouting paper and the paper data of the paper published in a certain period to which the date on which the predicted sprouting paper was published from the database. An acquisition unit, and an evaluation unit that evaluates the constructed prediction model with an evaluation index,
The prediction unit that performs prediction using the constructed prediction model includes a target data acquisition unit that acquires the paper data of the paper to be predicted from the database, and whether or not the paper to be predicted is an emerging paper. A prediction unit that performs prediction and an output unit that outputs a prediction result of the prediction unit are included.

すなわち、本発明者らは、出版した論文は既に引用文献一覧によって、既存の引用ネットワークに接続されることから、その時点で十分な情報を得ているところ、将来において被引用数が多くなる論文を執筆する研究者は、当該分野を適切に調査し、然るべき引用文献一覧（ＲｅｆｅｒｅｎｃｅＬｉｓｔ）を構築しているはずであり、論文が有するネットワーク情報のみで高い精度の予測が可能であろうとの考察の下、通常は寄与度が高い特徴量と考えられている特徴量である著者名、著者特徴量、論文誌名、及び論文誌特徴量を用いず、特徴量としてネットワーク情報を用いたモデルを構築し、本発明を完成した。 That is, since the published papers are already connected to the existing citation network by the cited reference list, sufficient information is obtained at that time, and the number of citations will increase in the future. Researchers who write the article should appropriately investigate the field and build an appropriate citation list (Reference List), and consider that it is possible to predict with high accuracy only by the network information that the paper has. The model that uses the network information as the feature quantity is not used, but the author name, author feature quantity, journal title, and journal feature quantity, which are usually considered to be feature quantities with high contribution, are not used. Constructed and completed the present invention.

また、本発明者らは、特徴量として、クラスタリング結果を用いたモデルを構築し、本発明を完成した。すなわち、萌芽領域の予測の際に、論文が所属するクラスタの情報に着目し、それを予測に使用した。具体的には、論文が所属するクラスタのノード数などの各種特徴量を使用することで、予測精度を向上させた。 In addition, the present inventors constructed a model using the clustering result as the feature quantity, and completed the present invention. In other words, we focused on the cluster information to which the paper belongs, and used it for the prediction. Specifically, the prediction accuracy was improved by using various features such as the number of nodes of the cluster to which the paper belongs.

さらに、本発明者らは、その他の多様な特徴量を使用することで、萌芽領域の予測精度を向上させた。 Furthermore, the present inventors improved the prediction accuracy of the sprouting area by using other various feature amounts.

また、本発明者らは、太陽光発電分野のように、技術と科学の距離を表すサイエンスリンケージが高い分野では、今後の技術動向を捉えるためには学術論文の動向が大きな指針となるとの考察の下、萌芽領域を早期に検知するモデルの具体的な対象として太陽光発電分野を採用し、太陽光発電分野における萌芽領域の予測方法に関する発明を完成させた。 In addition, the present inventors consider that the trend of academic papers is a great guide for capturing future technological trends in fields with high science linkage representing the distance between technology and science, such as the photovoltaic power generation field. As a result, the solar power generation field was adopted as a specific target of a model for early detection of the sprout area, and an invention relating to a method for predicting the sprout area in the solar power generation field was completed.

本発明によれば、予測対象論文が出版されてから短期間で、萌芽領域の予測結果を得ることができる。特に、本発明によれば、被引用実績がほとんどない時期である公表後１年以内の論文データから、萌芽領域の予測を行うことができる。また、本発明によれば、被引用実績がほとんどない時期である公表後１年以内の論文データから、太陽光発電分野における萌芽領域の予測を行うことができる。 According to the present invention, it is possible to obtain a sprouting region prediction result in a short period of time after publication of a prediction target paper. In particular, according to the present invention, the sprouting area can be predicted from the paper data within one year after publication, which is a time when there is almost no cited citation. In addition, according to the present invention, it is possible to predict a germination area in the photovoltaic power generation field from paper data within one year after publication, which is a time when there is almost no cited citation.

また、本発明では、使用される特徴量に、著者名、著者特徴量、論文誌名、及び論文誌特徴量を含まないため、これらを特徴量に含んだ場合（例えば、非特許文献３）と比べ、説明変数が大幅に減少し、その結果、これらのステップにおける計算時間を大幅に短縮することができる。
さらに、本発明によれば、予測モデルを構築する段階において用いた学習データが属する分野（学術分野、技術分野など。）に依存しない、より汎用性の高い予測モデルを構築することができる。すなわち、ある分野のデータに基づき構築された予測モデルを、当該分野と知識構造が類似する他の分野においても、予測モデルとして用いることができる。例えば、本発明において太陽光発電分野に属するデータを学習データとして構築された予測モデルは、ナノカーボン分野又は幹細胞分野でも利用することができ、ナノカーボン分野に属するデータを学習データとして構築した予測モデルは、太陽光発電分野又は幹細胞分野で利用することができ、幹細胞分野に属するデータを学習データとして構築した予測モデルは、太陽光発電分野又はナノカーボン分野で用いることができる（図４参照）。その理由としては、使用される特徴量に、著者名、著者特徴量、論文誌名、及び論文誌特徴量といった分野依存的な特徴量が含まれていないことが挙げられる。 In the present invention, the feature amount used does not include the author name, the author feature amount, the journal name, and the journal feature amount. Therefore, when these are included in the feature amount (for example, Non-Patent Document 3). Compared to, the explanatory variables are greatly reduced, and as a result, the calculation time in these steps can be greatly shortened.
Furthermore, according to the present invention, it is possible to construct a more versatile prediction model that does not depend on the field (academic field, technical field, etc.) to which the learning data used in the stage of building the prediction model belongs. That is, a prediction model constructed based on data in a certain field can be used as a prediction model in other fields having a similar knowledge structure to that field. For example, in the present invention, a prediction model constructed using data belonging to the photovoltaic field as learning data can also be used in the nanocarbon field or stem cell field, and a prediction model constructed using data belonging to the nanocarbon field as learning data. Can be used in the solar power generation field or the stem cell field, and a prediction model in which data belonging to the stem cell field is constructed as learning data can be used in the solar power generation field or the nanocarbon field (see FIG. 4). The reason for this is that the feature quantities used do not include field-dependent feature quantities such as author names, author feature quantities, journal titles, and journal feature quantities.

図１は、本発明による萌芽領域の予測方法の一例を示すフローチャートである。FIG. 1 is a flowchart illustrating an example of a method for predicting a germination area according to the present invention. 図２は、本発明による萌芽領域の予測システムの一例を示す機能ブロック図である。FIG. 2 is a functional block diagram showing an example of a germination region prediction system according to the present invention. 図３は、本発明における予測モデルを構築するステップ又は装置において行なわれるロジスティック回帰分析の一例である。FIG. 3 is an example of a logistic regression analysis performed in the step or apparatus for constructing the prediction model in the present invention. 図４は、本発明により構築された予測モデルの汎用性を示す評価結果の一例である。図中の数値は、予測モデルのＦ値を百分率で表わしたものである。FIG. 4 is an example of an evaluation result indicating the versatility of the prediction model constructed according to the present invention. The numerical values in the figure represent the F value of the prediction model as a percentage.

以下、本発明の実施形態を説明する。ただし、以下の実施形態は、発明内容の理解を助けるためのものであり、本発明を限定するものではない。 Embodiments of the present invention will be described below. However, the following embodiments are for helping understanding of the contents of the invention and do not limit the present invention.

＜萌芽領域の予測方法＞
本発明における萌芽領域の予測方法は、予測モデルを構築するステップ、構築された予測モデルを評価するステップ、及び構築された予測モデルを用いて予測を行うステップを含む（図１参照）。 <Prediction method of germination area>
The sprouting region prediction method according to the present invention includes a step of constructing a prediction model, a step of evaluating the constructed prediction model, and a step of performing prediction using the constructed prediction model (see FIG. 1).

予測モデルの構築ステップ
本発明における予測モデルを構築するステップは、データベースから論文データを取得するステップ、取得した論文データから論文の書誌情報及び引用ネットワークを抽出するステップ、抽出した論文の書誌情報及び引用ネットワークから論文の各特徴量を算出するステップ、並びに算出された各特徴量を説明変数とし萌芽論文を被説明変数として教師ありクラス分類を行なうステップを含む（図１参照）。 Step of constructing a prediction model The step of constructing a prediction model in the present invention includes a step of acquiring paper data from a database, a step of extracting bibliographic information and a citation network of papers from the acquired paper data, bibliographic information and citations of the extracted papers The method includes a step of calculating each feature amount of the paper from the network, and a step of performing supervised class classification using the calculated feature amount as an explanatory variable and an emerging paper as an explained variable (see FIG. 1).

ただし、本発明における予測モデルを構築するステップにおいて算出される特徴量には、著者名、著者特徴量、論文誌名、論文誌特徴量を含まない。 However, the feature amount calculated in the step of constructing the prediction model in the present invention does not include the author name, the author feature amount, the journal name, and the journal feature amount.

著者特徴量とは、対象論文が引用している論文群について、各著者特徴量を集計して特徴量としたものであり、具体的には、引用論文群における当該著者特徴量の最大値、引用論文群における当該著者特徴量の最小値、引用論文群における当該著者特徴量の平均値、及び引用論文群における当該著者特徴量の合計値である。
また、論文誌特徴量とは、対象論文が引用している論文群について、各論文誌特徴量を集計して特徴量としたものであり、具体的には、引用論文群における当該論文誌特徴量の最大値、引用論文群における当該論文誌特徴量の最小値、引用論文群における当該論文誌特徴量の平均値、引用論文群における当該論文誌特徴量の合計値である。 The author feature value is the feature value obtained by summing up each author feature value for the papers cited in the target paper. Specifically, the maximum value of the author feature value in the cited papers, The minimum value of the author feature quantity in the cited paper group, the average value of the author feature quantity in the cited paper group, and the total value of the author feature quantity in the cited paper group.
A journal feature is a feature obtained by summing up the features of each journal for the group of papers cited by the target paper. Specifically, the feature of the journal in the group of cited papers. The maximum value of the quantity, the minimum value of the journal feature quantity in the cited paper group, the average value of the journal feature quantity in the cited paper group, and the total value of the journal feature quantity in the cited paper group.

本発明に使用できるデータベースとしては、学術文献データベースが挙げられる。 Examples of databases that can be used in the present invention include academic literature databases.

本発明におけるデータベースから論文データを取得するステップは、論文のタイトル、アブストラクト、又はその他の項目について、特定の用語で検索をすることで対象とする分野のデータを取得するステップを含む。 The step of acquiring the article data from the database in the present invention includes the step of acquiring the data of the target field by searching for the title, abstract, or other item of the article with a specific term.

本発明における取得した論文データから引用ネットワークを抽出するステップは、取得した論文データ中の各論文ペア間の引用・被引用関係の有無について、論文の書誌情報に含まれる、あるいは、データベースに論文の書誌情報で問い合せることにより取得されるデータに基づき、各論文の引用または被引用先論文に、ペア相手の論文が含まれるかどうかといった照合を行うステップを含む。 In the present invention, the step of extracting a citation network from acquired paper data includes the presence or absence of a citation / cited relationship between each paper pair in the acquired paper data included in the bibliographic information of the paper, or in the database. Based on the data obtained by inquiring with the bibliographic information, the method includes a step of collating whether each article cited or cited paper includes the pair partner's article.

本発明における特徴量は、萌芽論文を予測するための学習データを表現するために用いられ、ネットワーク特徴量、クラスタ特徴量、中心性特徴量、及び引用関係特徴量を含む。さらに、これらの特徴量に加え、その他の特徴量を含んでも良い。 The feature amount in the present invention is used to express learning data for predicting a sprouting paper, and includes a network feature amount, a cluster feature amount, a centrality feature amount, and a citation relation feature amount. Further, in addition to these feature amounts, other feature amounts may be included.

本発明におけるネットワーク特徴量は、対象論文が含まれる引用ネットワーク全体の特徴を表す。例えば、ネットワークに含まれる論文数、ネットワークに含まれる引用リンク数、等が挙げられる。 The network feature amount in the present invention represents the feature of the entire cited network including the target paper. For example, the number of papers included in the network, the number of citation links included in the network, and the like.

本発明におけるクラスタ特徴量は、対象論文が所属するクラスタの特徴を表す。例えば、対象論文が所属するクラスタのモジュラリティ値の最大値、対象論文が所属するクラスタのノード数、対象論文が所属するクラスタの順位、クラスタのノード数、クラスタのモジュラリティ等が挙げられる。なお、クラスタとは、引用ネットワークの中でも特に引用が密な論文群を意味し、モジュラリティ最大化により抽出される。 The cluster feature amount in the present invention represents the feature of the cluster to which the target paper belongs. For example, the maximum value of the modularity of the cluster to which the target paper belongs, the number of nodes of the cluster to which the target paper belongs, the rank of the cluster to which the target paper belongs, the number of nodes of the cluster, the modularity of the cluster, and the like. A cluster means a group of papers that are particularly densely cited in the citation network, and is extracted by maximizing modularity.

本発明における中心性特徴量は、引用ネットワーク構造における論文の中心度合いを表す。中心度合いは複数の観点で定量化されるが、例えば、対象論文の次数中心性（Ｄｅｇｒｅｅｃｅｎｔｒａｌｉｔｙ）、対象論文の媒介中心性（Ｂｅｔｗｅｅｎｅｓｓｃｅｎｔｒａｌｉｔｙ）、対象論文の近接中心性（Ｃｌｏｓｅｎｅｓｓｃｅｎｔｒａｌｉｔｙ）、対象論文の固有ベクトル中心性（Ｅｉｇｅｎｖｅｃｔｏｒｃｅｎｔｒａｌｉｔｙ）、対象論文のネットワーク制約（Ｎｅｔｗｏｒｋｃｏｎｓｔｒａｉｎｔ）、対象論文のクラスタ係数（Ｃｌｕｓｔｅｒｉｎｇｃｏｅｆｆｉｃｉｅｎｔ）、対象論文のページランク（ＰａｇｅＲａｎｋ）、対象論文のハブ度（ＨｕｂＳｃｏｒｅ）、対象論文のオーソリティ度（ＡｕｔｈｏｒｉｔｙＳｃｏｒｅ）等が挙げられる。 The centrality feature amount in the present invention represents the central degree of the paper in the citation network structure. The central degree is quantified from a plurality of viewpoints. For example, the degree centrality of the target paper (Degree centrality), the median centrality of the target paper (Betweens centrality), the proximity centrality of the target paper (Closeness centrality), the target paper Eigenvector centrality (Eigenvector centrality), network constraint of the target paper (Network constrain), cluster coefficient of the target paper (Clustering coefficient), page rank of the target paper (PageRank), hub degree of the target paper (Hub Score), target paper The authority score of the above and the like.

本発明における引用関係特徴量は、対象となる論文が引用している論文群の統計的要約量を特徴量としたものである。例えば、引用論文群における当該ネットワーク特徴量の最大値、引用論文群における当該ネットワーク特徴量の最小値、引用論文群における当該ネットワーク特徴量の平均値、引用論文群における当該ネットワーク特徴量の合計値が挙げられる。 The citation-related feature amount in the present invention is a feature amount that is a statistical summary amount of a group of papers cited by a target paper. For example, the maximum value of the network feature in the cited paper group, the minimum value of the network feature in the cited paper group, the average value of the network feature in the cited paper group, and the total value of the network feature in the cited paper group Can be mentioned.

本発明におけるその他の特徴量としては、例えば、書誌情報に含まれる論文データ（例えば、所属組織、発表年等）の特徴量、レファレンス論文群の書誌情報に含まれる論文データの特徴量、その他の外部知識資源（Ｗｅｂサービス、オンライン辞書、論文誌のインパクトファクター等）から得られる特徴量、レファレンス論文群に関して外部知識資源から得られる特徴量、論文内・書誌情報内・外部知識資源のテキスト記述から抽出される特徴量、レファレンス論文群に関して論文内・書誌情報内・外部知識資源のテキスト記述から抽出される特徴量、既述の特徴量の複数を組み合わせることよって算出される特徴量、さらに、既述の一特徴量または複数の特徴量から統計・機械学習手法等を用いて算出される特徴量、等を含んでも良い。 Other feature quantities in the present invention include, for example, feature quantities of paper data included in bibliographic information (for example, organization, publication year, etc.), feature quantities of paper data contained in bibliographic information of reference papers, Features obtained from external knowledge resources (Web services, online dictionaries, impact factors of journals, etc.), features obtained from external knowledge resources regarding reference papers, text descriptions in papers, bibliographic information, and external knowledge resources Extracted feature values, feature values extracted from text descriptions in the paper, bibliographic information, and external knowledge resources regarding the reference papers, feature values calculated by combining multiple feature values, It may include a feature amount calculated from one feature amount or a plurality of feature amounts using a statistical / machine learning method or the like.

本発明における萌芽論文とは、教師ありクラス分類における正解データにおける正例である。例えば、教師ありクラス分類においてロジスティック回帰分析を用いる場合には、萌芽論文とはロジスティック回帰分析の正解データにおける正例であり、対象論文が公表された日から一定期間経過後における被引用数増加数が、対象論文が公表された日が属する一定期間に公表された論文についてのデータセット全体において上位一定割合に含まれる論文と定義でき、被引用数増加数が上位一定割合に含まれる論文であれば正例とし（１をフラグとして付与）、被引用数増加数下位一定割合の論文は負例とし（０をフラグとして付与）、ロジスティック回帰分析における正解データとして扱うことができる。
例えば、対象論文が公表された日が属する年次（対象年次）をｔ_１とし（ｔ_１＝ｔ_０＋４）、対象年次から４年前の日が属する年次ｔ_０に公表された論文の、ｔ_０及びｔ_０＋３それぞれの時点での被引用数を算出し、ｔ_０の時点からｔ_０＋３の時点までの被引用数増加数（ｔ_０＋３の時点での被引用数 - ｔ_０の時点の被引用数）を算出し、被引用数増加数がデータセット全体の上位５％に含まれる論文と定義した場合に、被引用数増加数上位５％に含まれる論文であれば正例とし（１をフラグとして付与）、被引用数増加数下位５０%の論文は負例とし（０をフラグとして付与）、ロジスティック回帰分析における正解データとして扱うことができる（図３参照）。
なお、本発明における公表の態様としては、特に限定されないが、出版や発表によるものが挙げられる。 The sprouting paper in the present invention is a positive example of correct data in supervised classification. For example, when using logistic regression analysis in supervised class classification, a sprouting paper is a positive example of correct data of logistic regression analysis, and the number of citations increased after a certain period from the date of publication of the target paper Can be defined as a paper that is included in the top constant percentage of the entire dataset for a paper published in a certain period to which the date of publication of the target paper belongs, and a paper that includes the increase in the number of citations in the top constant percentage. For example, a positive example (1 is given as a flag), and a paper with a lower percentage of the number of citations is a negative example (0 is given as a flag), and can be treated as correct data in logistic regression analysis.
For example, the year (target year) to which the date of publication of the target paper belongs is t ₁ (t ₁ = t ₀ +4), and it is published in the year t ₀ to which the date four years before the target year belongs. Calculate the number of citations at each time point of t ₀ and t ₀ +3 of the paper, and increase the number of citations from time t ₀ to time t ₀ +3 (number of citations at time t ₀ +3 − citations at the time of t ₀ ), and if the increase in citations is defined as the article in the top 5% of the entire data set, For example, a positive example (1 is given as a flag) and a paper with the lower 50% of citation counts as a negative example (0 is given as a flag) can be treated as correct data in logistic regression analysis (see Fig. 3) .
In addition, the publication mode in the present invention is not particularly limited, and examples thereof include publication and publication.

予測モデルの評価ステップ
本発明における構築された予測モデルを評価するステップは、被予測萌芽論文の論文データ及び被予測萌芽論文が公表された日が属する一定期間に公表された論文の論文データをデータベースから取得するステップ、並びに構築された予測モデルを評価指標で評価するステップを含む。
被予測萌芽論文が公表された日が属する一定期間とは、例えば、公表された日が属する年次とすることができる。 Step of Evaluating Prediction Model The step of evaluating the prediction model constructed in the present invention is a database of the paper data of the predicted sprouting paper and the paper data of the paper published in a certain period to which the date of publication of the predicted sprouting paper belongs. And a step of evaluating the constructed prediction model with an evaluation index.
The certain period to which the date on which the predicted sprouting paper is published can be, for example, the year to which the published date belongs.

本発明における被予測萌芽論文とは、構築された予測モデルにおいて萌芽論文であると予測された論文を意味する。 The predicted sprouting paper in the present invention means a paper that is predicted to be a sprouting paper in the constructed prediction model.

本発明において使用する評価指標としては、Ｆ値やＡＵＣを用いることができる。なお、ここでいうＦ値は、適合率(Ｐｒｅｃｉｓｉｏｎ)と再現率（Ｒｅｃａｌｌ）の調和平均で得られる指標であり、予測モデルの評価指標として一般的に用いられるものの一つである。また、ＡＵＣとは、偽陽性割合と真陽性の割合の組み合わせによってプロットされる受信者操作特性が作る曲線下の面積を表す、二値分類の精度評価指標であり、１に近いほど完全な分類に近づき、０．５に近いほどランダムな分類に近づく。 As an evaluation index used in the present invention, F value or AUC can be used. Note that the F value here is an index obtained by a harmonic average of the precision (Precision) and the recall (Recall), and is one of those generally used as an evaluation index of a prediction model. AUC is an accuracy evaluation index of binary classification that represents the area under the curve created by the receiver operation characteristics plotted by the combination of the false positive ratio and the true positive ratio. The closer to 0.5, the closer to random classification.

本発明における適合率は、被予測萌芽論文が公表された日が属する一定期間に公表された論文において、被予測萌芽論文のうち実際に萌芽論文であった論文の占める割合（すなわち、被予測萌芽論文が公表された日が属する一定期間に公表された被予測萌芽論文において実際に萌芽論文であった論文の数／被予測萌芽論文が公表された日が属する一定期間に公表された被予測萌芽論文の数）である。 In the present invention, the relevance rate is the ratio of the papers that were actually germinated papers out of the papers that were expected to be spun out of the papers that were published in a certain period to which the date of publication of the paper to be predicted was published (that is, Number of papers that were actually sprouting papers in the forecasted sprouting papers published in a certain period to which the date of publication of the article / predicted sprouting published in a certain period to which the date on which the forecasting sprouting paper was published belongs Number of papers).

本発明における再現率（Ｒｅｃａｌｌ）は、被予測萌芽論文が公表された日が属する一定期間に公表された論文において実際に萌芽論文であった論文のうち、被予測萌芽論文の占める割合（すなわち、被予測萌芽論文が公表された日が属する一定期間に公表された論文において実際に萌芽論文であった論文のうちの被予測萌芽論文の数／被予測萌芽論文が公表された日が属する一定期間に公表された論文において実際に萌芽論文であった論文の数）である。 The recall (Recall) in the present invention is the ratio of the predicted germ-paper in the paper that was actually germinated in the paper published in a certain period to which the date of the predicted germ-paper published (that is, Number of papers to be predicted among papers that were actually germinated papers in papers published in a certain period to which the date of publication of the paper to be predicted was published / Period of time to which the date on which the paper to be sprouted was published The number of papers that were actually sprouting papers).

構築モデルを用いた予測ステップ
本発明における構築されたモデルを用いて予測を行うステップは、予測の対象とする論文の論文データを取得するステップ、及び予測の対象とする論文が萌芽論文であるか否かを予測するステップを含む。 Prediction step using the construction model The step of performing the prediction using the constructed model in the present invention includes the step of obtaining the paper data of the paper to be predicted, and whether the paper to be predicted is an embryonic paper. Predicting whether or not.

予測の対象とする論文が萌芽論文であるか否かを予測するステップは、構築された予測モデルに対して、取得した論文データ（予測対象データ）について、対象論文が公表された日から一定期間が経過した後の被引用数増加数が上位一定割合以内に入るか否か、すなわち、萌芽論文であるか否かを予測するステップを含む。例えば、対象論文が公表された日から３年後の被引用数増加数が上位５％以内に入るか否かを予測するステップが挙げられる。 The step of predicting whether or not the paper to be predicted is an embryonic paper is a fixed period from the date of publication of the target paper for the acquired paper data (predicted data) for the constructed prediction model. Including a step of predicting whether or not the number of citations after the elapse of time falls within the upper fixed ratio, that is, whether or not it is a sprouting paper. For example, there is a step of predicting whether or not the increase in the number of citations three years after the date of publication of the target paper falls within the top 5%.

＜萌芽領域の予測システム＞
本発明における萌芽領域の予測システムは、予測モデルを構築する予測モデル構築部、構築された予測モデルを評価する予測モデル評価部、及び構築された予測モデルを用いて予測を行う予測部を含む（図２参照）。 <Prediction system for germination area>
The sprouting region prediction system of the present invention includes a prediction model construction unit that constructs a prediction model, a prediction model evaluation unit that evaluates the constructed prediction model, and a prediction unit that performs prediction using the constructed prediction model ( (See FIG. 2).

予測モデル構築部
本発明おける予測モデル構築部は、データベースから論文データを取得する学習データ取得部、取得した論文データから論文の書誌情報及び引用ネットワークを抽出する抽出部、抽出した論文の書誌情報及び引用ネットワークから論文データの各特徴量を算出する特徴量算出部、並びに算出された各特徴量を説明変数とし萌芽論文を被説明変数として教師ありクラス分類を行なうをする分析部を含む。 Prediction model construction unit The prediction model construction unit in the present invention includes a learning data acquisition unit that acquires paper data from a database, an extraction unit that extracts bibliographic information and a citation network of papers from the acquired paper data, bibliographic information of extracted papers, and A feature amount calculation unit that calculates each feature amount of the article data from the citation network; and an analysis unit that performs supervised class classification using the calculated feature amount as an explanatory variable and an emerging paper as an explained variable.

本発明における各種データ取得部（学習データ取得部、評価データ取得部、又は対象データ取得部）は、入力装置によって指定された論文のタイトル、アブストラクト、又はその他の項目について、特定の用語で検索をすることで対象とする分野のデータを取得する。 Various data acquisition units (learning data acquisition unit, evaluation data acquisition unit, or target data acquisition unit) in the present invention search for specific titles, abstracts, or other items specified by the input device. By doing so, the data of the target field is acquired.

本発明における予測モデル構築部において算出される特徴量には、著者名、著者特徴量、論文誌名、及び論文誌特徴量を含まない。 The feature quantity calculated by the prediction model construction unit in the present invention does not include the author name, author feature quantity, journal title, and journal feature quantity.

予測モデル評価部
本発明における構築された予測モデルを評価する予測モデル評価部は、被予測萌芽論文の論文データ及び被予測萌芽論文が公表された日が属する一定期間に公表された論文の論文データをデータベースから取得する評価データ取得部、並びに構築された予測モデルを評価指標で評価する評価部を含む。評価指標としては、Ｆ値やＡＵＣを用いることができる。 Prediction Model Evaluation Unit The prediction model evaluation unit that evaluates the prediction model constructed in the present invention includes the paper data of the predicted sprouting paper and the paper data of the paper published in a certain period to which the date of publication of the predicted sprouting paper belongs. Including an evaluation data acquisition unit for acquiring the prediction model from the database, and an evaluation unit for evaluating the constructed prediction model with the evaluation index. As an evaluation index, F value or AUC can be used.

予測部
本発明における構築された予測モデルを用いて予測を行う予測部は、予測の対象とする論文の論文データを取得する対象データ取得部と、予測の対象とする論文が萌芽論文であるか否かを予測する予測部と、予測部での予測結果を出力する出力部を含む。 Prediction unit The prediction unit that performs prediction using the constructed prediction model in the present invention includes a target data acquisition unit that acquires the paper data of a paper to be predicted, and whether the paper to be predicted is an embryonic paper. A prediction unit that predicts whether or not, and an output unit that outputs a prediction result of the prediction unit.

予測の対象とする論文が萌芽論文であるか否かを予測する予測部は、構築された予測モデルに対して、取得した論文データ（予測対象データ）について、対象論文が公表された日から一定期間が経過した後の被引用数増加数が上位一定割合以内に入るか否か、すなわち、萌芽論文であるか否かを予測する。例えば、対象論文が公表された日から３年後の被引用数増加数が上位５％以内に入るか否かを予測する。 The prediction unit that predicts whether or not the paper to be predicted is an embryonic paper, with respect to the constructed prediction model, the acquired paper data (prediction target data) is constant from the date the target paper is published It is predicted whether or not the increase in the number of citations after the lapse of time falls within the upper fixed ratio, that is, whether or not it is a sprouting paper. For example, it is predicted whether or not the increase in the number of citations within three years after the date of publication of the target paper falls within the top 5%.

本発明における萌芽領域の予測システムを構成する各部は、異なるハードウェア資源を用いて実現されても良いし、共通のハードウェア資源を用いて実現されても良い。例えば、学習データ取得部、評価データ取得部、及び対象データ取得部は、異なるハードウェア資源を用いて実現されても良いし、共通のハードウェア資源を用いて実現されても良い。 Each part which comprises the germination area | region prediction system in this invention may be implement | achieved using a different hardware resource, and may be implement | achieved using a common hardware resource. For example, the learning data acquisition unit, the evaluation data acquisition unit, and the target data acquisition unit may be realized using different hardware resources, or may be realized using common hardware resources.

以下、本発明の実施例を説明する。ただし、以下の実施例は、発明内容の理解を助けるためのものであり、本発明を限定するものではない。 Examples of the present invention will be described below. However, the following examples are for helping understanding of the contents of the invention and do not limit the present invention.

（実施例１）
１）予測モデルの構築
論文の書誌情報及び引用ネットワークとしてＴｈｏｍｓｏｎＷｅｂｏｆＳｃｉｅｎｃｅを用い、論文のタイトル、アブストラクト、キーワードのいずれかに(“ｓｏｌａｒｃｅｌｌ” ＯＲ “ｐｈｏｔｏｖｏｌｔ”)が含まれる論文データを抽出した。 Example 1
1) Construction of prediction model Thomson Web of Science was used as the bibliographic information and citation network of papers, and paper data that included ("solar cell" OR "photovolt") in any of the paper title, abstract, or keyword were extracted. .

抽出した論文データから論文の書誌情報及び引用ネットワークを抽出した。これらの論文同士の直接引用関係を元にしたネットワークを構築した結果、２０１５年７月２７日時点で最大連結成分に含まれる論文は１０４，０２５本であった。 Bibliographic information and citation network of the paper were extracted from the extracted paper data. As a result of constructing a network based on the direct citation relationship between these papers, as of July 27, 2015, 104,025 papers were included in the largest connected component.

次いで、最大連結成分に所属する論文全てについて特徴量を算出した。また、ネットワーク中における被引用数をすべての論文に対して算出した。算出した特徴量は、ネットワーク特徴量、クラスタ特徴量、中心性特徴量、及び引用関係特徴量である。予測モデルの構築ステップにおいて算出したネットワーク特徴量は、ネットワーク特徴量３種類、ネットワーククラスタ特徴量３種類、中心性特徴量９種類及び、レファレンス論文群におけるクラスタ特徴量および中心性特徴量計１２種類の最大値、最小値、平均値、合計値の全４８種類を特徴量とした。ネットワーク特徴量は合計６３種類となる。表１にそれぞれの特徴量の要約を示す。 Next, feature values were calculated for all papers belonging to the largest connected component. The number of citations in the network was calculated for all papers. The calculated feature amounts are a network feature amount, a cluster feature amount, a centrality feature amount, and a citation-related feature amount. The network feature values calculated in the prediction model construction step include three types of network feature values, three types of network cluster feature values, nine types of centrality feature amounts, and 12 types of cluster feature amounts and centrality feature amount meters in the reference paper group. A total of 48 types of maximum values, minimum values, average values, and total values were used as feature amounts. There are a total of 63 types of network features. Table 1 shows a summary of each feature.

算出した各種特徴量を説明変数とし、萌芽論文を被説明変数として、ロジスティック回帰分析を行った。本発明における萌芽論文の定義は、当該論文が出版された年から３年後における被引用数増加数がデータセット全体の上位５％に含まれる論文とし、実際に被引用数増加数上位５％に含まれる萌芽論文であれば正例とし（１をフラグとして付与）、被引用数増加数下位５０％の論文は負例とし（０をフラグとして付与）、教師あり機械学習上の正解データとして扱った。分析には、線形分類器であるロジスティック回帰を採用し、分析のパッケージにはＬＩＢＬＩＮＥＡＲを利用した。負例に含まれるデータのうち、正例と同じデータ量をランダムに８回抽出し、各年で８種類のデータセットを構築した。また、それぞれのモデルでは、５ｆｏｌｄｃｒｏｓｓｖａｌｉｄａｉｔｏｎを行うことで、データの過学習（ｏｖｅｒｆｉｔｔｉｎｇ）を回避した。 Logistic regression analysis was performed using the calculated various features as explanatory variables and the sprouting paper as the explained variable. The definition of a sprouting paper in the present invention is a paper in which the increase in the number of citations 3 years after the year when the paper was published is included in the top 5% of the entire data set, and the top 5% in the actual increase in the number of citations. If it is a sprouting paper included in, a positive example (1 is given as a flag), a paper with the lowest number of citations of 50% is a negative example (0 is given as a flag), and as correct data on supervised machine learning Handled. The analysis used logistic regression, a linear classifier, and LIBLINEAR was used as the analysis package. Of the data included in the negative examples, the same amount of data as in the positive examples was randomly extracted eight times, and eight types of data sets were constructed each year. In addition, in each model, overfitting of data was avoided by performing 5 fold cross validation.

２）予測モデルの評価
被予測萌芽論文の論文データ及び被予測萌芽論文が公表された日が属する一定期間に公表された論文の論文データをデータベースから取得し、構築された予測モデルの評価を行った。 2) Evaluation of the prediction model The paper data of the predicted sprouting paper and the paper data of the paper published in a certain period to which the date of the sprouting sprouting paper belongs are obtained from the database, and the constructed prediction model is evaluated. It was.

３）構築モデルを用いた予測
構築されたモデルに対して、２０１５年１月１日から２０１５年７月２７日までに発行された論文データを取得し、３年後の２０１８年に被引用数増加数が上位５％以内に入るか否かを予測した。 3) Prediction using the built model The paper data published from January 1, 2015 to July 27, 2015 is obtained for the built model, and the number of citations in 2018 three years later. It was predicted whether the increase would fall within the top 5%.

（比較例１）
実施例１と同様の手順で、予測モデルの構築、予測モデルの評価、及び予測モデルを用いた予測を行った。予測モデルの構築において算出した特徴量は、ネットワーク特徴量に加え、著者名、論文誌名とした。ネットワーク特徴量は、ネットワーク特徴量３種類、ネットワーククラスタ特徴量３種類、中心性特徴量９種類及び、レファレンス論文群におけるクラスタ特徴量および中心性特徴量計１２種類の最大値、最小値、平均値、合計値の全４８種類を特徴量とした。ネットワーク特徴量は合計６３種類となる。著者名は、実施例１において抽出した論文データの全書誌情報に含まれる各著者名が、論文の書誌情報に含まれるか否かを１又は０の２値で示したもの９８，７０２種類（２０１５年７月２日時点）、レファレンス論文群のいずれかに含まれているか否かを１又は０の２値で示したもの９８，７０２種類（２０１５年７月２日時点）を特徴量とした。著者名の特徴量は合計１９７，４０４種類となる。論文誌名は、実施例１において抽出した論文データの全書誌情報に含まれる各論文誌名が、論文の書誌情報に含まれるか否かを１又は０で示したもの２，１０７種類（2015年７月２日時点）、レファレンス論文群のいずれかに含まれているか否かを１又は０で示したもの２，１０７種類（２０１５年７月２日時点）を特徴量とした。論文誌名の特徴量は合計４，２１４種類となる。表２にそれぞれの特徴量の要約を示す。 (Comparative Example 1)
In the same procedure as in Example 1, construction of a prediction model, evaluation of the prediction model, and prediction using the prediction model were performed. The feature values calculated in the construction of the prediction model are the author name and the journal name in addition to the network feature values. There are three types of network features, three types of network features, three types of network cluster features, nine types of centrality features, and cluster features and centrality features in the reference papers. A total of 48 types of average values and total values were used as feature amounts. There are a total of 63 types of network features. 98,702 types of author names indicating whether each author name included in the bibliographic information of the paper data extracted in Example 1 is included in the bibliographic information of the paper as binary values of 1 or 0 ( As of July 2, 2015), 98,702 types (as of July 2, 2015) showing whether they are included in any of the reference papers as binary values of 1 or 0 did. There are a total of 197,404 features of author names. The journal name is 1107 indicating whether each journal name included in the bibliographic information of the article data extracted in Example 1 is included in the bibliographic information of the article 2,107 types (2015 As of July 2, 2015, 2,107 types (as of July 2, 2015) indicating whether they are included in any of the reference papers as 1 or 0 were used as feature quantities. There are a total of 4,214 types of feature values of journal names. Table 2 summarizes each feature quantity.

表３に、実施例１における、各年のモデルの評価結果を示す。また、表４に比較例１における、各年のモデルの評価結果を示す。実施例１におけるＦ値はいずれの対象年においても安定して８０％程度を推移しており、適合率および再現率ともにバランスのとれたモデルが構築でき、比較例１と比べ遜色のない結果が得られた。 Table 3 shows the evaluation results of the models for each year in Example 1. Table 4 shows the evaluation results of the models for each year in Comparative Example 1. The F value in Example 1 is stably about 80% in any target year, and a model in which both the precision and the recall are balanced can be constructed, and the result is comparable to that in Comparative Example 1. Obtained.

表５は、実施例１において、２０１１年モデルの特徴量のうち寄与度の高い上位１０個のリストである。２０１１年に出版された論文のうち、どの論文が３年後の２０１４年時点で萌芽論文となるかについて予測したモデルにおいて、予測に寄与する特徴量を寄与度（Ｗｅｉｇｈｔｓ）が高い順に上位１０件を示した。 Table 5 is a list of the top ten most highly contributed features of the 2011 model in the first embodiment. Among the papers published in 2011, in the model that predicts which paper will become a sprouting paper as of 2014, the top 10 features that contribute to the prediction in the descending order of contribution (Weights) showed that.

実施例１において、最も寄与度（Ｗｅｉｇｈｔｓ）の高い特徴量（Ｆｅａｔｕｒｅ）は、Ｐａｇｅｒａｎｋ（ＣＮＴ_ＰＡＧＥＲ）であった。Ｐａｇｅｒａｎｋはウェブページの重要度を決定するためのアルゴリズムであるが、その発想は引用関係に基づく学術論文の評価に端を発する。被引用数の多い論文から引用されている論文は重要度が高いことを示す指標であり、この指標を適用することで例えば仲間内で引用し合うことで被引用数を稼いでいるような論文の重要度は相対的に下がる。
次に寄与度が高い特徴量は次数中心性（ＣＮＴ_ＤＥＧＲＥ）であった。この指標は、引用文献一覧（ＲｅｆｅｒｅｎｃｅＬｉｓｔ）に多くの論文を掲載することで高くなる。オーソリティ度（ＣＮＴ_ＡＵＴＨＯＲ）は、その定義上、クラスタ同士を橋渡しする役割を持つ論文においてそのスコアが高くなると考えることができる。すなわち、分野横断的な論文から新たな分野が生まれることで萌芽領域が形成されることを示唆している。
６番目の特徴量（ＣＩＴＩＮＧ_ＳＵＭ−ＣＬ_ＲＡＮＫ）は、引用している論文群がネットワークの上位クラスタに含まれるほど、萌芽論文となりやすいことを意味する。６番目から９番目の特徴量はいずれも引用文献一覧（ＲｅｆｅｒｅｎｃｅＬｉｓｔ）に記載されている論文群の特徴量をもとにした特徴量である。
なお、寄与度上位１０件の特徴量うち、７件（次数中心性（ＣＮＴ_ＤＥＧＲＥ）、近接中心性（ＣＮＴ_ＣＬＯＳＥ）、及び固有ベクトル中心性（ＣＮＴ_ＥＩＧＥＮ）以外の特徴量）は、非特許文献３には用いられていない特徴量である。 In Example 1, the feature amount (Feature) having the highest contribution (Weights) was Pagerank (CNT_PAGE). Pagerank is an algorithm for determining the importance of a web page, but its idea originates in the evaluation of academic papers based on citation relationships. Papers cited from papers with a high number of citations are indicators that indicate high importance. By applying this indicator, papers that earn citations by quoting within peers, for example. The importance of is relatively lowered.
The feature amount having the next highest contribution was degree centrality (CNT_DEGRE). This index becomes higher when a large number of papers are published in the cited reference list (Reference List). The authority degree (CNT_AUTHOR) can be considered to have a higher score in a paper having the role of bridging clusters by definition. In other words, it suggests that a new field is born from cross-disciplinary papers and that a sprout area is formed.
The sixth feature amount (CITING_SUM-CL_RANK) means that the more the cited paper group is included in the upper cluster of the network, the easier it is to be an embryonic paper. The sixth to ninth feature quantities are all feature quantities based on the feature quantities of the paper group described in the cited reference list (Reference List).
Of the top 10 feature values of contributions, 7 (features other than the degree centrality (CNT_DEGRE), proximity centrality (CNT_CLOSE), and eigenvector centrality (CNT_EIGEN)) are used in Non-Patent Document 3. It is a feature amount that has not been performed.

表６は、実施例１において、２０１１年に出版された論文のうち萌芽論文となる可能性が高いと予測された論文のうち上位１０論文が、３年後の２０１４年時点でどの程度被引用数が増加しているかを示したものである。 Table 6 shows how many of the top 10 papers among the papers published in 2011 that were predicted to be highly sprouting papers in Example 1 were cited as of 2014, three years later. It shows whether the number is increasing.

実施例１において、２０１１年に出版された論文を元に萌芽論文となる可能性が高いとされた上位１０位の論文のうち、実際に２０１４年時点で萌芽論文となったのは、実施例１における表６のＮｏ．１、Ｎｏ．３、Ｎｏ．４、Ｎｏ．６、Ｎｏ．７、Ｎｏ．８及びＮｏ．１０の論文であった。すなわち、実施例１における上位１０位の論文に限って言えば、７０％が被引用数増加数が上位５％に含まれる論文となった。 In Example 1, among the top 10 papers that are highly likely to become germination papers based on papers published in 2011, the papers that actually became germination papers in 2014 No. 1 in Table 6 in FIG. 1, no. 3, no. 4, no. 6, no. 7, no. 8 and no. There were 10 papers. In other words, 70% of the papers included in the top 5% were cited in the top 10 papers in Example 1.

今後、情報が膨大に増加し知識の構造が複雑化していくほど、企業が行う研究開発投資における意思決定や、国が行う科学技術政策としての予算配分意思決定は困難を極めることが予想されるが、このような状況において、本発明における萌芽領域の予測方法は、企業の研究開発や、国が行う科学技術政策の意思決定に活用することができる。 In the future, as information increases enormously and the knowledge structure becomes more complex, it is expected that decisions on R & D investment made by companies and budget allocation decisions made by governments as science and technology policies will become more difficult. However, in such a situation, the sprouting area prediction method according to the present invention can be used for research and development of companies and decision making of science and technology policies conducted by the government.

Claims

A method for prediction of a sprouting paper,
Building a prediction model, evaluating the built prediction model, and performing a prediction using the built prediction model,
The step of constructing the prediction model includes the step of acquiring the article data from the database, the step of extracting the bibliographic information and the citation network of the article from the acquired article data, and the feature amount of the article from the bibliographic information and the citation network of the extracted article. And a step of performing supervised class classification with each calculated feature amount as an explanatory variable and a sprouting paper as an explained variable,
The feature amount calculated in the step of constructing the prediction model does not include the author name, author feature amount, journal title, and journal feature amount.
The step of evaluating the constructed prediction model includes obtaining the article data of the predicted sprouting paper and the article data of the article published in a certain period to which the date on which the predicted sprouting paper was published, from the database. Evaluating the predicted model with an evaluation index,
The step of performing prediction using the constructed model includes the step of obtaining the paper data of the paper to be predicted from the database, and the step of predicting whether or not the paper to be predicted is an emerging paper. ,
A method for predicting the germination area.

The method for predicting a sprouting area according to claim 1, wherein a sprouting paper can be predicted from paper data within one year after publication.

The method for predicting a sprouting region according to claim 1 or 2, wherein the method does not depend on a field to which the article data acquired in the step of constructing the prediction model belongs.

A system for forecasting emerging papers,
A prediction model building unit that builds a prediction model, a prediction model evaluation unit that evaluates the built prediction model, and a prediction unit that performs prediction using the built prediction model,
The prediction model construction unit is a learning data acquisition unit that acquires thesis data from the database, an extraction unit that extracts the bibliographic information and citation network of the paper from the acquired paper data, and each of the paper data from the bibliographic information of the extracted paper and the citation network A feature amount calculation unit that calculates a feature amount, and an analysis unit that performs supervised class classification with each calculated feature amount as an explanatory variable and a sprouting paper as an explained variable,
The feature amount calculated in the prediction model construction unit does not include the author name, author feature amount, journal title, and journal feature amount,
The prediction model evaluation unit that evaluates the constructed prediction model is the evaluation data that obtains the paper data of the predicted sprouting paper and the paper data of the paper published in a certain period to which the date on which the predicted sprouting paper was published from the database. An acquisition unit, and an evaluation unit that evaluates the constructed prediction model with an evaluation index,
The prediction unit that performs prediction using the constructed prediction model includes a target data acquisition unit that acquires the paper data of the paper to be predicted from the database, and whether or not the paper to be predicted is an emerging paper. Including a prediction unit for prediction, and an output unit for outputting a prediction result in the prediction unit,
Sprouting area prediction system.

The sprouting area prediction system according to claim 4, wherein a sprouting paper can be predicted from paper data within one year after publication.

The sprouting area prediction system according to claim 5 or 6, wherein the system does not depend on a field to which the article data acquired in the prediction model construction unit belongs.