CN104462215B

CN104462215B - A kind of scientific and technical literature based on time series is cited number Forecasting Methodology

Info

Publication number: CN104462215B
Application number: CN201410618173.XA
Authority: CN
Inventors: 姚念民; 李梦阳; 谭国真; 战福瑞
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2014-11-05
Filing date: 2014-11-05
Publication date: 2017-07-11
Anticipated expiration: 2034-11-05
Also published as: CN104462215A

Abstract

A kind of scientific and technical literature based on time series of the invention is cited number Forecasting Methodology, first counts the number that is cited of scientific and technical literature, then calculates the average number that is cited of document in each month；Normalized is done to the number that is cited in the month with reference to the average number that is cited of every month, the time series that is cited is obtained；Cluster analysis is carried out according to time series, collected by dividing training set and checking, build regression model, carried out error analysis, obtain the number estimated performance optimal models that is cited；Finally according to document to be measured and the similarity analysis of all kinds of document time serieses, similarity highest class is obtained, following one month number that is cited of document to be measured is obtained with prediction optimal models.The present invention is not only able to automatically analyze the situation that is cited after each Literature publication, obtain the average number that is cited in each month, also go out the different reference patterns of document by cluster result, and then go out the number that is cited in future according to the existing time series forecasting of document to be measured.

Description

A kind of scientific and technical literature based on time series is cited number Forecasting Methodology

Technical field

The invention belongs to field of computer technology, it is related to a kind of scientific and technical literature based on time series to be cited estimated number Method.

Background technology

The number that is cited refers to that scientific and technical literature, by the number of times of other reference citations, is assessment within one section of time period specified The important method of scientific and technical literature influence power and quality.But quantity to be quoted purpose statistics is vulnerable to the limitation of current point in time, it is difficult to The situation that is cited in future time section is obtained, and then influences the assessment to scientific and technical literature in terms of scientific and technological contribution.Urgently carry It is cited number Forecasting Methodology for a kind of scientific and technical literature based on time series, potential document, promotion section is recognized faster Learn research and the propagation of new knowledge.

The content of the invention

It is cited number Forecasting Methodology it is an object of the invention to provide a kind of scientific and technical literature based on time series, is passed through The time series that is cited of scientific and technical literature is obtained and analyzes, text is assessed in the number that is cited in prediction following a period of time, help The scientific and technical potential offered, there is provided reading suggestion rapidly and efficiently.

Realize the object of the invention technical scheme：

Step 1：Each Literature publication days and index list are collected, each document number that each moon is cited after publication is counted Mesh.

Step 2：In units of month, the monthly all documents to be analyzed sum being cited and the document being cited are calculated Sum, be divided by the number avecitecount (month) that must averagely be cited in the month；

Step 3：To each document, from this month is published, be cited number and avecitecount hereafter monthly is calculated (month) difference, obtains the time series that is cited of the document；

Step 4：According to being cited, time series similarity is clustered to literature collection, sets up many to the time series in every class Individual regression model, best performance model is selected using error analysis；

Step 5：The similarity of document to be measured and all kinds of document time serieses is calculated using vector similarity, with similarity most The regression model of class high calculates following one month number that is cited of document to be measured.

In step 1, using the index list of each document of database retrieval, label according to each document in database and go out On version days, specific time and number of times that statistical literature is cited, obtain the number that is cited in each month after each Literature publication.

In step 4, the document for participating in cluster is screened according to the time series that is cited first, when the foundation of screening is Between sequence length.Time series to length more than N, will block to overlength part.Time sequence to length less than N Row, give up.N values are set by the user.

In step 4, when carrying out cluster analysis, the distance of the time series that is respectively cited is calculated first, distance is calculated and uses Europe Distance is obtained in several, then clustering tree is generated using unweighted average Furthest Neighbor.

The time series that is cited X_i=(X_i1,X_i2,…X_i8)：Represent the time series vector value that is cited of document i；

The time series that is cited X_j=(X_j1,X_j2,…X_j8)：Represent the time series vector value that is cited of document j；

Apart from d (X_i,X_j)：Represent the Euclidean distance of the time series that is cited of document i and j；

It is as follows apart from computing formula：

By calculating the distance between the time series that is cited, a distance matrix is obtained.According to Spectral Clustering, use Unweighted average Furthest Neighbor generates clustering tree.

Between class distance D_pq：Represent classification G_p,G_qThe distance between.Wherein G_pElement number be n_p,G_qElement number be n_q.

Apart from d between element_ij：Represent the distance between time series i, j.

Between class distance computing formula is as follows：

By cluster analysis, each document in set is divided into different classes.

In step 4, when building regression model to time series in class, training set is divided first and checking collects, choose the time At a time point in sequence, using the data before the time point as training set, the data after the time point are used as checking Collection.Model is set up on training set, the assessment models accuracy on checking collection.Finally training set and checking collection data are merged into One data set, and the optimum prediction model obtained on training set is operated on the data set.

In step 5, for two document p and document p_j, respectively with (X_i1,X_i2,…X_i8) and (X_j1,X_j2,…X_j8) represent Corresponding time series vector value, then time series similarity Similarity (p, p between document_j) computing formula it is as follows：

The similarity for surveying document and all kinds of document time serieses and then can be calculated by time series similarity between document.

Calculate document to be measured as follows with the formula of the similarity of all kinds of document time serieses：

Similarity (p, C_i) represent document p and C to be measured_iThe Similarity value of class document time series；

Similarity (p, p_j) represent document p to be measured and document p_jTime series Similarity value, by cosine angle letter Number is tried to achieve.Document p_j∈C_iClass, (n represents C for j=1,2 ..., n_iThe total number of class Literature).

The device have the advantages that：

The present invention utilizes the number that is cited in each month after statistics of database scientific and technical literature publication time and publication；In number In the Data preprocess stage, the number sum that all documents are cited in each month and the document being cited sum are calculated, be divided by Obtain the average number that is cited in the month；For each document, from this month is published, with reference to the average quantity to be quoted of every month Mesh does normalized to the number that is cited in the month, obtains the time series that is cited of the document；According to the time of being cited The degree of correlation of sequence carries out cluster analysis to literature collection, in each class, is returned by dividing training set and checking collection, building Model, error analysis is carried out, obtain the number estimated performance optimal models that is cited；During finally according to document to be measured with all kinds of documents Between sequence similarity analysis, obtain similarity highest class, calculated with such prediction optimal models, obtain text to be measured Offer following one month number that is cited.The present invention is not only able to automatically analyze the situation that is cited after each Literature publication, obtains The average number that is cited in each month is obtained, the different reference patterns of document are also gone out by cluster result, and then according to text to be measured The existing time series forecasting offered goes out the number that is cited in future.

The present invention calculates the average number that is cited in each month in data preprocessing phase is step 2, in structure Build each document be cited time series when, the number that is cited using corresponding month is made with average quantity to be quoted purpose difference It is the actual value in the month, the error caused by can effectively cutting down because of seasonal science liveness difference to prediction improves pre- Survey accuracy rate.The foundation with regression model is analyzed by the Time Series Clustering that is cited in step 4, text can be fully excavated The difference offered is cited pattern, after error analysis obtains optimal models, training set and checking collection is merged and is transported again once again Row optimal models, can fully be applied to latest data in prediction, effectively improve the accuracy of forecast model.

Specific embodiment：

Using the index list of each document of database retrieval, label according to each document in database and publish days, Specific time and number of times that statistical literature is cited, obtain the number that is cited in each month after each Literature publication.

Each document in traversal set, reads the quotation label (refid in publication time (time) and index list₁, refid₂,…,refid_n).To each quotation label refid_i, the statistics interior reference refid every month from after publishing_iDocument Number is the number that is cited in the month.

Step 2：In units of month, the monthly all documents to be analyzed sum being cited and the document being cited are calculated Sum, be divided by the number avecitecount that must averagely be cited in the month；

The average number Avecitecount (month) that is cited：Represent the average number value that is cited within the month months.

Be cited the moon number Citecount (P_i, month) and (P_i∈ N) (N represents the archives being cited in the month months Close)：Represent document P_iIn the number value that is cited of the month months.

The monthly average number computing formula that is cited is as follows：

The average number that is cited in corresponding month can obtain by the monthly average number computing formula that is cited, each is being built During the time series of document, the reality of number and average quantity to be quoted purpose difference as the month that is cited in corresponding month is used Actual value, the error caused by can effectively cutting down because of seasonal science liveness difference to prediction.

First, the document for participating in cluster is screened according to the time series that is cited, the foundation of screening is time series Length.Time series to length more than N, will block to overlength part.Time series to length less than N, gives up. N values are set by the user.N=8 in this experiment.

When carrying out cluster analysis, calculate the distance of the time series that is respectively cited first, distance calculate using Euclid away from From then use unweighted average Furthest Neighbor generates clustering tree.

It is as follows apart from computing formula：

Between class distance computing formula is as follows：

By cluster analysis, each document in set is divided into different classes.Then to the time series structure in class Build multiple regression models.Linear trend model, exponential trend model and polynomial trend mode are constructed in this experiment.

If certain moon is cited, number is output variable Y_t, predictive variable is month t (t=1,2,3 ...), then linearly become Potential model is：

Y_t=β₀+β₁×t+ε

Wherein Y_tIt is the number that is cited in month t, β₀, β₁, ε corresponds to the level of time series, trend and noise respectively.

Exponential trend model is：

log Y_t=β₀+β₁×t+ε

Quadratic polynomial trend model is：

Y_t=β₀+β₁×t+β₂×t²+ε

When the time series in class builds regression model, training set is divided first and checking collects, choose time series In a time point t, using the data before the time point as training set, the data after the time point collect as checking. Model is set up on training set, the assessment models accuracy on checking collection.Root-mean-square error RMSE is used during assessment as assessment Index.

Root-mean-square error computing formula is：

Wherein, e_tThe actual value of expression time t and the difference of predicted value, v represent the time period number of checking collection.

Training set and checking collection data are finally merged into a data set, and is operated on training set on the data set The optimum prediction model for obtaining.

For two document p and document p_j, respectively with (X_i1,X_i2,…X_i8) and (X_j1,X_j2,…X_j8) represent it is corresponding when Between sequence vector value, then time series similarity Similarity (p, p between document_j) computing formula it is as follows：

Similarity (p, p_j) represent document p to be measured and document p_jTime series Similarity value.Document p_j∈C_i Class, (n represents C for j=1,2 ..., n_iThe total number of class Literature).

After filtering out similarity highest class, using the existing time series of document to be measured as input variable, using such Regression model can predict following number that is cited of the document.

Claims

1. a kind of scientific and technical literature based on time series is cited number Forecasting Methodology, it is characterised in that：

Step 1：Each Literature publication days and index list are collected, each document number that each moon is cited after publication is counted；

Step 2：In units of month, calculate the monthly all documents to be analyzed sum being cited and the document being cited is total Number, be divided by the number avecitecount (month) that must averagely be cited in the month；

Step 4：According to being cited, time series is screened to the document for participating in cluster, and the foundation of screening is time series It is long；Time series to length more than N, will block to overlength part；Time series to length less than N, gives up；N values It is set by the user；

When being clustered, the distance of the time series that is respectively cited is calculated first, distance is calculated and uses Euclidean distance, then makes Clustering tree is generated with unweighted average Furthest Neighbor；

It is as follows apart from computing formula：

d (X_{i}, X_{j}) = {[Σ_{k = 1}^{8} {(X_{i k} - X_{j k})}^{2}]}^{1 / 2}

By calculating the distance between the time series that is cited, a distance matrix is obtained；According to Spectral Clustering, using not plus Weight average Furthest Neighbor generates clustering tree；

Between class distance D_pq：Represent classification G_p,G_qThe distance between；Wherein G_PElement number be n_p,G_qElement number be n_q；

Apart from d between element_ij：Represent the distance between time series i, j；

Between class distance computing formula is as follows：

D_{p q} = \frac{1}{n_{p} n_{q}} \underset{i &Element; G_{p}}{Σ} \underset{j &Element; G_{q}}{Σ} d_{i j}

By cluster analysis, each document in set is divided into different classes, regression model is built to time series in class When, first divide training set and checking collect, choose a time point in time series, using the data before the time point as Training set, the data after the time point collect as checking；Model is set up on training set, assessment models are accurate on checking collection Property；Training set and checking collection data are finally merged into a data set, and is operated on the data set and obtained on training set Optimum prediction model；

Step 5：Document to be measured and the similarity of all kinds of document time serieses are calculated using vector similarity, similarity highest is used The regression model of class calculates following one month number that is cited of document to be measured；

For two document p and document p_j, respectively with (X_i1,X_i2,…X_i8) and (x_j1,x_j2,…x_j8) represent corresponding time series Vector value, then time series similarity Similarity (p, p between document_j) computing formula it is as follows：

S i m i l a r i t y (p, p_{j}) = c o s θ = \frac{Σ_{k} X_{i k} \times X_{j k}}{\sqrt{(Σ_{k} {X_{i k}}^{2}) (Σ_{k} {X_{j k}}^{2})}}

The similarity for surveying document and all kinds of document time serieses and then can be calculated by time series similarity between document；

S i m i l a r i t y (p, C_{i}) = \frac{1}{n} \times [Σ_{j = 1}^{n} S i m i l a r i t y (p, p_{j})]

Similarity (p, p_j) represent document p to be measured and document p_jTime series Similarity value, document p_j∈C_iClass, j= 1,2 ..., n, n represent C_iThe total number of class Literature.

2. the scientific and technical literature based on time series according to claim 1 is cited number Forecasting Methodology, it is characterised in that： In step 1, using the index list of each document of database retrieval, label according to each document in database and publish days, Specific time and number of times that statistical literature is cited, obtain the sum that is cited in each month after each Literature publication.