CN104462215B - A kind of scientific and technical literature based on time series is cited number Forecasting Methodology - Google Patents

A kind of scientific and technical literature based on time series is cited number Forecasting Methodology Download PDF

Info

Publication number
CN104462215B
CN104462215B CN201410618173.XA CN201410618173A CN104462215B CN 104462215 B CN104462215 B CN 104462215B CN 201410618173 A CN201410618173 A CN 201410618173A CN 104462215 B CN104462215 B CN 104462215B
Authority
CN
China
Prior art keywords
document
cited
time series
similarity
month
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410618173.XA
Other languages
Chinese (zh)
Other versions
CN104462215A (en
Inventor
姚念民
李梦阳
谭国真
战福瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN201410618173.XA priority Critical patent/CN104462215B/en
Publication of CN104462215A publication Critical patent/CN104462215A/en
Application granted granted Critical
Publication of CN104462215B publication Critical patent/CN104462215B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Abstract

A kind of scientific and technical literature based on time series of the invention is cited number Forecasting Methodology, first counts the number that is cited of scientific and technical literature, then calculates the average number that is cited of document in each month;Normalized is done to the number that is cited in the month with reference to the average number that is cited of every month, the time series that is cited is obtained;Cluster analysis is carried out according to time series, collected by dividing training set and checking, build regression model, carried out error analysis, obtain the number estimated performance optimal models that is cited;Finally according to document to be measured and the similarity analysis of all kinds of document time serieses, similarity highest class is obtained, following one month number that is cited of document to be measured is obtained with prediction optimal models.The present invention is not only able to automatically analyze the situation that is cited after each Literature publication, obtain the average number that is cited in each month, also go out the different reference patterns of document by cluster result, and then go out the number that is cited in future according to the existing time series forecasting of document to be measured.

Description

A kind of scientific and technical literature based on time series is cited number Forecasting Methodology
Technical field
The invention belongs to field of computer technology, it is related to a kind of scientific and technical literature based on time series to be cited estimated number Method.
Background technology
The number that is cited refers to that scientific and technical literature, by the number of times of other reference citations, is assessment within one section of time period specified The important method of scientific and technical literature influence power and quality.But quantity to be quoted purpose statistics is vulnerable to the limitation of current point in time, it is difficult to The situation that is cited in future time section is obtained, and then influences the assessment to scientific and technical literature in terms of scientific and technological contribution.Urgently carry It is cited number Forecasting Methodology for a kind of scientific and technical literature based on time series, potential document, promotion section is recognized faster Learn research and the propagation of new knowledge.
The content of the invention
It is cited number Forecasting Methodology it is an object of the invention to provide a kind of scientific and technical literature based on time series, is passed through The time series that is cited of scientific and technical literature is obtained and analyzes, text is assessed in the number that is cited in prediction following a period of time, help The scientific and technical potential offered, there is provided reading suggestion rapidly and efficiently.
Realize the object of the invention technical scheme:
Step 1:Each Literature publication days and index list are collected, each document number that each moon is cited after publication is counted Mesh.
Step 2:In units of month, the monthly all documents to be analyzed sum being cited and the document being cited are calculated Sum, be divided by the number avecitecount (month) that must averagely be cited in the month;
Step 3:To each document, from this month is published, be cited number and avecitecount hereafter monthly is calculated (month) difference, obtains the time series that is cited of the document;
Step 4:According to being cited, time series similarity is clustered to literature collection, sets up many to the time series in every class Individual regression model, best performance model is selected using error analysis;
Step 5:The similarity of document to be measured and all kinds of document time serieses is calculated using vector similarity, with similarity most The regression model of class high calculates following one month number that is cited of document to be measured.
In step 1, using the index list of each document of database retrieval, label according to each document in database and go out On version days, specific time and number of times that statistical literature is cited, obtain the number that is cited in each month after each Literature publication.
In step 4, the document for participating in cluster is screened according to the time series that is cited first, when the foundation of screening is Between sequence length.Time series to length more than N, will block to overlength part.Time sequence to length less than N Row, give up.N values are set by the user.
In step 4, when carrying out cluster analysis, the distance of the time series that is respectively cited is calculated first, distance is calculated and uses Europe Distance is obtained in several, then clustering tree is generated using unweighted average Furthest Neighbor.
The time series that is cited Xi=(Xi1,Xi2,…Xi8):Represent the time series vector value that is cited of document i;
The time series that is cited Xj=(Xj1,Xj2,…Xj8):Represent the time series vector value that is cited of document j;
Apart from d (Xi,Xj):Represent the Euclidean distance of the time series that is cited of document i and j;
It is as follows apart from computing formula:
By calculating the distance between the time series that is cited, a distance matrix is obtained.According to Spectral Clustering, use Unweighted average Furthest Neighbor generates clustering tree.
Between class distance Dpq:Represent classification Gp,GqThe distance between.Wherein GpElement number be np,GqElement number be nq.
Apart from d between elementij:Represent the distance between time series i, j.
Between class distance computing formula is as follows:
By cluster analysis, each document in set is divided into different classes.
In step 4, when building regression model to time series in class, training set is divided first and checking collects, choose the time At a time point in sequence, using the data before the time point as training set, the data after the time point are used as checking Collection.Model is set up on training set, the assessment models accuracy on checking collection.Finally training set and checking collection data are merged into One data set, and the optimum prediction model obtained on training set is operated on the data set.
In step 5, for two document p and document pj, respectively with (Xi1,Xi2,…Xi8) and (Xj1,Xj2,…Xj8) represent Corresponding time series vector value, then time series similarity Similarity (p, p between documentj) computing formula it is as follows:
The similarity for surveying document and all kinds of document time serieses and then can be calculated by time series similarity between document.
Calculate document to be measured as follows with the formula of the similarity of all kinds of document time serieses:
Similarity (p, Ci) represent document p and C to be measurediThe Similarity value of class document time series;
Similarity (p, pj) represent document p to be measured and document pjTime series Similarity value, by cosine angle letter Number is tried to achieve.Document pj∈CiClass, (n represents C for j=1,2 ..., niThe total number of class Literature).
The device have the advantages that:
The present invention utilizes the number that is cited in each month after statistics of database scientific and technical literature publication time and publication;In number In the Data preprocess stage, the number sum that all documents are cited in each month and the document being cited sum are calculated, be divided by Obtain the average number that is cited in the month;For each document, from this month is published, with reference to the average quantity to be quoted of every month Mesh does normalized to the number that is cited in the month, obtains the time series that is cited of the document;According to the time of being cited The degree of correlation of sequence carries out cluster analysis to literature collection, in each class, is returned by dividing training set and checking collection, building Model, error analysis is carried out, obtain the number estimated performance optimal models that is cited;During finally according to document to be measured with all kinds of documents Between sequence similarity analysis, obtain similarity highest class, calculated with such prediction optimal models, obtain text to be measured Offer following one month number that is cited.The present invention is not only able to automatically analyze the situation that is cited after each Literature publication, obtains The average number that is cited in each month is obtained, the different reference patterns of document are also gone out by cluster result, and then according to text to be measured The existing time series forecasting offered goes out the number that is cited in future.
The present invention calculates the average number that is cited in each month in data preprocessing phase is step 2, in structure Build each document be cited time series when, the number that is cited using corresponding month is made with average quantity to be quoted purpose difference It is the actual value in the month, the error caused by can effectively cutting down because of seasonal science liveness difference to prediction improves pre- Survey accuracy rate.The foundation with regression model is analyzed by the Time Series Clustering that is cited in step 4, text can be fully excavated The difference offered is cited pattern, after error analysis obtains optimal models, training set and checking collection is merged and is transported again once again Row optimal models, can fully be applied to latest data in prediction, effectively improve the accuracy of forecast model.
Specific embodiment:
Step 1:Each Literature publication days and index list are collected, each document number that each moon is cited after publication is counted Mesh.
Using the index list of each document of database retrieval, label according to each document in database and publish days, Specific time and number of times that statistical literature is cited, obtain the number that is cited in each month after each Literature publication.
Each document in traversal set, reads the quotation label (refid in publication time (time) and index list1, refid2,…,refidn).To each quotation label refidi, the statistics interior reference refid every month from after publishingiDocument Number is the number that is cited in the month.
Step 2:In units of month, the monthly all documents to be analyzed sum being cited and the document being cited are calculated Sum, be divided by the number avecitecount that must averagely be cited in the month;
The average number Avecitecount (month) that is cited:Represent the average number value that is cited within the month months.
Be cited the moon number Citecount (Pi, month) and (Pi∈ N) (N represents the archives being cited in the month months Close):Represent document PiIn the number value that is cited of the month months.
The monthly average number computing formula that is cited is as follows:
The average number that is cited in corresponding month can obtain by the monthly average number computing formula that is cited, each is being built During the time series of document, the reality of number and average quantity to be quoted purpose difference as the month that is cited in corresponding month is used Actual value, the error caused by can effectively cutting down because of seasonal science liveness difference to prediction.
Step 3:To each document, from this month is published, be cited number and avecitecount hereafter monthly is calculated (month) difference, obtains the time series that is cited of the document;
Step 4:According to being cited, time series similarity is clustered to literature collection, sets up many to the time series in every class Individual regression model, best performance model is selected using error analysis;
First, the document for participating in cluster is screened according to the time series that is cited, the foundation of screening is time series Length.Time series to length more than N, will block to overlength part.Time series to length less than N, gives up. N values are set by the user.N=8 in this experiment.
When carrying out cluster analysis, calculate the distance of the time series that is respectively cited first, distance calculate using Euclid away from From then use unweighted average Furthest Neighbor generates clustering tree.
The time series that is cited Xi=(Xi1,Xi2,…Xi8):Represent the time series vector value that is cited of document i;
The time series that is cited Xj=(Xj1,Xj2,…Xj8):Represent the time series vector value that is cited of document j;
Apart from d (Xi,Xj):Represent the Euclidean distance of the time series that is cited of document i and j;
It is as follows apart from computing formula:
By calculating the distance between the time series that is cited, a distance matrix is obtained.According to Spectral Clustering, use Unweighted average Furthest Neighbor generates clustering tree.
Between class distance Dpq:Represent classification Gp,GqThe distance between.Wherein GpElement number be np,GqElement number be nq.
Apart from d between elementij:Represent the distance between time series i, j.
Between class distance computing formula is as follows:
By cluster analysis, each document in set is divided into different classes.Then to the time series structure in class Build multiple regression models.Linear trend model, exponential trend model and polynomial trend mode are constructed in this experiment.
If certain moon is cited, number is output variable Yt, predictive variable is month t (t=1,2,3 ...), then linearly become Potential model is:
Yt01×t+ε
Wherein YtIt is the number that is cited in month t, β0, β1, ε corresponds to the level of time series, trend and noise respectively.
Exponential trend model is:
log Yt01×t+ε
Quadratic polynomial trend model is:
Yt01×t+β2×t2
When the time series in class builds regression model, training set is divided first and checking collects, choose time series In a time point t, using the data before the time point as training set, the data after the time point collect as checking. Model is set up on training set, the assessment models accuracy on checking collection.Root-mean-square error RMSE is used during assessment as assessment Index.
Root-mean-square error computing formula is:
Wherein, etThe actual value of expression time t and the difference of predicted value, v represent the time period number of checking collection.
Training set and checking collection data are finally merged into a data set, and is operated on training set on the data set The optimum prediction model for obtaining.
Step 5:The similarity of document to be measured and all kinds of document time serieses is calculated using vector similarity, with similarity most The regression model of class high calculates following one month number that is cited of document to be measured.
For two document p and document pj, respectively with (Xi1,Xi2,…Xi8) and (Xj1,Xj2,…Xj8) represent it is corresponding when Between sequence vector value, then time series similarity Similarity (p, p between documentj) computing formula it is as follows:
The similarity for surveying document and all kinds of document time serieses and then can be calculated by time series similarity between document.
Calculate document to be measured as follows with the formula of the similarity of all kinds of document time serieses:
Similarity (p, Ci) represent document p and C to be measurediThe Similarity value of class document time series;
Similarity (p, pj) represent document p to be measured and document pjTime series Similarity value.Document pj∈Ci Class, (n represents C for j=1,2 ..., niThe total number of class Literature).
After filtering out similarity highest class, using the existing time series of document to be measured as input variable, using such Regression model can predict following number that is cited of the document.

Claims (2)

1. a kind of scientific and technical literature based on time series is cited number Forecasting Methodology, it is characterised in that:
Step 1:Each Literature publication days and index list are collected, each document number that each moon is cited after publication is counted;
Step 2:In units of month, calculate the monthly all documents to be analyzed sum being cited and the document being cited is total Number, be divided by the number avecitecount (month) that must averagely be cited in the month;
Step 3:To each document, from this month is published, be cited number and avecitecount hereafter monthly is calculated (month) difference, obtains the time series that is cited of the document;
Step 4:According to being cited, time series is screened to the document for participating in cluster, and the foundation of screening is time series It is long;Time series to length more than N, will block to overlength part;Time series to length less than N, gives up;N values It is set by the user;
When being clustered, the distance of the time series that is respectively cited is calculated first, distance is calculated and uses Euclidean distance, then makes Clustering tree is generated with unweighted average Furthest Neighbor;
The time series that is cited Xi=(Xi1,Xi2,…Xi8):Represent the time series vector value that is cited of document i;
The time series that is cited Xj=(Xj1,Xj2,…Xj8):Represent the time series vector value that is cited of document j;
Apart from d (Xi,Xj):Represent the Euclidean distance of the time series that is cited of document i and j;
It is as follows apart from computing formula:
d ( X i , X j ) = [ Σ k = 1 8 ( X i k - X j k ) 2 ] 1 / 2
By calculating the distance between the time series that is cited, a distance matrix is obtained;According to Spectral Clustering, using not plus Weight average Furthest Neighbor generates clustering tree;
Between class distance Dpq:Represent classification Gp,GqThe distance between;Wherein GPElement number be np,GqElement number be nq
Apart from d between elementij:Represent the distance between time series i, j;
Between class distance computing formula is as follows:
D p q = 1 n p n q Σ i ∈ G p Σ j ∈ G q d i j
By cluster analysis, each document in set is divided into different classes, regression model is built to time series in class When, first divide training set and checking collect, choose a time point in time series, using the data before the time point as Training set, the data after the time point collect as checking;Model is set up on training set, assessment models are accurate on checking collection Property;Training set and checking collection data are finally merged into a data set, and is operated on the data set and obtained on training set Optimum prediction model;
Step 5:Document to be measured and the similarity of all kinds of document time serieses are calculated using vector similarity, similarity highest is used The regression model of class calculates following one month number that is cited of document to be measured;
For two document p and document pj, respectively with (Xi1,Xi2,…Xi8) and (xj1,xj2,…xj8) represent corresponding time series Vector value, then time series similarity Similarity (p, p between documentj) computing formula it is as follows:
S i m i l a r i t y ( p , p j ) = c o s θ = Σ k X i k × X j k ( Σ k X i k 2 ) ( Σ k X j k 2 )
The similarity for surveying document and all kinds of document time serieses and then can be calculated by time series similarity between document;
Calculate document to be measured as follows with the formula of the similarity of all kinds of document time serieses:
S i m i l a r i t y ( p , C i ) = 1 n × [ Σ j = 1 n S i m i l a r i t y ( p , p j ) ]
Similarity (p, Ci) represent document p and C to be measurediThe Similarity value of class document time series;
Similarity (p, pj) represent document p to be measured and document pjTime series Similarity value, document pj∈CiClass, j= 1,2 ..., n, n represent CiThe total number of class Literature.
2. the scientific and technical literature based on time series according to claim 1 is cited number Forecasting Methodology, it is characterised in that: In step 1, using the index list of each document of database retrieval, label according to each document in database and publish days, Specific time and number of times that statistical literature is cited, obtain the sum that is cited in each month after each Literature publication.
CN201410618173.XA 2014-11-05 2014-11-05 A kind of scientific and technical literature based on time series is cited number Forecasting Methodology Expired - Fee Related CN104462215B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410618173.XA CN104462215B (en) 2014-11-05 2014-11-05 A kind of scientific and technical literature based on time series is cited number Forecasting Methodology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410618173.XA CN104462215B (en) 2014-11-05 2014-11-05 A kind of scientific and technical literature based on time series is cited number Forecasting Methodology

Publications (2)

Publication Number Publication Date
CN104462215A CN104462215A (en) 2015-03-25
CN104462215B true CN104462215B (en) 2017-07-11

Family

ID=52908251

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410618173.XA Expired - Fee Related CN104462215B (en) 2014-11-05 2014-11-05 A kind of scientific and technical literature based on time series is cited number Forecasting Methodology

Country Status (1)

Country Link
CN (1) CN104462215B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491830B (en) * 2017-07-03 2021-03-26 北京奇艺世纪科技有限公司 Method and device for processing time series curve
US10776891B2 (en) 2017-09-29 2020-09-15 The Mitre Corporation Policy disruption early warning system
CN108875327A (en) * 2018-05-28 2018-11-23 阿里巴巴集团控股有限公司 One seed nucleus body method and apparatus

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887460A (en) * 2010-07-14 2010-11-17 北京大学 Document quality assessment method and application
CN103208038A (en) * 2013-05-03 2013-07-17 武汉大学 Patent introduction predicted value calculation method
CN103729432A (en) * 2013-12-27 2014-04-16 河海大学 Method for analyzing and sequencing academic influence of theme literature in citation database

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887460A (en) * 2010-07-14 2010-11-17 北京大学 Document quality assessment method and application
CN103208038A (en) * 2013-05-03 2013-07-17 武汉大学 Patent introduction predicted value calculation method
CN103729432A (en) * 2013-12-27 2014-04-16 河海大学 Method for analyzing and sequencing academic influence of theme literature in citation database

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
三种科技文献半衰期算法的比较研究;陈京莲;《井冈山大学学报(自然科学版)》;20140715(第2014年04期);全文 *
利用引文分析方法识别研究前沿的进展与展望;马楠,官建成;《中国科技论坛》;20060405(第2006年04期);全文 *
文献影响力的综合评价指标体系研究;徐建中,王名扬;《情报理论与实践》;20140509(第2014年05期);全文 *

Also Published As

Publication number Publication date
CN104462215A (en) 2015-03-25

Similar Documents

Publication Publication Date Title
Chhay et al. Municipal solid waste generation in China: influencing factor analysis and multi-model forecasting
Swathi et al. An optimal deep learning-based LSTM for stock price prediction using twitter sentiment analysis
CN106447285B (en) Recruitment information matching method based on multi-dimensional domain key knowledge
CN101556553B (en) Defect prediction method and system based on requirement change
Han et al. Intelligent decision model of road maintenance based on improved weight random forest algorithm
CN105893609A (en) Mobile APP recommendation method based on weighted mixing
CN104657496A (en) Method and equipment for calculating information hot value
CN104573359A (en) Method for integrating crowdsource annotation data based on task difficulty and annotator ability
Uddin et al. Predicting the popularity of online news from content metadata
CN106127242A (en) Year of based on integrated study Extreme Precipitation prognoses system and Forecasting Methodology thereof
CN106294882A (en) Data digging method and device
CN107957929A (en) A kind of software deficiency report based on topic model repairs personnel assignment method
Aminyavari et al. Probabilistic streamflow forecast based on spatial post-processing of TIGGE precipitation forecasts
Hamad et al. Predicting freeway incident duration using machine learning
Omole et al. An approach to reaeration coefficient modeling in local surface water quality monitoring
CN104462215B (en) A kind of scientific and technical literature based on time series is cited number Forecasting Methodology
Ekmekcioğlu et al. Tree-based nonlinear ensemble technique to predict energy dissipation in stepped spillways
CN109710725A (en) A kind of Chinese table column label restoration methods and system based on text classification
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
Abu Daoud et al. Validating the practicality of utilising an image classifier developed using TensorFlow framework in collecting corrugation data from gravel roads
Wang et al. A nonparametric mean estimator for judgment poststratified data
Sotomayor et al. Implications of macroinvertebrate taxonomic resolution for freshwater assessments using functional traits: The Paute River Basin (Ecuador) case
CN110827131A (en) Tax payer credit evaluation method based on distributed automatic feature combination
Chao Estimating project overheads rate in bidding: DSS approach using neural networks
Asghari et al. Spatial rainfall prediction using optimal features selection approaches

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170711

Termination date: 20191105

CF01 Termination of patent right due to non-payment of annual fee