CN112650847B

CN112650847B - Technological research hotspot theme prediction method

Info

Publication number: CN112650847B
Application number: CN201910961978.7A
Authority: CN
Inventors: 谢能付; 郝心宁; 熊炜; 徐倩; 吴蕾; 梁晓贺; 吴赛赛
Original assignee: Agricultural Information Institute of CAAS
Current assignee: Agricultural Information Institute of CAAS
Priority date: 2019-10-11
Filing date: 2019-10-11
Publication date: 2023-05-09
Anticipated expiration: 2039-10-11
Also published as: CN112650847A

Abstract

The invention discloses a technological research hotspot topic prediction method, which comprises the steps of preprocessing subject documents according to a technological research topic word list related to a topic to be detected to obtain word segmentation documents of corresponding years, and converting the word segmentation documents into binary vector matrixes; processing the binary vector matrix by using a frequent item set mining algorithm to obtain a frequent subject set; filtering the frequent topic sets to obtain hot topic sets; converting the hot topic set into time sequence data, training a plurality of prediction models according to the time sequence data, and obtaining topic prediction models by using a weighting processing method; predicting the occurrence frequency of the theme to be detected according to the theme prediction model. According to the invention, word filtering based on the domain subject vocabulary is adopted, so that the characteristics of the technical research domain are well summarized, and the hot topic in the technical research domain is identified by adopting a frequent item set algorithm, so that the hot topic in the future time can be accurately predicted.

Description

Technological research hotspot theme prediction method

Technical Field

The invention relates to the field of information processing, in particular to a technological research hotspot theme prediction method.

Background

Most of the prior art adopts a clustering method to identify scientific hot topics, and part of prediction methods are only carried out by utilizing key high frequencies, so that the prediction of the scientific hot topics in a future period cannot be effectively carried out, and the accuracy of the hot topic prediction is low.

Disclosure of Invention

The invention aims to provide a technological research hot spot theme prediction method which can accurately predict the occurrence frequency of hot spot themes.

In order to achieve the above object, the present invention provides the following solutions:

a technological research hotspot topic prediction method, the prediction method comprising:

determining a database of the corresponding scientific and technological research field according to the subject to be tested, wherein the database comprises subject documents, network resources and expert knowledge;

constructing a subject word list of one-dimensional transverse vectors according to the database;

preprocessing the annual subject literature according to the subject vocabulary to obtain subject literature word segmentation documents of corresponding years;

obtaining binary vectors of corresponding years according to the occurrence condition of words in the subject word list in the subject document word segmentation document by utilizing the subject word list; the binary vectors of all years form a binary vector matrix;

processing the binary vector matrix by using a frequent item set mining algorithm to obtain a frequent subject set;

filtering the frequent topic set to obtain a hot topic set;

converting the hot topic collection into time sequence data;

training a plurality of prediction models according to the time sequence data, and obtaining a theme prediction model by using a weighting processing method;

and predicting the occurrence frequency of the theme to be detected by using the theme prediction model.

Optionally, the preprocessing is performed on the annual subject literature according to the subject vocabulary to obtain the subject literature word segmentation document of the corresponding year, which specifically includes:

the following treatments are performed for every year scientific literature:

sentence division is carried out on the scientific literature to obtain a corresponding sentence set;

and performing word segmentation processing on the sentence set according to the subject word list to form a subject document word segmentation document of corresponding year.

Alternatively, if the word in the subject document word segmentation document appears in the subject word list, the word is marked as 1, otherwise, the word is marked as 0, and a binary vector of the corresponding year is formed.

Optionally, the processing the binary vector matrix by using a frequent item set mining algorithm to obtain a frequent topic set specifically includes:

taking binary vectors corresponding to word segmentation documents in any year as transactions, arranging subject words in the word segmentation documents according to the order of the support degree from large to small, and deleting frequent 1 item sets to obtain updated transaction data sets;

converting the transaction data set into a transaction linked list group, wherein each transaction linked list of the transaction linked list group stores information of each transaction with the same head element;

updating the transaction linked list group according to the increasing arrangement sequence of the supporting degree of the head element to obtain an updated transaction linked list group;

digging the updated transaction linked list group to obtain a frequent topic set of corresponding years;

and calculating each subject term by taking the frequent subject set of the last year as a reference, if the number of years of occurrence of the subject term exceeds a threshold value, reserving, otherwise, deleting, and obtaining the frequent hot subject set.

Optionally, the step of updating the transaction linked list group according to the increasing arrangement order of the supporting degree of the head element to obtain an updated transaction linked list group specifically includes:

recursively scanning the transaction linked list to find out frequent item sets;

deleting the transaction linked list group from the transaction linked list group, and creating a transaction linked list group taking the head element of the transaction linked list as a prefix;

and merging the transaction linked list group and the transaction linked list group with the transaction linked list head element as a prefix to obtain the updated transaction linked list group.

Optionally, filtering the frequent topic set to obtain a hot topic set, which specifically includes:

dividing the frequent subject set to obtain related subject words;

constructing related subject phrases according to related subject words;

each related subject phrase is processed as follows:

deleting repeated subject words to obtain a subject phrase without repetition;

deleting the subtopic word and the subtopic word of each topic word in the topic word group without repetition to obtain a hot topic word group;

and forming the hot topic set according to each hot topic phrase.

Optionally, the converting the hotspot topic set into time sequence data specifically includes:

forming a vector set according to the frequencies of topics in the hot topic set in the corresponding year;

and arranging the vector values in the vector set in order from small to large according to the year to form the time sequence data.

Optionally, the topic prediction model is: freq (X) =w ₁ *M ₁ (X)+w ₂ *M ₂ (X)+...+w _j *M _j (X)+...w _J *M _J (X)；

Wherein X represents the subject to be tested, freq (X) represents the frequency of occurrence of the subject to be tested, M _j (X) represents M _j Predicting value, w, of model on the subject to be detected _j Represents M _j Weight of model prediction, j=1, 2.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the invention discloses a technological research hotspot topic prediction method, which comprises the steps of preprocessing subject documents according to a technological research topic word list related to a topic to be detected to obtain word segmentation documents of corresponding years, and converting the word segmentation documents into binary vector matrixes; processing the binary vector matrix by using a frequent item set mining algorithm to obtain a frequent subject set; filtering the frequent topic sets to obtain hot topic sets; converting the hot topic set into time sequence data, training a plurality of prediction models according to the time sequence data, and obtaining topic prediction models by using a weighting processing method; predicting the occurrence frequency of the theme to be detected according to the theme prediction model. According to the invention, word filtering based on the domain subject vocabulary is adopted, so that the characteristics of the technical research domain are well summarized, and the hot topic in the technical research domain is identified by adopting a frequent item set algorithm, so that the hot topic in the future time can be accurately predicted.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for predicting hot topics in technical research according to the present invention;

FIG. 2 is a binary vector matrix of the present invention;

FIG. 3 is a timing sequence data of the present invention;

fig. 4 is a graph of the subject predictors of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention aims to provide a technological research hot spot theme prediction method, which improves the accuracy of hot spot theme prediction.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Examples

As shown in fig. 1, the method for predicting the hot topic of scientific research of the present invention includes:

step 101: determining a database of the corresponding scientific and technological research field according to the subject to be tested, wherein the database comprises subject documents, network resources and expert knowledge;

step 102: constructing a subject word list of one-dimensional transverse vectors according to the database;

step 103: preprocessing the annual subject literature according to the subject vocabulary to obtain subject literature word segmentation documents of corresponding years;

step 104: obtaining binary vectors of corresponding years according to the occurrence condition of words in the subject word list in the subject document word segmentation document by utilizing the subject word list; the binary vectors of all years form a binary vector matrix;

step 105: processing the binary vector matrix by using a frequent item set mining algorithm to obtain a frequent subject set;

step 106: filtering the frequent topic set to obtain a hot topic set;

step 107: converting the hot topic collection into time sequence data;

step 108: training a plurality of prediction models according to the time sequence data, and obtaining a theme prediction model by using a weighting processing method;

step 109: and predicting the occurrence frequency of the theme to be detected by using the theme prediction model.

When subject literature a in the scientific research field is collected, the following should be satisfied: 1) A minimal literature collection capable of covering research topics in the scientific research field; 2) The literature at least comprises three aspects of a title, a abstract and a keyword; 3) Data collection is at least a literature volume over 10 years.

According to the subject vocabulary, the annual subject literature is preprocessed to obtain subject literature word segmentation documents of corresponding years, and the method specifically comprises the following steps:

the following treatments are performed for every year scientific literature:

And according to the occurrence condition of the words in the subject word list in the subject word segmentation document, marking 1 if the occurrence condition occurs, otherwise marking 0, and marking binary vectors of corresponding years.

In the specific implementation process, the specific steps for obtaining the frequent topic collection are as follows:

and calculating each subject term by taking the last frequent subject set as a reference, if the number of years of occurrence of the subject term exceeds a threshold value, reserving, otherwise deleting, and obtaining the frequent hot subject set.

The specific process of obtaining the updated transaction linked list group is as follows: recursively scanning the transaction linked list to find out frequent item sets; deleting the transaction linked list group from the transaction linked list group, and creating a transaction linked list group taking the head element of the transaction linked list as a prefix; and merging the transaction linked list group and the transaction linked list group with the transaction linked list head element as a prefix to obtain the updated transaction linked list group.

The specific steps for obtaining the hot topic collection comprise:

dividing the frequent subject set to obtain related subject words;

constructing related subject phrases according to related subject words;

each related subject phrase is processed as follows:

deleting repeated subject words to obtain a subject phrase without repetition;

and forming the hot topic set according to each hot topic phrase.

The frequency of the topics of a single subject word in a certain year is that the number of lines in the binary vector matrix direct utilization in the year is 1 sum to be used as the frequency of occurrence in the year, namely the number of sentences in which the topic representation word occurs in the year. For a topic composed of multiple topic words, the number of sentences that all topic words of the topic appear in the year at the same time is directly calculated as the frequency of the occurrence in the year by using a binary vector matrix. The topic vector is denoted (F0, F1, … Fi, …, fn), F1 denotes the frequency of the topic in the beginning year, fi denotes the frequency of the topic in the beginning year + i year, fn denotes the frequency of the topic in the ending year, i.e. the vector frequency is set from small to large by year.

According to the predictive model freq (X) =w ₁ *M ₁ (X)+w ₂ *M ₂ (X)+...+w _j *M _j (X)+...w _J *M _J (X) predicting the probability of occurrence of the subject to be tested.

Wherein X represents the subject to be testedFreq (X) represents the frequency of occurrence of the subject to be tested, M _j (X) represents M _j Predicting value, w, of model on the subject to be detected _j Represents M _j Weight of model prediction, j=1, 2.

Potential hot spot topics may also be predicted according to the methods described above. Potential hot topic refers to the topic that hot words not in RT may become hot in future to predict so as to understand the hot development trend of subjects in the next year.

The time series vector x= (X1, X2, …, xn) is first derived from the frequency of occurrence of the dataset each year, and correlation calculations are performed with the timing vectors of all topics of RT. The invention adopts the relevance of the pearson correlation coefficient to two time sequence vectors, and the principle is as follows:

assume that there are two vectors x= (X) ₁ ,x ₂ ,...,x _n ) And y= (Y) ₁ ,y ₂ ,...,y _n ) The pearson correlation coefficient between X and Y can be calculated using the following formula:

wherein the covariance between X and Y is defined as:

the variance is defined as:

the correlation coefficient can be written as:

the values of the variables of the correlation coefficients are mathematically specified as between-1 and +1; when the value is closer to 1, the larger the value, the stronger the positive correlation between any two variables; the closer the value is to-1, the stronger the negative correlation between the two variables.

According to the calculation result, the invention takes the topic TP with highest correlation with the potential hot point topic as the calculation basis, and uses the formula F=TP _R ×P _L /TP _L Predicting occurrence frequency values of potential hot-spot topics, wherein TP _R As predicted value of TP in next year, TP _L For the last component value in the TP time series vector, i.e., the last year of the subject frequency in the collected data, P _L The frequency of occurrence of the time subject to be measured in the last year.

The following is a specific embodiment of the present solution:

and constructing a subject vocabulary Dict in the specific subject field according to subject documents, network resources and expert knowledge.

The second group, in english for example, selects 36 journals representative of the animal genetic and breeding field, and 73990 collections of documents in the year 2000 to 2017.

And thirdly, filtering irrelevant documents to finally obtain 71990 documents, dividing sentences of each article, and dividing words according to a subject word list Dict to form a sentence document set.

And fourthly, converting the result of the third step into a binary vector matrix of the document. That is, with the sentence as a dimension, the word in the sentence appears, then the flag is 1, otherwise it is 0, as shown in fig. 2. Where a behavior is a sentence, and a column is a word.

And fifthly, on the basis of the fourth step, carrying out hot spot topic identification by utilizing a frequent item set mining algorithm Relim algorithm. And taking sentence vectors processed by all documents as transactions, and setting a minimum support threshold MinSupport. At the minimum support threshold, the subject hotspot topic set st= { "animal_association_behavior", "animal_behavio", "concentrate_plasma", "cow_dairy", "feed_intak", … "gene_expression" } is obtained.

And step six, filtering the topics which can generate repetition on the basis of the step five, and finally forming a hot topic set = { "animal_association_behavir", "concept_plasma", "cow_day", "feed_interval", … "gene_expression" }.

Seventh, time series data are generated as shown in fig. 3. Collections are formed at the frequency of years according to the topics in St'.

And eighth step, a hot spot theme prediction model. And training according to the time sequence data by using linear regression, a support vector machine, radial basis function regression and a radial basis function neural network model respectively, wherein the weight of each model is 1/4. The predicted value is expressed by the formula freq (TopicWord) =w1×m1+w2×m2+w3×m3+w4×m4.wi takes on a value of 1/4.

And ninth, predicting hot spot topics, namely calculating possible occurrence frequencies of a hot spot topic set St' by using freq (TopicWord) to know hot spot changes of the hot spot topics.

Tenth, predicting the potential hot topic. For the hotspot word Wordp that is not in St', the user may want to learn about future hotspot conditions according to the business. According to formula 1, the subject "concentrate_plasma" is found to be most relevant to the predictive term "gene_expression", while the occurrence frequency of "gene_expression" in 2017 is 536, the predictive value in 2018 is 612, the occurrence frequency of "concentrate_plasma" in 2017 is 146, and the prediction of "concentrate_plasma" in 2018 is 146×612/536=168. The attention of the subject "concentrate_plasma" is described as having an increasing tendency.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A technological research hotspot topic prediction method, characterized in that the prediction method comprises:

determining a database of the corresponding technical research field according to the subject to be tested; the database comprises discipline literature, network resources and expert knowledge;

filtering the frequent topic set to obtain a hot topic set;

converting the hot topic collection into time sequence data;

2. The method for predicting hot topics of scientific research according to claim 1, wherein the preprocessing the annual subject literature according to the topic word list to obtain subject literature word segmentation documents of corresponding years specifically comprises the following steps:

the following treatments are performed for every year scientific literature:

3. The method of claim 1, wherein if a word in the subject document appears in the subject vocabulary, the word is marked as 1, otherwise the word is marked as 0, and a binary vector of the corresponding year is formed.

4. The technological research hotspot topic prediction method of claim 1, wherein the processing the binary vector matrix by using a frequent item set mining algorithm to obtain a frequent topic set specifically comprises:

5. The method for predicting hot topics of scientific research according to claim 4, wherein the step of updating the transaction linked list group according to the increasing order of the supporting degree of the head element to obtain an updated transaction linked list group specifically comprises:

6. The technological research hotspot topic prediction method of claim 1, wherein the filtering the frequent topic set to obtain a hotspot topic set specifically comprises:

dividing the frequent subject set to obtain related subject words;

constructing related subject phrases according to related subject words;

each related subject phrase is processed as follows:

deleting repeated subject words to obtain a subject phrase without repetition;

and forming the hot topic set according to each hot topic phrase.

7. The technological research hotspot topic prediction method of claim 1, wherein the converting the hotspot topic set into time series sequence data specifically comprises:

8. The scientific research hotspot topic prediction method of claim 1, wherein the topic prediction model is: freq (X) =w ₁ *M ₁ (X)+w ₂ *M ₂ (X)+...+w _j *M _j (X)+...w _J *M _J (X)；