CN111694930A

CN111694930A - Dynamic knowledge hotspot evolution and trend analysis method

Info

Publication number: CN111694930A
Application number: CN202010528034.3A
Authority: CN
Inventors: 侯颖; 崔运鹏; 刘娟
Original assignee: Agricultural Information Institute of CAAS
Current assignee: Agricultural Information Institute of CAAS
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2020-09-22
Anticipated expiration: 2040-06-11
Also published as: CN111694930B

Abstract

The invention discloses a dynamic knowledge hotspot evolution and trend analysis method which comprises the steps of dynamically modeling potential topics in a given document through time change and capturing the dynamic evolution of the topics along with time. And obtaining the theme preference of all the documents through dynamic modeling, so that a user can position the document information through the hot words under the theme. The dynamic knowledge hotspot evolution and trend analysis method provided by the invention intuitively presents the change trend of words in the theme in a graph form, helps a user to know or predict the development trend of the theme words, helps the user to locate the literature information related to the theme words through the hotspot words in the theme, and is convenient for the user to quickly evaluate and know the target subject field.

Description

Dynamic knowledge hotspot evolution and trend analysis method

Technical Field

The invention relates to the field of natural language processing and information extraction, in particular to a dynamic knowledge hotspot evolution and trend analysis method.

Background

With the continuous development of information technology, a large amount of information resources are emerging continuously, from scientific and technical literature, books, news, blogs, web pages and the like. In the face of massive information, in order to effectively extract useful information from explosively-growing electronic documents, new technologies and tools are urgently needed to help users analyze these massive data sets so as to help users quickly evaluate and understand the target subject field.

A large number of texts in a corpus (e.g., scientific literature) have temporal attributes, and some specific text information appears in a specific time period. The text visualization method extracts the key information by analyzing the text resources and displays the key information in a graphical mode, and is one of important branches of information visualization.

At present, the dynamic modeling analysis of the theme for the text with the time attribute cannot effectively show the dynamic evolution of the hot words on the time sequence in a visualization mode, and cannot find the corresponding metadata information of the literature through the hot words. Therefore, for the literature information collected by the user, a method for assisting the user to quickly know the target field and accurately searching the corresponding literature metadata according to the hot words is needed.

Disclosure of Invention

The invention aims to provide a dynamic knowledge hotspot evolution and trend analysis method. The method carries out dynamic modeling on the text through time change, captures the dynamic evolution of the theme along with the time, analyzes the change trend of words in different themes along with the time, or predicts and extracts the potential development trend of the theme, and can locate the literature information related to the hot words through the theme.

The purpose of the invention is realized by the following technical scheme:

the invention comprises the following steps:

s10, collecting metadata of documents by users according to the requirements, and outputting or forming a record file which is separated by a tab and has an encoding format of UTF-8 and contains fields such as title, abstract and the like;

s20, preprocessing the derived document metadata;

s30, selecting abstracts and publication years of the preprocessed document metadata, and performing dynamic modeling analysis on potential topics and preference calculation on the document topics to obtain hot words;

s40, visualizing the topic cluster of the hot words and displaying the hot words most relevant to each topic in each year;

s50, the variation trend of the hot words in the theme is visualized: the user selects a word of interest in the theme, and the variation trend of the word on the time sequence is displayed through a curve graph;

furthermore, the collected document metadata mainly comprises fields such as titles, abstracts, publication years and the like, the file storage format is a tab separation, UTF-8 coded csv or txt plain text type, and the data set can derive a corresponding format from a Web of Science core database or other customized data sets meeting the format requirements.

Further, the preprocessing work comprises the steps of deleting invalid metadata, completing word drying, removing stop words, removing meaningless characters and recognizing phrases.

Further, the topic modeling analysis employs variational inference to approximate a posterior distribution. The method is based on the following assumptions:

1) dividing data according to time slices;

2) the topic associated with time slice t evolves from the topic associated with time slice t-1;

3) each time slice models the document by using a K component topic model;

further, the visualization of the topic cluster of the hot words is to display the hot words in the model analysis result, display the hot words of each time slice (such as year) according to the topic classification, and display the words according to the probability order of the model analysis result.

Further, the visualization method comprises the following specific steps:

1) acquiring a hot word selected by a user;

2) performing additional graph calculation on hot word information in the subject dynamic modeling analysis result based on the received first interactive instruction, wherein the graph comprises an equivalent point; rendering based on hot word information in the subject dynamic modeling analysis result to obtain a corresponding phase point value;

3) and based on the received second interactive instruction, rendering the additional graph by connecting a plurality of phase points on the grid graph to obtain a curve trend graph.

Further, the topic dynamic modeling calculates coherence values with different topic numbers of 5, 10, 15, 20, 25, respectively, to obtain the optimal topic number.

Further, the generation process of analyzing the sequence corpus on the time slice t in the topic dynamic modeling is as follows:

1) according to β_t|β_(t-1)～N(β_(t-1),²I) Generating a topic-vocabulary probability distribution β over a time slice t_t；

2) According to α_t|α_(t-1)～N(α_(t-1),²I) Generating α a prior topic prior distribution over a time slice t_t；

3) For each article d on time slice t, according to η -N (α)_t,a²I) Generating a document-topic probability distribution η over time slice t;

4) for each word n in the document d, generating a word-subject distribution identification vector Z according to Z-Mult (pi (η)), and according to W_(t,d,n)～Mult(π(β_t,z) Generate a word W_(t,d,n)。

Further, the approximate variational posterior formula used by the dynamic modeling analysis document of the topic or the preference calculation is as follows:

the variational approach described above optimizes latent variables (topic β)_t,kMixing ratio of theta_t,dAnd a topic index Z_t,d,n) Parameter of upper distribution in { β_k,1,...,β_k,TIn the variation distribution, by setting a "variation observed value" having a gaussian "

Dynamic model protection ofLeaving the sequential structure of the topics. In the variation distribution of the document-level latent variables, each scale vector theta_t,dIs given a free Dirichlet parameter γ_t,d(ii) a Subject index Z_t,d,nGiven a free polynomial parameter phi_t,d,nOptimization of topic grading observations Using conjugate gradient method, resulting Natural topic parameters { β_k,1,...,β_k,TThe variational approximation of the } incorporates temporal dynamics.

One or more embodiments of the present invention may have the following advantages over the prior art:

according to the dynamic knowledge hotspot evolution and trend analysis method provided by the invention, dynamic modeling is carried out on a text through time change, a modeling analysis result is visualized, the change trend of words in different themes along with time is analyzed, or the potential development trend of the theme is predicted and extracted, and a user is helped to locate document information related to the hotspot words through the theme, so that the user can conveniently and rapidly evaluate and know the target subject field.

Drawings

FIG. 1 is a flow chart of a dynamic knowledge hotspot evolution and trend analysis method;

FIG. 2 is a flow chart of a dynamic knowledge hotspot evolution and trend analysis method preprocessing;

FIG. 3 is a diagram of a process of generating a sequence corpus over a dynamic modeling analysis time slice t of a dynamic knowledge hotspot evolution and trend analysis method subject;

FIG. 4 is a visualization diagram of dynamic topic modeling analysis results of the dynamic knowledge hotspot evolution and trend analysis method;

FIG. 5 is a graph of the dynamic knowledge hotspot evolution and trend analysis method hotspot word change trend;

FIG. 6 is a diagram of a dynamic knowledge hotspot evolution and trend analysis method for finding document metadata;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

As shown in fig. 1, a dynamic knowledge hotspot evolution and trend analysis method includes:

step S10 collects document metadata;

the user collects the metadata information of the literature according to the requirement of the user, the metadata mainly comprises fields such as a title, an abstract and a publication year, the file storage format is a tab separation and UTF-8 coded csv or txt pure text type, and the data set can be a corresponding format derived from a Webof Science core database or other self-defined data sets meeting the format requirement.

Step S20 preprocessing collected document metadata;

this step completes the data pre-processing of the abstract and year of publication fields to meet the format requirements for the next step of dynamic modeling analysis of the underlying subject matter of the text, as shown in FIG. 2. Preprocessing requires the completion of deleting invalid metadata, completing word drying, removing stop words, removing meaningless characters, and recognizing phrases.

Step S30 dynamic modeling analysis of the subject;

the step is a core analysis step of the system and completes the main calculation task of the system.

The topic dynamic modeling analysis employs variational inference to approximate a posterior distribution. The method is based on the following assumptions:

1) data is divided by time slice, such as by year;

3) each time slice models the document by using a K component topic model;

the generation process of the sequence corpus over the time slice t is as follows, as shown in fig. 3:

Therefore, the approximate variational posterior formula of the entire model is:

The dynamic model of (1) preserves the sequential structure of the topic. In the variation distribution of the document-level latent variables, each scale vector theta_t,dIs given a free Dirichlet parameter γ_t,d(ii) a Subject index Z_t,d,nGiven a free polynomial parameter phi_t,d,nOptimization of topic grading observations Using conjugate gradient method, resulting Natural topic parameters { β_k,1,...,β_k,TThe variational approximation of the } incorporates temporal dynamics.

The subject dynamic modeling analysis results of this step are exemplified as follows:

1) time slice sequences, divided by year, e.g. [2008,2009,2010 ].

2) The probability that each time slice sequence corresponds to the most relevant word and word of the topic in a different topic, e.g. (because there are too many actual hot words, only the first 3 hot words on each time sequence are listed here):

{0:['0.0140231014*application+0.0138825359*stream+0.0123572007*datum','0.0140471977*application+0.0138904899*stream+0.0124764708*datum','0.0139453390*stream+0.0138278045*application+0.0128339716*datum',

1:['0.0125233824*video+0.0118972892*propose+0.0103776871*network','0.0128266652*video+0.0116539875*propose+0.0104339393*network','0.0132288953*video+0.0113926101*propose+0.0103314936*network'],

2:['0.0201108175*stream+0.0160505421*use+0.0143336972*compute','0.0204567699*stream+0.0159369303*use+0.0145109152*compute','0.0204072031*stream+0.0159959192*use+0.0144690685*compute'],

3:['0.0224408733*algorithm+0.0203485369*stream+0.0184875342*compute','0.0227468752*algorithm+0.0205000072*stream+0.0185545889*compute','0.0230975940*algorithm+0.0206288220*stream+0.0185272671*compute'],

4:['0.0209717427*use+0.0150956938*stream+0.0111105387*propose','0.0207826879*use+0.0151531082*stream+0.0112516701*propose','0.0203461357*use+0.0151239365*stream+0.0117703962*propose']

}

3) document topic preferences, for example, topic distribution for document 20 is:

[1.17577895e-04,9.99529688e-01,1.17577895e-04,1.17577895e-04,1.17577895e-04]

it can be seen that, of the 5 topics, the 20 th document has a preference for the topic 1, and the topic preference of each document is counted in turn and stored in the table together with the document metadata information.

S40 visualizing the hot word topic clustering result;

the result returned by the previous step of topic dynamic modeling analysis includes a time slice sequence, and the probability corresponding to the most relevant words and words of each time slice and topic under each topic, and the first 50 hot words most relevant to each topic in each year are displayed according to the hot words in the analysis result, as shown in fig. 4.

S50 visualizing the change trend of the hotspot words in the theme;

as shown in fig. 5, the user selects a word of interest in the topic, analyzes the time slice sequence of the returned result, the hot word and the corresponding probability information according to the dynamic modeling of the topic, and traces the variation trend of the hot word on the time sequence through a graph.

As shown in fig. 6, the user selects one or more words of interest in the topic and the relationship between the words (and, or), and queries the document metadata containing the relationship between the one or more words under the selected topic according to the calculation result of the preference of the document topic.

The method comprises the steps that a theme dynamic modeling analysis is carried out to obtain a metadata search request, wherein the metadata search request carries search keywords; matching the search keywords with the retrieval keywords of the target documents; capturing the dynamic evolution of the theme along with the time, obtaining the theme preference of all documents through dynamic modeling, providing a user to visually present the variation trend of words in the theme in the form of a hot word positioning document information curve graph under the theme, and finding out corresponding document metadata when the user selects one or more words of interest in the theme and the relationship among the words. The method helps the user to know or predict the development trend of the subject word, helps the user to locate the relevant document information through the hot word under the subject, and is convenient for the user to quickly evaluate and know the target subject field. And returning the description information of the target document under the condition that the search keyword is successfully matched with the retrieval keyword of the target document, wherein the target document is the document matched with the search keyword.

Although the embodiments of the present invention have been described above, the above descriptions are only for the convenience of understanding the present invention, and are not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A dynamic knowledge hotspot evolution and trend analysis method is characterized by comprising the following steps:

the method comprises the following steps:

s20, preprocessing the derived document metadata;

s50, the variation trend of the hot words in the theme is visualized: the user selects a word of interest in the theme, and the variation trend of the word on the time series is displayed through the graph.

2. The method of claim 1, wherein the collected metadata of documents mainly includes fields such as title, abstract and year of publication, the file storage format is tab separation, UTF-8 coded csv or txt plain text type, and the data set can derive corresponding format from the Web of Science core database or other customized data set meeting the format requirement.

3. The method of claim 1, wherein the preprocessing comprises the steps of deleting invalid metadata, completing word drying, deactivating words, removing meaningless characters, and recognizing phrases.

4. The method of claim 1, wherein the topic modeling analysis uses variational inference to approximate a posterior distribution. The method is based on the following assumptions:

1) dividing data according to time slices;

3) each time slice models the document using a K-component topic model.

5. The dynamic knowledge hotspot evolution and trend analysis method of claim 1, wherein the visualization of the topic clusters of the hotspot words is a display of the hotspot words in the model analysis results, the hotspot words of each time slice (such as year) are displayed according to topic classification, and the words are displayed in order of probability of the model analysis results.

6. The dynamic knowledge hotspot evolution and trend analysis method of claim 1, wherein the visualization method comprises the following specific steps:

1) acquiring a hot word selected by a user;

7. The method of claim 1 or 4, wherein the topic dynamic modeling calculates coherence values with different topic numbers of 5, 10, 15, 20, and 25 to obtain the optimal topic number.

8. The method for dynamic knowledge hotspot evolution and trend analysis according to claim 1 or 4, wherein the generation process of the sequence corpus on the analysis time slice t in the topic dynamic modeling is as follows:

9. The method of claim 1, wherein the topic dynamic modeling analysis document or the approximate variational posterior formula used in the preference calculation is: