CN111339286B

CN111339286B - Method for exploring mechanism research conditions based on theme visualization

Info

Publication number: CN111339286B
Application number: CN202010092905.1A
Authority: CN
Inventors: 秦红星; 曹鑫霞
Original assignee: Sichuan Chaoyihong Technology Co ltd
Current assignee: Guangzhou Southern Wanfang Data Co.,Ltd.
Priority date: 2020-02-14
Filing date: 2020-02-14
Publication date: 2024-02-09
Anticipated expiration: 2040-02-14
Also published as: CN111339286A

Abstract

The invention relates to a method for researching conditions of an exploration mechanism based on theme visualization, and belongs to the technical field of visualization. The method comprises the following steps: s1, acquiring research data and preprocessing, namely determining an institution to be researched, and acquiring SCI academic literature data of the institution to be researched; extracting a required research field, and preprocessing the acquired research corpus; s2, processing the selected corpus by adopting TF-IDF feature extraction and LDA topic model text mining technology, extracting scientific research hot topics and topic words thereof, and carrying out academic literature topic clustering; and S3, presenting the clustering subject and other dimensional information in the academic literature data in a visual mode, and analyzing the result from multiple dimensions. The invention is beneficial to better grasp and track the development condition of the scientific research condition of the current mechanism, so as to better enable scientific researchers to capture the leading edge and hot spot of the development of subjects and avoid repeated research.

Description

Method for exploring mechanism research conditions based on theme visualization

Technical Field

The invention belongs to the technical field of visualization, and relates to a method for researching conditions of an exploration mechanism based on theme visualization.

Background

In recent years, the number of scientific research workers is rapidly increasing, and along with the wide application of computer networks and information technologies, academic documents are increasingly in sea, diversified and instant, and the phenomenon makes the development trend of scientific research hotspots unable to be tracked and processed artificially. Visual analysis is an emerging technology developed in recent years, is a product developed in the fields of information visualization and scientific visualization, is an effective means and way for people to understand and interpret large-scale complex situations, and realizes a graphical visual model through a visual algorithm to display multi-bit or high-dimensional data. The visual model combined with man-machine interaction can also perform dynamic multi-angle analysis.

Topic model-based literature hotspot analysis is an important method for exploring and researching a certain field condition, and is mainly performed by analyzing academic literature or patents published in the field, wherein the academic literature is an important embodiment of research development in the field. At present, research analysis of documents is carried out, topic models are improved by modeling topics and then displaying multi-scale information related to topic models in the field in a visual mode, or interactive operation is designed on the multi-scale information of the topic models.

An academic literature published by an organization carries research results of scientific research on various subjects. At present, scientific researches tend to be multi-polarized, and scientific topics are reflected in the characteristics of numerous, miscellaneous, messy and the like. The number of researchers is large, and the emphasis of each scientific research institution is different. The current research condition and development state of the scientific research institution are known and tracked through visual analysis by combining a plurality of dimensional information through subject modeling on academic documents of the scientific research institution.

Disclosure of Invention

In view of the above, the present invention aims to provide a method for exploring research conditions of an organization based on topic visualization, which aims at the problem that the existing topic model visual analysis system lacks research conditions of a certain organization.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a method of exploring an institutional research situation based on topic visualization, the method comprising the steps of:

s1: acquisition and processing of study data:

determining the institution needing to be studied, and acquiring SCI academic literature data of the required institution;

extracting a required research field;

preprocessing the extracted fields;

s2: processing the selected corpus by adopting TF-IDF feature extraction and LDA topic model analysis technology;

extracting and generating TF-IDF characteristics of the preprocessed data, and establishing a characteristic vector space model of the whole corpus;

the LDA algorithm establishes a topic model by utilizing a feature vector space model generated by a corpus, calculates the established topic model by utilizing a Gibbs sampling method, and outputs and stores a topic-word matrix;

performing cluster analysis on the output theme-word matrix, and storing and outputting a clustering result;

s3: the clustering subject and other dimension information in academic documents are presented in a visual mode, and results are analyzed from multiple dimensions;

respectively displaying the change of the intensity of a main body along with time, the research field represented by each theme and the change of the frequency of the themes by using a river flow diagram, a text cloud and a line diagram;

the tree diagram and the bar diagram respectively show the hierarchical structure under each topic, and analyze the weight of each branch office on the academic influence of the topic;

the scatter diagram and the line diagram respectively show the intensity change and research trend of different topics under each branch office, and find the discipline advantage.

Optionally, in said step S1, it is determined that the institution to be studied is based on SCI academic literature containing the address of the author of the institution for the last five years.

Optionally, in the step S1, regular matching is used, and the retained data includes the title, author, time, frequency of introduction, keywords, and abstract of the document; the corpus includes keywords, topics, article summaries.

Optionally, in the step S1, the preprocessing includes cleaning and denoising, word segmentation of english text, and disabling word and root word reduction.

Optionally, in the step S2, the TF-IDF algorithm is used to perform feature extraction and generate a text vector space, which specifically includes the following operations:

TF represents the number of times that the word appears in a document, IDF represents how many documents in the document set the word appears, and the TF and IDF are multiplied to obtain the importance of a specific word to a document; and calculating the importance degree of each document for all dimensions of the document, and generating TF-IDF feature vectors of each document:

Feature-Vector＝{f ₁ ,f ₂ ,f ₃ ……,f _n } (1)

in the formula (1), the TF-IDF characteristic calculation formula of each document is:

f _i ＝tf(w _i ,d _i )*idf(w _i ,D) (2)

in the formula (2), the tf value calculation formula is:

wherein n is _i Is the word in document d _i The denominator is document d _i The sum of the times of occurrence of all words in the list;

the idf value calculation formula in the formula (2) is:

wherein D is the total number of all documents in the |D| document set _i For a particular document, w _i Is a certain vocabulary, namely a feature;

the TF-IDF feature vectors of all the document sets constitute a (word, tfidf) matrix, which is a document feature vector space model.

Optionally, in the step S2, the obtaining of the topic-word matrix is implemented by using an LDA algorithm, and based on a word bag mode, the method specifically includes the following operations:

LDA assumes that the documents are produced from a mix of topics, each document being generated as follows:

generating the length N of a document from the distribution with the global poisson distribution parameter beta;

generating theta of a current document from the distribution with the global dirichlet parameter alpha;

for each word of the current document length N there is: generating a subject subscript z from a polynomial distribution with θ as a parameter _n Generating a word w from a polynomial distribution in which θ and z are parameters in common _n ；

Training process Gibbs Sampling:

randomly assigning a topic number z to each word w of each document;

statistics of each topic z _i The number of next occurring words w and the topic z in each document n _i The number of words w in (a);

subject distribution z excluding the current word w at a time _i Estimating current word w assignment based on topic classification of all other words

To the respective subject z ₁ ,z ₂ ,…,z _k Of (a), i.e. calculating p (z _i |z _-i D, w); obtaining that the current word belongs to all topics z ₁ ,z ₂ ,…,z _k After the probability distribution of (a), resampling a new topic z for the word ₁ The method comprises the steps of carrying out a first treatment on the surface of the Continuously updating the topic of the next word by the same method until the topic distribution theta under each document _n And word distribution phi under each topic _k Converging;

finally, outputting the parameters to be estimated, theta _n And phi _k Obtaining the subject z of each word _k,n ；

And training the LDA model by Gibbs Sampling to obtain a topic-word co-occurrence frequency matrix.

Optionally, in the step S2, clustering analysis is performed on the obtained LDA topic model to obtain academic document clustering data, where the academic document clustering data includes an academic document included in each topic cluster and a keyword included in each topic.

Optionally, in the step S3, the information of other dimensions includes: time, author, frequency of introduction, and branch office to which the author belongs.

Optionally, in the step S3, the result is analyzed from multiple dimensions by using a d3.Js visual analysis technique, which specifically includes the following operations:

the method comprises the steps of respectively displaying the change of main body strength along with time, the research field represented by each theme and the change of the frequency of the themes to be led by using a river flow diagram, a text cloud and a line diagram to know the overall research outline of the organization; the tree diagram and the bar diagram respectively show the hierarchical structure under each topic, and analyze the weight of each branch office on the academic influence of the topic to know the relationship between the topic and the branch office; the scatter diagram and the line diagram respectively show the intensity change and research trend of different topics under each branch office, and find the discipline advantage.

The invention has the beneficial effects that:

the mechanism research condition exploration based on the theme visualization adopts the idea of combining visual analysis and theme modeling on the basis of data visualization, and intuitively displays the relation between the data by means of a certain visual symbol, so that the understanding of the rules contained in the literature data by a user is deepened. The research object in the invention is SCI academic literature based on a specific organization, the data has a certain representativeness, theme elements obtained by modeling the theme are added with time dimension, and then the evolution of the research content of the organization along with time is analyzed; the attention degree of the mechanism in the scientific research field represented by the theme is obtained through the frequency analysis of each theme so as to effectively analyze and explain the future research trend of the theme field; and the research condition of each branch office is subjected to multiple dimension analysis, so that researchers can be helped to know the development condition of the subject in time, make a decision in time, and avoid repeated research.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a subject-based visualization of institutional research conditions in accordance with the present invention;

FIG. 2 is a theme flowsheet of an organization according to an example of the present invention;

FIG. 3 illustrates word clouds representing various subject matter in accordance with an example of the present invention;

FIG. 4 is a partial line drawing showing how frequently each topic is referenced and how many text messages are sent in accordance with an embodiment of the present invention;

FIG. 5 is a scatter plot showing annual research situation for various subjects in a branch office for an example of the present invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

As shown in fig. 1, the method of the present invention provides a method for exploring the research condition of an organization based on theme visualization, which comprises the following steps:

s1: acquiring and processing research data;

s3: visually presenting the clustering subject and other dimension information in academic documents, and analyzing results from multiple dimensions

In this example, the scientific literature of the required research institution is determined by data acquisition in academic literature retrieval of web ofScience, and the scientific literature of the last five years including academic papers, journal journals and the like contains the address of the author of the institution, and 2975 scientific literature in total.

In this embodiment, 2975 unstructured plain text format data are preprocessed, and the required research fields are extracted by regular matching, and the reserved data comprise titles, authors, time, frequency of introduction, keywords and abstracts of documents. The corpus includes keywords, topics, article summaries.

In this embodiment, text preprocessing operations such as english text word segmentation, stop word filtering, and root word reduction are performed on the information of the keywords, the topics, and the article abstract in three dimensions, so as to obtain a denoised corpus.

In this embodiment, feature extraction and generation are performed on each long text sample in the denoised corpus by using a TF-IDF algorithm, where TF-IDF is used to perform text feature weighting, and is a statistical-based calculation method, which is commonly used to evaluate the importance of a word in a document set on a certain document. It is composed of two parts: TF and IDF.

TF represents the number of times that this word appears in a document, assuming n _i Is a word in document d _i The denominator is document d _i If the sum of the times of occurrence of all words in the database is equal to the sum of the times of occurrence of all words in the database, the TF value of the word is calculated as follows:

IDF represents how many documents in the document set this word appears, assuming the total number of all documents in the |D| document set, D _i For a particular document, w _i For a certain word, the calculation formula of the IDF value of the word is as follows:

and multiplying TF and IDF to obtain the importance degree of a specific word for a document, wherein the TF-IDF value of the word for a document d is calculated as follows:

f _i ＝tf(w _i ,d _i )*idf(w _i ,D)

a document has a plurality of features, each feature is a word, the importance degree of the document is calculated for all dimensions of each document, and TF-IDF feature vectors of each document are generated as follows:

Feature-Vector＝{f ₁ ,f ₂ ,f ₃ ……,f _n }

and then, after TF-IDF feature extraction and establishment are carried out on all documents in the corpus, TF-IDF feature vector composition (word, tfidf) matrixes of all document sets are generated, and the matrixes are feature vector space models of the corpus.

In this embodiment, through the feature vector space model of the obtained corpus, then LDA topic modeling is performed, gibbs Sampling is adopted in the topic model, and the training process is as follows:

randomly assigning a topic number z to each word w of each document;

subject distribution z excluding the current word w at a time _i Estimating assignment of current word w to each topic z based on topic classification of all other words ₁ ,z ₂ ,…,z _k Of (a), i.e. calculating p (z _i |z _-i D, w) (Gibbsupdatingroule). Obtaining that the current word belongs to all topics z ₁ ,z ₂ ,…,z _k After the probability distribution of (a), resampling a new topic z for the word ₁ . The topic of the next word is updated continuously in the same way until the topic distribution θ under each document _n And word distribution phi under each topic _k And (5) convergence.

Finally, outputting the parameters to be estimated, theta _n And phi _k Theme z of each word _k,n Can also be obtained.

The Gibbs Sampling trains the LDA model to obtain a theme-word co-occurrence frequency matrix, and the matrix is the LDA model.

And evaluating the LDA model by using the confusion degree (perplexity), and selecting the number of topics corresponding to the minimum value of the confusion degree as the optimal clustering topic number by selecting the topic number in a certain range.

In this embodiment, the obtained LDA topic model is subjected to cluster analysis to obtain academic document cluster data, where the academic document cluster data includes an academic document contained in each topic cluster and a keyword contained in each topic.

In this embodiment, the clustering result is subjected to data fusion with information of other dimensions, where the information of other dimensions includes: time, author, frequency of introduction, and branch office to which the author belongs.

In this embodiment, finally, the d3.Js visual analysis technique is used to analyze the result from multiple dimensions, including the following steps:

as shown in fig. 2, the theme river flow graph shows the variation of the intensity of different themes, the horizontal axis represents the time axis, from left to right, represents the time lapse, different themes are rendered by different colors, the width represents the development condition of the themes in different time, and the larger the width is, the stronger the representative theme is. The mouse is moved over the theme with highlighting.

As shown in fig. 3, the text cloud represents keywords of each topic, that is, research content of each topic, and the size of the word represents the importance degree of the word in the topic.

As shown in FIG. 4, the line graph represents the annual volume of text sent by each topic and the frequency of the topic being referenced, whereby the research trends and academic impact of a topic of the institution can be predicted.

As shown in FIG. 5, the scatter plot represents the change of each topic study condition of each branch office with time, different topics are represented by different colors, and the circle size represents the number of topics.

According to the method for exploring the research condition of the organization based on the theme visualization, theoretical knowledge is based on data visualization and theme modeling, firstly, the researched prediction is required to be preprocessed, a text vector space is extracted and generated through features, then an academic document clustering result is obtained after the theme modeling, the academic document clustering result and theme keywords are combined with information of other dimensions in the academic document data, such as publishing time, branches, authors and the like, and the relationship among the data is analyzed and unfolded and predicted by using the visualization elements, so that the development condition of the scientific research condition of the organization can be mastered and tracked better, the leading edge and hot spot of the development of the science can be captured by scientific researchers better, and repeated research is avoided.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. A method for exploring a mechanism research situation based on theme visualization, which is characterized by comprising the following steps: the method comprises the following steps:

s1: acquisition and processing of study data:

extracting a required research field;

preprocessing the extracted fields;

the scatter diagram and the line diagram respectively show the intensity change and the research trend of different topics under each branch office, and find the subject advantage;

in said step S1, it is determined that the institution to be studied is based on SCI academic documents of the last five years containing the address of the author of the institution;

in the step S1, regular matching is used, and the retained data includes the title, author, time, frequency of introduction, keywords and abstract of the document; the corpus comprises keywords, topics and article abstracts;

in the step S1, preprocessing comprises cleaning and denoising, english text word segmentation, word stopping and root word reduction;

in the step S2, feature extraction and text vector space generation are performed using TF-IDF algorithm, which specifically includes the following operations:

Feature-Vector＝{f ₁ ,f ₂ ,f ₃ ……,f _n } (1)

f _i ＝tf(w _i ,d _i )*idf(w _i ,D) (2)

in the formula (2), the tf value calculation formula is:

the idf value calculation formula in the formula (2) is:

the TF-IDF feature vectors of all the document sets form a (word, tfidf) matrix, and the matrix is a document feature vector space model;

in the step S2, the obtaining of the topic-word matrix is achieved by using an LDA algorithm, and based on a word bag mode, the topic-word matrix obtaining method specifically includes the following operations:

Training process gibbs sampling:

randomly assigning a topic number z to each word w of each document;

Training the LDA model by Gibbssampling to obtain a topic-word co-occurrence frequency matrix;

in the step S2, clustering analysis is performed on the obtained LDA topic model to obtain academic document clustering data, where the academic document clustering data includes academic documents contained in each topic cluster and keywords contained in each topic;

in the step S3, the information of the other dimensions includes: time, author, frequency of introduction, and branch office to which the author belongs;

in the step S3, the d3.Js visual analysis technique is adopted to analyze the result from multiple dimensions, which specifically includes the following operations: