CN113344107B - Topic analysis method and system based on kernel principal component analysis and LDA - Google Patents

Topic analysis method and system based on kernel principal component analysis and LDA Download PDF

Info

Publication number
CN113344107B
CN113344107B CN202110709322.3A CN202110709322A CN113344107B CN 113344107 B CN113344107 B CN 113344107B CN 202110709322 A CN202110709322 A CN 202110709322A CN 113344107 B CN113344107 B CN 113344107B
Authority
CN
China
Prior art keywords
topic
word
document
lda
article
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110709322.3A
Other languages
Chinese (zh)
Other versions
CN113344107A (en
Inventor
李秀
许菁
王梦凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen International Graduate School of Tsinghua University
Original Assignee
Shenzhen International Graduate School of Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen International Graduate School of Tsinghua University filed Critical Shenzhen International Graduate School of Tsinghua University
Priority to CN202110709322.3A priority Critical patent/CN113344107B/en
Publication of CN113344107A publication Critical patent/CN113344107A/en
Application granted granted Critical
Publication of CN113344107B publication Critical patent/CN113344107B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a topic analysis method and a system based on kernel principal component analysis and LDA, which are characterized by comprising the following contents: 1) Acquiring a literature corpus, and preprocessing each article in the literature corpus; 2) Establishing a KPCA-LDA topic model according to the preprocessed literature corpus; 3) Performing topic analysis on the articles in the literature corpus by adopting the established KPCA-LDA topic model, and determining text representation of the articles in the literature corpus; 4) The method and the device adopt a Gibbs sampling algorithm to train and estimate parameters of the KPCA-LDA topic model, solve the parameters of the KPCA-LDA topic model, and generate a plurality of topics represented by words.

Description

Topic analysis method and system based on kernel principal component analysis and LDA
Technical Field
The invention relates to a topic analysis method and a topic analysis system based on kernel principal component analysis and LDA, belonging to the field of text mining.
Background
At present, a mature method system is formed by excavating research subjects and evolution development from scientific literature, and main research methods can be roughly divided into a word frequency analysis method, a co-word analysis method, a quotation analysis method and a text excavation method. With the rapid development of natural language and rapid increase of text data volume, a topic model is used as an efficient text data analysis tool, and gradually becomes one of core modes in the field of text mining. Researchers obtain two probability distributions, namely a topic-word polynomial distribution phi and a document-topic polynomial distribution theta, through topic extraction of a scientific literature corpus, and propose a generated probability topic model, namely an LDA (latent dirichlet allocation ) topic model. The LDA model overcomes the defect that the traditional text mining model can not well reflect semantic relations among words, and is widely applied to scientific information analysis and research. Researchers such as Griffiths firstly use the LDA model for abstracting abstract analysis of journal documents of national academy of sciences of the United states, researching topics and topic change trends of the topics, and deducing the LDA model by using a Gibbs sampling algorithm. Researchers such as Fuchs propose a semi-supervision method to extract topics from the microblogs, explore topics appearing in the text corpus through a visual analysis method, and propose to refine global topics describing the microblogs by utilizing interactive iteration. Wang Yuefen and other researchers take the field of Chinese knowledge flow as a research object, apply an LDA model to research topic extraction and distribution states under the view angle of subject classification, and analyze knowledge structures and research hotspots of various subjects under different topics. In addition, the LDA model is also widely applied to scientific literature subject mining in a plurality of fields such as text clustering, personalized recommendation, biomedicine, computer science, literature metrology and the like, and research development hot spots and trend analysis in specific fields are performed.
In order to optimize the modeling effect of the LDA model, the accuracy of the topic identification of the LDA model and the integrity of the evolution path of the main body are improved, and a learner sequentially improves the LDA model from a plurality of aspects such as a model algorithm, model attributes, theoretical basis and the like. From the aspect of model algorithm, researchers such as Li and the like put forward a microblog topic extraction method based on a fuzzy C-Means algorithm, and a fuzzy set is used for representing groups and topics, so that more reasonable and concentrated topic results can be obtained. Researchers such as Liu cut into model attributes and propose a multi-attribute LDA model (MA-LDA) that incorporates the time and tag attributes of a microblog into the LDA model. Based on the theory of document clustering analysis, researchers such as Wang Shaopeng use an LDA model to conduct college forum network public opinion analysis, and conduct deeper analysis on text implicit semantics. In addition, some scholars have improved on the traditional LDA model for differences in text data objects. Researchers such as Yan establish a BTM topic model to process short texts so as to solve the sparsity problem of co-occurrence of words at the document level, thereby realizing data analysis on phrase modeling. Zhong Qinghong and other researchers optimize the feature extraction of texts and pictures through LDA2Vec and ResNet V2 models, and solve the problem of semantic gap between heterogeneous data. Researchers such as Yang Ling find the fluctuation position and the theme thereof by using a method of combining principal component analysis PCA and a theme model, and dimension reduction is performed on the feature matrix by using PCA, so that the principal component of the feature matrix can be obtained, and the quantity originally scattered on a plurality of positions is unified and concentrated.
With the continuous improvement and optimization of the LDA model, text mining has shown good functional adaptability such as topic discovery, trend analysis, topic evolution and the like. However, the current topic analysis method is mainly aimed at short texts such as microblog comments, and lacks an algorithm with excellent performance for processing longer texts; in addition, the existing quantitative research mostly utilizes literature metering methods such as text quantity, citation rate and the like to conduct literature carding from specific view angles such as single subject field, subject crossing field and the like, and the analysis of global view angles and the subject trend evolution research are lacked.
Disclosure of Invention
In view of the above problems, it is an object of the present invention to provide a method and system for core principal component analysis and LDA based topic analysis that is capable of processing longer text and has a global perspective.
In order to achieve the above purpose, the present invention adopts the following technical scheme: a topic analysis method based on kernel principal component analysis and LDA comprises the following steps:
1) Acquiring a literature corpus, and preprocessing each article in the literature corpus;
2) Establishing a KPCA-LDA topic model according to the preprocessed literature corpus;
3) Performing topic analysis on articles in the literature corpus by adopting the established KPCA-LDA topic model, and determining text representation of the articles in the literature corpus;
4) Training and parameter estimation are carried out on the KPCA-LDA topic model by adopting a Gibbs sampling algorithm, parameters of the KPCA-LDA topic model are solved, and a plurality of topics represented by words are generated.
Further, the specific process of the step 2) is as follows:
2.1 Extracting pretreated literature corpusVocabulary w of articles in library D L =(w 1 ,w j ,…,w W ) Wherein W is the vocabulary length; w (w) j For vocabulary w L The j-th word of (a);
2.2 Generating a document-term matrix of the document corpus D;
2.3 Using a P-order polynomial kernel function, mapping the generated document-word matrix from two dimensions to Gao Weixi terbert space through nonlinear mapping, reducing the dimension to obtain a theme-word matrix R of low dimension n rows and n columns, and taking the theme-word matrix R as an input document of the KPCA-LDA theme model.
Further, the specific process of the step 2.2) is as follows:
2.2.1 Set up M articles d= (D) in the document corpus D 1 ,D 2 …,D M ) T ,D i Is the ith article in document corpus D, and D i =[d i1 d i2 …d iW ]Wherein d ij Is d ij Representing word w j At D i Representing the weight of the jth word w of the vocabulary j The number of occurrences in the ith article in the corpus;
2.2.2 Sequentially computing vocabulary w of the article set L The weight of each word in each article, and a document-word matrix of the document corpus D is obtained.
Further, the specific process of the step 3) is as follows:
3.1 Based on the definition of the topic, the probability of generation p (w|d) of the word w in the article d is calculated:
Figure SMS_1
where z represents the potential topic from which the word w comes; p (w|z=q) represents the probability that the word w comes out of the underlying topic z; p (z=j|d) represents the probability that the potential topic z is out of article d; k represents the number of topics; q represents a subject;
3.2 According to the parameter setting in the KPCA-LDA topic model establishing process, obtaining the probability p (w|d) of the article d containing the word w:
Figure SMS_2
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_3
probability distribution for topic-words; />
Figure SMS_4
Probability distribution for document-topic;
3.3 According to the probability p (w|d) of the word w contained in the article d, obtaining a conditional probability distribution p (d|alpha, beta) generated by the article d:
Figure SMS_5
wherein alpha is i Distributing superparameters for the topics of document i; alpha h The superparameter is distributed for the theme of the document h; n (N) d The total number of words for article d; θ i Probability distribution for document i-topic; θ h Probability distribution for document h-topic; beta h,j The word distribution super-parameters of the theme; w (w) j n Is a word.
Further, the specific process of the step 4) is as follows:
4.1 After inputting the extracted vocabulary, the document-word matrix of the document corpus D, the topic distribution hyper-parameters alpha of the document and the topic word distribution hyper-parameters beta of the topic, carrying out iterative computation by using a Gibbs sampling algorithm, estimating unknown parameter variables, solving and outputting the document-topic matrix theta and the topic-word matrix
Figure SMS_6
Wherein, the document-term matrix θ is:
Figure SMS_7
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_8
is subject z K W article D of (a);
topic-word matrix
Figure SMS_9
The method comprises the following steps:
Figure SMS_10
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_11
for the word w W K subject Z of (2);
4.2 K topics represented by t words are generated.
Further, in the step 4.2), the optimal number of topics is determined by using the topic consistency:
Figure SMS_12
Figure SMS_13
wherein D (x, y) calculates the number of documents containing the words x and y, D (x) calculates the number of documents containing the word x, representing a smoothing factor that ensures that the score returns a real number; v is a set of words describing a topic; e is a smoothing factor, ensuring that the score returns a real number. The number of words V is the optimal number of topics when the Coherence (V) is maximum.
A core principal component analysis and LDA-based topic analysis system comprising:
the data acquisition module is used for acquiring a literature corpus and preprocessing each article in the literature corpus;
the model construction module is used for establishing a KPCA-LDA topic model according to the preprocessed document corpus;
the text representation determining module is used for carrying out topic analysis on the articles in the document corpus by adopting the established KPCA-LDA topic model to determine the text representations of the articles in the document corpus;
the topic generation module is used for training and parameter estimation of the KPCA-LDA topic model by adopting a Gibbs sampling algorithm, solving the parameters of the KPCA-LDA topic model and generating a plurality of topics represented by words.
Further, the model building module includes:
the vocabulary extracting unit is used for extracting the vocabulary of each article in the preprocessed document corpus;
a matrix generation unit for generating a document-word matrix of the document corpus;
the dimension reduction unit is used for mapping the generated document-word matrix from two dimensions to Gao Weixi Arabic space by adopting a P-order polynomial kernel function through nonlinear mapping, obtaining a theme-word matrix R of low dimension n rows and n columns through dimension reduction, and taking the theme-word matrix R as an input document of the KPCA-LDA theme model.
A processor comprising computer program instructions which, when executed by the processor, are adapted to carry out the steps corresponding to the above-described core principal component analysis and LDA-based topic analysis method.
A computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, are for implementing the steps corresponding to the above-described core principal component analysis and LDA-based topic analysis method.
Due to the adoption of the technical scheme, the invention has the following advantages:
1. in the topic mining, since documents in a plurality of fields have the characteristics of wide scope, topic dispersion research and longer text, the obtained document-word matrix has higher dimensionality and sparseness, and is unfavorable for generating a high-quality topic.
2. Aiming at documents with characteristics of wide documents, scattered research topics, long texts and the like, the invention adopts the topic consistency to determine the optimal topic number, so that analysis on evolution of the topics of the documents is more comprehensive and accurate, and the method can be widely applied to the field of text mining.
Drawings
FIG. 1 is a flow chart of a method provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a KPCA-LDA topic model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of evolution of a literature topic according to an embodiment of the present invention;
fig. 4 is a graph of evolution trend of the strength of a literature topic according to an embodiment of the present invention, wherein the abscissa is year and the ordinate is strength of the literature topic.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
It is to be understood that the terminology used herein is for the purpose of describing particular example embodiments only, and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms "comprises," "comprising," "includes," "including," and "having" are inclusive and therefore specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. The method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order described or illustrated, unless an order of performance is explicitly stated. It should also be appreciated that additional or alternative steps may be used.
Term interpretation:
1. LDA: latent Dirichlet Allocation, implicit dirichlet distribution;
2. BTM: biterm Topic Model, a dual word sense enhanced topic model;
3. LDA2Vec: word embedding in the topic model by LDA+word2 vec;
4. ResNet V2: residual Network V2 residual network second generation;
5. PCA: principal Component Analysis, principal component analysis;
6. KPCA: kernel Principal Component Analysis, nuclear principal component analysis;
7. gibbs Sampling: gibbs sampling.
Aiming at the characteristics of long text, wide range and more subject dispersion of documents, the method and the system for analyzing the subject based on the kernel principal component analysis and the LDA provided by the embodiment of the invention are characterized in that in a KPCA-LDA subject model, KPCA is an improved PCA, is a kernel-based nonlinear dimension reduction method, utilizes nonlinear mapping to map data in an original space into a high-dimensional Hilbert space, and then performs principal component analysis on the mapped data in the high-dimensional space.
Example 1
As shown in fig. 1, the present embodiment provides a topic analysis method based on kernel principal component analysis and LDA, which includes the following steps:
1) And obtaining a literature corpus D, and preprocessing each article in the literature corpus D, wherein the preprocessing comprises the steps of deleting punctuation marks, deleting English characters, word segmentation, word deactivation and the like.
2) According to the preprocessed literature corpus D, a KPCA-LDA topic model is built, specifically:
2.1 Extracting vocabulary of each article in the preprocessed document corpus D):
by scanning the corpus D of documents, mutually exclusive words in the articles are added into the vocabulary in turn,vocabulary w of the article set is obtained L =(w 1 ,w j ,…,w W ) Wherein W is the vocabulary length; w (w) j For vocabulary w L The j-th word of (a).
2.2 Generating a document-term matrix of the document corpus D):
2.2.1 Assume that there are M articles in the corpus D of documents, i.e., d= (D) 1 ,D 2 …,D M ) T ,D i Is the ith article in document corpus D, and D i =[d i1 d i2 … d iW ]Wherein d ij For the word w j At D i Where the weights take on the value of the Term Frequency (TF), i.e., d ij Representing the jth word w of the vocabulary j Number of occurrences in the ith article in the corpus.
2.2.2 Sequentially computing vocabulary w of the article set L The weight of each word in each article, and a document-word matrix of the document corpus D is obtained.
2.3 Using a P-order polynomial kernel function, mapping the generated document-word matrix from two dimensions to Gao Weixi terbert space (H space) through nonlinear mapping, reducing the dimension to obtain a symmetric matrix R of low dimension n rows and n columns, namely a theme-word matrix, and taking the theme-word matrix as an input document of a KPCA-LDA theme model.
3) Performing topic analysis on articles in the document corpus D by adopting the established KPCA-LDA topic model, and determining text representation of the articles in the document corpus, wherein the text representation comprises the following specific steps:
the KPCA-LDA topic model based on the word co-occurrence matrix is shown in FIG. 2, and the description of each parameter in FIG. 1 is shown in the following table 1:
table 1: parameter specification table
Figure SMS_14
3.1 Based on the definition of the topic, the generation probability p (w|d) of the word w in the article d is calculated as:
Figure SMS_15
where z represents the potential topic from which the word w comes; p (w|z=q) represents the probability that the word w comes out of the underlying topic z; p (z=j|d) represents the probability that the potential topic z is out of article d; k represents the number of topics; q represents a theme.
3.2 According to parameter setting in the KPCA-LDA topic model establishing process, the probability p (w|d) of the article d containing the word w is obtained as follows:
Figure SMS_16
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_17
probability distribution for topic-words; />
Figure SMS_18
Is a probability distribution of document-topic.
3.3 According to the probability p (w|d) of the word w contained in the article d, the conditional probability distribution p (d|alpha, beta) generated by the article d is obtained as follows:
Figure SMS_19
wherein alpha is i The super parameters are distributed for the topics of the article I; alpha h The super parameters are distributed for the topics of the article h; n (N) d The total number of words of the article d (d is a general character of an LDA model formula, and the article i is the meaning of the selected ith article); θ i Probability distribution for chapter i-topic; θ is the probability distribution of the article θ -topic; beta h,j The word distribution super-parameters of the theme; w (w) j n Is a word.
Namely, the KPCA-LDA topic model generates an article by the following steps: selecting a potential theme z from the probability models theta, and selecting a probability model corresponding to the potential theme z
Figure SMS_20
Is selected from the words w, repeatedly N d Secondary until a strip containing N is generated d The best goal of the word article, KPCA-LDA topic model, is to maximize the conditional probability distribution p (d|α, β).
4) Training and parameter estimation are carried out on the KPCA-LDA topic model by adopting a Gibbs sampling algorithm, parameters of the KPCA-LDA topic model are solved, and K topics represented by t words are generated, wherein the method specifically comprises the following steps:
4.1 After inputting the extracted vocabulary, document-word matrix and related parameter values (namely, the topic distribution superparameter alpha of the document and the topic word distribution superparameter beta) of the corpus D, carrying out iterative calculation by using a Gibbs sampling algorithm, estimating unknown parameter variables, solving and outputting the document-topic matrix theta and the topic-word matrix
Figure SMS_21
Wherein, the document-topic matrix θ is:
Figure SMS_22
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_23
is subject z K W article D of (a).
Topic-word matrix
Figure SMS_24
The method comprises the following steps:
Figure SMS_25
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_26
for the word w W Is subject Z of K.
4.2 Selecting t words with the best readability to express the theme, and generating K topics represented by the t words, wherein the specific steps are as follows:
the optimal number of topics is determined using Topic consistency (Topic Coherence), which has a higher interpretive guarantee, and the UMass index defines the score as based on document co-occurrence:
Figure SMS_27
Figure SMS_28
wherein D (x, y) calculates the number of documents containing the words x and y, D (x) calculates the number of documents containing the word x, representing a smoothing factor that ensures that the score returns a real number; v is a set of words describing a topic; e is a smoothing factor, ensuring that the score returns a real number. The number of words V is the optimal number of topics when the Coherence (V) is maximum.
The topic analysis method based on the kernel principal component analysis and the LDA is described in detail below by using the topic analysis and evolution of the literature in the field of advanced education and research as a specific embodiment:
1. topic analysis
1) Establishing a KPCA-LDA topic model:
1.1 Acquiring the documents in the field of higher education and research, collecting the document abstracts, constructing a document corpus, and preprocessing the articles in the document corpus such as word segmentation, word stopping and the like to form a more standard document corpus.
1.2 By scanning the canonical corpus of documents, a vocabulary and a document-term matrix are obtained.
1.3 Performing KPCA dimension reduction on the text-word matrix to obtain a low-dimension symmetric matrix R, and representing literature corpus by adopting the dimension-reduced matrix.
2) And carrying out topic analysis on each article in the document corpus by adopting the established KPCA-LDA topic model to determine the text representation of the seal in the document corpus, wherein the prior parameters alpha and beta of the model are determined according to the empirical values of the existing documents, and the topic number K is determined by adopting topic consistency.
3) Training and parameter estimation are carried out on the KPCA-LDA topic model by adopting a Gibbs sampling algorithm, parameters of the KPCA-LDA topic model are solved, a document-topic distribution matrix and a topic-word distribution matrix are obtained, and the best KPCA-LDA topic model is determined.
2. Subject evolution:
1) Research framework
And carrying out subject evolution research on the text set by adopting the established KPCA-LDA model, as shown in figure 3. In the subject evolution, subject extraction is firstly carried out, and then research is carried out from the following two aspects: (1) evolution of topic intensity, document-topic distribution interpretation through text sets in different time windows; (2) the evolution of the topic content is measured by the similarity of topic distribution in different time windows and the topic-vocabulary distribution under similar topics.
2) Topic evolution research based on KPCA-LDA topic model
Evolution studies of the KPCA-LDA topic model were performed from the following: (1) determining the optimal theme number by using the theme consistency; (2) the topics are aligned using edit distance methods. On the basis, text clustering is firstly carried out according to years, and a formula is respectively adopted to calculate the topic intensity and similarity, so as to analyze the evolution of the topic intensity and the evolution of the content, and specifically, the method comprises the following steps:
2.1 Optimum topic count determination aspect
The optimum number of topics in the Topic evolution study is determined using Topic consistency (Topic Coherence).
2.2 Theme alignment aspect)
The theme is aligned using Edit Distance (Edit Distance):
Levenshtein.distance(str1,str2) (8)
wherein str1 is an initial character; str2 is the character to be converted.
In the process of the evolution of the topic intensity, assuming that the proportion of the topic z in the document d is a text set on the time window t, the topic z intensity on the time window t
Figure SMS_29
The method comprises the following steps:
Figure SMS_30
wherein D is t Is a corpus of documents under a time window t;
Figure SMS_31
is the document-topic matrix under topic z document d.
And calculating the intensity of the subject z in different time windows t, and making an intensity change graph according to the time sequence for researching and analyzing the trend of the subject intensity evolution.
3. Analysis of results
Training all text sets by adopting the established KPCA-LDA topic model, calculating to obtain document-topic probability distribution in the text sets, dispersing the text sets to 5 time windows in 2014-2018, and respectively calculating topic intensities in the 5 time windows, wherein 10 hot topics earlier in the text sets are selected, keywords under the hot topics are listed, and the obtained topic identification results are shown in the following table 2:
table 2: literature topic and keywords thereof
Figure SMS_32
Figure SMS_33
As can be seen from table 2, topic 11 is for internationalized education, topic 25 is for borderline internationalized education, topic 38 is for both leave and middle and outside cooperative offices, and the evolution trend of topic intensity can be derived from the probability distribution of topics over different time window text sets, as shown in fig. 4.
In the subject matter evolution, the following table 3 gives the higher education-related subjects and their keywords in each time window:
table 3: each time window theme and keywords thereof
Figure SMS_34
In conclusion, compared with the actual situation, the quality of the identified subject and the evolution trend thereof are similar, so that the method provided by the invention has a better application effect in tracking research development trend and research hot spot in the specific field.
Example 2
The embodiment provides a topic analysis system based on kernel principal component analysis and LDA, which comprises:
the data acquisition module is used for acquiring a literature corpus and preprocessing each article in the literature corpus.
The model construction module is used for establishing a KPCA-LDA topic model according to the preprocessed literature corpus.
And the text representation determining module is used for carrying out topic analysis on the articles in the document corpus by adopting the established KPCA-LDA topic model to determine the text representations of the articles in the document corpus.
The topic generation module is used for training and parameter estimation of the KPCA-LDA topic model by adopting a Gibbs sampling algorithm, solving the parameters of the KPCA-LDA topic model and generating a plurality of topics represented by words.
In a preferred embodiment, the model building module comprises:
the vocabulary extracting unit is used for extracting the vocabulary of each article in the preprocessed document corpus;
a matrix generation unit for generating a document-word matrix of the document corpus;
the dimension reduction unit is used for mapping the generated document-word matrix from two dimensions to Gao Weixi Arabic space by adopting a P-order polynomial kernel function through nonlinear mapping, obtaining a theme-word matrix R of low dimension n rows and n columns through dimension reduction, and taking the theme-word matrix R as an input document of the KPCA-LDA theme model.
Example 3
The present embodiment provides a processing device corresponding to the subject analysis method based on the core principal component analysis and LDA provided in the present embodiment 1, and the processing device may be a processing device for a client, for example, a mobile phone, a notebook computer, a tablet computer, a desktop computer, or the like, to perform the method of embodiment 1.
The processing device comprises a processor, a memory, a communication interface and a bus, wherein the processor, the memory and the communication interface are connected through the bus so as to complete communication among the processing device. The memory stores a computer program executable on a processor, and the processor executes the subject analysis method based on the core principal component analysis and the LDA provided in embodiment 1 when the computer program is executed.
In some implementations, the memory may be high-speed random access memory (RAM: random Access Memory), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
In other implementations, the processor may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), or other general-purpose processor, which is not limited herein.
Example 4
The core principal component analysis and LDA-based subject matter analysis method of this embodiment 1 may be embodied as a computer program product, which may include a computer-readable storage medium having computer-readable program instructions embodied thereon for performing the core principal component analysis and LDA-based subject matter analysis method of this embodiment 1.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any combination of the preceding.
The foregoing embodiments are only for illustrating the present invention, wherein the structures, connection modes, manufacturing processes, etc. of the components may be changed, and all equivalent changes and modifications performed on the basis of the technical solutions of the present invention should not be excluded from the protection scope of the present invention.

Claims (6)

1. A topic analysis method based on kernel principal component analysis and LDA, which is characterized by comprising the following contents:
1) Acquiring a literature corpus, and preprocessing each article in the literature corpus;
2) According to the pretreated document corpus, a KPCA-LDA topic model is established, and the specific process is as follows:
2.1 Extracting vocabulary w of each article in the pretreated document corpus D L =(w 1 ,w j ,…,w W ) Wherein W is the vocabulary length; w (w) j For vocabulary w L The j-th word of (a);
2.2 A document-word matrix of the document corpus D is generated, and the specific process is as follows:
2.2.1 Set up M articles d= (D) in the document corpus D 1 ,D 2 …,D M ) T ,D i Is the ith article in document corpus D, and D i =[d i1 d i2 … d iW ]Wherein d ij Is d ij Representing word w j At D i Representing the weight of the jth word w of the vocabulary j The number of occurrences in the ith article in the corpus;
2.2.2 Sequentially computing vocabulary w of the article set L The weight of each word in each article to obtain a document-word matrix of the document corpus D;
2.3 Mapping the generated document-word matrix from two dimensions to Gao Weixi terbert space by adopting a P-order polynomial kernel function through nonlinear mapping, reducing the dimension to obtain a theme-word matrix R of low dimension n rows and n columns, and taking the theme-word matrix R as an input document of a KPCA-LDA theme model;
3) Performing topic analysis on articles in a literature corpus by adopting the established KPCA-LDA topic model, and determining text representation of the articles in the literature corpus, wherein the specific process comprises the following steps of:
3.1 Based on the definition of the topic, the probability of generation p (w|d) of the word w in the article d is calculated:
Figure FDA0004271785160000011
where z represents the potential topic from which the word w comes; p (w|z=q) represents the probability that the word w comes out of the underlying topic z; p (z=j|d) represents the probability that the potential topic z is out of article d; k represents the number of topics; q represents a subject;
3.2 According to the parameter setting in the KPCA-LDA topic model establishing process, obtaining the probability p (w|d) of the article d containing the word w:
Figure FDA0004271785160000012
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure FDA0004271785160000013
probability distribution for topic-words; />
Figure FDA0004271785160000014
Probability distribution for document-topic;
3.3 According to the probability p (w|d) of the word w contained in the article d, obtaining a conditional probability distribution p (d|alpha, beta) generated by the article d:
Figure FDA0004271785160000015
wherein alpha is i Distributing superparameters for the topics of document i; alpha h The superparameter is distributed for the theme of the document h; n (N) d The total number of words for article d; θ i Probability distribution for document i-topic; θ h Probability distribution for document h-topic; beta h,j The word distribution super-parameters of the theme; w (w) j n Is a word;
4) Training and parameter estimation are carried out on the KPCA-LDA topic model by adopting a Gibbs sampling algorithm, parameters of the KPCA-LDA topic model are solved, and a plurality of topics represented by words are generated.
2. The method for analyzing the topic based on the kernel principal component analysis and the LDA as claimed in claim 1, wherein the specific process of the step 4) is as follows:
4.1 Inputting the extracted vocabulary, the document-word matrix of the document corpus D, the topic distribution superparameter alpha of the document and the topic word distribution superparameter beta, carrying out iterative calculation by using a Gibbs sampling algorithm, estimating unknown parameter variables, solving and outputting the document-topic matrix theta and the topic-word matrix
Figure FDA0004271785160000021
Wherein, the document-term matrix θ is:
Figure FDA0004271785160000022
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure FDA0004271785160000023
is subject z K W article D of (a);
topic-word matrix
Figure FDA0004271785160000024
The method comprises the following steps:
Figure FDA0004271785160000025
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure FDA0004271785160000026
for the word w W K subject Z of (2);
4.2 K topics represented by t words are generated.
3. The method for analyzing the topics based on the kernel principal component analysis and the LDA according to claim 2, wherein the optimal number of topics is determined by using the topic consistency in the step 4.2):
Figure FDA0004271785160000027
Figure FDA0004271785160000028
wherein D (x, y) calculates the number of documents containing the words x and y, D (x) calculates the number of documents containing the word x, representing a smoothing factor that ensures that the score returns a real number; v is a set of words describing a topic; the E is a smoothing factor, and the score is guaranteed to return a real number; the number of words V is the optimal number of topics when the Coherence (V) is maximum.
4. A subject matter analysis system based on kernel principal component analysis and LDA, comprising:
the data acquisition module is used for acquiring a literature corpus and preprocessing each article in the literature corpus;
the model building module is used for building a KPCA-LDA topic model according to the preprocessed literature corpus, and comprises the following steps:
the vocabulary extracting unit is used for extracting the vocabulary of each article in the preprocessed document corpus;
a matrix generation unit for generating a document-word matrix of the document corpus;
the dimension reduction unit is used for mapping the generated document-word matrix from two dimensions to Gao Weixi Arabic space by adopting a P-order polynomial kernel function through nonlinear mapping, obtaining a theme-word matrix R of low dimension n rows and n columns through dimension reduction, and taking the theme-word matrix R as an input document of the KPCA-LDA theme model;
the text representation determining module is used for carrying out topic analysis on the articles in the document corpus by adopting the established KPCA-LDA topic model, and determining the text representation of the articles in the document corpus, wherein the text representation determining module comprises the following specific processes:
based on the definition of the topic, the generation probability p (w|d) of the word w in the article d is calculated:
Figure FDA0004271785160000031
where z represents the potential topic from which the word w comes; p (w|z=q) represents the probability that the word w comes out of the underlying topic z; p (z=j|d) represents the probability that the potential topic z is out of article d; k represents the number of topics; q represents a subject;
obtaining the probability p (w|d) of the article d containing the word w according to the parameter setting in the KPCA-LDA topic model establishing process:
Figure FDA0004271785160000032
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure FDA0004271785160000033
probability distribution for topic-words; />
Figure FDA0004271785160000034
Probability distribution for document-topic;
obtaining a conditional probability distribution p (d|alpha, beta) generated by the article d according to the probability p (w|d) of the article d containing the word w:
Figure FDA0004271785160000035
wherein alpha is i Distributing superparameters for the topics of document i; alpha h The superparameter is distributed for the theme of the document h; n (N) d The total number of words for article d; θ i Probability distribution for document i-topic; θ h Probability distribution for document h-topic; beta h,j Word segmentation for topicsLaying out super parameters; w (w) j n Is a word;
the topic generation module is used for training and parameter estimation of the KPCA-LDA topic model by adopting a Gibbs sampling algorithm, solving the parameters of the KPCA-LDA topic model and generating a plurality of topics represented by words.
5. A processor comprising computer program instructions, wherein the computer program instructions when executed by the processor are for implementing the steps corresponding to the method of core principal component analysis and LDA-based topic analysis of any of claims 1-3.
6. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, are for implementing the steps corresponding to the subject analysis method based on the core principal component analysis and LDA according to any of claims 1-3.
CN202110709322.3A 2021-06-25 2021-06-25 Topic analysis method and system based on kernel principal component analysis and LDA Active CN113344107B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110709322.3A CN113344107B (en) 2021-06-25 2021-06-25 Topic analysis method and system based on kernel principal component analysis and LDA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110709322.3A CN113344107B (en) 2021-06-25 2021-06-25 Topic analysis method and system based on kernel principal component analysis and LDA

Publications (2)

Publication Number Publication Date
CN113344107A CN113344107A (en) 2021-09-03
CN113344107B true CN113344107B (en) 2023-07-11

Family

ID=77478609

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110709322.3A Active CN113344107B (en) 2021-06-25 2021-06-25 Topic analysis method and system based on kernel principal component analysis and LDA

Country Status (1)

Country Link
CN (1) CN113344107B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629275A (en) * 2012-03-21 2012-08-08 复旦大学 Face and name aligning method and system facing to cross media news retrieval
CN102902700A (en) * 2012-04-05 2013-01-30 中国人民解放军国防科学技术大学 Online-increment evolution topic model based automatic software classifying method
CN105975499A (en) * 2016-04-27 2016-09-28 深圳大学 Text subject detection method and system
CN106844416A (en) * 2016-11-17 2017-06-13 中国科学院计算技术研究所 A kind of sub-topic method for digging
CN107203958A (en) * 2017-05-25 2017-09-26 段云涛 A kind of hidden image analysis method based on multiple features combining
CN108519971A (en) * 2018-03-23 2018-09-11 中国传媒大学 A kind of across languages theme of news similarity comparison methods based on Parallel Corpus
CN109063030A (en) * 2018-07-16 2018-12-21 南京信息工程大学 A method of theme and descriptor are implied based on streaming LDA topic model discovery document
CN109325092A (en) * 2018-11-27 2019-02-12 中山大学 Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8510257B2 (en) * 2010-10-19 2013-08-13 Xerox Corporation Collapsed gibbs sampler for sparse topic models and discrete matrix factorization

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629275A (en) * 2012-03-21 2012-08-08 复旦大学 Face and name aligning method and system facing to cross media news retrieval
CN102902700A (en) * 2012-04-05 2013-01-30 中国人民解放军国防科学技术大学 Online-increment evolution topic model based automatic software classifying method
CN105975499A (en) * 2016-04-27 2016-09-28 深圳大学 Text subject detection method and system
CN106844416A (en) * 2016-11-17 2017-06-13 中国科学院计算技术研究所 A kind of sub-topic method for digging
CN107203958A (en) * 2017-05-25 2017-09-26 段云涛 A kind of hidden image analysis method based on multiple features combining
CN108519971A (en) * 2018-03-23 2018-09-11 中国传媒大学 A kind of across languages theme of news similarity comparison methods based on Parallel Corpus
CN109063030A (en) * 2018-07-16 2018-12-21 南京信息工程大学 A method of theme and descriptor are implied based on streaming LDA topic model discovery document
CN109325092A (en) * 2018-11-27 2019-02-12 中山大学 Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Effect of thermal oxidation on detection of adulteration at low concentrations in extra virgin olive oil: Study based on laser-induced fluorescence spectroscopy combined with KPCA–LDA";Yi Li.etc;《Food Chemistry》;全文 *

Also Published As

Publication number Publication date
CN113344107A (en) 2021-09-03

Similar Documents

Publication Publication Date Title
CN110442872B (en) Text element integrity checking method and device
CN111259153B (en) Attribute-level emotion analysis method of complete attention mechanism
CN114647741A (en) Process automatic decision and reasoning method, device, computer equipment and storage medium
CN115238029A (en) Construction method and device of power failure knowledge graph
CN112784591A (en) Data processing method and device, electronic equipment and storage medium
CN115578137A (en) Agricultural product future price prediction method and system based on text mining and deep learning model
CN113051932A (en) Method for detecting category of network media event of semantic and knowledge extension topic model
CN116775812A (en) Traditional Chinese medicine patent analysis and excavation tool based on natural voice processing
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN104679784A (en) O2B intelligent searching method and system
CN113344107B (en) Topic analysis method and system based on kernel principal component analysis and LDA
CN117131932A (en) Semi-automatic construction method and system for domain knowledge graph ontology based on topic model
US20170337484A1 (en) Scalable web data extraction
CN114491076B (en) Data enhancement method, device, equipment and medium based on domain knowledge graph
Pogorilyy et al. Assessment of Text Coherence by Constructing the Graph of Semantic, Lexical, and Grammatical Consistancy of Phrases of Sentences
Li et al. Text classification based on machine learning and natural language processing algorithms
Ye Translation mechanism of neural machine algorithm for online English resources
He An intelligent diagnosis system for English writing based on data feature extraction and fusion
Jiang et al. A Discourse Coherence Analysis Method Combining Sentence Embedding and Dimension Grid
Ong et al. A Comparative Study of Extractive Summary Algorithms Using Natural Language Processing
CN112989827A (en) Text data set quality evaluation method based on multi-source heterogeneous characteristics
Jiang et al. Python-Based Visual Classification Algorithm for Economic Text Big Data
Wang et al. A semantic path based approach to match subgraphs from large financial knowledge graph
Dai et al. A novel attention-based BiLSTM-CNN model in valence-arousal space
Liu et al. Practical Skills of Business English Correspondence Writing Based on Data Mining Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant