CN113344107B

CN113344107B - Topic analysis method and system based on kernel principal component analysis and LDA

Info

Publication number: CN113344107B
Application number: CN202110709322.3A
Authority: CN
Inventors: 李秀; 许菁; 王梦凯
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2023-07-11
Anticipated expiration: 2041-06-25
Also published as: CN113344107A

Abstract

The invention relates to a topic analysis method and a system based on kernel principal component analysis and LDA, which are characterized by comprising the following contents: 1) Acquiring a literature corpus, and preprocessing each article in the literature corpus; 2) Establishing a KPCA-LDA topic model according to the preprocessed literature corpus; 3) Performing topic analysis on the articles in the literature corpus by adopting the established KPCA-LDA topic model, and determining text representation of the articles in the literature corpus; 4) The method and the device adopt a Gibbs sampling algorithm to train and estimate parameters of the KPCA-LDA topic model, solve the parameters of the KPCA-LDA topic model, and generate a plurality of topics represented by words.

Description

Topic analysis method and system based on kernel principal component analysis and LDA

Technical Field

The invention relates to a topic analysis method and a topic analysis system based on kernel principal component analysis and LDA, belonging to the field of text mining.

Background

At present, a mature method system is formed by excavating research subjects and evolution development from scientific literature, and main research methods can be roughly divided into a word frequency analysis method, a co-word analysis method, a quotation analysis method and a text excavation method. With the rapid development of natural language and rapid increase of text data volume, a topic model is used as an efficient text data analysis tool, and gradually becomes one of core modes in the field of text mining. Researchers obtain two probability distributions, namely a topic-word polynomial distribution phi and a document-topic polynomial distribution theta, through topic extraction of a scientific literature corpus, and propose a generated probability topic model, namely an LDA (latent dirichlet allocation ) topic model. The LDA model overcomes the defect that the traditional text mining model can not well reflect semantic relations among words, and is widely applied to scientific information analysis and research. Researchers such as Griffiths firstly use the LDA model for abstracting abstract analysis of journal documents of national academy of sciences of the United states, researching topics and topic change trends of the topics, and deducing the LDA model by using a Gibbs sampling algorithm. Researchers such as Fuchs propose a semi-supervision method to extract topics from the microblogs, explore topics appearing in the text corpus through a visual analysis method, and propose to refine global topics describing the microblogs by utilizing interactive iteration. Wang Yuefen and other researchers take the field of Chinese knowledge flow as a research object, apply an LDA model to research topic extraction and distribution states under the view angle of subject classification, and analyze knowledge structures and research hotspots of various subjects under different topics. In addition, the LDA model is also widely applied to scientific literature subject mining in a plurality of fields such as text clustering, personalized recommendation, biomedicine, computer science, literature metrology and the like, and research development hot spots and trend analysis in specific fields are performed.

In order to optimize the modeling effect of the LDA model, the accuracy of the topic identification of the LDA model and the integrity of the evolution path of the main body are improved, and a learner sequentially improves the LDA model from a plurality of aspects such as a model algorithm, model attributes, theoretical basis and the like. From the aspect of model algorithm, researchers such as Li and the like put forward a microblog topic extraction method based on a fuzzy C-Means algorithm, and a fuzzy set is used for representing groups and topics, so that more reasonable and concentrated topic results can be obtained. Researchers such as Liu cut into model attributes and propose a multi-attribute LDA model (MA-LDA) that incorporates the time and tag attributes of a microblog into the LDA model. Based on the theory of document clustering analysis, researchers such as Wang Shaopeng use an LDA model to conduct college forum network public opinion analysis, and conduct deeper analysis on text implicit semantics. In addition, some scholars have improved on the traditional LDA model for differences in text data objects. Researchers such as Yan establish a BTM topic model to process short texts so as to solve the sparsity problem of co-occurrence of words at the document level, thereby realizing data analysis on phrase modeling. Zhong Qinghong and other researchers optimize the feature extraction of texts and pictures through LDA2Vec and ResNet V2 models, and solve the problem of semantic gap between heterogeneous data. Researchers such as Yang Ling find the fluctuation position and the theme thereof by using a method of combining principal component analysis PCA and a theme model, and dimension reduction is performed on the feature matrix by using PCA, so that the principal component of the feature matrix can be obtained, and the quantity originally scattered on a plurality of positions is unified and concentrated.

With the continuous improvement and optimization of the LDA model, text mining has shown good functional adaptability such as topic discovery, trend analysis, topic evolution and the like. However, the current topic analysis method is mainly aimed at short texts such as microblog comments, and lacks an algorithm with excellent performance for processing longer texts; in addition, the existing quantitative research mostly utilizes literature metering methods such as text quantity, citation rate and the like to conduct literature carding from specific view angles such as single subject field, subject crossing field and the like, and the analysis of global view angles and the subject trend evolution research are lacked.

Disclosure of Invention

In view of the above problems, it is an object of the present invention to provide a method and system for core principal component analysis and LDA based topic analysis that is capable of processing longer text and has a global perspective.

In order to achieve the above purpose, the present invention adopts the following technical scheme: a topic analysis method based on kernel principal component analysis and LDA comprises the following steps:

1) Acquiring a literature corpus, and preprocessing each article in the literature corpus;

2) Establishing a KPCA-LDA topic model according to the preprocessed literature corpus;

3) Performing topic analysis on articles in the literature corpus by adopting the established KPCA-LDA topic model, and determining text representation of the articles in the literature corpus;

4) Training and parameter estimation are carried out on the KPCA-LDA topic model by adopting a Gibbs sampling algorithm, parameters of the KPCA-LDA topic model are solved, and a plurality of topics represented by words are generated.

Further, the specific process of the step 2) is as follows:

2.1 Extracting pretreated literature corpusVocabulary w of articles in library D _L ＝(w ₁ ,w _j ,…,w _W ) Wherein W is the vocabulary length; w (w) _j For vocabulary w _L The j-th word of (a);

2.2 Generating a document-term matrix of the document corpus D;

2.3 Using a P-order polynomial kernel function, mapping the generated document-word matrix from two dimensions to Gao Weixi terbert space through nonlinear mapping, reducing the dimension to obtain a theme-word matrix R of low dimension n rows and n columns, and taking the theme-word matrix R as an input document of the KPCA-LDA theme model.

Further, the specific process of the step 2.2) is as follows:

2.2.1 Set up M articles d= (D) in the document corpus D ₁ ,D ₂ …,D _M ) ^T ，D _i Is the ith article in document corpus D, and D _i ＝[d _i1 d _i2 …d _iW ]Wherein d _ij Is d _ij Representing word w _j At D _i Representing the weight of the jth word w of the vocabulary _j The number of occurrences in the ith article in the corpus;

2.2.2 Sequentially computing vocabulary w of the article set _L The weight of each word in each article, and a document-word matrix of the document corpus D is obtained.

Further, the specific process of the step 3) is as follows:

3.1 Based on the definition of the topic, the probability of generation p (w|d) of the word w in the article d is calculated:

where z represents the potential topic from which the word w comes; p (w|z=q) represents the probability that the word w comes out of the underlying topic z; p (z=j|d) represents the probability that the potential topic z is out of article d; k represents the number of topics; q represents a subject;

3.2 According to the parameter setting in the KPCA-LDA topic model establishing process, obtaining the probability p (w|d) of the article d containing the word w:

wherein, the liquid crystal display device comprises a liquid crystal display device,

probability distribution for topic-words; />

Probability distribution for document-topic;

3.3 According to the probability p (w|d) of the word w contained in the article d, obtaining a conditional probability distribution p (d|alpha, beta) generated by the article d:

wherein alpha is _i Distributing superparameters for the topics of document i; alpha _h The superparameter is distributed for the theme of the document h; n (N) _d The total number of words for article d; θ _i Probability distribution for document i-topic; θ _h Probability distribution for document h-topic; beta _h,j The word distribution super-parameters of the theme; w (w) ^j _n Is a word.

Further, the specific process of the step 4) is as follows:

4.1 After inputting the extracted vocabulary, the document-word matrix of the document corpus D, the topic distribution hyper-parameters alpha of the document and the topic word distribution hyper-parameters beta of the topic, carrying out iterative computation by using a Gibbs sampling algorithm, estimating unknown parameter variables, solving and outputting the document-topic matrix theta and the topic-word matrix

Wherein, the document-term matrix θ is:

is subject z _K W article D of (a);

topic-word matrix

The method comprises the following steps:

for the word w _W K subject Z of (2);

4.2 K topics represented by t words are generated.

Further, in the step 4.2), the optimal number of topics is determined by using the topic consistency:

wherein D (x, y) calculates the number of documents containing the words x and y, D (x) calculates the number of documents containing the word x, representing a smoothing factor that ensures that the score returns a real number; v is a set of words describing a topic; e is a smoothing factor, ensuring that the score returns a real number. The number of words V is the optimal number of topics when the Coherence (V) is maximum.

A core principal component analysis and LDA-based topic analysis system comprising:

the data acquisition module is used for acquiring a literature corpus and preprocessing each article in the literature corpus;

the model construction module is used for establishing a KPCA-LDA topic model according to the preprocessed document corpus;

the text representation determining module is used for carrying out topic analysis on the articles in the document corpus by adopting the established KPCA-LDA topic model to determine the text representations of the articles in the document corpus;

the topic generation module is used for training and parameter estimation of the KPCA-LDA topic model by adopting a Gibbs sampling algorithm, solving the parameters of the KPCA-LDA topic model and generating a plurality of topics represented by words.

Further, the model building module includes:

the vocabulary extracting unit is used for extracting the vocabulary of each article in the preprocessed document corpus;

a matrix generation unit for generating a document-word matrix of the document corpus;

the dimension reduction unit is used for mapping the generated document-word matrix from two dimensions to Gao Weixi Arabic space by adopting a P-order polynomial kernel function through nonlinear mapping, obtaining a theme-word matrix R of low dimension n rows and n columns through dimension reduction, and taking the theme-word matrix R as an input document of the KPCA-LDA theme model.

A processor comprising computer program instructions which, when executed by the processor, are adapted to carry out the steps corresponding to the above-described core principal component analysis and LDA-based topic analysis method.

A computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, are for implementing the steps corresponding to the above-described core principal component analysis and LDA-based topic analysis method.

Due to the adoption of the technical scheme, the invention has the following advantages:

1. in the topic mining, since documents in a plurality of fields have the characteristics of wide scope, topic dispersion research and longer text, the obtained document-word matrix has higher dimensionality and sparseness, and is unfavorable for generating a high-quality topic.

2. Aiming at documents with characteristics of wide documents, scattered research topics, long texts and the like, the invention adopts the topic consistency to determine the optimal topic number, so that analysis on evolution of the topics of the documents is more comprehensive and accurate, and the method can be widely applied to the field of text mining.

Drawings

FIG. 1 is a flow chart of a method provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a KPCA-LDA topic model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of evolution of a literature topic according to an embodiment of the present invention;

fig. 4 is a graph of evolution trend of the strength of a literature topic according to an embodiment of the present invention, wherein the abscissa is year and the ordinate is strength of the literature topic.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

It is to be understood that the terminology used herein is for the purpose of describing particular example embodiments only, and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms "comprises," "comprising," "includes," "including," and "having" are inclusive and therefore specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. The method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order described or illustrated, unless an order of performance is explicitly stated. It should also be appreciated that additional or alternative steps may be used.

Term interpretation:

1. LDA: latent Dirichlet Allocation, implicit dirichlet distribution;

2. BTM: biterm Topic Model, a dual word sense enhanced topic model;

3. LDA2Vec: word embedding in the topic model by LDA+word2 vec;

4. ResNet V2: residual Network V2 residual network second generation;

5. PCA: principal Component Analysis, principal component analysis;

6. KPCA: kernel Principal Component Analysis, nuclear principal component analysis;

7. gibbs Sampling: gibbs sampling.

Aiming at the characteristics of long text, wide range and more subject dispersion of documents, the method and the system for analyzing the subject based on the kernel principal component analysis and the LDA provided by the embodiment of the invention are characterized in that in a KPCA-LDA subject model, KPCA is an improved PCA, is a kernel-based nonlinear dimension reduction method, utilizes nonlinear mapping to map data in an original space into a high-dimensional Hilbert space, and then performs principal component analysis on the mapped data in the high-dimensional space.

Example 1

As shown in fig. 1, the present embodiment provides a topic analysis method based on kernel principal component analysis and LDA, which includes the following steps:

1) And obtaining a literature corpus D, and preprocessing each article in the literature corpus D, wherein the preprocessing comprises the steps of deleting punctuation marks, deleting English characters, word segmentation, word deactivation and the like.

2) According to the preprocessed literature corpus D, a KPCA-LDA topic model is built, specifically:

2.1 Extracting vocabulary of each article in the preprocessed document corpus D):

by scanning the corpus D of documents, mutually exclusive words in the articles are added into the vocabulary in turn,vocabulary w of the article set is obtained _L ＝(w ₁ ,w _j ,…,w _W ) Wherein W is the vocabulary length; w (w) _j For vocabulary w _L The j-th word of (a).

2.2 Generating a document-term matrix of the document corpus D):

2.2.1 Assume that there are M articles in the corpus D of documents, i.e., d= (D) ₁ ,D ₂ …,D _M ) ^T ，D _i Is the ith article in document corpus D, and D _i ＝[d _i1 d _i2 … d _iW ]Wherein d _ij For the word w _j At D _i Where the weights take on the value of the Term Frequency (TF), i.e., d _ij Representing the jth word w of the vocabulary _j Number of occurrences in the ith article in the corpus.

2.3 Using a P-order polynomial kernel function, mapping the generated document-word matrix from two dimensions to Gao Weixi terbert space (H space) through nonlinear mapping, reducing the dimension to obtain a symmetric matrix R of low dimension n rows and n columns, namely a theme-word matrix, and taking the theme-word matrix as an input document of a KPCA-LDA theme model.

3) Performing topic analysis on articles in the document corpus D by adopting the established KPCA-LDA topic model, and determining text representation of the articles in the document corpus, wherein the text representation comprises the following specific steps:

the KPCA-LDA topic model based on the word co-occurrence matrix is shown in FIG. 2, and the description of each parameter in FIG. 1 is shown in the following table 1:

table 1: parameter specification table

3.1 Based on the definition of the topic, the generation probability p (w|d) of the word w in the article d is calculated as:

where z represents the potential topic from which the word w comes; p (w|z=q) represents the probability that the word w comes out of the underlying topic z; p (z=j|d) represents the probability that the potential topic z is out of article d; k represents the number of topics; q represents a theme.

3.2 According to parameter setting in the KPCA-LDA topic model establishing process, the probability p (w|d) of the article d containing the word w is obtained as follows:

probability distribution for topic-words; />

Is a probability distribution of document-topic.

3.3 According to the probability p (w|d) of the word w contained in the article d, the conditional probability distribution p (d|alpha, beta) generated by the article d is obtained as follows:

wherein alpha is _i The super parameters are distributed for the topics of the article I; alpha _h The super parameters are distributed for the topics of the article h; n (N) _d The total number of words of the article d (d is a general character of an LDA model formula, and the article i is the meaning of the selected ith article); θ _i Probability distribution for chapter i-topic; θ is the probability distribution of the article θ -topic; beta _h,j The word distribution super-parameters of the theme; w (w) ^j _n Is a word.

Namely, the KPCA-LDA topic model generates an article by the following steps: selecting a potential theme z from the probability models theta, and selecting a probability model corresponding to the potential theme z

Is selected from the words w, repeatedly N _d Secondary until a strip containing N is generated _d The best goal of the word article, KPCA-LDA topic model, is to maximize the conditional probability distribution p (d|α, β).

4) Training and parameter estimation are carried out on the KPCA-LDA topic model by adopting a Gibbs sampling algorithm, parameters of the KPCA-LDA topic model are solved, and K topics represented by t words are generated, wherein the method specifically comprises the following steps:

4.1 After inputting the extracted vocabulary, document-word matrix and related parameter values (namely, the topic distribution superparameter alpha of the document and the topic word distribution superparameter beta) of the corpus D, carrying out iterative calculation by using a Gibbs sampling algorithm, estimating unknown parameter variables, solving and outputting the document-topic matrix theta and the topic-word matrix

Wherein, the document-topic matrix θ is:

is subject z _K W article D of (a).

Topic-word matrix

The method comprises the following steps:

for the word w _W Is subject Z of K.

4.2 Selecting t words with the best readability to express the theme, and generating K topics represented by the t words, wherein the specific steps are as follows:

the optimal number of topics is determined using Topic consistency (Topic Coherence), which has a higher interpretive guarantee, and the UMass index defines the score as based on document co-occurrence:

The topic analysis method based on the kernel principal component analysis and the LDA is described in detail below by using the topic analysis and evolution of the literature in the field of advanced education and research as a specific embodiment:

1. topic analysis

1) Establishing a KPCA-LDA topic model:

1.1 Acquiring the documents in the field of higher education and research, collecting the document abstracts, constructing a document corpus, and preprocessing the articles in the document corpus such as word segmentation, word stopping and the like to form a more standard document corpus.

1.2 By scanning the canonical corpus of documents, a vocabulary and a document-term matrix are obtained.

1.3 Performing KPCA dimension reduction on the text-word matrix to obtain a low-dimension symmetric matrix R, and representing literature corpus by adopting the dimension-reduced matrix.

2) And carrying out topic analysis on each article in the document corpus by adopting the established KPCA-LDA topic model to determine the text representation of the seal in the document corpus, wherein the prior parameters alpha and beta of the model are determined according to the empirical values of the existing documents, and the topic number K is determined by adopting topic consistency.

3) Training and parameter estimation are carried out on the KPCA-LDA topic model by adopting a Gibbs sampling algorithm, parameters of the KPCA-LDA topic model are solved, a document-topic distribution matrix and a topic-word distribution matrix are obtained, and the best KPCA-LDA topic model is determined.

2. Subject evolution:

1) Research framework

And carrying out subject evolution research on the text set by adopting the established KPCA-LDA model, as shown in figure 3. In the subject evolution, subject extraction is firstly carried out, and then research is carried out from the following two aspects: (1) evolution of topic intensity, document-topic distribution interpretation through text sets in different time windows; (2) the evolution of the topic content is measured by the similarity of topic distribution in different time windows and the topic-vocabulary distribution under similar topics.

2) Topic evolution research based on KPCA-LDA topic model

Evolution studies of the KPCA-LDA topic model were performed from the following: (1) determining the optimal theme number by using the theme consistency; (2) the topics are aligned using edit distance methods. On the basis, text clustering is firstly carried out according to years, and a formula is respectively adopted to calculate the topic intensity and similarity, so as to analyze the evolution of the topic intensity and the evolution of the content, and specifically, the method comprises the following steps:

2.1 Optimum topic count determination aspect

The optimum number of topics in the Topic evolution study is determined using Topic consistency (Topic Coherence).

2.2 Theme alignment aspect)

The theme is aligned using Edit Distance (Edit Distance):

Levenshtein.distance(str1,str2) (8)

wherein str1 is an initial character; str2 is the character to be converted.

In the process of the evolution of the topic intensity, assuming that the proportion of the topic z in the document d is a text set on the time window t, the topic z intensity on the time window t

The method comprises the following steps:

wherein D is ^t Is a corpus of documents under a time window t;

is the document-topic matrix under topic z document d.

And calculating the intensity of the subject z in different time windows t, and making an intensity change graph according to the time sequence for researching and analyzing the trend of the subject intensity evolution.

3. Analysis of results

Training all text sets by adopting the established KPCA-LDA topic model, calculating to obtain document-topic probability distribution in the text sets, dispersing the text sets to 5 time windows in 2014-2018, and respectively calculating topic intensities in the 5 time windows, wherein 10 hot topics earlier in the text sets are selected, keywords under the hot topics are listed, and the obtained topic identification results are shown in the following table 2:

table 2: literature topic and keywords thereof

As can be seen from table 2, topic 11 is for internationalized education, topic 25 is for borderline internationalized education, topic 38 is for both leave and middle and outside cooperative offices, and the evolution trend of topic intensity can be derived from the probability distribution of topics over different time window text sets, as shown in fig. 4.

In the subject matter evolution, the following table 3 gives the higher education-related subjects and their keywords in each time window:

table 3: each time window theme and keywords thereof

In conclusion, compared with the actual situation, the quality of the identified subject and the evolution trend thereof are similar, so that the method provided by the invention has a better application effect in tracking research development trend and research hot spot in the specific field.

Example 2

The embodiment provides a topic analysis system based on kernel principal component analysis and LDA, which comprises:

the data acquisition module is used for acquiring a literature corpus and preprocessing each article in the literature corpus.

The model construction module is used for establishing a KPCA-LDA topic model according to the preprocessed literature corpus.

And the text representation determining module is used for carrying out topic analysis on the articles in the document corpus by adopting the established KPCA-LDA topic model to determine the text representations of the articles in the document corpus.

In a preferred embodiment, the model building module comprises:

Example 3

The present embodiment provides a processing device corresponding to the subject analysis method based on the core principal component analysis and LDA provided in the present embodiment 1, and the processing device may be a processing device for a client, for example, a mobile phone, a notebook computer, a tablet computer, a desktop computer, or the like, to perform the method of embodiment 1.

The processing device comprises a processor, a memory, a communication interface and a bus, wherein the processor, the memory and the communication interface are connected through the bus so as to complete communication among the processing device. The memory stores a computer program executable on a processor, and the processor executes the subject analysis method based on the core principal component analysis and the LDA provided in embodiment 1 when the computer program is executed.

In some implementations, the memory may be high-speed random access memory (RAM: random Access Memory), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

In other implementations, the processor may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), or other general-purpose processor, which is not limited herein.

Example 4

The core principal component analysis and LDA-based subject matter analysis method of this embodiment 1 may be embodied as a computer program product, which may include a computer-readable storage medium having computer-readable program instructions embodied thereon for performing the core principal component analysis and LDA-based subject matter analysis method of this embodiment 1.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any combination of the preceding.

The foregoing embodiments are only for illustrating the present invention, wherein the structures, connection modes, manufacturing processes, etc. of the components may be changed, and all equivalent changes and modifications performed on the basis of the technical solutions of the present invention should not be excluded from the protection scope of the present invention.

Claims

1. A topic analysis method based on kernel principal component analysis and LDA, which is characterized by comprising the following contents:

2) According to the pretreated document corpus, a KPCA-LDA topic model is established, and the specific process is as follows:

2.1 Extracting vocabulary w of each article in the pretreated document corpus D _L ＝(w ₁ ,w _j ,…,w _W ) Wherein W is the vocabulary length; w (w) _j For vocabulary w _L The j-th word of (a);

2.2 A document-word matrix of the document corpus D is generated, and the specific process is as follows:

2.2.1 Set up M articles d= (D) in the document corpus D ₁ ,D ₂ …,D _M ) ^T ，D _i Is the ith article in document corpus D, and D _i ＝[d _i1 d _i2 … d _iW ]Wherein d _ij Is d _ij Representing word w _j At D _i Representing the weight of the jth word w of the vocabulary _j The number of occurrences in the ith article in the corpus;

2.2.2 Sequentially computing vocabulary w of the article set _L The weight of each word in each article to obtain a document-word matrix of the document corpus D;

2.3 Mapping the generated document-word matrix from two dimensions to Gao Weixi terbert space by adopting a P-order polynomial kernel function through nonlinear mapping, reducing the dimension to obtain a theme-word matrix R of low dimension n rows and n columns, and taking the theme-word matrix R as an input document of a KPCA-LDA theme model;

3) Performing topic analysis on articles in a literature corpus by adopting the established KPCA-LDA topic model, and determining text representation of the articles in the literature corpus, wherein the specific process comprises the following steps of:

probability distribution for topic-words; />

Probability distribution for document-topic;

wherein alpha is _i Distributing superparameters for the topics of document i; alpha _h The superparameter is distributed for the theme of the document h; n (N) _d The total number of words for article d; θ _i Probability distribution for document i-topic; θ _h Probability distribution for document h-topic; beta _h,j The word distribution super-parameters of the theme; w (w) ^j _n Is a word;

2. The method for analyzing the topic based on the kernel principal component analysis and the LDA as claimed in claim 1, wherein the specific process of the step 4) is as follows:

4.1 Inputting the extracted vocabulary, the document-word matrix of the document corpus D, the topic distribution superparameter alpha of the document and the topic word distribution superparameter beta, carrying out iterative calculation by using a Gibbs sampling algorithm, estimating unknown parameter variables, solving and outputting the document-topic matrix theta and the topic-word matrix

Wherein, the document-term matrix θ is:

is subject z _K W article D of (a);

topic-word matrix

The method comprises the following steps:

for the word w _W K subject Z of (2);

4.2 K topics represented by t words are generated.

3. The method for analyzing the topics based on the kernel principal component analysis and the LDA according to claim 2, wherein the optimal number of topics is determined by using the topic consistency in the step 4.2):

wherein D (x, y) calculates the number of documents containing the words x and y, D (x) calculates the number of documents containing the word x, representing a smoothing factor that ensures that the score returns a real number; v is a set of words describing a topic; the E is a smoothing factor, and the score is guaranteed to return a real number; the number of words V is the optimal number of topics when the Coherence (V) is maximum.

4. A subject matter analysis system based on kernel principal component analysis and LDA, comprising:

the model building module is used for building a KPCA-LDA topic model according to the preprocessed literature corpus, and comprises the following steps:

the dimension reduction unit is used for mapping the generated document-word matrix from two dimensions to Gao Weixi Arabic space by adopting a P-order polynomial kernel function through nonlinear mapping, obtaining a theme-word matrix R of low dimension n rows and n columns through dimension reduction, and taking the theme-word matrix R as an input document of the KPCA-LDA theme model;

the text representation determining module is used for carrying out topic analysis on the articles in the document corpus by adopting the established KPCA-LDA topic model, and determining the text representation of the articles in the document corpus, wherein the text representation determining module comprises the following specific processes:

based on the definition of the topic, the generation probability p (w|d) of the word w in the article d is calculated:

obtaining the probability p (w|d) of the article d containing the word w according to the parameter setting in the KPCA-LDA topic model establishing process:

probability distribution for topic-words; />

Probability distribution for document-topic;

obtaining a conditional probability distribution p (d|alpha, beta) generated by the article d according to the probability p (w|d) of the article d containing the word w:

wherein alpha is _i Distributing superparameters for the topics of document i; alpha _h The superparameter is distributed for the theme of the document h; n (N) _d The total number of words for article d; θ _i Probability distribution for document i-topic; θ _h Probability distribution for document h-topic; beta _h,j Word segmentation for topicsLaying out super parameters; w (w) ^j _n Is a word;

5. A processor comprising computer program instructions, wherein the computer program instructions when executed by the processor are for implementing the steps corresponding to the method of core principal component analysis and LDA-based topic analysis of any of claims 1-3.

6. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, are for implementing the steps corresponding to the subject analysis method based on the core principal component analysis and LDA according to any of claims 1-3.