CN110134958B

CN110134958B - Short text topic mining method based on semantic word network

Info

Publication number: CN110134958B
Application number: CN201910400416.5A
Authority: CN
Inventors: 张雷; 经伟; 蔡洋; 陆恒杨; 徐鸣; 王崇骏
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-05-14
Filing date: 2019-05-14
Publication date: 2021-05-18
Anticipated expiration: 2039-05-14
Also published as: CN110134958A

Abstract

The invention discloses a short text topic mining method based on a semantic word network, which comprises the following steps of 1) model initialization stage: external corpus collection, corpus preprocessing, parameter setting and the like in the related field; 2) a subject unit construction stage: constructing a semantic word network, searching a specific word triangular structure, calculating model prior parameters and the like; 3) a model training stage: sampling the model variable by using a Gibbs sampling method, and judging whether the model reaches a convergence condition; 4) and a result output stage: and obtaining the topic distribution of each word triangle according to the sampling result of each variable after the model training is finished, and further calculating the topic distribution of the original document. The method combines semantic information learned by an external corpus with a word triangle topic structure, is applied to the aspect of short text topic mining, provides a solution for integrating external prior knowledge into a traditional topic model compared with the traditional word-to-topic model, and obviously improves the quality of the mined topic.

Description

Short text topic mining method based on semantic word network

Technical Field

The invention relates to a short text topic mining method, in particular to a short text topic mining method based on a semantic word network, which solves the problem that the topic quality is not high under the condition that the short text features are sparse in a common topic mining method.

Background

With the continuous acceleration of social development rhythm and the short and fast user experience brought by the intelligent mobile terminal, people tend to fragmentize more and more during communication on the network. Therefore, short text data is more and more important in today's network information interaction, for example, social network status, microblog text messages, traditional news headlines, short video headlines, question and answer websites, and the like are all in the form of short texts. And short text data is also generated and accumulated at great speed with the rise of mass companies such as microblog, cicada, Facebook, Twitter, etc. Therefore, the topic information mining from massive short text data has significant value, and for example, public opinion analysis, information retrieval, personalized recommendation, user interest clustering and the like are all application directions of topic mining. On the other hand, the traditional text mining method is difficult to mine the subject information of the short text, and the main reason is that the word co-occurrence information in the short text is very sparse.

At present, for a solution of sparse short text features, word co-occurrence relation is generally utilized. This solution is based on an assumption: the word pairs which co-occur in the same short text have topic relation. For example, two model word pair topic models and word network topic models are commonly used in the short text mining topic area. The former forms word pairs as basic subject units through co-occurrence words, and the latter forms a pseudo document for each word through co-occurrence words to assist in discovering the subject of the corresponding word. These methods ignore semantic relations between words, for example, "holiday" and "holiday" are two words with very close semantics, and their constituent words should contribute more to the subject than common co-occurrence words, but are ignored by the common model because of the fresh co-occurrence in the same short text.

The word vector is a method for representing words in a computer, and the words can be directly used as a characteristic input model based on the representation, so that great convenience is brought to the processing of natural language. Compared with the traditional one-hot expression word vector, the distributed expression word vector has the advantages that the vector dimension is lower and more controllable, and on the other hand, a large amount of external linguistic data are used for training through a neural language model, so that the included semantic information is richer. The invention utilizes the advantage of distributed word vectors for representing semantics, provides a new solution for the short text topic mining method by utilizing the word vectors to measure the semantic similarity of words and adding the semantic similarity into a word triangular topic model as prior knowledge.

Disclosure of Invention

The purpose of the invention is as follows: the technical problem to be solved by the invention is that when the word co-occurrence information is considered in response to the feature scarcity of short text data in the traditional topic model, the mined topic quality is not high enough due to the introduced noise information and the ignored semantic information. The invention discloses a method for mining a theme by introducing external semantic information and fusing word co-occurrence information to construct a semantic word network, which comprises the following steps: firstly, collecting external corpora from related fields and training word vectors through a word2vec model; then, traversing a target language database and combining word vector information to generate a semantic word network, and selecting a specific word triangular structure in the semantic word network; then, parameters are sampled by utilizing a Gibbs sampling method, and multiple iterations are carried out to achieve convergence; and finally, calculating the topic distribution of the word triangles according to the sampling result, and further calculating the main body distribution of the documents in the target language material library.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

a short text topic mining method based on semantic word network comprises the following steps:

step 1, model initialization stage: collecting external corpora of related fields to construct an external corpus; preprocessing the external corpus and the target corpus to convert the corpora in the external corpus and the target corpus into a format which can be accepted by a word2vec model; taking an external corpus as input, training a word2vec model, and enabling the word2vec model to output an appointed word vector; extracting word vector data in a target corpus by a trained word2vec model;

step 2, a subject unit construction stage:

2) -a is from the target corpus D ═ { D ═ D₁，d₂，...，d_nThe co-occurrence relation of the Chinese words generates a basic word co-occurrence network, and the specific steps are as follows:

step 2) -a-1) establishing a point set V, an edge set E and an edge attribute set R, wherein the initial state is null;

steps 2) -a-2) for document d_k＝{w₁，w₂，...，w_mEvery word w in_iIf the word w_iIf the current vector does not appear in the set V, adding the current vector into the set V, wherein k belongs to {1, 2., n };

steps 2) -a-3) for document d_kAll word pairs w in_i，w_jIf not, add it to E and add the attribute pair R to the set R_ij＝<S_ij，s_ij>(ii) a Wherein S is_ijWhere denotes the set of document numbers containing the word pair, s_ijRepresents the word w_iAnd w_jSemantic similarity attributes between them; order S_ijIf e, if k_ijAlready present in the collection E, then the edge attribute r_ijDocument number attribute set S in (1)_ijAdding a document number k;

2) b, fusing semantic information on the basis of the word co-occurrence network to construct a semantic word network, and specifically comprising the following steps:

step 2) -b-1) comparing the word vector data of the words in the target corpus and the external corpus, and setting the corresponding word vector to be empty for the words which are not registered in the target corpus, namely the subsequent semantic-free information;

step 2) -b-2) setting a threshold value delta;

step 2) -b-3) for each pair of word nodes w in the word co-occurrence network_iAnd w_jCalculating semantic similarity between the word pairs according to the following formula:

wherein the content of the first and second substances,

and

respectively represent words w_iAnd w_jA corresponding word vector;

step 2) -b-4) judging each pair of word nodes w_iAnd w_jWhether there is an edge connection between them; if yes, go to step 2) -b-5) (ii) a Otherwise, go to step 2) -b-6);

step 2) -b-5) comparing the semantic similarity s_ijEntering an edge attribute r_ij＝<S_ij，s_ij>In which S is_ijA set of co-occurrence document numbers for the original word pairs;

step 2) -b-6) judging semantic similarity s_ijWhether or not s is satisfied_ijIs more than delta; if yes, turning to the step 2) -b-7); otherwise, the word does not perform any operation on the node;

step 2) -b-7) adding an edge set E into the edge set E_ijAnd adding attribute pairs R into the edge attribute set R_ij＝<S_ij，s_ij>Let us order

s_ij＝sim(w_i，w_j)；

Steps 2) -c for each word w in the semantic word network_iCalculating the inverse document frequency by the formula:

where | D ∈ D: w is a_iE d | represents a representation of a vector containing w_iNumber of documents of (2), N_DRepresenting the total number of documents in the corpus;

step 2) -d searching a semantic word triangle meeting the following conditions in the semantic word network:

three word nodes in the semantic word triangle are mutually connected in a side mode and are from connecting parts of different document sub-networks;

step 3, a model training stage: randomly initializing the theme distribution of the semantic word triangles for all the semantic word triangles obtained in the step (2); obtaining the triangular distribution of the current semantic words through Gibbs sampling, calculating document theme distribution and theme word distribution updating parameters, circularly iterating until reaching the maximum iteration times or Gibbs sampling convergence, and taking the finally obtained Gibbs sampling result as the triangular theme distribution of the words;

step 4, result output stage: and (4) calculating the theme distribution of the original document according to the semantic word triangular theme distribution obtained in the step (3).

Further, the specific steps of searching the semantic word triangle in the steps 2) -d include:

steps 2) -d-1) for any three words w in the set V_i，w_j，w_kJudging whether each node has an edge, i.e. whether e exists_ij，e_jk，e_ikE belongs to E; if yes, turning to the step 2) -d-2);

step 2) -d-2) determining whether S is satisfied_ij≠S_ik∧S_ik≠S_jk∧S_ij≠S_jk(ii) a If yes, turning to the step 2) -d-3);

step 2) -d-3) calculating word triangle prior knowledge

Wherein, γ_ijk＝(γ_ij+ γ_ik+γ_jk)/3，γ_ij，γ_ik，γ_jkThe calculation method is as described above;

step 2) -d-4) generating semantic word triangle t ═ (w)_i，w_j，w_k，l_ijk)。

Further, in step 3, the gibbs sampling specifically includes the following steps:

step 3) -a-1) initializing a sampling algorithm platform, and constructing a program for sampling from conditional probability distribution by using a machine learning method for use by a SWTTM model;

step 3) -a-2) randomly initializing a theme for each semantic word triangle;

step 3) -a-3) selecting a suitable iteration number T and initializing: t is 0;

steps 3) -a-4) determining whether T is less than T: if yes, turning to the step 3) -a-5); if not, as a transition to step 3) -a-13);

step 3) -a-5) randomly selecting word triangle t_q＝(w_m，w_n，w_l，l_mnl) Calculating word triangles from expansion informationDirichlet distribution hyperparameter beta_m，β_n，β_lThe concrete formula is as follows:

wherein the content of the first and second substances,

is a constant set to prevent the value of β from being too small;

step 3) -a-6) of removing word triangle t from the calculation model_qTopic distribution in the latter environment, the formula is as follows:

wherein K represents the topic number, K represents the total number of topics, V represents the total number of words in the corpus, z_qTriangle t representing word_qT denotes a full set of semantic word triangles, Z_-qRepresenting removed word triangles t_qLater distribution of topics, P (z)_q＝k|T，Z_-q) Triangle t with the term other than_qTriangular t of term derived from distribution of other subjects_qThe subject of (a) is the probability of k,

representing removed word triangles t_qRear belonging to subject z_kThe number of the word triangle is greater than the number of the word triangle,

representing removed word triangles t_qFuture topic z_kChinese word w_mThe frequency of the word distribution model is shown as alpha, beta and alpha, beta are the prior distribution hyperparameters of the document topics, beta is the distribution hyperparameters of the topic words except the current word triangle, and alpha and beta are model input parameters;

steps 3) -a-7) according to the conditional probability distribution P (z)_q＝k|T，Z_-q) Sampling a theme;

step 3) -a-8) updating the 'document-subject' distribution parameter, wherein the updating formula is as follows:

wherein the content of the first and second substances,

denotes the topic as z_kNumber of documents, N_BRepresenting the total number of documents in the corpus;

step 3) -a-9) updating word w in word triangle_mAt subject z_kThe following distribution parameters, according to the formula:

wherein the content of the first and second substances,

denotes the topic as z_kLower word w_mBeta is a dirichlet distribution hyperparameter;

step 3) -a-10) updating w in word triangle_nAt subject z_kThe following distribution parameters are updated as follows:

wherein the content of the first and second substances,

denotes the topic as z_kLower word w_nThe number of occurrences of (c);

step 3) -a-11) updating w in word triangle_lAt subject z_kThe following distribution parameters are updated as follows:

wherein the content of the first and second substances,

denotes the topic as z_kLower word w_lThe number of occurrences of (c);

step 3) -a-12) making T equal to T +1, and judging whether T is less than T: if yes, turning to the step 3) -a-5); if not, go to step 3) -a-13);

and 3) finishing the model training.

Further, the specific step of the original document theme inference in step 4 includes:

step 4) -a-1) splitting each original document in the target corpus into a word pair set;

step 4) -a-2) judging whether the word pairs have at least one related semantic word triangle; if yes, turning to the step 4) -a-3); if not, go to step 4) -a-5);

step 4) -a-3) calculating the semantic word triangle t in the document d_qThe specific formula of (c) is as follows:

wherein n is_d(t_q) Semantic word triangle set d represented in document d_tTriangle t of middle semantic word_qThe frequency of (2);

step 4) -a-3) calculating semantic word triangle t by using Bayesian formula_qThe specific formula of the theme distribution is as follows:

step 4) -a-4) to obtain the theme distribution of the document by calculation, wherein the specific formula is as follows:

wherein the content of the first and second substances,

representing the size of the triangular set of semantic words for document d;

step 4) -a-5) calculating the word w in the document d_iThe probability of occurrence is specifically as follows:

step 4) -a-6) calculating words w according to global theme distribution and word distribution under the theme_iThe specific formula of the theme distribution is as follows:

step 4) -a-7) obtaining the theme of the document according to the theme distribution of words in the document, wherein the specific formula is as follows:

compared with the prior art, the invention has the following beneficial effects:

aiming at the problem that the traditional word pair topic model is influenced by the topic word quality of the high-frequency words, the invention assumes that the representation capability of the words appearing in most documents to the topics is weak, and introduces IDF indexes and semantic similarity together as the prior knowledge of word distribution based on the assumption, thereby relieving the influence of the high-frequency words on the topic quality. Aiming at the neglect of a plurality of word pairs with close semantic relation and less co-occurrence in the common word co-occurrence network, the invention provides a novel semantic word network construction method, so that a topic model can pay more attention to topic relation among words, and the quality of the mined topics is obviously improved.

Drawings

FIG. 1 is a flow chart of a short text topic mining method based on a semantic word network;

FIG. 2 is a flow chart of semantic word network construction and semantic word triangle finding;

fig. 3 is a probabilistic graphical model of the SWTTM algorithm.

Detailed Description

The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.

Fig. 1 is a flowchart of a short text topic mining method based on a semantic word network according to an embodiment of the present invention. The specific steps are described as follows:

step 0 is the starting state of the present invention;

in the model initialization phase (step 1-3):

step 1, collecting external corpora of related fields, wherein the text length is not required;

step 2, performing preprocessing operations such as word segmentation and screening on the external corpus and the target corpus; the method mainly aims to implement the next algorithm with words as units by segmenting the word in the corpus, and comprises the following specific steps:

step 2-1) performing word segmentation processing on the two corpora respectively, and removing stop words at the same time;

step 2-2) deleting words of non-Chinese characters and non-Latin letters, and lowercase all Latin letters;

step 2-3) deleting the words with the word frequency less than 5 in the corpus, and deleting the documents with the words less than 3;

step 3, setting word2vec model parameters, taking an external corpus as input, and training a model to obtain word vector data;

in the subject unit construction phase (steps 4-6):

step 4 is based on corpus D ═ D₁，d₂，...，d_nConstructing a basic co-occurrence word network;

step 5, fusing semantic information on the basis of the word co-occurrence network to construct a semantic word network:

step 6, searching a word triangular structure meeting the conditions in the semantic word network and calculating the word inverse document frequency;

the word triangle structure satisfies the following conditions: the three word nodes are connected with edges and are from the connected parts of different document sub-networks.

In the model training phase (step 7-8):

and 7, sampling by using variables in the Gibbs sampling method model, wherein the variables represent model training according to the sample data obtained in the

steps

1 and 2, and the specific implementation process is as follows:

step 7-1, initializing a sampling algorithm platform, and constructing a program for sampling from conditional probability distribution by using a machine learning method, wherein the program is used by a SWTTM model, and the algorithm flow of the SWTTM model is shown in FIG. 3;

step 7-2 randomly initializes a topic for each semantic word triangle.

And 7-3, selecting a proper iteration number T, and initializing: t is 0;

step 7-4, judging whether T is smaller than T: yes, go to step 3) -a-5); no, go to step 3) -a-13);

step 7-5 randomly selecting word triangle t_q＝(w_m，w_n，w_l，l_mnl) Computing Dirichlet distribution hyper-parameter beta of word triangle according to expansion information_m，β_n，β_lThe concrete formula is as follows:

wherein the content of the first and second substances,

is a constant set according to the word vector sampling condition and manual evaluation in order to prevent the beta value from being too small.

Step 7-6, word triangle t is removed from the calculation model_iTopic distribution in the latter environment, the formula is as follows:

step 7-7 sampling a topic according to the conditional probability distribution;

step 7-8 updates the "document-topic" distribution parameter according to the formula as follows:

wherein the content of the first and second substances,

denotes the topic as z_kNumber of documents, N_BRepresents the total number of documents in the corpus, and K represents the total number of topics.

Step 7-9 update w in word triangle_mAt subject z_kThe following distribution parameters, according to the formula:

wherein the content of the first and second substances,

denotes the topic as z_kLower word w_mThe number of occurrences of (c).

Step 7-10 update w in word triangle_nAt subject z_kThe following distribution parameters, according to the formula:

wherein the content of the first and second substances,

denotes the topic as z_kLower word w_nThe number of occurrences of (c).

Step 7-11 update w in word triangle_lAt subject z_kThe following distribution parameters, according to the formula:

wherein the content of the first and second substances,

denotes the topic as z_kLower word w_lThe number of occurrences of (c).

And 7-12, enabling T to be T +1, judging whether T is smaller than T: yes, go to step 3) -a-5); no, go to step 3) -a-13);

and 7-13, finishing the model training.

Step 8, distributing the Gibbs sampling result as a semantic word triangular theme;

in the result output stage (steps 9-10):

step 9, splitting the original document into word pairs;

step 10, calculating the subject distribution of the original document by finding the associated semantic word triangle according to the word pair, wherein the specific method comprises the following steps:

step 10-1, judging whether the word pairs have at least one related semantic word triangle; if yes, go to step 10-2; if not, turning to the step 10-4;

step 10-2, calculating the semantic word triangle t in the document d_qThe specific formula of (c) is as follows:

wherein n is_d(t_i) Semantic word triangle set d represented in document d_tTriangle t of middle semantic word_qThe frequency of (2).

Step 10-3, semantic word triangle t is calculated by using Bayesian formula_qThe specific formula of the theme distribution is as follows:

step 10-4, calculating to obtain the theme distribution of the document, wherein the specific formula is as follows:

wherein the content of the first and second substances,

representing the size of the triangular set of semantic words for document d.

Step 10-5 calculating word w in document d_iThe probability of occurrence is specifically as follows:

step 10-6, words w are calculated according to global theme distribution and word distribution under the theme_iThe specific formula of the theme distribution is as follows:

step 10-7, obtaining the theme of the document according to the theme distribution of words in the document, wherein the specific formula is as follows:

and step 10-8 is finished.

Step 11 is the end state.

Fig. 2 is a detailed description of

steps

4 and 5 in fig. 1.

Step 12 is the start state.

Step 13 is to establish a basic co-occurrence word network, and the specific method is as follows:

step 4-1, initializing a node set V, an edge set E and an edge attribute set R

Step 4-2 for document d_k＝{w₁，w₂，...，w_mThe word w in_iIf the word does not appear in the set V, it is added to V;

step 4-3 for document d_kAll word pairs w in_i，w_jIf not, add it to E and add the attribute pair R to the set R_ij＝<S_ij，s_ij>And order S_ijK. If the edge relationship already exists in set E, S in set R_ijAdding the document number k.

Step 14, obtaining word vector data in the target language material library according to the training result in the step 3, and setting a threshold value delta related to word semantics;

step 15, for each pair of word nodes in the basic co-occurrence word network, calculating the semantic similarity between the word pairs according to the following formula:

wherein the content of the first and second substances,

the expression w_iThe corresponding word vector.

Step 16, judging whether edges are connected among the word pair nodes; yes, go to step 17. If not, go to step 18;

step 17 notes semantic similarity information into the edge attribute, i.e. s_ij＝sim(w_i，w_j)；

Step 18 judges whether the word pair semantic similarity satisfies sim (w)_i，w_j) Is more than delta; if yes, go to step 19;

step 19 adds attribute pairs R to the set R_ij＝<S_ij，s_ij>And order

s_ij＝sim(w_i，w_j)；

Step 20 calculates the inverse document frequency for each word in the semantic word network, as follows:

where | D ∈ D: w is a_iE d | represents a representation of a vector containing w_iThe number of documents.

Step 21 for any three words w_i，w_j，w_kE.g. V, judging whether each node has an edge, i.e. e_ij，e_jk，e_ikE belongs to E; the sets of side information in the attribute sets of the judgment sides are different from each other, i.e. S_ij≠S_ik∧S_ik≠S_jk∧S_ij≠S_jk(ii) a If yes, go to step 23;

step 22 of calculating word triangle prior knowledge

Step 23 generates semantic word triangle t ═ (w)_i，w_j，w_k，l_ijk)；

Step 24 is the end state.

Aiming at the problem that the traditional word pair topic model has equal treatment on word pairs with different importance, the probability that the words with tighter semantic relation belong to the same topic is higher, and the semantic relation of the words is measured by introducing word embedding trained by an external corpus. The prior knowledge of the distribution of the information subject words enables the model to attach more importance to the word pairs with larger semantic similarity. Aiming at the problem that the traditional word pair topic model is influenced by the topic word quality of high-frequency words, the invention introduces IDF indexes and semantic similarity together as prior knowledge of word distribution on the assumption that the representation capability of the words appearing in most documents to the topics is weak. The invention also provides a novel semantic word network construction method, which enables the word network to pay attention to the topic connection among words more comprehensively, and provides a basic unit with tighter topic connection, namely a semantic word triangular structure on the basis of the network, and the semantic word triangular structure is used as a topic mining unit to obtain higher topic quality.

In summary, the short text topic mining method based on the semantic word network comprehensively considers external semantic information, context word frequency information and word triangle structures, and provides a new solution for solving the problem of feature sparsity of a short text topic model during mining.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A short text topic mining method based on a semantic word network is characterized by comprising the following steps:

step 2, a subject unit construction stage:

steps 2) -a-2) for document d_k＝{w₁，w₂，...，w_mEvery word w in_iIf the word w_iIf the current vector does not appear in the set V, adding the current vector into the set V, wherein k belongs to {1, 2, …, n };

steps 2) -a-3) for document d_kAll word pairs w in_i，w_jIf not in the set E, add it to E and add the edge attribute R to the set R_ij＝<S_ij，s_ij>(ii) a Wherein S is_ijWhere denotes the set of document numbers containing the word pair, s_ijRepresents the word w_iAnd w_jSemantic similarity attributes between them; order S_ijIf e, if k_ijAlready present in the collection E, then the edge attribute r_ijDocument number set S in (1)_ijAdding a document number k;

step 2) -b-2) setting a threshold value delta;

step 2) -b-3) for each word pair w in the word co-occurrence network_iAnd w_jSemantic similarity between word pairs is calculated according to the following formula:

wherein the content of the first and second substances,

and

respectively represent words w_iAnd w_jA corresponding word vector;

step 2) -b-4) judging each word pair w_iAnd w_jWhether there is an edge connection between them; if yes, turning to the step 2) -b-5); otherwise, go to step 2) -b-6);

step 2) -b-5) comparing the semantic similarity s_ijEntering an edge attribute r_ij＝<S_ij，s_ij>Performing the following steps;

step 2) -b-6) judging semantic similarity s_ijWhether or not s is satisfied_ijIs more than delta; if yes, turning to the step 2) -b-7); otherwise, the word pair does not do any operation;

step 2) -b-7) adding the edge E into the edge set E_ijAnd adding an edge attribute R into the edge attribute set R_ij＝<S_ij，s_ij>Let us order

s_ij＝sim(w_i，w_j)；

step 3, a model training stage: randomly initializing the theme distribution of the semantic word triangles for all the semantic word triangles obtained in the step (2); obtaining the triangular distribution of the current semantic words through Gibbs sampling, calculating document theme distribution and theme word distribution updating parameters, circularly iterating until reaching the maximum iteration times or Gibbs sampling convergence, and taking the finally obtained Gibbs sampling result as the triangular theme distribution of the semantic words;

2. The method for mining short text topics based on semantic word network as claimed in claim 1, wherein: the specific steps of searching the semantic word triangle in the steps 2) -d comprise:

step 2) -d-3) calculating word triangle prior knowledge

Wherein, γ_ijk＝(s_ij+s_ik+s_jk)/3；

3. The method for mining short text topics based on semantic word network as claimed in claim 2, wherein: in the step 3, the gibbs sampling process is as follows:

step 3) -a-1) randomly initializing a theme for each semantic word triangle;

step 3) -a-2) selecting a proper iteration number T and initializing: pt is 0;

step 3) -a-3) judging whether pt is smaller than T: if yes, turning to the step 3) -a-4); if not, go to step 3) -a-12);

step 3) -a-4) randomly selecting word triangle t_q＝(w_m，w_n，w_l，l_mnl) Calculating a dirichlet distribution hyper-parameter beta of the word triangle according to the expansion information_m，β_n，β_lThe concrete formula is as follows:

wherein ∈ is a constant set to prevent the β value from being too small;

step 3) -a-5) of removing word triangle t from the calculation model_qTopic distribution in the latter environment, the formula is as follows:

wherein h represents the topic number, K represents the total number of topics, F represents the total number of words in the corpus, and z represents_qTriangle t representing word_qU denotes a full set of semantic word triangles, Z_-qRepresenting removed word triangles t_qLater distribution of topics, P (z)_q＝h|U，Z_-q) Triangle t with the term other than_qTriangular t of term derived from distribution of other subjects_qThe topic of (a) is the probability of h,

representing removed word triangles t_qRear belonging to subject z_hThe number of the word triangle is greater than the number of the word triangle,

representing removed word triangles t_qRear belonging to subject z_sThe number of the word triangle is greater than the number of the word triangle,

representing removed word triangles t_qFuture topic z_hChinese word w_mThe frequency of (2) is determined,

representing removed word triangles t_qFuture topic z_hChinese word w_nThe frequency of (2) is determined,

representing removed word triangles t_qFuture topic z_hChinese word w_lThe frequency of (2) is determined,

representing removed word triangles t_qFuture topic z_hChinese word w_fThe frequency of the word distribution model is shown as alpha, beta and alpha, beta are the prior distribution hyperparameters of the document topics, beta is the distribution hyperparameters of the topic words except the current word triangle, and alpha and beta are model input parameters;

steps 3) -a-6) according to the conditional probability distribution P (z)_q＝h|U，Z_-q) Sampling a theme;

step 3) -a-7) updating the 'document-subject' distribution parameter, wherein the updating formula is as follows:

wherein the content of the first and second substances,

denotes the topic as z_hNumber of documents, N_BRepresenting the total number of documents in the corpus;

step 3) -a-8) updating word w in word triangle_mAt subject z_hThe following distribution parameters, according to the formula:

wherein the content of the first and second substances,

denotes the topic as z_hLower word w_mThe number of occurrences of (a) is,

denotes the topic as z_hThe occurrence frequency of the lower word w, beta is a dirichlet distribution hyper-parameter;

step 3) -a-9) updating w in word triangle_nAt subject z_hThe following distribution parameters are updated as follows:

wherein the content of the first and second substances,

denotes the topic as z_hLower word w_nThe number of occurrences of (c);

step 3) -a-10) updating w in word triangle_lAt subject z_hThe following distribution parameters are updated as follows:

wherein the content of the first and second substances,

denotes the topic as z_hLower word w_lThe number of occurrences of (c);

step 3) -a-11) making pt equal to pt +1, and judging whether pt is smaller than T: if yes, turning to the step 3) -a-4); if not, go to step 3) -a-12);

and 3) finishing the model training.

4. The method for mining short text topics based on semantic word network as claimed in claim 3, wherein: the specific steps of the original document theme inference in the step 4 comprise:

step 4) -a-2) judging whether the word pairs split in the step 4) -a-1) have at least one related semantic word triangle; if yes, turning to the step 4) -a-3); if not, go to step 4) -a-5);

step 4) -a-4) calculating semantic word triangle t by using Bayesian formula_qThe specific formula of the theme distribution is as follows:

wherein, P (z)_h) Denotes the topic as z_hProbability of (d), P (w)_m|z_h) Representing a topic z_hChinese word w_mProbability of occurrence, P (w)_n|z_h) Representing a topic z_hChinese word w_nProbability of occurrence, P (w)_l|z_h) Representing a topic z_hChinese word w_lThe probability of occurrence; p (z)_p) Denotes the topic as z_pProbability of (d), P (w)_m|z_p) Representing a topic z_pChinese word w_mProbability of occurrence, P (w)_n|z_p) Representing a topic z_pChinese word w_nProbability of occurrence, P (w)_l|z_p) Representing a topic z_pChinese word w_lThe probability of occurrence;

step 4) -a-5) to obtain the theme distribution of the document by calculation, wherein the specific formula is as follows:

wherein the content of the first and second substances,

representing the size of the triangular set of semantic words for document d;

step 4) -a-6) calculating the word w in the document d_iThe probability of occurrence is specifically as follows:

wherein n is_d(w_i) Semantic word triangle set d represented in document d_tChinese word w_iFrequency of (n)_d(w_j) Semantic word triangle set d represented in document d_tChinese language w_jFrequency of (n)_dRepresenting the number of documents d;

step 4) -a-7) calculating words w according to global topic distribution and word distribution under topics_iThe specific formula of the theme distribution is as follows:

wherein P (z) represents the probability that the topic of the document is z, and P (w)_i| z) represents the word w in the topic z_iProbability of occurrence, P (w)_j| z) represents the word w in the topic z_jThe probability of occurrence;

step 4) -a-8) obtaining the theme of the document according to the theme distribution of words in the document, wherein the specific formula is as follows: