CN110134958A

CN110134958A - A kind of short text Topics Crawling method based on semantic word network

Info

Publication number: CN110134958A
Application number: CN201910400416.5A
Authority: CN
Inventors: 张雷; 经伟; 蔡洋; 陆恒杨; 徐鸣; 王崇骏
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-05-14
Filing date: 2019-05-14
Publication date: 2019-08-16
Anticipated expiration: 2039-05-14
Also published as: CN110134958B

Abstract

The invention discloses a kind of short text Topics Crawling methods based on semantic word network, include the following steps 1) the model initialization stage: corpus collection, corpus pretreatment, parameter setting etc. outside related fields；2) thematic unit constructs the stage: constructing semantic word network, finds the work such as specific word three-legged structure, computation model Study first；3) model training stage: model variable is sampled using Gibbs sampling method, and whether judgment models reach the condition of convergence；4) result output stage: according to the sampled result of each variable after model training, the theme distribution of each word triangle is obtained, and then extrapolates the theme distribution of original text shelves.The present invention combines the semantic information that external corpus learns with word triangle thematic structure, in terms of short text Topics Crawling, relative to traditional word to topic model, the quality that this method provides one to incorporate the solution of external priori knowledge in traditional theme model, and excavate theme, which has, to be obviously improved.

Description

A kind of short text Topics Crawling method based on semantic word network

Technical field

The present invention relates to a kind of short text Topics Crawling method, especially a kind of short text theme based on semantic word network Method for digging, this method solve general subject method for digging in the sparse situation of short essay eigen, and theme is of low quality asks Topic.

Background technique

With the continuous quickening of social development rhythm and the user experience of intelligent mobile terminal bring " short, adaptable and fast ", people Exchange on network increasingly tend to fragmentation.Therefore, short text data the network information of today interaction in occupation of Increasingly consequence, such as social network status, microblogging text message, traditional news media title, short video title and question and answer Website etc. is occurred in the form of short text.And with microblogging, know, Facebook, Twitter Deng great scale of construction company rise abruptly It rises, short text data is also to be generated and accumulated with great speed.Therefore, theme letter is excavated from the short text data of magnanimity It is all main that breath, which just has very great value, such as the analysis of public opinion, information retrieval, personalized recommendation, user interest cluster etc., Inscribe the application direction excavated.And on the other hand, the subject information that short text is excavated using traditional text mining method is but deposited In very big difficulty, main cause is that Term co-occurrence information is very sparse in short text.

Currently, the solution sparse for short essay eigen, is typically carried out using word cooccurrence relation.This solution Certainly scheme is based on a hypothesis: the word of co-occurrence is to having theme contact in same piece short text.Such as it is dug in short text Two more commonly used model words of subject fields are dug to topic model and word network themes model.The former is made up of co-occurrence word For word to as basic theme unit, the latter passes through the master that co-occurrence word is that each word forms that corresponding word is excavated in pseudo- document assistance Topic.These methods all ignore the semantic relation between word, such as " vacation " " holiday " is semantic two very close words, The word that they are formed is to should be bigger to the contribution to theme than general co-occurrence word, but due to rarely having in same piece short text Co-occurrence and ignored by universal model.

Term vector is a kind of method for indicating word in computer-internal, can be by word directly as spy based on the expression Input model is levied, is brought great convenience for the processing of natural language.And the distributed term vector indicated is relative to previous On the one hand it is that vector dimension is lower more controllable that only hotlist shows term vector, and on the other hand it is to pass through nerve using a large amount of external corpus Language model training, the semantic information for including are also more abundant.The present invention characterizes semantic advantage using distributed term vector, mentions The semantic similarity of word is measured using term vector out and is added in word triangle topic model as priori knowledge, is short text Topics Crawling method provides a kind of new resolving ideas.

Summary of the invention

Goal of the invention: the technical problem to be solved by the present invention is to traditional topic models in reply short text data feature Scarcity and when considering word co-occurrence information, due to introducing noise information and the semantic information ignored and cause to excavate theme matter Measure not high enough problem.The present invention pass through introduce external semantic information and merge Term co-occurrence information construct together semanteme word network come The method for carrying out Topics Crawling: external corpus is collected from related fields first and passes through word2vec model training term vector；Then Target corpus bluebeard compound vector information generative semantics word network is traversed, and selects specific word three-legged structure wherein；It connects Utilize Gibbs sampling method sampling parameter, and successive ignition reaches convergence；Word triangle is calculated finally by sampled result Theme distribution, and then calculate target corpus in document main body distribution.

Technical solution: to achieve the above object, the technical solution adopted by the present invention are as follows:

A kind of short text Topics Crawling method based on semantic word network, includes the following steps:

Step 1, the model initialization stage: corpus constructs external corpus outside acquisition related fields；To external corpus With the carry out pretreatment operation of target corpus, being converted into the corpus in external corpus and target corpus can be by The format that word2vec model receives；Using external corpus as input, training word2vec model keeps word2vec model defeated Term vector is specified out；Pass through the term vector data in trained word2vec model extraction target corpus；

Step 2, thematic unit constructs the stage:

2)-a is according to target corpus D={ d₁, d₂..., d_nIn the cooccurrence relation of word generate basic Term co-occurrence network, Specific steps are as follows:

Step 2)-a-1) point set V, line set E, side attribute set R are established, original state is sky；

Step 2)-a-2) for document d_k={ w₁, w₂..., w_mIn each word w_iIf word w_iIt does not appear in It in set V, is then added into V, k ∈ { 1,2 ..., n }；

Step 2)-a-3) for document d_kIn all words to w_i, w_jIf do not existed in set E, it is added into E In, and attribute is added to r in set R_ij=< S_ij, s_ij>；Wherein, S_ij={ k } indicates the number of documents collection comprising the word pair It closes, s_ijIndicate word w_iAnd w_jBetween semantic similarity attribute；Enable S_ij={ k }, if side e_ijIt is present in set E, then exists Side attribute r_ijIn number of documents attribute set S_ijMiddle addition number of documents k；

2)-b incorporates the semantic word network of semantic information building, specific steps on the basis of Term co-occurrence network are as follows:

Step 2)-b-1) by the term vector data comparison of word in target corpus and external corpus, for target language The word that material library is not logged in sets corresponding term vector as sky, as subsequent no semantic information；

Step 2)-b-2) setting threshold value δ；

Step 2)-b-3) for a pair of word node w every in Term co-occurrence network_iAnd w_j, word is calculated to it according to following formula Between semantic similarity:

Wherein,WithRespectively indicate word w_iAnd w_jCorresponding term vector；

Step 2)-b-4) judge every a pair of of word node w_iAnd w_jBetween whether have side connection；If so, going to step 2)-b-5)； Otherwise, step 2)-b-6 is gone to)；

Step 2)-b-5) by semantic similarity s_ijCharge to side attribute r_ij=< S_ij, s_ij> in, wherein S_ijIt is word to script Co-occurrence number of documents set；

Step 2)-b-6) judge semantic similarity s_ijWhether s is met_ij> δ；If so, going to step 2)-b-7)；Otherwise, The word does not do any operation to node；

Step 2)-b-7) line set e is added in line set E_ij, and attribute is added to r in side attribute set R_ij=< S_ij, s_ij>, it enabless_ij=sim (w_i, w_j)；

Step 2)-c is for each word w in semantic word network_iInverse document frequency is calculated, formula is as follows:

Wherein | d ∈ D:w_i∈ d | it indicates to include w_iDocument number, N_DIndicate total number of documents in corpus；

Step 2)-d finds the semantic word triangle for meeting the following conditions in semantic word network:

There are side connection, and the connection from different document sub-network mutually between three word nodes in semantic word triangle Part；

Step 3, model training stage: all semantic word triangles that step 2 is obtained, random initializtion semanteme word triangle Theme distribution；Current semantics word angular distribution is obtained by gibbs sampler, and thus calculates document subject matter distribution and theme Word distributed update parameter, loop iteration, until reaching maximum number of iterations or gibbs sampler convergence, the Ji that will be finally obtained Buss sampled result is as word triangle theme distribution；

Step 4, as a result output stage: the semantic word triangle theme distribution obtained according to step 3 extrapolates the master of original text shelves Topic distribution.

Further, the specific steps of the semantic word triangle of searching include: in the step 2)-d

Step 2)-d-1) for any three word w in set V_i, w_j, w_k, judge whether side is individually present between node, It whether there is e_ij, e_jk, e_ik∈E；If so, going to step 2)-d-2)；

Step 2)-d-2) judge whether to meet S_ij≠S_ik∧S_ik≠S_jk∧S_ij≠S_jk；If so, going to step 2)-d- 3)；

Step 2)-d-3) calculate word triangle priori knowledgeWherein, γ_ijk =(γ_ij+γ_ik+γ_jk)/3, γ_ij, γ_ik, γ_jkCalculation method is as mentioned before；

Step 2)-d-4) generative semantics word triangle t=(w_i, w_j, w_k, l_ijk)。

Further, in the step 3, detailed process is as follows for gibbs sampler:

Step 3)-a-1) initialization sampling algorithm platform, using machine learning method, building is adopted from conditional probability distribution The program of sample is used for SWTTM model；

Step 3)-a-2) to each semantic word triangle one theme of random initializtion；

Step 3)-a-3) the suitable the number of iterations T of selection, and initialize: t=0；

Step 3)-a-4) judge whether t is less than T: if so, going to step 3)-a-5)；If it is not, as step 3)-is gone to a-13)；

Step 3)-a-5) random selection word triangle t_q=(w_m, w_n, w_l, l_mnl), according to Di for expanding information calculating word triangle Li Keli is distributed hyper parameter β_m, β_n, β_l, specific formula is as follows:

Wherein, ∈ be in order to prevent β value it is too small and set constant；

Step 3)-a-6) word triangle t is removed in computation model_qTheme distribution under environment afterwards, formula are as follows:

Wherein, k indicates that theme number, K indicate that theme sum, V indicate the total number of word in corpus, z_qIndicate word triangle t_qTheme, T indicates semantic word triangle entirety set, Z_-qIndicate removal word triangle t_qLater theme distribution, P (z_q=k | T, Z_-q) indicate in addition to word triangle t_qTheme distribution in addition calculates word triangle t_qTheme be k probability,Indicate removal word Triangle t_qAfter belong to theme z_kWord triangle number,Indicate removal word triangle t_qLater theme z_kMiddle word w_mThe frequency, α Indicate that the prior distribution hyper parameter of document subject matter, β indicate that the Topic word in addition to current word triangle is distributed hyper parameter, α is with β Mode input parameter；

Step 3)-a-7) according to conditional probability distribution P (z_q=k | T, Z_-q) one theme of sampling；

Step 3)-a-8) " document-theme " distribution parameter is updated, more new formula is as follows:

Wherein,Indicate that theme is z_kDocument number, N_BIndicate total number of documents in corpus；

Step 3)-a-9) word w in more neologisms triangle_mIn theme z_kUnder distribution parameter, according to formula it is as follows:

Wherein,Indicate that theme is z_kLower word w_mFrequency of occurrence, β be Di Li Cray be distributed hyper parameter；

Step 3)-a-10) w in more neologisms triangle_nIn theme z_kUnder distribution parameter, more new formula is as follows:

Wherein,Indicate that theme is z_kLower word w_nFrequency of occurrence；

Step 3)-a-11) w in more neologisms triangle_lIn theme z_kUnder distribution parameter, more new formula is as follows:

Wherein,Indicate that theme is z_kLower word w_lFrequency of occurrence；

Step 3)-a-12) t=t+1 is enabled, judge whether t is less than T: if so, going to step 3)-a-5)；If it is not, then turning To step 3)-a-13)；

Step 3)-a-13) model training terminates.

Further, the specific steps of step 4 Central Plains document subject matter deduction include:

Step 4)-a-1) for each original text shelves in target corpus, it is split as words pair set conjunction；

Step 4)-a-2) judge these words to the presence or absence of at least one relevant semantic word triangle；If so, going to step Rapid 4)-a-3)；If it is not, going to step 4)-a-5)；

Step 4)-a-3) calculate semantic word triangle t in document d_qFrequency, specific formula is as follows:

Wherein, n_d(t_q) indicate document d semantic word triangle set d_tMiddle semanteme word triangle t_qThe frequency；

Step 4)-a-3) utilize the semantic word triangle t of Bayesian formula calculating_qTheme distribution, specific formula is as follows:

Step 4)-a-4) theme distribution of document is calculated, specific formula is as follows:

Wherein,Indicate the size of the semantic word triangle set of document d；

Step 4)-a-5) calculate word w in document d_iThe probability of appearance, specific formula is as follows:

Step 4)-a-6) according to the word distribution calculating word w under global theme distribution and theme_iTheme distribution, it is specific public Formula is as follows:

Step 4)-a-7) theme of document is acquired according to the theme distribution of word in document, specific formula is as follows:

The present invention compared with prior art, has the advantages that

Aiming at the problem that traditional word is influenced topic model by the descriptor quality of high frequency words, present invention assumes that occurring Word in most of document is weaker for the characterization ability of theme, and introduces IDF index and language based on this hypothesis The priori knowledge that adopted similarity is distributed together as word alleviates influence of the high frequency words to theme quality.It is total for common word Existing network for some semantic relations are close and the ignorance for the word pair that co-occurrence is few, the invention proposes a kind of novel semantic word nets The construction method of network enables topic model to pay close attention to the theme contact between word more fully hereinafter, the theme matter excavated There has also been be obviously improved for amount.

Detailed description of the invention

Fig. 1 is the flow chart of the short text Topics Crawling method based on semantic word network；

Fig. 2 is semantic word network struction and the flow chart that semantic word triangle is found；

Fig. 3 is the probability graph model of SWTTM algorithm.

Specific embodiment

In the following with reference to the drawings and specific embodiments, the present invention is furture elucidated, it should be understood that these examples are merely to illustrate this It invents rather than limits the scope of the invention, after the present invention has been read, those skilled in the art are to of the invention various The modification of equivalent form falls within the application range as defined in the appended claims.

Such as the flow chart that Fig. 1 is a kind of short text Topics Crawling method based on semantic word network that the present invention is implemented.Tool Body step is described as follows:

Step 0 is initial state of the invention；

In the model initialization stage (step 1-3):

Step 1 is corpus outside acquisition related fields, for text size and no requirement (NR)；

Step 2 is the pretreatment operations such as external corpus and target corpus to be segmented, screened；Corpus is segmented Primarily to implement in next step by the algorithm of unit of word, the specific steps of which are as follows:

Step 2-1) word segmentation processing is carried out respectively to two corpus, and at the same time removal stop words；

Step 2-2) word of deleting non-Chinese character and non-latin alphabets, by all Latin alphabet small letters；

Step 2-3) word of the word frequency less than 5 in corpus is deleted, and delete the document comprising word less than 3；

Step 3 is setting word2vec model parameter, and using external corpus as input, training pattern obtains term vector number According to；

Stage (step 4-6) is constructed in thematic unit:

Step 4 is based on corpus D={ d₁, d₂..., d_nConstruct basic co-occurrence word network；

Step 5 is to incorporate semantic information on the basis of Term co-occurrence network to construct semantic word network:

Step 6 is to find the word three-legged structure for the condition that meets in semantic word network and calculate word inverse document frequency；

The condition that word three-legged structure meets are as follows: have side connection between three word nodes mutually, and from different document The coupling part of network.

At model training stage (step 7-8):

Step 7 is sampled using the variable in Gibbs sampling method model, indicates to be obtained according to step 1 and step 2 The sample data arrived carries out model training, and specific implementation process is as follows:

Step 7-1 initializes sampling algorithm platform, is sampled from conditional probability distribution using machine learning method, building Program, uses for SWTTM model, and the algorithm flow of SWTTM model is as shown in Figure 3；

Step 7-2 is to each semantic word triangle one theme of random initializtion.

Step 7-3 selects suitable the number of iterations T, and initializes: t=0；

Step 7-4 judges whether t is less than T: being to go to step 3)-a-5)；It is no, go to step 3)-a-13)；

Step 7-5 randomly chooses word triangle t_q=(w_m, w_n, w_l, l_mnl) according to the Di Like for expanding information calculating word triangle Benefit distribution hyper parameter β_m, β_n, β_l, specific formula is as follows:

Wherein, ∈ is the constant that β value is too small and according to term vector sampling situations and manual evaluation setting in order to prevent.

Word triangle t is removed in step 7-6 computation model_iTheme distribution under environment afterwards, formula are as follows:

Step 7-7 samples a theme according to conditional probability distribution；

Step 7-8 updates " document-theme " distribution parameter, as follows according to formula:

Wherein,Indicate that theme is z_kDocument number, N_BIndicate total number of documents in corpus, K indicates theme sum.

W in step 7-9 more neologisms triangle_mIn theme z_kUnder distribution parameter, according to formula it is as follows:

Wherein,Indicate that theme is z_kLower word w_mFrequency of occurrence.

W in step 7-10 more neologisms triangle_nIn theme z_kUnder distribution parameter, according to formula it is as follows:

Wherein,Indicate that theme is z_kLower word w_nFrequency of occurrence.

W in step 7-11 more neologisms triangle_lIn theme z_kUnder distribution parameter, according to formula it is as follows:

Wherein,Indicate that theme is z_kLower word w_lFrequency of occurrence.

Step 7-12 enables t=t+1, judges whether t is less than T: being to go to step 3)-a-5)；It is no, go to step 3)-a- 13)；

Step 7-13 model training terminates.

Step 8 is using gibbs sampler result as semantic word triangle theme distribution；

In result output stage (step 9-10):

Step 9 is that original text shelves are first split as word pair；

Step 10 is to calculate original text shelves theme distribution to associated semantic word triangle is found according to word, and specific method is such as Under:

Step 10-1 judges these words to the presence or absence of at least one relevant semantic word triangle；It is to go to step 10-2； It is no, go to step 10-4；

Step 10-2 calculates semantic word triangle t in document d_qFrequency, specific formula is as follows:

Wherein, n_d(t_i) indicate document d semantic word triangle set d_tMiddle semanteme word triangle t_qThe frequency.

Step 10-3 calculates semantic word triangle t using Bayesian formula_qTheme distribution, specific formula is as follows:

The theme distribution of document is calculated in step 10-4, specific formula is as follows:

Wherein,Indicate the size of the semantic word triangle set of document d.

Step 10-5 calculates word w in document d_iThe probability of appearance, specific formula is as follows:

Step 10-6 calculates word w according to the word distribution under global theme distribution and theme_iTheme distribution, specific formula is such as Under:

Step 10-7 acquires the theme of document according to the theme distribution of word in document, specific formula is as follows:

Step 10-8 terminates.

Step 11 is end state.

If Fig. 2 is the detailed description to step 4 in Fig. 1,5.

Step 12 is initial state.

Step 13 is to establish basic co-occurrence word network, and the specific method is as follows:

Step 4-1 initializes node set V, line set E, side attribute set R

Step 4-2 is for document d_k={ w₁, w₂..., w_mIn word w_iIt, will if word does not appear in set V It is added in V；

Step 4-3 is for document d_kIn all words to w_i, w_jIf do not existed in set E, it is added into E, And attribute is added in set R to r_ij=< S_ij, s_ij> and enable S_ij={ k }.If frontier juncture system is present in set E, S in set R_ijMiddle addition number of documents k.

Step 14 is term vector data in target corpus to be obtained according to the training result of step 3, and set phrase semantic Relevant threshold value δ；

Step 15 is that a pair of word node every in basic co-occurrence word network is calculated between word pair according to following formula Semantic similarity:

Wherein,Indicate word w_iCorresponding term vector.

Step 16 is grammatical term for the character between whether there is side connection node；It is to go to step 17.It is no, go to step 18；

Step 17 charges to semantic similarity information in side attribute, i.e. s_ij=sim (w_i, w_j)；

Whether step 18 grammatical term for the character meets sim (w to semantic similarity_i, w_j) > δ；It is to go to step 19；

Attribute is added to r in step 19 in set R_ij=< S_ij, s_ij>, and enables_ij=sim (w_i, w_j)；

Step 20 calculates inverse document frequency for word each in semantic word network, and formula is as follows:

Wherein | d ∈ D:w_i∈ d | it indicates to include w_iDocument number.

Step 21 is for any three word w_i, w_j, w_k∈ V judges side, i.e. e whether are individually present between node_ij, e_jk, e_ik ∈E；Judge that the sets of side information in the attribute set on side is different, i.e. S_ij≠S_ik∧S_ik≠S_jk∧S_ij≠S_jk；It is then to turn To step 23；

Step 22 calculates word triangle priori knowledge

Step 23 generative semantics word triangle t=(w_i, w_j, w_k, l_ijk)；

Step 24 is end state.

The present invention traditional word to topic model to the word of different importance to all treating on an equal basis aiming at the problem that, it is assumed that language The probability that the more close word of justice connection belongs to the same theme is bigger, and the word for introducing external training is embedded in measure The semantic relation of word.Model in the priori knowledge of this message subject word distribution, will be enabled more to pay attention to those semantemes Similar bigger word pair.The present invention aiming at the problem that traditional word is influenced topic model by the descriptor quality of high frequency words, Assuming that appear in the word in most of document be for the characterization ability of theme it is weaker, introduce IDF index and semantic phase The priori knowledge being distributed like degree together as word.The invention also provides a kind of construction methods of novel semantic word network, make The theme contact between word can be paid close attention to more fully hereinafter by obtaining word network, and propose a kind of theme on the basis of this network Connection basic unit --- semantic word three-legged structure more closely, the theme quality using it as theme excavation unit and obtaining is more It is high.

In conclusion a kind of short text Topics Crawling method based on semantic word network of the invention believes external semantic Breath, context word frequency information and word three-legged structure comprehensively consider, and cope with its feature sparsity when excavating for short text topic model The problem of provide a kind of new resolving ideas.

The above is only a preferred embodiment of the present invention, it should be pointed out that: for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of short text Topics Crawling method based on semantic word network, which comprises the steps of:

Step 1, the model initialization stage: corpus constructs external corpus outside acquisition related fields；To external corpus and mesh The carry out pretreatment operation for marking corpus, being converted into the corpus in external corpus and target corpus can be by word2vec The format that model receives；Using external corpus as input, training word2vec model makes word2vec model export specified word Vector；Pass through the term vector data in trained word2vec model extraction target corpus；

Step 2, thematic unit constructs the stage:

2)-a is according to target corpus D={ d₁, d₂..., d_nIn the cooccurrence relation of word generate basic Term co-occurrence network, specifically Step are as follows:

Step 2)-a-2) for document d_k={ w₁, w₂..., w_mIn each word w_iIf word w_iSet V is not appeared in In, then it is added into V, k ∈ { 1,2 ..., n }；

Step 2)-a-3) for document d_kIn all words to w_i, w_jIf do not existed in set E, it is added into E, And attribute is added in set R to r_ij=< S_ij, s_ij>；Wherein, S_ij={ k } indicates the number of documents set comprising the word pair, s_ijIndicate word w_iAnd w_jBetween semantic similarity attribute；Enable S_ij={ k }, if side e_ijIt is present in set E, then belongs on side Property r_ijIn number of documents attribute set S_ijMiddle addition number of documents k；

Step 2)-b-1) by the term vector data comparison of word in target corpus and external corpus, for target corpus The word being not logged in sets corresponding term vector as sky, as subsequent no semantic information；

Step 2)-b-2) setting threshold value δ；

Step 2)-b-3) for a pair of word node w every in Term co-occurrence network_iAnd w_j, calculated between word pair according to following formula Semantic similarity:

Wherein,WithRespectively indicate word w_iAnd w_jCorresponding term vector；

Step 2)-b-4) judge every a pair of of word node w_iAnd w_jBetween whether have side connection；If so, going to step 2)-b-5)；Otherwise, Go to step 2)-b-6)；

Step 2)-b-5) by semantic similarity s_ijCharge to side attribute r_ij=< S_ij, s_ij> in, wherein S_ijIt is word to the co-occurrence of script Number of documents set；

Step 2)-b-6) judge semantic similarity s_ijWhether s is met_ij> δ；If so, going to step 2)-b-7)；Otherwise, the word pair Node does not do any operation；

Step 2)-b-7) line set e is added in line set E_ij, and attribute is added to r in side attribute set R_ij=< S_ij, s_ij >, it enabless_ij=sim (w_i, w_j)；

There are side connection, and the interconnecting piece from different document sub-network mutually between three word nodes in semantic word triangle Point；

Step 3, model training stage: all semantic word triangles that step 2 is obtained, the theme of random initializtion semanteme word triangle Distribution；Current semantics word angular distribution is obtained by gibbs sampler, and thus calculates document subject matter distribution and Topic word Distributed update parameter, loop iteration, until reaching maximum number of iterations or gibbs sampler convergence, the gibbs that will be finally obtained Sampled result is as word triangle theme distribution；

Step 4, as a result output stage: the semantic word triangle theme distribution obtained according to step 3 extrapolates the theme point of original text shelves Cloth.

2. a kind of short text Topics Crawling method based on semantic word network according to claim 1, it is characterised in that: institute It states and finds the specific steps of semantic word triangle in step 2)-d and include:

Step 2)-d-1) for any three word w in set V_i, w_j, w_k, judge whether side is individually present between node, be It is no that there are e_ij, e_jk, e_ik∈E；If so, going to step 2)-d-2)；

Step 2)-d-2) judge whether to meet S_ij≠S_ik∧S_ik≠S_jk∧S_ij≠S_jk；If so, going to step 2)-d-3)；

Step 2)-d-3) calculate word triangle priori knowledgeWherein, γ_ijk= (γ_ij+γ_ik+γ_jk)/3, γ_ij, γ_ik, γ_jkCalculation method is as mentioned before；

Step 2)-d-4) generative semantics word triangle t=(w_i, w_j, w_k, l_ijk)。

3. a kind of short text Topics Crawling method based on semantic word network according to claim 2, it is characterised in that: institute It states in step 3, detailed process is as follows for gibbs sampler:

Step 3)-a-1) initialization sampling algorithm platform, it is sampled from conditional probability distribution using machine learning method, building Program is used for SWTTM model；

Step 3)-a-4) judge whether t is less than T: if so, going to step 3)-a-5)；If it is not, as step 3)-a- is gone to 13)；

Step 3)-a-5) random selection word triangle t_q=(w_m, w_n, w_l, l_mnl), according to the Di Like for expanding information calculating word triangle Benefit distribution hyper parameter β_m, β_n, β_l, specific formula is as follows:

Wherein, k indicates that theme number, K indicate that theme sum, V indicate the total number of word in corpus, z_qIndicate word triangle t_q's Theme, T indicate semantic word triangle entirety set, Z__qIndicate removal word triangle t_qLater theme distribution, P (z_q=k | T, Z__q) It indicates in addition to word triangle t_qTheme distribution in addition calculates word triangle t_qTheme be k probability,Indicate removal word triangle t_qAfter belong to theme z_kWord triangle number,Indicate removal word triangle t_qLater theme z_kMiddle word w_mThe frequency, α indicate The prior distribution hyper parameter of document subject matter, β indicate that the Topic word in addition to current word triangle is distributed hyper parameter, and α and β are model Input parameter；

Step 3)-a-7) according to conditional probability distribution P (z_q=k | T, Z__q) one theme of sampling；

Wherein,Indicate that theme is z_kLower word w_nFrequency of occurrence；

Wherein,Indicate that theme is z_kLower word w_lFrequency of occurrence；

Step 3)-a-12) t=t+1 is enabled, judge whether t is less than T: if so, going to step 3)-a-5)；If it is not, then going to step Rapid 3)-a-13)；

Step 3)-a-13) model training terminates.

4. a kind of short text Topics Crawling method based on semantic word network according to claim 3, it is characterised in that: institute Stating the specific steps that step 4 Central Plains document subject matter is inferred includes:

Step 4)-a-2) judge these words to the presence or absence of at least one relevant semantic word triangle；If so, going to step 4)-a-3)；If it is not, going to step 4)-a-5)；

Wherein,Indicate the size of the semantic word triangle set of document d；

Step 4)-a-6) according to the word distribution calculating word w under global theme distribution and theme_iTheme distribution, specific formula is such as Under: