CN110134958A - A kind of short text Topics Crawling method based on semantic word network - Google Patents

A kind of short text Topics Crawling method based on semantic word network Download PDF

Info

Publication number
CN110134958A
CN110134958A CN201910400416.5A CN201910400416A CN110134958A CN 110134958 A CN110134958 A CN 110134958A CN 201910400416 A CN201910400416 A CN 201910400416A CN 110134958 A CN110134958 A CN 110134958A
Authority
CN
China
Prior art keywords
word
theme
triangle
semantic
distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910400416.5A
Other languages
Chinese (zh)
Other versions
CN110134958B (en
Inventor
张雷
经伟
蔡洋
陆恒杨
徐鸣
王崇骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201910400416.5A priority Critical patent/CN110134958B/en
Publication of CN110134958A publication Critical patent/CN110134958A/en
Application granted granted Critical
Publication of CN110134958B publication Critical patent/CN110134958B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a kind of short text Topics Crawling methods based on semantic word network, include the following steps 1) the model initialization stage: corpus collection, corpus pretreatment, parameter setting etc. outside related fields;2) thematic unit constructs the stage: constructing semantic word network, finds the work such as specific word three-legged structure, computation model Study first;3) model training stage: model variable is sampled using Gibbs sampling method, and whether judgment models reach the condition of convergence;4) result output stage: according to the sampled result of each variable after model training, the theme distribution of each word triangle is obtained, and then extrapolates the theme distribution of original text shelves.The present invention combines the semantic information that external corpus learns with word triangle thematic structure, in terms of short text Topics Crawling, relative to traditional word to topic model, the quality that this method provides one to incorporate the solution of external priori knowledge in traditional theme model, and excavate theme, which has, to be obviously improved.

Description

A kind of short text Topics Crawling method based on semantic word network
Technical field
The present invention relates to a kind of short text Topics Crawling method, especially a kind of short text theme based on semantic word network Method for digging, this method solve general subject method for digging in the sparse situation of short essay eigen, and theme is of low quality asks Topic.
Background technique
With the continuous quickening of social development rhythm and the user experience of intelligent mobile terminal bring " short, adaptable and fast ", people Exchange on network increasingly tend to fragmentation.Therefore, short text data the network information of today interaction in occupation of Increasingly consequence, such as social network status, microblogging text message, traditional news media title, short video title and question and answer Website etc. is occurred in the form of short text.And with microblogging, know, Facebook, Twitter Deng great scale of construction company rise abruptly It rises, short text data is also to be generated and accumulated with great speed.Therefore, theme letter is excavated from the short text data of magnanimity It is all main that breath, which just has very great value, such as the analysis of public opinion, information retrieval, personalized recommendation, user interest cluster etc., Inscribe the application direction excavated.And on the other hand, the subject information that short text is excavated using traditional text mining method is but deposited In very big difficulty, main cause is that Term co-occurrence information is very sparse in short text.
Currently, the solution sparse for short essay eigen, is typically carried out using word cooccurrence relation.This solution Certainly scheme is based on a hypothesis: the word of co-occurrence is to having theme contact in same piece short text.Such as it is dug in short text Two more commonly used model words of subject fields are dug to topic model and word network themes model.The former is made up of co-occurrence word For word to as basic theme unit, the latter passes through the master that co-occurrence word is that each word forms that corresponding word is excavated in pseudo- document assistance Topic.These methods all ignore the semantic relation between word, such as " vacation " " holiday " is semantic two very close words, The word that they are formed is to should be bigger to the contribution to theme than general co-occurrence word, but due to rarely having in same piece short text Co-occurrence and ignored by universal model.
Term vector is a kind of method for indicating word in computer-internal, can be by word directly as spy based on the expression Input model is levied, is brought great convenience for the processing of natural language.And the distributed term vector indicated is relative to previous On the one hand it is that vector dimension is lower more controllable that only hotlist shows term vector, and on the other hand it is to pass through nerve using a large amount of external corpus Language model training, the semantic information for including are also more abundant.The present invention characterizes semantic advantage using distributed term vector, mentions The semantic similarity of word is measured using term vector out and is added in word triangle topic model as priori knowledge, is short text Topics Crawling method provides a kind of new resolving ideas.
Summary of the invention
Goal of the invention: the technical problem to be solved by the present invention is to traditional topic models in reply short text data feature Scarcity and when considering word co-occurrence information, due to introducing noise information and the semantic information ignored and cause to excavate theme matter Measure not high enough problem.The present invention pass through introduce external semantic information and merge Term co-occurrence information construct together semanteme word network come The method for carrying out Topics Crawling: external corpus is collected from related fields first and passes through word2vec model training term vector;Then Target corpus bluebeard compound vector information generative semantics word network is traversed, and selects specific word three-legged structure wherein;It connects Utilize Gibbs sampling method sampling parameter, and successive ignition reaches convergence;Word triangle is calculated finally by sampled result Theme distribution, and then calculate target corpus in document main body distribution.
Technical solution: to achieve the above object, the technical solution adopted by the present invention are as follows:
A kind of short text Topics Crawling method based on semantic word network, includes the following steps:
Step 1, the model initialization stage: corpus constructs external corpus outside acquisition related fields;To external corpus With the carry out pretreatment operation of target corpus, being converted into the corpus in external corpus and target corpus can be by The format that word2vec model receives;Using external corpus as input, training word2vec model keeps word2vec model defeated Term vector is specified out;Pass through the term vector data in trained word2vec model extraction target corpus;
Step 2, thematic unit constructs the stage:
2)-a is according to target corpus D={ d1, d2..., dnIn the cooccurrence relation of word generate basic Term co-occurrence network, Specific steps are as follows:
Step 2)-a-1) point set V, line set E, side attribute set R are established, original state is sky;
Step 2)-a-2) for document dk={ w1, w2..., wmIn each word wiIf word wiIt does not appear in It in set V, is then added into V, k ∈ { 1,2 ..., n };
Step 2)-a-3) for document dkIn all words to wi, wjIf do not existed in set E, it is added into E In, and attribute is added to r in set Rij=< Sij, sij>;Wherein, Sij={ k } indicates the number of documents collection comprising the word pair It closes, sijIndicate word wiAnd wjBetween semantic similarity attribute;Enable Sij={ k }, if side eijIt is present in set E, then exists Side attribute rijIn number of documents attribute set SijMiddle addition number of documents k;
2)-b incorporates the semantic word network of semantic information building, specific steps on the basis of Term co-occurrence network are as follows:
Step 2)-b-1) by the term vector data comparison of word in target corpus and external corpus, for target language The word that material library is not logged in sets corresponding term vector as sky, as subsequent no semantic information;
Step 2)-b-2) setting threshold value δ;
Step 2)-b-3) for a pair of word node w every in Term co-occurrence networkiAnd wj, word is calculated to it according to following formula Between semantic similarity:
Wherein,WithRespectively indicate word wiAnd wjCorresponding term vector;
Step 2)-b-4) judge every a pair of of word node wiAnd wjBetween whether have side connection;If so, going to step 2)-b-5); Otherwise, step 2)-b-6 is gone to);
Step 2)-b-5) by semantic similarity sijCharge to side attribute rij=< Sij, sij> in, wherein SijIt is word to script Co-occurrence number of documents set;
Step 2)-b-6) judge semantic similarity sijWhether s is metij> δ;If so, going to step 2)-b-7);Otherwise, The word does not do any operation to node;
Step 2)-b-7) line set e is added in line set Eij, and attribute is added to r in side attribute set Rij=< Sij, sij>, it enablessij=sim (wi, wj);
Step 2)-c is for each word w in semantic word networkiInverse document frequency is calculated, formula is as follows:
Wherein | d ∈ D:wi∈ d | it indicates to include wiDocument number, NDIndicate total number of documents in corpus;
Step 2)-d finds the semantic word triangle for meeting the following conditions in semantic word network:
There are side connection, and the connection from different document sub-network mutually between three word nodes in semantic word triangle Part;
Step 3, model training stage: all semantic word triangles that step 2 is obtained, random initializtion semanteme word triangle Theme distribution;Current semantics word angular distribution is obtained by gibbs sampler, and thus calculates document subject matter distribution and theme Word distributed update parameter, loop iteration, until reaching maximum number of iterations or gibbs sampler convergence, the Ji that will be finally obtained Buss sampled result is as word triangle theme distribution;
Step 4, as a result output stage: the semantic word triangle theme distribution obtained according to step 3 extrapolates the master of original text shelves Topic distribution.
Further, the specific steps of the semantic word triangle of searching include: in the step 2)-d
Step 2)-d-1) for any three word w in set Vi, wj, wk, judge whether side is individually present between node, It whether there is eij, ejk, eik∈E;If so, going to step 2)-d-2);
Step 2)-d-2) judge whether to meet Sij≠Sik∧Sik≠Sjk∧Sij≠Sjk;If so, going to step 2)-d- 3);
Step 2)-d-3) calculate word triangle priori knowledgeWherein, γijk =(γijikjk)/3, γij, γik, γjkCalculation method is as mentioned before;
Step 2)-d-4) generative semantics word triangle t=(wi, wj, wk, lijk)。
Further, in the step 3, detailed process is as follows for gibbs sampler:
Step 3)-a-1) initialization sampling algorithm platform, using machine learning method, building is adopted from conditional probability distribution The program of sample is used for SWTTM model;
Step 3)-a-2) to each semantic word triangle one theme of random initializtion;
Step 3)-a-3) the suitable the number of iterations T of selection, and initialize: t=0;
Step 3)-a-4) judge whether t is less than T: if so, going to step 3)-a-5);If it is not, as step 3)-is gone to a-13);
Step 3)-a-5) random selection word triangle tq=(wm, wn, wl, lmnl), according to Di for expanding information calculating word triangle Li Keli is distributed hyper parameter βm, βn, βl, specific formula is as follows:
Wherein, ∈ be in order to prevent β value it is too small and set constant;
Step 3)-a-6) word triangle t is removed in computation modelqTheme distribution under environment afterwards, formula are as follows:
Wherein, k indicates that theme number, K indicate that theme sum, V indicate the total number of word in corpus, zqIndicate word triangle tqTheme, T indicates semantic word triangle entirety set, Z-qIndicate removal word triangle tqLater theme distribution, P (zq=k | T, Z-q) indicate in addition to word triangle tqTheme distribution in addition calculates word triangle tqTheme be k probability,Indicate removal word Triangle tqAfter belong to theme zkWord triangle number,Indicate removal word triangle tqLater theme zkMiddle word wmThe frequency, α Indicate that the prior distribution hyper parameter of document subject matter, β indicate that the Topic word in addition to current word triangle is distributed hyper parameter, α is with β Mode input parameter;
Step 3)-a-7) according to conditional probability distribution P (zq=k | T, Z-q) one theme of sampling;
Step 3)-a-8) " document-theme " distribution parameter is updated, more new formula is as follows:
Wherein,Indicate that theme is zkDocument number, NBIndicate total number of documents in corpus;
Step 3)-a-9) word w in more neologisms trianglemIn theme zkUnder distribution parameter, according to formula it is as follows:
Wherein,Indicate that theme is zkLower word wmFrequency of occurrence, β be Di Li Cray be distributed hyper parameter;
Step 3)-a-10) w in more neologisms trianglenIn theme zkUnder distribution parameter, more new formula is as follows:
Wherein,Indicate that theme is zkLower word wnFrequency of occurrence;
Step 3)-a-11) w in more neologisms trianglelIn theme zkUnder distribution parameter, more new formula is as follows:
Wherein,Indicate that theme is zkLower word wlFrequency of occurrence;
Step 3)-a-12) t=t+1 is enabled, judge whether t is less than T: if so, going to step 3)-a-5);If it is not, then turning To step 3)-a-13);
Step 3)-a-13) model training terminates.
Further, the specific steps of step 4 Central Plains document subject matter deduction include:
Step 4)-a-1) for each original text shelves in target corpus, it is split as words pair set conjunction;
Step 4)-a-2) judge these words to the presence or absence of at least one relevant semantic word triangle;If so, going to step Rapid 4)-a-3);If it is not, going to step 4)-a-5);
Step 4)-a-3) calculate semantic word triangle t in document dqFrequency, specific formula is as follows:
Wherein, nd(tq) indicate document d semantic word triangle set dtMiddle semanteme word triangle tqThe frequency;
Step 4)-a-3) utilize the semantic word triangle t of Bayesian formula calculatingqTheme distribution, specific formula is as follows:
Step 4)-a-4) theme distribution of document is calculated, specific formula is as follows:
Wherein,Indicate the size of the semantic word triangle set of document d;
Step 4)-a-5) calculate word w in document diThe probability of appearance, specific formula is as follows:
Step 4)-a-6) according to the word distribution calculating word w under global theme distribution and themeiTheme distribution, it is specific public Formula is as follows:
Step 4)-a-7) theme of document is acquired according to the theme distribution of word in document, specific formula is as follows:
The present invention compared with prior art, has the advantages that
Aiming at the problem that traditional word is influenced topic model by the descriptor quality of high frequency words, present invention assumes that occurring Word in most of document is weaker for the characterization ability of theme, and introduces IDF index and language based on this hypothesis The priori knowledge that adopted similarity is distributed together as word alleviates influence of the high frequency words to theme quality.It is total for common word Existing network for some semantic relations are close and the ignorance for the word pair that co-occurrence is few, the invention proposes a kind of novel semantic word nets The construction method of network enables topic model to pay close attention to the theme contact between word more fully hereinafter, the theme matter excavated There has also been be obviously improved for amount.
Detailed description of the invention
Fig. 1 is the flow chart of the short text Topics Crawling method based on semantic word network;
Fig. 2 is semantic word network struction and the flow chart that semantic word triangle is found;
Fig. 3 is the probability graph model of SWTTM algorithm.
Specific embodiment
In the following with reference to the drawings and specific embodiments, the present invention is furture elucidated, it should be understood that these examples are merely to illustrate this It invents rather than limits the scope of the invention, after the present invention has been read, those skilled in the art are to of the invention various The modification of equivalent form falls within the application range as defined in the appended claims.
Such as the flow chart that Fig. 1 is a kind of short text Topics Crawling method based on semantic word network that the present invention is implemented.Tool Body step is described as follows:
Step 0 is initial state of the invention;
In the model initialization stage (step 1-3):
Step 1 is corpus outside acquisition related fields, for text size and no requirement (NR);
Step 2 is the pretreatment operations such as external corpus and target corpus to be segmented, screened;Corpus is segmented Primarily to implement in next step by the algorithm of unit of word, the specific steps of which are as follows:
Step 2-1) word segmentation processing is carried out respectively to two corpus, and at the same time removal stop words;
Step 2-2) word of deleting non-Chinese character and non-latin alphabets, by all Latin alphabet small letters;
Step 2-3) word of the word frequency less than 5 in corpus is deleted, and delete the document comprising word less than 3;
Step 3 is setting word2vec model parameter, and using external corpus as input, training pattern obtains term vector number According to;
Stage (step 4-6) is constructed in thematic unit:
Step 4 is based on corpus D={ d1, d2..., dnConstruct basic co-occurrence word network;
Step 5 is to incorporate semantic information on the basis of Term co-occurrence network to construct semantic word network:
Step 6 is to find the word three-legged structure for the condition that meets in semantic word network and calculate word inverse document frequency;
The condition that word three-legged structure meets are as follows: have side connection between three word nodes mutually, and from different document The coupling part of network.
At model training stage (step 7-8):
Step 7 is sampled using the variable in Gibbs sampling method model, indicates to be obtained according to step 1 and step 2 The sample data arrived carries out model training, and specific implementation process is as follows:
Step 7-1 initializes sampling algorithm platform, is sampled from conditional probability distribution using machine learning method, building Program, uses for SWTTM model, and the algorithm flow of SWTTM model is as shown in Figure 3;
Step 7-2 is to each semantic word triangle one theme of random initializtion.
Step 7-3 selects suitable the number of iterations T, and initializes: t=0;
Step 7-4 judges whether t is less than T: being to go to step 3)-a-5);It is no, go to step 3)-a-13);
Step 7-5 randomly chooses word triangle tq=(wm, wn, wl, lmnl) according to the Di Like for expanding information calculating word triangle Benefit distribution hyper parameter βm, βn, βl, specific formula is as follows:
Wherein, ∈ is the constant that β value is too small and according to term vector sampling situations and manual evaluation setting in order to prevent.
Word triangle t is removed in step 7-6 computation modeliTheme distribution under environment afterwards, formula are as follows:
Step 7-7 samples a theme according to conditional probability distribution;
Step 7-8 updates " document-theme " distribution parameter, as follows according to formula:
Wherein,Indicate that theme is zkDocument number, NBIndicate total number of documents in corpus, K indicates theme sum.
W in step 7-9 more neologisms trianglemIn theme zkUnder distribution parameter, according to formula it is as follows:
Wherein,Indicate that theme is zkLower word wmFrequency of occurrence.
W in step 7-10 more neologisms trianglenIn theme zkUnder distribution parameter, according to formula it is as follows:
Wherein,Indicate that theme is zkLower word wnFrequency of occurrence.
W in step 7-11 more neologisms trianglelIn theme zkUnder distribution parameter, according to formula it is as follows:
Wherein,Indicate that theme is zkLower word wlFrequency of occurrence.
Step 7-12 enables t=t+1, judges whether t is less than T: being to go to step 3)-a-5);It is no, go to step 3)-a- 13);
Step 7-13 model training terminates.
Step 8 is using gibbs sampler result as semantic word triangle theme distribution;
In result output stage (step 9-10):
Step 9 is that original text shelves are first split as word pair;
Step 10 is to calculate original text shelves theme distribution to associated semantic word triangle is found according to word, and specific method is such as Under:
Step 10-1 judges these words to the presence or absence of at least one relevant semantic word triangle;It is to go to step 10-2; It is no, go to step 10-4;
Step 10-2 calculates semantic word triangle t in document dqFrequency, specific formula is as follows:
Wherein, nd(ti) indicate document d semantic word triangle set dtMiddle semanteme word triangle tqThe frequency.
Step 10-3 calculates semantic word triangle t using Bayesian formulaqTheme distribution, specific formula is as follows:
The theme distribution of document is calculated in step 10-4, specific formula is as follows:
Wherein,Indicate the size of the semantic word triangle set of document d.
Step 10-5 calculates word w in document diThe probability of appearance, specific formula is as follows:
Step 10-6 calculates word w according to the word distribution under global theme distribution and themeiTheme distribution, specific formula is such as Under:
Step 10-7 acquires the theme of document according to the theme distribution of word in document, specific formula is as follows:
Step 10-8 terminates.
Step 11 is end state.
If Fig. 2 is the detailed description to step 4 in Fig. 1,5.
Step 12 is initial state.
Step 13 is to establish basic co-occurrence word network, and the specific method is as follows:
Step 4-1 initializes node set V, line set E, side attribute set R
Step 4-2 is for document dk={ w1, w2..., wmIn word wiIt, will if word does not appear in set V It is added in V;
Step 4-3 is for document dkIn all words to wi, wjIf do not existed in set E, it is added into E, And attribute is added in set R to rij=< Sij, sij> and enable Sij={ k }.If frontier juncture system is present in set E, S in set RijMiddle addition number of documents k.
Step 14 is term vector data in target corpus to be obtained according to the training result of step 3, and set phrase semantic Relevant threshold value δ;
Step 15 is that a pair of word node every in basic co-occurrence word network is calculated between word pair according to following formula Semantic similarity:
Wherein,Indicate word wiCorresponding term vector.
Step 16 is grammatical term for the character between whether there is side connection node;It is to go to step 17.It is no, go to step 18;
Step 17 charges to semantic similarity information in side attribute, i.e. sij=sim (wi, wj);
Whether step 18 grammatical term for the character meets sim (w to semantic similarityi, wj) > δ;It is to go to step 19;
Attribute is added to r in step 19 in set Rij=< Sij, sij>, and enablesij=sim (wi, wj);
Step 20 calculates inverse document frequency for word each in semantic word network, and formula is as follows:
Wherein | d ∈ D:wi∈ d | it indicates to include wiDocument number.
Step 21 is for any three word wi, wj, wk∈ V judges side, i.e. e whether are individually present between nodeij, ejk, eik ∈E;Judge that the sets of side information in the attribute set on side is different, i.e. Sij≠Sik∧Sik≠Sjk∧Sij≠Sjk;It is then to turn To step 23;
Step 22 calculates word triangle priori knowledge
Step 23 generative semantics word triangle t=(wi, wj, wk, lijk);
Step 24 is end state.
The present invention traditional word to topic model to the word of different importance to all treating on an equal basis aiming at the problem that, it is assumed that language The probability that the more close word of justice connection belongs to the same theme is bigger, and the word for introducing external training is embedded in measure The semantic relation of word.Model in the priori knowledge of this message subject word distribution, will be enabled more to pay attention to those semantemes Similar bigger word pair.The present invention aiming at the problem that traditional word is influenced topic model by the descriptor quality of high frequency words, Assuming that appear in the word in most of document be for the characterization ability of theme it is weaker, introduce IDF index and semantic phase The priori knowledge being distributed like degree together as word.The invention also provides a kind of construction methods of novel semantic word network, make The theme contact between word can be paid close attention to more fully hereinafter by obtaining word network, and propose a kind of theme on the basis of this network Connection basic unit --- semantic word three-legged structure more closely, the theme quality using it as theme excavation unit and obtaining is more It is high.
In conclusion a kind of short text Topics Crawling method based on semantic word network of the invention believes external semantic Breath, context word frequency information and word three-legged structure comprehensively consider, and cope with its feature sparsity when excavating for short text topic model The problem of provide a kind of new resolving ideas.
The above is only a preferred embodiment of the present invention, it should be pointed out that: for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (4)

1. a kind of short text Topics Crawling method based on semantic word network, which comprises the steps of:
Step 1, the model initialization stage: corpus constructs external corpus outside acquisition related fields;To external corpus and mesh The carry out pretreatment operation for marking corpus, being converted into the corpus in external corpus and target corpus can be by word2vec The format that model receives;Using external corpus as input, training word2vec model makes word2vec model export specified word Vector;Pass through the term vector data in trained word2vec model extraction target corpus;
Step 2, thematic unit constructs the stage:
2)-a is according to target corpus D={ d1, d2..., dnIn the cooccurrence relation of word generate basic Term co-occurrence network, specifically Step are as follows:
Step 2)-a-1) point set V, line set E, side attribute set R are established, original state is sky;
Step 2)-a-2) for document dk={ w1, w2..., wmIn each word wiIf word wiSet V is not appeared in In, then it is added into V, k ∈ { 1,2 ..., n };
Step 2)-a-3) for document dkIn all words to wi, wjIf do not existed in set E, it is added into E, And attribute is added in set R to rij=< Sij, sij>;Wherein, Sij={ k } indicates the number of documents set comprising the word pair, sijIndicate word wiAnd wjBetween semantic similarity attribute;Enable Sij={ k }, if side eijIt is present in set E, then belongs on side Property rijIn number of documents attribute set SijMiddle addition number of documents k;
2)-b incorporates the semantic word network of semantic information building, specific steps on the basis of Term co-occurrence network are as follows:
Step 2)-b-1) by the term vector data comparison of word in target corpus and external corpus, for target corpus The word being not logged in sets corresponding term vector as sky, as subsequent no semantic information;
Step 2)-b-2) setting threshold value δ;
Step 2)-b-3) for a pair of word node w every in Term co-occurrence networkiAnd wj, calculated between word pair according to following formula Semantic similarity:
Wherein,WithRespectively indicate word wiAnd wjCorresponding term vector;
Step 2)-b-4) judge every a pair of of word node wiAnd wjBetween whether have side connection;If so, going to step 2)-b-5);Otherwise, Go to step 2)-b-6);
Step 2)-b-5) by semantic similarity sijCharge to side attribute rij=< Sij, sij> in, wherein SijIt is word to the co-occurrence of script Number of documents set;
Step 2)-b-6) judge semantic similarity sijWhether s is metij> δ;If so, going to step 2)-b-7);Otherwise, the word pair Node does not do any operation;
Step 2)-b-7) line set e is added in line set Eij, and attribute is added to r in side attribute set Rij=< Sij, sij >, it enablessij=sim (wi, wj);
Step 2)-c is for each word w in semantic word networkiInverse document frequency is calculated, formula is as follows:
Wherein | d ∈ D:wi∈ d | it indicates to include wiDocument number, NDIndicate total number of documents in corpus;
Step 2)-d finds the semantic word triangle for meeting the following conditions in semantic word network:
There are side connection, and the interconnecting piece from different document sub-network mutually between three word nodes in semantic word triangle Point;
Step 3, model training stage: all semantic word triangles that step 2 is obtained, the theme of random initializtion semanteme word triangle Distribution;Current semantics word angular distribution is obtained by gibbs sampler, and thus calculates document subject matter distribution and Topic word Distributed update parameter, loop iteration, until reaching maximum number of iterations or gibbs sampler convergence, the gibbs that will be finally obtained Sampled result is as word triangle theme distribution;
Step 4, as a result output stage: the semantic word triangle theme distribution obtained according to step 3 extrapolates the theme point of original text shelves Cloth.
2. a kind of short text Topics Crawling method based on semantic word network according to claim 1, it is characterised in that: institute It states and finds the specific steps of semantic word triangle in step 2)-d and include:
Step 2)-d-1) for any three word w in set Vi, wj, wk, judge whether side is individually present between node, be It is no that there are eij, ejk, eik∈E;If so, going to step 2)-d-2);
Step 2)-d-2) judge whether to meet Sij≠Sik∧Sik≠Sjk∧Sij≠Sjk;If so, going to step 2)-d-3);
Step 2)-d-3) calculate word triangle priori knowledgeWherein, γijk= (γijikjk)/3, γij, γik, γjkCalculation method is as mentioned before;
Step 2)-d-4) generative semantics word triangle t=(wi, wj, wk, lijk)。
3. a kind of short text Topics Crawling method based on semantic word network according to claim 2, it is characterised in that: institute It states in step 3, detailed process is as follows for gibbs sampler:
Step 3)-a-1) initialization sampling algorithm platform, it is sampled from conditional probability distribution using machine learning method, building Program is used for SWTTM model;
Step 3)-a-2) to each semantic word triangle one theme of random initializtion;
Step 3)-a-3) the suitable the number of iterations T of selection, and initialize: t=0;
Step 3)-a-4) judge whether t is less than T: if so, going to step 3)-a-5);If it is not, as step 3)-a- is gone to 13);
Step 3)-a-5) random selection word triangle tq=(wm, wn, wl, lmnl), according to the Di Like for expanding information calculating word triangle Benefit distribution hyper parameter βm, βn, βl, specific formula is as follows:
Wherein, ∈ be in order to prevent β value it is too small and set constant;
Step 3)-a-6) word triangle t is removed in computation modelqTheme distribution under environment afterwards, formula are as follows:
Wherein, k indicates that theme number, K indicate that theme sum, V indicate the total number of word in corpus, zqIndicate word triangle tq's Theme, T indicate semantic word triangle entirety set, Z_qIndicate removal word triangle tqLater theme distribution, P (zq=k | T, Z_q) It indicates in addition to word triangle tqTheme distribution in addition calculates word triangle tqTheme be k probability,Indicate removal word triangle tqAfter belong to theme zkWord triangle number,Indicate removal word triangle tqLater theme zkMiddle word wmThe frequency, α indicate The prior distribution hyper parameter of document subject matter, β indicate that the Topic word in addition to current word triangle is distributed hyper parameter, and α and β are model Input parameter;
Step 3)-a-7) according to conditional probability distribution P (zq=k | T, Z_q) one theme of sampling;
Step 3)-a-8) " document-theme " distribution parameter is updated, more new formula is as follows:
Wherein,Indicate that theme is zkDocument number, NBIndicate total number of documents in corpus;
Step 3)-a-9) word w in more neologisms trianglemIn theme zkUnder distribution parameter, according to formula it is as follows:
Wherein,Indicate that theme is zkLower word wmFrequency of occurrence, β be Di Li Cray be distributed hyper parameter;
Step 3)-a-10) w in more neologisms trianglenIn theme zkUnder distribution parameter, more new formula is as follows:
Wherein,Indicate that theme is zkLower word wnFrequency of occurrence;
Step 3)-a-11) w in more neologisms trianglelIn theme zkUnder distribution parameter, more new formula is as follows:
Wherein,Indicate that theme is zkLower word wlFrequency of occurrence;
Step 3)-a-12) t=t+1 is enabled, judge whether t is less than T: if so, going to step 3)-a-5);If it is not, then going to step Rapid 3)-a-13);
Step 3)-a-13) model training terminates.
4. a kind of short text Topics Crawling method based on semantic word network according to claim 3, it is characterised in that: institute Stating the specific steps that step 4 Central Plains document subject matter is inferred includes:
Step 4)-a-1) for each original text shelves in target corpus, it is split as words pair set conjunction;
Step 4)-a-2) judge these words to the presence or absence of at least one relevant semantic word triangle;If so, going to step 4)-a-3);If it is not, going to step 4)-a-5);
Step 4)-a-3) calculate semantic word triangle t in document dqFrequency, specific formula is as follows:
Wherein, nd(tq) indicate document d semantic word triangle set dtMiddle semanteme word triangle tqThe frequency;
Step 4)-a-3) utilize the semantic word triangle t of Bayesian formula calculatingqTheme distribution, specific formula is as follows:
Step 4)-a-4) theme distribution of document is calculated, specific formula is as follows:
Wherein,Indicate the size of the semantic word triangle set of document d;
Step 4)-a-5) calculate word w in document diThe probability of appearance, specific formula is as follows:
Step 4)-a-6) according to the word distribution calculating word w under global theme distribution and themeiTheme distribution, specific formula is such as Under:
Step 4)-a-7) theme of document is acquired according to the theme distribution of word in document, specific formula is as follows:
CN201910400416.5A 2019-05-14 2019-05-14 Short text topic mining method based on semantic word network Active CN110134958B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910400416.5A CN110134958B (en) 2019-05-14 2019-05-14 Short text topic mining method based on semantic word network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910400416.5A CN110134958B (en) 2019-05-14 2019-05-14 Short text topic mining method based on semantic word network

Publications (2)

Publication Number Publication Date
CN110134958A true CN110134958A (en) 2019-08-16
CN110134958B CN110134958B (en) 2021-05-18

Family

ID=67574004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910400416.5A Active CN110134958B (en) 2019-05-14 2019-05-14 Short text topic mining method based on semantic word network

Country Status (1)

Country Link
CN (1) CN110134958B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061866A (en) * 2019-08-20 2020-04-24 河北工程大学 Bullet screen text clustering method based on feature extension and T-oBTM
CN111339289A (en) * 2020-03-06 2020-06-26 西安工程大学 Topic model inference method based on commodity comments
CN111723563A (en) * 2020-05-11 2020-09-29 华南理工大学 Topic modeling method based on word co-occurrence network
CN112183108A (en) * 2020-09-07 2021-01-05 哈尔滨工业大学(深圳) Inference method, system, computer equipment and storage medium for short text topic distribution
CN112487185A (en) * 2020-11-27 2021-03-12 国家电网有限公司客户服务中心 Data classification method in power customer field
CN116432639A (en) * 2023-05-31 2023-07-14 华东交通大学 News element word mining method based on improved BTM topic model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150154148A1 (en) * 2013-12-02 2015-06-04 Qbase, LLC Method of automated discovery of new topics
CN105608192A (en) * 2015-12-23 2016-05-25 南京大学 Short text recommendation method for user-based biterm topic model
CN105955948A (en) * 2016-04-22 2016-09-21 武汉大学 Short text topic modeling method based on word semantic similarity
CN106055604A (en) * 2016-05-25 2016-10-26 南京大学 Short text topic model mining method based on word network to extend characteristics
CN108182176A (en) * 2017-12-29 2018-06-19 太原理工大学 Enhance BTM topic model descriptor semantic dependencies and theme condensation degree method
CN108197144A (en) * 2017-11-28 2018-06-22 河海大学 A kind of much-talked-about topic based on BTM and Single-pass finds method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150154148A1 (en) * 2013-12-02 2015-06-04 Qbase, LLC Method of automated discovery of new topics
CN105608192A (en) * 2015-12-23 2016-05-25 南京大学 Short text recommendation method for user-based biterm topic model
CN105955948A (en) * 2016-04-22 2016-09-21 武汉大学 Short text topic modeling method based on word semantic similarity
CN106055604A (en) * 2016-05-25 2016-10-26 南京大学 Short text topic model mining method based on word network to extend characteristics
CN108197144A (en) * 2017-11-28 2018-06-22 河海大学 A kind of much-talked-about topic based on BTM and Single-pass finds method
CN108182176A (en) * 2017-12-29 2018-06-19 太原理工大学 Enhance BTM topic model descriptor semantic dependencies and theme condensation degree method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李思宇 等: "基于双词语义扩展的Biterm主题模型", 《计算机工程》 *
蔡洋: "基于词三角的短文本主题模型算法", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061866A (en) * 2019-08-20 2020-04-24 河北工程大学 Bullet screen text clustering method based on feature extension and T-oBTM
CN111061866B (en) * 2019-08-20 2024-01-02 河北工程大学 Barrage text clustering method based on feature expansion and T-oBTM
CN111339289A (en) * 2020-03-06 2020-06-26 西安工程大学 Topic model inference method based on commodity comments
CN111339289B (en) * 2020-03-06 2022-10-28 西安工程大学 Topic model inference method based on commodity comments
CN111723563A (en) * 2020-05-11 2020-09-29 华南理工大学 Topic modeling method based on word co-occurrence network
CN111723563B (en) * 2020-05-11 2023-09-26 华南理工大学 Topic modeling method based on word co-occurrence network
CN112183108A (en) * 2020-09-07 2021-01-05 哈尔滨工业大学(深圳) Inference method, system, computer equipment and storage medium for short text topic distribution
CN112487185A (en) * 2020-11-27 2021-03-12 国家电网有限公司客户服务中心 Data classification method in power customer field
CN116432639A (en) * 2023-05-31 2023-07-14 华东交通大学 News element word mining method based on improved BTM topic model
CN116432639B (en) * 2023-05-31 2023-08-25 华东交通大学 News element word mining method based on improved BTM topic model

Also Published As

Publication number Publication date
CN110134958B (en) 2021-05-18

Similar Documents

Publication Publication Date Title
CN110134958A (en) A kind of short text Topics Crawling method based on semantic word network
CN110457404B (en) Social media account classification method based on complex heterogeneous network
CN103617169B (en) A kind of hot microblog topic extracting method based on Hadoop
CN110781317B (en) Method and device for constructing event map and electronic equipment
US9015035B2 (en) User modification of generative model for determining topics and sentiments
CN103793501B (en) Based on the theme Combo discovering method of social networks
CN110705260A (en) Text vector generation method based on unsupervised graph neural network structure
CN105447179B (en) Topic auto recommending method and its system based on microblogging social networks
CN102722709B (en) Method and device for identifying garbage pictures
CN104536956A (en) A Microblog platform based event visualization method and system
CN104484343A (en) Topic detection and tracking method for microblog
CN108628828A (en) A kind of joint abstracting method of viewpoint and its holder based on from attention
CN108897842A (en) Computer readable storage medium and computer system
CN101149739A (en) Internet faced sensing string digging method and system
CN111966786A (en) Microblog rumor detection method
CN104182421A (en) Video clustering method and detecting method
CN106980651B (en) Crawling seed list updating method and device based on knowledge graph
CN104504024A (en) Method and system for mining keywords based on microblog content
CN110442726A (en) Social media short text on-line talking method based on physical constraints
CN103488787A (en) Method and device for pushing online playing entry objects based on video retrieval
CN106503256B (en) A kind of hot information method for digging based on social networks document
CN105956158B (en) The method that network neologisms based on massive micro-blog text and user information automatically extract
CN104376115A (en) Fuzzy word determining method and device based on global search
CN105243083B (en) Document subject matter method for digging and device
CN109597926A (en) A kind of information acquisition method and system based on social media emergency event

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant