CN110134958B - Short text topic mining method based on semantic word network - Google Patents
Short text topic mining method based on semantic word network Download PDFInfo
- Publication number
- CN110134958B CN110134958B CN201910400416.5A CN201910400416A CN110134958B CN 110134958 B CN110134958 B CN 110134958B CN 201910400416 A CN201910400416 A CN 201910400416A CN 110134958 B CN110134958 B CN 110134958B
- Authority
- CN
- China
- Prior art keywords
- word
- topic
- semantic
- triangle
- distribution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The invention discloses a short text topic mining method based on a semantic word network, which comprises the following steps of 1) model initialization stage: external corpus collection, corpus preprocessing, parameter setting and the like in the related field; 2) a subject unit construction stage: constructing a semantic word network, searching a specific word triangular structure, calculating model prior parameters and the like; 3) a model training stage: sampling the model variable by using a Gibbs sampling method, and judging whether the model reaches a convergence condition; 4) and a result output stage: and obtaining the topic distribution of each word triangle according to the sampling result of each variable after the model training is finished, and further calculating the topic distribution of the original document. The method combines semantic information learned by an external corpus with a word triangle topic structure, is applied to the aspect of short text topic mining, provides a solution for integrating external prior knowledge into a traditional topic model compared with the traditional word-to-topic model, and obviously improves the quality of the mined topic.
Description
Technical Field
The invention relates to a short text topic mining method, in particular to a short text topic mining method based on a semantic word network, which solves the problem that the topic quality is not high under the condition that the short text features are sparse in a common topic mining method.
Background
With the continuous acceleration of social development rhythm and the short and fast user experience brought by the intelligent mobile terminal, people tend to fragmentize more and more during communication on the network. Therefore, short text data is more and more important in today's network information interaction, for example, social network status, microblog text messages, traditional news headlines, short video headlines, question and answer websites, and the like are all in the form of short texts. And short text data is also generated and accumulated at great speed with the rise of mass companies such as microblog, cicada, Facebook, Twitter, etc. Therefore, the topic information mining from massive short text data has significant value, and for example, public opinion analysis, information retrieval, personalized recommendation, user interest clustering and the like are all application directions of topic mining. On the other hand, the traditional text mining method is difficult to mine the subject information of the short text, and the main reason is that the word co-occurrence information in the short text is very sparse.
At present, for a solution of sparse short text features, word co-occurrence relation is generally utilized. This solution is based on an assumption: the word pairs which co-occur in the same short text have topic relation. For example, two model word pair topic models and word network topic models are commonly used in the short text mining topic area. The former forms word pairs as basic subject units through co-occurrence words, and the latter forms a pseudo document for each word through co-occurrence words to assist in discovering the subject of the corresponding word. These methods ignore semantic relations between words, for example, "holiday" and "holiday" are two words with very close semantics, and their constituent words should contribute more to the subject than common co-occurrence words, but are ignored by the common model because of the fresh co-occurrence in the same short text.
The word vector is a method for representing words in a computer, and the words can be directly used as a characteristic input model based on the representation, so that great convenience is brought to the processing of natural language. Compared with the traditional one-hot expression word vector, the distributed expression word vector has the advantages that the vector dimension is lower and more controllable, and on the other hand, a large amount of external linguistic data are used for training through a neural language model, so that the included semantic information is richer. The invention utilizes the advantage of distributed word vectors for representing semantics, provides a new solution for the short text topic mining method by utilizing the word vectors to measure the semantic similarity of words and adding the semantic similarity into a word triangular topic model as prior knowledge.
Disclosure of Invention
The purpose of the invention is as follows: the technical problem to be solved by the invention is that when the word co-occurrence information is considered in response to the feature scarcity of short text data in the traditional topic model, the mined topic quality is not high enough due to the introduced noise information and the ignored semantic information. The invention discloses a method for mining a theme by introducing external semantic information and fusing word co-occurrence information to construct a semantic word network, which comprises the following steps: firstly, collecting external corpora from related fields and training word vectors through a word2vec model; then, traversing a target language database and combining word vector information to generate a semantic word network, and selecting a specific word triangular structure in the semantic word network; then, parameters are sampled by utilizing a Gibbs sampling method, and multiple iterations are carried out to achieve convergence; and finally, calculating the topic distribution of the word triangles according to the sampling result, and further calculating the main body distribution of the documents in the target language material library.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
a short text topic mining method based on semantic word network comprises the following steps:
2) -a is from the target corpus D ═ { D ═ D1,d2,...,dnThe co-occurrence relation of the Chinese words generates a basic word co-occurrence network, and the specific steps are as follows:
step 2) -a-1) establishing a point set V, an edge set E and an edge attribute set R, wherein the initial state is null;
steps 2) -a-2) for document dk={w1,w2,...,wmEvery word w iniIf the word wiIf the current vector does not appear in the set V, adding the current vector into the set V, wherein k belongs to {1, 2., n };
steps 2) -a-3) for document dkAll word pairs w ini,wjIf not, add it to E and add the attribute pair R to the set Rij=<Sij,sij>(ii) a Wherein S isijWhere denotes the set of document numbers containing the word pair, sijRepresents the word wiAnd wjSemantic similarity attributes between them; order SijIf e, if kijAlready present in the collection E, then the edge attribute rijDocument number attribute set S in (1)ijAdding a document number k;
2) b, fusing semantic information on the basis of the word co-occurrence network to construct a semantic word network, and specifically comprising the following steps:
step 2) -b-1) comparing the word vector data of the words in the target corpus and the external corpus, and setting the corresponding word vector to be empty for the words which are not registered in the target corpus, namely the subsequent semantic-free information;
step 2) -b-2) setting a threshold value delta;
step 2) -b-3) for each pair of word nodes w in the word co-occurrence networkiAnd wjCalculating semantic similarity between the word pairs according to the following formula:
wherein the content of the first and second substances,andrespectively represent words wiAnd wjA corresponding word vector;
step 2) -b-4) judging each pair of word nodes wiAnd wjWhether there is an edge connection between them; if yes, go to step 2) -b-5) (ii) a Otherwise, go to step 2) -b-6);
step 2) -b-5) comparing the semantic similarity sijEntering an edge attribute rij=<Sij,sij>In which S isijA set of co-occurrence document numbers for the original word pairs;
step 2) -b-6) judging semantic similarity sijWhether or not s is satisfiedijIs more than delta; if yes, turning to the step 2) -b-7); otherwise, the word does not perform any operation on the node;
step 2) -b-7) adding an edge set E into the edge set EijAnd adding attribute pairs R into the edge attribute set Rij=<Sij,sij>Let us ordersij=sim(wi,wj);
Steps 2) -c for each word w in the semantic word networkiCalculating the inverse document frequency by the formula:
where | D ∈ D: w is aiE d | represents a representation of a vector containing wiNumber of documents of (2), NDRepresenting the total number of documents in the corpus;
step 2) -d searching a semantic word triangle meeting the following conditions in the semantic word network:
three word nodes in the semantic word triangle are mutually connected in a side mode and are from connecting parts of different document sub-networks;
Further, the specific steps of searching the semantic word triangle in the steps 2) -d include:
steps 2) -d-1) for any three words w in the set Vi,wj,wkJudging whether each node has an edge, i.e. whether e existsij,ejk,eikE belongs to E; if yes, turning to the step 2) -d-2);
step 2) -d-2) determining whether S is satisfiedij≠Sik∧Sik≠Sjk∧Sij≠Sjk(ii) a If yes, turning to the step 2) -d-3);
step 2) -d-3) calculating word triangle prior knowledgeWherein, γijk=(γij+ γik+γjk)/3,γij,γik,γjkThe calculation method is as described above;
step 2) -d-4) generating semantic word triangle t ═ (w)i,wj,wk,lijk)。
Further, in step 3, the gibbs sampling specifically includes the following steps:
step 3) -a-1) initializing a sampling algorithm platform, and constructing a program for sampling from conditional probability distribution by using a machine learning method for use by a SWTTM model;
step 3) -a-2) randomly initializing a theme for each semantic word triangle;
step 3) -a-3) selecting a suitable iteration number T and initializing: t is 0;
steps 3) -a-4) determining whether T is less than T: if yes, turning to the step 3) -a-5); if not, as a transition to step 3) -a-13);
step 3) -a-5) randomly selecting word triangle tq=(wm,wn,wl,lmnl) Calculating word triangles from expansion informationDirichlet distribution hyperparameter betam,βn,βlThe concrete formula is as follows:
wherein the content of the first and second substances,is a constant set to prevent the value of β from being too small;
step 3) -a-6) of removing word triangle t from the calculation modelqTopic distribution in the latter environment, the formula is as follows:
wherein K represents the topic number, K represents the total number of topics, V represents the total number of words in the corpus, zqTriangle t representing wordqT denotes a full set of semantic word triangles, Z-qRepresenting removed word triangles tqLater distribution of topics, P (z)q=k|T,Z-q) Triangle t with the term other thanqTriangular t of term derived from distribution of other subjectsqThe subject of (a) is the probability of k,representing removed word triangles tqRear belonging to subject zkThe number of the word triangle is greater than the number of the word triangle,representing removed word triangles tqFuture topic zkChinese word wmThe frequency of the word distribution model is shown as alpha, beta and alpha, beta are the prior distribution hyperparameters of the document topics, beta is the distribution hyperparameters of the topic words except the current word triangle, and alpha and beta are model input parameters;
steps 3) -a-7) according to the conditional probability distribution P (z)q=k|T,Z-q) Sampling a theme;
step 3) -a-8) updating the 'document-subject' distribution parameter, wherein the updating formula is as follows:
wherein the content of the first and second substances,denotes the topic as zkNumber of documents, NBRepresenting the total number of documents in the corpus;
step 3) -a-9) updating word w in word trianglemAt subject zkThe following distribution parameters, according to the formula:
wherein the content of the first and second substances,denotes the topic as zkLower word wmBeta is a dirichlet distribution hyperparameter;
step 3) -a-10) updating w in word trianglenAt subject zkThe following distribution parameters are updated as follows:
wherein the content of the first and second substances,denotes the topic as zkLower word wnThe number of occurrences of (c);
step 3) -a-11) updating w in word trianglelAt subject zkThe following distribution parameters are updated as follows:
wherein the content of the first and second substances,denotes the topic as zkLower word wlThe number of occurrences of (c);
step 3) -a-12) making T equal to T +1, and judging whether T is less than T: if yes, turning to the step 3) -a-5); if not, go to step 3) -a-13);
and 3) finishing the model training.
Further, the specific step of the original document theme inference in step 4 includes:
step 4) -a-1) splitting each original document in the target corpus into a word pair set;
step 4) -a-2) judging whether the word pairs have at least one related semantic word triangle; if yes, turning to the step 4) -a-3); if not, go to step 4) -a-5);
step 4) -a-3) calculating the semantic word triangle t in the document dqThe specific formula of (c) is as follows:
wherein n isd(tq) Semantic word triangle set d represented in document dtTriangle t of middle semantic wordqThe frequency of (2);
step 4) -a-3) calculating semantic word triangle t by using Bayesian formulaqThe specific formula of the theme distribution is as follows:
step 4) -a-4) to obtain the theme distribution of the document by calculation, wherein the specific formula is as follows:
wherein the content of the first and second substances,representing the size of the triangular set of semantic words for document d;
step 4) -a-5) calculating the word w in the document diThe probability of occurrence is specifically as follows:
step 4) -a-6) calculating words w according to global theme distribution and word distribution under the themeiThe specific formula of the theme distribution is as follows:
step 4) -a-7) obtaining the theme of the document according to the theme distribution of words in the document, wherein the specific formula is as follows:
compared with the prior art, the invention has the following beneficial effects:
aiming at the problem that the traditional word pair topic model is influenced by the topic word quality of the high-frequency words, the invention assumes that the representation capability of the words appearing in most documents to the topics is weak, and introduces IDF indexes and semantic similarity together as the prior knowledge of word distribution based on the assumption, thereby relieving the influence of the high-frequency words on the topic quality. Aiming at the neglect of a plurality of word pairs with close semantic relation and less co-occurrence in the common word co-occurrence network, the invention provides a novel semantic word network construction method, so that a topic model can pay more attention to topic relation among words, and the quality of the mined topics is obviously improved.
Drawings
FIG. 1 is a flow chart of a short text topic mining method based on a semantic word network;
FIG. 2 is a flow chart of semantic word network construction and semantic word triangle finding;
fig. 3 is a probabilistic graphical model of the SWTTM algorithm.
Detailed Description
The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.
Fig. 1 is a flowchart of a short text topic mining method based on a semantic word network according to an embodiment of the present invention. The specific steps are described as follows:
in the model initialization phase (step 1-3):
step 2-1) performing word segmentation processing on the two corpora respectively, and removing stop words at the same time;
step 2-2) deleting words of non-Chinese characters and non-Latin letters, and lowercase all Latin letters;
step 2-3) deleting the words with the word frequency less than 5 in the corpus, and deleting the documents with the words less than 3;
in the subject unit construction phase (steps 4-6):
the word triangle structure satisfies the following conditions: the three word nodes are connected with edges and are from the connected parts of different document sub-networks.
In the model training phase (step 7-8):
and 7, sampling by using variables in the Gibbs sampling method model, wherein the variables represent model training according to the sample data obtained in the steps 1 and 2, and the specific implementation process is as follows:
step 7-1, initializing a sampling algorithm platform, and constructing a program for sampling from conditional probability distribution by using a machine learning method, wherein the program is used by a SWTTM model, and the algorithm flow of the SWTTM model is shown in FIG. 3;
step 7-2 randomly initializes a topic for each semantic word triangle.
And 7-3, selecting a proper iteration number T, and initializing: t is 0;
step 7-4, judging whether T is smaller than T: yes, go to step 3) -a-5); no, go to step 3) -a-13);
step 7-5 randomly selecting word triangle tq=(wm,wn,wl,lmnl) Computing Dirichlet distribution hyper-parameter beta of word triangle according to expansion informationm,βn,βlThe concrete formula is as follows:
wherein the content of the first and second substances,is a constant set according to the word vector sampling condition and manual evaluation in order to prevent the beta value from being too small.
Step 7-6, word triangle t is removed from the calculation modeliTopic distribution in the latter environment, the formula is as follows:
step 7-7 sampling a topic according to the conditional probability distribution;
step 7-8 updates the "document-topic" distribution parameter according to the formula as follows:
wherein the content of the first and second substances,denotes the topic as zkNumber of documents, NBRepresents the total number of documents in the corpus, and K represents the total number of topics.
Step 7-9 update w in word trianglemAt subject zkThe following distribution parameters, according to the formula:
wherein the content of the first and second substances,denotes the topic as zkLower word wmThe number of occurrences of (c).
Step 7-10 update w in word trianglenAt subject zkThe following distribution parameters, according to the formula:
wherein the content of the first and second substances,denotes the topic as zkLower word wnThe number of occurrences of (c).
Step 7-11 update w in word trianglelAt subject zkThe following distribution parameters, according to the formula:
wherein the content of the first and second substances,denotes the topic as zkLower word wlThe number of occurrences of (c).
And 7-12, enabling T to be T +1, judging whether T is smaller than T: yes, go to step 3) -a-5); no, go to step 3) -a-13);
and 7-13, finishing the model training.
in the result output stage (steps 9-10):
step 10-1, judging whether the word pairs have at least one related semantic word triangle; if yes, go to step 10-2; if not, turning to the step 10-4;
step 10-2, calculating the semantic word triangle t in the document dqThe specific formula of (c) is as follows:
wherein n isd(ti) Semantic word triangle set d represented in document dtTriangle t of middle semantic wordqThe frequency of (2).
Step 10-3, semantic word triangle t is calculated by using Bayesian formulaqThe specific formula of the theme distribution is as follows:
step 10-4, calculating to obtain the theme distribution of the document, wherein the specific formula is as follows:
wherein the content of the first and second substances,representing the size of the triangular set of semantic words for document d.
Step 10-5 calculating word w in document diThe probability of occurrence is specifically as follows:
step 10-6, words w are calculated according to global theme distribution and word distribution under the themeiThe specific formula of the theme distribution is as follows:
step 10-7, obtaining the theme of the document according to the theme distribution of words in the document, wherein the specific formula is as follows:
and step 10-8 is finished.
Fig. 2 is a detailed description of steps 4 and 5 in fig. 1.
step 4-1, initializing a node set V, an edge set E and an edge attribute set R
Step 4-2 for document dk={w1,w2,...,wmThe word w iniIf the word does not appear in the set V, it is added to V;
step 4-3 for document dkAll word pairs w ini,wjIf not, add it to E and add the attribute pair R to the set Rij=<Sij,sij>And order SijK. If the edge relationship already exists in set E, S in set RijAdding the document number k.
wherein the content of the first and second substances,the expression wiThe corresponding word vector.
where | D ∈ D: w is aiE d | represents a representation of a vector containing wiThe number of documents.
Aiming at the problem that the traditional word pair topic model has equal treatment on word pairs with different importance, the probability that the words with tighter semantic relation belong to the same topic is higher, and the semantic relation of the words is measured by introducing word embedding trained by an external corpus. The prior knowledge of the distribution of the information subject words enables the model to attach more importance to the word pairs with larger semantic similarity. Aiming at the problem that the traditional word pair topic model is influenced by the topic word quality of high-frequency words, the invention introduces IDF indexes and semantic similarity together as prior knowledge of word distribution on the assumption that the representation capability of the words appearing in most documents to the topics is weak. The invention also provides a novel semantic word network construction method, which enables the word network to pay attention to the topic connection among words more comprehensively, and provides a basic unit with tighter topic connection, namely a semantic word triangular structure on the basis of the network, and the semantic word triangular structure is used as a topic mining unit to obtain higher topic quality.
In summary, the short text topic mining method based on the semantic word network comprehensively considers external semantic information, context word frequency information and word triangle structures, and provides a new solution for solving the problem of feature sparsity of a short text topic model during mining.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.
Claims (4)
1. A short text topic mining method based on a semantic word network is characterized by comprising the following steps:
step 1, model initialization stage: collecting external corpora of related fields to construct an external corpus; preprocessing the external corpus and the target corpus to convert the corpora in the external corpus and the target corpus into a format which can be accepted by a word2vec model; taking an external corpus as input, training a word2vec model, and enabling the word2vec model to output an appointed word vector; extracting word vector data in a target corpus by a trained word2vec model;
step 2, a subject unit construction stage:
2) -a is from the target corpus D ═ { D ═ D1,d2,...,dnThe co-occurrence relation of the Chinese words generates a basic word co-occurrence network, and the specific steps are as follows:
step 2) -a-1) establishing a point set V, an edge set E and an edge attribute set R, wherein the initial state is null;
steps 2) -a-2) for document dk={w1,w2,...,wmEvery word w iniIf the word wiIf the current vector does not appear in the set V, adding the current vector into the set V, wherein k belongs to {1, 2, …, n };
steps 2) -a-3) for document dkAll word pairs w ini,wjIf not in the set E, add it to E and add the edge attribute R to the set Rij=<Sij,sij>(ii) a Wherein S isijWhere denotes the set of document numbers containing the word pair, sijRepresents the word wiAnd wjSemantic similarity attributes between them; order SijIf e, if kijAlready present in the collection E, then the edge attribute rijDocument number set S in (1)ijAdding a document number k;
2) b, fusing semantic information on the basis of the word co-occurrence network to construct a semantic word network, and specifically comprising the following steps:
step 2) -b-1) comparing the word vector data of the words in the target corpus and the external corpus, and setting the corresponding word vector to be empty for the words which are not registered in the target corpus, namely the subsequent semantic-free information;
step 2) -b-2) setting a threshold value delta;
step 2) -b-3) for each word pair w in the word co-occurrence networkiAnd wjSemantic similarity between word pairs is calculated according to the following formula:wherein the content of the first and second substances,andrespectively represent words wiAnd wjA corresponding word vector;
step 2) -b-4) judging each word pair wiAnd wjWhether there is an edge connection between them; if yes, turning to the step 2) -b-5); otherwise, go to step 2) -b-6);
step 2) -b-5) comparing the semantic similarity sijEntering an edge attribute rij=<Sij,sij>Performing the following steps;
step 2) -b-6) judging semantic similarity sijWhether or not s is satisfiedijIs more than delta; if yes, turning to the step 2) -b-7); otherwise, the word pair does not do any operation;
step 2) -b-7) adding the edge E into the edge set EijAnd adding an edge attribute R into the edge attribute set Rij=<Sij,sij>Let us ordersij=sim(wi,wj);
Steps 2) -c for each word w in the semantic word networkiCalculating the inverse document frequency by the formula:
where | D ∈ D: w is aiE d | represents a representation of a vector containing wiNumber of documents of (2), NDRepresenting the total number of documents in the corpus;
step 2) -d searching a semantic word triangle meeting the following conditions in the semantic word network:
three word nodes in the semantic word triangle are mutually connected in a side mode and are from connecting parts of different document sub-networks;
step 3, a model training stage: randomly initializing the theme distribution of the semantic word triangles for all the semantic word triangles obtained in the step (2); obtaining the triangular distribution of the current semantic words through Gibbs sampling, calculating document theme distribution and theme word distribution updating parameters, circularly iterating until reaching the maximum iteration times or Gibbs sampling convergence, and taking the finally obtained Gibbs sampling result as the triangular theme distribution of the semantic words;
step 4, result output stage: and (4) calculating the theme distribution of the original document according to the semantic word triangular theme distribution obtained in the step (3).
2. The method for mining short text topics based on semantic word network as claimed in claim 1, wherein: the specific steps of searching the semantic word triangle in the steps 2) -d comprise:
steps 2) -d-1) for any three words w in the set Vi,wj,wkJudging whether each node has an edge, i.e. whether e existsij,ejk,eikE belongs to E; if yes, turning to the step 2) -d-2);
step 2) -d-2) determining whether S is satisfiedij≠Sik∧Sik≠Sjk∧Sij≠Sjk(ii) a If yes, turning to the step 2) -d-3);
Step 2) -d-4) generating semantic word triangle t ═ (w)i,wj,wk,lijk)。
3. The method for mining short text topics based on semantic word network as claimed in claim 2, wherein: in the step 3, the gibbs sampling process is as follows:
step 3) -a-1) randomly initializing a theme for each semantic word triangle;
step 3) -a-2) selecting a proper iteration number T and initializing: pt is 0;
step 3) -a-3) judging whether pt is smaller than T: if yes, turning to the step 3) -a-4); if not, go to step 3) -a-12);
step 3) -a-4) randomly selecting word triangle tq=(wm,wn,wl,lmnl) Calculating a dirichlet distribution hyper-parameter beta of the word triangle according to the expansion informationm,βn,βlThe concrete formula is as follows:
wherein ∈ is a constant set to prevent the β value from being too small;
step 3) -a-5) of removing word triangle t from the calculation modelqTopic distribution in the latter environment, the formula is as follows:
wherein h represents the topic number, K represents the total number of topics, F represents the total number of words in the corpus, and z representsqTriangle t representing wordqU denotes a full set of semantic word triangles, Z-qRepresenting removed word triangles tqLater distribution of topics, P (z)q=h|U,Z-q) Triangle t with the term other thanqTriangular t of term derived from distribution of other subjectsqThe topic of (a) is the probability of h,representing removed word triangles tqRear belonging to subject zhThe number of the word triangle is greater than the number of the word triangle,representing removed word triangles tqRear belonging to subject zsThe number of the word triangle is greater than the number of the word triangle,representing removed word triangles tqFuture topic zhChinese word wmThe frequency of (2) is determined,representing removed word triangles tqFuture topic zhChinese word wnThe frequency of (2) is determined,representing removed word triangles tqFuture topic zhChinese word wlThe frequency of (2) is determined,representing removed word triangles tqFuture topic zhChinese word wfThe frequency of the word distribution model is shown as alpha, beta and alpha, beta are the prior distribution hyperparameters of the document topics, beta is the distribution hyperparameters of the topic words except the current word triangle, and alpha and beta are model input parameters;
steps 3) -a-6) according to the conditional probability distribution P (z)q=h|U,Z-q) Sampling a theme;
step 3) -a-7) updating the 'document-subject' distribution parameter, wherein the updating formula is as follows:
wherein the content of the first and second substances,denotes the topic as zhNumber of documents, NBRepresenting the total number of documents in the corpus;
step 3) -a-8) updating word w in word trianglemAt subject zhThe following distribution parameters, according to the formula:
wherein the content of the first and second substances,denotes the topic as zhLower word wmThe number of occurrences of (a) is,denotes the topic as zhThe occurrence frequency of the lower word w, beta is a dirichlet distribution hyper-parameter;
step 3) -a-9) updating w in word trianglenAt subject zhThe following distribution parameters are updated as follows:
wherein the content of the first and second substances,denotes the topic as zhLower word wnThe number of occurrences of (c);
step 3) -a-10) updating w in word trianglelAt subject zhThe following distribution parameters are updated as follows:
wherein the content of the first and second substances,denotes the topic as zhLower word wlThe number of occurrences of (c);
step 3) -a-11) making pt equal to pt +1, and judging whether pt is smaller than T: if yes, turning to the step 3) -a-4); if not, go to step 3) -a-12);
and 3) finishing the model training.
4. The method for mining short text topics based on semantic word network as claimed in claim 3, wherein: the specific steps of the original document theme inference in the step 4 comprise:
step 4) -a-1) splitting each original document in the target corpus into a word pair set;
step 4) -a-2) judging whether the word pairs split in the step 4) -a-1) have at least one related semantic word triangle; if yes, turning to the step 4) -a-3); if not, go to step 4) -a-5);
step 4) -a-3) calculating the semantic word triangle t in the document dqThe specific formula of (c) is as follows:
wherein n isd(tq) Semantic word triangle set d represented in document dtTriangle t of middle semantic wordqThe frequency of (2);
step 4) -a-4) calculating semantic word triangle t by using Bayesian formulaqThe specific formula of the theme distribution is as follows:
wherein, P (z)h) Denotes the topic as zhProbability of (d), P (w)m|zh) Representing a topic zhChinese word wmProbability of occurrence, P (w)n|zh) Representing a topic zhChinese word wnProbability of occurrence, P (w)l|zh) Representing a topic zhChinese word wlThe probability of occurrence; p (z)p) Denotes the topic as zpProbability of (d), P (w)m|zp) Representing a topic zpChinese word wmProbability of occurrence, P (w)n|zp) Representing a topic zpChinese word wnProbability of occurrence, P (w)l|zp) Representing a topic zpChinese word wlThe probability of occurrence;
step 4) -a-5) to obtain the theme distribution of the document by calculation, wherein the specific formula is as follows:
wherein the content of the first and second substances,representing the size of the triangular set of semantic words for document d;
step 4) -a-6) calculating the word w in the document diThe probability of occurrence is specifically as follows:
wherein n isd(wi) Semantic word triangle set d represented in document dtChinese word wiFrequency of (n)d(wj) Semantic word triangle set d represented in document dtChinese language wjFrequency of (n)dRepresenting the number of documents d;
step 4) -a-7) calculating words w according to global topic distribution and word distribution under topicsiThe specific formula of the theme distribution is as follows:
wherein P (z) represents the probability that the topic of the document is z, and P (w)i| z) represents the word w in the topic ziProbability of occurrence, P (w)j| z) represents the word w in the topic zjThe probability of occurrence;
step 4) -a-8) obtaining the theme of the document according to the theme distribution of words in the document, wherein the specific formula is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910400416.5A CN110134958B (en) | 2019-05-14 | 2019-05-14 | Short text topic mining method based on semantic word network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910400416.5A CN110134958B (en) | 2019-05-14 | 2019-05-14 | Short text topic mining method based on semantic word network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110134958A CN110134958A (en) | 2019-08-16 |
CN110134958B true CN110134958B (en) | 2021-05-18 |
Family
ID=67574004
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910400416.5A Active CN110134958B (en) | 2019-05-14 | 2019-05-14 | Short text topic mining method based on semantic word network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110134958B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061866B (en) * | 2019-08-20 | 2024-01-02 | 河北工程大学 | Barrage text clustering method based on feature expansion and T-oBTM |
CN111339289B (en) * | 2020-03-06 | 2022-10-28 | 西安工程大学 | Topic model inference method based on commodity comments |
CN111723563B (en) * | 2020-05-11 | 2023-09-26 | 华南理工大学 | Topic modeling method based on word co-occurrence network |
CN112183108B (en) * | 2020-09-07 | 2021-06-22 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Inference method, system, computer equipment and storage medium for short text topic distribution |
CN112487185B (en) * | 2020-11-27 | 2022-12-30 | 国家电网有限公司客户服务中心 | Data classification method in power customer field |
CN116432639B (en) * | 2023-05-31 | 2023-08-25 | 华东交通大学 | News element word mining method based on improved BTM topic model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105955948A (en) * | 2016-04-22 | 2016-09-21 | 武汉大学 | Short text topic modeling method based on word semantic similarity |
CN108197144A (en) * | 2017-11-28 | 2018-06-22 | 河海大学 | A kind of much-talked-about topic based on BTM and Single-pass finds method |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9177262B2 (en) * | 2013-12-02 | 2015-11-03 | Qbase, LLC | Method of automated discovery of new topics |
CN105608192A (en) * | 2015-12-23 | 2016-05-25 | 南京大学 | Short text recommendation method for user-based biterm topic model |
CN106055604B (en) * | 2016-05-25 | 2019-08-27 | 南京大学 | Word-based network carries out the short text topic model method for digging of feature extension |
CN108182176B (en) * | 2017-12-29 | 2021-08-10 | 太原理工大学 | Method for enhancing semantic relevance and topic aggregation of topic words of BTM topic model |
-
2019
- 2019-05-14 CN CN201910400416.5A patent/CN110134958B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105955948A (en) * | 2016-04-22 | 2016-09-21 | 武汉大学 | Short text topic modeling method based on word semantic similarity |
CN108197144A (en) * | 2017-11-28 | 2018-06-22 | 河海大学 | A kind of much-talked-about topic based on BTM and Single-pass finds method |
Non-Patent Citations (2)
Title |
---|
基于双词语义扩展的Biterm主题模型;李思宇 等;《计算机工程》;20190131;第45卷(第1期);第1-7页 * |
基于词三角的短文本主题模型算法;蔡洋;《中国优秀硕士学位论文全文数据库信息科技辑》;20170815(第8期);第1-49页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110134958A (en) | 2019-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110134958B (en) | Short text topic mining method based on semantic word network | |
CN110162593B (en) | Search result processing and similarity model training method and device | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
CN108710611B (en) | Short text topic model generation method based on word network and word vector | |
CN108132927B (en) | Keyword extraction method for combining graph structure and node association | |
CN108681557B (en) | Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint | |
CN113254599A (en) | Multi-label microblog text classification method based on semi-supervised learning | |
CN106354818B (en) | Social media-based dynamic user attribute extraction method | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
CN111723295A (en) | Content distribution method, device and storage medium | |
CN112699240A (en) | Intelligent dynamic mining and classifying method for Chinese emotional characteristic words | |
CN112966091A (en) | Knowledge graph recommendation system fusing entity information and heat | |
CN114896377A (en) | Knowledge graph-based answer acquisition method | |
Marujo et al. | Hourly traffic prediction of news stories | |
CN111400483B (en) | Time-weighting-based three-part graph news recommendation method | |
CN112906391A (en) | Meta-event extraction method and device, electronic equipment and storage medium | |
CN117057349A (en) | News text keyword extraction method, device, computer equipment and storage medium | |
CN109871429B (en) | Short text retrieval method integrating Wikipedia classification and explicit semantic features | |
CN110263344B (en) | Text emotion analysis method, device and equipment based on hybrid model | |
CN115329078B (en) | Text data processing method, device, equipment and storage medium | |
CN108427769B (en) | Character interest tag extraction method based on social network | |
CN114491296B (en) | Proposal affiliate recommendation method, system, computer device and readable storage medium | |
CN103744830A (en) | Semantic analysis based identification method of identity information in EXCEL document | |
CN115329850A (en) | Information comparison method and device, electronic equipment and storage medium | |
Liao et al. | TIRR: A code reviewer recommendation algorithm with topic model and reviewer influence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |