CN110134958B - Short text topic mining method based on semantic word network - Google Patents

Short text topic mining method based on semantic word network Download PDF

Info

Publication number
CN110134958B
CN110134958B CN201910400416.5A CN201910400416A CN110134958B CN 110134958 B CN110134958 B CN 110134958B CN 201910400416 A CN201910400416 A CN 201910400416A CN 110134958 B CN110134958 B CN 110134958B
Authority
CN
China
Prior art keywords
word
topic
semantic
triangle
distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910400416.5A
Other languages
Chinese (zh)
Other versions
CN110134958A (en
Inventor
张雷
经伟
蔡洋
陆恒杨
徐鸣
王崇骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201910400416.5A priority Critical patent/CN110134958B/en
Publication of CN110134958A publication Critical patent/CN110134958A/en
Application granted granted Critical
Publication of CN110134958B publication Critical patent/CN110134958B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a short text topic mining method based on a semantic word network, which comprises the following steps of 1) model initialization stage: external corpus collection, corpus preprocessing, parameter setting and the like in the related field; 2) a subject unit construction stage: constructing a semantic word network, searching a specific word triangular structure, calculating model prior parameters and the like; 3) a model training stage: sampling the model variable by using a Gibbs sampling method, and judging whether the model reaches a convergence condition; 4) and a result output stage: and obtaining the topic distribution of each word triangle according to the sampling result of each variable after the model training is finished, and further calculating the topic distribution of the original document. The method combines semantic information learned by an external corpus with a word triangle topic structure, is applied to the aspect of short text topic mining, provides a solution for integrating external prior knowledge into a traditional topic model compared with the traditional word-to-topic model, and obviously improves the quality of the mined topic.

Description

Short text topic mining method based on semantic word network
Technical Field
The invention relates to a short text topic mining method, in particular to a short text topic mining method based on a semantic word network, which solves the problem that the topic quality is not high under the condition that the short text features are sparse in a common topic mining method.
Background
With the continuous acceleration of social development rhythm and the short and fast user experience brought by the intelligent mobile terminal, people tend to fragmentize more and more during communication on the network. Therefore, short text data is more and more important in today's network information interaction, for example, social network status, microblog text messages, traditional news headlines, short video headlines, question and answer websites, and the like are all in the form of short texts. And short text data is also generated and accumulated at great speed with the rise of mass companies such as microblog, cicada, Facebook, Twitter, etc. Therefore, the topic information mining from massive short text data has significant value, and for example, public opinion analysis, information retrieval, personalized recommendation, user interest clustering and the like are all application directions of topic mining. On the other hand, the traditional text mining method is difficult to mine the subject information of the short text, and the main reason is that the word co-occurrence information in the short text is very sparse.
At present, for a solution of sparse short text features, word co-occurrence relation is generally utilized. This solution is based on an assumption: the word pairs which co-occur in the same short text have topic relation. For example, two model word pair topic models and word network topic models are commonly used in the short text mining topic area. The former forms word pairs as basic subject units through co-occurrence words, and the latter forms a pseudo document for each word through co-occurrence words to assist in discovering the subject of the corresponding word. These methods ignore semantic relations between words, for example, "holiday" and "holiday" are two words with very close semantics, and their constituent words should contribute more to the subject than common co-occurrence words, but are ignored by the common model because of the fresh co-occurrence in the same short text.
The word vector is a method for representing words in a computer, and the words can be directly used as a characteristic input model based on the representation, so that great convenience is brought to the processing of natural language. Compared with the traditional one-hot expression word vector, the distributed expression word vector has the advantages that the vector dimension is lower and more controllable, and on the other hand, a large amount of external linguistic data are used for training through a neural language model, so that the included semantic information is richer. The invention utilizes the advantage of distributed word vectors for representing semantics, provides a new solution for the short text topic mining method by utilizing the word vectors to measure the semantic similarity of words and adding the semantic similarity into a word triangular topic model as prior knowledge.
Disclosure of Invention
The purpose of the invention is as follows: the technical problem to be solved by the invention is that when the word co-occurrence information is considered in response to the feature scarcity of short text data in the traditional topic model, the mined topic quality is not high enough due to the introduced noise information and the ignored semantic information. The invention discloses a method for mining a theme by introducing external semantic information and fusing word co-occurrence information to construct a semantic word network, which comprises the following steps: firstly, collecting external corpora from related fields and training word vectors through a word2vec model; then, traversing a target language database and combining word vector information to generate a semantic word network, and selecting a specific word triangular structure in the semantic word network; then, parameters are sampled by utilizing a Gibbs sampling method, and multiple iterations are carried out to achieve convergence; and finally, calculating the topic distribution of the word triangles according to the sampling result, and further calculating the main body distribution of the documents in the target language material library.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
a short text topic mining method based on semantic word network comprises the following steps:
step 1, model initialization stage: collecting external corpora of related fields to construct an external corpus; preprocessing the external corpus and the target corpus to convert the corpora in the external corpus and the target corpus into a format which can be accepted by a word2vec model; taking an external corpus as input, training a word2vec model, and enabling the word2vec model to output an appointed word vector; extracting word vector data in a target corpus by a trained word2vec model;
step 2, a subject unit construction stage:
2) -a is from the target corpus D ═ { D ═ D1,d2,...,dnThe co-occurrence relation of the Chinese words generates a basic word co-occurrence network, and the specific steps are as follows:
step 2) -a-1) establishing a point set V, an edge set E and an edge attribute set R, wherein the initial state is null;
steps 2) -a-2) for document dk={w1,w2,...,wmEvery word w iniIf the word wiIf the current vector does not appear in the set V, adding the current vector into the set V, wherein k belongs to {1, 2., n };
steps 2) -a-3) for document dkAll word pairs w ini,wjIf not, add it to E and add the attribute pair R to the set Rij=<Sij,sij>(ii) a Wherein S isijWhere denotes the set of document numbers containing the word pair, sijRepresents the word wiAnd wjSemantic similarity attributes between them; order SijIf e, if kijAlready present in the collection E, then the edge attribute rijDocument number attribute set S in (1)ijAdding a document number k;
2) b, fusing semantic information on the basis of the word co-occurrence network to construct a semantic word network, and specifically comprising the following steps:
step 2) -b-1) comparing the word vector data of the words in the target corpus and the external corpus, and setting the corresponding word vector to be empty for the words which are not registered in the target corpus, namely the subsequent semantic-free information;
step 2) -b-2) setting a threshold value delta;
step 2) -b-3) for each pair of word nodes w in the word co-occurrence networkiAnd wjCalculating semantic similarity between the word pairs according to the following formula:
Figure BDA0002058481130000021
wherein the content of the first and second substances,
Figure BDA0002058481130000022
and
Figure BDA0002058481130000023
respectively represent words wiAnd wjA corresponding word vector;
step 2) -b-4) judging each pair of word nodes wiAnd wjWhether there is an edge connection between them; if yes, go to step 2) -b-5) (ii) a Otherwise, go to step 2) -b-6);
step 2) -b-5) comparing the semantic similarity sijEntering an edge attribute rij=<Sij,sij>In which S isijA set of co-occurrence document numbers for the original word pairs;
step 2) -b-6) judging semantic similarity sijWhether or not s is satisfiedijIs more than delta; if yes, turning to the step 2) -b-7); otherwise, the word does not perform any operation on the node;
step 2) -b-7) adding an edge set E into the edge set EijAnd adding attribute pairs R into the edge attribute set Rij=<Sij,sij>Let us order
Figure BDA0002058481130000031
sij=sim(wi,wj);
Steps 2) -c for each word w in the semantic word networkiCalculating the inverse document frequency by the formula:
Figure BDA0002058481130000032
where | D ∈ D: w is aiE d | represents a representation of a vector containing wiNumber of documents of (2), NDRepresenting the total number of documents in the corpus;
step 2) -d searching a semantic word triangle meeting the following conditions in the semantic word network:
three word nodes in the semantic word triangle are mutually connected in a side mode and are from connecting parts of different document sub-networks;
step 3, a model training stage: randomly initializing the theme distribution of the semantic word triangles for all the semantic word triangles obtained in the step (2); obtaining the triangular distribution of the current semantic words through Gibbs sampling, calculating document theme distribution and theme word distribution updating parameters, circularly iterating until reaching the maximum iteration times or Gibbs sampling convergence, and taking the finally obtained Gibbs sampling result as the triangular theme distribution of the words;
step 4, result output stage: and (4) calculating the theme distribution of the original document according to the semantic word triangular theme distribution obtained in the step (3).
Further, the specific steps of searching the semantic word triangle in the steps 2) -d include:
steps 2) -d-1) for any three words w in the set Vi,wj,wkJudging whether each node has an edge, i.e. whether e existsij,ejk,eikE belongs to E; if yes, turning to the step 2) -d-2);
step 2) -d-2) determining whether S is satisfiedij≠Sik∧Sik≠Sjk∧Sij≠Sjk(ii) a If yes, turning to the step 2) -d-3);
step 2) -d-3) calculating word triangle prior knowledge
Figure BDA0002058481130000033
Wherein, γijk=(γij+ γikjk)/3,γij,γik,γjkThe calculation method is as described above;
step 2) -d-4) generating semantic word triangle t ═ (w)i,wj,wk,lijk)。
Further, in step 3, the gibbs sampling specifically includes the following steps:
step 3) -a-1) initializing a sampling algorithm platform, and constructing a program for sampling from conditional probability distribution by using a machine learning method for use by a SWTTM model;
step 3) -a-2) randomly initializing a theme for each semantic word triangle;
step 3) -a-3) selecting a suitable iteration number T and initializing: t is 0;
steps 3) -a-4) determining whether T is less than T: if yes, turning to the step 3) -a-5); if not, as a transition to step 3) -a-13);
step 3) -a-5) randomly selecting word triangle tq=(wm,wn,wl,lmnl) Calculating word triangles from expansion informationDirichlet distribution hyperparameter betam,βn,βlThe concrete formula is as follows:
Figure BDA0002058481130000041
Figure BDA0002058481130000042
Figure BDA0002058481130000043
wherein the content of the first and second substances,
Figure BDA0002058481130000044
is a constant set to prevent the value of β from being too small;
step 3) -a-6) of removing word triangle t from the calculation modelqTopic distribution in the latter environment, the formula is as follows:
Figure BDA0002058481130000045
wherein K represents the topic number, K represents the total number of topics, V represents the total number of words in the corpus, zqTriangle t representing wordqT denotes a full set of semantic word triangles, Z-qRepresenting removed word triangles tqLater distribution of topics, P (z)q=k|T,Z-q) Triangle t with the term other thanqTriangular t of term derived from distribution of other subjectsqThe subject of (a) is the probability of k,
Figure BDA0002058481130000046
representing removed word triangles tqRear belonging to subject zkThe number of the word triangle is greater than the number of the word triangle,
Figure BDA0002058481130000047
representing removed word triangles tqFuture topic zkChinese word wmThe frequency of the word distribution model is shown as alpha, beta and alpha, beta are the prior distribution hyperparameters of the document topics, beta is the distribution hyperparameters of the topic words except the current word triangle, and alpha and beta are model input parameters;
steps 3) -a-7) according to the conditional probability distribution P (z)q=k|T,Z-q) Sampling a theme;
step 3) -a-8) updating the 'document-subject' distribution parameter, wherein the updating formula is as follows:
Figure BDA0002058481130000048
wherein the content of the first and second substances,
Figure BDA0002058481130000049
denotes the topic as zkNumber of documents, NBRepresenting the total number of documents in the corpus;
step 3) -a-9) updating word w in word trianglemAt subject zkThe following distribution parameters, according to the formula:
Figure BDA0002058481130000051
wherein the content of the first and second substances,
Figure BDA0002058481130000052
denotes the topic as zkLower word wmBeta is a dirichlet distribution hyperparameter;
step 3) -a-10) updating w in word trianglenAt subject zkThe following distribution parameters are updated as follows:
Figure BDA0002058481130000053
wherein the content of the first and second substances,
Figure BDA0002058481130000054
denotes the topic as zkLower word wnThe number of occurrences of (c);
step 3) -a-11) updating w in word trianglelAt subject zkThe following distribution parameters are updated as follows:
Figure BDA0002058481130000055
wherein the content of the first and second substances,
Figure BDA0002058481130000056
denotes the topic as zkLower word wlThe number of occurrences of (c);
step 3) -a-12) making T equal to T +1, and judging whether T is less than T: if yes, turning to the step 3) -a-5); if not, go to step 3) -a-13);
and 3) finishing the model training.
Further, the specific step of the original document theme inference in step 4 includes:
step 4) -a-1) splitting each original document in the target corpus into a word pair set;
step 4) -a-2) judging whether the word pairs have at least one related semantic word triangle; if yes, turning to the step 4) -a-3); if not, go to step 4) -a-5);
step 4) -a-3) calculating the semantic word triangle t in the document dqThe specific formula of (c) is as follows:
Figure BDA0002058481130000057
wherein n isd(tq) Semantic word triangle set d represented in document dtTriangle t of middle semantic wordqThe frequency of (2);
step 4) -a-3) calculating semantic word triangle t by using Bayesian formulaqThe specific formula of the theme distribution is as follows:
Figure BDA0002058481130000058
step 4) -a-4) to obtain the theme distribution of the document by calculation, wherein the specific formula is as follows:
Figure BDA0002058481130000061
wherein the content of the first and second substances,
Figure BDA0002058481130000062
representing the size of the triangular set of semantic words for document d;
step 4) -a-5) calculating the word w in the document diThe probability of occurrence is specifically as follows:
Figure BDA0002058481130000063
step 4) -a-6) calculating words w according to global theme distribution and word distribution under the themeiThe specific formula of the theme distribution is as follows:
Figure BDA0002058481130000064
step 4) -a-7) obtaining the theme of the document according to the theme distribution of words in the document, wherein the specific formula is as follows:
Figure BDA0002058481130000065
compared with the prior art, the invention has the following beneficial effects:
aiming at the problem that the traditional word pair topic model is influenced by the topic word quality of the high-frequency words, the invention assumes that the representation capability of the words appearing in most documents to the topics is weak, and introduces IDF indexes and semantic similarity together as the prior knowledge of word distribution based on the assumption, thereby relieving the influence of the high-frequency words on the topic quality. Aiming at the neglect of a plurality of word pairs with close semantic relation and less co-occurrence in the common word co-occurrence network, the invention provides a novel semantic word network construction method, so that a topic model can pay more attention to topic relation among words, and the quality of the mined topics is obviously improved.
Drawings
FIG. 1 is a flow chart of a short text topic mining method based on a semantic word network;
FIG. 2 is a flow chart of semantic word network construction and semantic word triangle finding;
fig. 3 is a probabilistic graphical model of the SWTTM algorithm.
Detailed Description
The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.
Fig. 1 is a flowchart of a short text topic mining method based on a semantic word network according to an embodiment of the present invention. The specific steps are described as follows:
step 0 is the starting state of the present invention;
in the model initialization phase (step 1-3):
step 1, collecting external corpora of related fields, wherein the text length is not required;
step 2, performing preprocessing operations such as word segmentation and screening on the external corpus and the target corpus; the method mainly aims to implement the next algorithm with words as units by segmenting the word in the corpus, and comprises the following specific steps:
step 2-1) performing word segmentation processing on the two corpora respectively, and removing stop words at the same time;
step 2-2) deleting words of non-Chinese characters and non-Latin letters, and lowercase all Latin letters;
step 2-3) deleting the words with the word frequency less than 5 in the corpus, and deleting the documents with the words less than 3;
step 3, setting word2vec model parameters, taking an external corpus as input, and training a model to obtain word vector data;
in the subject unit construction phase (steps 4-6):
step 4 is based on corpus D ═ D1,d2,...,dnConstructing a basic co-occurrence word network;
step 5, fusing semantic information on the basis of the word co-occurrence network to construct a semantic word network:
step 6, searching a word triangular structure meeting the conditions in the semantic word network and calculating the word inverse document frequency;
the word triangle structure satisfies the following conditions: the three word nodes are connected with edges and are from the connected parts of different document sub-networks.
In the model training phase (step 7-8):
and 7, sampling by using variables in the Gibbs sampling method model, wherein the variables represent model training according to the sample data obtained in the steps 1 and 2, and the specific implementation process is as follows:
step 7-1, initializing a sampling algorithm platform, and constructing a program for sampling from conditional probability distribution by using a machine learning method, wherein the program is used by a SWTTM model, and the algorithm flow of the SWTTM model is shown in FIG. 3;
step 7-2 randomly initializes a topic for each semantic word triangle.
And 7-3, selecting a proper iteration number T, and initializing: t is 0;
step 7-4, judging whether T is smaller than T: yes, go to step 3) -a-5); no, go to step 3) -a-13);
step 7-5 randomly selecting word triangle tq=(wm,wn,wl,lmnl) Computing Dirichlet distribution hyper-parameter beta of word triangle according to expansion informationm,βn,βlThe concrete formula is as follows:
Figure BDA0002058481130000071
Figure BDA0002058481130000072
Figure BDA0002058481130000073
wherein the content of the first and second substances,
Figure BDA0002058481130000074
is a constant set according to the word vector sampling condition and manual evaluation in order to prevent the beta value from being too small.
Step 7-6, word triangle t is removed from the calculation modeliTopic distribution in the latter environment, the formula is as follows:
Figure BDA0002058481130000081
step 7-7 sampling a topic according to the conditional probability distribution;
step 7-8 updates the "document-topic" distribution parameter according to the formula as follows:
Figure BDA0002058481130000082
wherein the content of the first and second substances,
Figure BDA0002058481130000083
denotes the topic as zkNumber of documents, NBRepresents the total number of documents in the corpus, and K represents the total number of topics.
Step 7-9 update w in word trianglemAt subject zkThe following distribution parameters, according to the formula:
Figure BDA0002058481130000084
wherein the content of the first and second substances,
Figure BDA0002058481130000085
denotes the topic as zkLower word wmThe number of occurrences of (c).
Step 7-10 update w in word trianglenAt subject zkThe following distribution parameters, according to the formula:
Figure BDA0002058481130000086
wherein the content of the first and second substances,
Figure BDA0002058481130000087
denotes the topic as zkLower word wnThe number of occurrences of (c).
Step 7-11 update w in word trianglelAt subject zkThe following distribution parameters, according to the formula:
Figure BDA0002058481130000088
wherein the content of the first and second substances,
Figure BDA0002058481130000089
denotes the topic as zkLower word wlThe number of occurrences of (c).
And 7-12, enabling T to be T +1, judging whether T is smaller than T: yes, go to step 3) -a-5); no, go to step 3) -a-13);
and 7-13, finishing the model training.
Step 8, distributing the Gibbs sampling result as a semantic word triangular theme;
in the result output stage (steps 9-10):
step 9, splitting the original document into word pairs;
step 10, calculating the subject distribution of the original document by finding the associated semantic word triangle according to the word pair, wherein the specific method comprises the following steps:
step 10-1, judging whether the word pairs have at least one related semantic word triangle; if yes, go to step 10-2; if not, turning to the step 10-4;
step 10-2, calculating the semantic word triangle t in the document dqThe specific formula of (c) is as follows:
Figure BDA00020584811300000810
wherein n isd(ti) Semantic word triangle set d represented in document dtTriangle t of middle semantic wordqThe frequency of (2).
Step 10-3, semantic word triangle t is calculated by using Bayesian formulaqThe specific formula of the theme distribution is as follows:
Figure BDA0002058481130000091
step 10-4, calculating to obtain the theme distribution of the document, wherein the specific formula is as follows:
Figure BDA0002058481130000092
wherein the content of the first and second substances,
Figure BDA0002058481130000093
representing the size of the triangular set of semantic words for document d.
Step 10-5 calculating word w in document diThe probability of occurrence is specifically as follows:
Figure BDA0002058481130000094
step 10-6, words w are calculated according to global theme distribution and word distribution under the themeiThe specific formula of the theme distribution is as follows:
Figure BDA0002058481130000095
step 10-7, obtaining the theme of the document according to the theme distribution of words in the document, wherein the specific formula is as follows:
Figure BDA0002058481130000096
and step 10-8 is finished.
Step 11 is the end state.
Fig. 2 is a detailed description of steps 4 and 5 in fig. 1.
Step 12 is the start state.
Step 13 is to establish a basic co-occurrence word network, and the specific method is as follows:
step 4-1, initializing a node set V, an edge set E and an edge attribute set R
Step 4-2 for document dk={w1,w2,...,wmThe word w iniIf the word does not appear in the set V, it is added to V;
step 4-3 for document dkAll word pairs w ini,wjIf not, add it to E and add the attribute pair R to the set Rij=<Sij,sij>And order SijK. If the edge relationship already exists in set E, S in set RijAdding the document number k.
Step 14, obtaining word vector data in the target language material library according to the training result in the step 3, and setting a threshold value delta related to word semantics;
step 15, for each pair of word nodes in the basic co-occurrence word network, calculating the semantic similarity between the word pairs according to the following formula:
Figure BDA0002058481130000101
wherein the content of the first and second substances,
Figure BDA0002058481130000102
the expression wiThe corresponding word vector.
Step 16, judging whether edges are connected among the word pair nodes; yes, go to step 17. If not, go to step 18;
step 17 notes semantic similarity information into the edge attribute, i.e. sij=sim(wi,wj);
Step 18 judges whether the word pair semantic similarity satisfies sim (w)i,wj) Is more than delta; if yes, go to step 19;
step 19 adds attribute pairs R to the set Rij=<Sij,sij>And order
Figure BDA0002058481130000103
sij=sim(wi,wj);
Step 20 calculates the inverse document frequency for each word in the semantic word network, as follows:
Figure BDA0002058481130000104
where | D ∈ D: w is aiE d | represents a representation of a vector containing wiThe number of documents.
Step 21 for any three words wi,wj,wkE.g. V, judging whether each node has an edge, i.e. eij,ejk,eikE belongs to E; the sets of side information in the attribute sets of the judgment sides are different from each other, i.e. Sij≠Sik∧Sik≠Sjk∧Sij≠Sjk(ii) a If yes, go to step 23;
step 22 of calculating word triangle prior knowledge
Figure BDA0002058481130000105
Step 23 generates semantic word triangle t ═ (w)i,wj,wk,lijk);
Step 24 is the end state.
Aiming at the problem that the traditional word pair topic model has equal treatment on word pairs with different importance, the probability that the words with tighter semantic relation belong to the same topic is higher, and the semantic relation of the words is measured by introducing word embedding trained by an external corpus. The prior knowledge of the distribution of the information subject words enables the model to attach more importance to the word pairs with larger semantic similarity. Aiming at the problem that the traditional word pair topic model is influenced by the topic word quality of high-frequency words, the invention introduces IDF indexes and semantic similarity together as prior knowledge of word distribution on the assumption that the representation capability of the words appearing in most documents to the topics is weak. The invention also provides a novel semantic word network construction method, which enables the word network to pay attention to the topic connection among words more comprehensively, and provides a basic unit with tighter topic connection, namely a semantic word triangular structure on the basis of the network, and the semantic word triangular structure is used as a topic mining unit to obtain higher topic quality.
In summary, the short text topic mining method based on the semantic word network comprehensively considers external semantic information, context word frequency information and word triangle structures, and provides a new solution for solving the problem of feature sparsity of a short text topic model during mining.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (4)

1. A short text topic mining method based on a semantic word network is characterized by comprising the following steps:
step 1, model initialization stage: collecting external corpora of related fields to construct an external corpus; preprocessing the external corpus and the target corpus to convert the corpora in the external corpus and the target corpus into a format which can be accepted by a word2vec model; taking an external corpus as input, training a word2vec model, and enabling the word2vec model to output an appointed word vector; extracting word vector data in a target corpus by a trained word2vec model;
step 2, a subject unit construction stage:
2) -a is from the target corpus D ═ { D ═ D1,d2,...,dnThe co-occurrence relation of the Chinese words generates a basic word co-occurrence network, and the specific steps are as follows:
step 2) -a-1) establishing a point set V, an edge set E and an edge attribute set R, wherein the initial state is null;
steps 2) -a-2) for document dk={w1,w2,...,wmEvery word w iniIf the word wiIf the current vector does not appear in the set V, adding the current vector into the set V, wherein k belongs to {1, 2, …, n };
steps 2) -a-3) for document dkAll word pairs w ini,wjIf not in the set E, add it to E and add the edge attribute R to the set Rij=<Sij,sij>(ii) a Wherein S isijWhere denotes the set of document numbers containing the word pair, sijRepresents the word wiAnd wjSemantic similarity attributes between them; order SijIf e, if kijAlready present in the collection E, then the edge attribute rijDocument number set S in (1)ijAdding a document number k;
2) b, fusing semantic information on the basis of the word co-occurrence network to construct a semantic word network, and specifically comprising the following steps:
step 2) -b-1) comparing the word vector data of the words in the target corpus and the external corpus, and setting the corresponding word vector to be empty for the words which are not registered in the target corpus, namely the subsequent semantic-free information;
step 2) -b-2) setting a threshold value delta;
step 2) -b-3) for each word pair w in the word co-occurrence networkiAnd wjSemantic similarity between word pairs is calculated according to the following formula:
Figure FDA0003013232500000011
wherein the content of the first and second substances,
Figure FDA0003013232500000012
and
Figure FDA0003013232500000013
respectively represent words wiAnd wjA corresponding word vector;
step 2) -b-4) judging each word pair wiAnd wjWhether there is an edge connection between them; if yes, turning to the step 2) -b-5); otherwise, go to step 2) -b-6);
step 2) -b-5) comparing the semantic similarity sijEntering an edge attribute rij=<Sij,sij>Performing the following steps;
step 2) -b-6) judging semantic similarity sijWhether or not s is satisfiedijIs more than delta; if yes, turning to the step 2) -b-7); otherwise, the word pair does not do any operation;
step 2) -b-7) adding the edge E into the edge set EijAnd adding an edge attribute R into the edge attribute set Rij=<Sij,sij>Let us order
Figure FDA0003013232500000024
sij=sim(wi,wj);
Steps 2) -c for each word w in the semantic word networkiCalculating the inverse document frequency by the formula:
Figure FDA0003013232500000021
where | D ∈ D: w is aiE d | represents a representation of a vector containing wiNumber of documents of (2), NDRepresenting the total number of documents in the corpus;
step 2) -d searching a semantic word triangle meeting the following conditions in the semantic word network:
three word nodes in the semantic word triangle are mutually connected in a side mode and are from connecting parts of different document sub-networks;
step 3, a model training stage: randomly initializing the theme distribution of the semantic word triangles for all the semantic word triangles obtained in the step (2); obtaining the triangular distribution of the current semantic words through Gibbs sampling, calculating document theme distribution and theme word distribution updating parameters, circularly iterating until reaching the maximum iteration times or Gibbs sampling convergence, and taking the finally obtained Gibbs sampling result as the triangular theme distribution of the semantic words;
step 4, result output stage: and (4) calculating the theme distribution of the original document according to the semantic word triangular theme distribution obtained in the step (3).
2. The method for mining short text topics based on semantic word network as claimed in claim 1, wherein: the specific steps of searching the semantic word triangle in the steps 2) -d comprise:
steps 2) -d-1) for any three words w in the set Vi,wj,wkJudging whether each node has an edge, i.e. whether e existsij,ejk,eikE belongs to E; if yes, turning to the step 2) -d-2);
step 2) -d-2) determining whether S is satisfiedij≠Sik∧Sik≠Sjk∧Sij≠Sjk(ii) a If yes, turning to the step 2) -d-3);
step 2) -d-3) calculating word triangle prior knowledge
Figure FDA0003013232500000022
Wherein, γijk=(sij+sik+sjk)/3;
Step 2) -d-4) generating semantic word triangle t ═ (w)i,wj,wk,lijk)。
3. The method for mining short text topics based on semantic word network as claimed in claim 2, wherein: in the step 3, the gibbs sampling process is as follows:
step 3) -a-1) randomly initializing a theme for each semantic word triangle;
step 3) -a-2) selecting a proper iteration number T and initializing: pt is 0;
step 3) -a-3) judging whether pt is smaller than T: if yes, turning to the step 3) -a-4); if not, go to step 3) -a-12);
step 3) -a-4) randomly selecting word triangle tq=(wm,wn,wl,lmnl) Calculating a dirichlet distribution hyper-parameter beta of the word triangle according to the expansion informationm,βn,βlThe concrete formula is as follows:
Figure FDA0003013232500000023
Figure FDA0003013232500000031
Figure FDA0003013232500000032
wherein ∈ is a constant set to prevent the β value from being too small;
step 3) -a-5) of removing word triangle t from the calculation modelqTopic distribution in the latter environment, the formula is as follows:
Figure FDA0003013232500000033
wherein h represents the topic number, K represents the total number of topics, F represents the total number of words in the corpus, and z representsqTriangle t representing wordqU denotes a full set of semantic word triangles, Z-qRepresenting removed word triangles tqLater distribution of topics, P (z)q=h|U,Z-q) Triangle t with the term other thanqTriangular t of term derived from distribution of other subjectsqThe topic of (a) is the probability of h,
Figure FDA0003013232500000034
representing removed word triangles tqRear belonging to subject zhThe number of the word triangle is greater than the number of the word triangle,
Figure FDA0003013232500000035
representing removed word triangles tqRear belonging to subject zsThe number of the word triangle is greater than the number of the word triangle,
Figure FDA0003013232500000036
representing removed word triangles tqFuture topic zhChinese word wmThe frequency of (2) is determined,
Figure FDA0003013232500000037
representing removed word triangles tqFuture topic zhChinese word wnThe frequency of (2) is determined,
Figure FDA0003013232500000038
representing removed word triangles tqFuture topic zhChinese word wlThe frequency of (2) is determined,
Figure FDA0003013232500000039
representing removed word triangles tqFuture topic zhChinese word wfThe frequency of the word distribution model is shown as alpha, beta and alpha, beta are the prior distribution hyperparameters of the document topics, beta is the distribution hyperparameters of the topic words except the current word triangle, and alpha and beta are model input parameters;
steps 3) -a-6) according to the conditional probability distribution P (z)q=h|U,Z-q) Sampling a theme;
step 3) -a-7) updating the 'document-subject' distribution parameter, wherein the updating formula is as follows:
Figure FDA00030132325000000310
wherein the content of the first and second substances,
Figure FDA00030132325000000311
denotes the topic as zhNumber of documents, NBRepresenting the total number of documents in the corpus;
step 3) -a-8) updating word w in word trianglemAt subject zhThe following distribution parameters, according to the formula:
Figure FDA00030132325000000312
wherein the content of the first and second substances,
Figure FDA00030132325000000313
denotes the topic as zhLower word wmThe number of occurrences of (a) is,
Figure FDA00030132325000000314
denotes the topic as zhThe occurrence frequency of the lower word w, beta is a dirichlet distribution hyper-parameter;
step 3) -a-9) updating w in word trianglenAt subject zhThe following distribution parameters are updated as follows:
Figure FDA0003013232500000041
wherein the content of the first and second substances,
Figure FDA0003013232500000042
denotes the topic as zhLower word wnThe number of occurrences of (c);
step 3) -a-10) updating w in word trianglelAt subject zhThe following distribution parameters are updated as follows:
Figure FDA0003013232500000043
wherein the content of the first and second substances,
Figure FDA0003013232500000044
denotes the topic as zhLower word wlThe number of occurrences of (c);
step 3) -a-11) making pt equal to pt +1, and judging whether pt is smaller than T: if yes, turning to the step 3) -a-4); if not, go to step 3) -a-12);
and 3) finishing the model training.
4. The method for mining short text topics based on semantic word network as claimed in claim 3, wherein: the specific steps of the original document theme inference in the step 4 comprise:
step 4) -a-1) splitting each original document in the target corpus into a word pair set;
step 4) -a-2) judging whether the word pairs split in the step 4) -a-1) have at least one related semantic word triangle; if yes, turning to the step 4) -a-3); if not, go to step 4) -a-5);
step 4) -a-3) calculating the semantic word triangle t in the document dqThe specific formula of (c) is as follows:
Figure FDA0003013232500000045
wherein n isd(tq) Semantic word triangle set d represented in document dtTriangle t of middle semantic wordqThe frequency of (2);
step 4) -a-4) calculating semantic word triangle t by using Bayesian formulaqThe specific formula of the theme distribution is as follows:
Figure FDA0003013232500000046
wherein, P (z)h) Denotes the topic as zhProbability of (d), P (w)m|zh) Representing a topic zhChinese word wmProbability of occurrence, P (w)n|zh) Representing a topic zhChinese word wnProbability of occurrence, P (w)l|zh) Representing a topic zhChinese word wlThe probability of occurrence; p (z)p) Denotes the topic as zpProbability of (d), P (w)m|zp) Representing a topic zpChinese word wmProbability of occurrence, P (w)n|zp) Representing a topic zpChinese word wnProbability of occurrence, P (w)l|zp) Representing a topic zpChinese word wlThe probability of occurrence;
step 4) -a-5) to obtain the theme distribution of the document by calculation, wherein the specific formula is as follows:
Figure FDA0003013232500000051
wherein the content of the first and second substances,
Figure FDA0003013232500000052
representing the size of the triangular set of semantic words for document d;
step 4) -a-6) calculating the word w in the document diThe probability of occurrence is specifically as follows:
Figure FDA0003013232500000053
wherein n isd(wi) Semantic word triangle set d represented in document dtChinese word wiFrequency of (n)d(wj) Semantic word triangle set d represented in document dtChinese language wjFrequency of (n)dRepresenting the number of documents d;
step 4) -a-7) calculating words w according to global topic distribution and word distribution under topicsiThe specific formula of the theme distribution is as follows:
Figure FDA0003013232500000054
wherein P (z) represents the probability that the topic of the document is z, and P (w)i| z) represents the word w in the topic ziProbability of occurrence, P (w)j| z) represents the word w in the topic zjThe probability of occurrence;
step 4) -a-8) obtaining the theme of the document according to the theme distribution of words in the document, wherein the specific formula is as follows:
Figure FDA0003013232500000055
CN201910400416.5A 2019-05-14 2019-05-14 Short text topic mining method based on semantic word network Active CN110134958B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910400416.5A CN110134958B (en) 2019-05-14 2019-05-14 Short text topic mining method based on semantic word network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910400416.5A CN110134958B (en) 2019-05-14 2019-05-14 Short text topic mining method based on semantic word network

Publications (2)

Publication Number Publication Date
CN110134958A CN110134958A (en) 2019-08-16
CN110134958B true CN110134958B (en) 2021-05-18

Family

ID=67574004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910400416.5A Active CN110134958B (en) 2019-05-14 2019-05-14 Short text topic mining method based on semantic word network

Country Status (1)

Country Link
CN (1) CN110134958B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061866B (en) * 2019-08-20 2024-01-02 河北工程大学 Barrage text clustering method based on feature expansion and T-oBTM
CN111339289B (en) * 2020-03-06 2022-10-28 西安工程大学 Topic model inference method based on commodity comments
CN111723563B (en) * 2020-05-11 2023-09-26 华南理工大学 Topic modeling method based on word co-occurrence network
CN112183108B (en) * 2020-09-07 2021-06-22 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Inference method, system, computer equipment and storage medium for short text topic distribution
CN112487185B (en) * 2020-11-27 2022-12-30 国家电网有限公司客户服务中心 Data classification method in power customer field
CN116432639B (en) * 2023-05-31 2023-08-25 华东交通大学 News element word mining method based on improved BTM topic model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955948A (en) * 2016-04-22 2016-09-21 武汉大学 Short text topic modeling method based on word semantic similarity
CN108197144A (en) * 2017-11-28 2018-06-22 河海大学 A kind of much-talked-about topic based on BTM and Single-pass finds method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9177262B2 (en) * 2013-12-02 2015-11-03 Qbase, LLC Method of automated discovery of new topics
CN105608192A (en) * 2015-12-23 2016-05-25 南京大学 Short text recommendation method for user-based biterm topic model
CN106055604B (en) * 2016-05-25 2019-08-27 南京大学 Word-based network carries out the short text topic model method for digging of feature extension
CN108182176B (en) * 2017-12-29 2021-08-10 太原理工大学 Method for enhancing semantic relevance and topic aggregation of topic words of BTM topic model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955948A (en) * 2016-04-22 2016-09-21 武汉大学 Short text topic modeling method based on word semantic similarity
CN108197144A (en) * 2017-11-28 2018-06-22 河海大学 A kind of much-talked-about topic based on BTM and Single-pass finds method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于双词语义扩展的Biterm主题模型;李思宇 等;《计算机工程》;20190131;第45卷(第1期);第1-7页 *
基于词三角的短文本主题模型算法;蔡洋;《中国优秀硕士学位论文全文数据库信息科技辑》;20170815(第8期);第1-49页 *

Also Published As

Publication number Publication date
CN110134958A (en) 2019-08-16

Similar Documents

Publication Publication Date Title
CN110134958B (en) Short text topic mining method based on semantic word network
CN110162593B (en) Search result processing and similarity model training method and device
CN106649818B (en) Application search intention identification method and device, application search method and server
CN108710611B (en) Short text topic model generation method based on word network and word vector
CN108132927B (en) Keyword extraction method for combining graph structure and node association
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
CN113254599A (en) Multi-label microblog text classification method based on semi-supervised learning
CN106354818B (en) Social media-based dynamic user attribute extraction method
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN111723295A (en) Content distribution method, device and storage medium
CN112699240A (en) Intelligent dynamic mining and classifying method for Chinese emotional characteristic words
CN112966091A (en) Knowledge graph recommendation system fusing entity information and heat
CN114896377A (en) Knowledge graph-based answer acquisition method
Marujo et al. Hourly traffic prediction of news stories
CN111400483B (en) Time-weighting-based three-part graph news recommendation method
CN112906391A (en) Meta-event extraction method and device, electronic equipment and storage medium
CN117057349A (en) News text keyword extraction method, device, computer equipment and storage medium
CN109871429B (en) Short text retrieval method integrating Wikipedia classification and explicit semantic features
CN110263344B (en) Text emotion analysis method, device and equipment based on hybrid model
CN115329078B (en) Text data processing method, device, equipment and storage medium
CN108427769B (en) Character interest tag extraction method based on social network
CN114491296B (en) Proposal affiliate recommendation method, system, computer device and readable storage medium
CN103744830A (en) Semantic analysis based identification method of identity information in EXCEL document
CN115329850A (en) Information comparison method and device, electronic equipment and storage medium
Liao et al. TIRR: A code reviewer recommendation algorithm with topic model and reviewer influence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant