CN113051932B

CN113051932B - Category detection method for network media event of semantic and knowledge expansion theme model

Info

Publication number: CN113051932B
Application number: CN202110366951.0A
Authority: CN
Inventors: 薛峰; 缪乃阳; 张涛
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-04-06
Filing date: 2021-04-06
Publication date: 2023-11-03
Anticipated expiration: 2041-04-06
Also published as: CN113051932A

Abstract

The invention discloses a category detection method of a network media event of a semantic and knowledge expansion theme model, which comprises the following steps: 1. expanding a supervised topic model (MedLDA), and modeling multi-mode data and label information of network media events in a unified model; 2. the multi-mode data of the network media event shares a theme space, internal semantics are introduced through a part-of-speech tagging technology, and external semantics are introduced through expanding a knowledge mode. According to the invention, by introducing the internal semantics and the external knowledge of the network media event, the semantic words in the network media event are effectively mined, and the high-quality subject with the interpretability is learned, so that the accurate and efficient large-scale multi-mode network media event type detection is realized.

Description

Category detection method for network media event of semantic and knowledge expansion theme model

Technical Field

The invention belongs to the technical field of computer machine learning and artificial intelligence, and mainly relates to a network media event detection method based on a supervised topic model.

Background

With the rapid development of the mobile internet and the popularity of various social networking sites, people can upload real-time events occurring in real life through mobile phones at any time and any place and leave own beliefs, so that data on social networking sites grows exponentially. When a significant event occurs in real life, users may post event-related multimedia content (e.g., text, pictures, video, etc.) to a social media website. However, the data contributed by the user is often noisy, unstructured, and it is difficult for people to manually analyze the network media events therein. Therefore, automatically organizing a large amount of social media data, mining the topics of hot spot network media events is particularly important to improve event analysis capability.

The mainstream methods for network media event analysis are all based on topic models. PLSA and LDA are widely used for text modeling and analysis. A plurality of supervised topic models are developed based on LDA, and the models can find out more optimal document expression by using label information. On the internet, social media consists of rich unstructured data with multiple modalities (text, pictures, video, etc.), which helps express the full meaning of network media events. The multi-mode topic model expands the visual mode of generating visual words on the mode of generating text words by the original topic, words in two modes share the topic space of a document, and multi-mode information of network media events is fully utilized.

The theme model adds the concept of the theme between the document and the word, so that potential semantics in the document structure can be learned, and the word is clustered on the theme level, thereby achieving the effect of dimension reduction. The topic model assumes that words obey a polynomial distribution, and that different word topic models cannot distinguish their semantic differences, but only their word frequency differences. The probability of generating a word in a document depends only on the frequency with which the word appears in the corpus. The topic model can be simply considered as a complex TF model that works in the global domain of the corpus. The method can well model the word frequency information of the corpus global and the potential semantics of the document, and is prone to focusing on high-frequency words. Therefore, the topic model always comprises a large number of high-frequency words, and the high-frequency words with extremely high occurrence probability in the topic are sometimes nonsensical words or words irrelevant to the event, so that the topic meaning cannot be expressed at all. Therefore, topic models without any human knowledge or prior semantic guidance tend to result in topics that are difficult to interpret. The existing mmSLDA, MMSTM and other models ignore rich internal semantics of network media events and external semantics encoded in a knowledge graph, and cannot distinguish semantic distinction of different words. These models are overly focused on high frequency words, limiting further improvement in model performance.

Disclosure of Invention

The invention aims to solve the defects of the prior art, and provides a category detection method of a network media event of a semantic and knowledge expansion topic model, which aims to effectively mine semantic words in the network media event by introducing internal semantic and external knowledge of the network media event and learn a high-quality topic with interpretability so as to realize accurate and efficient large-scale multi-mode network media event category detection.

The invention adopts the following technical scheme for solving the technical problems:

the invention relates to a category detection method of a network media event of a semantic and knowledge expansion theme model, which is characterized by comprising the following steps:

step 1, acquiring a data set of a network media event, and preprocessing sentence segmentation, word shape restoration and part-of-speech tagging of text data of each document in the data set so as to construct a text dictionary;

step 2, performing blocking processing on the image data corresponding to each document in the data set, taking each small block after blocking as a visual word, and extracting the image characteristic of each visual word so as to construct a visual word dictionary;

step 3, constructing a classification loss function of the network media event by using the formula (1):

in the formula (1), q represents a posterior distribution, L () represents an upper bound of log likelihood of the posterior distribution q, c represents a regularization parameter, and D represents the number of documents in the datasetL represents the category number of network media events, E _q []Representing a mathematical expectation regarding the posterior distribution q,a hinge loss function representing that the d-th document belongs to the first category, and has:

in the formula (2), eta _l The discrimination coefficient representing the first class, the superscript T representing the transpose, iota representing the predefined cost parameter,represents the subject experience scale of document d, < ->A classification tag indicating whether the d-th document belongs to the first category, and having:

in the formula (3), y _d An actual category label representing the document of the d-th paragraph;

step 4, a data generation process:

step 4.1, sampling the theme distribution parameter theta of the d document from the dirichlet distribution with a priori parameter alpha _d ；

Step 4.2, for the kth topic:

(1) From a priori parameters beta ^w Word distribution of text modality corresponding to a sample dataset in dirichlet distribution

(2) From a priori parameters beta ^v Word distribution of visual mode corresponding to sampling data set in dirichlet distribution

(3) From a priori parameters (μ) ₀ ,C ₀ ) Sample position parameter μ in vMF distribution of (2) _k ；

(4) From a priori parameters ofThe width parameter k of the sampled vMF distribution in the lognormal distribution of (a) _k ；

Step 4.3, let u= (d, m ₀ ) Mth representing the d-th document ₀ Subscript of individual entity vectors:

(1) From the subject distribution parameter θ _d Sampling a topic in a polynomial distribution of (a)

(2) From the parameters ofSample m of the d document in vMF distribution ₀ Personal entity vector e _u ；

Step 4.4, let i= (d, m) ₁ ) Mth representing the d-th document ₁ Subscript of individual text words:

(2) According to the m th ₁ Word w of individual text _i From the parameters of the part of speech a priori pSampling S in a polynomial distribution of (1) _p Second order m ₁ Word w of individual text _i ；

Step 4.5, let j= (d, m) ₂ ) Mth representing the d-th document ₂ Subscript of individual visual words:

(2) From the parameters ofSample mth of the d-th document in polynomial distribution ₂ Personal visual word v _j ；

Step 4.6, sampling the actual class label y of the d-th document _d ：

(1) For the discrimination coefficient η, the parameters are sequentially (0, σ) ² ) Is sampled in its kth component eta in the normal distribution _k ；

(2) From the parameters ofSampling the actual category labels y of the document of the d-th in the max-margin distribution _d ；

Step 5, constructing a joint distribution q (eta, lambda, z, theta, phi) shown in the formula (4) by utilizing a generating process ^w ,Φ ^v )：

In the formula (4), ψ (y, w, v, E) represents a normalization constant, wherein y represents a category variable, w represents a text word vector, v represents a visual word vector, and E represents a knowledge entity matrix; p is p ₀ (η,z,θ,Φ ^w ,Φ ^v ) Represents a priori distribution, where z represents the topic distribution vector, θ represents the topic scale, Φ ^w Parameter matrix, Φ, representing text word distribution ^v Parameter matrix representing visual word distribution, p (w, v, e|z, phi) ^w ,Φ ^v ) Is a conditional probability of the generation process;is a posterior distribution representing category information, where λ is an augmentation variable;

step 6, obtaining the probability of the sampling entity vector theme by using the formula (5):

in the formula (5), the amino acid sequence of the compound,representing the probability of assigning the entity vector corresponding to the subscript u to the kth topic after assigning the topic of removing the entity vector corresponding to the subscript u>Representing the count under the kth topic in the d-th document after the topic count of the entity vector corresponding to the subscript u is removed; alpha is dirichlet pri; c (C) _L (x) The coefficient function representing the vMF distribution, the term represents a modulus of the vector; kappa (kappa) _k Is a width parameter of vMF distribution; e, e _ii Representing the ii-th entity vector in the d-th document; (mu) ₀ ,C ₀ ) Is a priori parameter of vMF distribution;

step 7, sampling vMF distributed width parameters using equation (6):

in the formula (6), the amino acid sequence of the compound,entity vector counts representing the kth topic; logNormal (·) represents the probability density function of the logNormal distribution; />A priori parameters for a lognormal distribution;

step 8, sampling a discrimination coefficient eta by using a formula (7):

q(η|z,λ)∝N(μ,Σ) (7)

in the formula (7), the discrimination coefficient eta is firstSubject to Gaussian distribution, i.e. p ₀ (η _k )＝N(0,σ ² ) Wherein σ is a non-zero parameter; μ represents the mean, Σ represents the covariance matrix, and has:

in the formula (8), the amino acid sequence of the compound,representing the subject experience scale of the document of the d; the superscript T denotes a transpose; i represents an identity matrix;

step 9, sampling the theme of the text word by using the formula (9):

in the formula (9), the amino acid sequence of the compound,a theme vector after the theme of the text word corresponding to the subscript i is removed in the text mode is represented; w (w) _i ＝t ₀ Representing text word w _i T in corresponding text dictionary ₀ A personal term; />Representing that the kth subject belongs to the kth subject after the subject count of text words corresponding to the subscript i is removed ₀ Word count of individual terms; />Representing the counting of the kth topic in the d-th document after the topic counting of the text word corresponding to the subscript i is removed; alpha and beta are dirichlet priors; />An augmentation value for the d-th document under the first category; />Representing the number of text words in the d-th document; η (eta) _l,k A value representing a kth dimension of the discrimination vector corresponding to the ith category; />A discriminant function value representing the word excluding the subscript i, and +.>

Step 10, sampling the theme of the visual word by using the formula (10):

in the formula (9), the amino acid sequence of the compound,a theme vector which is obtained by removing the theme of the visual word corresponding to the subscript j in the visual mode is represented; v _j ＝t ₁ Representing visual word v _j T in corresponding visual dictionary ₁ A personal term; />Indicating that the kth subject belongs to the kth subject after the subject count of the visual word corresponding to the subscript j is removed ₁ Word count of individual terms; />Representing the count under the kth topic in the d-th document after the topic count of the visual word corresponding to the subscript j is removed; />Representing the number of visual words in the d-th document; />A discriminant function value representing the word excluding the subscript j, and +.>

Step 11, sampling the augmented value variable lambda of the d-th document using equation (11) _d ：

In the formula (11), GIG (x; p, a, b) is a generalized inverse Gaussian distribution;

step 12, estimating the subject distribution parameter theta by using the formula (12) in the Gibbs sampling process _d Word distribution parameters for text modalitiesAnd word distribution parameters of visual modality +.>

In the formula (10), the amino acid sequence of the compound,the number of text words, visual words and entity vectors in the d-th document are respectively, K is the number of topics, M ^w Representing the length of the text dictionary, M ^v Representing the length of the visual dictionary; n is n _d,k Representing word and entity vector count under the kth topic in the d-th document, +.>Representing the t under the kth topic in a text modality ₀ A word count of the individual terms,representing the total word count under the kth topic in the text modality,/for>Representing the t under the kth topic in the visual modality ₁ Word count of individual terms, ++>Representing a total word count under a kth topic in the visual modality;

step 13, predicting that the document belongs to the discriminant function value by using the formula (13)The largest single category:

in the formula (13), L is the number of categories.

Compared with the prior art, the invention has the beneficial effects that:

1. the method and the device fully utilize the multi-modal attribute and the label information of the network media event, introduce the label information with discriminant into the low-dimensional representation of the document by means of the principle of maximum interval behind the SVM, so that on one hand, observation data can be described as much as possible, and on the other hand, the minimum classification loss is achieved as much as possible to find the most effective compromise scheme, thereby increasing the robustness of the model and improving the classification effect of the model.

2. The method introduces the semantics in the corpus in the text mode, and can effectively help the model to recognize the difference of words on the semantic level. From a linguistic perspective, changes in part of speech may allow the language to express more information. Each part-of-speech plays a unique role in language expression, conveying different information. Generally, nouns, adjectives, verbs, and adverbs are more important to the semantic representation of text than other parts of speech. Therefore, the invention marks the parts of speech of the text word, and measures the semantic quantity contained in the text word according to different parts of speech. By correcting the sampling weight of the model, the status of nouns, adjectives and the like with rich semantics in event topics is improved, so that more coherent topic representations are obtained, and the performance of the model is improved.

3. The invention expands a knowledge mode based on multiple modes (text mode and visual mode), and takes knowledge entity linked from the text mode as a mode sample, and shares the subject space with the text mode and visual mode in the document range. There are a large number of knowledge entities (e.g., person names, place names, etc.) in the network media event that are encoded in a fact-oriented knowledge base. Fusing the existing human knowledge in the knowledge base may also result in improved performance of the topic model. The invention uses a transition algorithm to obtain a low-dimensional vector representation of a knowledge entity in a knowledge graph (WN 18), and adopts vMF distribution to model such directional data. The knowledge mode expanded in the knowledge embedding form can not only introduce facts-oriented human knowledge for the model and fully utilize the associated knowledge in the document structure, but also optimize the topic representation of the document, so that the model can mine event topics with more consistent concepts.

Drawings

FIG. 1 is a diagram of a model structure of the present invention;

FIG. 2 is a flow chart of an embodiment of the present invention.

Detailed Description

In the embodiment, the category detection method of the network media event of the semantic and knowledge expansion topic model fuses the internal semantic and external knowledge of the corpus, and overcomes the defect of insufficient interpretability of the supervision topic model. The model structure diagram of the method is shown in figure 1, wherein gray nodes represent observable variables, white nodes are hidden variables, and peripheral black nodes are model super-parameters. The method optimizes the generation process of text semantic words by extracting an internal semantic guidance model of a text mode, so that the model is prone to semantic words instead of high-frequency words. And expanding a knowledge mode, introducing fact-oriented human knowledge and text information, and co-describing potential semantic structures of the document in the same theme space by visual information. The method comprises the steps of obtaining a preliminary feature representation (word id vector) of a document through a certain process of a crawled data set, obtaining the embellishment of a knowledge entity by using a TransE algorithm, establishing a generation model for describing observation data, deducing the conditional distribution of hidden variables according to the joint distribution of the generation model, deducing model parameters by using a Gibbs sampling algorithm, sampling word topics of a test document according to text word distribution and visual word distribution after the model converges, taking labels with two classification discrimination coefficients corresponding to the maximum value of the point multiplication represented by the document topic as the belonged events of the test document, so that the detection problem of large-scale multi-mode network media events can be well solved. Specifically, as shown in fig. 2, the method is carried out according to the following steps:

step 2, image data corresponding to each document in the data set are subjected to blocking processing, each small block after blocking is used as a visual word, and the image characteristics of each visual word are extracted, so that a visual word dictionary is constructed;

and 3, linking out the knowledge entity in the network media event document, obtaining the vector representation of the knowledge entity in the knowledge graph by using a TransE algorithm, and describing the knowledge entity vector by adopting vMF distribution.

Step 4, constructing a classification loss function of the network media event by using the formula (1):

in the formula (1), q represents posterior distribution, L () represents upper bound of log likelihood of posterior distribution q, c represents regularization parameter, D represents number of files in data set, L represents category number of network media event, E _q []Representing a mathematical expectation regarding the posterior distribution q,represent the firstd documents belong to the hinge loss function of the first category and have:

step 5, the data generation process:

step 5.1, sampling the theme distribution parameter theta of the d document from the dirichlet distribution with a priori parameter alpha _d ；

Step 5.2, for the kth topic:

Step 5.3, let u= (d, m ₀ ) Mth representing the d-th document ₀ Subscript of individual entity vectors:

Step 5.4, let i= (d, m) ₁ ) Mth representing the d-th document ₁ Subscript of individual text words:

Step 5.5, let j= (d, m) ₂ ) Mth representing the d-th document ₂ Subscript of individual visual words:

Step 5.6, sampling the actual class label y of the d-th document _d ：

Step 6, constructing a joint distribution q (eta, lambda, z, theta, phi) shown in the formula (4) by utilizing a generating process ^w ,Φ ^v )：

step 7, obtaining the probability of the sampling entity vector theme by using the formula (5):

in the formula (5), the amino acid sequence of the compound,representing the probability of assigning the entity vector corresponding to the subscript u to the kth topic after assigning the topic of removing the entity vector corresponding to the subscript u>Representing the count under the kth topic in the d-th document after the topic count of the entity vector corresponding to the subscript u is removed; alpha is dirichlet pri; c (C) _L (x) Representing a coefficient function of vMF distribution, anWherein I is _L (. Cndot.) represents a modified L-th order Bessel function of the first class; the term represents a modulus of the vector; kappa (kappa) _k Is a width parameter of vMF distribution; />Representing the sum of all entity vectors under the kth subject after removing entity vectors corresponding to the subscript u in the d-th document; />Representing the sum of all entity vectors assigned to the kth topic in the d-th document; (mu) ₀ ,C ₀ ) Is a priori parameter of vMF distribution;

step 8, sampling vMF distributed width parameters by using the formula (6):

step 9, sampling a discrimination coefficient eta by using the formula (7):

q(η|z,λ)∝N(μ,Σ) (7)

in equation (7), the prior of the discrimination coefficient η follows a gaussian distribution, i.e. p ₀ (η _k )＝N(0,σ ² ) Wherein σ is a non-zero parameter; μ represents the mean, Σ represents the covariance matrix, and has:

in the formula (8), the amino acid sequence of the compound,representing the subject experience scale of the document of the d; the superscript T denotes a transpose; i represents an identity matrix.

Step 10, sampling the theme of the text word by using the formula (9):

in the formula (9), the amino acid sequence of the compound,a theme vector after the theme of the text word corresponding to the subscript i is removed in the text mode is represented; w (w) _i ＝t ₀ Representing text word w _i T in corresponding text dictionary ₀ A personal term; />Representing that the kth subject belongs to the kth subject after the subject count of text words corresponding to the subscript i is removed ₀ Word count of individual terms; />Indicating removal of subscript iCounting under the kth topic in the d-th document after counting the topics corresponding to the text words; alpha and beta are dirichlet priors; />An augmentation value for the d-th document under the first category; />Representing the number of text words in the d-th document; η (eta) _l,k A value representing a kth dimension of the discrimination vector corresponding to the ith category; />A discriminant function value representing the word excluding the subscript i, and +.>

Step 11, sampling the theme of the visual word by using the formula (10):

Step 12, sampling the augmented value variable lambda of the d-th document by using the formula (11) _d ：

step 13, estimating the subject distribution parameter theta by using the formula (12) in the Gibbs sampling process _d Word distribution parameters for text modalitiesAnd word distribution parameters of visual modality +.>

step 14, predicting that the document belongs to the discriminant function value by using the formula (13)The largest single category:

in the formula (13), L is the number of categories.

In summary, the method aims at the problem of insufficient interpretability of the existing network media event detection method based on the topic model, and can well solve the problem of large-scale multi-mode network media event detection.

Claims

1. A kind of detection method of network media event of semantic and knowledge expansion theme model, its characteristic includes:

in formula (1), q represents the posterior distribution, L () represents the upper bound of the log likelihood of the posterior distribution q, c represents the regularization parameter, D represents the number of documents in the dataset, L represents the number of categories, E, of network media events _q []Representing a mathematical expectation regarding the posterior distribution q,a hinge loss function representing that the d-th document belongs to the first category, and has:

step 4, a data generation process:

Step 4.2, for the kth topic:

Step 4.6, sampling the actual class label y of the d-th document _d ：

in the formula (5), the amino acid sequence of the compound,representing the probability of assigning the entity vector corresponding to the subscript u to the kth topic after assigning the topic of removing the entity vector corresponding to the subscript u>Representing the count under the kth topic in the d-th document after the topic count of the entity vector corresponding to the subscript u is removed; alpha is dirichlet pri; c (C) _L (x) The coefficient function representing the vMF distribution, the term represents a modulus of the vector; kappa (kappa) _k Is a width parameter of vMF distribution; e, e _ii Representing the ii-th entity vector in the d-th document; (mu) ₀ ,C ₀ ) Is vMF distributed firstChecking parameters;

step 7, sampling vMF distributed width parameters using equation (6):

step 8, sampling a discrimination coefficient eta by using a formula (7):

q(η|z,λ)∝N(μ,Σ) (7)

step 9, sampling the theme of the text word by using the formula (9):

in the formula (9), the amino acid sequence of the compound,a theme vector after the theme of the text word corresponding to the subscript i is removed in the text mode is represented; w (w) _i ＝t ₀ Representing text word w _i T in corresponding text dictionary ₀ A personal term; />Representing that the kth subject belongs to the kth subject after the subject count of text words corresponding to the subscript i is removed ₀ Word count of individual terms; />Representing the counting of the kth topic in the d-th document after the topic counting of the text word corresponding to the subscript i is removed; alpha and beta are dirichlet priors; />An augmentation value for the d-th document under the first category;representing the number of text words in the d-th document; η (eta) _lk A value representing a kth dimension of the discrimination vector corresponding to the ith category; />A discriminant function value representing the word excluding the subscript i, and +.>

Step 10, sampling the theme of the visual word by using the formula (10):

step 12, estimating the subject distribution parameter theta by using the formula (12) in the Gibbs sampling process _d Word distribution parameters for text modalitiesAnd visionWord distribution parameters of modality->

In the formula (10), the amino acid sequence of the compound,the number of text words, visual words and entity vectors in the d-th document are respectively, K is the number of topics, M ^w Representing the length of the text dictionary, M ^v Representing the length of the visual dictionary; n is n _d,k Representing word and entity vector count under the kth topic in the d-th document, +.>Representing the t under the kth topic in a text modality ₀ Word count of individual terms, ++>Representing the total word count under the kth topic in the text modality,/for>Representing the t under the kth topic in the visual modality ₁ Word count of individual terms, ++>Representing a total word count under a kth topic in the visual modality;

in the formula (13), L is the number of categories.