CN113051932B - Category detection method for network media event of semantic and knowledge expansion theme model - Google Patents

Category detection method for network media event of semantic and knowledge expansion theme model Download PDF

Info

Publication number
CN113051932B
CN113051932B CN202110366951.0A CN202110366951A CN113051932B CN 113051932 B CN113051932 B CN 113051932B CN 202110366951 A CN202110366951 A CN 202110366951A CN 113051932 B CN113051932 B CN 113051932B
Authority
CN
China
Prior art keywords
distribution
representing
word
document
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110366951.0A
Other languages
Chinese (zh)
Other versions
CN113051932A (en
Inventor
薛峰
缪乃阳
张涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202110366951.0A priority Critical patent/CN113051932B/en
Publication of CN113051932A publication Critical patent/CN113051932A/en
Application granted granted Critical
Publication of CN113051932B publication Critical patent/CN113051932B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Marketing (AREA)
  • Computing Systems (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Probability & Statistics with Applications (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a category detection method of a network media event of a semantic and knowledge expansion theme model, which comprises the following steps: 1. expanding a supervised topic model (MedLDA), and modeling multi-mode data and label information of network media events in a unified model; 2. the multi-mode data of the network media event shares a theme space, internal semantics are introduced through a part-of-speech tagging technology, and external semantics are introduced through expanding a knowledge mode. According to the invention, by introducing the internal semantics and the external knowledge of the network media event, the semantic words in the network media event are effectively mined, and the high-quality subject with the interpretability is learned, so that the accurate and efficient large-scale multi-mode network media event type detection is realized.

Description

Category detection method for network media event of semantic and knowledge expansion theme model
Technical Field
The invention belongs to the technical field of computer machine learning and artificial intelligence, and mainly relates to a network media event detection method based on a supervised topic model.
Background
With the rapid development of the mobile internet and the popularity of various social networking sites, people can upload real-time events occurring in real life through mobile phones at any time and any place and leave own beliefs, so that data on social networking sites grows exponentially. When a significant event occurs in real life, users may post event-related multimedia content (e.g., text, pictures, video, etc.) to a social media website. However, the data contributed by the user is often noisy, unstructured, and it is difficult for people to manually analyze the network media events therein. Therefore, automatically organizing a large amount of social media data, mining the topics of hot spot network media events is particularly important to improve event analysis capability.
The mainstream methods for network media event analysis are all based on topic models. PLSA and LDA are widely used for text modeling and analysis. A plurality of supervised topic models are developed based on LDA, and the models can find out more optimal document expression by using label information. On the internet, social media consists of rich unstructured data with multiple modalities (text, pictures, video, etc.), which helps express the full meaning of network media events. The multi-mode topic model expands the visual mode of generating visual words on the mode of generating text words by the original topic, words in two modes share the topic space of a document, and multi-mode information of network media events is fully utilized.
The theme model adds the concept of the theme between the document and the word, so that potential semantics in the document structure can be learned, and the word is clustered on the theme level, thereby achieving the effect of dimension reduction. The topic model assumes that words obey a polynomial distribution, and that different word topic models cannot distinguish their semantic differences, but only their word frequency differences. The probability of generating a word in a document depends only on the frequency with which the word appears in the corpus. The topic model can be simply considered as a complex TF model that works in the global domain of the corpus. The method can well model the word frequency information of the corpus global and the potential semantics of the document, and is prone to focusing on high-frequency words. Therefore, the topic model always comprises a large number of high-frequency words, and the high-frequency words with extremely high occurrence probability in the topic are sometimes nonsensical words or words irrelevant to the event, so that the topic meaning cannot be expressed at all. Therefore, topic models without any human knowledge or prior semantic guidance tend to result in topics that are difficult to interpret. The existing mmSLDA, MMSTM and other models ignore rich internal semantics of network media events and external semantics encoded in a knowledge graph, and cannot distinguish semantic distinction of different words. These models are overly focused on high frequency words, limiting further improvement in model performance.
Disclosure of Invention
The invention aims to solve the defects of the prior art, and provides a category detection method of a network media event of a semantic and knowledge expansion topic model, which aims to effectively mine semantic words in the network media event by introducing internal semantic and external knowledge of the network media event and learn a high-quality topic with interpretability so as to realize accurate and efficient large-scale multi-mode network media event category detection.
The invention adopts the following technical scheme for solving the technical problems:
the invention relates to a category detection method of a network media event of a semantic and knowledge expansion theme model, which is characterized by comprising the following steps:
step 1, acquiring a data set of a network media event, and preprocessing sentence segmentation, word shape restoration and part-of-speech tagging of text data of each document in the data set so as to construct a text dictionary;
step 2, performing blocking processing on the image data corresponding to each document in the data set, taking each small block after blocking as a visual word, and extracting the image characteristic of each visual word so as to construct a visual word dictionary;
step 3, constructing a classification loss function of the network media event by using the formula (1):
in the formula (1), q represents a posterior distribution, L () represents an upper bound of log likelihood of the posterior distribution q, c represents a regularization parameter, and D represents the number of documents in the datasetL represents the category number of network media events, E q []Representing a mathematical expectation regarding the posterior distribution q,a hinge loss function representing that the d-th document belongs to the first category, and has:
in the formula (2), eta l The discrimination coefficient representing the first class, the superscript T representing the transpose, iota representing the predefined cost parameter,represents the subject experience scale of document d, < ->A classification tag indicating whether the d-th document belongs to the first category, and having:
in the formula (3), y d An actual category label representing the document of the d-th paragraph;
step 4, a data generation process:
step 4.1, sampling the theme distribution parameter theta of the d document from the dirichlet distribution with a priori parameter alpha d
Step 4.2, for the kth topic:
(1) From a priori parameters beta w Word distribution of text modality corresponding to a sample dataset in dirichlet distribution
(2) From a priori parameters beta v Word distribution of visual mode corresponding to sampling data set in dirichlet distribution
(3) From a priori parameters (μ) 0 ,C 0 ) Sample position parameter μ in vMF distribution of (2) k
(4) From a priori parameters ofThe width parameter k of the sampled vMF distribution in the lognormal distribution of (a) k
Step 4.3, let u= (d, m 0 ) Mth representing the d-th document 0 Subscript of individual entity vectors:
(1) From the subject distribution parameter θ d Sampling a topic in a polynomial distribution of (a)
(2) From the parameters ofSample m of the d document in vMF distribution 0 Personal entity vector e u
Step 4.4, let i= (d, m) 1 ) Mth representing the d-th document 1 Subscript of individual text words:
(1) From the subject distribution parameter θ d Sampling a topic in a polynomial distribution of (a)
(2) According to the m th 1 Word w of individual text i From the parameters of the part of speech a priori pSampling S in a polynomial distribution of (1) p Second order m 1 Word w of individual text i
Step 4.5, let j= (d, m) 2 ) Mth representing the d-th document 2 Subscript of individual visual words:
(1) From the subject distribution parameter θ d Sampling a topic in a polynomial distribution of (a)
(2) From the parameters ofSample mth of the d-th document in polynomial distribution 2 Personal visual word v j
Step 4.6, sampling the actual class label y of the d-th document d
(1) For the discrimination coefficient η, the parameters are sequentially (0, σ) 2 ) Is sampled in its kth component eta in the normal distribution k
(2) From the parameters ofSampling the actual category labels y of the document of the d-th in the max-margin distribution d
Step 5, constructing a joint distribution q (eta, lambda, z, theta, phi) shown in the formula (4) by utilizing a generating process wv ):
In the formula (4), ψ (y, w, v, E) represents a normalization constant, wherein y represents a category variable, w represents a text word vector, v represents a visual word vector, and E represents a knowledge entity matrix; p is p 0 (η,z,θ,Φ wv ) Represents a priori distribution, where z represents the topic distribution vector, θ represents the topic scale, Φ w Parameter matrix, Φ, representing text word distribution v Parameter matrix representing visual word distribution, p (w, v, e|z, phi) wv ) Is a conditional probability of the generation process;is a posterior distribution representing category information, where λ is an augmentation variable;
step 6, obtaining the probability of the sampling entity vector theme by using the formula (5):
in the formula (5), the amino acid sequence of the compound,representing the probability of assigning the entity vector corresponding to the subscript u to the kth topic after assigning the topic of removing the entity vector corresponding to the subscript u>Representing the count under the kth topic in the d-th document after the topic count of the entity vector corresponding to the subscript u is removed; alpha is dirichlet pri; c (C) L (x) The coefficient function representing the vMF distribution, the term represents a modulus of the vector; kappa (kappa) k Is a width parameter of vMF distribution; e, e ii Representing the ii-th entity vector in the d-th document; (mu) 0 ,C 0 ) Is a priori parameter of vMF distribution;
step 7, sampling vMF distributed width parameters using equation (6):
in the formula (6), the amino acid sequence of the compound,entity vector counts representing the kth topic; logNormal (·) represents the probability density function of the logNormal distribution; />A priori parameters for a lognormal distribution;
step 8, sampling a discrimination coefficient eta by using a formula (7):
q(η|z,λ)∝N(μ,Σ) (7)
in the formula (7), the discrimination coefficient eta is firstSubject to Gaussian distribution, i.e. p 0k )=N(0,σ 2 ) Wherein σ is a non-zero parameter; μ represents the mean, Σ represents the covariance matrix, and has:
in the formula (8), the amino acid sequence of the compound,representing the subject experience scale of the document of the d; the superscript T denotes a transpose; i represents an identity matrix;
step 9, sampling the theme of the text word by using the formula (9):
in the formula (9), the amino acid sequence of the compound,a theme vector after the theme of the text word corresponding to the subscript i is removed in the text mode is represented; w (w) i =t 0 Representing text word w i T in corresponding text dictionary 0 A personal term; />Representing that the kth subject belongs to the kth subject after the subject count of text words corresponding to the subscript i is removed 0 Word count of individual terms; />Representing the counting of the kth topic in the d-th document after the topic counting of the text word corresponding to the subscript i is removed; alpha and beta are dirichlet priors; />An augmentation value for the d-th document under the first category; />Representing the number of text words in the d-th document; η (eta) l,k A value representing a kth dimension of the discrimination vector corresponding to the ith category; />A discriminant function value representing the word excluding the subscript i, and +.>
Step 10, sampling the theme of the visual word by using the formula (10):
in the formula (9), the amino acid sequence of the compound,a theme vector which is obtained by removing the theme of the visual word corresponding to the subscript j in the visual mode is represented; v j =t 1 Representing visual word v j T in corresponding visual dictionary 1 A personal term; />Indicating that the kth subject belongs to the kth subject after the subject count of the visual word corresponding to the subscript j is removed 1 Word count of individual terms; />Representing the count under the kth topic in the d-th document after the topic count of the visual word corresponding to the subscript j is removed; />Representing the number of visual words in the d-th document; />A discriminant function value representing the word excluding the subscript j, and +.>
Step 11, sampling the augmented value variable lambda of the d-th document using equation (11) d
In the formula (11), GIG (x; p, a, b) is a generalized inverse Gaussian distribution;
step 12, estimating the subject distribution parameter theta by using the formula (12) in the Gibbs sampling process d Word distribution parameters for text modalitiesAnd word distribution parameters of visual modality +.>
In the formula (10), the amino acid sequence of the compound,the number of text words, visual words and entity vectors in the d-th document are respectively, K is the number of topics, M w Representing the length of the text dictionary, M v Representing the length of the visual dictionary; n is n d,k Representing word and entity vector count under the kth topic in the d-th document, +.>Representing the t under the kth topic in a text modality 0 A word count of the individual terms,representing the total word count under the kth topic in the text modality,/for>Representing the t under the kth topic in the visual modality 1 Word count of individual terms, ++>Representing a total word count under a kth topic in the visual modality;
step 13, predicting that the document belongs to the discriminant function value by using the formula (13)The largest single category:
in the formula (13), L is the number of categories.
Compared with the prior art, the invention has the beneficial effects that:
1. the method and the device fully utilize the multi-modal attribute and the label information of the network media event, introduce the label information with discriminant into the low-dimensional representation of the document by means of the principle of maximum interval behind the SVM, so that on one hand, observation data can be described as much as possible, and on the other hand, the minimum classification loss is achieved as much as possible to find the most effective compromise scheme, thereby increasing the robustness of the model and improving the classification effect of the model.
2. The method introduces the semantics in the corpus in the text mode, and can effectively help the model to recognize the difference of words on the semantic level. From a linguistic perspective, changes in part of speech may allow the language to express more information. Each part-of-speech plays a unique role in language expression, conveying different information. Generally, nouns, adjectives, verbs, and adverbs are more important to the semantic representation of text than other parts of speech. Therefore, the invention marks the parts of speech of the text word, and measures the semantic quantity contained in the text word according to different parts of speech. By correcting the sampling weight of the model, the status of nouns, adjectives and the like with rich semantics in event topics is improved, so that more coherent topic representations are obtained, and the performance of the model is improved.
3. The invention expands a knowledge mode based on multiple modes (text mode and visual mode), and takes knowledge entity linked from the text mode as a mode sample, and shares the subject space with the text mode and visual mode in the document range. There are a large number of knowledge entities (e.g., person names, place names, etc.) in the network media event that are encoded in a fact-oriented knowledge base. Fusing the existing human knowledge in the knowledge base may also result in improved performance of the topic model. The invention uses a transition algorithm to obtain a low-dimensional vector representation of a knowledge entity in a knowledge graph (WN 18), and adopts vMF distribution to model such directional data. The knowledge mode expanded in the knowledge embedding form can not only introduce facts-oriented human knowledge for the model and fully utilize the associated knowledge in the document structure, but also optimize the topic representation of the document, so that the model can mine event topics with more consistent concepts.
Drawings
FIG. 1 is a diagram of a model structure of the present invention;
FIG. 2 is a flow chart of an embodiment of the present invention.
Detailed Description
In the embodiment, the category detection method of the network media event of the semantic and knowledge expansion topic model fuses the internal semantic and external knowledge of the corpus, and overcomes the defect of insufficient interpretability of the supervision topic model. The model structure diagram of the method is shown in figure 1, wherein gray nodes represent observable variables, white nodes are hidden variables, and peripheral black nodes are model super-parameters. The method optimizes the generation process of text semantic words by extracting an internal semantic guidance model of a text mode, so that the model is prone to semantic words instead of high-frequency words. And expanding a knowledge mode, introducing fact-oriented human knowledge and text information, and co-describing potential semantic structures of the document in the same theme space by visual information. The method comprises the steps of obtaining a preliminary feature representation (word id vector) of a document through a certain process of a crawled data set, obtaining the embellishment of a knowledge entity by using a TransE algorithm, establishing a generation model for describing observation data, deducing the conditional distribution of hidden variables according to the joint distribution of the generation model, deducing model parameters by using a Gibbs sampling algorithm, sampling word topics of a test document according to text word distribution and visual word distribution after the model converges, taking labels with two classification discrimination coefficients corresponding to the maximum value of the point multiplication represented by the document topic as the belonged events of the test document, so that the detection problem of large-scale multi-mode network media events can be well solved. Specifically, as shown in fig. 2, the method is carried out according to the following steps:
step 1, acquiring a data set of a network media event, and preprocessing sentence segmentation, word shape restoration and part-of-speech tagging of text data of each document in the data set so as to construct a text dictionary;
step 2, image data corresponding to each document in the data set are subjected to blocking processing, each small block after blocking is used as a visual word, and the image characteristics of each visual word are extracted, so that a visual word dictionary is constructed;
and 3, linking out the knowledge entity in the network media event document, obtaining the vector representation of the knowledge entity in the knowledge graph by using a TransE algorithm, and describing the knowledge entity vector by adopting vMF distribution.
Step 4, constructing a classification loss function of the network media event by using the formula (1):
in the formula (1), q represents posterior distribution, L () represents upper bound of log likelihood of posterior distribution q, c represents regularization parameter, D represents number of files in data set, L represents category number of network media event, E q []Representing a mathematical expectation regarding the posterior distribution q,represent the firstd documents belong to the hinge loss function of the first category and have:
in the formula (2), eta l The discrimination coefficient representing the first class, the superscript T representing the transpose, iota representing the predefined cost parameter,represents the subject experience scale of document d, < ->A classification tag indicating whether the d-th document belongs to the first category, and having:
in the formula (3), y d An actual category label representing the document of the d-th paragraph;
step 5, the data generation process:
step 5.1, sampling the theme distribution parameter theta of the d document from the dirichlet distribution with a priori parameter alpha d
Step 5.2, for the kth topic:
(1) From a priori parameters beta w Word distribution of text modality corresponding to a sample dataset in dirichlet distribution
(2) From a priori parameters beta v Word distribution of visual mode corresponding to sampling data set in dirichlet distribution
(3) From a priori parameters (μ) 0 ,C 0 ) Sample position parameter μ in vMF distribution of (2) k
(4) From a priori parameters ofThe width parameter k of the sampled vMF distribution in the lognormal distribution of (a) k
Step 5.3, let u= (d, m 0 ) Mth representing the d-th document 0 Subscript of individual entity vectors:
(1) From the subject distribution parameter θ d Sampling a topic in a polynomial distribution of (a)
(2) From the parameters ofSample m of the d document in vMF distribution 0 Personal entity vector e u
Step 5.4, let i= (d, m) 1 ) Mth representing the d-th document 1 Subscript of individual text words:
(1) From the subject distribution parameter θ d Sampling a topic in a polynomial distribution of (a)
(2) According to the m th 1 Word w of individual text i From the parameters of the part of speech a priori pSampling S in a polynomial distribution of (1) p Second order m 1 Word w of individual text i
Step 5.5, let j= (d, m) 2 ) Mth representing the d-th document 2 Subscript of individual visual words:
(1) From the subject distribution parameter θ d Sampling a topic in a polynomial distribution of (a)
(2) From the parameters ofSample mth of the d-th document in polynomial distribution 2 Personal visual word v j
Step 5.6, sampling the actual class label y of the d-th document d
(1) For the discrimination coefficient η, the parameters are sequentially (0, σ) 2 ) Is sampled in its kth component eta in the normal distribution k
(2) From the parameters ofSampling the actual category labels y of the document of the d-th in the max-margin distribution d
Step 6, constructing a joint distribution q (eta, lambda, z, theta, phi) shown in the formula (4) by utilizing a generating process wv ):
In the formula (4), ψ (y, w, v, E) represents a normalization constant, wherein y represents a category variable, w represents a text word vector, v represents a visual word vector, and E represents a knowledge entity matrix; p is p 0 (η,z,θ,Φ wv ) Represents a priori distribution, where z represents the topic distribution vector, θ represents the topic scale, Φ w Parameter matrix, Φ, representing text word distribution v Parameter matrix representing visual word distribution, p (w, v, e|z, phi) wv ) Is a conditional probability of the generation process;is a posterior distribution representing category information, where λ is an augmentation variable;
step 7, obtaining the probability of the sampling entity vector theme by using the formula (5):
in the formula (5), the amino acid sequence of the compound,representing the probability of assigning the entity vector corresponding to the subscript u to the kth topic after assigning the topic of removing the entity vector corresponding to the subscript u>Representing the count under the kth topic in the d-th document after the topic count of the entity vector corresponding to the subscript u is removed; alpha is dirichlet pri; c (C) L (x) Representing a coefficient function of vMF distribution, anWherein I is L (. Cndot.) represents a modified L-th order Bessel function of the first class; the term represents a modulus of the vector; kappa (kappa) k Is a width parameter of vMF distribution; />Representing the sum of all entity vectors under the kth subject after removing entity vectors corresponding to the subscript u in the d-th document; />Representing the sum of all entity vectors assigned to the kth topic in the d-th document; (mu) 0 ,C 0 ) Is a priori parameter of vMF distribution;
step 8, sampling vMF distributed width parameters by using the formula (6):
in the formula (6), the amino acid sequence of the compound,entity vector counts representing the kth topic; logNormal (·) represents the probability density function of the logNormal distribution; />A priori parameters for a lognormal distribution;
step 9, sampling a discrimination coefficient eta by using the formula (7):
q(η|z,λ)∝N(μ,Σ) (7)
in equation (7), the prior of the discrimination coefficient η follows a gaussian distribution, i.e. p 0k )=N(0,σ 2 ) Wherein σ is a non-zero parameter; μ represents the mean, Σ represents the covariance matrix, and has:
in the formula (8), the amino acid sequence of the compound,representing the subject experience scale of the document of the d; the superscript T denotes a transpose; i represents an identity matrix.
Step 10, sampling the theme of the text word by using the formula (9):
in the formula (9), the amino acid sequence of the compound,a theme vector after the theme of the text word corresponding to the subscript i is removed in the text mode is represented; w (w) i =t 0 Representing text word w i T in corresponding text dictionary 0 A personal term; />Representing that the kth subject belongs to the kth subject after the subject count of text words corresponding to the subscript i is removed 0 Word count of individual terms; />Indicating removal of subscript iCounting under the kth topic in the d-th document after counting the topics corresponding to the text words; alpha and beta are dirichlet priors; />An augmentation value for the d-th document under the first category; />Representing the number of text words in the d-th document; η (eta) l,k A value representing a kth dimension of the discrimination vector corresponding to the ith category; />A discriminant function value representing the word excluding the subscript i, and +.>
Step 11, sampling the theme of the visual word by using the formula (10):
in the formula (9), the amino acid sequence of the compound,a theme vector which is obtained by removing the theme of the visual word corresponding to the subscript j in the visual mode is represented; v j =t 1 Representing visual word v j T in corresponding visual dictionary 1 A personal term; />Indicating that the kth subject belongs to the kth subject after the subject count of the visual word corresponding to the subscript j is removed 1 Word count of individual terms; />Representing the count under the kth topic in the d-th document after the topic count of the visual word corresponding to the subscript j is removed; />Representing the number of visual words in the d-th document; />A discriminant function value representing the word excluding the subscript j, and +.>
Step 12, sampling the augmented value variable lambda of the d-th document by using the formula (11) d
In the formula (11), GIG (x; p, a, b) is a generalized inverse Gaussian distribution;
step 13, estimating the subject distribution parameter theta by using the formula (12) in the Gibbs sampling process d Word distribution parameters for text modalitiesAnd word distribution parameters of visual modality +.>
In the formula (10), the amino acid sequence of the compound,the number of text words, visual words and entity vectors in the d-th document are respectively, K is the number of topics, M w Representing the length of the text dictionary, M v Representing the length of the visual dictionary; n is n d,k Representing word and entity vector count under the kth topic in the d-th document, +.>Representing the t under the kth topic in a text modality 0 A word count of the individual terms,representing the total word count under the kth topic in the text modality,/for>Representing the t under the kth topic in the visual modality 1 Word count of individual terms, ++>Representing a total word count under a kth topic in the visual modality;
step 14, predicting that the document belongs to the discriminant function value by using the formula (13)The largest single category:
in the formula (13), L is the number of categories.
In summary, the method aims at the problem of insufficient interpretability of the existing network media event detection method based on the topic model, and can well solve the problem of large-scale multi-mode network media event detection.

Claims (1)

1. A kind of detection method of network media event of semantic and knowledge expansion theme model, its characteristic includes:
step 1, acquiring a data set of a network media event, and preprocessing sentence segmentation, word shape restoration and part-of-speech tagging of text data of each document in the data set so as to construct a text dictionary;
step 2, performing blocking processing on the image data corresponding to each document in the data set, taking each small block after blocking as a visual word, and extracting the image characteristic of each visual word so as to construct a visual word dictionary;
step 3, constructing a classification loss function of the network media event by using the formula (1):
in formula (1), q represents the posterior distribution, L () represents the upper bound of the log likelihood of the posterior distribution q, c represents the regularization parameter, D represents the number of documents in the dataset, L represents the number of categories, E, of network media events q []Representing a mathematical expectation regarding the posterior distribution q,a hinge loss function representing that the d-th document belongs to the first category, and has:
in the formula (2), eta l The discrimination coefficient representing the first class, the superscript T representing the transpose, iota representing the predefined cost parameter,represents the subject experience scale of document d, < ->A classification tag indicating whether the d-th document belongs to the first category, and having:
in the formula (3), y d An actual category label representing the document of the d-th paragraph;
step 4, a data generation process:
step 4.1, sampling the theme distribution parameter theta of the d document from the dirichlet distribution with a priori parameter alpha d
Step 4.2, for the kth topic:
(1) From a priori parameters beta w Word distribution of text modality corresponding to a sample dataset in dirichlet distribution
(2) From a priori parameters beta v Word distribution of visual mode corresponding to sampling data set in dirichlet distribution
(3) From a priori parameters (μ) 0 ,C 0 ) Sample position parameter μ in vMF distribution of (2) k
(4) From a priori parameters ofThe width parameter k of the sampled vMF distribution in the lognormal distribution of (a) k
Step 4.3, let u= (d, m 0 ) Mth representing the d-th document 0 Subscript of individual entity vectors:
(1) From the subject distribution parameter θ d Sampling a topic in a polynomial distribution of (a)
(2) From the parameters ofSample m of the d document in vMF distribution 0 Personal entity vector e u
Step 4.4, let i= (d, m) 1 ) Mth representing the d-th document 1 Subscript of individual text words:
(1) From the subject distribution parameter θ d Sampling a topic in a polynomial distribution of (a)
(2) According to the m th 1 Word w of individual text i From the parameters of the part of speech a priori pSampling S in a polynomial distribution of (1) p Second order m 1 Word w of individual text i
Step 4.5, let j= (d, m) 2 ) Mth representing the d-th document 2 Subscript of individual visual words:
(1) From the subject distribution parameter θ d Sampling a topic in a polynomial distribution of (a)
(2) From the parameters ofSample mth of the d-th document in polynomial distribution 2 Personal visual word v j
Step 4.6, sampling the actual class label y of the d-th document d
(1) For the discrimination coefficient η, the parameters are sequentially (0, σ) 2 ) Is sampled in its kth component eta in the normal distribution k
(2) From the parameters ofSampling the actual category labels y of the document of the d-th in the max-margin distribution d
Step 5, constructing a joint distribution q (eta, lambda, z, theta, phi) shown in the formula (4) by utilizing a generating process wv ):
In the formula (4), ψ (y, w, v, E) represents a normalization constant, wherein y represents a category variable, w represents a text word vector, v represents a visual word vector, and E represents a knowledge entity matrix; p is p 0 (η,z,θ,Φ wv ) Represents a priori distribution, where z represents the topic distribution vector, θ represents the topic scale, Φ w Parameter matrix, Φ, representing text word distribution v Parameter matrix representing visual word distribution, p (w, v, e|z, phi) wv ) Is a conditional probability of the generation process;is a posterior distribution representing category information, where λ is an augmentation variable;
step 6, obtaining the probability of the sampling entity vector theme by using the formula (5):
in the formula (5), the amino acid sequence of the compound,representing the probability of assigning the entity vector corresponding to the subscript u to the kth topic after assigning the topic of removing the entity vector corresponding to the subscript u>Representing the count under the kth topic in the d-th document after the topic count of the entity vector corresponding to the subscript u is removed; alpha is dirichlet pri; c (C) L (x) The coefficient function representing the vMF distribution, the term represents a modulus of the vector; kappa (kappa) k Is a width parameter of vMF distribution; e, e ii Representing the ii-th entity vector in the d-th document; (mu) 0 ,C 0 ) Is vMF distributed firstChecking parameters;
step 7, sampling vMF distributed width parameters using equation (6):
in the formula (6), the amino acid sequence of the compound,entity vector counts representing the kth topic; logNormal (·) represents the probability density function of the logNormal distribution; />A priori parameters for a lognormal distribution;
step 8, sampling a discrimination coefficient eta by using a formula (7):
q(η|z,λ)∝N(μ,Σ) (7)
in equation (7), the prior of the discrimination coefficient η follows a gaussian distribution, i.e. p 0k )=N(0,σ 2 ) Wherein σ is a non-zero parameter; μ represents the mean, Σ represents the covariance matrix, and has:
in the formula (8), the amino acid sequence of the compound,representing the subject experience scale of the document of the d; the superscript T denotes a transpose; i represents an identity matrix;
step 9, sampling the theme of the text word by using the formula (9):
in the formula (9), the amino acid sequence of the compound,a theme vector after the theme of the text word corresponding to the subscript i is removed in the text mode is represented; w (w) i =t 0 Representing text word w i T in corresponding text dictionary 0 A personal term; />Representing that the kth subject belongs to the kth subject after the subject count of text words corresponding to the subscript i is removed 0 Word count of individual terms; />Representing the counting of the kth topic in the d-th document after the topic counting of the text word corresponding to the subscript i is removed; alpha and beta are dirichlet priors; />An augmentation value for the d-th document under the first category;representing the number of text words in the d-th document; η (eta) lk A value representing a kth dimension of the discrimination vector corresponding to the ith category; />A discriminant function value representing the word excluding the subscript i, and +.>
Step 10, sampling the theme of the visual word by using the formula (10):
in the formula (9), the amino acid sequence of the compound,a theme vector which is obtained by removing the theme of the visual word corresponding to the subscript j in the visual mode is represented; v j =t 1 Representing visual word v j T in corresponding visual dictionary 1 A personal term; />Indicating that the kth subject belongs to the kth subject after the subject count of the visual word corresponding to the subscript j is removed 1 Word count of individual terms; />Representing the count under the kth topic in the d-th document after the topic count of the visual word corresponding to the subscript j is removed; />Representing the number of visual words in the d-th document; />A discriminant function value representing the word excluding the subscript j, and +.>
Step 11, sampling the augmented value variable lambda of the d-th document using equation (11) d
In the formula (11), GIG (x; p, a, b) is a generalized inverse Gaussian distribution;
step 12, estimating the subject distribution parameter theta by using the formula (12) in the Gibbs sampling process d Word distribution parameters for text modalitiesAnd visionWord distribution parameters of modality->
In the formula (10), the amino acid sequence of the compound,the number of text words, visual words and entity vectors in the d-th document are respectively, K is the number of topics, M w Representing the length of the text dictionary, M v Representing the length of the visual dictionary; n is n d,k Representing word and entity vector count under the kth topic in the d-th document, +.>Representing the t under the kth topic in a text modality 0 Word count of individual terms, ++>Representing the total word count under the kth topic in the text modality,/for>Representing the t under the kth topic in the visual modality 1 Word count of individual terms, ++>Representing a total word count under a kth topic in the visual modality;
step 13, predicting that the document belongs to the discriminant function value by using the formula (13)The largest single category:
in the formula (13), L is the number of categories.
CN202110366951.0A 2021-04-06 2021-04-06 Category detection method for network media event of semantic and knowledge expansion theme model Active CN113051932B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110366951.0A CN113051932B (en) 2021-04-06 2021-04-06 Category detection method for network media event of semantic and knowledge expansion theme model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110366951.0A CN113051932B (en) 2021-04-06 2021-04-06 Category detection method for network media event of semantic and knowledge expansion theme model

Publications (2)

Publication Number Publication Date
CN113051932A CN113051932A (en) 2021-06-29
CN113051932B true CN113051932B (en) 2023-11-03

Family

ID=76517588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110366951.0A Active CN113051932B (en) 2021-04-06 2021-04-06 Category detection method for network media event of semantic and knowledge expansion theme model

Country Status (1)

Country Link
CN (1) CN113051932B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343679B (en) * 2021-07-06 2024-02-13 合肥工业大学 Multi-mode subject mining method based on label constraint
CN113836939B (en) * 2021-09-24 2023-07-21 北京百度网讯科技有限公司 Text-based data analysis method and device
US12026199B1 (en) * 2022-03-09 2024-07-02 Amazon Technologies, Inc. Generating description pages for media entities
CN117808104B (en) * 2024-02-29 2024-04-30 南京邮电大学 Viewpoint mining method based on self-supervision expression learning and oriented to hot topics

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760507A (en) * 2016-02-23 2016-07-13 复旦大学 Cross-modal subject correlation modeling method based on deep learning
CN111368068A (en) * 2020-03-18 2020-07-03 江苏鸿程大数据技术与应用研究院有限公司 Short text topic modeling method based on part-of-speech feature and semantic enhancement

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9176969B2 (en) * 2013-08-29 2015-11-03 Hewlett-Packard Development Company, L.P. Integrating and extracting topics from content of heterogeneous sources
US11636355B2 (en) * 2019-05-30 2023-04-25 Baidu Usa Llc Integration of knowledge graph embedding into topic modeling with hierarchical Dirichlet process

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760507A (en) * 2016-02-23 2016-07-13 复旦大学 Cross-modal subject correlation modeling method based on deep learning
CN111368068A (en) * 2020-03-18 2020-07-03 江苏鸿程大数据技术与应用研究院有限公司 Short text topic modeling method based on part-of-speech feature and semantic enhancement

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于多标签双词主题模型的短文本谣言分析研究;武庆圆;何凌南;;情报杂志(第03期);全文 *
基于词嵌入与概率主题模型的社会媒体话题识别;余冲;李晶;孙旭东;傅向华;;计算机工程(第12期);全文 *

Also Published As

Publication number Publication date
CN113051932A (en) 2021-06-29

Similar Documents

Publication Publication Date Title
CN107609009B (en) Text emotion analysis method and device, storage medium and computer equipment
CN113051932B (en) Category detection method for network media event of semantic and knowledge expansion theme model
CN111581961B (en) Automatic description method for image content constructed by Chinese visual vocabulary
CN107423282B (en) Method for concurrently extracting semantic consistency subject and word vector in text based on mixed features
Jin et al. A novel lexicalized HMM-based learning framework for web opinion mining
CN106886580B (en) Image emotion polarity analysis method based on deep learning
CN107315734B (en) A kind of method and system to be standardized based on time window and semantic variant word
CN110750648A (en) Text emotion classification method based on deep learning and feature fusion
CN108984532A (en) Aspect abstracting method based on level insertion
CN104484437A (en) Network brief comment sentiment mining method
Saifullah et al. Automated text annotation using a semi-supervised approach with meta vectorizer and machine learning algorithms for hate speech detection
Baboo et al. Sentiment analysis and automatic emotion detection analysis of twitter using machine learning classifiers
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
Vukanti et al. Business Analytics: A case-study approach using LDA topic modelling
Sivalingam et al. CRF-MEM: Conditional Random Field Model Based Modified Expectation Maximization Algorithm for Sarcasm Detection in Social Media
CN107291686B (en) Method and system for identifying emotion identification
CN117291190A (en) User demand calculation method based on emotion dictionary and LDA topic model
Guo et al. BLGAV: generative AI author verification model based on BERT and BiLSTM
Sithole Fine-tuning semantic information for optimized classification of the Internet of Things patterns using neural word embeddings
Dunivin A lexical approach to locating symbolic boundaries around cultural identities on social media
Sravani et al. Multimodal Sentimental Classification using Long-Short Term Memory
Qi English Sentence Semantic Feature Extraction Method Based on Fuzzy Logic Algorithm.
Abdella et al. Detection of Emotions in Afan Oromo Social Media Texts Using Deep Learning Method
Bai et al. Ensemble Deep Learning (EDL) for Cyber-bullying on Social Media
CN111860662B (en) Training method and device, application method and device of similarity detection model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant