CN113343679A

CN113343679A - Multi-modal topic mining method based on label constraint

Info

Publication number: CN113343679A
Application number: CN202110762186.4A
Authority: CN
Inventors: 姜元春; 李�浩; 钱洋; 柴一栋; 刘业政; 孙见山; 周凡; 袁昆; 梁瑞成; 陶守正
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2021-09-03
Anticipated expiration: 2041-07-06
Also published as: CN113343679B

Abstract

The invention discloses a multi-modal subject mining method based on label constraint, which comprises the following steps: 1. the method comprises the steps of building a data set in the multi-modal document, 2 modeling document tag theme distribution, 3 modeling text tag themes and visual tag themes in the document, 4 building a multi-modal theme model based on tag constraint, and 5 utilizing a collapse Gibbs sampling algorithm to learn parameters. When the method is used for dealing with the associated text and image data with the tags, the multi-modal theme can be learned quickly and accurately, so that favorable support is provided for data mining tasks such as recommendation and retrieval.

Description

Multi-modal topic mining method based on label constraint

Technical Field

The invention relates to the technical field of topic mining of multi-modal data, in particular to a multi-modal topic mining method based on label constraint.

Background

The data mining task is a typical data-driven process, and a large amount of data has great significance for learning accurate results. With the rapid development of internet technology and the widespread use of various website platforms (e.g., Facebook, twitter), the amount of multimodal data is increasing. Some typical websites, such as microblog, surf, and panning, not only allow users to upload and share their multimodal data, but also allow them to provide relevant semantic descriptive terms. Moreover, the associated text and image not only have good correspondence, but also are easy to understand the semantic content of the text and the image. For some data mining tasks, such as recommendations, image retrieval and classification, jointly modeled text and images with tags are necessary.

In recent years, there has been increasing research on data mining. For example, the document [ undergradingLarge-Scale Dynamic purchasing Behavior,2021] carries out topic modeling from historical purchasing data of consumers, and can understand the purchasing Behavior of the consumers; the method comprises the following steps of (1) carrying out theme modeling on product data in a Probalistic Topic Model for Hybrid Recommander Systems, A Stochastical Variational Bayesian Approach,2018, concisely describing products according to hidden themes, and discovering consumer preferences through the themes so as to design a recommendation system; document [ discriminating Sketch Topic Model With Structural Constraint for SAR Image Classification,2020] classifies radar images using a method of Topic Model; a document [ Online Multi-modal Multi-expert Learning for Social Event Tracking,2018] analyzes media data by using a method of a Multi-modal topic model, and automatically identifies events; document [ Image Tag referenced by regulated Latent dictionary, 2014] refines tags using a topic model for accomplishing the task of Image retrieval. However, none of these methods is capable of processing tagged associated text and image data. Furthermore, learning large-scale data through the gibbs sampling algorithm results in a slow learning process.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a multi-modal topic mining method based on label constraint, so that a multi-modal topic can be rapidly and accurately learned when large-scale multi-modal data is dealt with, and the data mining speed and accuracy are improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to a multi-modal topic mining method based on label constraint, which is characterized by comprising the following steps of:

step1, constructing a data set D of the multi-modal document;

step 1.1, constructing a text content set in the multi-modal document, and recording the text content set as

Wherein the content of the first and second substances,

represents the text data in the mth text content and has

w_m,tRepresenting the t-th text word, N, in the mth text content_mRepresenting the number of words in the mth piece of text content; m represents the number of multimodal documents;

step 1.2, constructing a visual content set in the multi-modal document, and recording the visual content set as

Wherein the content of the first and second substances,

representing image data in the mth piece of visual content, and having

v_m,pRepresenting the p-th visual word, L, in the m-th visual content_mRepresenting the number of words in the mth piece of visual content;

step 1.3, constructing a label content set in the multi-modal document, and marking as Λ ═ Λ₁,Λ₂,...,Λ_m,...,Λ_MWherein, Λ_mRepresenting a set of tagged content in the mth multimodal document; defining a tag space

L is the number of different tags; l is any label serial number in the label space;

step 1.4, constructing a data set D which comprises a text content set W, a visual content set V and a tag content Λ, wherein the data set D is { W, V, Λ };

step2, modeling label theme distribution of the multi-modal document;

defining topic distributions for multimodal documents

Wherein the content of the first and second substances,

representing the topic distribution of the m-th multimodal document and obeying the dirichlet distribution with parameter α;

defining a topic distribution of tag j in the mth multimodal document as

Wherein, K_jIndicates the number of topics associated with label j, θ_m,j,kRepresenting the fun of tag j in the mth article of multi-modal documentation on the kth topicInterest weight, K ∈ {1, 2.,. K_j}，j∈Λ_m(ii) a Each tag can be associated with multiple topics, but each topic can only be assigned to one tag;

step3, modeling a text label theme and a visual label theme in the multi-modal document;

step 3.1, determining the number of the topics in the multi-modal document as K;

step 3.2, defining the text probability distribution of the kth theme under the label j

Obey parameter is beta^wHas a dirichlet distribution of

Wherein T represents the number of non-repeating text words in the text content set,

representing the interest weight of the kth subject under the label j on the t text word;

step 3.3, defining visual probability distribution of kth topic under label j

Obey parameter is beta^vHas a dirichlet distribution of

Wherein P represents the number of non-repeating visual words in the set of visual content,

representing interest weight of the kth subject under the label j on the pth visual word;

step 4, establishing a multi-modal topic model based on label constraint;

step 4.1, define the topic number of all text words in the mth multi-modal document as

Wherein the content of the first and second substances,

a topic number representing the t-th text word in the m-th multimodal document, and

compliance parameter of

Is preferably a polynomial distribution of (a) and (b),

and

forming a conjugate of the dirichlet distribution and the polynomial distribution; defining the t text word w in the m-th multimodal document_m,tThe interest weight belonging to tag j is

Step 4.2, define the topic number of all visual words in the mth multi-modal document as

Wherein the content of the first and second substances,

a topic number representing the p-th visual word in the m-th multimodal document, and

compliance parameter of

Is preferably a polynomial distribution of (a) and (b),

and

forming a conjugate of the dirichlet distribution and the polynomial distribution; defining the p text word v in the m document_m,pThe interest weight belonging to tag j is

Step 5, applying a collapse Gibbs sampling method to obtain three interest weights theta_m,j,k、

And

learning is carried out;

step 5.1, calculating the observed text word W and the unobserved ith tag and the subject number z of the text content of all the multimodal documents by using the formula (1)^wIs given by the joint probability distribution p (W, V, z)^w,z^v,l|α,β^w,β^v)；

p(W,V,z^w,z^v,l|α,β^w,β^v)＝p(W|z^w,l,β^w)p(V|z^v,l,β^v)p(z^w,z^v,l|α) (1)

In the formula (1), z^vA topic number representing the visual content of all multimodal documents; alpha represents a hyper-parameter;

step 5.1.1, calculating the generation probability p (W | z) of all text words in the multi-modal document by using the formula (2)^w,l,β^w)；

In the formula (2), n_.,j,k,bDenotes the number of text words b generated by the kth topic under the label j, Δ is the operator, and for any K-dimensional vector X, there is

x_kRepresents the kth component of the K-dimensional vector X, Γ (·) being a gamma function;

step 5.1.2, calculating the generation probability p (V | z) of all visual words in the multi-modal document by using the formula (3)^v,l,β^v)；

In the formula (3), d_.,j,k,cRepresenting the number of visual words c generated by the kth topic under the label j;

step 5.1.3, calculating the generation probability p (z) of the tag topic of all the multi-modal documents by using the formula (4)^w,z^v,l|α)；

In the formula (4), n_m,j,k,.Representing the number of text words corresponding to the kth topic under the label j in the mth multi-modal document; d_m,j,k,.Representing the number of visual words corresponding to the kth topic under the label j in the mth multi-modal document;

step 5.2, solving the probability that the t text word e in the m multi-modal document is assigned to the k topic under the label j by using the formula (5)

In the formula (1), oc represents proportional to z, and I (·) represents an indicator function; lambda represents the logical sum of_m,tJ indicates that the t-th text word in the m-th multimodal document corresponds to a tag of j,

the topic number corresponding to the t text word in the m multi-modal document is k, l_-m,tTags representing all text words except the t-th text word in the m-th multimodal document,

topic number, w, representing all text words except the t text word in the m-th multimodal document_m,tE means that the tth text word in the mth multimodal document is e,

indicating the number of text words e generated by topic k under tag j, in addition to the t-th text word in the m-th multimodal document,

representing the number of text words corresponding to the kth topic under the label j in the document m, except the t-th text word in the m-th multimodal document, d_m,j,k,.Representing the number of visual words corresponding to the kth topic under the label j in the mth multi-modal document;

step 5.3, solving the probability that the t visual word f in the m multi-modal document is assigned to the k topic under the label j by using the formula (6)

In the formula (6), the reaction mixture is,

the topic number corresponding to the t visual word in the m multi-modal document is k, l_-m,tTags representing all visual words except the t-th visual word in the m-th multimodal document,

topic number, v, representing all visual words except the t-th visual word in the m-th multimodal document_m,tF means that the mth visual word of the mth multimodal document is f,

indicating the number f of visual words generated by the kth topic under the label j, in addition to the t-th visual word in the mth multimodal document,

indicating the number of visual words corresponding to the kth topic under the label j in the document m except the tth visual word in the mth multi-modal document, n_m,j,k,.Representing the number of text words corresponding to the kth topic under the m-th multi-modal document label j;

step 5.4, repeatedly circulating the step 5.2 and the step 5.3, and distributing label subjects to all text words and visual words in the multi-modal document by using a collapse Gibbs sampling method until an iteration condition is met;

step 5.5, calculating the interest weight theta of the kth topic under the label j in the mth multi-mode by using the formula (7)_m,j,k：

Step 5.5, calculating the interest weight of the kth subject under the label j on the text word e by using the formula (8)

In the formula (8), n_.,j,k,eRepresenting the number of text words e generated by the kth topic under the label j;

step 5.5, calculating the interest weight of the kth topic under the label j on the visual word f by using the formula (9)

In the formula (9), d_.,j,k,fRepresents the number of visual words f generated by the kth topic under the label j;

and taking the topic distribution of the multi-modal document, the text topic word distribution and the visual topic word distribution obtained by the interest weight as a topic mining result.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention jointly models text, image and label information: associated text and image datasets with tags. Therefore, the method is more convenient and valuable in practical application. Secondly, the multi-modal topics learned from the topic model can be associated with the tags, so that the semantic gap is effectively reduced, and the understanding and the explanation of the meaning of the topics are facilitated. In addition, visual information is integrated into the model, so that the model has good interpretability.

2. The collapse Gibbs sampling method is designed, so that the expandability of the method is more efficient, more accurate and easier to expand to big data. When dealing with large-scale multi-modal data, valuable subjects can be learned more quickly.

Drawings

FIG. 1 is a probability model diagram of a multi-modal topic mining method based on tag constraint according to the present invention.

FIG. 2 is a flow chart of a collapse Gibbs sampling algorithm of the multi-modal topic mining method based on tag constraint.

Detailed Description

In the embodiment, a multi-modal topic mining method based on tag constraint is designed for a tagged associated text and image data set, tags are used as supervision information and introduced into a topic model, a background tag facing the whole data set is introduced, a collapse Gibbs sampling method is adopted to carry out approximate estimation on the model, and the method is suitable for learning out valuable multi-modal topics and is applied to data mining tasks such as recommendation or classification. The method comprises the following specific steps:

step1, constructing a data set D of the multi-modal document;

the probability model diagram shown in fig. 1 shows the following symbols: w represents a text content set, V represents a visual content set, Λ represents a tag content set, l is any tag sequence number in a tag space, M represents the number of multimodal documents, and N represents a tag sequence number in a tag space_dRepresenting the number of words in the content of the nth piece of text; l is_dRepresenting the number of words in the d-th piece of visual content; k represents the number of topics in the multimodal document; k_dRepresenting the number of topics in the d-th multimodal document; θ is an M M matrix representing the distribution of topics in the multimodal document; phi is a^wIs a K multiplied by T matrix which represents the distribution of text subject words in the multi-modal document; phi is a^vIs a K multiplied by P matrix, which represents the distribution of visual subject words in the multi-modal document; α is a parameter of the topic distribution of the multimodal document, β^wIs a text subject term distribution parameter, beta^vIs a visual subject word distribution parameter.

Wherein the content of the first and second substances,

represents the text data in the mth text content and has

and 1.2, constructing a visual content set in the multi-modal document, and coding the image by using the statistical frequency of the visual words contained in the image by using a visual word bag model (BOVW). Since the visual words in the image are not as readily available as in text, the visual words need to be extracted from the image. The Scale Invariant Feature Transform (SIFT) algorithm is the most widely used algorithm for extracting local invariant features from images at present. Therefore, the SIFT algorithm can extract invariant feature points from the image as visual words. The specific steps for representing the image by the BOVW model are as follows:

step 1: feature points are extracted from the image, which is important for understanding the image. Visual features are extracted from the image using the SIFT algorithm and all visual features are labeled.

Step 2: and after the feature extraction is finished, establishing a dictionary for the extracted image feature information by utilizing dictionary learning. In order to make the dictionary representative and effective, a large number of samples are randomly selected from images in the data set, and then the dictionary is learned through a K-mean clustering method.

And according to the feature points extracted by SIFT, randomly selecting H clustering centers, and iterating by using a K-means clustering algorithm until convergence is achieved to finally obtain the H clustering centers. Each cluster center is a visual vocabulary, and finally a visual dictionary is formed. And defining the Euclidean distance square sum as a distance formula of the K-means clustering algorithm.

Step 3: through dictionary learning, a specific vocabulary for image feature representation is obtained. With the SIFT algorithm, a number of feature points can be extracted from each image, which can be approximately replaced by visual words in a dictionary. Each image may be converted to a visual histogram in which the abscissa represents the visual word and the ordinate represents the number of times the visual word occurs.

Is marked as

Wherein the content of the first and second substances,

representing image data in the mth piece of visual content, and having

L is the number of different tags; l is any label serial number in the label space; in the label space

A global hidden label B is also arranged in the label;

step2, modeling label theme distribution of the multi-modal document;

defining topic distributions for multimodal documents

Wherein the content of the first and second substances,

defining a topic distribution of tag j in the mth multimodal document as

Wherein, K_jIs shown andnumber of topics associated with label j, θ_m,j,kRepresents the interest weight of tag j in the mth article multimodal document on the kth topic, K ∈ {1,2_j}，j∈Λ_m(ii) a Each tag can be associated with multiple topics, but each topic can only be assigned to one tag;

Obey parameter is beta^wHas a dirichlet distribution of

step 3.3, defining visual probability distribution of kth topic under label j

Obey parameter is beta^vHas a dirichlet distribution of

step 4, establishing a multi-modal topic model based on label constraint;

step 4.1, define all the texts in the mth multi-modal documentThe subject number of this word is

Wherein the content of the first and second substances,

compliance parameter of

Is preferably a polynomial distribution of (a) and (b),

and

Wherein the content of the first and second substances,

compliance parameter of

Is preferably a polynomial distribution of (a) and (b),

and

And

learning is carried out;

the observed text word W and the unobserved ith tag are calculated using equation (1) with the topic number z of the text content of all multimodal documents^wIs given by the joint probability distribution p (W, V, z)^w,z^v,l|α,β^w,β^v)；

step 5.1, calculating the generation probability p (W | z) of all text words in the multi-modal document by using the formula (2)^w,l,β^w)；

step 5.2, calculating the generation probability p (V | z) of all visual words in the multi-modal document by using the formula (3)^v,l,β^v)；

step 5.3, calculating the generation probability p (z) of the tag topics of all the multi-modal documents by utilizing the formula (4)^w,z^v,l|α)；

as shown in fig. 2, the flow chart of the collapse gibbs sampling algorithm includes the following specific steps:

the first step solves for the probability that the tth text word e in the mth multimodal document is assigned to the kth topic under the label j using equation (5):

solving the probability that the t text word e in the m multi-modal document is assigned to the k topic under the label j by using the formula (5)

the second step solves for the probability that the tth visual word f in the mth multimodal document is assigned to the kth topic under the label j using equation (6):

solving the probability that the t visual word f in the m multi-modal document is assigned to the k topic under the label j by using the formula (6)

In the formula (6), the reaction mixture is,

thirdly, repeatedly circulating the first step and the second step, and distributing label subjects to all text words and visual words in the multi-modal document by using a collapse Gibbs sampling method until an iteration condition is met;

fourthly, calculating different interest weights:

calculating interest weight theta of kth subject under label j in mth multi-mode by using formula (7)_m,j,k：

Calculating interest weight of kth subject under label j on text word e by using formula (8)

calculating interest weight of kth subject under label j on visual word f by using formula (9)

Claims

1. A multi-modal topic mining method based on label constraint is characterized by comprising the following steps:

step1, constructing a data set D of the multi-modal document;

Wherein the content of the first and second substances,

represents the text data in the mth text content and has

Wherein the content of the first and second substances,

representing image data in the mth piece of visual content, and having

step2, modeling label theme distribution of the multi-modal document;

defining topic distributions for multimodal documents

Wherein the content of the first and second substances,

defining a topic distribution of tag j in the mth multimodal document as

Wherein, K_jIndicates the number of topics associated with label j, θ_m,j,kRepresents the interest weight of tag j in the mth article multimodal document on the kth topic, K ∈ {1,2_j}，j∈Λ_m(ii) a Each tag can be associated with multiple topics, but each topic can only be assigned to one tag;