CN115391522A

CN115391522A - Text topic modeling method and system based on social platform metadata

Info

Publication number: CN115391522A
Application number: CN202210921496.0A
Authority: CN
Inventors: 高金华; 赵鑫; 沈华伟; 王永庆; 庞亮; 孟剑; 程学旗
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2022-11-25

Abstract

The invention provides a text topic modeling method and system based on social platform metadata, which comprises the steps of constructing a bag-of-words representation of text data based on keywords of the text data; training attribute value prediction tasks of corresponding categories based on metadata categories of the text data to fine-tune a pre-training semantic extraction model to obtain a target semantic extraction model, and extracting text semantic representation of the text data by using the target semantic extraction model; constructing a semantic constraint target based on text semantic representation, taking the semantic constraint target as guidance, taking bag-of-words representation as an input and reconstruction target, training a neural topic model based on a variational self-encoder to obtain a topic extraction model, and deriving topic-keyword distribution and topic embedding representation from the model. The method and the system can perform topic modeling on short text messages widely existing in mobile application, extract keywords of topics and learn to obtain embedded expression of the topics.

Description

Text topic modeling method and system based on social platform metadata

Technical Field

The invention is applicable to the field of mobile application big data analysis, relates to a theme modeling method and a theme modeling system for mobile application metadata, and particularly relates to a theme classification method and a theme classification system for social application platform metadata.

Background

The topic modeling task aims to perform probability modeling on the corpus and discover a group of potential topics, and the obtained topics can be used in the fields of user portrayal, public opinion analysis and tracking, man-machine conversation and the like. Each topic may be used to describe an interpretable semantic concept corresponding to a probability distribution over a vocabulary. At the same time, given a document, the topic model can infer its topic distribution. Topic modeling is used as a powerful unsupervised text analysis technology, topics discussed in massive texts can be extracted, and the texts are clustered or classified according to topic distribution.

Latent Dirichletaillocation (LDA) potential dirichlet was distributed in the bayesian probabilistic topic model proposed in 2003, and the topic distribution of a document was inferred by modeling the generation process of the document. As shown in FIG. 5, in the LDA topic model, there are M document-topic Dirichlet prior distributions, corresponding to M document-topic multiple posterior distributions, such that α → θ _d →z _d Form Dirichlet-Multi conjugation, using Gibbs sampling method, to obtain the document-topic posterior distribution based on Dirichlet distribution.

Although LDA models have had great success in the topic modeling task, their inference efficiency is too low to be applied in large-scale analysis scenarios. To solve the above efficiency problem, the variational self-encoder based neural topic model employs a deep neural network based decoder and encoder to fit the process of document generation and topic inference. By maximizing the lower bound of Evidence (ELBO) of the edge likelihood function of the input data, the decoder and encoder of the model are trained and used for the subsequent topic inference task:

where p represents a posterior probability distribution, q represents a prior probability distribution, x represents a word distribution of the document, and z represents a topic distribution of the document.

The words of the short text messages in the social platform are distributed sparsely, the content forms are various, and the existing topic modeling method still has some defects and shortcomings.

Firstly, in the aspect of computational efficiency, the traditional topic modeling method represented by the LDA model needs to infer the topic of the document to be predicted in a sampling manner, so that the method is low in efficiency and cannot be adapted to a large-scale data analysis scenario, particularly a social media streaming data scenario. The advent of neural topic modeling approaches has alleviated this deficiency to some extent.

Secondly, the existing neural topic model mainly takes a bag-of-words model of a document as an input. However, in a short text scenario, the word distribution of a single document is very sparse, and it is difficult to accurately infer the topic of the document using only discrete word distribution information. To solve the problem, part of the existing models consider adding constraints to the topic distribution of the short text, for example, constraining all words in the short text to belong to the same topic. Although the method can relieve the problem caused by the sparsity of the corpus to a certain extent, the modeling capability of the model is greatly reduced.

In addition, the existing neural topic model mainly uses text contents as the main basis of topic modeling. However, in mobile applications, especially social applications, short texts often contain rich attribute information, such as Hashtag, URL, etc., which can provide important clue information for topic modeling. Although the prior model introduces covariate and tag information closely related to the corpus documents, the prior model requires that the attachment information of each document is complete, and the application range of the topic model is greatly limited. In fact, attribute information in short texts is often lacked in a large amount, and a large amount of noise is introduced by directly using the existing model, so that topic modeling fails.

In summary, the existing topic modeling method has two main problems: (1) The adopted bag-of-words model represents the semantics and theme which can not fully express the sparse short text; (2) The existing method cannot effectively utilize the self-abundant attribute information of the short text.

Disclosure of Invention

The invention provides a theme modeling method and system for mobile application metadata, and aims to solve the problems that the semantics of a short text cannot be fully expressed by adopting a bag-of-words model in the conventional theme model and the multi-attribute information of the short text cannot be effectively utilized. The method and the system can perform topic modeling on short text messages widely existing in mobile application, extract keywords of topics and learn to obtain embedded representation of the topics, can effectively fuse sparse multi-attribute information of the short texts, further improve the topic modeling effect, and support analysis of the attribute value distribution condition of the topics on each attribute information.

Aiming at the defects of the prior art, the invention provides a text topic modeling method based on social platform metadata, which comprises the following steps:

step 1, obtaining text data to be subject modeled and metadata of the text data from a social platform;

step 2, constructing a bag-of-words representation of the text data based on the keywords of the text data;

step 3, training attribute value prediction tasks of corresponding categories based on the categories of the metadata to finely adjust a pre-training semantic extraction model to obtain a target semantic extraction model, and extracting text semantic representation of the text data by using the target semantic extraction model;

step 4, constructing a semantic constraint target based on the text semantic representation, taking the semantic constraint target as a guide, taking bag-of-words representation as an input and reconstruction target, training a neural topic model based on a variational self-encoder to obtain a topic extraction model, and deriving topic-keyword distribution and topic embedding representation from the model;

and 5, inputting the theme embedded representation into the attribute value prediction task to obtain attribute value distribution of the theme on the corresponding attribute, merging the same theme according to the attribute value distribution, the theme-keyword distribution and the theme embedded representation, and taking a merging result as a theme model of the text data.

The text topic modeling method based on the social platform metadata comprises the following steps of 3:

classifying attributes of the metadata into a discrete type attribute, a continuous type attribute and a text type attribute;

for discrete attributes, counting based on attribute values appearing in corpus sets respectively, taking out attribute values with the number exceeding a preset threshold value at present to form an attribute value set according to the process of constructing a word list, constructing a classification task for predicting the attribute values based on the attribute value set, and adopting cross entropy as a loss function of the classification task;

for the continuous attribute, converting the attribute value into the distribution with the mean value of 0 and the variance of 1; constructing a regression task for predicting the converted attribute values based on the continuous attributes, and adopting MSE as a loss function of the regression task;

splicing the text data with the text type attribute to obtain a spliced text, inputting the pre-training semantic extraction model, and generating a text semantic vector;

and constructing a countermeasure classification task for judging the attribute category of the text semantic vector, and adopting cross entropy as a loss function.

The text topic modeling method based on the social platform metadata comprises the following steps of 5: constructing an attribute value list of each attribute of the theme according to the attribute value distribution; constructing a keyword list according to the topic-keyword distribution;

when the topics are combined, the Jacard coefficient distribution is used for measuring the similarity between the keyword lists and the attribute value lists of the topics to obtain a first similarity and a second similarity, and the cosine similarity is used for measuring the similarity between the embedded representations of the topics to obtain a third similarity; and weighting and averaging the first similarity, the second similarity and the third similarity to obtain the final similarity among the topics, and combining the topics with the final similarity larger than a preset value.

The text topic modeling method based on the social platform metadata comprises the following steps: time of publication, user ID of publication, user profile of publication, @ User, # Tag, and URL.

The invention also provides a text topic modeling system based on the social platform metadata, which comprises the following steps:

the system comprises an initial module, a topic modeling module and a topic modeling module, wherein the initial module is used for acquiring text data to be subject modeled and metadata of the text data from a social platform; constructing a bag-of-words representation of the text data based on the keywords of the text data;

the fine tuning module is used for training attribute value prediction tasks of corresponding categories according to the categories of the metadata so as to fine tune a pre-training semantic extraction model to obtain a target semantic extraction model, and extracting text semantic representation of the text data by using the target semantic extraction model;

the extraction module is used for constructing a semantic constraint target according to the text semantic representation, taking the semantic constraint target as guidance, taking bag-of-words representation as an input and reconstruction target, training a neural topic model based on a variational self-encoder to obtain a topic extraction model, and deriving topic-keyword distribution and topic embedded representation from the model;

and the merging module is used for inputting the theme embedded representation into the attribute value prediction task to obtain the attribute value distribution of the theme on the corresponding attribute, merging the same theme according to the attribute value distribution, the theme-keyword distribution and the theme embedded representation, and taking a merging result as a theme model of the text data.

The text topic modeling system based on the social platform metadata comprises a fine tuning module and a text topic modeling module, wherein the fine tuning module is specifically used for:

counting discrete attributes respectively based on attribute values appearing in a corpus set, taking out attribute values with the current number exceeding a preset threshold value to form an attribute value set according to the process of constructing a word list, constructing a classification task for predicting the attribute values based on the attribute value set, and adopting cross entropy as a loss function of the classification task;

The text topic modeling system based on the social platform metadata comprises a merging module and a display module, wherein the merging module is used for: constructing an attribute value list of each attribute of the theme according to the attribute value distribution; constructing a keyword list according to the topic-keyword distribution;

when the topics are combined, the Jacard coefficient distribution is used for measuring the similarity between the keyword lists and the attribute value lists of the topics to obtain a first similarity and a second similarity, and the cosine similarity is used for measuring the similarity between the embedded representations of the topics to obtain a third similarity; and weighting and averaging the first similarity, the second similarity and the third similarity to obtain the final similarity among the topics, and combining the topics of which the final similarity is greater than a preset value.

The text topic modeling system based on the social platform metadata comprises the following metadata: time of publication, user ID of publication, user profile of publication, @ User, # Tag, and URL.

The invention also provides a storage medium for storing a program for executing the text topic modeling method based on the social platform metadata.

The invention further provides a client used for the any text topic modeling system based on the social platform metadata.

According to the scheme, the invention has the advantages that:

1. aiming at the problem that short text semantics cannot be fully expressed by adopting a bag-of-words model for a neural topic model, the invention provides a constraint method based on pre-training model text semantic representation, which introduces external knowledge to guide topic modeling learning, can effectively improve the modeling effect of the neural topic model, synchronously learns the semantic representation of each topic, and projects the topics to a pre-training semantic representation space.

2. Aiming at the problem that the neural topic model cannot effectively utilize short text multi-attribute information, the invention provides a set of preprocessing scheme aiming at different types of attribute data, the multi-attribute information is merged into text semantic representation through multi-task learning and confrontation auxiliary tasks, and self metadata is introduced to guide topic modeling learning, so that the modeling effect of the neural topic model can be effectively improved, and the distribution condition of attribute values of each topic on each attribute is modeled and analyzed.

3. The invention constructs a set of text topic analysis system facing the mobile application metadata. The method realizes a series of processes of preprocessing (word segmentation, part-of-speech recognition, entity recognition, word list construction and bag-of-word model representation conversion) of short text contents in mobile application, preprocessing (discrete type, continuous type and text type attribute conversion processing) of metadata, fine adjustment of semantic representation of a pre-training model, extraction of topic keywords (main bodies (characters, mechanisms, organizational entities and the like), extraction of actions (verbs, dynamic nouns and the like), time, places and the like), derivation of topic embedded representation, analysis of attribute value distribution conditions of topics on various attributes and the like, and can comprehensively and efficiently mine and analyze keywords, embedded representation and attribute value distribution conditions of topics of short texts in mobile application.

Drawings

FIG. 1 is a schematic diagram of a topic modeling method based on semantic constraints;

FIG. 2 is a schematic diagram of a topic modeling method for fusing metadata information;

FIG. 3 is a basic flow diagram of a mobile application metadata oriented topic modeling approach;

FIG. 4 is a business logic diagram of a mobile application metadata oriented topic modeling system;

FIG. 5 is a system block diagram of a prior art LDA model.

Detailed Description

In order to achieve the technical effects, the invention comprises the following key technical points:

key point 1: aiming at the problem that the short text semantics cannot be fully expressed by adopting a bag-of-words model in the existing neural topic model, a neural topic modeling method based on pre-training model text semantic constraint is provided. Remarkably improving the theme modeling effect of the neural theme model, and learning to obtain the embedded expression of the theme;

key point 2: aiming at the problem that the existing neural topic model can not effectively utilize short text multi-attribute information, a multi-attribute information condition topic modeling scheme based on multi-task learning is provided. The adverse effect of attribute value loss on topic modeling is overcome, the topic modeling effect of the neural topic model on the rich-attribute short text is remarkably improved, and the analysis of the attribute value distribution condition of the topic on each attribute information is supported.

Key point 3: a set of text topic analysis system oriented to the metadata of the mobile application is constructed. The method realizes a series of processes of preprocessing (word segmentation, part of speech recognition, entity recognition, word list construction and word bag model representation conversion) of short text content in mobile application, preprocessing (discrete type, continuous type and text type attribute conversion processing) of metadata, fine adjustment of semantic representation of a pre-training model, topic keywords (main bodies (characters, mechanisms, organization entities and the like), extraction of actions (verbs, action nouns and the like), time, places and the like, derivation of topic embedded representation and attribute value distribution analysis of topics on each attribute, and the like, and can comprehensively and efficiently mine and analyze keywords, embedded representation and attribute value distribution of topics of online short texts.

In order to make the aforementioned features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

Aiming at the problem that the short text semantics can not be fully expressed by adopting the bag-of-words model for representation, the invention proposes that a text semantic vector with fixed dimensionality generated by a pre-training model is taken as the semantic representation of a document and is fused into the encoding and decoding process of a neural topic model, so that the neural topic model overcomes the limitation of the bag-of-words model representation and more accurately deduces and models the topic of the short text by the constraint of dense semantic representation containing word order and context information in the scene with sparse document word distribution of a short text corpus, and the principle of the method is shown in figure 1.

Aiming at the problem that the neural topic model cannot effectively utilize short text multi-attribute information, the inventor observes that: in the real scene of the online short texts, the short texts discussing the same subject have locality in distribution of multiple attribute values such as publisher portraits, publishing time, hashtag, URL addresses, user Mention and the like, namely: similar publishers, in a short period of time, send short text with similar attribute information often discussing the same topic. Therefore, the invention provides multi-task prediction short text attribute value construction based on short text sparse multi-attribute information, a pre-training model is finely adjusted by utilizing multi-task learning, a text semantic vector which is more relevant to subject information is generated by the pre-training model to be used as semantic representation, and then the aim of indirectly fusing the short text multi-attribute information is achieved through semantic constraint, so that noise interference introduced by a large number of missing attribute values is avoided, and a schematic diagram of the method is shown in fig. 2.

A theme modeling method oriented to mobile application metadata comprises the following processing flows:

1. and cleaning text content, identifying entity words and predicate words in the text, constructing a word list, and converting the text content into bag-of-word model representation.

2. And respectively cleaning and converting the metadata of the text according to the attribute types, constructing a corresponding attribute value prediction task, finely adjusting the semantic representation of the pre-training model through multi-task learning and auxiliary countermeasure tasks, and integrating multi-attribute information. And converting the spliced text of the text content and the text type attribute value into text semantic representation by utilizing the fine-tuned pre-training model.

3. And constructing a semantic constraint target based on text semantic representation, guiding the neural topic model to learn, and deriving well-modeled topic-word distribution and topic embedded representation after training is finished. The topic-word distribution is different from the above-described bag-of-words model representation of document-words. The input to the topic modeling task is a collection of documents, containing multiple documents. The topic modeling task is to mine which topics are contained within it. Topic-word distributions describe topics by several keywords. Topic embedded representation is a commonly used representation method in deep learning, and a topic is represented by a real numerical vector.

4. And embedding the theme in an attribute value prediction network corresponding to the input representation to obtain the attribute value distribution condition of the theme on each attribute. A topic is a semantic concept that can be described by a set of keywords, by participating users, or by hashtags used by users to discuss the topic. The attribute distribution condition of the theme is further analyzed, namely the theme is described from more dimensions, so that the user can better understand text keywords related to the theme, the main participating user, the mainstream Hashtag, the time period range covered by the theme and the like.

The specific flow is shown in fig. 3:

1. data acquisition: text data and corresponding metadata published by a specific social platform in a period of time are collected through means of data collection/data query and the like, and the method includes but is not limited to the following steps: publishing time, publishing User ID, publishing User profile, @ User, # Tag, URL, etc.

2. Text preprocessing: the method comprises the steps of cleaning text content, identifying entity words (people, organizations, time, places, other entities and the like) and predicate words (event trigger words) in the text, constructing a word list, and converting the text content into corresponding bag-of-words to represent.

3. Metadata preprocessing: metadata is divided into three categories: and respectively processing the discrete attribute, the continuous attribute and the text attribute.

1) Discrete type attribute: such as publishing user ID, # Tag, URL, etc. And for discrete attributes, respectively counting based on attribute values appearing in the corpus, and taking out the attribute values with the current times exceeding a set threshold value to form an attribute value set according to the process of constructing a word list. Specifically, for the attribute value of the URL address, the Host field in the URL address is used for replacing the original URL address as the attribute value of the URL address attribute, and the information source of the online short text is represented.

2) The continuous type attribute: such as release time, rating scores, etc. For the continuous type attribute, the attribute values are normalized and converted into a distribution having a mean value of 0 and a variance of 1. Specifically, for the time class attribute value, the time class attribute value is converted into a time stamp and then standardized.

3) Text type attribute: such as publishing a user profile, etc. And for the text type attribute, splicing the text content with the text type attribute to be used as input for acquiring the semantic representation of the text of the pre-training model.

4. And (3) metadata fusion: and constructing a plurality of attribute value prediction tasks by taking the preprocessed metadata attribute values as tags, and finely tuning the pre-training model to fuse metadata information through multi-task learning. Each discrete attribute corresponds to a classification task of predicting an attribute value ID, and cross entropy is adopted as a loss function of the classification task. Each continuous type attribute corresponds to a regression task for predicting the converted attribute values, and MSE is used as a loss function of the regression task. In order to avoid the high coupling of text semantic vectors generated by the fine-tuned pre-training model with a certain attribute and enable pre-training model sharing parameters to be more inclined to capture the co-occurrence rule among multiple attributes and learn semantic representation more relevant to subject information, the invention also adds an auxiliary confrontation classification task: namely, the text semantic vector generated by the pre-training model is judged to correspond to which attribute, and the cross entropy is adopted as a loss function.

5. Modeling a theme: and taking the text semantic representation fused with the metadata information as semantic constraint, guiding the training of a neural topic model, and deriving the well-modeled topic-keyword distribution and topic embedded representation.

6. And (3) post-treatment: and embedding the theme into an attribute value prediction task network which represents the input 4 to obtain the attribute value distribution condition of the theme on the corresponding attribute, and taking Top K as an attribute value list of each attribute of the theme. The topic-keyword distribution of a topic is divided according to keyword categories (entity words (people, organizations, time, places, other entities, etc.) and predicates (event triggers)), and Top K is taken as a keyword list of each category of the topic. Similar topics are merged according to the keyword list, the attribute value list, and the topic embedding representation.

The significance of the above merging is that there may be multiple documents that are all descriptions of the same topic. The purpose of topic modeling is to compress for a set of text, finding the main events that it describes. Thus, the goal of topic modeling is to get a description about each topic, such as the topic-key distributions herein.

A common form of topic-keyword distribution is to describe a topic with the K most important keywords. For example, the keyword distribution under the topic of some geographical conflict may be "XX country, ZZ country, XX country leader, ZZ country element, third party participating in organization, military operations, sanctions".

When two topics are combined, the similarity between the keyword list and the attribute value list of the topics is measured by using the Jaccard coefficient, and the similarity between the embedded representations of the topics is measured by using cosine similarity. After three similarity scores are obtained, a weighted average is used to obtain the final similarity between the two topics. And merging the topics with the similarity exceeding a specified threshold.

When the theme distribution of a given text is deduced, the bag-of-word representation and the semantic representation of the text can be directly input into a theme model to obtain the theme distribution, and the theme corresponding to the dimension with the largest dimension value in the theme distribution is used as the theme of the text.

The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

A set of text topic analysis system oriented to mobile application metadata has business logic shown in figure 4, and at least comprises the following modules:

1. a preprocessing module: the method is mainly responsible for reading text data to be analyzed from a file/database/data warehouse and carrying out the following preprocessing operations: (1) cleaning text contents; (2) Identifying entity words and predicate words in the text, and constructing a word list; (3) And extracting attribute values of various attributes in the text, and respectively performing cleaning conversion according to the attribute types.

2. The theme modeling learning module: the neural topic model is mainly responsible for training the neural topic model and outputting keywords, embedded expressions and attribute value distribution conditions of the modeled topics. The method comprises the steps of representing a bag-of-words model for topic modeling, converting and loading pre-training semantic representations, training a neural topic model, deriving keywords of a modeled topic, embedding representations and attribute value distribution conditions thereof, and deducing topic distribution of a text.

3. A post-processing module: the method is mainly responsible for processing and converting the keywords, the embedded expressions and the attribute value distribution of the derived topics. The method comprises the steps of merging similar topics, extracting the title and description information of the topics, distributing cluster labels of all texts and the like.

4. A data storage module: the method is mainly responsible for storing output results in a database. The method comprises the steps of creating a theme, a keyword and a text object, processing the foreign key incidence relation among the theme, the keyword and the text object, connecting a database and writing data into a corresponding data table.

5. A metadata fusion module: the method is mainly responsible for fine-tuning the pre-training model to generate semantic representation more relevant to the subject information. The method comprises the steps of fine tuning conversion and loading of text data and multi-attribute data of a task, fine tuning a pre-training model, and exporting and storing fine-tuned pre-training model parameters.

the fine-tuning module is used for training attribute value prediction tasks of corresponding categories according to the categories of the metadata so as to fine-tune a pre-training semantic extraction model to obtain a target semantic extraction model, and extracting text semantic representation of the text data by using the target semantic extraction model;

The invention also provides a storage medium for storing a program for executing the any one text topic modeling method based on the social platform metadata.

Claims

1. A text topic modeling method based on social platform metadata is characterized by comprising the following steps:

step 4, constructing a semantic constraint target based on the text semantic representation, training a neural topic model based on a variational self-encoder by taking the semantic constraint target as a guide and taking bag-of-words representation as an input and reconstruction target to obtain a topic extraction model, and deriving topic-keyword distribution and topic embedding representation from the model;

2. The method of claim 1, wherein the step 3 comprises:

and constructing a confrontation classification task for judging the attribute category of the text semantic vector, and taking the cross entropy as a loss function.

3. The method of claim 1, wherein the step 5 comprises: constructing an attribute value list of each attribute of the theme according to the attribute value distribution; constructing a keyword list according to the topic-keyword distribution;

4. The social platform metadata based text topic modeling method of claim 1, wherein the metadata comprises: time of publication, user ID of publication, user profile of publication, @ User, # Tag, and URL.

5. A text topic modeling system based on social platform metadata, comprising:

6. The social platform metadata based text topic modeling system of claim 5 wherein the hinting module is specifically configured to:

7. The social platform metadata based text topic modeling system of claim 5 wherein the merging module is to: constructing an attribute value list of each attribute of the theme according to the attribute value distribution; constructing a keyword list according to the topic-keyword distribution;

8. The social platform metadata based text topic modeling system of claim 5 wherein the metadata comprises: time of publication, user ID of publication, user profile of publication, @ User, # Tag, and URL.

9. A storage medium storing a program for executing the method for modeling a text topic based on metadata of a social platform according to any one of claims 1 to 4.

10. A client for use in the social platform metadata based text topic modeling system of any one of claims 5 to 8.