CN116258600A

CN116258600A - Multi-modal feature fusion social media content propagation prediction method

Info

Publication number: CN116258600A
Application number: CN202310255939.1A
Authority: CN
Inventors: 郑博仑; 徐逸杰; 张权; 潘航佳; 颜成钢
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-03-16
Filing date: 2023-03-16
Publication date: 2023-06-13

Abstract

The invention discloses a multi-modal feature fusion social media content propagation prediction method, which comprises the steps of firstly, utilizing a data crawler to obtain original data of target social media platform content and social attribute of a publisher; preprocessing the acquired data according to different modes respectively; extracting features of the image mode and the text mode through a feature extraction module; then, the data of the three types of modes are adjusted to the same dimension; and finally, performing neural network supervision training. The method and the device can ensure that the regression fitting result is not too biased to a certain mode, thereby improving the accuracy and the robustness of the prediction result.

Description

Multi-modal feature fusion social media content propagation prediction method

Technical Field

The invention relates to the field of deep learning, in particular to a social media content propagation prediction method based on multi-modal feature fusion.

Background

With the progress of internet communication technology and the perfection and development of communication infrastructure in recent years, social media is gradually rising and widely integrated into life production of people, and the propagation characteristics of social media are attracting attention. Different from the vertical propagation of the traditional media such as newspapers, radio, television and the like from top to bottom, the social media are propagated transversely between the self-media and users through the social network, and the method has the characteristics of convenience in content release, strong interactivity, high propagation speed, wide influence and the like. However, most of the existing social media propagation prediction schemes at present give prediction results lacking in credibility in the face of special properties and a large amount of data noise of social media information, and cannot accurately predict the propagation degree of social media content.

The types of information contained in the social media content are various, wherein the multi-mode data information comprises images, texts, publisher information and the like, and due to the isomerism of the multi-mode data, the data among different modes are difficult to reasonably and efficiently combine, and the propagation results (related indexes such as click quantity, browsing quantity and play quantity) of the social media content are accurately predicted. The deep learning method is utilized to conduct prediction tasks, a large amount of data support is needed, the current social media platforms are various, different social media platforms have different platform structures and content presentation modes, and a large amount of time is needed to wait for obtaining the propagation results of newly released contents. The method has the advantages that the problems of long acquisition time, high difficulty, high noise of the obtained data and the like of the content data of the target social media platform are brought, and the prediction task of deep learning is greatly and negatively influenced.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a social media content propagation prediction method based on multi-modal feature fusion. The method aims at solving the problem of how to extract and fuse high-dimensional features from multi-modal data of social media content through a neural network, so that accuracy and robustness of social media propagation prediction are improved.

To achieve the above objective, we first need to crawl the data of published content from the target social media platform using a data crawler. And separating data of different modes from the social media content data, and carrying out normalization and other processing on the data. And then inputting feature extraction networks corresponding to different modes to respectively extract mode features, carrying out dimension reduction alignment on the extracted features, merging, inputting a full-connection layer to carry out regression prediction, and finally improving the accuracy of social media content propagation prediction.

A multi-modal feature fusion social media content propagation prediction method comprises the following steps:

and step 1, acquiring original data of the target social media platform content and the social attribute of the publisher by utilizing a data crawler.

And 2, respectively preprocessing the acquired data according to different modes.

And 3, extracting the characteristics of the image mode and the text mode through a characteristic extraction module.

And 4, adjusting the data of different modes to the same dimension.

And 5, supervising and training by the neural network.

And monitoring the neural network regression task by adopting the mean square error.

Further, the specific method in the step 1 is as follows:

the method comprises the steps of obtaining original data of target social media platform content and social attributes of publishers by utilizing a data crawler, wherein the original data comprise the following data: an image, text, pre-propagation values, post-propagation values. The image comprises a picture attached to manuscript and a video cover frame; the text comprises titles, text contents, partition information, custom labels of manuscripts, authentication information of publishers and personal profiles; the pre-propagation numerical value is numerical information of a publisher and comprises a vermicelli number, a concern number, a total manuscript number and a total praise amount; the propagated numerical values comprise click quantity, browsing quantity, comment quantity, forwarding quantity, collection quantity and release time in a certain time after the manuscript is released.

Further, the specific method in the step 2 is as follows:

the data comprises three modes of images, texts and numerical values, for the image modes, the image size is uniformly scaled to 224 multiplied by 224 pixels, and the missing or invalid image is replaced by a 224 multiplied by 224 blank image; for the text mode, the missing or invalid text items are replaced by character strings of 0, and then the text is converted into text vectors by using a text word segmentation device; for numerical mode, the missing or invalid numerical terms are replaced with 0 and exceed 10 for the overall scale range ⁶ Is subjected to a natural logarithmic transformation and then Z-Score normalization of the data. After pretreatment, multi-mode data are obtained: p, T, N, where P represents image data, T represents text data, and N represents numerical data. And calculating popularity value of the manuscript as a prediction target:

wherein p represents popularity value, v represents index of most direct expression propagation condition of manuscript, d represents published days of manuscript content.

Further, the specific method in the step 3 is as follows:

for image modalities, a pre-trained convolutional neural network is used as the image feature extraction network, while for text modalities, a transducer-based attention network is used as the text feature extraction network and text attention is calculated:

wherein σ represents a softmax function, T represents an input text vector, W ^Q ，W ^K ，W ^V Respectively represent the matrix of query, key and values parameters, d, in the attention mechanism _k Is W ^K Is a dimension of (c).

Outputting the image characteristics F after the characteristic extraction _p And text feature F _t 。

Further, the specific method in the step 4 is as follows:

after the steps, three types of modes F _p 、F _t Inconsistent N dimensions will cause the final prediction result to be too biased to a certain class of modes, and three classes of mode data need to be aligned to the same dimension. And respectively using a full connection layer for the three types of modal data, and outputting consistent dimensionality. The output vectors are F _p ^* 、F _t ^* And N ^* Combine to x= { F _p ^* ，F _t ^* ，N ^* As input to the neural network regression task.

Further, the specific method in step 5 is as follows:

and carrying out regression prediction on the X input by a neural network consisting of three full-connection layers, and carrying out supervision training by adopting the mean square error loss to obtain a predicted value y based on the neural network.

Mean square error loss:

wherein y is _i True tags, y 'representing data' _i Representing the output values predicted via the neural network, n representing the number of samples.

After each supervised training, the gradients of the network parameters need to be tailored in order to prevent gradient explosions:

wherein g is the gradient of the input, c is the set clipping threshold, g ^* Is the gradient after clipping by the threshold value c, norm _max Is the maximum norm of the set gradient _total Is the vector 2-norm of all parameters, coef _clip Is a clipping coefficient, grad _out Is the gradient outputted after gradient clipping.

The training parameters are as follows: under the pytorch library, the optimizer selects Adam adaptive moment estimation optimizer with an initial learning rate of 0.001, a batch size of 128, c in step 5 set to 1, norm _max Set to 10.

Furthermore, the image feature extraction network adopts ResNet-101, and the text feature extraction network adopts BERT.

The beneficial effects of the invention are as follows:

the invention can efficiently and stably crawl the social media content data by utilizing the data crawler framework, and the obtained invalid value and the missing value of the data occupy less. The pre-trained ResNet and BERT are used as an image feature extraction network and a text feature extraction network respectively, feature information in image and text modes can be fully extracted, feature dimensions of different modes are aligned, and a regression fitting result is not too biased to a certain mode, so that accuracy and robustness of a prediction result are improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a network architecture employed in an embodiment of the present invention;

Detailed Description

The invention will be further described in detail below with reference to the attached drawings and by means of specific examples.

The network structure adopted by the method is shown in fig. 2, firstly, data crawls data of published content from a target social media platform by utilizing a data crawler, after data preprocessing, the image and the text are respectively subjected to feature extraction through a feature extraction network, and then a prediction result is calculated through a regression part.

As shown in fig. 1, the embodiment of the present invention and the implementation process thereof are as follows:

the specific implementation process comprises two stages of training and testing of the deep neural network model:

step 1: the method comprises the steps of obtaining original data of target social media platform content and social attributes of publishers by utilizing a data crawler, wherein the original data comprise the following data: images (pictures attached to manuscripts, video cover frames, etc.), texts (titles, text contents, partition information, custom labels, authentication information of publishers, personal profiles, etc.), values before transmission (numerical information of publishers, such as number of fans, attention number, total number of manuscripts, total endorsements, etc.), values after transmission (click count, browsing count, comment count, forwarding number, collection number, release time, etc. within a certain time after the publication of the manuscripts).

Because different social media platforms have different platform structures and content presentation modes, the data crawler frameworks used are also different. The data crawler framework for the video social media platform bilibilili in this embodiment is composed of four parts, namely a scheduler, a URL manager, a web page downloader and a web page parser. The scheduler is responsible for scheduling coordination work among the URL manager, the downloader, and the parser. The URL manager is responsible for managing the URL address to be crawled and the crawled URL address, preventing repeated crawling of the URL and cyclic crawling of the URL, and is mainly realized through three modes of a memory, a database and a cache database. Then, a web page downloader is constructed using the python third party library Requests, which acts to download the web page after entering a URL address into the web page downloader and convert the web page into a string. And finally, constructing a webpage analyzer by utilizing json functions of the requestors library, analyzing the character strings output by the webpage downloading device by utilizing the regular expression, and extracting valuable information in a fuzzy matching mode. Most social network media have a self anti-crawling mechanism, and access requests of suspected crawler programs are refused to ensure safety. In order to crawl data reasonably and effectively, when an access request is sent to a webpage, an agent is added in a request head to disguise, and then the data crawling is performed by using the data crawler framework.

Step 2: and preprocessing the acquired data according to different modes. The data comprises three modes of images, texts and numerical values, for the image modes, the image size is uniformly scaled to 224 multiplied by 224 pixels, and the missing or invalid image is replaced by a 224 multiplied by 224 blank image; for the text mode, the missing or invalid text items are replaced by character strings of 0, and then the text is converted into text vectors by using a text word segmentation device; for numerical mode, the missing or invalid numerical terms are replaced with 0 and exceed 10 for the overall scale range ⁶ Natural logarithmic transformation is performed on the numerical values (such as vermicelli number, praise number, etc.):

x ^* ＝ln(x+1)

wherein x is the original value of x, x ^* To make natural logarithmic transformation of the values.

The data were then Z-Score normalized:

wherein x is ^* Is the value after natural logarithmic transformation, mu is the average value of the item value overall, sigma is the variance of the item value overall, and N is the value data output after Z-Score normalization.

After pretreatment, multi-mode data are obtained: p, T, N, where P represents image data, T represents text data, and N represents numerical data.

Since the numerical scale ranges such as click volume and browse volume are too large, model deviation is too large if the model is used as a prediction target. Therefore, using a logarithmic transformation, the index which most directly reflects the propagation condition of the social media content is converted into a popularity value of the propagation of the social media content to be used as a prediction target:

wherein p represents popularity value, v represents index (such as click quantity, browse quantity or play quantity) of the most direct representing propagation condition of the manuscript, and d represents publishing days of contents of the manuscript.

Step 3: and extracting the characteristics of the image mode and the text mode through a characteristic extraction module. Feature extraction is performed on the image using a convolutional neural network. Whereas for text, a transducer-based attention network is used for feature extraction:

Outputting the image characteristics F after the characteristic extraction _p And text feature Ft. The image feature extraction network adopts ResNet-101, and the text feature extraction network adopts BERT.

Step 4: and adjusting the data of the three types of modes to the same dimension.

After the steps, three types of modes F _p 、F _t Inconsistent N dimensions will cause the final prediction result to be too biased to a certain class of modes, and three classes of mode data need to be aligned to the same dimension. A layer of full connection layer is respectively used for three types of modal data, 128-dimensional vectors with the same output are respectively obtained, and the output vectors are Fx ^* 、F _t ^* And N ^* Combine to x= { F _p ^* ，F _t ^* ，N ^* As input to the neural network regression task.

Step 5: the X input was regressed to a neural network consisting of three fully connected layers.

And performing supervision training by adopting the mean square error loss to obtain a predicted value y based on the neural network.

Mean square error loss:

where g is the gradient of the input, c is the clipping threshold, norm _max Is the maximum norm of the set gradient _total Is the vector 2-norm of all parameters, grad _out Is the gradient outputted after gradient clipping.

Testing:

step 6: preprocessing the test data according to a preprocessing mode of a training stage, namely scaling the image size to 224×224 pixels for an image mode, wherein the missing or invalid image is replaced by a 224×224 blank image; for the text mode, the missing or invalid text items are replaced by character strings of 0, and then the text is converted into text vectors by using a text word segmentation device; for the numerical modality, the missing or invalid numerical terms are replaced with 0, and then the data is Z-Score normalized. And inputting the preprocessed data into a network for feature extraction and fusion, and outputting a predicted value after calculation by a neural network.

The foregoing is a further detailed description of the invention in connection with specific/preferred embodiments, and it is not intended that the invention be limited to such description. It will be apparent to those skilled in the art that several alternatives or modifications can be made to the described embodiments without departing from the spirit of the invention, and these alternatives or modifications should be considered to be within the scope of the invention.

The invention, in part not described in detail, is within the skill of those skilled in the art.

Claims

1. A multi-modal feature fusion social media content propagation prediction method is characterized by comprising the following steps:

step 1, acquiring original data of target social media platform content and social attributes of publishers by utilizing a data crawler;

step 2, preprocessing the acquired data according to different modes respectively;

step 3, extracting features of the image mode and the text mode through a feature extraction module;

step 4, adjusting the data of different modes to the same dimension;

step 5, monitoring and training by a neural network;

2. The method for predicting social media content propagation by multi-modal feature fusion according to claim 1, wherein the specific method in step 1 is as follows:

the method comprises the steps of obtaining original data of target social media platform content and social attributes of publishers by utilizing a data crawler, wherein the original data comprise the following data: an image, text, pre-propagation value, post-propagation value; the image comprises a picture attached to manuscript and a video cover frame; the text comprises titles, text contents, partition information, custom labels of manuscripts, authentication information of publishers and personal profiles; the pre-propagation numerical value is numerical information of a publisher and comprises a vermicelli number, a concern number, a total manuscript number and a total praise amount; the propagated numerical values comprise click quantity, browsing quantity, comment quantity, forwarding quantity, collection quantity and release time in a certain time after the manuscript is released.

3. The method for predicting social media content propagation by multi-modal feature fusion according to claim 2, wherein the specific method in step 2 is as follows:

the data comprises three modes of images, texts and numerical values, for the image modes, the image size is uniformly scaled to 224 multiplied by 224 pixels, and the missing or invalid image is replaced by a 224 multiplied by 224 blank image; for the text mode, the missing or invalid text items are replaced by character strings of 0, and then the text is converted into text vectors by using a text word segmentation device; for numerical mode, the missing or invalid numerical terms are replaced with 0 and exceed 10 for the overall scale range ⁶ Natural logarithmic transformation is carried out on the numerical terms of the data, and then Z-Score normalization is carried out on the data; after pretreatment, multi-mode data are obtained: p, T, N, wherein P represents image data, T represents text data, and N represents numerical data; and calculating popularity value of the manuscript as a prediction target:

4. The method for predicting social media content propagation by multi-modal feature fusion according to claim 3, wherein the specific method in step 3 is as follows:

wherein σ represents a softmax function, T represents an input text vector, W ^Q ，W ^K ，W ^V Respectively represent the matrix of query, key and values parameters, d, in the attention mechanism _k Is W ^K Is a dimension of (2);

5. The method for predicting social media content propagation by multimodal feature fusion as claimed in claim 4, wherein the specific method in step 4 is as follows:

after the steps, three types of modes F _p 、F _t Inconsistent N dimensions can lead to the final prediction result being too biased to a certain type of mode, and three types of mode data need to be aligned to the same dimension; respectively using a full connection layer for the three types of modal data, and outputting consistent dimensionality; the output vectors are F _p ^* 、F _t ^* And N ^* Combine to x= { F _p ^* ，F _t ^* ，N ^* As input to the neural network regression task.

6. The method for predicting social media content propagation by multimodal feature fusion as claimed in claim 5, wherein the specific method in step 5 is as follows:

inputting X into a neural network consisting of three full-connection layers to carry out regression prediction, and carrying out supervision training by adopting mean square error loss to obtain a predicted value y based on the neural network;

mean square error loss:

wherein y is _i True tags, y 'representing data' _i Representing the output value predicted by the neural network, n representing the number of samples;

wherein g is the gradient of the input, c is the set clipping threshold, g ^* Is the gradient after clipping by the threshold value c, norm _max Is the maximum norm of the set gradient _total Is the vector 2-norm of all parameters, coef _clip Is a clipping coefficient, grad _out Is the gradient outputted after gradient cutting;

the training parameters are as follows: under the pytorch library, the optimizer selects Adam adaptive moment estimation optimizer with an initial learning rate of 0.001, a batch size of 128, c set to 1, norm _max Set to 10.

7. The method for predicting social media content propagation by multi-modal feature fusion according to claim 4, 5 or 6, wherein the image feature extraction network uses res net-101 and the text feature extraction network uses BERT.