CN117077085B

CN117077085B - Multi-mode harmful social media content identification method combining large model with two-way memory

Info

Publication number: CN117077085B
Application number: CN202311339502.2A
Authority: CN
Inventors: 宋彦; 张勇东; 田元贺
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-10-17
Filing date: 2023-10-17
Publication date: 2024-02-09
Anticipated expiration: 2043-10-17
Also published as: CN117077085A

Abstract

The invention relates to the technical field of social media content identification, and discloses a multi-mode harmful social media content identification method combining a large model with two-way memory, which comprises the following steps: extracting image features of the image using an image encoder; extracting text characteristics of the text by using a text embedding module; modeling the image features and the text features by using a dual-channel memory module to obtain image vectors and text vectors; tag prediction using a large model: the image vector and the text vector are input into the large model, and the label is generated. According to the invention, through the dual-channel memory module, weights can be calculated for different memory vectors according to visual characteristics, and the weight distribution enables the model to align and fuse information more accurately. In addition, the visual channel and the text channel are processed by the same program, so that the two information sources are equally and effectively considered.

Description

Multi-mode harmful social media content identification method combining large model with two-way memory

Technical Field

The invention relates to the technical field of social media content identification, in particular to a multi-mode harmful social media content identification method combining a large model with two-way memory.

Background

There are a large number of multi-modal information (such as expression packages and the like) in social media, and the information may include harmful information (such as expression packages and the accompanying text thereof include harmful information), and identification of the information requires effective processing and combination of images and texts.

The prior art does not utilize the text modeling capability of a large language model, and meanwhile, the prior cross-modal information combination method uses serial multi-modal characteristics or calculates the outer product of the multi-modal characteristics to realize multi-modal information combination, so that the requirements of tasks on the common understanding of the multi-modal information are not satisfied.

Disclosure of Invention

In order to solve the technical problems, the invention provides a multi-mode harmful social media content identification method combining a large model with two-way memory. According to the invention, the proposed dual-channel memory network is utilized to realize effective alignment of multi-mode information in a memory semantic space, and the recognition performance of multi-mode harmful contents is improved by utilizing the language modeling capability of a large model.

In order to solve the technical problems, the invention adopts the following technical scheme:

a multi-mode harmful social media content identification method combining large model with two-way memory inputs given imagesAnd text->Outputting a tag for judging whether the multimodal input is harmful +.>The method specifically comprises the following steps:

step one: extracting an image using an image encoderImage characteristics of->；

Step two: extracting text using text embedding moduleText feature of->；

Step three: image feature by using dual-channel memory moduleText feature->Modeling, obtaining image vectors respectively>Text vector +.>The method comprises the steps of carrying out a first treatment on the surface of the The dual-channel memory module comprises N memory vectors, a visual channel and a text channel; the visual channel and the text channel respectively encode image features and text features; the memory vector is a parameter of the dual-channel memory module, which represents memory semantic space and is marked as +.>，/>Is->A plurality of memory vectors;

a process for visual channel encoding image features, comprising the steps of:

s31: computing is based on image featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Visual weight score +.>；

S32: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned image feature +.>；

S33: tandem aligned image featuresAnd image feature->Obtaining the image vector of visual channel output>；

A process for text passage encoding text features, comprising the steps of:

s34: computing is based on text featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Text weight score +.>；

S35: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned text feature +.>；

S36: text features aligned in seriesAnd text feature->Obtaining text vector of text channel output>；

Step four, label prediction is carried out by using a large model: vector of imageText vector +.>Inputting into a large model, generating tag +.>。

Further, in step S31, the first calculation is performedMemory vector->Visual weight score +.>When (1):

；

is a trainable parameter matrix;

in step S32Applied to the corresponding memory vector, calculating the weighted sum of all memory vectors and obtaining the aligned image features +.>When (1):

；

in step S33, the series-aligned image featuresAnd image feature->Obtaining the image vector of visual channel output>When (1):

；

wherein,representing the series of vectors.

Further, in step S34, the first calculation is performedMemory vector->Text weight score +.>When (1):

；

wherein,is a trainable parameter matrix;

in step S35Applied to the corresponding memory vector, calculating the weighted sum of all memory vectors and obtaining the aligned text feature +.>When (1):

；

in step S36, the series-aligned text featuresAnd text feature->Obtaining text vector of text channel output>When (1):

；

wherein,representing the series of vectors.

Compared with the prior art, the invention has the beneficial technical effects that:

and the excellent text coding and characterization capability of the large model is utilized, so that the performance of multi-mode harmful content identification is improved.

Through the dual-channel memory module, the model can calculate weights for different memory vectors according to visual features. This weight assignment enables the model to more accurately align and fuse information. In addition, the visual channel and the text channel are processed by the same program, so that the two information sources are equally and effectively considered.

Drawings

FIG. 1 is a diagram of the structure of the model of the present invention.

Detailed Description

A preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

The model structure of the present invention is shown in fig. 1. The task of the invention is to provide a method for displaying a given imageAnd text->Identifying a tag for judging whether the multimodal input is harmful>The method specifically comprises the following steps:

step one: extracting an image using an image encoderImage characteristics of->. The image encoder of the present invention adopts a Vision Transformer encoder.

Step two: extracting text using text embedding moduleText feature of->. The text embedding module maps each word in the text to a vector using a matrix, the vector corresponding to the i-th word in the vocabulary being the i-th row in the matrix, howeverAnd then averaging vectors of all words to obtain the text feature e.

Step three: image feature by using dual-channel memory moduleText feature->Modeling, obtaining an image vector->Text vector +.>。

Step four, label prediction is carried out by using a large model: image vectorText vector +.>Inputting into a large model, generating tag +.>. The large model in the invention adopts Chinese-aplara.

The two-channel memory module related to the third step comprises a group of memory vectors and two independent channels, namely a visual channel and a text channel, which respectively encode image features and text features. The memory vector is a parameter of the dual-channel memory module, and represents memory semantic space and is recorded as，/>Is->And memorizing the vectors.

A process for visual channel encoding image features, comprising the steps of:

s31: computing is based on image featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Visual weight score +.>：

；

Wherein,is a trainable parameter matrix.

S32: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned image feature +.>：

。

S33: tandem aligned image featuresAnd image feature->Obtaining the image vector of visual channel output>：

；

Output image vectorWill be used in large models.

A process for text passage encoding text features, comprising the steps of:

s34: computing is based on text featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Text weight score +.>：

；

Wherein,is a trainable parameter matrix.

S35: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned text feature +.>：

。

S36: text features aligned in seriesAnd text feature->Obtaining text vector of text channel output>：

；

Output text vectorWill be used in large models.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a single embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to specific embodiments, and that the embodiments may be combined appropriately to form other embodiments that will be understood by those skilled in the art.

Claims

1. A multi-mode harmful social media content identification method combining large model with two-way memory inputs given imagesAnd text->Outputting a tag for judging whether the multimodal input is harmful +.>The method specifically comprises the following steps:

Step two: extracting text using text embedding moduleText feature of->；

Step three: image feature by using dual-channel memory moduleText feature->Modeling, obtaining image vectors respectively>Text vector +.>The method comprises the steps of carrying out a first treatment on the surface of the The dual-channel memory module comprises N memory vectors, a visual channel and a text channel; the visual channel and the text channel respectively encode image features and text features; the memory vector is a parameter of the dual-channel memory module and represents the memory languageThe sense space is marked as->，/>Is->A plurality of memory vectors;

a process for visual channel encoding image features, comprising the steps of:

；

A process for text passage encoding text features, comprising the steps of:

；

Wherein,is a trainable parameter matrix;

2. The method for identifying multi-modal harmful social media content by combining large models with two-way memory according to claim 1, wherein the method is characterized by comprising the following steps of:

；

wherein,representing the series of vectors.

3. The method for identifying multi-modal harmful social media content by combining large models with two-way memory according to claim 1, wherein the method is characterized by comprising the following steps of:

；

wherein,representing the series of vectors.