CN117077085B - Multi-mode harmful social media content identification method combining large model with two-way memory - Google Patents

Multi-mode harmful social media content identification method combining large model with two-way memory Download PDF

Info

Publication number
CN117077085B
CN117077085B CN202311339502.2A CN202311339502A CN117077085B CN 117077085 B CN117077085 B CN 117077085B CN 202311339502 A CN202311339502 A CN 202311339502A CN 117077085 B CN117077085 B CN 117077085B
Authority
CN
China
Prior art keywords
text
memory
image
vector
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311339502.2A
Other languages
Chinese (zh)
Other versions
CN117077085A (en
Inventor
宋彦
张勇东
田元贺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202311339502.2A priority Critical patent/CN117077085B/en
Publication of CN117077085A publication Critical patent/CN117077085A/en
Application granted granted Critical
Publication of CN117077085B publication Critical patent/CN117077085B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • G06F18/15Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to the technical field of social media content identification, and discloses a multi-mode harmful social media content identification method combining a large model with two-way memory, which comprises the following steps: extracting image features of the image using an image encoder; extracting text characteristics of the text by using a text embedding module; modeling the image features and the text features by using a dual-channel memory module to obtain image vectors and text vectors; tag prediction using a large model: the image vector and the text vector are input into the large model, and the label is generated. According to the invention, through the dual-channel memory module, weights can be calculated for different memory vectors according to visual characteristics, and the weight distribution enables the model to align and fuse information more accurately. In addition, the visual channel and the text channel are processed by the same program, so that the two information sources are equally and effectively considered.

Description

Multi-mode harmful social media content identification method combining large model with two-way memory
Technical Field
The invention relates to the technical field of social media content identification, in particular to a multi-mode harmful social media content identification method combining a large model with two-way memory.
Background
There are a large number of multi-modal information (such as expression packages and the like) in social media, and the information may include harmful information (such as expression packages and the accompanying text thereof include harmful information), and identification of the information requires effective processing and combination of images and texts.
The prior art does not utilize the text modeling capability of a large language model, and meanwhile, the prior cross-modal information combination method uses serial multi-modal characteristics or calculates the outer product of the multi-modal characteristics to realize multi-modal information combination, so that the requirements of tasks on the common understanding of the multi-modal information are not satisfied.
Disclosure of Invention
In order to solve the technical problems, the invention provides a multi-mode harmful social media content identification method combining a large model with two-way memory. According to the invention, the proposed dual-channel memory network is utilized to realize effective alignment of multi-mode information in a memory semantic space, and the recognition performance of multi-mode harmful contents is improved by utilizing the language modeling capability of a large model.
In order to solve the technical problems, the invention adopts the following technical scheme:
a multi-mode harmful social media content identification method combining large model with two-way memory inputs given imagesAnd text->Outputting a tag for judging whether the multimodal input is harmful +.>The method specifically comprises the following steps:
step one: extracting an image using an image encoderImage characteristics of->
Step two: extracting text using text embedding moduleText feature of->
Step three: image feature by using dual-channel memory moduleText feature->Modeling, obtaining image vectors respectively>Text vector +.>The method comprises the steps of carrying out a first treatment on the surface of the The dual-channel memory module comprises N memory vectors, a visual channel and a text channel; the visual channel and the text channel respectively encode image features and text features; the memory vector is a parameter of the dual-channel memory module, which represents memory semantic space and is marked as +.>,/>Is->A plurality of memory vectors;
a process for visual channel encoding image features, comprising the steps of:
s31: computing is based on image featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Visual weight score +.>
S32: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned image feature +.>
S33: tandem aligned image featuresAnd image feature->Obtaining the image vector of visual channel output>
A process for text passage encoding text features, comprising the steps of:
s34: computing is based on text featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Text weight score +.>
S35: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned text feature +.>
S36: text features aligned in seriesAnd text feature->Obtaining text vector of text channel output>
Step four, label prediction is carried out by using a large model: vector of imageText vector +.>Inputting into a large model, generating tag +.>
Further, in step S31, the first calculation is performedMemory vector->Visual weight score +.>When (1):
is a trainable parameter matrix;
in step S32Applied to the corresponding memory vector, calculating the weighted sum of all memory vectors and obtaining the aligned image features +.>When (1):
in step S33, the series-aligned image featuresAnd image feature->Obtaining the image vector of visual channel output>When (1):
wherein,representing the series of vectors.
Further, in step S34, the first calculation is performedMemory vector->Text weight score +.>When (1):
wherein,is a trainable parameter matrix;
in step S35Applied to the corresponding memory vector, calculating the weighted sum of all memory vectors and obtaining the aligned text feature +.>When (1):
in step S36, the series-aligned text featuresAnd text feature->Obtaining text vector of text channel output>When (1):
wherein,representing the series of vectors.
Compared with the prior art, the invention has the beneficial technical effects that:
and the excellent text coding and characterization capability of the large model is utilized, so that the performance of multi-mode harmful content identification is improved.
Through the dual-channel memory module, the model can calculate weights for different memory vectors according to visual features. This weight assignment enables the model to more accurately align and fuse information. In addition, the visual channel and the text channel are processed by the same program, so that the two information sources are equally and effectively considered.
Drawings
FIG. 1 is a diagram of the structure of the model of the present invention.
Detailed Description
A preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.
The model structure of the present invention is shown in fig. 1. The task of the invention is to provide a method for displaying a given imageAnd text->Identifying a tag for judging whether the multimodal input is harmful>The method specifically comprises the following steps:
step one: extracting an image using an image encoderImage characteristics of->. The image encoder of the present invention adopts a Vision Transformer encoder.
Step two: extracting text using text embedding moduleText feature of->. The text embedding module maps each word in the text to a vector using a matrix, the vector corresponding to the i-th word in the vocabulary being the i-th row in the matrix, howeverAnd then averaging vectors of all words to obtain the text feature e.
Step three: image feature by using dual-channel memory moduleText feature->Modeling, obtaining an image vector->Text vector +.>
Step four, label prediction is carried out by using a large model: image vectorText vector +.>Inputting into a large model, generating tag +.>. The large model in the invention adopts Chinese-aplara.
The two-channel memory module related to the third step comprises a group of memory vectors and two independent channels, namely a visual channel and a text channel, which respectively encode image features and text features. The memory vector is a parameter of the dual-channel memory module, and represents memory semantic space and is recorded as,/>Is->And memorizing the vectors.
A process for visual channel encoding image features, comprising the steps of:
s31: computing is based on image featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Visual weight score +.>
Wherein,is a trainable parameter matrix.
S32: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned image feature +.>
S33: tandem aligned image featuresAnd image feature->Obtaining the image vector of visual channel output>
Output image vectorWill be used in large models.
A process for text passage encoding text features, comprising the steps of:
s34: computing is based on text featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Text weight score +.>
Wherein,is a trainable parameter matrix.
S35: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned text feature +.>
S36: text features aligned in seriesAnd text feature->Obtaining text vector of text channel output>
Output text vectorWill be used in large models.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a single embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to specific embodiments, and that the embodiments may be combined appropriately to form other embodiments that will be understood by those skilled in the art.

Claims (3)

1. A multi-mode harmful social media content identification method combining large model with two-way memory inputs given imagesAnd text->Outputting a tag for judging whether the multimodal input is harmful +.>The method specifically comprises the following steps:
step one: extracting an image using an image encoderImage characteristics of->
Step two: extracting text using text embedding moduleText feature of->
Step three: image feature by using dual-channel memory moduleText feature->Modeling, obtaining image vectors respectively>Text vector +.>The method comprises the steps of carrying out a first treatment on the surface of the The dual-channel memory module comprises N memory vectors, a visual channel and a text channel; the visual channel and the text channel respectively encode image features and text features; the memory vector is a parameter of the dual-channel memory module and represents the memory languageThe sense space is marked as->,/>Is->A plurality of memory vectors;
a process for visual channel encoding image features, comprising the steps of:
s31: computing is based on image featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Visual weight score +.>
S32: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned image feature +.>
S33: tandem aligned image featuresAnd image feature->Obtaining the image vector of visual channel output>
A process for text passage encoding text features, comprising the steps of:
s34: computing is based on text featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Text weight score +.>
Wherein,is a trainable parameter matrix;
s35: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned text feature +.>
S36: text features aligned in seriesAnd text feature->Obtaining text vector of text channel output>
Step four, label prediction is carried out by using a large model: vector of imageText vector +.>Inputting into a large model, generating tag +.>
2. The method for identifying multi-modal harmful social media content by combining large models with two-way memory according to claim 1, wherein the method is characterized by comprising the following steps of:
in step S32Applied to the corresponding memory vector, calculating the weighted sum of all memory vectors and obtaining the aligned image features +.>When (1):
in step S33, the series-aligned image featuresAnd image feature->Obtaining the image vector of visual channel output>When (1):
wherein,representing the series of vectors.
3. The method for identifying multi-modal harmful social media content by combining large models with two-way memory according to claim 1, wherein the method is characterized by comprising the following steps of:
in step S35Applied to the corresponding memory vector, calculating the weighted sum of all memory vectors and obtaining the aligned text feature +.>When (1):
in step S36, the series-aligned text featuresAnd text feature->Obtaining text vector of text channel output>When (1):
wherein,representing the series of vectors.
CN202311339502.2A 2023-10-17 2023-10-17 Multi-mode harmful social media content identification method combining large model with two-way memory Active CN117077085B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311339502.2A CN117077085B (en) 2023-10-17 2023-10-17 Multi-mode harmful social media content identification method combining large model with two-way memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311339502.2A CN117077085B (en) 2023-10-17 2023-10-17 Multi-mode harmful social media content identification method combining large model with two-way memory

Publications (2)

Publication Number Publication Date
CN117077085A CN117077085A (en) 2023-11-17
CN117077085B true CN117077085B (en) 2024-02-09

Family

ID=88704676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311339502.2A Active CN117077085B (en) 2023-10-17 2023-10-17 Multi-mode harmful social media content identification method combining large model with two-way memory

Country Status (1)

Country Link
CN (1) CN117077085B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117291175B (en) * 2023-11-27 2024-03-29 中国科学技术大学 Method for detecting generated text based on statistical feature fusion of multiple large language models
CN117542538A (en) * 2024-01-10 2024-02-09 中国科学技术大学 Medical multi-mode content analysis and generation method based on reinforcement learning

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814454A (en) * 2020-07-10 2020-10-23 重庆大学 Multi-modal network spoofing detection model on social network
EP3754548A1 (en) * 2019-06-17 2020-12-23 Sap Se A method for recognizing an object in an image using features vectors of an encoding neural network
CN112395442A (en) * 2020-10-12 2021-02-23 杭州电子科技大学 Automatic identification and content filtering method for popular pictures on mobile internet
CN114491289A (en) * 2021-12-31 2022-05-13 南京信息工程大学 Social content depression detection method of bidirectional gated convolutional network
CN115964482A (en) * 2022-05-24 2023-04-14 西北工业大学 Multi-mode false news detection method based on user cognitive consistency reasoning
CN116450819A (en) * 2023-03-10 2023-07-18 西安交通大学 Multi-mode emotion recognition method and system based on self-adaptive fusion
CN116563854A (en) * 2023-05-11 2023-08-08 中国联合网络通信集团有限公司 Text recognition method, device, equipment and storage medium based on double channels
CN116719930A (en) * 2023-04-28 2023-09-08 西安工程大学 Multi-mode emotion analysis method based on visual attention

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11055537B2 (en) * 2016-04-26 2021-07-06 Disney Enterprises, Inc. Systems and methods for determining actions depicted in media contents based on attention weights of media content frames
CN112036513B (en) * 2020-11-04 2021-03-09 成都考拉悠然科技有限公司 Image anomaly detection method based on memory-enhanced potential spatial autoregression

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3754548A1 (en) * 2019-06-17 2020-12-23 Sap Se A method for recognizing an object in an image using features vectors of an encoding neural network
CN111814454A (en) * 2020-07-10 2020-10-23 重庆大学 Multi-modal network spoofing detection model on social network
CN112395442A (en) * 2020-10-12 2021-02-23 杭州电子科技大学 Automatic identification and content filtering method for popular pictures on mobile internet
CN114491289A (en) * 2021-12-31 2022-05-13 南京信息工程大学 Social content depression detection method of bidirectional gated convolutional network
CN115964482A (en) * 2022-05-24 2023-04-14 西北工业大学 Multi-mode false news detection method based on user cognitive consistency reasoning
CN116450819A (en) * 2023-03-10 2023-07-18 西安交通大学 Multi-mode emotion recognition method and system based on self-adaptive fusion
CN116719930A (en) * 2023-04-28 2023-09-08 西安工程大学 Multi-mode emotion analysis method based on visual attention
CN116563854A (en) * 2023-05-11 2023-08-08 中国联合网络通信集团有限公司 Text recognition method, device, equipment and storage medium based on double channels

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Abhinav Kumar 等.A deep multi‑modal neural network for informative Twitter content classif cation during emergencies.《https://doi.org/10.1007/s10479-020-03514-x》.2022,全文. *
Sequential Prediction of Social Media Popularity with Deep Temporal Context Networks;Bo Wu 等;《arXiv:1712.04443v1》;全文 *
双分支线索深度感知与自适应协同优化的多模态 虚假新闻检测;钟善男 等;《https://link.cnki.net/urlid/11.1826.TP.20230814.1352.002》;全文 *

Also Published As

Publication number Publication date
CN117077085A (en) 2023-11-17

Similar Documents

Publication Publication Date Title
CN117077085B (en) Multi-mode harmful social media content identification method combining large model with two-way memory
CN107680579A (en) Text regularization model training method and device, text regularization method and device
US11487952B2 (en) Method and terminal for generating a text based on self-encoding neural network, and medium
CN112990296B (en) Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation
JP2022177218A (en) Virtual image generation model training method and virtual image generation method
CN107731228A (en) The text conversion method and device of English voice messaging
CN110188348B (en) Chinese language processing model and method based on deep neural network
CN109712108B (en) Visual positioning method for generating network based on diversity discrimination candidate frame
CN110210027B (en) Fine-grained emotion analysis method, device, equipment and medium based on ensemble learning
CN110795549B (en) Short text conversation method, device, equipment and storage medium
CN108108468A (en) A kind of short text sentiment analysis method and apparatus based on concept and text emotion
CN110175248B (en) Face image retrieval method and device based on deep learning and Hash coding
CN114462567A (en) Attention mechanism-based neural network model
CN114969338A (en) Image-text emotion classification method and system based on heterogeneous fusion and symmetric translation
CN114694224A (en) Customer service question and answer method, customer service question and answer device, customer service question and answer equipment, storage medium and computer program product
CN116721176B (en) Text-to-face image generation method and device based on CLIP supervision
CN111859925B (en) Emotion analysis system and method based on probability emotion dictionary
CN112966503A (en) Aspect level emotion analysis method
CN117349402A (en) Emotion cause pair identification method and system based on machine reading understanding
CN116663566A (en) Aspect-level emotion analysis method and system based on commodity evaluation
CN116167014A (en) Multi-mode associated emotion recognition method and system based on vision and voice
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN114529908A (en) Offline handwritten chemical reaction type image recognition technology
CN117688936B (en) Low-rank multi-mode fusion emotion analysis method for graphic fusion
CN116975298B (en) NLP-based modernized society governance scheduling system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant