CN117077085B - Multi-mode harmful social media content identification method combining large model with two-way memory - Google Patents
Multi-mode harmful social media content identification method combining large model with two-way memory Download PDFInfo
- Publication number
- CN117077085B CN117077085B CN202311339502.2A CN202311339502A CN117077085B CN 117077085 B CN117077085 B CN 117077085B CN 202311339502 A CN202311339502 A CN 202311339502A CN 117077085 B CN117077085 B CN 117077085B
- Authority
- CN
- China
- Prior art keywords
- text
- memory
- image
- vector
- vectors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 239000013598 vector Substances 0.000 claims abstract description 71
- 230000000007 visual effect Effects 0.000 claims abstract description 21
- 239000011159 matrix material Substances 0.000 claims description 7
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
- G06F18/256—Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
- G06F18/15—Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/809—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
- G06V10/811—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Image Processing (AREA)
Abstract
The invention relates to the technical field of social media content identification, and discloses a multi-mode harmful social media content identification method combining a large model with two-way memory, which comprises the following steps: extracting image features of the image using an image encoder; extracting text characteristics of the text by using a text embedding module; modeling the image features and the text features by using a dual-channel memory module to obtain image vectors and text vectors; tag prediction using a large model: the image vector and the text vector are input into the large model, and the label is generated. According to the invention, through the dual-channel memory module, weights can be calculated for different memory vectors according to visual characteristics, and the weight distribution enables the model to align and fuse information more accurately. In addition, the visual channel and the text channel are processed by the same program, so that the two information sources are equally and effectively considered.
Description
Technical Field
The invention relates to the technical field of social media content identification, in particular to a multi-mode harmful social media content identification method combining a large model with two-way memory.
Background
There are a large number of multi-modal information (such as expression packages and the like) in social media, and the information may include harmful information (such as expression packages and the accompanying text thereof include harmful information), and identification of the information requires effective processing and combination of images and texts.
The prior art does not utilize the text modeling capability of a large language model, and meanwhile, the prior cross-modal information combination method uses serial multi-modal characteristics or calculates the outer product of the multi-modal characteristics to realize multi-modal information combination, so that the requirements of tasks on the common understanding of the multi-modal information are not satisfied.
Disclosure of Invention
In order to solve the technical problems, the invention provides a multi-mode harmful social media content identification method combining a large model with two-way memory. According to the invention, the proposed dual-channel memory network is utilized to realize effective alignment of multi-mode information in a memory semantic space, and the recognition performance of multi-mode harmful contents is improved by utilizing the language modeling capability of a large model.
In order to solve the technical problems, the invention adopts the following technical scheme:
a multi-mode harmful social media content identification method combining large model with two-way memory inputs given imagesAnd text->Outputting a tag for judging whether the multimodal input is harmful +.>The method specifically comprises the following steps:
step one: extracting an image using an image encoderImage characteristics of->;
Step two: extracting text using text embedding moduleText feature of->;
Step three: image feature by using dual-channel memory moduleText feature->Modeling, obtaining image vectors respectively>Text vector +.>The method comprises the steps of carrying out a first treatment on the surface of the The dual-channel memory module comprises N memory vectors, a visual channel and a text channel; the visual channel and the text channel respectively encode image features and text features; the memory vector is a parameter of the dual-channel memory module, which represents memory semantic space and is marked as +.>,/>Is->A plurality of memory vectors;
a process for visual channel encoding image features, comprising the steps of:
s31: computing is based on image featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Visual weight score +.>;
S32: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned image feature +.>;
S33: tandem aligned image featuresAnd image feature->Obtaining the image vector of visual channel output>;
A process for text passage encoding text features, comprising the steps of:
s34: computing is based on text featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Text weight score +.>;
S35: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned text feature +.>;
S36: text features aligned in seriesAnd text feature->Obtaining text vector of text channel output>;
Step four, label prediction is carried out by using a large model: vector of imageText vector +.>Inputting into a large model, generating tag +.>。
Further, in step S31, the first calculation is performedMemory vector->Visual weight score +.>When (1):
;
is a trainable parameter matrix;
in step S32Applied to the corresponding memory vector, calculating the weighted sum of all memory vectors and obtaining the aligned image features +.>When (1):
;
in step S33, the series-aligned image featuresAnd image feature->Obtaining the image vector of visual channel output>When (1):
;
wherein,representing the series of vectors.
Further, in step S34, the first calculation is performedMemory vector->Text weight score +.>When (1):
;
wherein,is a trainable parameter matrix;
in step S35Applied to the corresponding memory vector, calculating the weighted sum of all memory vectors and obtaining the aligned text feature +.>When (1):
;
in step S36, the series-aligned text featuresAnd text feature->Obtaining text vector of text channel output>When (1):
;
wherein,representing the series of vectors.
Compared with the prior art, the invention has the beneficial technical effects that:
and the excellent text coding and characterization capability of the large model is utilized, so that the performance of multi-mode harmful content identification is improved.
Through the dual-channel memory module, the model can calculate weights for different memory vectors according to visual features. This weight assignment enables the model to more accurately align and fuse information. In addition, the visual channel and the text channel are processed by the same program, so that the two information sources are equally and effectively considered.
Drawings
FIG. 1 is a diagram of the structure of the model of the present invention.
Detailed Description
A preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.
The model structure of the present invention is shown in fig. 1. The task of the invention is to provide a method for displaying a given imageAnd text->Identifying a tag for judging whether the multimodal input is harmful>The method specifically comprises the following steps:
step one: extracting an image using an image encoderImage characteristics of->. The image encoder of the present invention adopts a Vision Transformer encoder.
Step two: extracting text using text embedding moduleText feature of->. The text embedding module maps each word in the text to a vector using a matrix, the vector corresponding to the i-th word in the vocabulary being the i-th row in the matrix, howeverAnd then averaging vectors of all words to obtain the text feature e.
Step three: image feature by using dual-channel memory moduleText feature->Modeling, obtaining an image vector->Text vector +.>。
Step four, label prediction is carried out by using a large model: image vectorText vector +.>Inputting into a large model, generating tag +.>. The large model in the invention adopts Chinese-aplara.
The two-channel memory module related to the third step comprises a group of memory vectors and two independent channels, namely a visual channel and a text channel, which respectively encode image features and text features. The memory vector is a parameter of the dual-channel memory module, and represents memory semantic space and is recorded as,/>Is->And memorizing the vectors.
A process for visual channel encoding image features, comprising the steps of:
s31: computing is based on image featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Visual weight score +.>:
;
Wherein,is a trainable parameter matrix.
S32: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned image feature +.>:
。
S33: tandem aligned image featuresAnd image feature->Obtaining the image vector of visual channel output>:
;
Output image vectorWill be used in large models.
A process for text passage encoding text features, comprising the steps of:
s34: computing is based on text featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Text weight score +.>:
;
Wherein,is a trainable parameter matrix.
S35: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned text feature +.>:
。
S36: text features aligned in seriesAnd text feature->Obtaining text vector of text channel output>:
;
Output text vectorWill be used in large models.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a single embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to specific embodiments, and that the embodiments may be combined appropriately to form other embodiments that will be understood by those skilled in the art.
Claims (3)
1. A multi-mode harmful social media content identification method combining large model with two-way memory inputs given imagesAnd text->Outputting a tag for judging whether the multimodal input is harmful +.>The method specifically comprises the following steps:
step one: extracting an image using an image encoderImage characteristics of->;
Step two: extracting text using text embedding moduleText feature of->;
Step three: image feature by using dual-channel memory moduleText feature->Modeling, obtaining image vectors respectively>Text vector +.>The method comprises the steps of carrying out a first treatment on the surface of the The dual-channel memory module comprises N memory vectors, a visual channel and a text channel; the visual channel and the text channel respectively encode image features and text features; the memory vector is a parameter of the dual-channel memory module and represents the memory languageThe sense space is marked as->,/>Is->A plurality of memory vectors;
a process for visual channel encoding image features, comprising the steps of:
s31: computing is based on image featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Visual weight score +.>:
;
S32: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned image feature +.>;
S33: tandem aligned image featuresAnd image feature->Obtaining the image vector of visual channel output>;
A process for text passage encoding text features, comprising the steps of:
s34: computing is based on text featuresWeights of different memory vectors of (4) to obtain +.>Memory vector->Text weight score +.>:
;
Wherein,is a trainable parameter matrix;
s35: will beApplying to the corresponding memory vectors, calculating the weighted sum of all memory vectors to obtain the aligned text feature +.>;
S36: text features aligned in seriesAnd text feature->Obtaining text vector of text channel output>;
Step four, label prediction is carried out by using a large model: vector of imageText vector +.>Inputting into a large model, generating tag +.>。
2. The method for identifying multi-modal harmful social media content by combining large models with two-way memory according to claim 1, wherein the method is characterized by comprising the following steps of:
in step S32Applied to the corresponding memory vector, calculating the weighted sum of all memory vectors and obtaining the aligned image features +.>When (1):
;
in step S33, the series-aligned image featuresAnd image feature->Obtaining the image vector of visual channel output>When (1):
;
wherein,representing the series of vectors.
3. The method for identifying multi-modal harmful social media content by combining large models with two-way memory according to claim 1, wherein the method is characterized by comprising the following steps of:
in step S35Applied to the corresponding memory vector, calculating the weighted sum of all memory vectors and obtaining the aligned text feature +.>When (1):
;
in step S36, the series-aligned text featuresAnd text feature->Obtaining text vector of text channel output>When (1):
;
wherein,representing the series of vectors.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311339502.2A CN117077085B (en) | 2023-10-17 | 2023-10-17 | Multi-mode harmful social media content identification method combining large model with two-way memory |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311339502.2A CN117077085B (en) | 2023-10-17 | 2023-10-17 | Multi-mode harmful social media content identification method combining large model with two-way memory |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117077085A CN117077085A (en) | 2023-11-17 |
CN117077085B true CN117077085B (en) | 2024-02-09 |
Family
ID=88704676
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311339502.2A Active CN117077085B (en) | 2023-10-17 | 2023-10-17 | Multi-mode harmful social media content identification method combining large model with two-way memory |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117077085B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117291175B (en) * | 2023-11-27 | 2024-03-29 | 中国科学技术大学 | Method for detecting generated text based on statistical feature fusion of multiple large language models |
CN117542538A (en) * | 2024-01-10 | 2024-02-09 | 中国科学技术大学 | Medical multi-mode content analysis and generation method based on reinforcement learning |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111814454A (en) * | 2020-07-10 | 2020-10-23 | 重庆大学 | Multi-modal network spoofing detection model on social network |
EP3754548A1 (en) * | 2019-06-17 | 2020-12-23 | Sap Se | A method for recognizing an object in an image using features vectors of an encoding neural network |
CN112395442A (en) * | 2020-10-12 | 2021-02-23 | 杭州电子科技大学 | Automatic identification and content filtering method for popular pictures on mobile internet |
CN114491289A (en) * | 2021-12-31 | 2022-05-13 | 南京信息工程大学 | Social content depression detection method of bidirectional gated convolutional network |
CN115964482A (en) * | 2022-05-24 | 2023-04-14 | 西北工业大学 | Multi-mode false news detection method based on user cognitive consistency reasoning |
CN116450819A (en) * | 2023-03-10 | 2023-07-18 | 西安交通大学 | Multi-mode emotion recognition method and system based on self-adaptive fusion |
CN116563854A (en) * | 2023-05-11 | 2023-08-08 | 中国联合网络通信集团有限公司 | Text recognition method, device, equipment and storage medium based on double channels |
CN116719930A (en) * | 2023-04-28 | 2023-09-08 | 西安工程大学 | Multi-mode emotion analysis method based on visual attention |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11055537B2 (en) * | 2016-04-26 | 2021-07-06 | Disney Enterprises, Inc. | Systems and methods for determining actions depicted in media contents based on attention weights of media content frames |
CN112036513B (en) * | 2020-11-04 | 2021-03-09 | 成都考拉悠然科技有限公司 | Image anomaly detection method based on memory-enhanced potential spatial autoregression |
-
2023
- 2023-10-17 CN CN202311339502.2A patent/CN117077085B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3754548A1 (en) * | 2019-06-17 | 2020-12-23 | Sap Se | A method for recognizing an object in an image using features vectors of an encoding neural network |
CN111814454A (en) * | 2020-07-10 | 2020-10-23 | 重庆大学 | Multi-modal network spoofing detection model on social network |
CN112395442A (en) * | 2020-10-12 | 2021-02-23 | 杭州电子科技大学 | Automatic identification and content filtering method for popular pictures on mobile internet |
CN114491289A (en) * | 2021-12-31 | 2022-05-13 | 南京信息工程大学 | Social content depression detection method of bidirectional gated convolutional network |
CN115964482A (en) * | 2022-05-24 | 2023-04-14 | 西北工业大学 | Multi-mode false news detection method based on user cognitive consistency reasoning |
CN116450819A (en) * | 2023-03-10 | 2023-07-18 | 西安交通大学 | Multi-mode emotion recognition method and system based on self-adaptive fusion |
CN116719930A (en) * | 2023-04-28 | 2023-09-08 | 西安工程大学 | Multi-mode emotion analysis method based on visual attention |
CN116563854A (en) * | 2023-05-11 | 2023-08-08 | 中国联合网络通信集团有限公司 | Text recognition method, device, equipment and storage medium based on double channels |
Non-Patent Citations (3)
Title |
---|
Abhinav Kumar 等.A deep multi‑modal neural network for informative Twitter content classif cation during emergencies.《https://doi.org/10.1007/s10479-020-03514-x》.2022,全文. * |
Sequential Prediction of Social Media Popularity with Deep Temporal Context Networks;Bo Wu 等;《arXiv:1712.04443v1》;全文 * |
双分支线索深度感知与自适应协同优化的多模态 虚假新闻检测;钟善男 等;《https://link.cnki.net/urlid/11.1826.TP.20230814.1352.002》;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117077085A (en) | 2023-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN117077085B (en) | Multi-mode harmful social media content identification method combining large model with two-way memory | |
CN107680579A (en) | Text regularization model training method and device, text regularization method and device | |
US11487952B2 (en) | Method and terminal for generating a text based on self-encoding neural network, and medium | |
CN112990296B (en) | Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation | |
JP2022177218A (en) | Virtual image generation model training method and virtual image generation method | |
CN107731228A (en) | The text conversion method and device of English voice messaging | |
CN110188348B (en) | Chinese language processing model and method based on deep neural network | |
CN109712108B (en) | Visual positioning method for generating network based on diversity discrimination candidate frame | |
CN110210027B (en) | Fine-grained emotion analysis method, device, equipment and medium based on ensemble learning | |
CN110795549B (en) | Short text conversation method, device, equipment and storage medium | |
CN108108468A (en) | A kind of short text sentiment analysis method and apparatus based on concept and text emotion | |
CN110175248B (en) | Face image retrieval method and device based on deep learning and Hash coding | |
CN114462567A (en) | Attention mechanism-based neural network model | |
CN114969338A (en) | Image-text emotion classification method and system based on heterogeneous fusion and symmetric translation | |
CN114694224A (en) | Customer service question and answer method, customer service question and answer device, customer service question and answer equipment, storage medium and computer program product | |
CN116721176B (en) | Text-to-face image generation method and device based on CLIP supervision | |
CN111859925B (en) | Emotion analysis system and method based on probability emotion dictionary | |
CN112966503A (en) | Aspect level emotion analysis method | |
CN117349402A (en) | Emotion cause pair identification method and system based on machine reading understanding | |
CN116663566A (en) | Aspect-level emotion analysis method and system based on commodity evaluation | |
CN116167014A (en) | Multi-mode associated emotion recognition method and system based on vision and voice | |
CN111401069A (en) | Intention recognition method and intention recognition device for conversation text and terminal | |
CN114529908A (en) | Offline handwritten chemical reaction type image recognition technology | |
CN117688936B (en) | Low-rank multi-mode fusion emotion analysis method for graphic fusion | |
CN116975298B (en) | NLP-based modernized society governance scheduling system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |