CN116796251A

CN116796251A - Poor website classification method, system and equipment based on image-text multi-mode

Info

Publication number: CN116796251A
Application number: CN202311078357.7A
Authority: CN
Inventors: 徐艺丹; 韦芹余; 栾鹏林; 李永成; 盛响; 倪正国; 韩晓华; 褚连杰; 高陆云; 朱琳彤; 周恬; 张志元; 张�浩
Original assignee: Jiangsu Internet Industry Management Service Center
Current assignee: Jiangsu Internet Industry Management Service Center
Priority date: 2023-08-25
Filing date: 2023-08-25
Publication date: 2023-09-22

Abstract

The application relates to the technical field of network security, in particular to a poor website classification method, system and equipment based on image-text multi-mode, which are used for extracting characteristics and calculating similarity of webpage screenshot and website meta titles, optimizing a CLIP model through an InfoNCE loss function, training the CLIP model by adopting random sampling based on a body, fusing the characteristics of two modes of a graph and a text, classifying the poor website according to the fused characteristics, and improving the accuracy and coverage rate of website classification; by adopting a small-batch training sampling mode, similar training effects can be achieved only by about 1% of data, meanwhile, a large amount of training data and a large TPU computing machine do not need to be marked manually, and specific classification rules or dictionaries are not needed to be designed for different languages or topics, so that the efficiency and the realizability of website classification can be improved; pictures that cannot be analyzed using OCR techniques can be analyzed and explicitly classified.

Description

Poor website classification method, system and equipment based on image-text multi-mode

Technical Field

The application relates to the technical field of network security, in particular to a poor website classification method, system and equipment based on image-text multi-mode.

Background

The information technology and the mobile internet are vigorously developed, so that people are convenient to live, and meanwhile, bad websites are required to be timely and accurately identified, and early warning is performed on access users in advance.

The existing bad website identification methods mainly comprise two methods, one is to match text keywords of a website by adopting a matching technology based on the text content of the website; the other is to construct a deep learning classification model to identify bad websites according to the text or picture single-mode information of the websites. The method for constructing the deep learning classification model according to the text or the single-mode information of the website only considers the information of the picture or the single-mode information of the text, or directly extracts the text information in the picture by using the OCR technology, the picture without the text information is not processed sufficiently, and the carriers of the information comprise the text, the image, the video, the audio and the like, so that the method is various. The existing bad website identification method has the problems of low identification efficiency and easy missing report.

The multi-mode learning is realized by establishing a multi-mode model, fusing information of a plurality of modes such as pictures, texts and the like, so that a network can learn information of each mode from the multi-mode, the learning capacity of the model is improved, and more accurate results are obtained. Therefore, the application provides a poor website classification method, system and equipment based on image-text multi-mode, which are used for solving the problems in the prior art.

Disclosure of Invention

Aiming at the problems of insufficient picture processing without text information, low recognition efficiency, easy missing report and the like in the existing bad website classification method, the application provides a bad website classification method, system and equipment based on image-text multi-mode. The method has the advantages that the information of the two modes of the bad website screenshot and the text is extracted through the CLIP model based on the pre-training image-text model, the visual characteristics and the text characteristics of the bad website are extracted, the characteristics of the two modes of the image and the text are fused, the bad website is classified according to the fused characteristics, and a good classification effect can be achieved under the conditions of less training data and lower hardware requirements.

In order to achieve the above object, the present application is realized by the following technical scheme:

a bad website classification method based on image-text multi-mode comprises the following steps:

s1: acquiring a webpage screenshot and a website meta title, cleaning data of the website meta title, and filtering a sample with valid title content as a training set; the title content effectively refers to Chinese characters with the length of more than 4 or Chinese characters with the ratio exceeding 50%;

s2: the webpage screenshot is subjected to compression correction of the image size, and is encoded through a pre-trained ResNet50 model and converted into an imageFeature vector；

S3: chinese type corresponding to webpage screenshot through Bert modelAnd keywords in website meta titlewCoding, converting to +.>Text feature vectors with the same dimensions +.>And +.>And the text feature vector->Carrying out L2 normalization treatment;

s4: constructing a CLIP model, wherein each time the CLIP model is input into a batch of triples, each tripletAre generated by the combination of image-text pairs; computing an image feature vector +.>And text feature vector +.>Is the similarity of the image-text joint representation score +.>Introducing an InfoNCE function to calculate cross entropy loss;

s5: and carrying out multiple rounds of iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data to enable the CLIP model to align the image feature vector and the text feature vector, namely: the distance between the image generated by the bad website and the text feature vector is shortened, and the difference between the image generated by the bad website and the text feature vector generated by the normal website is increased;

s6: and using the trained CLIP model for multi-mode website classification, and classifying bad websites according to the webpage screenshot alignment semantic information.

As a preferred embodiment of the present application, in the step S1, the data cleaning specifically includes: the method comprises the steps of missing value cleaning, format content cleaning and logic error cleaning, wherein the missing value cleaning is used for determining a missing value range, removing unnecessary fields and missing value filling, the format content cleaning is used for removing unnecessary characters, and the logic error cleaning is used for removing duplication, unreasonable values and correcting contradictory contents.

In a preferred embodiment of the present application, in the step S2, the web page screenshot is encoded by a pre-trained res net50 model and converted into an image feature vectorThe method of (1) specifically comprises the following steps:

importing a pre-trained ResNet50 model as an image feature extractor through a deep learning framework, wherein the deep learning framework comprises PyTorch, tensorFlow, keras;

removing the last full-connection layer of the ResNet50 model, and taking the penultimate layer as an output layer;

preprocessing the webpage screenshot, wherein the preprocessing comprises the steps of adjusting the size of an image to 224 multiplied by 224, normalizing the pixel value of the image to be between 0 and 1, subtracting the mean value of a training set, and dividing the mean value by the standard deviation;

inputting the preprocessed webpage screenshot into a ResNet50 model to obtain the output of the penultimate layer as an image feature vector。

As a preferred embodiment of the present application, in the step S3, the text feature vector is converted intoThe method of (1) specifically comprises the following steps:

chinese type corresponding to webpage screenshotAnd keywords in website meta titlewInputting the text sequence into a pretrained Bert model as a text sequence, embedding the text sequence into a vector space with a fixed length, and sending the text sequence into Transformer Encoder for coding conversion; the Transformer Encoder layer consists of a plurality of identical layers, and each layer comprises two sublayers of a multi-head self-attention mechanism and a feedforward neural network;

in the multi-headed self-attention mechanism, each word in the text sequence is focused and affects other words to let the Bert model capture contextual information in the text sequence; in the feedforward neural network, the text sequences are weighted and summed and subjected to nonlinear transformation, so that a new vector representation is obtained, the new vector representation is sent into a multi-head self-attention mechanism of the next layer, and the above operation is repeated until the whole text sequence is converted into a vector representation with fixed length;

the Bert model outputs this vector representation as a text feature vector。

As a preferred embodiment of the present application, the calculating image feature vectorAnd text feature vector +.>The similarity of (2) is: for the image feature vector->And text feature vector +.>Performing dot multiplication;

the combined representation score of the image-text is calculatedThe method of (1) specifically comprises the following steps:

text feature vectorAs query, image feature vector +.>Calculated as keys and values at the firstiUnder the description, text feature vector +.>Relative to the imagejAttention to the expression->Then get the joint expression score of image-text +.>：

。

As a preferred embodiment of the present application, the method for introducing the InfoNCE function to calculate the cross entropy loss specifically includes:

at a given text feature vectorIn the case of (a) minimizing the correct retrieval of image feature vectors from the set of all batches of images +.>Other instances in the batch act as negative samples, this cross entropy loss is called the picture retrieval loss function, noted +.>：

；

In the middle of，Is shown in the firstiUnder the description, text feature vector +.>Relative to the corresponding image itselfiIs a joint representation score of (2); />Is the firstiThe picture under the description retrieves the loss function,jrepresent the firstjIndividual images->Is the total number of images;

similarly, a text retrieval penalty function is defined：

；

In the method, in the process of the application,is the firstiText retrieval loss function under individual description;

usingAnd +.>+/>The two losses sum to train the CLIP model.

As a preferred embodiment of the present application, in the step S5, the method for performing multiple iterations and optimization on the CLIP model by using randomly sampled positive and negative sample data in batches includes:

and (3) adopting a random small-batch sampling mode based on the ontology, wherein the samples comprise N classes of bad websites and corresponding number of normal websites, randomly extracting small-batch bad picture samples and randomly extracting corresponding number samples of all other types to train the CLIP model, so that text semantics and image semantics are consistent in terms of the major classes, randomly extracting small-batch data according to keywords in meta titles of different websites, and continuously refining CLIP model parameters by distinguishing different keywords under the ontology of the same type, thereby achieving fine-granularity optimization.

A bad website classification system based on teletext multimodality, the system comprising:

the data acquisition module is used for acquiring a webpage screenshot and a website meta title;

the data cleaning module is used for cleaning data of the website meta title and filtering samples with valid title contents as training sets; the title content effectively refers to Chinese characters with the length of more than 4 or Chinese characters with the ratio exceeding 50%;

the feature extraction module comprises a ResNet50 model unit, a Bert model unit and an L2 normalization unit, wherein the Bert model unit is used for encoding the webpage screenshot and converting the webpage screenshot into an image feature vector; the Bert model unit is used for coding the Chinese type corresponding to the webpage screenshot and the keywords in the meta title of the website and converting the Chinese type and the keywords into text feature vectors with the same dimension as the image feature vectors; the L2 normalization unit is used for carrying out L2 normalization processing on the image feature vector and the text feature vector;

the CLIP model training module is used for constructing a CLIP model, and carrying out multi-round iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data so as to align the CLIP model with the image feature vector and the text feature vector;

and the website classification output module is used for classifying the multi-mode websites through the trained CLIP model and classifying bad websites according to the webpage screenshot alignment semantic information.

Poor website classification equipment based on picture and text multimode includes:

a memory: for storing a computer program;

a processor: the method is used for executing the computer program to realize the bad website classification method based on the image-text multi-mode.

A readable storage medium having stored thereon a computer program for implementing a bad website classification method based on teletext multi-modality as described above when executed by a processor.

Compared with the prior art, the improved CLIP model is used for classifying websites, and has the beneficial effects that: the text and image information obtained from the website content and the relevance between the text and the image information are utilized to judge which category the website belongs to, the effect can reach the standard of the original model, and the accuracy and coverage rate of the classification of the website can be improved; compared with the traditional CLIP which needs hundred million-level data, the method adopts a small batch training sampling mode, can achieve similar training effect only by about 1% of data, does not need to manually mark a large amount of training data and a large TPU computing machine, does not need to design specific classification rules or dictionaries for different languages or topics, and can improve the efficiency and the realizability of website classification; pictures that cannot be analyzed using OCR techniques can be analyzed and explicitly classified.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a flow chart of the method of the present application;

FIG. 2 is an overall structure diagram of a CLIP model in an embodiment of the application;

FIG. 3 is a diagram of a CLIP model sample training step in an embodiment of the present application;

fig. 4 is a system configuration diagram of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present application. It will be apparent that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which are obtained by a person skilled in the art based on the described embodiments of the application, fall within the scope of protection of the application.

The multi-mode learning is realized by establishing a multi-mode model, so that the multi-mode learning system can learn information of each mode from the multi-modes, and the learning capacity of the model is improved.

The multi-modal characterization is the basis of multi-modal tasks, and the multi-modal model structure can be divided into two types according to different multi-modal characterization learning modes at present: one is Joint, a Joint, also known as a single tower structure; the other is Coordinated, i.e. a collaborative or twin tower structure. The single-tower structure focuses on capturing complementarity among multiple modes, the multi-mode characteristic representation is obtained by fusing the characteristics of the multiple mode information, and then the fused multi-mode characteristic is used for completing a prediction task, and the aim of network optimization is to predict the performance of a specific task. The dual-tower model does not seek fusion, but rather models to find correlations between multiple (usually two) modal information, maps the multiple modal data to a collaboration space, and finds collaboration relations between the multiple modal data, and the objective of network optimization is similarity between the two modal information. The double-tower structure is suitable for the uplink tasks which only have one modal information as input, such as tasks of cross-modal retrieval, translation and the like; the model of the single tower structure is suitable for application scenes with a plurality of modal information as input, such as tasks of visual question answering, emotion analysis and the like.

English full name of CLIP is Contrastive Language-Image Pre-training, which is a multi-modal model based on contrast learning, and training data is text-Image pairs: an image and its corresponding text description, it is desirable that the model learn by contrast learning, the matching relationship of the text-image pair. CLIP includes two models: the text encoder and the image encoder calculate cosine similarity between two characteristic calculation modes, so that the similarity of N matched image-text pairs is maximum, and the similarity of unmatched image-text pairs is minimum; model parameters are optimized and adjusted by symmetrical cross entropy loss.

CLIP can learn a generic visual-language representation from a large number of unlabeled text and images, thereby enabling processing of web site content in multiple languages, topics and media types; websites can also be categorized according to natural language queries entered by users to provide more flexible and accurate categorization results, but they face the problem that data requirements are too large, and models starting from scratch require 4 hundred million image text pairs and 2 hundred GPUs to achieve model effects.

The method uses an improved CLIP model, an image encoder adopts a pre-trained ResNet50 model, a text encoder adopts a pre-trained Bert model, characteristic extraction and similarity calculation are carried out on webpage screenshot and website meta titles, the CLIP model is optimized through an InfoNCE loss function, and CLIP model training is carried out by adopting random sampling based on an ontology, so that multi-mode website classification is realized.

1-3, an embodiment of the present application provides a method for classifying bad websites in a multi-mode image-text based on an optimized CLIP model, which specifically includes the following steps:

s1: acquiring a webpage screenshot and a website meta title, cleaning data of the website meta title, and filtering a sample with effective title content as a training set; the title content effectively refers to Chinese character length of 4 or more or Chinese character ratio of more than 50%;

in one embodiment, the data cleansing specifically includes: the method comprises the steps of missing value cleaning, format content cleaning and logic error cleaning, wherein the missing value cleaning is to determine a missing value range, remove unnecessary fields and missing value filling, the format content cleaning is to remove unnecessary characters, and the logic error cleaning is to remove duplication, unreasonable values and correct contradiction contents. Invalid characters and missing fields are corrected by data cleansing.

S2: the compression correction of the image size of the webpage screenshot is carried out, and the webpage screenshot is encoded through a pre-trained ResNet50 model and is converted into an image feature vector；

ResNet50 is a depth residual network that can be used for image classification, object detection, and semantic segmentation tasks. The main feature of ResNet50 is the use of residual connection (residual connection) to solve the gradient extinction and degradation problems of deep networks. The idea of residual connection is to add a skip connection between certain layers of the network so that the input can be directly transferred to the output, thereby preserving the information of the input and enhancing the expressive power of the network.

In a specific embodiment, the method for extracting the image feature vector by using the ResNet50 model specifically comprises the following steps:

the pre-trained ResNet50 model is imported, and the ResNet50 model can be loaded by using APIs provided by deep learning frameworks such as PyTorch, tensorFlow, keras, and the ResNet50 model file can be downloaded from the Internet. The pre-trained ResNet50 model has learned some generic features on a large image dataset (e.g., imageNet) and can be used as an image feature extractor.

The last full-connection layer (fully connected layer) of the ResNet50 model is removed, the last layer is used for classification tasks, the probability distribution of the categories is output, the last layer is not needed, the output of the last-to-last layer is needed, and the last-to-last layer is a 2048-dimensional vector and can represent high-level semantic features of the image.

Preprocessing the webpage screenshot, wherein the preprocessing comprises the operations of adjusting the size of an image to 224 multiplied by 224, normalizing the pixel value of the image to be between 0 and 1, subtracting the mean value of a training set, dividing the mean value by the standard deviation and the like; the operations can ensure that the format and distribution of the image are consistent with the data in model training, and the effect of feature extraction is improved.

Inputting the preprocessed webpage screenshot into a ResNet50 model to obtain the output of the penultimate layer as an image feature vectorThis vector can be used for subsequent image retrieval, clustering, classification, etc.

S3: general purpose medicineChinese type corresponding to webpage screenshot through Bert modelAnd keywords in website meta titlewCoding and converting into a vector of the image characteristic +.>Text feature vectors with the same dimensions +.>And +.>And text feature vector +.>Carrying out L2 normalization treatment;

meta is the keyword content under the website title, and the marked sample is the Chinese type。

The Bert model is a pre-trained language model based on a transducer architecture that can convert text codes into vector representations. Specifically, the Bert model embeds (Embedding) the input text sequence first, and then transcodes it through multiple layers of Transformer Encoder.

In a specific embodiment, the method for extracting the text feature vector through the Bert model specifically comprises the following steps:

chinese type corresponding to webpage screenshotAnd keywords in website meta titlewInputting the text sequence into a pretrained Bert model, and Embedding the text sequence into a vector space with a fixed length, wherein the process is realized by a technology called Word Embedding; word Embedding maps each Word to a low-dimensional real vector so that the distance of semantically similar words in vector space is also comparedProximity.

Next, the input text sequence is sent into Transformer Encoder for transcoding; transformer Encoder consists of several identical layers (layers), each Layer comprising two sub-layers of a Multi-Head Self-Attention mechanism (Multi-Head Self-Attention) and a Feed-forward neural network (Feed-Forward Neural Network);

in the multi-headed self-Attention mechanism, each word in the text sequence is focused (Attention) and affects other words to make the Bert model better capture the context information in the text sequence; in the feedforward neural network, the text sequence is weighted and summed and subjected to nonlinear transformation, so that a new vector representation is obtained, the new vector representation is sent into a multi-head self-attention mechanism of the next layer, and the above operation is repeated until the whole text sequence is converted into a vector representation with fixed length;

finally, the BERT model outputs this vector representation as a text feature vector。

computing image feature vectorsAnd text feature vector +.>The similarity of (2) is: image feature vector->And text feature vector +.>Performing dot multiplication;

computing joint representation scores for image-textThe method of (1) specifically comprises the following steps:

。

The method for introducing the InfoNCE function to calculate the cross entropy loss specifically comprises the following steps:

；

In the method, in the process of the application,is shown in the firstiUnder the description, text feature vector +.>Relative to the corresponding image itselfiIs a joint representation score of (2); />Is the firstiThe picture under the description retrieves the loss function,jrepresent the firstjIndividual images->Is the total number of images;

to minimize the value of the picture loss function, one should makeIs a function of the maximum value of (2),minimum. Similarly, define text retrieval penalty function +.>：

；

the meaning is as follows: image feature vectorFor retrieving the correct text feature vector +.>Use +.>And +.>+The two losses sum to train the CLIP model.

the method for carrying out multi-round iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data comprises the following steps:

adopting a random small batch sampling mode based on an ontology, wherein the samples comprise N classes of bad websites and corresponding number of normal websites, randomly extracting small batch of bad picture samples and randomly extracting corresponding number of samples of all other types, such as randomly extracting 5000 gambling and 5000 samples of other types, training a CLIP model, and enabling text semantics and image semantics to be consistent from a large class;

based on the training, under the category of the same gambling, small batches of data are randomly extracted according to keywords in meta titles of different websites, keywords with different information such as 'new staphyloin' and 'XX cut-pieces' possibly exist under the category of the same gambling, and the CLIP model parameters are continuously refined through different keyword differences under the same type of ontology, such as 5000 new staphyloin and 5000 samples containing lottery ticket are randomly extracted, so that optimization of finer granularity is achieved, more picture information with finer granularity can be aligned, and a better effect of training of the small samples is achieved.

As shown in fig. 4, another embodiment of the present application provides a bad website classification system based on graph-text multi-mode, which specifically includes:

the data cleaning module is used for cleaning data of website meta titles and filtering samples with effective title contents as training sets; the title content effectively refers to Chinese character length of 4 or more or Chinese character ratio of more than 50%;

the feature extraction module comprises a ResNet50 model unit, a Bert model unit and an L2 normalization unit, wherein the Bert model unit is used for encoding the webpage screenshot and converting the webpage screenshot into an image feature vector; the Bert model unit is used for encoding the Chinese type corresponding to the webpage screenshot and the keywords in the meta title of the website and converting the Chinese type and the keywords into text feature vectors with the same dimension as the image feature vectors; the L2 normalization unit is used for carrying out L2 normalization processing on the image feature vector and the text feature vector;

The embodiment further provides a poor website classification device based on graph-text multi-mode, which comprises:

a memory: for storing a computer program;

The embodiment also provides a readable storage medium, which may be a read-only memory, a magnetic disk or an optical disk, etc., and has a computer program stored thereon, where the computer program is executed by a processor to implement a bad website classification method based on graphic multi-mode as described above.

In summary, the text and image information obtained from the website content and the relevance between the text and the image information are utilized to judge which category the website belongs to, the effect can reach the standard of the original model, and the accuracy and coverage rate of the classification of the website can be improved; compared with the traditional CLIP which needs hundred million-level data, the method adopts a small batch training sampling mode, can achieve similar training effect only by about 1% of data, does not need to manually mark a large amount of training data and a large TPU computing machine, does not need to design specific classification rules or dictionaries for different languages or topics, and can improve the efficiency and the realizability of website classification; pictures that cannot be analyzed using OCR techniques can be analyzed and explicitly classified.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Any process or method description in a flowchart or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes additional implementations in which functions may be performed in a substantially simultaneous manner or in an opposite order from that shown or discussed, including in accordance with the functions that are involved.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that various changes and substitutions are possible within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. The bad website classification method based on the image-text multi-mode is characterized by comprising the following steps:

s2: the webpage screenshot is subjected to compression correction of the image size, and is encoded through a pre-trained ResNet50 model and converted into an image feature vector；

2. The method for classifying bad websites based on graphic multi-mode according to claim 1, wherein in step S1, the data cleaning specifically includes: the method comprises the steps of missing value cleaning, format content cleaning and logic error cleaning, wherein the missing value cleaning is used for determining a missing value range, removing unnecessary fields and missing value filling, the format content cleaning is used for removing unnecessary characters, and the logic error cleaning is used for removing duplication, unreasonable values and correcting contradictory contents.

3. The method for classifying bad websites based on graphic multi-mode as claimed in claim 1, wherein in the step S2, the web page screenshot is encoded by a pre-trained res net50 model and converted into an image feature vectorThe method of (1) specifically comprises the following steps:

4. The method for classifying bad websites based on graphic multi-mode as claimed in claim 1, wherein the step S3 is thatIn, convert to text feature vectorThe method of (1) specifically comprises the following steps:

the Bert model outputs this vector representation as a text feature vector。

5. The method for classifying bad websites based on graphic multi-mode as claimed in claim 1, wherein the calculating image feature vectorAnd text feature vector +.>The similarity of (2) is: for the image feature vector->And text feature vector +.>Performing dot multiplication;

。

6. The method for classifying bad websites based on graphic multi-mode according to claim 5, wherein the method for introducing the InfoNCE function to calculate the cross entropy loss specifically comprises:

；

similarly, a text retrieval penalty function is defined：

；

usingAnd +.>+/>The two losses sum to train the CLIP model.

7. The method for classifying bad websites based on graphic multi-mode as claimed in claim 1, wherein in the step S5, the method for performing multi-round iteration and optimization on the CLIP model by using randomly sampled positive and negative sample data in batches comprises:

8. A system based on a bad website classification method based on graphic multi-modality according to any of claims 1 to 7, wherein the system comprises:

9. Poor website classification equipment based on picture and text multimode is characterized by comprising:

a memory: for storing a computer program;

a processor: for executing the computer program to implement a bad website classification method based on teletext multi-modality as claimed in any one of claims 1-7.

10. A readable storage medium, wherein a computer program is stored on the readable storage medium, and the computer program is used for implementing a bad website classification method based on graphic-text multi-mode according to any one of claims 1-7 when being executed by a processor.