CN116796251A - Poor website classification method, system and equipment based on image-text multi-mode - Google Patents
Poor website classification method, system and equipment based on image-text multi-mode Download PDFInfo
- Publication number
- CN116796251A CN116796251A CN202311078357.7A CN202311078357A CN116796251A CN 116796251 A CN116796251 A CN 116796251A CN 202311078357 A CN202311078357 A CN 202311078357A CN 116796251 A CN116796251 A CN 116796251A
- Authority
- CN
- China
- Prior art keywords
- text
- image
- feature vector
- model
- website
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 70
- 238000012549 training Methods 0.000 claims abstract description 32
- 230000006870 function Effects 0.000 claims abstract description 26
- 238000005070 sampling Methods 0.000 claims abstract description 15
- 239000013598 vector Substances 0.000 claims description 108
- 238000004140 cleaning Methods 0.000 claims description 29
- 238000005457 optimization Methods 0.000 claims description 14
- 238000010606 normalization Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 11
- 230000007246 mechanism Effects 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 238000013135 deep learning Methods 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 5
- 238000003860 storage Methods 0.000 claims description 4
- 230000006835 compression Effects 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 230000008094 contradictory effect Effects 0.000 claims description 2
- 238000007670 refining Methods 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 9
- 238000005516 engineering process Methods 0.000 description 4
- 208000001613 Gambling Diseases 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 238000013145 classification model Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008033 biological extinction Effects 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application relates to the technical field of network security, in particular to a poor website classification method, system and equipment based on image-text multi-mode, which are used for extracting characteristics and calculating similarity of webpage screenshot and website meta titles, optimizing a CLIP model through an InfoNCE loss function, training the CLIP model by adopting random sampling based on a body, fusing the characteristics of two modes of a graph and a text, classifying the poor website according to the fused characteristics, and improving the accuracy and coverage rate of website classification; by adopting a small-batch training sampling mode, similar training effects can be achieved only by about 1% of data, meanwhile, a large amount of training data and a large TPU computing machine do not need to be marked manually, and specific classification rules or dictionaries are not needed to be designed for different languages or topics, so that the efficiency and the realizability of website classification can be improved; pictures that cannot be analyzed using OCR techniques can be analyzed and explicitly classified.
Description
Technical Field
The application relates to the technical field of network security, in particular to a poor website classification method, system and equipment based on image-text multi-mode.
Background
The information technology and the mobile internet are vigorously developed, so that people are convenient to live, and meanwhile, bad websites are required to be timely and accurately identified, and early warning is performed on access users in advance.
The existing bad website identification methods mainly comprise two methods, one is to match text keywords of a website by adopting a matching technology based on the text content of the website; the other is to construct a deep learning classification model to identify bad websites according to the text or picture single-mode information of the websites. The method for constructing the deep learning classification model according to the text or the single-mode information of the website only considers the information of the picture or the single-mode information of the text, or directly extracts the text information in the picture by using the OCR technology, the picture without the text information is not processed sufficiently, and the carriers of the information comprise the text, the image, the video, the audio and the like, so that the method is various. The existing bad website identification method has the problems of low identification efficiency and easy missing report.
The multi-mode learning is realized by establishing a multi-mode model, fusing information of a plurality of modes such as pictures, texts and the like, so that a network can learn information of each mode from the multi-mode, the learning capacity of the model is improved, and more accurate results are obtained. Therefore, the application provides a poor website classification method, system and equipment based on image-text multi-mode, which are used for solving the problems in the prior art.
Disclosure of Invention
Aiming at the problems of insufficient picture processing without text information, low recognition efficiency, easy missing report and the like in the existing bad website classification method, the application provides a bad website classification method, system and equipment based on image-text multi-mode. The method has the advantages that the information of the two modes of the bad website screenshot and the text is extracted through the CLIP model based on the pre-training image-text model, the visual characteristics and the text characteristics of the bad website are extracted, the characteristics of the two modes of the image and the text are fused, the bad website is classified according to the fused characteristics, and a good classification effect can be achieved under the conditions of less training data and lower hardware requirements.
In order to achieve the above object, the present application is realized by the following technical scheme:
a bad website classification method based on image-text multi-mode comprises the following steps:
s1: acquiring a webpage screenshot and a website meta title, cleaning data of the website meta title, and filtering a sample with valid title content as a training set; the title content effectively refers to Chinese characters with the length of more than 4 or Chinese characters with the ratio exceeding 50%;
s2: the webpage screenshot is subjected to compression correction of the image size, and is encoded through a pre-trained ResNet50 model and converted into an imageFeature vector;
S3: chinese type corresponding to webpage screenshot through Bert modelAnd keywords in website meta titlewCoding, converting to +.>Text feature vectors with the same dimensions +.>And +.>And the text feature vector->Carrying out L2 normalization treatment;
s4: constructing a CLIP model, wherein each time the CLIP model is input into a batch of triples, each tripletAre generated by the combination of image-text pairs; computing an image feature vector +.>And text feature vector +.>Is the similarity of the image-text joint representation score +.>Introducing an InfoNCE function to calculate cross entropy loss;
s5: and carrying out multiple rounds of iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data to enable the CLIP model to align the image feature vector and the text feature vector, namely: the distance between the image generated by the bad website and the text feature vector is shortened, and the difference between the image generated by the bad website and the text feature vector generated by the normal website is increased;
s6: and using the trained CLIP model for multi-mode website classification, and classifying bad websites according to the webpage screenshot alignment semantic information.
As a preferred embodiment of the present application, in the step S1, the data cleaning specifically includes: the method comprises the steps of missing value cleaning, format content cleaning and logic error cleaning, wherein the missing value cleaning is used for determining a missing value range, removing unnecessary fields and missing value filling, the format content cleaning is used for removing unnecessary characters, and the logic error cleaning is used for removing duplication, unreasonable values and correcting contradictory contents.
In a preferred embodiment of the present application, in the step S2, the web page screenshot is encoded by a pre-trained res net50 model and converted into an image feature vectorThe method of (1) specifically comprises the following steps:
importing a pre-trained ResNet50 model as an image feature extractor through a deep learning framework, wherein the deep learning framework comprises PyTorch, tensorFlow, keras;
removing the last full-connection layer of the ResNet50 model, and taking the penultimate layer as an output layer;
preprocessing the webpage screenshot, wherein the preprocessing comprises the steps of adjusting the size of an image to 224 multiplied by 224, normalizing the pixel value of the image to be between 0 and 1, subtracting the mean value of a training set, and dividing the mean value by the standard deviation;
inputting the preprocessed webpage screenshot into a ResNet50 model to obtain the output of the penultimate layer as an image feature vector。
As a preferred embodiment of the present application, in the step S3, the text feature vector is converted intoThe method of (1) specifically comprises the following steps:
chinese type corresponding to webpage screenshotAnd keywords in website meta titlewInputting the text sequence into a pretrained Bert model as a text sequence, embedding the text sequence into a vector space with a fixed length, and sending the text sequence into Transformer Encoder for coding conversion; the Transformer Encoder layer consists of a plurality of identical layers, and each layer comprises two sublayers of a multi-head self-attention mechanism and a feedforward neural network;
in the multi-headed self-attention mechanism, each word in the text sequence is focused and affects other words to let the Bert model capture contextual information in the text sequence; in the feedforward neural network, the text sequences are weighted and summed and subjected to nonlinear transformation, so that a new vector representation is obtained, the new vector representation is sent into a multi-head self-attention mechanism of the next layer, and the above operation is repeated until the whole text sequence is converted into a vector representation with fixed length;
the Bert model outputs this vector representation as a text feature vector。
As a preferred embodiment of the present application, the calculating image feature vectorAnd text feature vector +.>The similarity of (2) is: for the image feature vector->And text feature vector +.>Performing dot multiplication;
the combined representation score of the image-text is calculatedThe method of (1) specifically comprises the following steps:
text feature vectorAs query, image feature vector +.>Calculated as keys and values at the firstiUnder the description, text feature vector +.>Relative to the imagejAttention to the expression->Then get the joint expression score of image-text +.>:
。
As a preferred embodiment of the present application, the method for introducing the InfoNCE function to calculate the cross entropy loss specifically includes:
at a given text feature vectorIn the case of (a) minimizing the correct retrieval of image feature vectors from the set of all batches of images +.>Other instances in the batch act as negative samples, this cross entropy loss is called the picture retrieval loss function, noted +.>:
;
In the middle of,Is shown in the firstiUnder the description, text feature vector +.>Relative to the corresponding image itselfiIs a joint representation score of (2); />Is the firstiThe picture under the description retrieves the loss function,jrepresent the firstjIndividual images->Is the total number of images;
similarly, a text retrieval penalty function is defined:
;
In the method, in the process of the application,is the firstiText retrieval loss function under individual description;
usingAnd +.>+/>The two losses sum to train the CLIP model.
As a preferred embodiment of the present application, in the step S5, the method for performing multiple iterations and optimization on the CLIP model by using randomly sampled positive and negative sample data in batches includes:
and (3) adopting a random small-batch sampling mode based on the ontology, wherein the samples comprise N classes of bad websites and corresponding number of normal websites, randomly extracting small-batch bad picture samples and randomly extracting corresponding number samples of all other types to train the CLIP model, so that text semantics and image semantics are consistent in terms of the major classes, randomly extracting small-batch data according to keywords in meta titles of different websites, and continuously refining CLIP model parameters by distinguishing different keywords under the ontology of the same type, thereby achieving fine-granularity optimization.
A bad website classification system based on teletext multimodality, the system comprising:
the data acquisition module is used for acquiring a webpage screenshot and a website meta title;
the data cleaning module is used for cleaning data of the website meta title and filtering samples with valid title contents as training sets; the title content effectively refers to Chinese characters with the length of more than 4 or Chinese characters with the ratio exceeding 50%;
the feature extraction module comprises a ResNet50 model unit, a Bert model unit and an L2 normalization unit, wherein the Bert model unit is used for encoding the webpage screenshot and converting the webpage screenshot into an image feature vector; the Bert model unit is used for coding the Chinese type corresponding to the webpage screenshot and the keywords in the meta title of the website and converting the Chinese type and the keywords into text feature vectors with the same dimension as the image feature vectors; the L2 normalization unit is used for carrying out L2 normalization processing on the image feature vector and the text feature vector;
the CLIP model training module is used for constructing a CLIP model, and carrying out multi-round iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data so as to align the CLIP model with the image feature vector and the text feature vector;
and the website classification output module is used for classifying the multi-mode websites through the trained CLIP model and classifying bad websites according to the webpage screenshot alignment semantic information.
Poor website classification equipment based on picture and text multimode includes:
a memory: for storing a computer program;
a processor: the method is used for executing the computer program to realize the bad website classification method based on the image-text multi-mode.
A readable storage medium having stored thereon a computer program for implementing a bad website classification method based on teletext multi-modality as described above when executed by a processor.
Compared with the prior art, the improved CLIP model is used for classifying websites, and has the beneficial effects that: the text and image information obtained from the website content and the relevance between the text and the image information are utilized to judge which category the website belongs to, the effect can reach the standard of the original model, and the accuracy and coverage rate of the classification of the website can be improved; compared with the traditional CLIP which needs hundred million-level data, the method adopts a small batch training sampling mode, can achieve similar training effect only by about 1% of data, does not need to manually mark a large amount of training data and a large TPU computing machine, does not need to design specific classification rules or dictionaries for different languages or topics, and can improve the efficiency and the realizability of website classification; pictures that cannot be analyzed using OCR techniques can be analyzed and explicitly classified.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:
FIG. 1 is a flow chart of the method of the present application;
FIG. 2 is an overall structure diagram of a CLIP model in an embodiment of the application;
FIG. 3 is a diagram of a CLIP model sample training step in an embodiment of the present application;
fig. 4 is a system configuration diagram of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present application. It will be apparent that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which are obtained by a person skilled in the art based on the described embodiments of the application, fall within the scope of protection of the application.
The multi-mode learning is realized by establishing a multi-mode model, so that the multi-mode learning system can learn information of each mode from the multi-modes, and the learning capacity of the model is improved.
The multi-modal characterization is the basis of multi-modal tasks, and the multi-modal model structure can be divided into two types according to different multi-modal characterization learning modes at present: one is Joint, a Joint, also known as a single tower structure; the other is Coordinated, i.e. a collaborative or twin tower structure. The single-tower structure focuses on capturing complementarity among multiple modes, the multi-mode characteristic representation is obtained by fusing the characteristics of the multiple mode information, and then the fused multi-mode characteristic is used for completing a prediction task, and the aim of network optimization is to predict the performance of a specific task. The dual-tower model does not seek fusion, but rather models to find correlations between multiple (usually two) modal information, maps the multiple modal data to a collaboration space, and finds collaboration relations between the multiple modal data, and the objective of network optimization is similarity between the two modal information. The double-tower structure is suitable for the uplink tasks which only have one modal information as input, such as tasks of cross-modal retrieval, translation and the like; the model of the single tower structure is suitable for application scenes with a plurality of modal information as input, such as tasks of visual question answering, emotion analysis and the like.
English full name of CLIP is Contrastive Language-Image Pre-training, which is a multi-modal model based on contrast learning, and training data is text-Image pairs: an image and its corresponding text description, it is desirable that the model learn by contrast learning, the matching relationship of the text-image pair. CLIP includes two models: the text encoder and the image encoder calculate cosine similarity between two characteristic calculation modes, so that the similarity of N matched image-text pairs is maximum, and the similarity of unmatched image-text pairs is minimum; model parameters are optimized and adjusted by symmetrical cross entropy loss.
CLIP can learn a generic visual-language representation from a large number of unlabeled text and images, thereby enabling processing of web site content in multiple languages, topics and media types; websites can also be categorized according to natural language queries entered by users to provide more flexible and accurate categorization results, but they face the problem that data requirements are too large, and models starting from scratch require 4 hundred million image text pairs and 2 hundred GPUs to achieve model effects.
The method uses an improved CLIP model, an image encoder adopts a pre-trained ResNet50 model, a text encoder adopts a pre-trained Bert model, characteristic extraction and similarity calculation are carried out on webpage screenshot and website meta titles, the CLIP model is optimized through an InfoNCE loss function, and CLIP model training is carried out by adopting random sampling based on an ontology, so that multi-mode website classification is realized.
1-3, an embodiment of the present application provides a method for classifying bad websites in a multi-mode image-text based on an optimized CLIP model, which specifically includes the following steps:
s1: acquiring a webpage screenshot and a website meta title, cleaning data of the website meta title, and filtering a sample with effective title content as a training set; the title content effectively refers to Chinese character length of 4 or more or Chinese character ratio of more than 50%;
in one embodiment, the data cleansing specifically includes: the method comprises the steps of missing value cleaning, format content cleaning and logic error cleaning, wherein the missing value cleaning is to determine a missing value range, remove unnecessary fields and missing value filling, the format content cleaning is to remove unnecessary characters, and the logic error cleaning is to remove duplication, unreasonable values and correct contradiction contents. Invalid characters and missing fields are corrected by data cleansing.
S2: the compression correction of the image size of the webpage screenshot is carried out, and the webpage screenshot is encoded through a pre-trained ResNet50 model and is converted into an image feature vector;
ResNet50 is a depth residual network that can be used for image classification, object detection, and semantic segmentation tasks. The main feature of ResNet50 is the use of residual connection (residual connection) to solve the gradient extinction and degradation problems of deep networks. The idea of residual connection is to add a skip connection between certain layers of the network so that the input can be directly transferred to the output, thereby preserving the information of the input and enhancing the expressive power of the network.
In a specific embodiment, the method for extracting the image feature vector by using the ResNet50 model specifically comprises the following steps:
the pre-trained ResNet50 model is imported, and the ResNet50 model can be loaded by using APIs provided by deep learning frameworks such as PyTorch, tensorFlow, keras, and the ResNet50 model file can be downloaded from the Internet. The pre-trained ResNet50 model has learned some generic features on a large image dataset (e.g., imageNet) and can be used as an image feature extractor.
The last full-connection layer (fully connected layer) of the ResNet50 model is removed, the last layer is used for classification tasks, the probability distribution of the categories is output, the last layer is not needed, the output of the last-to-last layer is needed, and the last-to-last layer is a 2048-dimensional vector and can represent high-level semantic features of the image.
Preprocessing the webpage screenshot, wherein the preprocessing comprises the operations of adjusting the size of an image to 224 multiplied by 224, normalizing the pixel value of the image to be between 0 and 1, subtracting the mean value of a training set, dividing the mean value by the standard deviation and the like; the operations can ensure that the format and distribution of the image are consistent with the data in model training, and the effect of feature extraction is improved.
Inputting the preprocessed webpage screenshot into a ResNet50 model to obtain the output of the penultimate layer as an image feature vectorThis vector can be used for subsequent image retrieval, clustering, classification, etc.
S3: general purpose medicineChinese type corresponding to webpage screenshot through Bert modelAnd keywords in website meta titlewCoding and converting into a vector of the image characteristic +.>Text feature vectors with the same dimensions +.>And +.>And text feature vector +.>Carrying out L2 normalization treatment;
meta is the keyword content under the website title, and the marked sample is the Chinese type。
The Bert model is a pre-trained language model based on a transducer architecture that can convert text codes into vector representations. Specifically, the Bert model embeds (Embedding) the input text sequence first, and then transcodes it through multiple layers of Transformer Encoder.
In a specific embodiment, the method for extracting the text feature vector through the Bert model specifically comprises the following steps:
chinese type corresponding to webpage screenshotAnd keywords in website meta titlewInputting the text sequence into a pretrained Bert model, and Embedding the text sequence into a vector space with a fixed length, wherein the process is realized by a technology called Word Embedding; word Embedding maps each Word to a low-dimensional real vector so that the distance of semantically similar words in vector space is also comparedProximity.
Next, the input text sequence is sent into Transformer Encoder for transcoding; transformer Encoder consists of several identical layers (layers), each Layer comprising two sub-layers of a Multi-Head Self-Attention mechanism (Multi-Head Self-Attention) and a Feed-forward neural network (Feed-Forward Neural Network);
in the multi-headed self-Attention mechanism, each word in the text sequence is focused (Attention) and affects other words to make the Bert model better capture the context information in the text sequence; in the feedforward neural network, the text sequence is weighted and summed and subjected to nonlinear transformation, so that a new vector representation is obtained, the new vector representation is sent into a multi-head self-attention mechanism of the next layer, and the above operation is repeated until the whole text sequence is converted into a vector representation with fixed length;
finally, the BERT model outputs this vector representation as a text feature vector。
S4: constructing a CLIP model, wherein each time the CLIP model is input into a batch of triples, each tripletAre generated by the combination of image-text pairs; computing an image feature vector +.>And text feature vector +.>Is the similarity of the image-text joint representation score +.>Introducing an InfoNCE function to calculate cross entropy loss;
computing image feature vectorsAnd text feature vector +.>The similarity of (2) is: image feature vector->And text feature vector +.>Performing dot multiplication;
computing joint representation scores for image-textThe method of (1) specifically comprises the following steps:
text feature vectorAs query, image feature vector +.>Calculated as keys and values at the firstiUnder the description, text feature vector +.>Relative to the imagejAttention to the expression->Then get the joint expression score of image-text +.>:
。
The method for introducing the InfoNCE function to calculate the cross entropy loss specifically comprises the following steps:
at a given text feature vectorIn the case of (a) minimizing the correct retrieval of image feature vectors from the set of all batches of images +.>Other instances in the batch act as negative samples, this cross entropy loss is called the picture retrieval loss function, noted +.>:
;
In the method, in the process of the application,is shown in the firstiUnder the description, text feature vector +.>Relative to the corresponding image itselfiIs a joint representation score of (2); />Is the firstiThe picture under the description retrieves the loss function,jrepresent the firstjIndividual images->Is the total number of images;
to minimize the value of the picture loss function, one should makeIs a function of the maximum value of (2),minimum. Similarly, define text retrieval penalty function +.>:
;
In the method, in the process of the application,is the firstiText retrieval loss function under individual description;
the meaning is as follows: image feature vectorFor retrieving the correct text feature vector +.>Use +.>And +.>+The two losses sum to train the CLIP model.
S5: and carrying out multiple rounds of iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data to enable the CLIP model to align the image feature vector and the text feature vector, namely: the distance between the image generated by the bad website and the text feature vector is shortened, and the difference between the image generated by the bad website and the text feature vector generated by the normal website is increased;
the method for carrying out multi-round iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data comprises the following steps:
adopting a random small batch sampling mode based on an ontology, wherein the samples comprise N classes of bad websites and corresponding number of normal websites, randomly extracting small batch of bad picture samples and randomly extracting corresponding number of samples of all other types, such as randomly extracting 5000 gambling and 5000 samples of other types, training a CLIP model, and enabling text semantics and image semantics to be consistent from a large class;
based on the training, under the category of the same gambling, small batches of data are randomly extracted according to keywords in meta titles of different websites, keywords with different information such as 'new staphyloin' and 'XX cut-pieces' possibly exist under the category of the same gambling, and the CLIP model parameters are continuously refined through different keyword differences under the same type of ontology, such as 5000 new staphyloin and 5000 samples containing lottery ticket are randomly extracted, so that optimization of finer granularity is achieved, more picture information with finer granularity can be aligned, and a better effect of training of the small samples is achieved.
S6: and using the trained CLIP model for multi-mode website classification, and classifying bad websites according to the webpage screenshot alignment semantic information.
As shown in fig. 4, another embodiment of the present application provides a bad website classification system based on graph-text multi-mode, which specifically includes:
the data acquisition module is used for acquiring a webpage screenshot and a website meta title;
the data cleaning module is used for cleaning data of website meta titles and filtering samples with effective title contents as training sets; the title content effectively refers to Chinese character length of 4 or more or Chinese character ratio of more than 50%;
the feature extraction module comprises a ResNet50 model unit, a Bert model unit and an L2 normalization unit, wherein the Bert model unit is used for encoding the webpage screenshot and converting the webpage screenshot into an image feature vector; the Bert model unit is used for encoding the Chinese type corresponding to the webpage screenshot and the keywords in the meta title of the website and converting the Chinese type and the keywords into text feature vectors with the same dimension as the image feature vectors; the L2 normalization unit is used for carrying out L2 normalization processing on the image feature vector and the text feature vector;
the CLIP model training module is used for constructing a CLIP model, and carrying out multi-round iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data so as to align the CLIP model with the image feature vector and the text feature vector;
and the website classification output module is used for classifying the multi-mode websites through the trained CLIP model and classifying bad websites according to the webpage screenshot alignment semantic information.
The embodiment further provides a poor website classification device based on graph-text multi-mode, which comprises:
a memory: for storing a computer program;
a processor: the method is used for executing the computer program to realize the bad website classification method based on the image-text multi-mode.
The embodiment also provides a readable storage medium, which may be a read-only memory, a magnetic disk or an optical disk, etc., and has a computer program stored thereon, where the computer program is executed by a processor to implement a bad website classification method based on graphic multi-mode as described above.
In summary, the text and image information obtained from the website content and the relevance between the text and the image information are utilized to judge which category the website belongs to, the effect can reach the standard of the original model, and the accuracy and coverage rate of the classification of the website can be improved; compared with the traditional CLIP which needs hundred million-level data, the method adopts a small batch training sampling mode, can achieve similar training effect only by about 1% of data, does not need to manually mark a large amount of training data and a large TPU computing machine, does not need to design specific classification rules or dictionaries for different languages or topics, and can improve the efficiency and the realizability of website classification; pictures that cannot be analyzed using OCR techniques can be analyzed and explicitly classified.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Any process or method description in a flowchart or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes additional implementations in which functions may be performed in a substantially simultaneous manner or in an opposite order from that shown or discussed, including in accordance with the functions that are involved.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that various changes and substitutions are possible within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.
Claims (10)
1. The bad website classification method based on the image-text multi-mode is characterized by comprising the following steps:
s1: acquiring a webpage screenshot and a website meta title, cleaning data of the website meta title, and filtering a sample with valid title content as a training set; the title content effectively refers to Chinese characters with the length of more than 4 or Chinese characters with the ratio exceeding 50%;
s2: the webpage screenshot is subjected to compression correction of the image size, and is encoded through a pre-trained ResNet50 model and converted into an image feature vector;
S3: chinese type corresponding to webpage screenshot through Bert modelAnd keywords in website meta titlewCoding, converting to +.>Text feature vectors with the same dimensions +.>And +.>And the text feature vector->Carrying out L2 normalization treatment;
s4: constructing a CLIP model, wherein each time the CLIP model is input into a batch of triples, each tripletAre generated by the combination of image-text pairs; computing an image feature vector +.>And text feature vector +.>Is the similarity of the image-text joint representation score +.>Introducing an InfoNCE function to calculate cross entropy loss;
s5: and carrying out multiple rounds of iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data to enable the CLIP model to align the image feature vector and the text feature vector, namely: the distance between the image generated by the bad website and the text feature vector is shortened, and the difference between the image generated by the bad website and the text feature vector generated by the normal website is increased;
s6: and using the trained CLIP model for multi-mode website classification, and classifying bad websites according to the webpage screenshot alignment semantic information.
2. The method for classifying bad websites based on graphic multi-mode according to claim 1, wherein in step S1, the data cleaning specifically includes: the method comprises the steps of missing value cleaning, format content cleaning and logic error cleaning, wherein the missing value cleaning is used for determining a missing value range, removing unnecessary fields and missing value filling, the format content cleaning is used for removing unnecessary characters, and the logic error cleaning is used for removing duplication, unreasonable values and correcting contradictory contents.
3. The method for classifying bad websites based on graphic multi-mode as claimed in claim 1, wherein in the step S2, the web page screenshot is encoded by a pre-trained res net50 model and converted into an image feature vectorThe method of (1) specifically comprises the following steps:
importing a pre-trained ResNet50 model as an image feature extractor through a deep learning framework, wherein the deep learning framework comprises PyTorch, tensorFlow, keras;
removing the last full-connection layer of the ResNet50 model, and taking the penultimate layer as an output layer;
preprocessing the webpage screenshot, wherein the preprocessing comprises the steps of adjusting the size of an image to 224 multiplied by 224, normalizing the pixel value of the image to be between 0 and 1, subtracting the mean value of a training set, and dividing the mean value by the standard deviation;
inputting the preprocessed webpage screenshot into a ResNet50 model to obtain the output of the penultimate layer as an image feature vector。
4. The method for classifying bad websites based on graphic multi-mode as claimed in claim 1, wherein the step S3 is thatIn, convert to text feature vectorThe method of (1) specifically comprises the following steps:
chinese type corresponding to webpage screenshotAnd keywords in website meta titlewInputting the text sequence into a pretrained Bert model as a text sequence, embedding the text sequence into a vector space with a fixed length, and sending the text sequence into Transformer Encoder for coding conversion; the Transformer Encoder layer consists of a plurality of identical layers, and each layer comprises two sublayers of a multi-head self-attention mechanism and a feedforward neural network;
in the multi-headed self-attention mechanism, each word in the text sequence is focused and affects other words to let the Bert model capture contextual information in the text sequence; in the feedforward neural network, the text sequences are weighted and summed and subjected to nonlinear transformation, so that a new vector representation is obtained, the new vector representation is sent into a multi-head self-attention mechanism of the next layer, and the above operation is repeated until the whole text sequence is converted into a vector representation with fixed length;
the Bert model outputs this vector representation as a text feature vector。
5. The method for classifying bad websites based on graphic multi-mode as claimed in claim 1, wherein the calculating image feature vectorAnd text feature vector +.>The similarity of (2) is: for the image feature vector->And text feature vector +.>Performing dot multiplication;
the combined representation score of the image-text is calculatedThe method of (1) specifically comprises the following steps:
text feature vectorAs query, image feature vector +.>Calculated as keys and values at the firstiUnder the description, text feature vector +.>Relative to the imagejAttention to the expression->Then get the joint expression score of image-text +.>:
。
6. The method for classifying bad websites based on graphic multi-mode according to claim 5, wherein the method for introducing the InfoNCE function to calculate the cross entropy loss specifically comprises:
at a given text feature vectorIn the case of (a) minimizing the correct retrieval of image feature vectors from the set of all batches of images +.>Other instances in the batch act as negative samples, this cross entropy loss is called the picture retrieval loss function, noted +.>:
;
In the method, in the process of the application,is shown in the firstiUnder the description, text feature vector +.>Relative to the corresponding image itselfiIs a joint representation score of (2); />Is the firstiThe picture under the description retrieves the loss function,jrepresent the firstjIndividual images->Is the total number of images;
similarly, a text retrieval penalty function is defined:
;
In the method, in the process of the application,is the firstiText retrieval loss function under individual description;
usingAnd +.>+/>The two losses sum to train the CLIP model.
7. The method for classifying bad websites based on graphic multi-mode as claimed in claim 1, wherein in the step S5, the method for performing multi-round iteration and optimization on the CLIP model by using randomly sampled positive and negative sample data in batches comprises:
and (3) adopting a random small-batch sampling mode based on the ontology, wherein the samples comprise N classes of bad websites and corresponding number of normal websites, randomly extracting small-batch bad picture samples and randomly extracting corresponding number samples of all other types to train the CLIP model, so that text semantics and image semantics are consistent in terms of the major classes, randomly extracting small-batch data according to keywords in meta titles of different websites, and continuously refining CLIP model parameters by distinguishing different keywords under the ontology of the same type, thereby achieving fine-granularity optimization.
8. A system based on a bad website classification method based on graphic multi-modality according to any of claims 1 to 7, wherein the system comprises:
the data acquisition module is used for acquiring a webpage screenshot and a website meta title;
the data cleaning module is used for cleaning data of the website meta title and filtering samples with valid title contents as training sets; the title content effectively refers to Chinese characters with the length of more than 4 or Chinese characters with the ratio exceeding 50%;
the feature extraction module comprises a ResNet50 model unit, a Bert model unit and an L2 normalization unit, wherein the Bert model unit is used for encoding the webpage screenshot and converting the webpage screenshot into an image feature vector; the Bert model unit is used for coding the Chinese type corresponding to the webpage screenshot and the keywords in the meta title of the website and converting the Chinese type and the keywords into text feature vectors with the same dimension as the image feature vectors; the L2 normalization unit is used for carrying out L2 normalization processing on the image feature vector and the text feature vector;
the CLIP model training module is used for constructing a CLIP model, and carrying out multi-round iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data so as to align the CLIP model with the image feature vector and the text feature vector;
and the website classification output module is used for classifying the multi-mode websites through the trained CLIP model and classifying bad websites according to the webpage screenshot alignment semantic information.
9. Poor website classification equipment based on picture and text multimode is characterized by comprising:
a memory: for storing a computer program;
a processor: for executing the computer program to implement a bad website classification method based on teletext multi-modality as claimed in any one of claims 1-7.
10. A readable storage medium, wherein a computer program is stored on the readable storage medium, and the computer program is used for implementing a bad website classification method based on graphic-text multi-mode according to any one of claims 1-7 when being executed by a processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311078357.7A CN116796251A (en) | 2023-08-25 | 2023-08-25 | Poor website classification method, system and equipment based on image-text multi-mode |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311078357.7A CN116796251A (en) | 2023-08-25 | 2023-08-25 | Poor website classification method, system and equipment based on image-text multi-mode |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116796251A true CN116796251A (en) | 2023-09-22 |
Family
ID=88046827
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311078357.7A Pending CN116796251A (en) | 2023-08-25 | 2023-08-25 | Poor website classification method, system and equipment based on image-text multi-mode |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116796251A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117198514A (en) * | 2023-11-08 | 2023-12-08 | 中国医学科学院北京协和医院 | Vulnerable plaque identification method and system based on CLIP model |
CN117235532A (en) * | 2023-11-09 | 2023-12-15 | 西南民族大学 | Training and detecting method for malicious website detection model based on M-Bert |
CN117435739A (en) * | 2023-12-21 | 2024-01-23 | 深圳须弥云图空间科技有限公司 | Image text classification method and device |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109635835A (en) * | 2018-11-08 | 2019-04-16 | 深圳蓝韵医学影像有限公司 | A kind of breast lesion method for detecting area based on deep learning and transfer learning |
CN110020335A (en) * | 2017-07-28 | 2019-07-16 | 北京搜狗科技发展有限公司 | The treating method and apparatus of collection |
CN110121118A (en) * | 2019-06-17 | 2019-08-13 | 腾讯科技(深圳)有限公司 | Video clip localization method, device, computer equipment and storage medium |
CN114743630A (en) * | 2022-04-01 | 2022-07-12 | 杭州电子科技大学 | Medical report generation method based on cross-modal contrast learning |
CN114862811A (en) * | 2022-05-19 | 2022-08-05 | 湖南大学 | Defect detection method based on variational automatic encoder |
CN115222845A (en) * | 2022-08-01 | 2022-10-21 | 北京元亦科技有限公司 | Method and device for generating style font picture, electronic equipment and medium |
CN115374325A (en) * | 2022-05-31 | 2022-11-22 | 国家计算机网络与信息安全管理中心 | Website classification method and device, classification equipment and storage medium |
CN115563342A (en) * | 2022-10-19 | 2023-01-03 | 国家计算机网络与信息安全管理中心广东分中心 | Method, system, equipment and storage medium for video theme retrieval |
CN115659175A (en) * | 2022-10-13 | 2023-01-31 | 国网辽宁省电力有限公司信息通信分公司 | Multi-mode data analysis method, device and medium for micro-service resources |
CN115661594A (en) * | 2022-10-19 | 2023-01-31 | 海南港航控股有限公司 | Image-text multi-mode feature representation method and system based on alignment and fusion |
CN116206239A (en) * | 2023-01-30 | 2023-06-02 | 北京达佳互联信息技术有限公司 | Video feature extraction network training method and device, electronic equipment and storage medium |
-
2023
- 2023-08-25 CN CN202311078357.7A patent/CN116796251A/en active Pending
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110020335A (en) * | 2017-07-28 | 2019-07-16 | 北京搜狗科技发展有限公司 | The treating method and apparatus of collection |
CN109635835A (en) * | 2018-11-08 | 2019-04-16 | 深圳蓝韵医学影像有限公司 | A kind of breast lesion method for detecting area based on deep learning and transfer learning |
CN110121118A (en) * | 2019-06-17 | 2019-08-13 | 腾讯科技(深圳)有限公司 | Video clip localization method, device, computer equipment and storage medium |
CN114743630A (en) * | 2022-04-01 | 2022-07-12 | 杭州电子科技大学 | Medical report generation method based on cross-modal contrast learning |
CN114862811A (en) * | 2022-05-19 | 2022-08-05 | 湖南大学 | Defect detection method based on variational automatic encoder |
CN115374325A (en) * | 2022-05-31 | 2022-11-22 | 国家计算机网络与信息安全管理中心 | Website classification method and device, classification equipment and storage medium |
CN115222845A (en) * | 2022-08-01 | 2022-10-21 | 北京元亦科技有限公司 | Method and device for generating style font picture, electronic equipment and medium |
CN115659175A (en) * | 2022-10-13 | 2023-01-31 | 国网辽宁省电力有限公司信息通信分公司 | Multi-mode data analysis method, device and medium for micro-service resources |
CN115563342A (en) * | 2022-10-19 | 2023-01-03 | 国家计算机网络与信息安全管理中心广东分中心 | Method, system, equipment and storage medium for video theme retrieval |
CN115661594A (en) * | 2022-10-19 | 2023-01-31 | 海南港航控股有限公司 | Image-text multi-mode feature representation method and system based on alignment and fusion |
CN116206239A (en) * | 2023-01-30 | 2023-06-02 | 北京达佳互联信息技术有限公司 | Video feature extraction network training method and device, electronic equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
ALEC RADFORD等: ""Learning Transferable Visual Models From Natural Language Supervision"", 《ARXIV》, pages 1 - 48 * |
TEJAS SRINIVASAN等: ""Curriculum Learning for Data-Efficient Vision-Language Alignment"", 《PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》, pages 5619 - 5620 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117198514A (en) * | 2023-11-08 | 2023-12-08 | 中国医学科学院北京协和医院 | Vulnerable plaque identification method and system based on CLIP model |
CN117198514B (en) * | 2023-11-08 | 2024-01-30 | 中国医学科学院北京协和医院 | Vulnerable plaque identification method and system based on CLIP model |
CN117235532A (en) * | 2023-11-09 | 2023-12-15 | 西南民族大学 | Training and detecting method for malicious website detection model based on M-Bert |
CN117235532B (en) * | 2023-11-09 | 2024-01-26 | 西南民族大学 | Training and detecting method for malicious website detection model based on M-Bert |
CN117435739A (en) * | 2023-12-21 | 2024-01-23 | 深圳须弥云图空间科技有限公司 | Image text classification method and device |
CN117435739B (en) * | 2023-12-21 | 2024-03-15 | 深圳须弥云图空间科技有限公司 | Image text classification method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109165563B (en) | Pedestrian re-identification method and apparatus, electronic device, storage medium, and program product | |
CN116796251A (en) | Poor website classification method, system and equipment based on image-text multi-mode | |
CN111985239A (en) | Entity identification method and device, electronic equipment and storage medium | |
CN113672693B (en) | Label recommendation method of online question-answering platform based on knowledge graph and label association | |
CN113657115B (en) | Multi-mode Mongolian emotion analysis method based on ironic recognition and fine granularity feature fusion | |
CN113220890A (en) | Deep learning method combining news headlines and news long text contents based on pre-training | |
CN114691864A (en) | Text classification model training method and device and text classification method and device | |
CN116304984A (en) | Multi-modal intention recognition method and system based on contrast learning | |
CN111581964A (en) | Theme analysis method for Chinese ancient books | |
CN115408488A (en) | Segmentation method and system for novel scene text | |
CN115687571A (en) | Depth unsupervised cross-modal retrieval method based on modal fusion reconstruction hash | |
Yang et al. | A comparative study of language transformers for video question answering | |
Krishnan et al. | Bringing semantics into word image representation | |
CN117540039A (en) | Data retrieval method based on unsupervised cross-modal hash algorithm | |
CN117033558A (en) | BERT-WWM and multi-feature fused film evaluation emotion analysis method | |
CN112084788A (en) | Automatic marking method and system for implicit emotional tendency of image captions | |
CN117033626A (en) | Text auditing method, device, equipment and storage medium | |
CN114020871B (en) | Multi-mode social media emotion analysis method based on feature fusion | |
CN114386412B (en) | Multi-mode named entity recognition method based on uncertainty perception | |
CN116451699A (en) | Segment extraction type machine reading and understanding method based on attention mechanism | |
El-Gayar | Automatic generation of image caption based on semantic relation using deep visual attention prediction | |
CN116702094B (en) | Group application preference feature representation method | |
Liu | IntelliExtract: An End-to-End Framework for Chinese Resume Information Extraction from Document Images | |
CN117875266B (en) | Training method and device for text coding model, electronic equipment and storage medium | |
Gao et al. | Team gzw at Factify 2: Multimodal Attention and Fusion Networks for Multi-Modal Fact Verification. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |