CN116796251A - Poor website classification method, system and equipment based on image-text multi-mode - Google Patents

Poor website classification method, system and equipment based on image-text multi-mode Download PDF

Info

Publication number
CN116796251A
CN116796251A CN202311078357.7A CN202311078357A CN116796251A CN 116796251 A CN116796251 A CN 116796251A CN 202311078357 A CN202311078357 A CN 202311078357A CN 116796251 A CN116796251 A CN 116796251A
Authority
CN
China
Prior art keywords
text
image
feature vector
model
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311078357.7A
Other languages
Chinese (zh)
Inventor
徐艺丹
韦芹余
栾鹏林
李永成
盛响
倪正国
韩晓华
褚连杰
高陆云
朱琳彤
周恬
张志元
张�浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Internet Industry Management Service Center
Original Assignee
Jiangsu Internet Industry Management Service Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Internet Industry Management Service Center filed Critical Jiangsu Internet Industry Management Service Center
Priority to CN202311078357.7A priority Critical patent/CN116796251A/en
Publication of CN116796251A publication Critical patent/CN116796251A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of network security, in particular to a poor website classification method, system and equipment based on image-text multi-mode, which are used for extracting characteristics and calculating similarity of webpage screenshot and website meta titles, optimizing a CLIP model through an InfoNCE loss function, training the CLIP model by adopting random sampling based on a body, fusing the characteristics of two modes of a graph and a text, classifying the poor website according to the fused characteristics, and improving the accuracy and coverage rate of website classification; by adopting a small-batch training sampling mode, similar training effects can be achieved only by about 1% of data, meanwhile, a large amount of training data and a large TPU computing machine do not need to be marked manually, and specific classification rules or dictionaries are not needed to be designed for different languages or topics, so that the efficiency and the realizability of website classification can be improved; pictures that cannot be analyzed using OCR techniques can be analyzed and explicitly classified.

Description

Poor website classification method, system and equipment based on image-text multi-mode
Technical Field
The application relates to the technical field of network security, in particular to a poor website classification method, system and equipment based on image-text multi-mode.
Background
The information technology and the mobile internet are vigorously developed, so that people are convenient to live, and meanwhile, bad websites are required to be timely and accurately identified, and early warning is performed on access users in advance.
The existing bad website identification methods mainly comprise two methods, one is to match text keywords of a website by adopting a matching technology based on the text content of the website; the other is to construct a deep learning classification model to identify bad websites according to the text or picture single-mode information of the websites. The method for constructing the deep learning classification model according to the text or the single-mode information of the website only considers the information of the picture or the single-mode information of the text, or directly extracts the text information in the picture by using the OCR technology, the picture without the text information is not processed sufficiently, and the carriers of the information comprise the text, the image, the video, the audio and the like, so that the method is various. The existing bad website identification method has the problems of low identification efficiency and easy missing report.
The multi-mode learning is realized by establishing a multi-mode model, fusing information of a plurality of modes such as pictures, texts and the like, so that a network can learn information of each mode from the multi-mode, the learning capacity of the model is improved, and more accurate results are obtained. Therefore, the application provides a poor website classification method, system and equipment based on image-text multi-mode, which are used for solving the problems in the prior art.
Disclosure of Invention
Aiming at the problems of insufficient picture processing without text information, low recognition efficiency, easy missing report and the like in the existing bad website classification method, the application provides a bad website classification method, system and equipment based on image-text multi-mode. The method has the advantages that the information of the two modes of the bad website screenshot and the text is extracted through the CLIP model based on the pre-training image-text model, the visual characteristics and the text characteristics of the bad website are extracted, the characteristics of the two modes of the image and the text are fused, the bad website is classified according to the fused characteristics, and a good classification effect can be achieved under the conditions of less training data and lower hardware requirements.
In order to achieve the above object, the present application is realized by the following technical scheme:
a bad website classification method based on image-text multi-mode comprises the following steps:
s1: acquiring a webpage screenshot and a website meta title, cleaning data of the website meta title, and filtering a sample with valid title content as a training set; the title content effectively refers to Chinese characters with the length of more than 4 or Chinese characters with the ratio exceeding 50%;
s2: the webpage screenshot is subjected to compression correction of the image size, and is encoded through a pre-trained ResNet50 model and converted into an imageFeature vector
S3: chinese type corresponding to webpage screenshot through Bert modelAnd keywords in website meta titlewCoding, converting to +.>Text feature vectors with the same dimensions +.>And +.>And the text feature vector->Carrying out L2 normalization treatment;
s4: constructing a CLIP model, wherein each time the CLIP model is input into a batch of triples, each tripletAre generated by the combination of image-text pairs; computing an image feature vector +.>And text feature vector +.>Is the similarity of the image-text joint representation score +.>Introducing an InfoNCE function to calculate cross entropy loss;
s5: and carrying out multiple rounds of iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data to enable the CLIP model to align the image feature vector and the text feature vector, namely: the distance between the image generated by the bad website and the text feature vector is shortened, and the difference between the image generated by the bad website and the text feature vector generated by the normal website is increased;
s6: and using the trained CLIP model for multi-mode website classification, and classifying bad websites according to the webpage screenshot alignment semantic information.
As a preferred embodiment of the present application, in the step S1, the data cleaning specifically includes: the method comprises the steps of missing value cleaning, format content cleaning and logic error cleaning, wherein the missing value cleaning is used for determining a missing value range, removing unnecessary fields and missing value filling, the format content cleaning is used for removing unnecessary characters, and the logic error cleaning is used for removing duplication, unreasonable values and correcting contradictory contents.
In a preferred embodiment of the present application, in the step S2, the web page screenshot is encoded by a pre-trained res net50 model and converted into an image feature vectorThe method of (1) specifically comprises the following steps:
importing a pre-trained ResNet50 model as an image feature extractor through a deep learning framework, wherein the deep learning framework comprises PyTorch, tensorFlow, keras;
removing the last full-connection layer of the ResNet50 model, and taking the penultimate layer as an output layer;
preprocessing the webpage screenshot, wherein the preprocessing comprises the steps of adjusting the size of an image to 224 multiplied by 224, normalizing the pixel value of the image to be between 0 and 1, subtracting the mean value of a training set, and dividing the mean value by the standard deviation;
inputting the preprocessed webpage screenshot into a ResNet50 model to obtain the output of the penultimate layer as an image feature vector
As a preferred embodiment of the present application, in the step S3, the text feature vector is converted intoThe method of (1) specifically comprises the following steps:
chinese type corresponding to webpage screenshotAnd keywords in website meta titlewInputting the text sequence into a pretrained Bert model as a text sequence, embedding the text sequence into a vector space with a fixed length, and sending the text sequence into Transformer Encoder for coding conversion; the Transformer Encoder layer consists of a plurality of identical layers, and each layer comprises two sublayers of a multi-head self-attention mechanism and a feedforward neural network;
in the multi-headed self-attention mechanism, each word in the text sequence is focused and affects other words to let the Bert model capture contextual information in the text sequence; in the feedforward neural network, the text sequences are weighted and summed and subjected to nonlinear transformation, so that a new vector representation is obtained, the new vector representation is sent into a multi-head self-attention mechanism of the next layer, and the above operation is repeated until the whole text sequence is converted into a vector representation with fixed length;
the Bert model outputs this vector representation as a text feature vector
As a preferred embodiment of the present application, the calculating image feature vectorAnd text feature vector +.>The similarity of (2) is: for the image feature vector->And text feature vector +.>Performing dot multiplication;
the combined representation score of the image-text is calculatedThe method of (1) specifically comprises the following steps:
text feature vectorAs query, image feature vector +.>Calculated as keys and values at the firstiUnder the description, text feature vector +.>Relative to the imagejAttention to the expression->Then get the joint expression score of image-text +.>
As a preferred embodiment of the present application, the method for introducing the InfoNCE function to calculate the cross entropy loss specifically includes:
at a given text feature vectorIn the case of (a) minimizing the correct retrieval of image feature vectors from the set of all batches of images +.>Other instances in the batch act as negative samples, this cross entropy loss is called the picture retrieval loss function, noted +.>
In the middle of,Is shown in the firstiUnder the description, text feature vector +.>Relative to the corresponding image itselfiIs a joint representation score of (2); />Is the firstiThe picture under the description retrieves the loss function,jrepresent the firstjIndividual images->Is the total number of images;
similarly, a text retrieval penalty function is defined
In the method, in the process of the application,is the firstiText retrieval loss function under individual description;
usingAnd +.>+/>The two losses sum to train the CLIP model.
As a preferred embodiment of the present application, in the step S5, the method for performing multiple iterations and optimization on the CLIP model by using randomly sampled positive and negative sample data in batches includes:
and (3) adopting a random small-batch sampling mode based on the ontology, wherein the samples comprise N classes of bad websites and corresponding number of normal websites, randomly extracting small-batch bad picture samples and randomly extracting corresponding number samples of all other types to train the CLIP model, so that text semantics and image semantics are consistent in terms of the major classes, randomly extracting small-batch data according to keywords in meta titles of different websites, and continuously refining CLIP model parameters by distinguishing different keywords under the ontology of the same type, thereby achieving fine-granularity optimization.
A bad website classification system based on teletext multimodality, the system comprising:
the data acquisition module is used for acquiring a webpage screenshot and a website meta title;
the data cleaning module is used for cleaning data of the website meta title and filtering samples with valid title contents as training sets; the title content effectively refers to Chinese characters with the length of more than 4 or Chinese characters with the ratio exceeding 50%;
the feature extraction module comprises a ResNet50 model unit, a Bert model unit and an L2 normalization unit, wherein the Bert model unit is used for encoding the webpage screenshot and converting the webpage screenshot into an image feature vector; the Bert model unit is used for coding the Chinese type corresponding to the webpage screenshot and the keywords in the meta title of the website and converting the Chinese type and the keywords into text feature vectors with the same dimension as the image feature vectors; the L2 normalization unit is used for carrying out L2 normalization processing on the image feature vector and the text feature vector;
the CLIP model training module is used for constructing a CLIP model, and carrying out multi-round iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data so as to align the CLIP model with the image feature vector and the text feature vector;
and the website classification output module is used for classifying the multi-mode websites through the trained CLIP model and classifying bad websites according to the webpage screenshot alignment semantic information.
Poor website classification equipment based on picture and text multimode includes:
a memory: for storing a computer program;
a processor: the method is used for executing the computer program to realize the bad website classification method based on the image-text multi-mode.
A readable storage medium having stored thereon a computer program for implementing a bad website classification method based on teletext multi-modality as described above when executed by a processor.
Compared with the prior art, the improved CLIP model is used for classifying websites, and has the beneficial effects that: the text and image information obtained from the website content and the relevance between the text and the image information are utilized to judge which category the website belongs to, the effect can reach the standard of the original model, and the accuracy and coverage rate of the classification of the website can be improved; compared with the traditional CLIP which needs hundred million-level data, the method adopts a small batch training sampling mode, can achieve similar training effect only by about 1% of data, does not need to manually mark a large amount of training data and a large TPU computing machine, does not need to design specific classification rules or dictionaries for different languages or topics, and can improve the efficiency and the realizability of website classification; pictures that cannot be analyzed using OCR techniques can be analyzed and explicitly classified.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:
FIG. 1 is a flow chart of the method of the present application;
FIG. 2 is an overall structure diagram of a CLIP model in an embodiment of the application;
FIG. 3 is a diagram of a CLIP model sample training step in an embodiment of the present application;
fig. 4 is a system configuration diagram of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present application. It will be apparent that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which are obtained by a person skilled in the art based on the described embodiments of the application, fall within the scope of protection of the application.
The multi-mode learning is realized by establishing a multi-mode model, so that the multi-mode learning system can learn information of each mode from the multi-modes, and the learning capacity of the model is improved.
The multi-modal characterization is the basis of multi-modal tasks, and the multi-modal model structure can be divided into two types according to different multi-modal characterization learning modes at present: one is Joint, a Joint, also known as a single tower structure; the other is Coordinated, i.e. a collaborative or twin tower structure. The single-tower structure focuses on capturing complementarity among multiple modes, the multi-mode characteristic representation is obtained by fusing the characteristics of the multiple mode information, and then the fused multi-mode characteristic is used for completing a prediction task, and the aim of network optimization is to predict the performance of a specific task. The dual-tower model does not seek fusion, but rather models to find correlations between multiple (usually two) modal information, maps the multiple modal data to a collaboration space, and finds collaboration relations between the multiple modal data, and the objective of network optimization is similarity between the two modal information. The double-tower structure is suitable for the uplink tasks which only have one modal information as input, such as tasks of cross-modal retrieval, translation and the like; the model of the single tower structure is suitable for application scenes with a plurality of modal information as input, such as tasks of visual question answering, emotion analysis and the like.
English full name of CLIP is Contrastive Language-Image Pre-training, which is a multi-modal model based on contrast learning, and training data is text-Image pairs: an image and its corresponding text description, it is desirable that the model learn by contrast learning, the matching relationship of the text-image pair. CLIP includes two models: the text encoder and the image encoder calculate cosine similarity between two characteristic calculation modes, so that the similarity of N matched image-text pairs is maximum, and the similarity of unmatched image-text pairs is minimum; model parameters are optimized and adjusted by symmetrical cross entropy loss.
CLIP can learn a generic visual-language representation from a large number of unlabeled text and images, thereby enabling processing of web site content in multiple languages, topics and media types; websites can also be categorized according to natural language queries entered by users to provide more flexible and accurate categorization results, but they face the problem that data requirements are too large, and models starting from scratch require 4 hundred million image text pairs and 2 hundred GPUs to achieve model effects.
The method uses an improved CLIP model, an image encoder adopts a pre-trained ResNet50 model, a text encoder adopts a pre-trained Bert model, characteristic extraction and similarity calculation are carried out on webpage screenshot and website meta titles, the CLIP model is optimized through an InfoNCE loss function, and CLIP model training is carried out by adopting random sampling based on an ontology, so that multi-mode website classification is realized.
1-3, an embodiment of the present application provides a method for classifying bad websites in a multi-mode image-text based on an optimized CLIP model, which specifically includes the following steps:
s1: acquiring a webpage screenshot and a website meta title, cleaning data of the website meta title, and filtering a sample with effective title content as a training set; the title content effectively refers to Chinese character length of 4 or more or Chinese character ratio of more than 50%;
in one embodiment, the data cleansing specifically includes: the method comprises the steps of missing value cleaning, format content cleaning and logic error cleaning, wherein the missing value cleaning is to determine a missing value range, remove unnecessary fields and missing value filling, the format content cleaning is to remove unnecessary characters, and the logic error cleaning is to remove duplication, unreasonable values and correct contradiction contents. Invalid characters and missing fields are corrected by data cleansing.
S2: the compression correction of the image size of the webpage screenshot is carried out, and the webpage screenshot is encoded through a pre-trained ResNet50 model and is converted into an image feature vector
ResNet50 is a depth residual network that can be used for image classification, object detection, and semantic segmentation tasks. The main feature of ResNet50 is the use of residual connection (residual connection) to solve the gradient extinction and degradation problems of deep networks. The idea of residual connection is to add a skip connection between certain layers of the network so that the input can be directly transferred to the output, thereby preserving the information of the input and enhancing the expressive power of the network.
In a specific embodiment, the method for extracting the image feature vector by using the ResNet50 model specifically comprises the following steps:
the pre-trained ResNet50 model is imported, and the ResNet50 model can be loaded by using APIs provided by deep learning frameworks such as PyTorch, tensorFlow, keras, and the ResNet50 model file can be downloaded from the Internet. The pre-trained ResNet50 model has learned some generic features on a large image dataset (e.g., imageNet) and can be used as an image feature extractor.
The last full-connection layer (fully connected layer) of the ResNet50 model is removed, the last layer is used for classification tasks, the probability distribution of the categories is output, the last layer is not needed, the output of the last-to-last layer is needed, and the last-to-last layer is a 2048-dimensional vector and can represent high-level semantic features of the image.
Preprocessing the webpage screenshot, wherein the preprocessing comprises the operations of adjusting the size of an image to 224 multiplied by 224, normalizing the pixel value of the image to be between 0 and 1, subtracting the mean value of a training set, dividing the mean value by the standard deviation and the like; the operations can ensure that the format and distribution of the image are consistent with the data in model training, and the effect of feature extraction is improved.
Inputting the preprocessed webpage screenshot into a ResNet50 model to obtain the output of the penultimate layer as an image feature vectorThis vector can be used for subsequent image retrieval, clustering, classification, etc.
S3: general purpose medicineChinese type corresponding to webpage screenshot through Bert modelAnd keywords in website meta titlewCoding and converting into a vector of the image characteristic +.>Text feature vectors with the same dimensions +.>And +.>And text feature vector +.>Carrying out L2 normalization treatment;
meta is the keyword content under the website title, and the marked sample is the Chinese type
The Bert model is a pre-trained language model based on a transducer architecture that can convert text codes into vector representations. Specifically, the Bert model embeds (Embedding) the input text sequence first, and then transcodes it through multiple layers of Transformer Encoder.
In a specific embodiment, the method for extracting the text feature vector through the Bert model specifically comprises the following steps:
chinese type corresponding to webpage screenshotAnd keywords in website meta titlewInputting the text sequence into a pretrained Bert model, and Embedding the text sequence into a vector space with a fixed length, wherein the process is realized by a technology called Word Embedding; word Embedding maps each Word to a low-dimensional real vector so that the distance of semantically similar words in vector space is also comparedProximity.
Next, the input text sequence is sent into Transformer Encoder for transcoding; transformer Encoder consists of several identical layers (layers), each Layer comprising two sub-layers of a Multi-Head Self-Attention mechanism (Multi-Head Self-Attention) and a Feed-forward neural network (Feed-Forward Neural Network);
in the multi-headed self-Attention mechanism, each word in the text sequence is focused (Attention) and affects other words to make the Bert model better capture the context information in the text sequence; in the feedforward neural network, the text sequence is weighted and summed and subjected to nonlinear transformation, so that a new vector representation is obtained, the new vector representation is sent into a multi-head self-attention mechanism of the next layer, and the above operation is repeated until the whole text sequence is converted into a vector representation with fixed length;
finally, the BERT model outputs this vector representation as a text feature vector
S4: constructing a CLIP model, wherein each time the CLIP model is input into a batch of triples, each tripletAre generated by the combination of image-text pairs; computing an image feature vector +.>And text feature vector +.>Is the similarity of the image-text joint representation score +.>Introducing an InfoNCE function to calculate cross entropy loss;
computing image feature vectorsAnd text feature vector +.>The similarity of (2) is: image feature vector->And text feature vector +.>Performing dot multiplication;
computing joint representation scores for image-textThe method of (1) specifically comprises the following steps:
text feature vectorAs query, image feature vector +.>Calculated as keys and values at the firstiUnder the description, text feature vector +.>Relative to the imagejAttention to the expression->Then get the joint expression score of image-text +.>
The method for introducing the InfoNCE function to calculate the cross entropy loss specifically comprises the following steps:
at a given text feature vectorIn the case of (a) minimizing the correct retrieval of image feature vectors from the set of all batches of images +.>Other instances in the batch act as negative samples, this cross entropy loss is called the picture retrieval loss function, noted +.>
In the method, in the process of the application,is shown in the firstiUnder the description, text feature vector +.>Relative to the corresponding image itselfiIs a joint representation score of (2); />Is the firstiThe picture under the description retrieves the loss function,jrepresent the firstjIndividual images->Is the total number of images;
to minimize the value of the picture loss function, one should makeIs a function of the maximum value of (2),minimum. Similarly, define text retrieval penalty function +.>
In the method, in the process of the application,is the firstiText retrieval loss function under individual description;
the meaning is as follows: image feature vectorFor retrieving the correct text feature vector +.>Use +.>And +.>+The two losses sum to train the CLIP model.
S5: and carrying out multiple rounds of iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data to enable the CLIP model to align the image feature vector and the text feature vector, namely: the distance between the image generated by the bad website and the text feature vector is shortened, and the difference between the image generated by the bad website and the text feature vector generated by the normal website is increased;
the method for carrying out multi-round iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data comprises the following steps:
adopting a random small batch sampling mode based on an ontology, wherein the samples comprise N classes of bad websites and corresponding number of normal websites, randomly extracting small batch of bad picture samples and randomly extracting corresponding number of samples of all other types, such as randomly extracting 5000 gambling and 5000 samples of other types, training a CLIP model, and enabling text semantics and image semantics to be consistent from a large class;
based on the training, under the category of the same gambling, small batches of data are randomly extracted according to keywords in meta titles of different websites, keywords with different information such as 'new staphyloin' and 'XX cut-pieces' possibly exist under the category of the same gambling, and the CLIP model parameters are continuously refined through different keyword differences under the same type of ontology, such as 5000 new staphyloin and 5000 samples containing lottery ticket are randomly extracted, so that optimization of finer granularity is achieved, more picture information with finer granularity can be aligned, and a better effect of training of the small samples is achieved.
S6: and using the trained CLIP model for multi-mode website classification, and classifying bad websites according to the webpage screenshot alignment semantic information.
As shown in fig. 4, another embodiment of the present application provides a bad website classification system based on graph-text multi-mode, which specifically includes:
the data acquisition module is used for acquiring a webpage screenshot and a website meta title;
the data cleaning module is used for cleaning data of website meta titles and filtering samples with effective title contents as training sets; the title content effectively refers to Chinese character length of 4 or more or Chinese character ratio of more than 50%;
the feature extraction module comprises a ResNet50 model unit, a Bert model unit and an L2 normalization unit, wherein the Bert model unit is used for encoding the webpage screenshot and converting the webpage screenshot into an image feature vector; the Bert model unit is used for encoding the Chinese type corresponding to the webpage screenshot and the keywords in the meta title of the website and converting the Chinese type and the keywords into text feature vectors with the same dimension as the image feature vectors; the L2 normalization unit is used for carrying out L2 normalization processing on the image feature vector and the text feature vector;
the CLIP model training module is used for constructing a CLIP model, and carrying out multi-round iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data so as to align the CLIP model with the image feature vector and the text feature vector;
and the website classification output module is used for classifying the multi-mode websites through the trained CLIP model and classifying bad websites according to the webpage screenshot alignment semantic information.
The embodiment further provides a poor website classification device based on graph-text multi-mode, which comprises:
a memory: for storing a computer program;
a processor: the method is used for executing the computer program to realize the bad website classification method based on the image-text multi-mode.
The embodiment also provides a readable storage medium, which may be a read-only memory, a magnetic disk or an optical disk, etc., and has a computer program stored thereon, where the computer program is executed by a processor to implement a bad website classification method based on graphic multi-mode as described above.
In summary, the text and image information obtained from the website content and the relevance between the text and the image information are utilized to judge which category the website belongs to, the effect can reach the standard of the original model, and the accuracy and coverage rate of the classification of the website can be improved; compared with the traditional CLIP which needs hundred million-level data, the method adopts a small batch training sampling mode, can achieve similar training effect only by about 1% of data, does not need to manually mark a large amount of training data and a large TPU computing machine, does not need to design specific classification rules or dictionaries for different languages or topics, and can improve the efficiency and the realizability of website classification; pictures that cannot be analyzed using OCR techniques can be analyzed and explicitly classified.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Any process or method description in a flowchart or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes additional implementations in which functions may be performed in a substantially simultaneous manner or in an opposite order from that shown or discussed, including in accordance with the functions that are involved.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that various changes and substitutions are possible within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (10)

1. The bad website classification method based on the image-text multi-mode is characterized by comprising the following steps:
s1: acquiring a webpage screenshot and a website meta title, cleaning data of the website meta title, and filtering a sample with valid title content as a training set; the title content effectively refers to Chinese characters with the length of more than 4 or Chinese characters with the ratio exceeding 50%;
s2: the webpage screenshot is subjected to compression correction of the image size, and is encoded through a pre-trained ResNet50 model and converted into an image feature vector
S3: chinese type corresponding to webpage screenshot through Bert modelAnd keywords in website meta titlewCoding, converting to +.>Text feature vectors with the same dimensions +.>And +.>And the text feature vector->Carrying out L2 normalization treatment;
s4: constructing a CLIP model, wherein each time the CLIP model is input into a batch of triples, each tripletAre generated by the combination of image-text pairs; computing an image feature vector +.>And text feature vector +.>Is the similarity of the image-text joint representation score +.>Introducing an InfoNCE function to calculate cross entropy loss;
s5: and carrying out multiple rounds of iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data to enable the CLIP model to align the image feature vector and the text feature vector, namely: the distance between the image generated by the bad website and the text feature vector is shortened, and the difference between the image generated by the bad website and the text feature vector generated by the normal website is increased;
s6: and using the trained CLIP model for multi-mode website classification, and classifying bad websites according to the webpage screenshot alignment semantic information.
2. The method for classifying bad websites based on graphic multi-mode according to claim 1, wherein in step S1, the data cleaning specifically includes: the method comprises the steps of missing value cleaning, format content cleaning and logic error cleaning, wherein the missing value cleaning is used for determining a missing value range, removing unnecessary fields and missing value filling, the format content cleaning is used for removing unnecessary characters, and the logic error cleaning is used for removing duplication, unreasonable values and correcting contradictory contents.
3. The method for classifying bad websites based on graphic multi-mode as claimed in claim 1, wherein in the step S2, the web page screenshot is encoded by a pre-trained res net50 model and converted into an image feature vectorThe method of (1) specifically comprises the following steps:
importing a pre-trained ResNet50 model as an image feature extractor through a deep learning framework, wherein the deep learning framework comprises PyTorch, tensorFlow, keras;
removing the last full-connection layer of the ResNet50 model, and taking the penultimate layer as an output layer;
preprocessing the webpage screenshot, wherein the preprocessing comprises the steps of adjusting the size of an image to 224 multiplied by 224, normalizing the pixel value of the image to be between 0 and 1, subtracting the mean value of a training set, and dividing the mean value by the standard deviation;
inputting the preprocessed webpage screenshot into a ResNet50 model to obtain the output of the penultimate layer as an image feature vector
4. The method for classifying bad websites based on graphic multi-mode as claimed in claim 1, wherein the step S3 is thatIn, convert to text feature vectorThe method of (1) specifically comprises the following steps:
chinese type corresponding to webpage screenshotAnd keywords in website meta titlewInputting the text sequence into a pretrained Bert model as a text sequence, embedding the text sequence into a vector space with a fixed length, and sending the text sequence into Transformer Encoder for coding conversion; the Transformer Encoder layer consists of a plurality of identical layers, and each layer comprises two sublayers of a multi-head self-attention mechanism and a feedforward neural network;
in the multi-headed self-attention mechanism, each word in the text sequence is focused and affects other words to let the Bert model capture contextual information in the text sequence; in the feedforward neural network, the text sequences are weighted and summed and subjected to nonlinear transformation, so that a new vector representation is obtained, the new vector representation is sent into a multi-head self-attention mechanism of the next layer, and the above operation is repeated until the whole text sequence is converted into a vector representation with fixed length;
the Bert model outputs this vector representation as a text feature vector
5. The method for classifying bad websites based on graphic multi-mode as claimed in claim 1, wherein the calculating image feature vectorAnd text feature vector +.>The similarity of (2) is: for the image feature vector->And text feature vector +.>Performing dot multiplication;
the combined representation score of the image-text is calculatedThe method of (1) specifically comprises the following steps:
text feature vectorAs query, image feature vector +.>Calculated as keys and values at the firstiUnder the description, text feature vector +.>Relative to the imagejAttention to the expression->Then get the joint expression score of image-text +.>
6. The method for classifying bad websites based on graphic multi-mode according to claim 5, wherein the method for introducing the InfoNCE function to calculate the cross entropy loss specifically comprises:
at a given text feature vectorIn the case of (a) minimizing the correct retrieval of image feature vectors from the set of all batches of images +.>Other instances in the batch act as negative samples, this cross entropy loss is called the picture retrieval loss function, noted +.>
In the method, in the process of the application,is shown in the firstiUnder the description, text feature vector +.>Relative to the corresponding image itselfiIs a joint representation score of (2); />Is the firstiThe picture under the description retrieves the loss function,jrepresent the firstjIndividual images->Is the total number of images;
similarly, a text retrieval penalty function is defined
In the method, in the process of the application,is the firstiText retrieval loss function under individual description;
usingAnd +.>+/>The two losses sum to train the CLIP model.
7. The method for classifying bad websites based on graphic multi-mode as claimed in claim 1, wherein in the step S5, the method for performing multi-round iteration and optimization on the CLIP model by using randomly sampled positive and negative sample data in batches comprises:
and (3) adopting a random small-batch sampling mode based on the ontology, wherein the samples comprise N classes of bad websites and corresponding number of normal websites, randomly extracting small-batch bad picture samples and randomly extracting corresponding number samples of all other types to train the CLIP model, so that text semantics and image semantics are consistent in terms of the major classes, randomly extracting small-batch data according to keywords in meta titles of different websites, and continuously refining CLIP model parameters by distinguishing different keywords under the ontology of the same type, thereby achieving fine-granularity optimization.
8. A system based on a bad website classification method based on graphic multi-modality according to any of claims 1 to 7, wherein the system comprises:
the data acquisition module is used for acquiring a webpage screenshot and a website meta title;
the data cleaning module is used for cleaning data of the website meta title and filtering samples with valid title contents as training sets; the title content effectively refers to Chinese characters with the length of more than 4 or Chinese characters with the ratio exceeding 50%;
the feature extraction module comprises a ResNet50 model unit, a Bert model unit and an L2 normalization unit, wherein the Bert model unit is used for encoding the webpage screenshot and converting the webpage screenshot into an image feature vector; the Bert model unit is used for coding the Chinese type corresponding to the webpage screenshot and the keywords in the meta title of the website and converting the Chinese type and the keywords into text feature vectors with the same dimension as the image feature vectors; the L2 normalization unit is used for carrying out L2 normalization processing on the image feature vector and the text feature vector;
the CLIP model training module is used for constructing a CLIP model, and carrying out multi-round iteration and optimization on the CLIP model by using random sampling batch positive and negative sample data so as to align the CLIP model with the image feature vector and the text feature vector;
and the website classification output module is used for classifying the multi-mode websites through the trained CLIP model and classifying bad websites according to the webpage screenshot alignment semantic information.
9. Poor website classification equipment based on picture and text multimode is characterized by comprising:
a memory: for storing a computer program;
a processor: for executing the computer program to implement a bad website classification method based on teletext multi-modality as claimed in any one of claims 1-7.
10. A readable storage medium, wherein a computer program is stored on the readable storage medium, and the computer program is used for implementing a bad website classification method based on graphic-text multi-mode according to any one of claims 1-7 when being executed by a processor.
CN202311078357.7A 2023-08-25 2023-08-25 Poor website classification method, system and equipment based on image-text multi-mode Pending CN116796251A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311078357.7A CN116796251A (en) 2023-08-25 2023-08-25 Poor website classification method, system and equipment based on image-text multi-mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311078357.7A CN116796251A (en) 2023-08-25 2023-08-25 Poor website classification method, system and equipment based on image-text multi-mode

Publications (1)

Publication Number Publication Date
CN116796251A true CN116796251A (en) 2023-09-22

Family

ID=88046827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311078357.7A Pending CN116796251A (en) 2023-08-25 2023-08-25 Poor website classification method, system and equipment based on image-text multi-mode

Country Status (1)

Country Link
CN (1) CN116796251A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117198514A (en) * 2023-11-08 2023-12-08 中国医学科学院北京协和医院 Vulnerable plaque identification method and system based on CLIP model
CN117235532A (en) * 2023-11-09 2023-12-15 西南民族大学 Training and detecting method for malicious website detection model based on M-Bert
CN117435739A (en) * 2023-12-21 2024-01-23 深圳须弥云图空间科技有限公司 Image text classification method and device

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635835A (en) * 2018-11-08 2019-04-16 深圳蓝韵医学影像有限公司 A kind of breast lesion method for detecting area based on deep learning and transfer learning
CN110020335A (en) * 2017-07-28 2019-07-16 北京搜狗科技发展有限公司 The treating method and apparatus of collection
CN110121118A (en) * 2019-06-17 2019-08-13 腾讯科技(深圳)有限公司 Video clip localization method, device, computer equipment and storage medium
CN114743630A (en) * 2022-04-01 2022-07-12 杭州电子科技大学 Medical report generation method based on cross-modal contrast learning
CN114862811A (en) * 2022-05-19 2022-08-05 湖南大学 Defect detection method based on variational automatic encoder
CN115222845A (en) * 2022-08-01 2022-10-21 北京元亦科技有限公司 Method and device for generating style font picture, electronic equipment and medium
CN115374325A (en) * 2022-05-31 2022-11-22 国家计算机网络与信息安全管理中心 Website classification method and device, classification equipment and storage medium
CN115563342A (en) * 2022-10-19 2023-01-03 国家计算机网络与信息安全管理中心广东分中心 Method, system, equipment and storage medium for video theme retrieval
CN115659175A (en) * 2022-10-13 2023-01-31 国网辽宁省电力有限公司信息通信分公司 Multi-mode data analysis method, device and medium for micro-service resources
CN115661594A (en) * 2022-10-19 2023-01-31 海南港航控股有限公司 Image-text multi-mode feature representation method and system based on alignment and fusion
CN116206239A (en) * 2023-01-30 2023-06-02 北京达佳互联信息技术有限公司 Video feature extraction network training method and device, electronic equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020335A (en) * 2017-07-28 2019-07-16 北京搜狗科技发展有限公司 The treating method and apparatus of collection
CN109635835A (en) * 2018-11-08 2019-04-16 深圳蓝韵医学影像有限公司 A kind of breast lesion method for detecting area based on deep learning and transfer learning
CN110121118A (en) * 2019-06-17 2019-08-13 腾讯科技(深圳)有限公司 Video clip localization method, device, computer equipment and storage medium
CN114743630A (en) * 2022-04-01 2022-07-12 杭州电子科技大学 Medical report generation method based on cross-modal contrast learning
CN114862811A (en) * 2022-05-19 2022-08-05 湖南大学 Defect detection method based on variational automatic encoder
CN115374325A (en) * 2022-05-31 2022-11-22 国家计算机网络与信息安全管理中心 Website classification method and device, classification equipment and storage medium
CN115222845A (en) * 2022-08-01 2022-10-21 北京元亦科技有限公司 Method and device for generating style font picture, electronic equipment and medium
CN115659175A (en) * 2022-10-13 2023-01-31 国网辽宁省电力有限公司信息通信分公司 Multi-mode data analysis method, device and medium for micro-service resources
CN115563342A (en) * 2022-10-19 2023-01-03 国家计算机网络与信息安全管理中心广东分中心 Method, system, equipment and storage medium for video theme retrieval
CN115661594A (en) * 2022-10-19 2023-01-31 海南港航控股有限公司 Image-text multi-mode feature representation method and system based on alignment and fusion
CN116206239A (en) * 2023-01-30 2023-06-02 北京达佳互联信息技术有限公司 Video feature extraction network training method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ALEC RADFORD等: ""Learning Transferable Visual Models From Natural Language Supervision"", 《ARXIV》, pages 1 - 48 *
TEJAS SRINIVASAN等: ""Curriculum Learning for Data-Efficient Vision-Language Alignment"", 《PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》, pages 5619 - 5620 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117198514A (en) * 2023-11-08 2023-12-08 中国医学科学院北京协和医院 Vulnerable plaque identification method and system based on CLIP model
CN117198514B (en) * 2023-11-08 2024-01-30 中国医学科学院北京协和医院 Vulnerable plaque identification method and system based on CLIP model
CN117235532A (en) * 2023-11-09 2023-12-15 西南民族大学 Training and detecting method for malicious website detection model based on M-Bert
CN117235532B (en) * 2023-11-09 2024-01-26 西南民族大学 Training and detecting method for malicious website detection model based on M-Bert
CN117435739A (en) * 2023-12-21 2024-01-23 深圳须弥云图空间科技有限公司 Image text classification method and device
CN117435739B (en) * 2023-12-21 2024-03-15 深圳须弥云图空间科技有限公司 Image text classification method and device

Similar Documents

Publication Publication Date Title
CN109165563B (en) Pedestrian re-identification method and apparatus, electronic device, storage medium, and program product
CN116796251A (en) Poor website classification method, system and equipment based on image-text multi-mode
CN111985239A (en) Entity identification method and device, electronic equipment and storage medium
CN113672693B (en) Label recommendation method of online question-answering platform based on knowledge graph and label association
CN113657115B (en) Multi-mode Mongolian emotion analysis method based on ironic recognition and fine granularity feature fusion
CN113220890A (en) Deep learning method combining news headlines and news long text contents based on pre-training
CN114691864A (en) Text classification model training method and device and text classification method and device
CN116304984A (en) Multi-modal intention recognition method and system based on contrast learning
CN111581964A (en) Theme analysis method for Chinese ancient books
CN115408488A (en) Segmentation method and system for novel scene text
CN115687571A (en) Depth unsupervised cross-modal retrieval method based on modal fusion reconstruction hash
Yang et al. A comparative study of language transformers for video question answering
Krishnan et al. Bringing semantics into word image representation
CN117540039A (en) Data retrieval method based on unsupervised cross-modal hash algorithm
CN117033558A (en) BERT-WWM and multi-feature fused film evaluation emotion analysis method
CN112084788A (en) Automatic marking method and system for implicit emotional tendency of image captions
CN117033626A (en) Text auditing method, device, equipment and storage medium
CN114020871B (en) Multi-mode social media emotion analysis method based on feature fusion
CN114386412B (en) Multi-mode named entity recognition method based on uncertainty perception
CN116451699A (en) Segment extraction type machine reading and understanding method based on attention mechanism
El-Gayar Automatic generation of image caption based on semantic relation using deep visual attention prediction
CN116702094B (en) Group application preference feature representation method
Liu IntelliExtract: An End-to-End Framework for Chinese Resume Information Extraction from Document Images
CN117875266B (en) Training method and device for text coding model, electronic equipment and storage medium
Gao et al. Team gzw at Factify 2: Multimodal Attention and Fusion Networks for Multi-Modal Fact Verification.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination