CN112148943A - Webpage classification method and device, electronic equipment and readable storage medium - Google Patents

Webpage classification method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN112148943A
CN112148943A CN202011037160.5A CN202011037160A CN112148943A CN 112148943 A CN112148943 A CN 112148943A CN 202011037160 A CN202011037160 A CN 202011037160A CN 112148943 A CN112148943 A CN 112148943A
Authority
CN
China
Prior art keywords
sequence
dom tree
webpage
word vector
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011037160.5A
Other languages
Chinese (zh)
Inventor
董叶豪
刘盈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202011037160.5A priority Critical patent/CN112148943A/en
Publication of CN112148943A publication Critical patent/CN112148943A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/954Navigation, e.g. using categorised browsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a webpage classification method, a webpage classification device, electronic equipment and a readable storage medium. The method comprises the following steps: acquiring a DOM tree generated according to webpage data of a webpage to be classified; traversing the DOM tree through an extraction algorithm to obtain a target sequence for representing the overall relevance of the nodes in the DOM tree; representing the target sequence as a target word vector sequence through a pre-trained word vector model; and inputting the target word vector sequence into a trained classification model to obtain a classification result of the webpage to be classified. In the above embodiment, the target sequence is obtained by extracting the overall correlation of the nodes in the DOM tree, so that the target sequence can more completely express the structural features of the DOM tree, and the accuracy of webpage classification is improved when the target word vector sequence is used for classifying the webpage to be classified.

Description

Webpage classification method and device, electronic equipment and readable storage medium
Technical Field
The invention relates to the technical field of computer data processing, in particular to a webpage classification method and device, electronic equipment and a readable storage medium.
Background
There are two general ways for web page classification, namely classification according to text features of web pages and classification according to structural features of web pages. At present, when the structural features of a web page are used for classification, the number of the HTML (Hyper Text Markup Language) tags appearing and the proportion occupied by the HTML tags can be counted as the structural features. Currently, when structural features are used for classification, the classification is limited by the captured structural features, so that deviation exists in the classification of web pages, and the accuracy of the classification is influenced.
Disclosure of Invention
The application provides a webpage classification method, a webpage classification device, an electronic device and a readable storage medium, which can solve the problem that the accuracy of classification is affected due to deviation of webpage classification.
In order to achieve the above purpose, the technical solutions provided in the embodiments of the present application are as follows:
in a first aspect, an embodiment of the present application provides a method for classifying web pages, where the method includes:
acquiring a DOM tree generated according to webpage data of a webpage to be classified;
traversing the DOM tree through an extraction algorithm to obtain a target sequence for representing the overall relevance of the nodes in the DOM tree;
representing the target sequence as a target word vector sequence through a pre-trained word vector model;
and inputting the target word vector sequence into a trained classification model to obtain a classification result of the webpage to be classified.
In the above embodiment, the target sequence is obtained by extracting the overall correlation of the nodes in the DOM tree, so that the target sequence can more completely express the structural features of the DOM tree, and the accuracy of webpage classification is improved when the target word vector sequence is used for classifying the webpage to be classified.
With reference to the first aspect, in some optional implementations, the extraction algorithm includes a first extraction algorithm and a second extraction algorithm, and the DOM tree is traversed by the extraction algorithm to obtain a target sequence for representing overall relevance of nodes in the DOM tree, including;
traversing the DOM tree through the first extraction algorithm to obtain a first sequence for representing the correlation of nodes at the same level in the DOM tree;
traversing the DOM tree through the second extraction algorithm to obtain a second sequence for representing the correlation of parent and child nodes in the DOM tree;
and connecting the first sequence and the second sequence to obtain the target sequence.
In the above embodiment, the extraction algorithm is used to extract the first sequence representing the correlation between the nodes in the same level and the second sequence representing the correlation between the nodes in the parent node and the child node from the DOM tree, so that the structural features of the DOM tree can be more completely expressed by the first sequence and the second sequence. In the subsequent webpage classification process, when the target sequence is used for classification, the target sequence can more completely express the structural features of the DOM tree, and the missing features are few, so that the accuracy and the reliability of webpage classification are improved.
With reference to the first aspect, in some optional embodiments, the first extraction algorithm comprises a BFS algorithm and the second extraction algorithm comprises a DFS algorithm.
With reference to the first aspect, in some optional implementations, before representing the target sequence as a target word vector sequence by a pre-trained word vector model, the method further includes:
acquiring webpage data of the webpage to be classified, wherein the webpage data comprises at least one of an HTML document and a URL link;
acquiring a DOM tree generated according to webpage data of a webpage to be classified, wherein the DOM tree comprises:
acquiring a DOM tree generated according to an HTML document of a webpage to be classified, and extracting a designated field in a URL link;
representing the target sequence as a target word vector sequence by a pre-trained word vector model, comprising:
representing the target sequence and the specified field as the target word vector sequence through a pre-trained word vector model.
In the above embodiment, the electronic device acquires the webpage data of the webpage to be classified, and then the electronic device can generate the DOM tree according to the webpage data, so that the electronic device does not need to acquire the DOM tree of the webpage to be classified from other devices, and the flow is simplified.
With reference to the first aspect, in some optional implementations, the classification model includes a feature extraction unit and a classifier, and the inputting the target word vector sequence into the trained classification model to obtain a classification result of the web page to be classified includes:
inputting the target word vector sequence into the feature extraction unit in the trained classification model;
and inputting the feature vector obtained by performing feature extraction on the target word vector sequence by the feature extraction unit into the classifier to obtain a classification result of the webpage to be classified output by the classifier.
With reference to the first aspect, in some optional embodiments, the classification model includes an LSTM model, the feature extraction unit is an LSTM unit, and the classifier is Softmax.
With reference to the first aspect, in some optional implementations, the Word vector model includes a Word2Vec model.
In a second aspect, an embodiment of the present application further provides a web page classification apparatus, where the apparatus includes:
the acquisition unit is used for acquiring a DOM tree generated according to the webpage data of the webpage to be classified;
the correlation extraction unit is used for traversing the DOM tree through an extraction algorithm to obtain a target sequence for representing the overall correlation of the nodes in the DOM tree;
a vector representing unit, configured to represent the target sequence as a target word vector sequence through a pre-trained word vector model;
and the classification unit is used for inputting the target word vector sequence into the trained classification model to obtain a classification result of the webpage to be classified.
In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor coupled to each other, and a computer program is stored in the memory, and when the computer program is executed by the processor, the electronic device is caused to perform the above-mentioned web page classification method.
In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the above-mentioned web page classification method.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below. It is appreciated that the following drawings depict only certain embodiments of the application and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.
Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Fig. 2 is a schematic flowchart of a webpage classification method according to an embodiment of the present application.
FIG. 3 is a flowchart illustrating sub-steps included in step S220 of the method for classifying web pages in FIG. 2
Fig. 4 is a block diagram of a web page classification apparatus according to an embodiment of the present application.
Icon: 10-an electronic device; 11-a processing module; 12-a storage module; 100-web page classification means; 110-an obtaining unit; 120-a correlation extraction unit; 130-vector representation unit; 140-classification unit.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. It should be noted that the terms "first," "second," and the like are used merely to distinguish one description from another, and are not intended to indicate or imply relative importance.
The applicant finds that, at present, when a webpage is classified by using structural features of a DOM (Document Object Model) tree, only the extracted structural features and element features are generally extracted, and only the structural information and the number of tags of the DOM tree are counted. The method does not consider the sequence among the nodes, the correlation between the brother node and the father and son nodes of the DOM tree, and the semantic and syntactic correlation among the HTML tags in the DOM tree, so that the accuracy of webpage classification is not high.
In view of the above problems, the applicant proposes the following embodiments to solve the above problems. The embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, an embodiment of the present application provides an electronic device 10, which can be used for classifying web pages. The types of the web pages can be divided in advance according to actual situations, and include but are not limited to shopping web pages, news web pages, video playing web pages, image library web pages, enterprise web pages, school web pages and the like.
In this embodiment, the electronic device 10 may include a processing module 11, a storage module 12, and a web page classification apparatus 100. The storage module 12 stores therein a computer program which, when executed by the processing module 11, may enable the electronic device 10 to perform the steps of the web page classification method described below.
Understandably, the electronic device 10 may also include other modules. For example, the electronic device 10 may further include a communication module for establishing a communication connection of the electronic device 10 with a user terminal. The processing module 11, the storage module 12, the communication module and the elements of the web page classification device 100 are directly or indirectly electrically connected to each other, so as to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.
The electronic device 10 may be, but is not limited to, a personal computer, a server, etc. The user terminal may be, but is not limited to, a smart phone, a Personal Computer (PC), a tablet PC, a Personal Digital Assistant (PDA), a Mobile Internet Device (MID), and the like.
Referring to fig. 2, an embodiment of the present application further provides a web page classification method, which can be applied to the electronic device 10, and the electronic device 10 executes or implements the steps of the method. The method may comprise the steps of:
step S210, acquiring a DOM tree generated according to webpage data of a webpage to be classified;
step S220, traversing the DOM tree through an extraction algorithm to obtain a target sequence for representing the overall relevance of the nodes in the DOM tree;
step S230, representing the target sequence as a target word vector sequence through a pre-trained word vector model;
step S240, inputting the target word vector sequence into the trained classification model to obtain the classification result of the webpage to be classified.
In the above embodiment, the target sequence is obtained by extracting the overall correlation of the nodes in the DOM tree of the web pages number, so that the target sequence can more completely express the structural features of the DOM tree, and the accuracy of webpage classification can be improved when the webpage to be classified is classified by using the target word vector sequence.
The individual steps of the process are explained in detail below, as follows:
in step S210, when the web page needs to be classified, the electronic device may acquire a DOM tree of the web page to be classified. The electronic device may obtain the DOM tree of the webpage to be classified from itself, or obtain the DOM tree of the webpage to be classified from another device (e.g., a server).
For example, the electronic device may analyze and process web page data (such as an HTML document) of the web page to be classified, so as to obtain a DOM tree of the web page to be classified. The webpage data of the webpage to be classified can be analyzed and processed by other equipment to obtain the DOM tree of the webpage to be classified, and then the electronic equipment acquires the DOM tree from other equipment, so that the electronic equipment acquires the DOM tree with the webpage to be classified.
In the DOM tree, the basic elements of the DOM tree structure are "nodes", and the structural features of the DOM tree are the tree topology structure formed by each node, and can be embodied by the sequence representing the correlation of parent and child nodes and the sequence representing the correlation of peer nodes. When the DOM tree is generated by using the webpage data, the webpage data can be interpreted by the interpreter, so that a DOM tree structure is obtained.
In step S220, the electronic device may store a corresponding extraction algorithm in advance. The extraction algorithm may simultaneously include a plurality of extraction algorithms for extracting sequences of correlations of different dimensions from the DOM tree. By traversing the entire DOM tree, a target sequence can be obtained that more completely represents the relevance of each node in the DOM tree. For example, the extraction algorithms may include an algorithm for extracting parent-child node dependencies in the DOM tree to obtain a corresponding sequence, and an algorithm for extracting sibling node dependencies in the DOM tree to obtain a corresponding sequence. A sequence is understood to be a text content for describing the relevance of the nodes. Based on this, the structural features of the DOM tree can be converted into a sequence, and the structural features of the DOM tree are described by the sequence, so that the web pages can be classified by the sequence.
In step S230, the electronic device stores a pre-trained word vector model in advance. The pre-trained word vector model may be understood as: and the initial word vector model is a pre-training word vector model obtained after pre-training. The pre-training method is well known to those skilled in the art and will not be described herein. The Word vector model may be selected according to the actual situation, for example, the Word vector model may be a Word2Vec model.
Understandably, the Word2Vec model can represent the target sequence as a multidimensional target Word vector sequence as an input vector to the classification model, which can capture semantic and syntactic dependencies between web page data. That is, the Word2Vec model can capture semantic and syntactic dependencies among various types of tags (e.g., HTML tags) in the web page data from the target sequence, and can capture structural rules existing in the DOM tree. The dimension of the target word vector sequence is usually a fixed value and can be determined according to actual conditions. Namely, after different target sequences are converted by the Word2Vec model, the obtained target Word vector sequences have the same dimensionality, so that the classification and identification of the webpage can be conveniently carried out by utilizing the classification model subsequently.
Because the target sequence comprises the sequence among the nodes, when the Word2Vec model converts the target sequence into the target Word vector sequence, the characteristics of the node sequence in the DOM tree, the semantic and syntactic relevance among the labels and the like can be extracted, and therefore the accuracy of webpage classification can be improved.
The Word2Vec model can be used as a pre-training model for learning Word vectors of input words, and a target Word vector sequence obtained by pre-training is used for representing words in the target sequence instead of a One-Hot vector. Understandably, if the words in the target sequence are represented by the One-Hot vector, the obtained features are sparse and discrete, and sequence information of the words in the sequence is easy to lose, so that the represented vector lacks the relevant features of the sequence, and the accuracy of subsequent webpage classification is influenced.
In this embodiment, a Word2Vec model is used, and the pre-training can separate the learning Word vector from the training model, so that the complexity of the model can be reduced, and the model training is accelerated. In addition, the pre-trained word vector model can obtain a better local optimal solution, so that the output target word vector sequence can represent the semantics of the target sequence more completely, and the accuracy of webpage classification is improved.
In step S240, the electronic device stores a trained classification model in advance, where the classification model is a neural network model, and can detect the type of the web page according to the word vector to determine the type of the web page. For example, the classification model may be an LSTM (Long-Short Term Memory) model.
In this embodiment, the trained classification model can automatically detect and classify the target word vector sequence, so as to obtain the type of the web page. In the training process, the webpage data are associated with the webpage types, so that the classification model can obtain the corresponding relation between the feature vectors of the word vectors and the webpage types, and the classification model has the function of identifying the webpage types according to the word vectors. The training method of the classification model is well known to those skilled in the art, and is not particularly limited herein.
Referring to fig. 3, as an alternative embodiment, step S220 may include the following sub-steps:
substep S221, traversing the DOM tree through the first extraction algorithm to obtain a first sequence for representing the correlation of nodes at the same level in the DOM tree;
substep S222, traversing the DOM tree through the second extraction algorithm to obtain a second sequence for representing the correlation of parent and child nodes in the DOM tree;
and a substep S223 of connecting the first sequence and the second sequence to obtain the target sequence.
In this embodiment, sibling nodes may be referred to as sibling nodes. The first extraction algorithm and the second extraction algorithm may be selected according to the actual situation. For example, the First extraction algorithm may be a BFS (break First Search) algorithm, and the second extraction algorithm may be a DFS (Depth First Search) algorithm.
The electronic device may perform traversal search on the DOM tree through the first extraction algorithm, extract the correlation of the peer nodes from the DOM tree, and describe or represent the correlation of each peer node in the DOM tree by using a sequence, where the sequence is a first sequence, and the content of the first sequence may be determined according to an actual situation, and is not specifically limited herein. Similarly, the electronic device may perform traversal search on the DOM tree through the second extraction algorithm, extract the correlation of the parent node and the child node from the DOM tree, and describe or represent the correlation of each parent node and child node in the DOM tree by using a sequence, where the sequence is the second sequence, and the content of the second sequence may be determined according to an actual situation, and is not specifically limited herein.
After the first sequence and the second sequence are obtained, the first sequence and the second sequence may be concatenated or spliced to obtain the target sequence. The connection or splicing manner of the first sequence and the second sequence may be: when the first sequence and the second sequence are in the same row, the first sequence can be on the left side of the second sequence, and the second sequence is adjacent to the first sequence, so that a target sequence is obtained; alternatively, the first sequence is to the right of the second sequence, and the second sequence is immediately adjacent to the first sequence, thereby obtaining the target sequence.
In the above embodiment, the extraction algorithm is used to extract the first sequence representing the correlation between the nodes in the same level and the second sequence representing the correlation between the nodes in the parent node and the child node from the DOM tree, so that the structural features of the DOM tree can be more completely expressed by the first sequence and the second sequence. In the subsequent webpage classification process, when the target sequence is used for classification, the target sequence can more completely express the structural features of the DOM tree, and the missing features are few, so that the accuracy and the reliability of webpage classification are improved.
It should be noted that the execution sequence of the sub-step S221 and the sub-step S222 may be interchanged, for example, in other embodiments, the electronic device may execute the step S222 first and then execute the step S221, where the execution sequence of the sub-step S221 and the sub-step S222 is not specifically limited.
As an optional implementation manner, before step S230, the method may further include: acquiring webpage data of the webpage to be classified, wherein the webpage data comprises at least one of an HTML document and a URL link;
step S220 may include: acquiring a DOM tree generated according to an HTML document of a webpage to be classified, and extracting a designated field in a URL link;
step S230 may include: representing the target sequence and the specified field as the target word vector sequence through a pre-trained word vector model.
Understandably, the electronic device can acquire the webpage data of the webpage to be classified according to the website of the webpage to be classified. The web page data includes, but is not limited to, HTML documents, URL (Uniform Resource Locator) links, etc. of the web page. An HTML document is understood to be the HTML source code of a web page. After obtaining the web page data, the electronic device may interpret the HTML document through the interpreter, thereby obtaining the DOM tree. In addition, the electronic device may parse the URL link to extract the specified field in the URL link. The designated field may be a field in the URL link, which is used to indicate information such as a sub-domain name, a top-level domain name, a directory, etc. that may reflect the type of the website corresponding to the web page.
Understandably, the way of interpreting the HTML document data by an interpreter to obtain a DOM tree is well known to those skilled in the art and will not be described herein. In addition, the manner of extracting the designated field from the URL link can be determined according to actual conditions. For example, through a regular expression, specified fields in the URL link are extracted from the URL link.
In this embodiment, the Word2Vec model may represent the first sequence and the second sequence as a first Word vector sequence and a second Word vector sequence, respectively. Additionally, the Word2Vec model may convert specified fields extracted from URL links into URL feature sequences. Generally, the vector dimension of the URL feature sequence is larger than the vector dimensions of the first word vector sequence and the second word vector sequence (the vector dimensions of the first word vector sequence and the second word vector sequence are generally the same). The electronic device can convert the vector dimension in the URL feature sequence through a corresponding conversion function, so that the converted URL feature sequence is the same as the vector dimensions of the first word vector sequence and the second word vector sequence. The vector dimension conversion process is well known to those skilled in the art. The conversion function may be determined according to actual conditions, and may be, for example, numpy.
Illustratively, assuming that the number of elements in the first word vector sequence is m and the word vector dimension is 20, the matrix dimension of the first word vector sequence is (20, m). If the number of elements in the URL feature sequence is n and the dimension of the word vector is 300, the matrix dimension of the URL feature sequence is (300, n). The 300-dimensional word vector is required to be converted into a (20, 15) -dimensional matrix, so that the vector dimension of the converted URL feature sequence is the same as the dimension of the first word vector sequence, where m and n are integers greater than 0, and may be determined according to actual situations.
And then, connecting the first word vector sequence, the second word vector sequence and the URL characteristic sequence after the dimensionality conversion through a connection function to obtain a target word vector sequence which is used for being input into a classification model so as to classify the webpage. The connection function may be selected according to actual conditions, and may be, for example, numpy.
In the above embodiment, by using the first word vector sequence, the second word vector sequence, and the URL feature sequence after the dimension conversion, and by combining the structural features of the DOM tree with the features of the URL, it is beneficial to more completely express the information of the web page to be classified, thereby being beneficial to improving the accuracy of web page classification.
In this embodiment, the webpage data acquired each time may be webpage data of one webpage to be classified, or data of multiple webpages to be classified. That is, the electronic device may perform classification processing on one or more web pages to be classified at the same time.
When the number of the webpages to be classified is large, the electronic device can sort the webpages to be classified in a queue manner to form a classification queue. And then, the electronic equipment acquires the webpage data of the webpages to be classified from the classification queue according to the websites of the webpages to be classified in the queue in sequence, and then classifies and identifies the webpages to be classified in sequence.
In the above embodiment, the electronic device acquires the webpage data of the webpage to be classified, and then the electronic device can generate the DOM tree according to the webpage data, so that the electronic device does not need to acquire the DOM tree of the webpage to be classified from other devices, and the flow is simplified.
Step S240 may include: inputting the target word vector sequence into the feature extraction unit in the trained classification model;
and inputting the feature vector obtained by performing feature extraction on the target word vector sequence by the feature extraction unit into the classifier to obtain a classification result of the webpage to be classified output by the classifier.
In this embodiment, the classification result may be determined according to actual situations, and includes types of the web pages to be classified, for example, the types may be the shopping web pages, the news web pages, the video playing web pages, and the like. The feature vector is obtained by performing iterative processing on the target word vector sequence by the feature extraction unit, and is well known to those skilled in the art, and is not described herein again.
The feature extraction unit and the classifier can be selected according to actual conditions. For example, when the classification model is an LSTM model, the feature extraction unit is an LSTM unit and the classifier is Softmax.
When the classification model is the LSTM model, the electronic device may use a feature vector obtained by feature extraction of the target word vector sequence at the last time step by the LSTM unit as the feature vector input to the classifier. The last time step can be understood as a feature vector extracted by the LSTM unit in the last cycle, wherein the last result and the input in the current sequence are used as the input of the cycle in each cycle. For example, if a target sequence is (x0, x1, x2), and the total number of time steps is 3, the processing of the LSTM unit may be as follows:
the processing procedure of the first time step is as follows: the LSTM unit performs feature extraction on the x0 to obtain y 0;
the processing procedure of the second time step is as follows: the LSTM unit performs feature extraction on the x1 and the y0 to obtain y 1;
the processing procedure of the third time step is as follows: the LSTM unit performs feature extraction on the x2 and the y1 to obtain y 2; the resulting feature vector "y 2" is the feature vector obtained at the last time step for use as the feature vector for final input to the classifier. The electronic device may discard the outputs of all previous time steps, for example, discarding "y 0" and "y 1", and only inputting "y 2" to the classifier, so as to improve the accuracy and validity of the feature vectors input to the classifier, and thus improve the accuracy of webpage classification.
Referring to fig. 4, an embodiment of the present application further provides a web page classification apparatus 100, which can be applied to the electronic device described above and is used to execute or implement each step in the method. The web page classification apparatus 100 includes at least one software functional module which can be stored in a storage module in the form of software or Firmware (Firmware) or solidified in an Operating System (OS) of the electronic device. The processing module is used for executing executable modules stored in the storage module, such as software functional modules and computer programs included in the web page classification apparatus 100.
In this embodiment, the web page classification apparatus 100 may include an obtaining unit 110, a correlation extracting unit 120, a vector representing unit 130, and a classifying unit 140, and the executed operation contents may be as follows:
an obtaining unit 110, configured to obtain a DOM tree generated according to web page data of a web page to be classified;
a correlation extraction unit 120, configured to traverse the DOM tree through an extraction algorithm to obtain a target sequence used for representing the overall correlation of nodes in the DOM tree;
a vector representing unit 130, configured to represent the target sequence as a target word vector sequence through a pre-trained word vector model;
and the classification unit 140 is configured to input the target word vector sequence into a trained classification model, so as to obtain a classification result of the web page to be classified.
Optionally, the extraction algorithm includes a first extraction algorithm and a second extraction algorithm, and the correlation extraction unit 120 is further configured to:
traversing the DOM tree through the first extraction algorithm to obtain a first sequence for representing the correlation of nodes at the same level in the DOM tree;
traversing the DOM tree through the second extraction algorithm to obtain a second sequence for representing the correlation of parent and child nodes in the DOM tree;
and connecting the first sequence and the second sequence to obtain the target sequence.
Before the vector representation unit 130 performs step S230, the obtaining unit 110 may be further configured to: and acquiring webpage data of the webpage to be classified, wherein the webpage data comprises at least one of an HTML document and a URL link. The correlation extraction unit 120 is further configured to obtain a DOM tree generated according to an HTML document of the web page to be classified, and extract a specified field in the URL link; vector representation unit 130 may also be configured to represent the target sequence and the specified field as the target word vector sequence via a pre-trained word vector model.
Optionally, the classification unit 140 may further be configured to:
inputting the target word vector sequence into the feature extraction unit in the trained classification model;
and inputting the feature vector obtained by performing feature extraction on the target word vector sequence by the feature extraction unit into the classifier to obtain a classification result of the webpage to be classified output by the classifier.
In this embodiment, the processing module may be an integrated circuit chip having signal processing capability. The processing module may be a general purpose processor. For example, the Processor may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Network Processor (NP), or the like; the method, the steps and the logic block diagram disclosed in the embodiments of the present Application may also be implemented or executed by a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
The memory module may be, but is not limited to, a random access memory, a read only memory, a programmable read only memory, an erasable programmable read only memory, an electrically erasable programmable read only memory, and the like. In this embodiment, the storage module may be configured to store the web page data. Of course, the storage module may also be used to store a program, and the processing module executes the program after receiving the execution instruction.
The communication module is used for establishing communication connection between the electronic equipment and other equipment (such as a user terminal) through a network and receiving and transmitting data through the network.
It is understood that the structure shown in fig. 1 is only a schematic structural diagram of an electronic device, and the electronic device may further include more components than those shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.
It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the electronic device and the web page classification apparatus 100 described above may refer to the corresponding processes of the steps in the foregoing method, and are not described in detail herein.
The embodiment of the application also provides a computer readable storage medium. The readable storage medium has stored therein a computer program which, when run on a computer, causes the computer to execute the web page classification method as described in the above embodiments.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by hardware, or by software plus a necessary general hardware platform, and based on such understanding, the technical solution of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments of the present application.
In summary, the present application provides a method and an apparatus for classifying web pages, an electronic device and a readable storage medium. The method comprises the following steps: acquiring a DOM tree generated according to webpage data of a webpage to be classified; traversing the DOM tree through an extraction algorithm to obtain a target sequence for representing the overall relevance of the nodes in the DOM tree; representing the target sequence as a target word vector sequence through a pre-trained word vector model; and inputting the target word vector sequence into a trained classification model to obtain a classification result of the webpage to be classified. In the above embodiment, the target sequence is obtained by extracting the overall correlation of the nodes in the DOM tree, so that the target sequence can more completely express the structural features of the DOM tree, and the accuracy of webpage classification is improved when the target word vector sequence is used for classifying the webpage to be classified.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus, system, and method may be implemented in other ways. The apparatus, system, and method embodiments described above are illustrative only, as the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method for classifying web pages, the method comprising:
acquiring a DOM tree generated according to webpage data of a webpage to be classified;
traversing the DOM tree through an extraction algorithm to obtain a target sequence for representing the overall relevance of the nodes in the DOM tree;
representing the target sequence as a target word vector sequence through a pre-trained word vector model;
and inputting the target word vector sequence into a trained classification model to obtain a classification result of the webpage to be classified.
2. The method according to claim 1, wherein the extraction algorithm comprises a first extraction algorithm and a second extraction algorithm, and the DOM tree is traversed by the extraction algorithm to obtain a target sequence representing the overall relevance of the nodes in the DOM tree, including;
traversing the DOM tree through the first extraction algorithm to obtain a first sequence for representing the correlation of nodes at the same level in the DOM tree;
traversing the DOM tree through the second extraction algorithm to obtain a second sequence for representing the correlation of parent and child nodes in the DOM tree;
and connecting the first sequence and the second sequence to obtain the target sequence.
3. The method of claim 2, wherein the first extraction algorithm comprises a BFS algorithm and the second extraction algorithm comprises a DFS algorithm.
4. The method of claim 1, wherein prior to representing the target sequence as a target word vector sequence by a pre-trained word vector model, the method further comprises:
acquiring webpage data of the webpage to be classified, wherein the webpage data comprises at least one of an HTML document and a URL link;
acquiring a DOM tree generated according to webpage data of a webpage to be classified, wherein the DOM tree comprises:
acquiring a DOM tree generated according to an HTML document of a webpage to be classified, and extracting a designated field in a URL link;
representing the target sequence as a target word vector sequence by a pre-trained word vector model, comprising:
representing the target sequence and the specified field as the target word vector sequence through a pre-trained word vector model.
5. The method according to any one of claims 1 to 4, wherein the classification model comprises a feature extraction unit and a classifier, and the step of inputting the target word vector sequence into the trained classification model to obtain the classification result of the web page to be classified comprises:
inputting the target word vector sequence into the feature extraction unit in the trained classification model;
and inputting the feature vector obtained by performing feature extraction on the target word vector sequence by the feature extraction unit into the classifier to obtain a classification result of the webpage to be classified output by the classifier.
6. The method of claim 5, wherein the classification model comprises an LSTM model, wherein the feature extraction unit is an LSTM unit, and wherein the classifier is Softmax.
7. The method of claim 1, wherein the Word vector model comprises a Word2Vec model.
8. An apparatus for classifying web pages, the apparatus comprising:
the acquisition unit is used for acquiring a DOM tree generated according to the webpage data of the webpage to be classified;
the correlation extraction unit is used for traversing the DOM tree through an extraction algorithm to obtain a target sequence for representing the overall correlation of the nodes in the DOM tree;
a vector representing unit, configured to represent the target sequence as a target word vector sequence through a pre-trained word vector model;
and the classification unit is used for inputting the target word vector sequence into the trained classification model to obtain a classification result of the webpage to be classified.
9. An electronic device, characterized in that the electronic device comprises a memory, a processor coupled to each other, the memory storing a computer program which, when executed by the processor, causes the electronic device to perform the method of web page classification according to claims 1-7.
10. A computer-readable storage medium, in which a computer program is stored which, when run on a computer, causes the computer to carry out the method of classifying web pages according to claims 1-7.
CN202011037160.5A 2020-09-27 2020-09-27 Webpage classification method and device, electronic equipment and readable storage medium Pending CN112148943A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011037160.5A CN112148943A (en) 2020-09-27 2020-09-27 Webpage classification method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011037160.5A CN112148943A (en) 2020-09-27 2020-09-27 Webpage classification method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN112148943A true CN112148943A (en) 2020-12-29

Family

ID=73895597

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011037160.5A Pending CN112148943A (en) 2020-09-27 2020-09-27 Webpage classification method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112148943A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114817639A (en) * 2022-05-18 2022-07-29 山东大学 Webpage graph convolution document ordering method and system based on comparison learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049562A (en) * 2012-12-31 2013-04-17 华为技术有限公司 Method and device for recognizing similar webpages
CN108733405A (en) * 2017-04-13 2018-11-02 富士通株式会社 The method and apparatus that training webpage distribution indicates model
CN110110075A (en) * 2017-12-25 2019-08-09 中国电信股份有限公司 Web page classification method, device and computer readable storage medium
CN110309304A (en) * 2019-06-04 2019-10-08 平安科技(深圳)有限公司 A kind of file classification method, device, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049562A (en) * 2012-12-31 2013-04-17 华为技术有限公司 Method and device for recognizing similar webpages
CN108733405A (en) * 2017-04-13 2018-11-02 富士通株式会社 The method and apparatus that training webpage distribution indicates model
CN110110075A (en) * 2017-12-25 2019-08-09 中国电信股份有限公司 Web page classification method, device and computer readable storage medium
CN110309304A (en) * 2019-06-04 2019-10-08 平安科技(深圳)有限公司 A kind of file classification method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114817639A (en) * 2022-05-18 2022-07-29 山东大学 Webpage graph convolution document ordering method and system based on comparison learning
CN114817639B (en) * 2022-05-18 2024-05-10 山东大学 Webpage diagram convolution document ordering method and system based on contrast learning

Similar Documents

Publication Publication Date Title
US11232140B2 (en) Method and apparatus for processing information
US20190196811A1 (en) Api specification generation
CN109033282B (en) Webpage text extraction method and device based on extraction template
CN107153716B (en) Webpage content extraction method and device
CN110210038B (en) Core entity determining method, system, server and computer readable medium thereof
CN116257406A (en) Gateway data management method and system for smart city
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN115757991A (en) Webpage identification method and device, electronic equipment and storage medium
CN113918794B (en) Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium
CN111488953A (en) Method for rapidly classifying webpage topics based on HTM L source code characteristics
CN114154043A (en) Website fingerprint calculation method, system, storage medium and terminal
CN111143642A (en) Webpage classification method and device, electronic equipment and computer readable storage medium
CN112148943A (en) Webpage classification method and device, electronic equipment and readable storage medium
CN112926341A (en) Text data processing method and device
CN110929647B (en) Text detection method, device, equipment and storage medium
CN112445862A (en) Internet of things equipment data set construction method and device, electronic equipment and storage medium
CN115801455B (en) Method and device for detecting counterfeit website based on website fingerprint
CN110489740B (en) Semantic analysis method and related product
CN113806667B (en) Method and system for supporting webpage classification
CN111552783A (en) Content analysis query method, device, equipment and computer storage medium
CN110413909B (en) Machine learning-based intelligent identification method for online firmware of large-scale embedded equipment
CN113869421A (en) Picture identification method, device, equipment and storage medium
CN117909505B (en) Event argument extraction method and related equipment
CN112445961B (en) Information processing method, device, terminal equipment and storage medium
CN117278322B (en) Web intrusion detection method, device, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination