CN113806667A - Method and system for supporting webpage classification - Google Patents

Method and system for supporting webpage classification Download PDF

Info

Publication number
CN113806667A
CN113806667A CN202111129758.1A CN202111129758A CN113806667A CN 113806667 A CN113806667 A CN 113806667A CN 202111129758 A CN202111129758 A CN 202111129758A CN 113806667 A CN113806667 A CN 113806667A
Authority
CN
China
Prior art keywords
file
feature vector
html
webpage
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111129758.1A
Other languages
Chinese (zh)
Other versions
CN113806667B (en
Inventor
陈超凡
王轶骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202111129758.1A priority Critical patent/CN113806667B/en
Publication of CN113806667A publication Critical patent/CN113806667A/en
Application granted granted Critical
Publication of CN113806667B publication Critical patent/CN113806667B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a method and a system for supporting webpage classification, wherein HTML (hypertext markup language) files and JS (Java script) files of data set webpages are obtained; calculating a feature vector according to the DOM tree; calculating a feature vector according to the CFG of the JS; combining the corresponding feature vectors of the HTML file and the JS file to obtain a webpage feature vector; taking the obtained webpage feature vector as the input of a neural network for training; and obtaining the characteristic vector of the webpage to be detected by the same method, inputting the characteristic vector into a neural network, and obtaining the output classification. Compared with the prior art, the method has the advantages of improving the identification accuracy of the dynamically loaded webpage, supporting large-scale webpage classification detection, overcoming the defects of language difference and the like in the webpage classification based on the content, and the like.

Description

Method and system for supporting webpage classification
Technical Field
The invention relates to the technical field of internet communication, in particular to a method and a system for supporting webpage classification.
Background
In the development process of the internet, web pages are always important participants, the web pages begin to appear from the W3C world wide web starting from shared information, in the sharing process, various phenomena of stealing information represented by copying web page source codes occur, similar patterns are buried in similar websites, and nowadays, the second wave of the mobile internet and the convenience of web page scale generation are brought to the present, and the number of web pages is increased explosively.
In the process of classifying these Web pages, the Web pages are mainly processed. Some people use the text content of the web page to identify the web page, some people identify the web page through the display content of the web page image, the picture and the like, and some people identify the web page through the screenshot of the web page. With the continuous development and replacement of web page development technology, static web pages are gradually reduced, and more websites adopt a dynamic web page loading mode.
By static web page, it is meant that the page data and DOM tree structure are stored directly in the HTML file. The dynamic webpage loading refers to a static webpage enhancement programming technology, and changes due to the fact that dynamic adjustment is conducted according to the JS codes in the webpage DOM tree generation and webpage rendering processes, so that if a webpage source code is crawled directly through a crawler, real data cannot be obtained.
Different types of webpages often have different webpage structures, the webpage structures also have the characteristics of webpage category information, conditions such as webpage localization, content filling and the like can be avoided, however, dynamic loading pages cannot be well identified through a simple HTML DOM tree calculation technology at present, and identification accuracy is not high.
Disclosure of Invention
The present invention is directed to a method and system for supporting web page classification, which overcome the above-mentioned shortcomings of the prior art.
The purpose of the invention can be realized by the following technical scheme:
the first aspect of the present invention provides a method for supporting webpage classification, which comprises the following steps:
a resource obtaining step:
acquiring an HTML file and a JS file of a dynamically loaded webpage;
HTML file feature vector calculation step:
calculating a feature vector corresponding to the HTML file according to the DOM tree of the HTML;
calculating a JS file feature vector:
acquiring a control flow graph containing attribute parameters in the JS file, converting basic blocks of the control flow graph into eigenvectors, and calculating the eigenvectors of the JS file based on the eigenvectors of the basic blocks;
calculating a webpage feature vector:
combining the obtained HTML file feature vector and the JS file feature vector to obtain a feature vector of the webpage;
training a neural network:
converting the labeled web pages into feature vectors of the labeled web pages according to the steps, and taking the feature vectors as input to train the neural network;
a neural network identification step:
and converting the web page to be detected into the feature vector of the web page to be detected according to the steps, inputting the feature vector into the trained neural network, and acquiring a classification result.
Further, the specific steps of the resource obtaining step include:
crawling and analyzing: the HTML file of the dynamic loading webpage is crawled, the HTML label of the dynamic loading webpage is analyzed, and the file with the JS suffix, the HTML file in the frame label and the file path of the HTML file are obtained;
and (3) classification downloading: and determining the acquired HTML file in the frame tag as an embedded HTML file, dividing the JS file into a source JS file and an embedded JS file according to whether the file path of the embedded HTML file is input into the source domain name or not, and downloading the JS file.
Further, the specific steps of the HTML document feature vector calculation step include:
and (3) label identification: identifying HTML tags in the DOM tree and converting the HTML tags into tag units;
and (3) merging the label units: merging the label units with the same attribute to obtain a merged label unit set;
vector calculation: and calculating the weight values of the corresponding tags for the combined tag unit sets, and constructing the feature vector of the HTML file according to the weight values of the tags.
Further, the specific steps of the JS file feature vector calculation step include:
file analysis: analyzing each function of the JS file to obtain a control flow graph;
and a function vector calculation step: converting the basic blocks of the control flow graph into vectors with fixed length according to the preset attribute parameter characteristics, and calculating the vectors of the functions;
vector calculation: and constructing a JS file feature vector for the vector of the function.
Another aspect of the present invention provides a system for supporting webpage classification, including:
a resource acquisition module: acquiring an HTML file and a JS file of a dynamically loaded webpage;
HTML file feature vector calculation module: calculating a feature vector corresponding to the HTML file according to the DOM tree of the HTML;
JS file feature vector calculation module: acquiring a control flow graph containing attribute parameters of the JS file, converting basic blocks of the control flow graph into eigenvectors, and calculating the eigenvectors of the JS file based on the eigenvectors of the basic blocks;
the webpage feature vector calculation module: combining the feature vector corresponding to the HTML file and the feature vector of the JS file to obtain the feature vector of the dynamic loading webpage;
a neural network training module: converting the labeled webpage into a vector according to the above mode, and taking the vector as an input training neural network;
the neural network identification module: and converting the web page to be detected into a vector based on the characteristic vector calculation module, inputting the trained neural network, and acquiring a classification result.
Compared with the prior art, the method and the system for supporting webpage classification provided by the invention at least have the following beneficial effects:
1) compared with the text-based technology, the method has better adaptability by taking the statistical information of the HTML label as part of the characteristics of the webpage. For example, the text-based classification technology with the Chinese word band model cannot process English web pages, is insensitive to text languages based on tag statistical information, can eliminate the influence caused by inconsistent text languages, and overcomes the defects of language difference and the like in text-based web page classification.
2) According to the method, the JS file information is used as part of characteristics of the webpage, and compared with a classification technology based on the contents such as texts and images, the method has better adaptability. For example, in a dynamic loading webpage, corresponding texts and images cannot be obtained directly through a webpage source code, and usually, dynamically loaded codes are in a JS file, so that the JS file is characterized, and the defect that the dynamically loaded webpage cannot be well classified in the prior art can be overcome.
3) The invention distinguishes the source file and the embedded file, takes the link information into consideration and can improve the classification accuracy. For example, for a video website, for a certain type of web pages, in order to avoid conventional classification techniques and supervision, an embedded frame is used to embed an HTML file and an exclusive JS file, whereas if the video website belongs to a regular video website, video resources are generally stored in a specific server, a new HTML file is not embedded, and the JS file is also general. The difference between the two types of websites can be amplified by distinguishing the source files and the embedded files and introducing the link information, so that the classification accuracy can be improved, and the website classification can be finer in granularity.
4) According to the invention, the webpage structure and the JS file are statically analyzed, the dynamically loaded webpage can be classified by using the neural network under the condition that the JS code is not executed, and the large-scale webpage classification detection can be supported.
Drawings
FIG. 1 is a schematic diagram illustrating a method for supporting webpage classification according to an embodiment;
fig. 2 is a schematic flow chart illustrating an implementation of the method for supporting webpage classification in the embodiment.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
Examples
The invention relates to a method for supporting webpage classification, which comprises the following steps:
a resource obtaining step: and acquiring the HTML file and the JS file of the dynamic loading webpage.
HTML file feature vector calculation step: and calculating a feature vector corresponding to the HTML file according to the DOM tree of the HTML.
Calculating a JS file feature vector: and analyzing the JS file to obtain a control flow graph containing attribute parameters, converting basic blocks of the control flow graph into vectors, and calculating the feature vector of the JS file based on the feature vectors of the basic blocks.
Calculating a webpage feature vector: and combining the obtained HTML file feature vector and the JS file feature vector to obtain the feature vector of the webpage.
Training a neural network: and converting the labeled webpage into a vector according to the mode, and taking the vector as an input to train the neural network.
A neural network identification step: and converting the web page to be detected into a vector according to the above mode, and inputting the vector into the trained neural network to obtain a classification result.
Specifically, the specific content of the resource obtaining step is as follows:
crawling and analyzing: crawling an HTML (hypertext markup language) file of the dynamic loading webpage, analyzing an HTML tag of the dynamic loading webpage, and acquiring a file with a JS suffix, a HTML file in a frame tag and a file path of the HTML file;
and (3) classification downloading: and determining the acquired HTML file as an embedded HTML file, dividing the JS file into a source JS file and an embedded JS file according to whether the file path inputs the source domain name, and downloading.
Specifically, the specific steps of the HTML document feature vector calculation step include:
and (3) label identification: identifying HTML tags in the DOM tree, and converting the HTML tags into tag attribute units according to a certain rule, namely converting the HTML tags into the tag attribute units according to the types and the positions of the nodes in the DOM tree;
and (3) merging the label units: merging the label units with the same attribute;
vector calculation: calculating the weight values of the corresponding labels of the combined label unit sets according to a preset rule, and constructing HTML file feature vectors; specifically, the initial weight value of each tag type is set according to the importance degree of the tag type in the webpage, the weight value of the corresponding tag is calculated according to the principle that the weight value decreases with the depth difference, and the HTML file feature vector is constructed.
Specifically, the JS file feature vector calculation step includes:
file analysis: and analyzing each function of the JS file to obtain a control flow graph. Specifically, analyzing each function of the source JS file and the embedded JS file by using an open source toolkit to obtain a control flow graph; the operation processes of the source JS file and the embedded JS file are the same, firstly, the JS file contains information of a website, and webpage classification can be assisted by using the JS file, secondly, the source JS file and the embedded JS file contained in different websites have different characteristics, and the characteristics carried by the JS file of the website can be more clearly expressed by dividing the source JS file and the embedded JS file, so that webpage classification is better supported;
and a function vector calculation step: converting the basic blocks of the control flow graph into vectors with fixed length according to the preset attribute parameter characteristics, and calculating the vectors of the functions;
vector calculation: constructing a JS file feature vector for the function vector according to a preset rule; further, the feature vector of the JS file is constructed according to certain calculation, such as addition or averaging.
Specifically, the webpage feature vector calculating step includes: the combination may be one or any of a plurality of ways, for example, splicing the source HTML file feature vector, the embedded HTML file feature vector, the source JS file feature vector, and the embedded JS file feature vector together in order may be adopted.
Specifically, in the neural network training step, the web pages with labels are marked, and the labels can be one or any plurality of labels; the network structure of the neural network is not limited to one, and for example, a simple stack of fully connected layers of the ReLu activation function may be employed.
The invention also provides a system for supporting webpage classification, which comprises:
a resource acquisition module: the system comprises an HTML file and a JS file, wherein the HTML file and the JS file are used for acquiring a dynamic loading webpage;
HTML file feature vector calculation module: calculating a feature vector corresponding to the HTML file according to the DOM tree of the HTML;
JS file feature vector calculation module: analyzing the JS file to obtain a control flow graph containing attribute parameters, converting basic blocks of the control flow graph into vectors, and calculating the feature vector of the JS file based on the basic block feature vector;
the webpage feature vector calculation module: combining the obtained HTML file feature vector and the JS file feature vector to obtain a feature vector of the webpage;
a neural network training module: converting the labeled webpage into a vector according to the above mode, and taking the vector as an input training neural network;
the neural network identification module: and converting the web page to be detected into a vector according to the above mode, and inputting the vector into the trained neural network to obtain a classification result.
The system for supporting webpage classification provided by the invention can be realized by the step flow of the method for supporting webpage classification. The method for supporting the classification of web pages can be understood as a preferred example of the system for supporting the classification of web pages by those skilled in the art.
Preferred examples are further described below.
As shown in fig. 1, the method for supporting webpage classification according to the embodiment of the present invention includes:
step 101: respectively calculating partial page features according to the HTML file structure and the JS code of the page;
step 102: performing webpage classification training according to the page features, and detecting;
the type of the dynamic loading webpage can be identified through the processing, and when a certain unknown Web page is input, the category of the Web page can be quickly found through identification. As shown in fig. 2, the method specifically includes the following processing steps:
step 201: acquiring an HTML file and a JS file; crawling the dynamic loading page, and acquiring files with suffix names of HTML and JS by analyzing src attributes under iframe and script tags in the HTML, for example, acquiring a corresponding JS file by analyzing an HTML tag of < script src ═ view/JS/jquery-3.1.0.JS1.5"> < script >. And simultaneously distinguishing a source HTML file, an embedded HTML file, a source JS file and an embedded JS file according to the label and the file path.
Step 202: calculating a feature vector according to the DOM tree of the webpage:
and dividing the node tag types and the node positions thereof in the DOM tree of the HTML file into tag units.
The following is an example of a web page DOM tree, which has two child nodes head and body under the root node html, and then contains multiple tags.
<html>
<head>
< title > Web Page title </title >
<link href=”https://www.test.com/1.png”>
<link rel=”stylesheet”href=”https://www.test.com/1.css”>
<script>...</script>
<style type=”text/css”>...</style>
<style type=”text/css”>...</style>
<style type=”text/css”>...</style>
<style type=”text/css”>...</style>
<head>
<body>
<div id=”nuxt”>...</div>
<script>...</script>
<script src=”/static/1.js defer></script>
<script src=”/static/2.js defer></script>
<script src=”/static/3.js defer></script>
<div class=”popup”>
<img src=”/static/2.png>
<div class=”byte”>
<img src=”/dynamic/1.png>
</div>
</div>
</body>
<html>
For example, if html is 0, any script tag unit in the second layer can be expressed as < script,2>, and the same img tag appearing for the first time can be expressed as < img,3 >.
After dividing the DOM tree into tag units, determining the conversion from the tag units to the feature vectors, specifically comprising the following steps:
first, the same tag unit is counted, where the same tag unit refers to a tag unit with the same node tag type and located at the same level. For example, in the html level 0, the feature units < script src ═ static/1.js defer > and < script src ═ static/2.js defer > under the node body are script tag types, and belong to the second level, and their tag units may be labeled the same. Similarly, in the embodiment, merging tag units in units of < type, hierarchy, number >, the following representation of tag feature units can be obtained:
<html,0,1>
<head,1,1>
<body,1,1>
<title,2,1>
<link,2,2>
<style,2,4>
<script,2,5>
<div,2,2>
<div,3,1>
<img,3,1>
<img,4,1>
since there is a fixed format for any web page
<html>
<head>
<title></title>
</head>
<body></body>
</html>
And eliminating the information of the html, head, title and body labels without any distinction.
Secondly, a feature vector representation method is constructed and the weight value of the response label is calculated.
The dimension of the feature vector is a fixed value preset according to the tag type, and here, the dimension may be 6 dimensions, and the information represented by each dimension is a weight value, < link, style, script, div, img, video > represented by the tag type.
And determining the weight value corresponding to each label type according to a preset rule. Specifically, the weight value represents the importance degree of the corresponding tag type on the web page, each occurrence of the tag type is assigned with a weight value, the final weight value is the accumulation of the weight values of the feature units, and the weight value of the feature unit appearing on the web page is determined by a predetermined rule, which includes:
the weighted value of the tag type is decreased with the depth difference of the tag type in the DOM tree from the first appearance, because the probability that different tag types appear on the same layer in the webpage DOM tree is different, if the absolute depth decreasing method is adopted, the weighted value of part of the tags is too low. In practical application, the weight values of the feature units can be determined in an equal-ratio decreasing mode, only the feature units in a limited depth are considered,
for example, the weight value of the highest layer of each tag type is preset to 1.0, if the layer where the tag type appears for the second time is also the highest layer, the weight value is still 1, and if the relative deviation of the layer where the tag appears for the second time and the highest layer is 2, the weight value is multiplied by the square of the attenuation factor (assuming that the attenuation factor is preset to 0.5).
Finally, calculating the HTML file feature vector
After the weight value of the tag feature is determined, the real numerical value of the tag type on the dimension of the feature vector is determined according to the weight value of the tag type in the DOM tree, and the feature vector corresponding to the webpage DOM tree is determined accordingly.
For example, the link, style, script, and video tags in the above example all appear in the same layer, and by simple superposition, the partial feature vector <2,4,5, div, img,0> can be calculated. For div, 2 occurrences occur in the second tier, the overlay weight is 1 x 2 — 2, and 1 occurrence in the third tier, the overlay weight is 1 x 0.5 x 1 — 0.5, and the final weight is 2+0.5 — 2.5. A similar final weight of img is 1.5. The final feature vector <2,4,5,2.5,1.5,0>
Step 202: calculating a feature vector according to the CFG (Control Flow Graph) of the JS; the specific implementation is based on an open source toolkit, for example, a JS file is analyzed through an ast-flow-graph, and a control flow graph of all functions can be directly generated.
After the control flow graph is obtained, determining a feature vector of the JS file, and specifically comprising the following processing procedures: the control flow graph of each function is composed of basic blocks and edges connecting the basic blocks. And performing characterization on each control block, wherein the used characteristics comprise statistical characteristics such as the constant number of character strings, the constant number of numerical values, the calling number and the like, and structural characteristics such as the number of subsequent basic blocks and the like.
Here, the dimension of a single basic block vector may be 4 dimensions, and the information represented by each dimension is the weight value represented by each feature of the basic block, < number of constants of character string, number of constants of numerical value, number of calls, number of subsequent basic blocks >
The dimensions of the feature vector generated by each basic block correspond to the number of features selected. And (4) taking all the basic block feature vectors of the function as the feature vectors of the function through certain calculation, such as addition or averaging, and calculating the feature vectors of the function to obtain the feature vectors of the JS file similarly to the above steps.
And determining the weight value corresponding to each feature according to a booking rule. In particular, the weight value represents the degree of importance of the corresponding feature in the function.
Step 203: and combining the obtained feature vectors to obtain the webpage feature vector. Specifically, the source HTML file feature vector, the embedded HTML file feature vector, the source JS file feature vector, and the embedded JS file feature vector are combined together, such as spliced, to serve as the final feature vector of the web page.
The simplest splicing method can be selected here, namely, the final feature vector of the webpage is source HTML file feature vector-embedded HTML file feature vector-source JS file feature vector-embedded JS file feature vector.
Step 204: and inputting the characteristics into a neural network for training. And converting the marked webpage data set into a characteristic vector to be input into a neural network, and selecting a proper loss function for training.
Because the dimensions of the HTML file feature vector and the JS file feature vector are determined by the booking rules, the dimension of the feature vector of the webpage is fixed without considering the problem of filling the vector.
In practical cases, for example, binary classification is used, i.e. to determine whether the web page belongs to the blog class, where a simple stack of fully connected layers of the Relu activation function can be used, outputting the probability that an arbitrary value is compressed between [0,1] using the sigmoid function and used as the prediction result.
Step 205: and inputting the web pages to be detected into the neural network, and outputting the web pages to be detected as the corresponding categories of the web pages.
Compared with the text-based technology, the method has better adaptability by taking the statistical information of the HTML label as part of the characteristics of the webpage. For example, the text-based classification technology with the Chinese word band model cannot process English web pages, is insensitive to text languages based on tag statistical information, can eliminate the influence caused by inconsistent text languages, and overcomes the defects of language difference and the like in text-based web page classification. According to the fact that the JS file information is used as part of characteristics of the webpage, compared with a classification technology based on the contents such as texts and images, the method has good adaptability. For example, in a dynamic loading webpage, corresponding texts and images cannot be obtained directly through a webpage source code, and usually, dynamically loaded codes are in a JS file, so that the JS file is characterized, and the defect that the dynamically loaded webpage cannot be well classified in the prior art can be overcome.
The invention distinguishes the source file and the embedded file, takes the link information into consideration and can improve the classification accuracy. For example, for video websites, if they belong to the pornographic category, in order to avoid the conventional classification technology and supervision, embedded frames are used to embed HTML files and proprietary JS files, while for regular video websites, video resources are generally stored in a specific server, no new HTML file is embedded, and the JS files are also common. The difference between the two types of websites can be amplified by distinguishing the source files and the embedded files and introducing the link information, so that the classification accuracy can be improved, and the website classification can be finer in granularity.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and those skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (5)

1. A method for supporting classification of web pages, comprising:
a resource obtaining step:
acquiring an HTML file and a JS file of a dynamically loaded webpage;
HTML file feature vector calculation step:
calculating a feature vector corresponding to the HTML file according to the DOM tree of the HTML;
calculating a JS file feature vector:
acquiring a control flow graph containing attribute parameters in the JS file, converting basic blocks of the control flow graph into eigenvectors, and calculating the eigenvectors of the JS file based on the eigenvectors of the basic blocks;
calculating a webpage feature vector:
combining the obtained HTML file feature vector and the JS file feature vector to obtain a feature vector of the webpage;
training a neural network:
converting the labeled web pages into feature vectors of the labeled web pages according to the steps, and taking the feature vectors as input to train the neural network;
a neural network identification step:
and converting the web page to be detected into the feature vector of the web page to be detected according to the steps, inputting the feature vector into the trained neural network, and acquiring a classification result.
2. The method for supporting webpage classification according to claim 1, wherein the specific steps of the resource obtaining step include:
crawling and analyzing: the HTML file of the dynamic loading webpage is crawled, the HTML label of the dynamic loading webpage is analyzed, and the file with the JS suffix, the HTML file in the frame label and the file path of the HTML file are obtained;
and (3) classification downloading: and determining the acquired HTML file in the frame tag as an embedded HTML file, dividing the JS file into a source JS file and an embedded JS file according to whether the file path of the embedded HTML file is input into the source domain name or not, and downloading the JS file.
3. The method for supporting webpage classification according to claim 1, wherein the specific steps of the HTML document feature vector calculating step include:
and (3) label identification: identifying HTML tags in the DOM tree and converting the HTML tags into tag units;
and (3) merging the label units: merging the label units with the same attribute to obtain a merged label unit set;
vector calculation: and calculating the weight values of the corresponding tags for the combined tag unit sets, and constructing the feature vector of the HTML file according to the weight values of the tags.
4. The method for supporting webpage classification according to claim 1, wherein the JS file feature vector calculation step specifically comprises:
file analysis: analyzing each function of the JS file to obtain a control flow graph;
and a function vector calculation step: converting the basic blocks of the control flow graph into vectors with fixed length according to the preset attribute parameter characteristics, and calculating the vectors of the functions;
vector calculation: and constructing a JS file feature vector for the vector of the function.
5. A system for supporting classification of web pages, comprising:
a resource acquisition module: acquiring an HTML file and a JS file of a dynamically loaded webpage;
HTML file feature vector calculation module: calculating a feature vector corresponding to the HTML file according to the DOM tree of the HTML;
JS file feature vector calculation module: acquiring a control flow graph containing attribute parameters of the JS file, converting basic blocks of the control flow graph into eigenvectors, and calculating the eigenvectors of the JS file based on the eigenvectors of the basic blocks;
the webpage feature vector calculation module: combining the feature vector corresponding to the HTML file and the feature vector of the JS file to obtain the feature vector of the dynamic loading webpage;
a neural network training module: converting the labeled web pages into vectors according to the calculation mode of each feature vector calculation module, and using the vectors as input training neural networks;
the neural network identification module: and converting the web page to be detected into a vector according to the above mode, inputting the trained neural network, and acquiring a classification result.
CN202111129758.1A 2021-09-26 2021-09-26 Method and system for supporting webpage classification Active CN113806667B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111129758.1A CN113806667B (en) 2021-09-26 2021-09-26 Method and system for supporting webpage classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111129758.1A CN113806667B (en) 2021-09-26 2021-09-26 Method and system for supporting webpage classification

Publications (2)

Publication Number Publication Date
CN113806667A true CN113806667A (en) 2021-12-17
CN113806667B CN113806667B (en) 2023-10-03

Family

ID=78938543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111129758.1A Active CN113806667B (en) 2021-09-26 2021-09-26 Method and system for supporting webpage classification

Country Status (1)

Country Link
CN (1) CN113806667B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127236A (en) * 2023-04-19 2023-05-16 远江盛邦(北京)网络安全科技股份有限公司 Webpage web component identification method and device based on parallel structure

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10223616B1 (en) * 2018-06-30 2019-03-05 Figleaf Limited System and method identification and classification of internet advertising
CN109684584A (en) * 2018-11-15 2019-04-26 北京海泰方圆科技股份有限公司 A kind of intelligent switch method of browser kernel, device, terminal and storage medium
CN111783016A (en) * 2020-07-03 2020-10-16 支付宝(杭州)信息技术有限公司 Website classification method, device and equipment
CN111881398A (en) * 2020-06-29 2020-11-03 腾讯科技(深圳)有限公司 Page type determination method, device and equipment and computer storage medium
CN112287272A (en) * 2020-10-27 2021-01-29 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10223616B1 (en) * 2018-06-30 2019-03-05 Figleaf Limited System and method identification and classification of internet advertising
CN109684584A (en) * 2018-11-15 2019-04-26 北京海泰方圆科技股份有限公司 A kind of intelligent switch method of browser kernel, device, terminal and storage medium
CN111881398A (en) * 2020-06-29 2020-11-03 腾讯科技(深圳)有限公司 Page type determination method, device and equipment and computer storage medium
CN111783016A (en) * 2020-07-03 2020-10-16 支付宝(杭州)信息技术有限公司 Website classification method, device and equipment
CN112287272A (en) * 2020-10-27 2021-01-29 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郭淼霞;: "中文网页分类研究综述", 赤峰学院学报(自然科学版), no. 12, pages 51 - 53 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127236A (en) * 2023-04-19 2023-05-16 远江盛邦(北京)网络安全科技股份有限公司 Webpage web component identification method and device based on parallel structure
CN116127236B (en) * 2023-04-19 2023-07-21 远江盛邦(北京)网络安全科技股份有限公司 Webpage web component identification method and device based on parallel structure

Also Published As

Publication number Publication date
CN113806667B (en) 2023-10-03

Similar Documents

Publication Publication Date Title
US8046681B2 (en) Techniques for inducing high quality structural templates for electronic documents
US20210303641A1 (en) Artificial intelligence for product data extraction
US11907644B2 (en) Detecting compatible layouts for content-based native ads
CN103810251B (en) Method and device for extracting text
CN112417338B (en) Page adaptation method, system and equipment
CN108446136B (en) Element code extraction method and system
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
WO2023155303A1 (en) Webpage data extraction method and apparatus, computer device, and storage medium
CN109033282A (en) A kind of Web page text extracting method and device based on extraction template
CN116049597B (en) Pre-training method and device for multi-task model of webpage and electronic equipment
CN111881398A (en) Page type determination method, device and equipment and computer storage medium
JP2020098596A (en) Method, device and storage medium for extracting information from web page
CN113918794B (en) Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium
CN114021042A (en) Webpage content extraction method and device, computer equipment and storage medium
CN114398138B (en) Interface generation method, device, computer equipment and storage medium
CN113806667B (en) Method and system for supporting webpage classification
CN111061975B (en) Method and device for processing irrelevant content in page
CN111914199A (en) Page element filtering method, device, equipment and storage medium
CN112632421B (en) Self-adaptive structured document extraction method
CN114625658A (en) APP stability test method, device, equipment and computer readable storage medium
US20210397663A1 (en) Data reduction in a tree data structure for a wireframe
EP2096561B1 (en) Method for extracting relevant content from a markup language file, in particular from a HTML file
AL-Ghuribi et al. Bi-languages mining algorithm for extraction useful web contents (BiLEx)
Thanadechteemapat et al. Automatic web content extraction for generating tag clouds from thai web sites
CN110990671B (en) Page type discrimination device and method and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant