CN112749360A - Webpage classification method and device - Google Patents

Webpage classification method and device Download PDF

Info

Publication number
CN112749360A
CN112749360A CN201911043232.4A CN201911043232A CN112749360A CN 112749360 A CN112749360 A CN 112749360A CN 201911043232 A CN201911043232 A CN 201911043232A CN 112749360 A CN112749360 A CN 112749360A
Authority
CN
China
Prior art keywords
page
http request
attribute information
classification model
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911043232.4A
Other languages
Chinese (zh)
Inventor
赵一飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201911043232.4A priority Critical patent/CN112749360A/en
Publication of CN112749360A publication Critical patent/CN112749360A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a webpage classification method and a webpage classification device. Extracting attribute information of each dimension from HTTP request information corresponding to the web page to be classified, and analyzing the information of each dimension by using a classification model obtained by pre-training to obtain a target page type of the web page to be classified. According to the content, the scheme only needs to obtain the HTTP request information corresponding to the page without acquiring specific page content, and the classification model obtained through pre-training is used for analyzing the HTTP request information without manually combing the rule corresponding to each page type, so that the method is applicable to the web page with any page structure, namely the scheme is high in applicability, and further the page type of the web page can be accurately analyzed for the web page with any page structure, and therefore the accuracy of the scheme is higher.

Description

Webpage classification method and device
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a webpage classification method and device.
Background
A web crawler is a program or script that automatically crawls the world wide web according to certain rules. When a web crawler crawls a web page, the found web page needs to be classified into a directory page and a content page, and different ways are adopted for processing according to different types of web pages. The directory page comprises links of specific content pages, and the content in the directory page can be continuously updated; while the content of the content page is hardly updated. For example, the catalog page is a news page header page, and each link on the catalog page corresponds to a content page containing news coverage.
The web page classification mode adopted by the existing web crawlers is generally to analyze the structure and content of a web page according to the crawled web page and then classify the web page, but the web page classification mode needs the web crawler to actually crawl the web page and then classify the web page, and the efficiency is low.
Disclosure of Invention
In view of the above, an object of the present invention is to provide a method and an apparatus for classifying web pages, so as to solve the problem of low efficiency of the conventional web page classification method, and the specific technical solution is as follows:
in a first aspect, the present invention provides a method for classifying web pages, including:
acquiring HTTP request information corresponding to a webpage to be classified;
extracting attribute information of a preset dimension from the HTTP request information, wherein the attribute information is used for representing the characteristic of the HTTP request information related to the webpage type;
analyzing the attribute information of the preset dimensionality to obtain the page type of the webpage to be classified based on a classification model obtained by pre-training, wherein the page type comprises one of a catalog page and a content page;
the classification model is obtained by utilizing HTTP request information corresponding to a plurality of pages with different page types and page type training.
In a possible implementation manner of the present invention, the process of training the classification model includes:
acquiring HTTP request sample data marked with a page type;
extracting attribute information of a preset dimension from each HTTP request sample data;
analyzing attribute information corresponding to each HTTP request sample data by using a preset classification model to obtain a prediction type result corresponding to the HTTP request sample data;
and iteratively optimizing the model parameters of the preset classification model until a preset convergence condition is met according to the labeled page type corresponding to the same HTTP request sample data and the prediction type result to obtain a final classification model.
In another possible implementation manner of the present invention, analyzing the attribute information of the preset dimension to obtain the page type corresponding to the web page to be classified based on a classification model obtained by pre-training includes:
analyzing the attribute information of each dimension by using a classification model obtained by pre-training to obtain confidence coefficients that the web pages to be classified respectively belong to the directory page and the content page;
and determining the page type with the maximum confidence coefficient as a target page type to which the webpage to be classified belongs.
In another possible implementation manner of the present invention, the attribute information of the preset dimension in the HTTP request information includes at least one of the following:
the number of the small sections of the URL contained in the HTTP request information;
the character string length of the file name contained in the URL;
the ratio of the symbol characters in the file name contained in the URL;
the ratio of the number characters in the file name contained in the URL;
an extension of a file name contained in the URL;
whether the URL carries a date;
whether the URL contains a specified keyword;
the number of query parameters contained by the URL;
whether the URL contains the name of the query parameter or not contains the character ID;
and the HTTP request information comprises a request predicate.
In another possible implementation manner of the present invention, the analyzing the attribute information of the preset dimension to obtain the page type of the web page to be classified based on the classification model obtained by pre-training includes:
converting attribute information of a preset dimension of the HTTP request information into a corresponding characteristic value;
and inputting the characteristic value corresponding to the HTTP request information into the classification model obtained by pre-training for analysis to obtain the page type of the webpage to be classified.
In another possible implementation manner of the present invention, converting attribute information of a preset dimension of the HTTP request information into a corresponding feature value includes:
discretizing the attribute information belonging to the numerical type to obtain corresponding characteristic values;
and mapping the non-numerical attribute information into corresponding characteristic values according to a preset mapping rule.
In a second aspect, the present invention further provides a web page classification apparatus, including:
the acquisition module is used for acquiring HTTP request information corresponding to the webpage to be classified;
the information extraction module is used for extracting attribute information of a preset dimension from the HTTP request information, wherein the attribute information is used for representing the characteristics of the HTTP request information related to the webpage type;
the analysis module is used for analyzing the attribute information of the preset dimensionality to obtain the page type of the webpage to be classified based on a classification model obtained through pre-training, wherein the page type comprises one of a directory page and a content page;
the classification model is obtained by utilizing HTTP request information corresponding to a plurality of pages with different page types and page type training.
In a possible implementation manner of the second aspect, the apparatus further includes:
the training sample acquisition module is used for acquiring HTTP request sample data marked with page types;
the sample information extraction module is used for extracting attribute information of a preset dimension from each HTTP request sample data;
the sample information analysis module is used for analyzing the attribute information corresponding to each HTTP request sample data by using a preset classification model to obtain a prediction type result corresponding to the HTTP request sample data;
and the model optimization module is used for iteratively optimizing the model parameters of the preset classification model until a preset convergence condition is met according to the labeled page type corresponding to the same HTTP request sample data and the prediction type result to obtain a final classification model.
In a third aspect, the present invention also provides an apparatus, comprising: at least one processor, and at least one memory, bus connected with the processor;
the processor and the memory complete mutual communication through the bus;
the processor is configured to call program instructions in the memory to execute the web page classification method according to any one of the possible implementations of the first aspect.
In a fourth aspect, the present invention further provides a storage medium, on which a program is stored, where the program is loaded and executed by a processor to implement the method for classifying web pages according to any one of the possible implementation manners of the first aspect.
According to the webpage classification method provided by the invention, before the specific webpage of the webpage is obtained, the HTTP request information of the webpage is analyzed to obtain the webpage type of the webpage. Extracting attribute information of each dimension from HTTP request information corresponding to the web page to be classified, and analyzing the information of each dimension by using a classification model obtained by pre-training to obtain a target page type of the web page to be classified. According to the content, the scheme only needs to obtain the HTTP request information corresponding to the page without acquiring specific page content, and the classification model obtained through pre-training is used for analyzing the HTTP request information without manually combing the rule corresponding to each page type, so that the method is applicable to the web page with any page structure, namely the scheme is high in applicability, and further the page type of the web page can be accurately analyzed for the web page with any page structure, and therefore the accuracy of the scheme is higher.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart of a method for classifying web pages according to the present invention;
FIG. 2 is a flow chart of a process for training a classification model provided by the present invention;
FIG. 3 is a schematic structural diagram of a web page classification apparatus according to the present invention;
FIG. 4 is a schematic structural diagram of another apparatus for classifying web pages according to the present invention;
fig. 5 is a schematic structural diagram of an apparatus provided by the present invention.
Detailed Description
The web crawler searches web pages through link addresses of the web pages, determines a route through a specific search algorithm, generally starts from a certain page of a website, finds other link addresses in the web pages, and then searches a next web page through the link addresses, and the process is circulated until all the web pages of the website are completely captured. Therefore, many web crawler technologies at present need to determine the type of a web page, i.e., determine whether a web page is a directory page or a content page.
The webpage classification method is that the structure and the content of the webpage are analyzed and classified according to the webpage obtained by crawling, the accuracy is high, and the efficiency is low because the webpage needs to be actually crawled. In order to improve the classification efficiency of web pages, another web page classification mode is provided, and the type of a page can be determined according to hypertext transfer protocol (HTTP) request information before a specific page is crawled.
However, the inventor has found through research that the current way of classifying web pages based on HTTP request information is usually a rule of manually combing HTTP requests according to experience, but the layout of web pages is diversified, the HTTP requests of web pages are also diversified, the way of manually combing HTTP request information is not applicable, and the manually combing rule is not sensitive to website revision, that is, the rule based on old version of web page combing may not be applicable to the revised website, resulting in low accuracy.
In order to solve the technical problem, the invention provides a webpage classification method, which is used for acquiring HTTP request information corresponding to a webpage to be classified, extracting attribute information of each dimension in the HTTP request information, and analyzing by using a classification model obtained by pre-training to obtain a target page type to which the webpage to be classified belongs. According to the scheme, a specific page does not need to be obtained, only HTTP request information corresponding to the page needs to be obtained, the HTTP request information is analyzed through a classification model obtained through pre-training, rules corresponding to each page type do not need to be combed manually, and therefore the scheme can be suitable for the web pages with any page structure, namely the scheme is high in applicability and accuracy rate.
Referring to fig. 1, a flowchart of a method for classifying web pages according to the present invention is shown, where the method is applied to a computer device, and the computer device may be a PC, a server, a mobile phone, or other devices. The method obtains the page type of the webpage by analyzing HTTP request information corresponding to the webpage.
As shown in fig. 1, the method may include the steps of:
s110, obtaining HTTP request information corresponding to the webpage to be classified.
The webpage to be classified refers to any webpage of which the page type needs to be determined.
The client requests access to resources in the server, e.g., a web page requesting an open web site, from the server by sending an HTTP request. The HTTP request includes Uniform Resource Locator (URL) information of a resource in the server to be accessed.
In an application scene of a web crawler crawling page, after the web crawler obtains an HTTP request of a page to be crawled, the HTTP request of the page to be crawled is firstly analyzed, whether the type of the page to be crawled is a directory page or a content page is judged, and then a corresponding crawling strategy is executed.
S120, extracting attribute information of each dimension from the HTTP request information.
The attribute information is used for representing the characteristic of the HTTP request information related to the webpage type.
The HTTP request mainly includes a URL of a resource to be accessed and a request predicate (GET/POST), and components of the HTTP request are described below with reference to an example of the HTTP request, for example, the HTTP request is HTTP:// www.aspxfans.com: 80/news/index.aspbardid ═ 5& ID ═ 24618& page ═ 1, and the components include the following:
1) the transmission protocol is used for transmitting the information of the interaction between the client and the server; the transport protocol used in this example is the HTTP protocol;
2) the domain name is the unique name of a physical server to be accessed, and the name is mapped into a unique IP address through the domain name server; the domain name in this example is www.aspxfans.com;
3) ports, which are used to distinguish the identities of different services on the same server, and the use between the domain name and the port is ": "as a separator. The port is not a necessary part of a URL, and if the port part is omitted, a default port is adopted; the port in this example is "80";
4) a path of the resource, a path of a resource location on the requested server, describing a directory hierarchy of the web server using a Unix directory format, wherein the path of the resource is not an essential part of the URL; the path in this example is "/new/";
5) the name of the requested file may be a document type served by a server, such as an HTML page, a PDF file, an image, an audio file, or a video file, and the file name in this example is "index. In embodiments of the present invention, a web page is requested via HTTP, and thus, the requested file is an HTML page. Wherein the file name is also not a necessary part of the URL, and if the part is omitted, a default file name is used;
6) the query string is also called a search part and a query part. The query string in this example is "boardID ═ 5& ID ═ 24618& page ═ 1", each parameter appears in the form of "name ═ value", and the parameters are separated by "&".
According to the above description of the composition of the HTTP request, determining the relatively important dimensions in the HTTP request mainly includes:
1. URL fractional number;
the number of URL sections refers to the number of sections contained in the URL body portion, i.e., the number of sections into which the URL body portion is divided by "/". The number of URL segments characterizes the page hierarchy to some extent. For example, the HTTP request example HTTP:// www.aspxfans.com:80/news/index aspbardid ═ 5& ID ═ 24618& page ═ 1 includes 3 bars.
2. The length of the string of the filename in the URL;
3. the ratio of the digital characters in the file name in the URL;
the larger the ratio of digital characters in the file name is, the more likely it is to be a content page. In actual operation, the classification model finally needs to determine the type of the page by integrating the attribute information of each dimension.
Similarly, in consideration of the requirement of model training, the ratio of the digital characters in the file names can be converted to obtain corresponding characteristic values.
4. The symbol character ratio in the file name in the URL;
the ratio of the number of symbolic characters in the filename to the number of characters of the entire filename in the URL.
Similarly, in consideration of the requirement of model training, the character proportion in the file name can be subjected to discretization processing and conversion to obtain a corresponding characteristic value.
5. An extension of the file name in the URL;
for example, in the above example of an HTTP request, the extension of the file name is ". asp".
6. Whether the URL has a date;
usually, the content page includes the release date of the content, and therefore, the URL corresponding to the content page has a high probability of carrying the date.
7. Whether the URL includes common keywords;
by analyzing commonly used keywords in the URL, such as list/category/index and the like, dimension indexes of the existence of each keyword are respectively established.
8. Number of URL query parameters; the number of query parameters contained in the URL.
9. Whether the URL inquiry parameter name contains a character ID or not; the query parameter "ID" here characterizes the context parameters of the page.
10. Http requests predicates (e.g., GET/POST).
In an embodiment of the present invention, the machine learning model cannot directly process the attribute information, and therefore the attribute information needs to be converted into feature values, and a conversion manner for converting the attribute information of each dimension into corresponding feature values may be determined according to the specification requirement of the adopted classification model on the input feature values.
For example, in one application scenario, the numerical attribute information such as the string length of the file name, the ratio of numeric characters in the file name, and the ratio of symbolic characters in the file name may be discretized to obtain the corresponding feature value. For example, the string length of the file name may be divided into 5 or more, 5 to 10, 10 to 15, 15 to 20, and 20 or more.
For non-numerical attribute information, such as extensions of file names, common keywords and the like, corresponding mapping rules can be set to specify characteristic values corresponding to different extensions or different keywords respectively, and further a dimension index of whether the keyword exists is established.
For whether the query parameter name contains the character "ID", whether the query parameter name has a date, etc., if yes, the corresponding field may be set to "1", if no, the corresponding field may be set to "0", and of course, other values may be set according to actual requirements.
Of course, in other application scenarios, the numerical attribute information may also be converted into the corresponding feature value in other manners, which is not limited herein.
It should be noted that the process of converting the information of each dimension into the feature value may be integrated into the process of extracting the attribute information, and in this case, the extracted attribute information is in the form of the feature value.
In another possible implementation manner, the process of converting the information of each dimension into the feature value may be independent from the attribute information extraction process, in this case, the attribute information extracted from the HTTP request information is in the form of feature information, and then, the feature value conversion program converts the feature information into the corresponding feature value. In addition, the feature value conversion program can also be integrated in the process of analyzing the attribute information by using a machine learning model, and the specific conversion process is the same, and is not described in detail herein.
And S130, analyzing attribute information of each dimension based on a classification model obtained by pre-training to obtain a target page type to which the webpage to be classified belongs.
The classification model is obtained by utilizing HTTP request information corresponding to a plurality of pages with different page types and page type training. The purpose of training the classification model is to enable the classification model to learn the fitting relation between the attribute information of each dimension of the HTTP request information and the page type.
The classification model in the embodiment of the present invention may be obtained by training using a machine learning tool/algorithm, for example, a spark. The MLlib includes some general machine learning algorithms and tools, such as: classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives. The scheme mainly uses spark MLlib tools to realize the classification function.
After the classification model is obtained through training, the classification model obtained through training analyzes the fitting relationship between the attribute information of each dimension and the page type, which is obtained through learning of the attribute information of each dimension in the HTTP request information of the web page to be classified obtained in the last step, so as to obtain the page type corresponding to the attribute information of the HTTP request information of the current web page to be classified, and the page type of the web page to be classified is obtained.
In an embodiment of the present invention, the classification model analyzes the feature values of each attribute information in the HTTP request information to obtain the confidence levels that the web pages corresponding to the HTTP request information respectively belong to the directory page and the content page, and determines the page type with the maximum confidence level as the page type to which the web page belongs.
For example, the classification model obtains, through analysis, that the confidence that the web page belongs to the content page is 0.65, and the confidence that the web page belongs to the catalog page is 0.36, and finally determines that the web page belongs to the content page.
In the webpage classification method provided by this embodiment, before a specific page of a webpage is obtained, HTTP request information of the webpage is analyzed to obtain a page type of the webpage. Extracting attribute information of each dimension from HTTP request information corresponding to the web page to be classified, and then analyzing each characteristic value by using a classification model obtained by pre-training to obtain a target page type of the web page to be classified. According to the above content, the scheme only needs to obtain the HTTP request information corresponding to the page without acquiring a specific page, and analyzes the HTTP request information by using the classification model obtained by the pre-training, and does not need to manually comb the rule corresponding to each page type, so that the scheme is applicable to a web page with any page structure, that is, the scheme has high applicability, and further, the page type of the web page can be accurately analyzed for the web page with any page structure, and therefore, the scheme has higher accuracy.
In one embodiment of the present invention, as shown in fig. 2, the process of training the obtained classification model is as follows:
s210, acquiring HTTP request sample data marked with the page type.
Acquiring sample data of a large number of webpages, wherein the sample data comprises HTTP requests of the webpages and page types labeled for the webpages. The page type of the web page can be manually marked, or can be determined by using a traditional web page classification mode and is manually verified. That is, it is necessary to ensure that the page types labeled for the respective web pages are accurate.
And S220, extracting the attribute information of each dimension from each HTTP request sample data, and converting each attribute information into a corresponding characteristic value respectively.
The process of extracting the attribute information of each dimension from the HTTP request sample data is the same as the process of extracting the attribute information of each dimension in S120 described above, and is not described herein again.
It should be noted that, in an application scenario, the machine learning model may directly utilize the extracted attribute information of each dimension, and in this case, the extracted attribute information of each dimension may be directly utilized to train the classification model.
And S230, analyzing the characteristic value corresponding to each HTTP request sample data by using a preset classification model to obtain a prediction type result corresponding to the HTTP request sample data.
The feature values corresponding to the attribute information of each HTTP request sample data are input into a preset classification model, which may be implemented by the spark.
And S240, repeatedly optimizing model parameters of the preset classification model according to the labeled page type and the prediction type result corresponding to the same HTTP request sample data until a preset convergence condition is met, and obtaining a final classification model.
And adjusting the hyper-parameters (such as step length, iteration times and the like) of the preset classification model by using the prediction classification result and the marked page type of the same HTTP request sample data, wherein the process is an iterative optimization process until preset convergence conditions are met, such as the iteration times, the accuracy of the classification result and the like, and finally obtaining the classification model capable of predicting.
The result of training the classification model is to enable the model to learn to obtain a fitting relationship between the attribute information of each dimension of the HTTP request information of the web page and the page type of the web page, and the fitting relationship can be embodied on the weight coefficient of the attribute information of each dimension. In other words, the result of training the classification model is to determine the weight coefficients corresponding to the attribute information of each dimension.
In the training method for the classification model provided in this embodiment, the preset classification model is trained by using the HTTP request information in the webpage sample data and the page type of the webpage labeled in advance, so that the model learns to obtain the fitting relationship between the HTTP request information and the page type of the webpage. Therefore, the HTTP request information is analyzed by using the classification model, and the rule corresponding to each page type does not need to be manually combed, so that the method and the device can be suitable for the web pages with any page structure, namely the scheme has high applicability and higher classification accuracy.
Corresponding to the embodiment of the webpage classification method, the invention also provides an embodiment of a webpage classification device.
Referring to fig. 3, a schematic structural diagram of a web page classification apparatus provided by the present invention is shown, where the apparatus is applied to a computer device, for example, a PC, a server, and the like, and as shown in fig. 3, the apparatus may include: an acquisition module 110, an information extraction module 120, and an analysis module 130.
The obtaining module 110 is configured to obtain HTTP request information corresponding to a web page to be classified.
An information extraction module 120, configured to extract attribute information of a preset dimension from the HTTP request information.
The attribute information is used for representing the characteristic of the HTTP request information related to the webpage type.
In one embodiment of the invention, the attribute information may be the extracted information itself; alternatively, in other embodiments of the present invention, the attribute information may be a feature value converted from the extracted information.
In a possible implementation manner of the present invention, the attribute information of each dimension may include the following contents:
the number of small sections of the URL contained in the HTTP request information;
the length of the character string of the file name contained in the URL;
the ratio of the symbol characters in the file name contained in the URL;
the ratio of the number characters in the file name contained in the URL;
an extension of the file name contained in the URL;
whether the URL carries a date;
whether the URL contains a specified keyword;
the number of query parameters contained by the URL;
whether the query parameter name contained in the URL contains the character "ID";
the HTTP request information includes a request predicate.
The analysis module 130 is configured to analyze the attribute information of the preset dimension to obtain a page type of the web page to be classified based on the classification model obtained through pre-training.
Wherein the page type includes one of a directory page and a content page. The classification model is obtained by utilizing HTTP request information and page type training corresponding to a plurality of pages with different page types.
In a possible implementation manner of the present invention, the classification model may directly utilize the extracted information of each dimension, and in this case, the extracted attribute information of each dimension may be directly input into the classification model to be analyzed to obtain the page type of the web page to be classified.
In another possible implementation manner of the present invention, the classification model cannot directly use the information extracted by the information extraction module 120, and therefore, the information of each dimension needs to be converted into a corresponding feature value, and then the feature value is input into the classification model to be analyzed to obtain the page type corresponding to the web page to be classified. The process of converting into the characteristic value can be determined according to the requirements of the classification model, for example, in an application scenario, the attribute information belonging to the numerical type can be subjected to discretization processing to obtain the corresponding characteristic value; and mapping the non-numerical attribute information into corresponding characteristic values according to a preset mapping rule.
In a possible implementation manner, the classification model analyzes the characteristic value of each attribute information in the HTTP request information to obtain the confidence levels that the web page corresponding to the HTTP request information belongs to the directory page and the content page, respectively, and determines the page type with the maximum confidence level as the page type to which the web page belongs.
The web page classification device provided in this embodiment analyzes HTTP request information of a web page to obtain a page type of the web page before obtaining a specific page of the web page. Extracting attribute information of each dimension from HTTP request information corresponding to the web page to be classified, and then analyzing each characteristic value by using a classification model obtained by pre-training to obtain a target page type of the web page to be classified. According to the above content, the scheme only needs to obtain the HTTP request information corresponding to the page without acquiring a specific page, and analyzes the HTTP request information by using the classification model obtained by the pre-training, and does not need to manually comb the rule corresponding to each page type, so that the scheme is applicable to a web page with any page structure, that is, the scheme has high applicability, and further, the page type of the web page can be accurately analyzed for the web page with any page structure, and therefore, the scheme has higher accuracy.
Referring to fig. 4, a schematic structural diagram of another web page classification apparatus provided by the present invention is shown, where the apparatus may further include, based on the embodiment shown in fig. 3: a training sample acquisition module 210, a sample information extraction module 220, a sample information analysis module 230, and a model optimization module 240.
A training sample obtaining module 210, configured to obtain HTTP request sample data labeled with a page type;
a sample information extraction module 220, configured to extract attribute information of a preset dimension from each HTTP request sample data;
the sample information analysis module 230 is configured to analyze the attribute information corresponding to each piece of HTTP request sample data by using a preset classification model to obtain a prediction type result corresponding to the HTTP request sample data;
and the model optimization module 240 is configured to iteratively optimize the model parameters of the preset classification model until a preset convergence condition is met according to the labeled page type corresponding to the same HTTP request sample data and the prediction type result, so as to obtain a final classification model.
The result of training the classification model is to enable the model to learn to obtain a fitting relationship between the attribute information of each dimension of the HTTP request information of the web page and the page type of the web page, and the fitting relationship can be embodied on the weight coefficient of the attribute information of each dimension. In other words, the result of training the classification model is to determine the weight coefficients corresponding to the attribute information of each dimension.
The web page classification device provided in this embodiment trains a preset classification model by using HTTP request information in web page sample data and a pre-labeled page type of a web page, so that the model learns to obtain a fitting relationship between the HTTP request information and the page type of the web page. Therefore, the HTTP request information is analyzed by using the classification model, and the rule corresponding to each page type does not need to be manually combed, so that the method and the device can be suitable for the web pages with any page structure, namely the scheme has high applicability and higher classification accuracy.
The web page classification device comprises a processor and a memory, and the following steps are carried out: the acquisition module, the information extraction module and the analysis module, the training sample acquisition module, the sample information extraction module, the sample information analysis module, the model optimization module and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the page type of the webpage to be classified is obtained by analyzing according to the HTTP request information corresponding to the webpage to be classified by adjusting the kernel parameters.
An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the web page classification method when executed by a processor.
The embodiment of the invention provides a processor, which is used for running a program, wherein the webpage classification method is executed when the program runs.
The embodiment of the invention provides equipment, and the equipment can be a server, a PC, a PAD, a mobile phone and the like. As shown in fig. 5, the apparatus includes at least one processor 310, and at least one memory 320 connected to the processor 310, a bus 330; the processor 310 and the memory 320 complete communication with each other through the bus 330; the processor 310 is used to call program instructions in the memory 320 to execute the web page classification method described above.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:
acquiring HTTP request information corresponding to a webpage to be classified;
extracting attribute information of a preset dimension from the HTTP request information, wherein the attribute information is used for representing the characteristic of the HTTP request information related to the webpage type;
analyzing the attribute information of the preset dimensionality to obtain the page type of the webpage to be classified based on a classification model obtained by pre-training, wherein the page type comprises one of a catalog page and a content page;
the classification model is obtained by utilizing HTTP request information corresponding to a plurality of pages with different page types and page type training.
In a possible implementation manner of the present invention, the process of training the classification model includes:
acquiring HTTP request sample data marked with a page type;
extracting attribute information of a preset dimension from each HTTP request sample data;
analyzing attribute information corresponding to each HTTP request sample data by using a preset classification model to obtain a prediction type result corresponding to the HTTP request sample data;
and iteratively optimizing the model parameters of the preset classification model until a preset convergence condition is met according to the labeled page type corresponding to the same HTTP request sample data and the prediction type result to obtain a final classification model.
In another possible implementation manner of the present invention, analyzing the attribute information of the preset dimension to obtain the page type corresponding to the web page to be classified based on a classification model obtained by pre-training includes:
analyzing the attribute information of each dimension by using a classification model obtained by pre-training to obtain confidence coefficients that the web pages to be classified respectively belong to the directory page and the content page;
and determining the page type with the maximum confidence coefficient as a target page type to which the webpage to be classified belongs.
In another possible implementation manner of the present invention, the attribute information of the preset dimension in the HTTP request information includes at least one of the following:
the number of the small sections of the URL contained in the HTTP request information;
the character string length of the file name contained in the URL;
the ratio of the symbol characters in the file name contained in the URL;
the ratio of the number characters in the file name contained in the URL;
an extension of a file name contained in the URL;
whether the URL carries a date;
whether the URL contains a specified keyword;
the number of query parameters contained by the URL;
whether the URL contains the name of the query parameter or not contains the character ID;
and the HTTP request information comprises a request predicate.
In another possible implementation manner of the present invention, the analyzing the attribute information of the preset dimension to obtain the page type of the web page to be classified based on a classification model obtained by pre-training includes:
converting attribute information of a preset dimension of the HTTP request information into a corresponding characteristic value;
and inputting the characteristic value corresponding to the HTTP request information into the classification model obtained by pre-training for analysis to obtain the page type of the webpage to be classified.
In another possible implementation manner of the present invention, converting the attribute information of the preset dimension of the HTTP request information into a corresponding feature value includes:
discretizing the attribute information belonging to the numerical type to obtain corresponding characteristic values;
and mapping the non-numerical attribute information into corresponding characteristic values according to a preset mapping rule.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for classifying web pages, comprising:
acquiring HTTP request information corresponding to a webpage to be classified;
extracting attribute information of a preset dimension from the HTTP request information, wherein the attribute information is used for representing the characteristic of the HTTP request information related to the webpage type;
analyzing the attribute information of the preset dimensionality to obtain the page type of the webpage to be classified based on a classification model obtained by pre-training, wherein the page type comprises one of a catalog page and a content page;
the classification model is obtained by utilizing HTTP request information corresponding to a plurality of pages with different page types and page type training.
2. The method of claim 1, wherein the process of training the classification model comprises:
acquiring HTTP request sample data marked with a page type;
extracting attribute information of a preset dimension from each HTTP request sample data;
analyzing attribute information corresponding to each HTTP request sample data by using a preset classification model to obtain a prediction type result corresponding to the HTTP request sample data;
and iteratively optimizing the model parameters of the preset classification model until a preset convergence condition is met according to the labeled page type corresponding to the same HTTP request sample data and the prediction type result to obtain a final classification model.
3. The method according to claim 1, wherein analyzing the attribute information of the preset dimension pair to obtain a page type corresponding to the web page to be classified based on a classification model obtained by pre-training comprises:
analyzing the attribute information of each dimension by using a classification model obtained by pre-training to obtain confidence coefficients that the web pages to be classified respectively belong to the directory page and the content page;
and determining the page type with the maximum confidence coefficient as a target page type to which the webpage to be classified belongs.
4. The method according to claim 1, wherein the attribute information of the preset dimension in the HTTP request information includes at least one of:
the number of the small sections of the URL contained in the HTTP request information;
the character string length of the file name contained in the URL;
the ratio of the symbol characters in the file name contained in the URL;
the ratio of the number characters in the file name contained in the URL;
an extension of a file name contained in the URL;
whether the URL carries a date;
whether the URL contains a specified keyword;
the number of query parameters contained by the URL;
whether the URL contains the name of the query parameter or not contains the character ID;
and the HTTP request information comprises a request predicate.
5. The method according to any one of claims 1 to 4, wherein the analyzing the attribute information of the preset dimension to obtain the page type of the web page to be classified based on a classification model obtained by pre-training comprises:
converting attribute information of a preset dimension of the HTTP request information into a corresponding characteristic value;
and inputting the characteristic value corresponding to the HTTP request information into the classification model obtained by pre-training for analysis to obtain the page type of the webpage to be classified.
6. The method according to claim 5, wherein converting the attribute information of the preset dimension of the HTTP request information into a corresponding feature value comprises:
discretizing the attribute information belonging to the numerical type to obtain corresponding characteristic values;
and mapping the non-numerical attribute information into corresponding characteristic values according to a preset mapping rule.
7. A web page classification apparatus, comprising:
the acquisition module is used for acquiring HTTP request information corresponding to the webpage to be classified;
the information extraction module is used for extracting attribute information of a preset dimension from the HTTP request information, wherein the attribute information is used for representing the characteristics of the HTTP request information related to the webpage type;
the analysis module is used for analyzing the attribute information of the preset dimensionality to obtain the page type of the webpage to be classified based on a classification model obtained through pre-training, wherein the page type comprises one of a directory page and a content page;
the classification model is obtained by utilizing HTTP request information corresponding to a plurality of pages with different page types and page type training.
8. The apparatus of claim 7, further comprising:
the training sample acquisition module is used for acquiring HTTP request sample data marked with page types;
the sample information extraction module is used for extracting attribute information of a preset dimension from each HTTP request sample data;
the sample information analysis module is used for analyzing the attribute information corresponding to each HTTP request sample data by using a preset classification model to obtain a prediction type result corresponding to the HTTP request sample data;
and the model optimization module is used for iteratively optimizing the model parameters of the preset classification model until a preset convergence condition is met according to the labeled page type corresponding to the same HTTP request sample data and the prediction type result to obtain a final classification model.
9. An apparatus, comprising: at least one processor, and at least one memory, bus connected with the processor;
the processor and the memory complete mutual communication through the bus;
the processor is configured to call program instructions in the memory to perform the web page classification method of any of claims 1-6.
10. A storage medium having a program stored thereon, wherein the program, when loaded and executed by a processor, implements the method of classifying web pages according to any one of claims 1 to 6.
CN201911043232.4A 2019-10-30 2019-10-30 Webpage classification method and device Pending CN112749360A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911043232.4A CN112749360A (en) 2019-10-30 2019-10-30 Webpage classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911043232.4A CN112749360A (en) 2019-10-30 2019-10-30 Webpage classification method and device

Publications (1)

Publication Number Publication Date
CN112749360A true CN112749360A (en) 2021-05-04

Family

ID=75641668

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911043232.4A Pending CN112749360A (en) 2019-10-30 2019-10-30 Webpage classification method and device

Country Status (1)

Country Link
CN (1) CN112749360A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113190779A (en) * 2021-05-08 2021-07-30 北京百度网讯科技有限公司 Webpage evaluation method and device
CN115168755A (en) * 2022-07-26 2022-10-11 北京永信至诚科技股份有限公司 Abnormal data processing method and system based on URL (Uniform resource locator) characteristics
WO2023282848A1 (en) * 2021-07-07 2023-01-12 脸萌有限公司 Web page classification method and apparatus, storage medium, and electronic device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
WO2016045378A1 (en) * 2014-09-26 2016-03-31 中兴通讯股份有限公司 Web page classifying method and device
CN107025230A (en) * 2016-01-29 2017-08-08 北京国双科技有限公司 The processing method and processing device of web crawlers
CN109284465A (en) * 2018-09-04 2019-01-29 暨南大学 A kind of Web page classifying device construction method and its classification method based on URL

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
WO2016045378A1 (en) * 2014-09-26 2016-03-31 中兴通讯股份有限公司 Web page classifying method and device
CN107025230A (en) * 2016-01-29 2017-08-08 北京国双科技有限公司 The processing method and processing device of web crawlers
CN109284465A (en) * 2018-09-04 2019-01-29 暨南大学 A kind of Web page classifying device construction method and its classification method based on URL

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113190779A (en) * 2021-05-08 2021-07-30 北京百度网讯科技有限公司 Webpage evaluation method and device
CN113190779B (en) * 2021-05-08 2023-07-28 北京百度网讯科技有限公司 Webpage evaluation method and device
WO2023282848A1 (en) * 2021-07-07 2023-01-12 脸萌有限公司 Web page classification method and apparatus, storage medium, and electronic device
CN115168755A (en) * 2022-07-26 2022-10-11 北京永信至诚科技股份有限公司 Abnormal data processing method and system based on URL (Uniform resource locator) characteristics

Similar Documents

Publication Publication Date Title
CN109033358B (en) Method for associating news aggregation with intelligent entity
US8959091B2 (en) Keyword assignment to a web page
US20080016147A1 (en) Method of retrieving an appropriate search engine
CN109905288B (en) Application service classification method and device
CN112749360A (en) Webpage classification method and device
US20060235858A1 (en) Using attribute inheritance to identify crawl paths
US20130232424A1 (en) User operation detection system and user operation detection method
Al-asadi et al. A survey on web mining techniques and applications
EP2802979A2 (en) Processing store visiting data
KR20190131778A (en) Web Crawler System for Collecting a Structured and Unstructured Data in Hidden URL
CN104572934A (en) Webpage key content extracting method based on DOM
CN111209325B (en) Service system interface identification method, device and storage medium
CN113918794B (en) Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium
Mehta et al. A comparative study of various approaches to adaptive web scraping
Li [Retracted] Internet Tourism Resource Retrieval Using PageRank Search Ranking Algorithm
Devi et al. An efficient approach for web indexing of big data through hyperlinks in web crawling
US20150149887A1 (en) Method of implementing structured and non-structured data in xml document
CN117743564B (en) Automatic extraction and recommendation method and system for technological policy information
Langhnoja et al. Web usage mining to discover visitor group with common behavior using DBSCAN clustering algorithm
CN109948015B (en) Meta search list result extraction method and system
KR20200119534A (en) Ontology-based multilingual url filtering apparatus
US9692804B2 (en) Method of and system for determining creation time of a web resource
KR102247067B1 (en) Method, apparatus and computer program for processing URL collected in web site
CN102521288A (en) Acquisition method of Web service information on Internet
TW202219793A (en) Web page analyzing method and web page analyzing platform using the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination