CN110377810B

CN110377810B - Classification method of mobile terminal web pages

Info

Publication number: CN110377810B
Application number: CN201910554829.9A
Authority: CN
Inventors: 沈继忠; 邓立; 杜歆
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-06-25
Filing date: 2019-06-25
Publication date: 2022-04-08
Anticipated expiration: 2039-06-25
Also published as: CN110377810A

Abstract

The invention provides a classification method of mobile terminal web pages, which has a simple structure of a list type, the web page contents mostly appear in the form of information flow, and important information appears in front. For the features of the mobile-side web page, first, the subject information is extracted, and the subject information is located in the web pages < title >, < meta > description, and < meta > keywords. And then positioning an information stream, wherein the information stream meets the conditions that the same structure of the HTML label is repeated at least m times, and the text content in the structure is more than n characters. If the positioning is successful, extracting information before the positioning position of the information flow, and defining the information as head information; the content in the same structure of the information flow is information related to a theme, and the information in the same structure of the first m sections of the information flow is extracted and defined as information of the information flow; the content of the information flow m segments is similar after the same structure, and the information flow m segments is defined as noise information and is eliminated. If the positioning fails, the information in HTML labels < Hn >, < a >, and is directly extracted. Converting the extracted information into vectors, inputting the vectors into a classifier to train a classification model, and then classifying.

Description

Classification method of mobile terminal web pages

Technical Field

The invention relates to the field of webpage classification, in particular to a mobile terminal webpage classification method.

Background

There are abundant information resources on the network, and the amount of information on the network has increased explosively over time. The classification of web pages facilitates web page information retrieval and management, such as developing and maintaining web page catalogs, improving search engine quality, filtering web page content, and the like. According to the Chinese Internet development report 2018 published by the China Internet Association, the scale of Chinese netizens reaches 7.72 hundred million by 2017, wherein the scale of mobile phone netizens reaches 7.53 hundred million. The internet access ratio of net citizens through mobile phones is as high as 97.5%, while the internet access ratio of desktop computers and notebook computers is 53.0% and 35.8%, respectively, and mobile internet users reach the same level as personal computer users, even surpass the number of personal computer users. We refer to a personal computer as a desktop side, corresponding to a mobile side. In the mobile internet era, people increasingly browse webpages on mobile terminal equipment such as mobile phones and tablet computers, and the classification method of the mobile terminal webpages has great application value.

At present, most mobile end web pages are obviously different from desktop end web pages, and the structure of the web pages is different from that of the desktop end. The screen of the mobile terminal is smaller than that of the desktop terminal, so that the content of the mobile terminal webpage is more striking and the structure is simpler. FIG. 1 is a table top end webpage of the Xin Lang financial institution, FIG. 2 is a mobile end webpage of the Xin Lang financial institution, and it can be found by comparison that the table top end webpage has a complex structure of multi-level nesting, including upper and lower distribution, double column distribution, three column distribution, etc.; the mobile terminal web page is a simple structure of a list type, contents appear like a list line by line, and the simple structure enables the contents of the web page to be clearly presented on a small screen of the mobile terminal. Fig. 3 is a mobile-end web page of Zhejiang university, Xinlang finance and Tiger playing football, and it can be found that the contents in the boxes are repeated in the same structure, and the information appears in a 'stream' form. The repeated appearance of this same structure at the mobile end is called information flow. The screen of the mobile terminal is a vertical screen, the page of the webpage is very long, and important information appears in the front, because the designer of the webpage tends to put important content in the front to attract the user, so that the user is interested to continue browsing.

The related technologies of the mobile end web page mainly include: mobile terminal web page design, web page recombination and scaling-based interaction technology. The mobile terminal Web page design is to adopt various mobile markup languages and interactive tools to help a Web designer to manually create and optimize a mobile terminal website; the webpage reorganization is to intelligently adjust the desktop webpage to adapt to the mobile terminal, so that the mobile terminal can browse conveniently; the zoom-based interaction technique is to display a web page within a mobile screen as a summary, allowing the user to zoom in on a particular section for detailed reading. The technologies mainly relate to browsing mobile-side web pages, and currently, no classification method for mobile-side web page features exists.

Disclosure of Invention

Aiming at the condition that a mobile-end webpage classification method is lacked at present, the invention provides a classification method aiming at mobile-end webpage characteristics, which improves the accuracy of mobile-end webpage classification and solves the problem that a webpage desktop classification method is not applicable to a mobile end.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows: a classification method for mobile terminal web pages comprises the following steps:

step (1) of extracting subject information of a web page, the subject information being located in the web page<title>、<meta>description、<meta>In keywords, the weight is set to w₁；

Step (2), positioning an information stream, wherein the information stream meets the conditions that the same structure of an HTML label repeatedly appears m times, and the text content in the structure is larger than n characters; m is an integer of 3 to 5 inclusive, and n is an integer of 6 to 6 inclusive;

and (3) if the positioning is successful, extracting information before the positioning position of the information flow, defining the information as head information, and setting the weight as w₂(ii) a The content in the same structure of the information flow is information related to the subject, the information in the same structure of the first m segments of the information flow is extracted and defined as information flow information, and the weight is set as w₃The content of the m segments of the information flow is similar after the same structure, and the m segments of the information flow are defined as noise information and are eliminated. If the positioning fails, the HTML label is directly extracted<Hn>、<a>、、The information in (1); setting the weight w₁、w₂、w₃Used for measuring the importance degree of the information;

and (4) converting the extracted information into vectors, inputting the vectors into a classifier to train a classification model, and then classifying.

Further, the step (1) is specifically as follows:

< title > is a title of a web page, < meta > keywords provided therein, and < meta > description describing contents of the web page, which are provided by a web site designer, from which theme information of the web page is extracted.

Further, the step (2) is specifically as follows:

and traversing HTML tags of the web pages, and adding the HTML tags into the List List _ all if the text length in the tags is larger than n. And traversing the List List _ all, judging whether HTML tags with the same structure exist, if the HTML tags with the same structure appear, judging whether the HTML tags appear repeatedly m times, and if the HTML tags appear, successfully positioning. If the List List _ all traversal is finished and the repeated m times of the same structure of the HTML tag do not appear yet, the positioning fails.

Further, m is preferably 3, and n is preferably 8.

Further, the step (3) is specifically as follows:

in the HTML document, a text before a stream localization position is defined as header information, a text in m pieces of the same structure content from the stream localization position is defined as stream information, and a content after m pieces of the same structure content from the stream localization position is discarded.

Further, the step (4) is specifically as follows:

converting text information into vectors by adopting word2vec, training a classification model by adopting a Support Vector Machine (SVM), and then classifying.

The invention has the beneficial effects that: the method can effectively extract the information of the mobile terminal webpage aiming at the mobile terminal webpage characteristics, thereby improving the effect of mobile terminal webpage classification. The method solves the problem that the webpage desktop end classification method is not suitable for the mobile end, and compared with the method for classifying the mobile end webpage, the method for classifying the mobile end webpage has higher classification accuracy.

Drawings

FIG. 1 is a desktop-side web page of New waste finance;

FIG. 2 is a mobile-side web page of New waste finance;

FIG. 3 is a mobile end webpage of Zhejiang university, New Langchan, tiger football;

FIG. 4 is a classification method of a mobile-side web page;

FIG. 5 is a flow chart of information flow positioning;

FIG. 6 is a positioning of the mobile end web page information flow of Zhejiang university;

FIG. 7 is a diagram illustrating the positioning of the web page information flow of the Xinwang finance mobile terminal;

fig. 8 is the positioning of the information flow of the mobile terminal web page of the tiger-pounding football.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. In the following description and in the drawings, the same numbers in different drawings identify the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of methods consistent with certain aspects of the invention, as detailed in the appended claims. Various embodiments of the present description are described in an incremental manner.

As shown in fig. 4, the present invention provides a method for classifying a mobile-end web page, which includes the following steps:

(1) extracting substance information

Subject matter information located on web pages<title>、<meta>description、<meta>In keywords, the subject matter information is provided by the webpage designer, is the content that the designer subjectively wants to express, and is relatively more accurate than other information in the webpage, and the weight w₁Set to 3 based on empirical values and experiments.

(2) Positioning information flow

And positioning an information stream, wherein the information stream meets the conditions that the same structure of the HTML label is repeated for 3 times, and the text content in the structure is more than 8 characters. As shown in fig. 5, traversing the HTML tag of the web page, and if the text length in the tag is greater than 8, adding the HTML tag into the List _ all; and traversing the List List _ all, judging whether HTML tags with the same structure exist, if the HTML tags with the same structure appear, judging whether the HTML tags appear repeatedly for 3 times, and if the HTML tags appear, successfully positioning. If the List List _ all traversal is finished and the same structure of the HTML tag is not found to be repeated for 3 times, the positioning is failed.

The information flow positioning algorithm comprises the following steps:

algorithm 1 information flow positioning

Inputting: web page HTML document

And (3) outputting: an HTML tag list; the serial number of the location position; text for locating position

Description of the parameters: list _ all is a List of all HTML tags, List is a List of tags in a repeating structure

(3) Extracting other information

If the positioning is successful, extracting the information before the positioning position of the information flow, defining the information as the head information and the weight w₂Is set to 2; the content in the same structure of the information flow is information related to the subject, the information in the same structure of the first 3 segments of the information flow is extracted and defined as information of the information flow, and the weight w₃Set to 1, the content of the 3 segments of the stream is similar after the same structure, defined as noise information, is dropped. If the positioning fails, the HTML label is directly extracted<Hn>、<a>、、The information in (1).

(4) Training classification models to classify

According to the information flow positioning algorithm, the information flow positioning effect of the Zhejiang university, the Xinlang finance and finance, and the tiger-flapping football mobile terminal webpage is shown in figures 6, 7 and 8, and repeated HTML tags, the sequence numbers of the positioning positions and the texts of the positioning positions are output. In List _ all a List of all HTML tags is stored, the location number, i.e. the number of the first HTML tag of the repeating structure starting in List _ all.

Results and analysis of the experiments

In order to verify the effectiveness of the method, mobile end webpage data are collected for experiments, and 6000 mobile end webpages including known websites such as Tencent, network variation, Fox search, New wave, Taobao, Jingdong, Tiger and the like are collected according to classification labels of Alexa websites of Amazon to form a data set.

The classification accuracy ACC is:

in the formula (1), TP is the number of positive samples classified as positive samples, TN is the number of negative samples classified as negative samples, FP is the number of negative samples classified as positive samples, and FN is the number of positive samples classified as negative samples.

The method for extracting the text information in the key HTML tags < Hn >, < a >, , can be used for both desktop end webpages and mobile end webpages, and is compared with the method for the characteristics of the mobile end webpages in experiments. The experimental result is shown in table 1, the accuracy of the method for extracting text information classification in the key tag is 95.0%, and the method for extracting information classification aiming at the mobile terminal webpage features is more effective, and the accuracy is 97.2%.

TABLE 1 Mobile end Web experiment accuracy

The two methods are further compared on three indexes of Precision (Precision), Recall (Recall) and F-measure. The accuracy rate is the ratio of the data divided into positive samples to judge the correctness, and the correctness of a classification result of a certain class is evaluated. The recall ratio is the proportion of real positive samples judged as positive samples, and whether the search of a certain category is complete or not is measured. The F value is a comprehensive evaluation index of the accuracy and the recall rate. The accuracy, the recall rate and the F value of the method for extracting the text information classification in the key label are respectively 94.7%, 94.3% and 94.5%, the accuracy, the recall rate and the F value of the method for extracting the information classification aiming at the mobile terminal webpage features are respectively 96.9%, 97.5% and 97.2%, and the method is better in three indexes.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A classification method for mobile terminal web pages is characterized by comprising the following steps:

Step (2), positioning an information stream, wherein the information stream meets the conditions that the same structure of an HTML label is continuously repeated for at least m times, and the text content in the structure is more than n characters; wherein m is an integer of 3 or more and 5 or less, and n is an integer of 6 or more;

and (3) if the positioning is successful, taking the position where the repeated HTML label structure starts as a positioning position, extracting all text information before the positioning position of the information stream, defining the text information as header information, and setting the weight as w₂(ii) a The content in the same structure of the information flow is information related to the subject, the information in the same structure of the first m segments of the information flow is extracted and defined as information flow information, and the weight is set as w₃(ii) a The content of the information flow m segments after the same structure is similar, defined as noise information, and is eliminated; if the positioning fails, the HTML label is directly extracted<Hn>、<a>、、The text information in (1); setting the weight w₁、w₂、w₃Used for measuring the importance degree of the information;

2. The method for classifying mobile-end web pages according to claim 1, wherein the step (1) is specifically as follows:

3. The method for classifying mobile-end web pages according to claim 2, wherein the step (2) is specifically as follows:

traversing HTML tags of the web pages, and if the text length in the tags is larger than n, adding the HTML tags into a List _ all; traversing the List _ all, judging whether HTML tags with the same structure exist, if the HTML tags with the same structure appear, judging whether the HTML tags appear repeatedly for m times, and if the HTML tags appear, successfully positioning; if the List List _ all traversal is finished and the repeated m times of the same structure of the HTML tag do not appear yet, the positioning fails.

4. The method for classifying mobile web pages according to claim 1, wherein m is preferably 3, and n is preferably 8.

5. The method for classifying mobile-end web pages according to claim 3, wherein the step (3) is specifically as follows:

6. The method for classifying mobile-end web pages according to claim 5, wherein the step (4) is specifically as follows: