CN110377810B - Classification method of mobile terminal web pages - Google Patents
Classification method of mobile terminal web pages Download PDFInfo
- Publication number
- CN110377810B CN110377810B CN201910554829.9A CN201910554829A CN110377810B CN 110377810 B CN110377810 B CN 110377810B CN 201910554829 A CN201910554829 A CN 201910554829A CN 110377810 B CN110377810 B CN 110377810B
- Authority
- CN
- China
- Prior art keywords
- information
- same structure
- positioning
- html
- web pages
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a classification method of mobile terminal web pages, which has a simple structure of a list type, the web page contents mostly appear in the form of information flow, and important information appears in front. For the features of the mobile-side web page, first, the subject information is extracted, and the subject information is located in the web pages < title >, < meta > description, and < meta > keywords. And then positioning an information stream, wherein the information stream meets the conditions that the same structure of the HTML label is repeated at least m times, and the text content in the structure is more than n characters. If the positioning is successful, extracting information before the positioning position of the information flow, and defining the information as head information; the content in the same structure of the information flow is information related to a theme, and the information in the same structure of the first m sections of the information flow is extracted and defined as information of the information flow; the content of the information flow m segments is similar after the same structure, and the information flow m segments is defined as noise information and is eliminated. If the positioning fails, the information in HTML labels < Hn >, < a >, < b > and < p > is directly extracted. Converting the extracted information into vectors, inputting the vectors into a classifier to train a classification model, and then classifying.
Description
Technical Field
The invention relates to the field of webpage classification, in particular to a mobile terminal webpage classification method.
Background
There are abundant information resources on the network, and the amount of information on the network has increased explosively over time. The classification of web pages facilitates web page information retrieval and management, such as developing and maintaining web page catalogs, improving search engine quality, filtering web page content, and the like. According to the Chinese Internet development report 2018 published by the China Internet Association, the scale of Chinese netizens reaches 7.72 hundred million by 2017, wherein the scale of mobile phone netizens reaches 7.53 hundred million. The internet access ratio of net citizens through mobile phones is as high as 97.5%, while the internet access ratio of desktop computers and notebook computers is 53.0% and 35.8%, respectively, and mobile internet users reach the same level as personal computer users, even surpass the number of personal computer users. We refer to a personal computer as a desktop side, corresponding to a mobile side. In the mobile internet era, people increasingly browse webpages on mobile terminal equipment such as mobile phones and tablet computers, and the classification method of the mobile terminal webpages has great application value.
At present, most mobile end web pages are obviously different from desktop end web pages, and the structure of the web pages is different from that of the desktop end. The screen of the mobile terminal is smaller than that of the desktop terminal, so that the content of the mobile terminal webpage is more striking and the structure is simpler. FIG. 1 is a table top end webpage of the Xin Lang financial institution, FIG. 2 is a mobile end webpage of the Xin Lang financial institution, and it can be found by comparison that the table top end webpage has a complex structure of multi-level nesting, including upper and lower distribution, double column distribution, three column distribution, etc.; the mobile terminal web page is a simple structure of a list type, contents appear like a list line by line, and the simple structure enables the contents of the web page to be clearly presented on a small screen of the mobile terminal. Fig. 3 is a mobile-end web page of Zhejiang university, Xinlang finance and Tiger playing football, and it can be found that the contents in the boxes are repeated in the same structure, and the information appears in a 'stream' form. The repeated appearance of this same structure at the mobile end is called information flow. The screen of the mobile terminal is a vertical screen, the page of the webpage is very long, and important information appears in the front, because the designer of the webpage tends to put important content in the front to attract the user, so that the user is interested to continue browsing.
The related technologies of the mobile end web page mainly include: mobile terminal web page design, web page recombination and scaling-based interaction technology. The mobile terminal Web page design is to adopt various mobile markup languages and interactive tools to help a Web designer to manually create and optimize a mobile terminal website; the webpage reorganization is to intelligently adjust the desktop webpage to adapt to the mobile terminal, so that the mobile terminal can browse conveniently; the zoom-based interaction technique is to display a web page within a mobile screen as a summary, allowing the user to zoom in on a particular section for detailed reading. The technologies mainly relate to browsing mobile-side web pages, and currently, no classification method for mobile-side web page features exists.
Disclosure of Invention
Aiming at the condition that a mobile-end webpage classification method is lacked at present, the invention provides a classification method aiming at mobile-end webpage characteristics, which improves the accuracy of mobile-end webpage classification and solves the problem that a webpage desktop classification method is not applicable to a mobile end.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows: a classification method for mobile terminal web pages comprises the following steps:
step (1) of extracting subject information of a web page, the subject information being located in the web page<title>、<meta>description、<meta>In keywords, the weight is set to w1;
Step (2), positioning an information stream, wherein the information stream meets the conditions that the same structure of an HTML label repeatedly appears m times, and the text content in the structure is larger than n characters; m is an integer of 3 to 5 inclusive, and n is an integer of 6 to 6 inclusive;
and (3) if the positioning is successful, extracting information before the positioning position of the information flow, defining the information as head information, and setting the weight as w2(ii) a The content in the same structure of the information flow is information related to the subject, the information in the same structure of the first m segments of the information flow is extracted and defined as information flow information, and the weight is set as w3The content of the m segments of the information flow is similar after the same structure, and the m segments of the information flow are defined as noise information and are eliminated. If the positioning fails, the HTML label is directly extracted<Hn>、<a>、<b>、<p>The information in (1); setting the weight w1、w2、w3Used for measuring the importance degree of the information;
and (4) converting the extracted information into vectors, inputting the vectors into a classifier to train a classification model, and then classifying.
Further, the step (1) is specifically as follows:
< title > is a title of a web page, < meta > keywords provided therein, and < meta > description describing contents of the web page, which are provided by a web site designer, from which theme information of the web page is extracted.
Further, the step (2) is specifically as follows:
and traversing HTML tags of the web pages, and adding the HTML tags into the List List _ all if the text length in the tags is larger than n. And traversing the List List _ all, judging whether HTML tags with the same structure exist, if the HTML tags with the same structure appear, judging whether the HTML tags appear repeatedly m times, and if the HTML tags appear, successfully positioning. If the List List _ all traversal is finished and the repeated m times of the same structure of the HTML tag do not appear yet, the positioning fails.
Further, m is preferably 3, and n is preferably 8.
Further, the step (3) is specifically as follows:
in the HTML document, a text before a stream localization position is defined as header information, a text in m pieces of the same structure content from the stream localization position is defined as stream information, and a content after m pieces of the same structure content from the stream localization position is discarded.
Further, the step (4) is specifically as follows:
converting text information into vectors by adopting word2vec, training a classification model by adopting a Support Vector Machine (SVM), and then classifying.
The invention has the beneficial effects that: the method can effectively extract the information of the mobile terminal webpage aiming at the mobile terminal webpage characteristics, thereby improving the effect of mobile terminal webpage classification. The method solves the problem that the webpage desktop end classification method is not suitable for the mobile end, and compared with the method for classifying the mobile end webpage, the method for classifying the mobile end webpage has higher classification accuracy.
Drawings
FIG. 1 is a desktop-side web page of New waste finance;
FIG. 2 is a mobile-side web page of New waste finance;
FIG. 3 is a mobile end webpage of Zhejiang university, New Langchan, tiger football;
FIG. 4 is a classification method of a mobile-side web page;
FIG. 5 is a flow chart of information flow positioning;
FIG. 6 is a positioning of the mobile end web page information flow of Zhejiang university;
FIG. 7 is a diagram illustrating the positioning of the web page information flow of the Xinwang finance mobile terminal;
fig. 8 is the positioning of the information flow of the mobile terminal web page of the tiger-pounding football.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. In the following description and in the drawings, the same numbers in different drawings identify the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of methods consistent with certain aspects of the invention, as detailed in the appended claims. Various embodiments of the present description are described in an incremental manner.
As shown in fig. 4, the present invention provides a method for classifying a mobile-end web page, which includes the following steps:
(1) extracting substance information
Subject matter information located on web pages<title>、<meta>description、<meta>In keywords, the subject matter information is provided by the webpage designer, is the content that the designer subjectively wants to express, and is relatively more accurate than other information in the webpage, and the weight w1Set to 3 based on empirical values and experiments.
(2) Positioning information flow
And positioning an information stream, wherein the information stream meets the conditions that the same structure of the HTML label is repeated for 3 times, and the text content in the structure is more than 8 characters. As shown in fig. 5, traversing the HTML tag of the web page, and if the text length in the tag is greater than 8, adding the HTML tag into the List _ all; and traversing the List List _ all, judging whether HTML tags with the same structure exist, if the HTML tags with the same structure appear, judging whether the HTML tags appear repeatedly for 3 times, and if the HTML tags appear, successfully positioning. If the List List _ all traversal is finished and the same structure of the HTML tag is not found to be repeated for 3 times, the positioning is failed.
The information flow positioning algorithm comprises the following steps:
algorithm 1 information flow positioning
Inputting: web page HTML document
And (3) outputting: an HTML tag list; the serial number of the location position; text for locating position
Description of the parameters: list _ all is a List of all HTML tags, List is a List of tags in a repeating structure
(3) Extracting other information
If the positioning is successful, extracting the information before the positioning position of the information flow, defining the information as the head information and the weight w2Is set to 2; the content in the same structure of the information flow is information related to the subject, the information in the same structure of the first 3 segments of the information flow is extracted and defined as information of the information flow, and the weight w3Set to 1, the content of the 3 segments of the stream is similar after the same structure, defined as noise information, is dropped. If the positioning fails, the HTML label is directly extracted<Hn>、<a>、<b>、<p>The information in (1).
(4) Training classification models to classify
Converting text information into vectors by adopting word2vec, training a classification model by adopting a Support Vector Machine (SVM), and then classifying.
According to the information flow positioning algorithm, the information flow positioning effect of the Zhejiang university, the Xinlang finance and finance, and the tiger-flapping football mobile terminal webpage is shown in figures 6, 7 and 8, and repeated HTML tags, the sequence numbers of the positioning positions and the texts of the positioning positions are output. In List _ all a List of all HTML tags is stored, the location number, i.e. the number of the first HTML tag of the repeating structure starting in List _ all.
Results and analysis of the experiments
In order to verify the effectiveness of the method, mobile end webpage data are collected for experiments, and 6000 mobile end webpages including known websites such as Tencent, network variation, Fox search, New wave, Taobao, Jingdong, Tiger and the like are collected according to classification labels of Alexa websites of Amazon to form a data set.
The classification accuracy ACC is:
in the formula (1), TP is the number of positive samples classified as positive samples, TN is the number of negative samples classified as negative samples, FP is the number of negative samples classified as positive samples, and FN is the number of positive samples classified as negative samples.
The method for extracting the text information in the key HTML tags < Hn >, < a >, < b >, < p > can be used for both desktop end webpages and mobile end webpages, and is compared with the method for the characteristics of the mobile end webpages in experiments. The experimental result is shown in table 1, the accuracy of the method for extracting text information classification in the key tag is 95.0%, and the method for extracting information classification aiming at the mobile terminal webpage features is more effective, and the accuracy is 97.2%.
TABLE 1 Mobile end Web experiment accuracy
The two methods are further compared on three indexes of Precision (Precision), Recall (Recall) and F-measure. The accuracy rate is the ratio of the data divided into positive samples to judge the correctness, and the correctness of a classification result of a certain class is evaluated. The recall ratio is the proportion of real positive samples judged as positive samples, and whether the search of a certain category is complete or not is measured. The F value is a comprehensive evaluation index of the accuracy and the recall rate. The accuracy, the recall rate and the F value of the method for extracting the text information classification in the key label are respectively 94.7%, 94.3% and 94.5%, the accuracy, the recall rate and the F value of the method for extracting the information classification aiming at the mobile terminal webpage features are respectively 96.9%, 97.5% and 97.2%, and the method is better in three indexes.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (6)
1. A classification method for mobile terminal web pages is characterized by comprising the following steps:
step (1) of extracting subject information of a web page, the subject information being located in the web page<title>、<meta>description、<meta>In keywords, the weight is set to w1;
Step (2), positioning an information stream, wherein the information stream meets the conditions that the same structure of an HTML label is continuously repeated for at least m times, and the text content in the structure is more than n characters; wherein m is an integer of 3 or more and 5 or less, and n is an integer of 6 or more;
and (3) if the positioning is successful, taking the position where the repeated HTML label structure starts as a positioning position, extracting all text information before the positioning position of the information stream, defining the text information as header information, and setting the weight as w2(ii) a The content in the same structure of the information flow is information related to the subject, the information in the same structure of the first m segments of the information flow is extracted and defined as information flow information, and the weight is set as w3(ii) a The content of the information flow m segments after the same structure is similar, defined as noise information, and is eliminated; if the positioning fails, the HTML label is directly extracted<Hn>、<a>、<b>、<p>The text information in (1); setting the weight w1、w2、w3Used for measuring the importance degree of the information;
and (4) converting the extracted information into vectors, inputting the vectors into a classifier to train a classification model, and then classifying.
2. The method for classifying mobile-end web pages according to claim 1, wherein the step (1) is specifically as follows:
< title > is a title of a web page, < meta > keywords provided therein, and < meta > description describing contents of the web page, which are provided by a web site designer, from which theme information of the web page is extracted.
3. The method for classifying mobile-end web pages according to claim 2, wherein the step (2) is specifically as follows:
traversing HTML tags of the web pages, and if the text length in the tags is larger than n, adding the HTML tags into a List _ all; traversing the List _ all, judging whether HTML tags with the same structure exist, if the HTML tags with the same structure appear, judging whether the HTML tags appear repeatedly for m times, and if the HTML tags appear, successfully positioning; if the List List _ all traversal is finished and the repeated m times of the same structure of the HTML tag do not appear yet, the positioning fails.
4. The method for classifying mobile web pages according to claim 1, wherein m is preferably 3, and n is preferably 8.
5. The method for classifying mobile-end web pages according to claim 3, wherein the step (3) is specifically as follows:
in the HTML document, a text before a stream localization position is defined as header information, a text in m pieces of the same structure content from the stream localization position is defined as stream information, and a content after m pieces of the same structure content from the stream localization position is discarded.
6. The method for classifying mobile-end web pages according to claim 5, wherein the step (4) is specifically as follows:
converting text information into vectors by adopting word2vec, training a classification model by adopting a Support Vector Machine (SVM), and then classifying.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910554829.9A CN110377810B (en) | 2019-06-25 | 2019-06-25 | Classification method of mobile terminal web pages |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910554829.9A CN110377810B (en) | 2019-06-25 | 2019-06-25 | Classification method of mobile terminal web pages |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110377810A CN110377810A (en) | 2019-10-25 |
CN110377810B true CN110377810B (en) | 2022-04-08 |
Family
ID=68250621
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910554829.9A Active CN110377810B (en) | 2019-06-25 | 2019-06-25 | Classification method of mobile terminal web pages |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110377810B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103020129A (en) * | 2012-11-20 | 2013-04-03 | 中兴通讯股份有限公司 | Text content extraction method and text content extraction device |
EP3035210A1 (en) * | 2013-09-04 | 2016-06-22 | ZTE Corporation | Method and device for obtaining web page category standards, and method and device for categorizing web page categories |
CN108694192A (en) * | 2017-04-07 | 2018-10-23 | 北京国双科技有限公司 | The judgment method and device of type of webpage |
CN108984706A (en) * | 2018-07-06 | 2018-12-11 | 浙江大学 | A kind of Web page classification method based on deep learning fusing text and structure feature |
-
2019
- 2019-06-25 CN CN201910554829.9A patent/CN110377810B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103020129A (en) * | 2012-11-20 | 2013-04-03 | 中兴通讯股份有限公司 | Text content extraction method and text content extraction device |
EP3035210A1 (en) * | 2013-09-04 | 2016-06-22 | ZTE Corporation | Method and device for obtaining web page category standards, and method and device for categorizing web page categories |
CN108694192A (en) * | 2017-04-07 | 2018-10-23 | 北京国双科技有限公司 | The judgment method and device of type of webpage |
CN108984706A (en) * | 2018-07-06 | 2018-12-11 | 浙江大学 | A kind of Web page classification method based on deep learning fusing text and structure feature |
Non-Patent Citations (1)
Title |
---|
基于页面标签的网页分类研究;陈笑筑等;《商业视角》;20091231;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110377810A (en) | 2019-10-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sun et al. | Dom based content extraction via text density | |
CN106682192B (en) | Method and device for training answer intention classification model based on search keywords | |
TWI695277B (en) | Automatic website data collection method | |
CA2774278C (en) | Methods and systems for extracting keyphrases from natural text for search engine indexing | |
WO2016058267A1 (en) | Chinese website classification method and system based on characteristic analysis of website homepage | |
CN110020189A (en) | A kind of article recommended method based on Chinese Similarity measures | |
CN103838798B (en) | Page classifications system and page classifications method | |
CN103874994A (en) | Method and apparatus for automatically summarizing the contents of electronic documents | |
Paranjpe | Learning document aboutness from implicit user feedback and document structure | |
CN103678310A (en) | Method and device for classifying webpage topics | |
CN108038173B (en) | Webpage classification method and system and webpage classification equipment | |
Wu et al. | News filtering and summarization on the web | |
CN110555154B (en) | Theme-oriented information retrieval method | |
CN112256861B (en) | Rumor detection method based on search engine return result and electronic device | |
CN112989802B (en) | Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium | |
CN104503988A (en) | Searching method and device | |
CN103177036A (en) | Method and system for label automatic extraction | |
CN103678422A (en) | Web page classification method and device and training method and device of web page classifier | |
KR100954842B1 (en) | Method and System of classifying web page using category tag information and Recording medium using by the same | |
CN111339457B (en) | Method and apparatus for extracting information from web page and storage medium | |
de Moura et al. | Using structural information to improve search in Web collections | |
WO2017000659A1 (en) | Enriched uniform resource locator (url) identification method and apparatus | |
Liu et al. | Main content extraction from web pages based on node characteristics | |
Hsu et al. | Hierarchical comments-based clustering | |
Gali et al. | Extracting representative image from web page |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |