CN110377810B - Classification method of mobile terminal web pages - Google Patents

Classification method of mobile terminal web pages Download PDF

Info

Publication number
CN110377810B
CN110377810B CN201910554829.9A CN201910554829A CN110377810B CN 110377810 B CN110377810 B CN 110377810B CN 201910554829 A CN201910554829 A CN 201910554829A CN 110377810 B CN110377810 B CN 110377810B
Authority
CN
China
Prior art keywords
information
same structure
positioning
html
web pages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910554829.9A
Other languages
Chinese (zh)
Other versions
CN110377810A (en
Inventor
沈继忠
邓立
杜歆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201910554829.9A priority Critical patent/CN110377810B/en
Publication of CN110377810A publication Critical patent/CN110377810A/en
Application granted granted Critical
Publication of CN110377810B publication Critical patent/CN110377810B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a classification method of mobile terminal web pages, which has a simple structure of a list type, the web page contents mostly appear in the form of information flow, and important information appears in front. For the features of the mobile-side web page, first, the subject information is extracted, and the subject information is located in the web pages < title >, < meta > description, and < meta > keywords. And then positioning an information stream, wherein the information stream meets the conditions that the same structure of the HTML label is repeated at least m times, and the text content in the structure is more than n characters. If the positioning is successful, extracting information before the positioning position of the information flow, and defining the information as head information; the content in the same structure of the information flow is information related to a theme, and the information in the same structure of the first m sections of the information flow is extracted and defined as information of the information flow; the content of the information flow m segments is similar after the same structure, and the information flow m segments is defined as noise information and is eliminated. If the positioning fails, the information in HTML labels < Hn >, < a >, < b > and < p > is directly extracted. Converting the extracted information into vectors, inputting the vectors into a classifier to train a classification model, and then classifying.

Description

Classification method of mobile terminal web pages
Technical Field
The invention relates to the field of webpage classification, in particular to a mobile terminal webpage classification method.
Background
There are abundant information resources on the network, and the amount of information on the network has increased explosively over time. The classification of web pages facilitates web page information retrieval and management, such as developing and maintaining web page catalogs, improving search engine quality, filtering web page content, and the like. According to the Chinese Internet development report 2018 published by the China Internet Association, the scale of Chinese netizens reaches 7.72 hundred million by 2017, wherein the scale of mobile phone netizens reaches 7.53 hundred million. The internet access ratio of net citizens through mobile phones is as high as 97.5%, while the internet access ratio of desktop computers and notebook computers is 53.0% and 35.8%, respectively, and mobile internet users reach the same level as personal computer users, even surpass the number of personal computer users. We refer to a personal computer as a desktop side, corresponding to a mobile side. In the mobile internet era, people increasingly browse webpages on mobile terminal equipment such as mobile phones and tablet computers, and the classification method of the mobile terminal webpages has great application value.
At present, most mobile end web pages are obviously different from desktop end web pages, and the structure of the web pages is different from that of the desktop end. The screen of the mobile terminal is smaller than that of the desktop terminal, so that the content of the mobile terminal webpage is more striking and the structure is simpler. FIG. 1 is a table top end webpage of the Xin Lang financial institution, FIG. 2 is a mobile end webpage of the Xin Lang financial institution, and it can be found by comparison that the table top end webpage has a complex structure of multi-level nesting, including upper and lower distribution, double column distribution, three column distribution, etc.; the mobile terminal web page is a simple structure of a list type, contents appear like a list line by line, and the simple structure enables the contents of the web page to be clearly presented on a small screen of the mobile terminal. Fig. 3 is a mobile-end web page of Zhejiang university, Xinlang finance and Tiger playing football, and it can be found that the contents in the boxes are repeated in the same structure, and the information appears in a 'stream' form. The repeated appearance of this same structure at the mobile end is called information flow. The screen of the mobile terminal is a vertical screen, the page of the webpage is very long, and important information appears in the front, because the designer of the webpage tends to put important content in the front to attract the user, so that the user is interested to continue browsing.
The related technologies of the mobile end web page mainly include: mobile terminal web page design, web page recombination and scaling-based interaction technology. The mobile terminal Web page design is to adopt various mobile markup languages and interactive tools to help a Web designer to manually create and optimize a mobile terminal website; the webpage reorganization is to intelligently adjust the desktop webpage to adapt to the mobile terminal, so that the mobile terminal can browse conveniently; the zoom-based interaction technique is to display a web page within a mobile screen as a summary, allowing the user to zoom in on a particular section for detailed reading. The technologies mainly relate to browsing mobile-side web pages, and currently, no classification method for mobile-side web page features exists.
Disclosure of Invention
Aiming at the condition that a mobile-end webpage classification method is lacked at present, the invention provides a classification method aiming at mobile-end webpage characteristics, which improves the accuracy of mobile-end webpage classification and solves the problem that a webpage desktop classification method is not applicable to a mobile end.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows: a classification method for mobile terminal web pages comprises the following steps:
step (1) of extracting subject information of a web page, the subject information being located in the web page<title>、<meta>description、<meta>In keywords, the weight is set to w1
Step (2), positioning an information stream, wherein the information stream meets the conditions that the same structure of an HTML label repeatedly appears m times, and the text content in the structure is larger than n characters; m is an integer of 3 to 5 inclusive, and n is an integer of 6 to 6 inclusive;
and (3) if the positioning is successful, extracting information before the positioning position of the information flow, defining the information as head information, and setting the weight as w2(ii) a The content in the same structure of the information flow is information related to the subject, the information in the same structure of the first m segments of the information flow is extracted and defined as information flow information, and the weight is set as w3The content of the m segments of the information flow is similar after the same structure, and the m segments of the information flow are defined as noise information and are eliminated. If the positioning fails, the HTML label is directly extracted<Hn>、<a>、<b>、<p>The information in (1); setting the weight w1、w2、w3Used for measuring the importance degree of the information;
and (4) converting the extracted information into vectors, inputting the vectors into a classifier to train a classification model, and then classifying.
Further, the step (1) is specifically as follows:
< title > is a title of a web page, < meta > keywords provided therein, and < meta > description describing contents of the web page, which are provided by a web site designer, from which theme information of the web page is extracted.
Further, the step (2) is specifically as follows:
and traversing HTML tags of the web pages, and adding the HTML tags into the List List _ all if the text length in the tags is larger than n. And traversing the List List _ all, judging whether HTML tags with the same structure exist, if the HTML tags with the same structure appear, judging whether the HTML tags appear repeatedly m times, and if the HTML tags appear, successfully positioning. If the List List _ all traversal is finished and the repeated m times of the same structure of the HTML tag do not appear yet, the positioning fails.
Further, m is preferably 3, and n is preferably 8.
Further, the step (3) is specifically as follows:
in the HTML document, a text before a stream localization position is defined as header information, a text in m pieces of the same structure content from the stream localization position is defined as stream information, and a content after m pieces of the same structure content from the stream localization position is discarded.
Further, the step (4) is specifically as follows:
converting text information into vectors by adopting word2vec, training a classification model by adopting a Support Vector Machine (SVM), and then classifying.
The invention has the beneficial effects that: the method can effectively extract the information of the mobile terminal webpage aiming at the mobile terminal webpage characteristics, thereby improving the effect of mobile terminal webpage classification. The method solves the problem that the webpage desktop end classification method is not suitable for the mobile end, and compared with the method for classifying the mobile end webpage, the method for classifying the mobile end webpage has higher classification accuracy.
Drawings
FIG. 1 is a desktop-side web page of New waste finance;
FIG. 2 is a mobile-side web page of New waste finance;
FIG. 3 is a mobile end webpage of Zhejiang university, New Langchan, tiger football;
FIG. 4 is a classification method of a mobile-side web page;
FIG. 5 is a flow chart of information flow positioning;
FIG. 6 is a positioning of the mobile end web page information flow of Zhejiang university;
FIG. 7 is a diagram illustrating the positioning of the web page information flow of the Xinwang finance mobile terminal;
fig. 8 is the positioning of the information flow of the mobile terminal web page of the tiger-pounding football.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. In the following description and in the drawings, the same numbers in different drawings identify the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of methods consistent with certain aspects of the invention, as detailed in the appended claims. Various embodiments of the present description are described in an incremental manner.
As shown in fig. 4, the present invention provides a method for classifying a mobile-end web page, which includes the following steps:
(1) extracting substance information
Subject matter information located on web pages<title>、<meta>description、<meta>In keywords, the subject matter information is provided by the webpage designer, is the content that the designer subjectively wants to express, and is relatively more accurate than other information in the webpage, and the weight w1Set to 3 based on empirical values and experiments.
(2) Positioning information flow
And positioning an information stream, wherein the information stream meets the conditions that the same structure of the HTML label is repeated for 3 times, and the text content in the structure is more than 8 characters. As shown in fig. 5, traversing the HTML tag of the web page, and if the text length in the tag is greater than 8, adding the HTML tag into the List _ all; and traversing the List List _ all, judging whether HTML tags with the same structure exist, if the HTML tags with the same structure appear, judging whether the HTML tags appear repeatedly for 3 times, and if the HTML tags appear, successfully positioning. If the List List _ all traversal is finished and the same structure of the HTML tag is not found to be repeated for 3 times, the positioning is failed.
The information flow positioning algorithm comprises the following steps:
algorithm 1 information flow positioning
Inputting: web page HTML document
And (3) outputting: an HTML tag list; the serial number of the location position; text for locating position
Description of the parameters: list _ all is a List of all HTML tags, List is a List of tags in a repeating structure
Figure BDA0002106599830000041
(3) Extracting other information
If the positioning is successful, extracting the information before the positioning position of the information flow, defining the information as the head information and the weight w2Is set to 2; the content in the same structure of the information flow is information related to the subject, the information in the same structure of the first 3 segments of the information flow is extracted and defined as information of the information flow, and the weight w3Set to 1, the content of the 3 segments of the stream is similar after the same structure, defined as noise information, is dropped. If the positioning fails, the HTML label is directly extracted<Hn>、<a>、<b>、<p>The information in (1).
(4) Training classification models to classify
Converting text information into vectors by adopting word2vec, training a classification model by adopting a Support Vector Machine (SVM), and then classifying.
According to the information flow positioning algorithm, the information flow positioning effect of the Zhejiang university, the Xinlang finance and finance, and the tiger-flapping football mobile terminal webpage is shown in figures 6, 7 and 8, and repeated HTML tags, the sequence numbers of the positioning positions and the texts of the positioning positions are output. In List _ all a List of all HTML tags is stored, the location number, i.e. the number of the first HTML tag of the repeating structure starting in List _ all.
Results and analysis of the experiments
In order to verify the effectiveness of the method, mobile end webpage data are collected for experiments, and 6000 mobile end webpages including known websites such as Tencent, network variation, Fox search, New wave, Taobao, Jingdong, Tiger and the like are collected according to classification labels of Alexa websites of Amazon to form a data set.
The classification accuracy ACC is:
Figure BDA0002106599830000051
in the formula (1), TP is the number of positive samples classified as positive samples, TN is the number of negative samples classified as negative samples, FP is the number of negative samples classified as positive samples, and FN is the number of positive samples classified as negative samples.
The method for extracting the text information in the key HTML tags < Hn >, < a >, < b >, < p > can be used for both desktop end webpages and mobile end webpages, and is compared with the method for the characteristics of the mobile end webpages in experiments. The experimental result is shown in table 1, the accuracy of the method for extracting text information classification in the key tag is 95.0%, and the method for extracting information classification aiming at the mobile terminal webpage features is more effective, and the accuracy is 97.2%.
TABLE 1 Mobile end Web experiment accuracy
Figure BDA0002106599830000052
The two methods are further compared on three indexes of Precision (Precision), Recall (Recall) and F-measure. The accuracy rate is the ratio of the data divided into positive samples to judge the correctness, and the correctness of a classification result of a certain class is evaluated. The recall ratio is the proportion of real positive samples judged as positive samples, and whether the search of a certain category is complete or not is measured. The F value is a comprehensive evaluation index of the accuracy and the recall rate. The accuracy, the recall rate and the F value of the method for extracting the text information classification in the key label are respectively 94.7%, 94.3% and 94.5%, the accuracy, the recall rate and the F value of the method for extracting the information classification aiming at the mobile terminal webpage features are respectively 96.9%, 97.5% and 97.2%, and the method is better in three indexes.
Figure BDA0002106599830000053
Figure BDA0002106599830000054
Figure BDA0002106599830000055
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (6)

1. A classification method for mobile terminal web pages is characterized by comprising the following steps:
step (1) of extracting subject information of a web page, the subject information being located in the web page<title>、<meta>description、<meta>In keywords, the weight is set to w1
Step (2), positioning an information stream, wherein the information stream meets the conditions that the same structure of an HTML label is continuously repeated for at least m times, and the text content in the structure is more than n characters; wherein m is an integer of 3 or more and 5 or less, and n is an integer of 6 or more;
and (3) if the positioning is successful, taking the position where the repeated HTML label structure starts as a positioning position, extracting all text information before the positioning position of the information stream, defining the text information as header information, and setting the weight as w2(ii) a The content in the same structure of the information flow is information related to the subject, the information in the same structure of the first m segments of the information flow is extracted and defined as information flow information, and the weight is set as w3(ii) a The content of the information flow m segments after the same structure is similar, defined as noise information, and is eliminated; if the positioning fails, the HTML label is directly extracted<Hn>、<a>、<b>、<p>The text information in (1); setting the weight w1、w2、w3Used for measuring the importance degree of the information;
and (4) converting the extracted information into vectors, inputting the vectors into a classifier to train a classification model, and then classifying.
2. The method for classifying mobile-end web pages according to claim 1, wherein the step (1) is specifically as follows:
< title > is a title of a web page, < meta > keywords provided therein, and < meta > description describing contents of the web page, which are provided by a web site designer, from which theme information of the web page is extracted.
3. The method for classifying mobile-end web pages according to claim 2, wherein the step (2) is specifically as follows:
traversing HTML tags of the web pages, and if the text length in the tags is larger than n, adding the HTML tags into a List _ all; traversing the List _ all, judging whether HTML tags with the same structure exist, if the HTML tags with the same structure appear, judging whether the HTML tags appear repeatedly for m times, and if the HTML tags appear, successfully positioning; if the List List _ all traversal is finished and the repeated m times of the same structure of the HTML tag do not appear yet, the positioning fails.
4. The method for classifying mobile web pages according to claim 1, wherein m is preferably 3, and n is preferably 8.
5. The method for classifying mobile-end web pages according to claim 3, wherein the step (3) is specifically as follows:
in the HTML document, a text before a stream localization position is defined as header information, a text in m pieces of the same structure content from the stream localization position is defined as stream information, and a content after m pieces of the same structure content from the stream localization position is discarded.
6. The method for classifying mobile-end web pages according to claim 5, wherein the step (4) is specifically as follows:
converting text information into vectors by adopting word2vec, training a classification model by adopting a Support Vector Machine (SVM), and then classifying.
CN201910554829.9A 2019-06-25 2019-06-25 Classification method of mobile terminal web pages Active CN110377810B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910554829.9A CN110377810B (en) 2019-06-25 2019-06-25 Classification method of mobile terminal web pages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910554829.9A CN110377810B (en) 2019-06-25 2019-06-25 Classification method of mobile terminal web pages

Publications (2)

Publication Number Publication Date
CN110377810A CN110377810A (en) 2019-10-25
CN110377810B true CN110377810B (en) 2022-04-08

Family

ID=68250621

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910554829.9A Active CN110377810B (en) 2019-06-25 2019-06-25 Classification method of mobile terminal web pages

Country Status (1)

Country Link
CN (1) CN110377810B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020129A (en) * 2012-11-20 2013-04-03 中兴通讯股份有限公司 Text content extraction method and text content extraction device
EP3035210A1 (en) * 2013-09-04 2016-06-22 ZTE Corporation Method and device for obtaining web page category standards, and method and device for categorizing web page categories
CN108694192A (en) * 2017-04-07 2018-10-23 北京国双科技有限公司 The judgment method and device of type of webpage
CN108984706A (en) * 2018-07-06 2018-12-11 浙江大学 A kind of Web page classification method based on deep learning fusing text and structure feature

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020129A (en) * 2012-11-20 2013-04-03 中兴通讯股份有限公司 Text content extraction method and text content extraction device
EP3035210A1 (en) * 2013-09-04 2016-06-22 ZTE Corporation Method and device for obtaining web page category standards, and method and device for categorizing web page categories
CN108694192A (en) * 2017-04-07 2018-10-23 北京国双科技有限公司 The judgment method and device of type of webpage
CN108984706A (en) * 2018-07-06 2018-12-11 浙江大学 A kind of Web page classification method based on deep learning fusing text and structure feature

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于页面标签的网页分类研究;陈笑筑等;《商业视角》;20091231;全文 *

Also Published As

Publication number Publication date
CN110377810A (en) 2019-10-25

Similar Documents

Publication Publication Date Title
Sun et al. Dom based content extraction via text density
CN106682192B (en) Method and device for training answer intention classification model based on search keywords
TWI695277B (en) Automatic website data collection method
CA2774278C (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
WO2016058267A1 (en) Chinese website classification method and system based on characteristic analysis of website homepage
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
CN103838798B (en) Page classifications system and page classifications method
CN103874994A (en) Method and apparatus for automatically summarizing the contents of electronic documents
Paranjpe Learning document aboutness from implicit user feedback and document structure
CN103678310A (en) Method and device for classifying webpage topics
CN108038173B (en) Webpage classification method and system and webpage classification equipment
Wu et al. News filtering and summarization on the web
CN110555154B (en) Theme-oriented information retrieval method
CN112256861B (en) Rumor detection method based on search engine return result and electronic device
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN104503988A (en) Searching method and device
CN103177036A (en) Method and system for label automatic extraction
CN103678422A (en) Web page classification method and device and training method and device of web page classifier
KR100954842B1 (en) Method and System of classifying web page using category tag information and Recording medium using by the same
CN111339457B (en) Method and apparatus for extracting information from web page and storage medium
de Moura et al. Using structural information to improve search in Web collections
WO2017000659A1 (en) Enriched uniform resource locator (url) identification method and apparatus
Liu et al. Main content extraction from web pages based on node characteristics
Hsu et al. Hierarchical comments-based clustering
Gali et al. Extracting representative image from web page

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant