CN102999511A - Rapid page switching method, rapid page switching device and rapid page switching system - Google Patents
Rapid page switching method, rapid page switching device and rapid page switching system Download PDFInfo
- Publication number
- CN102999511A CN102999511A CN2011102702683A CN201110270268A CN102999511A CN 102999511 A CN102999511 A CN 102999511A CN 2011102702683 A CN2011102702683 A CN 2011102702683A CN 201110270268 A CN201110270268 A CN 201110270268A CN 102999511 A CN102999511 A CN 102999511A
- Authority
- CN
- China
- Prior art keywords
- label
- page
- dictionary
- word
- structural type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The application provides a rapid page switching method, a rapid page switching device and a rapid page switching system and relates to the technical field of webpages. The method comprises the following steps of: receiving a page request of a client; acquiring a page document according to the request and resolving the DOM (Document Object Model) tree structure of the document; filtering all levels of tags in the DOM tree according to a tag library and a structure tag dictionary; writing the filtered tags in the DOM tree and contents of the filtered tags into a display frame according to corresponding structures; and returning the treatment result to the client. By virtue of the method, device and system provided by the application, the whole page switching process can be completed in a real-time and online manner, any local storage is not needed, the calculation speed is rapid, the storage buffer of a data processing process can be completed in an internal memory, and excessive document IO (Input/Output) operations and databases operations are not needed.
Description
Technical field
The application relates to the web technologies field, particularly a kind of page fast conversion method, device and system.
Background technology
Along with popularizing of Internet enabled portable terminal, most users has been brought into use mobile terminal Internet access, browsing page information.For this trend, each large website has been optimized WAP site (WAP, Wireless Application Protocol, the WAP (wireless application protocol) of oneself specially, a kind of application protocol standard that realizes that mobile phone is combined with the internet), done the WAP site of experiencing for mobile phone users.
In the prior art, after receiving page request when server, the a large amount of this locality of server needs are stored in the backstage and remove to grasp webpage (web) and set up the template training analysis, and use the template of several curing directly to extract the content of web page, thereby generate the WAP page.This makes prior art have following shortcoming and defect:
(1) therefore a large amount of local storages, prior art need a large amount of this locality storages because will grasp webpage and set up the template training analysis.
(2) limitation, for Protean various internets web page, prior art is mostly used and is solidified the content that template is directly extracted web page, has affected universality.
For most of pages, usually just in the normal or complete demonstration of PC (PC), and its display effect is not undesirable on portable terminal, and this solution of prior art often expends a large amount of a lot of man power and materials.
Summary of the invention
The application's technical matters to be solved provides a kind of page fast conversion method, device and system, to solve the many problems of consumes resources in the Wireless Application Environment.
In order to address the above problem, the application discloses a kind of page fast conversion method, comprising:
The request receiving step receives the user side page request;
Page obtaining step obtains page documents according to described request, and resolves the dom tree structure of described document;
The label filtration step according to tag library and structure label dictionary, filters the labels at different levels in the described dom tree;
Page arrangement step writes display frame with the label in the dom tree after filtering and the content that comprises thereof according to institute's counter structure;
The page returns step, returns result after the arrangement to user side.
Preferably, described label filtration step specifically comprises, for the labels at different levels in the dom tree, carries out following steps:
Preliminary label filtration step for the subtab of current level, filters this grade subtab according to tag library;
Structural type label filtration step, the structural type label for described reservation after filtering according to structure label dictionary, filters it.
Preferably, described preliminary label filtration step comprises the label determining step:
For the text label that keeps, change described text label and content thereof over to page arrangement step with corresponding father's label;
For the image tag that keeps, when the size of the image of described image tag indication is lower than the preliminary dimension size threshold value, then change described image tag and content thereof and corresponding father's label over to page arrangement step;
Structural type label for keeping changes structural type label filtration step over to.
Preferably, the label word of described structure label dictionary comprises the label word in the text that label id attribute and class attribute comprise; Wherein, the frequency is selected according to statistics for described label root.
Preferably, described structural type label filtration step specifically comprises:
Finding step, for each structural type label, the label word according in its id attribute and/or the class attribute text carries out matched and searched in the label word of structure label dictionary;
Label similarity calculation procedure according to the matched and searched result, according to the label rule set, is calculated the label similarity of label word in described structural type label and the structural type label dictionary;
Judge filtration step, the label similarity that calculates and the threshold value that presets are compared, and according to comparative result, described structural type label is filtered.
Preferably, described label similarity calculates according to label text similarity and label semantic similarity.
Preferably, the computing method of described label text similarity are:
The computing method of described label semantic similarity are:
The computing method of described label similarity are:
Wherein, W
1The id and/or the class attribute text that represent structural type label to be filtered, W
2The text of the label word of expression structure dictionary, text size is asked in len () expression, and the two maximal value, W are asked in max () expression
SExpression W
1And W
2Similar part, W
KiThe upper the next or synonym set of expression label word vocabulary, ∩ calculates expression and carries out the synonym comparison.
Preferably, described judgement filtration step specifically comprises:
When described label similarity during greater than threshold value, described structural type label is filtered.
Preferably, described structure label dictionary comprises that dictionary is filtered in navigation and footer filters dictionary; Described navigation is filtered dictionary and is comprised navigation tag word, advertisement tag word for the label word that filters, and the label word that described footer filters the dictionary filtration comprises header label word, footer label word.
Disclosed herein as well is accordingly a kind of page quick switching device, comprising:
The request receiving module is used for receiving the user side page request;
Page acquisition module is used for obtaining page documents, and resolves the dom tree structure of described page documents;
The label filtering module is used for according to tag library and structure label dictionary, and the labels at different levels in the described dom tree are filtered;
Page sorting module, the label of the dom tree after being used for filtering and the content that comprises thereof write display frame according to institute's counter structure;
The page returns module, is used for returning result after the arrangement to user side.
Compared with prior art, the application has the following advantages:
The application begins by the root node in the dom tree structure that is obtained by page documents, and according to tag library and tactical rule dictionary, traversal is filtered the label in the dom tree structure step by step, and label and the content thereof that retains write display frame;
At first, this process is by utilizing the label analytic technique, the dom tree that recursive solution is separated out in internal memory can obtain the web page structure, this process does not need local the storage fully, and finish the filtration of effective content by structure label dictionary and label rule set, such operation only needs the structural type label dictionary that definition is resided in local internal memory to get final product, and also can not need the storage of database and file;
Secondly, the application is that the content after denoising is filtered writes network data flow or file offers user's demonstration, and the complete content after also being about to filter writes the display frame template and removes to export the new page, can adapt to multiple occasion, and universality is strong;
In a word, the application's whole process can be finished by real-time online, without any need for this locality storage, and fast operation, can finish the memory buffers of data handling procedure in internal memory, does not need too much file I/O operation and database manipulation.
Description of drawings
Fig. 1 is the schematic flow sheet of a kind of page fast conversion method of the embodiment of the present application;
Fig. 2 is the label filtration step schematic flow sheet that the embodiment of the present application preferably begins from the root node of dom tree;
Fig. 3 is the filtering process synoptic diagram of the preferred structural type label of the embodiment of the present application;
Fig. 4 is that the application's label similarity is judged schematic flow sheet;
Fig. 5 is the method flow synoptic diagram of the another kind of preferred embodiment of the application;
Fig. 6 is the structural representation of a kind of page quick switching device of the embodiment of the present application;
Fig. 7 is the label filtering module structural representation that the embodiment of the present application preferably begins from the root node of dom tree;
Fig. 8 is the preferred structural type label of the embodiment of the present application filtering module structural representation;
Fig. 9 is the structural representation of the quick converting system of the preferred a kind of page of the application;
Figure 10 is the workflow synoptic diagram of the quick converting system of the preferred a kind of page of the application.
Figure 11 is that the application is with url
Http:// news.qq.com/a/20110725/000098.htmA conversion of page after the part sectional drawing of the page.
Embodiment
For above-mentioned purpose, the feature and advantage that make the application can become apparent more, below in conjunction with the drawings and specific embodiments the application is described in further detail.
With reference to Fig. 1, it shows the schematic flow sheet of a kind of page fast conversion method of the embodiment of the present application.
Request receiving step 50, reception user's page request;
Portable terminal or other user sides can be selected to send the WAP page request and wait until server end when access websites.Server end can be corresponding according to page request the user side type, determine to carry out conversion of page, show requirement to satisfy user side because equipment performance can not satisfy the page, and need to the page be changed;
Page obtaining step 100 obtains page documents, and resolves the dom tree structure of described page documents.
In the reality, can at first grasp the html document of web page, then parse DOM (the Document Object Model of described html document, DOM Document Object Model) tree construction, such as, from root node body, the text of label of root node body subordinate (document) label, an img (image) label, a plurality of structural type labels, such as the div label, wherein each structural type label comprises again the next stage label; Form thus a tree-shaped dom tree structure.
Wherein, can read the html document such as htmlparser or htmlclient or nekohtml by a lot of Open-Source Tools and go to resolve the dom tree structure.
Label filtration step 200 according to tag library and structure label dictionary, filters the labels at different levels in the described dom tree.
In practice, with reference to Fig. 2, for the labels at different levels in the dom tree, described label filtration step 200 specifically may further comprise the steps:
Preliminary label filtration step 210, for the subtab of current level, according to tag library this grade subtab is filtered.
Wherein, described tag library can comprise document-type label<text 〉, image-type label<img,<a,<div,<dl,<ul,<ol,<table,<p,<h1-6,<span,<tr,<td etc.When filtering for the subtab of certain father's label, the label whether all subtabs of judging this father's label record in this tag library if so, then keeps; If not, then filter out.
Structural type label filtration step 220, for the structural type label of described reservation after filtering, according to structure label dictionary, it is filtered.
Through behind the preliminary label filtration step 210, to the structural type label in the subtab that retains, judge according to the label dictionary whether this structural type label should keep.
Among the application, can set up structure label dictionary by the statistical learning method under the line, tag attributes id and class are comprised the label word of particular text as non-text label word.Wherein, described label word can be selected according to the statistics frequency, and this label word is made up the dictionary item as clauses and subclauses.Can certainly make up by other means structure label dictionary.Wherein, play up (css and js control) in order to be conducive to pattern, developer's majority of website can be chosen as the attributes edit that label in the html page is finished class or id, the CSS pattern is finished by tag attributes id and class substantially, and the Object Selection of js also is to finish by id substantially.So the label in the general structure of web page comprises this two attributes.
Particularly, described structure label dictionary can comprise that dictionary is filtered in navigation and footer filters dictionary; Described navigation is filtered dictionary and is comprised navigation tag word, advertisement tag word for the label word that filters, and the label word that described footer filters the dictionary filtration comprises header label word, footer label word.
Since the label word that comprises of aforementioned id and class always with its label under content correlativity is arranged, so the name of these label words is limited, the application selects to determine that by statistics which label word is more used, these label words are again to name which type of label, can determine the label substance that filters according to statistics.For example, through statistical computation, can be with nav, menu, head, comment, bottom, top, link, foot, side, logo, ads, login, structural type label dictionary put in the label words such as copyright.Analyze as example take a sections and pages face document, its url is:
Http:// news.qq.com/a/20110725/000098.htm, the main page code of original is as follows:
<!DOCTYPE?html?PUBLIC″-//W3C//DTDXHTML?1.0?Transitional//EN″″http://www.w3.org/TR/xhtm11/DTD/xhtml1-transitional.dt?d″><html
xmlns=″http://www.w3.org/1999/xhtml″><head>
<meta?content=″text/html;charset=gb2312″http-equiv=″Content-Type″>
<title〉Beijing meet most torrential rain in this year many ground ponding 2 people doubt the news www.qq.com that is killed by lightening</title
....
<body?id=″Article-QQ″>
...
<div?id=″site_nav?mblogin″class=″site_nav?mblogin″bossZone=″mbLogin″>
...
<div?id=″main_nav?qq″bossZone=″mainNav″>
...
<div?class=″wrapper″id=″Main-Article-QQ″><div?class=″bd″id=″MainL″>
...
<h1〉Beijing meet most torrential rain in this year many ground ponding 2 people doubt be killed by lightening</h1
...
Moving heavy rain III level emergency response.In at 2 o'clock in afternoon, the City Weather Bureau observes heavy rain and closes on sign, immediately successively issue the yellow early warning of thunder and lightning and the blue early warning of heavy rain.</P〉<P style=" TEXT-INDENT:2em "〉after this, the west areas such as Haidian begin first to fall downpour.Weather bureau ...
</div>
...
Its top margin has two parts navigation, be put into respectively in two div blocks, the id of its div name is respectively " site_nav_mblogin " and " main_nav_qq " and the div block id of its text main contents is " Cnt-Main-Article-QQ ", can see, if filter with structural type label dictionary, comprised label word nav in the navigation block, and the text block does not comprise.By calculating the similarity of label word in label word that label id or class comprise and the structural type label dictionary, can obtain a part of div needs to keep, and a part then needs filtration.
Wherein, with reference to Fig. 3, described structural type label filtration step 220 specifically comprises:
Finding step S221, for each structural type label, the label word according in its id attribute and/or the class attribute text carries out matched and searched in the label word of structure label dictionary.
According to the label word in its id attribute and/or the class attribute text, in the label word of structure label dictionary, carry out in the process of matched and searched, can search by the mode of long word symbol coupling.
Label similarity calculation procedure S222 according to the matched and searched result, according to the label rule set, calculates the label similarity of label word in described structural type label and the structural type label dictionary.
Wherein, described label rule set be label determination methods and rule.
Filtration step S223 compares the label similarity that calculates and the threshold value that presets, and according to comparative result described structural type label is filtered.
Wherein, described label similarity judges that flow process can be with reference to Fig. 4, when described label similarity during greater than threshold value, corresponding construction type label is filtered; Otherwise, then corresponding construction type label is kept.
In described label similarity calculation procedure S222, can adopt following method to calculate the label similarity:
(1) label similarity:
(2) label text similarity:
(3) label semantic similarity:
Wherein, W
1The id and/or the class attribute text that represent structural type label to be filtered, W
2The text of the label word of expression structure dictionary, text size is asked in len () expression, and the two maximal value, W are asked in max () expression
sExpression W
1And W
2Similar part, W
KiThe upper the next or synonym set of expression label word vocabulary, ∩ calculates expression and carries out the synonym comparison, and M is for adjusting parameter.
Above-mentioned three formula all are used in similarity and calculate, and wherein the formula of (1) need utilize the result of the formula of (2) and (3), and namely (1) calculating formula of similarity comprises text similarity sims calculating and semantic similarity simw calculating two parts.The application's M value gets 0.01.Threshold is 0.5 in the application's the application in addition, and this value is optimum in this application.Threshold value only is used for judging similarity, and text and semantic similarity are only calculated as intermediate result.
For example to the web page of above-mentioned url (http://news.qq.com/a/20110725/000098.htm), the such tag name of div block class=" site_nav_mblogin " of its top margin, carry out similarity with nav and login in the structural type label dictionary and calculate, wherein W
1=site_nav_mblogin, W
2=nav (from structural type label dictionary), text similarity utilizes formula to be calculated as:
Simw=1/[(3/3-3/16) * 0.01]=16/0.39=41.026, wherein the M value gets 0.01;
And semantic similarity utilizes formula to be calculated as:
sims=max(3∩13/13,3∩13/3)=max(2/13,2/3)=2/3=0.6667
Wherein, ∩ calculate to be used for finishing the synonym comparison, and 3 represents 3 label words are arranged among the w1, and 13 represent 13 filtration label words are arranged in this structural type label dictionary, be respectively nav, menu, head, comment, bottom, top, link, foot, side, logo, ads, login, these 13 of copyright), and wherein identically there are two, therefore 3 ∩ 13=2.
Last similarity utilizes formula to be calculated as:
sim=log(sims*simw/16)=0.70
Be that similarity is 70%.And W
1=site_nav_mblogin, W
2The calculating of=login is the same with nav.Owing to greater than the threshold value 0.5 of setting, therefore be filtered, in the page, can not show.
Structural type label for keeping after filtering based on structure label dictionary can repeat above-mentioned steps to its next stage subtab, until filtered all labels.Judge namely whether the structural type label after filtering contains subtab, when the structural type label comprises subtab, then change step 210 over to and proceed to filter; So circulation is until all labels filtrations are complete.
Because label comprises document-type label, image-type label and structural type label, wherein document-type label and image-type label are the terminal points of dom tree leaf, and the structural type label can comprise the next stage label.
After the foundation tag library is filtered every grade of subtab, namely after the subtab filtration to certain father's label, can also for the text label that keeps, change described text label and content thereof over to page arrangement step with corresponding father's label.
For the image tag that keeps, when the size of the image of described image tag indication is lower than the preliminary dimension size threshold value, then change described image tag and content thereof and corresponding father's label over to page arrangement step.
When being higher than 20mm * 30mm such as the size when image tag indication image, then this label and content thereof are not exported, when the size of image tag indication image is lower than 20mm * 30mm, then this label and content thereof are exported.Wherein, the content that image tag comprises may be one and point to link, namely show certain picture by this link, in such cases, can judge first whether this size of pointing to the image of link sensing is lower than the size threshold value, determine whether that according to judged result image tag and content thereof that needs are corresponding with it change page arrangement step over to corresponding father's label.
Structural type label for keeping changes structural type label filtration step over to.Namely judge whether to filter according to structure label dictionary, when described structural type label keeps, then change this structural type label over to its next stage subtab is filtered according to tag library process.
Unite the form of output by zygote label and his father's label, guaranteed that the page of output is integrally-built constant.
Page arrangement step 300 writes display frame with the label in the dom tree after filtering and the content that comprises thereof according to institute's counter structure.
The page returns module 350, is used for returning result after the arrangement to user side.
The label that will keep after can will filtering in every grade of filtration in this process writes display frame by its structure, also can all labels of dom tree are filtered complete after, label and the content thereof that keeps write display frame by its structure together.
After the full page arrangement is finished, can send to user side by server.
Further, the page of finishing for arrangement can also be temporarily stored in the server, in the buffer memory such as server, so that the page that this arrangement can be finished of server directly sends to the user side of the same web page of access within a certain period of time, repeat to put in order the same page to avoid server.
With reference to Fig. 5, it shows the method flow synoptic diagram of the another kind of preferred embodiment of the application.
At first, obtain web page, to its html document, use htmlparser to parse the dom tree structure of the page, then, title leaching process (TitleExtractor) will read the content of page title (title) element and as the title that filters rear Page Template, contents extraction process (ContentExtractor) has then comprised the parsing of content of pages.Wherein, in contents extraction process (ContentExtractor), filter every grade of label by tag library, filter dictionary NavFilter by navigation and filter navigation tag word, advertisement tag word, filter dictionary FootFilter page or leaf by pin and filter header label word, footer label word.Content after will filtering at last adds the paging element, carries out paging, is that one page shows 3500 words such as what use, so muchly remaining carries out paging with Pager.Element after at last all being filtered is inserted display frame, the page after output can obtain putting in order.Wherein, filter process can filter label by the mode of recursion cycle.
Such as to web page: the TitleExtractor of aforementioned url (http://news.qq.com/a/20110725/000098.htm) with the title element of the page from "<title〉Beijing meet this year many ground of most torrential rain ponding 2 people doubt be killed by lightening _ news rise _ interrogate net</title " obtain, and as the title of template.ContentExtractor is right<div class=" wrapper " id=" Main-Article-QQ "〉and,<div id=" site_nav_mblogin " class=" site_nav_mblogin " bossZone=" mbLogin "〉and<div id=" main_nav_qq " bossZone=" mainNav "〉carry out aforesaid filter process.Wherein, the content that remains is taken from<div class=" wrapper " id=" Main-Article-QQ "〉block, because<div id=" site_nav_mblogin " class=" site_nav_mblogin " bossZone=" mbLogin "〉and<divid=" main_nav_qq " bossZone=" mainNav "〉block all is filtered.
With reference to Figure 11, it shows the application and with url is
Http:// news.qq.com/a/20110725/000098.htmA conversion of page after the part sectional drawing of the new page.Former html document is mainly as follows through the page source code after changing:
... top margin omits
<body>
<div?class=″out″>
<div?class=″title″>
<h2>
Doubtful being killed by lightening _ news _ www.qq.com of many ground of most torrential rain ponding 2 people in this year met in Beijing
</h2>
</div>
<div?class=″content″>
<br/><a target=″_blank″
href=″html2wml?start=0&page=3500&url=http%3A%2F%2Fnews.qq.com″
bossZone=″crumbLogo″><img
Src=" http://matl.gtimg.com/www/images/channel_logo/news_logo.png " alt=" Tengxun's news " style=〉</a〉<br/〉yesterday, Beiyuan, Tongzhou bridge, rainfall causes this highway section ponding to occur.7: 30 last night, gateway, subway four favour station, the pedestrian is stepping on fragment of brick, passes through from ponding.
....
...
<b〉the relevant reading:</b〉﹠amp; #183;<ahref=" html2wml? start=0﹠amp; Page=3500﹠amp; Url=http%3A%2F%2Fnews.qq.com%2Fa%2F20110725%2F000097.htm " target=" _ blank " class=" RelaLinkStyle "〉Beijing suffer most torrential rain attack today morning peak will occur blocking up greatly</a
</div>
<div?class=″page″>
<a
href=″html2wml?start=3571&page=3500&url=http%3A%2F%2Fnews.qq.com%
2F%2Fa%2F20110725%2F000098.htm″>
[lower one page]</a 〉
</div>
<div?class=″ft″>
<div?class=″cmt″>
Wherein:
From<br/〉<a target=" _ blank " href=" html2wml? start=0﹠amp; Page=3500﹠amp; Url=http%3A%2F%2Fnews.qq.com " } to Beijing suffer most torrential rain attack today morning peak will occur large traffic congestion</a this section source code is the result after former html document is changed; Other all are the template code of display frame.
With above-mentioned page fast conversion method accordingly, the embodiment of the present application also provides a kind of page quick switching device, its structure comprises as shown in Figure 6:
Request receiving module 400 is for the page request that receives the user;
The page returns module 600, is used for returning result after the arrangement to user side.
Further, for the labels at different levels in the dom tree, described label filtering module 500 as shown in Figure 7, it specifically can comprise:,
Preliminary label filtering module 510, be used for the subtab for current level, at first according to tag library this grade subtab filtered;
Structural type label filtering module 520, for the structural type label of described reservation after filtering, according to structure label dictionary, it is filtered;
Wherein, the structural type label for keeping after filtering based on structure label dictionary repeats the step of above-mentioned module to its next stage subtab, until filtered all labels.
Wherein, preliminary label filtering module 510 specifically comprises the label judge module, is used for after every grade of subtab being carried out according to tag library every grade of subtab is filtered:
For the text label of described reservation, change described text label and content thereof over to page arrangement step with corresponding father's label;
For the image tag of described reservation, when the size of the image of described image tag indication is lower than the preliminary dimension size threshold value, then change described image tag and content thereof and corresponding father's label over to page arrangement step;
For the structural type label of described reservation, change structural type label filtration step over to.
Further, described structural type label filtering module 520, as shown in Figure 8, it specifically comprises:
Search module S521, for each structural type label, the label word according in its id attribute and/or the class attribute text carries out matched and searched in the label word of structure label dictionary;
Obtain label similarity module S522, be used for according to the matched and searched result, according to the label rule set, calculate the label similarity of label word in described structural type label and the structural type label dictionary;
Judge filtering module S523, the label similarity that calculates and the threshold value that presets are compared, and according to comparative result described structural type label is filtered.
With reference to Fig. 9, it shows the structural representation of the quick converting system of the preferred a kind of page of the application.
Described system comprises user side 700 and server 800; Described server 700 comprises the device of page fast conversion method;
Described user side 700 is used for submitting to request of access and receives the page;
Described server 800 is used for receiving request of access and the transmits page data arrive its page quick switching device, and the page that page quick switching device is generated sends to user side 700;
Described page quick switching device comprises:
The request receiving module is used for receiving the user side page request;
Page acquisition module is used for obtaining page documents, and resolves the dom tree structure of described page documents;
The label filtering module is used for according to tag library and structure label dictionary, and the labels at different levels in the described dom tree are filtered;
Page sorting module, the label of the dom tree after being used for filtering and the content that comprises thereof write display frame according to institute's counter structure;
The page returns module, is used for returning result after the arrangement to user side.
Preferably, for the labels at different levels in the dom tree, described label filtering module specifically comprises:
Preliminary label filtering module, be used for the subtab for current level, at first according to tag library this grade subtab filtered;
Structural type label filtering module, be used for the structural type label for described reservation after filtering, according to structure label dictionary, it filtered;
In actual applications, can convert fast web page to the WAP page.
With reference to Figure 10, it shows the workflow synoptic diagram of the quick converting system of a kind of page of the application.
User side submits to request access to server;
Server transmits data to page quick switching device;
Page quick switching device at first receives request of access, and then according to the document of the acquisition request page, recycling tag library, structure label dictionary and label rule set traversal dom tree filter label, then write display frame; So just in server, generated the new page.
Server returns to user side with the new page again.
Wherein, the new page can be real-time return to user side, also can be temporarily stored within a certain period of time in the server, if when this section in the period server receive other users to the request of access of the same page, server just can have been changed this complete new page and send to these user sides.
Pass through the application, at first, the application's process is by utilizing the label analytic technique, the dom tree that recursive solution is separated out HTML in internal memory can obtain the web page structure, this process does not need local the storage fully, and finish the filtration of effective content by structure label dictionary and label rule set, such operation only needs that the resident structural type label dictionary of definition get final product in local internal memory, also can not need the storage of database and file;
Secondly, the application is that the content after denoising is filtered writes network data flow or file offers user's demonstration, and the complete content after also being about to filter writes the display frame template and removes to export the WAP page, can adapt to multiple occasion, and universality is strong;
Again, the whole process of the application be to web page can the real-time online complete process, have real-time.
In a word, the application's whole process can be finished by real-time online, stores without any need for this locality, and fast operation, all core calculations are finished in internal memory, do not need too much file I/O operation and database manipulation, can fast web page be converted to the WAP page.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For system embodiment because itself and embodiment of the method basic simlarity, so describe fairly simple, relevant part gets final product referring to the part explanation of embodiment of the method.
More than a kind of page fast conversion method, device and system that the application is provided, be described in detail, used specific case herein the application's principle and embodiment are set forth, the explanation of above embodiment just is used for helping to understand the application's method and core concept thereof; Simultaneously, for one of ordinary skill in the art, the thought according to the application all will change in specific embodiments and applications, and in sum, this description should not be construed as the restriction to the application.
Claims (10)
1. a page fast conversion method is characterized in that, comprising:
The request receiving step receives the user side page request;
Page obtaining step obtains page documents according to described request, and resolves the dom tree structure of described document;
The label filtration step according to tag library and structure label dictionary, filters the labels at different levels in the described dom tree;
Page arrangement step writes display frame with the label in the dom tree after filtering and the content that comprises thereof according to institute's counter structure;
The page returns step, returns result after the arrangement to user side.
2. the method for claim 1 is characterized in that:
Described label filtration step specifically comprises, for the labels at different levels in the dom tree, carries out following steps:
Preliminary label filtration step for the subtab of current level, filters this grade subtab according to tag library;
Structural type label filtration step, the structural type label for described reservation after filtering according to structure label dictionary, filters it.
3. method as claimed in claim 2 is characterized in that:
Described preliminary label filtration step comprises the label determining step:
For the text label that keeps, change described text label and content thereof over to page arrangement step with corresponding father's label;
For the image tag that keeps, when the size of the image of described image tag indication is lower than the preliminary dimension size threshold value, then change described image tag and content thereof and corresponding father's label over to page arrangement step;
Structural type label for keeping changes structural type label filtration step over to.
4. the method for claim 1 is characterized in that:
The label word of described structure label dictionary comprises the label word in the text that label id attribute and class attribute comprise; Wherein, the frequency is selected according to statistics for described label root.
5. method as claimed in claim 4 is characterized in that:
Described structural type label filtration step specifically comprises:
Finding step, for each structural type label, the label word according in its id attribute and/or the class attribute text carries out matched and searched in the label word of structure label dictionary;
Label similarity calculation procedure according to the matched and searched result, according to the label rule set, is calculated the label similarity of label word in described structural type label and the structural type label dictionary;
Judge filtration step, the label similarity that calculates and the threshold value that presets are compared, and according to comparative result, described structural type label is filtered.
6. method as claimed in claim 5 is characterized in that:
Described label similarity calculates according to label text similarity and label semantic similarity.
7. method as claimed in claim 6 is characterized in that:
The computing method of described label text similarity are:
The computing method of described label semantic similarity are:
The computing method of described label similarity are:
Wherein, W
1The id and/or the class attribute text that represent structural type label to be filtered, W
2The text of the label word of expression structure dictionary, text size is asked in len () expression, and the two maximal value, W are asked in max () expression
sExpression W
1And W
2Similar part, W
KiThe upper the next or synonym set of expression label word vocabulary, ∩ calculates expression and carries out the synonym comparison.
8. method as claimed in claim 5 is characterized in that:
Described judgement filtration step specifically comprises:
When described label similarity during greater than threshold value, described structural type label is filtered.
9. method as claimed in claim 5 is characterized in that:
Described structure label dictionary comprises that dictionary is filtered in navigation and footer filters dictionary; Described navigation is filtered dictionary and is comprised navigation tag word, advertisement tag word for the label word that filters, and the label word that described footer filters the dictionary filtration comprises header label word, footer label word.
10. a page quick switching device is characterized in that, comprising:
The request receiving module is used for receiving the user side page request;
Page acquisition module is used for obtaining page documents, and resolves the dom tree structure of described page documents;
The label filtering module is used for according to tag library and structure label dictionary, and the labels at different levels in the described dom tree are filtered;
Page sorting module, the label of the dom tree after being used for filtering and the content that comprises thereof write display frame according to institute's counter structure;
The page returns module, is used for returning result after the arrangement to user side.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110270268.3A CN102999511B (en) | 2011-09-13 | 2011-09-13 | A kind of page fast conversion method, device and system |
HK13106043.9A HK1179012A1 (en) | 2011-09-13 | 2013-05-22 | Method, device and system for rapid conversion of a page |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110270268.3A CN102999511B (en) | 2011-09-13 | 2011-09-13 | A kind of page fast conversion method, device and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102999511A true CN102999511A (en) | 2013-03-27 |
CN102999511B CN102999511B (en) | 2016-04-13 |
Family
ID=47928085
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110270268.3A Active CN102999511B (en) | 2011-09-13 | 2011-09-13 | A kind of page fast conversion method, device and system |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN102999511B (en) |
HK (1) | HK1179012A1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103226599A (en) * | 2013-04-23 | 2013-07-31 | 翁杰 | Method and system for accurately extracting webpage content |
CN103699665A (en) * | 2013-12-27 | 2014-04-02 | 贝壳网际(北京)安全技术有限公司 | Method and device for filtering web page advertisements |
CN104298721A (en) * | 2014-09-25 | 2015-01-21 | 宇威科技发展(青岛)有限公司 | Split screen layout editing method for any number of objects in on-line courseware making based on Web |
CN105447920A (en) * | 2014-06-26 | 2016-03-30 | 阿里巴巴集团控股有限公司 | Electronic gate pass generation method, electronic gate pass generation device and electronic gate pass generation system |
CN107193815A (en) * | 2016-03-14 | 2017-09-22 | 阿里巴巴集团控股有限公司 | A kind of processing method of page code, device and equipment |
CN107622087A (en) * | 2017-08-17 | 2018-01-23 | 珠海云游道科技有限责任公司 | User-friendly document management apparatus and method |
CN110377884A (en) * | 2019-06-13 | 2019-10-25 | 北京百度网讯科技有限公司 | Document analytic method, device, computer equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040177327A1 (en) * | 2000-02-04 | 2004-09-09 | Robert Kieffer | System and process for delivering and rendering scalable web pages |
CN101197849A (en) * | 2007-12-21 | 2008-06-11 | 腾讯科技(深圳)有限公司 | Method and device for commuting internet page into wireless application protocol page |
CN101446983A (en) * | 2009-01-12 | 2009-06-03 | 腾讯科技(深圳)有限公司 | Method, system and equipment for realizing web page acquisition by mobile terminal |
CN101860533A (en) * | 2010-05-26 | 2010-10-13 | 卓望数码技术(深圳)有限公司 | Data transmission method based on C/S architecture browser and server |
CN102163233A (en) * | 2011-04-18 | 2011-08-24 | 北京神州数码思特奇信息技术股份有限公司 | Method and system for converting webpage markup language format |
-
2011
- 2011-09-13 CN CN201110270268.3A patent/CN102999511B/en active Active
-
2013
- 2013-05-22 HK HK13106043.9A patent/HK1179012A1/en unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040177327A1 (en) * | 2000-02-04 | 2004-09-09 | Robert Kieffer | System and process for delivering and rendering scalable web pages |
CN101197849A (en) * | 2007-12-21 | 2008-06-11 | 腾讯科技(深圳)有限公司 | Method and device for commuting internet page into wireless application protocol page |
CN101446983A (en) * | 2009-01-12 | 2009-06-03 | 腾讯科技(深圳)有限公司 | Method, system and equipment for realizing web page acquisition by mobile terminal |
CN101860533A (en) * | 2010-05-26 | 2010-10-13 | 卓望数码技术(深圳)有限公司 | Data transmission method based on C/S architecture browser and server |
CN102163233A (en) * | 2011-04-18 | 2011-08-24 | 北京神州数码思特奇信息技术股份有限公司 | Method and system for converting webpage markup language format |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103226599A (en) * | 2013-04-23 | 2013-07-31 | 翁杰 | Method and system for accurately extracting webpage content |
CN103226599B (en) * | 2013-04-23 | 2018-09-28 | 翁杰 | A kind of method and system of accurate extraction web page contents |
CN103699665A (en) * | 2013-12-27 | 2014-04-02 | 贝壳网际(北京)安全技术有限公司 | Method and device for filtering web page advertisements |
CN105447920A (en) * | 2014-06-26 | 2016-03-30 | 阿里巴巴集团控股有限公司 | Electronic gate pass generation method, electronic gate pass generation device and electronic gate pass generation system |
CN104298721A (en) * | 2014-09-25 | 2015-01-21 | 宇威科技发展(青岛)有限公司 | Split screen layout editing method for any number of objects in on-line courseware making based on Web |
CN107193815A (en) * | 2016-03-14 | 2017-09-22 | 阿里巴巴集团控股有限公司 | A kind of processing method of page code, device and equipment |
CN107193815B (en) * | 2016-03-14 | 2021-03-12 | 阿里巴巴集团控股有限公司 | Page code processing method, device and equipment |
CN107622087A (en) * | 2017-08-17 | 2018-01-23 | 珠海云游道科技有限责任公司 | User-friendly document management apparatus and method |
CN107622087B (en) * | 2017-08-17 | 2024-03-22 | 珠海云游道科技有限责任公司 | Document management apparatus and method convenient for user operation |
CN110377884A (en) * | 2019-06-13 | 2019-10-25 | 北京百度网讯科技有限公司 | Document analytic method, device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN102999511B (en) | 2016-04-13 |
HK1179012A1 (en) | 2013-09-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102999511B (en) | A kind of page fast conversion method, device and system | |
WO2017113645A1 (en) | Information extraction method and device | |
CN106503211B (en) | Method for automatically generating mobile version facing information publishing website | |
CN103577466B (en) | Method and device for displaying webpage content in browser | |
CN108399150B (en) | Text processing method and device, computer equipment and storage medium | |
CN102156737B (en) | Method for extracting subject content of Chinese webpage | |
CN101515272B (en) | Method and device for extracting webpage content | |
CN102306201B (en) | Method and system for analyzing webpage title | |
CN103294781A (en) | Method and equipment used for processing page data | |
CN103544255A (en) | Text semantic relativity based network public opinion information analysis method | |
CN101197849A (en) | Method and device for commuting internet page into wireless application protocol page | |
CN110457579B (en) | Webpage denoising method and system based on cooperative work of template and classifier | |
CN103166981A (en) | Wireless webpage transcoding method and device | |
CN103544176A (en) | Method and device for generating page structure template corresponding to multiple pages | |
CN102591992A (en) | Webpage classification identifying system and method based on vertical search and focused crawler technology | |
CN104239298A (en) | Text message recommendation method, server, browser and system | |
CN103544178A (en) | Method and equipment for providing reconstruction page corresponding to target page | |
CN107153716B (en) | Webpage content extraction method and device | |
CN102065114A (en) | Method and device for mobile terminal to access webpage | |
CN102207936A (en) | Method and system for indicating content change of electronic document | |
CN103246732A (en) | Online Web news content extracting method and system | |
CN102063456A (en) | Method for positioning to optic center of webpage automatically and device | |
CN112699295A (en) | Webpage content recommendation method and device and computer readable storage medium | |
CN103049481B (en) | A kind of searching method and search equipment | |
CN109033282A (en) | A kind of Web page text extracting method and device based on extraction template |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 1179012 Country of ref document: HK |
|
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: GR Ref document number: 1179012 Country of ref document: HK |