CN103246680B

CN103246680B - A kind of method in browser, web page contents polymerization being represented and device

Info

Publication number: CN103246680B
Application number: CN201210031482.8A
Authority: CN
Inventors: 蒋进舟; 滕跃龙
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2012-02-13
Filing date: 2012-02-13
Publication date: 2016-05-18
Anticipated expiration: 2032-02-13
Also published as: CN103246680A

Abstract

The invention provides a kind of method in browser, web page contents polymerization being represented, comprise the information source information generated source identifier selected according to user; By predetermined method, web page contents corresponding to described information source identifier analyzed, extracted corresponding web page contents and preserve, in the time of the information fusion page of user's open any browser, read and show corresponding web page contents. The present invention is by analyzing web page contents, extract that corresponding web page contents is preserved and be user's demonstration, even if corresponding website does not provide the subscription of RSS or ATOM, also can realize corresponding web page contents is aggregated in to browser, and access each website without user.

Description

A kind of method in browser, web page contents polymerization being represented and device

Technical field

The invention provides a kind of method in browser, web page contents polymerization being represented and device, belong to web page contents poly-Close technical field.

Background technology

In online, often can pay close attention to the content of multiple websites, if there is no info web polymerization, so userWhen user is wanting to check the information that he pays close attention to, can only go to browse each website, until browse end, whole mistakeJourney as shown in Figure 1.

In order to address this problem, present browser has generally all been introduced the function of polymerization, provides by subscription websiteRSS (ReallySimpleSyndicatio, simple and easy information fusion) or Atom (document format based on XML and based onThe agreement of HTTP, is used to website and client's instrument that converging network content is provided), the information that user is paid close attention to pulls thisGround is also combined, and the process of polymerization as shown in Figure 2. But while going in this way polymerization resource, if website does not haveThere is the subscription that RSS or ATOM are provided, so just have no idea these information fusions in browser, must accessCorresponding content just can be browsed in corresponding website.

Summary of the invention

If the present invention solves exist in the web page contents polymerization technique of existing browser not content-aggregated by what accessIn browser, must access corresponding website and just can browse the problem of corresponding content, and then provide a kind of clearLook at the method and the device that web page contents polymerization are represented in device.

The method in browser, web page contents polymerization being represented, comprising:

According to the selected information source information generated source identifier of user;

By predetermined method, web page contents corresponding to described information source identifier analyzed, extracted in corresponding webpageHold and preserve, in the time of the information fusion page of user's open any browser, read and show corresponding web page contents.

The device in browser, web page contents polymerization being represented, comprising:

Identifier generation module, for according to the selected information source information generated source identifier of user;

Polymerization represents module, for the method by predetermined, web page contents corresponding to described information source identifier is dividedAnalyse, extract corresponding web page contents and preserve, in the time of the information fusion page of user's open any browser, read and show described inCorresponding web page contents.

As seen from the above technical solution provided by the invention, by web page contents is analyzed, extract corresponding netPage content is preserved and for user shows, even if corresponding website does not provide the subscription of RSS or ATOM, also can be realizedCorresponding web page contents is aggregated in to browser, and accesses each website without user.

Brief description of the drawings

Fig. 1 is that in prior art, user browses each website until browse the schematic flow sheet of end;

Fig. 2 be in prior art by content subscription by the schematic flow sheet of web page contents polymerization;

Fig. 3 is the flow process signal of the method in browser, web page contents polymerization being represented that provides of the specific embodiment of the inventionFigure;

Fig. 4 is the mark schematic diagram of regional in Tengxun's homepage of providing of the specific embodiment of the invention;

Fig. 5 is that the reptile that adds that the specific embodiment of the invention provides is analyzed the schematic flow sheet of generation polymerization page afterwards;

Fig. 6 is the structural representation of the device in browser, web page contents polymerization being represented that provides of the specific embodiment of the inventionFigure.

Detailed description of the invention

The specific embodiment of the invention provides a kind of method in browser, web page contents polymerization being represented, and comprises basisThe information source information generated source identifier that user is selected; By the predetermined method web page contents corresponding to information source identifierAnalyze, extract corresponding web page contents and preserve, in the time of the information fusion page of user's open any browser, read and showShow corresponding web page contents. Below not support the content-aggregated exhibiting method of website of content subscription as example combination to certainFigure of description illustrates this detailed description of the invention, as shown in Figure 3, in browser, web page contents is poly-accordinglyClosing the method representing comprises:

Step 31, according to the selected information source information generated source identifier of user.

Because existing number of site does not provide the subscription of RSS or ATOM, so just have no idea these information are poly-Be combined in browser, when user is wanting to check the information of concern, can only go to browse each website. For example TengxunThe Today's news of homepage, this information is not owing to providing subscription, so if user wants to check this information, only haveAccess Tengxun homepage just can be checked its content.

Concrete, existing most of webpage all forms by multiple regions are nested, and these regions all can have one oneselfTitle or mark, this mark can be id, the className of web page element or even the element order in this regionNumber. Take www.qq.com as example, as shown in Figure 4, in www.qq.com's page, there is a mark each zonule, so onceUser has selected a web page area of oneself paying close attention in webpage, so just can by the unique expression of this mark thisIndividual region. In each region, there are several to comprise the information source of link or address. For example, the mark of www.qq.com's the first rowBe #TextNav, the mark of the second line search is #SOSO, and the mark of lower left corner press center is #NewsInfo, right side the presentThe mark of day topic is #txArea.

After user has selected certain information source of www.qq.com, for example user has selected the press center in the lower left corner, TengxunThe press center that the server of net needs to select according to user generates on a network can this information source of unique identificationIdentifier, i.e. the #NewsInfo of press center mark, this identifier can identify with URL added elements path, butBe to be not limited to this mode, be one here and give an example. For example, need to preserve news region time, just can set upSuch corresponding relation:

URL	Element path	Content
			***.com	#txArea#NewsInfo	The content of extracting after the HTML in this region or analysis

After user, in open any browser, the server of www.qq.com just can come by this URL and element pathAgain capture its page pointed, upgrade its content.

Step 32, analyzes web page contents corresponding to information source identifier by predetermined method, extracts corresponding netPage content is also preserved, and in the time of the information fusion page of user's open any browser, reads and shows corresponding web page contents.

For the website of subscription that RSS or ATOM are not provided, need the source of the server active analysis webpage of websiteCode, just can extract the content that user is concerned about, and ignore the unconcerned part of user in the code of website.

Concrete, take www.qq.com as example, when the relatively simple for structure of webpage or for more common structure, www.qq.comServer can in browser, directly analyze webpage, for example, for simple URL link. At user's warpCross after the press center in the selected lower left corner of step 31, the server of www.qq.com is according to the information source mark of #NewsInfo markThe region of symbol representative, can directly capture the HTML code in this region to get off to analyze, and by its contentBe kept at this locality, in the time of the information fusion page of user's open any browser, just directly content read and shown. ItsIn, the descriptive text that HTML code is made up of HTML order, HTML order can comment, figure,Animation, sound, form, link etc. The structure of HTML code comprises head (Head), main body, and (Body) two is largePart, wherein head is described the required information of browser, and main body comprises the particular content that will illustrate. The step of extractingRapid from the URL of Initial page, obtain the URL on Initial page, extracting in the process of webpage, constantly from working asOn the front page, extract new URL and put into queue, until meet certain stop condition of system. By HTML codeBody matter just can analyze the link of HTML code the inside and the particular content of word, for the server of www.qq.comExtract. In the time that user has browsed a region and turned to another region, the server of www.qq.com can be with reference to above-mentioned sideMethod again captures new content according to the information source identifier of the selected content of user and browses for user.

Concrete analytic process comprises: link and the word of HTML code the inside can be extracted, be linked in source codeCan be with<a></a>Mark surround, and the word that does not have this mark to surround can be regarded common language as, as long as can take outTake out word content and link, just can represent. For example, in the webpage of html format, if having link orPerson has list, link or list can be extracted, and in analysis, can search<a>,<ul>,<ol>With<li>Extract these information on label.

Take www.qq.com as example, in the time facing the webpage of some more complicated, for example, adopt the net of frame structure or dynamic linkPage, can be submitted to the information source identifier of generation on the server of www.qq.com, and join on the server of www.qq.comPut the grasping means of various different web sites, by the server of www.qq.com, information source identifier is carried out to special analysis, for exampleAnalyze by reptile, according to predetermined rule, for example, according to the identifier of the selected information source of user, extract the inside and useThe part that family is concerned about most. Add the flow process of reptile analysis generation polymerization page afterwards as shown in Figure 5. Concrete, when user's choosingDetermine after the framework architecture at press center place in the lower left corner news content that the server of www.qq.com is concerned about extraction user mostProcess comprise the web page contents representing according to predetermined web page analysis algorithm and information source identifier, filter with theme irrelevantLink, the URL queue to be extracted such as remain with the link of use and put it into; Then, by according to certain searchStrategy is selected next step webpage URL that will capture from queue, and repeats said process, until reach a certain of systemWhen condition, stop. In the time that user has browsed a region and turned to another region, the server of www.qq.com can be with reference to above-mentionedMethod again captures new content according to the information source identifier of the selected content of user and browses for user.

Corresponding analytical method is with that HTML code is captured to the method for getting off to analyze is similar, but due to the method for analyzingCan on the server of www.qq.com, customize, so the relative complex that can do. For the indeterminable net of general-purpose algorithmStand, specified rule that can be artificial on backstage, but final information or the word content extracting, main link and theseThe satellite information of link. For example, can specify if the block in region, if id is " content ", is so labeled as chainThe satellite information connecing; If id is " title ", and have<a>mark, be so main link. By the time it is poly-that user opens informationWhen hinge, the server of www.qq.com can be inquired about on backstage the content of this specific region after putting in order, is presented at and browsesIn device.

Due to the general more complicated of method of analyzing web page extraction content, in this detailed description of the invention, also can adopt otherMethod is downloaded overall webpage and is only shown that the method for a part wherein simplifies the process of analysis, for example, can utilize HTML netIn page<iframe>label, webpage embed wholly that user is paid close attention to, in polymerization page, then utilizes absolute fix in cssMethod, adjusts this<iframe>position and the size of label, reach and hide except user pays close attention to content all the elementsMethod. Same, the structure of all right amendment webpage initiatively, the DOM interface coming out by kernel, as IEIHtmlElement interface etc., travels through web page element, only user is concerned about to the web page element in region is directly related with itFather's element remain, other element is all deleted, thereby is reached cutting webpage, the most at last net of these cuttingsPage is aggregated in the object in browser polymerization page.

The technical scheme that adopts the present embodiment to provide, by web page contents is analyzed, extracts corresponding web page contents and protectsDeposit and for user shows, even if corresponding website does not provide the subscription of RSS or ATOM, also can realize accordinglyWeb page contents is aggregated in browser, and accesses each website without user.

The specific embodiment of the present invention also provides a kind of device in browser, web page contents polymerization being represented, this dressEach module of putting can be arranged in the server of website with the form of software module or hardware entities, as shown in Figure 6, and bagDraw together:

Identifier generation module 61, for according to the selected information source information generated source identifier of user;

Polymerization represents module 62, for the method by predetermined, web page contents corresponding to information source identifier analyzed,Extract corresponding web page contents and preserve, in the time of the information fusion page of user's open any browser, read and show corresponding netPage content.

Optionally, in identifier generation module 61, information source identifier is combined and marks with element path by URLKnow.

Optionally, represent in module 62 and can comprise at least one in following submodule in polymerization:

First content extracts submodule, for by searching in the link of the corresponding web page contents of html web page or listCorresponding label, to extract corresponding web page contents;

Second content extracts submodule, for according to the corresponding capturing webpage contents method of information source identifier configurations, passes throughGrasping means is analyzed corresponding web page contents, to extract corresponding web page contents.

Optionally, represent in module 62 and can also comprise in polymerization:

Information display sub-module, for showing corresponding web page contents, or, show full content by corresponding webpageWeb page contents beyond content is hidden or is deleted.

The implementation of the processing capacity of the each module comprising in the above-mentioned device in browser, web page contents polymerization being representedIn method detailed description of the invention before, describe, be no longer repeated in this description at this.

The above, be only preferably detailed description of the invention of the present invention, but protection scope of the present invention is not limited to this,Anyly be familiar with in technical scope that those skilled in the art disclose in the present invention the variation that can expect easily or replaceChange, within all should being encompassed in protection scope of the present invention.

Claims

1. the method in browser, web page contents polymerization being represented, is characterized in that, comprising:

According to user's information source information generated source identifier in selected web page area in webpage; Described webpage forms by multiple regions are nested, the mark that each region is corresponding unique, and each region comprises several information sources;

By search corresponding label in the link of web page contents accordingly or list in html web page, extract corresponding web page contents and preserve; Or according to the corresponding capturing webpage contents method of described information source identifier configurations, by described grasping means, corresponding web page contents is analyzed, extract corresponding web page contents and preserve;

In the time of the information fusion page of user's open any browser, read and show corresponding web page contents.

2. method according to claim 1, is characterized in that, described information source identifier is combined and identifies with element path by URL.

3. method according to claim 1, is characterized in that, described in read and show that corresponding web page contents comprises:

Show corresponding web page contents, or, show whole web page contents and the web page contents beyond corresponding web page contents hidden or deleted.

4. the device in browser, web page contents polymerization being represented, is characterized in that, comprising:

Identifier generation module, for the information source information generated source identifier in the selected web page area of webpage according to user; Described webpage forms by multiple regions are nested, the mark that each region is corresponding unique, and each region comprises several information sources;

Polymerization represents module, for by search corresponding label in the link of the corresponding web page contents of html web page or list, extracts corresponding web page contents and preserves; Or according to the corresponding capturing webpage contents method of described information source identifier configurations, by described grasping means, corresponding web page contents is analyzed, extract corresponding web page contents and preserve; Also for when the information fusion page of user's open any browser, read and show corresponding web page contents.

5. device according to claim 4, is characterized in that, in identifier generation module, described information source identifier is combined and identifies with element path by URL.

6. according to the device described in claim 4 or 5, it is characterized in that, represent module in polymerization and comprise:

First content extracts submodule, for by search corresponding label in the link of the corresponding web page contents of html web page or list, to extract corresponding web page contents.

7. according to the device described in claim 4 or 5, it is characterized in that, represent module in polymerization and comprise:

Second content extracts submodule, for according to the corresponding capturing webpage contents method of described information source identifier configurations, by described grasping means, corresponding web page contents is analyzed, to extract corresponding web page contents.

8. device according to claim 4, is characterized in that, represents in module and also comprises in polymerization:

Information display sub-module, for showing corresponding web page contents, or, show whole web page contents and the web page contents beyond corresponding web page contents hidden or deleted.