CN103064943A - Customer premises equipment - Google Patents

Customer premises equipment Download PDF

Info

Publication number
CN103064943A
CN103064943A CN2012105730887A CN201210573088A CN103064943A CN 103064943 A CN103064943 A CN 103064943A CN 2012105730887 A CN2012105730887 A CN 2012105730887A CN 201210573088 A CN201210573088 A CN 201210573088A CN 103064943 A CN103064943 A CN 103064943A
Authority
CN
China
Prior art keywords
coupling
web page
setting option
webpage
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012105730887A
Other languages
Chinese (zh)
Other versions
CN103064943B (en
Inventor
谢洲为
潘洪学
糜裕峰
任寰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201210573088.7A priority Critical patent/CN103064943B/en
Publication of CN103064943A publication Critical patent/CN103064943A/en
Application granted granted Critical
Publication of CN103064943B publication Critical patent/CN103064943B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses customer premises equipment on which a browser is arranged. A device capable of extracting web page text contents is arranged in the browser. The customer premises equipment is used for starting the device capable of extracting web page text contents according to web browsing instructions of users and is used for displaying web page text contents extracted by the device capable of extracting web page text contents in the browser to the users. The device capable of extracting web page text contents includes a matching setting configuration unit, a download unit, a matching unit and an extraction unit, wherein the matching setting configuration unit is applicable to presetting at least one matching setting of the web page text contents on one side of the browser, the download unit is applicable to downloading the web page contents on one side of the browser, the matching unit is applicable to matching the matching settings of the web page contents with the web page contents until the matching of the web page contents is succeeded and the extraction unit is applicable to using the matching settings of the web page contents, which is matched successfully with the web page contents, to extract web page text contents of the web page contents.

Description

A kind of client device
Technical field
The present invention relates to networking technology area, particularly a kind of client device.
Background technology
Along with popularizing of Internet technology, network has become one of important channel of people's obtaining information, and the content of text in the webpage is the main carriers of information.Yet, generally in the webpage except content of text, also comprise the garbages such as mass advertising picture, non-article content, had a strong impact on user's reading experience.
In the scheme of the extraction webpage text content that prior art provides, webpage is in browser behind the loaded, content in the webpage is split, then by the matched rule file in the browser web page contents is positioned, extract required field contents and show, thereby the user can see the webpage after the text screening, the reading that the user can be made things convenient for and be absorbed in.
At least there is following defective in the scheme of existing extraction webpage text content:
Existing scheme arranges a matched rule file for a certain predetermined structure of web page, this matched rule file is only applicable to the extraction of webpage text content under the predetermined structure, yet because the renewal speed of Internet resources is very fast, structure of web page can change often, then existing matched rule file can't carry out to the webpage after the change text extraction, and regenerate new matched rule file, again new matched rule file is arranged in the browser, cause again realizing that the operation of mating is too loaded down with trivial details, workload is large, inefficiency.
Summary of the invention
In view of the above problems, the present invention has been proposed in order to a kind of client device that overcomes the problems referred to above or address the above problem at least in part is provided.
According to the present invention, the embodiment of the invention provides a kind of client device, on this client device browser is installed, and the device that can extract webpage text content is set in the browser,
Client device start the device that can extract webpage text content according to user's web page browsing instruction, and the webpage text content that this device that can extract webpage text content is extracted shows the user in browser;
The described device that can extract webpage text content comprises:
Coupling arranges dispensing unit, is suitable for arranging in default at least one webpage text content coupling of browser side;
Download unit is suitable for carrying out web page contents in the browser side and downloads;
Matching unit is suitable for web page contents is mated with the setting of webpage text content coupling respectively, until web page contents the match is successful;
Extraction unit is suitable for utilizing with the web page contents webpage text content that the match is successful coupling arranging, and extracts the webpage text content in the web page contents.
Wherein, coupling arranges dispensing unit, is suitable for setting up that a coupling arranges file and at least one webpage text content coupling arranged be kept at coupling and arrange in the file; Wherein, this coupling arranges and comprises at least one website node in the file, comprise at least a web page joint in each website node, be provided with plural coupling at least part of web page joint description node is set, each coupling arranges the corresponding webpage text content coupling of description node and arranges, and the coupling of at least two webpage text contents arranges the middle Different matching setting option that comprises respectively the same type content of text.
Wherein, matching unit is suitable for arranging in coupling and searches web page contents corresponding website node and web page joint in the file; Under the web page joint that finds, the coupling setting option that the coupling of first in web page contents and this web page joint is arranged in the description node mates successively; To the coupling setting option that the match is successful, matching result is set to the webpage text content that utilizes this coupling setting option to extract; To the coupling setting option that it fails to match, in this web page joint, arrange and search the corresponding coupling setting option of coupling setting option that it fails to match with this in the description node except the first coupling arranges coupling the description node, the coupling setting option and the web page contents that find are mated, until the coupling setting option that finds and web page contents the match is successful, and matching result is set to the webpage text content that extracts according to this coupling setting option.
Wherein, extraction unit, be suitable for the webpage text content that extracts of the with good grounds coupling setting option that the match is successful as the webpage text content in the web page contents that identifies.
Wherein, coupling arranges dispensing unit, and a website node is set up in the website that is suitable for every type; Under a website node, for every type webpage under this corresponding website of website node is set up a web page joint; The coupling of setting up each web page joint according to the content of webpage arranges the coupling setting option in the description node, wherein the first coupling at web page joint arranges in the description node, for every type content of text in the corresponding webpage of this web page joint is set up at least one coupling setting option; And, for the content of text of same type in the webpage, the coupling setting option set up in the description node is set and in this web page joint, mates the coupling that arranges the description node that the coupling setting option of setting up in the description node is set is different except first in the first coupling.
Wherein, coupling arranges dispensing unit, also be suitable in web page joint, arranging downloading mode attribute and element filter attribute, the filter type of this element filter attribute indication comprises: filtration picture, filtration Cascading Style Sheet CSS, filtration Javascript script, filter frame, filtering object and filtration embed one or more in the content, said apparatus also comprises Loading Control unit and filter element
The Loading Control unit, be suitable under the web page joint that finds, in web page contents and this web page joint first coupling is arranged before coupling setting option in the description node mates successively, whether the property value of judging the downloading mode attribute in the web page joint that finds is predetermined value, if, start filter element, then under the web page joint that finds, the coupling setting option that the first coupling in the web page contents after filtering and this web page joint is arranged in the description node mates successively; If not, directly will be loaded in the browser under the web page contents;
Filter element is suitable for according to the filter type of element filter attribute indication the content in the webpage being filtered.
Wherein, the webpage text content coupling that coupling arranges dispensing unit configuration arranges the uniform resource position mark URL that is included as web page contents and sets up webpage URL coupling setting option,
Comprise in the webpage URL coupling setting option: the match attribute setting option, this match attribute setting option comprises:
Webpage URL with predetermined content as beginning; And/or webpage URL comprises predetermined content, and the precalculated position of this predetermined content comprises any character; And/or webpage URL does not comprise predetermined content, and this predetermined content comprises any character.
Wherein, the webpage URL that coupling arranges dispensing unit foundation mates setting option also banner properties settings, banner extraction properties settings and conversion properties settings,
The banner properties settings comprises: with the character in precalculated position among the URL of the webpage banner as this web page contents;
Banner extracts properties settings and comprises: choose the character in precalculated position as banner in the banner that obtains according to banner properties settings coupling;
The conversion properties settings comprises: the URL that obtains this webpage according to the composition format conversion of the banner of the web page contents of knowing and URL.
Wherein, coupling arranges the webpage URL coupling setting option that dispensing unit sets up and comprises that also web page title extracts properties settings, and this web page title extracts properties settings and comprises: be title with the contents extraction before the book character in the web page contents.
Wherein, coupling arranges dispensing unit, also is suitable for arranging in the description node as every type the HTML (Hypertext Markup Language) html element element of content of text in web page contents in the webpage in the first coupling setting up at least one coupling setting option;
The coupling setting option of setting up for the html element element comprises one-time positioning coupling setting option, and this one-time positioning coupling setting option comprises at least:
Basic point is searched setting option: the indication basic point mode of searching, this mode comprise searches sign, lookup names, searches class name, searches content, searches expression formula; And/or,
Mark location setting option: the element that the sign of location and html element element is complementary; And/or,
Title location setting option: the element that the title of location and html element element is complementary; And/or,
Class name location setting option: the element that the class title of location and html element element is complementary; And/or,
Content location setting option: the element that the content of location and html element element is complementary; And/or,
Expression formula location setting option: the element that the expression formula in location and the html element element is complementary;
And/or,
The label setting option: when indication utilizes mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option that element is located, type and/or the attribute of institute's location element.
Wherein, it is that the coupling setting option that the html element element is set up also comprises that coupling arranges dispensing unit: secondary position matching setting option, and this secondary position matching setting option comprises at least:
The father inquires about setting option: the element that navigates to according to one-time positioning coupling setting option is set, the mode of searching father's element of this element; Perhaps,
Subquery setting option: the element that navigates to according to one-time positioning coupling setting option is set, searches the mode of the daughter element of this element; Perhaps,
When inquiring about setting option and subquery setting option, the father puts when existing simultaneously, first inquire about father's element that setting option is searched the element that one-time positioning coupling setting option navigates to according to the father, then according to the subquery setting option, from this father's element that finds, the daughter element of searching this father's element.
Wherein, it is that the coupling setting option that the html element element is set up also comprises that coupling arranges dispensing unit: element deletion coupling setting option, and this element deletion coupling setting option comprises at least:
Predetermined content in the element that deletion is oriented by one-time positioning coupling setting option or secondary position matching setting option; And/or
Predetermined content in the element that change is oriented by one-time positioning coupling setting option or secondary position matching setting option.
Wherein, said apparatus comprises that also coupling arranges updating block, be suitable for set up one the coupling file is set after, according to the update instruction that receives, coupling is arranged website node, web page joint, coupling in the file coupling setting option that description node and/or coupling arrange in the description node is set upgrades.
Wherein, said apparatus also comprises the Multi-thread control unit.This Multi-thread control unit, be suitable for when there is a plurality of web page contents that downloads in the browser side, for each web page contents distributes a thread, and the control matching unit mates the corresponding web page content respectively in the thread that distributes with the setting of webpage text content coupling, until web page contents the match is successful; And/or, this Multi-thread control unit, a web page contents that is suitable for the browser side distributes a plurality of threads, and the control matching unit arranges web page contents respectively in different threads and mate from different webpage text content coupling, until web page contents the match is successful.
Wherein, above-mentioned returning apparatus comprises input block and uploading unit.Input block is suitable for receiving the instruction of choosing of choosing the setting of webpage text content coupling that the user sends; Then coupling arranges dispensing unit, also is suitable for setting up coupling file being set according to choosing instruction, and will chooses webpage text content coupling in the instruction and arrange and be kept at the coupling of setting up and arrange in the file; And uploading unit is suitable for coupling being arranged File Upload to server and being stored in server side user's the user data.
Wherein, said apparatus also comprises the startup control module, is suitable for starting the matching unit execution web page contents being arranged the operation of mating with the webpage text content coupling respectively when the file that monitors indication browser loaded is finished event.
Wherein, matching unit also is suitable for the web page contents layering analysis that downloads to is obtained the DOM structure of this web page contents; According to the DOM structure of web page contents, web page contents is mated with the setting of webpage text content coupling respectively.
From the above mentioned, the embodiment of the invention arranges by set up a plurality of webpage text content couplings in the browser side, and same webpage text content and a plurality of webpage text content coupling arranged the technological means of mating, when web page contents changes, can from a plurality of webpage text content couplings arrange, find the webpage text content coupling that is complementary with the webpage that changes to arrange, thereby can utilize the webpage text content coupling setting that the match is successful to extract webpage text content.And this programme has been avoided when web page contents changes, and need to generate new matched rule file and is arranged on operation in the browser, has simplified the operation that realizes coupling, has reduced workload, has improved efficient.
Above-mentioned explanation only is the general introduction of technical solution of the present invention, for can clearer understanding technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Description of drawings
By reading hereinafter detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing only is used for the purpose of preferred implementation is shown, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts with identical reference symbol.In the accompanying drawings:
Fig. 1 shows the apparatus structure synoptic diagram that can extract webpage text content according to an embodiment of the invention;
Fig. 2 shows the method flow diagram that webpage text content is extracted of according to the present invention another embodiment.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in the accompanying drawing, yet should be appreciated that and to realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order to understand the disclosure more thoroughly that these embodiment are provided, and can with the scope of the present disclosure complete convey to those skilled in the art.
One embodiment of the invention provides a kind of device that can extract webpage text content, can under the prerequisite that guarantees text extraction rate and stability, provide convenient and absorbed reading service to the user.Referring to Fig. 1, this device comprises that coupling arranges dispensing unit 100, download unit 101, matching unit 102, extraction unit 103, Loading Control unit 104, filter element 105, coupling arranges updating block 106, Multi-thread control unit 107, input block 108 and uploading unit 109.The below describes each unit respectively.
Coupling arranges dispensing unit 100, is suitable for arranging in default at least one webpage text content coupling of browser side.Concrete, coupling arranges dispensing unit 100 and is suitable for setting up that a coupling arranges file and at least one webpage text content coupling arranged be kept at coupling and arrange in the file; Wherein, this coupling arranges and comprises at least one website node in the file, comprise at least a web page joint in each website node, be provided with plural coupling at least part of web page joint description node is set that each coupling arranges the corresponding webpage text content coupling of description node and arranges.Coupling arranges and can comprise one or more coupling setting options in the description node, and at least two webpage text content couplings arrange the middle Different matching setting option that comprises respectively the same type content of text.
Coupling arranges dispensing unit 100 and sets up a website node for every type website, i.e. the website of corresponding one type of website node; Under a website node, for every type webpage under this corresponding website of website node is set up a web page joint, i.e. the webpage of corresponding one type of web page joint.The coupling of setting up each web page joint according to the content of webpage arranges the coupling setting option in the description node.Different webpages, the content that wherein comprises is different, and it is also different then to mate accordingly the coupling setting option that arranges in the description node.
Comprise that under a web page joint a plurality of couplings arrange description node, owing to usually can have some fix informations that can often not change and some variable informations that are easy to change in the webpage, coupling arranges the coupling of dispensing unit 100 under web page joint and arranges in the description node and to determine that a coupling arranges description node and as the first coupling description node is set, that the coupling setting option that comprises in the description node is set is the most comprehensive for this first coupling, has comprised at least one coupling setting option of setting up for every type content of text in the webpage.And arrange in the description node in the coupling that arranges the description node except the first coupling, can be only set up the coupling setting option for the variable information in the webpage, and in this web page joint, mate the coupling that arranges the description node that the coupling setting option of setting up in the description node is set is different except first.
This processing mode has been simplified on the one hand the structure that the webpage text content coupling arranges, and avoids Different matching that the part of repetition is arranged in arranging, and has reduced the data volume that the coupling of required storage arranges, thereby has improved resource utilization; Also avoided on the other hand identical web page contents is carried out the repeated matching operation, improved matching efficiency.
Below in conjunction with the example of one section code coupling being arranged file is specifically described.
Figure BDA00002646555600071
Figure BDA00002646555600081
Figure BDA00002646555600091
Below in conjunction with each node in the above-mentioned code coupling being arranged file is described as follows:
1.<and websites〉total website node: this node is maximum father node, and this node arranges file corresponding to a coupling, and this node is made of several websites (website) node.
2.<and website〉node: each website node represents a kind of website of supporting, in the website node one or more web page joints are set, as be arranged with books (book) web page joint, catalogue (catalog) web page joint and chapters and sections (chapter) web page joint at website node www.feiku.com.In web page joint, also be provided with downloading mode (downloadmode) attribute and element and filter (elementfilter) attribute.
3.<and book〉web page joint: describe the novel home tip, two couplings are set under this web page joint description node<profile is set.Arrange as the first coupling description node<profile in dispose a plurality of coupling setting options, such as URL(Uniform/Universal Resource Locator, URL(uniform resource locator)) the coupling setting option describes related urls coupling and obtains the bookid(banner) information; The title(title) the coupling setting option is described the information that how to obtain novel homepage title; Catalogurl(catalogue URL) the coupling setting option is described the catalogue URL of this novel; The up-to-date chapters and sections of lasterchapter() the coupling setting option is described the description of up-to-date chapters and sections; The up-to-date chapters and sections URL of lasterchapterurl() the coupling setting option is described the URL of up-to-date chapters and sections.
4.<and catalog〉web page joint: describe the listing of novel page information, a coupling only is set under this web page joint description node is set, comprise under coupling arranges description node: URL coupling setting option is described the related urls coupling and is obtained bookid information; Chapterlist mates setting option, describes the related content of catalogue page; Returnbook describes the URL address of novel homepage.
5.<and chapter〉web page joint: describe novel chapters and sections page information, two<profile is set under this web page joint.Arrange as the first coupling description node<profile in dispose: URL mate setting option, describes related urls and mates and obtain bookid information; Title mates setting option, describes the information that how to obtain novel homepage title; The text(text) coupling setting option, the body matter of description novel; Next mates setting option, describes next chapters and sections novel page or leaf URL; Prev mates setting option, describes a chapters and sections URL on the novel; The returncatalog(Returning catalogue) the coupling setting option is described the listing of novel page or leaf URL that the chapters and sections page or leaf is preserved; Returnbook(returns books) the coupling setting option, the novel homepage that novel chapters and sections page or leaf is preserved is described.
6.<and profile〉coupling arranges description node: when a plurality of webpage text contents couplings being set under the web page joint arranging, can configurations match description node<profile be set 〉, each<profile〉corresponding webpage text content coupling arranges.<profile〉be positioned under the concrete web page joint, for example, be positioned at below above-mentioned book web page joint and the chapter web page joint, will mate setting option and be arranged on<profile in.
When receiving user's web page access instruction, download unit 101 carries out web page contents in the browser side to be downloaded, and connects web page contents corresponding to downloading web pages access instruction from server such as download unit 101 and server.
Matching unit 102 arranges the web page contents that downloads to respectively and mates with the webpage text content coupling, until web page contents the match is successful.Still with the scene explanation in the above-mentioned code, matching unit 102 arranges in coupling and searches web page contents corresponding website node and web page joint in the file, finding website node corresponding to this web page contents according to the web page contents that downloads to is website node www.feiku.com, and corresponding web page joint is the book web page joint; Then under the web page joint that finds, the coupling setting option that in web page contents and this web page joint first coupling is arranged in the description node mates successively, when first in book web page joint coupling arrange description node be configured under the book web page joint first<profile the time, first with web page contents and this first<profile〉in the coupling setting option mate.To the coupling setting option that the match is successful, matching result is set to the webpage text content that utilizes this coupling setting option to extract, and the result who at this moment returns is the content of text for extracting directly, and perhaps returning the indication result is the information of true (TRUE); To the coupling setting option that it fails to match, at this moment the matching result that returns can be the information of false (FALSE) for indicating the null character string that can't process or returning the indication result, (such as second<profile under the book web page joint〉in) then is set in the description node except the first coupling arranges coupling the description node in this web page joint searches the corresponding coupling setting option of coupling setting option that it fails to match with this, the coupling setting option and the web page contents that find are mated, until the coupling setting option that finds and web page contents the match is successful, and matching result is set to the webpage text content that extracts according to this coupling setting option.Namely for utilizing the first coupling that the description node web page contents that it fails to match is set, as long as there is one<profile〉can match, just can utilize this coupling<profile the corresponding web page content is extracted.
Because generally, the appearance form of web page contents is HTML(Hypertext MarkupLanguage, HTML (Hypertext Markup Language)), matching unit 102 also need to be for the element of the html element in the webpage when carrying out coupling, for example, the web page contents layering analysis that 102 pairs of matching units download to, obtain the DOM Document Object Model DOM(Document Object Model of this web page contents, DOM Document Object Model) structure, DOM structure according to web page contents, web page contents is mated with the setting of webpage text content coupling respectively, thereby extract webpage text content.
Extraction unit 103 is suitable for utilizing with the web page contents webpage text content that the match is successful coupling and arranges, and extracts the webpage text content in the web page contents.Concrete, extraction unit 103 be suitable for the webpage text content that extracts of the with good grounds coupling setting option that the match is successful as the webpage text content in the web page contents that identifies.
Further, can also utilize coupling that downloading mode (downloadmode) attribute that dispensing unit 100 arranges in web page joint and element are set in the present embodiment filters (elementfilter) attribute the download of web page contents is controlled.Said apparatus also comprises Loading Control unit 104 and filter element 105.
Coupling arranges dispensing unit 100 and is downloading mode setup of attribute at least two generic attribute values, for example, when this property value is 0, indication is according to the downloading mode of existing browsing device net page, whole web page contents are downloaded in the browser, when this property value is 1, utilize 105 pairs of web page contents of filter element to filter, remaining web page contents is downloaded in the browser after only will filtering.
Coupling arranges dispensing unit 100 and for the element filter attribute a plurality of property values is set, the corresponding a kind of filter type of each property value, for example, picture (img) is filtered in property value 1 expression, property value 2 represents that filtration Cascading Style Sheets (Cascading Style Sheet, CSS), property value 4 represent that the Javascript script is filtered in filter frames (frame), property value 8 expressions, property value 16 represents that filtering objects (object) and property value 32 represent that filtration embeds (embed) content.
When needs adopt the combination of above-mentioned multiple filter type, can by the binary-coded character of above-mentioned property value adopt step-by-step or account form, generate new property value, then this new property value can be indicated above-mentioned multiple filter type.
Loading Control unit 104 is suitable under the web page joint that finds, in web page contents and this web page joint first coupling is arranged before coupling setting option in the description node mates successively, whether the property value of judging the downloading mode attribute in the web page joint that finds is predetermined value (such as 1), if, start filter element 105, then under the web page joint that finds, the coupling setting option that the first coupling in the web page contents after filtering and this web page joint is arranged in the description node mates successively; If not, directly will be loaded in the browser under the web page contents;
Filter element 105 is suitable for according to the filter type of element filter attribute indication the content in the webpage being filtered.For example, when picture is filtered in the property value indication of element filter attribute, filter element 105 all filters out the picture in the web page contents, and when picture and CSS were filtered in the property value indication of element filter attribute, filter element 105 all filtered out the picture in the web page contents and CSS.
The below arranges dispensing unit 100 to coupling and in coupling some main coupling setting options that dispose in the description node is set and is specifically described.
One, about the extraction of webpage URL
The URL that is included as web page contents during the webpage text content coupling that coupling arranges dispensing unit 100 configuration arranges sets up webpage URL coupling setting option.
In this part, in conjunction with the url node in the above-mentioned example, from Match setting, Trans setting, Bookid setting, Booksep setting and Tabtitle five aspects are set webpage URL coupling setting option is described.
1) Match arranges: the match attribute setting option
Comprise the match attribute setting option in the webpage URL coupling setting option, this match attribute setting option comprises:
A. webpage URL as beginning, as with the ^ beginning, shows that url must be with the content beginning of ^ back with predetermined content.
B. webpage URL comprises predetermined content, the precalculated position of this predetermined content comprises any character, is with the content of@beginning such as this predetermined content, shows the content after this url must comprise@, content the inside behind the@can add character *, this character representation coupling any character.
C. webpage URL does not comprise predetermined content, and this predetermined content comprises any character.As this predetermined content be with! The content of beginning shows that this url must not comprise! After content,! After content the inside can add character *, this character representation coupling any character.
When extracting webpage URL, can require to satisfy simultaneously above-mentioned a, b and c, perhaps, only satisfy or two among a, b and the c.
2) Trans arranges: transform properties settings
Obtain the URL of this webpage according to the composition format conversion of the banner of the web page contents of knowing and URL.This operation is mainly used in only having the scene that a coupling arranges description node under the web page joint, namely only exists under the scene of a profile, carries out the associative operation that URL transforms by banners such as given novel homepage, catalogue page, chapters and sections pages or leaves.This setting option has been described the composition form of url, and only need to insert the banners such as bookid or chapterid and just can obtain a url, as: trans=http: //www.qidian.com/BookReader/##s, ##s.aspx^^bookid^^chapterid
Above-mentioned character string display the composition form of URL, then bookid is inserted first ##s, chapterid is inserted second ##s, just can obtain the url of a chapters and sections page or leaf.
3) Bookid arranges: the banner properties settings
With the character in precalculated position among the URL of the webpage banner as this web page contents.
This operation is as obtaining banner, bookid character string such as url, for example, for bookid=http: //www.readnovel.com/novel/*.html, wherein, the position of character * is above-mentioned precalculated position, then with the character string of this position as the banner that extracts, such as the bookid character string.
Utilize the banner that extracts in this operation can carry out the conversion of webpage URL.
4) Booksep arranges: banner extracts properties settings
In the banner that obtains according to banner properties settings coupling, choose the character in precalculated position as banner.During banner more complicated that this operation is mainly used in getting access to, need the scene of further extracting.
As the extraction structure of booksep="/: 0 " is set, then when comprising "/" symbol among the banner bookid, in order to get pure digi-tal, can use booksep, "/" expression separates identifier, when ": " expression separator, " 0 " expression are separated into some sections when target text by "/", get which part (since 0 counting) of section as banner bookid.
Utilize the banner that extracts in this operation can carry out the conversion of webpage URL.
5) Tabtitle arranges: web page title extracts properties settings
Be title (Title) information with the contents extraction before the book character in the web page contents.As the extraction structure of tabtitle=" *-" is set, represent that then "-" part before of first appearance all is title.Symbol * can mate any character.
Two, about the extraction of HTML content in the webpage
Coupling arranges dispensing unit 100 and (arranging in the description node such as the first coupling) is set in the description node in coupling sets up at least one coupling setting option for every type HTML (Hypertext Markup Language) (Hypertext Markup Language, the HTML) element of content of text in web page contents in the webpage.
Need the html element element that extracts also different in the dissimilar webpages, for example the scene in the above-mentioned code is as example, need html element element to be processed comprise the indication title<title element, indication catalogue url<catalogurl〉element, indicate up-to-date chapters and sections<lastchapter element, indicate up-to-date chapters and sections url<lastchapterurl element, the indication text<text〉element, the lower one page url of indication<next〉element, indication page up url<prev〉element, indication Returning catalogue url<returncatalog element and indication return to the homepage url<returnbook element etc.
Coupling arranges dispensing unit 100 and comprises one-time positioning coupling setting option and secondary position matching setting option for the plain coupling setting option of setting up of html element.The below describes respectively.
1) one-time positioning coupling setting option
This one-time positioning coupling setting option comprises at least:
A. basic point is searched setting option el: the mode that the indication basic point is searched, can be set to the numerical value such as 1,2,4,8,16, wherein, 1 corresponding to searching sign id, 2 corresponding to lookup names name, 4 corresponding to searching class name classname, and 8 corresponding to searching content value, and 16 corresponding to expression formula regular.
B. mark location setting option id: the element that the sign of location and html element element is complementary.
C. title is located setting option name: the element that the title of location and html element element is complementary.
D. class name location setting option classmate: the element that the class title of location and html element element is complementary, when having the element that a plurality of class titles are complementary, only mate first element.
E. content is located setting option value: the element that the content (innertext) of location and html element element is complementary, when having a plurality of element that is complementary, only mate first element.
F. expression formula is located setting option regul ar: the element that the expression formula in location and the html element element is complementary as to expression formula %CUURENTURL%, is positioned the url that this expression formula is complementary.
G. label setting option tag: when indication utilizes mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option that element is located, the type of the element of locating and/or attribute.
Be element type and the attribute of tag indication one-time positioning.As tag=is set " a-href " structure, then to get and navigate to attribute of an element be href in expression, the type of the element that navigates to is a.And be not have the secondary location to occur the opportunity that the tag setting option comes into force, if there is the secondary location to occur, then tag only is responsible for checking.
2) secondary position matching setting option
On the basis of carrying out one-time positioning, to the result that one-time positioning obtains, can also carry out the secondary location.This secondary position matching setting option comprises:
A. the father inquires about setting option parentselect: the element that navigates to according to one-time positioning coupling setting option is set, the mode of searching father's element of this element;
B. subquery setting option childrenselect: the element that navigates to according to one-time positioning coupling setting option is set, searches the mode of the daughter element of this element;
C. when inquiring about setting option and subquery setting option, the father puts when existing simultaneously, first inquire about father's element that setting option is searched the element that one-time positioning coupling setting option navigates to according to the father, then according to the subquery setting option, from this father's element that finds, the daughter element of searching this father's element.
Present embodiment is also according to element term, element property and sequential scheduling, be provided with the concrete mode of locating in the setting options such as parentselect, childrenselect and tag, as being expressed as " ul:0|li:1|a-href:0 " when this mode, show from when the element of prelocalization, carry out following positioning action:
1. the 1(0 that searches the next stage (upper level, current) of currentElement represents first) individual<ul label, wherein, under parentselect, search the 1st<ul of the upper level of currentElement〉label, under childrenselect, search the 1st<ul of the next stage of currentElement〉label, under tag, search the 1st the current<ul of currentElement〉label.
2. then represent first at the 2(1 of the next stage of looking for the ul element (upper level, current)) individual<li label.
3. then represent first at the 1(0 of the next stage of looking for the li element (upper level, current)) individual<a label.
4. after finding a element, if arrange can-href, then the expression href property content of getting a element; If should not arrange, then directly get the element content (innertext) of a element.
3) filter setting
Coupling arranges dispensing unit 100 and also comprises element deletion coupling setting option elementerase for the plain coupling setting option of setting up of html element, to wipe out some daughter element in the element of orienting.This element deletion coupling setting option comprises at least:
Predetermined content in the element that deletion is oriented by one-time positioning coupling setting option or secondary position matching setting option; And/or change predetermined content in the element of being oriented by one-time positioning coupling setting option or secondary position matching setting option.
For example, when the structure of elementerase=" font:0|FONT:0 " is set, then " wipe " content of selecting between font in the content or FONT label.The mode of " wiping " depends on the implication corresponding to numerical value of symbol ": " back, and for example, numerical value 0 is divstyle=" display:none " corresponding to changing element term; Numerical value 1 is corresponding to changing element term for identifying, and numerical value 2 is corresponding to the deletion element.
Further, said apparatus also comprise the coupling arrange updating block 106 be suitable for set up one the coupling file is set after, according to the update instruction that receives, coupling is arranged website node, web page joint, coupling in the file coupling setting option that description node and/or coupling arrange in the description node is set upgrades.For example, when a certain website has not been present in the internet or has not needed the webpage in this website is not carried out text when extracting, utilize coupling that the updating block 106 website node that this website is corresponding and the relevant setting under this website node are set and all arrange the file from coupling and delete.
Further, said apparatus also comprises Multi-thread control unit 107.This Multi-thread control unit 107 is suitable for when there is a plurality of web page contents that downloads in the browser side, for each web page contents distributes a thread, and the control matching unit mates the corresponding web page content respectively in the thread that distributes with the setting of webpage text content coupling, until web page contents the match is successful; And/or, the web page contents that this Multi-thread control unit 107 is suitable for the browser side distributes a plurality of threads, and the control matching unit arranges web page contents respectively in different threads and mates from different webpage text content coupling, until web page contents the match is successful.This programme has adopted the multithreading treatment technology, can realize more rapidly that the text of one or more web page contents extracts, and shortens the time of browser Web page loading, and the webpage text content that extracts is presented to the user fast in browser.
Wherein, above-mentioned returning apparatus comprises input block 108 and uploading unit 109.Input block 108 is suitable for receiving the instruction of choosing of choosing the setting of webpage text content coupling that the user sends; Then coupling arranges dispensing unit 100 and also is suitable for setting up coupling file being set according to choosing instruction, and will choose webpage text content coupling setting in the instruction and be kept at the coupling of setting up and arrange in the file, and coupling arranges dispensing unit 100 and can also according to the update instruction from the user, coupling be arranged file upgrade; And uploading unit 109 is suitable for coupling being arranged File Upload to server and being stored in server side user's the user data, then arrange that file is damaged or when losing when the coupling of browser side, the coupling that the browser side can utilize server side to preserve arranges file and recovers or upgrade.
Further, said apparatus comprises that also starting control module is suitable for when the file that monitors indication browser loaded is finished (DocumentComplete) event, know the current extraction operation that can carry out web page contents, then start the matching unit execution web page contents is arranged the operation of mating with the webpage text content coupling respectively.
Be appreciated that above-mentioned coupling arranges one or more in updating block 106, Multi-thread control unit 107, input block 108 and the uploading unit 109 and can omit in some scenes.
From the above mentioned, the embodiment of the invention arranges by set up a plurality of webpage text content couplings in the browser side, and same webpage text content and a plurality of webpage text content coupling arranged the technological means of mating, when web page contents changes, can from a plurality of webpage text content couplings arrange, find the webpage text content coupling that is complementary with the webpage that changes to arrange, thereby can utilize the webpage text content coupling setting that the match is successful to extract webpage text content.And this programme has been avoided when web page contents changes, and need to generate new matched rule file and is arranged on operation in the browser, has simplified the operation that realizes coupling, has reduced workload, has improved efficient.
Another embodiment of the present invention also provides a kind of client device, on this client device browser is installed, and is provided with the device that can extract webpage text content described above in the described browser,
Client device, according to user's the described device that can extract webpage text content of web page browsing instruction startup, and the webpage text content that this device that can extract webpage text content is extracted shows the user in browser.
The specific works mode of the device that can extract webpage text content in the client device can referring to relevant apparatus embodiment of the present invention, not repeat them here.
Another embodiment of the present invention also provides a kind of method that webpage text content is extracted,, can under the prerequisite that guarantees text extraction rate and stability, provide convenient and absorbed reading service to the user, the method comprises:
S200: arrange in default at least one webpage text content coupling of browser side.
Set up that a coupling arranges file and at least one webpage text content coupling arranged and be kept at coupling and arrange in the file, wherein, coupling arranges and comprises at least one website node in the file, comprise at least a web page joint in each website node, be provided with plural coupling at least part of web page joint description node is set, each coupling arranges the corresponding webpage text content coupling of description node and arranges, and the coupling of at least two webpage text contents arranges the middle Different matching setting option that comprises respectively the same type content of text.
A website node is set up in website for every type in the present embodiment; Under a website node, for every type webpage under this corresponding website of website node is set up a web page joint; The coupling of setting up each web page joint according to the content of webpage arranges the coupling setting option in the description node, wherein the first coupling at web page joint arranges in the description node, for every type content of text in the corresponding webpage of this web page joint is set up at least one coupling setting option; And for the content of text of same type in the webpage, the coupling setting option set up in the description node is set and in this web page joint, mates the coupling that arranges the description node that the coupling setting option of setting up in the description node is set is different except first in the first coupling.Thereby to a certain web page contents, when the first coupling arrange in the description node the coupling setting option with its can't mate the time, this web page contents and other couplings can be arranged and mate setting option in the description node and mate, until the match is successful.
Comprise that under a web page joint a plurality of couplings arrange description node, owing to usually can have some fix informations that can often not change and some variable informations that are easy to change in the webpage, coupling under web page joint arranges determines in the description node that a coupling arranges description node and as the first coupling description node is set, that the coupling setting option that comprises in the description node is set is the most comprehensive for this first coupling, has comprised at least one coupling setting option of setting up for every type content of text in the webpage.And arrange in the description node in the coupling that arranges the description node except the first coupling, can be only set up the coupling setting option for the variable information in the webpage, and in this web page joint, mate the coupling that arranges the description node that the coupling setting option of setting up in the description node is set is different except first.
This processing mode has been simplified on the one hand the structure that the webpage text content coupling arranges, and avoids Different matching that the part of repetition is arranged in arranging, and has reduced the data volume that the coupling of required storage arranges, thereby has improved resource utilization; Also avoided on the other hand identical web page contents is carried out the repeated matching operation, improved matching efficiency.
Further, in web page joint, comprise downloading mode attribute and element filter attribute, the filter type of this element filter attribute indication comprises: filtration picture, filtration Cascading Style Sheet CSS, filtration Javascript script, filter frame, filtering object and filtration embed one or more in the content
Under the web page joint that finds, first in web page contents and this web page joint coupling is arranged before the step that the coupling setting option in the description node mates successively, said method further comprises:
Whether the property value of judging the downloading mode attribute in the web page joint that finds is predetermined value, if, filter type according to the indication of element filter attribute filters the content in the webpage, then under the web page joint that finds, the coupling setting option that the first coupling in the web page contents after filtering and this web page joint is arranged in the description node mates successively; If not, directly be loaded in the browser under the web page contents.
Wherein, above-mentioned webpage text content coupling arranges the URL that is included as web page contents and sets up webpage URL coupling setting option, comprises the match attribute setting option in the webpage URL coupling setting option, and this match attribute setting option comprises:
Webpage URL with predetermined content as beginning; And/or webpage URL comprises predetermined content, and the precalculated position of this predetermined content comprises any character; And/or webpage URL does not comprise predetermined content, and this predetermined content comprises any character.
Wherein, above-mentioned webpage URL coupling setting option comprises that also banner properties settings, banner extract properties settings and transform properties settings,
The banner properties settings comprises the character in precalculated position among the URL of the webpage banner as this web page contents; Banner extraction properties settings is included in the banner that obtains according to banner properties settings coupling and chooses the character in precalculated position as banner; The conversion properties settings comprises that the composition format conversion according to the banner of the web page contents of knowing and URL obtains the URL of this webpage.
Wherein, above-mentioned webpage URL coupling setting option also comprises: web page title extracts properties settings.This web page title extracts properties settings and comprises: be title with the contents extraction before the book character in the web page contents.
Wherein, above-mentioned the first coupling at web page joint arranges in the description node, sets up at least one coupling setting option for every type content of text in the corresponding webpage of this web page joint and comprises:
Arrange in the description node as every type the HTML (Hypertext Markup Language) html element element of content of text in web page contents in the webpage in the first coupling and to set up at least one coupling setting option;
The above-mentioned coupling setting option of setting up for the html element element comprises one-time positioning coupling setting option, and this one-time positioning coupling setting option comprises at least:
Basic point is searched setting option in the indication basic point mode of searching, and this mode comprises searches sign, lookup names, searches class name, searches content, searches expression formula; And/or, the element that the mark location setting option is complementary with location and the sign of html element element; And/or, the element that title location setting option is complementary with the title of location and html element element; And/or, the element that class name location setting option is complementary with the class title of location and html element element; And/or, the element that content location setting option is complementary with the content of location and html element element; And/or, the element that expression formula location setting option is complementary with the expression formula in location and the html element element; And/or, when the label setting option utilizes mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option that element is located with indication, type and/or the attribute of institute's location element.
Wherein, the above-mentioned coupling setting option of setting up for the html element element also comprises: secondary position matching setting option, and this secondary position matching setting option comprises at least:
The father inquires about setting option so that the element that navigates to according to one-time positioning coupling setting option, the mode of searching father's element of this element to be set; Perhaps, the subquery setting option is to arrange the element that navigates to according to one-time positioning coupling setting option, search this element daughter element mode with or, when inquiring about setting option and subquery setting option, the father puts when existing simultaneously, first inquire about father's element that setting option is searched the element that one-time positioning coupling setting option navigates to according to the father, then according to the subquery setting option, from this father's element that finds, the daughter element of searching this father's element.
Wherein, above-mentioned coupling setting option for the foundation of html element element also comprises: element deletion coupling setting option, and this element deletion coupling setting option comprises at least: delete the predetermined content in the element of being oriented by one-time positioning coupling setting option or secondary position matching setting option; And/or, change the predetermined content in the element of being oriented by one-time positioning coupling setting option or secondary position matching setting option.
S202: carry out web page contents in the browser side and download.
S204: arrange in coupling and to search web page contents corresponding website node and web page joint in the file.
S206: under the web page joint that finds, the coupling setting option that the coupling of first in web page contents and this web page joint is arranged in the description node mates successively, according to matching result difference execution in step S208 or S210.
S208: to the coupling setting option that the match is successful, matching result is set to the webpage text content that utilizes this coupling setting option to extract;
S210: to the coupling setting option that it fails to match, in this web page joint, arrange and search the corresponding coupling setting option of coupling setting option that it fails to match with this in the description node except the first coupling arranges coupling the description node, the coupling setting option and the web page contents that find are mated, until the coupling setting option that finds and web page contents the match is successful, and matching result is set to the webpage text content that extracts according to this coupling setting option.
S212: utilize with the web page contents webpage text content that the match is successful coupling to arrange, extract the webpage text content in the web page contents.
With the webpage text content that extracts of the with good grounds coupling setting option that the match is successful as the webpage text content in the web page contents that identifies.
Wherein, after step S200, said method also comprises: according to the update instruction that receives, coupling is arranged website node, web page joint, coupling in the file coupling setting option that description node and/or coupling arrange in the description node is set upgrades.
Wherein, above-mentioned steps S206 mates web page contents respectively with the setting of webpage text content coupling, until web page contents the match is successful comprises:
When there is a plurality of web page contents that downloads in the browser side, be that each web page contents distributes a thread, in the thread that distributes, the corresponding web page content arranged with the webpage text content coupling respectively and mates, until web page contents the match is successful; And/or, for a web page contents of browser side distributes a plurality of threads, in different threads, web page contents arranged from different webpage text content coupling respectively and mates, until web page contents the match is successful.
And in step S206, because web page contents has the description form of HTML usually, present embodiment can to the web page contents layering analysis that downloads to, obtain the DOM structure of this web page contents; According to the DOM structure of web page contents, web page contents is mated with the setting of webpage text content coupling respectively.
Wherein, in step S200, also comprise: receive the instruction of choosing of choosing the setting of webpage text content coupling that the user sends; Set up coupling file is set according to choosing instruction, and will choose webpage text content coupling in the instruction and arrange and be kept at the coupling of setting up and arrange in the file; Coupling is arranged File Upload to server and be stored in server side user's the user data.
Wherein, before step S204, said method also comprises: when the file that monitors indication browser loaded is finished event, start web page contents is mated the operation that setting is mated with webpage text content respectively.
The concrete executive mode of each step can be referring to the related content among apparatus of the present invention embodiment in the present embodiment.
From the above mentioned, the embodiment of the invention arranges by set up a plurality of webpage text content couplings in the browser side, and same webpage text content and a plurality of webpage text content coupling arranged the technological means of mating, when web page contents changes, can from a plurality of webpage text content couplings arrange, find the webpage text content coupling that is complementary with the webpage that changes to arrange, thereby can utilize the webpage text content coupling setting that the match is successful to extract webpage text content.And this programme has been avoided when web page contents changes, and need to generate new matched rule file and is arranged on operation in the browser, has simplified the operation that realizes coupling, has reduced workload, has improved efficient.
Intrinsic not relevant with any certain computer, virtual system or miscellaneous equipment with demonstration at this algorithm that provides.Various general-purpose systems also can be with using based on the teaching at this.According to top description, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.Should be understood that and to utilize various programming languages to realize content of the present invention described here, and the top description that language-specific is done is in order to disclose preferred forms of the present invention.
In the instructions that provides herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can be put into practice in the situation of these details not having.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the description to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes in the above.Yet the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires the more feature of feature clearly put down in writing than institute in each claim.Or rather, as following claims reflected, inventive aspect was to be less than all features of the disclosed single embodiment in front.Therefore, follow claims of embodiment and incorporate clearly thus this embodiment into, wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can adaptively change and they are arranged in one or more equipment different from this embodiment the module in the equipment among the embodiment.Can be combined into a module or unit or assembly to the module among the embodiment or unit or assembly, and can be divided into a plurality of submodules or subelement or sub-component to them in addition.In such feature and/or process or unit at least some are mutually repelling, and can adopt any combination to disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and so all processes or the unit of disclosed any method or equipment make up.Unless in addition clearly statement, disclosed each feature can be by providing identical, being equal to or the alternative features of similar purpose replaces in this instructions (comprising claim, summary and the accompanying drawing followed).
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included among other embodiment, the combination of the feature of different embodiment means and is within the scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, perhaps realizes with the software module of moving at one or more processor, and perhaps the combination with them realizes.It will be understood by those of skill in the art that and to use in practice microprocessor or digital signal processor (DSP) to realize according to some or all some or repertoire of parts in the client device of the embodiment of the invention.The present invention can also be embodied as be used to part or all equipment or the device program (for example, computer program and computer program) of carrying out method as described herein.Such realization program of the present invention can be stored on the computer-readable medium, perhaps can have the form of one or more signal.Such signal can be downloaded from internet website and obtain, and perhaps provides at carrier signal, perhaps provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation of the scope that does not break away from claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed in element or step in the claim.Being positioned at word " " before the element or " one " does not get rid of and has a plurality of such elements.The present invention can realize by means of the hardware that includes some different elements and by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to come imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title with these word explanations.

Claims (17)

1. a client device is equipped with browser on this client device, is provided with the device that can extract webpage text content in the described browser,
Described client device, according to user's the described device that can extract webpage text content of web page browsing instruction startup, and the webpage text content that this device that can extract webpage text content is extracted shows the user in browser;
The described device that can extract webpage text content comprises:
Coupling arranges dispensing unit, is suitable for arranging in default at least one webpage text content coupling of browser side;
Download unit is suitable for carrying out web page contents in the browser side and downloads;
Matching unit is suitable for described web page contents arranged with described webpage text content coupling respectively and mates, until described web page contents the match is successful;
Extraction unit is suitable for utilizing with the described web page contents webpage text content that the match is successful coupling arranging, and extracts the webpage text content in the described web page contents.
2. client device according to claim 1 is characterized in that, described coupling arranges dispensing unit, is suitable for setting up a coupling and file and the setting of will be described at least one webpage text content coupling are set are kept at described coupling and arrange in the file; Wherein, described coupling arranges and comprises at least one website node in the file, comprise at least a web page joint in each website node, be provided with plural coupling at least part of described web page joint description node is set, each coupling arranges the corresponding webpage text content coupling of description node and arranges, and the coupling of at least two described webpage text contents arranges the middle Different matching setting option that comprises respectively the same type content of text.
3. client device according to claim 2 is characterized in that,
Described matching unit is suitable for arranging in described coupling and searches described web page contents corresponding website node and web page joint in the file; Under the web page joint that finds, the coupling setting option that the coupling of first in described web page contents and this web page joint is arranged in the description node mates successively; To the coupling setting option that the match is successful, matching result is set to the webpage text content that utilizes this coupling setting option to extract; To the coupling setting option that it fails to match, in this web page joint, arrange and search the corresponding coupling setting option of coupling setting option that it fails to match with this in the description node except the first coupling arranges coupling the description node, the coupling setting option and the described web page contents that find are mated, until the coupling setting option that finds and described web page contents the match is successful, and matching result is set to the webpage text content that extracts according to this coupling setting option.
4. client device according to claim 3 is characterized in that, described extraction unit, be suitable for the webpage text content that extracts of the with good grounds described coupling setting option that the match is successful as the webpage text content in the described web page contents that identifies.
5. client device according to claim 2 is characterized in that, described coupling arranges dispensing unit, and a website node is set up in the website that is suitable for every type; Under a website node, for every type webpage under this corresponding website of website node is set up a web page joint; The coupling of setting up each web page joint according to the content of webpage arranges the coupling setting option in the description node, wherein the first coupling at web page joint arranges in the description node, for every type content of text in the corresponding webpage of this web page joint is set up at least one coupling setting option; And, for the content of text of same type in the webpage, the coupling setting option set up in the description node is set and in this web page joint, mates the coupling that arranges the description node that the coupling setting option of setting up in the description node is set is different except first in described the first coupling.
6. client device according to claim 3, it is characterized in that, described coupling arranges dispensing unit, also be suitable in described web page joint, arranging downloading mode attribute and element filter attribute, the filter type of described element filter attribute indication comprises: filtration picture, filtration Cascading Style Sheet CSS, filtration Javascript script, filter frame, filtering object and filtration embed one or more in the content, described device also comprises Loading Control unit and filter element
Described Loading Control unit, be suitable under the web page joint that finds, in described web page contents and this web page joint first coupling is arranged before coupling setting option in the description node mates successively, whether the property value of judging the downloading mode attribute in the described web page joint that finds is predetermined value, if, start filter element, then under the web page joint that finds, the coupling setting option that the first coupling in the web page contents after filtering and this web page joint is arranged in the description node mates successively; If not, directly will be loaded in the browser under the described web page contents;
Described filter element is suitable for according to the filter type of element filter attribute indication the content in the webpage being filtered.
7. client device according to claim 1 is characterized in that, the webpage text content coupling that described coupling arranges dispensing unit configuration arranges the uniform resource position mark URL that is included as web page contents and sets up webpage URL coupling setting option,
Comprise in the described webpage URL coupling setting option: the match attribute setting option, described match attribute setting option comprises:
Webpage URL with predetermined content as beginning; And/or,
Webpage URL comprises predetermined content, and the precalculated position of this predetermined content comprises any character; And/or,
Webpage URL does not comprise predetermined content, and this predetermined content comprises any character.
8. client device according to claim 7 is characterized in that, the webpage URL coupling setting option that described coupling arranges dispensing unit foundation comprises that also banner properties settings, banner extract properties settings and transform properties settings,
Described banner properties settings comprises: with the character in precalculated position among the URL of the webpage banner as this web page contents;
Described banner extracts properties settings and comprises: choose the character in precalculated position as banner in the banner that obtains according to banner properties settings coupling;
Described conversion properties settings comprises: the URL that obtains this webpage according to the composition format conversion of the banner of the web page contents of knowing and URL.
9. client device according to claim 7 is characterized in that, the webpage URL coupling setting option that described coupling arranges dispensing unit foundation comprises that also web page title extracts properties settings,
Described web page title extracts properties settings and comprises: be title with the contents extraction before the book character in the web page contents.
10. client device according to claim 5, it is characterized in that, described coupling arranges dispensing unit, also is suitable for arranging in the description node as every type the HTML (Hypertext Markup Language) html element element of content of text in web page contents in the webpage in the first coupling setting up at least one coupling setting option;
The described coupling setting option of setting up for the html element element comprises one-time positioning coupling setting option, and described one-time positioning coupling setting option comprises at least:
Basic point is searched setting option: the indication basic point mode of searching, described mode comprise searches sign, lookup names, searches class name, searches content, searches expression formula; And/or,
Mark location setting option: the element that the sign of location and html element element is complementary; And/or,
Title location setting option: the element that the title of location and html element element is complementary; And/or,
Class name location setting option: the element that the class title of location and html element element is complementary; And/or,
Content location setting option: the element that the content of location and html element element is complementary; And/or,
Expression formula location setting option: the element that the expression formula in location and the html element element is complementary;
And/or,
The label setting option: when indication utilizes described mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option that element is located, type and/or the attribute of institute's location element.
11. client device according to claim 10, it is characterized in that, it is that the coupling setting option that html element element is set up also comprises that described coupling arranges dispensing unit: secondary position matching setting option, described secondary position matching setting option comprise a kind of setting option in following at least:
The father inquires about setting option: the element that navigates to according to one-time positioning coupling setting option is set, the mode of searching father's element of this element; Perhaps,
Subquery setting option: the element that navigates to according to one-time positioning coupling setting option is set, searches the mode of the daughter element of this element; Perhaps,
When inquiring about setting option and subquery setting option, the father puts when existing simultaneously, first inquire about father's element that setting option is searched the element that one-time positioning coupling setting option navigates to according to the father, then according to the subquery setting option, from this father's element that finds, the daughter element of searching this father's element.
12. client device according to claim 10 is characterized in that, it is that the coupling setting option that the html element element is set up also comprises that described coupling arranges dispensing unit: element deletion coupling setting option, and described element deletion coupling setting option comprises at least:
Predetermined content in the element that deletion is oriented by one-time positioning coupling setting option or secondary position matching setting option; And/or
Predetermined content in the element that change is oriented by one-time positioning coupling setting option or secondary position matching setting option.
13. client device according to claim 2, it is characterized in that, described device comprises that also coupling arranges updating block, be suitable for described set up one the coupling file is set after, according to the update instruction that receives, described coupling is arranged website node, web page joint, coupling in the file coupling setting option that description node and/or coupling arrange in the description node is set upgrades.
14. client device according to claim 1 is characterized in that, the described device that can extract webpage text content also comprises the Multi-thread control unit,
Described Multi-thread control unit, be suitable for when there is a plurality of web page contents that downloads in the browser side, for each web page contents distributes a thread, and control described matching unit and in the thread that distributes, the corresponding web page content arranged with described webpage text content coupling respectively and mate, until described web page contents the match is successful; And/or
Described Multi-thread control unit, a web page contents that is suitable for the browser side distributes a plurality of threads, and control described matching unit and in different threads, described web page contents arranged from different webpage text content coupling respectively and mate, until described web page contents the match is successful.
15. client device according to claim 2 is characterized in that, described device comprises input block and uploading unit,
Described input block is suitable for receiving the instruction of choosing of choosing the setting of webpage text content coupling that the user sends;
Described coupling arranges dispensing unit, also is suitable for choosing instruction and setting up coupling file is set according to described, and will describedly chooses webpage text content coupling setting in the instruction and be kept at the coupling of setting up and arrange in the file;
Described uploading unit is suitable for described coupling being arranged File Upload to server and being stored in the described user's of server side the user data.
16. client device according to claim 1, it is characterized in that, the described device that can extract webpage text content also comprises the startup control module, be suitable for when the file that monitors indication browser loaded is finished event, start described matching unit execution described web page contents is arranged the operation of mating with described webpage text content coupling respectively.
17. client device according to claim 1 is characterized in that,
Described matching unit also is suitable for the web page contents layering analysis to downloading to, and obtains the DOM Document Object Model DOM structure of this web page contents; According to the DOM structure of described web page contents, web page contents is mated with the setting of described webpage text content coupling respectively.
CN201210573088.7A 2012-12-25 2012-12-25 A kind of client device Expired - Fee Related CN103064943B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210573088.7A CN103064943B (en) 2012-12-25 2012-12-25 A kind of client device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210573088.7A CN103064943B (en) 2012-12-25 2012-12-25 A kind of client device

Publications (2)

Publication Number Publication Date
CN103064943A true CN103064943A (en) 2013-04-24
CN103064943B CN103064943B (en) 2016-11-23

Family

ID=48107573

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210573088.7A Expired - Fee Related CN103064943B (en) 2012-12-25 2012-12-25 A kind of client device

Country Status (1)

Country Link
CN (1) CN103064943B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302742A (en) * 2014-07-04 2016-02-03 深圳市雅都软件股份有限公司 System and method for avoiding repeated loading of dynamically cached graphic data
CN106326316A (en) * 2015-07-08 2017-01-11 腾讯科技(深圳)有限公司 Web page advertisement filtering method and device
CN106547806A (en) * 2015-09-23 2017-03-29 阿里巴巴集团控股有限公司 Page loading method and device
CN108628860A (en) * 2017-03-15 2018-10-09 贵州白山云科技有限公司 A kind of method and device of automatic acquisition web data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020032735A1 (en) * 2000-08-25 2002-03-14 Daniel Burnstein Apparatus, means and methods for automatic community formation for phones and computer networks
CN101094135A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and system for extracting information of content in Internet
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN102681994A (en) * 2011-03-07 2012-09-19 北京百度网讯科技有限公司 Webpage information extracting method and system
CN102708174A (en) * 2012-05-04 2012-10-03 奇智软件(北京)有限公司 Method and device for displaying rich media information in browser
CN102789484A (en) * 2012-06-28 2012-11-21 奇智软件(北京)有限公司 Method and device for webpage information processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020032735A1 (en) * 2000-08-25 2002-03-14 Daniel Burnstein Apparatus, means and methods for automatic community formation for phones and computer networks
CN101094135A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and system for extracting information of content in Internet
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN102681994A (en) * 2011-03-07 2012-09-19 北京百度网讯科技有限公司 Webpage information extracting method and system
CN102708174A (en) * 2012-05-04 2012-10-03 奇智软件(北京)有限公司 Method and device for displaying rich media information in browser
CN102789484A (en) * 2012-06-28 2012-11-21 奇智软件(北京)有限公司 Method and device for webpage information processing

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302742A (en) * 2014-07-04 2016-02-03 深圳市雅都软件股份有限公司 System and method for avoiding repeated loading of dynamically cached graphic data
CN105302742B (en) * 2014-07-04 2018-07-20 深圳市雅都软件股份有限公司 Avoid the system and method for repeating to load dynamic buffering graph data
CN106326316A (en) * 2015-07-08 2017-01-11 腾讯科技(深圳)有限公司 Web page advertisement filtering method and device
CN106326316B (en) * 2015-07-08 2022-11-29 腾讯科技(深圳)有限公司 Webpage advertisement filtering method and device
CN106547806A (en) * 2015-09-23 2017-03-29 阿里巴巴集团控股有限公司 Page loading method and device
CN108628860A (en) * 2017-03-15 2018-10-09 贵州白山云科技有限公司 A kind of method and device of automatic acquisition web data

Also Published As

Publication number Publication date
CN103064943B (en) 2016-11-23

Similar Documents

Publication Publication Date Title
CN103020266A (en) Method and device for extracting webpage text content
EP3491544B1 (en) Web page display systems and methods
CN100476830C (en) Network resource searching method and system
EP2938044B1 (en) System, method, apparatus, and server for displaying network medium information
CN104077388A (en) Summary information extraction method and device based on search engine and search engine
CN104063454A (en) Search push method and device for mining user demands
CN105528452A (en) Method and system for loading page data
CN108021598B (en) Page extraction template matching method and device and server
CN108710490B (en) Method and device for editing Web page
CN102831252A (en) Method and device for updating index database and search method and system
CN103207874A (en) Updated webpage content prompting method and system
US20100318888A1 (en) System and method for providing sub-publication content in an electronic device
CN102129428A (en) Method and device for subscribing information from webpage
CN109976840A (en) The method and system of multilingual automatic adaptation are realized under a kind of separation platform based on front and back
CN103064943A (en) Customer premises equipment
CN102955850A (en) Method and device for loading sequencing website
CN102982118A (en) Searching method and device based on favorites
GB2496689A (en) Using metadata to provide embedded media on third-party webpages according to viewing settings
CN102902784A (en) Web page classification storage system and method
US20110314044A1 (en) Flexible content organization and retrieval
CN102902792B (en) list page identification system and method
CN103761231A (en) Method and device for providing media content information of page by search engine
CN106951429B (en) Method, browser and equipment for enhancing webpage comment display
CN106934036A (en) A kind of method and system of Network Learning Resource aggregate query
CN102982078A (en) Loading method of sequencing website and client with sequencing website being loaded

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220727

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161123