CN103020266B - The method and apparatus that webpage text content is extracted - Google Patents

The method and apparatus that webpage text content is extracted Download PDF

Info

Publication number
CN103020266B
CN103020266B CN201210573022.8A CN201210573022A CN103020266B CN 103020266 B CN103020266 B CN 103020266B CN 201210573022 A CN201210573022 A CN 201210573022A CN 103020266 B CN103020266 B CN 103020266B
Authority
CN
China
Prior art keywords
coupling
web page
setting option
webpage
page contents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210573022.8A
Other languages
Chinese (zh)
Other versions
CN103020266A (en
Inventor
谢洲为
潘洪学
糜裕峰
任寰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201210573022.8A priority Critical patent/CN103020266B/en
Publication of CN103020266A publication Critical patent/CN103020266A/en
Application granted granted Critical
Publication of CN103020266B publication Critical patent/CN103020266B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of method and apparatus that webpage text content is extracted.A kind of method that webpage text content is extracted that the embodiment of the present invention provides includes: presets at least one webpage text content coupling in browser side and arranges;Web page contents download is carried out in browser side;With webpage text content, web page contents is mated setting respectively mate, until web page contents the match is successful;Utilize and mate setting with the web page contents webpage text content that the match is successful, extract the webpage text content in web page contents.

Description

The method and apparatus that webpage text content is extracted
Technical field
The present invention relates to networking technology area, particularly to a kind of method and apparatus that webpage text content is extracted.
Background technology
Along with popularizing of Internet technology, network has become as one of people important channel obtaining information, and the content of text in webpage is the main carriers of information.But, under normal circumstances except content of text in webpage, also include the garbages such as mass advertising picture, non-article content, had a strong impact on the reading experience of user.
In the scheme extracting webpage text content that prior art provides, after webpage loaded in a browser, content in webpage is split, then by the matched rule file in browser, web page contents is positioned, extract required field contents and show, thus user is it can be seen that webpage after text screening, allow users to reading that is convenient and that be absorbed in.
The scheme of existing extraction webpage text content at least has following defects that
Existing scheme arranges a matched rule file for a certain predetermined structure of web page, the extraction that this matched rule file is only applicable under predetermined structure webpage text content, renewal speed yet with Internet resources is very fast, structure of web page can change often, then existing matched rule file will be unable to the webpage after variation is carried out Text Feature Extraction, and regenerate new matched rule file, again new matched rule file is arranged in a browser, causing that again the operation realizing coupling is excessively loaded down with trivial details, workload is relatively big, inefficiency.
Summary of the invention
In view of the above problems, it is proposed that the present invention is to provide a kind of method and apparatus that webpage text content is extracted overcoming the problems referred to above or solving the problems referred to above at least in part.
According to one aspect of the present invention, embodiments provide a kind of method that webpage text content is extracted, including: preset at least one webpage text content coupling in browser side and arrange;Web page contents download is carried out in browser side;With webpage text content, web page contents is mated setting respectively mate, until web page contents the match is successful;Utilize and mate setting with the web page contents webpage text content that the match is successful, extract the webpage text content in web page contents.
Another embodiment of the present invention additionally provides a kind of device that webpage text content can be extracted, including: coupling arranges dispensing unit, is suitable to preset at least one webpage text content coupling in browser side and arranges;Download unit, is suitable to carry out web page contents download in browser side;Matching unit, is suitable to that with webpage text content, web page contents is mated setting respectively and mates, until web page contents the match is successful;Extraction unit, is suitable to utilize and mates setting with the web page contents webpage text content that the match is successful, extract the webpage text content in web page contents.
From the above mentioned, the embodiment of the present invention is arranged by setting up multiple webpage text content coupling in browser side, and same webpage text content is mated with multiple webpage text contents the technological means that setting carries out mating, when web page contents changes, the webpage text content matched with the webpage changed can be found to mate setting arranging from multiple webpage text contents coupling such that it is able to utilize the webpage text content coupling setting that the match is successful to extract webpage text content.Further, this programme avoids when web page contents changes, it is necessary to generates new matched rule file and arranges operation in a browser, simplifying the operation realizing coupling, reduce workload, improve efficiency.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, and can be practiced according to the content of description, and in order to above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit those of ordinary skill in the art be will be clear from understanding.Accompanying drawing is only for illustrating the purpose of preferred implementation, and is not considered as limitation of the present invention.And in whole accompanying drawing, it is denoted by the same reference numerals identical parts.In the accompanying drawings:
Fig. 1 illustrates the apparatus structure schematic diagram that according to an embodiment of the invention webpage text content can be extracted;
Fig. 2 illustrates the method flow diagram that webpage text content is extracted according to another embodiment of the present invention.
Detailed description of the invention
It is more fully described the exemplary embodiment of the disclosure below with reference to accompanying drawings.Although accompanying drawing showing the exemplary embodiment of the disclosure, it being understood, however, that may be realized in various forms the disclosure and should do not limited by embodiments set forth here.On the contrary, it is provided that these embodiments are able to be best understood from the disclosure, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
One embodiment of the invention provides a kind of device that webpage text content can be extracted, it is possible under ensureing the premise of Text Feature Extraction speed and stability, provide a user with convenient and absorbed reading service.Referring to Fig. 1, this device include coupling arrange dispensing unit 100, download unit 101, matching unit 102, extraction unit 103, Loading Control unit 104, filter element 105, coupling updating block 106, Multi-thread control unit 107, input block 108 and uploading unit 109 are set.Separately below each unit is illustrated.
Coupling arranges dispensing unit 100, is suitable to preset at least one webpage text content coupling in browser side and arranges.Concrete, coupling arranges dispensing unit 100 and is adapted to set up a coupling and arranges file and at least one webpage text content mates setting be saved in coupling and arrange in file;Wherein, this coupling arranges file and includes at least one website node, each website node includes at least one web page joint, is provided with plural coupling and arranges description node at least part of web page joint, and each coupling arranges the corresponding webpage text content coupling of description node and arranges.Coupling is arranged in description node can include one or more coupling setting option, and at least two webpage text content coupling arranges and middle includes the Different matching setting option to same type content of text respectively.
Coupling arranges dispensing unit 100 and sets up a website node for each type of website, i.e. a corresponding a type of website of website node;Under a website node, set up a web page joint for each type of webpage under this corresponding website of website node, i.e. a corresponding a type of webpage of web page joint.Content according to webpage is set up the coupling of each web page joint and is arranged the coupling setting option described in node.Different webpages, the content wherein comprised is different, then the coupling setting option that corresponding coupling is arranged in description node is also different.
Under a web page joint, include multiple coupling description node is set, owing to usual webpage can exist the variable information that some fix informations that will not often change and some are prone to change, coupling arrange the dispensing unit 100 coupling under web page joint arrange description node in determine a coupling arrange description node as first coupling description node is set, this first coupling arranges that to describe the coupling setting option that includes of node the most comprehensive, includes at least one coupling setting option set up into each type of content of text in webpage.And arrange in description node arranging the coupling except describing node except the first coupling, coupling setting option can be set up only for the variable information in webpage, and it is different to arrange, except the first coupling mated except arranging description node, the coupling setting option set up in description node in this web page joint.
This processing mode, simplifies the structure that webpage text content coupling is arranged, it is to avoid Different matching has the part of repetition in arranging on the one hand, decreases the data volume that the coupling of required storage is arranged, thus improve resource utilization;It also avoid on the other hand and identical web page contents is repeated matching operation, improve matching efficiency.
Below in conjunction with the example of one section of code, coupling is arranged file to be specifically described.
Below in conjunction with each node in above-mentioned code, coupling is arranged file to be described as follows:
1.<websites>total website node: this node is maximum father node, and this node arranges file corresponding to a coupling, and this node is made up of several websites (website) node.
2.<website>node: a kind of website supported of each website node on behalf, one website node arranges one or more web page joint, as being arranged with books (book) web page joint, catalogue (catalog) web page joint and chapters and sections (chapter) web page joint at website node www.feiku.com.Web page joint is additionally provided with downloading mode (downloadmode) attribute and element filters (elementfilter) attribute.
3.<book>web page joint: describe novel home tip, two couplings are set under this web page joint description node is set<profile>.Arranging as the first coupling, node is described<profile>the multiple coupling setting option of middle configuration, such as URL(Uniform/UniversalResourceLocator, URL) mate setting option describe related urls coupling and obtain bookid(banner) information;Title(title) mate setting option, the information how obtaining novel homepage title is described;Catalogurl(catalogue URL) mate setting option and describe the catalogue URL of this novel;The up-to-date chapters and sections of lasterchapter() mate setting option and describe the description of up-to-date chapters and sections;Lasterchapterurl(up-to-date chapters and sections URL) mate setting option and describe the URL of up-to-date chapters and sections.
4.<catalog>web page joint: describe listing of novel page information, only arrange a coupling and arrange description node under this web page joint, arranges in coupling and includes under description node: URL mates setting option and describes related urls coupling and obtain bookid information;Chapterlist mates setting option, describes the related content of catalogue page;Returnbook describes the URL address of novel homepage.
5.<chapter>web page joint: describe novel chapters and sections page information, arrange two under this web page joint<profile>.Arranging as the first coupling, node is described<profile>in be configured with: URL and mate setting option, describe related urls coupling and obtain bookid information;Title mates setting option, describes the information how obtaining novel homepage title;Text(text) mate setting option, the body matter of novel is described;Next mates setting option, describes next chapters and sections novel page URL;Prev mates setting option, describes and little says a chapters and sections URL;Returncatalog(Returning catalogue) mate setting option, the listing of novel page URL that chapters and sections page preserves is described;Returnbook(returns books) mate setting option, the novel homepage that novel chapters and sections page preserves is described.
6.<profile>coupling arranges description node: when arranging multiple webpage text content coupling under a web page joint and arranging, configurations match can arrange description node<profile>, each<profile>corresponding webpage text content coupling is arranged.<profile>it is positioned under concrete web page joint, for instance, it is positioned at below above-mentioned book web page joint and chapter web page joint, coupling setting option is arranged on<profile>in.
When receiving the web page access instruction of user, download unit 101 carries out web page contents download in browser side, as download unit 101 is connected with server foundation, downloads the web page contents that web page access instruction is corresponding from server.
The web page contents downloaded to is mated setting with webpage text content and mates by matching unit 102 respectively, until web page contents the match is successful.Still with the scene explanation in above-mentioned code, matching unit 102 arranges in coupling and searches website node corresponding to web page contents and web page joint in file, finding website node corresponding to this web page contents according to the web page contents downloaded to is website node www.feiku.com, and the web page joint of correspondence is book web page joint;Then under the web page joint found, web page contents is mated with first in this web page joint the coupling setting option arranged in description node and is sequentially carried out coupling, when the first coupling in book web page joint arranges first that description node is configured under book web page joint<profile>time, first by web page contents and this first<profile>in coupling setting option mate.To the coupling setting option that the match is successful, matching result being set to the webpage text content utilizing this coupling setting option to extract, the result at this moment returned can be directly the content of text extracted, or returns the information that instruction result is true (TRUE);To the coupling setting option that it fails to match, at this moment the matching result returned for indicating the null character string that cannot process or can return the information that instruction result is false (FALSE), then except the first coupling mated except arranging description node is arranged in description node (such as second under book web page joint in this web page joint<profile>in) search the coupling setting option corresponding with this coupling setting option that it fails to match, the coupling setting option found is mated with web page contents, until the coupling setting option found the match is successful with web page contents, and matching result is set to the webpage text content extracted according to this coupling setting option.Namely for utilizing the first coupling to arrange the web page contents that description node matching is failed, simply by the presence of one<profile>can match, it is possible to utilize this coupling<profile>corresponding web page content is extracted.
Due under normal circumstances, the appearance form of web page contents is HTML(HypertextMarkupLanguage, HTML), matching unit 102 is also required to for the HTML element in webpage when performing coupling, such as, the matching unit 102 web page contents layering analysis to downloading to, obtain the DOM Document Object Model DOM(DocumentObjectModel of this web page contents, DOM Document Object Model) structure, DOM structure according to web page contents, with webpage text content, web page contents is mated setting respectively mate, thus extracting webpage text content.
Extraction unit 103 is suitable to utilization and mates setting with the web page contents webpage text content that the match is successful, extracts the webpage text content in web page contents.Concrete, extraction unit 103 is suitable to good grounds for institute that the match is successful mates webpage text content that setting option extracts as the webpage text content in the web page contents identified.
Further, the download of web page contents is controlled by coupling can also be utilized in the present embodiment to arrange downloading mode (downloadmode) attribute that dispensing unit 100 arranges in web page joint and element filtration (elementfilter) attribute.Said apparatus also includes Loading Control unit 104 and filter element 105.
Coupling arranges dispensing unit 100 and arranges at least two generic attribute values for downloading mode attribute, such as, when this property value is 0, indicate the downloading mode according to existing browsing device net page, whole web page contents are downloaded in browser, when this property value is 1, utilize filter element 105 that web page contents is filtered, only web page contents remaining after filtration is downloaded in browser.
Coupling arranges dispensing unit 100 and arranges multiple property value for element filter attribute, each property value correspondence one filter type, such as, property value 1 represents that filtration picture (img), property value 2 represent that filtration Cascading Style Sheet (CascadingStyleSheet, CSS), property value 4 represent that Javascript script is filtered in the expression of filter frame (frame), property value 8, property value 16 represents filtering object (object) and embedding (embed) content is filtered in property value 32 expression.
When needing the combination adopting above-mentioned multiple filter type, it is possible to by above-mentioned property value binary-coded character adopt step-by-step or calculation, generate new property value, then this new property value can indicate that above-mentioned multiple filter type.
Loading Control unit 104 is suitable under the web page joint found, web page contents is mated with first in this web page joint and arranges before the coupling setting option described in node is sequentially carried out coupling, whether the property value of the downloading mode attribute in the web page joint that judgement finds is predetermined value (such as 1), if, start filter element 105, then, under the web page joint found, the web page contents after filtration is mated with first in this web page joint the coupling setting option arranged in description node and is sequentially carried out coupling;If it is not, directly web page contents is downloaded in a browser;
Content in webpage is filtered by the filter type that filter element 105 is suitable to according to the instruction of element filter attribute.Such as, when the property value of element filter attribute indicates and filters picture, picture in web page contents is all filtered out by filter element 105, and when the property value of element filter attribute indicates and filters picture and CSS, the picture in web page contents and CSS are all filtered out by filter element 105.
Coupling arranges dispensing unit 100 below arrange in coupling and describe some main coupling setting options of configuration in node and be specifically described.
One, about the extraction of webpage URL
Coupling arrange dispensing unit 100 configuration webpage text content coupling arrange include for the URL of web page contents set up webpage URL mate setting option.
In this section, in conjunction with the url node in above-mentioned example, from Match setting, Trans is arranged, Bookid is arranged, Booksep is arranged and Tabtitle arranges five aspects and webpage URL coupling setting option is illustrated.
1) Match is arranged: match attribute setting option
Webpage URL mates in setting option and comprises match attribute setting option, and this match attribute setting option includes:
A. webpage URL is using predetermined content as beginning, as started with ^, it was shown that url must start with the content after ^.
B. webpage URL comprises predetermined content, and the precalculated position of this predetermined content comprises any character, as this predetermined content be with beginning content, it was shown that this url must comprise after content, after content inside can add character *, this character representation coupling any character.
C. webpage URL does not comprise predetermined content, and this predetermined content comprises any character.As this predetermined content be with!The content of beginning, it was shown that this url must not comprise!After content,!After content inside can add character *, this character representation coupling any character.
When extracting webpage URL, it is possible to require to meet above-mentioned a, b and c simultaneously, or, only meet in a, b and c or two.
2) Trans is arranged: convert properties settings
Banner according to the web page contents known and the composition format conversion of URL obtain the URL of this webpage.This operation is mainly used under a web page joint only having a coupling and arranges the scene describing node, namely only exists under the scene of a profile, the banners such as the novel homepage given, catalogue page, chapters and sections page carries out the associative operation of URL conversion.This setting option describes the composition form of url, it is only necessary to the banners such as bookid or chapterid are inserted and just can obtain a url, as: trans=http: //www.qidian.com/BookReader/##s, ##s.aspx^^bookid^^chapterid
The above-mentioned character string display composition form of URL, then insert first ##s bookid, chapterid inserted second ##s, just can obtain the url of a chapters and sections page.
3) Bookid is arranged: banner properties settings
Using the character in precalculated position in the URL of webpage as the banner of this web page contents.
This being operated as obtains banner, bookid character string such as url, such as, for bookid=http: //www.readnovel.com/novel/*.html, wherein, the position of character * is above-mentioned precalculated position, then using the character string of this position as the banner extracted, such as bookid character string.
Utilize the banner extracted in this operation can carry out the conversion of webpage URL.
4) Booksep is arranged: banner extracts properties settings
The banner obtained according to banner properties settings coupling is chosen the character in precalculated position as banner.This operation be mainly used in the banner that gets more complicated time, it is necessary to the scene extracted further.
As arranged booksep="/: the extraction structure of 01; then when comprising "/" symbol in banner bookid; in order to get pure digi-tal; booksep can be used; "/" represents and separates identifier; ": " represents separator, " 0 " represents when target text is separated into some sections by "/", take section which partly (starting counting up from 0) as banner bookid.
Utilize the banner extracted in this operation can carry out the conversion of webpage URL.
5) Tabtitle is arranged: web page title extracts properties settings
It is title (Title) information by the contents extraction before book character in web page contents.As arranged the extraction structure of tabtitle=2*-3, then it represents that the part before first "-" occurred is all title.Symbol * can mate any character.
Two, about the extraction of HTML content in webpage
Coupling arrange dispensing unit 100 coupling arrange description node in (as first coupling arrange description node in) for webpage in each type of content of text HTML (HypertextMarkupLanguage, HTML) element in web page contents set up at least one coupling setting option.
Different types of webpage need the HTML element extracted also different, for instance the scene in above-mentioned code, it is necessary to the HTML element of process includes instruction title<title>element, instruction catalogue url<catalogurl>element, indicate up-to-date chapters and sections<lastchapter>element, indicate up-to-date chapters and sections url's<lastchapterurl>element, instruction text<text>one page url under element, instruction<next>element, instruction page up url<prev>element, instruction Returning catalogue url<returncatalog>element and instruction return to the homepage url's<returnbook>element etc..
Coupling arranges dispensing unit 100 and includes one-time positioning coupling setting option and second positioning coupling setting option for the coupling setting option that HTML element is set up.Illustrate separately below.
1) one-time positioning coupling setting option
This one-time positioning coupling setting option at least includes:
A. basic point searches setting option el: the mode that instruction basic point is searched, could be arranged to the numerical value such as 1,2,4,8,16, wherein, 1 corresponding to searching mark id, 2 correspond to lookup names name, 4 corresponding to searching class name classname, and 8 corresponding to searching content value, and 16 correspond to expression formula regular.
B. mark location setting option id: the element that the mark of location and HTML element matches.
C. title location setting option name: the element that the title of location and HTML element matches.
D. class name location setting option classmate: the element that the class name of location and HTML element matches, when having the element that multiple class name matches, only first element of coupling.
E. content location setting option value: the element that the content (innertext) of location and HTML element matches, when having multiple element matched, only first element of coupling.
F. expression formula location setting option regular: the element that location matches with the expression formula in HTML element, as to expression formula %CUURENTURL%, being positioned the url that this expression formula matches.
G. label setting option tag: when instruction utilizes mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option that element is positioned, the type of the element positioned and/or attribute.
Namely tag indicates element type and the attribute of one-time positioning.As arranged tag=" a-href " structure, then it represents that taking and navigating to attribute of an element is href, and the type of the element navigated to is a.And the opportunity that tag setting option comes into force is not have second positioning to occur, if there being second positioning to occur, then tag is only responsible for checking.
2) second positioning coupling setting option
Performing on the basis of one-time positioning, the result that one-time positioning is obtained, it is also possible to carry out second positioning.This second positioning coupling setting option includes:
A. father inquires about setting option parentselect: arranges and mates, according to one-time positioning, the element that setting option navigates to, searches the mode of father's element of this element;
B. subquery setting option childrenselect: arrange and mate, according to one-time positioning, the element that setting option navigates to, search the mode of the daughter element of this element;
C. when father inquires about setting option and subquery setting option is put when existing simultaneously, first inquire about setting option according to father and search father's element of the element that one-time positioning coupling setting option navigates to, then according to subquery setting option, from this father's element found, the daughter element of this father's element is searched.
The present embodiment is according further to element term, element property and order etc., it is provided with in the setting options such as parentselect, childrenselect and tag the concrete mode of location, as being expressed as 4ul:0 | li:1 | a-href: 0 when which "; show from, when the element of prelocalization, performing following positioning action:
1. the 1(0 of the next stage (upper level, current) searching currentElement represents first) individual<ul>label, wherein, searches the 1st of upper level of currentElement under parentselect<ul>label, searches the 1st of next stage of currentElement under childrenselect<ul>label, searches current the 1st of currentElement under tag<ul>label.
2. then represent first at the 2(1 of the next stage (upper level, current) looking for ul element) individual<li>label.
3. then represent first at the 1(0 of the next stage (upper level, current) looking for li element) individual<a>label.
4. after finding a element, can-href if arranged, then it represents that take the href property content of a element;Without this setting, then directly take the element content (innertext) of a element.
3) setting is filtered
Coupling arranges dispensing unit 100 and also includes element deletion coupling setting option elementerase for the coupling setting option that HTML element is set up, to wipe out some daughter element in the element oriented.This element is deleted coupling setting option and is at least included:
Delete the predetermined content in the element oriented by one-time positioning coupling setting option or second positioning coupling setting option;And/or change the predetermined content in the element oriented by one-time positioning coupling setting option or second positioning coupling setting option.
Such as, when arranging the structure of elementerase=" font:0 | FONT:0 ", then " erasing " select the content in content between font or FONT label." erasing " mode depends on the implication that symbol ": " numerical value below is corresponding, for instance, numerical value 0 is divstyle=" display:none " corresponding to changing element term;Numerical value 1 is corresponding to changing element term for not can recognize that, numerical value 2 is corresponding to deleting element.
Further, said apparatus also includes coupling and arranges updating block 106 and be suitable to after setting up a coupling and arranging file, according to the renewal instruction received, coupling is arranged the website node in file, web page joint, coupling arrange and describe node and/or coupling arranges the coupling setting option described in node and is updated.Such as, when, during a certain website is not already present in the Internet or when the webpage in this website need not be carried out Text Feature Extraction, utilizing coupling to arrange updating block 106 and the relevant setting under website node corresponding for this website and this website node is all arranged deletion file from coupling.
Further, said apparatus also includes Multi-thread control unit 107.This Multi-thread control unit 107 is suitable to when browser side exists multiple web page contents downloaded to, a thread is distributed for each web page contents, and control matching unit and in the thread distributed, corresponding web page content mated setting with webpage text content respectively and mate, until web page contents the match is successful;And/or, this Multi-thread control unit 107 is suitable for a web page contents of browser side and distributes multiple threads, and control matching unit and in different threads, web page contents mated from different webpage text content coupling settings respectively, until web page contents the match is successful.This programme have employed multiple threads technology, it is possible to realizes the Text Feature Extraction of one or more web page contents more rapidly, shortens browser and loads the time of webpage, the webpage text content extracted quickly is presented to user in a browser.
Wherein, above-mentioned returning apparatus includes input block 108 and uploading unit 109.Input block 108 be suitable to receive user send choose webpage text content coupling arrange choose instruction;Then coupling arranges dispensing unit 100 and is further adapted for setting up coupling and arranging file according to choosing instruction, and the webpage text content chosen in instruction coupling arranged be saved in the coupling set up and arrange in file, and coupling arranges dispensing unit 100 and according to the renewal instruction from user, coupling can also be arranged file and be updated;And uploading unit 109 is suitable to arrange coupling files passe and to server and is stored in the user data of server side user, then when the coupling of browser side arrange file be damaged or lose time, browser side can utilize the coupling that server side preserves to arrange file to carry out recovering or updating.
Further, said apparatus also includes starting control unit and is suitable to when the file monitoring instruction browser loaded completes (DocumentComplete) event, know that currently can perform the extraction to web page contents operates, then start matching unit and perform to mate web page contents respectively with webpage text content to arrange the operation carrying out mating.
What be appreciated that above-mentioned coupling arranges in updating block 106, Multi-thread control unit 107, input block 108 and uploading unit 109 one or more can omit in some scenes.
From the above mentioned, the embodiment of the present invention is arranged by setting up multiple webpage text content coupling in browser side, and same webpage text content is mated with multiple webpage text contents the technological means that setting carries out mating, when web page contents changes, the webpage text content matched with the webpage changed can be found to mate setting arranging from multiple webpage text contents coupling such that it is able to utilize the webpage text content coupling setting that the match is successful to extract webpage text content.Further, this programme avoids when web page contents changes, it is necessary to generates new matched rule file and arranges operation in a browser, simplifying the operation realizing coupling, reduce workload, improve efficiency.
Another embodiment of the present invention additionally provides a kind of client device, and this client device is provided with browser, is provided with the device that described above webpage text content can be extracted in described browser,
Client device, starts the described device that webpage text content can be extracted according to the web page browsing instruction of user, and this webpage text content that device that webpage text content can extract is extracted is showed user in a browser.
The specific works mode of the device that webpage text content can extract be may refer to the relevant apparatus embodiment of the present invention by client device, do not repeat them here.
Another embodiment of the present invention additionally provides a kind of method that webpage text content is extracted, it is possible under ensureing the premise of Text Feature Extraction speed and stability, providing a user with convenient and absorbed reading service, the method includes:
S200: preset at least one webpage text content coupling in browser side and arrange.
Set up a coupling arrange file and at least one webpage text content is mated setting be saved in coupling and arrange in file, wherein, coupling arranges file and includes at least one website node, each website node includes at least one web page joint, at least part of web page joint is provided with plural coupling description node is set, each coupling arranges the corresponding webpage text content coupling of description node and arranges, and the coupling of at least two webpage text content arranges and middle includes the Different matching setting option to same type content of text respectively.
The present embodiment sets up a website node for each type of website;Under a website node, set up a web page joint for each type of webpage under this corresponding website of website node;Content according to webpage is set up the coupling of each web page joint and is arranged the coupling setting option described in node, wherein arrange in description node in the first coupling of web page joint, set up at least one coupling setting option for each type of content of text in the corresponding webpage of this web page joint;And for content of text same kind of in webpage, arrange to describe the coupling setting option set up in node and in this web page joint, arrange the coupling except describing node except the first coupling in the first coupling and arrange that to describe the coupling setting option of foundation in node different.Thus to a certain web page contents, when the first coupling arrange the coupling setting option described in node cannot mate with it time, it is possible to mate this web page contents with other to arrange to describe and node mates setting option mate, until the match is successful.
Under a web page joint, include multiple coupling description node is set, owing to usual webpage can exist the variable information that some fix informations that will not often change and some are prone to change, coupling under web page joint is arranged determines in description node that a coupling arranges description node and arranges description node as the first coupling, this first coupling arranges that to describe the coupling setting option that includes of node the most comprehensive, includes at least one coupling setting option set up into each type of content of text in webpage.And arrange in description node arranging the coupling except describing node except the first coupling, coupling setting option can be set up only for the variable information in webpage, and it is different to arrange, except the first coupling mated except arranging description node, the coupling setting option set up in description node in this web page joint.
This processing mode, simplifies the structure that webpage text content coupling is arranged, it is to avoid Different matching has the part of repetition in arranging on the one hand, decreases the data volume that the coupling of required storage is arranged, thus improve resource utilization;It also avoid on the other hand and identical web page contents is repeated matching operation, improve matching efficiency.
Further, downloading mode attribute and element filter attribute is included at web page joint, the filter type of this element filter attribute instruction includes: filtration picture, filtration Cascading Style Sheet CSS, filtration Javascript script, filter frame, filtering object and filtration embed one or more in content
Under the web page joint found, being mated with first in this web page joint by web page contents before arranging the step that the coupling setting option described in node is sequentially carried out coupling, said method farther includes:
Whether the property value of the downloading mode attribute in the web page joint that judgement finds is predetermined value, if, content in webpage is filtered by the filter type according to the instruction of element filter attribute, then, under the web page joint found, the web page contents after filtration is mated with first in this web page joint the coupling setting option arranged in description node and is sequentially carried out coupling;If it is not, directly web page contents is downloaded in a browser.
Wherein, above-mentioned webpage text content coupling arranges and includes setting up webpage URL coupling setting option for the URL of web page contents, and webpage URL mates in setting option and comprises match attribute setting option, and this match attribute setting option includes:
Webpage URL is using predetermined content as beginning;And/or, webpage URL comprises predetermined content, and the precalculated position of this predetermined content comprises any character;And/or, webpage URL does not comprise predetermined content, and this predetermined content comprises any character.
Wherein, above-mentioned webpage URL mates setting option and also includes banner properties settings, banner extraction properties settings and convert properties settings,
Banner properties settings includes the character in precalculated position in the URL of webpage as the banner of this web page contents;Banner extracts properties settings and includes the character choosing precalculated position in the banner obtained according to banner properties settings coupling as banner;Convert properties settings and include obtaining according to the banner of web page contents known and the composition format conversion of URL the URL of this webpage.
Wherein, above-mentioned webpage URL coupling setting option also includes: web page title extracts properties settings.This web page title extracts properties settings and includes: be title by the contents extraction before book character in web page contents.
Wherein, above-mentioned the first coupling at web page joint is arranged in description node, sets up at least one coupling setting option for each type of content of text in the corresponding webpage of this web page joint and includes:
Arranging in description node in the first coupling is that in webpage, each type of content of text HTML HTML element in web page contents sets up at least one coupling setting option;
The above-mentioned coupling setting option for HTML element foundation includes one-time positioning coupling setting option, and this one-time positioning coupling setting option at least includes:
Basic point searches setting option in the way of indicating basic point lookup, and which includes searching mark, lookup names, lookup class name, searches content, searches expression formula;And/or, the element that mark location setting option matches with the mark of location with HTML element;And/or, the element that title location setting option matches with the title of location with HTML element;And/or, the element that class name location setting option matches with the class name of location with HTML element;And/or, the element that content location setting option matches with the content of location with HTML element;And/or, the element that expression formula location setting option matches with the expression formula in HTML element with location;And/or, when label setting option utilizes mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option that element is positioned with instruction, the type of institute's location element and/or attribute.
Wherein, the above-mentioned coupling setting option for HTML element foundation also includes: second positioning coupling setting option, and this second positioning coupling setting option at least includes:
Father inquires about setting option mates, according to one-time positioning, the element that setting option navigates to arrange, and searches the mode of father's element of this element;Or, subquery setting option mates, according to one-time positioning, the element that setting option navigates to arrange, search this element daughter element mode with or, when father inquires about setting option and subquery setting option is put when existing simultaneously, first inquire about setting option according to father and search father's element of the element that one-time positioning coupling setting option navigates to, then according to subquery setting option, from this father's element found, the daughter element of this father's element is searched.
Wherein, the above-mentioned coupling setting option for HTML element foundation also includes: element is deleted and mated setting option, and this element is deleted coupling setting option and at least included: deletion is mated setting option by one-time positioning or second positioning mates the predetermined content in the element that setting option is oriented;And/or, change the predetermined content in the element oriented by one-time positioning coupling setting option or second positioning coupling setting option.
S202: carry out web page contents download in browser side.
S204: arrange in coupling and search website node corresponding to web page contents and web page joint in file.
S206: under the web page joint found, mates web page contents the coupling setting option arranged in description node and is sequentially carried out coupling, perform step S208 or S210 according to matching result respectively with first in this web page joint.
S208: to the coupling setting option that the match is successful, matching result is set to the webpage text content utilizing this coupling setting option to extract;
S210: to the coupling setting option that it fails to match, this web page joint is arranged except the first coupling the coupling described except node arrange describe node is searched with should it fails to match mates the coupling setting option that setting option is corresponding, the coupling setting option found is mated with web page contents, until the coupling setting option found the match is successful with web page contents, and matching result is set to the webpage text content extracted according to this coupling setting option.
S212: utilize and mate setting with the web page contents webpage text content that the match is successful, extract the webpage text content in web page contents.
Using with good grounds the match is successful webpage text content that coupling setting option extracts as the webpage text content in the web page contents identified.
Wherein, after step S200, said method also includes: according to the renewal instruction received, and coupling arranges the website node in file, web page joint, coupling arrange and describe node and/or coupling arranges the coupling setting option described in node and is updated.
Wherein, web page contents is mated setting with webpage text content and mates by above-mentioned steps S206 respectively, until web page contents the match is successful includes:
When there is multiple web page contents downloaded in browser side, for each web page contents distribute a thread, in the thread distributed, corresponding web page content is mated setting with webpage text content respectively and mates, until web page contents the match is successful;And/or, the web page contents for browser side distributes multiple threads, is mated from different webpage text content coupling settings respectively by web page contents in different threads, until web page contents the match is successful.
Further, in step S206, owing to web page contents is generally of the description form of HTML, the present embodiment to the web page contents layering analysis downloaded to, can obtain the DOM structure of this web page contents;According to the DOM structure of web page contents, web page contents is mated setting with webpage text content respectively and mates.
Wherein, also include in step s 200: receive user send choose webpage text content coupling arrange choose instruction;Set up coupling according to choosing instruction file is set, and the webpage text content chosen in instruction coupling is arranged be saved in the coupling set up and arrange in file;Coupling arranged files passe and to server and is stored in the user data of server side user.
Wherein, before step S204, said method also includes: when the file monitoring instruction browser loaded completes event, start and web page contents mates with webpage text content operation that setting carries out mating respectively.
In the present embodiment, the concrete executive mode of each step may refer to the related content in apparatus of the present invention embodiment.
From the above mentioned, the embodiment of the present invention is arranged by setting up multiple webpage text content coupling in browser side, and same webpage text content is mated with multiple webpage text contents the technological means that setting carries out mating, when web page contents changes, the webpage text content matched with the webpage changed can be found to mate setting arranging from multiple webpage text contents coupling such that it is able to utilize the webpage text content coupling setting that the match is successful to extract webpage text content.Further, this programme avoids when web page contents changes, it is necessary to generates new matched rule file and arranges operation in a browser, simplifying the operation realizing coupling, reduce workload, improve efficiency.
Not intrinsic to any certain computer, virtual system or miscellaneous equipment relevant in algorithm and the display of this offer.Various general-purpose systems can also with use based on together with this teaching.As described above, the structure constructed required by this kind of system is apparent from.Additionally, the present invention is also not for any certain programmed language.It is understood that, it is possible to utilize various programming language to realize the content of invention described herein, and the description above language-specific done is the preferred forms in order to disclose the present invention.
In description mentioned herein, describe a large amount of detail.It is to be appreciated, however, that embodiments of the invention can be put into practice when not having these details.In some instances, known method, structure and technology it are not shown specifically, in order to do not obscure the understanding of this description.
Similarly, it is to be understood that, one or more in order to what simplify that the disclosure helping understands in each inventive aspect, herein above in the description of the exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or descriptions thereof sometimes.But, the method for the disclosure should be construed to and reflect an intention that namely the present invention for required protection requires feature more more than the feature being expressly recited in each claim.More precisely, as the following claims reflect, inventive aspect is in that all features less than single embodiment disclosed above.Therefore, it then follows claims of detailed description of the invention are thus expressly incorporated in this detailed description of the invention, wherein each claim itself as the independent embodiment of the present invention.
Those skilled in the art are appreciated that, it is possible to carry out the module in the equipment in embodiment adaptively changing and they being arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit excludes each other, it is possible to adopt any combination that all processes or the unit of all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment are combined.Unless expressly stated otherwise, each feature disclosed in this specification (including adjoint claim, summary and accompanying drawing) can be replaced by the alternative features providing purpose identical, equivalent or similar.
In addition, those skilled in the art it will be appreciated that, although embodiments more described herein include some feature included in other embodiments rather than further feature, but the combination of the feature of different embodiment means to be within the scope of the present invention and form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can mode use in any combination.
The all parts embodiment of the present invention can realize with hardware, or realizes with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions of the some or all parts that microprocessor or digital signal processor (DSP) can be used in practice to realize in the device that webpage text content can be extracted according to embodiments of the present invention.The present invention is also implemented as part or all the equipment for performing method as described herein or device program (such as, computer program and computer program).The program of such present invention of realization can store on a computer-readable medium, or can have the form of one or more signal.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described rather than limits the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment without departing from the scope of the appended claims.In the claims, any reference marks that should not will be located between bracket is configured to limitations on claims.Word " comprises " and does not exclude the presence of the element or step not arranged in the claims.Word "a" or "an" before being positioned at element does not exclude the presence of multiple such element.The present invention by means of including the hardware of some different elements and can realize by means of properly programmed computer.In the unit claim listing some devices, several in these devices can be through same hardware branch and specifically embody.Word first, second and third use do not indicate that any order.Can be title by these word explanations.
A1, a kind of method that webpage text content is extracted are disclosed herein, including: preset at least one webpage text content coupling in browser side and arrange;Web page contents download is carried out in browser side;With described webpage text content, described web page contents is mated setting respectively mate, until described web page contents the match is successful;Utilize and mate setting with the described web page contents webpage text content that the match is successful, extract the webpage text content in described web page contents.A2, method according to A1, it is characterised in that described preset the coupling setting of at least one webpage text content in browser side and include: set up a coupling and file is set and described at least one webpage text content is mated setting is saved in described coupling and arranges in file;Wherein, described coupling arranges file and includes at least one website node, each website node includes at least one web page joint, at least part of described web page joint is provided with plural coupling description node is set, each coupling arranges the corresponding webpage text content coupling of description node and arranges, and the coupling of webpage text content described at least two arranges and middle includes the Different matching setting option to same type content of text respectively.A3, method according to A2, it is characterized in that, described described web page contents mated setting with described webpage text content respectively mate, until described web page contents the match is successful includes: arrange in described coupling and file searched website node corresponding to described web page contents and web page joint;Under the web page joint found, described web page contents is mated with first in this web page joint the coupling setting option arranged in description node and is sequentially carried out coupling;To the coupling setting option that the match is successful, matching result is set to the webpage text content utilizing this coupling setting option to extract;To the coupling setting option that it fails to match, this web page joint is arranged except the first coupling the coupling described except node arrange describe node is searched with should it fails to match mates the coupling setting option that setting option is corresponding, the coupling setting option found is mated with described web page contents, until the coupling setting option found is with described web page contents, the match is successful, and matching result is set to the webpage text content extracted according to this coupling setting option.A4, method according to A3, it is characterized in that, described utilization mates setting with the described web page contents webpage text content that the match is successful, and the webpage text content extracted in described web page contents includes: using the webpage text content that extracts of the with good grounds described coupling setting option that the match is successful as the webpage text content in the described web page contents identified.A5, method according to A2, it is characterised in that a described coupling of setting up arranges file and described at least one webpage text content mates setting is saved in described coupling and arranges file and include: set up a website node for each type of website;Under a website node, set up a web page joint for each type of webpage under this corresponding website of website node;Content according to webpage is set up the coupling of each web page joint and is arranged the coupling setting option described in node, wherein arrange in description node in the first coupling of web page joint, set up at least one coupling setting option for each type of content of text in the corresponding webpage of this web page joint;And for content of text same kind of in webpage, arrange to describe the coupling setting option set up in node and in this web page joint, arrange the coupling except describing node except the first coupling in described first coupling and arrange that to describe the coupling setting option of foundation in node different.A6, method according to A3, it is characterized in that, described web page joint arranges downloading mode attribute and element filter attribute, the filter type of described element filter attribute instruction includes: filter picture, filter Cascading Style Sheet CSS, filter Javascript script, filter frame, filtering object and filtration embed one or more in content, under the web page joint found, described web page contents is mated before the step that the coupling setting option described in node is sequentially carried out coupling is set with first in this web page joint, described method farther includes: whether the property value of the downloading mode attribute in the web page joint found described in judgement is predetermined value, if, content in webpage is filtered by the filter type according to the instruction of element filter attribute, then under the web page joint found, web page contents after filtration is mated with first in this web page joint the coupling setting option arranged in description node and is sequentially carried out coupling;If it is not, directly described web page contents is downloaded in a browser.A7, method according to A1, it is characterized in that, described webpage text content coupling arranges and includes setting up webpage URL coupling setting option for the uniform resource position mark URL of web page contents, described webpage URL mates in setting option and comprises: match attribute setting option, and described match attribute setting option includes: webpage URL is using predetermined content as beginning;And/or, webpage URL comprises predetermined content, and the precalculated position of this predetermined content comprises any character;And/or, webpage URL does not comprise predetermined content, and this predetermined content comprises any character.A8, method according to A7, it is characterized in that, described webpage URL mates setting option and also includes: banner properties settings, banner extract properties settings and convert properties settings, and described banner properties settings includes: using the character in precalculated position in the URL of webpage as the banner of this web page contents;Described banner extracts properties settings and includes: choose the character in precalculated position in the banner obtained according to banner properties settings coupling as banner;Described conversion properties settings includes: obtain the URL of this webpage according to the composition format conversion of the banner of the web page contents known and URL.A9, method according to A7, it is characterised in that described webpage URL mates setting option and also includes: web page title extracts properties settings, described web page title extracts properties settings and includes: be title by the contents extraction before book character in web page contents.A10, method according to A5, it is characterized in that, described the first coupling at web page joint is arranged in description node, sets up at least one coupling setting option for each type of content of text in the corresponding webpage of this web page joint and includes: arranging in description node in the first coupling is that in webpage, each type of content of text HTML HTML element in web page contents sets up at least one coupling setting option;The described coupling setting option for HTML element foundation includes one-time positioning coupling setting option, described one-time positioning coupling setting option at least includes: basic point searches setting option: the mode that instruction basic point is searched, and described mode includes searching mark, lookup names, lookup class name, searches content, searches expression formula;And/or, mark location setting option: the element that the mark of location and HTML element matches;And/or, title location setting option: the element that the title of location and HTML element matches;And/or, class name location setting option: the element that the class name of location and HTML element matches;And/or, content location setting option: the element that the content of location and HTML element matches;And/or, expression formula location setting option: the element that location matches with the expression formula in HTML element;And/or, label setting option: instruction utilizes described mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option when element is positioned, the type of institute's location element and/or attribute.A11, method according to A10, it is characterized in that, the described coupling setting option for HTML element foundation also includes: second positioning coupling setting option, described second positioning coupling setting option at least includes: father inquires about setting option: arranges and mates, according to one-time positioning, the element that setting option navigates to, searches the mode of father's element of this element;Or, subquery setting option: arrange and mate, according to one-time positioning, the element that setting option navigates to, search the mode of the daughter element of this element;Or, when father inquires about setting option and subquery setting option is put when existing simultaneously, first inquire about setting option according to father and search father's element of the element that one-time positioning coupling setting option navigates to, then according to subquery setting option, from this father's element found, search the daughter element of this father's element.A12, method according to A10, it is characterized in that, the described coupling setting option for HTML element foundation also includes: element is deleted and mated setting option, and described element is deleted coupling setting option and at least included: deletion is mated setting option by one-time positioning or second positioning mates the predetermined content in the element that setting option is oriented;And/or change the predetermined content in the element oriented by one-time positioning coupling setting option or second positioning coupling setting option.A13, method according to A2, it is characterized in that, set up after a coupling arranges file described, described method also includes: according to the renewal instruction received, and described coupling arranges the website node in file, web page joint, coupling arrange and describe node and/or coupling arranges the coupling setting option described in node and is updated.A14, method according to A1, it is characterized in that, described described web page contents is mated setting with described webpage text content respectively mate, until described web page contents the match is successful includes: when there is multiple web page contents downloaded in browser side, a thread is distributed for each web page contents, corresponding web page content is mated setting with described webpage text content by thread respectively that distribute mate, until described web page contents the match is successful;And/or be that a web page contents of browser side distributes multiple threads, in different threads, described web page contents is mated from different webpage text content coupling settings respectively, until described web page contents the match is successful.A15, method according to A2, it is characterized in that, a described coupling of setting up arranges file and described at least one webpage text content mates setting is saved in described coupling and arranges file and include: webpage text content coupling arranges chooses instruction for choosing of receiving that user sends;Choose instruction and set up coupling according to described file is set, and the described webpage text content chosen in instruction coupling is arranged be saved in the coupling set up and arrange in file;Described coupling arranges files passe to server and be stored in the user data of user described in server side.A16, method according to A1, it is characterized in that, described web page contents mated respectively before setting mates with described webpage text content described, described method also includes: when the file monitoring instruction browser loaded completes event, starts and described described web page contents mate with described webpage text content setting respectively carry out the operation mated.A17, method according to A1, it is characterised in that described described web page contents is mated with described webpage text content respectively setting carry out coupling and include: to the web page contents layering analysis downloaded to, obtain the DOM Document Object Model DOM structure of this web page contents;According to the DOM structure of described web page contents, web page contents is mated setting with described webpage text content respectively and mates.
B18, a kind of device that webpage text content can be extracted are disclosed herein, including: coupling arranges dispensing unit, is suitable to preset at least one webpage text content coupling in browser side and arranges;Download unit, is suitable to carry out web page contents download in browser side;Matching unit, is suitable to that with described webpage text content, described web page contents is mated setting respectively and mates, until described web page contents the match is successful;Extraction unit, is suitable to utilize and mates setting with the described web page contents webpage text content that the match is successful, extract the webpage text content in described web page contents.B19, device according to B18, it is characterised in that described coupling arranges dispensing unit, be adapted to set up a coupling and arrange file and described at least one webpage text content mates setting be saved in described coupling and arrange in file;Wherein, described coupling arranges file and includes at least one website node, each website node includes at least one web page joint, at least part of described web page joint is provided with plural coupling description node is set, each coupling arranges the corresponding webpage text content coupling of description node and arranges, and the coupling of webpage text content described at least two arranges and middle includes the Different matching setting option to same type content of text respectively.B20, device according to B19, it is characterised in that described matching unit, be suitable to arrange in described coupling search website node corresponding to described web page contents and web page joint in file;Under the web page joint found, described web page contents is mated with first in this web page joint the coupling setting option arranged in description node and is sequentially carried out coupling;To the coupling setting option that the match is successful, matching result is set to the webpage text content utilizing this coupling setting option to extract;To the coupling setting option that it fails to match, this web page joint is arranged except the first coupling the coupling described except node arrange describe node is searched with should it fails to match mates the coupling setting option that setting option is corresponding, the coupling setting option found is mated with described web page contents, until the coupling setting option found is with described web page contents, the match is successful, and matching result is set to the webpage text content extracted according to this coupling setting option.B21, device according to B20, it is characterised in that described extraction unit, be suitable to using the webpage text content that extracts of the with good grounds described coupling setting option that the match is successful as the webpage text content in the described web page contents identified.B22, device according to B19, it is characterised in that described coupling arranges dispensing unit, is suitable for each type of website and sets up a website node;Under a website node, set up a web page joint for each type of webpage under this corresponding website of website node;Content according to webpage is set up the coupling of each web page joint and is arranged the coupling setting option described in node, wherein arrange in description node in the first coupling of web page joint, set up at least one coupling setting option for each type of content of text in the corresponding webpage of this web page joint;And, for content of text same kind of in webpage, arrange to describe the coupling setting option set up in node and in this web page joint, arrange the coupling except describing node except the first coupling in described first coupling and arrange that to describe the coupling setting option of foundation in node different.B23, device according to B20, it is characterized in that, described coupling arranges dispensing unit, it is further adapted in described web page joint, arrange downloading mode attribute and element filter attribute, the filter type of described element filter attribute instruction includes: filter picture, filter Cascading Style Sheet CSS, filter Javascript script, filter frame, filtering object and filtration embed one or more in content, described device also includes Loading Control unit and filter element, described Loading Control unit, be suitable under the web page joint found, described web page contents is mated with first in this web page joint and arranges before the coupling setting option described in node is sequentially carried out coupling, whether the property value of the downloading mode attribute in the web page joint found described in judgement is predetermined value, if, start filter element, then under the web page joint found, web page contents after filtration is mated with first in this web page joint the coupling setting option arranged in description node and is sequentially carried out coupling;If it is not, directly described web page contents is downloaded in a browser;Described filter element, is suitable to the filter type according to the instruction of element filter attribute and the content in webpage is filtered.B24, device according to B18, it is characterized in that, described coupling arranges the webpage text content coupling setting of dispensing unit configuration and includes setting up webpage URL coupling setting option for the uniform resource position mark URL of web page contents, described webpage URL mates in setting option and comprises: match attribute setting option, and described match attribute setting option includes: webpage URL is using predetermined content as beginning;And/or, webpage URL comprises predetermined content, and the precalculated position of this predetermined content comprises any character;And/or, webpage URL does not comprise predetermined content, and this predetermined content comprises any character.B25, device according to B24, it is characterized in that, described coupling arranges the webpage URL that dispensing unit sets up and mates that setting option also includes banner properties settings, banner extracts properties settings and converts properties settings, and described banner properties settings includes: using the character in precalculated position in the URL of webpage as the banner of this web page contents;Described banner extracts properties settings and includes: choose the character in precalculated position in the banner obtained according to banner properties settings coupling as banner;Described conversion properties settings includes: obtain the URL of this webpage according to the composition format conversion of the banner of the web page contents known and URL.B26, device according to B24, it is characterized in that, described coupling arranges the webpage URL that dispensing unit sets up and mates setting option and also include web page title and extract properties settings, and described web page title extracts properties settings and includes: be title by the contents extraction before book character in web page contents.B27, device according to B22, it is characterized in that, described coupling arranges dispensing unit, and being further adapted for arranging in description node in the first coupling is that in webpage, each type of content of text HTML HTML element in web page contents sets up at least one coupling setting option;The described coupling setting option for HTML element foundation includes one-time positioning coupling setting option, described one-time positioning coupling setting option at least includes: basic point searches setting option: the mode that instruction basic point is searched, and described mode includes searching mark, lookup names, lookup class name, searches content, searches expression formula;And/or, mark location setting option: the element that the mark of location and HTML element matches;And/or, title location setting option: the element that the title of location and HTML element matches;And/or, class name location setting option: the element that the class name of location and HTML element matches;And/or, content location setting option: the element that the content of location and HTML element matches;And/or, expression formula location setting option: the element that location matches with the expression formula in HTML element;And/or, label setting option: instruction utilizes described mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option when element is positioned, the type of institute's location element and/or attribute.B28, device according to B27, it is characterized in that, it is that the coupling setting option that HTML element is set up also includes that described coupling arranges dispensing unit: second positioning coupling setting option, described second positioning coupling setting option at least includes one of the following setting option: father inquires about setting option: arranges and mates, according to one-time positioning, the element that setting option navigates to, searches the mode of father's element of this element;Or, subquery setting option: arrange and mate, according to one-time positioning, the element that setting option navigates to, search the mode of the daughter element of this element;Or, when father inquires about setting option and subquery setting option is put when existing simultaneously, first inquire about setting option according to father and search father's element of the element that one-time positioning coupling setting option navigates to, then according to subquery setting option, from this father's element found, search the daughter element of this father's element.B29, device according to B27, it is characterized in that, it is that the coupling setting option that HTML element is set up also includes that described coupling arranges dispensing unit: element deletion coupling setting option, and described element is deleted coupling setting option and at least included: delete the predetermined content in the element oriented by one-time positioning coupling setting option or second positioning coupling setting option;And/or change the predetermined content in the element oriented by one-time positioning coupling setting option or second positioning coupling setting option.B30, device according to B19, it is characterized in that, described device also includes coupling and arranges updating block, be suitable to set up after a coupling arranges file described, according to the renewal instruction received, described coupling is arranged the website node in file, web page joint, coupling arrange and describe node and/or coupling arranges the coupling setting option described in node and is updated.B31, device according to B18, it is characterized in that, also include Multi-thread control unit, described Multi-thread control unit, be suitable to when browser side exists multiple web page contents downloaded to, distribute a thread for each web page contents, and control described matching unit and in the thread distributed, corresponding web page content is mated setting with described webpage text content respectively and mate, until described web page contents the match is successful;And/or described Multi-thread control unit, the web page contents being suitable for browser side distributes multiple threads, and control described matching unit and in different threads, described web page contents mated from different webpage text content coupling settings respectively, until described web page contents the match is successful.B32, device according to B19, it is characterised in that described device includes input block and uploading unit, described input block, be suitable to receive user sends choose that webpage text content coupling arranges choose instruction;Described coupling arranges dispensing unit, be further adapted for according to described in choose instruction and set up coupling file is set, and the described webpage text content chosen in instruction coupling is arranged be saved in the coupling set up and arrange in file;Described uploading unit, is suitable to that described coupling arranges files passe and to server and is stored in the user data of user described in server side.B33, device according to B18, it is characterized in that, described device also includes starting control unit, be suitable to when the file monitoring instruction browser loaded completes event, start described matching unit and perform to mate described web page contents respectively with described webpage text content to arrange the operation carrying out mating.B34, device according to B18, it is characterised in that described matching unit, be further adapted for, to the web page contents layering analysis downloaded to, obtaining the DOM Document Object Model DOM structure of this web page contents;According to the DOM structure of described web page contents, web page contents is mated setting with described webpage text content respectively and mates.

Claims (32)

1. method webpage text content extracted, including:
Presetting at least one webpage text content coupling to arrange in browser side, each webpage text content coupling arranges one or more coupling setting options that the content of text included according to webpage is set up;
Web page contents download is carried out in browser side;
With described webpage text content, described web page contents is mated setting respectively mate, until described web page contents the match is successful;
Utilize and mate setting with the described web page contents webpage text content that the match is successful, extract the webpage text content in described web page contents;
Described preset in browser side at least one webpage text content coupling setting include:
Set up a coupling arrange file and described at least one webpage text content is mated setting be saved in described coupling and arrange in file;
Wherein, described coupling arranges file and includes at least one website node, each website node includes at least one web page joint, at least part of described web page joint is provided with plural coupling description node is set, each coupling arranges the corresponding webpage text content coupling of description node and arranges, and the coupling of webpage text content described at least two arranges and middle includes the Different matching setting option to same type content of text respectively.
2. method according to claim 1, it is characterised in that described described web page contents is mated setting with described webpage text content respectively mate, until described web page contents the match is successful includes:
Arrange in described coupling and file is searched website node corresponding to described web page contents and web page joint;
Under the web page joint found, described web page contents is mated with first in this web page joint the coupling setting option arranged in description node and is sequentially carried out coupling;
To the coupling setting option that the match is successful, matching result is set to the webpage text content utilizing this coupling setting option to extract;
To the coupling setting option that it fails to match, this web page joint is arranged except the first coupling the coupling described except node arrange describe node is searched with should it fails to match mates the coupling setting option that setting option is corresponding, the coupling setting option found is mated with described web page contents, until the coupling setting option found is with described web page contents, the match is successful, and matching result is set to the webpage text content extracted according to this coupling setting option.
3. method according to claim 2, it is characterised in that described utilization mates setting with the described web page contents webpage text content that the match is successful, and the webpage text content extracted in described web page contents includes:
Using the webpage text content that extracts of the with good grounds described coupling setting option that the match is successful as the webpage text content in the described web page contents identified.
4. method according to claim 1, it is characterised in that a described coupling of setting up arranges file and described at least one webpage text content mates setting is saved in described coupling and arranges file and include:
A website node is set up for each type of website;
Under a website node, set up a web page joint for each type of webpage under this corresponding website of website node;
Content according to webpage is set up the coupling of each web page joint and is arranged the coupling setting option described in node, wherein arrange in description node in the first coupling of web page joint, set up at least one coupling setting option for each type of content of text in the corresponding webpage of this web page joint;And
For content of text same kind of in webpage, arrange to describe the coupling setting option set up in node and in this web page joint, arrange the coupling except describing node except the first coupling in described first coupling and arrange that to describe the coupling setting option of foundation in node different.
5. method according to claim 2, it is characterized in that, described web page joint arranges downloading mode attribute and element filter attribute, the filter type of described element filter attribute instruction includes: filtration picture, filtration Cascading Style Sheet CSS, filtration Javascript script, filter frame, filtering object and filtration embed one or more in content
Under the web page joint found, being mated with first in this web page joint by described web page contents before arranging the step that the coupling setting option described in node is sequentially carried out coupling, described method farther includes:
Whether the property value of the downloading mode attribute in the web page joint found described in judgement is predetermined value, if, content in webpage is filtered by the filter type according to the instruction of element filter attribute, then, under the web page joint found, the web page contents after filtration is mated with first in this web page joint the coupling setting option arranged in description node and is sequentially carried out coupling;If it is not, directly described web page contents is downloaded in a browser.
6. method according to claim 1, it is characterised in that described webpage text content coupling arranges and includes setting up webpage URL coupling setting option for the uniform resource position mark URL of web page contents,
Described webpage URL mates in setting option and comprises: match attribute setting option, and described match attribute setting option includes:
Webpage URL is using predetermined content as beginning;And/or,
Webpage URL comprises predetermined content, and the precalculated position of this predetermined content comprises any character;And/or,
Webpage URL does not comprise predetermined content, and this predetermined content comprises any character.
7. method according to claim 6, it is characterised in that described webpage URL mates setting option and also includes: banner properties settings, banner extract properties settings and convert properties settings,
Described banner properties settings includes: using the character in precalculated position in the URL of webpage as the banner of this web page contents;
Described banner extracts properties settings and includes: choose the character in precalculated position in the banner obtained according to banner properties settings coupling as banner;
Described conversion properties settings includes: obtain the URL of this webpage according to the composition format conversion of the banner of the web page contents known and URL.
8. method according to claim 6, it is characterised in that described webpage URL mates setting option and also includes: web page title extracts properties settings,
Described web page title extracts properties settings and includes: be title by the contents extraction before book character in web page contents.
9. method according to claim 4, it is characterised in that described the first coupling at web page joint is arranged in description node, sets up at least one coupling setting option for each type of content of text in the corresponding webpage of this web page joint and includes:
Arranging in description node in the first coupling is that in webpage, each type of content of text HTML HTML element in web page contents sets up at least one coupling setting option;
The described coupling setting option for HTML element foundation includes one-time positioning coupling setting option, and described one-time positioning coupling setting option at least includes:
Basic point searches setting option: the mode that instruction basic point is searched, and described mode includes searching mark, lookup names, lookup class name, searches content, searches expression formula;And/or,
Mark location setting option: the element that the mark of location and HTML element matches;And/or,
Title location setting option: the element that the title of location and HTML element matches;And/or,
Class name location setting option: the element that the class name of location and HTML element matches;And/or,
Content location setting option: the element that the content of location and HTML element matches;And/or,
Expression formula location setting option: the element that location matches with the expression formula in HTML element;
And/or,
Label setting option: when instruction utilizes described mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option that element is positioned, the type of institute's location element and/or attribute.
10. method according to claim 9, it is characterised in that the described coupling setting option for HTML element foundation also includes: second positioning coupling setting option, described second positioning coupling setting option at least includes:
Father inquires about setting option: arranges and mates, according to one-time positioning, the element that setting option navigates to, searches the mode of father's element of this element;Or,
Subquery setting option: arrange and mate, according to one-time positioning, the element that setting option navigates to, search the mode of the daughter element of this element;Or,
When father inquires about setting option and subquery setting option exists simultaneously, first inquire about setting option according to father and search father's element of the element that one-time positioning coupling setting option navigates to, then according to subquery setting option, from this father's element found, the daughter element of this father's element is searched.
11. method according to claim 9, it is characterised in that the described coupling setting option for HTML element foundation also includes: element deletes coupling setting option, described element is deleted coupling setting option and is at least included:
Delete the predetermined content in the element oriented by one-time positioning coupling setting option or second positioning coupling setting option;And/or
Change the predetermined content in the element oriented by one-time positioning coupling setting option or second positioning coupling setting option.
12. method according to claim 1, it is characterised in that setting up after a coupling arranges file described, described method also includes:
According to the renewal instruction received, described coupling is arranged the website node in file, web page joint, coupling arrange and describe node and/or coupling arranges the coupling setting option described in node and is updated.
13. method according to claim 1, it is characterised in that described described web page contents is mated setting with described webpage text content respectively mate, until described web page contents the match is successful includes:
When there is multiple web page contents downloaded in browser side, for each web page contents distribute a thread, in the thread distributed, corresponding web page content is mated setting with described webpage text content respectively and mates, until described web page contents the match is successful;And/or
A web page contents for browser side distributes multiple threads, is mated from different webpage text content coupling settings respectively by described web page contents in different threads, until described web page contents the match is successful.
14. method according to claim 1, it is characterised in that a described coupling of setting up arranges file and described at least one webpage text content mates setting is saved in described coupling and arranges file and include:
Receive user send choose webpage text content coupling arrange choose instruction;
Choose instruction and set up coupling according to described file is set, and the described webpage text content chosen in instruction coupling is arranged be saved in the coupling set up and arrange in file;
Described coupling arranges files passe to server and be stored in the user data of user described in server side.
15. method according to claim 1, it is characterised in that being mated by described web page contents before setting mates respectively with described webpage text content described, described method also includes:
When the file monitoring instruction browser loaded completes event, start and described described web page contents mate with described webpage text content respectively setting carry out the operation mated.
16. method according to claim 1, it is characterised in that described described web page contents is mated with described webpage text content respectively setting carry out coupling and include:
To the web page contents layering analysis downloaded to, obtain the DOM Document Object Model DOM structure of this web page contents;
According to the DOM structure of described web page contents, web page contents is mated setting with described webpage text content respectively and mates.
17. the device that webpage text content can be extracted, including:
Coupling arranges dispensing unit, is suitable to preset at least one webpage text content coupling in browser side and arranges, and each webpage text content coupling arranges one or more coupling setting options that the content of text included according to webpage is set up;
Download unit, is suitable to carry out web page contents download in browser side;
Matching unit, is suitable to that with described webpage text content, described web page contents is mated setting respectively and mates, until described web page contents the match is successful;
Extraction unit, is suitable to utilize and mates setting with the described web page contents webpage text content that the match is successful, extract the webpage text content in described web page contents;
Described coupling arranges dispensing unit, is adapted to set up a coupling and arranges file and described at least one webpage text content is mated setting be saved in described coupling and arrange in file;Wherein, described coupling arranges file and includes at least one website node, each website node includes at least one web page joint, at least part of described web page joint is provided with plural coupling description node is set, each coupling arranges the corresponding webpage text content coupling of description node and arranges, and the coupling of webpage text content described at least two arranges and middle includes the Different matching setting option to same type content of text respectively.
18. device according to claim 17, it is characterised in that
Described matching unit, is suitable to arrange in described coupling search website node corresponding to described web page contents and web page joint in file;Under the web page joint found, described web page contents is mated with first in this web page joint the coupling setting option arranged in description node and is sequentially carried out coupling;To the coupling setting option that the match is successful, matching result is set to the webpage text content utilizing this coupling setting option to extract;To the coupling setting option that it fails to match, this web page joint is arranged except the first coupling the coupling described except node arrange describe node is searched with should it fails to match mates the coupling setting option that setting option is corresponding, the coupling setting option found is mated with described web page contents, until the coupling setting option found is with described web page contents, the match is successful, and matching result is set to the webpage text content extracted according to this coupling setting option.
19. device according to claim 18, it is characterised in that described extraction unit, be suitable to using the webpage text content that extracts of the with good grounds described coupling setting option that the match is successful as the webpage text content in the described web page contents identified.
20. device according to claim 17, it is characterised in that described coupling arranges dispensing unit, it is suitable for each type of website and sets up a website node;Under a website node, set up a web page joint for each type of webpage under this corresponding website of website node;Content according to webpage is set up the coupling of each web page joint and is arranged the coupling setting option described in node, wherein arrange in description node in the first coupling of web page joint, set up at least one coupling setting option for each type of content of text in the corresponding webpage of this web page joint;And, for content of text same kind of in webpage, arrange to describe the coupling setting option set up in node and in this web page joint, arrange the coupling except describing node except the first coupling in described first coupling and arrange that to describe the coupling setting option of foundation in node different.
21. device according to claim 18, it is characterized in that, described coupling arranges dispensing unit, it is further adapted in described web page joint, arrange downloading mode attribute and element filter attribute, the filter type of described element filter attribute instruction includes: filtration picture, filtration Cascading Style Sheet CSS, filtration Javascript script, filter frame, filtering object and filtration embed one or more in content, described device also includes Loading Control unit and filter element
Described Loading Control unit, be suitable under the web page joint found, described web page contents is mated with first in this web page joint and arranges before the coupling setting option described in node is sequentially carried out coupling, whether the property value of the downloading mode attribute in the web page joint found described in judgement is predetermined value, if, start filter element, then, under the web page joint found, the web page contents after filtration is mated with first in this web page joint the coupling setting option arranged in description node and is sequentially carried out coupling;If it is not, directly described web page contents is downloaded in a browser;
Described filter element, is suitable to the filter type according to the instruction of element filter attribute and the content in webpage is filtered.
22. device according to claim 17, it is characterised in that described coupling arranges the webpage text content coupling setting of dispensing unit configuration and includes setting up webpage URL coupling setting option for the uniform resource position mark URL of web page contents,
Described webpage URL mates in setting option and comprises: match attribute setting option, and described match attribute setting option includes:
Webpage URL is using predetermined content as beginning;And/or,
Webpage URL comprises predetermined content, and the precalculated position of this predetermined content comprises any character;And/or,
Webpage URL does not comprise predetermined content, and this predetermined content comprises any character.
23. device according to claim 22, it is characterised in that described coupling arranges the webpage URL coupling setting option of dispensing unit foundation and also includes banner properties settings, banner extraction properties settings and convert properties settings,
Described banner properties settings includes: using the character in precalculated position in the URL of webpage as the banner of this web page contents;
Described banner extracts properties settings and includes: choose the character in precalculated position in the banner obtained according to banner properties settings coupling as banner;
Described conversion properties settings includes: obtain the URL of this webpage according to the composition format conversion of the banner of the web page contents known and URL.
24. device according to claim 22, it is characterised in that described coupling arranges the webpage URL coupling setting option of dispensing unit foundation and also includes web page title extraction properties settings,
Described web page title extracts properties settings and includes: be title by the contents extraction before book character in web page contents.
25. device according to claim 20, it is characterized in that, described coupling arranges dispensing unit, and being further adapted for arranging in description node in the first coupling is that in webpage, each type of content of text HTML HTML element in web page contents sets up at least one coupling setting option;
The described coupling setting option for HTML element foundation includes one-time positioning coupling setting option, and described one-time positioning coupling setting option at least includes:
Basic point searches setting option: the mode that instruction basic point is searched, and described mode includes searching mark, lookup names, lookup class name, searches content, searches expression formula;And/or,
Mark location setting option: the element that the mark of location and HTML element matches;And/or,
Title location setting option: the element that the title of location and HTML element matches;And/or,
Class name location setting option: the element that the class name of location and HTML element matches;And/or,
Content location setting option: the element that the content of location and HTML element matches;And/or,
Expression formula location setting option: the element that location matches with the expression formula in HTML element;
And/or,
Label setting option: when instruction utilizes described mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option that element is positioned, the type of institute's location element and/or attribute.
26. device according to claim 25, it is characterized in that, it is that the coupling setting option that HTML element is set up also includes that described coupling arranges dispensing unit: second positioning coupling setting option, and described second positioning coupling setting option at least includes one of the following setting option:
Father inquires about setting option: arranges and mates, according to one-time positioning, the element that setting option navigates to, searches the mode of father's element of this element;Or,
Subquery setting option: arrange and mate, according to one-time positioning, the element that setting option navigates to, search the mode of the daughter element of this element;Or,
When father inquires about setting option and subquery setting option exists simultaneously, first inquire about setting option according to father and search father's element of the element that one-time positioning coupling setting option navigates to, then according to subquery setting option, from this father's element found, the daughter element of this father's element is searched.
27. device according to claim 25, it is characterised in that it is that the coupling setting option that HTML element is set up also includes that described coupling arranges dispensing unit: element deletes coupling setting option, described element is deleted coupling setting option and is at least included:
Delete the predetermined content in the element oriented by one-time positioning coupling setting option or second positioning coupling setting option;And/or
Change the predetermined content in the element oriented by one-time positioning coupling setting option or second positioning coupling setting option.
28. device according to claim 17, it is characterized in that, described device also includes coupling and arranges updating block, be suitable to set up after a coupling arranges file described, according to the renewal instruction received, described coupling is arranged the website node in file, web page joint, coupling arrange and describe node and/or coupling arranges the coupling setting option described in node and is updated.
29. device according to claim 17, it is characterised in that also include Multi-thread control unit,
Described Multi-thread control unit, be suitable to when browser side exists multiple web page contents downloaded to, a thread is distributed for each web page contents, and control described matching unit and in the thread distributed, corresponding web page content mated setting with described webpage text content respectively and mate, until described web page contents the match is successful;And/or
Described Multi-thread control unit, the web page contents being suitable for browser side distributes multiple threads, and control described matching unit and in different threads, described web page contents mated from different webpage text content coupling settings respectively, until described web page contents the match is successful.
30. device according to claim 17, it is characterised in that described device includes input block and uploading unit,
Described input block, be suitable to receive user send choose webpage text content coupling arrange choose instruction;
Described coupling arranges dispensing unit, be further adapted for according to described in choose instruction and set up coupling file is set, and the described webpage text content chosen in instruction coupling is arranged be saved in the coupling set up and arrange in file;
Described uploading unit, is suitable to that described coupling arranges files passe and to server and is stored in the user data of user described in server side.
31. device according to claim 17, it is characterized in that, described device also includes starting control unit, be suitable to when the file monitoring instruction browser loaded completes event, start described matching unit and perform to mate described web page contents respectively with described webpage text content to arrange the operation carrying out mating.
32. device according to claim 17, it is characterised in that
Described matching unit, is further adapted for, to the web page contents layering analysis downloaded to, obtaining the DOM Document Object Model DOM structure of this web page contents;According to the DOM structure of described web page contents, web page contents is mated setting with described webpage text content respectively and mates.
CN201210573022.8A 2012-12-25 2012-12-25 The method and apparatus that webpage text content is extracted Active CN103020266B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210573022.8A CN103020266B (en) 2012-12-25 2012-12-25 The method and apparatus that webpage text content is extracted

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210573022.8A CN103020266B (en) 2012-12-25 2012-12-25 The method and apparatus that webpage text content is extracted

Publications (2)

Publication Number Publication Date
CN103020266A CN103020266A (en) 2013-04-03
CN103020266B true CN103020266B (en) 2016-06-29

Family

ID=47968869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210573022.8A Active CN103020266B (en) 2012-12-25 2012-12-25 The method and apparatus that webpage text content is extracted

Country Status (1)

Country Link
CN (1) CN103020266B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399759B (en) * 2013-06-29 2017-02-08 广州市动景计算机科技有限公司 Network content downloading method and device
CN103530336B (en) * 2013-09-30 2017-09-15 北京奇虎科技有限公司 The identification equipment and method of Invalid parameter in uniform resource position mark URL
CN103577566B (en) * 2013-10-25 2017-07-28 北京奇虎科技有限公司 A kind of web page browing content loading method and device
CN106980700B (en) * 2013-11-08 2021-04-09 北京奇虎科技有限公司 Method for searching network on browser side and browser
CN104700031B (en) * 2013-12-06 2019-12-13 腾讯科技(深圳)有限公司 Method, device and system for preventing remote code from being executed in application operation
WO2015165245A1 (en) * 2014-04-30 2015-11-05 广州市动景计算机科技有限公司 Webpage data processing method and device
CN104008131B (en) * 2014-04-30 2018-07-13 广州市动景计算机科技有限公司 A kind of web data processing method and processing device
CN104021172B (en) * 2014-05-30 2017-07-28 北京搜狗科技发展有限公司 Advertisement filter method and advertisement filter device
CN104317883B (en) * 2014-10-21 2017-11-21 北京国双科技有限公司 Network text processing method and processing device
CN105468730A (en) * 2015-11-20 2016-04-06 广州华多网络科技有限公司 Webpage information extraction method and equipment
CN106855859B (en) * 2015-12-08 2020-11-10 北京搜狗科技发展有限公司 Webpage text extraction method and device
CN108009171B (en) * 2016-10-27 2020-06-30 腾讯科技(北京)有限公司 Method and device for extracting content data
CN108241680B (en) * 2016-12-26 2020-10-13 北京国双科技有限公司 Method and device for acquiring reading amount of webpage
CN107402953A (en) * 2017-05-22 2017-11-28 阿里巴巴集团控股有限公司 A kind of method for page jump and device
CN113254751B (en) * 2021-06-24 2021-09-21 北森云计算有限公司 Method, equipment and storage medium for accurately extracting complex webpage structured information

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN102681994A (en) * 2011-03-07 2012-09-19 北京百度网讯科技有限公司 Webpage information extracting method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN102681994A (en) * 2011-03-07 2012-09-19 北京百度网讯科技有限公司 Webpage information extracting method and system

Also Published As

Publication number Publication date
CN103020266A (en) 2013-04-03

Similar Documents

Publication Publication Date Title
CN103020266B (en) The method and apparatus that webpage text content is extracted
EP3491544B1 (en) Web page display systems and methods
CN101517511B (en) System, process and software arrangement for assisting in navigating internet
CN102663135B (en) Method and device for implementing graphical bookmark for embedded browser, and terminal
CN102662966B (en) Method and system for obtaining subject-oriented dynamic page content
US20160364373A1 (en) Method and apparatus for extracting webpage information
CN108021598B (en) Page extraction template matching method and device and server
WO2014026606A1 (en) Method, system and device for filtering mobile terminal webpage advertisements
CN104036011A (en) Webpage element display method and browser device.
CN101765979A (en) Document processing for mobile devices
CN102970348B (en) Network application method for pushing, system and network application server
CN101996193A (en) Processing method and system for expressing network resource link and internet terminal
CN103023972B (en) A kind of method and apparatus that file is managed
KR101340588B1 (en) Method and apparatus for comprising webpage
CN102955850A (en) Method and device for loading sequencing website
CN110309386B (en) Method and device for crawling web page
CN103064943B (en) A kind of client device
CN102902784B (en) Web page classification storage system and method
CN105630310A (en) Method and device for displaying titles during graph group switching
CN105653678A (en) Data chart subscription method and data chart subscription system
CN102999591B (en) File management method and device
CN105930385A (en) Data crawling method and system
CN102982143A (en) Searching method for network novel and browsing device
EP2998885A1 (en) Method and device for information search
CN102982078A (en) Loading method of sequencing website and client with sequencing website being loaded

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220727

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.