CN103020266A - Method and device for extracting webpage text content - Google Patents

Method and device for extracting webpage text content Download PDF

Info

Publication number
CN103020266A
CN103020266A CN2012105730228A CN201210573022A CN103020266A CN 103020266 A CN103020266 A CN 103020266A CN 2012105730228 A CN2012105730228 A CN 2012105730228A CN 201210573022 A CN201210573022 A CN 201210573022A CN 103020266 A CN103020266 A CN 103020266A
Authority
CN
China
Prior art keywords
coupling
web page
setting option
webpage
page contents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012105730228A
Other languages
Chinese (zh)
Other versions
CN103020266B (en
Inventor
谢洲为
潘洪学
糜裕峰
任寰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201210573022.8A priority Critical patent/CN103020266B/en
Publication of CN103020266A publication Critical patent/CN103020266A/en
Application granted granted Critical
Publication of CN103020266B publication Critical patent/CN103020266B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for extracting webpage text content. The method for extracting the webpage text content, provided by the embodiment of the invention, comprises the following steps of presetting at least one matching set of the webpage text content at a browser side; downloading webpage content at the browser side; respectively matching the webpage content with the matching set of the webpage text content until the webpage content is successfully matched; and extracting the webpage text content in the webpage content by utilizing the matching set of the webpage text content which is successfully matched with the webpage content.

Description

The method and apparatus that webpage text content is extracted
Technical field
The present invention relates to networking technology area, particularly a kind of method and apparatus that webpage text content is extracted.
Background technology
Along with popularizing of Internet technology, network has become one of important channel of people's obtaining information, and the content of text in the webpage is the main carriers of information.Yet, generally in the webpage except content of text, also comprise the garbages such as mass advertising picture, non-article content, had a strong impact on user's reading experience.
In the scheme of the extraction webpage text content that prior art provides, webpage is in browser behind the loaded, content in the webpage is split, then by the matched rule file in the browser web page contents is positioned, extract required field contents and show, thereby the user can see the webpage after the text screening, the reading that the user can be made things convenient for and be absorbed in.
At least there is following defective in the scheme of existing extraction webpage text content:
Existing scheme arranges a matched rule file for a certain predetermined structure of web page, this matched rule file is only applicable to the extraction of webpage text content under the predetermined structure, yet because the renewal speed of Internet resources is very fast, structure of web page can change often, then existing matched rule file can't carry out to the webpage after the change text extraction, and regenerate new matched rule file, again new matched rule file is arranged in the browser, cause again realizing that the operation of mating is too loaded down with trivial details, workload is large, inefficiency.
Summary of the invention
In view of the above problems, the present invention has been proposed in order to a kind of method and apparatus that webpage text content is extracted that overcomes the problems referred to above or address the above problem at least in part is provided.
According to one aspect of the present invention, the embodiment of the invention provides a kind of method that webpage text content is extracted, and comprising: arrange in default at least one webpage text content coupling of browser side; Carrying out web page contents in the browser side downloads; Web page contents is mated with the setting of webpage text content coupling respectively, until web page contents the match is successful; Utilize with the web page contents webpage text content that the match is successful coupling to arrange, extract the webpage text content in the web page contents.
Another embodiment of the present invention also provides a kind of device that can extract webpage text content, comprising: coupling arranges dispensing unit, is suitable for arranging in default at least one webpage text content coupling of browser side; Download unit is suitable for carrying out web page contents in the browser side and downloads; Matching unit is suitable for web page contents is mated with the setting of webpage text content coupling respectively, until web page contents the match is successful; Extraction unit is suitable for utilizing with the web page contents webpage text content that the match is successful coupling arranging, and extracts the webpage text content in the web page contents.
From the above mentioned, the embodiment of the invention arranges by set up a plurality of webpage text content couplings in the browser side, and same webpage text content and a plurality of webpage text content coupling arranged the technological means of mating, when web page contents changes, can from a plurality of webpage text content couplings arrange, find the webpage text content coupling that is complementary with the webpage that changes to arrange, thereby can utilize the webpage text content coupling setting that the match is successful to extract webpage text content.And this programme has been avoided when web page contents changes, and need to generate new matched rule file and is arranged on operation in the browser, has simplified the operation that realizes coupling, has reduced workload, has improved efficient.
Above-mentioned explanation only is the general introduction of technical solution of the present invention, for can clearer understanding technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Description of drawings
By reading hereinafter detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing only is used for the purpose of preferred implementation is shown, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts with identical reference symbol.In the accompanying drawings:
Fig. 1 shows the apparatus structure synoptic diagram that can extract webpage text content according to an embodiment of the invention;
Fig. 2 shows the method flow diagram that webpage text content is extracted of according to the present invention another embodiment.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in the accompanying drawing, yet should be appreciated that and to realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order to understand the disclosure more thoroughly that these embodiment are provided, and can with the scope of the present disclosure complete convey to those skilled in the art.
One embodiment of the invention provides a kind of device that can extract webpage text content, can under the prerequisite that guarantees text extraction rate and stability, provide convenient and absorbed reading service to the user.Referring to Fig. 1, this device comprises that coupling arranges dispensing unit 100, download unit 101, matching unit 102, extraction unit 103, Loading Control unit 104, filter element 105, coupling arranges updating block 106, Multi-thread control unit 107, input block 108 and uploading unit 109.The below describes each unit respectively.
Coupling arranges dispensing unit 100, is suitable for arranging in default at least one webpage text content coupling of browser side.Concrete, coupling arranges dispensing unit 100 and is suitable for setting up that a coupling arranges file and at least one webpage text content coupling arranged be kept at coupling and arrange in the file; Wherein, this coupling arranges and comprises at least one website node in the file, comprise at least a web page joint in each website node, be provided with plural coupling at least part of web page joint description node is set that each coupling arranges the corresponding webpage text content coupling of description node and arranges.Coupling arranges and can comprise one or more coupling setting options in the description node, and at least two webpage text content couplings arrange the middle Different matching setting option that comprises respectively the same type content of text.
Coupling arranges dispensing unit 100 and sets up a website node for every type website, i.e. the website of corresponding one type of website node; Under a website node, for every type webpage under this corresponding website of website node is set up a web page joint, i.e. the webpage of corresponding one type of web page joint.The coupling of setting up each web page joint according to the content of webpage arranges the coupling setting option in the description node.Different webpages, the content that wherein comprises is different, and it is also different then to mate accordingly the coupling setting option that arranges in the description node.
Comprise that under a web page joint a plurality of couplings arrange description node, owing to usually can have some fix informations that can often not change and some variable informations that are easy to change in the webpage, coupling arranges the coupling of dispensing unit 100 under web page joint and arranges in the description node and to determine that a coupling arranges description node and as the first coupling description node is set, that the coupling setting option that comprises in the description node is set is the most comprehensive for this first coupling, has comprised at least one coupling setting option of setting up for every type content of text in the webpage.And arrange in the description node in the coupling that arranges the description node except the first coupling, can be only set up the coupling setting option for the variable information in the webpage, and in this web page joint, mate the coupling that arranges the description node that the coupling setting option of setting up in the description node is set is different except first.
This processing mode has been simplified on the one hand the structure that the webpage text content coupling arranges, and avoids Different matching that the part of repetition is arranged in arranging, and has reduced the data volume that the coupling of required storage arranges, thereby has improved resource utilization; Also avoided on the other hand identical web page contents is carried out the repeated matching operation, improved matching efficiency.
Below in conjunction with the example of one section code coupling being arranged file is specifically described.
Figure BDA00002648246000041
Figure BDA00002648246000051
Below in conjunction with each node in the above-mentioned code coupling being arranged file is described as follows:
1.<and websites〉total website node: this node is maximum father node, and this node arranges file corresponding to a coupling, and this node is made of several websites (website) node.
2.<and website〉node: each website node represents a kind of website of supporting, in the website node one or more web page joints are set, as be arranged with books (book) web page joint, catalogue (catalog) web page joint and chapters and sections (chapter) web page joint at website node www.feiku.com.In web page joint, also be provided with downloading mode (downloadmode) attribute and element and filter (elementfilter) attribute.
3.<and book〉web page joint: describe the novel home tip, two couplings are set under this web page joint description node<profile is set.Arrange as the first coupling description node<profile in dispose a plurality of coupling setting options, such as URL(Uniform/Universal Resource Locator, URL(uniform resource locator)) the coupling setting option describes related urls coupling and obtains the bookid(banner) information; The title(title) the coupling setting option is described the information that how to obtain novel homepage title; Catalogurl(catalogue URL) the coupling setting option is described the catalogue URL of this novel; The up-to-date chapters and sections of lasterchapter() the coupling setting option is described the description of up-to-date chapters and sections; The up-to-date chapters and sections URL of lasterchapterurl() the coupling setting option is described the URL of up-to-date chapters and sections.
4.<and catalog〉web page joint: describe the listing of novel page information, a coupling only is set under this web page joint description node is set, comprise under coupling arranges description node: URL coupling setting option is described the related urls coupling and is obtained bookid information; Chapterlist mates setting option, describes the related content of catalogue page; Returnbook describes the URL address of novel homepage.
5.<and chapter〉web page joint: describe novel chapters and sections page information, two<profile is set under this web page joint.Arrange as the first coupling description node<profile in dispose: URL mate setting option, describes related urls and mates and obtain bookid information; Title mates setting option, describes the information that how to obtain novel homepage title; The text(text) coupling setting option, the body matter of description novel; Next mates setting option, describes next chapters and sections novel page or leaf URL; Prev mates setting option, describes a chapters and sections URL on the novel; The returncatalog(Returning catalogue) the coupling setting option is described the listing of novel page or leaf URL that the chapters and sections page or leaf is preserved; Returnbook(returns books) the coupling setting option, the novel homepage that novel chapters and sections page or leaf is preserved is described.
6.<and profile〉coupling arranges description node: when a plurality of webpage text contents couplings being set under the web page joint arranging, can configurations match description node<profile be set 〉, each<profile〉corresponding webpage text content coupling arranges.<profile〉be positioned under the concrete web page joint, for example, be positioned at below above-mentioned book web page joint and the chapter web page joint, will mate setting option and be arranged on<profile in.
When receiving user's web page access instruction, download unit 101 carries out web page contents in the browser side to be downloaded, and connects web page contents corresponding to downloading web pages access instruction from server such as download unit 101 and server.
Matching unit 102 arranges the web page contents that downloads to respectively and mates with the webpage text content coupling, until web page contents the match is successful.Still with the scene explanation in the above-mentioned code, matching unit 102 arranges in coupling and searches web page contents corresponding website node and web page joint in the file, finding website node corresponding to this web page contents according to the web page contents that downloads to is website node www.feiku.com, and corresponding web page joint is the book web page joint; Then under the web page joint that finds, the coupling setting option that in web page contents and this web page joint first coupling is arranged in the description node mates successively, when first in book web page joint coupling arrange description node be configured under the book web page joint first<profile the time, first with web page contents and this first<profile〉in the coupling setting option mate.To the coupling setting option that the match is successful, matching result is set to the webpage text content that utilizes this coupling setting option to extract, and the result who at this moment returns is the content of text for extracting directly, and perhaps returning the indication result is the information of true (TRUE); To the coupling setting option that it fails to match, at this moment the matching result that returns can be the information of false (FALSE) for indicating the null character string that can't process or returning the indication result, (such as second<profile under the book web page joint〉in) then is set in the description node except the first coupling arranges coupling the description node in this web page joint searches the corresponding coupling setting option of coupling setting option that it fails to match with this, the coupling setting option and the web page contents that find are mated, until the coupling setting option that finds and web page contents the match is successful, and matching result is set to the webpage text content that extracts according to this coupling setting option.Namely for utilizing the first coupling that the description node web page contents that it fails to match is set, as long as there is one<profile〉can match, just can utilize this coupling<profile the corresponding web page content is extracted.
Because generally, the appearance form of web page contents is HTML(Hypertext MarkupLanguage, HTML (Hypertext Markup Language)), matching unit 102 also need to be for the element of the html element in the webpage when carrying out coupling, for example, the web page contents layering analysis that 102 pairs of matching units download to, obtain the DOM Document Object Model DOM(Document Object Model of this web page contents, DOM Document Object Model) structure, DOM structure according to web page contents, web page contents is mated with the setting of webpage text content coupling respectively, thereby extract webpage text content.
Extraction unit 103 is suitable for utilizing with the web page contents webpage text content that the match is successful coupling and arranges, and extracts the webpage text content in the web page contents.Concrete, extraction unit 103 be suitable for the webpage text content that extracts of the with good grounds coupling setting option that the match is successful as the webpage text content in the web page contents that identifies.
Further, can also utilize coupling that downloading mode (downloadmode) attribute that dispensing unit 100 arranges in web page joint and element are set in the present embodiment filters (elementfilter) attribute the download of web page contents is controlled.Said apparatus also comprises Loading Control unit 104 and filter element 105.
Coupling arranges dispensing unit 100 and is downloading mode setup of attribute at least two generic attribute values, for example, when this property value is 0, indication is according to the downloading mode of existing browsing device net page, whole web page contents are downloaded in the browser, when this property value is 1, utilize 105 pairs of web page contents of filter element to filter, remaining web page contents is downloaded in the browser after only will filtering.
Coupling arranges dispensing unit 100 and for the element filter attribute a plurality of property values is set, the corresponding a kind of filter type of each property value, for example, picture (img) is filtered in property value 1 expression, property value 2 represents that filtration Cascading Style Sheets (Cascading Style Sheet, CSS), property value 4 represent that the Javascript script is filtered in filter frames (frame), property value 8 expressions, property value 16 represents that filtering objects (object) and property value 32 represent that filtration embeds (embed) content.
When needs adopt the combination of above-mentioned multiple filter type, can by the binary-coded character of above-mentioned property value adopt step-by-step or account form, generate new property value, then this new property value can be indicated above-mentioned multiple filter type.
Loading Control unit 104 is suitable under the web page joint that finds, in web page contents and this web page joint first coupling is arranged before coupling setting option in the description node mates successively, whether the property value of judging the downloading mode attribute in the web page joint that finds is predetermined value (such as 1), if, start filter element 105, then under the web page joint that finds, the coupling setting option that the first coupling in the web page contents after filtering and this web page joint is arranged in the description node mates successively; If not, directly will be loaded in the browser under the web page contents;
Filter element 105 is suitable for according to the filter type of element filter attribute indication the content in the webpage being filtered.For example, when picture is filtered in the property value indication of element filter attribute, filter element 105 all filters out the picture in the web page contents, and when picture and CSS were filtered in the property value indication of element filter attribute, filter element 105 all filtered out the picture in the web page contents and CSS.
The below arranges dispensing unit 100 to coupling and in coupling some main coupling setting options that dispose in the description node is set and is specifically described.
One, about the extraction of webpage URL
The URL that is included as web page contents during the webpage text content coupling that coupling arranges dispensing unit 100 configuration arranges sets up webpage URL coupling setting option.
In this part, in conjunction with the url node in the above-mentioned example, from Match setting, Trans setting, Bookid setting, Booksep setting and Tabtitle five aspects are set webpage URL coupling setting option is described.
1) Match arranges: the match attribute setting option
Comprise the match attribute setting option in the webpage URL coupling setting option, this match attribute setting option comprises:
A. webpage URL as beginning, as with the ^ beginning, shows that url must be with the content beginning of ^ back with predetermined content.
B. webpage URL comprises predetermined content, the precalculated position of this predetermined content comprises any character, is with the content of@beginning such as this predetermined content, shows the content after this url must comprise@, content the inside behind the@can add character *, this character representation coupling any character.
C. webpage URL does not comprise predetermined content, and this predetermined content comprises any character.As this predetermined content be with! The content of beginning shows that this url must not comprise! After content,! After content the inside can add character *, this character representation coupling any character.
When extracting webpage URL, can require to satisfy simultaneously above-mentioned a, b and c, perhaps, only satisfy or two among a, b and the c.
2) Trans arranges: transform properties settings
Obtain the URL of this webpage according to the composition format conversion of the banner of the web page contents of knowing and URL.This operation is mainly used in only having the scene that a coupling arranges description node under the web page joint, namely only exists under the scene of a profile, carries out the associative operation that URL transforms by banners such as given novel homepage, catalogue page, chapters and sections pages or leaves.This setting option has been described the composition form of url, and only need to insert the banners such as bookid or chapterid and just can obtain a url, as: trans=http: //www.qidian.com/BookReader/##s, ##s.aspx^^bookid^^chapterid
Above-mentioned character string display the composition form of URL, then bookid is inserted first ##s, chapterid is inserted second ##s, just can obtain the url of a chapters and sections page or leaf.
3) Bookid arranges: the banner properties settings
With the character in precalculated position among the URL of the webpage banner as this web page contents.
This operation is as obtaining banner, bookid character string such as url, for example, for bookid=http: //www.readnovel.com/novel/*.html, wherein, the position of character * is above-mentioned precalculated position, then with the character string of this position as the banner that extracts, such as the bookid character string.
Utilize the banner that extracts in this operation can carry out the conversion of webpage URL.
4) Booksep arranges: banner extracts properties settings
In the banner that obtains according to banner properties settings coupling, choose the character in precalculated position as banner.During banner more complicated that this operation is mainly used in getting access to, need the scene of further extracting.
As booksep=is set "/: 01 extraction structure; then when comprising "/" symbol among the banner bookid; in order to get pure digi-tal; can use booksep; "/" expression separates identifier; when ": " expression separator, " 0 " expression are separated into some sections when target text by "/", get section which partly (count since 0) as banner bookid.
Utilize the banner that extracts in this operation can carry out the conversion of webpage URL.
5) Tabtitle arranges: web page title extracts properties settings
Be title (Title) information with the contents extraction before the book character in the web page contents.As the extraction structure of tabtitle=2*-3 is set, represent that then "-" part before of first appearance all is title.Symbol * can mate any character.
Two, about the extraction of HTML content in the webpage
Coupling arranges dispensing unit 100 and (arranging in the description node such as the first coupling) is set in the description node in coupling sets up at least one coupling setting option for every type HTML (Hypertext Markup Language) (Hypertext Markup Language, the HTML) element of content of text in web page contents in the webpage.
Need the html element element that extracts also different in the dissimilar webpages, for example the scene in the above-mentioned code is as example, need html element element to be processed comprise the indication title<title element, indication catalogue url<catalogurl〉element, indicate up-to-date chapters and sections<lastchapter element, indicate up-to-date chapters and sections url<lastchapterurl element, the indication text<text〉element, the lower one page url of indication<next〉element, indication page up url<prev〉element, indication Returning catalogue url<returncatalog element and indication return to the homepage url<returnbook element etc.
Coupling arranges dispensing unit 100 and comprises one-time positioning coupling setting option and secondary position matching setting option for the plain coupling setting option of setting up of html element.The below describes respectively.
1) one-time positioning coupling setting option
This one-time positioning coupling setting option comprises at least:
A. basic point is searched setting option el: the mode that the indication basic point is searched, can be set to the numerical value such as 1,2,4,8,16, wherein, 1 corresponding to searching sign id, 2 corresponding to lookup names name, 4 corresponding to searching class name classname, and 8 corresponding to searching content value, and 16 corresponding to expression formula regular.
B. mark location setting option id: the element that the sign of location and html element element is complementary.
C. title is located setting option name: the element that the title of location and html element element is complementary.
D. class name location setting option classmate: the element that the class title of location and html element element is complementary, when having the element that a plurality of class titles are complementary, only mate first element.
E. content is located setting option value: the element that the content (innertext) of location and html element element is complementary, when having a plurality of element that is complementary, only mate first element.
F. expression formula is located setting option regular: the element that the expression formula in location and the html element element is complementary as to expression formula %CUURENTURL%, is positioned the url that this expression formula is complementary.
G. label setting option tag: when indication utilizes mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option that element is located, the type of the element of locating and/or attribute.
Be element type and the attribute of tag indication one-time positioning.As tag=is set " a-href " structure, then to get and navigate to attribute of an element be href in expression, the type of the element that navigates to is a.And be not have the secondary location to occur the opportunity that the tag setting option comes into force, if there is the secondary location to occur, then tag only is responsible for checking.
2) secondary position matching setting option
On the basis of carrying out one-time positioning, to the result that one-time positioning obtains, can also carry out the secondary location.This secondary position matching setting option comprises:
A. the father inquires about setting option parentselect: the element that navigates to according to one-time positioning coupling setting option is set, the mode of searching father's element of this element;
B. subquery setting option childrenselect: the element that navigates to according to one-time positioning coupling setting option is set, searches the mode of the daughter element of this element;
C. when inquiring about setting option and subquery setting option, the father puts when existing simultaneously, first inquire about father's element that setting option is searched the element that one-time positioning coupling setting option navigates to according to the father, then according to the subquery setting option, from this father's element that finds, the daughter element of searching this father's element.
Present embodiment is also according to element term, element property and sequential scheduling, be provided with the concrete mode of locating in the setting options such as parentselect, childrenselect and tag, as being expressed as 4ul:0|li:1|a-href when this mode: 0 "; show from when the element of prelocalization, carry out following positioning action:
1. the 1(0 that searches the next stage (upper level, current) of currentElement represents first) individual<ul label, wherein, under parentselect, search the 1st<ul of the upper level of currentElement〉label, under childrenselect, search the 1st<ul of the next stage of currentElement〉label, under tag, search the 1st the current<ul of currentElement〉label.
2. then represent first at the 2(1 of the next stage of looking for the ul element (upper level, current)) individual<li label.
3. then represent first at the 1(0 of the next stage of looking for the li element (upper level, current)) individual<a label.
4. after finding a element, if arrange can-href, then the expression href property content of getting a element; If should not arrange, then directly get the element content (innertext) of a element.
3) filter setting
Coupling arranges dispensing unit 100 and also comprises element deletion coupling setting option elementerase for the plain coupling setting option of setting up of html element, to wipe out some daughter element in the element of orienting.This element deletion coupling setting option comprises at least:
Predetermined content in the element that deletion is oriented by one-time positioning coupling setting option or secondary position matching setting option; And/or change predetermined content in the element of being oriented by one-time positioning coupling setting option or secondary position matching setting option.
For example, when the structure of elementerase=" font:0|FONT:0 " is set, then " wipe " content of selecting between font in the content or FONT label.The mode of " wiping " depends on the implication corresponding to numerical value of symbol ": " back, and for example, numerical value 0 is divstyle=" display:none " corresponding to changing element term; Numerical value 1 is corresponding to changing element term for identifying, and numerical value 2 is corresponding to the deletion element.
Further, said apparatus also comprise the coupling arrange updating block 106 be suitable for set up one the coupling file is set after, according to the update instruction that receives, coupling is arranged website node, web page joint, coupling in the file coupling setting option that description node and/or coupling arrange in the description node is set upgrades.For example, when a certain website has not been present in the internet or has not needed the webpage in this website is not carried out text when extracting, utilize coupling that the updating block 106 website node that this website is corresponding and the relevant setting under this website node are set and all arrange the file from coupling and delete.
Further, said apparatus also comprises Multi-thread control unit 107.This Multi-thread control unit 107 is suitable for when there is a plurality of web page contents that downloads in the browser side, for each web page contents distributes a thread, and the control matching unit mates the corresponding web page content respectively in the thread that distributes with the setting of webpage text content coupling, until web page contents the match is successful; And/or, the web page contents that this Multi-thread control unit 107 is suitable for the browser side distributes a plurality of threads, and the control matching unit arranges web page contents respectively in different threads and mates from different webpage text content coupling, until web page contents the match is successful.This programme has adopted the multithreading treatment technology, can realize more rapidly that the text of one or more web page contents extracts, and shortens the time of browser Web page loading, and the webpage text content that extracts is presented to the user fast in browser.
Wherein, above-mentioned returning apparatus comprises input block 108 and uploading unit 109.Input block 108 is suitable for receiving the instruction of choosing of choosing the setting of webpage text content coupling that the user sends; Then coupling arranges dispensing unit 100 and also is suitable for setting up coupling file being set according to choosing instruction, and will choose webpage text content coupling setting in the instruction and be kept at the coupling of setting up and arrange in the file, and coupling arranges dispensing unit 100 and can also according to the update instruction from the user, coupling be arranged file upgrade; And uploading unit 109 is suitable for coupling being arranged File Upload to server and being stored in server side user's the user data, then arrange that file is damaged or when losing when the coupling of browser side, the coupling that the browser side can utilize server side to preserve arranges file and recovers or upgrade.
Further, said apparatus comprises that also starting control module is suitable for when the file that monitors indication browser loaded is finished (DocumentComplete) event, know the current extraction operation that can carry out web page contents, then start the matching unit execution web page contents is arranged the operation of mating with the webpage text content coupling respectively.
Be appreciated that above-mentioned coupling arranges one or more in updating block 106, Multi-thread control unit 107, input block 108 and the uploading unit 109 and can omit in some scenes.
From the above mentioned, the embodiment of the invention arranges by set up a plurality of webpage text content couplings in the browser side, and same webpage text content and a plurality of webpage text content coupling arranged the technological means of mating, when web page contents changes, can from a plurality of webpage text content couplings arrange, find the webpage text content coupling that is complementary with the webpage that changes to arrange, thereby can utilize the webpage text content coupling setting that the match is successful to extract webpage text content.And this programme has been avoided when web page contents changes, and need to generate new matched rule file and is arranged on operation in the browser, has simplified the operation that realizes coupling, has reduced workload, has improved efficient.
Another embodiment of the present invention also provides a kind of client device, on this client device browser is installed, and is provided with the device that can extract webpage text content described above in the described browser,
Client device, according to user's the described device that can extract webpage text content of web page browsing instruction startup, and the webpage text content that this device that can extract webpage text content is extracted shows the user in browser.
The specific works mode of the device that can extract webpage text content in the client device can referring to relevant apparatus embodiment of the present invention, not repeat them here.
Another embodiment of the present invention also provides a kind of method that webpage text content is extracted,, can under the prerequisite that guarantees text extraction rate and stability, provide convenient and absorbed reading service to the user, the method comprises:
S200: arrange in default at least one webpage text content coupling of browser side.
Set up that a coupling arranges file and at least one webpage text content coupling arranged and be kept at coupling and arrange in the file, wherein, coupling arranges and comprises at least one website node in the file, comprise at least a web page joint in each website node, be provided with plural coupling at least part of web page joint description node is set, each coupling arranges the corresponding webpage text content coupling of description node and arranges, and the coupling of at least two webpage text contents arranges the middle Different matching setting option that comprises respectively the same type content of text.
A website node is set up in website for every type in the present embodiment; Under a website node, for every type webpage under this corresponding website of website node is set up a web page joint; The coupling of setting up each web page joint according to the content of webpage arranges the coupling setting option in the description node, wherein the first coupling at web page joint arranges in the description node, for every type content of text in the corresponding webpage of this web page joint is set up at least one coupling setting option; And for the content of text of same type in the webpage, the coupling setting option set up in the description node is set and in this web page joint, mates the coupling that arranges the description node that the coupling setting option of setting up in the description node is set is different except first in the first coupling.Thereby to a certain web page contents, when the first coupling arrange in the description node the coupling setting option with its can't mate the time, this web page contents and other couplings can be arranged and mate setting option in the description node and mate, until the match is successful.
Comprise that under a web page joint a plurality of couplings arrange description node, owing to usually can have some fix informations that can often not change and some variable informations that are easy to change in the webpage, coupling under web page joint arranges determines in the description node that a coupling arranges description node and as the first coupling description node is set, that the coupling setting option that comprises in the description node is set is the most comprehensive for this first coupling, has comprised at least one coupling setting option of setting up for every type content of text in the webpage.And arrange in the description node in the coupling that arranges the description node except the first coupling, can be only set up the coupling setting option for the variable information in the webpage, and in this web page joint, mate the coupling that arranges the description node that the coupling setting option of setting up in the description node is set is different except first.
This processing mode has been simplified on the one hand the structure that the webpage text content coupling arranges, and avoids Different matching that the part of repetition is arranged in arranging, and has reduced the data volume that the coupling of required storage arranges, thereby has improved resource utilization; Also avoided on the other hand identical web page contents is carried out the repeated matching operation, improved matching efficiency.
Further, in web page joint, comprise downloading mode attribute and element filter attribute, the filter type of this element filter attribute indication comprises: filtration picture, filtration Cascading Style Sheet CSS, filtration Javascript script, filter frame, filtering object and filtration embed one or more in the content
Under the web page joint that finds, first in web page contents and this web page joint coupling is arranged before the step that the coupling setting option in the description node mates successively, said method further comprises:
Whether the property value of judging the downloading mode attribute in the web page joint that finds is predetermined value, if, filter type according to the indication of element filter attribute filters the content in the webpage, then under the web page joint that finds, the coupling setting option that the first coupling in the web page contents after filtering and this web page joint is arranged in the description node mates successively; If not, directly be loaded in the browser under the web page contents.
Wherein, above-mentioned webpage text content coupling arranges the URL that is included as web page contents and sets up webpage URL coupling setting option, comprises the match attribute setting option in the webpage URL coupling setting option, and this match attribute setting option comprises:
Webpage URL with predetermined content as beginning; And/or webpage URL comprises predetermined content, and the precalculated position of this predetermined content comprises any character; And/or webpage URL does not comprise predetermined content, and this predetermined content comprises any character.
Wherein, above-mentioned webpage URL coupling setting option comprises that also banner properties settings, banner extract properties settings and transform properties settings,
The banner properties settings comprises the character in precalculated position among the URL of the webpage banner as this web page contents; Banner extraction properties settings is included in the banner that obtains according to banner properties settings coupling and chooses the character in precalculated position as banner; The conversion properties settings comprises that the composition format conversion according to the banner of the web page contents of knowing and URL obtains the URL of this webpage.
Wherein, above-mentioned webpage URL coupling setting option also comprises: web page title extracts properties settings.This web page title extracts properties settings and comprises: be title with the contents extraction before the book character in the web page contents.
Wherein, above-mentioned the first coupling at web page joint arranges in the description node, sets up at least one coupling setting option for every type content of text in the corresponding webpage of this web page joint and comprises:
Arrange in the description node as every type the HTML (Hypertext Markup Language) html element element of content of text in web page contents in the webpage in the first coupling and to set up at least one coupling setting option;
The above-mentioned coupling setting option of setting up for the html element element comprises one-time positioning coupling setting option, and this one-time positioning coupling setting option comprises at least:
Basic point is searched setting option in the indication basic point mode of searching, and this mode comprises searches sign, lookup names, searches class name, searches content, searches expression formula; And/or, the element that the mark location setting option is complementary with location and the sign of html element element; And/or, the element that title location setting option is complementary with the title of location and html element element; And/or, the element that class name location setting option is complementary with the class title of location and html element element; And/or, the element that content location setting option is complementary with the content of location and html element element; And/or, the element that expression formula location setting option is complementary with the expression formula in location and the html element element; And/or, when the label setting option utilizes mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option that element is located with indication, type and/or the attribute of institute's location element.
Wherein, the above-mentioned coupling setting option of setting up for the html element element also comprises: secondary position matching setting option, and this secondary position matching setting option comprises at least:
The father inquires about setting option so that the element that navigates to according to one-time positioning coupling setting option, the mode of searching father's element of this element to be set; Perhaps, the subquery setting option is to arrange the element that navigates to according to one-time positioning coupling setting option, search this element daughter element mode with or, when inquiring about setting option and subquery setting option, the father puts when existing simultaneously, first inquire about father's element that setting option is searched the element that one-time positioning coupling setting option navigates to according to the father, then according to the subquery setting option, from this father's element that finds, the daughter element of searching this father's element.
Wherein, above-mentioned coupling setting option for the foundation of html element element also comprises: element deletion coupling setting option, and this element deletion coupling setting option comprises at least: delete the predetermined content in the element of being oriented by one-time positioning coupling setting option or secondary position matching setting option; And/or, change the predetermined content in the element of being oriented by one-time positioning coupling setting option or secondary position matching setting option.
S202: carry out web page contents in the browser side and download.
S204: arrange in coupling and to search web page contents corresponding website node and web page joint in the file.
S206: under the web page joint that finds, the coupling setting option that the coupling of first in web page contents and this web page joint is arranged in the description node mates successively, according to matching result difference execution in step S208 or S210.
S208: to the coupling setting option that the match is successful, matching result is set to the webpage text content that utilizes this coupling setting option to extract;
S210: to the coupling setting option that it fails to match, in this web page joint, arrange and search the corresponding coupling setting option of coupling setting option that it fails to match with this in the description node except the first coupling arranges coupling the description node, the coupling setting option and the web page contents that find are mated, until the coupling setting option that finds and web page contents the match is successful, and matching result is set to the webpage text content that extracts according to this coupling setting option.
S212: utilize with the web page contents webpage text content that the match is successful coupling to arrange, extract the webpage text content in the web page contents.
With the webpage text content that extracts of the with good grounds coupling setting option that the match is successful as the webpage text content in the web page contents that identifies.
Wherein, after step S200, said method also comprises: according to the update instruction that receives, coupling is arranged website node, web page joint, coupling in the file coupling setting option that description node and/or coupling arrange in the description node is set upgrades.
Wherein, above-mentioned steps S206 mates web page contents respectively with the setting of webpage text content coupling, until web page contents the match is successful comprises:
When there is a plurality of web page contents that downloads in the browser side, be that each web page contents distributes a thread, in the thread that distributes, the corresponding web page content arranged with the webpage text content coupling respectively and mates, until web page contents the match is successful; And/or, for a web page contents of browser side distributes a plurality of threads, in different threads, web page contents arranged from different webpage text content coupling respectively and mates, until web page contents the match is successful.
And in step S206, because web page contents has the description form of HTML usually, present embodiment can to the web page contents layering analysis that downloads to, obtain the DOM structure of this web page contents; According to the DOM structure of web page contents, web page contents is mated with the setting of webpage text content coupling respectively.
Wherein, in step S200, also comprise: receive the instruction of choosing of choosing the setting of webpage text content coupling that the user sends; Set up coupling file is set according to choosing instruction, and will choose webpage text content coupling in the instruction and arrange and be kept at the coupling of setting up and arrange in the file; Coupling is arranged File Upload to server and be stored in server side user's the user data.
Wherein, before step S204, said method also comprises: when the file that monitors indication browser loaded is finished event, start web page contents is mated the operation that setting is mated with webpage text content respectively.
The concrete executive mode of each step can be referring to the related content among apparatus of the present invention embodiment in the present embodiment.
From the above mentioned, the embodiment of the invention arranges by set up a plurality of webpage text content couplings in the browser side, and same webpage text content and a plurality of webpage text content coupling arranged the technological means of mating, when web page contents changes, can from a plurality of webpage text content couplings arrange, find the webpage text content coupling that is complementary with the webpage that changes to arrange, thereby can utilize the webpage text content coupling setting that the match is successful to extract webpage text content.And this programme has been avoided when web page contents changes, and need to generate new matched rule file and is arranged on operation in the browser, has simplified the operation that realizes coupling, has reduced workload, has improved efficient.
Intrinsic not relevant with any certain computer, virtual system or miscellaneous equipment with demonstration at this algorithm that provides.Various general-purpose systems also can be with using based on the teaching at this.According to top description, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.Should be understood that and to utilize various programming languages to realize content of the present invention described here, and the top description that language-specific is done is in order to disclose preferred forms of the present invention.
In the instructions that provides herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can be put into practice in the situation of these details not having.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the description to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes in the above.Yet the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires the more feature of feature clearly put down in writing than institute in each claim.Or rather, as following claims reflected, inventive aspect was to be less than all features of the disclosed single embodiment in front.Therefore, follow claims of embodiment and incorporate clearly thus this embodiment into, wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can adaptively change and they are arranged in one or more equipment different from this embodiment the module in the equipment among the embodiment.Can be combined into a module or unit or assembly to the module among the embodiment or unit or assembly, and can be divided into a plurality of submodules or subelement or sub-component to them in addition.In such feature and/or process or unit at least some are mutually repelling, and can adopt any combination to disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and so all processes or the unit of disclosed any method or equipment make up.Unless in addition clearly statement, disclosed each feature can be by providing identical, being equal to or the alternative features of similar purpose replaces in this instructions (comprising claim, summary and the accompanying drawing followed).
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included among other embodiment, the combination of the feature of different embodiment means and is within the scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, perhaps realizes with the software module of moving at one or more processor, and perhaps the combination with them realizes.It will be understood by those of skill in the art that and to use in practice microprocessor or digital signal processor (DSP) to realize according to some or all some or repertoire of parts in the device that can extract webpage text content of the embodiment of the invention.The present invention can also be embodied as be used to part or all equipment or the device program (for example, computer program and computer program) of carrying out method as described herein.Such realization program of the present invention can be stored on the computer-readable medium, perhaps can have the form of one or more signal.Such signal can be downloaded from internet website and obtain, and perhaps provides at carrier signal, perhaps provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation of the scope that does not break away from claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed in element or step in the claim.Being positioned at word " " before the element or " one " does not get rid of and has a plurality of such elements.The present invention can realize by means of the hardware that includes some different elements and by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to come imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title with these word explanations.
Herein disclosed is A1, a kind of method that webpage text content is extracted, comprising: arrange in default at least one webpage text content coupling of browser side; Carrying out web page contents in the browser side downloads; Described web page contents arranged with described webpage text content coupling respectively mates, until described web page contents the match is successful; Utilize with the described web page contents webpage text content that the match is successful coupling to arrange, extract the webpage text content in the described web page contents.A2, according to the described method of A1, it is characterized in that described the setting in the default at least one webpage text content coupling of browser side comprises: set up a coupling and file is set and will the setting of described at least one webpage text content coupling be kept at described coupling and arrange in the file; Wherein, described coupling arranges and comprises at least one website node in the file, comprise at least a web page joint in each website node, be provided with plural coupling at least part of described web page joint description node is set, each coupling arranges the corresponding webpage text content coupling of description node and arranges, and the coupling of at least two described webpage text contents arranges the middle Different matching setting option that comprises respectively the same type content of text.A3, according to the described method of A2, it is characterized in that, described described web page contents is arranged with described webpage text content coupling respectively mated, until described web page contents the match is successful comprises: arrange in described coupling and search described web page contents corresponding website node and web page joint in the file; Under the web page joint that finds, the coupling setting option that the coupling of first in described web page contents and this web page joint is arranged in the description node mates successively; To the coupling setting option that the match is successful, matching result is set to the webpage text content that utilizes this coupling setting option to extract; To the coupling setting option that it fails to match, in this web page joint, arrange and search the corresponding coupling setting option of coupling setting option that it fails to match with this in the description node except the first coupling arranges coupling the description node, the coupling setting option and the described web page contents that find are mated, until the coupling setting option that finds and described web page contents the match is successful, and matching result is set to the webpage text content that extracts according to this coupling setting option.A4, according to the described method of A3, it is characterized in that, described utilization and the described web page contents webpage text content that the match is successful coupling arranges, and the webpage text content that extracts in the described web page contents comprises: with the webpage text content that extracts of the with good grounds described coupling setting option that the match is successful as the webpage text content in the described web page contents that identifies.A5, according to the described method of A2, it is characterized in that the described coupling of setting up arranges file and the setting of will be described at least one webpage text content coupling and is kept at described coupling and arranges in the file and comprise: for setting up a website node in every type website; Under a website node, for every type webpage under this corresponding website of website node is set up a web page joint; The coupling of setting up each web page joint according to the content of webpage arranges the coupling setting option in the description node, wherein the first coupling at web page joint arranges in the description node, for every type content of text in the corresponding webpage of this web page joint is set up at least one coupling setting option; And for the content of text of same type in the webpage, the coupling setting option set up in the description node is set and in this web page joint, mates the coupling that arranges the description node that the coupling setting option of setting up in the description node is set is different except first in described the first coupling.A6, according to the described method of A3, it is characterized in that, downloading mode attribute and element filter attribute are set in described web page joint, the filter type of described element filter attribute indication comprises: filter picture, filter Cascading Style Sheet CSS, filter the Javascript script, filter frame, filtering object and filtration embed one or more in the content, under the web page joint that finds, in described web page contents and this web page joint first coupling is arranged before the step that the coupling setting option in the description node mates successively, described method further comprises: whether the property value of judging the downloading mode attribute in the described web page joint that finds is predetermined value, if, filter type according to the indication of element filter attribute filters the content in the webpage, then under the web page joint that finds, the coupling setting option that the first coupling in the web page contents after filtering and this web page joint is arranged in the description node mates successively; If not, directly will be loaded in the browser under the described web page contents.A7, according to the described method of A1, it is characterized in that, described webpage text content coupling arranges the uniform resource position mark URL that is included as web page contents and sets up webpage URL coupling setting option, comprise in the described webpage URL coupling setting option: the match attribute setting option, described match attribute setting option comprises: webpage URL with predetermined content as beginning; And/or webpage URL comprises predetermined content, and the precalculated position of this predetermined content comprises any character; And/or webpage URL does not comprise predetermined content, and this predetermined content comprises any character.A8, according to the described method of A7, it is characterized in that, described webpage URL coupling setting option also comprises: banner properties settings, banner extract properties settings and transform properties settings, and described banner properties settings comprises: with the character in precalculated position among the URL of the webpage banner as this web page contents; Described banner extracts properties settings and comprises: choose the character in precalculated position as banner in the banner that obtains according to banner properties settings coupling; Described conversion properties settings comprises: the URL that obtains this webpage according to the composition format conversion of the banner of the web page contents of knowing and URL.A9, according to the described method of A7, it is characterized in that described webpage URL coupling setting option also comprises: web page title extracts properties settings, described web page title extracts properties settings and comprises: be title with the contents extraction before the book character in the web page contents.A10, according to the described method of A5, it is characterized in that, described the first coupling at web page joint arranges in the description node, sets up at least one coupling setting option for every type content of text in the corresponding webpage of this web page joint and comprises: arrange in the description node as every type the HTML (Hypertext Markup Language) html element element of content of text in web page contents in the webpage in the first coupling and set up at least one coupling setting option; The described coupling setting option of setting up for the html element element comprises one-time positioning coupling setting option, described one-time positioning coupling setting option comprises at least: basic point is searched setting option: the mode that the indication basic point is searched, described mode comprise searches sign, lookup names, searches class name, searches content, searches expression formula; And/or, mark location setting option: the element that the sign of location and html element element is complementary; And/or, title location setting option: the element that the title of location and html element element is complementary; And/or, class name location setting option: the element that the class title of location and html element element is complementary; And/or, content location setting option: the element that the content of location and html element element is complementary; And/or, expression formula location setting option: the element that the expression formula in location and the html element element is complementary; And/or, the label setting option: when indication utilizes described mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option that element is located, type and/or the attribute of institute's location element.A11, according to the described method of A10, it is characterized in that, the described coupling setting option of setting up for the html element element also comprises: secondary position matching setting option, described secondary position matching setting option comprises at least: the father inquires about setting option: the element that navigates to according to one-time positioning coupling setting option is set, the mode of searching father's element of this element; Perhaps, subquery setting option: the element that navigates to according to one-time positioning coupling setting option is set, searches the mode of the daughter element of this element; Perhaps, put when existing simultaneously when the father inquires about setting option and subquery setting option, inquire about father's element that setting option is searched the element that one-time positioning coupling setting option navigates to according to the father first, then according to the subquery setting option, from this father's element that finds, the daughter element of searching this father's element.A12, according to the described method of A10, it is characterized in that, described coupling setting option for the foundation of html element element also comprises: element deletion coupling setting option, and described element deletion coupling setting option comprises at least: delete the predetermined content in the element of being oriented by one-time positioning coupling setting option or secondary position matching setting option; And/or change predetermined content in the element of being oriented by one-time positioning coupling setting option or secondary position matching setting option.A13, according to the described method of A2, it is characterized in that, described set up one the coupling file is set after, described method also comprises: according to the update instruction that receives, described coupling is arranged website node, web page joint, coupling in the file coupling setting option that description node and/or coupling arrange in the description node is set upgrades.A14, according to the described method of A1, it is characterized in that, described described web page contents is arranged with described webpage text content coupling respectively mated, until described web page contents the match is successful comprises: when there is a plurality of web page contents that downloads in the browser side, for each web page contents distributes a thread, in the thread that distributes, the corresponding web page content arranged with described webpage text content coupling respectively and mates, until described web page contents the match is successful; And/or for a web page contents of browser side distributes a plurality of threads, in different threads, described web page contents arranged from different webpage text content coupling respectively and mates, until described web page contents the match is successful.A15, according to the described method of A2, it is characterized in that the described coupling of setting up arranges file and will be described at least one webpage text content coupling and arranges and be kept at described coupling and arrange in the file and comprise: receive the instruction of choosing of choosing the setting of webpage text content coupling that the user sends; Choose instruction and set up coupling file is set according to described, and will describedly choose webpage text content coupling setting in the instruction and be kept at the coupling of setting up and arrange in the file; Described coupling is arranged File Upload to server and be stored in the described user's of server side the user data.A16, according to the described method of A1, it is characterized in that, described with described web page contents respectively with described webpage text content coupling arrange mate before, described method also comprises: when the file that monitors indication browser loaded is finished event, starts the described operation that described web page contents is mated with the setting of described webpage text content coupling respectively.A17, according to the described method of A1, it is characterized in that, described described web page contents is arranged to mate with described webpage text content coupling respectively comprise: to the web page contents layering analysis that downloads to, obtain the DOM Document Object Model DOM structure of this web page contents; According to the DOM structure of described web page contents, web page contents is mated with the setting of described webpage text content coupling respectively.
Herein disclosed is B18, a kind of device that can extract webpage text content, comprising: coupling arranges dispensing unit, is suitable for arranging in default at least one webpage text content coupling of browser side; Download unit is suitable for carrying out web page contents in the browser side and downloads; Matching unit is suitable for described web page contents arranged with described webpage text content coupling respectively and mates, until described web page contents the match is successful; Extraction unit is suitable for utilizing with the described web page contents webpage text content that the match is successful coupling arranging, and extracts the webpage text content in the described web page contents.B19, according to the described device of B18, it is characterized in that described coupling arranges dispensing unit, be suitable for setting up a coupling and file and the setting of will be described at least one webpage text content coupling are set are kept at described coupling and arrange in the file; Wherein, described coupling arranges and comprises at least one website node in the file, comprise at least a web page joint in each website node, be provided with plural coupling at least part of described web page joint description node is set, each coupling arranges the corresponding webpage text content coupling of description node and arranges, and the coupling of at least two described webpage text contents arranges the middle Different matching setting option that comprises respectively the same type content of text.B20, according to the described device of B19, it is characterized in that described matching unit is suitable for arranging in described coupling and searches described web page contents corresponding website node and web page joint in the file; Under the web page joint that finds, the coupling setting option that the coupling of first in described web page contents and this web page joint is arranged in the description node mates successively; To the coupling setting option that the match is successful, matching result is set to the webpage text content that utilizes this coupling setting option to extract; To the coupling setting option that it fails to match, in this web page joint, arrange and search the corresponding coupling setting option of coupling setting option that it fails to match with this in the description node except the first coupling arranges coupling the description node, the coupling setting option and the described web page contents that find are mated, until the coupling setting option that finds and described web page contents the match is successful, and matching result is set to the webpage text content that extracts according to this coupling setting option.B21, according to the described device of B20, it is characterized in that, described extraction unit, be suitable for the webpage text content that extracts of the with good grounds described coupling setting option that the match is successful as the webpage text content in the described web page contents that identifies.B22, according to the described device of B19, it is characterized in that described coupling arranges dispensing unit, a website node is set up in the website that is suitable for every type; Under a website node, for every type webpage under this corresponding website of website node is set up a web page joint; The coupling of setting up each web page joint according to the content of webpage arranges the coupling setting option in the description node, wherein the first coupling at web page joint arranges in the description node, for every type content of text in the corresponding webpage of this web page joint is set up at least one coupling setting option; And, for the content of text of same type in the webpage, the coupling setting option set up in the description node is set and in this web page joint, mates the coupling that arranges the description node that the coupling setting option of setting up in the description node is set is different except first in described the first coupling.B23, according to the described device of B20, it is characterized in that, described coupling arranges dispensing unit, also be suitable in described web page joint, arranging downloading mode attribute and element filter attribute, the filter type of described element filter attribute indication comprises: filter picture, filter Cascading Style Sheet CSS, filter the Javascript script, filter frame, filtering object and filtration embed one or more in the content, described device also comprises Loading Control unit and filter element, described Loading Control unit, be suitable under the web page joint that finds, in described web page contents and this web page joint first coupling is arranged before coupling setting option in the description node mates successively, whether the property value of judging the downloading mode attribute in the described web page joint that finds is predetermined value, if, start filter element, then under the web page joint that finds, the coupling setting option that the first coupling in the web page contents after filtering and this web page joint is arranged in the description node mates successively; If not, directly will be loaded in the browser under the described web page contents; Described filter element is suitable for according to the filter type of element filter attribute indication the content in the webpage being filtered.B24, according to the described device of B18, it is characterized in that, the webpage text content coupling that described coupling arranges dispensing unit configuration arranges the uniform resource position mark URL that is included as web page contents and sets up webpage URL coupling setting option, comprise in the described webpage URL coupling setting option: the match attribute setting option, described match attribute setting option comprises: webpage URL with predetermined content as beginning; And/or webpage URL comprises predetermined content, and the precalculated position of this predetermined content comprises any character; And/or webpage URL does not comprise predetermined content, and this predetermined content comprises any character.B25, according to the described device of B24, it is characterized in that, described coupling arranges the webpage URL coupling setting option that dispensing unit sets up and comprises that also banner properties settings, banner extract properties settings and transform properties settings, and described banner properties settings comprises: with the character in precalculated position among the URL of the webpage banner as this web page contents; Described banner extracts properties settings and comprises: choose the character in precalculated position as banner in the banner that obtains according to banner properties settings coupling; Described conversion properties settings comprises: the URL that obtains this webpage according to the composition format conversion of the banner of the web page contents of knowing and URL.B26, according to the described device of B24, it is characterized in that, described coupling arranges the webpage URL coupling setting option that dispensing unit sets up and comprises that also web page title extracts properties settings, and described web page title extracts properties settings and comprises: be title with the contents extraction before the book character in the web page contents.B27, according to the described device of B22, it is characterized in that, described coupling arranges dispensing unit, also is suitable for arranging in the description node as every type the HTML (Hypertext Markup Language) html element element of content of text in web page contents in the webpage in the first coupling setting up at least one coupling setting option; The described coupling setting option of setting up for the html element element comprises one-time positioning coupling setting option, described one-time positioning coupling setting option comprises at least: basic point is searched setting option: the mode that the indication basic point is searched, described mode comprise searches sign, lookup names, searches class name, searches content, searches expression formula; And/or, mark location setting option: the element that the sign of location and html element element is complementary; And/or, title location setting option: the element that the title of location and html element element is complementary; And/or, class name location setting option: the element that the class title of location and html element element is complementary; And/or, content location setting option: the element that the content of location and html element element is complementary; And/or, expression formula location setting option: the element that the expression formula in location and the html element element is complementary; And/or, the label setting option: when indication utilizes described mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option that element is located, type and/or the attribute of institute's location element.B28, according to the described device of B27, it is characterized in that, it is that the coupling setting option that the html element element is set up also comprises: secondary position matching setting option that described coupling arranges dispensing unit, described secondary position matching setting option comprises a kind of setting option in following at least: the father inquires about setting option: the element that navigates to according to one-time positioning coupling setting option is set, the mode of searching father's element of this element; Perhaps, subquery setting option: the element that navigates to according to one-time positioning coupling setting option is set, searches the mode of the daughter element of this element; Perhaps, put when existing simultaneously when the father inquires about setting option and subquery setting option, inquire about father's element that setting option is searched the element that one-time positioning coupling setting option navigates to according to the father first, then according to the subquery setting option, from this father's element that finds, the daughter element of searching this father's element.B29, according to the described device of B27, it is characterized in that, it is that the coupling setting option that html element element is set up also comprises that described coupling arranges dispensing unit: element deletion coupling setting option, and described element deletion coupling setting option comprises at least: the predetermined content in the element that deletion is oriented by one-time positioning coupling setting option or secondary position matching setting option; And/or change predetermined content in the element of being oriented by one-time positioning coupling setting option or secondary position matching setting option.B30, according to the described device of B19, it is characterized in that, described device comprises that also coupling arranges updating block, be suitable for described set up one the coupling file is set after, according to the update instruction that receives, described coupling is arranged website node, web page joint, coupling in the file coupling setting option that description node and/or coupling arrange in the description node is set upgrades.B31, according to the described device of B18, it is characterized in that, also comprise the Multi-thread control unit, described Multi-thread control unit, be suitable for when there is a plurality of web page contents that downloads in the browser side, be that each web page contents distributes a thread, and control described matching unit and in the thread that distributes, the corresponding web page content arranged with described webpage text content coupling respectively and mate, until described web page contents the match is successful; And/or described Multi-thread control unit, a web page contents that is suitable for the browser side distributes a plurality of threads, and control described matching unit and in different threads, described web page contents arranged from different webpage text content coupling respectively and mate, until described web page contents the match is successful.B32, according to the described device of B19, it is characterized in that described device comprises input block and uploading unit, described input block is suitable for receiving that the user sends chooses the instruction of choosing that the webpage text content coupling arranges; Described coupling arranges dispensing unit, also is suitable for choosing instruction and setting up coupling file is set according to described, and will describedly chooses webpage text content coupling setting in the instruction and be kept at the coupling of setting up and arrange in the file; Described uploading unit is suitable for described coupling being arranged File Upload to server and being stored in the described user's of server side the user data.B33, according to the described device of B18, it is characterized in that, described device also comprises the startup control module, be suitable for when the file that monitors indication browser loaded is finished event, start described matching unit execution described web page contents is arranged the operation of mating with described webpage text content coupling respectively.B34, according to the described device of B18, it is characterized in that described matching unit also is suitable for the web page contents layering analysis to downloading to, and obtains the DOM Document Object Model DOM structure of this web page contents; According to the DOM structure of described web page contents, web page contents is mated with the setting of described webpage text content coupling respectively.

Claims (20)

1. method that webpage text content is extracted comprises:
Arrange in default at least one webpage text content coupling of browser side;
Carrying out web page contents in the browser side downloads;
Described web page contents arranged with described webpage text content coupling respectively mates, until described web page contents the match is successful;
Utilize with the described web page contents webpage text content that the match is successful coupling to arrange, extract the webpage text content in the described web page contents.
2. method according to claim 1 is characterized in that, the described setting in default at least one webpage text content coupling of browser side comprises:
Setting up a coupling arranges file and the setting of will be described at least one webpage text content coupling and is kept at described coupling and arranges in the file;
Wherein, described coupling arranges and comprises at least one website node in the file, comprise at least a web page joint in each website node, be provided with plural coupling at least part of described web page joint description node is set, each coupling arranges the corresponding webpage text content coupling of description node and arranges, and the coupling of at least two described webpage text contents arranges the middle Different matching setting option that comprises respectively the same type content of text.
3. method according to claim 2 is characterized in that, described described web page contents is arranged with described webpage text content coupling respectively mated, until described web page contents the match is successful comprises:
Arrange in described coupling and to search described web page contents corresponding website node and web page joint in the file;
Under the web page joint that finds, the coupling setting option that the coupling of first in described web page contents and this web page joint is arranged in the description node mates successively;
To the coupling setting option that the match is successful, matching result is set to the webpage text content that utilizes this coupling setting option to extract;
To the coupling setting option that it fails to match, in this web page joint, arrange and search the corresponding coupling setting option of coupling setting option that it fails to match with this in the description node except the first coupling arranges coupling the description node, the coupling setting option and the described web page contents that find are mated, until the coupling setting option that finds and described web page contents the match is successful, and matching result is set to the webpage text content that extracts according to this coupling setting option.
4. method according to claim 3 is characterized in that, described utilization and the described web page contents webpage text content that the match is successful coupling arrange, and the webpage text content that extracts in the described web page contents comprises:
With the webpage text content that extracts of the with good grounds described coupling setting option that the match is successful as the webpage text content in the described web page contents that identifies.
5. method according to claim 2 is characterized in that, the described coupling of setting up arranges file and the setting of will be described at least one webpage text content coupling and is kept at described coupling and arranges in the file and comprise:
For setting up a website node in every type website;
Under a website node, for every type webpage under this corresponding website of website node is set up a web page joint;
The coupling of setting up each web page joint according to the content of webpage arranges the coupling setting option in the description node, wherein the first coupling at web page joint arranges in the description node, for every type content of text in the corresponding webpage of this web page joint is set up at least one coupling setting option; And
For the content of text of same type in the webpage, the coupling setting option set up in the description node is set and in this web page joint, mates the coupling that arranges the description node that the coupling setting option of setting up in the description node is set is different except first in described the first coupling.
6. method according to claim 3, it is characterized in that, downloading mode attribute and element filter attribute are set in described web page joint, the filter type of described element filter attribute indication comprises: filtration picture, filtration Cascading Style Sheet CSS, filtration Javascript script, filter frame, filtering object and filtration embed one or more in the content
Under the web page joint that finds, first in described web page contents and this web page joint coupling is arranged before the step that the coupling setting option in the description node mates successively, described method further comprises:
Whether the property value of judging the downloading mode attribute in the described web page joint that finds is predetermined value, if, filter type according to the indication of element filter attribute filters the content in the webpage, then under the web page joint that finds, the coupling setting option that the first coupling in the web page contents after filtering and this web page joint is arranged in the description node mates successively; If not, directly will be loaded in the browser under the described web page contents.
7. method according to claim 1 is characterized in that, described webpage text content coupling arranges the uniform resource position mark URL that is included as web page contents and sets up webpage URL coupling setting option,
Comprise in the described webpage URL coupling setting option: the match attribute setting option, described match attribute setting option comprises:
Webpage URL with predetermined content as beginning; And/or,
Webpage URL comprises predetermined content, and the precalculated position of this predetermined content comprises any character; And/or,
Webpage URL does not comprise predetermined content, and this predetermined content comprises any character.
8. method according to claim 7 is characterized in that, described webpage URL coupling setting option also comprises: banner properties settings, banner extract properties settings and transform properties settings,
Described banner properties settings comprises: with the character in precalculated position among the URL of the webpage banner as this web page contents;
Described banner extracts properties settings and comprises: choose the character in precalculated position as banner in the banner that obtains according to banner properties settings coupling;
Described conversion properties settings comprises: the URL that obtains this webpage according to the composition format conversion of the banner of the web page contents of knowing and URL.
9. method according to claim 7 is characterized in that, described webpage URL coupling setting option also comprises: web page title extracts properties settings,
Described web page title extracts properties settings and comprises: be title with the contents extraction before the book character in the web page contents.
10. method according to claim 5 is characterized in that, described the first coupling at web page joint arranges in the description node, sets up at least one coupling setting option for every type content of text in the corresponding webpage of this web page joint and comprises:
Arrange in the description node as every type the HTML (Hypertext Markup Language) html element element of content of text in web page contents in the webpage in the first coupling and to set up at least one coupling setting option;
The described coupling setting option of setting up for the html element element comprises one-time positioning coupling setting option, and described one-time positioning coupling setting option comprises at least:
Basic point is searched setting option: the indication basic point mode of searching, described mode comprise searches sign, lookup names, searches class name, searches content, searches expression formula; And/or,
Mark location setting option: the element that the sign of location and html element element is complementary; And/or,
Title location setting option: the element that the title of location and html element element is complementary; And/or,
Class name location setting option: the element that the class title of location and html element element is complementary; And/or,
Content location setting option: the element that the content of location and html element element is complementary; And/or,
Expression formula location setting option: the element that the expression formula in location and the html element element is complementary;
And/or,
The label setting option: when indication utilizes described mark location setting option, title location setting option, class name location setting option, content location setting option or expression formula location setting option that element is located, type and/or the attribute of institute's location element.
11. method according to claim 10 is characterized in that, the described coupling setting option of setting up for the html element element also comprises: secondary position matching setting option, and described secondary position matching setting option comprises at least:
The father inquires about setting option: the element that navigates to according to one-time positioning coupling setting option is set, the mode of searching father's element of this element; Perhaps,
Subquery setting option: the element that navigates to according to one-time positioning coupling setting option is set, searches the mode of the daughter element of this element; Perhaps,
When inquiring about setting option and subquery setting option, the father puts when existing simultaneously, first inquire about father's element that setting option is searched the element that one-time positioning coupling setting option navigates to according to the father, then according to the subquery setting option, from this father's element that finds, the daughter element of searching this father's element.
12. method according to claim 10 is characterized in that, the described coupling setting option of setting up for the html element element also comprises: element deletion coupling setting option, and described element deletion coupling setting option comprises at least:
Predetermined content in the element that deletion is oriented by one-time positioning coupling setting option or secondary position matching setting option; And/or
Predetermined content in the element that change is oriented by one-time positioning coupling setting option or secondary position matching setting option.
13. method according to claim 2 is characterized in that, described set up one the coupling file is set after, described method also comprises:
According to the update instruction that receives, described coupling is arranged website node, web page joint, coupling in the file coupling setting option that description node and/or coupling arrange in the description node is set upgrades.
14. method according to claim 1 is characterized in that, described described web page contents is arranged with described webpage text content coupling respectively mated, until described web page contents the match is successful comprises:
When there is a plurality of web page contents that downloads in the browser side, be that each web page contents distributes a thread, in the thread that distributes, the corresponding web page content arranged with described webpage text content coupling respectively and mates, until described web page contents the match is successful; And/or
For a web page contents of browser side distributes a plurality of threads, in different threads, described web page contents arranged from different webpage text content coupling respectively and mates, until described web page contents the match is successful.
15. method according to claim 2 is characterized in that, the described coupling of setting up arranges file and the setting of will be described at least one webpage text content coupling and is kept at described coupling and arranges in the file and comprise:
Receive the instruction of choosing of choosing the setting of webpage text content coupling that the user sends;
Choose instruction and set up coupling file is set according to described, and will describedly choose webpage text content coupling setting in the instruction and be kept at the coupling of setting up and arrange in the file;
Described coupling is arranged File Upload to server and be stored in the described user's of server side the user data.
16. method according to claim 1 is characterized in that, described with described web page contents respectively with described webpage text content coupling arrange mate before, described method also comprises:
When the file that monitors indication browser loaded is finished event, starts the described operation that described web page contents is mated with the setting of described webpage text content coupling respectively.
17. method according to claim 1 is characterized in that, described described web page contents is mated with the setting of described webpage text content coupling respectively, comprise:
To the web page contents layering analysis that downloads to, obtain the DOM Document Object Model DOM structure of this web page contents;
According to the DOM structure of described web page contents, web page contents is mated with the setting of described webpage text content coupling respectively.
18. the device that can extract webpage text content comprises:
Coupling arranges dispensing unit, is suitable for arranging in default at least one webpage text content coupling of browser side;
Download unit is suitable for carrying out web page contents in the browser side and downloads;
Matching unit is suitable for described web page contents arranged with described webpage text content coupling respectively and mates, until described web page contents the match is successful;
Extraction unit is suitable for utilizing with the described web page contents webpage text content that the match is successful coupling arranging, and extracts the webpage text content in the described web page contents.
19. device according to claim 18 is characterized in that, described coupling arranges dispensing unit, is suitable for setting up a coupling and file and the setting of will be described at least one webpage text content coupling are set are kept at described coupling and arrange in the file; Wherein, described coupling arranges and comprises at least one website node in the file, comprise at least a web page joint in each website node, be provided with plural coupling at least part of described web page joint description node is set, each coupling arranges the corresponding webpage text content coupling of description node and arranges, and the coupling of at least two described webpage text contents arranges the middle Different matching setting option that comprises respectively the same type content of text.
20. device according to claim 19 is characterized in that,
Described matching unit is suitable for arranging in described coupling and searches described web page contents corresponding website node and web page joint in the file; Under the web page joint that finds, the coupling setting option that the coupling of first in described web page contents and this web page joint is arranged in the description node mates successively; To the coupling setting option that the match is successful, matching result is set to the webpage text content that utilizes this coupling setting option to extract; To the coupling setting option that it fails to match, in this web page joint, arrange and search the corresponding coupling setting option of coupling setting option that it fails to match with this in the description node except the first coupling arranges coupling the description node, the coupling setting option and the described web page contents that find are mated, until the coupling setting option that finds and described web page contents the match is successful, and matching result is set to the webpage text content that extracts according to this coupling setting option.
CN201210573022.8A 2012-12-25 2012-12-25 The method and apparatus that webpage text content is extracted Active CN103020266B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210573022.8A CN103020266B (en) 2012-12-25 2012-12-25 The method and apparatus that webpage text content is extracted

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210573022.8A CN103020266B (en) 2012-12-25 2012-12-25 The method and apparatus that webpage text content is extracted

Publications (2)

Publication Number Publication Date
CN103020266A true CN103020266A (en) 2013-04-03
CN103020266B CN103020266B (en) 2016-06-29

Family

ID=47968869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210573022.8A Active CN103020266B (en) 2012-12-25 2012-12-25 The method and apparatus that webpage text content is extracted

Country Status (1)

Country Link
CN (1) CN103020266B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399759A (en) * 2013-06-29 2013-11-20 广州市动景计算机科技有限公司 Network content downloading method and device
CN103530336A (en) * 2013-09-30 2014-01-22 北京奇虎科技有限公司 Equipment and method for identifying invalid parameters in URLs
CN103577566A (en) * 2013-10-25 2014-02-12 北京奇虎科技有限公司 Web reading content loading method and device
CN104008131A (en) * 2014-04-30 2014-08-27 广州市动景计算机科技有限公司 Processing method and device for web page data
CN104021172A (en) * 2014-05-30 2014-09-03 北京搜狗科技发展有限公司 Advertisement filtering method and advertisement filtering device
CN104317883A (en) * 2014-10-21 2015-01-28 北京国双科技有限公司 Web text processing method and web text processing device
CN104700031A (en) * 2013-12-06 2015-06-10 腾讯科技(深圳)有限公司 Method, device and system for preventing remote code execution during application operation
WO2015165245A1 (en) * 2014-04-30 2015-11-05 广州市动景计算机科技有限公司 Webpage data processing method and device
CN105468730A (en) * 2015-11-20 2016-04-06 广州华多网络科技有限公司 Webpage information extraction method and equipment
CN106855859A (en) * 2015-12-08 2017-06-16 北京搜狗科技发展有限公司 A kind of webpage context extraction method and device
CN106980700A (en) * 2013-11-08 2017-07-25 北京奇虎科技有限公司 The method and browser of web search are carried out in browser side
CN107402953A (en) * 2017-05-22 2017-11-28 阿里巴巴集团控股有限公司 A kind of method for page jump and device
CN108009171A (en) * 2016-10-27 2018-05-08 腾讯科技(北京)有限公司 A kind of method and apparatus for extracting content-data
CN108241680A (en) * 2016-12-26 2018-07-03 北京国双科技有限公司 The method and apparatus for obtaining the amount of reading of webpage
CN113254751A (en) * 2021-06-24 2021-08-13 北森云计算有限公司 Method, equipment and storage medium for accurately extracting complex webpage structured information

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN102681994A (en) * 2011-03-07 2012-09-19 北京百度网讯科技有限公司 Webpage information extracting method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101944094A (en) * 2009-07-06 2011-01-12 富士通株式会社 Webpage information extraction method and device thereof
CN102681994A (en) * 2011-03-07 2012-09-19 北京百度网讯科技有限公司 Webpage information extracting method and system

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399759A (en) * 2013-06-29 2013-11-20 广州市动景计算机科技有限公司 Network content downloading method and device
CN103530336B (en) * 2013-09-30 2017-09-15 北京奇虎科技有限公司 The identification equipment and method of Invalid parameter in uniform resource position mark URL
CN103530336A (en) * 2013-09-30 2014-01-22 北京奇虎科技有限公司 Equipment and method for identifying invalid parameters in URLs
CN103577566A (en) * 2013-10-25 2014-02-12 北京奇虎科技有限公司 Web reading content loading method and device
CN106980700B (en) * 2013-11-08 2021-04-09 北京奇虎科技有限公司 Method for searching network on browser side and browser
CN106980700A (en) * 2013-11-08 2017-07-25 北京奇虎科技有限公司 The method and browser of web search are carried out in browser side
CN104700031A (en) * 2013-12-06 2015-06-10 腾讯科技(深圳)有限公司 Method, device and system for preventing remote code execution during application operation
CN104700031B (en) * 2013-12-06 2019-12-13 腾讯科技(深圳)有限公司 Method, device and system for preventing remote code from being executed in application operation
CN104008131A (en) * 2014-04-30 2014-08-27 广州市动景计算机科技有限公司 Processing method and device for web page data
WO2015165245A1 (en) * 2014-04-30 2015-11-05 广州市动景计算机科技有限公司 Webpage data processing method and device
CN104021172A (en) * 2014-05-30 2014-09-03 北京搜狗科技发展有限公司 Advertisement filtering method and advertisement filtering device
CN104021172B (en) * 2014-05-30 2017-07-28 北京搜狗科技发展有限公司 Advertisement filter method and advertisement filter device
CN104317883A (en) * 2014-10-21 2015-01-28 北京国双科技有限公司 Web text processing method and web text processing device
CN104317883B (en) * 2014-10-21 2017-11-21 北京国双科技有限公司 Network text processing method and processing device
CN105468730A (en) * 2015-11-20 2016-04-06 广州华多网络科技有限公司 Webpage information extraction method and equipment
CN106855859A (en) * 2015-12-08 2017-06-16 北京搜狗科技发展有限公司 A kind of webpage context extraction method and device
CN108009171A (en) * 2016-10-27 2018-05-08 腾讯科技(北京)有限公司 A kind of method and apparatus for extracting content-data
CN108241680A (en) * 2016-12-26 2018-07-03 北京国双科技有限公司 The method and apparatus for obtaining the amount of reading of webpage
CN108241680B (en) * 2016-12-26 2020-10-13 北京国双科技有限公司 Method and device for acquiring reading amount of webpage
CN107402953A (en) * 2017-05-22 2017-11-28 阿里巴巴集团控股有限公司 A kind of method for page jump and device
CN113254751A (en) * 2021-06-24 2021-08-13 北森云计算有限公司 Method, equipment and storage medium for accurately extracting complex webpage structured information

Also Published As

Publication number Publication date
CN103020266B (en) 2016-06-29

Similar Documents

Publication Publication Date Title
CN103020266A (en) Method and device for extracting webpage text content
EP3491544B1 (en) Web page display systems and methods
US20080301562A1 (en) Systems and Methods for Accelerating Access to Web Resources by Linking Browsers
CN104021172A (en) Advertisement filtering method and advertisement filtering device
CN103714115A (en) Method and device for loading web page content
CN108710490B (en) Method and device for editing Web page
CN102982161A (en) Method and device for acquiring webpage information
CN103577595A (en) Keyword pushing method and device based on current browse webpage
CN101996193A (en) Processing method and system for expressing network resource link and internet terminal
CN105528452A (en) Method and system for loading page data
CN103092941A (en) Method and device showing content on electronic equipment
CN102982162A (en) System for acquiring webpage information
US20100318888A1 (en) System and method for providing sub-publication content in an electronic device
CN102129428A (en) Method and device for subscribing information from webpage
CN102999578A (en) Method and device for processing page element
CN103678639A (en) Method and device for reminding information updating in browser
CN103577566A (en) Web reading content loading method and device
KR101340588B1 (en) Method and apparatus for comprising webpage
CN103064943A (en) Customer premises equipment
CN102902784B (en) Web page classification storage system and method
CN105608170A (en) Display method and device of search result page
CN105653678A (en) Data chart subscription method and data chart subscription system
CN102982143A (en) Searching method for network novel and browsing device
CN102999591A (en) File management method and device
CN102982078A (en) Loading method of sequencing website and client with sequencing website being loaded

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220727

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

TR01 Transfer of patent right