CN106951505A - Info web preparation method and system - Google Patents

Info web preparation method and system Download PDF

Info

Publication number
CN106951505A
CN106951505A CN201710157301.9A CN201710157301A CN106951505A CN 106951505 A CN106951505 A CN 106951505A CN 201710157301 A CN201710157301 A CN 201710157301A CN 106951505 A CN106951505 A CN 106951505A
Authority
CN
China
Prior art keywords
node
web page
source code
text
navigation bar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710157301.9A
Other languages
Chinese (zh)
Other versions
CN106951505B (en
Inventor
李天与
刘海龙
郗家贞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sohu New Media Information Technology Co Ltd
Original Assignee
Beijing Sohu New Media Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sohu New Media Information Technology Co Ltd filed Critical Beijing Sohu New Media Information Technology Co Ltd
Priority to CN201710157301.9A priority Critical patent/CN106951505B/en
Publication of CN106951505A publication Critical patent/CN106951505A/en
Application granted granted Critical
Publication of CN106951505B publication Critical patent/CN106951505B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/954Navigation, e.g. using categorised browsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiments of the invention provide a kind of info web preparation method and device, the Web Page Key Words of user input can be obtained, the Web Page Key Words are retrieved by search engine, obtain the network linking of multiple webpages corresponding with the Web Page Key Words, the web page source code of the multiple webpage is obtained by the network linking, the web page source code is analyzed and processed, the network linking of text web page listings page is obtained from the web page source code.It can be seen that, the present invention realizes the automatic acquisition of text web page listings page, can fast and accurately obtain the text web page listings page of a large amount of websites.

Description

Info web preparation method and system
Technical field
The present invention relates to technical field of information processing, more particularly to info web preparation method and system.
Background technology
News is the main supporting body of internet information, is obtained at the News Production person such as major doors, media after news, It can be studied, so as to draw some features and trend of internet information.
In order to obtain substantial amounts of news, it is necessary to be monitored to news list page.Needed under prior art by manually seeking Look for the news list page of major doors, media, it is impossible to fast and accurately obtain the news list page of a large amount of websites.
The content of the invention
The purpose of the embodiment of the present invention is to provide a kind of info web preparation method and system, with reality fast and accurately Obtain the news list page of a large amount of websites.Concrete technical scheme is as follows:
A kind of info web preparation method, including:
Obtain the Web Page Key Words of user input;
The Web Page Key Words are retrieved by search engine, multiple webpages corresponding with the Web Page Key Words are obtained Network linking;
The web page source code of the multiple webpage is obtained by the network linking;
The web page source code is analyzed and processed, the net of text web page listings page is obtained from the web page source code Network is linked.
Optionally, before the web page source code that the multiple webpage is obtained by the network linking, the side Method also includes:
The network linking of positive web page text is selected from the network linking;
The web page source code that the multiple webpage is obtained by the network linking, including:Pass through the text net The network linking of page obtains the web page source code of the positive web page text.
Optionally, it is described that the web page source code is analyzed and processed, text net is obtained from the web page source code The network linking of page list page, including:
The web page source code is analyzed and processed, the DOM Document Object Model HTML DOM of webpage are obtained;
All nodes in the HTML DOM are traveled through, the navigation bar list of labels of webpage is obtained, wherein, it is described Each navigation bar label is arranged in order in navigation bar list of labels;
The corresponding network linking of penultimate navigation bar label in the navigation bar list of labels is defined as the net The network linking of the corresponding text web page listings page of page.
Optionally, all nodes in the HTML DOM are traveled through, and obtain the navigation bar label column of webpage Table, including:
Judge that body nodes in the HTML DOM whether there is child node, if it is present by the body nodes A child node be defined as present node;
Whether include the blank character of navigation bar in the text text for judging present node, if between including navigation bar Every symbol, then judge that present node whether there is the brotgher of node not traveled through;If not comprising the blank character for having navigation bar, by institute Whether another child node for stating body nodes is defined as present node and returns to perform to wrap in the text for judging present node The step of blank character containing navigation bar;
If there is the brotgher of node not traveled through, then the brotgher of node not traveled through of present node is defined as present node And judge whether present node has default label characteristics, if with default label characteristics, present node be put into In the navigation bar list of labels of webpage and return to perform and described judge that present node whether there is the step of the brotgher of node not traveled through Suddenly;If without default label characteristics, returning and performing the judgement present node with the presence or absence of the brother's section not traveled through The step of point;
If there is no the brotgher of node not traveled through, then judge whether that all child nodes to the body nodes are carried out Traversal, if all child nodes to the body nodes are traveled through, it is determined that be put into the navigation bar list of labels of webpage In node be whole nodes in the navigation bar list of labels of webpage;If not to all child nodes of the body nodes Traveled through, then another child node of the body nodes is defined as present node and returns to the execution judgement present node Text in the step of whether include the blank character of navigation bar.
Optionally, the default label characteristics, including any in following feature:
Above title;
Identifier is provided between navigation link;
With label form.
A kind of info web obtains device, including:Keyword obtaining unit, retrieval unit, source code obtaining unit and source Code analysis unit,
The keyword obtaining unit, the Web Page Key Words for the acquisition user input;
The retrieval unit, for being retrieved by search engine to the Web Page Key Words, obtains and is closed with the webpage The network linking of the corresponding multiple webpages of keyword;
The source code obtaining unit, the web page source code for obtaining the multiple webpage by the network linking;
The source code analysis unit, for being analyzed and processed to the web page source code, from the web page source code The middle network linking for obtaining text web page listings page.
Optionally, described device also includes:Module of selection is linked, for obtaining web page source in the source code obtaining unit Before code, the network linking of positive web page text is selected from the network linking;
The source code obtaining unit, specifically for:The text net is obtained by the network linking of the positive web page text The web page source code of page.
Optionally, the source code analysis unit, including:It is true that model obtains subelement, node traverses subelement and link Stator unit,
The model obtains subelement, for being analyzed and processed to the web page source code, obtains the document pair of webpage As model HTML DOM;
The node traverses subelement, for being traveled through to all nodes in the HTML DOM, obtains webpage Navigation bar list of labels, wherein, each navigation bar label is arranged in order in the navigation bar list of labels;
The link determination subelement, for by the penultimate navigation bar label pair in the navigation bar list of labels The network linking answered is defined as the network linking of the corresponding text web page listings page of the webpage.
Optionally, the node traverses subelement, including:Child node judgment sub-unit, the first present node determine that son is single Member, the second present node determination subelement, blank character judgment sub-unit, brotgher of node judgment sub-unit, label characteristics judge son Unit, node are put into subelement,
The child node judgment sub-unit, for judging that the body nodes in the HTML DOM whether there is child node, If it is present triggering the first present node determination subelement;
The first present node determination subelement, for a child node of the body nodes to be defined as working as prosthomere Point;
Whether include in the blank character judgment sub-unit, the text text for judging present node between navigation bar Every symbol, if including the blank character of navigation bar, the brotgher of node judgment sub-unit is triggered;If not including has navigation bar Blank character, then trigger the second present node determination subelement;
The brotgher of node judgment sub-unit, for judging that present node whether there is the brotgher of node not traveled through;If In the presence of the brotgher of node not traveled through, then the label characteristics judgment sub-unit is triggered;If there is no the brotgher of node not traveled through, Then trigger the traversal and complete determination subelement;
The second present node determination subelement, for another child node of the body nodes to be defined as working as prosthomere Put and trigger the blank character judgment sub-unit;
The label characteristics judgment sub-unit, for the brotgher of node not traveled through of present node to be defined as into present node And judge whether present node has default label characteristics, if with default label characteristics, triggering the node and putting Enter subelement;If without default label characteristics, triggering the brotgher of node judgment sub-unit;
The node is put into subelement, described for being put into the navigation bar list of labels of webpage and triggering by present node Brotgher of node judgment sub-unit;
The traversal completes determination subelement, for judging whether all child nodes progress time to the body nodes Go through, if all child nodes to the body nodes are traveled through, trigger list determination subelement;If not to described All child nodes of body nodes are traveled through, then trigger the second present node determination subelement;
The list determination subelement, the node for determining to be put into the navigation bar list of labels of webpage is webpage Whole nodes in navigation bar list of labels.
Optionally, the default label characteristics, including any in following feature:
Above title;
Identifier is provided between navigation link;
With label form.
A kind of info web preparation method and device provided in an embodiment of the present invention, the webpage that can obtain user input are closed The Web Page Key Words are retrieved by keyword by search engine, obtain multiple webpages corresponding with the Web Page Key Words Network linking, the web page source code of the multiple webpage is obtained by the network linking, the web page source code is divided Analysis is handled, and the network linking of text web page listings page is obtained from the web page source code.It can be seen that, the present invention realizes text net The automatic acquisition of page list page, can fast and accurately obtain the text web page listings page of a large amount of websites.
Certainly, implementing any product or method of the present invention must be not necessarily required to while reaching all the above excellent Point.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of flow chart of info web preparation method provided in an embodiment of the present invention;
Fig. 2 is the flow chart of another info web preparation method provided in an embodiment of the present invention;
Fig. 3 is step S400 provided in an embodiment of the present invention flow chart;
Fig. 4 is step S420 provided in an embodiment of the present invention flow chart;
Fig. 5 obtains the structural representation of device for a kind of info web provided in an embodiment of the present invention;
Fig. 6 obtains the structural representation of source code analysis unit in device for a kind of info web provided in an embodiment of the present invention Figure.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.
As shown in figure 1, a kind of info web preparation method provided in an embodiment of the present invention, can include:
S100, the Web Page Key Words for obtaining user input;
S200, by search engine the Web Page Key Words are retrieved, obtained corresponding with the Web Page Key Words many The network linking of individual webpage;
Wherein, step S200 can be retrieved by Meta Search Engine technology using a variety of search engines.
Specifically, after being retrieved by search engine to the Web Page Key Words, can also be carried out to retrieval result The network linking of multiple webpages corresponding with Internet Keyword is obtained after processing again.Specific processing can have a variety of, and such as disappear weight Processing, to remove identical retrieval result;For another example:Retrieval result is screened, the retrieval result of first N pages is only chosen, remaining Retrieval result then abandon.
S300, the web page source code by the multiple webpage of network linking acquisition;
Specifically, the present invention can obtain web page source code by the download module of browser, such as:Firefox downloads mould Block.Firefox download modules incorporate JavaScript actuators and CSS renderers, therefore can be downloaded by Firefox To the web page source code comprising Ajax implementing results and CSS rendering results.
S400, the web page source code analyzed and processed, text web page listings are obtained from the web page source code The network linking of page.
A kind of info web preparation method provided in an embodiment of the present invention, can obtain the Web Page Key Words of user input, The Web Page Key Words are retrieved by search engine, the lattice chain of multiple webpages corresponding with the Web Page Key Words is obtained Connect, the web page source code of the multiple webpage obtained by the network linking, the web page source code is analyzed and processed, The network linking of text web page listings page is obtained from the web page source code.It can be seen that, the present invention realizes text web page listings The automatic acquisition of page, can fast and accurately obtain the text web page listings page of a large amount of websites.
As shown in Fig. 2 in other embodiments of the present invention, in method shown in Fig. 1 before step S300, can also include:
S210, the network linking for selecting from the network linking positive web page text;
It is understood that the present invention can pass through the network linking of positive web page text and website homepage network linking, text The network linking of positive web page text is selected in the difference of the network linking of web page listings page from network linking.For example:Positive web page text The afterbody of network linking generally include string number, and website homepage network linking and text network list page network linking Do not include then.The network linking of positive web page text can be selected according to this feature.
Step S300 can be specifically included:The webpage of the positive web page text is obtained by the network linking of the positive web page text Source code.
As shown in figure 3, the step S400 in Fig. 1 and Fig. 2 embodiments can include:
S410, the web page source code analyzed and processed, obtain the DOM Document Object Model HTML DOM of webpage;
Wherein, HTML DOM can be rendered as html document the node tree with element, attribute and text.
S420, all nodes in the HTML DOM are traveled through, obtain the navigation bar list of labels of webpage, its In, each navigation bar label is arranged in order in the navigation bar list of labels;
S430, the corresponding network linking of penultimate navigation bar label in the navigation bar list of labels is defined as The network linking of the corresponding text web page listings page of the webpage.
Wherein, last navigation bar label is the corresponding label of positive web page text, therefore penultimate navigation field mark Label are the corresponding label of text web page listings page.
As shown in figure 4, step S420 can include in embodiment illustrated in fig. 3:
S421, the body nodes judged in the HTML DOM whether there is child node, if it is present performing step S422;
Wherein, body nodes are that visual element in the most important root node in HTML, webpage is generally all located at body sections Point among.
The child node of body nodes refers to:For dom tree, all nodes below body nodes.
Visual element in webpage is mostly just located in body nodes, and body other root nodes arranged side by side are only included The initialization information of webpage, without processing, therefore only processing body nodes and its child node.
S422, a child node of the body nodes is defined as present node;
S423, judge whether include the blank character of navigation bar in the text text of present node, if including navigation The blank character on column, then perform step S424;If not comprising the blank character for having navigation bar, performing step S425;
Wherein, blank character can have diversified forms, such as:“>", "/" etc..
S424, judge that present node whether there is the brotgher of node that does not travel through;If there is the brotgher of node not traveled through, then Perform step S426;If there is no the brotgher of node not traveled through, then step S428 is performed;
S425, another child node of the body nodes is defined as present node;Return and perform the step S423;
S426, the brotgher of node not traveled through of present node is defined as present node and judges whether present node has Default label characteristics, if with default label characteristics, performing step S427;If special without default label Levy, then return and perform the step S424;
Specifically, each node can be defined as currently by step S426 successively according to modes such as depth-first or breadth Firsts Node.
Wherein, the default label characteristics, can include any in following feature:
Above title;
Identifier is provided between navigation link;
With label form.
Wherein, can be with when the text of present node carries the word quantity of link and text and is no more than predetermined threshold value Determine that present node has label form.S427, present node is put into the navigation bar list of labels of webpage, returns and perform institute State step S424;
S428, judge whether that all child nodes to the body nodes are traveled through, if saved to the body All child nodes of point are traveled through, then perform step S429;If not to all child nodes progress time of the body nodes Go through, then return and perform step S425.
During S429, the node that is put into the navigation bar list of labels of webpage of determination are the navigation bar list of labels of webpage Whole nodes.
It is corresponding with above method embodiment, obtain device present invention also offers a kind of info web.
As shown in figure 5, a kind of info web provided in an embodiment of the present invention obtains device, it can include:Keyword is obtained Unit 100, retrieval unit 200, source code obtaining unit 300 and source code analysis unit 400,
The keyword obtaining unit 100, the Web Page Key Words for the acquisition user input;
The retrieval unit 200, for being retrieved by search engine to the Web Page Key Words, is obtained and the webpage The network linking of the corresponding multiple webpages of keyword;
Specifically, after being retrieved by search engine to the Web Page Key Words, can also be carried out to retrieval result The network linking of multiple webpages corresponding with Internet Keyword is obtained after processing again.Specific processing can have a variety of, and such as disappear weight Processing, to remove identical retrieval result;For another example:Retrieval result is screened, the retrieval result of first N pages is only chosen, remaining Retrieval result then abandon.
The source code obtaining unit 300, the web page source generation for obtaining the multiple webpage by the network linking Code;
Specifically, the present invention can obtain web page source code by the download module of browser, such as:Firefox downloads mould Block.Firefox download modules incorporate JavaScript actuators and CSS renderers, therefore can be downloaded by Firefox To the web page source code comprising Ajax implementing results and CSS rendering results.
The source code analysis unit 400, for being analyzed and processed to the web page source code, from the web page source generation The network linking of text web page listings page is obtained in code.
A kind of info web provided in an embodiment of the present invention obtains device, can obtain the Web Page Key Words of user input, The Web Page Key Words are retrieved by search engine, the lattice chain of multiple webpages corresponding with the Web Page Key Words is obtained Connect, the web page source code of the multiple webpage obtained by the network linking, the web page source code is analyzed and processed, The network linking of text web page listings page is obtained from the web page source code.It can be seen that, the present invention realizes text web page listings The automatic acquisition of page, can fast and accurately obtain the text web page listings page of a large amount of websites.
In other embodiments of the present invention, Fig. 5 shown devices can also include:Module of selection is linked, in the source Code obtaining unit 300 is obtained before web page source code, and the network linking of positive web page text is selected from the network linking;
The source code obtaining unit 300, specifically for:The text is obtained by the network linking of the positive web page text The web page source code of webpage.
It is understood that the present invention can pass through the network linking of positive web page text and website homepage network linking, text The network linking of positive web page text is selected in the difference of the network linking of web page listings page from network linking.For example:Positive web page text The afterbody of network linking generally include string number, and website homepage network linking and text network list page network linking Do not include then.The network linking of positive web page text can be selected according to this feature.
Wherein, as shown in fig. 6, the source code analysis unit 400, can include:Model obtains subelement 410, node Subelement 420 and link determination subelement 430 are traveled through,
The model obtains subelement 410, for being analyzed and processed to the web page source code, obtains the document of webpage Object model HTML DOM;
The node traverses subelement 420, for being traveled through to all nodes in the HTML DOM, obtains webpage Navigation bar list of labels, wherein, each navigation bar label is arranged in order in the navigation bar list of labels;
It is described link determination subelement 430, for by the navigation bar list of labels penultimate navigate field mark Sign the network linking that corresponding network linking is defined as the corresponding text web page listings page of the webpage.
Wherein, last navigation bar label is the corresponding label of positive web page text, therefore penultimate navigation field mark Label are the corresponding label of text web page listings page.
Specifically, the node traverses subelement 420, can include:Child node judgment sub-unit, the first present node are true Stator unit, the second present node determination subelement, blank character judgment sub-unit, brotgher of node judgment sub-unit, label characteristics Judgment sub-unit, node are put into subelement,
The child node judgment sub-unit, for judging that the body nodes in the HTML DOM whether there is child node, If it is present triggering the first present node determination subelement;
The first present node determination subelement, for a child node of the body nodes to be defined as working as prosthomere Point;
Whether include in the blank character judgment sub-unit, the text text for judging present node between navigation bar Every symbol, if including the blank character of navigation bar, the brotgher of node judgment sub-unit is triggered;If not including has navigation bar Blank character, then trigger the second present node determination subelement;
The brotgher of node judgment sub-unit, for judging that present node whether there is the brotgher of node not traveled through;If In the presence of the brotgher of node not traveled through, then the label characteristics judgment sub-unit is triggered;If there is no the brotgher of node not traveled through, Then trigger the traversal and complete determination subelement;
The second present node determination subelement, for another child node of the body nodes to be defined as working as prosthomere Put and trigger the blank character judgment sub-unit;
The label characteristics judgment sub-unit, for the brotgher of node not traveled through of present node to be defined as into present node And judge whether present node has default label characteristics, if with default label characteristics, triggering the node and putting Enter subelement;If without default label characteristics, triggering the brotgher of node judgment sub-unit;
The node is put into subelement, described for being put into the navigation bar list of labels of webpage and triggering by present node Brotgher of node judgment sub-unit;
The traversal completes determination subelement, for judging whether all child nodes progress time to the body nodes Go through, if all child nodes to the body nodes are traveled through, trigger list determination subelement;If not to described All child nodes of body nodes are traveled through, then trigger the second present node determination subelement;
The list determination subelement, the node for determining to be put into the navigation bar list of labels of webpage is webpage Whole nodes in navigation bar list of labels.
Wherein, the default label characteristics, can include any in following feature:
Above title;
Identifier is provided between navigation link;
With label form.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or deposited between operating In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Nonexcludability is included, so that process, method, article or equipment including a series of key elements not only will including those Element, but also other key elements including being not expressly set out, or also include being this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Also there is other identical element in process, method, article or equipment including the key element.
Each embodiment in this specification is described by the way of related, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.It is real especially for system Apply for example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.
Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the scope of the present invention.It is all Any modifications, equivalent substitutions and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention It is interior.

Claims (10)

1. a kind of info web preparation method, it is characterised in that methods described includes:
Obtain the Web Page Key Words of user input;
The Web Page Key Words are retrieved by search engine, the net of multiple webpages corresponding with the Web Page Key Words is obtained Network is linked;
The web page source code of the multiple webpage is obtained by the network linking;
The web page source code is analyzed and processed, the lattice chain of text web page listings page is obtained from the web page source code Connect.
2. according to the method described in claim 1, it is characterised in that pass through the multiple net of network linking acquisition described Before the web page source code of page, methods described also includes:
The network linking of positive web page text is selected from the network linking;
The web page source code that the multiple webpage is obtained by the network linking, including:Pass through the positive web page text Network linking obtains the web page source code of the positive web page text.
3. method according to claim 1 or 2, it is characterised in that described to be analyzed and processed to the web page source code, The network linking of text web page listings page is obtained from the web page source code, including:
The web page source code is analyzed and processed, the DOM Document Object Model HTML DOM of webpage are obtained;
All nodes in the HTML DOM are traveled through, the navigation bar list of labels of webpage is obtained, wherein, the navigation Each navigation bar label is arranged in order in column list of labels;
The corresponding network linking of penultimate navigation bar label in the navigation bar list of labels is defined as the webpage pair The network linking for the text web page listings page answered.
4. method according to claim 3, it is characterised in that all nodes progress time in the HTML DOM Go through, obtain the navigation bar list of labels of webpage, including:
Judge that body nodes in the HTML DOM whether there is child node, if it is present by the one of the body nodes Individual child node is defined as present node;
Whether include the blank character of navigation bar in the text text for judging present node, if including the interval of navigation bar Symbol, then judge that present node whether there is the brotgher of node not traveled through;If, will be described not comprising the blank character for having navigation bar Whether another child node of body nodes is defined as present node and returns to perform to include in the text for judging present node The step of having the blank character of navigation bar;
If there is the brotgher of node not traveled through, then the brotgher of node not traveled through of present node is defined as present node and sentenced Whether disconnected present node has default label characteristics, if with default label characteristics, present node is put into webpage Navigation bar list of labels in and return perform it is described judge present node whether there is do not travel through the brotgher of node the step of;Such as Fruit does not have default label characteristics, then returns and perform the step that the judgement present node whether there is the brotgher of node not traveled through Suddenly;
If there is no the brotgher of node not traveled through, then judge whether all child nodes progress time to the body nodes Go through, if all child nodes to the body nodes are traveled through, it is determined that be put into the navigation bar list of labels of webpage Node be whole nodes in the navigation bar list of labels of webpage;If all child nodes to the body nodes are not entered Row traversal, then be defined as present node and return to perform the judgement present node by another child node of the body nodes The step of whether including the blank character of navigation bar in text.
5. method according to claim 4, it is characterised in that the default label characteristics, including in following feature Any:
Above title;
Identifier is provided between navigation link;
With label form.
6. a kind of info web obtains device, it is characterised in that described device includes:Keyword obtaining unit, retrieval unit, source Code obtaining unit and source code analysis unit,
The keyword obtaining unit, the Web Page Key Words for the acquisition user input;
The retrieval unit, for being retrieved by search engine to the Web Page Key Words, is obtained and the Web Page Key Words The network linking of corresponding multiple webpages;
The source code obtaining unit, the web page source code for obtaining the multiple webpage by the network linking;
The source code analysis unit, for being analyzed and processed to the web page source code, is obtained from the web page source code Obtain the network linking of text web page listings page.
7. device according to claim 6, it is characterised in that described device also includes:Module of selection is linked, in institute State before source code obtaining unit acquisition web page source code, the network linking of positive web page text is selected from the network linking;
The source code obtaining unit, specifically for:The positive web page text is obtained by the network linking of the positive web page text Web page source code.
8. the device according to claim 6 or 7, it is characterised in that the source code analysis unit, including:Model is obtained Subelement, node traverses subelement and link determination subelement,
The model obtains subelement, for being analyzed and processed to the web page source code, obtains the document object mould of webpage Type HTML DOM;
The node traverses subelement, for being traveled through to all nodes in the HTML DOM, obtains the navigation of webpage Column list of labels, wherein, each navigation bar label is arranged in order in the navigation bar list of labels;
The link determination subelement, for the penultimate navigation bar label in the navigation bar list of labels is corresponding Network linking is defined as the network linking of the corresponding text web page listings page of the webpage.
9. device according to claim 8, it is characterised in that the node traverses subelement, including:Child node judges son Unit, the first present node determination subelement, the second present node determination subelement, blank character judgment sub-unit, the brotgher of node Judgment sub-unit, label characteristics judgment sub-unit, node are put into subelement,
The child node judgment sub-unit, for judging that the body nodes in the HTML DOM whether there is child node, if In the presence of then triggering the first present node determination subelement;
The first present node determination subelement, for a child node of the body nodes to be defined as into present node;
Whether the interval of navigation bar is included in the blank character judgment sub-unit, the text text for judging present node Symbol, if including the blank character of navigation bar, triggers the brotgher of node judgment sub-unit;If not comprising there is navigation bar Blank character, then trigger the second present node determination subelement;
The brotgher of node judgment sub-unit, for judging that present node whether there is the brotgher of node not traveled through;If there is The brotgher of node not traveled through, then trigger the label characteristics judgment sub-unit;If there is no the brotgher of node not traveled through, then touch Send out described and travel through completion determination subelement;
The second present node determination subelement, for another child node of the body nodes to be defined as into present node simultaneously Trigger the blank character judgment sub-unit;
The label characteristics judgment sub-unit, for the brotgher of node not traveled through of present node to be defined as into present node and sentenced Whether disconnected present node has default label characteristics, if with default label characteristics, triggering the node and being put into son Unit;If without default label characteristics, triggering the brotgher of node judgment sub-unit;
The node is put into subelement, for present node to be put into the navigation bar list of labels of webpage to and triggered the brother Node judgment sub-unit;
The traversal completes determination subelement, for judging whether that all child nodes to the body nodes are traveled through, If all child nodes to the body nodes are traveled through, trigger list determination subelement;If not to described All child nodes of body nodes are traveled through, then trigger the second present node determination subelement;
The list determination subelement, the node for determining to be put into the navigation bar list of labels of webpage is the navigation of webpage Whole nodes in column list of labels.
10. device according to claim 9, it is characterised in that the default label characteristics, including in following feature Any:
Above title;
Identifier is provided between navigation link;
With label form.
CN201710157301.9A 2017-03-16 2017-03-16 Webpage information obtaining method and system Active CN106951505B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710157301.9A CN106951505B (en) 2017-03-16 2017-03-16 Webpage information obtaining method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710157301.9A CN106951505B (en) 2017-03-16 2017-03-16 Webpage information obtaining method and system

Publications (2)

Publication Number Publication Date
CN106951505A true CN106951505A (en) 2017-07-14
CN106951505B CN106951505B (en) 2021-02-02

Family

ID=59472623

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710157301.9A Active CN106951505B (en) 2017-03-16 2017-03-16 Webpage information obtaining method and system

Country Status (1)

Country Link
CN (1) CN106951505B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107894974A (en) * 2017-11-02 2018-04-10 华南农业大学 Webpage context extraction method based on tag path and text punctuate than Fusion Features
CN112602078A (en) * 2018-06-21 2021-04-02 株式会社Tsunagu.AI Automatic generation system for webpage content

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101237465A (en) * 2007-01-30 2008-08-06 中国科学院声学研究所 A webpage context extraction method based on quick Fourier conversion
CN102236867A (en) * 2011-08-15 2011-11-09 悠易互通(北京)广告有限公司 Cloud computing-based audience behavioral analysis advertisement targeting system
CN102591612A (en) * 2011-12-27 2012-07-18 厦门市美亚柏科信息股份有限公司 General webpage text extraction method based on punctuation continuity and system thereof
CN105677764A (en) * 2015-12-30 2016-06-15 百度在线网络技术(北京)有限公司 Information extraction method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101237465A (en) * 2007-01-30 2008-08-06 中国科学院声学研究所 A webpage context extraction method based on quick Fourier conversion
CN102236867A (en) * 2011-08-15 2011-11-09 悠易互通(北京)广告有限公司 Cloud computing-based audience behavioral analysis advertisement targeting system
CN102591612A (en) * 2011-12-27 2012-07-18 厦门市美亚柏科信息股份有限公司 General webpage text extraction method based on punctuation continuity and system thereof
CN105677764A (en) * 2015-12-30 2016-06-15 百度在线网络技术(北京)有限公司 Information extraction method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107894974A (en) * 2017-11-02 2018-04-10 华南农业大学 Webpage context extraction method based on tag path and text punctuate than Fusion Features
CN112602078A (en) * 2018-06-21 2021-04-02 株式会社Tsunagu.AI Automatic generation system for webpage content

Also Published As

Publication number Publication date
CN106951505B (en) 2021-02-02

Similar Documents

Publication Publication Date Title
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
CN101097578A (en) Network resource searching method and system
CN107066576A (en) A kind of big data web crawlers paging system of selection and system
CN102760151B (en) Implementation method of open source software acquisition and searching system
CN102779169A (en) Extracting method and device for webpage content based on HTML (Hypertext Markup Language) label
CN104331438B (en) To novel web page contents selectivity abstracting method and device
CN104391978B (en) Web page storage processing method and processing device for browser
CN101630330A (en) Method for webpage classification
CN107016102A (en) A kind of big data web crawlers paging collocation method
CN102915361A (en) Webpage text extracting method based on character distribution characteristic
CN111897914A (en) Entity information extraction and knowledge graph construction method for field of comprehensive pipe gallery
CN110222251A (en) A kind of Service encapsulating method based on Web-page segmentation and searching algorithm
CN101310277B (en) Method of obtaining a representation of a text and system
CN106547749A (en) The method and apparatus of collecting webpage data
CN106874502A (en) A kind of method of video search, device and terminal
CN106951505A (en) Info web preparation method and system
CN106547803A (en) The method and apparatus for crawling website incremental resource
CN106933864A (en) A kind of search engine system and its searching method
CN106611029A (en) Method and device for improving site search efficiency in website
CN103617225A (en) Associated webpage searching method and system
JPH11110384A (en) Method and device for retrieving and displaying structured document
US8887037B1 (en) Scroll-free user interface and applications
CN104965902A (en) Enriched URL (uniform resource locator) recognition method and apparatus
CN112232075A (en) Article release time identification method based on time format and webpage element characteristics
CN106934036A (en) A kind of method and system of Network Learning Resource aggregate query

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant