CN111625748A - Website navigation bar information extraction method and device, electronic equipment and storage medium - Google Patents

Website navigation bar information extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111625748A
CN111625748A CN202010484954.XA CN202010484954A CN111625748A CN 111625748 A CN111625748 A CN 111625748A CN 202010484954 A CN202010484954 A CN 202010484954A CN 111625748 A CN111625748 A CN 111625748A
Authority
CN
China
Prior art keywords
navigation bar
node
bar information
dom tree
information set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010484954.XA
Other languages
Chinese (zh)
Other versions
CN111625748B (en
Inventor
祁俊辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xiaoman Technology Co ltd
Original Assignee
Shenzhen Xiaoman Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xiaoman Technology Co ltd filed Critical Shenzhen Xiaoman Technology Co ltd
Priority to CN202010484954.XA priority Critical patent/CN111625748B/en
Publication of CN111625748A publication Critical patent/CN111625748A/en
Application granted granted Critical
Publication of CN111625748B publication Critical patent/CN111625748B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/954Navigation, e.g. using categorised browsing

Abstract

The invention relates to the technical field of text extraction, and provides a method and a device for extracting navigation bar information of a website, electronic equipment and a storage medium, wherein the method comprises the following steps: downloading a main page and any sub-page source code of the domain name of the enterprise website to be extracted, acquiring a first HTML code, analyzing the first HTML code into a first node DOM tree, and acquiring a second HTML code, analyzing the second HTML code into a second node DOM tree; and removing outer chains of the first node DOM tree and the second node DOM tree to obtain a third node DOM tree and a fourth node DOM tree, extracting navigation bar information by using a NAV label method, an A label density method, a maximum public area method and a keyword link block method, then removing duplication and filtering, calculating the node score of each node and outputting the navigation bar information of the enterprise to be extracted. The navigation bar information is extracted through a NAV label method, an A label density method, a maximum public area method and a keyword link block method, and the accuracy and efficiency of extracting the navigation bar information in the page are improved.

Description

Website navigation bar information extraction method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of text extraction, in particular to a method and a device for extracting navigation bar information of a website, electronic equipment and a storage medium.
Background
In order to display information such as enterprise culture, products, brief introduction, contact information and the like, an enterprise official website usually displays a link of key information on the top of a page or on the left side of the page in a navigation bar form, and for accurately establishing a content index of the enterprise official website, the information of the navigation bar needs to be accurately extracted, but the extraction of the navigation bar is difficult to implement due to the freedom of an HTML (hypertext markup language) language used for writing a webpage, irregular code writing and the like.
The prior art uses the NAV tag method, but the method needs the page to use the HTML5 version and developers strictly follow the specification of a development manual to accurately extract the navigation bar node information. Therefore, for pages other than HTML5, or irregular code writing, etc., the accuracy of the extracted navigation bar node information is not high, or even cannot be extracted.
Disclosure of Invention
In view of the above, there is a need for a method, an apparatus, an electronic device and a storage medium for extracting navigation bar information of a website, which can accurately and quickly extract navigation bar node information in any page.
The invention provides a method for extracting navigation bar information of a website, which comprises the following steps:
downloading a main page source code and any sub-page source code corresponding to a website domain name of an enterprise to be extracted, acquiring a first HTML code in the main page source code, analyzing the first HTML code into a first node DOM tree, acquiring a second HTML code in the any sub-page source code, and analyzing the second HTML code into a second node DOM tree;
according to the website domain name and the href attribute in the navigation bar information of each node in the first node DOM tree, eliminating an external link in the first node DOM tree to obtain a third node DOM tree, and according to the website domain name and the href attribute in the navigation bar information of each node in the second node DOM tree, eliminating an external link in the second node DOM tree to obtain a fourth node DOM tree;
extracting navigation bar information of the third-node DOM tree by using a NAV label method to obtain a first navigation bar information set, extracting navigation bar information of the third-node DOM tree by using an A label density method to obtain a second navigation bar information set, extracting navigation bar information of the third-node DOM tree and the fourth-node DOM tree by using a maximum public area method to obtain a third navigation bar information set, and extracting navigation bar information of the third-node DOM tree by using a keyword link block method to obtain a fourth navigation bar information set;
collecting the first navigation bar information set, the second navigation bar information set, the third navigation bar information set and the fourth navigation bar information set to obtain a fifth navigation bar information set;
the navigation bar information of each node in the fifth navigation bar information set is subjected to duplication removal and filtering to obtain a sixth navigation bar information set;
calculating a node score for each node in the sixth navigation bar information set using a node scoring algorithm;
and sequencing all the nodes in the sixth navigation bar information set according to the node score of each node and outputting the navigation bar information of the enterprise to be extracted.
Preferably, the extracting the navigation bar information of the DOM tree of the third node by using the NAV tagging method to obtain the first navigation bar information set includes:
extracting PATH PATHs in the navigation bar information of all nodes in the DOM tree of the third node; judging whether a NAV label exists in each PATH PATH; when the NAV label exists in any PATH PATH, extracting navigation bar information of a node corresponding to the any PATH PATH, and combining the navigation bar information of all extracted nodes to obtain a first navigation bar information set; or
Extracting CLASS attributes in the navigation bar information of all nodes in the DOM tree of the third node; judging whether each CLASS attribute contains a preset NAV keyword; and when any CLASS attribute contains the preset NAV keyword, extracting navigation bar information of a node corresponding to the any CLASS attribute, and combining the navigation bar information of all the extracted nodes to obtain a first navigation bar information set.
Preferably, the extracting the navigation bar information of the DOM tree of the third node by using an a-tag density method to obtain a second navigation bar information set includes:
extracting href attributes in navigation bar information of all nodes;
calculating the character string length of the href attribute of each node, and combining to obtain a href attribute length list;
dividing the href attribute length list into a plurality of subsets by a zero element, wherein each subset comprises a plurality of continuous non-zero character string lengths;
accumulating the character string length of the href attribute of each node in each subset to obtain the total character string length of each subset;
extracting a PATH of each node in a subset with the longest total character string length;
and acquiring the navigation bar information of the node corresponding to the public prefix meeting the longest public prefix of all the PATH PATHs, and merging to obtain a second navigation bar information set.
Preferably, the extracting the navigation bar information of the third node DOM tree and the fourth node DOM tree by using the maximum public area method to obtain a third navigation bar information set includes:
extracting the maximum continuous public subset of the navigation bar information of each node in the third node DOM tree and the navigation bar information of each node in the fourth node DOM tree;
extracting navigation bar information of all nodes in the maximum continuous public subset;
and merging the navigation bar information with the preset A label in the navigation bar information of all the nodes to obtain a third navigation bar information set.
Preferably, the extracting the navigation bar information of the DOM tree of the third node by using the keyword link blocking method to obtain a fourth navigation bar information set includes:
extracting anchor texts in the navigation bar information of all nodes in the DOM tree of the third node;
acquiring PATH PATHs of nodes corresponding to anchor texts of anchor texts belonging to a preset multi-language keyword set in the navigation bar information of all the nodes;
and acquiring the navigation bar information of the node corresponding to the public prefix meeting the longest public prefix of all the PATH PATHs, and merging to obtain a fourth navigation bar information set.
Preferably, the removing duplication and filtering of the navigation bar information of each node in the fifth navigation bar information set to obtain a sixth navigation bar information set includes:
extracting anchor text and href attributes of each node in the fifth navigation bar information set;
when the anchor text and the href attribute of any two nodes are completely consistent, retaining the node appearing first, and deleting the node appearing later to obtain a target fifth navigation bar information set;
matching the anchor text and the href attribute of any node in the target fifth navigation bar information set with the text information in any one of a plurality of preset rule sets by utilizing regular matching;
and acquiring the navigation bar information which is unsuccessfully matched with the text information in any rule set in the target fifth navigation bar information set, and merging the acquired navigation bar information to obtain a sixth navigation bar information set.
Preferably, the calculating the node score of each node in the sixth navigation bar information set by using a node scoring algorithm includes:
calculating the PATH score of each node in the sixth navigation bar information set by adopting the following formula:
Figure BDA0002518730660000041
wherein α represents a PATH weight, a _ node represents a node list of the sixth navigation bar information set, len (a _ node) represents the number of nodes in the node list, index _ node represents a corresponding position index of each node in the node list, and f () represents a fitting function with a value range of [0,1 ];
calculating a score for the anchor text for each node in the sixth navigation bar information set using the following formula:
Figure BDA0002518730660000042
wherein β represents an anchor text weight, text represents an anchor text of each node, keywords represents a preset multilingual keyword set, and other represents that the anchor text of each node in the sixth navigation bar information set does not belong to the preset multilingual keyword set;
calculating the score of the href attribute of each node in the sixth navigation bar information set by adopting the following formula:
Figure BDA0002518730660000043
wherein γ represents the href attribute weight, layer represents the number of layers linked in the href attribute of each node, and α + β + γ is 100;
and accumulating the score of the PATH, the score of the anchor text and the score of the href attribute of each node in the sixth navigation bar information set to obtain the node score of each node.
A second aspect of the present invention provides a navigation bar information extraction apparatus for a website, the apparatus comprising:
the analysis module is used for downloading a main page source code and any sub-page source code corresponding to a website domain name of an enterprise to be extracted, acquiring a first HTML code in the main page source code, analyzing the first HTML code into a first node DOM tree, acquiring a second HTML code in the any sub-page source code, and analyzing the second HTML code into a second node DOM tree;
the removing module is used for removing the external link in the first node DOM tree according to the website domain name and the href attribute in the navigation bar information of each node in the first node DOM tree to obtain a third node DOM tree, and removing the external link in the second node DOM tree according to the website domain name and the href attribute in the navigation bar information of each node in the second node DOM tree to obtain a fourth node DOM tree;
the extraction module is used for extracting navigation bar information of the third-node DOM tree by using a NAV label method to obtain a first navigation bar information set, extracting navigation bar information of the third-node DOM tree by using an A label density method to obtain a second navigation bar information set, extracting navigation bar information of the third-node DOM tree and the fourth-node DOM tree by using a maximum public area method to obtain a third navigation bar information set, and extracting navigation bar information of the third-node DOM tree by using a keyword link block method to obtain a fourth navigation bar information set;
a merging module, configured to aggregate the first navigation bar information set, the second navigation bar information set, the third navigation bar information set, and the fourth navigation bar information set to obtain a fifth navigation bar information set;
the duplication removal filtering module is used for carrying out duplication removal and filtering on the navigation bar information of each node in the fifth navigation bar information set to obtain a sixth navigation bar information set;
the calculation module is used for calculating the node score of each node in the sixth navigation bar information set by using a node scoring algorithm;
and the output module is used for sequencing all the nodes in the sixth navigation bar information set according to the node scores of all the nodes and outputting the navigation bar information of the enterprise to be extracted.
A third aspect of the present invention provides an electronic device, which includes a processor configured to implement the method for extracting navigation bar information of a website when executing a computer program stored in a memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method for extracting navigation bar information of a website.
In summary, according to the method, the apparatus, the electronic device, and the storage medium for extracting navigation bar information of a website described in the present invention, on one hand, different navigation bar information sets are obtained by extracting navigation bar node information using a NAV label method, an a label density method, a maximum public area method, and a keyword link block method, so that the phenomena of poor extraction effect, inflexibility, and the like caused by different HTML versions used by a web page or irregular compiling of a web page code are avoided, and the accuracy and the extraction efficiency of the extracted navigation bar information are improved, on the other hand, by performing deduplication filtering on nodes in a fifth navigation bar information set, the number of nodes is reduced, the extraction efficiency of the navigation bar information is improved, and enterprises appearing in a preset rule set are filtered out, and navigation bar node information in any page is accurately and quickly extracted.
Drawings
Fig. 1 is a flowchart of a method for extracting navigation bar information of a website according to an embodiment of the present invention.
Fig. 2 is a structural diagram of a navigation bar information extraction device of a website according to a second embodiment of the present invention.
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
The following detailed description will further illustrate the invention in conjunction with the above-described figures.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Example one
Fig. 1 is a flowchart of a method for extracting navigation bar information of a website according to an embodiment of the present invention.
In this embodiment, the method for extracting information of a navigation bar of a website may be applied to an electronic device, and for an electronic device that needs to extract information of a navigation bar of a website, a function of extracting information of a navigation bar of a website provided by the method of the present invention may be directly integrated on the electronic device, or may be run in the electronic device in a form of a Software Development Kit (SKD).
As shown in fig. 1, the method for extracting information of a navigation bar of a website specifically includes the following steps, and according to different requirements, the order of the steps in the flowchart may be changed, and some of the steps may be omitted.
S11: downloading a main page source code and any sub-page source code corresponding to a website domain name of an enterprise to be extracted, acquiring a first HTML code in the main page source code, analyzing the first HTML code into a first node DOM tree, acquiring a second HTML code in the any sub-page source code, and analyzing the second HTML code into a second node DOM tree.
In this embodiment, when an enterprise displays enterprise information such as enterprise culture, products, introduction, contact information, and the like, an enterprise establishes a website to obtain a website domain name, downloads a main page source code and any sub-page source code of the website domain name, where a link of any sub-page is an internal link of the website domain name, removes JavaScript and CSS codes in the main page source code and any sub-page source code, obtains an HTML code, and parses the HTML code into a node DOM tree.
S12: and according to the website domain name and the href attribute in the navigation bar information of each node in the first node DOM tree, eliminating the outer link in the first node DOM tree to obtain a third node DOM tree, and according to the website domain name and the href attribute in the navigation bar information of each node in the second node DOM tree, eliminating the outer link in the second node DOM tree to obtain a fourth node DOM tree.
In this embodiment, the external link refers to a link in which an external website is linked to the website of the enterprise to be extracted, the node DOM tree is obtained by analyzing the HTML code, each node DOM tree includes navigation bar information of each node, and the navigation bar information of each node includes an href attribute, an anchor text, a PATH, and a CLASS attribute.
In this embodiment, the outer links are removed from the first node DOM tree and the second DOM tree, so that the data of the inner links of the enterprise to be extracted are obtained, and the accuracy of the extracted information of the navigation bar of the enterprise to be extracted is improved.
S13: and extracting the navigation bar information of the DOM tree of the third node by using a NAV label method to obtain a first navigation bar information set, extracting the navigation bar information of the DOM tree of the third node by using an A label density method to obtain a second navigation bar information set, extracting the navigation bar information of the DOM tree of the third node and the DOM tree of the fourth node by using a maximum public area method to obtain a third navigation bar information set, and extracting the navigation bar information of the DOM tree of the third node by using a keyword link block method to obtain a fourth navigation bar information set.
In this embodiment, different methods, such as a NAV label method, an a label density method, a maximum public area method, and a keyword link block method, are used to extract navigation bar node information to obtain different navigation bar information sets.
Preferably, the extracting the navigation bar information of the DOM tree of the third node by using the NAV tagging method to obtain the first navigation bar information set includes:
extracting PATH PATHs in the navigation bar information of all nodes in the DOM tree of the third node; judging whether a NAV label exists in each PATH PATH; when the NAV label exists in any PATH PATH, extracting navigation bar information of a node corresponding to the any PATH PATH, and combining the navigation bar information of all extracted nodes to obtain a first navigation bar information set; or
Extracting CLASS attributes in the navigation bar information of all nodes in the DOM tree of the third node; judging whether each CLASS attribute contains a preset NAV keyword; and when any CLASS attribute contains the preset NAV keyword, extracting navigation bar information of a node corresponding to the any CLASS attribute, and combining the navigation bar information of all the extracted nodes to obtain a first navigation bar information set.
In this embodiment, the preset NAV keyword may be: other NAV keywords such as "NAV", and the like. The NAV tag is a link group which can be used for page navigation, wherein link elements are linked with other pages or other parts of the current page, however, not all links are required to be placed in NAV tags, and key links are only placed in NAV, such as a navigation bar of a page.
Preferably, the extracting the navigation bar information of the DOM tree of the third node by using an a-tag density method to obtain a second navigation bar information set includes:
extracting href attributes in navigation bar information of all nodes;
calculating the character string length of the href attribute of each node, and combining to obtain a href attribute length list;
dividing the href attribute length list into a plurality of subsets by a zero element, wherein each subset comprises a plurality of continuous non-zero character string lengths;
accumulating the character string length of the href attribute of each node in each subset to obtain the total character string length of each subset;
extracting a PATH of each node in a subset with the longest total character string length;
and acquiring the navigation bar information of the node corresponding to the public prefix meeting the longest public prefix of all the PATH PATHs, and merging to obtain a second navigation bar information set.
In this embodiment, the a tag is a defined hyperlink, and is configured to link from one page to another page, calculate a string length of an href attribute in navigation bar information of all nodes in the DOM tree of the third node by using an a tag density method, delete a node whose string length of the href attribute is empty, divide nodes corresponding to a plurality of consecutive non-empty href attributes into a subset, calculate a common prefix of a PATH of each node in the subset having the longest string length, determine whether each common prefix satisfies the longest common prefix, obtain a second navigation bar information set based on the navigation bar information of all nodes satisfying requirements extracted based on the determination result, and improve extraction efficiency of extracting navigation bar information of an enterprise to be extracted by deleting a node whose string length of the href attribute is empty.
Preferably, the extracting the navigation bar information of the third node DOM tree and the fourth node DOM tree by using the maximum public area method to obtain a third navigation bar information set includes:
extracting the maximum continuous public subset of the navigation bar information of each node in the third node DOM tree and the navigation bar information of each node in the fourth node DOM tree;
extracting navigation bar information of all nodes in the maximum continuous public subset;
and merging the navigation bar information with the preset A label in the navigation bar information of all the nodes to obtain a third navigation bar information set.
In the embodiment, the navigation bar information of each node in the maximum continuous public subset in the main page and any sub-page is extracted by using the maximum public area method, whether the preset A label exists in the navigation bar information or not is judged, the navigation bar information without the preset A label is deleted quickly, and the accuracy and the extraction efficiency of the extracted navigation bar information are improved.
Preferably, the extracting the navigation bar information of the DOM tree of the third node by using the keyword link blocking method to obtain a fourth navigation bar information set includes:
extracting anchor texts in the navigation bar information of all nodes in the DOM tree of the third node;
acquiring PATH PATHs of nodes corresponding to anchor texts of anchor texts belonging to a preset multi-language keyword set in the navigation bar information of all the nodes;
and acquiring the navigation bar information of the node corresponding to the public prefix meeting the longest public prefix of all the PATH PATHs, and merging to obtain a fourth navigation bar information set.
In this embodiment, the PATH PATHs of the nodes corresponding to the anchor texts of which the anchor texts in the navigation bar information of all the nodes belong to the anchor texts of the preset multilingual keyword set are extracted, and the navigation bar information of the nodes of the PATH PATHs of the nodes corresponding to all the anchor texts, wherein the common prefixes of the PATH PATHs of the nodes corresponding to all the anchor texts meet the requirements of the PATH of the node corresponding to the longest common prefix; and merging the navigation bar information of all the acquired nodes to obtain a fourth navigation bar information set.
In the embodiment, different navigation bar information sets are obtained by extracting the navigation bar node information by using a NAV label method, an A label density method, a maximum public area method and a keyword link block method, so that the phenomena of poor extraction effect, inflexibility and the like caused by different HTML (hypertext markup language) versions used by webpages or irregular webpage code compiling are avoided, and the accuracy and the extraction efficiency of the extracted navigation bar information are improved.
S14: and collecting the first navigation bar information set, the second navigation bar information set, the third navigation bar information set and the fourth navigation bar information set to obtain a fifth navigation bar information set.
In this embodiment, different sets of navigation bar information are added to obtain a fifth set of navigation bar information.
S15: and carrying out duplication removal and filtering on the navigation bar information of each node in the fifth navigation bar information set to obtain a sixth navigation bar information set.
In this embodiment, since different navigation bar information sets are added, the same node may appear, the same node is deleted, and after the node is deleted, the node with the same anchor text and href attribute in the fifth navigation bar information set is filtered to obtain a sixth navigation bar information set.
Preferably, the removing duplication and filtering of the navigation bar information of each node in the fifth navigation bar information set to obtain a sixth navigation bar information set includes:
extracting anchor text and href attributes of each node in the fifth navigation bar information set;
when the anchor text and the href attribute of any two nodes are completely consistent, retaining the node appearing first, and deleting the node appearing later to obtain a target fifth navigation bar information set;
matching the anchor text and the href attribute of any node in the target fifth navigation bar information set with the text information in any one of a plurality of preset rule sets by utilizing regular matching;
and acquiring the navigation bar information which is unsuccessfully matched with the text information in any rule set in the target fifth navigation bar information set, and merging the acquired navigation bar information to obtain a sixth navigation bar information set.
In this embodiment, the preset rule set may include: the system comprises a mailbox address rule set, a telephone number rule set, a multilingual stop word set and the like, wherein the mailbox address rule set is a global unified mailbox format rule set, the telephone number rule set is a telephone number format rule set by each country, and the multilingual stop word set comprises stop word sets of common languages, including but not limited to Chinese, English, French, German, Russian, Japanese, Korean, Italian, Vietnamese, Thai and the like. After deleting the same node, matching the anchor text and the href attribute in the reserved node with any preset rule set in a plurality of preset rule sets, if the anchor text and the href attribute corresponding to any node are matched in any rule set, determining that the node is a company in the existing rule set, deleting the node, if the anchor text and the href attribute corresponding to any node are not matched in any rule set, reserving the node, extracting navigation bar information corresponding to any node, and combining the navigation bar information of all extracted nodes to obtain a sixth navigation bar information set, so that the data processing quantity is reduced.
In the embodiment, the number of the nodes is reduced by performing duplicate removal and filtering on the nodes in the fifth navigation bar information set, the navigation bar information extraction efficiency is improved, enterprises appearing in the preset rule set are filtered, and the accuracy of the extracted navigation bar information is improved.
S16: calculating a node score for each node in the sixth navigation bar information set using a node scoring algorithm.
Preferably, the calculating the node score of each node in the sixth navigation bar information set by using a node scoring algorithm includes:
70) calculating the PATH score of each node in the sixth navigation bar information set by adopting formula (1):
Figure BDA0002518730660000111
wherein α represents a PATH weight, a _ node represents a node list of the sixth navigation bar information set, len (a _ node) represents the number of nodes in the node list, index _ node represents a corresponding position index of each node in the node list, and f () represents a fitting function with a value range of [0,1 ];
71) calculating the score of the anchor text of each node in the sixth navigation bar information set by adopting a formula (2):
Figure BDA0002518730660000121
wherein β represents an anchor text weight, text represents an anchor text of each node, keywords represents a preset multilingual keyword set, and other represents that the anchor text of each node in the sixth navigation bar information set does not belong to the preset multilingual keyword set;
72) calculating the score of the href attribute of each node in the sixth navigation bar information set by adopting a formula (3):
Figure BDA0002518730660000122
wherein γ represents the href attribute weight, layer represents the number of layers linked in the href attribute of each node, and α + β + γ is 100;
73) and accumulating the score of the PATH, the score of the anchor text and the score of the href attribute of each node in the sixth navigation bar information set to obtain the node score of each node.
In this embodiment, weights are set for the PATH, the anchor text, and the href attribute of each node, a score of the PATH, a score of the anchor text, and a score of the href attribute of each node are calculated to obtain a node score of each node, and the importance of each node in the navigation bar information is determined according to the node score of each node.
S17: and sequencing all the nodes in the sixth navigation bar information set according to the node score of each node and outputting the navigation bar information of the enterprise to be extracted.
In this embodiment, the nodes extracted from the enterprise to be extracted are ranked according to their node scores, and the navigation bar information of the enterprise to be extracted is sequentially output according to the importance degree of each node.
In summary, according to the method for extracting navigation bar information of a website described in this embodiment, on one hand, different navigation bar information sets are obtained by extracting navigation bar node information by using a NAV label method, an a label density method, a maximum public area method, and a keyword link block method, so that the phenomena of poor extraction effect, inflexibility, and the like caused by different HTML versions used by webpages or irregular webpage code compiling are avoided, and the accuracy and the extraction efficiency of the extracted navigation bar information are improved, and on the other hand, by performing deduplication filtering on nodes in a fifth navigation bar information set, the number of nodes is reduced, the navigation bar information extraction efficiency is improved, and enterprises appearing in a preset rule set are filtered, and the accuracy of the extracted navigation bar information is improved.
Example two
Fig. 2 is a structural diagram of a navigation bar information extraction device of a website according to a second embodiment of the present invention.
In some embodiments, the navigation bar information extraction device 20 of the website may include a plurality of functional modules composed of program code segments. The program codes of the various program segments in the navigation bar information extraction device 20 of the website may be stored in a memory of the electronic device and executed by the at least one processor to perform (see fig. 1 for details) the extraction of the navigation bar information of the website.
In this embodiment, the navigation bar information extraction device 20 of the website may be divided into a plurality of functional modules according to the functions performed by the device. The functional module may include: the system comprises an analysis module 201, a rejection module 202, an extraction module 203, a merging module 204, a deduplication filtering module 205, a calculation module 206 and an output module 207. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The analysis module 201: the method is used for downloading a main page source code and any sub-page source code corresponding to a website domain name of an enterprise to be extracted, acquiring a first HTML code in the main page source code, analyzing the first HTML code into a first node DOM tree, acquiring a second HTML code in the any sub-page source code, and analyzing the second HTML code into a second node DOM tree.
In this embodiment, when an enterprise displays enterprise information such as enterprise culture, products, introduction, contact information, and the like, an enterprise establishes a website to obtain a website domain name, downloads a main page source code and any sub-page source code of the website domain name, where a link of any sub-page is an internal link of the website domain name, removes JavaScript and CSS codes in the main page source code and any sub-page source code, obtains an HTML code, and parses the HTML code into a node DOM tree.
A culling module 202: and the external link in the DOM tree of the first node is removed according to the website domain name and the href attribute in the navigation bar information of each node in the DOM tree of the first node to obtain a DOM tree of a third node, and the external link in the DOM tree of the second node is removed according to the website domain name and the href attribute in the navigation bar information of each node in the DOM tree of the second node to obtain a DOM tree of a fourth node.
In this embodiment, the external link refers to a link in which an external website is linked to the website of the enterprise to be extracted, the node DOM tree is obtained by analyzing the HTML code, each node DOM tree includes navigation bar information of each node, and the navigation bar information of each node includes an href attribute, an anchor text, a PATH, and a CLASS attribute.
In this embodiment, the outer links are removed from the first node DOM tree and the second DOM tree, so that the data of the inner links of the enterprise to be extracted are obtained, and the accuracy of the extracted information of the navigation bar of the enterprise to be extracted is improved.
The extraction module 203: the navigation bar information extraction method comprises the steps of extracting navigation bar information of a third-node DOM tree by using a NAV label method to obtain a first navigation bar information set, extracting navigation bar information of the third-node DOM tree by using an A label density method to obtain a second navigation bar information set, extracting navigation bar information of the third-node DOM tree and navigation bar information of a fourth-node DOM tree by using a maximum public area method to obtain a third navigation bar information set, and extracting navigation bar information of the third-node DOM tree by using a keyword link block method to obtain a fourth navigation bar information set.
In this embodiment, different methods, such as a NAV label method, an a label density method, a maximum public area method, and a keyword link block method, are used to extract navigation bar node information to obtain different navigation bar information sets.
Preferably, the extracting module 203 extracts the navigation bar information of the DOM tree of the third node by using a NAV tagging method to obtain a first navigation bar information set, where the first navigation bar information set includes:
extracting PATH PATHs in the navigation bar information of all nodes in the DOM tree of the third node; judging whether a NAV label exists in each PATH PATH; when the NAV label exists in any PATH PATH, extracting navigation bar information of a node corresponding to the any PATH PATH, and combining the navigation bar information of all extracted nodes to obtain a first navigation bar information set; or
Extracting CLASS attributes in the navigation bar information of all nodes in the DOM tree of the third node; judging whether each CLASS attribute contains a preset NAV keyword; and when any CLASS attribute contains the preset NAV keyword, extracting navigation bar information of a node corresponding to the any CLASS attribute, and combining the navigation bar information of all the extracted nodes to obtain a first navigation bar information set.
In this embodiment, the preset NAV keyword may be: other NAV keywords such as "NAV", and the like. The NAV tag is a link group which can be used for page navigation, wherein link elements are linked with other pages or other parts of the current page, however, not all links are required to be placed in NAV tags, and key links are only placed in NAV, such as a navigation bar of a page.
Preferably, the extracting module 203 extracts the navigation bar information of the DOM tree of the third node by using an a-tag density method to obtain a second navigation bar information set, where the second navigation bar information set includes:
extracting href attributes in navigation bar information of all nodes;
calculating the character string length of the href attribute of each node, and combining to obtain a href attribute length list;
dividing the href attribute length list into a plurality of subsets by a zero element, wherein each subset comprises a plurality of continuous non-zero character string lengths;
accumulating the character string length of the href attribute of each node in each subset to obtain the total character string length of each subset;
extracting a PATH of each node in a subset with the longest total character string length;
and acquiring the navigation bar information of the node corresponding to the public prefix meeting the longest public prefix of all the PATH PATHs, and merging to obtain a second navigation bar information set.
In this embodiment, the a tag is a defined hyperlink, and is configured to link from one page to another page, calculate a string length of an href attribute in navigation bar information of all nodes in the DOM tree of the third node by using an a tag density method, delete a node whose string length of the href attribute is empty, divide nodes corresponding to a plurality of consecutive non-empty href attributes into a subset, calculate a common prefix of a PATH of each node in the subset having the longest string length, determine whether each common prefix satisfies the longest common prefix, obtain a second navigation bar information set based on the navigation bar information of all nodes satisfying requirements extracted based on the determination result, and improve extraction efficiency of extracting navigation bar information of an enterprise to be extracted by deleting a node whose string length of the href attribute is empty.
Preferably, the extracting module 203 extracts the navigation bar information of the third node DOM tree and the fourth node DOM tree by using a maximum public area method to obtain a third navigation bar information set, where the third navigation bar information set includes:
extracting the maximum continuous public subset of the navigation bar information of each node in the third node DOM tree and the navigation bar information of each node in the fourth node DOM tree;
extracting navigation bar information of all nodes in the maximum continuous public subset;
and merging the navigation bar information with the preset A label in the navigation bar information of all the nodes to obtain a third navigation bar information set.
In the embodiment, the navigation bar information of each node in the maximum continuous public subset in the main page and any sub-page is extracted by using the maximum public area method, whether the preset A label exists in the navigation bar information or not is judged, the navigation bar information without the preset A label is deleted quickly, and the accuracy and the extraction efficiency of the extracted navigation bar information are improved.
Preferably, the extracting module 203 extracts the navigation bar information of the DOM tree of the third node by using a keyword link blocking method to obtain a fourth navigation bar information set, where the fourth navigation bar information set includes:
extracting anchor texts in the navigation bar information of all nodes in the DOM tree of the third node;
acquiring PATH PATHs of nodes corresponding to anchor texts of anchor texts belonging to a preset multi-language keyword set in the navigation bar information of all the nodes;
and acquiring the navigation bar information of the node corresponding to the public prefix meeting the longest public prefix of all the PATH PATHs, and merging to obtain a fourth navigation bar information set.
In this embodiment, the PATH PATHs of the nodes corresponding to the anchor texts of which the anchor texts in the navigation bar information of all the nodes belong to the anchor texts of the preset multilingual keyword set are extracted, and the navigation bar information of the nodes of the PATH PATHs of the nodes corresponding to all the anchor texts, wherein the common prefixes of the PATH PATHs of the nodes corresponding to all the anchor texts meet the requirements of the PATH of the node corresponding to the longest common prefix; and merging the navigation bar information of all the acquired nodes to obtain a fourth navigation bar information set.
In the embodiment, different navigation bar information sets are obtained by extracting the navigation bar node information by using a NAV label method, an A label density method, a maximum public area method and a keyword link block method, so that the phenomena of poor extraction effect, inflexibility and the like caused by different HTML (hypertext markup language) versions used by webpages or irregular webpage code compiling are avoided, and the accuracy and the extraction efficiency of the extracted navigation bar information are improved.
The merging module 204: and the navigation bar information processing unit is used for collecting the first navigation bar information set, the second navigation bar information set, the third navigation bar information set and the fourth navigation bar information set to obtain a fifth navigation bar information set.
In this embodiment, different sets of navigation bar information are added to obtain a fifth set of navigation bar information.
The deduplication filtering module 205: and the navigation bar information is used for removing the duplication and filtering the navigation bar information of each node in the fifth navigation bar information set to obtain a sixth navigation bar information set.
In this embodiment, since different navigation bar information sets are added, the same node may appear, the same node is deleted, and after the node is deleted, the node with the same anchor text and href attribute in the fifth navigation bar information set is filtered to obtain a sixth navigation bar information set.
Preferably, the duplication elimination and filtering module 205 is configured to eliminate duplication and filter the navigation bar information of each node in the fifth navigation bar information set, and obtain a sixth navigation bar information set, where:
extracting anchor text and href attributes of each node in the fifth navigation bar information set;
when the anchor text and the href attribute of any two nodes are completely consistent, retaining the node appearing first, and deleting the node appearing later to obtain a target fifth navigation bar information set;
matching the anchor text and the href attribute of any node in the target fifth navigation bar information set with the text information in any one of a plurality of preset rule sets by utilizing regular matching;
and acquiring the navigation bar information which is unsuccessfully matched with the text information in any rule set in the target fifth navigation bar information set, and merging the acquired navigation bar information to obtain a sixth navigation bar information set.
In this embodiment, the preset rule set may include: the system comprises a mailbox address rule set, a telephone number rule set, a multilingual stop word set and the like, wherein the mailbox address rule set is a global unified mailbox format rule set, the telephone number rule set is a telephone number format rule set by each country, and the multilingual stop word set comprises stop word sets of common languages, including but not limited to Chinese, English, French, German, Russian, Japanese, Korean, Italian, Vietnamese, Thai and the like. After deleting the same node, matching the anchor text and the href attribute in the reserved node with any preset rule set in a plurality of preset rule sets, if the anchor text and the href attribute corresponding to any node are matched in any rule set, determining that the node is a company in the existing rule set, deleting the node, if the anchor text and the href attribute corresponding to any node are not matched in any rule set, reserving the node, extracting navigation bar information corresponding to any node, and combining the navigation bar information of all extracted nodes to obtain a sixth navigation bar information set, so that the data processing quantity is reduced.
In the embodiment, the number of the nodes is reduced by performing duplicate removal and filtering on the nodes in the fifth navigation bar information set, the navigation bar information extraction efficiency is improved, enterprises appearing in the preset rule set are filtered, and the accuracy of the extracted navigation bar information is improved.
The calculation module 206: for calculating a node score for each node in the sixth navigation bar information set using a node scoring algorithm.
Preferably, the calculating module 206 calculates the node score of each node in the sixth navigation bar information set by using a node scoring algorithm, including:
70) calculating the PATH score of each node in the sixth navigation bar information set by adopting formula (1):
Figure BDA0002518730660000181
wherein α represents a PATH weight, a _ node represents a node list of the sixth navigation bar information set, len (a _ node) represents the number of nodes in the node list, index _ node represents a corresponding position index of each node in the node list, and f () represents a fitting function with a value range of [0,1 ];
71) calculating the score of the anchor text of each node in the sixth navigation bar information set by adopting a formula (2):
Figure BDA0002518730660000182
wherein β represents an anchor text weight, text represents an anchor text of each node, keywords represents a preset multilingual keyword set, and other represents that the anchor text of each node in the sixth navigation bar information set does not belong to the preset multilingual keyword set;
72) calculating the score of the href attribute of each node in the sixth navigation bar information set by adopting a formula (3):
Figure BDA0002518730660000183
wherein γ represents the href attribute weight, layer represents the number of layers linked in the href attribute of each node, and α + β + γ is 100;
73) and accumulating the score of the PATH, the score of the anchor text and the score of the href attribute of each node in the sixth navigation bar information set to obtain the node score of each node.
In this embodiment, weights are set for the PATH, the anchor text, and the href attribute of each node, a score of the PATH, a score of the anchor text, and a score of the href attribute of each node are calculated to obtain a node score of each node, and the importance of each node in the navigation bar information is determined according to the node score of each node.
The output module 207: and the navigation bar information processing unit is used for sequencing all the nodes in the sixth navigation bar information set according to the node score of each node and outputting the navigation bar information of the enterprise to be extracted.
In this embodiment, the nodes extracted from the enterprise to be extracted are ranked according to their node scores, and the navigation bar information of the enterprise to be extracted is sequentially output according to the importance degree of each node.
In summary, the navigation bar information extraction apparatus for a website according to this embodiment extracts navigation bar node information by using a NAV label method, an a label density method, a maximum public area method, and a keyword link block method to obtain different navigation bar information sets, so as to avoid poor extraction effect, inflexibility, and the like caused by different HTML versions used by webpages or irregular webpage code compiling, and improve the accuracy and extraction efficiency of the extracted navigation bar information.
EXAMPLE III
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the electronic device 3 comprises a memory 31, at least one processor 32, at least one communication bus 33 and a transceiver 34.
It will be appreciated by those skilled in the art that the configuration of the electronic device shown in fig. 3 does not constitute a limitation of the embodiment of the present invention, and may be a bus-type configuration or a star-type configuration, and the electronic device 3 may include more or less other hardware or software than those shown, or a different arrangement of components.
In some embodiments, the electronic device 3 is an electronic device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware thereof includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The electronic device 3 may also include a client device, which includes, but is not limited to, any electronic product that can interact with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, and the like.
It should be noted that the electronic device 3 is only an example, and other existing or future electronic products, such as those that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.
In some embodiments, the memory 31 is used for storing program codes and various data, such as the navigation bar information extraction device 20 of a website installed in the electronic device 3, and realizes high-speed and automatic access to programs or data during the operation of the electronic device 3. The Memory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only Memory (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.
In some embodiments, the at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The at least one processor 32 is a Control Unit (Control Unit) of the electronic device 3, connects various components of the electronic device 3 by using various interfaces and lines, and executes various functions of the electronic device 3 and processes data, such as a navigation bar information extraction function of a website, by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31.
In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.
Although not shown, the electronic device 3 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, an electronic device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.
In a further embodiment, in conjunction with fig. 2, the at least one processor 32 may execute an operating device of the electronic device 3 and various installed application programs (e.g. the navigation bar information extraction device 20 of the website), program codes, and the like, for example, the above modules.
The memory 31 has program code stored therein, and the at least one processor 32 can call the program code stored in the memory 31 to perform related functions. For example, the modules described in fig. 2 are program codes stored in the memory 31 and executed by the at least one processor 32, so as to implement the functions of the modules for the purpose of extracting navigation bar information of a website.
In one embodiment of the present invention, the memory 31 stores a plurality of instructions that are executed by the at least one processor 32 to implement a navigation bar information extraction function for a website.
Specifically, the at least one processor 32 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, and details are not repeated here.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method for extracting navigation bar information of a website is characterized by comprising the following steps:
downloading a main page source code and any sub-page source code corresponding to a website domain name of an enterprise to be extracted, acquiring a first HTML code in the main page source code, analyzing the first HTML code into a first node DOM tree, acquiring a second HTML code in the any sub-page source code, and analyzing the second HTML code into a second node DOM tree;
according to the website domain name and the href attribute in the navigation bar information of each node in the first node DOM tree, eliminating an external link in the first node DOM tree to obtain a third node DOM tree, and according to the website domain name and the href attribute in the navigation bar information of each node in the second node DOM tree, eliminating an external link in the second node DOM tree to obtain a fourth node DOM tree;
extracting navigation bar information of the third-node DOM tree by using a NAV label method to obtain a first navigation bar information set, extracting navigation bar information of the third-node DOM tree by using an A label density method to obtain a second navigation bar information set, extracting navigation bar information of the third-node DOM tree and the fourth-node DOM tree by using a maximum public area method to obtain a third navigation bar information set, and extracting navigation bar information of the third-node DOM tree by using a keyword link block method to obtain a fourth navigation bar information set;
collecting the first navigation bar information set, the second navigation bar information set, the third navigation bar information set and the fourth navigation bar information set to obtain a fifth navigation bar information set;
the navigation bar information of each node in the fifth navigation bar information set is subjected to duplication removal and filtering to obtain a sixth navigation bar information set;
calculating a node score for each node in the sixth navigation bar information set using a node scoring algorithm;
and sequencing all the nodes in the sixth navigation bar information set according to the node score of each node and outputting the navigation bar information of the enterprise to be extracted.
2. The method for extracting information on a navigation bar of a website according to claim 1, wherein the extracting information on a navigation bar of the DOM tree of the third node using a NAV tagging method to obtain a first set of information on a navigation bar comprises:
extracting PATH PATHs in the navigation bar information of all nodes in the DOM tree of the third node; judging whether a NAV label exists in each PATH PATH; when the NAV label exists in any PATH PATH, extracting navigation bar information of a node corresponding to the any PATH PATH, and combining the navigation bar information of all extracted nodes to obtain a first navigation bar information set; or
Extracting CLASS attributes in the navigation bar information of all nodes in the DOM tree of the third node; judging whether each CLASS attribute contains a preset NAV keyword; and when any CLASS attribute contains the preset NAV keyword, extracting navigation bar information of a node corresponding to the any CLASS attribute, and combining the navigation bar information of all the extracted nodes to obtain a first navigation bar information set.
3. The method for extracting information on a navigation bar of a website according to claim 1, wherein the extracting information on a navigation bar of the DOM tree of the third node using the a-tag density method to obtain a second set of information on a navigation bar comprises:
extracting href attributes in navigation bar information of all nodes;
calculating the character string length of the href attribute of each node, and combining to obtain a href attribute length list;
dividing the href attribute length list into a plurality of subsets by a zero element, wherein each subset comprises a plurality of continuous non-zero character string lengths;
accumulating the character string length of the href attribute of each node in each subset to obtain the total character string length of each subset;
extracting a PATH of each node in a subset with the longest total character string length;
and acquiring the navigation bar information of the node corresponding to the public prefix meeting the longest public prefix of all the PATH PATHs, and merging to obtain a second navigation bar information set.
4. The method of extracting information on a navigation bar of a website according to claim 1, wherein the extracting navigation bar information on the third node DOM tree and the fourth node DOM tree using a maximum common area method to obtain a third navigation bar information set comprises:
extracting the maximum continuous public subset of the navigation bar information of each node in the third node DOM tree and the navigation bar information of each node in the fourth node DOM tree;
extracting navigation bar information of all nodes in the maximum continuous public subset;
and merging the navigation bar information with the preset A label in the navigation bar information of all the nodes to obtain a third navigation bar information set.
5. The method as claimed in claim 1, wherein the extracting navigation bar information of the third node DOM tree using the keyword link blocking method to obtain a fourth navigation bar information set comprises:
extracting anchor texts in the navigation bar information of all nodes in the DOM tree of the third node;
acquiring PATH PATHs of nodes corresponding to anchor texts of anchor texts belonging to a preset multi-language keyword set in the navigation bar information of all the nodes;
and acquiring the navigation bar information of the node corresponding to the public prefix meeting the longest public prefix of all the PATH PATHs, and merging to obtain a fourth navigation bar information set.
6. The method for extracting information on a navigation bar of a website according to claim 1, wherein the step of performing deduplication and filtering on the navigation bar information of each node in the fifth navigation bar information set to obtain a sixth navigation bar information set comprises:
extracting anchor text and href attributes of each node in the fifth navigation bar information set;
when the anchor text and the href attribute of any two nodes are completely consistent, retaining the node appearing first, and deleting the node appearing later to obtain a target fifth navigation bar information set;
matching the anchor text and the href attribute of any node in the target fifth navigation bar information set with the text information in any one of a plurality of preset rule sets by utilizing regular matching;
and acquiring the navigation bar information which is unsuccessfully matched with the text information in any rule set in the target fifth navigation bar information set, and merging the acquired navigation bar information to obtain a sixth navigation bar information set.
7. The method of extracting navigation bar information of a website of claim 1, wherein the calculating a node score of each node in the sixth navigation bar information set using a node scoring algorithm comprises:
calculating the PATH score of each node in the sixth navigation bar information set by adopting the following formula:
Figure FDA0002518730650000031
wherein α represents a PATH weight, a _ node represents a node list of the sixth navigation bar information set, len (a _ node) represents the number of nodes in the node list, index _ node represents a corresponding position index of each node in the node list, and f () represents a fitting function with a value range of [0,1 ];
calculating the score of the anchor text of each node in the sixth navigation bar information set by adopting the following formula:
Figure FDA0002518730650000041
wherein β represents an anchor text weight, text represents an anchor text of each node, keywords represents a preset multilingual keyword set, and other represents that the anchor text of each node in the sixth navigation bar information set does not belong to the preset multilingual keyword set;
calculating the score of the href attribute of each node in the sixth navigation bar information set by adopting the following formula:
Figure FDA0002518730650000042
wherein γ represents the href attribute weight, layer represents the number of layers linked in the href attribute of each node, and α + β + γ is 100;
and accumulating the score of the PATH, the score of the anchor text and the score of the href attribute of each node in the sixth navigation bar information set to obtain the node score of each node.
8. An apparatus for extracting navigation bar information of a web site, comprising:
the analysis module is used for downloading a main page source code and any sub-page source code corresponding to a website domain name of an enterprise to be extracted, acquiring a first HTML code in the main page source code, analyzing the first HTML code into a first node DOM tree, acquiring a second HTML code in the any sub-page source code, and analyzing the second HTML code into a second node DOM tree;
the removing module is used for removing the external link in the first node DOM tree according to the website domain name and the href attribute in the navigation bar information of each node in the first node DOM tree to obtain a third node DOM tree, and removing the external link in the second node DOM tree according to the website domain name and the href attribute in the navigation bar information of each node in the second node DOM tree to obtain a fourth node DOM tree;
the extraction module is used for extracting navigation bar information of the third-node DOM tree by using a NAV label method to obtain a first navigation bar information set, extracting navigation bar information of the third-node DOM tree by using an A label density method to obtain a second navigation bar information set, extracting navigation bar information of the third-node DOM tree and the fourth-node DOM tree by using a maximum public area method to obtain a third navigation bar information set, and extracting navigation bar information of the third-node DOM tree by using a keyword link block method to obtain a fourth navigation bar information set;
a merging module, configured to aggregate the first navigation bar information set, the second navigation bar information set, the third navigation bar information set, and the fourth navigation bar information set to obtain a fifth navigation bar information set;
the duplication removal filtering module is used for carrying out duplication removal and filtering on the navigation bar information of each node in the fifth navigation bar information set to obtain a sixth navigation bar information set;
the calculation module is used for calculating the node score of each node in the sixth navigation bar information set by using a node scoring algorithm;
and the output module is used for sequencing all the nodes in the sixth navigation bar information set according to the node scores of all the nodes and outputting the navigation bar information of the enterprise to be extracted.
9. An electronic device, characterized in that the electronic device comprises a processor for implementing the navigation bar information extraction method of a website according to any one of claims 1 to 7 when executing a computer program stored in a memory.
10. A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements a navigation bar information extraction method for a website according to any one of claims 1 to 7.
CN202010484954.XA 2020-06-01 2020-06-01 Navigation bar information extraction method and device of website, electronic equipment and storage medium Active CN111625748B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010484954.XA CN111625748B (en) 2020-06-01 2020-06-01 Navigation bar information extraction method and device of website, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010484954.XA CN111625748B (en) 2020-06-01 2020-06-01 Navigation bar information extraction method and device of website, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111625748A true CN111625748A (en) 2020-09-04
CN111625748B CN111625748B (en) 2024-01-09

Family

ID=72272649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010484954.XA Active CN111625748B (en) 2020-06-01 2020-06-01 Navigation bar information extraction method and device of website, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111625748B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112230989A (en) * 2020-12-14 2021-01-15 北京智慧星光信息技术有限公司 Webpage channel navigation bar extraction method, system, electronic equipment and storage medium
CN112528117A (en) * 2020-12-11 2021-03-19 杭州安恒信息技术股份有限公司 Recognition method and related device for government affair website primary catalog
CN114610985A (en) * 2022-05-10 2022-06-10 北京百炼智能科技有限公司 Information extraction method and device, electronic equipment and storage medium
CN114817811A (en) * 2022-05-07 2022-07-29 盐城金堤科技有限公司 Website analysis method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN103778104A (en) * 2012-10-22 2014-05-07 富士通株式会社 Information processing device, information processing method and electronic device
CN103823824A (en) * 2013-11-12 2014-05-28 哈尔滨工业大学深圳研究生院 Method and system for automatically constructing text classification corpus by aid of internet
CN105069107A (en) * 2015-08-07 2015-11-18 北京百度网讯科技有限公司 Method and device for monitoring website
CN106294722A (en) * 2016-08-09 2017-01-04 上海资誉网络科技有限公司 A kind of web page contents extraction method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN103778104A (en) * 2012-10-22 2014-05-07 富士通株式会社 Information processing device, information processing method and electronic device
CN103823824A (en) * 2013-11-12 2014-05-28 哈尔滨工业大学深圳研究生院 Method and system for automatically constructing text classification corpus by aid of internet
CN105069107A (en) * 2015-08-07 2015-11-18 北京百度网讯科技有限公司 Method and device for monitoring website
CN106294722A (en) * 2016-08-09 2017-01-04 上海资誉网络科技有限公司 A kind of web page contents extraction method and device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528117A (en) * 2020-12-11 2021-03-19 杭州安恒信息技术股份有限公司 Recognition method and related device for government affair website primary catalog
CN112528117B (en) * 2020-12-11 2023-03-14 杭州安恒信息技术股份有限公司 Recognition method and related device for government affair website primary catalog
CN112230989A (en) * 2020-12-14 2021-01-15 北京智慧星光信息技术有限公司 Webpage channel navigation bar extraction method, system, electronic equipment and storage medium
CN114817811A (en) * 2022-05-07 2022-07-29 盐城金堤科技有限公司 Website analysis method and device
CN114817811B (en) * 2022-05-07 2024-03-19 盐城天眼察微科技有限公司 Website analysis method and device
CN114610985A (en) * 2022-05-10 2022-06-10 北京百炼智能科技有限公司 Information extraction method and device, electronic equipment and storage medium
CN114610985B (en) * 2022-05-10 2022-08-19 北京百炼智能科技有限公司 Information extraction method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111625748B (en) 2024-01-09

Similar Documents

Publication Publication Date Title
CN111625748B (en) Navigation bar information extraction method and device of website, electronic equipment and storage medium
CN103294781B (en) A kind of method and apparatus for processing page data
CN109145215A (en) Internet public opinion analysis method, apparatus and storage medium
CN111241389A (en) Sensitive word filtering method and device based on matrix, electronic equipment and storage medium
CN102915361B (en) Webpage text extracting method based on character distribution characteristic
CN105550359B (en) Webpage sorting method and device based on vertical search and server
CN103942211B (en) A kind of recognition methods of text page and device
JPWO2019224891A1 (en) Classification device, classification method, generation method, classification program and generation program
CN112818200A (en) Data crawling and event analyzing method and system based on static website
JP4724158B2 (en) Method and apparatus for automatic form filling in mobile devices
CN113886204A (en) User behavior data collection method and device, electronic equipment and readable storage medium
CN112232075A (en) Article release time identification method based on time format and webpage element characteristics
CN111949849B (en) Fish information acquisition method and device, electronic equipment and readable storage medium
CN113468288A (en) Content extraction method of text courseware based on artificial intelligence and related equipment
CN109033370A (en) A kind of method and device that searching similar shop, the method and device of shop access
CN112667208A (en) Translation error recognition method and device, computer equipment and readable storage medium
CN112069824A (en) Region identification method, device and medium based on context probability and citation
CN113139145B (en) Page generation method and device, electronic equipment and readable storage medium
CN112230989B (en) Webpage channel navigation bar extraction method, system, electronic equipment and storage medium
CN112667874A (en) Webpage data extraction method and device, electronic equipment and storage medium
CN111026942A (en) Hot word extraction method, device, terminal and medium based on web crawler
CN113204962A (en) Word sense disambiguation method, device, equipment and medium based on graph expansion structure
CN113806311A (en) Deep learning-based file classification method and device, electronic equipment and medium
CN111950037A (en) Detection method, detection device, electronic equipment and storage medium
CN111581950A (en) Method for determining synonym and method for establishing synonym knowledge base

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant