CN111625748B - Navigation bar information extraction method and device of website, electronic equipment and storage medium - Google Patents

Navigation bar information extraction method and device of website, electronic equipment and storage medium Download PDF

Info

Publication number
CN111625748B
CN111625748B CN202010484954.XA CN202010484954A CN111625748B CN 111625748 B CN111625748 B CN 111625748B CN 202010484954 A CN202010484954 A CN 202010484954A CN 111625748 B CN111625748 B CN 111625748B
Authority
CN
China
Prior art keywords
navigation bar
bar information
node
dom tree
information set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010484954.XA
Other languages
Chinese (zh)
Other versions
CN111625748A (en
Inventor
祁俊辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xiaoman Technology Co ltd
Original Assignee
Shenzhen Xiaoman Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xiaoman Technology Co ltd filed Critical Shenzhen Xiaoman Technology Co ltd
Priority to CN202010484954.XA priority Critical patent/CN111625748B/en
Publication of CN111625748A publication Critical patent/CN111625748A/en
Application granted granted Critical
Publication of CN111625748B publication Critical patent/CN111625748B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/954Navigation, e.g. using categorised browsing

Abstract

The invention relates to the technical field of text extraction, and provides a navigation bar information extraction method, a device, electronic equipment and a storage medium of a website, wherein the method comprises the following steps: downloading the source codes of a main page and any sub-page of the enterprise website domain name to be extracted, acquiring a first HTML code and analyzing the first HTML code into a first node DOM tree, and acquiring a second HTML code and analyzing the second HTML code into a second node DOM tree; removing the outer chains of the first node DOM tree and the second node DOM tree to obtain a third node DOM tree and a fourth node DOM tree, extracting navigation bar information by using a NAV label method, an A label density method, a maximum public area method and a keyword link block method, then removing duplication, filtering, calculating node scores of all nodes, and outputting the navigation bar information of an enterprise to be extracted. According to the invention, the navigation bar information is extracted by the NAV label method, the A label density method, the maximum public area method and the keyword link block method, so that the accuracy and the efficiency of extracting the navigation bar information in the page are improved.

Description

Navigation bar information extraction method and device of website, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of text extraction, in particular to a navigation bar information extraction method and device of a website, electronic equipment and a storage medium.
Background
In order to display information such as enterprise culture, products, introduction, contact ways, etc., the enterprise official network usually displays links of key information in the form of navigation bars at the top or left side of a page, and accurately builds a content index of the enterprise official network, so that the navigation bar information needs to be accurately extracted, but the extraction of the navigation bars is difficult to implement due to the freedom of HTML language used for writing web pages, irregular code writing, etc.
The prior art uses the NAV tag method, but this method requires the page to use HTML5 version and the developer strictly follows the development manual specification to accurately extract the navigation bar node information. Therefore, for pages other than HTML5, or code writing non-normative, the accuracy of the extracted navigation bar node information is not high, or even cannot be extracted.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a method, an apparatus, an electronic device, and a storage medium for extracting navigation bar information of a website, which can accurately and rapidly extract navigation bar node information in any page.
The first aspect of the invention provides a navigation bar information extraction method of a website, which comprises the following steps:
Downloading a main page source code and any sub-page source code corresponding to a website domain name of an enterprise to be extracted, acquiring a first HTML code in the main page source code, analyzing the first HTML code into a first node DOM tree, acquiring a second HTML code in the arbitrary sub-page source code, and analyzing the second HTML code into a second node DOM tree;
removing the outer chain in the first node DOM tree according to the website domain name and the href attribute in the navigation bar information of each node in the first node DOM tree to obtain a third node DOM tree, and removing the outer chain in the second node DOM tree according to the website domain name and the href attribute in the navigation bar information of each node in the second node DOM tree to obtain a fourth node DOM tree;
extracting navigation bar information of the third node DOM tree by using a NAV label method to obtain a first navigation bar information set, extracting navigation bar information of the third node DOM tree by using an A label density method to obtain a second navigation bar information set, extracting navigation bar information of the third node DOM tree and the fourth node DOM tree by using a maximum public area method to obtain a third navigation bar information set, and extracting navigation bar information of the third node DOM tree by using a keyword link block method to obtain a fourth navigation bar information set;
The first navigation bar information set, the second navigation bar information set, the third navigation bar information set and the fourth navigation bar information set are combined to obtain a fifth navigation bar information set;
removing duplication of navigation bar information of each node in the fifth navigation bar information set and filtering to obtain a sixth navigation bar information set;
calculating a node score of each node in the sixth navigation bar information set by using a node scoring algorithm;
and sequencing all the nodes in the sixth navigation bar information set according to the node score of each node, and outputting the navigation bar information of the enterprise to be extracted.
Preferably, the extracting the navigation bar information of the third node DOM tree by using the NAV tag method to obtain the first navigation bar information set includes:
extracting PATH PATHs in navigation bar information of all nodes in the third node DOM tree; judging whether NAV labels exist in each PATH PATH; when the NAV label exists in any PATH PATH, extracting navigation bar information of nodes corresponding to the any PATH PATH, and combining the navigation bar information of all the extracted nodes to obtain a first navigation bar information set; or alternatively
Extracting CLASS attributes in navigation bar information of all nodes in the third node DOM tree; judging whether each CLASS attribute contains a preset NAV keyword or not; when any CLASS attribute contains the preset NAV keywords, extracting navigation bar information of nodes corresponding to the any CLASS attribute, and combining the navigation bar information of all the extracted nodes to obtain a first navigation bar information set.
Preferably, the extracting the navigation bar information of the third node DOM tree by using the a-tag density method to obtain the second navigation bar information set includes:
extracting href attribute in navigation bar information of all nodes;
calculating the string length of the href attribute of each node, and merging to obtain a href attribute length list;
dividing the href attribute length list into a plurality of subsets by zero elements, wherein each subset contains a plurality of consecutive non-zero string lengths;
accumulating the string length of the href attribute of each node in each subset to obtain the total string length of each subset;
extracting PATH PATHs of each node in the subset with the longest total string length;
and obtaining navigation bar information of nodes corresponding to the public prefixes of all the PATH PATHs meeting the longest public prefix, and combining the navigation bar information to obtain a second navigation bar information set.
Preferably, the extracting the navigation bar information of the third node DOM tree and the fourth node DOM tree by using the maximum common area method to obtain a third navigation bar information set includes:
extracting the maximum continuous public subset of the navigation bar information of each node in the third node DOM tree and the navigation bar information of each node in the fourth node DOM tree;
Extracting navigation bar information of all nodes in the maximum continuous public subset;
and merging the navigation bar information with the preset A label in the navigation bar information of all the nodes to obtain a third navigation bar information set.
Preferably, the extracting the navigation bar information of the third node DOM tree by using the keyword link block method to obtain the fourth navigation bar information set includes:
extracting anchor texts in navigation bar information of all nodes in the third node DOM tree;
acquiring PATH PATHs of nodes corresponding to anchor texts in the navigation bar information of all the nodes, wherein the anchor texts belong to anchor texts of a preset multilingual keyword set;
and obtaining navigation bar information of nodes corresponding to the public prefixes of all the PATH PATHs meeting the longest public prefix, and combining the navigation bar information to obtain a fourth navigation bar information set.
Preferably, the performing deduplication and filtering on the navigation bar information of each node in the fifth navigation bar information set to obtain a sixth navigation bar information set includes:
extracting anchor text and href attribute of each node in the fifth navigation bar information set;
when the anchor text and href attribute of any two nodes are completely consistent, the node which appears first is reserved, and the node which appears after deletion is obtained to obtain a target fifth navigation bar information set;
Matching the anchor text and href attribute of any node in the target fifth navigation bar information set with the text information in any one of a plurality of preset rule sets by utilizing regular matching;
and acquiring navigation bar information which fails to be matched with the text information in any rule set in the fifth navigation bar information set of the target, and merging the acquired navigation bar information to obtain a sixth navigation bar information set.
Preferably, the calculating the node score of each node in the sixth navigation bar information set using a node scoring algorithm includes:
calculating the PATH score of each node in the sixth navigation bar information set by adopting the following formula:
wherein α represents PATH weight, a_note represents a node list of the sixth navigation bar information set, len (a_note) represents the number of nodes in the node list, index_note represents a corresponding position subscript of each node in the node list, and f () represents a fitting function with a value range of [0,1 ];
calculating the score of the anchor text of each node in the sixth navigation bar information set using the formula:
wherein, β represents the weight of the anchor text, text represents the anchor text of each node, keyword represents a preset multilingual keyword set, and other represents that the anchor text of each node in the sixth navigation bar information set does not belong to the preset multilingual keyword set;
The score of the href attribute of each node in the sixth navigation bar information set is calculated by the following formula:
where γ represents an href attribute weight, layer represents the number of linked layers in the href attribute of each node, α+β+γ=100;
and accumulating the score of the PATH, the score of the anchor text and the score of the href attribute of each node in the sixth navigation bar information set to obtain the node score of each node.
A second aspect of the present invention provides a navigation bar information extraction apparatus of a website, the apparatus comprising:
the analyzing module is used for downloading a main page source code and any sub-page source code corresponding to the website domain name of an enterprise to be extracted, acquiring a first HTML code in the main page source code, analyzing the first HTML code into a first node DOM tree, acquiring a second HTML code in the arbitrary sub-page source code, and analyzing the second HTML code into a second node DOM tree;
the rejecting module is used for rejecting an outer chain in the first node DOM tree according to the website domain name and the href attribute in the navigation bar information of each node in the first node DOM tree to obtain a third node DOM tree, and rejecting an outer chain in the second node DOM tree according to the website domain name and the href attribute in the navigation bar information of each node in the second node DOM tree to obtain a fourth node DOM tree;
The extraction module is used for extracting navigation bar information of the third node DOM tree by using a NAV label method to obtain a first navigation bar information set, extracting navigation bar information of the third node DOM tree by using an A label density method to obtain a second navigation bar information set, extracting navigation bar information of the third node DOM tree and the fourth node DOM tree by using a maximum public area method to obtain a third navigation bar information set, and extracting navigation bar information of the third node DOM tree by using a keyword link block method to obtain a fourth navigation bar information set;
the merging module is used for merging the first navigation bar information set, the second navigation bar information set, the third navigation bar information set and the fourth navigation bar information set to obtain a fifth navigation bar information set;
the duplicate removal filtering module is used for removing duplicate of the navigation bar information of each node in the fifth navigation bar information set and filtering the navigation bar information to obtain a sixth navigation bar information set;
the calculation module is used for calculating the node score of each node in the sixth navigation bar information set by using a node scoring algorithm;
and the output module is used for sequencing all the nodes in the sixth navigation bar information set according to the node score of each node and outputting the navigation bar information of the enterprise to be extracted.
A third aspect of the present invention provides an electronic device, including a processor configured to implement the navigation bar information extraction method of the website when executing a computer program stored in a memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the navigation bar information extraction method of a website.
In summary, according to the method, the device, the electronic equipment and the storage medium for extracting navigation bar information of the website, on one hand, different navigation bar information sets are obtained by extracting navigation bar node information by using the NAV label method, the A label density method, the maximum public area method and the keyword link block method, so that the phenomena of poor extraction effect, inflexibility and the like caused by different HTML versions used by webpages or irregular writing of webpage codes are avoided, the accuracy and the extraction efficiency of the extracted navigation bar information are improved, on the other hand, the number of nodes in a fifth navigation bar information set is reduced by performing de-duplication filtering, the extraction efficiency of the navigation bar information is improved, and the navigation bar node information in any page is extracted accurately and rapidly by filtering enterprises appearing in a preset rule set.
Drawings
Fig. 1 is a flowchart of a method for extracting navigation bar information of a website according to an embodiment of the present invention.
Fig. 2 is a block diagram of a navigation bar information extraction device of a website according to a second embodiment of the present invention.
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
The invention will be further described in the following detailed description in conjunction with the above-described figures.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It should be noted that, without conflict, the embodiments of the present invention and features in the embodiments may be combined with each other.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Example 1
Fig. 1 is a flowchart of a method for extracting navigation bar information of a website according to an embodiment of the present invention.
In this embodiment, the method for extracting navigation bar information of a website may be applied to an electronic device, and for an electronic device that needs to extract navigation bar information of a website, the function of extracting navigation bar information of a website provided by the method of the present invention may be directly integrated on the electronic device, or may be run in the electronic device in the form of a software development kit (Software Development Kit, SKD).
As shown in fig. 1, the method for extracting navigation bar information of a website specifically includes the following steps, the order of the steps in the flowchart may be changed according to different requirements, and some may be omitted.
S11: downloading a main page source code and any sub page source code corresponding to a website domain name of an enterprise to be extracted, acquiring a first HTML code in the main page source code, analyzing the first HTML code into a first node DOM tree, acquiring a second HTML code in the arbitrary sub page source code, and analyzing the second HTML code into a second node DOM tree.
In this embodiment, when enterprise information such as enterprise culture, products, introduction, contact information and the like is displayed, an enterprise will establish a website to obtain a website domain name, download the main page source code and any sub page source code of the website domain name, where the link of any sub page is the inner link of the website domain name, remove JavaScript and CSS codes in the main page source code and any sub page source code, obtain an HTML code, and analyze the HTML code into a node DOM tree.
S12: and removing the outer chain in the first node DOM tree according to the website domain name and the href attribute in the navigation bar information of each node in the first node DOM tree to obtain a third node DOM tree, and removing the outer chain in the second node DOM tree according to the website domain name and the href attribute in the navigation bar information of each node in the second node DOM tree to obtain a fourth node DOM tree.
In this embodiment, the external link refers to a link where an external website links to a website of the enterprise to be extracted, a node DOM tree is obtained by parsing the HTML code, and each node DOM tree includes navigation bar information of each node, where the navigation bar information of each node includes href attribute, anchor text, PATH and CLASS attribute.
In this embodiment, by removing the outer links from the first node DOM tree and the second DOM tree, the obtained data are all the inner links of the enterprise to be extracted, which improves the accuracy of the extracted navigation bar information of the enterprise to be extracted.
S13: extracting navigation bar information of the third node DOM tree by using a NAV label method to obtain a first navigation bar information set, extracting navigation bar information of the third node DOM tree by using an A label density method to obtain a second navigation bar information set, extracting navigation bar information of the third node DOM tree and the fourth node DOM tree by using a maximum public area method to obtain a third navigation bar information set, and extracting navigation bar information of the third node DOM tree by using a keyword link block method to obtain a fourth navigation bar information set.
In this embodiment, different navigation bar information sets are obtained by extracting navigation bar node information by using different methods, such as a NAV tag method, an a tag density method, a maximum public area method, and a keyword link block method.
Preferably, the extracting the navigation bar information of the third node DOM tree by using the NAV tag method to obtain the first navigation bar information set includes:
extracting PATH PATHs in navigation bar information of all nodes in the third node DOM tree; judging whether NAV labels exist in each PATH PATH; when the NAV label exists in any PATH PATH, extracting navigation bar information of nodes corresponding to the any PATH PATH, and combining the navigation bar information of all the extracted nodes to obtain a first navigation bar information set; or alternatively
Extracting CLASS attributes in navigation bar information of all nodes in the third node DOM tree; judging whether each CLASS attribute contains a preset NAV keyword or not; when any CLASS attribute contains the preset NAV keywords, extracting navigation bar information of nodes corresponding to the any CLASS attribute, and combining the navigation bar information of all the extracted nodes to obtain a first navigation bar information set.
In this embodiment, the preset NAV keyword may be: other NAV keywords such as "NAV", etc. The NAV tag is a link group which can be used for page navigation, other pages linked by link elements or other parts of the current page are not all placed in the NAV tag, but key links are placed in the NAV, such as navigation bars of the pages, whether NAV tags exist or not is judged according to PATH PATHs or CLASS attributes of all nodes in the third node DOM tree, and extraction of navigation bar information is performed based on a judgment result, so that accuracy and extraction efficiency of extracting the navigation bar information of an enterprise to be extracted are improved.
Preferably, the extracting the navigation bar information of the third node DOM tree by using the a-tag density method to obtain the second navigation bar information set includes:
extracting href attribute in navigation bar information of all nodes;
calculating the string length of the href attribute of each node, and merging to obtain a href attribute length list;
dividing the href attribute length list into a plurality of subsets by zero elements, wherein each subset contains a plurality of consecutive non-zero string lengths;
accumulating the string length of the href attribute of each node in each subset to obtain the total string length of each subset;
extracting PATH PATHs of each node in the subset with the longest total string length;
and obtaining navigation bar information of nodes corresponding to the public prefixes of all the PATH PATHs meeting the longest public prefix, and combining the navigation bar information to obtain a second navigation bar information set.
In this embodiment, the a tag is an defined hyperlink, and is configured to link from one page to another page, calculate, using an a tag density method, a string length of an href attribute in navigation bar information of all nodes in the DOM tree of the third node, delete a node with an null string length of the href attribute, partition a plurality of nodes corresponding to the href attribute that are not null continuously into a subset, calculate a common prefix of a PATH of each node in the subset with the longest string length, determine whether each common prefix meets the longest common prefix, obtain a second navigation bar information set based on navigation bar information of all nodes meeting requirements extracted by a determination result, and improve extraction efficiency of extracting navigation bar information of an enterprise to be extracted by deleting a node with a null string length of the href attribute.
Preferably, the extracting the navigation bar information of the third node DOM tree and the fourth node DOM tree by using the maximum common area method to obtain a third navigation bar information set includes:
extracting the maximum continuous public subset of the navigation bar information of each node in the third node DOM tree and the navigation bar information of each node in the fourth node DOM tree;
extracting navigation bar information of all nodes in the maximum continuous public subset;
and merging the navigation bar information with the preset A label in the navigation bar information of all the nodes to obtain a third navigation bar information set.
In this embodiment, the navigation bar information of each node in the maximum continuous public subsets in the main page and any sub-page is extracted by using the maximum public area method, whether the preset a tag exists in the navigation bar information is judged, the navigation bar information without the preset a tag is deleted rapidly, and the accuracy and the extraction efficiency of the extracted navigation bar information are improved.
Preferably, the extracting the navigation bar information of the third node DOM tree by using the keyword link block method to obtain the fourth navigation bar information set includes:
extracting anchor texts in navigation bar information of all nodes in the third node DOM tree;
Acquiring PATH PATHs of nodes corresponding to anchor texts in the navigation bar information of all the nodes, wherein the anchor texts belong to anchor texts of a preset multilingual keyword set;
and obtaining navigation bar information of nodes corresponding to the public prefixes of all the PATH PATHs meeting the longest public prefix, and combining the navigation bar information to obtain a fourth navigation bar information set.
In this embodiment, PATH PATHs of nodes corresponding to anchor texts of anchor texts belonging to a preset multilingual keyword set in navigation bar information of all nodes are extracted, and navigation bar information of nodes of PATH PATHs of nodes corresponding to common prefixes of the nodes corresponding to the anchor texts satisfying the longest common prefix is obtained; and merging the navigation bar information of all the acquired nodes to obtain a fourth navigation bar information set.
In this embodiment, different navigation bar information sets are obtained by extracting navigation bar node information by using a NAV tag method, an a tag density method, a maximum public area method and a keyword link block method, so that phenomena of poor extraction effect, inflexibility and the like caused by different HTML versions used by web pages or irregular writing of web page codes are avoided, and accuracy and extraction efficiency of the extracted navigation bar information are improved.
S14: and obtaining a fifth navigation bar information set by combining the first navigation bar information set, the second navigation bar information set, the third navigation bar information set and the fourth navigation bar information set.
In this embodiment, the fifth navigation bar information set is obtained by adding the different navigation bar information sets.
S15: and de-duplicating and filtering the navigation bar information of each node in the fifth navigation bar information set to obtain a sixth navigation bar information set.
In this embodiment, since the different navigation bar information sets are added, the same node may appear, the same node is deleted, and after deleting the node, the nodes with the same anchor text and href attribute in the fifth navigation bar information set are filtered to obtain the sixth navigation bar information set.
Preferably, the performing deduplication and filtering on the navigation bar information of each node in the fifth navigation bar information set to obtain a sixth navigation bar information set includes:
extracting anchor text and href attribute of each node in the fifth navigation bar information set;
when the anchor text and href attribute of any two nodes are completely consistent, the node which appears first is reserved, and the node which appears after deletion is obtained to obtain a target fifth navigation bar information set;
Matching the anchor text and href attribute of any node in the target fifth navigation bar information set with the text information in any one of a plurality of preset rule sets by utilizing regular matching;
and acquiring navigation bar information which fails to be matched with the text information in any rule set in the fifth navigation bar information set of the target, and merging the acquired navigation bar information to obtain a sixth navigation bar information set.
In this embodiment, the preset rule set may include: mailbox address rule sets, phone number rule sets, multilingual stop word sets and the like, wherein the mailbox address rule sets are global unified mailbox format rule sets, the phone number rule sets are phone number format rule sets set by each country, and the multilingual stop word sets comprise stop word sets of common languages including but not limited to Chinese, english, french, german, russian, japanese, korean, italian, vietnam, thai and the like. After deleting the same node, matching the anchor text and href attribute in the reserved node with any preset rule set in a plurality of preset rule sets, determining that the node is a company in the existing rule set if the anchor text and href attribute corresponding to any node is matched in any rule set, deleting the node, reserving the node if the anchor text and href attribute corresponding to any node is not matched in any rule set, extracting navigation bar information corresponding to any node, combining the navigation bar information of all extracted nodes to obtain a sixth navigation bar information set, and reducing the number of data processing.
In this embodiment, the number of nodes is reduced by performing deduplication and filtering on the nodes in the fifth navigation bar information set, so that the extraction efficiency of the navigation bar information is improved, and the enterprises appearing in the preset rule set are filtered, so that the accuracy of the extracted navigation bar information is improved.
S16: a node score for each node in the sixth navigation bar information set is calculated using a node scoring algorithm.
Preferably, the calculating the node score of each node in the sixth navigation bar information set using a node scoring algorithm includes:
70 Calculating the PATH score of each node in the sixth navigation bar information set by using formula (1):
wherein α represents PATH weight, a_note represents a node list of the sixth navigation bar information set, len (a_note) represents the number of nodes in the node list, index_note represents a corresponding position subscript of each node in the node list, and f () represents a fitting function with a value range of [0,1 ];
71 Calculating the score of the anchor text of each node in the sixth navigation bar information set by using the formula (2):
wherein, beta represents the weight of the anchor text, text represents the anchor text of each node, keywords represents a preset multilingual keyword set, and other represents that the anchor text of each node in the sixth navigation bar information set does not belong to the preset multilingual keyword set;
72 Calculating a score of href attribute of each node in the sixth navigation bar information set using formula (3):
where γ represents an href attribute weight, layer represents the number of linked layers in the href attribute of each node, α+β+γ=100;
73 A node score for each node is obtained by accumulating the score of the PATH, the score of the anchor text and the score of href attribute of each node in the sixth navigation bar information set.
In this embodiment, by setting weights for the PATH, the anchor text, and the href attribute of each node, the score of the PATH, the score of the anchor text, and the score of the href attribute of each node are calculated to obtain the node score of each node, and the importance of each node in the navigation bar information is determined according to the node score of each node.
S17: and sequencing all the nodes in the sixth navigation bar information set according to the node score of each node, and outputting the navigation bar information of the enterprise to be extracted.
In this embodiment, the node scores of each node extracted from the enterprise to be extracted are ranked, and the navigation bar information of the enterprise to be extracted is sequentially output according to the importance degree of each node.
In summary, according to the method for extracting navigation bar information of the website in this embodiment, on one hand, by extracting navigation bar node information by using the NAV tag method, the a tag density method, the maximum public area method and the keyword link block method, different navigation bar information sets are obtained, so that the phenomena of poor extraction effect, inflexibility and the like caused by different HTML versions used by webpages or irregular writing of webpage codes are avoided, the accuracy and extraction efficiency of the extracted navigation bar information are improved, and on the other hand, by performing de-duplication filtering on nodes in the fifth navigation bar information set, the number of nodes is reduced, the extraction efficiency of the navigation bar information is improved, enterprises appearing in the preset rule set are filtered, and the accuracy of the extracted navigation bar information is improved.
Example two
Fig. 2 is a block diagram of a navigation bar information extraction device of a website according to a second embodiment of the present invention.
In some embodiments, the navigation bar information extraction device 20 of the website may include a plurality of functional modules composed of program code segments. Program code for each program segment in the navigation bar information extraction apparatus 20 of the website may be stored in a memory of the electronic device and executed by the at least one processor to perform (see fig. 1 for details) extraction of navigation bar information of the website.
In this embodiment, the navigation bar information extraction device 20 of the website may be divided into a plurality of functional modules according to the functions performed by the navigation bar information extraction device. The functional module may include: the device comprises an analysis module 201, a rejection module 202, an extraction module 203, a merging module 204, a deduplication filtering module 205, a calculation module 206 and an output module 207. The module referred to in the present invention refers to a series of computer program segments capable of being executed by at least one processor and of performing a fixed function, stored in a memory. In the present embodiment, the functions of the respective modules will be described in detail in the following embodiments.
The parsing module 201: the method comprises the steps of downloading a main page source code and any sub-page source code corresponding to a website domain name of an enterprise to be extracted, obtaining a first HTML code in the main page source code, analyzing the first HTML code into a first node DOM tree, obtaining a second HTML code in the arbitrary sub-page source code, and analyzing the second HTML code into a second node DOM tree.
In this embodiment, when enterprise information such as enterprise culture, products, introduction, contact information and the like is displayed, an enterprise will establish a website to obtain a website domain name, download the main page source code and any sub page source code of the website domain name, where the link of any sub page is the inner link of the website domain name, remove JavaScript and CSS codes in the main page source code and any sub page source code, obtain an HTML code, and analyze the HTML code into a node DOM tree.
Rejection module 202: and the outer chain in the first node DOM tree is removed according to the website domain name and the href attribute in the navigation bar information of each node in the first node DOM tree to obtain a third node DOM tree, and the outer chain in the second node DOM tree is removed according to the website domain name and the href attribute in the navigation bar information of each node in the second node DOM tree to obtain a fourth node DOM tree.
In this embodiment, the external link refers to a link where an external website links to a website of the enterprise to be extracted, a node DOM tree is obtained by parsing the HTML code, and each node DOM tree includes navigation bar information of each node, where the navigation bar information of each node includes href attribute, anchor text, PATH and CLASS attribute.
In this embodiment, by removing the outer links from the first node DOM tree and the second DOM tree, the obtained data are all the inner links of the enterprise to be extracted, which improves the accuracy of the extracted navigation bar information of the enterprise to be extracted.
Extraction module 203: the method comprises the steps of extracting navigation bar information of a third node DOM tree by using a NAV label method to obtain a first navigation bar information set, extracting navigation bar information of the third node DOM tree by using an A label density method to obtain a second navigation bar information set, extracting navigation bar information of the third node DOM tree and navigation bar information of a fourth node DOM tree by using a maximum public area method to obtain a third navigation bar information set, and extracting navigation bar information of the third node DOM tree by using a keyword link block method to obtain a fourth navigation bar information set.
In this embodiment, different navigation bar information sets are obtained by extracting navigation bar node information by using different methods, such as a NAV tag method, an a tag density method, a maximum public area method, and a keyword link block method.
Preferably, the extracting module 203 extracts the navigation bar information of the DOM tree of the third node by using a NAV tag method to obtain the first navigation bar information set includes:
extracting PATH PATHs in navigation bar information of all nodes in the third node DOM tree; judging whether NAV labels exist in each PATH PATH; when the NAV label exists in any PATH PATH, extracting navigation bar information of nodes corresponding to the any PATH PATH, and combining the navigation bar information of all the extracted nodes to obtain a first navigation bar information set; or alternatively
Extracting CLASS attributes in navigation bar information of all nodes in the third node DOM tree; judging whether each CLASS attribute contains a preset NAV keyword or not; when any CLASS attribute contains the preset NAV keywords, extracting navigation bar information of nodes corresponding to the any CLASS attribute, and combining the navigation bar information of all the extracted nodes to obtain a first navigation bar information set.
In this embodiment, the preset NAV keyword may be: other NAV keywords such as "NAV", etc. The NAV tag is a link group which can be used for page navigation, other pages linked by link elements or other parts of the current page are not all placed in the NAV tag, but key links are placed in the NAV, such as navigation bars of the pages, whether NAV tags exist or not is judged according to PATH PATHs or CLASS attributes of all nodes in the third node DOM tree, and extraction of navigation bar information is performed based on a judgment result, so that accuracy and extraction efficiency of extracting the navigation bar information of an enterprise to be extracted are improved.
Preferably, the extracting module 203 extracts the navigation bar information of the third node DOM tree by using an a-tag density method to obtain the second navigation bar information set includes:
extracting href attribute in navigation bar information of all nodes;
calculating the string length of the href attribute of each node, and merging to obtain a href attribute length list;
dividing the href attribute length list into a plurality of subsets by zero elements, wherein each subset contains a plurality of consecutive non-zero string lengths;
Accumulating the string length of the href attribute of each node in each subset to obtain the total string length of each subset;
extracting PATH PATHs of each node in the subset with the longest total string length;
and obtaining navigation bar information of nodes corresponding to the public prefixes of all the PATH PATHs meeting the longest public prefix, and combining the navigation bar information to obtain a second navigation bar information set.
In this embodiment, the a tag is an defined hyperlink, and is configured to link from one page to another page, calculate, using an a tag density method, a string length of an href attribute in navigation bar information of all nodes in the DOM tree of the third node, delete a node with an null string length of the href attribute, partition a plurality of nodes corresponding to the href attribute that are not null continuously into a subset, calculate a common prefix of a PATH of each node in the subset with the longest string length, determine whether each common prefix meets the longest common prefix, obtain a second navigation bar information set based on navigation bar information of all nodes meeting requirements extracted by a determination result, and improve extraction efficiency of extracting navigation bar information of an enterprise to be extracted by deleting a node with a null string length of the href attribute.
Preferably, the extracting module 203 extracts the navigation bar information of the third node DOM tree and the fourth node DOM tree by using a maximum common area method to obtain a third navigation bar information set, including:
extracting the maximum continuous public subset of the navigation bar information of each node in the third node DOM tree and the navigation bar information of each node in the fourth node DOM tree;
extracting navigation bar information of all nodes in the maximum continuous public subset;
and merging the navigation bar information with the preset A label in the navigation bar information of all the nodes to obtain a third navigation bar information set.
In this embodiment, the navigation bar information of each node in the maximum continuous public subsets in the main page and any sub-page is extracted by using the maximum public area method, whether the preset a tag exists in the navigation bar information is judged, the navigation bar information without the preset a tag is deleted rapidly, and the accuracy and the extraction efficiency of the extracted navigation bar information are improved.
Preferably, the extracting module 203 extracts the navigation bar information of the DOM tree of the third node by using a keyword link block method to obtain a fourth navigation bar information set includes:
extracting anchor texts in navigation bar information of all nodes in the third node DOM tree;
Acquiring PATH PATHs of nodes corresponding to anchor texts in the navigation bar information of all the nodes, wherein the anchor texts belong to anchor texts of a preset multilingual keyword set;
and obtaining navigation bar information of nodes corresponding to the public prefixes of all the PATH PATHs meeting the longest public prefix, and combining the navigation bar information to obtain a fourth navigation bar information set.
In this embodiment, PATH PATHs of nodes corresponding to anchor texts of anchor texts belonging to a preset multilingual keyword set in navigation bar information of all nodes are extracted, and navigation bar information of nodes of PATH PATHs of nodes corresponding to common prefixes of the nodes corresponding to the anchor texts satisfying the longest common prefix is obtained; and merging the navigation bar information of all the acquired nodes to obtain a fourth navigation bar information set.
In this embodiment, different navigation bar information sets are obtained by extracting navigation bar node information by using a NAV tag method, an a tag density method, a maximum public area method and a keyword link block method, so that phenomena of poor extraction effect, inflexibility and the like caused by different HTML versions used by web pages or irregular writing of web page codes are avoided, and accuracy and extraction efficiency of the extracted navigation bar information are improved.
The merge module 204: and the navigation system is used for acquiring a fifth navigation bar information set from the first navigation bar information set, the second navigation bar information set, the third navigation bar information set and the fourth navigation bar information set.
In this embodiment, the fifth navigation bar information set is obtained by adding the different navigation bar information sets.
The deduplication filtering module 205: and the navigation bar information processing module is used for carrying out de-duplication and filtering on the navigation bar information of each node in the fifth navigation bar information set to obtain a sixth navigation bar information set.
In this embodiment, since the different navigation bar information sets are added, the same node may appear, the same node is deleted, and after deleting the node, the nodes with the same anchor text and href attribute in the fifth navigation bar information set are filtered to obtain the sixth navigation bar information set.
Preferably, the deduplication filtering module 205 deduplicates and filters the navigation bar information of each node in the fifth navigation bar information set, and the obtaining the sixth navigation bar information set includes:
extracting anchor text and href attribute of each node in the fifth navigation bar information set;
when the anchor text and href attribute of any two nodes are completely consistent, the node which appears first is reserved, and the node which appears after deletion is obtained to obtain a target fifth navigation bar information set;
Matching the anchor text and href attribute of any node in the target fifth navigation bar information set with the text information in any one of a plurality of preset rule sets by utilizing regular matching;
and acquiring navigation bar information which fails to be matched with the text information in any rule set in the fifth navigation bar information set of the target, and merging the acquired navigation bar information to obtain a sixth navigation bar information set.
In this embodiment, the preset rule set may include: mailbox address rule sets, phone number rule sets, multilingual stop word sets and the like, wherein the mailbox address rule sets are global unified mailbox format rule sets, the phone number rule sets are phone number format rule sets set by each country, and the multilingual stop word sets comprise stop word sets of common languages including but not limited to Chinese, english, french, german, russian, japanese, korean, italian, vietnam, thai and the like. After deleting the same node, matching the anchor text and href attribute in the reserved node with any preset rule set in a plurality of preset rule sets, determining that the node is a company in the existing rule set if the anchor text and href attribute corresponding to any node is matched in any rule set, deleting the node, reserving the node if the anchor text and href attribute corresponding to any node is not matched in any rule set, extracting navigation bar information corresponding to any node, combining the navigation bar information of all extracted nodes to obtain a sixth navigation bar information set, and reducing the number of data processing.
In this embodiment, the number of nodes is reduced by performing deduplication and filtering on the nodes in the fifth navigation bar information set, so that the extraction efficiency of the navigation bar information is improved, and the enterprises appearing in the preset rule set are filtered, so that the accuracy of the extracted navigation bar information is improved.
The calculation module 206: for calculating a node score for each node in the sixth navigation bar information set using a node scoring algorithm.
Preferably, the calculating module 206 calculates the node score of each node in the sixth navigation bar information set using a node scoring algorithm includes:
70 Calculating the PATH score of each node in the sixth navigation bar information set by using formula (1):
wherein α represents PATH weight, a_note represents a node list of the sixth navigation bar information set, len (a_note) represents the number of nodes in the node list, index_note represents a corresponding position subscript of each node in the node list, and f () represents a fitting function with a value range of [0,1 ];
71 Calculating the score of the anchor text of each node in the sixth navigation bar information set by using the formula (2):
wherein, beta represents the weight of the anchor text, text represents the anchor text of each node, keywords represents a preset multilingual keyword set, and other represents that the anchor text of each node in the sixth navigation bar information set does not belong to the preset multilingual keyword set;
72 Calculating a score of href attribute of each node in the sixth navigation bar information set using formula (3):
where γ represents an href attribute weight, layer represents the number of linked layers in the href attribute of each node, α+β+γ=100;
73 A node score for each node is obtained by accumulating the score of the PATH, the score of the anchor text and the score of href attribute of each node in the sixth navigation bar information set.
In this embodiment, by setting weights for the PATH, the anchor text and the href attribute of each node, the score of the PATH, the score of the anchor text and the attribute score of href of each node are calculated to obtain the node score of each node, and the importance of each node in the navigation bar information is determined according to the node score of each node.
The output module 207: and the navigation bar information processing unit is used for sequencing all the nodes in the sixth navigation bar information set according to the node score of each node and outputting the navigation bar information of the enterprise to be extracted.
In this embodiment, the node scores of each node extracted from the enterprise to be extracted are ranked, and the navigation bar information of the enterprise to be extracted is sequentially output according to the importance degree of each node.
In summary, according to the navigation bar information extraction device of the website in this embodiment, on one hand, by extracting navigation bar node information by using the NAV tag method, the a tag density method, the maximum public area method and the keyword link block method, different navigation bar information sets are obtained, so that the phenomena of poor extraction effect, inflexibility and the like caused by different HTML versions used by webpages or irregular writing of webpage codes are avoided, the accuracy and extraction efficiency of the extracted navigation bar information are improved, and on the other hand, by performing de-duplication filtering on nodes in the fifth navigation bar information set, the number of nodes is reduced, the extraction efficiency of the navigation bar information is improved, enterprises appearing in the preset rule set are filtered, and the accuracy of the extracted navigation bar information is improved.
Example III
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. In the preferred embodiment of the invention, the electronic device 3 comprises a memory 31, at least one processor 32, at least one communication bus 33 and a transceiver 34.
It will be appreciated by those skilled in the art that the configuration of the electronic device shown in fig. 3 is not limiting of the embodiments of the present invention, and that either a bus-type configuration or a star-type configuration is possible, and that the electronic device 3 may also include more or less other hardware or software than that shown, or a different arrangement of components.
In some embodiments, the electronic device 3 is an electronic device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The electronic device 3 may further include a client device, where the client device includes, but is not limited to, any electronic product that can interact with a client by way of a keyboard, a mouse, a remote control, a touch pad, or a voice control device, such as a personal computer, a tablet computer, a smart phone, a digital camera, etc.
It should be noted that the electronic device 3 is only used as an example, and other electronic products that may be present in the present invention or may be present in the future are also included in the scope of the present invention by way of reference.
In some embodiments, the memory 31 is used to store program codes and various data, such as navigation bar information extraction means 20 of a website installed in the electronic device 3, and to implement high-speed, automatic access to programs or data during operation of the electronic device 3. The Memory 31 includes Read-Only Memory (ROM), programmable Read-Only Memory (PROM), erasable programmable Read-Only Memory (EPROM), one-time programmable Read-Only Memory (One-time Programmable Read-Only Memory, OTPROM), electrically erasable rewritable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic tape Memory, or any other medium that can be used for computer-readable carrying or storing data.
In some embodiments, the at least one processor 32 may be comprised of an integrated circuit, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The at least one processor 32 is a Control Unit (Control Unit) of the electronic device 3, connects the respective components of the entire electronic device 3 using various interfaces and lines, and executes various functions of the electronic device 3 and processes data, such as a navigation bar information extraction function of a website, by running or executing programs or modules stored in the memory 31, and calling data stored in the memory 31.
In some embodiments, the at least one communication bus 33 is arranged to enable connected communication between the memory 31 and the at least one processor 32 or the like.
Although not shown, the electronic device 3 may further comprise a power source (such as a battery) for powering the various components, which may preferably be logically connected to the at least one processor 32 via a power management device, such that functions of managing charging, discharging, and power consumption are performed by the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 3 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device, etc.) or a processor (processor) to perform portions of the methods described in the various embodiments of the invention.
In a further embodiment, in conjunction with fig. 2, the at least one processor 32 may execute the operating device of the electronic device 3 and various installed applications (such as the navigation bar information extraction device 20 of the website), program codes, etc., for example, the above modules.
The memory 31 has program code stored therein, and the at least one processor 32 can invoke the program code stored in the memory 31 to perform related functions. For example, each of the modules depicted in fig. 2 is a program code stored in the memory 31 and executed by the at least one processor 32 to perform the functions of each of the modules for navigation bar information extraction of a web site.
In one embodiment of the present invention, the memory 31 stores a plurality of instructions that are executed by the at least one processor 32 to implement the navigation bar information extraction function of the website.
Specifically, the specific implementation method of the above instruction by the at least one processor 32 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it will be obvious that the term "comprising" does not exclude other elements or that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (10)

1. The method for extracting the navigation bar information of the website is characterized by comprising the following steps of:
downloading a main page source code and any sub-page source code corresponding to a website domain name of an enterprise to be extracted, acquiring a first HTML code in the main page source code, analyzing the first HTML code into a first node DOM tree, acquiring a second HTML code in the arbitrary sub-page source code, and analyzing the second HTML code into a second node DOM tree;
removing the outer chain in the first node DOM tree according to the website domain name and the href attribute in the navigation bar information of each node in the first node DOM tree to obtain a third node DOM tree, and removing the outer chain in the second node DOM tree according to the website domain name and the href attribute in the navigation bar information of each node in the second node DOM tree to obtain a fourth node DOM tree;
extracting navigation bar information of a node with a NAV label in a node of the third node DOM tree by using a NAV label method, merging the extracted navigation bar information to obtain a first navigation bar information set, extracting a third navigation bar information set with preset A labels in which the string length of href attribute in the node of the third node DOM tree belongs to a subset with the longest total string length in a plurality of subsets obtained by dividing the string length of the node heref attribute, and the public prefix of the PATH PATH meets the node with the longest public prefix, merging the extracted navigation bar information to obtain a second navigation bar information set, extracting the navigation bar information of the node in the maximum continuous public subset of the third node DOM tree and the fourth node DOM tree by using a maximum public area method, merging the extracted navigation bar information to obtain a third navigation bar information set with preset A labels, and extracting the navigation bar information in the node of the third node DOM tree by using a keyword link block method, wherein the public prefix of the PATH PATH meets the longest public prefix and the public prefix of the navigation bar in the anchor information belongs to a fourth navigation bar information set;
The first navigation bar information set, the second navigation bar information set, the third navigation bar information set and the fourth navigation bar information set are combined to obtain a fifth navigation bar information set;
removing duplication of navigation bar information of each node in the fifth navigation bar information set and filtering to obtain a sixth navigation bar information set;
accumulating the score of the PATH, the score of the anchor text and the score of the href attribute of each node in the sixth navigation bar information set to obtain the node score of each node;
and sequencing all the nodes in the sixth navigation bar information set according to the node score of each node, and outputting the navigation bar information of the enterprise to be extracted.
2. The method for extracting navigation bar information of website according to claim 1, wherein extracting navigation bar information of a node having a NAV tag among nodes of the DOM tree of the third node using a NAV tag method, and combining the extracted navigation bar information to obtain the first navigation bar information set comprises:
extracting PATH PATHs in navigation bar information of all nodes in the third node DOM tree;
judging whether NAV labels exist in each PATH PATH; when the NAV label exists in any PATH PATH, extracting navigation bar information of nodes corresponding to the any PATH PATH, and combining the navigation bar information of all the extracted nodes to obtain a first navigation bar information set; or alternatively
Extracting CLASS attributes in navigation bar information of all nodes in the third node DOM tree; judging whether each CLASS attribute contains a preset NAV keyword or not; when any CLASS attribute contains the preset NAV keyword, extracting navigation bar information of nodes corresponding to the any CLASS attribute, and combining the navigation bar information of all the extracted nodes to obtain a first navigation bar information set.
3. The method for extracting navigation bar information of a website according to claim 1, wherein the extracting the navigation bar information of the node of the third node DOM tree, in which the string length of the href attribute belongs to the subset having the longest total string length among the plurality of subsets obtained by dividing the string length of the node heref attribute, and the common prefix of the PATH satisfies the longest common prefix, using an a-tag density method, and the merging the extracted navigation bar information to obtain the second navigation bar information set includes:
extracting href attribute in navigation bar information of all nodes;
calculating the string length of the href attribute of each node, and merging to obtain a href attribute length list;
dividing the href attribute length list into a plurality of subsets by zero elements, wherein each subset contains a plurality of consecutive non-zero string lengths;
Accumulating the string length of the href attribute of each node in each subset to obtain the total string length of each subset;
extracting PATH PATHs of each node in the subset with the longest total string length;
and obtaining navigation bar information of nodes corresponding to the public prefixes of all the PATH PATHs meeting the longest public prefix, and combining the navigation bar information to obtain a second navigation bar information set.
4. The method for extracting navigation bar information of a website according to claim 1, wherein extracting navigation bar information of nodes in the largest continuous common subset of the third node DOM tree and the fourth node DOM tree by using a maximum common area method, and combining the extracted navigation bar information to obtain a third navigation bar information set in which the navigation bar information has preset a tags includes:
extracting the maximum continuous public subset of the navigation bar information of each node in the third node DOM tree and the navigation bar information of each node in the fourth node DOM tree;
extracting navigation bar information of all nodes in the maximum continuous public subset;
and merging the navigation bar information with the preset A label in the navigation bar information of all the nodes to obtain a third navigation bar information set.
5. The method for extracting navigation bar information of website according to claim 1, wherein extracting navigation bar information of nodes whose public prefixes of PATH PATHs in nodes of the DOM tree of the third node satisfy the longest public prefix and whose anchor text in the navigation bar information belongs to a preset multilingual keyword set by using a keyword link block method, and combining the extracted navigation bar information to obtain a fourth navigation bar information set comprises:
extracting anchor texts in navigation bar information of all nodes in the third node DOM tree;
acquiring PATH PATHs of nodes corresponding to anchor texts in the navigation bar information of all the nodes, wherein the anchor texts belong to anchor texts of a preset multilingual keyword set;
and obtaining navigation bar information of nodes corresponding to the public prefixes of all the PATH PATHs meeting the longest public prefix, and combining the navigation bar information to obtain a fourth navigation bar information set.
6. The method for extracting navigation bar information of a website according to claim 1, wherein performing deduplication and filtering on navigation bar information of each node in the fifth navigation bar information set to obtain a sixth navigation bar information set comprises:
extracting anchor text and href attribute of each node in the fifth navigation bar information set; when the anchor text and href attribute of any two nodes are completely consistent, the node which appears first is reserved, and the node which appears after deletion is obtained to obtain a target fifth navigation bar information set;
Matching the anchor text and href attribute of any node in the target fifth navigation bar information set with the text information in any one of a plurality of preset rule sets by utilizing regular matching;
and acquiring navigation bar information which fails to be matched with the text information in any rule set in the fifth navigation bar information set of the target, and merging the acquired navigation bar information to obtain a sixth navigation bar information set.
7. The method for extracting navigation bar information of a website according to claim 1, wherein the accumulating the PATH score, the anchor text score, and the href attribute score of each node in the sixth navigation bar information set to obtain the node score of each node comprises:
the PATH score of each node in the sixth navigation bar information set is calculated using the following formula:
wherein α represents PATH weight, α_note represents a node list of the sixth navigation bar information set, len (α_note) represents the number of nodes in the node list, index_note represents a corresponding position subscript of each node in the node list, and f () represents a fitting function with a value range of [0,1 ];
calculating the score of the anchor text of each node in the sixth navigation bar information set by adopting the following formula:
Wherein, beta represents the weight of the anchor text, text represents the anchor text of each node, keywords represents a preset multilingual keyword set, and other represents that the anchor text of each node in the sixth navigation bar information set does not belong to the preset multilingual keyword set;
calculating the score of the href attribute of each node in the sixth navigation bar information set by adopting the following formula:
where γ represents an href attribute weight, layer represents the number of linked layers in the href attribute of each node, α+β+γ=100;
and accumulating the score of the PATH, the score of the anchor text and the score of the href attribute of each node in the sixth navigation bar information set to obtain the node score of each node.
8. A navigation bar information extraction device of a website, the navigation bar information extraction device of the website comprising:
the analyzing module is used for downloading a main page source code and any sub-page source code corresponding to the website domain name of an enterprise to be extracted, acquiring a first HTML code in the main page source code, analyzing the first HTML code into a first node DOM tree, acquiring a second HTML code in the arbitrary sub-page source code, and analyzing the second HTML code into a second node DOM tree;
The rejecting module is used for rejecting an outer chain in the first node DOM tree according to the website domain name and the href attribute in the navigation bar information of each node in the first node DOM tree to obtain a third node DOM tree, and rejecting an outer chain in the second node DOM tree according to the website domain name and the href attribute in the navigation bar information of each node in the second node DOM tree to obtain a fourth node DOM tree;
the extracting module is used for extracting navigation bar information of a node with NAV labels in the nodes of the third node DOM tree by using a NAV label method, merging the extracted navigation bar information to obtain a first navigation bar information set, extracting a third navigation bar information set with preset A labels in the navigation bar information in the nodes of the third node DOM tree by using an A label density method, extracting navigation bar information of a node with the longest total character string length in a plurality of subsets obtained by dividing the character string length of the node heref attribute and with a public prefix of a PATH PATH meeting the node with the longest public prefix, merging the extracted navigation bar information to obtain a second navigation bar information set, extracting the navigation bar information of the nodes in the maximum continuous public subset of the third node DOM tree and the fourth node DOM tree by using a maximum public area method, merging the extracted navigation bar information to obtain a third navigation bar information set with preset A labels, and extracting the navigation bar information of the H PATH in the nodes of the third node DOM tree by using a keyword link block method to obtain a navigation bar information set with the keyword belonging to the navigation bar information;
The merging module is used for merging the first navigation bar information set, the second navigation bar information set, the third navigation bar information set and the fourth navigation bar information set to obtain a fifth navigation bar information set;
the duplicate removal filtering module is used for removing duplicate of the navigation bar information of each node in the fifth navigation bar information set and filtering the navigation bar information to obtain a sixth navigation bar information set;
the calculation module is used for accumulating the score of the PATH, the score of the anchor text and the score of the href attribute of each node in the sixth navigation bar information set to obtain the node score of each node;
and the output module is used for sequencing all the nodes in the sixth navigation bar information set according to the node score of each node and outputting the navigation bar information of the enterprise to be extracted.
9. An electronic device comprising a processor for implementing the navigation bar information extraction method of the website according to any one of claims 1 to 7 when executing a computer program stored in a memory.
10. A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the navigation bar information extraction method of a website according to any one of claims 1 to 7.
CN202010484954.XA 2020-06-01 2020-06-01 Navigation bar information extraction method and device of website, electronic equipment and storage medium Active CN111625748B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010484954.XA CN111625748B (en) 2020-06-01 2020-06-01 Navigation bar information extraction method and device of website, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010484954.XA CN111625748B (en) 2020-06-01 2020-06-01 Navigation bar information extraction method and device of website, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111625748A CN111625748A (en) 2020-09-04
CN111625748B true CN111625748B (en) 2024-01-09

Family

ID=72272649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010484954.XA Active CN111625748B (en) 2020-06-01 2020-06-01 Navigation bar information extraction method and device of website, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111625748B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528117B (en) * 2020-12-11 2023-03-14 杭州安恒信息技术股份有限公司 Recognition method and related device for government affair website primary catalog
CN112230989B (en) * 2020-12-14 2021-03-12 北京智慧星光信息技术有限公司 Webpage channel navigation bar extraction method, system, electronic equipment and storage medium
CN114817811B (en) * 2022-05-07 2024-03-19 盐城天眼察微科技有限公司 Website analysis method and device
CN114610985B (en) * 2022-05-10 2022-08-19 北京百炼智能科技有限公司 Information extraction method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN103778104A (en) * 2012-10-22 2014-05-07 富士通株式会社 Information processing device, information processing method and electronic device
CN103823824A (en) * 2013-11-12 2014-05-28 哈尔滨工业大学深圳研究生院 Method and system for automatically constructing text classification corpus by aid of internet
CN105069107A (en) * 2015-08-07 2015-11-18 北京百度网讯科技有限公司 Method and device for monitoring website
CN106294722A (en) * 2016-08-09 2017-01-04 上海资誉网络科技有限公司 A kind of web page contents extraction method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN103778104A (en) * 2012-10-22 2014-05-07 富士通株式会社 Information processing device, information processing method and electronic device
CN103823824A (en) * 2013-11-12 2014-05-28 哈尔滨工业大学深圳研究生院 Method and system for automatically constructing text classification corpus by aid of internet
CN105069107A (en) * 2015-08-07 2015-11-18 北京百度网讯科技有限公司 Method and device for monitoring website
CN106294722A (en) * 2016-08-09 2017-01-04 上海资誉网络科技有限公司 A kind of web page contents extraction method and device

Also Published As

Publication number Publication date
CN111625748A (en) 2020-09-04

Similar Documents

Publication Publication Date Title
CN111625748B (en) Navigation bar information extraction method and device of website, electronic equipment and storage medium
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
CN111782907B (en) News classification method and device and electronic equipment
CN110020312B (en) Method and device for extracting webpage text
CN105550359B (en) Webpage sorting method and device based on vertical search and server
JPWO2019224891A1 (en) Classification device, classification method, generation method, classification program and generation program
CN111930962A (en) Document data value evaluation method and device, electronic equipment and storage medium
KR20220064016A (en) Method for extracting construction safety accident based data mining using big data
CN112149409A (en) Medical word cloud generation method and device, computer equipment and storage medium
CN112667802A (en) Service information input method, device, server and storage medium
CN110781669A (en) Text key information extraction method and device, electronic equipment and storage medium
US20150205769A1 (en) System and method for recognizing non-body text in webpage
CN106372232B (en) Information mining method and device based on artificial intelligence
CN106611029A (en) Method and device for improving site search efficiency in website
JP4724158B2 (en) Method and apparatus for automatic form filling in mobile devices
CN112069808A (en) Financing wind control method and device, computer equipment and storage medium
CN114462383B (en) Method, system, storage medium and equipment for obtaining design specification of building drawing
Eldirdiery et al. Detecting and removing noisy data on web document using text density approach
CN113139145B (en) Page generation method and device, electronic equipment and readable storage medium
US20170220557A1 (en) Method, device, and computer program for providing a definition or a translation of a word belonging to a sentence as a function of neighbouring words and of databases
CN112230989B (en) Webpage channel navigation bar extraction method, system, electronic equipment and storage medium
CN111026942A (en) Hot word extraction method, device, terminal and medium based on web crawler
CN113806311A (en) Deep learning-based file classification method and device, electronic equipment and medium
CN113204962A (en) Word sense disambiguation method, device, equipment and medium based on graph expansion structure
CN112667874A (en) Webpage data extraction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant