CN111625749B - Method, device, equipment and medium for extracting website detail page information of participant company - Google Patents

Method, device, equipment and medium for extracting website detail page information of participant company Download PDF

Info

Publication number
CN111625749B
CN111625749B CN202010485912.8A CN202010485912A CN111625749B CN 111625749 B CN111625749 B CN 111625749B CN 202010485912 A CN202010485912 A CN 202010485912A CN 111625749 B CN111625749 B CN 111625749B
Authority
CN
China
Prior art keywords
node
dom tree
participant
target
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010485912.8A
Other languages
Chinese (zh)
Other versions
CN111625749A (en
Inventor
祁俊辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xiaoman Technology Co ltd
Original Assignee
Shenzhen Xiaoman Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xiaoman Technology Co ltd filed Critical Shenzhen Xiaoman Technology Co ltd
Priority to CN202010485912.8A priority Critical patent/CN111625749B/en
Publication of CN111625749A publication Critical patent/CN111625749A/en
Application granted granted Critical
Publication of CN111625749B publication Critical patent/CN111625749B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of text extraction, and provides a method, a device, equipment and a medium for extracting website detail page information of a participant company, wherein the method comprises the following steps: acquiring detail page link sets of a plurality of participant websites and downloading source codes; the HTML codes in the source codes are acquired and then are analyzed into a first node DOM tree; extracting a main body part of a first node DOM tree according to the first node DOM trees of the multiple participant companies to obtain a second node DOM tree; extracting PATH PATHs and CLASS attributes of a plurality of target nodes in a main body part of the DOM tree of the second node according to a plurality of preset rules; and analyzing the detail page information of each participant according to PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of the DOM tree of the second node. According to the invention, the detail page information of each participant is analyzed through a plurality of preset rules, so that the efficiency and the accuracy of extracting the detail page information are improved.

Description

Method, device, equipment and medium for extracting website detail page information of participant company
Technical Field
The invention relates to the technical field of text extraction, in particular to a method, a device, equipment and a medium for extracting website detail page information of a participant company.
Background
To increase exposure and popularity, enterprises choose to participate in some exhibitions to show company brands and products to the masses, while as sponsors, exhibition websites publish the information of participating enterprises in their official networks in advance, which are public information for the masses, and remain in their official networks even after the end of the exhibitions.
Because the information belongs to popular public information, researchers can extract the public information to conduct research or data mining analysis, in the grabbing process, the most used manual matching nodes in the prior art are web page analysis, or 'fool' visual crawlers are used for web page analysis, different node paths are required to be manually input to different websites, and engineers are required to check a dictionary to obtain the position of required information when unfamiliar foreign language web pages are required, so that the efficiency of extracting the page information is low and inaccurate, and the page information extraction cannot be flexibly and accurately conducted.
Disclosure of Invention
In view of the above, it is necessary to provide a method, a device and a medium for extracting website detail page information of a participant company, which obtain and analyze the detail page information of each participant company through a plurality of preset rules, particularly a node text density algorithm, so as to improve the efficiency and the accuracy of extracting the detail page information and increase the intelligence and the flexibility of extracting website structured information.
The first aspect of the invention provides a method for extracting website detail page information of a joining company, wherein a node text density algorithm comprises the following steps:
acquiring HTML codes of websites of a participant company, and analyzing the HTML codes into node DOM trees;
calculating the text length of each node in the node DOM tree;
accumulating the text lengths of all nodes in the node DOM tree to obtain the total length;
calculating the duty ratio of the text length of each node according to the total length;
updating the duty ratio smaller than or equal to a preset first duty ratio threshold value to zero, and obtaining a text length duty ratio set according to the updated duty ratio;
dividing the text length duty cycle set into a plurality of subsets by zero elements, wherein each subset comprises a plurality of consecutive non-zero duty cycles;
calculating the sum of the duty cycles of each subset according to the non-zero duty cycle of each subset;
acquiring a target subset corresponding to a second duty ratio threshold value which is larger than or equal to the duty ratio sum, and acquiring PATH PATHs and CLASS attributes of target nodes corresponding to all non-zero duty ratios in the target subset;
calculating the number of each PATH and the number of CLASS attributes, taking the PATH with the largest number of PATH PATHs as the target PATH of the participant, and taking the CLASS attribute with the largest number of CLASS attributes as the target CLASS attribute of the participant;
And analyzing the detailed page information of the participant according to the target PATH PATH and the target CLASS attribute.
A second aspect of the present invention provides a method for extracting website detail page information of a participating company, the method comprising:
acquiring a detail page link set of a plurality of participant websites, and downloading source codes according to each link in the detail page link set;
acquiring an HTML code in each source code, and analyzing the HTML code into a first node DOM tree;
extracting a main body part of each first node DOM tree by using a double-end weight judging method according to the first node DOM trees of the plurality of participant companies to obtain a second node DOM tree of each participant company;
extracting PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of each second node DOM tree according to a plurality of preset rules;
and analyzing the detail page information of each participant according to PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of each second node DOM tree.
Preferably, the extracting PATH PATHs and CLASS attributes of the plurality of target nodes in the body portion of the DOM tree of each second node according to the plurality of preset rules includes:
Extracting PATH PATHs and CLASS attributes of company name nodes in the main body part of each second node DOM tree according to a preset company name suffix set;
extracting PATH PATHs and CLASS attributes of mailbox address nodes in the main body part of each second node DOM tree according to preset mailbox address rules;
extracting PATH PATHs and CLASS attributes of telephone number nodes in the main body part of each second node DOM tree according to preset telephone number rules;
extracting PATH PATHs and CLASS attributes of company address nodes in the main body part of each second node DOM tree according to a multi-language address recognition algorithm;
and extracting PATH PATHs and CLASS attributes of company profile nodes in the main body part of each second node DOM tree according to a node text density algorithm, wherein PATH PATHs and CLASS attributes of company profile nodes in the main body part of each second node DOM tree are extracted according to the node text density algorithm.
Preferably, the parsing the detail page information of each participant according to PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of the DOM tree of each second node includes:
judging whether each target node in the main body part of each second node DOM tree contains CLASS attributes or not;
When the target node contains the CLASS attribute, analyzing the detail page information of the participant by using the CLASS attribute of the target node;
and when the target node does not contain the CLASS attribute, analyzing the detail page information of the participant by using the PATH of the target node.
Preferably, the extracting the main body part of each first node DOM tree according to the first node DOM tree of the multiple participating companies by using a double-end weight judging method, and obtaining the second node DOM tree of each participating company includes:
aiming at each participant, taking a first node DOM tree of the participant as a target first node DOM tree, and taking the first node DOM tree of any other participant as a candidate first node DOM tree;
sequentially acquiring each first node from the head of the DOM tree of the target first node, acquiring a second node identical to the first node from each candidate first node DOM tree, matching the information of the first node with the information of the second node until the information of the first node is not matched with the information of the second node, and recording the first node as a starting subscript;
sequentially acquiring each third node from the tail of the DOM tree of the target first node, acquiring a fourth node identical to the third node from each candidate first node DOM tree, matching the information of the third node with the information of the fourth node until the information of the third node is not matched with the information of the fourth node, and recording the third node as an end subscript;
Acquiring the recording times of each starting subscript and each ending subscript;
and determining a second node DOM tree of the participant according to the record times.
Preferably, the determining the second node DOM tree of the participant according to the recording times includes:
taking the starting subscript with the largest recording times as a target starting subscript, and taking the ending subscript with the largest recording times as a target ending subscript;
and a main body part from the target starting subscript to the target ending subscript is intercepted from the target first node DOM tree, and the main body part is determined to be the second node DOM tree of the participant.
Preferably, the method for extracting website detail page information of the participant company further comprises the following steps:
when the number of the obtained detail pages of the website of the participant is larger than the number of the preset training samples, randomly selecting the detail pages of the N website of the participant from the number of the obtained detail pages of the website of the participant to train.
A third aspect of the present invention provides a participating company website detail page information extraction apparatus, the apparatus comprising:
the acquisition module is used for acquiring detail page link sets of a plurality of participant websites and downloading source codes according to each link in the detail page link sets;
The first analysis module is used for acquiring the HTML codes in each source code and analyzing the HTML codes into a first node DOM tree;
the first extraction module is used for extracting the main body part of each first node DOM tree by using a double-end weight judging method according to the first node DOM trees of the plurality of participant companies to obtain a second node DOM tree of each participant company;
the second extraction module is used for extracting PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of each second node DOM tree according to a plurality of preset rules;
and the second parsing module is used for parsing the detail page information of each participant according to PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of each second node DOM tree.
A fourth aspect of the present invention provides an electronic device comprising a processor for implementing the method for extracting website detail page information of a reference company when executing a computer program stored in a memory.
A fifth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method for extracting website detail page information of a joining company.
In summary, according to the method, the device, the equipment and the medium for extracting the website detail page information of the participant company, disclosed by the invention, the main body part of the website of the participant company is extracted by using the double-end weight judging method, the unnecessary part in the page is deleted, the page extraction efficiency is improved, the detail page information of each participant company is analyzed through a plurality of preset rules, the efficiency and the accuracy of extracting the detail page information are improved, and the intelligence and the flexibility of website structured information extraction are improved.
Drawings
Fig. 1 is a flowchart of a method for extracting website detail page information of a participating company according to an embodiment of the present invention.
Fig. 2 is a block diagram of a website detail page information extraction device of a participating company according to a second embodiment of the present invention.
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
The invention will be further described in the following detailed description in conjunction with the above-described figures.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It should be noted that, without conflict, the embodiments of the present invention and features in the embodiments may be combined with each other.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Example 1
Fig. 1 is a flowchart of a method for extracting website detail page information of a participating company according to an embodiment of the present invention.
In this embodiment, the method for extracting website detail page information of a participant company may be applied to an electronic device, and for the electronic device that needs to extract website detail page information of the participant company, the function of extracting website detail page information of the participant company provided by the method of the present invention may be directly integrated on the electronic device, or may be run in the electronic device in the form of a software development kit (Software Development Kit, SKD).
As shown in FIG. 1, the method for extracting the website detail page information of the participating companies specifically comprises the following steps, the sequence of the steps in the flowchart can be changed according to different requirements, and some steps can be omitted.
S11: and acquiring a detail page link set of a plurality of participant websites, and downloading source codes according to each link in the detail page link set.
In this embodiment, links of a plurality of participants are obtained from different websites, for example, various information of all the participants are obtained from the website of the exhibition, links of all the participants are obtained from the website of the exhibition, and source codes of each website of the participants are downloaded according to the links.
S12: and acquiring the HTML code in each source code, and analyzing the HTML code into a first node DOM tree.
In this embodiment, javaScript and CSS codes are deleted from the source code of each participant, HTML codes are reserved, and an HTML parser is used to parse the HTML codes corresponding to each participant into a first node DOM tree.
S13: and extracting the main body part of each first node DOM tree by using a double-end weight judging method according to the first node DOM trees of the plurality of participant companies to obtain a second node DOM tree of each participant company.
In this embodiment, the main body portion refers to a text portion in the middle of each participant company, and text information of the navigation bar portion is not included, and the second node DOM tree of each participant company is obtained by using a double-end weight determination method.
Preferably, the extracting the main body part of each first node DOM tree according to the first node DOM tree of the multiple participating companies by using a double-end weight judging method, and obtaining the second node DOM tree of each participating company includes:
Aiming at each participant, taking a first node DOM tree of the participant as a target first node DOM tree, and taking the first node DOM tree of any other participant as a candidate first node DOM tree;
sequentially acquiring each first node from the head of the DOM tree of the target first node, acquiring a second node identical to the first node from each candidate first node DOM tree, matching the information of the first node with the information of the second node until the information of the first node is not matched with the information of the second node, and recording the first node as a starting subscript;
sequentially acquiring each third node from the tail of the DOM tree of the target first node, acquiring a fourth node identical to the third node from each candidate first node DOM tree, matching the information of the third node with the information of the fourth node until the information of the third node is not matched with the information of the fourth node, and recording the third node as an end subscript;
acquiring the recording times of each starting subscript and each ending subscript;
and determining a second node DOM tree of the participant according to the record times.
Further, the determining the second node DOM tree of the participant according to the record times includes:
taking the starting subscript with the largest recording times as a target starting subscript, and taking the ending subscript with the largest recording times as a target ending subscript;
and a main body part from the target starting subscript to the target ending subscript is intercepted from the target first node DOM tree, and the main body part is determined to be the second node DOM tree of the participant.
Illustratively, the reference company is four, and four first node DOM trees are obtained through parsing: html0, html2, html3 and html4, determining that the target first node DOM tree is html0, and the candidate first node DOM tree is html2, html3 and html4, wherein the html0 comprises information of the first node: html 0= [ A1, A2, A3, A4], the html2 including information of the second node: html 2= [ B1, B2, B3, B4], the html3 including information of the second node: html 3= [ C1, C2, C3, C4], the html4 including information of the second node: html 4= [ D1, D2, D3, D4].
Sequentially acquiring each first node from the head of the html0, selecting any candidate first node DOM tree html2, performing text matching on the A1 and the B1, performing text matching on the A2 and the B2 if the A1 and the B1 are completely matched, and recording 2 as a starting subscript if the A2 and the B2 are not matched; then carrying out text matching on the A4 and the B4, if the A4 and the B4 are completely matched, carrying out text matching on the A3 and the B3, and if the A3 and the B3 are not matched, recording 3 as an end subscript; selecting any candidate first node DOM tree html3, performing text matching on the A1 and the C1, wherein the A1 and the C1 are not matched, recording 1 is a starting subscript, performing text matching on the A4 and the C4, performing text matching on the A3 and the C3 if the A4 and the C4 are completely matched, and recording 3 is an ending subscript if the A3 and the B3 are not matched; selecting any residual first node DOM tree as html4, performing text matching on the A1 and the D1, wherein the A1 and the D1 are not matched, recording 1 as a starting subscript, performing text matching on the A4 and the D4, performing text matching on the A3 and the D3 if the D4 and the D4 are completely matched, and recording 3 as an ending subscript if the A3 and the D3 are not matched.
And calculating the number of times of recording the initial subscript 1 as 2 times, calculating the number of times of recording the initial subscript 2 as 1 time, and calculating the number of times of recording the end subscript 3 as 3 times, wherein the main body part of the target first node DOM tree is [ A1, A2, A3], and the [ A1, A2, A3] is used as a second node DOM tree.
In the embodiment, the double-end weight judging method is used for extracting the main body part of the website of the participant, and unnecessary parts in the page are deleted, so that the efficiency of extracting the detailed page information of the participant is improved.
S14: and extracting PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of each second node DOM tree according to a plurality of preset rules.
In this embodiment, the plurality of preset rules include: other rules such as company name node rules, mailbox address node rules, telephone number node rules, company address node rules, company detail page node rules and the like can be nodes corresponding to any structured text information, and the invention is not limited herein.
Preferably, the extracting PATH PATHs and CLASS attributes of the plurality of target nodes in the body portion of the DOM tree of each second node according to the plurality of preset rules includes:
Extracting PATH PATHs and CLASS attributes of company name nodes in the main body part of each second node DOM tree according to a preset company name suffix set;
extracting PATH PATHs and CLASS attributes of mailbox address nodes in the main body part of each second node DOM tree according to preset mailbox address rules;
extracting PATH PATHs and CLASS attributes of telephone number nodes in the main body part of each second node DOM tree according to preset telephone number rules;
extracting PATH PATHs and CLASS attributes of company address nodes in the main body part of each second node DOM tree according to a multi-language address recognition algorithm;
and extracting PATH PATHs and CLASS attributes of company profile nodes in the main body part of each second node DOM tree according to a node text density algorithm, wherein PATH PATHs and CLASS attributes of company profile nodes in the main body part of each second node DOM tree are extracted according to the node text density algorithm.
Further, the extracting PATH PATHs and CLASS attributes of the company name nodes in the main body part of the DOM tree of each second node according to the preset company name suffix set includes:
extracting text information in node information of each node in the second node DOM tree of each participant;
Judging whether the text information of each node contains any company name suffix in a preset company name suffix set or not by utilizing regular matching;
when the text information of any node is determined to contain any company name suffix in the preset company name suffix set, recording a PATH and a CLASS attribute corresponding to the any node;
calculating the recording times of each PATH PATH and CLASS attribute;
and taking the PATH with the largest record number as the PATH of the company name node of each participant and taking the CLASS attribute with the largest record number as the CLASS attribute of the company name node of each participant.
In this embodiment, the preset company name suffix set is a common company suffix set driven by languages of each country, text information of each node in the DOM tree of the second node is regularly matched to the preset company name suffix set, if text information of any node is matched to any company name suffix set in the preset company name suffix set, PATH PATHs and CLASS attributes of any node are recorded, regular matching is adopted for the text information of each node, a PATH with the largest recording frequency is used as a PATH of a company name node of each participant company, and CLASS attributes with the largest recording frequency are used as CLASS attributes of a company name node of each participant company.
In this embodiment, text information of each node in the DOM tree of the second node is regularly matched to the preset company name suffix set, so that PATH PATHs and CLASS attributes with the largest recording times are recorded, and accuracy of company name extraction of page information is improved.
Further, the extracting PATH PATHs and CLASS attributes of the mailbox address nodes in the main body part of the DOM tree of each second node according to the preset mailbox address rule includes:
extracting text information in node information of each node in the second node DOM tree of each participant;
judging whether the text information of each node contains any mailbox address rule in a preset mailbox address rule set or not by utilizing regular matching;
when the text information of any node is determined to contain any mailbox address rule in the preset mailbox address rule set, recording a PATH and a CLASS attribute corresponding to the any node;
calculating the recording times of each PATH PATH and CLASS attribute;
and taking the PATH with the largest record number as the PATH of the mailbox address node of each participant, and taking the CLASS attribute with the largest record number as the CLASS attribute of the mailbox address node of each participant.
In this embodiment, the preset mailbox address rule is a global unified mailbox format rule set, text information of each node in the second node DOM tree is regularly matched to the preset mailbox address rule set, if text information of any node is matched to any mailbox format rule in the preset company name suffix set, PATH PATHs and CLASS attributes of any node are recorded, regular matching is adopted for the text information of each node, PATH PATHs with the largest recording times are used as PATH PATHs of company name nodes of each participant company, and CLASS attributes with the largest recording times are used as CLASS attributes of mailbox addresses of each participant company.
In this embodiment, the text information of each node in the DOM tree of the second node is regularly matched to the preset mailbox address rule set, so that the PATH and the CLASS attribute with the largest recording times are recorded, and the mailbox address extraction accuracy in the website detail page information of the joining company is improved.
Further, the extracting PATH PATHs and CLASS attributes of the phone number nodes in the main body part of the DOM tree of each second node according to the preset phone number rule includes:
Extracting text information in node information of each node in the second node DOM tree of each participant;
when text information of any node is analyzed by using a multi-language address recognition algorithm to obtain address information, recording a PATH and a CLASS attribute corresponding to the any node;
calculating the recording times of each PATH PATH and CLASS attribute;
and taking the PATH with the largest record number as the PATH of the telephone number node of each participant and the CLASS attribute with the largest record number as the CLASS attribute of the telephone number node of each participant.
In this embodiment, the preset phone number rule is a phone number format rule set by each country, text information of each node in the DOM tree of the second node is regularly matched to the preset phone number rule set, if text information of any node is matched to any phone number rule in the preset phone number rule set, PATH PATHs and CLASS attributes of any node are recorded, regular matching is adopted for the text information of each node, a PATH with the largest recording frequency is used as a PATH of a phone number node of each reference company, and a CLASS attribute with the largest recording frequency is used as a CLASS attribute of a phone number of each reference company.
In this embodiment, the text information of each node in the DOM tree of the second node is regularly matched to the preset phone number rule set, so that the PATH and the CLASS attribute with the largest recording times are recorded, and the accuracy of extracting the phone number in the website detail page information of the participant company is improved.
Further, the extracting PATH PATHs and CLASS attributes of the company address nodes in the body portion of each second node DOM tree according to the multilingual address recognition algorithm comprises:
acquiring node information of each node in the second node DOM tree;
when node information of any node is analyzed by using a multi-language address recognition algorithm to obtain address information, recording a PATH and a CLASS attribute corresponding to the any node; calculating the recording times of each PATH PATH and CLASS attribute;
and taking the PATH with the largest record number as the PATH of the address information node of the participant and taking the CLASS attribute with the largest record number as the CLASS attribute of the address information node of the participant.
In this embodiment, the libossal algorithm is a multi-language address recognition algorithm, and when the address information is obtained by analyzing node information of any node by using the libossal algorithm, PATH PATHs and CLASS attributes with the largest recording times are recorded, so that the accuracy of extracting the address information in the website detail page information of the participating company is improved.
Wherein the node text density algorithm comprises:
acquiring HTML codes of websites of a participant company, and analyzing the HTML codes into node DOM trees;
calculating the text length of each node in the node DOM tree;
accumulating the text lengths of all nodes in the node DOM tree to obtain the total length;
calculating the duty ratio of the text length of each node according to the total length;
updating the duty ratio smaller than or equal to a preset first duty ratio threshold value to zero, and obtaining a text length duty ratio set according to the updated duty ratio;
dividing the text length duty cycle set into a plurality of subsets by zero elements, wherein each subset comprises a plurality of consecutive non-zero duty cycles;
calculating the sum of the duty cycles of each subset according to the non-zero duty cycle of each subset;
acquiring a target subset corresponding to a second duty ratio threshold value which is larger than or equal to the duty ratio sum, and acquiring PATH PATHs and CLASS attributes of target nodes corresponding to all non-zero duty ratios in the target subset;
calculating the number of each PATH and the number of CLASS attributes, taking the PATH with the largest number of PATH PATHs as the target PATH of the participant, and taking the CLASS attribute with the largest number of CLASS attributes as the target CLASS attribute of the participant;
And analyzing the detailed page information of the participant according to the target PATH PATH and the target CLASS attribute.
In this embodiment, the node DOM tree obtains the HTML code of each website of the joining company, parses the HTML code into a first node DOM tree, extracts a main body portion in the first node DOM tree to obtain a second node DOM tree, and uses the second node DOM tree as a node DOM tree.
Exemplary, if the extracted second node DOM tree is [ F1, F2, F3, F4, F5, F6, F7, F8]Calculating the length of the F1 text as L1, the length of the F2 text as L2, the length of the F3 text as L3, the length of the F4 text as L4, the length of the F5 text as L5, the length of the F6 text as L6, the length of the F7 text as L7 and the length of the F8 text as L8; accumulating the text lengths of all nodes to obtain a total length l=l1+l2+l3+l3+l4+l5+l6+l7+l8; calculating the single-node text duty ratio of each node: the single-node text duty ratio of the node F1 isThe single-node text of node F2 has a duty ratio of +.>The single-node text of node F3 has a duty ratio of +.>The single-node text of node F4 has a duty ratio of +.>The single-node text of the counter node F5 has a ratio of +.>The single-node text of node F6 has a duty ratio of +. >The single-node text of node F7 has a duty ratio of +.>The single-node text of node F8 has a duty ratio of +.>Comparing the single-node text duty ratio of each node with delta, wherein the single-node text duty ratio corresponding to P2 and P6 is smaller than or equal to delta, reassigning P2 and P6 to zero, and reassigning the reassigned second node DOM tree to [ F1, F2, F3, F4, F5, F6, F7, F8 ] by using zero elements]Cut into two subsets, the first subset: [ F3, F4, F5 ]]A second subset: [ F7, F8 ]]Accumulating the single-node text proportion of each node in each subset to obtain a node area block text proportion setpro1=p3+p4+p5 of the first subset, and accumulating the node area block text proportion setpro2=p7+p8 of the second subset, wherein setpro1 is larger than a preset node area block text proportion threshold, and recording PATH PATHs and CLASS attributes corresponding to F3, F4 and F5 in the first subset.
In this embodiment, the single node text duty ratio of the second node DOM tree is reassigned by the node text density algorithm, the nodes where the node information does not exist are deleted, the nodes which are not zero continuously are reserved as a subset, and PATH PATHs and CLASS attributes of the nodes with the largest recording times in the subset are recorded as PATH PATHs and CLASS attributes corresponding to the nodes of the detail pages of the participating company, so that the accuracy and the extraction efficiency of extracting the detail pages of the company in the website detail pages information of the participating company are improved.
In this embodiment, the details page information of each participant is obtained and analyzed through a plurality of preset rules, so that the efficiency and the accuracy of extracting the details page information are improved, and the intelligence and the flexibility of extracting the website structured information are improved.
S15: and analyzing the detail page information of each participant according to PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of each second node DOM tree.
In this embodiment, the PATH PATHs and the CLASS attributes of the different target nodes are used to parse the DOM tree of the second node of each website of the participating company, so as to obtain the website information of each participating company.
Preferably, the parsing the detail page information of each participant according to PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of the DOM tree of each second node includes:
judging whether each target node in the main body part of each second node DOM tree contains CLASS attributes or not;
when the target node contains the CLASS attribute, analyzing the detail page information of the participant by using the CLASS attribute of the target node;
and when the target node does not contain the CLASS attribute, analyzing the detail page information of the participant by using the PATH of the target node.
In this embodiment, the second node DOM tree is resolved by preferentially using the CLASS attribute in the multiple target nodes of each participating company website, so as to obtain the website information of each participating company, thereby improving the efficiency of obtaining the detail page information of each participating company.
In summary, according to the method for extracting the website detail page information of the participant company, on one hand, the double-end weight judging method is used for extracting the main body part of the website of the participant company, unnecessary parts in the page are deleted, so that the efficiency of page extraction is improved, on the other hand, the detail page information of each participant company is obtained and analyzed through a plurality of preset rules, the efficiency and the correctness of extracting the detail page information are improved, and the intelligence and the flexibility of website structured information extraction are improved.
Example two
Fig. 2 is a block diagram of a website detail page information extraction device of a participating company according to a second embodiment of the present invention.
In some embodiments, the participant website detail page information extraction means 20 may comprise a plurality of functional modules consisting of program code segments. Program code for each program segment in the participating company website detail page information extraction device 20 may be stored in a memory of the electronic device and executed by the at least one processor to perform (see fig. 1 for details) extraction of company page information for the website.
In this embodiment, the website detail page information extracting apparatus 20 of the participating company may be divided into a plurality of functional modules according to the functions performed by the website detail page information extracting apparatus. The functional module may include: the device comprises an acquisition module 201, a first analysis module 202, a first extraction module 203, a second extraction module 204 and a second analysis module 205. The module referred to in the present invention refers to a series of computer program segments capable of being executed by at least one processor and of performing a fixed function, stored in a memory. In the present embodiment, the functions of the respective modules will be described in detail in the following embodiments.
The acquisition module 201: and the system is used for acquiring a detail page link set of a plurality of participant websites and downloading source codes according to each link in the detail page link set.
In this embodiment, links of a plurality of participants are obtained from different websites, for example, various information of all the participants are obtained from the website of the exhibition, links of all the participants are obtained from the website of the exhibition, and source codes of each website of the participants are downloaded according to the links.
The first parsing module 202: and the method is used for acquiring the HTML code in each source code and analyzing the HTML code into a first node DOM tree.
In this embodiment, javaScript and CSS codes are deleted from the source code of each participant, HTML codes are reserved, and an HTML parser is used to parse the HTML codes corresponding to each participant into a first node DOM tree.
The first extraction module 203: and the method is used for extracting the main body part of each first node DOM tree by using a double-end weight judging method according to the first node DOM trees of the plurality of participant companies to obtain a second node DOM tree of each participant company.
In this embodiment, the main body portion refers to a text portion in the middle of each participant company, and text information of the navigation bar portion is not included, and the second node DOM tree of each participant company is obtained by using a double-end weight determination method.
Preferably, the first extracting module 203 extracts, according to the first node DOM trees of the multiple participating companies, a main body portion of each first node DOM tree by using a double-end weight determination method, so as to obtain a second node DOM tree of each participating company, where the obtaining includes:
aiming at each participant, taking a first node DOM tree of the participant as a target first node DOM tree, and taking the first node DOM tree of any other participant as a candidate first node DOM tree;
sequentially acquiring each first node from the head of the DOM tree of the target first node, acquiring a second node identical to the first node from each candidate first node DOM tree, matching the information of the first node with the information of the second node until the information of the first node is not matched with the information of the second node, and recording the first node as a starting subscript;
Sequentially acquiring each third node from the tail of the DOM tree of the target first node, acquiring a fourth node identical to the third node from each candidate first node DOM tree, matching the information of the third node with the information of the fourth node until the information of the third node is not matched with the information of the fourth node, and recording the third node as an end subscript;
acquiring the recording times of each starting subscript and each ending subscript;
and determining a second node DOM tree of the participant according to the record times.
Further, the determining the second node DOM tree of the participant according to the record times includes:
taking the starting subscript with the largest recording times as a target starting subscript, and taking the ending subscript with the largest recording times as a target ending subscript;
and a main body part from the target starting subscript to the target ending subscript is intercepted from the target first node DOM tree, and the main body part is determined to be the second node DOM tree of the participant.
Illustratively, the reference company is four, and four first node DOM trees are obtained through parsing: html0, html2, html3 and html4, determining that the target first node DOM tree is html0, and the candidate first node DOM tree is html2, html3 and html4, wherein the html0 comprises information of the first node: html 0= [ A1, A2, A3, A4], the html2 including information of the second node: html 2= [ B1, B2, B3, B4], the html3 including information of the second node: html 3= [ C1, C2, C3, C4], the html4 including information of the second node: html 4= [ D1, D2, D3, D4].
Sequentially acquiring each first node from the head of the html0, selecting any candidate first node DOM tree html2, performing text matching on the A1 and the B1, performing text matching on the A2 and the B2 if the A1 and the B1 are completely matched, and recording 2 as a starting subscript if the A2 and the B2 are not matched; then carrying out text matching on the A4 and the B4, if the A4 and the B4 are completely matched, carrying out text matching on the A3 and the B3, and if the A3 and the B3 are not matched, recording 3 as an end subscript; selecting any candidate first node DOM tree html3, performing text matching on the A1 and the C1, wherein the A1 and the C1 are not matched, recording 1 is a starting subscript, performing text matching on the A4 and the C4, performing text matching on the A3 and the C3 if the A4 and the C4 are completely matched, and recording 3 is an ending subscript if the A3 and the B3 are not matched; selecting any residual first node DOM tree as html4, performing text matching on the A1 and the D1, wherein the A1 and the D1 are not matched, recording 1 as a starting subscript, performing text matching on the A4 and the D4, performing text matching on the A3 and the D3 if the D4 and the D4 are completely matched, and recording 3 as an ending subscript if the A3 and the D3 are not matched.
And calculating the number of times of recording the initial subscript 1 as 2 times, calculating the number of times of recording the initial subscript 2 as 1 time, and calculating the number of times of recording the end subscript 3 as 3 times, wherein the main body part of the target first node DOM tree is [ A1, A2, A3], and the [ A1, A2, A3] is used as a second node DOM tree.
In this embodiment, the double-end weight determination method is used to extract the main body part of the website of the participating company, and delete the unnecessary parts in the page, thereby improving the efficiency of page extraction.
The second extraction module 204: and extracting PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of each second node DOM tree according to a plurality of preset rules.
In this embodiment, the plurality of preset rules include: other rules such as company name node rules, mailbox address node rules, telephone number node rules, company address node rules, company detail page node rules and the like can be nodes corresponding to any structured text information, and the invention is not limited herein.
Preferably, the second extracting module 204 extracts PATH PATHs and CLASS attributes of a plurality of target nodes in the main body portion of the DOM tree of each second node according to a plurality of preset rules, including:
extracting PATH PATHs and CLASS attributes of company name nodes in the main body part of each second node DOM tree according to a preset company name suffix set;
extracting PATH PATHs and CLASS attributes of mailbox address nodes in the main body part of each second node DOM tree according to preset mailbox address rules;
Extracting PATH PATHs and CLASS attributes of telephone number nodes in the main body part of each second node DOM tree according to preset telephone number rules;
extracting PATH PATHs and CLASS attributes of company address nodes in the main body part of each second node DOM tree according to a multi-language address recognition algorithm;
and extracting PATH PATHs and CLASS attributes of company profile nodes in the main body part of each second node DOM tree according to a node text density algorithm, wherein PATH PATHs and CLASS attributes of company profile nodes in the main body part of each second node DOM tree are extracted according to the node text density algorithm.
Further, the extracting PATH PATHs and CLASS attributes of the company name nodes in the main body part of the DOM tree of each second node according to the preset company name suffix set includes:
extracting text information in node information of each node in the second node DOM tree of each participant;
judging whether the text information of each node contains any company name suffix in a preset company name suffix set or not by utilizing regular matching;
when the text information of any node is determined to contain any company name suffix in the preset company name suffix set, recording a PATH and a CLASS attribute corresponding to the any node;
Calculating the recording times of each PATH PATH and CLASS attribute;
and taking the PATH with the largest record number as the PATH of the company name node of each participant and taking the CLASS attribute with the largest record number as the CLASS attribute of the company name node of each participant.
In this embodiment, the preset company name suffix set is a common company suffix set driven by languages of each country, text information of each node in the DOM tree of the second node is regularly matched to the preset company name suffix set, if text information of any node is matched to any company name suffix set in the preset company name suffix set, PATH PATHs and CLASS attributes of any node are recorded, regular matching is adopted for the text information of each node, a PATH with the largest recording frequency is used as a PATH of a company name node of each participant company, and CLASS attributes with the largest recording frequency are used as CLASS attributes of a company name node of each participant company.
In this embodiment, text information of each node in the DOM tree of the second node is regularly matched to the preset company name suffix set, so that PATH PATHs and CLASS attributes with the largest recording times are recorded, and accuracy of company name extraction of page information is improved.
Further, the extracting PATH PATHs and CLASS attributes of the mailbox address nodes in the main body part of the DOM tree of each second node according to the preset mailbox address rule includes:
extracting text information in node information of each node in the second node DOM tree of each participant;
judging whether the text information of each node contains any mailbox address rule in a preset mailbox address rule set or not by utilizing regular matching;
when the text information of any node is determined to contain any mailbox address rule in the preset mailbox address rule set, recording a PATH and a CLASS attribute corresponding to the any node;
calculating the recording times of each PATH PATH and CLASS attribute;
and taking the PATH with the largest record number as the PATH of the mailbox address node of each participant, and taking the CLASS attribute with the largest record number as the CLASS attribute of the mailbox address node of each participant.
In this embodiment, the preset mailbox address rule is a global unified mailbox format rule set, text information of each node in the second node DOM tree is regularly matched to the preset mailbox address rule set, if text information of any node is matched to any mailbox format rule in the preset company name suffix set, PATH PATHs and CLASS attributes of any node are recorded, regular matching is adopted for the text information of each node, PATH PATHs with the largest recording times are used as PATH PATHs of company name nodes of each participant company, and CLASS attributes with the largest recording times are used as CLASS attributes of mailbox addresses of each participant company.
In this embodiment, the text information of each node in the DOM tree of the second node is regularly matched to the preset mailbox address rule set, so that the PATH and the CLASS attribute with the largest recording times are recorded, and the mailbox address extraction accuracy in the website detail page information of the joining company is improved.
Further, the extracting PATH PATHs and CLASS attributes of the phone number nodes in the main body part of the DOM tree of each second node according to the preset phone number rule includes:
extracting text information in node information of each node in the second node DOM tree of each participant;
when text information of any node is analyzed by using a multi-language address recognition algorithm to obtain address information, recording a PATH and a CLASS attribute corresponding to the any node;
calculating the recording times of each PATH PATH and CLASS attribute;
and taking the PATH with the largest record number as the PATH of the telephone number node of each participant and the CLASS attribute with the largest record number as the CLASS attribute of the telephone number node of each participant.
In this embodiment, the preset phone number rule is a phone number format rule set by each country, text information of each node in the DOM tree of the second node is regularly matched to the preset phone number rule set, if text information of any node is matched to any phone number rule in the preset phone number rule set, PATH PATHs and CLASS attributes of any node are recorded, regular matching is adopted for the text information of each node, a PATH with the largest recording frequency is used as a PATH of a phone number node of each reference company, and a CLASS attribute with the largest recording frequency is used as a CLASS attribute of a phone number of each reference company.
In this embodiment, the text information of each node in the DOM tree of the second node is regularly matched to the preset phone number rule set, so that the PATH and the CLASS attribute with the largest recording times are recorded, and the accuracy of extracting the phone number in the website detail page information of the participant company is improved.
Further, the extracting PATH PATHs and CLASS attributes of the company address nodes in the body portion of each second node DOM tree according to the multilingual address recognition algorithm comprises:
acquiring node information of each node in the second node DOM tree;
when node information of any node is analyzed by using a multi-language address recognition algorithm to obtain address information, recording a PATH and a CLASS attribute corresponding to the any node; calculating the recording times of each PATH PATH and CLASS attribute;
and taking the PATH with the largest record number as the PATH of the address information node of the participant and taking the CLASS attribute with the largest record number as the CLASS attribute of the address information node of the participant.
In this embodiment, the libossal algorithm is a multi-language address recognition algorithm, and when the address information is obtained by analyzing node information of any node by using the libossal algorithm, PATH PATHs and CLASS attributes with the largest recording times are recorded, so that the accuracy of extracting the address information in the website detail page information of the participating company is improved.
Wherein the node text density algorithm comprises:
acquiring HTML codes of websites of a participant company, and analyzing the HTML codes into node DOM trees;
calculating the text length of each node in the node DOM tree;
accumulating the text lengths of all nodes in the node DOM tree to obtain the total length;
calculating the duty ratio of the text length of each node according to the total length;
updating the duty ratio smaller than or equal to a preset first duty ratio threshold value to zero, and obtaining a text length duty ratio set according to the updated duty ratio;
dividing the text length duty cycle set into a plurality of subsets by zero elements, wherein each subset comprises a plurality of consecutive non-zero duty cycles;
calculating the sum of the duty cycles of each subset according to the non-zero duty cycle of each subset;
acquiring a target subset corresponding to a second duty ratio threshold value which is larger than or equal to the duty ratio sum, and acquiring PATH PATHs and CLASS attributes of target nodes corresponding to all non-zero duty ratios in the target subset;
calculating the number of each PATH and the number of CLASS attributes, taking the PATH with the largest number of PATH PATHs as the target PATH of the participant, and taking the CLASS attribute with the largest number of CLASS attributes as the target CLASS attribute of the participant;
And analyzing the detailed page information of the participant according to the target PATH PATH and the target CLASS attribute.
In this embodiment, the node DOM tree obtains the HTML code of each website of the joining company, parses the HTML code into a first node DOM tree, extracts a main body portion in the first node DOM tree to obtain a second node DOM tree, and uses the second node DOM tree as a node DOM tree.
Exemplary, if the extracted second node DOM tree is [ F1, F2, F3, F4, F5, F6, F7, F8]Calculating the length of the F1 text as L1, the length of the F2 text as L2, the length of the F3 text as L3, the length of the F4 text as L4, the length of the F5 text as L5, the length of the F6 text as L6, the length of the F7 text as L7 and the length of the F8 text as L8; accumulating the text lengths of all nodes to obtain a total length l=l1+l2+l3+l3+l4+l5+l6+l7+l8; calculating the single-node text duty ratio of each node: the single-node text duty ratio of the node F1 isThe single-node text of node F2 has a duty ratio of +.>The single-node text of node F3 has a duty ratio of +.>The single-node text of node F4 has a duty ratio of +.>The single-node text of the counter node F5 has a ratio of +.>The single-node text of node F6 has a duty ratio of +. >The single-node text of node F7 has a duty ratio of +.>The single-node text of node F8 has a duty ratio of +.>Comparing the single-node text duty ratio of each node with delta, wherein the single-node text duty ratio corresponding to P2 and P6 is smaller than or equal to delta, reassigning P2 and P6 to zero, and utilizing the fact that the preset single-node text duty ratio threshold is deltaThe zero element is used for reassigning the reassigned second node DOM tree to be [ F1, F2, F3, F4, F5, F6, F7, F8 ]]Cut into two subsets, the first subset: [ F3, F4, F5 ]]A second subset: [ F7, F8 ]]Accumulating the single-node text proportion of each node in each subset to obtain a node area block text proportion setpro1=p3+p4+p5 of the first subset, and accumulating the node area block text proportion setpro2=p7+p8 of the second subset, wherein setpro1 is larger than a preset node area block text proportion threshold, and recording PATH PATHs and CLASS attributes corresponding to F3, F4 and F5 in the first subset.
In this embodiment, the single node text duty ratio of the second node DOM tree is reassigned by the node text density algorithm, the nodes where the node information does not exist are deleted, the nodes which are not zero continuously are reserved as a subset, and PATH PATHs and CLASS attributes of the nodes with the largest recording times in the subset are recorded as PATH PATHs and CLASS attributes corresponding to the nodes of the detail pages of the participating company, so that the accuracy and the extraction efficiency of extracting the detail pages of the company in the website detail pages information of the participating company are improved.
In this embodiment, the details page information of each participant is obtained and analyzed through a plurality of preset rules, so that the efficiency and the accuracy of extracting the details page information are improved, and the intelligence and the flexibility of extracting the website structured information are improved.
The second parsing module 205: and the system is used for analyzing the detail page information of each participant according to PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of each second node DOM tree.
In this embodiment, the PATH PATHs and the CLASS attributes of the different target nodes are used to parse the DOM tree of the second node of each website of the participating company, so as to obtain the website information of each participating company.
Preferably, the second parsing module 205 parses, according to PATH PATHs and CLASS attributes of a plurality of target nodes in the body portion of the DOM tree of each second node, details page information of each participant company, including:
judging whether each target node in the main body part of each second node DOM tree contains CLASS attributes or not;
when the target node contains the CLASS attribute, analyzing the detail page information of the participant by using the CLASS attribute of the target node;
and when the target node does not contain the CLASS attribute, analyzing the detail page information of the participant by using the PATH of the target node.
In this embodiment, the second node DOM tree is resolved by preferentially using the CLASS attribute in the multiple target nodes of each participating company website, so as to obtain the website information of each participating company, thereby improving the efficiency of obtaining the detail page information of each participating company.
In summary, according to the website detail page information extraction device of the participating company, on one hand, the double-end weight judging method is used for extracting the main body part of the website of the participating company, unnecessary parts in the page are deleted, so that the page extraction efficiency is improved, on the other hand, the detail page information of each participating company is obtained and analyzed through a plurality of preset rules, the efficiency and the correctness of extracting the detail page information are improved, and the intelligence and the flexibility of website structured information extraction are improved.
Example III
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. In the preferred embodiment of the invention, the electronic device 3 comprises a memory 31, at least one processor 32, at least one communication bus 33 and a transceiver 34.
It will be appreciated by those skilled in the art that the configuration of the electronic device shown in fig. 3 is not limiting of the embodiments of the present invention, and that either a bus-type configuration or a star-type configuration is possible, and that the electronic device 3 may also include more or less other hardware or software than that shown, or a different arrangement of components.
In some embodiments, the electronic device 3 is an electronic device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The electronic device 3 may further include a client device, where the client device includes, but is not limited to, any electronic product that can interact with a client by way of a keyboard, a mouse, a remote control, a touch pad, or a voice control device, such as a personal computer, a tablet computer, a smart phone, a digital camera, etc.
It should be noted that the electronic device 3 is only used as an example, and other electronic products that may be present in the present invention or may be present in the future are also included in the scope of the present invention by way of reference.
In some embodiments, the memory 31 is used to store program codes and various data, such as the participant website detail page information extraction device 20 installed in the electronic device 3, and to enable high-speed, automatic access to programs or data during operation of the electronic device 3. The Memory 31 includes Read-Only Memory (ROM), programmable Read-Only Memory (PROM), erasable programmable Read-Only Memory (EPROM), one-time programmable Read-Only Memory (One-time Programmable Read-Only Memory, OTPROM), electrically erasable rewritable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic tape Memory, or any other medium that can be used for computer-readable carrying or storing data.
In some embodiments, the at least one processor 32 may be comprised of an integrated circuit, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The at least one processor 32 is a Control Unit (Control Unit) of the electronic device 3, connects the respective components of the entire electronic device 3 using various interfaces and lines, and executes various functions of the electronic device 3 and processes data by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31.
In some embodiments, the at least one communication bus 33 is arranged to enable connected communication between the memory 31 and the at least one processor 32 or the like.
Although not shown, the electronic device 3 may further comprise a power source (such as a battery) for powering the various components, which may preferably be logically connected to the at least one processor 32 via a power management device, such that functions of managing charging, discharging, and power consumption are performed by the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 3 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device, etc.) or a processor (processor) to perform portions of the methods described in the various embodiments of the invention.
In a further embodiment, in connection with fig. 2, the at least one processor 32 may execute the operating means of the electronic device 3 as well as various installed applications (such as the participant website detail page information extraction means 20), program code, etc., such as the various modules described above.
The memory 31 has program code stored therein, and the at least one processor 32 can invoke the program code stored in the memory 31 to perform related functions. For example, each of the modules depicted in fig. 2 is program code stored in the memory 31 and executed by the at least one processor 32 to perform the functions of the respective modules for purposes of reference company website detail page information extraction.
In one embodiment of the invention, the memory 31 stores a plurality of instructions that are executed by the at least one processor 32 to implement a reference company website detail page information extraction function.
Specifically, the specific implementation method of the above instruction by the at least one processor 32 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it will be obvious that the term "comprising" does not exclude other elements or that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (9)

1. The method for extracting the website detail page information of the participant is characterized by comprising the following steps of:
acquiring a detail page link set of a plurality of participant websites, and downloading source codes according to each link in the detail page link set;
acquiring an HTML code in each source code, and analyzing the HTML code into a first node DOM tree;
extracting a main body part of each first node DOM tree by using a double-end weight judging method according to the first node DOM trees of the plurality of participant companies to obtain a second node DOM tree of each participant company; the double-end weight judging method is to sequentially acquire each node from the head part and the tail part of a target node DOM tree, acquire the same node as the node acquired from the target node DOM tree from each candidate node DOM tree, match node information acquired through the target node DOM tree and the candidate node DOM tree, and record the node acquired from the target node DOM tree as a starting subscript when the node information is not matched; extracting PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of each second node DOM tree according to a plurality of preset rules;
And analyzing the detail page information of each participant according to PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of each second node DOM tree.
2. The method for extracting website detail page information of a participant company according to claim 1, wherein the extracting PATH PATHs and CLASS attributes of a plurality of target nodes in the body part of each DOM tree of the second node according to a plurality of preset rules comprises:
extracting PATH PATHs and CLASS attributes of company name nodes in the main body part of each second node DOM tree according to a preset company name suffix set;
extracting PATH PATHs and CLASS attributes of mailbox address nodes in the main body part of each second node DOM tree according to preset mailbox address rules;
extracting PATH PATHs and CLASS attributes of telephone number nodes in the main body part of each second node DOM tree according to preset telephone number rules;
extracting PATH PATHs and CLASS attributes of company address nodes in the main body part of each second node DOM tree according to a multi-language address recognition algorithm;
PATH PATHs and CLASS attributes of company profile nodes in the body portion of each second node DOM tree are extracted according to a node text density algorithm.
3. The method for extracting website detail page information of a participant company according to claim 1, wherein said parsing the detail page information of each participant company according to PATH PATHs and CLASS attributes of a plurality of target nodes in the main body portion of the DOM tree of each second node comprises:
Judging whether each target node in the main body part of each second node DOM tree contains CLASS attributes or not;
when the target node contains the CLASS attribute, analyzing the detail page information of the participant by using the CLASS attribute of the target node;
and when the target node does not contain the CLASS attribute, analyzing the detail page information of the participant by using the PATH of the target node.
4. The method for extracting website detail page information of a participant company according to claim 1, wherein said extracting a main body portion of each of said first node DOM trees according to first node DOM trees of said plurality of participant companies using a double-ended weight determination method, to obtain a second node DOM tree of each of the participant companies comprises:
aiming at each participant, taking a first node DOM tree of the participant as a target first node DOM tree, and taking the first node DOM tree of any other participant as a candidate first node DOM tree;
sequentially acquiring each first node from the head of the DOM tree of the target first node, acquiring a second node identical to the first node from each candidate first node DOM tree, matching the information of the first node with the information of the second node until the information of the first node is not matched with the information of the second node, and recording the first node as a starting subscript;
Sequentially acquiring each third node from the tail of the DOM tree of the target first node, acquiring a fourth node identical to the third node from each candidate first node DOM tree, matching the information of the third node with the information of the fourth node until the information of the third node is not matched with the information of the fourth node, and recording the third node as an end subscript;
acquiring the recording times of each starting subscript and each ending subscript;
and determining a second node DOM tree of the participant according to the record times.
5. The method for extracting website details page information of a participant according to claim 4, wherein determining the second node DOM tree of the participant according to the recording times comprises:
taking the starting subscript with the largest recording times as a target starting subscript, and taking the ending subscript with the largest recording times as a target ending subscript;
and a main body part from the target starting subscript to the target ending subscript is intercepted from the target first node DOM tree, and the main body part is determined to be the second node DOM tree of the participant.
6. The method for extracting website details page information of a participant company as claimed in claim 1, wherein said method for extracting website details page information of a participant company further comprises:
When the number of the obtained detail pages of the website of the participant is larger than the number of the preset training samples, randomly selecting the detail pages of the N website of the participant from the number of the obtained detail pages of the website of the participant to train.
7. A participant website detail page information extraction device, characterized in that the participant website detail page information extraction device comprises:
the acquisition module is used for acquiring detail page link sets of a plurality of participant websites and downloading source codes according to each link in the detail page link sets;
the first analysis module is used for acquiring the HTML codes in each source code and analyzing the HTML codes into a first node DOM tree;
the first extraction module is used for extracting the main body part of each first node DOM tree by using a double-end weight judging method according to the first node DOM trees of the plurality of participant companies to obtain a second node DOM tree of each participant company; the double-end weight judging method is to sequentially acquire each node from the head part and the tail part of a target node DOM tree, acquire the same node as the node acquired from the target node DOM tree from each candidate node DOM tree, match node information acquired through the target node DOM tree and the candidate node DOM tree, and record the node acquired from the target node DOM tree as a starting subscript when the node information is not matched;
The second extraction module is used for extracting PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of each second node DOM tree according to a plurality of preset rules;
and the second parsing module is used for parsing the detail page information of each participant according to PATH PATHs and CLASS attributes of a plurality of target nodes in the main body part of each second node DOM tree.
8. An electronic device comprising a processor for implementing the method for extracting website detail page information of a joining company according to any one of claims 1 to 6 when executing a computer program stored in a memory.
9. A computer-readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the method for extracting website detail page information of a joining company as claimed in any one of claims 1 to 6.
CN202010485912.8A 2020-06-01 2020-06-01 Method, device, equipment and medium for extracting website detail page information of participant company Active CN111625749B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010485912.8A CN111625749B (en) 2020-06-01 2020-06-01 Method, device, equipment and medium for extracting website detail page information of participant company

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010485912.8A CN111625749B (en) 2020-06-01 2020-06-01 Method, device, equipment and medium for extracting website detail page information of participant company

Publications (2)

Publication Number Publication Date
CN111625749A CN111625749A (en) 2020-09-04
CN111625749B true CN111625749B (en) 2023-08-11

Family

ID=72271254

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010485912.8A Active CN111625749B (en) 2020-06-01 2020-06-01 Method, device, equipment and medium for extracting website detail page information of participant company

Country Status (1)

Country Link
CN (1) CN111625749B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN103838823A (en) * 2014-01-22 2014-06-04 浙江大学 Website content accessible detection method based on web page templates
CN103853760A (en) * 2012-12-03 2014-06-11 中国移动通信集团公司 Method and device for extracting contents of bodies of web pages
CN105095466A (en) * 2015-07-31 2015-11-25 山东大学 Web text information extraction method
CN106528509A (en) * 2016-11-11 2017-03-22 政和科技股份有限公司 Webpage information extracting method and apparatus
CN106557565A (en) * 2016-11-22 2017-04-05 福州大学 A kind of text message extracting method based on website construction
CN107590219A (en) * 2017-09-04 2018-01-16 电子科技大学 Webpage personage subject correlation message extracting method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN103853760A (en) * 2012-12-03 2014-06-11 中国移动通信集团公司 Method and device for extracting contents of bodies of web pages
CN103838823A (en) * 2014-01-22 2014-06-04 浙江大学 Website content accessible detection method based on web page templates
CN105095466A (en) * 2015-07-31 2015-11-25 山东大学 Web text information extraction method
CN106528509A (en) * 2016-11-11 2017-03-22 政和科技股份有限公司 Webpage information extracting method and apparatus
CN106557565A (en) * 2016-11-22 2017-04-05 福州大学 A kind of text message extracting method based on website construction
CN107590219A (en) * 2017-09-04 2018-01-16 电子科技大学 Webpage personage subject correlation message extracting method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谢方立.基于节点类型标注的网页主题信息提取技术研究.《中国优秀硕士学位论文全文数据库(信息科技辑)》.2017,第三章、第四章. *

Also Published As

Publication number Publication date
CN111625749A (en) 2020-09-04

Similar Documents

Publication Publication Date Title
US20220091827A1 (en) Pruning Engine
CN1924858B (en) Method and device for fetching new words and input method system
CN110795697B (en) Method and device for acquiring logic expression, storage medium and electronic device
CN103345532A (en) Method and device for extracting webpage information
CN114495143B (en) Text object recognition method and device, electronic equipment and storage medium
CN112395251A (en) Intelligent analysis method and device for data file, electronic equipment and storage medium
CN111625748A (en) Website navigation bar information extraction method and device, electronic equipment and storage medium
CN110851136A (en) Data acquisition method and device, electronic equipment and storage medium
CN113282854A (en) Data request response method and device, electronic equipment and storage medium
CN113704420A (en) Method and device for identifying role in text, electronic equipment and storage medium
CN112733551A (en) Text analysis method and device, electronic equipment and readable storage medium
CN112667878A (en) Webpage text content extraction method and device, electronic equipment and storage medium
CN112347737A (en) Method and device for previewing excel file online, electronic equipment and storage medium
CN114862520A (en) Product recommendation method and device, computer equipment and storage medium
US9208134B2 (en) Methods and systems for tokenizing multilingual textual documents
CN111158973B (en) Web application dynamic evolution monitoring method
CN111625749B (en) Method, device, equipment and medium for extracting website detail page information of participant company
CN112948573A (en) Text label extraction method, device, equipment and computer storage medium
CN116719683A (en) Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium
CN112667874A (en) Webpage data extraction method and device, electronic equipment and storage medium
CN114416107A (en) Method, device, storage medium and equipment for translating logic
CN113204962A (en) Word sense disambiguation method, device, equipment and medium based on graph expansion structure
CN113033179A (en) Knowledge acquisition method and device, electronic equipment and readable storage medium
CN111597108B (en) Form attribute testing method and device and computer readable storage medium
CN117406996B (en) Semantic analysis method and device for hardware description code

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant