CN109165373B - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN109165373B
CN109165373B CN201811073868.9A CN201811073868A CN109165373B CN 109165373 B CN109165373 B CN 109165373B CN 201811073868 A CN201811073868 A CN 201811073868A CN 109165373 B CN109165373 B CN 109165373B
Authority
CN
China
Prior art keywords
page
path information
node path
information list
pages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811073868.9A
Other languages
Chinese (zh)
Other versions
CN109165373A (en
Inventor
杨帆
戴超男
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201811073868.9A priority Critical patent/CN109165373B/en
Publication of CN109165373A publication Critical patent/CN109165373A/en
Application granted granted Critical
Publication of CN109165373B publication Critical patent/CN109165373B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method and a device, wherein the method comprises the following steps: analyzing at least one page to respectively obtain a node path information list corresponding to each page; comparing the structures of the pages in the at least one page based on the node path information list to obtain similar pages; setting a label for the node path information list of the similar page; and generating a page extraction template matched with the label based on the node path information list with the same label.

Description

Data processing method and device
Technical Field
The present application relates to the field of page extraction technologies, and in particular, to a data processing method and apparatus.
Background
At present, when structured information is extracted from the same type of website, a template construction method is usually adopted to extract webpage information.
However, the existing extraction template configuration cannot be applied to information extraction of different website webpages, thereby reducing the general applicability of information extraction.
Disclosure of Invention
In view of this, the present application provides a data processing method and apparatus, so as to solve the technical problem that the information extraction applicability is low because the page extraction template in the prior art cannot extract information from the webpages of different websites.
The application provides a data processing method, which comprises the following steps:
analyzing at least one page to respectively obtain a node path information list corresponding to each page;
comparing the structures of the pages in the at least one page based on the node path information list to obtain similar pages;
setting a label for the node path information list of the similar page;
and generating a page extraction template matched with the label based on the node path information list with the same label.
Preferably, the above method, comparing the structures of the pages in the at least one page based on the node path information list, includes:
performing the following operations on the node path information lists of two pages of the at least one page:
respectively obtaining tree structure root nodes and corresponding sub-trees of a first page and a second page in the two pages based on the node path information lists of the two pages;
determining subtrees with the highest similarity to the subtrees in the second page respectively in the subtrees of the first page based on the judgment that the tree structure root node comparisons of the two pages are the same so as to form a subtree pair;
obtaining similarity values of two subtrees in the subtree pair, and obtaining preset weights of the subtrees belonging to the first page in the subtree pair;
obtaining a total similarity value between the first page and the second page based on the similarity value of the subtree pair and the preset weight;
and determining that the first page and the second page are similar pages based on the judgment that the total similarity value is higher than a preset threshold value.
The above method, preferably, further comprises:
obtaining page content in the at least one page;
and comparing the categories and structures of the page contents among the pages in the at least one page to obtain similar pages.
In the above method, preferably, the setting of the label to the node path information list of the similar page includes:
determining target content to be extracted according to the node path information list of the similar page;
and setting the label of the node path information list of the similar page based on the target content.
The above method, preferably, further comprises:
extracting strong feature words and feature attributes thereof which have incidence relation with labels in a node path information list for the node path information list with the same labels;
generating a feature dictionary of the page extraction template based on the strong feature words and the feature attributes thereof;
analyzing the word meaning of the strong characteristic word to obtain a synonym of the strong characteristic word;
and adding the synonym into a feature dictionary of the page extraction template.
Preferably, the method for generating a page extraction template matched with the label based on the node path information list with the same label includes:
merging the node path information lists with the same label;
generating a page extraction template matched with the label based on the combined node path information list; the generated page extraction template comprises a plurality of node path information, the node path information is used for information extraction, and the priority of the node path information containing the strong characteristic words is higher than the priority of the other node path information.
The above method, preferably, merging the node path information lists with the same label, includes:
setting a preset mark symbol for the node path information list from which the strong characteristic words are extracted in the node path information list with the same label;
comparing the node path information lists one by one according to the node order in the node path information lists to obtain comparison results;
and merging the node path information lists with the same node comparison into a node path information list based on the comparison result, merging the node path information lists with different nodes into a node path information list, and replacing the different nodes with the mark symbols.
Preferably, before comparing the node path information lists one by one according to the node order in the node path information lists in sequence, the method further includes:
simplifying the node path information list by using the mark symbol;
the method specifically comprises the following steps:
at least reserving tree structure node names and information used for filling the strong characteristic words in the node path information for the node path information of the node path information list provided with the mark symbol;
and reserving tree structure node names in the node path information for the node path information of the node path information list without the set mark symbol.
The above method, preferably, further comprises:
responding to a received page extraction request, and obtaining a target page to be extracted and an extraction tag of the target page;
preferentially using node path information containing strong feature words to extract page data of the target page in a target page extraction template corresponding to the extraction tag to obtain an extraction result;
and obtaining a node path information list of the target page and generating a corresponding page extraction template based on the judgment that corresponding data are not extracted from the extraction result.
The present application also provides a data processing apparatus, including:
the page analyzing unit is used for analyzing at least one page so as to respectively obtain a node path information list corresponding to each page;
the similarity comparison unit is used for carrying out structure comparison on the pages in the at least one page based on the node path information list to obtain similar pages;
the label setting unit is used for setting labels for the node path information lists of the similar pages;
and the template generating unit is used for generating a page extraction template matched with the label based on the node path information list with the same label.
According to the technical scheme, after the node path information lists of various pages are analyzed, similarity between different pages is classified based on the node path information lists, so that the similar pages are subjected to label setting to generate the page extraction template under the label, and information extraction is conveniently performed on the corresponding pages. Therefore, in the application, after the webpages of different websites are classified according to the similarity, the webpages with similar structural contents are obtained, and then the corresponding page extraction templates are generated, so that the data extraction of the webpages of different websites is realized.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a data processing method according to an embodiment of the present application;
FIG. 2 is a partial flow chart of a first embodiment of the present application;
FIGS. 3, 4 and 5 are exemplary diagrams of embodiments of the present application;
FIG. 6, FIG. 7, FIG. 8 and FIG. 9 are respectively another partial flow charts of the first embodiment of the present application;
fig. 10 is a schematic structural diagram of a data processing apparatus according to a second embodiment of the present application;
fig. 11 and 12 are diagrams illustrating another example of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, a flowchart of an implementation of a data processing method provided in an embodiment of the present application is shown, where the method is suitable for constructing a page extraction template for pages of different websites, that is, websites with different structures or contents, and further extracting page information. The method in this embodiment may be run in a computer or server having computing capabilities.
Specifically, the method in this embodiment may include the following steps:
step 101: and analyzing at least one page to respectively obtain a node path information list corresponding to each page.
The pages in this embodiment may include pages on one page website, or may include pages on multiple page websites, and the pages on different page websites may be the same (or similar) or different in structure and content. For example, pages on shopping websites, news websites, and advertising websites all differ in structure and content.
It should be noted that in this embodiment, before parsing the pages, the pages are definitely obtained, such as reading the pages from a database or crawling the pages on a website in real time by using a tool such as a web crawler.
In this embodiment, the node Path information list corresponding to the page obtained by analyzing the page may be understood as an xpath (xml Path length) list of the page, where the xpath list represents a page structure and a structure content by using a Path expression.
Specifically, in this embodiment, the xpath list may be generated by constructing a tree structure of the page for the page and then parsing the tree structure. For example, in this embodiment, a third-party library (e.g., lxml) is used to parse a hypertext html (hypertext Markup language) page on one or more websites and construct a document Object model DOM (document Object model) tree, so as to parse the DOM tree to form a complete xpath list of corresponding pages.
Step 102: and comparing the structures of the pages in at least one page based on the node path information list to obtain similar pages.
In this embodiment, structure comparison may be performed on any two pages in at least one page, so as to determine which pages belong to similar pages and which pages are not similar pages.
It should be noted that the similar pages in this embodiment may be understood as follows: the page with the similarity value between the pages larger than a certain threshold becomes a similar page, and the similarity value between the pages may be a structural similarity value and/or a content similarity value between the pages, that is, a similar page between two pages means: the two pages are similar in structure and/or page.
Step 103: and setting a label for the node path information list of the similar page.
The tag in this embodiment may be a tag extracted from characters in the page content, for example, a keyword in the page content is used as the tag; or the tags may be characters associated with characters located in the page content, such as words that approximate keywords in the page content, as tags, and so forth.
And the label setting at least means that: the pages corresponding to the node path information lists under the same label are similar pages, and the pages corresponding to the node path information lists under different labels do not belong to the similar pages.
Specifically, in this embodiment, the node path information list of the similar page may be subjected to refinement classification according to the information content name and the type thereof, and then a label is generated based on a result of the refinement classification, and the label is set in the node path information list of the similar page.
Step 104: and generating a page extraction template matched with the label based on the node path information list with the same label.
In this embodiment, the page extraction template may be generated by processing the node path information lists with the same label, for example, selecting one node path information list with the highest similarity to other node path information lists in the node path information lists with the same label as the page extraction template; or combining or integrating the node path information lists based on the same label to generate a page extraction template matched with the label, and the like.
According to the above scheme, after the node path information lists of various pages are analyzed, similarity between different pages is classified based on the node path information lists, so that tag setting is performed on similar pages to generate a page extraction template under the tag, and information extraction is performed on corresponding pages. Therefore, in the application, after the webpages of different websites are classified according to the similarity, the webpages with similar structural contents are obtained, and then the corresponding page extraction templates are generated, so that the data extraction of the webpages of different websites is realized.
In an implementation manner, in step 102 in fig. 1, the following operations as in fig. 2 may be specifically performed on the node path information lists of any two pages in at least one page, so as to determine whether two pages are similar pages, so as to resolve out similar pages in all pages that need to be subjected to information extraction, as shown in fig. 2:
step 201: and respectively obtaining the tree structure root nodes and the corresponding subtrees of the first page and the second page in the two pages based on the node path information lists of the two pages.
As shown in fig. 3, the tree structures represented by the node path information lists of the first page and the second page are constructed to obtain a tree structure of the first page and a tree structure of the second page, and further obtain respective tree structure root nodes and corresponding subtrees of the first page and the second page.
It should be noted that fig. 3 only shows an example of a tree structure of two pages, that is, only each node and node position in the tree structure are shown, and other information in the node path information list is not shown, but does not represent that the node path information list does not contain other information except the tree structure.
Step 202: and comparing whether the tree structure root nodes of the two pages are the same, and if so, executing the step 203.
In this embodiment, whether the tree structure root nodes of the two pages are completely consistent or approximately consistent may be compared, for example, if the tree structure root nodes of the two pages are characterized as the same root directory folder name in the node path information list, if the tree structure root nodes of the two pages are completely consistent or approximately consistent, step 203 is executed.
Step 203: and determining subtrees with the highest similarity to the subtrees in the second page respectively in the subtrees of the first page to form subtree pairs.
As shown in fig. 4, for the sub-trees a1, a2 and a3 in the first page, the sub-trees with the highest similarity to the sub-trees b1, b2 and b3 in the second page are determined to form a sub-tree pair, for example, the sub-tree pair with the highest similarity to b1 is a2, the sub-tree pair with b2 is a1, and the sub-tree pair with b3 is a3, in which case, b1 and a2 form a sub-tree pair, b2 and a1 form a sub-tree pair, and b3 and a3 form a sub-tree pair. It should be noted that, the subtrees in the second page may be the same or different in the subtrees in the first page, for example, the subtrees with the highest similarity to b1 are a2, b2 is a2, and b3 is a1, where b1 and a2 form a subtree pair, b2 and a2 form a subtree pair, and b3 and a1 form a subtree pair.
Step 204: and obtaining similarity values of two subtrees in the subtree pairs, and obtaining preset weights of the subtrees belonging to the first page in the subtree pairs.
In this embodiment, an iterative loop scheme may be adopted to iterate into step 202 to implement similarity calculation for two subtrees. Specifically, in this embodiment, when the similarity values of two subtrees in the word number pair are obtained, the method may iteratively enter step 202, continue to calculate the total similarity value of the two subtrees according to the scheme in this embodiment until a leaf node in the tree structure is finally reached, compare the similarity of the leaf node, such as the content similarity, to obtain the similarity value between the leaf nodes, then iterate back to the parent node subtree on the previous layer of the leaf node, and after calculating the total similarity value between the parent node subtrees, continue to iterate back to the parent node subtree on the previous layer until the similarity values of the two subtrees in the subtree pair are obtained, as shown in fig. 5.
Wherein, the preset weight of the subtree of the first page can be understood as: the importance degree of each position structure in the page to the user belongs to the inherent attribute of the page, the preset weights of different page position structures may be different, for example, the preset weight of the sub-tree corresponding to the page text is higher than the preset weight of the sub-tree corresponding to the page side bar, and the like. In this embodiment, the preset weight may be obtained in the node path information list or the corresponding page information.
Step 205: and obtaining a total similarity value between the first page and the second page based on the similarity value of the subtree pair and the preset weight.
Specifically, in this embodiment, the similarity value of the subtree pair may be multiplied by a preset weight of the subtree belonging to the first page in the subtree pair, and the sum is added to obtain a total similarity value of the first page and the second page.
For example, the preset weights of the sub-tree a1, the sub-tree a2 and the sub-tree a3 in the first page are respectively: 0.3, 0.2 and 0.1; accordingly, if b1 and a2 form a sub-tree pair c1, b2 and a1 form a sub-tree pair c2, and b3 and a3 form a sub-tree pair c3, then the total similarity value of the first page is: similarity value of c1 0.2+ similarity value of c2 0.3+ similarity value of c3 0.1; if b1 and a2 form a sub-tree pair d1, b2 and a2 form a sub-tree pair d2, and b3 and a1 form a sub-tree pair d3, then the total similarity value of the first page is: similarity value of d1 0.2+ similarity value of d2 0.2+ similarity value of c3 0.3.
Step 206: and judging whether the total similarity value is higher than a preset threshold value, if so, executing a step 207, otherwise, executing a step 208.
The preset threshold value can be set according to requirements such as accuracy requirements of information extraction or efficiency requirements of information extraction, for example, the higher the preset threshold value is, the higher the accuracy of information extraction is, the lower the preset threshold value is, the higher the efficiency of information extraction is, and the user can freely set the preset threshold value according to the self requirements, so that a more free information extraction service mode is brought to the user.
Step 207: and determining that the first page and the second page are similar pages.
As can be seen, in this embodiment, by performing similarity comparison on the tree structures of the first page and the second page, when the total similarity value is higher than the preset threshold, it can be determined that the two pages are similar pages.
Step 208: it is determined that the first page and the second page are not similar pages.
It should be noted that, in step 202, if the root nodes of the tree structures of the two pages are compared to be the same, step 208 may also be executed, as shown in fig. 2, it is seen that, in this embodiment, it is first determined whether the root nodes of the tree structures of the two pages are the same, if so, the total similarity value obtained by sub-tree calculation is continued to determine whether the two pages are similar pages, and if the root nodes are not the same, the two pages may be directly determined not to be similar pages.
In an implementation manner, the comparison of the similarity between the two pages shown in fig. 2 refers to comparison of the similarity between the two pages in the page structure, and in this embodiment, page contents need to be compared, accordingly, in this embodiment, while performing structure comparison between the first page and the second page, the category to which the page contents belong between the first page and the second page may also be compared, and thus, in this embodiment, the page contents of each page in at least one page need to be obtained, so that the category to which the page contents belong and the structure are compared between any two pages in at least one page, and thus a similar page is obtained. The similar page at this time refers to a page where the two pages are the same in structure and content or have a similarity value higher than a certain threshold.
Specifically, in this embodiment, the scheme shown in fig. 2 may be adopted to compare the structures of the pages, and the category to which the page content of each page belongs may be determined by analyzing the content such as the page title or the hidden topic, so as to implement content comparison and obtain the similarity value on the category to which the content belongs. Further, in this embodiment, after comparing the category and the structure to which the content belongs on the page, the page similarity value between the pages may be calculated again according to the weight occupied by the category and the structure to which the content belongs, for example, the weight of the category to which the content belongs is 0.5, and the weight of the structure is 0.5 (or the weight of the category to which the content belongs is 0.4, and the weight of the structure is 0.6, etc.), and the similarity value on the category to which the content belongs and the similarity value on the structure are multiplied by the respective weights respectively and then summed to obtain the page similarity value, so as to determine whether the pages are similar or not.
In an implementation manner, when the label is set in step 103 in the present embodiment on the node path information list of the similar page, the following implementation may be specifically implemented, as shown in fig. 6:
step 601: and determining the target content to be extracted according to the node path information list of the similar page.
Specifically, in this embodiment, the information that may need to be extracted may be determined as the target content by identifying the content, such as the folder name or the file name, the file attribute, and the file type, in the node path information list xpath.
Step 602: and setting the labels of the node path information lists of the similar pages based on the target content.
Specifically, in this embodiment, after the target content is subjected to refinement and classification, a suitable label is determined according to a classification result to be set in the node path information list. For example, the node path information list china/south he/zheng zhou/high and new district, and the node information list china/south he/zheng zhou/convention and exhibition center, as similar pages, identify various information in the node path information list, such as folder names, file attributes, file types, etc., determine as target contents, perform content refinement and classification from these target contents, and may set "zheng zhou" as a label of the node path information list of these two similar pages.
In an implementation manner, when the page extraction template is generated in step 104 in this embodiment, a feature dictionary of the page extraction template under the label may be generated for the node path information list with the same label, where the feature dictionary may include a strong feature word associated with the label and a feature attribute thereof, and may also include a synonym of the strong feature word, and specifically, in this embodiment, the feature dictionary may be obtained in the following manner, as shown in fig. 7:
step 701: and extracting strong characteristic words and characteristic attributes thereof which have association relation with the labels in the node path information lists for the node path information lists with the same labels.
The strong feature words may be words of a label containing the node path information of the strong feature words, or words whose similarity to the label reaches a certain threshold, or words having a corresponding association relationship with the label in content, concept, or meaning. Correspondingly, the feature attributes of the strong feature words may be: the file name or the file type attribute corresponding to the folder name, such as class attribute, css attribute, etc. In this embodiment, the strong feature words and the feature attributes thereof are extracted by analyzing the content in the node path information list.
Step 702: and generating a feature dictionary of the page extraction template based on the strong feature words and the feature attributes thereof.
In this embodiment, the strong feature words and the feature attributes thereof may be classified and integrated to obtain a word set, which is used as a feature dictionary of a subsequently generated page extraction template.
Step 703: and analyzing the word senses of the strong characteristic words to obtain synonyms of the strong characteristic words, and adding the synonyms into the characteristic dictionary.
For example, the strong feature word "typhoon", synonyms thereof "strong typhoon", "tropical storm", and the like are added to the feature dictionary, and in the feature dictionary, there is a correspondence between the strong feature word and the synonym thereof.
Based on the above implementation, step 104 in this embodiment may be implemented in the following manner, as shown in fig. 8:
step 801: and merging the node path information lists with the same label.
Specifically, in this embodiment, a preset mark symbol, such as a wildcard character, may be first set in the node path information list with the same label for the node path information list from which the strong feature word is extracted, so as to mark that the strong feature word is extracted from the node path information list, and no mark symbol is set in the other node path information lists from which the strong feature word is not extracted;
then, comparing the node path information lists one by one according to the node sequence in the node path information lists to obtain a comparison result, wherein the comparison result can represent whether each node in the node path information lists corresponds to the same node, which nodes are different, and the like;
then, based on the comparison result, merging the node path information lists with the same node comparison into one node path information list, for example, reserving one of the node path information lists and deleting the other node path information list; the node path information lists with different nodes are also merged into a node path information list, and the different node is replaced by a mark symbol, for example, one of the node path information lists is deleted, and the node which is different from the deleted node path information list in the other node path information list is replaced by the mark symbol; of course, the node path information lists for two or more different nodes are considered to be different, and the merging process cannot be performed as a similar process.
For example, in this embodiment, node path information lists of two pages performing node-to-node comparison are as similar or identical as possible on a node level, such as 5-layer nodes or 3-layer nodes, so that nodes separated by "/" in xpaths are compared one by one, if the nodes are identical, the two xpaths are merged into one, if only one node is different, the two xpaths are also merged into one, and meanwhile, the node distinguished in the merged xpath is replaced with a mark symbol such as a wildcard symbol "; if a plurality of different nodes exist, the nodes are not considered to be combined xpaths and are not processed.
In addition, in order to reduce the amount of data calculation, when merging node path information lists with the same label in this embodiment, before sequentially comparing the node path information lists one by one according to the node order in the node path information list, the node path information list may be simplified first, for example, the node path information list may be simplified by using a mark symbol.
Specifically, when simplifying the node path information list in this embodiment, the following method may be specifically implemented:
at least reserving tree structure node names and information used for filling strong characteristic words in the node path information for the node path information of the node path information list provided with the mark symbols;
for the node path information of the node path information list without the set marker symbol, only the tree structure node name in the node path information may be reserved.
It should be noted that, if a tree structure node corresponding to the node path information includes a certain target content, such as table content, the node name and table sequence number information of the node path information may be retained, and if the tree structure node corresponding to the node path information does not include the target content, only the node name of the node path information is retained.
Step 802: and generating a page extraction template matched with the label based on the combined node path information list.
In this embodiment, the merged node path information list may be directly used as a page extraction template, the page extraction template obtained in this embodiment may include a plurality of node path information, and the node path information may be used for subsequent information extraction on a page, and in the node path information, the priority of the node path information including the strong feature word is higher than the priority of other node path information not including the strong feature word during information extraction, for example, when performing subsequent information extraction on a page, the node path information including the strong feature word is preferentially used for information extraction on the page, which may be understood as that the strong feature word in the feature dictionary is preferentially used for information extraction.
In an implementation manner, after obtaining the page extraction template in this embodiment, the following steps may be further included, as shown in fig. 9:
step 901: and responding to the received page extraction request, and obtaining a target page to be extracted and an extraction tag of the target page.
The page extraction request can include a page identifier or a page address of a target page to be extracted and the like so as to represent the target page to be extracted; in this embodiment, the extraction tag of the target page may be obtained by analyzing information such as the page identifier, the page content, or the subject content of the target page.
Step 902: and in the target page extraction template corresponding to the extraction label, preferentially using the node path information containing the strong characteristic words to extract page data of the target page to obtain an extraction result.
For example, in the present embodiment, a target page extraction template having the same tag as the extraction tag is found among various page extraction templates by extracting the tag, and then the target page extraction template is used for extracting page data of the target page, in particular, the page data can be extracted by preferentially using node path information containing strong characteristic words, if the proper information can not be extracted, then all or part of strong feature words in the feature dictionary of the target page extraction template are utilized to extract page data, if appropriate information is not extracted, it is considered that the page data extraction is performed using synonyms of strong feature words in the feature dictionary, and if appropriate information is not extracted, extracting page information by using node path information without strong characteristic words, and finally obtaining an extraction result.
In addition, when the page data is extracted by using the strong feature words and appropriate information cannot be obtained, the feature attributes of the strong feature words may be used to perform learning training, for example, extracting and training information of a certain document attribute such as a document name and a folder name under a class attribute, so as to extract corresponding information in the page.
If no suitable information is extracted at last, in this embodiment, the target page and the extraction tag thereof may be marked with a default, and the corresponding page extraction template is regenerated by obtaining the node path information list of the target page and is merged into the already generated page extraction template, so as to extract information for the target page or more other pages.
It should be noted that after the extraction result is obtained in this embodiment, there may be a situation of information redundancy in the extraction result due to data alignment, and therefore, after the extraction result is obtained in this embodiment, the extraction result may be further cleaned, for example, repeated data is deleted by data redundancy processing, so as to obtain a more accurate extraction result.
Referring to fig. 10, a schematic structural diagram of a data processing apparatus according to a second embodiment of the present application is provided, where the apparatus is adapted to construct a page extraction template for pages of different websites, that is, websites with different structures or contents, and further extract page information. The apparatus in this embodiment may be run in a computer or server with computing power.
Specifically, the apparatus in this embodiment may include the following structure:
the page parsing unit 1001 is configured to parse at least one page to obtain a node path information list corresponding to each page.
The pages in this embodiment may include pages on one page website, or may include pages on multiple page websites, and the pages on different page websites may be the same (or similar) or different in structure and content. For example, pages on shopping websites, news websites, and advertising websites all differ in structure and content.
It should be noted that in this embodiment, before parsing the pages, the pages are definitely obtained, such as reading the pages from a database or crawling the pages on a website in real time by using a tool such as a web crawler.
In this embodiment, the node Path information list corresponding to the page obtained by analyzing the page may be understood as an xpath (xml Path length) list of the page, where the xpath list represents a page structure and a structure content by using a Path expression.
Specifically, in this embodiment, the xpath list may be generated by constructing a tree structure of the page for the page and then parsing the tree structure. For example, in this embodiment, a third-party library (e.g., lxml) is used to parse a hypertext html (hypertext Markup language) page on one or more websites and construct a document Object model DOM (document Object model) tree, so as to parse the DOM tree to form a complete xpath list of corresponding pages.
The similarity comparison unit 1002 is configured to perform structure comparison between pages in at least one page based on the node path information list to obtain a similar page.
In this embodiment, structure comparison may be performed on any two pages in at least one page, so as to determine which pages belong to similar pages and which pages are not similar pages.
It should be noted that the similar pages in this embodiment may be understood as follows: the page with the similarity value between the pages larger than a certain threshold becomes a similar page, and the similarity value between the pages may be a structural similarity value and/or a content similarity value between the pages, that is, a similar page between two pages means: the two pages are similar in structure and/or page.
A tag setting unit 1003, configured to set a tag for the node path information list of the similar page.
The tag in this embodiment may be a tag extracted from characters in the page content, for example, a keyword in the page content is used as the tag; or the tags may be characters associated with characters located in the page content, such as words that approximate keywords in the page content, as tags, and so forth.
And the label setting at least means that: the pages corresponding to the node path information lists under the same label are similar pages, and the pages corresponding to the node path information lists under different labels do not belong to the similar pages.
Specifically, in this embodiment, the node path information list of the similar page may be subjected to refinement classification according to the information content name and the type thereof, and then a label is generated based on a result of the refinement classification, and the label is set in the node path information list of the similar page.
The template generating unit 1004 is configured to generate a page extraction template matching the label based on the node path information list having the same label.
In this embodiment, the page extraction template may be generated by processing the node path information lists with the same label, for example, selecting one node path information list with the highest similarity to other node path information lists in the node path information lists with the same label as the page extraction template; or combining or integrating the node path information lists based on the same label to generate a page extraction template matched with the label, and the like.
According to the above scheme, after the node path information lists of various pages are analyzed, similarity between different pages is classified based on the node path information lists, so that tag setting is performed on similar pages to generate a page extraction template under the tag, and information extraction is performed on corresponding pages. It can be seen that, in this embodiment, after the webpages of different websites are classified by the similarity, the webpages with similar structural contents are obtained, and then the corresponding page extraction templates are generated, so as to implement data extraction of the webpages of different websites.
Based on the above implementation, the following illustrates an example of the present embodiment when performing template construction for structured data extraction to perform specific extraction, as shown in fig. 11:
and 1101, carrying out structural similarity comparison and content comparison on the DOM trees generated by the web pages, and classifying the web pages with the same type and similar structures.
Specifically, in this embodiment, a third-party library (e.g., lxml) may be used to parse the HTML page and build a DOM tree, which forms a complete XPath list for the page. The resulting tree-like XPath column represents, for example, fig. 12, where the strong feature words and feature attributes are not shown in this embodiment, and only the nodes and node locations are shown. In this embodiment, the content is first classified according to the page title or the hidden subject to determine similar pages on the content. In order to judge the structural similarity of two or more webpages, comparing whether the root nodes of the two webpage trees are the same or not, if the root nodes are different, the similarity is 0, and stopping the calculation; if so, the next calculation is continued. And aiming at each subtree, selecting the subtree with the maximum similarity from the other subtree set as a matching object, wherein the similarity is a similarity reference value. And taking the nodes of the subtrees as weights, and calculating the total reference values of all the subtrees to obtain the overall similarity of the two trees. And under the condition of meeting the similarity threshold, judging that the two webpages are structurally similar webpages. The judgment scheme in the embodiment is suitable for the web interfaces of different websites, and the web pages of the same website can be judged by using regular matching.
Step 1102, analyzing the web pages with the same type and similar structure on the content, sorting the analyzed XPath according to the data name and type which are required to be extracted, and marking a label.
For example, in this embodiment, at least some HTML pages with the same content type and similar structure are used as a page sample in a field, information that may need to be extracted is determined according to the page content, and an XPath path corresponding to the HTML page content is analyzed, so that data to be extracted is subjected to refinement and classification according to name and type, tags are labeled, and corresponding XPath paths are unified under the same tag. The same label may retain part of the extracted original tag data or data features for comparison with part of the extracted result.
And 1103, extracting strong feature words and feature attributes close to the labels in the paths aiming at the XPath paths of the same labels, weakening nodes where the features are located, and replacing the nodes with wildcards.
In this embodiment, when the XPath path is analyzed in the previous step 1102, for a table class or a content having a corresponding name or type on a page, an XPath path that can be acquired by the name may be preferentially found. And according to an XPath path of the node traversal label, such as characters and the like, when characters close to the label or attributes representing the type of the extracted object exist, matching the current node and the position of the feature word by using a' wildcard character, and recording the corresponding characters or attributes, the strong feature word and the feature attributes thereof.
And 1104, sorting and inducing the strong characteristic words and the characteristic attributes to form a characteristic dictionary, analyzing the meaning and the part of speech to obtain other possible synonyms, and adding the synonyms into the dictionary for standby.
In this embodiment, the strong feature words obtained through traversal are sorted and induced to obtain a preliminary feature dictionary. Because the current extraction object is mainly a Chinese or English webpage, a Chinese or English near-meaning word toolkit is used for searching for near-meaning words with higher similarity and adding dictionary alternatives; for the case that the strong feature word is a phrase or a word combination, word segmentation is required, words with parts of speech being nouns after segmentation are selected, a near-meaning word with high similarity is searched, and a dictionary alternative is added. If the found synonym already exists in the feature dictionary, the synonym is not added; if contained only within a certain phrase in the dictionary, the addition continues.
And 1105, based on the processing in the step 1103, simplifying and merging the XPath paths with the label according to the designed comparison table, so that the path possibly containing strong feature words and feature attributes can be preferentially selected, and other alternative paths are of lower level and stored as templates under the label.
In this embodiment, the xpath path is first simplified, specifically, for a node that is not replaced with a wildcard, only the node name is retained, and a part of the related nodes in the table are retained to the serial number part; for nodes that are replaced with wildcards, the node name is retained along with the portion that can be used to fill in the strong feature words. And after all the nodes are reduced, obtaining a preliminary simplified XPath.
Then, combining xpath paths with the same label in this embodiment, specifically, sequentially comparing paths under the same label according to nodes, and combining completely same entries; if there is only one node difference between more than two XPath expressions, then they are treated equally as identical expressions and wildcards are used to replace the only different nodes.
Step 1106, selecting a template under the label to be extracted to extract the same type of web pages, preferentially selecting a path containing strong feature words, and if the path is not matched with the strong feature words, replacing synonyms in the feature dictionary; and when all the synonyms in the dictionary are matched, starting to select an XPath without characteristic attributes, and comparing the extracted text with the mark data.
When the information of the page needs to be extracted, in the embodiment, for a target webpage which is similar in type and contains information to be extracted, a page extraction template under the label to be extracted can be selected according to the requirement after the label to be extracted of the target webpage is determined, and then a path containing strong feature words and a corresponding feature dictionary in the template are preferentially selected for extraction; if the synonyms in the feature dictionary are not matched, namely, the appropriate information is not extracted, the synonyms in the feature dictionary are replaced for matching again. And if the result cannot be extracted after all the synonyms in the dictionary are matched, starting to select an XPath without the characteristic attribute for matching, comparing the extracted text with the original marked data, and keeping reasonable information as the extracted result.
It should be noted that the page extraction template in this embodiment may perform batch extraction on pages. In the batch extraction process, if a certain XPath is used for successfully extracting the required content from the webpages in the same website, the XPath corresponding to the current label is recorded, and the paths are preferentially used for extraction in the subsequent extraction. It should be noted that there may be multiple data extracted by using the same tag in the same page, and due to the alignment of the data, the information redundancy may be extracted, and further cleaning of the extraction result may be required.
Step 1107, if the data to be extracted is not matched, marking default, and analyzing the whole webpage and adding the webpage into the merging template.
Specifically, in the batch extraction, if a small number of labels are used and data to be extracted is not matched, the default is marked, and the extraction is continued. And the unmatched tags can be complemented by the extracted data of the same type of websites or manually add corresponding extractable XPath paths. And if the labels to be extracted are not matched with the corresponding information, the whole webpage is required to be analyzed as a seed, and the template generation step is repeated to obtain the corresponding XPath.
As can be seen, in this embodiment, for the webpages of different websites, classification can be performed by calculating the similarity, so as to obtain webpages with similar structural content; under the condition of sufficient analysis samples, for most similar webpages, the difference parts can be subjected to microprocessing by using a conditional expression, so that the template has strong universality and a wide application range; meanwhile, the user-defined label is used for classifying the feature words and the feature attributes, so that the classification of fields with the same meaning in various expression modes is facilitated; in addition, in the embodiment, the original words and synonyms in the path are used to form a feature dictionary, so that the method has universality on the same data of different webpages; and only the data template to be extracted is saved, and useless data does not need to be sorted.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (7)

1. A method of data processing, comprising:
analyzing the pages to respectively obtain a node path information list corresponding to each page;
comparing the structures of the pages in the page based on the node path information list to obtain similar pages;
setting a label for the node path information list of the similar page;
generating a page extraction template matched with the label based on the node path information list with the same label;
setting labels for the node path information list of the similar page, including:
determining target content to be extracted according to the node path information list of the similar page;
setting a label of a node path information list of the similar page based on the target content, wherein the label comprises: performing thinning classification on the target content, and setting a label of a node path information list of the similar page according to a classification result;
the generating of the page extraction template matched with the label based on the node path information list with the same label comprises:
extracting strong feature words and feature attributes thereof which have incidence relation with labels in a node path information list for the node path information list with the same labels;
generating a feature dictionary of the page extraction template based on the strong feature words and the feature attributes thereof;
analyzing the word meaning of the strong characteristic word to obtain a synonym of the strong characteristic word;
adding the synonym into a feature dictionary of the page extraction template;
merging the node path information lists with the same label;
generating a page extraction template matched with the label based on the combined node path information list; the generated page extraction template comprises a plurality of node path information, the node path information is used for information extraction, and the priority of the node path information containing the strong characteristic words is higher than the priority of the other node path information.
2. The method of claim 1, wherein comparing the structures of the pages based on the node path information list comprises:
performing the following operations on the node path information lists of two of the pages:
respectively obtaining tree structure root nodes and corresponding sub-trees of a first page and a second page in the two pages based on the node path information lists of the two pages;
determining subtrees with the highest similarity to the subtrees in the second page respectively in the subtrees of the first page based on the judgment that the tree structure root node comparisons of the two pages are the same so as to form a subtree pair;
obtaining similarity values of two subtrees in the subtree pair, and obtaining preset weights of the subtrees belonging to the first page in the subtree pair;
obtaining a total similarity value between the first page and the second page based on the similarity value of the subtree pair and the preset weight;
and determining that the first page and the second page are similar pages based on the judgment that the total similarity value is higher than a preset threshold value.
3. The method of claim 1 or 2, further comprising:
obtaining page content in the page;
and comparing the categories and structures of the page contents among the pages to obtain similar pages.
4. The method of claim 1, wherein merging the node path information lists with the same label comprises:
setting a preset mark symbol for the node path information list from which the strong characteristic words are extracted in the node path information list with the same label;
comparing the node path information lists one by one according to the node order in the node path information lists to obtain comparison results;
and merging the node path information lists with the same node comparison into a node path information list based on the comparison result, merging the node path information lists with different nodes into a node path information list, and replacing the different nodes with the mark symbols.
5. The method according to claim 4, wherein before performing the one-to-one comparison between the node path information lists in sequence according to the node orders in the node path information lists, the method further comprises:
simplifying the node path information list by using the mark symbol;
the method specifically comprises the following steps:
at least reserving tree structure node names and information used for filling the strong characteristic words in the node path information for the node path information of the node path information list provided with the mark symbol;
and reserving tree structure node names in the node path information for the node path information of the node path information list without the set mark symbol.
6. The method of claim 1 or 2, further comprising:
responding to a received page extraction request, and obtaining a target page to be extracted and an extraction tag of the target page;
preferentially using node path information containing strong feature words to extract page data of the target page in a target page extraction template corresponding to the extraction tag to obtain an extraction result;
and obtaining a node path information list of the target page and generating a corresponding page extraction template based on the judgment that corresponding data are not extracted from the extraction result.
7. A data processing apparatus comprising:
the page analyzing unit is used for analyzing the obtained pages so as to respectively obtain a node path information list corresponding to each page;
the similarity comparison unit is used for carrying out structure comparison on the pages in the pages based on the node path information list to obtain similar pages;
the label setting unit is used for setting labels for the node path information lists of the similar pages;
the template generating unit is used for generating a page extraction template matched with the label based on the node path information list with the same label;
setting labels for the node path information list of the similar page, including:
determining target content to be extracted according to the node path information list of the similar page;
setting a label of a node path information list of the similar page based on the target content, wherein the label comprises: performing thinning classification on the target content, and setting a label of a node path information list of the similar page according to a classification result;
the generating of the page extraction template matched with the label based on the node path information list with the same label comprises:
extracting strong feature words and feature attributes thereof which have incidence relation with labels in a node path information list for the node path information list with the same labels;
generating a feature dictionary of the page extraction template based on the strong feature words and the feature attributes thereof;
analyzing the word meaning of the strong characteristic word to obtain a synonym of the strong characteristic word;
adding the synonym into a feature dictionary of the page extraction template;
merging the node path information lists with the same label;
generating a page extraction template matched with the label based on the combined node path information list; the generated page extraction template comprises a plurality of node path information, the node path information is used for information extraction, and the priority of the node path information containing the strong characteristic words is higher than the priority of the other node path information.
CN201811073868.9A 2018-09-14 2018-09-14 Data processing method and device Active CN109165373B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811073868.9A CN109165373B (en) 2018-09-14 2018-09-14 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811073868.9A CN109165373B (en) 2018-09-14 2018-09-14 Data processing method and device

Publications (2)

Publication Number Publication Date
CN109165373A CN109165373A (en) 2019-01-08
CN109165373B true CN109165373B (en) 2022-04-22

Family

ID=64879429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811073868.9A Active CN109165373B (en) 2018-09-14 2018-09-14 Data processing method and device

Country Status (1)

Country Link
CN (1) CN109165373B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826002B (en) * 2019-10-30 2024-06-25 腾讯科技(深圳)有限公司 Information sharing method, device, terminal and storage medium
CN111522606B (en) * 2020-04-26 2023-08-04 广东优特云科技有限公司 Data processing method, device, equipment and storage medium
CN113626028B (en) * 2020-05-07 2024-06-14 腾讯科技(深圳)有限公司 Page element mapping method and device
CN111966930B (en) * 2020-08-17 2021-05-04 山东亿云信息技术有限公司 Webpage list analyzing method and system based on XPath sequence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7765236B2 (en) * 2007-08-31 2010-07-27 Microsoft Corporation Extracting data content items using template matching
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template
CN103136358A (en) * 2013-03-07 2013-06-05 宁波成电泰克电子信息技术发展有限公司 Method for automatically extracting BBS (bulletin board system) data
CN104572934A (en) * 2014-12-29 2015-04-29 西安交通大学 Webpage key content extracting method based on DOM

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7720868B2 (en) * 2006-11-13 2010-05-18 Microsoft Corporation Providing assistance with the creation of an XPath expression
CN105117397B (en) * 2015-06-18 2018-08-28 浙江大学 A kind of medical files semantic association search method based on ontology
CN105512245B (en) * 2015-11-30 2018-08-21 青岛智能产业技术研究院 A method of enterprise's portrait is established based on regression model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7765236B2 (en) * 2007-08-31 2010-07-27 Microsoft Corporation Extracting data content items using template matching
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template
CN103136358A (en) * 2013-03-07 2013-06-05 宁波成电泰克电子信息技术发展有限公司 Method for automatically extracting BBS (bulletin board system) data
CN104572934A (en) * 2014-12-29 2015-04-29 西安交通大学 Webpage key content extracting method based on DOM

Also Published As

Publication number Publication date
CN109165373A (en) 2019-01-08

Similar Documents

Publication Publication Date Title
CN109189942B (en) Construction method and device of patent data knowledge graph
CN109165373B (en) Data processing method and device
Sun et al. Dom based content extraction via text density
CN107229668B (en) Text extraction method based on keyword matching
CN102254014B (en) Adaptive information extraction method for webpage characteristics
US8868556B2 (en) Method and device for tagging a document
US9268749B2 (en) Incremental computation of repeats
US20090248707A1 (en) Site-specific information-type detection methods and systems
JP2006004417A (en) Method and device for recognizing specific type of information file
CN109033282B (en) Webpage text extraction method and device based on extraction template
JP2005063432A (en) Multimedia object retrieval apparatus and multimedia object retrieval method
Cardoso et al. An efficient language-independent method to extract content from news webpages
CN107145591B (en) Title-based webpage effective metadata content extraction method
Kosala et al. Information extraction from structured documents using k-testable tree automaton inference
CN106372232B (en) Information mining method and device based on artificial intelligence
CN111966940B (en) Target data positioning method and device based on user request sequence
CN116881595B (en) Customizable webpage data crawling method
CN111339457A (en) Method and apparatus for extracting information from web page and storage medium
CN108874870A (en) A kind of data pick-up method, equipment and computer can storage mediums
Nanba et al. Bilingual PRESRI-Integration of Multiple Research Paper Databases.
JP4143085B2 (en) Synonym acquisition method and apparatus, program, and computer-readable recording medium
CN110232160B (en) Method and device for detecting interest point transition event and storage medium
CN114706948A (en) News processing method and device, storage medium and electronic equipment
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
Gkotsis et al. Self-supervised automated wrapper generation for weblog data extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant