CN109165373A - A kind of data processing method and device - Google Patents

A kind of data processing method and device Download PDF

Info

Publication number
CN109165373A
CN109165373A CN201811073868.9A CN201811073868A CN109165373A CN 109165373 A CN109165373 A CN 109165373A CN 201811073868 A CN201811073868 A CN 201811073868A CN 109165373 A CN109165373 A CN 109165373A
Authority
CN
China
Prior art keywords
page
path information
node path
information list
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811073868.9A
Other languages
Chinese (zh)
Other versions
CN109165373B (en
Inventor
杨帆
戴超男
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201811073868.9A priority Critical patent/CN109165373B/en
Publication of CN109165373A publication Critical patent/CN109165373A/en
Application granted granted Critical
Publication of CN109165373B publication Critical patent/CN109165373B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of data processing method and device, method includes: to parse at least one page, to obtain the corresponding node path information list of each described page respectively;Based on the node path information list, structure alignment is carried out between the page at least one described page, obtains similar pages;Label is arranged to the node path information list of the similar pages;Based on the node path information list with same label, the page extraction template to match with the label is generated.

Description

A kind of data processing method and device
Technical field
This application involves page extraction technique field more particularly to a kind of data processing method and devices.
Background technique
Currently, generalling use the mode of building template to net when carrying out structured message extraction in same type website Page information is extracted.
But the information extraction of different web sites webpage can not be suitable in existing extraction template configuration, thus reduce information The general applicability of extraction.
Summary of the invention
In view of this, the application provides a kind of data processing method and device, taken out to solve the page in the prior art Modulus plate can not carry out information extraction, the technical problem for causing information extraction applicability lower to the webpage of different web sites.
This application provides a kind of data processing methods, comprising:
At least one page is parsed, to obtain the corresponding node path information list of each described page respectively;
Based on the node path information list, structure alignment is carried out between the page at least one described page, Obtain similar pages;
Label is arranged to the node path information list of the similar pages;
Based on the node path information list with same label, the page extraction template to match with the label is generated.
The above method, it is preferable that based on the node path information list to the page at least one described page it Between carry out structure alignment, comprising:
Following operation is executed to the node path information list of two pages at least one described page:
Node path information list based on described two pages obtains first page and in described two pages respectively The tree construction root node of two pages and corresponding subtree;
Tree construction root node based on described two pages compares identical judgement, in each subtree of the first page In, determine respectively with the highest subtree of subtree similarity each in the second page, to form subtree pair;
The similarity value of two subtrees of subtree centering is obtained, and obtains and belongs to the first page in the subtree pair Subtree default weight;
Similarity value and the default weight based on the subtree pair, obtain the first page and the second page Between total similarity value;
It is higher than the judgement of preset threshold based on total similarity value, determines the first page and the second page is Similar pages.
The above method, it is preferable that further include:
Obtain the content of pages at least one described page;
To content of pages generic and structure alignment is carried out between the page at least one described page, phase is obtained Like the page.
The above method, it is preferable that label is arranged to the node path information list of the similar pages, comprising:
According to the node path information list of the similar pages, object content to be extracted is determined;
Based on the object content, the label of the node path information list of the similar pages is set.
The above method, it is preferable that further include:
To the node path information list with same label, extracts in the node path information list and deposited with its label In the strong Feature Words and its characteristic attribute of incidence relation;
Based on the strong Feature Words and its characteristic attribute, the feature lexicon of the page extraction template is generated;
The meaning of a word of the strong Feature Words is parsed, to obtain the synonym of the strong Feature Words;
The synonym is added in the feature lexicon of the page extraction template.
The above method, it is preferable that based on the node path information list with same label, generation matches with the label Page extraction template, comprising:
Node path information list with same label is merged;
Based on the node path information list after merging, the page extraction template to match with the label is generated;Wherein, It include multiple node path information in the page extraction template of generation, the node path information is used for information extraction, and contains The priority of the node path information of strong Feature Words is higher than the priority of other node path information.
The above method, it is preferable that the node path information list with same label is merged, comprising:
In the node path information list with same label, the node path for extracting the strong Feature Words is believed List is ceased, preset label symbol is set;
To between the node path information list successively according in node path information list node order carry out one One compares, and obtains comparison result;
Based on the comparison result, node is compared into identical node path information list and merges into a node road Diameter information list, the node path information list that a different node will be present merge into a node path information list, And the different node is substituted with the label symbol.
The above method, it is preferable that between the node path information list successively according to node path information list In node order carry out one by one compare before, the method also includes:
Using the label symbol, the node path information list is simplified;
Specifically:
Node path information to the node path information list for being provided with the label symbol, at least retains the node Tree construction nodename in routing information and the information for filling the strong Feature Words;
Node path information to the node path information list for being not provided with the label symbol, retains the node road Tree construction nodename in diameter information.
The above method, it is preferable that further include:
Request is extracted in response to the page received, obtains the extraction of target pages and the target pages to be extracted Label;
It is preferential to use the node path for containing strong Feature Words in the corresponding target pages extraction template of the extraction label Information carries out page data extraction to the target pages, obtains extracting result;
Based on the judgement for not being drawn into corresponding data in the extraction result, the node path of the target pages is obtained Information list simultaneously generates corresponding page extraction template.
Present invention also provides a kind of data processing equipments, comprising:
Page parsing unit, for parsing at least one page, to obtain the corresponding node of each described page respectively Routing information list;
Similarity comparison unit, for being based on the node path information list, to the page at least one described page Between carry out structure alignment, obtain similar pages;
Label is arranged for the node path information list to the similar pages in label setting unit;
Template generation unit, for generating and the label phase based on the node path information list with same label The page extraction template matched.
It can be seen from the above technical proposal that a kind of data processing method disclosed in the present application and device, by parsing Out after the node path information list of the various pages, based between these node path information lists page different to these Similarity classification, so that carrying out label to similar pages is arranged the page extraction template generated under the label in turn, in order to right The corresponding page carries out information extraction.As it can be seen that classifying for the webpage of different web sites by similarity in the application, structure is obtained After the similar webpage of content, corresponding page extraction template is regenerated, thus realize the data pick-up of different web sites webpage, the letter Ceasing the scheme extracted can be suitable for extracting the page info of different web sites, be not limited to the website pages of a certain structure or content Face, to improve the universal applicability of information extraction.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of application for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is a kind of flow chart for data processing method that the embodiment of the present application one provides;
Fig. 2 is the partial process view of the embodiment of the present application one;
Fig. 3, Fig. 4 and Fig. 5 are respectively the exemplary diagram of the embodiment of the present application;
Fig. 6, Fig. 7, Fig. 8 and Fig. 9 are respectively another part flow chart of the embodiment of the present application one;
Figure 10 is a kind of structural schematic diagram for data processing equipment that the embodiment of the present application two provides;
Figure 11 and Figure 12 is respectively another exemplary diagram of the embodiment of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.
With reference to Fig. 1, for a kind of implementation flow chart for data processing method that the embodiment of the present application one provides, this method is applicable in The building of page extraction template is carried out in the page to different web sites, that is, different structure or the website of content, and then for the page Information is extracted.Method in the present embodiment may operate in computer or server with computing capability.
Specifically, the method in the present embodiment may include having following steps:
Step 101: at least one page is parsed, to obtain the corresponding node path information list of each page respectively.
Wherein, the page in the present embodiment may include the page having on a page website, also may include have it is multiple The page on page website, can identical (or similar) or not in structure and content without the page on same page veil station Together.For example, the page on shopping website, news website and advertiser website is different from structure and content.
It should be noted that acquiring these pages certainly, such as in the present embodiment before parsing to the page From reading these pages in database or crawled in real time on website using tools such as web crawlers to these pages.
Wherein, by obtaining the corresponding node path information list of the page and being understood that page parsing in the present embodiment For xpath (XML Path Language) list of the page, wherein xpath list characterizes page structure using path expression And structure content.
Specifically, can be by the tree construction to the page makeup page in the present embodiment, and then tree construction is parsed to generate Xpath list.For example, parsing the super text on a website or multiple websites using third party library (such as lxml) in the present embodiment This HTML (HyperText Markup Language) page simultaneously constructs DOM Document Object Model DOM (Document Object Model it) sets, and then parses dom tree, to form the complete xpath list of respective page.
Step 102: it is based on node path information list, to carrying out structure alignment between the page at least one page, Obtain similar pages.
Wherein, structure alignment can be carried out respectively to any two page at least one page in the present embodiment, from And determining which page belongs to similar pages, which page is not similar pages.
It should be noted that the similar pages in the present embodiment are it is to be understood that the similarity value between the page is greater than one The page for determining threshold value just becomes similar pages, and the similarity value between the page can structural similarity value between the page And/or content similarity value, that is to say, that refer between two pages for similar pages: structure and/or page between two pages Face is similar.
Step 103: label is arranged to the node path information list of similar pages.
Wherein, the label in the present embodiment can be the character in extraction content of pages as label, in content of pages Keyword as label;Or label can be located at content of pages in the associated character of character as label, such as with page The approximate word of keyword in the content of face is as label, etc..
And the label setting meaning at least that: the page corresponding to the node path information list under same label For similar pages, the page corresponding to the node path information list under different labels is not belonging to similar pages.
Specifically, can be with the node path information list of similar pages according to its information content title and class in the present embodiment Type carries out classification, and then label is generated based on the result of classification, and label is arranged to the node of similar pages In routing information list.
Step 104: based on the node path information list with same label, generating the page to match with the label and take out Modulus plate.
Wherein, can be by handling the node path information list with same label in the present embodiment, next life At page extraction template, for example, selecting to arrange in the node path information list of these same labels with other node path information The highest node path information list of table similarity is as page extraction template;Or the node based on these same labels Routing information list is combined or integrates to generate the page extraction template, etc. to match under the label.
By above scheme it is found that a kind of data processing method that the embodiment of the present application one provides, by parse it is various After the node path information list of the page, based on similarity between these node path information lists page different to these Classification, so that carrying out label to similar pages is arranged the page extraction template generated under the label in turn, in order to corresponding The page carries out information extraction.As it can be seen that classifying for the webpage of different web sites by similarity in the application, structure content phase is obtained As after webpage, regenerate corresponding page extraction template, to realize the data pick-up of different web sites webpage, the information extraction Scheme can be suitable for extracting the page infos of different web sites, be not limited to the Website page of a certain structure or content, from And improve the universal applicability of information extraction.
It in one implementation, specifically can be to any two pages at least one page in step 102 in Fig. 1 Node path information list executes the following operation as in Fig. 2, in order to determine whether between two pages be similar pages, from And parse it is in need carry out information extraction the page in similar pages, as shown in Figure 2:
Step 201: the node path information list based on two pages obtains first page and the in two pages respectively The tree construction root node of two pages and corresponding subtree.
As shown in Figure 3, the tree construction characterized to the node path information list of first page and second page carries out Building obtains first page tree construction and second page tree construction, and then obtains first page and the respective tree knot of second page Structure root node and corresponding subtree.
It should be noted that showing only the tree construction example of two pages in Fig. 3, that is, show only in tree construction Other information in node path information list is not shown, but does not represent node by each node and node location In addition to tree construction does not include other information in routing information list.
Step 202: whether the tree construction root node for comparing two pages is identical, if identical, thens follow the steps 203.
It wherein, can the whether completely the same or approximate consistent progress by the tree construction root node of two pages in the present embodiment It compares, same root catalogue file folder title is such as characterized as in node path information list, if the tree construction root of two pages Node is completely the same or approximate consistent, then executing step 203.
Step 203: in each subtree of first page, determine respectively with the highest son of subtree similarity each in second page Tree, to form subtree pair.
As shown in Figure 4, it for subtree a1, subtree a2, the subtree a3 in first page, determines respectively and in second page Subtree b1, subtree b2, the highest subtree of subtree b3 similarity, to form subtree pair, for example, highest with b1 similarity be A2, highest with b2 similarity is a1, and highest with b3 similarity is a3, at this point, b1 and a2 forms subtree pair, b2 and a1 composition Subtree pair, b3 and a3 form subtree pair.It should be noted that the phase in the subtree in first page of each subtree in second page Like spend highest subtree may be it is identical may also be different, it is highest with b2 similarity for example, highest with b1 similarity is a2 It is also a2, and highest with b3 similarity is a1, at this point, b1 and a2 forms subtree pair, b2 and a2 form subtree pair, b3 and a1 group At subtree pair.
Step 204: obtaining the similarity value of two subtrees of subtree centering, and obtain the son for belonging to first page in subtree pair The default weight of tree.
Wherein, 202 realizations can be entered step to the phase of two subtrees using iterative cycles scheme iteration in the present embodiment It is calculated like degree.Specifically, can be entered step in the present embodiment in obtaining number of words pair when the similarity value of two subtrees with iteration In 202, continuation is calculated with total similarity value of the scheme in the present embodiment to two subtrees, until eventually arriving at tree construction In leaf node, the similarity of leaf node, such as content similarity are compared, to obtain the similarity between leaf node Value, later, iteration return upper one layer of father node subtree of leaf node, after calculating total similarity value between father node subtree, Continue iteration and returns upper one layer of father node subtree, the similarity value until obtaining two subtrees of subtree centering, as shown in Figure 5.
Wherein, the default weight of the subtree of first page is it is to be understood that each position structure characterizes user in the page Importance degree, belong to the build-in attribute of the page, the default weight of different page location structures may be different, such as page body Corresponding subtree presets weight and is higher than default weight of subtree corresponding to page sidebar, etc..It can saved in the present embodiment The default weight is obtained in point routing information list or its corresponding page info.
Step 205: similarity value and default weight based on subtree pair obtain total between first page and second page Similarity value.
Specifically, can be by the similarity value of subtree pair multiplied by the son for belonging to first page in the subtree pair in the present embodiment The default weight of tree, and sum it up, obtain total similarity value of first page and second page.
For example, the default weight of subtree a1, subtree a2, subtree a3 in first page are respectively as follows: 0.3,0.2 and 0.1;Phase It answers, if b1 and a2 composition subtree forms subtree to c3, then by first to c2, b3 and a3 to c1, b2 and a1 composition subtree Total similarity value of the page are as follows: the similarity value * 0.1 of the similarity value * 0.3+c3 of the similarity value * 0.2+c2 of c1;If b1 Subtree is formed to d3, then by the total similar of first page to d2, b3 and a1 to d1, b2 and a2 composition subtree with a2 composition subtree Angle value are as follows: the similarity value * 0.3 of the similarity value * 0.2+c3 of the similarity value * 0.2+d2 of d1.
Step 206: judging whether total similarity value is higher than preset threshold, if so, otherwise executing step 207 executes step Rapid 208.
Wherein, preset threshold can the accuracy requirement of such as information extraction or the efficiency requirements of information extraction according to demand Carry out device, for example, the preset threshold is higher, the precision of information extraction is higher, and preset threshold is lower, and the efficiency of information extraction is got over Preset threshold can be freely arranged in height, user according to self-demand, thus bring more free information extraction to take for user Business mode.
Step 207: determining first page and second page is similar pages.
As it can be seen that similarity comparison is carried out by the tree construction to first page and second page in the present embodiment, thus When total similarity value is higher than preset threshold, so that it may determine that two pages are similar pages.
Step 208: determining first page and second page is not similar pages.
It should be noted that in step 202 if the tree construction root node for comparing two pages is identical, it can also be with Step 208 is executed, as shown in Figure 2, it is seen then that the tree construction root node that two pages are first determined whether in the present embodiment is phase Together, if it is identical, continue through subtree calculate come total similarity value come determine whether similar pages, and if root section Point is not identical, then can directly determine two pages is not similar pages.
In one implementation, two Page resemblances comparison shown in Fig. 2 refers to two pages in page structure On similarity compare, and also need that content of pages is compared in the present embodiment, correspondingly, to the in the present embodiment It, can also be by carrying out the page between first page and second page while one page and second page carry out structure alignment Content generic is compared, and also needs to obtain at least one page in the page of each page as a result, in the present embodiment Hold, thus by comparing progress content of pages generic and structure between any two page at least one page It is right, to obtain similar pages.And similar pages at this time refer to two pages be all in structure and content it is identical or Similarity value is higher than the page of certain threshold value.
It, can be with specifically, the structure between the page can be compared using scheme shown in Fig. 2 in the present embodiment By analyzing the contents such as page title or hiding theme, it is determined with the content of pages generic to each page, To realize that content compares, the similarity value on content generic is obtained.Further, it is carried out in the present embodiment to the page After content generic and structure are compared, can according to shared by content generic and structure weight calculate again Page resemblance value between the page, for example, the weight of content of pages generic is 0.5, the weight of structure be 0.5 (or The weight of content of pages generic is 0.4, and the weight of structure is 0.6, etc.), it will be similar on content of pages generic Similarity value in angle value and structure obtains Page resemblance value respectively multiplied by summing it up after respective weight, so that it is determined that the page it Between whether be similar pages.
In one implementation, step 103 is arranged in the node path information list to similar pages in the present embodiment When label, can specifically it be accomplished by the following way, as shown in Figure 6:
Step 601: according to the node path information list of similar pages, determining object content to be extracted.
Specifically, can be by the folder name or text in node path information list xpath in the present embodiment The contents such as part title, file attribute, file type are identified, so that it is determined that going out may need the information extracted as in target Hold.
Step 602: being based on object content, the label of the node path information list of similar pages is set.
Specifically, being determined after classification can be carried out to the above object content in the present embodiment according to classification results Suitable label gives node path information list to be arranged.For example, node path information list China/Henan/Zhengzhou/high-new Area, and nodal information list China/Henan/Zhengzhou/conference and exhibition center, as similar pages to each in node path information list The contents such as kind information such as folder name, file name, file attribute, file type are identified, object content is determined as, from Content division classification is carried out in these object contents, and the node path of " Chinese Zhengzhou " as the two similar pages can be set The label of information list.
It in one implementation, can be right first in the present embodiment when step 104 carries out the generation of page extraction template Node path information list with same label generates the feature lexicon of the page extraction template under the label, this feature dictionary In can wrap containing strong Feature Words associated with label and its characteristic attribute, can also include the synonym of strong Feature Words Deng, specifically, feature lexicon can be obtained in the following manner in the present embodiment, as shown in Figure 7:
Step 701: to the node path information list with same label, extract in these node path information lists with There are the strong Feature Words and its characteristic attribute of incidence relation for its label.
Wherein, strong Feature Words can be the words such as file name, folder name in node path information list, Qiang Te Sign word can be the words of the label comprising node path information where it, be also possible to reach certain threshold with the label similarity The words of value, or there are words of corresponding incidence relation etc. in content, concept or meaning with the label.Correspondingly, strong The characteristic attribute of Feature Words can be with are as follows: file type property corresponding to file name or folder name, as class attribute, Css attribute etc..These strong Feature Words are extracted in the present embodiment by being parsed to the content in node path information list And its characteristic attribute.
Step 702: being based on strong Feature Words and its characteristic attribute, generate the feature lexicon of page extraction template.
Wherein, these strong Feature Words and its characteristic attribute can be subjected to classification integration in the present embodiment, to obtain word set It closes, the feature lexicon as the page extraction template being subsequently generated.
Step 703: parsing the meaning of a word of strong Feature Words, to obtain the synonym of strong Feature Words, and feature is added in synonym In dictionary.
For example, strong Feature Words " typhoon ", these synonyms are added to by synonym " violent typhoon ", " tropical storm " etc. In feature lexicon, and in feature lexicon, there is corresponding relationship between strong Feature Words and its synonym.
Based on implementation above, step 104 can be specifically accomplished by the following way in the present embodiment, as shown in Figure 8:
Step 801: the node path information list with same label is merged.
Specifically, can be first in the node path information list with same label, for extracting in the present embodiment To the node path information list of strong Feature Words, preset label symbol, such as asterisk wildcard " * " are set, to mark the node Strong Feature Words have been extracted in routing information list, and other do not extract the node path information list of strong Feature Words not Setting flag symbol;
Later, between node path information list successively according in node path information list node order carry out one One compare, obtain comparison result, the comparison result can characterize each node in node path information list whether correspond to it is identical, There is which node difference, etc.;
Later, it is based on the above comparison result, node is compared into identical node path information list and merges into one Node path information list, such as retain one of them, delete another node path information list;And for there are one not With node node path information list between also merge into a node path information list, and by wherein that is different Node is substituted with label symbol, for example, one of them is deleted, by that section in another node path information list with deletion There are that different nodes to be substituted with label symbol in point routing information list;Certainly, for there are two or more differences The node path information list of node then think to be different, cannot function as similar merging processing.
For example, carrying out the node path information list for two pages that node compares one by one in node layer in the present embodiment It is similar or identical as far as possible in grade, for example is all 5 node layers or 3 node layers, so the section just separated with "/" in xpath Point is compared one by one, if identical, one is merged into, if only one node is different, also the two Xpath is merged into one, while in that node label symbol such as asterisk wildcard " * " generation distinguished in the xpath of this merging It replaces;If there is multiple and different nodes, it is just not considered as the xpath that can merge, is not processed.
In addition, being arranged in the present embodiment the node path information with same label to reduce the calculation amount of data When table merges, between node path information list successively according in node path information list node order carry out Before comparing one by one, node path information list can be simplified first, for example, can use label symbol, to node Routing information list is simplified.
Specifically, in the present embodiment when simplifying to node path information list, it specifically can be in the following manner It realizes:
For being provided with the node path information of the node path information list of label symbol, at least reservation node path is believed Tree construction nodename in breath and the information for filling strong Feature Words;
And the node path information of the node path information list for being not provided with label symbol, it can only retain node Tree construction nodename in routing information.
It should be noted that if the corresponding tree construction node of node path information includes in some object content such as table Hold etc., nodename and form serial number information of node routing information etc. can be retained, and if node path information is corresponding Tree construction node does not include object content, then just only retaining the nodename of node routing information.
Step 802: based on the node path information list after merging, generating the page extraction template to match with label.
Wherein, in the present embodiment can by the node path information list after merging directly as page extraction template, It is can wrap in page extraction template obtained in the present embodiment containing multiple node path information, and these node path information can To carry out information extraction to the page for subsequent, and in node path information, the node path information containing strong Feature Words Priority in information extraction higher than other without containing strong Feature Words node path information priority, for example, subsequent It is preferential that information extraction is carried out to the page using the node path information containing strong Feature Words when carrying out page info extraction To be interpreted as preferentially carrying out information extraction using the strong Feature Words in feature lexicon.
In one implementation, in the present embodiment after obtaining page extraction template, can also include the following steps, It is as shown in Figure 9:
Step 901: extracting request in response to the page received, obtain target pages and target pages to be extracted Extract label.
Wherein, the page extracts page iden-tity or the page address that may include target pages to be extracted in request Deng to characterize target pages to be extracted;In the present embodiment can by page iden-tity or content of pages to target pages or The information such as subject content are parsed, to obtain the extraction label of target pages.
Step 902: preferential to use the node for containing strong Feature Words in extracting the corresponding target pages extraction template of label Routing information carries out page data to target pages and extracts to obtain extraction result.
For example, having and the extraction label phase in the present embodiment by extracting label and being found in various page extraction templates The target pages extraction template of same label, and then page data pumping is carried out to target pages using the target pages extraction template It takes, specifically, page data extraction preferentially can be carried out using the node path information containing strong Feature Words, if can not extract To suitable information, all or part of strong Feature Words in the feature lexicon of target pages extraction template is recycled to carry out the page Data pick-up, if again without suitable information is drawn into, it may be considered that use the synonymous of the strong Feature Words in feature lexicon Word carries out page data extraction, if finally reused again without suitable information is drawn into without containing strong Feature Words Node path information carry out page info extraction, finally obtain extraction result.
In addition, can also be used when being unable to get suitable information using the progress page data extraction of strong Feature Words The characteristic attribute of strong Feature Words carries out learning training, for example, to the file name under some file attribute such as class attribute, text The information of part folder title etc. carries out extracting training etc., to be drawn into corresponding information in the page.
And if not finally still being drawn into suitable information, it can be to target pages and its pumping in the present embodiment Take label label default, and the node path information list by obtaining the target pages extracts to regenerate the corresponding page Template is merged into the page extraction template generated, in order to carry out to the target pages or other more pages Information extraction.
It should be noted that obtained after extracting result in the present embodiment, it, may since there are the reasons of alignment of data Understand the case where there are information redundancies in extracting result, it therefore, can be to extraction in the present embodiment after obtaining extracting result As a result it is further cleaned, such as duplicate data are deleted in data redundancy processing, to obtain more accurate extraction result.
With reference to Figure 10, for a kind of structural schematic diagram for data processing equipment that the embodiment of the present application two provides, the device is suitable The building of page extraction template is carried out for the page to different web sites, that is, different structure or the website of content, and then for page Face information is extracted.Device in the present embodiment may operate in computer or server with computing capability.
Specifically, the device in the present embodiment may include with flowering structure:
Page parsing unit 1001, for parsing at least one page, to obtain the corresponding node of each page respectively Routing information list.
Wherein, the page in the present embodiment may include the page having on a page website, also may include have it is multiple The page on page website, can identical (or similar) or not in structure and content without the page on same page veil station Together.For example, the page on shopping website, news website and advertiser website is different from structure and content.
It should be noted that acquiring these pages certainly, such as in the present embodiment before parsing to the page From reading these pages in database or crawled in real time on website using tools such as web crawlers to these pages.
Wherein, by obtaining the corresponding node path information list of the page and being understood that page parsing in the present embodiment For xpath (XML Path Language) list of the page, wherein xpath list characterizes page structure using path expression And structure content.
Specifically, can be by the tree construction to the page makeup page in the present embodiment, and then tree construction is parsed to generate Xpath list.For example, parsing the super text on a website or multiple websites using third party library (such as lxml) in the present embodiment This HTML (HyperText Markup Language) page simultaneously constructs DOM Document Object Model DOM (Document Object Model it) sets, and then parses dom tree, to form the complete xpath list of respective page.
Similarity comparison unit 1002, for being based on node path information list, between the page at least one page Structure alignment is carried out, similar pages are obtained.
Wherein, structure alignment can be carried out respectively to any two page at least one page in the present embodiment, from And determining which page belongs to similar pages, which page is not similar pages.
It should be noted that the similar pages in the present embodiment are it is to be understood that the similarity value between the page is greater than one The page for determining threshold value just becomes similar pages, and the similarity value between the page can structural similarity value between the page And/or content similarity value, that is to say, that refer between two pages for similar pages: structure and/or page between two pages Face is similar.
Label is arranged for the node path information list to similar pages in label setting unit 1003.
Wherein, the label in the present embodiment can be the character in extraction content of pages as label, in content of pages Keyword as label;Or label can be located at content of pages in the associated character of character as label, such as with page The approximate word of keyword in the content of face is as label, etc..
And the label setting meaning at least that: the page corresponding to the node path information list under same label For similar pages, the page corresponding to the node path information list under different labels is not belonging to similar pages.
Specifically, can be with the node path information list of similar pages according to its information content title and class in the present embodiment Type carries out classification, and then label is generated based on the result of classification, and label is arranged to the node of similar pages In routing information list.
Template generation unit 1004, for generating and the label based on the node path information list with same label The page extraction template to match.
Wherein, can be by handling the node path information list with same label in the present embodiment, next life At page extraction template, for example, selecting to arrange in the node path information list of these same labels with other node path information The highest node path information list of table similarity is as page extraction template;Or the node based on these same labels Routing information list is combined or integrates to generate the page extraction template, etc. to match under the label.
By above scheme it is found that a kind of data processing equipment that the embodiment of the present application two provides, by parse it is various After the node path information list of the page, based on similarity between these node path information lists page different to these Classification, so that carrying out label to similar pages is arranged the page extraction template generated under the label in turn, in order to corresponding The page carries out information extraction.As it can be seen that classifying for the webpage of different web sites by similarity in the present embodiment, structure content is obtained After similar webpage, corresponding page extraction template is regenerated, to realize the data pick-up of different web sites webpage, which is taken out The scheme taken can be suitable for extracting the page info of different web sites, be not limited to the Website page of a certain structure or content, To improve the universal applicability of information extraction.
Based on implementation above scheme, progress template building progress is being extracted for structural data in the present embodiment below Example when specific extraction is illustrated, as shown in Figure 11:
Step 1101. carries out the comparison of structural similarity comparison and content to the dom tree of auto-building html files, is same type and knot The similar webpage of structure is classified.
Specifically, third party library (such as lxml) parsing html page can be used in the present embodiment and construct dom tree, solve Analysis dom tree forms the complete XPath list of the page.Obtained tree-shaped XPath list example such as Figure 12, wherein the present embodiment In do not show strong Feature Words and characteristic attribute, only show node and node location.First according to the page in the present embodiment Title or hiding theme carry out the classification of content, to determine similar pages in terms of content.In order to judge two or more webpages Structural similarity, whether the root node for comparing two webpage trees identical, and similarity is 0 if different, stops calculating;If It is identical, then continue next calculating.For each subtree, is chosen from another subtree collection and similarly spend maximum son For tree as matching object, which is similarity reference value.Using the node of subtree as weight, the Headquarters of the General Staff of all subtrees are calculated Value is examined, the overall similarity of two trees is obtained.In the case where meeting similarity threshold, determine that two webpages are that structure is similar Webpage.The webpage of same web site can be used in the web interface that decision scheme in the present embodiment is suitable for different web sites Canonical matching is judged.
Step 1102. parse content on same type and the similar webpage of structure, and the data name extracted as required and Type arranges the path XPath being resolved to and marks label.
For example, will at least partly content same type and similar html page of structure is as a field in the present embodiment Page sample, according to content of pages, the information that determination may need to extract, and parse their the corresponding paths XPath, in turn Data to be extracted are subjected to classification according to title and type and mark label, and the corresponding path XPath is unified to same Under one label.It is compareed to extract result with part, same label can retain the original marking number that part is drawn into According to or data characteristics.
Step 1103. is directed to the path XPath of same label, the strong Feature Words and spy close with label in extraction path Attribute is levied, and weakens these features place node, is substituted with asterisk wildcard.
It wherein, can be when previous step 1102 parses the path XPath, for table class or in the page in the present embodiment On have the content of corresponding title or type, preferential find passes through the available path XPath arrived of the title.According to node time The path XPath under label, such as text are gone through, when the category that there is comprising the text close with label or represent extracting object type Property when, match present node and Feature Words position with ' * ' asterisk wildcard, and record corresponding text or attribute, and strong Feature Words and Its characteristic attribute.
Step 1104. arranges and concludes strong Feature Words and characteristic attribute forms feature lexicon, analysing word meaning and part of speech, obtains To other possible synonyms and that dictionary is added is alternative.
Wherein, the strong Feature Words that summarizing traverses in the present embodiment, obtain preliminary feature lexicon.Because current Extracting object is mainly Chinese or English webpage, finds the higher nearly justice of similarity using Chinese or English near synonym kit Word, and it is alternative that dictionary is added;It for the case where strong Feature Words are phrase or word combination, needs first to be segmented, select Part of speech is the word of noun after selecting cutting, finds the higher near synonym of similarity, and it is alternative that dictionary is added.If what is searched out is same Adopted word has existed in feature lexicon, then no longer adds;If continuing to add in some phrase being only contained in dictionary Add.
Step 1105. is based on the processing in step 1103, and merging simplified according to the designed table of comparisons is the same as label The path XPath allows to be preferred containing the path of strong Feature Words and characteristic attribute, other alternative paths be compared with Low level, and save as the template under the label.
Wherein, simplify the path xpath in the present embodiment first, specifically, for the node not substituted with asterisk wildcard, only Retain the nodename, partial table interdependent node retains to serial number part;For the node for using asterisk wildcard to substitute, reserve section Point title and the part that can be used for filling strong Feature Words.Reduce after completing all nodes, the road XPath tentatively simplified Diameter.
Later, the path xpath with label is merged in the present embodiment, specifically, successively comparing same label according to node Under path, merge identical entry;If there is between two or more XPath expression formulas only have a node difference, They are then equally regarded as to identical expression formula, and substitutes unique different node with asterisk wildcard.
Step 1106. selects the template under label to be extracted to extract similar webpage, and preferential selection contains the road of strong Feature Words Diameter replaces the synonym in feature dictionary if do not matched;After having matched all synonyms in dictionary, starts selection and do not have There is the path XPath of characteristic attribute, and the text being drawn into and flag data are compared.
Wherein, similar for type and include letter to be extracted in the present embodiment when needing to carry out information extraction to the page The target webpage of breath can select the page under label to be extracted after determining the label to be extracted of target webpage according to demand Face extraction template, and then path and character pair dictionary in template containing strong Feature Words is preferentially selected to be extracted;If not Matching is not drawn into suitable information, then the synonym replaced in feature dictionary is matched again.If there is matching All synonyms in complete dictionary, the case where can not being all drawn into result, then start the XPath that selection is used without characteristic attribute Path is matched, and by the text being drawn into and original marking data comparison, retains reasonable information as the knot being drawn into Fruit.
It should be noted that the page extraction template in the present embodiment can carry out batch extraction to the page.And in batch During extraction, for the webpage in same web site, if be successfully drawn into using a certain path XPath required Content is then recorded the corresponding path XPath of current label, and is preferentially taken out using these paths in extraction later It takes.It should be noted that the possible same page has multiple data that the same label can be used and extract, then because of data pair Neat reason, in fact it could happen that the case where being drawn into information redundancy needs further to clean extraction result.
If step 1107. is not matched to data to be extracted, mark default, and by the webpage lump-sum analysis and is added Merge template.
Specifically, the present embodiment, if being not matched to data to be extracted using a small amount of label, is marked in batch extracts Remember default, and continues to extract.And the label not being matched to can carry out complementation by the data for the same type website being drawn into Or addition can extract the path XPath accordingly manually.If most labels to be extracted are not matched to corresponding informance, this is needed Webpage lump-sum analysis repeats template generation step, obtains the corresponding path XPath as seed.
As it can be seen that the webpage of different web sites can be classified by calculating similarity, be tied in the present embodiment The similar webpage of structure content;And in the case where parsing sample abundance, to most of similar webpage, difference section can make Micro process is carried out with conditional expression, template versatility is stronger, there is relatively broad application range;Meanwhile making in the present embodiment With customized label come classification features word and characteristic attribute, conducive to the identical meanings field sorted out there are many expression way;Separately Outside, the homogeneous data of different web pages is had more using original word and synonym composition characteristic dictionary in path in the present embodiment There is universality;And data template to be extracted is only saved, it does not need to arrange hash.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered Think beyond scope of the present application.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims (10)

1. a kind of data processing method, comprising:
At least one page is parsed, to obtain the corresponding node path information list of each described page respectively;
Based on the node path information list, structure alignment is carried out between the page at least one described page, is obtained Similar pages;
Label is arranged to the node path information list of the similar pages;
Based on the node path information list with same label, the page extraction template to match with the label is generated.
2. the method according to claim 1, wherein based on the node path information list to described at least one Structure alignment is carried out between the page in a page, comprising:
Following operation is executed to the node path information list of two pages at least one described page:
Node path information list based on described two pages obtains first page and second page in described two pages respectively The tree construction root node in face and corresponding subtree;
Tree construction root node based on described two pages compares identical judgement, in each subtree of the first page, really It is fixed respectively with the highest subtree of subtree similarity each in the second page, to form subtree pair;
The similarity value of two subtrees of subtree centering is obtained, and obtains the son for belonging to the first page in the subtree pair The default weight of tree;
Similarity value and the default weight based on the subtree pair, obtain between the first page and the second page Total similarity value;
It is higher than the judgement of preset threshold based on total similarity value, determines the first page and the second page is similar The page.
3. method according to claim 1 or 2, which is characterized in that further include:
Obtain the content of pages at least one described page;
To content of pages generic and structure alignment is carried out between the page at least one described page, similar page is obtained Face.
4. method according to claim 1 or 2, which is characterized in that the node path information list of the similar pages Label is set, comprising:
According to the node path information list of the similar pages, object content to be extracted is determined;
Based on the object content, the label of the node path information list of the similar pages is set.
5. method according to claim 1 or 2, which is characterized in that further include:
To the node path information list with same label, extracts to exist in the node path information list with its label and close The strong Feature Words and its characteristic attribute of connection relationship;
Based on the strong Feature Words and its characteristic attribute, the feature lexicon of the page extraction template is generated;
The meaning of a word of the strong Feature Words is parsed, to obtain the synonym of the strong Feature Words;
The synonym is added in the feature lexicon of the page extraction template.
6. according to the method described in claim 5, it is characterized in that, based on the node path information list with same label, Generate the page extraction template to match with the label, comprising:
Node path information list with same label is merged;
Based on the node path information list after merging, the page extraction template to match with the label is generated;Wherein, it generates Page extraction template in include multiple node path information, the node path information is used for information extraction, and containing strong special The priority for levying the node path information of word is higher than the priority of other node path information.
7. according to the method described in claim 6, it is characterized in that, being carried out to the node path information list with same label Merge, comprising:
In the node path information list with same label, the node path information for extracting the strong Feature Words is arranged Preset label symbol is arranged in table;
To successively being compared one by one according to the node order in node path information list between the node path information list It is right, obtain comparison result;
Based on the comparison result, node is compared into identical node path information list and merges into a node path letter List is ceased, the node path information list that a different node will be present merges into a node path information list, and will The different node is substituted with the label symbol.
8. the method according to the description of claim 7 is characterized in that between the node path information list successively according to Before node order in node path information list compare one by one, the method also includes:
Using the label symbol, the node path information list is simplified;
Specifically:
Node path information to the node path information list for being provided with the label symbol, at least retains the node path Tree construction nodename in information and the information for filling the strong Feature Words;
Node path information to the node path information list for being not provided with the label symbol retains the node path letter Tree construction nodename in breath.
9. method according to claim 1 or 2, which is characterized in that further include:
Request is extracted in response to the page received, obtains the extraction mark of target pages and the target pages to be extracted Label;
It is preferential to use the node path information for containing strong Feature Words in the corresponding target pages extraction template of the extraction label Page data extraction is carried out to the target pages, obtains extracting result;
Based on the judgement for not being drawn into corresponding data in the extraction result, the node path information of the target pages is obtained List simultaneously generates corresponding page extraction template.
10. a kind of data processing equipment, comprising:
Page parsing unit, for parsing at least one page, to obtain the corresponding node path of each described page respectively Information list;
Similarity comparison unit, for being based on the node path information list, between the page at least one described page Structure alignment is carried out, similar pages are obtained;
Label is arranged for the node path information list to the similar pages in label setting unit;
Template generation unit, for what is matched based on the node path information list with same label, generation with the label Page extraction template.
CN201811073868.9A 2018-09-14 2018-09-14 Data processing method and device Active CN109165373B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811073868.9A CN109165373B (en) 2018-09-14 2018-09-14 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811073868.9A CN109165373B (en) 2018-09-14 2018-09-14 Data processing method and device

Publications (2)

Publication Number Publication Date
CN109165373A true CN109165373A (en) 2019-01-08
CN109165373B CN109165373B (en) 2022-04-22

Family

ID=64879429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811073868.9A Active CN109165373B (en) 2018-09-14 2018-09-14 Data processing method and device

Country Status (1)

Country Link
CN (1) CN109165373B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826002A (en) * 2019-10-30 2020-02-21 腾讯科技(深圳)有限公司 Information sharing method and device, terminal and storage medium
CN111522606A (en) * 2020-04-26 2020-08-11 广东优特云科技有限公司 Data processing method, device, equipment and storage medium
CN111966930A (en) * 2020-08-17 2020-11-20 山东亿云信息技术有限公司 Webpage list analyzing method and system based on XPath sequence
CN113626028A (en) * 2020-05-07 2021-11-09 腾讯科技(深圳)有限公司 Page element mapping method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7720868B2 (en) * 2006-11-13 2010-05-18 Microsoft Corporation Providing assistance with the creation of an XPath expression
US7765236B2 (en) * 2007-08-31 2010-07-27 Microsoft Corporation Extracting data content items using template matching
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template
CN103136358A (en) * 2013-03-07 2013-06-05 宁波成电泰克电子信息技术发展有限公司 Method for automatically extracting BBS (bulletin board system) data
CN104572934A (en) * 2014-12-29 2015-04-29 西安交通大学 Webpage key content extracting method based on DOM
CN105117397A (en) * 2015-06-18 2015-12-02 浙江大学 Method for searching semantic association of medical documents based on ontology
CN105512245A (en) * 2015-11-30 2016-04-20 青岛智能产业技术研究院 Enterprise figure building method based on regression model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7720868B2 (en) * 2006-11-13 2010-05-18 Microsoft Corporation Providing assistance with the creation of an XPath expression
US7765236B2 (en) * 2007-08-31 2010-07-27 Microsoft Corporation Extracting data content items using template matching
CN102890681A (en) * 2011-07-20 2013-01-23 阿里巴巴集团控股有限公司 Method and system for generating webpage structure template
CN103136358A (en) * 2013-03-07 2013-06-05 宁波成电泰克电子信息技术发展有限公司 Method for automatically extracting BBS (bulletin board system) data
CN104572934A (en) * 2014-12-29 2015-04-29 西安交通大学 Webpage key content extracting method based on DOM
CN105117397A (en) * 2015-06-18 2015-12-02 浙江大学 Method for searching semantic association of medical documents based on ontology
CN105512245A (en) * 2015-11-30 2016-04-20 青岛智能产业技术研究院 Enterprise figure building method based on regression model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
朴勇: "《基于XML的文本结构信息抽取与聚类研究》", 《中国优秀博硕士学位论文全文数据库(博士) 信息科技辑》 *
王海涛: "《基于大规模文本数据集的相似检测关键技术研究》", 《中国优秀博硕士学位论文全文数据库(博士) 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826002A (en) * 2019-10-30 2020-02-21 腾讯科技(深圳)有限公司 Information sharing method and device, terminal and storage medium
CN111522606A (en) * 2020-04-26 2020-08-11 广东优特云科技有限公司 Data processing method, device, equipment and storage medium
CN111522606B (en) * 2020-04-26 2023-08-04 广东优特云科技有限公司 Data processing method, device, equipment and storage medium
CN113626028A (en) * 2020-05-07 2021-11-09 腾讯科技(深圳)有限公司 Page element mapping method and device
CN111966930A (en) * 2020-08-17 2020-11-20 山东亿云信息技术有限公司 Webpage list analyzing method and system based on XPath sequence

Also Published As

Publication number Publication date
CN109165373B (en) 2022-04-22

Similar Documents

Publication Publication Date Title
US10482384B1 (en) System for extracting semantic triples for building a knowledge base
Embley et al. Record-boundary discovery in Web documents
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN102254014B (en) Adaptive information extraction method for webpage characteristics
US7519621B2 (en) Extracting information from Web pages
CN109165373A (en) A kind of data processing method and device
US7558792B2 (en) Automatic extraction of human-readable lists from structured documents
US7516397B2 (en) Methods, apparatus and computer programs for characterizing web resources
CN104268148B (en) A kind of forum page Information Automatic Extraction method and system based on time string
CN109325201A (en) Generation method, device, equipment and the storage medium of entity relationship data
US7099870B2 (en) Personalized web page
KR20110009098A (en) Search results ranking using editing distance and document information
JP2006004417A (en) Method and device for recognizing specific type of information file
JP2010506247A (en) Network-based method and apparatus for filtering junk information
CN111079043A (en) Key content positioning method
CN110377884A (en) Document analytic method, device, computer equipment and storage medium
Khan et al. Audio structuring and personalized retrieval using ontologies
US20120005207A1 (en) Method and system for web extraction
CN111966940B (en) Target data positioning method and device based on user request sequence
CN108874870A (en) A kind of data pick-up method, equipment and computer can storage mediums
CN110209765A (en) A kind of method and apparatus by semantic search key
US20120150899A1 (en) System and method for selectively generating tabular data from semi-structured content
Liu et al. An automated algorithm for extracting website skeleton
Changuel et al. A general learning method for automatic title extraction from html pages
CN113157857B (en) Hot topic detection method, device and equipment for news

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant