CN109165373A - A kind of data processing method and device - Google Patents
A kind of data processing method and device Download PDFInfo
- Publication number
- CN109165373A CN109165373A CN201811073868.9A CN201811073868A CN109165373A CN 109165373 A CN109165373 A CN 109165373A CN 201811073868 A CN201811073868 A CN 201811073868A CN 109165373 A CN109165373 A CN 109165373A
- Authority
- CN
- China
- Prior art keywords
- page
- path information
- node path
- information list
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application discloses a kind of data processing method and device, method includes: to parse at least one page, to obtain the corresponding node path information list of each described page respectively;Based on the node path information list, structure alignment is carried out between the page at least one described page, obtains similar pages;Label is arranged to the node path information list of the similar pages;Based on the node path information list with same label, the page extraction template to match with the label is generated.
Description
Technical field
This application involves page extraction technique field more particularly to a kind of data processing method and devices.
Background technique
Currently, generalling use the mode of building template to net when carrying out structured message extraction in same type website
Page information is extracted.
But the information extraction of different web sites webpage can not be suitable in existing extraction template configuration, thus reduce information
The general applicability of extraction.
Summary of the invention
In view of this, the application provides a kind of data processing method and device, taken out to solve the page in the prior art
Modulus plate can not carry out information extraction, the technical problem for causing information extraction applicability lower to the webpage of different web sites.
This application provides a kind of data processing methods, comprising:
At least one page is parsed, to obtain the corresponding node path information list of each described page respectively;
Based on the node path information list, structure alignment is carried out between the page at least one described page,
Obtain similar pages;
Label is arranged to the node path information list of the similar pages;
Based on the node path information list with same label, the page extraction template to match with the label is generated.
The above method, it is preferable that based on the node path information list to the page at least one described page it
Between carry out structure alignment, comprising:
Following operation is executed to the node path information list of two pages at least one described page:
Node path information list based on described two pages obtains first page and in described two pages respectively
The tree construction root node of two pages and corresponding subtree;
Tree construction root node based on described two pages compares identical judgement, in each subtree of the first page
In, determine respectively with the highest subtree of subtree similarity each in the second page, to form subtree pair;
The similarity value of two subtrees of subtree centering is obtained, and obtains and belongs to the first page in the subtree pair
Subtree default weight;
Similarity value and the default weight based on the subtree pair, obtain the first page and the second page
Between total similarity value;
It is higher than the judgement of preset threshold based on total similarity value, determines the first page and the second page is
Similar pages.
The above method, it is preferable that further include:
Obtain the content of pages at least one described page;
To content of pages generic and structure alignment is carried out between the page at least one described page, phase is obtained
Like the page.
The above method, it is preferable that label is arranged to the node path information list of the similar pages, comprising:
According to the node path information list of the similar pages, object content to be extracted is determined;
Based on the object content, the label of the node path information list of the similar pages is set.
The above method, it is preferable that further include:
To the node path information list with same label, extracts in the node path information list and deposited with its label
In the strong Feature Words and its characteristic attribute of incidence relation;
Based on the strong Feature Words and its characteristic attribute, the feature lexicon of the page extraction template is generated;
The meaning of a word of the strong Feature Words is parsed, to obtain the synonym of the strong Feature Words;
The synonym is added in the feature lexicon of the page extraction template.
The above method, it is preferable that based on the node path information list with same label, generation matches with the label
Page extraction template, comprising:
Node path information list with same label is merged;
Based on the node path information list after merging, the page extraction template to match with the label is generated;Wherein,
It include multiple node path information in the page extraction template of generation, the node path information is used for information extraction, and contains
The priority of the node path information of strong Feature Words is higher than the priority of other node path information.
The above method, it is preferable that the node path information list with same label is merged, comprising:
In the node path information list with same label, the node path for extracting the strong Feature Words is believed
List is ceased, preset label symbol is set;
To between the node path information list successively according in node path information list node order carry out one
One compares, and obtains comparison result;
Based on the comparison result, node is compared into identical node path information list and merges into a node road
Diameter information list, the node path information list that a different node will be present merge into a node path information list,
And the different node is substituted with the label symbol.
The above method, it is preferable that between the node path information list successively according to node path information list
In node order carry out one by one compare before, the method also includes:
Using the label symbol, the node path information list is simplified;
Specifically:
Node path information to the node path information list for being provided with the label symbol, at least retains the node
Tree construction nodename in routing information and the information for filling the strong Feature Words;
Node path information to the node path information list for being not provided with the label symbol, retains the node road
Tree construction nodename in diameter information.
The above method, it is preferable that further include:
Request is extracted in response to the page received, obtains the extraction of target pages and the target pages to be extracted
Label;
It is preferential to use the node path for containing strong Feature Words in the corresponding target pages extraction template of the extraction label
Information carries out page data extraction to the target pages, obtains extracting result;
Based on the judgement for not being drawn into corresponding data in the extraction result, the node path of the target pages is obtained
Information list simultaneously generates corresponding page extraction template.
Present invention also provides a kind of data processing equipments, comprising:
Page parsing unit, for parsing at least one page, to obtain the corresponding node of each described page respectively
Routing information list;
Similarity comparison unit, for being based on the node path information list, to the page at least one described page
Between carry out structure alignment, obtain similar pages;
Label is arranged for the node path information list to the similar pages in label setting unit;
Template generation unit, for generating and the label phase based on the node path information list with same label
The page extraction template matched.
It can be seen from the above technical proposal that a kind of data processing method disclosed in the present application and device, by parsing
Out after the node path information list of the various pages, based between these node path information lists page different to these
Similarity classification, so that carrying out label to similar pages is arranged the page extraction template generated under the label in turn, in order to right
The corresponding page carries out information extraction.As it can be seen that classifying for the webpage of different web sites by similarity in the application, structure is obtained
After the similar webpage of content, corresponding page extraction template is regenerated, thus realize the data pick-up of different web sites webpage, the letter
Ceasing the scheme extracted can be suitable for extracting the page info of different web sites, be not limited to the website pages of a certain structure or content
Face, to improve the universal applicability of information extraction.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of application for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is a kind of flow chart for data processing method that the embodiment of the present application one provides;
Fig. 2 is the partial process view of the embodiment of the present application one;
Fig. 3, Fig. 4 and Fig. 5 are respectively the exemplary diagram of the embodiment of the present application;
Fig. 6, Fig. 7, Fig. 8 and Fig. 9 are respectively another part flow chart of the embodiment of the present application one;
Figure 10 is a kind of structural schematic diagram for data processing equipment that the embodiment of the present application two provides;
Figure 11 and Figure 12 is respectively another exemplary diagram of the embodiment of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on
Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall in the protection scope of this application.
With reference to Fig. 1, for a kind of implementation flow chart for data processing method that the embodiment of the present application one provides, this method is applicable in
The building of page extraction template is carried out in the page to different web sites, that is, different structure or the website of content, and then for the page
Information is extracted.Method in the present embodiment may operate in computer or server with computing capability.
Specifically, the method in the present embodiment may include having following steps:
Step 101: at least one page is parsed, to obtain the corresponding node path information list of each page respectively.
Wherein, the page in the present embodiment may include the page having on a page website, also may include have it is multiple
The page on page website, can identical (or similar) or not in structure and content without the page on same page veil station
Together.For example, the page on shopping website, news website and advertiser website is different from structure and content.
It should be noted that acquiring these pages certainly, such as in the present embodiment before parsing to the page
From reading these pages in database or crawled in real time on website using tools such as web crawlers to these pages.
Wherein, by obtaining the corresponding node path information list of the page and being understood that page parsing in the present embodiment
For xpath (XML Path Language) list of the page, wherein xpath list characterizes page structure using path expression
And structure content.
Specifically, can be by the tree construction to the page makeup page in the present embodiment, and then tree construction is parsed to generate
Xpath list.For example, parsing the super text on a website or multiple websites using third party library (such as lxml) in the present embodiment
This HTML (HyperText Markup Language) page simultaneously constructs DOM Document Object Model DOM (Document Object
Model it) sets, and then parses dom tree, to form the complete xpath list of respective page.
Step 102: it is based on node path information list, to carrying out structure alignment between the page at least one page,
Obtain similar pages.
Wherein, structure alignment can be carried out respectively to any two page at least one page in the present embodiment, from
And determining which page belongs to similar pages, which page is not similar pages.
It should be noted that the similar pages in the present embodiment are it is to be understood that the similarity value between the page is greater than one
The page for determining threshold value just becomes similar pages, and the similarity value between the page can structural similarity value between the page
And/or content similarity value, that is to say, that refer between two pages for similar pages: structure and/or page between two pages
Face is similar.
Step 103: label is arranged to the node path information list of similar pages.
Wherein, the label in the present embodiment can be the character in extraction content of pages as label, in content of pages
Keyword as label;Or label can be located at content of pages in the associated character of character as label, such as with page
The approximate word of keyword in the content of face is as label, etc..
And the label setting meaning at least that: the page corresponding to the node path information list under same label
For similar pages, the page corresponding to the node path information list under different labels is not belonging to similar pages.
Specifically, can be with the node path information list of similar pages according to its information content title and class in the present embodiment
Type carries out classification, and then label is generated based on the result of classification, and label is arranged to the node of similar pages
In routing information list.
Step 104: based on the node path information list with same label, generating the page to match with the label and take out
Modulus plate.
Wherein, can be by handling the node path information list with same label in the present embodiment, next life
At page extraction template, for example, selecting to arrange in the node path information list of these same labels with other node path information
The highest node path information list of table similarity is as page extraction template;Or the node based on these same labels
Routing information list is combined or integrates to generate the page extraction template, etc. to match under the label.
By above scheme it is found that a kind of data processing method that the embodiment of the present application one provides, by parse it is various
After the node path information list of the page, based on similarity between these node path information lists page different to these
Classification, so that carrying out label to similar pages is arranged the page extraction template generated under the label in turn, in order to corresponding
The page carries out information extraction.As it can be seen that classifying for the webpage of different web sites by similarity in the application, structure content phase is obtained
As after webpage, regenerate corresponding page extraction template, to realize the data pick-up of different web sites webpage, the information extraction
Scheme can be suitable for extracting the page infos of different web sites, be not limited to the Website page of a certain structure or content, from
And improve the universal applicability of information extraction.
It in one implementation, specifically can be to any two pages at least one page in step 102 in Fig. 1
Node path information list executes the following operation as in Fig. 2, in order to determine whether between two pages be similar pages, from
And parse it is in need carry out information extraction the page in similar pages, as shown in Figure 2:
Step 201: the node path information list based on two pages obtains first page and the in two pages respectively
The tree construction root node of two pages and corresponding subtree.
As shown in Figure 3, the tree construction characterized to the node path information list of first page and second page carries out
Building obtains first page tree construction and second page tree construction, and then obtains first page and the respective tree knot of second page
Structure root node and corresponding subtree.
It should be noted that showing only the tree construction example of two pages in Fig. 3, that is, show only in tree construction
Other information in node path information list is not shown, but does not represent node by each node and node location
In addition to tree construction does not include other information in routing information list.
Step 202: whether the tree construction root node for comparing two pages is identical, if identical, thens follow the steps 203.
It wherein, can the whether completely the same or approximate consistent progress by the tree construction root node of two pages in the present embodiment
It compares, same root catalogue file folder title is such as characterized as in node path information list, if the tree construction root of two pages
Node is completely the same or approximate consistent, then executing step 203.
Step 203: in each subtree of first page, determine respectively with the highest son of subtree similarity each in second page
Tree, to form subtree pair.
As shown in Figure 4, it for subtree a1, subtree a2, the subtree a3 in first page, determines respectively and in second page
Subtree b1, subtree b2, the highest subtree of subtree b3 similarity, to form subtree pair, for example, highest with b1 similarity be
A2, highest with b2 similarity is a1, and highest with b3 similarity is a3, at this point, b1 and a2 forms subtree pair, b2 and a1 composition
Subtree pair, b3 and a3 form subtree pair.It should be noted that the phase in the subtree in first page of each subtree in second page
Like spend highest subtree may be it is identical may also be different, it is highest with b2 similarity for example, highest with b1 similarity is a2
It is also a2, and highest with b3 similarity is a1, at this point, b1 and a2 forms subtree pair, b2 and a2 form subtree pair, b3 and a1 group
At subtree pair.
Step 204: obtaining the similarity value of two subtrees of subtree centering, and obtain the son for belonging to first page in subtree pair
The default weight of tree.
Wherein, 202 realizations can be entered step to the phase of two subtrees using iterative cycles scheme iteration in the present embodiment
It is calculated like degree.Specifically, can be entered step in the present embodiment in obtaining number of words pair when the similarity value of two subtrees with iteration
In 202, continuation is calculated with total similarity value of the scheme in the present embodiment to two subtrees, until eventually arriving at tree construction
In leaf node, the similarity of leaf node, such as content similarity are compared, to obtain the similarity between leaf node
Value, later, iteration return upper one layer of father node subtree of leaf node, after calculating total similarity value between father node subtree,
Continue iteration and returns upper one layer of father node subtree, the similarity value until obtaining two subtrees of subtree centering, as shown in Figure 5.
Wherein, the default weight of the subtree of first page is it is to be understood that each position structure characterizes user in the page
Importance degree, belong to the build-in attribute of the page, the default weight of different page location structures may be different, such as page body
Corresponding subtree presets weight and is higher than default weight of subtree corresponding to page sidebar, etc..It can saved in the present embodiment
The default weight is obtained in point routing information list or its corresponding page info.
Step 205: similarity value and default weight based on subtree pair obtain total between first page and second page
Similarity value.
Specifically, can be by the similarity value of subtree pair multiplied by the son for belonging to first page in the subtree pair in the present embodiment
The default weight of tree, and sum it up, obtain total similarity value of first page and second page.
For example, the default weight of subtree a1, subtree a2, subtree a3 in first page are respectively as follows: 0.3,0.2 and 0.1;Phase
It answers, if b1 and a2 composition subtree forms subtree to c3, then by first to c2, b3 and a3 to c1, b2 and a1 composition subtree
Total similarity value of the page are as follows: the similarity value * 0.1 of the similarity value * 0.3+c3 of the similarity value * 0.2+c2 of c1;If b1
Subtree is formed to d3, then by the total similar of first page to d2, b3 and a1 to d1, b2 and a2 composition subtree with a2 composition subtree
Angle value are as follows: the similarity value * 0.3 of the similarity value * 0.2+c3 of the similarity value * 0.2+d2 of d1.
Step 206: judging whether total similarity value is higher than preset threshold, if so, otherwise executing step 207 executes step
Rapid 208.
Wherein, preset threshold can the accuracy requirement of such as information extraction or the efficiency requirements of information extraction according to demand
Carry out device, for example, the preset threshold is higher, the precision of information extraction is higher, and preset threshold is lower, and the efficiency of information extraction is got over
Preset threshold can be freely arranged in height, user according to self-demand, thus bring more free information extraction to take for user
Business mode.
Step 207: determining first page and second page is similar pages.
As it can be seen that similarity comparison is carried out by the tree construction to first page and second page in the present embodiment, thus
When total similarity value is higher than preset threshold, so that it may determine that two pages are similar pages.
Step 208: determining first page and second page is not similar pages.
It should be noted that in step 202 if the tree construction root node for comparing two pages is identical, it can also be with
Step 208 is executed, as shown in Figure 2, it is seen then that the tree construction root node that two pages are first determined whether in the present embodiment is phase
Together, if it is identical, continue through subtree calculate come total similarity value come determine whether similar pages, and if root section
Point is not identical, then can directly determine two pages is not similar pages.
In one implementation, two Page resemblances comparison shown in Fig. 2 refers to two pages in page structure
On similarity compare, and also need that content of pages is compared in the present embodiment, correspondingly, to the in the present embodiment
It, can also be by carrying out the page between first page and second page while one page and second page carry out structure alignment
Content generic is compared, and also needs to obtain at least one page in the page of each page as a result, in the present embodiment
Hold, thus by comparing progress content of pages generic and structure between any two page at least one page
It is right, to obtain similar pages.And similar pages at this time refer to two pages be all in structure and content it is identical or
Similarity value is higher than the page of certain threshold value.
It, can be with specifically, the structure between the page can be compared using scheme shown in Fig. 2 in the present embodiment
By analyzing the contents such as page title or hiding theme, it is determined with the content of pages generic to each page,
To realize that content compares, the similarity value on content generic is obtained.Further, it is carried out in the present embodiment to the page
After content generic and structure are compared, can according to shared by content generic and structure weight calculate again
Page resemblance value between the page, for example, the weight of content of pages generic is 0.5, the weight of structure be 0.5 (or
The weight of content of pages generic is 0.4, and the weight of structure is 0.6, etc.), it will be similar on content of pages generic
Similarity value in angle value and structure obtains Page resemblance value respectively multiplied by summing it up after respective weight, so that it is determined that the page it
Between whether be similar pages.
In one implementation, step 103 is arranged in the node path information list to similar pages in the present embodiment
When label, can specifically it be accomplished by the following way, as shown in Figure 6:
Step 601: according to the node path information list of similar pages, determining object content to be extracted.
Specifically, can be by the folder name or text in node path information list xpath in the present embodiment
The contents such as part title, file attribute, file type are identified, so that it is determined that going out may need the information extracted as in target
Hold.
Step 602: being based on object content, the label of the node path information list of similar pages is set.
Specifically, being determined after classification can be carried out to the above object content in the present embodiment according to classification results
Suitable label gives node path information list to be arranged.For example, node path information list China/Henan/Zhengzhou/high-new
Area, and nodal information list China/Henan/Zhengzhou/conference and exhibition center, as similar pages to each in node path information list
The contents such as kind information such as folder name, file name, file attribute, file type are identified, object content is determined as, from
Content division classification is carried out in these object contents, and the node path of " Chinese Zhengzhou " as the two similar pages can be set
The label of information list.
It in one implementation, can be right first in the present embodiment when step 104 carries out the generation of page extraction template
Node path information list with same label generates the feature lexicon of the page extraction template under the label, this feature dictionary
In can wrap containing strong Feature Words associated with label and its characteristic attribute, can also include the synonym of strong Feature Words
Deng, specifically, feature lexicon can be obtained in the following manner in the present embodiment, as shown in Figure 7:
Step 701: to the node path information list with same label, extract in these node path information lists with
There are the strong Feature Words and its characteristic attribute of incidence relation for its label.
Wherein, strong Feature Words can be the words such as file name, folder name in node path information list, Qiang Te
Sign word can be the words of the label comprising node path information where it, be also possible to reach certain threshold with the label similarity
The words of value, or there are words of corresponding incidence relation etc. in content, concept or meaning with the label.Correspondingly, strong
The characteristic attribute of Feature Words can be with are as follows: file type property corresponding to file name or folder name, as class attribute,
Css attribute etc..These strong Feature Words are extracted in the present embodiment by being parsed to the content in node path information list
And its characteristic attribute.
Step 702: being based on strong Feature Words and its characteristic attribute, generate the feature lexicon of page extraction template.
Wherein, these strong Feature Words and its characteristic attribute can be subjected to classification integration in the present embodiment, to obtain word set
It closes, the feature lexicon as the page extraction template being subsequently generated.
Step 703: parsing the meaning of a word of strong Feature Words, to obtain the synonym of strong Feature Words, and feature is added in synonym
In dictionary.
For example, strong Feature Words " typhoon ", these synonyms are added to by synonym " violent typhoon ", " tropical storm " etc.
In feature lexicon, and in feature lexicon, there is corresponding relationship between strong Feature Words and its synonym.
Based on implementation above, step 104 can be specifically accomplished by the following way in the present embodiment, as shown in Figure 8:
Step 801: the node path information list with same label is merged.
Specifically, can be first in the node path information list with same label, for extracting in the present embodiment
To the node path information list of strong Feature Words, preset label symbol, such as asterisk wildcard " * " are set, to mark the node
Strong Feature Words have been extracted in routing information list, and other do not extract the node path information list of strong Feature Words not
Setting flag symbol;
Later, between node path information list successively according in node path information list node order carry out one
One compare, obtain comparison result, the comparison result can characterize each node in node path information list whether correspond to it is identical,
There is which node difference, etc.;
Later, it is based on the above comparison result, node is compared into identical node path information list and merges into one
Node path information list, such as retain one of them, delete another node path information list;And for there are one not
With node node path information list between also merge into a node path information list, and by wherein that is different
Node is substituted with label symbol, for example, one of them is deleted, by that section in another node path information list with deletion
There are that different nodes to be substituted with label symbol in point routing information list;Certainly, for there are two or more differences
The node path information list of node then think to be different, cannot function as similar merging processing.
For example, carrying out the node path information list for two pages that node compares one by one in node layer in the present embodiment
It is similar or identical as far as possible in grade, for example is all 5 node layers or 3 node layers, so the section just separated with "/" in xpath
Point is compared one by one, if identical, one is merged into, if only one node is different, also the two
Xpath is merged into one, while in that node label symbol such as asterisk wildcard " * " generation distinguished in the xpath of this merging
It replaces;If there is multiple and different nodes, it is just not considered as the xpath that can merge, is not processed.
In addition, being arranged in the present embodiment the node path information with same label to reduce the calculation amount of data
When table merges, between node path information list successively according in node path information list node order carry out
Before comparing one by one, node path information list can be simplified first, for example, can use label symbol, to node
Routing information list is simplified.
Specifically, in the present embodiment when simplifying to node path information list, it specifically can be in the following manner
It realizes:
For being provided with the node path information of the node path information list of label symbol, at least reservation node path is believed
Tree construction nodename in breath and the information for filling strong Feature Words;
And the node path information of the node path information list for being not provided with label symbol, it can only retain node
Tree construction nodename in routing information.
It should be noted that if the corresponding tree construction node of node path information includes in some object content such as table
Hold etc., nodename and form serial number information of node routing information etc. can be retained, and if node path information is corresponding
Tree construction node does not include object content, then just only retaining the nodename of node routing information.
Step 802: based on the node path information list after merging, generating the page extraction template to match with label.
Wherein, in the present embodiment can by the node path information list after merging directly as page extraction template,
It is can wrap in page extraction template obtained in the present embodiment containing multiple node path information, and these node path information can
To carry out information extraction to the page for subsequent, and in node path information, the node path information containing strong Feature Words
Priority in information extraction higher than other without containing strong Feature Words node path information priority, for example, subsequent
It is preferential that information extraction is carried out to the page using the node path information containing strong Feature Words when carrying out page info extraction
To be interpreted as preferentially carrying out information extraction using the strong Feature Words in feature lexicon.
In one implementation, in the present embodiment after obtaining page extraction template, can also include the following steps,
It is as shown in Figure 9:
Step 901: extracting request in response to the page received, obtain target pages and target pages to be extracted
Extract label.
Wherein, the page extracts page iden-tity or the page address that may include target pages to be extracted in request
Deng to characterize target pages to be extracted;In the present embodiment can by page iden-tity or content of pages to target pages or
The information such as subject content are parsed, to obtain the extraction label of target pages.
Step 902: preferential to use the node for containing strong Feature Words in extracting the corresponding target pages extraction template of label
Routing information carries out page data to target pages and extracts to obtain extraction result.
For example, having and the extraction label phase in the present embodiment by extracting label and being found in various page extraction templates
The target pages extraction template of same label, and then page data pumping is carried out to target pages using the target pages extraction template
It takes, specifically, page data extraction preferentially can be carried out using the node path information containing strong Feature Words, if can not extract
To suitable information, all or part of strong Feature Words in the feature lexicon of target pages extraction template is recycled to carry out the page
Data pick-up, if again without suitable information is drawn into, it may be considered that use the synonymous of the strong Feature Words in feature lexicon
Word carries out page data extraction, if finally reused again without suitable information is drawn into without containing strong Feature Words
Node path information carry out page info extraction, finally obtain extraction result.
In addition, can also be used when being unable to get suitable information using the progress page data extraction of strong Feature Words
The characteristic attribute of strong Feature Words carries out learning training, for example, to the file name under some file attribute such as class attribute, text
The information of part folder title etc. carries out extracting training etc., to be drawn into corresponding information in the page.
And if not finally still being drawn into suitable information, it can be to target pages and its pumping in the present embodiment
Take label label default, and the node path information list by obtaining the target pages extracts to regenerate the corresponding page
Template is merged into the page extraction template generated, in order to carry out to the target pages or other more pages
Information extraction.
It should be noted that obtained after extracting result in the present embodiment, it, may since there are the reasons of alignment of data
Understand the case where there are information redundancies in extracting result, it therefore, can be to extraction in the present embodiment after obtaining extracting result
As a result it is further cleaned, such as duplicate data are deleted in data redundancy processing, to obtain more accurate extraction result.
With reference to Figure 10, for a kind of structural schematic diagram for data processing equipment that the embodiment of the present application two provides, the device is suitable
The building of page extraction template is carried out for the page to different web sites, that is, different structure or the website of content, and then for page
Face information is extracted.Device in the present embodiment may operate in computer or server with computing capability.
Specifically, the device in the present embodiment may include with flowering structure:
Page parsing unit 1001, for parsing at least one page, to obtain the corresponding node of each page respectively
Routing information list.
Wherein, the page in the present embodiment may include the page having on a page website, also may include have it is multiple
The page on page website, can identical (or similar) or not in structure and content without the page on same page veil station
Together.For example, the page on shopping website, news website and advertiser website is different from structure and content.
It should be noted that acquiring these pages certainly, such as in the present embodiment before parsing to the page
From reading these pages in database or crawled in real time on website using tools such as web crawlers to these pages.
Wherein, by obtaining the corresponding node path information list of the page and being understood that page parsing in the present embodiment
For xpath (XML Path Language) list of the page, wherein xpath list characterizes page structure using path expression
And structure content.
Specifically, can be by the tree construction to the page makeup page in the present embodiment, and then tree construction is parsed to generate
Xpath list.For example, parsing the super text on a website or multiple websites using third party library (such as lxml) in the present embodiment
This HTML (HyperText Markup Language) page simultaneously constructs DOM Document Object Model DOM (Document Object
Model it) sets, and then parses dom tree, to form the complete xpath list of respective page.
Similarity comparison unit 1002, for being based on node path information list, between the page at least one page
Structure alignment is carried out, similar pages are obtained.
Wherein, structure alignment can be carried out respectively to any two page at least one page in the present embodiment, from
And determining which page belongs to similar pages, which page is not similar pages.
It should be noted that the similar pages in the present embodiment are it is to be understood that the similarity value between the page is greater than one
The page for determining threshold value just becomes similar pages, and the similarity value between the page can structural similarity value between the page
And/or content similarity value, that is to say, that refer between two pages for similar pages: structure and/or page between two pages
Face is similar.
Label is arranged for the node path information list to similar pages in label setting unit 1003.
Wherein, the label in the present embodiment can be the character in extraction content of pages as label, in content of pages
Keyword as label;Or label can be located at content of pages in the associated character of character as label, such as with page
The approximate word of keyword in the content of face is as label, etc..
And the label setting meaning at least that: the page corresponding to the node path information list under same label
For similar pages, the page corresponding to the node path information list under different labels is not belonging to similar pages.
Specifically, can be with the node path information list of similar pages according to its information content title and class in the present embodiment
Type carries out classification, and then label is generated based on the result of classification, and label is arranged to the node of similar pages
In routing information list.
Template generation unit 1004, for generating and the label based on the node path information list with same label
The page extraction template to match.
Wherein, can be by handling the node path information list with same label in the present embodiment, next life
At page extraction template, for example, selecting to arrange in the node path information list of these same labels with other node path information
The highest node path information list of table similarity is as page extraction template;Or the node based on these same labels
Routing information list is combined or integrates to generate the page extraction template, etc. to match under the label.
By above scheme it is found that a kind of data processing equipment that the embodiment of the present application two provides, by parse it is various
After the node path information list of the page, based on similarity between these node path information lists page different to these
Classification, so that carrying out label to similar pages is arranged the page extraction template generated under the label in turn, in order to corresponding
The page carries out information extraction.As it can be seen that classifying for the webpage of different web sites by similarity in the present embodiment, structure content is obtained
After similar webpage, corresponding page extraction template is regenerated, to realize the data pick-up of different web sites webpage, which is taken out
The scheme taken can be suitable for extracting the page info of different web sites, be not limited to the Website page of a certain structure or content,
To improve the universal applicability of information extraction.
Based on implementation above scheme, progress template building progress is being extracted for structural data in the present embodiment below
Example when specific extraction is illustrated, as shown in Figure 11:
Step 1101. carries out the comparison of structural similarity comparison and content to the dom tree of auto-building html files, is same type and knot
The similar webpage of structure is classified.
Specifically, third party library (such as lxml) parsing html page can be used in the present embodiment and construct dom tree, solve
Analysis dom tree forms the complete XPath list of the page.Obtained tree-shaped XPath list example such as Figure 12, wherein the present embodiment
In do not show strong Feature Words and characteristic attribute, only show node and node location.First according to the page in the present embodiment
Title or hiding theme carry out the classification of content, to determine similar pages in terms of content.In order to judge two or more webpages
Structural similarity, whether the root node for comparing two webpage trees identical, and similarity is 0 if different, stops calculating;If
It is identical, then continue next calculating.For each subtree, is chosen from another subtree collection and similarly spend maximum son
For tree as matching object, which is similarity reference value.Using the node of subtree as weight, the Headquarters of the General Staff of all subtrees are calculated
Value is examined, the overall similarity of two trees is obtained.In the case where meeting similarity threshold, determine that two webpages are that structure is similar
Webpage.The webpage of same web site can be used in the web interface that decision scheme in the present embodiment is suitable for different web sites
Canonical matching is judged.
Step 1102. parse content on same type and the similar webpage of structure, and the data name extracted as required and
Type arranges the path XPath being resolved to and marks label.
For example, will at least partly content same type and similar html page of structure is as a field in the present embodiment
Page sample, according to content of pages, the information that determination may need to extract, and parse their the corresponding paths XPath, in turn
Data to be extracted are subjected to classification according to title and type and mark label, and the corresponding path XPath is unified to same
Under one label.It is compareed to extract result with part, same label can retain the original marking number that part is drawn into
According to or data characteristics.
Step 1103. is directed to the path XPath of same label, the strong Feature Words and spy close with label in extraction path
Attribute is levied, and weakens these features place node, is substituted with asterisk wildcard.
It wherein, can be when previous step 1102 parses the path XPath, for table class or in the page in the present embodiment
On have the content of corresponding title or type, preferential find passes through the available path XPath arrived of the title.According to node time
The path XPath under label, such as text are gone through, when the category that there is comprising the text close with label or represent extracting object type
Property when, match present node and Feature Words position with ' * ' asterisk wildcard, and record corresponding text or attribute, and strong Feature Words and
Its characteristic attribute.
Step 1104. arranges and concludes strong Feature Words and characteristic attribute forms feature lexicon, analysing word meaning and part of speech, obtains
To other possible synonyms and that dictionary is added is alternative.
Wherein, the strong Feature Words that summarizing traverses in the present embodiment, obtain preliminary feature lexicon.Because current
Extracting object is mainly Chinese or English webpage, finds the higher nearly justice of similarity using Chinese or English near synonym kit
Word, and it is alternative that dictionary is added;It for the case where strong Feature Words are phrase or word combination, needs first to be segmented, select
Part of speech is the word of noun after selecting cutting, finds the higher near synonym of similarity, and it is alternative that dictionary is added.If what is searched out is same
Adopted word has existed in feature lexicon, then no longer adds;If continuing to add in some phrase being only contained in dictionary
Add.
Step 1105. is based on the processing in step 1103, and merging simplified according to the designed table of comparisons is the same as label
The path XPath allows to be preferred containing the path of strong Feature Words and characteristic attribute, other alternative paths be compared with
Low level, and save as the template under the label.
Wherein, simplify the path xpath in the present embodiment first, specifically, for the node not substituted with asterisk wildcard, only
Retain the nodename, partial table interdependent node retains to serial number part;For the node for using asterisk wildcard to substitute, reserve section
Point title and the part that can be used for filling strong Feature Words.Reduce after completing all nodes, the road XPath tentatively simplified
Diameter.
Later, the path xpath with label is merged in the present embodiment, specifically, successively comparing same label according to node
Under path, merge identical entry;If there is between two or more XPath expression formulas only have a node difference,
They are then equally regarded as to identical expression formula, and substitutes unique different node with asterisk wildcard.
Step 1106. selects the template under label to be extracted to extract similar webpage, and preferential selection contains the road of strong Feature Words
Diameter replaces the synonym in feature dictionary if do not matched;After having matched all synonyms in dictionary, starts selection and do not have
There is the path XPath of characteristic attribute, and the text being drawn into and flag data are compared.
Wherein, similar for type and include letter to be extracted in the present embodiment when needing to carry out information extraction to the page
The target webpage of breath can select the page under label to be extracted after determining the label to be extracted of target webpage according to demand
Face extraction template, and then path and character pair dictionary in template containing strong Feature Words is preferentially selected to be extracted;If not
Matching is not drawn into suitable information, then the synonym replaced in feature dictionary is matched again.If there is matching
All synonyms in complete dictionary, the case where can not being all drawn into result, then start the XPath that selection is used without characteristic attribute
Path is matched, and by the text being drawn into and original marking data comparison, retains reasonable information as the knot being drawn into
Fruit.
It should be noted that the page extraction template in the present embodiment can carry out batch extraction to the page.And in batch
During extraction, for the webpage in same web site, if be successfully drawn into using a certain path XPath required
Content is then recorded the corresponding path XPath of current label, and is preferentially taken out using these paths in extraction later
It takes.It should be noted that the possible same page has multiple data that the same label can be used and extract, then because of data pair
Neat reason, in fact it could happen that the case where being drawn into information redundancy needs further to clean extraction result.
If step 1107. is not matched to data to be extracted, mark default, and by the webpage lump-sum analysis and is added
Merge template.
Specifically, the present embodiment, if being not matched to data to be extracted using a small amount of label, is marked in batch extracts
Remember default, and continues to extract.And the label not being matched to can carry out complementation by the data for the same type website being drawn into
Or addition can extract the path XPath accordingly manually.If most labels to be extracted are not matched to corresponding informance, this is needed
Webpage lump-sum analysis repeats template generation step, obtains the corresponding path XPath as seed.
As it can be seen that the webpage of different web sites can be classified by calculating similarity, be tied in the present embodiment
The similar webpage of structure content;And in the case where parsing sample abundance, to most of similar webpage, difference section can make
Micro process is carried out with conditional expression, template versatility is stronger, there is relatively broad application range;Meanwhile making in the present embodiment
With customized label come classification features word and characteristic attribute, conducive to the identical meanings field sorted out there are many expression way;Separately
Outside, the homogeneous data of different web pages is had more using original word and synonym composition characteristic dictionary in path in the present embodiment
There is universality;And data template to be extracted is only saved, it does not need to arrange hash.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other
The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment
For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part
It is bright.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure
And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and
The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These
Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession
Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered
Think beyond scope of the present application.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor
The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit
Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology
In any other form of storage medium well known in field.
The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application.
Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application
It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one
The widest scope of cause.
Claims (10)
1. a kind of data processing method, comprising:
At least one page is parsed, to obtain the corresponding node path information list of each described page respectively;
Based on the node path information list, structure alignment is carried out between the page at least one described page, is obtained
Similar pages;
Label is arranged to the node path information list of the similar pages;
Based on the node path information list with same label, the page extraction template to match with the label is generated.
2. the method according to claim 1, wherein based on the node path information list to described at least one
Structure alignment is carried out between the page in a page, comprising:
Following operation is executed to the node path information list of two pages at least one described page:
Node path information list based on described two pages obtains first page and second page in described two pages respectively
The tree construction root node in face and corresponding subtree;
Tree construction root node based on described two pages compares identical judgement, in each subtree of the first page, really
It is fixed respectively with the highest subtree of subtree similarity each in the second page, to form subtree pair;
The similarity value of two subtrees of subtree centering is obtained, and obtains the son for belonging to the first page in the subtree pair
The default weight of tree;
Similarity value and the default weight based on the subtree pair, obtain between the first page and the second page
Total similarity value;
It is higher than the judgement of preset threshold based on total similarity value, determines the first page and the second page is similar
The page.
3. method according to claim 1 or 2, which is characterized in that further include:
Obtain the content of pages at least one described page;
To content of pages generic and structure alignment is carried out between the page at least one described page, similar page is obtained
Face.
4. method according to claim 1 or 2, which is characterized in that the node path information list of the similar pages
Label is set, comprising:
According to the node path information list of the similar pages, object content to be extracted is determined;
Based on the object content, the label of the node path information list of the similar pages is set.
5. method according to claim 1 or 2, which is characterized in that further include:
To the node path information list with same label, extracts to exist in the node path information list with its label and close
The strong Feature Words and its characteristic attribute of connection relationship;
Based on the strong Feature Words and its characteristic attribute, the feature lexicon of the page extraction template is generated;
The meaning of a word of the strong Feature Words is parsed, to obtain the synonym of the strong Feature Words;
The synonym is added in the feature lexicon of the page extraction template.
6. according to the method described in claim 5, it is characterized in that, based on the node path information list with same label,
Generate the page extraction template to match with the label, comprising:
Node path information list with same label is merged;
Based on the node path information list after merging, the page extraction template to match with the label is generated;Wherein, it generates
Page extraction template in include multiple node path information, the node path information is used for information extraction, and containing strong special
The priority for levying the node path information of word is higher than the priority of other node path information.
7. according to the method described in claim 6, it is characterized in that, being carried out to the node path information list with same label
Merge, comprising:
In the node path information list with same label, the node path information for extracting the strong Feature Words is arranged
Preset label symbol is arranged in table;
To successively being compared one by one according to the node order in node path information list between the node path information list
It is right, obtain comparison result;
Based on the comparison result, node is compared into identical node path information list and merges into a node path letter
List is ceased, the node path information list that a different node will be present merges into a node path information list, and will
The different node is substituted with the label symbol.
8. the method according to the description of claim 7 is characterized in that between the node path information list successively according to
Before node order in node path information list compare one by one, the method also includes:
Using the label symbol, the node path information list is simplified;
Specifically:
Node path information to the node path information list for being provided with the label symbol, at least retains the node path
Tree construction nodename in information and the information for filling the strong Feature Words;
Node path information to the node path information list for being not provided with the label symbol retains the node path letter
Tree construction nodename in breath.
9. method according to claim 1 or 2, which is characterized in that further include:
Request is extracted in response to the page received, obtains the extraction mark of target pages and the target pages to be extracted
Label;
It is preferential to use the node path information for containing strong Feature Words in the corresponding target pages extraction template of the extraction label
Page data extraction is carried out to the target pages, obtains extracting result;
Based on the judgement for not being drawn into corresponding data in the extraction result, the node path information of the target pages is obtained
List simultaneously generates corresponding page extraction template.
10. a kind of data processing equipment, comprising:
Page parsing unit, for parsing at least one page, to obtain the corresponding node path of each described page respectively
Information list;
Similarity comparison unit, for being based on the node path information list, between the page at least one described page
Structure alignment is carried out, similar pages are obtained;
Label is arranged for the node path information list to the similar pages in label setting unit;
Template generation unit, for what is matched based on the node path information list with same label, generation with the label
Page extraction template.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811073868.9A CN109165373B (en) | 2018-09-14 | 2018-09-14 | Data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811073868.9A CN109165373B (en) | 2018-09-14 | 2018-09-14 | Data processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109165373A true CN109165373A (en) | 2019-01-08 |
CN109165373B CN109165373B (en) | 2022-04-22 |
Family
ID=64879429
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811073868.9A Active CN109165373B (en) | 2018-09-14 | 2018-09-14 | Data processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109165373B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110826002A (en) * | 2019-10-30 | 2020-02-21 | 腾讯科技(深圳)有限公司 | Information sharing method and device, terminal and storage medium |
CN111522606A (en) * | 2020-04-26 | 2020-08-11 | 广东优特云科技有限公司 | Data processing method, device, equipment and storage medium |
CN111966930A (en) * | 2020-08-17 | 2020-11-20 | 山东亿云信息技术有限公司 | Webpage list analyzing method and system based on XPath sequence |
CN113626028A (en) * | 2020-05-07 | 2021-11-09 | 腾讯科技(深圳)有限公司 | Page element mapping method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7720868B2 (en) * | 2006-11-13 | 2010-05-18 | Microsoft Corporation | Providing assistance with the creation of an XPath expression |
US7765236B2 (en) * | 2007-08-31 | 2010-07-27 | Microsoft Corporation | Extracting data content items using template matching |
CN102890681A (en) * | 2011-07-20 | 2013-01-23 | 阿里巴巴集团控股有限公司 | Method and system for generating webpage structure template |
CN103136358A (en) * | 2013-03-07 | 2013-06-05 | 宁波成电泰克电子信息技术发展有限公司 | Method for automatically extracting BBS (bulletin board system) data |
CN104572934A (en) * | 2014-12-29 | 2015-04-29 | 西安交通大学 | Webpage key content extracting method based on DOM |
CN105117397A (en) * | 2015-06-18 | 2015-12-02 | 浙江大学 | Method for searching semantic association of medical documents based on ontology |
CN105512245A (en) * | 2015-11-30 | 2016-04-20 | 青岛智能产业技术研究院 | Enterprise figure building method based on regression model |
-
2018
- 2018-09-14 CN CN201811073868.9A patent/CN109165373B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7720868B2 (en) * | 2006-11-13 | 2010-05-18 | Microsoft Corporation | Providing assistance with the creation of an XPath expression |
US7765236B2 (en) * | 2007-08-31 | 2010-07-27 | Microsoft Corporation | Extracting data content items using template matching |
CN102890681A (en) * | 2011-07-20 | 2013-01-23 | 阿里巴巴集团控股有限公司 | Method and system for generating webpage structure template |
CN103136358A (en) * | 2013-03-07 | 2013-06-05 | 宁波成电泰克电子信息技术发展有限公司 | Method for automatically extracting BBS (bulletin board system) data |
CN104572934A (en) * | 2014-12-29 | 2015-04-29 | 西安交通大学 | Webpage key content extracting method based on DOM |
CN105117397A (en) * | 2015-06-18 | 2015-12-02 | 浙江大学 | Method for searching semantic association of medical documents based on ontology |
CN105512245A (en) * | 2015-11-30 | 2016-04-20 | 青岛智能产业技术研究院 | Enterprise figure building method based on regression model |
Non-Patent Citations (2)
Title |
---|
朴勇: "《基于XML的文本结构信息抽取与聚类研究》", 《中国优秀博硕士学位论文全文数据库(博士) 信息科技辑》 * |
王海涛: "《基于大规模文本数据集的相似检测关键技术研究》", 《中国优秀博硕士学位论文全文数据库(博士) 信息科技辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110826002A (en) * | 2019-10-30 | 2020-02-21 | 腾讯科技(深圳)有限公司 | Information sharing method and device, terminal and storage medium |
CN111522606A (en) * | 2020-04-26 | 2020-08-11 | 广东优特云科技有限公司 | Data processing method, device, equipment and storage medium |
CN111522606B (en) * | 2020-04-26 | 2023-08-04 | 广东优特云科技有限公司 | Data processing method, device, equipment and storage medium |
CN113626028A (en) * | 2020-05-07 | 2021-11-09 | 腾讯科技(深圳)有限公司 | Page element mapping method and device |
CN111966930A (en) * | 2020-08-17 | 2020-11-20 | 山东亿云信息技术有限公司 | Webpage list analyzing method and system based on XPath sequence |
Also Published As
Publication number | Publication date |
---|---|
CN109165373B (en) | 2022-04-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10482384B1 (en) | System for extracting semantic triples for building a knowledge base | |
Embley et al. | Record-boundary discovery in Web documents | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN102254014B (en) | Adaptive information extraction method for webpage characteristics | |
US7519621B2 (en) | Extracting information from Web pages | |
CN109165373A (en) | A kind of data processing method and device | |
US7558792B2 (en) | Automatic extraction of human-readable lists from structured documents | |
US7516397B2 (en) | Methods, apparatus and computer programs for characterizing web resources | |
CN104268148B (en) | A kind of forum page Information Automatic Extraction method and system based on time string | |
CN109325201A (en) | Generation method, device, equipment and the storage medium of entity relationship data | |
US7099870B2 (en) | Personalized web page | |
KR20110009098A (en) | Search results ranking using editing distance and document information | |
JP2006004417A (en) | Method and device for recognizing specific type of information file | |
JP2010506247A (en) | Network-based method and apparatus for filtering junk information | |
CN111079043A (en) | Key content positioning method | |
CN110377884A (en) | Document analytic method, device, computer equipment and storage medium | |
Khan et al. | Audio structuring and personalized retrieval using ontologies | |
US20120005207A1 (en) | Method and system for web extraction | |
CN111966940B (en) | Target data positioning method and device based on user request sequence | |
CN108874870A (en) | A kind of data pick-up method, equipment and computer can storage mediums | |
CN110209765A (en) | A kind of method and apparatus by semantic search key | |
US20120150899A1 (en) | System and method for selectively generating tabular data from semi-structured content | |
Liu et al. | An automated algorithm for extracting website skeleton | |
Changuel et al. | A general learning method for automatic title extraction from html pages | |
CN113157857B (en) | Hot topic detection method, device and equipment for news |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |