CN104317948A

CN104317948A - Page data capturing method and system

Info

Publication number: CN104317948A
Application number: CN201410635960.5A
Authority: CN
Inventors: 刘旭辉; 任继成; 高照
Original assignee: BEIJING ZHONGKE FULONG INFORMATION TECHNOLOGY Co Ltd
Current assignee: BEIJING ZHONGKE FULONG INFORMATION TECHNOLOGY Co Ltd
Priority date: 2014-11-05
Filing date: 2014-11-05
Publication date: 2015-01-28

Abstract

The invention relates to a page data capturing method and a page data capturing system. The method comprises the steps of: S1, analyzing a target page to obtain configuration information of the target page, and generating a matching template according to the configuration information; S2, obtaining address information of the target page from the configuration information, determining the target page according to the address information, and obtaining text data in the target page; S3, capturing the text data in the target page according to the matching template by a capturing unit, and storing the text data as a basis for index operation. According to the page data capturing method and system, the capturing unit can quickly adapt to the pages of various websites, and specific areas and/or data in the target page can be accurately captured.

Description

Page data grasping means and system

Technical field

The present invention relates to technical field of data processing, in particular to a kind of page data grasping means and a kind of page data grasping system.

Background technology

By mode the most easily, obtaining the most effective information is the target that people pursue all the time.Therefore, simple and reliable, stable performance is the peak demand of the collection of programming personnel's design information and searching system.Along with the fast development of global IT application, internet creates a large amount of webpages, the Issues about Information Retrieval that user faces magnanimity webpage directly that appears as of traditional search engines (Google, Baidu etc.) provides solution route.But what traditional search engines was focused on is the range of information retrieval, be difficult to meet that user is more and more personalized, specialized Search Requirement, therefore, vertically retrieve arise at the historic moment with the enterprise-level retrieval in special-purpose or field, theme.The Internet resources that specific or user thinks most interested only searched for by personalized search engine, can better for user provides convenient, efficient retrieval service, become the development trend that of modern information retrieval field is important gradually.

Search engine has three zones in general: reptile, the crawl of primary responsibility web data, for search engine provides the data source header of retrieval; Index, in order to improve recall precision, the inverted file in units of word set up after the web data participle that reptile gathers; Rank, according to after the inquiry of user's input and index database coupling according to certain ordering rule feed back to the result for retrieval sequence of user.As can be seen here, reptile gather the data source header that the webpage of returning is search engine, the retrieval effectiveness of its quality to search engine has a great impact.Individual searching engine compares universal search engine, maximum difference is exactly crawler system, tradition reptile covers by maximized Internet resources the general Search Requirement meeting a large number of users, adopt the breadth first traversal mode being similar to digraph to capture Internet resources, emphasis be the range of information acquisition.And individual searching engine target is the web information capturing most worthy with minimum reptile Service Source, and maximize to obtain filtering useless information, for user provides information very accurately, its reptile module is called " vertical reptile " for the time being.So-called vertical, be relative to comprehensive search Engine-Network reptile module Horizon Search, contain much information, inquire about not accurately, the degree of depth not for.The difference of vertical reptile and general reptile has mainly carried out structuring information extraction to info web, namely the unstructured data of webpage is taken into specific structured message data, its feature is exactly " specially, smart, dark ", and there is industry color, to compare the magnanimity information disordering of comprehensive search engine, vertical reptile then seem more be absorbed in, concrete and deeply.

As mentioned above, the target of individual searching engine in order to make search effect reach " specially, smart, dark ", its reptile module must capture the concrete column of all types of website or forum accurately, the even article list of certain class theme, the content as far as possible avoided or reduce unvested theme appears in Search Results, to cause bad Consumer's Experience.For this reason, perhaps compile and edit system for the website of every type to develop a set of reptile and do not lose a solution, but, growing along with internet sites, the continuous change of Website page content, such mode certainly will bring huge hard work amount to developer and maintainer, and recall precision also can not be guaranteed.

Summary of the invention

Technical matters to be solved by this invention is, how to make placement unit can be applicable to the page of various website fast, and accurately can capture specific region and/or data in target pages.

For this purpose, the present invention proposes a kind of page data grasping means, comprising: S1, resolve the configuration information that target pages obtains described target pages, generate matching template according to described configuration information; S2, obtains the address information of described target pages from described configuration information, determines described target pages according to described address information, obtains the text data in described target pages; S3, placement unit captures text data according to described matching template in described target pages, stores the basis of described text data as index operation.

Preferably, described step S1 also comprises: be loaded into by described configuration information in the static memory of described placement unit.

Preferably, described step S2 also comprises: described address information inserted address queue from tail of the queue, wherein, described address queue is managed by singleton pattern.

Preferably, described step S2 also comprises: filter the data do not conformed to preset data type in described text data.

Preferably, also comprised before described step S1: the complexity judging described target pages, when described complexity is less than or equal to preset value, resolve described target pages by regular expression, when described complexity is greater than described preset value, by target pages described in jsoup Analysis on Framework.

The invention allows for a kind of page data grasping system, comprising: resolution unit, obtaining the configuration information of described target pages for resolving target pages, generate matching template according to described configuration information; Acquiring unit, for obtaining the address information of described target pages from described configuration information, determining described target pages according to described address information, obtaining the text data in described target pages; Placement unit, for capturing text data according to described matching template in described target pages, stores the basis of described text data as index operation.

Preferably, also comprise: loading unit, for described configuration information is loaded in the static memory of described placement unit.

Preferably, also comprise: queue management unit, for described address information being inserted address queue from tail of the queue, wherein, described address queue is managed by singleton pattern.

Preferably, also comprise: filter element, for filtering the data do not conformed to preset data type in described text data.

Preferably, also comprise: judging unit, for judging the complexity of described target pages, wherein, described resolution unit is when described complexity is less than or equal to preset value, described target pages is resolved, when described complexity is greater than described preset value, by target pages described in jsoup Analysis on Framework by regular expression.

Pass through technique scheme, the present invention is by resolving target pages, for target pages makes a set of matching template to measure, placement unit just can according to the configuration information of template, region in target pages and/or data are captured accurately, if website or webpage there occurs change, only need to revise template, need not go again and revise placement unit, and the workload person revising template is less than amendment placement unit, both improve work efficiency like this, can shorten again the cycle that placement unit captures data, the content that placement unit is provided for index is upgraded in time.

Accompanying drawing explanation

Can understanding the features and advantages of the present invention clearly by reference to accompanying drawing, accompanying drawing is schematic and should not be construed as and carry out any restriction to the present invention, in the accompanying drawings:

Fig. 1 shows the schematic flow diagram of page data grasping means according to an embodiment of the invention;

Fig. 2 shows the schematic block diagram of page data grasping means according to an embodiment of the invention;

Fig. 3 shows source code according to an embodiment of the invention and splits schematic diagram;

Fig. 4 shows the schematic flow diagram of page data grasping means in accordance with another embodiment of the present invention;

Fig. 5 shows the particular flow sheet of grasping manipulation in accordance with another embodiment of the present invention.

Embodiment

Can more clearly understand above-mentioned purpose of the present invention, feature and advantage, below in conjunction with the drawings and specific embodiments, the present invention is further described in detail.It should be noted that, when not conflicting, the feature in the embodiment of the application and embodiment can combine mutually.

Set forth a lot of detail in the following description so that fully understand the present invention; but; the present invention can also adopt other to be different from other modes described here and implement, and therefore, protection scope of the present invention is not by the restriction of following public specific embodiment.

As shown in Figure 1, page data grasping means comprises according to an embodiment of the invention: S1, resolves the configuration information that target pages obtains target pages, generates matching template according to configuration information; S2, obtains the address information of target pages from configuration information, according to address information determination target pages, obtains the text data in target pages; S3, placement unit captures text data according to matching template in target pages, stores the basis of text data as index operation.

By resolving target pages (such as webpage), the configuration information of target pages can be obtained, by configuration information in the form of a label stored in XML file, so that placement unit analysis, in one embodiment of the invention, configuration information is stored in as shown in table 1 after XML file:

Table 1

Obtaining in configuration information, at least comprising the address information (such as URL link) of the page, text message and structural information (such as column link, paging lists of links etc.), can matching template be generated according to structural information wherein.Then from configuration information, obtain the address information of target pages, according to address information determination target pages, obtain the text data in target pages; Finally in text data, capture target data according to matching template by placement unit.Preferably, a queue can being set up for the grasping manipulation of placement unit, for for resolving, storing and the operation such as crawl provides the Row control of a first in first out, to prevent number according to omitting or the problem of repeated storage, and ensureing the unitarity of data.

The text data (such as title, issuing time, body matter etc.) that placement unit grabs can be stored in database and (can carry out Organization of Data with the row place of database table or local file).Because placement unit is the data captured according to matching template, data grabber can be carried out for the page meeting this matching template, and without the need to revising the program in placement unit for the different pages (even not same column), there is multiple sub-channel in the economic channel of such as certain website, the structural information of each channel is identical (or similar), so can generate matching template according to the structural information of one of them channel, make placement unit can capture text data in every sub-channel according to this matching template, and without the need to updating placement unit according to every sub-channel, make a placement unit all applicable for even several website, whole website, save in system in order to store multiple placement unit the space reserved.

And use matching template to carry out capturing the cycle significantly can reducing program development, improve the work efficiency of program research staff.More meet the requirements according to the data that matching template carries out capturing, for index provides foundation more accurately.And when target pages changes, only need to change template, and without the need to updating placement unit, maintenance cost is reduced.

Preferably, step S1 also comprises: be loaded in the static memory of placement unit by configuration information.Configuration information in table 1 can be loaded in the static memory of placement unit, such as, store with Map<code, PageStruct> form, instead of leave lane database in.By this storage mode, placement unit can be made in frequent coupling and only need take out webpage configuration information just from the static memory of container when capturing page data, need not read from database again, namely improve and capture efficiency, the I/O pressure of the database also reduced

Preferably, step S2 also comprises: address information inserted address queue from tail of the queue, wherein, address queue is managed by singleton pattern.Can according to the news tag link address of configuration information by page appointed area, the address that in the description that such as, in table 1 in label, captureType is corresponding, href links, these labels stored in one with singleton pattern management queue in, can insert from tail of the queue, to ensure that capturing task only has a queue example at every turn, and when Context resolution is carried out in team's head taking-up link, guarantee that multiple thread can synchronous working, indicate if having paging to configure or the follow-up page number can be taken out, continue address information to insert queue.Queue processing flow process concrete in crawl process is as Fig. 5, and data flow as shown in Figure 4.

Preferably, step S2 also comprises: filter the data do not conformed to preset data type in text data.Before carrying out data grabber, first the data filtering do not conformed to preset data type in text data can be fallen, thus reduce the data volume analyzed described in placement unit, improve and capture efficiency.

Preferably, also comprised before step S1: the complexity judging target pages, when complexity is less than or equal to preset value, resolve target pages by regular expression, when complexity is greater than preset value, by jsoup Analysis on Framework target pages.Parsing in the present invention can be adopted in two ways, adopt regular expression to resolve when page complexity is lower, by the relative position of the webpage Html structure represented by webpage configuration file and label, carry out the particular content of analyzing web page, because regular expression is more abstract and operational efficiency is lower, be applicable to webpage relatively simple for structure.For the also difficulty that complexity is higher, the present invention adopts the web analysis framework of jsoup to resolve, wherein, jsoup is the html parser of a Java, directly can resolve certain URL address, html text content, jsoup framework provides a set of API (application programming interface) very efficiently, by DOM (a kind of standard programming interface processing easily extensible markup language), CSS (application of standard generalized markup language) and the method for operating being similar to jQuery (a kind of Java framework) are taken out and service data.

Use the mode of regular expression in page parsing, the content of pages analysis of sing on web file structure is as one embodiment of the present of invention, can first delete useless mark, analyze the page again, add end mark to only having initial mark to guarantee that matched indicia is correct, and employing HTML indicia distribution rule carries out analyzing web page content.Here be one after treatment, simple page source code structure example:

Source code wherein can be split as structure shown in Fig. 3.Based on focusing on the file structure parameter extraction Web page text content in describing, then matching regular expressions can be used, obtaining the data such as lower one page URL address.

For the page target region that the page or be difficult to of complicated structure describes with regular expression, the present invention is configured target pages by template, and jsoup framework just can to target-region locating and reading, and then extraction data wherein.For strengthen placement unit universality with and shorten user's obtaining information time.Use the benefit of template to be according to the feature of website, write out the general placement unit program being applicable to this website, and template can to reduce the complexity of user operation, user just can use placement unit without the need to providing a large amount of information.

The invention allows for a kind of page data grasping system 10, comprising: resolution unit 11, obtaining the configuration information of target pages for resolving target pages, generate matching template according to configuration information; Acquiring unit 12, for obtaining the address information of target pages from configuration information, according to address information determination target pages, obtains the text data in target pages; Placement unit 13, for capturing text data according to matching template in target pages, stores the basis of text data as index operation.

Preferably, also comprise: loading unit 14, for configuration information is loaded in the static memory of placement unit 13.

Preferably, also comprise: queue management unit 15, for address information being inserted address queue from tail of the queue, wherein, address queue is managed by singleton pattern.

Preferably, also comprise: filter element 16, for filtering the data do not conformed to preset data type in text data.

Preferably, also comprise: judging unit 17, for judging the complexity of target pages, wherein, resolution unit 11, when complexity is less than or equal to preset value, resolves target pages by regular expression, when complexity is greater than preset value, by jsoup Analysis on Framework target pages.

According to the embodiment of the present invention, additionally provide a kind of non-volatile machine readable media, store the program product captured for page data, program product comprises above-mentioned two kinds of program products.

According to the embodiment of the present invention, additionally provide a kind of machine readable program, program makes machine perform page data grasping means arbitrary in as above technical scheme.

According to the embodiment of the present invention, additionally provide a kind of storage medium storing machine readable program, wherein, machine readable program makes machine perform page data grasping means arbitrary in as above technical scheme.

More than be described with reference to the accompanying drawings technical scheme of the present invention, consider in correlation technique, when capturing data for the different pages, need to revise capture program, need to pay very large programing work amount, and also cannot very exactly for index provides foundation by means of only the data of capture program crawl.By the technical scheme of the application, a set of matching template can be made to measure for target pages, placement unit just can according to the configuration information of template, region in target pages and/or data are captured accurately, if website or webpage there occurs change, only need to revise template, need not go again and revise placement unit, and the workload person revising template is less than amendment placement unit, both improve work efficiency like this, can shorten again the cycle that placement unit captures data, the content that placement unit is provided for index is upgraded in time.

In the present invention, term " multiple " refers to two or more, unless otherwise clear and definite restriction.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a page data grasping means, is characterized in that, comprising:

S1, resolves the configuration information that target pages obtains described target pages, generates matching template according to described configuration information;

S2, obtains the address information of described target pages from described configuration information, determines described target pages according to described address information, obtains the text data in described target pages;

S3, placement unit captures text data according to described matching template in described target pages, stores the basis of described text data as index operation.

2. page data grasping means according to claim 1, is characterized in that, described step S1 also comprises: be loaded into by described configuration information in the static memory of described placement unit.

3. page data grasping means according to claim 1, is characterized in that, described step S2 also comprises: described address information inserted address queue from tail of the queue, wherein, described address queue is managed by singleton pattern.

4. page data grasping means according to any one of claims 1 to 3, is characterized in that, described step S2 also comprises: filter the data do not conformed to preset data type in described text data.

5. page data grasping means according to any one of claims 1 to 3, it is characterized in that, also comprised before described step S1: the complexity judging described target pages, when described complexity is less than or equal to preset value, described target pages is resolved by regular expression, when described complexity is greater than described preset value, by target pages described in jsoup Analysis on Framework.

6. a page data grasping system, is characterized in that, comprising:

Resolution unit, obtains the configuration information of described target pages for resolving target pages, generate matching template according to described configuration information;

Acquiring unit, for obtaining the address information of described target pages from described configuration information, determining described target pages according to described address information, obtaining the text data in described target pages;

Placement unit, for capturing text data according to described matching template in described target pages, stores the basis of described text data as index operation.

7. page data grasping system according to claim 6, is characterized in that, also comprise:

Loading unit, for being loaded into described configuration information in the static memory of described placement unit.

8. page data grasping system according to claim 6, is characterized in that, also comprise:

Queue management unit, for described address information being inserted address queue from tail of the queue, wherein, described address queue is managed by singleton pattern.

9. page data grasping system according to any one of claim 6 to 8, is characterized in that, also comprise:

Filter element, for filtering the data do not conformed to preset data type in described text data.

10. page data grasping system according to any one of claim 6 to 8, is characterized in that, also comprise:

Judging unit, for judging the complexity of described target pages,

Wherein, described resolution unit, when described complexity is less than or equal to preset value, resolves described target pages by regular expression, when described complexity is greater than described preset value, by target pages described in jsoup Analysis on Framework.