CN108536699A - Grasping means, device, equipment and the storage medium of web page contents - Google Patents

Grasping means, device, equipment and the storage medium of web page contents Download PDF

Info

Publication number
CN108536699A
CN108536699A CN201710120775.6A CN201710120775A CN108536699A CN 108536699 A CN108536699 A CN 108536699A CN 201710120775 A CN201710120775 A CN 201710120775A CN 108536699 A CN108536699 A CN 108536699A
Authority
CN
China
Prior art keywords
web page
content
crawl
webpage
page contents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710120775.6A
Other languages
Chinese (zh)
Inventor
刘永
魏炎炎
阳健
张旭祥
李曙聪
牛朋涛
张莹莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201710120775.6A priority Critical patent/CN108536699A/en
Publication of CN108536699A publication Critical patent/CN108536699A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention discloses a kind of grasping means of web page contents, device, equipment and storage mediums.The method includes:It obtains and input by user custom-configures information, wherein the information that custom-configures includes:Crawl corresponding with content to be captured entry address;At least one web page contents of information scratching are custom-configured according to described;The web page contents of crawl are parsed, characteristic value corresponding with the associated crawl feature of the web page contents is obtained;The characteristic value of acquisition is stored in corresponding storage container.The grasping means of web page contents provided in an embodiment of the present invention, the crawl of web page contents is realized by custom-configuring, and content obtaining is captured by parsing and captures the corresponding characteristic value of feature, realizes the quick obtaining of data, the configuration for realizing webpage capture, reduces development cost.

Description

Grasping means, device, equipment and the storage medium of web page contents
Technical field
The present embodiments relate to data processing technique more particularly to a kind of grasping means of web page contents, device, equipment And storage medium.
Background technology
Currently, many companies' projects are required for the support of big data, the acquisitions of data is largely derived from disclosed The official website information that platform class website is announced, these information can weigh enterprise the application such as value of the product or analysis market and carry For useful guidance, but the acquisition of data is a big bottleneck, and different types of website shows content all in page layout, information Gap is very big, needs the webpage capture system to suit the requirements according to the exploitation that the Type of website customizes in the prior art, exploitation Cost is big, and time loss is more.
Invention content
The embodiment of the present invention provides a kind of grasping means of web page contents, device, equipment and storage medium, to realize configuration The webpage capture of change.
In a first aspect, an embodiment of the present invention provides a kind of grasping means of web page contents, including:
It obtains and input by user custom-configures information, wherein the information that custom-configures includes:With content to be captured Corresponding crawl entry address;
At least one web page contents of information scratching are custom-configured according to described;
The web page contents of crawl are parsed, spy corresponding with the associated crawl feature of the web page contents is obtained Value indicative;
The characteristic value of acquisition is stored in corresponding storage container.
Second aspect, the embodiment of the present invention additionally provide a kind of grabbing device of web page contents, which includes:
Data obtaining module input by user custom-configures information, wherein described to custom-configure information for obtaining Including:Crawl corresponding with content to be captured entry address;
Capturing webpage contents module, for custom-configuring at least one web page contents of information scratching according to;
Characteristic value acquisition module is parsed for the web page contents to crawl, is obtained and is closed with the web page contents The corresponding characteristic value of crawl feature of connection;
Characteristic value memory module, for the characteristic value obtained to be stored in corresponding storage container.
The third aspect, the embodiment of the present invention additionally provide a kind of computer equipment, including memory, processor and are stored in On memory and the computer program that can run on a processor, the processor are realized when executing described program as the present invention is real Apply any capturing webpage contents method in example.
Fourth aspect, the embodiment of the present invention additionally provide a kind of computer readable storage medium, are stored thereon with computer Program realizes the grasping means of the web page contents as described in any in the embodiment of the present invention when program is executed by processor.
The grasping means of web page contents provided in an embodiment of the present invention realizes grabbing for web page contents by custom-configuring It takes, and content obtaining is captured by parsing and captures the corresponding characteristic value of feature, realize the quick obtaining of data, realize webpage The configuration of crawl, reduces development cost.
Description of the drawings
Fig. 1 is the flow chart of the grasping means for the web page contents that the embodiment of the present invention one provides;
Fig. 2 is a kind of flow chart of the grasping means of web page contents provided by Embodiment 2 of the present invention;
Fig. 3 a are a kind of flow charts of the grasping means for web page contents that the embodiment of the present invention three provides;
Fig. 3 b are the crawl page schematic diagrames in a kind of grasping means for web page contents that the embodiment of the present invention three provides;
Fig. 4 a are a kind of flow charts of the grasping means for web page contents that the embodiment of the present invention four provides;
Fig. 4 b be a kind of web page contents that the embodiment of the present invention four provides grasping means in webpage under a subdirectory Schematic diagram;
Fig. 4 c are in the part of the crawl webpage in a kind of grasping means for web page contents that the embodiment of the present invention four provides Hold schematic diagram;
Fig. 5 is a kind of flow chart of the grasping means for web page contents that the embodiment of the present invention five provides;
Fig. 6 is a kind of flow chart of the grasping means for web page contents that the embodiment of the present invention six provides;
Fig. 7 is a kind of structural schematic diagram of the grabbing device for web page contents that the embodiment of the present invention seven provides;
Fig. 8 is a kind of structural schematic diagram for computer equipment that the embodiment of the present invention eight provides.
Specific implementation mode
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limitation of the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.
It also should be noted that only the parts related to the present invention are shown for ease of description, in attached drawing rather than Full content.It should be mentioned that some exemplary embodiments are described before exemplary embodiment is discussed in greater detail At the processing or method described as flow chart.Although operations (or step) are described as the processing of sequence by flow chart, It is that many of which operation can be implemented concurrently, concomitantly or simultaneously.In addition, the sequence of operations can be by again It arranges.The processing can be terminated when its operations are completed, it is also possible to the additional step being not included in attached drawing. The processing can correspond to method, function, regulation, subroutine, subprogram etc..
Embodiment one
Fig. 1 is the flow chart of the grasping means for the web page contents that the embodiment of the present invention one provides, and the present embodiment is applicable to The case where being captured to content in webpage, this method can be executed by the grabbing device of web page contents, which can be by soft Part and/or hardware are realized, and can be generally integrated in server, and this method specifically includes:
S110, it obtains and input by user custom-configures information.
Wherein, the information that custom-configures includes:Crawl corresponding with content to be captured entry address.
Wherein, crawl entry address can be the web page interlinkage entrance of webpage to be captured, and may have access to after click and wait capturing Webpage.User can custom-configure information according to crawl demand, voluntarily configuration.When website upgrades so that capturing entry address When variation, without developing new grasping means again, only configuration need to be updated.
The content to be captured can be specifically the content that user actually wants to obtain in webpage, for example, one or Multiple essential information records for having issued periodical.
S120, at least one web page contents of information scratching are custom-configured according to.
Wherein, web page contents can be the full content captured in the corresponding webpage to be captured in entry address, can also be Partial content, crawl range can be adjusted according to information is custom-configured.
S130, the web page contents of crawl are parsed, is obtained and the associated crawl feature pair of the web page contents The characteristic value answered.
Wherein it is possible to by means such as Text regions, the web page contents of crawl are parsed.It can be one to capture feature It is a, can also be multiple, characteristic value can be the corresponding content of crawl feature.Illustratively, if the webpage of crawl is asked for certain The Chinese Resume of duty person then captures name, educational background or native place that feature can be job hunter etc. and captures the crawl of Webpage correlation Feature, characteristic value is the corresponding attribute value of each crawl feature, such as the educational background of job hunter is master's educational background, then the crawl feature pair The characteristic value answered is exactly " master's educational background ".
S140, the characteristic value of acquisition is stored in corresponding storage container.
Wherein, the characteristic value that parsing obtains is stored in the memory space of terminal or server, so as to subsequent use Or it excavates.
The grasping means of web page contents provided in an embodiment of the present invention realizes grabbing for web page contents by custom-configuring It takes, and content obtaining is captured by parsing and captures the corresponding characteristic value of feature, realize the quick obtaining of data, realize webpage The configuration of crawl, reduces development cost.
Embodiment two
Fig. 2 is a kind of flow chart of the grasping means of web page contents provided by Embodiment 2 of the present invention, the embodiment of the present invention To be optimized on the basis of the above embodiments, obtain it is input by user custom-configure information before, increase portion Dtex is levied, and this method specifically includes:
S210, determination main entry address corresponding with content to be captured.
Wherein, main entry address can be URL (the Uniform Resource of the homepage of webpage to be searched Locator, uniform resource locator).Main entry address is input in address field, you can jump to the homepage of webpage to be searched Face.Illustratively, if the homepage of webpage to be searched is Baidu's academic journal channel, the homepage corresponding to the webpage The addresses URL are:http:The addresses URL are input to ground by //xueshu.baidu.com/usercenter/data/journal In the column of location, Baidu's academic journal channel site home page face can be jumped to.
S220, input is associated with the content to be captured in the search box of the corresponding homepage in the main entry address Search term, and trigger search.
Wherein, the selection of search term need to ensure with wait capturing in have certain incidence relation, each content to be captured Multiple search terms can be corresponded to, search term is inputted in search box, and trigger search, you can obtains and content to be captured is relevant Search result interfaces.Illustratively, it if content to be captured includes the relevant information of Automation of Electric Systems periodical, can incite somebody to action Search term is set to " electric system ", and keyword " electric system " is input in search box, analog subscriber click search key or After pressing enter key, that is, trigger search mission.
S230, the search result returned according to the homepage determine crawl entrance corresponding with the content to be captured Address.
Wherein, after triggering search mission, in the search result of appearance, including a plurality of information, each information is all corresponding An entry address, according to content to be captured, at least one entry address is chosen wherein, as corresponding with content to be captured Crawl entry address.Illustratively, crucial if content to be captured includes the relevant information of Automation of Electric Systems periodical After word is triggered for the search mission of " electric system ", including " Automation of Electric Systems ", " protecting electrical power system and control Multiple Magazines Collections such as system " and " Power System and its Automation journal " are as a result, each search result corresponding one enters Port address, can click a certain selected search result, and after jump page, the network address in address field can be content to be captured Corresponding crawl entry address.
S240, it obtains and input by user custom-configures information.
Wherein, it is described custom-configure information include by the operation of S210-S230 it is determining with it is described wait capturing in Hold corresponding crawl entry address.
S250, at least one web page contents of information scratching are custom-configured according to.
S260, the web page contents of crawl are parsed, is obtained and the associated crawl feature pair of the web page contents The characteristic value answered.
S270, the characteristic value of acquisition is stored in corresponding storage container.
The grasping means of web page contents provided in an embodiment of the present invention first determines that the master of homepage enters before capturing webpage Port address, and then determine the corresponding crawl entry address of content to be captured, it is more accurate advantageously to determine webpage to be captured, it improves Working efficiency.
Embodiment three
Fig. 3 a be the embodiment of the present invention three provide a kind of web page contents grasping means flow chart, the present embodiment be It is optimized on the basis of above-described embodiment, to " parsing, obtaining and the web page contents to the web page contents of crawl The corresponding characteristic value of associated crawl feature " and " characteristic value of acquisition is stored in corresponding storage container " carry out Further refinement, this method specifically include:
S310, it obtains and input by user custom-configures information.
Wherein, the information that custom-configures further includes:At least one crucial crawl feature.
In general, the content of user's actual needs is only the part in captured web page contents, if obtained Whole web page contents are stored and are extracted corresponding characteristic value, can cause the waste of resource.Therefore, user may further be Custom-configure the crucial crawl feature that its actual needs is configured in information.Such as:Capture webpage in each periodical influence because Son or searchable index etc..
S320, at least one web page contents of information scratching are custom-configured according to.
S330, extraction and the crucial local page content for capturing feature association in the web page contents of crawl, And the parsing critical eigenvalue corresponding with the key crawl feature in the local page content.
Wherein, local page can be in browser div tag by Web-page segmentation at independent, different parts, grab The webpage taken includes multiple local pages.The local page where it is can determine according to crucial crawl feature, in the webpage of crawl In content, the local page content of extraction and crucial crawl feature association, and in local page content, parse crucial crawl The corresponding critical eigenvalue of feature.Illustratively, Fig. 3 b are a kind of crawl sides for web page contents that the embodiment of the present invention three provides Crawl page schematic diagram in method shows the First partial page 301 and the second local page 302 in figure, if crucial crawl is special Sign is " impact factor " of Automation of Electric Systems periodical, then extracts the First partial page where the periodical " impact factor " Content in 301, and parse critical eigenvalue corresponding with " impact factor ", i.e. " 2.348 ".
S340, the local page content of acquisition and the critical eigenvalue are stored in corresponding storage container In.
Wherein, critical eigenvalue is stored with corresponding local page content into corresponding memory space.
The grasping means of web page contents provided in an embodiment of the present invention, extraction and crucial crawl in the web page contents of crawl The local page content of feature association, and critical eigenvalue corresponding with the key crawl feature is parsed, more really The range for having determined critical eigenvalue parsing, reduces the workload of resolving.
Example IV
Fig. 4 a be the embodiment of the present invention four provide a kind of web page contents grasping means flow chart, the present embodiment be It is optimized on the basis of above-described embodiment, " custom-configuring at least one web page contents of information scratching according to described " is carried out Further refinement, this method specifically include:
S410, it obtains and input by user custom-configures information.
The web page contents of S420, crawl target webpage corresponding with the crawl entry address.
Wherein, input crawl entry address can get target webpage corresponding with crawl entry address, capture target webpage In web page contents.
S430, a web page interlinkage is obtained in the web page contents, extract webpage corresponding with the web page interlinkage and retouch State feature.
Wherein, crawl webpage may include multiple subdirectories, and each subdirectory corresponds to a different web page interlinkage, together When each subdirectory correspond to different webpage Expressive Features.It is special that webpage Expressive Features embody the corresponding webpage of each subdirectory Sign.Illustratively, Fig. 4 b be a kind of web page contents that the embodiment of the present invention four provides grasping means under a subdirectory " special Planning " webpage schematic diagram shows the corresponding webpage Expressive Features of the web page interlinkage of the subdirectory at region 401, i.e.,.
Optionally, the web page interlinkage of the acquisition includes:Explicit web page interlinkage and/or hiding web page interlinkage.
Wherein, when web page interlinkage is display web page interlinkage, it can be directly obtained webpage link address, when web page interlinkage is Web page interlinkage is hidden, web page interlinkage can also be hidden to this by corresponding algorithm and be identified, and then obtains corresponding webpage Chained address.
S440, judge whether the webpage Expressive Features meet similarity condition with the crucial crawl feature, if so, holding Row S450;Otherwise, S470 is executed.
S450, crawl web page contents corresponding with the web page interlinkage, execute S460.
Wherein, it captures feature webpage Expressive Features corresponding with each web page interlinkage by key to be matched, determination waits for The webpage of crawl, similarity condition can be that crucial crawl feature is under the jurisdiction of among the classification corresponding to webpage Expressive Features.
S460, continue the new web page interlinkage of iterative search in web page contents corresponding with the web page interlinkage and capture phase The web page contents answered execute S470 until meeting preset excavating depth condition.
Wherein, have corresponding web page interlinkage in web page contents, in some, it, can be into when clicking the web page interlinkage Enter into the next stage webpage of current web page.Illustratively, Fig. 4 c are a kind of web page contents that the embodiment of the present invention four provides The partial content schematic diagram of crawl webpage in grasping means, is clicked " electricity market academic research ", you can it is corresponding to enter its The catalogue of the 13rd phase of volume 38 in 2014.For at least one crucial crawl feature, webpage capture is carried out step by step, until meeting pre- If excavating depth condition.Wherein, excavating depth condition can be the restriction to total series of excavation, such as can be total excavation three Grade.Can also be crucial to be captured until features capture and finish until by all.
S470, judge whether to complete the processing to whole web page interlinkages in the target webpage, if so, S480 is executed, it is no Then, it returns and executes S430.
Wherein, iteration captures web page contents, until completing the processing to whole web page interlinkages in target webpage.Final grabs It is in target webpage in all subdirectories, to complete default excavating depth for all crucial crawl features to take result.
S480, the web page contents of crawl are parsed, is obtained and the associated crawl feature pair of the web page contents The characteristic value answered.
S490, the characteristic value of acquisition is stored in corresponding storage container.
The grasping means of web page contents provided in an embodiment of the present invention, the description using crucial crawl feature and subdirectory are special Similitude between sign determines subdirectory to be captured, is more accurately determined range to be captured, reduces the work of Context resolution It measures, while ensure that excavating depth, the crawl process of web page contents is made preferably to meet excavation demand.
Embodiment five
Fig. 5 be the embodiment of the present invention five provide a kind of web page contents grasping means flow chart, the present embodiment be It is optimized on the basis of above-described embodiment, " custom-configuring at least one web page contents of information scratching according to described " is carried out Further refinement, this method specifically include:
S510, it obtains and input by user custom-configures information.
Wherein, the information that custom-configures further includes:Start identifying code and cracks function.
If with the crawl entry address and/or the corresponding webpage of the web page interlinkage including S520, identifying code, It then chooses and identifies the identifying code with the recognition strategy of the identifying code type matching.
Wherein, verification code type can be verification mode or the input verification for dragging sliding block in verification process and completing picture mosaic Digital operation result in picture hits the authentications such as number in some given pattern or identification verification picture at verification picture midpoint Formula.For different verification code types, different recognition strategies can be used, identifying code is identified.
S530, the identifying code that will identify that are input in the identifying code input frame in webpage, and trigger confirmation verification Code.
Wherein, for different verification code types, obtained recognition result is different.Such as complete picture mosaic for sliding block is dragged to Verification mode, recognition result can be sliding block stop correct position, sliding block is positioned to designated position, you can triggering confirm Identifying code.Such as can be that verification mode digital in picture is verified in identification, recognition result is the number of identification, will be digital It is input in the identifying code input frame of webpage, you can triggering confirms identifying code.
S540, the web page contents that webpage is returned after confirmation identifying code are captured.
Wherein, after identifying code confirms, you can the corresponding webpage in entry address is jumped to, in the web page needed for crawl Web page contents.
S550, the web page contents of crawl are parsed, is obtained and the associated crawl feature pair of the web page contents The characteristic value answered.
S560, the characteristic value of acquisition is stored in corresponding storage container.
The grasping means of web page contents provided in an embodiment of the present invention can choose identifying code recognition strategy to verification automatically Code is identified, and inputs identifying code, and triggering confirms identifying code, realizes the intelligence of identifying code identification and input, avoids It is cumbersome caused by previous artificial input identifying code, be conducive to the working efficiency for improving crawl web page contents.
Embodiment six
Fig. 6 be the embodiment of the present invention six provide a kind of web page contents grasping means flow chart, the present embodiment be It is optimized on the basis of above-described embodiment, " custom-configuring at least one web page contents of information scratching according to described " is carried out Further refinement, this method specifically include:
S610, it obtains and input by user custom-configures information.
Wherein, the information that custom-configures further includes:Start digital finger-print function.
S620, at least one web page contents of information scratching are custom-configured according to.
S630, the web page contents of crawl are parsed, is obtained and the associated crawl feature pair of the web page contents The characteristic value answered.
S640, the characteristic value of acquisition is stored in corresponding storage container.
S650, corresponding with local page content digital finger-print is calculated, and by the digital finger-print and the part Content of pages corresponds to storage.
Wherein, digital finger-print is used to characterize the uniqueness of local page content, for judging whether local page content is sent out Changing.To the calculation of digital finger-print without limiting, such as can correspond to all words in local page content ASCII character be added summation, obtain the corresponding digital finger-print of local page content.By digital finger-print and situation content of pages pair It should store.
S660, interval setting time reacquire local page content corresponding with the content to be captured as to be tested Content is demonstrate,proved, and calculates digital finger-print to be verified corresponding with the content to be verified.
Wherein, local content of pages is rechecked at regular intervals, reacquires local page content and calculating pair The digital finger-print to be verified answered.Interval time can be set according to actual demand, such as can be one month or a season Degree etc..
S670, in the storage container, obtain the number of the local page content to match with the content to be verified Fingerprint is as original fingerprint.
Wherein, in memory space, stored digital finger-print can be read as original fingerprint.
If S680, the original fingerprint are differed with the fingerprint to be verified, parsed in the content to be verified New critical eigenvalue.
Wherein, if original fingerprint is differed with fingerprint to be verified, illustrate to have variation in local page, capture feature Corresponding critical eigenvalue and the critical eigenvalue parsed originally are not exactly the same, can parse critical eigenvalue again at this time.
S690, in the storage container, update storage the content to be verified and the new critical eigenvalue.
Wherein, by the critical eigenvalue parsed again, web page contents are corresponding updates storage into memory space with part.
The grasping means of web page contents provided in an embodiment of the present invention, by local webpage Content Transformation at the shape of digital finger-print Formula, using judge digital finger-print it is whether identical come judge parsing critical eigenvalue whether change, only digital finger-print not Newer critical eigenvalue simultaneously, then without any processing when digital finger-print is identical, this method saves storage information more New workload not only can guarantee timely updating when information changing, but also be avoided that when information does not change still fresh information Caused non-productive work amount, improves the working efficiency of capturing webpage contents.
Embodiment seven
Fig. 7 is a kind of structural schematic diagram of the grabbing device for web page contents that the embodiment of the present invention seven provides, the device packet It includes:Data obtaining module 710, capturing webpage contents module 720, characteristic value acquisition module 730 and characteristic value memory module 740.Wherein:
Data obtaining module 710 input by user custom-configures information, wherein described to custom-configure for obtaining Information includes:Crawl corresponding with content to be captured entry address;
Capturing webpage contents module 720, for custom-configuring at least one web page contents of information scratching according to;
Characteristic value acquisition module 730 is parsed for the web page contents to crawl, is obtained and the web page contents The corresponding characteristic value of associated crawl feature;
Characteristic value memory module 740, for the characteristic value obtained to be stored in corresponding storage container.
The grabbing device of web page contents provided in an embodiment of the present invention realizes grabbing for web page contents by custom-configuring It takes, and content obtaining is captured by parsing and captures the corresponding characteristic value of feature, realize the quick obtaining of data, realize webpage The configuration of crawl, reduces development cost.
Further, it can also include address acquisition module, including:
Main entry address determination unit, for obtain it is input by user custom-configure information before, determine and wait grabbing Take the corresponding main entry address of content;
Search trigger unit waits grabbing for inputting in the search box of the corresponding homepage in the main entry address with described The associated search term of content is taken, and triggers search;
Capture entrance determination unit, for the search result that is returned according to the homepage, determine with it is described wait capturing it is interior Hold corresponding crawl entry address.
Further, the information that custom-configures can also include:At least one crucial crawl feature;
The characteristic value acquisition module 730, specifically can be used for:
Extraction and the crucial local page content for capturing feature association in the web page contents of crawl, and in institute State the critical eigenvalue corresponding with the key crawl feature of parsing in local page content;
The characteristic value memory module 740, specifically can be used for:
The local page content of acquisition and the critical eigenvalue are stored in corresponding storage container.
Further, the capturing webpage contents module 720, specifically can be used for:
Capture the web page contents of target webpage corresponding with the crawl entry address;
A web page interlinkage is obtained in the web page contents, and it is special to extract webpage description corresponding with the web page interlinkage Sign;
If the webpage Expressive Features meet similarity condition, crawl and the webpage with the crucial feature that captures Link corresponding web page contents;
Continue the new web page interlinkage of iterative search in web page contents corresponding with the web page interlinkage and crawl is corresponding Web page contents, until meeting preset excavating depth condition;
It returns and obtains a web page interlinkage in the web page contents, extract webpage description corresponding with the web page interlinkage This operation of feature, until completing the processing to whole web page interlinkages in the target webpage.
Further, the web page interlinkage of the acquisition may include:Explicit web page interlinkage and/or hiding web page interlinkage.
Further, the self-defined information can also include:Start identifying code and cracks function;
The capturing webpage contents module 720 may include:
Identifying code selection unit, if be used for and the crawl entry address and/or the corresponding webpage of the web page interlinkage Include identifying code, then chooses and identify the identifying code with the recognition strategy of the identifying code type matching;
Identifying code input unit, the identifying code for will identify that are input in the identifying code input frame in webpage, And trigger confirmation identifying code;
Content placement unit, for being captured to the web page contents for returning to webpage after confirmation identifying code.
Further, the self-defined information can also include:Start digital finger-print function;
Can also include characteristic value update module, including:
Fingerprint calculation unit, for after the characteristic value obtained is stored in corresponding storage container, calculating Digital finger-print corresponding with the local page content, and the digital finger-print and the local page content are corresponded into storage;
Time setting unit reacquires local page corresponding with the content to be captured for being spaced setting time Content calculates digital finger-print to be verified corresponding with the content to be verified as content to be verified;
Fingerprint acquiring unit, in the storage container, obtaining the partial page to match with the content to be verified The digital finger-print of face content is as original fingerprint;
Characteristic value resolution unit waits for if differed for the original fingerprint and the fingerprint to be verified described New critical eigenvalue is parsed in verification content;
Characteristic value updating unit updates storage the content to be verified and described new in the storage container Critical eigenvalue.
The said goods can perform the method that any embodiment of the present invention is provided, and have the corresponding function module of execution method And advantageous effect.
Embodiment eight
Fig. 8 is a kind of structural schematic diagram for computer equipment that the embodiment of the present invention eight provides.Fig. 8 is shown suitable for being used for Realize the block diagram of the exemplary computer device 12 of embodiment of the present invention.The computer equipment 12 that Fig. 8 is shown is only one Example should not bring any restrictions to the function and use scope of the embodiment of the present invention.
As shown in figure 8, computer equipment 12 is showed in the form of universal computing device.The component of computer equipment 12 can be with Including but not limited to:One or more processor or processing unit 16, system storage 28 connect different system component The bus 18 of (including system storage 28 and processing unit 16).
Bus 18 indicates one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using the arbitrary bus structures in a variety of bus structures.It lifts For example, these architectures include but not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Computer equipment 12 typically comprises a variety of computer system readable media.These media can be it is any can be by The usable medium that computer equipment 12 accesses, including volatile and non-volatile media, moveable and immovable medium.
System storage 28 may include the computer system readable media of form of volatile memory, such as arbitrary access Memory (RAM) 30 and/or cache memory 32.Computer equipment 12 may further include it is other it is removable/can not Mobile, volatile/non-volatile computer system storage medium.Only as an example, storage system 34 can be used for reading and writing not Movably, non-volatile magnetic media (Fig. 8 do not show, commonly referred to as " hard disk drive ").It, can be with although being not shown in Fig. 8 It provides for the disc driver to moving non-volatile magnetic disk (such as " floppy disk ") read-write, and to removable non-volatile The CD drive of CD (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these cases, each driving Device can be connected by one or more data media interfaces with bus 18.Memory 28 may include at least one program production There is one group of (for example, at least one) program module, these program modules to be configured to perform of the invention each for product, the program product The function of embodiment.
Program/utility 40 with one group of (at least one) program module 42 can be stored in such as memory 28 In, such program module 42 includes --- but being not limited to --- operating system, one or more application program, other programs Module and program data may include the realization of network environment in each or certain combination in these examples.Program mould Block 42 usually executes function and/or method in embodiment described in the invention.
Computer equipment 12 can also be with one or more external equipments 14 (such as keyboard, sensing equipment, display 24 Deng) communication, can also be enabled a user to one or more equipment interact with the computer equipment 12 communicate, and/or with make The computer equipment 12 any equipment (such as network interface card, the modulatedemodulate that can be communicated with one or more of the other computing device Adjust device etc.) communication.This communication can be carried out by input/output (I/O) interface 22.Also, computer equipment 12 may be used also To pass through network adapter 20 and one or more network (such as LAN (LAN), wide area network (WAN) and/or public network Network, such as internet) communication.As shown, network adapter 20 is logical by bus 18 and other modules of computer equipment 12 Letter.It should be understood that although not shown in the drawings, can in conjunction with computer equipment 12 use other hardware and/or software module, including But it is not limited to:Microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive And data backup storage system etc..
Processing unit 16 is stored in program in system storage 28 by operation, to perform various functions application and Data processing, such as realize the grasping means for the web page contents that the embodiment of the present invention is provided.
I.e.:It obtains and input by user custom-configures information, wherein the information that custom-configures includes:With wait capturing The corresponding crawl entry address of content;
At least one web page contents of information scratching are custom-configured according to described;
The web page contents of crawl are parsed, spy corresponding with the associated crawl feature of the web page contents is obtained Value indicative;
The characteristic value of acquisition is stored in corresponding storage container.
Embodiment nine
The embodiment of the present invention nine provides a kind of computer readable storage medium, is stored thereon with computer program, the journey The grasping means of the web page contents provided such as all inventive embodiments of the application is provided when sequence is executed by processor.
I.e.:It obtains and input by user custom-configures information, wherein the information that custom-configures includes:With wait capturing The corresponding crawl entry address of content;
At least one web page contents of information scratching are custom-configured according to described;
The web page contents of crawl are parsed, spy corresponding with the associated crawl feature of the web page contents is obtained Value indicative;
The characteristic value of acquisition is stored in corresponding storage container
The arbitrary combination of one or more computer-readable media may be used.Computer-readable medium can be calculated Machine readable signal medium or computer readable storage medium.Computer readable storage medium for example can be --- but it is unlimited In --- electricity, system, device or the device of magnetic, optical, electromagnetic, infrared ray or semiconductor, or the arbitrary above combination.It calculates The more specific example (non exhaustive list) of machine readable storage medium storing program for executing includes:Electrical connection with one or more conducting wires, just It takes formula computer disk, hard disk, random access memory (RAM), read-only memory (ROM), erasable type and may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In this document, can be any include computer readable storage medium or storage journey The tangible medium of sequence, the program can be commanded the either device use or in connection of execution system, device.
Computer-readable signal media may include in a base band or as the data-signal that a carrier wave part is propagated, Wherein carry computer-readable program code.Diversified forms may be used in the data-signal of this propagation, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium other than computer readable storage medium, which can send, propagate or Transmission for by instruction execution system, device either device use or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
It can be write with one or more programming languages or combinations thereof for executing the computer that operates of the present invention Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partly executes or executed on a remote computer or server completely on the remote computer on the user computer. Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as carried using Internet service It is connected by internet for quotient).
Note that above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The present invention is not limited to specific embodiments described here, can carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out to the present invention by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also May include other more equivalent embodiments, and the scope of the present invention is determined by scope of the appended claims.

Claims (16)

1. a kind of grasping means of web page contents, which is characterized in that including:
It obtains and input by user custom-configures information, wherein the information that custom-configures includes:It is corresponding with content to be captured Crawl entry address;
At least one web page contents of information scratching are custom-configured according to described;
The web page contents of crawl are parsed, feature corresponding with the associated crawl feature of the web page contents is obtained Value;
The characteristic value of acquisition is stored in corresponding storage container.
2. according to the method described in claim 1, it is characterized in that, obtain it is input by user custom-configure information before, Further include:
Determine main entry address corresponding with content to be captured;
Input and the associated search term of content to be captured in the search box of the corresponding homepage in the main entry address, and Triggering search;
According to the search result that the homepage returns, crawl entry address corresponding with the content to be captured is determined.
3. method according to claim 1 or 2, which is characterized in that the information that custom-configures further includes:It is at least one Key crawl feature;
The web page contents of crawl are parsed, feature corresponding with the associated crawl feature of the web page contents is obtained Value, including:
Extraction and the crucial local page content for capturing feature association in the web page contents of crawl, and in the office Parsing critical eigenvalue corresponding with the key crawl feature in portion's content of pages;
The characteristic value of acquisition is stored in corresponding storage container, including:
The local page content of acquisition and the critical eigenvalue are stored in corresponding storage container.
4. according to the method described in claim 3, it is characterized in that, custom-configuring at least one net of information scratching according to described Page content, including:
Capture the web page contents of target webpage corresponding with the crawl entry address;
A web page interlinkage is obtained in the web page contents, extracts webpage Expressive Features corresponding with the web page interlinkage;
If the webpage Expressive Features meet similarity condition, crawl and the web page interlinkage with the crucial feature that captures Corresponding web page contents;
Continue the new web page interlinkage of iterative search in web page contents corresponding with the web page interlinkage and captures corresponding webpage Content, until meeting preset excavating depth condition;
It returns and obtains a web page interlinkage in the web page contents, extract webpage Expressive Features corresponding with the web page interlinkage This operation, until completing the processing to whole web page interlinkages in the target webpage.
5. according to the method described in claim 4, it is characterized in that, the web page interlinkage of the acquisition includes:Explicit web page interlinkage, And/or hiding web page interlinkage.
6. according to the method described in claim 1, it is characterized in that, the information that custom-configures further includes:Start identifying code Crack function;
At least one web page contents of information scratching are custom-configured according to described, including:
If including identifying code with the crawl entry address and/or the corresponding webpage of the web page interlinkage, selection and institute The recognition strategy for stating identifying code type matching identifies the identifying code;
The identifying code that will identify that is input in the identifying code input frame in webpage, and triggers confirmation identifying code;
Web page contents to returning to webpage after confirmation identifying code capture.
7. according to the method described in claim 3, it is characterized in that, the information that custom-configures further includes:Start number to refer to Line function;
After the characteristic value obtained is stored in corresponding storage container, further include:
Calculate corresponding with local page content digital finger-print, and by the digital finger-print and the local page content pair It should store;
It is spaced setting time, reacquires local page content corresponding with the content to be captured as content to be verified, and Calculate digital finger-print to be verified corresponding with the content to be verified;
In the storage container, the digital finger-print of the local page content to match with the content to be verified is obtained as former Beginning fingerprint;
If the original fingerprint is differed with the fingerprint to be verified, it is special that new key is parsed in the content to be verified Value indicative;
In the storage container, the content to be verified and the new critical eigenvalue are updated storage.
8. a kind of grabbing device of web page contents, which is characterized in that including:
Data obtaining module input by user custom-configures information, wherein described to custom-configure packet for obtaining It includes:Crawl corresponding with content to be captured entry address;
Capturing webpage contents module, for custom-configuring at least one web page contents of information scratching according to;
Characteristic value acquisition module is parsed for the web page contents to crawl, is obtained associated with the web page contents Capture the corresponding characteristic value of feature;
Characteristic value memory module, for the characteristic value obtained to be stored in corresponding storage container.
9. device according to claim 8, which is characterized in that further include address acquisition module, including:
Main entry address determination unit, for obtain it is input by user custom-configure information before, determine with wait capturing in Hold corresponding main entry address;
Search trigger unit, in the search box of the corresponding homepage in the main entry address input with it is described wait capturing in Hold associated search term, and triggers search;
Entrance determination unit is captured, the search result for being returned according to the homepage determines and the content pair to be captured The crawl entry address answered.
10. device according to claim 8 or claim 9, which is characterized in that the information that custom-configures further includes:At least one A crucial crawl feature;
The characteristic value acquisition module, is specifically used for:
Extraction and the crucial local page content for capturing feature association in the web page contents of crawl, and in the office Parsing critical eigenvalue corresponding with the key crawl feature in portion's content of pages;
The characteristic value memory module, is specifically used for:
The local page content of acquisition and the critical eigenvalue are stored in corresponding storage container.
11. device according to claim 10, which is characterized in that the capturing webpage contents module is specifically used for:
Capture the web page contents of target webpage corresponding with the crawl entry address;
A web page interlinkage is obtained in the web page contents, extracts webpage Expressive Features corresponding with the web page interlinkage;
If the webpage Expressive Features meet similarity condition, crawl and the web page interlinkage with the crucial feature that captures Corresponding web page contents;
Continue the new web page interlinkage of iterative search in web page contents corresponding with the web page interlinkage and captures corresponding webpage Content, until meeting preset excavating depth condition;
It returns and obtains a web page interlinkage in the web page contents, extract webpage Expressive Features corresponding with the web page interlinkage This operation, until completing the processing to whole web page interlinkages in the target webpage.
12. according to the devices described in claim 11, which is characterized in that the web page interlinkage of the acquisition includes:Explicit webpage chain It connects and/or hiding web page interlinkage.
13. device according to claim 8, which is characterized in that the self-defined information further includes:Start identifying code to crack Function;
The capturing webpage contents module, including:
Identifying code selection unit, if for being wrapped with the crawl entry address and/or the corresponding webpage of the web page interlinkage Identifying code is included, then chooses and identifies the identifying code with the recognition strategy of the identifying code type matching;
Identifying code input unit, the identifying code for will identify that is input in the identifying code input frame in webpage, and is touched Hair confirms identifying code;
Content placement unit, for being captured to the web page contents for returning to webpage after confirmation identifying code.
14. device according to claim 10, which is characterized in that the self-defined information further includes:Start digital finger-print Function;
Correspondingly, further include characteristic value update module, including:
Fingerprint calculation unit, for by obtain the characteristic value be stored in corresponding storage container after, calculate and institute The corresponding digital finger-print of local page content is stated, and the digital finger-print and the local page content are corresponded into storage;
Time setting unit reacquires local page content corresponding with the content to be captured for being spaced setting time As content to be verified, and calculate digital finger-print to be verified corresponding with the content to be verified;
Fingerprint acquiring unit, in the storage container, obtaining in the local page to match with the content to be verified The digital finger-print of appearance is as original fingerprint;
Characteristic value resolution unit, if differed for the original fingerprint and the fingerprint to be verified, described to be verified New critical eigenvalue is parsed in content;
Characteristic value updating unit, in the storage container, updating storage the content to be verified and the new pass Key characteristic value.
15. a kind of computer equipment, including memory, processor and storage are on a memory and the meter that can run on a processor Calculation machine program, which is characterized in that the processor realizes the side as described in any in claim 1-7 when executing described program Method.
16. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The method as described in any in claim 1-7 is realized when execution.
CN201710120775.6A 2017-03-02 2017-03-02 Grasping means, device, equipment and the storage medium of web page contents Pending CN108536699A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710120775.6A CN108536699A (en) 2017-03-02 2017-03-02 Grasping means, device, equipment and the storage medium of web page contents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710120775.6A CN108536699A (en) 2017-03-02 2017-03-02 Grasping means, device, equipment and the storage medium of web page contents

Publications (1)

Publication Number Publication Date
CN108536699A true CN108536699A (en) 2018-09-14

Family

ID=63489273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710120775.6A Pending CN108536699A (en) 2017-03-02 2017-03-02 Grasping means, device, equipment and the storage medium of web page contents

Country Status (1)

Country Link
CN (1) CN108536699A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188259A (en) * 2019-05-27 2019-08-30 厦门商集网络科技有限责任公司 A kind of data grab method and device of configurableization

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106219A (en) * 2011-11-15 2013-05-15 盛乐信息技术(上海)有限公司 Method and system of web page grabbing
US20130290344A1 (en) * 2012-04-27 2013-10-31 Eric Glover Updating a search index used to facilitate application searches
CN103714140A (en) * 2013-12-23 2014-04-09 北京锐安科技有限公司 Searching method and device based on topic-focused web crawler
CN104346328A (en) * 2013-07-23 2015-02-11 同程网络科技股份有限公司 Vertical intelligent crawler data collecting method based on webpage data capture
CN104866517A (en) * 2014-12-30 2015-08-26 智慧城市信息技术有限公司 Method and device for capturing webpage content

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106219A (en) * 2011-11-15 2013-05-15 盛乐信息技术(上海)有限公司 Method and system of web page grabbing
US20130290344A1 (en) * 2012-04-27 2013-10-31 Eric Glover Updating a search index used to facilitate application searches
CN104346328A (en) * 2013-07-23 2015-02-11 同程网络科技股份有限公司 Vertical intelligent crawler data collecting method based on webpage data capture
CN103714140A (en) * 2013-12-23 2014-04-09 北京锐安科技有限公司 Searching method and device based on topic-focused web crawler
CN104866517A (en) * 2014-12-30 2015-08-26 智慧城市信息技术有限公司 Method and device for capturing webpage content

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188259A (en) * 2019-05-27 2019-08-30 厦门商集网络科技有限责任公司 A kind of data grab method and device of configurableization

Similar Documents

Publication Publication Date Title
CN109145219A (en) Point of interest Effective judgement method and apparatus based on internet text mining
CN102073725B (en) Method for searching structured data and search engine system for implementing same
CN107491547A (en) Searching method and device based on artificial intelligence
CN102722498B (en) Search engine and implementation method thereof
WO2012108623A1 (en) Method, system and computer-readable recording medium for adding a new image and information on the new image to an image database
CN103842993A (en) Systems and methods for contextual personalized searching
CN104572072B (en) A kind of language transfer method and equipment to the program based on MVC pattern
CN105989268A (en) Safety access method and system for human-computer identification
CN103838566A (en) Information processing device, and information processing method
CN110390054A (en) Point of interest recalls method, apparatus, server and storage medium
CN103810168A (en) Search application method, device and terminal
CN109637000B (en) Invoice detection method and device, storage medium and electronic terminal
CN109299320A (en) A kind of information interacting method, device, computer equipment and storage medium
CN111291210A (en) Image material library generation method, image material recommendation method and related device
CN107526846A (en) Generation, sort method, device, server and the medium of channel sequencing model
CN109241319A (en) A kind of picture retrieval method, device, server and storage medium
CN105868290A (en) Search result presentation method and apparatus
CN102880618A (en) Method and system for searching webpage document
CN109492081A (en) Text information search and information interacting method, device, equipment and storage medium
CN106897016A (en) A kind of searching method based on touch screen terminal, device and touch screen terminal
KR20200031005A (en) Method building assessment infomation according to curriculum assessment and method providing thereof
CN102687158A (en) Search term security
CN113221216B (en) BIM data verification method and device, electronic equipment and storage medium
CN108536699A (en) Grasping means, device, equipment and the storage medium of web page contents
JP5424269B2 (en) Local correspondence extraction apparatus and local correspondence extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180914

RJ01 Rejection of invention patent application after publication