CN107480264B - A kind of web crawlers De-weight method and calculate equipment - Google Patents

A kind of web crawlers De-weight method and calculate equipment Download PDF

Info

Publication number
CN107480264B
CN107480264B CN201710706059.6A CN201710706059A CN107480264B CN 107480264 B CN107480264 B CN 107480264B CN 201710706059 A CN201710706059 A CN 201710706059A CN 107480264 B CN107480264 B CN 107480264B
Authority
CN
China
Prior art keywords
page
hyperlinked
targeted sites
value
hyperlinked object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710706059.6A
Other languages
Chinese (zh)
Other versions
CN107480264A (en
Inventor
郭宝军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Knownsec Information Technology Co Ltd
Original Assignee
Beijing Knownsec Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Knownsec Information Technology Co Ltd filed Critical Beijing Knownsec Information Technology Co Ltd
Priority to CN201710706059.6A priority Critical patent/CN107480264B/en
Publication of CN107480264A publication Critical patent/CN107480264A/en
Application granted granted Critical
Publication of CN107480264B publication Critical patent/CN107480264B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/954Navigation, e.g. using categorised browsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of web crawlers De-weight methods, suitable for being executed in calculating equipment, the transverse and longitudinal coordinate threshold value of page visual range and the decision threshold of each parameter navigation website are provided in the calculating equipment, this method comprises: downloading and rendering resource relevant to targeted sites, to obtain the visible page information of the targeted sites, and establish corresponding page visual coordinate system;The coordinate information of each hyperlinked object is extracted from the rendering page, which includes transverse and longitudinal coordinate value of each hyperlinked object in page visual coordinate system;If there is a value to be less than or equal to corresponding coordinate threshold value in the transverse and longitudinal coordinate value of some hyperlinked object, determine the hyperlinked object in page visual range;And hyperlinked object number in page visual range is counted, if the number is more than or equal to decision threshold, determine that the targeted sites are navigated website for parameter, and the duplicate removal processing to the hyperlinked object in page visual range is omitted during crawler.

Description

A kind of web crawlers De-weight method and calculate equipment
Technical field
The present invention relates to Internet technical field more particularly to a kind of web crawlers De-weight method and calculate equipment.
Background technique
With the rapid development of network, WWW becomes the carrier of bulk information, to efficiently extract and use these letters Breath web crawlers is born.Web crawlers need to carry out duplicate removal to website similar pages to efficiently crawl site information.But The structure of current web is multifarious, so that web crawlers duplicate removal forms problem.Such as the website based on parameter navigation, it should The characteristics of types of sites is to realize jumping for the Different Logic page by the same URL, and it is specific to usually contain some in this URL Parameter, background application can judge page jump logic according to the value of this parameter.
Web crawlers page duplicate removal mostly uses the Duplicate Removal Algorithm based on number of parameters, such as the configuration of duplicate removal number of parameters at present It is 4, when carrying out site page information extraction, the URL page of the number greater than 4 of same parameters value will not be by carry out information in URL It extracts.But this method, when the website to parameter navigation type carries out information extraction, the page is easy to be gone to rerun by web crawlers The leakage extraction that method filters out so as to cause information.
Therefore, it is necessary to the web crawlers methods that one kind can effectively prevent information leakage to extract.
Summary of the invention
For this purpose, the present invention provides a kind of position searching method and calculates equipment, to try hard to solve or at least alleviate deposit above The problem of.
According to an aspect of the invention, there is provided a kind of web crawlers De-weight method, suitable for being executed in calculating equipment, The transverse and longitudinal coordinate threshold value of page visual range and the decision threshold of each parameter navigation website, the party are provided in the calculating equipment Method includes: downloading and renders relevant to targeted sites resource, to obtain the visible page information of the targeted sites, and establish and The visible page corresponding page visual coordinate system;Extraction is presented on each in the page from the visible page information after rendering The coordinate information of a hyperlinked object, coordinate information include transverse and longitudinal coordinate of each hyperlinked object in page visual coordinate system Value;For some hyperlinked object, if there is a value to be less than or equal to corresponding coordinate threshold value in its transverse and longitudinal coordinate value, determining should Hyperlinked object in page visual range, it is on the contrary then not in page visual range;And statistics is in page visual range Hyperlinked object number, if the number be more than or equal to decision threshold, determine the targeted sites for parameter navigate website, and The duplicate removal processing to the hyperlinked object in page visual range is omitted during crawler.
Optionally, in the method according to the invention, further includes: if it is determined that the targeted sites are parameter navigation website, then Duplicate removal processing is carried out to the hyperlinked object not in page visual range;And if in page visual range in certain targeted sites The number of interior hyperlinked object is less than decision threshold, then determining the targeted sites not is parameter navigation website, and to the target All hyperlinked objects in website carry out duplicate removal processing.
Optionally, in the method according to the invention, decision threshold is to have same paths, identical parameters in targeted sites But the decision threshold of the different hyperlinked object of parameter value.
Optionally, in the method according to the invention, the transverse and longitudinal coordinate threshold value of page visual range is 200 pixels.
Optionally, in the method according to the invention, resource relevant to targeted sites is downloaded and renders, to obtain the mesh The step of visible page information of labeling station point, is suitable for executing by webkit kernel class browser engine.
Optionally, in the method according to the invention, each hyperlinked object has text box, and coordinate information is suitable for benefit It is extracted with the geometry () method of .QtWebKit.QWebElement, extracting key in content is each hyperlinked object Content includes the transverse and longitudinal coordinate value of the left upper apex of text box where each hyperlink object in value value.
Optionally, in the method according to the invention, coordinate information further includes the length of text box where the hyperlinked object Degree and height.
Optionally, in the method according to the invention, page visual coordinate system is using the upper left corner of browser interface as origin, Origin is X-axis to the right, is downwards Y-axis.
According to another aspect of the present invention, a kind of calculating equipment is provided, comprising: one or more processors;Memory; With one or more programs, wherein the storage of one or more of programs in the memory and is configured as by one Or multiple processors execute, one or more of programs include the finger for either executing in method as described above method It enables.
In accordance with a further aspect of the present invention, a kind of computer-readable storage medium for storing one or more programs is provided Matter, one or more of programs include instruction, and described instruction is when calculating equipment execution, so that the calculating equipment executes such as Method either in the upper method.
The technical solution provided according to the present invention after the visible page for opening targeted sites by rendering mode, judges position In the number of the hyperlinked object in page visual range, to judge whether the targeted sites are parameter navigation website.When sentencing Break certain website be parameter navigate website when, then duplicate removal no longer is carried out to the hyperlinked object in page visual range, and only to page Hyperlinked object outside facial vision range carries out duplicate removal, when so as to effectively avoid carrying out web crawlers to parameter guidance station point Information leak extract.
Detailed description of the invention
To the accomplishment of the foregoing and related purposes, certain illustrative sides are described herein in conjunction with following description and drawings Face, these aspects indicate the various modes that can practice principles disclosed herein, and all aspects and its equivalent aspect It is intended to fall in the range of theme claimed.Read following detailed description in conjunction with the accompanying drawings, the disclosure it is above-mentioned And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical appended drawing reference generally refers to identical Component or element.
Fig. 1 shows the schematic diagram according to an embodiment of the invention for calculating equipment 100;And
Fig. 2 shows the flow charts of web crawlers De-weight method 200 according to an embodiment of the invention.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
Fig. 1 is the block diagram of Example Computing Device 100.In basic configuration 102, calculating equipment 100, which typically comprises, is System memory 106 and one or more processor 104.Memory bus 108 can be used for storing in processor 104 and system Communication between device 106.
Depending on desired configuration, processor 104 can be any kind of processing, including but not limited to: microprocessor (μ P), microcontroller (μ C), digital information processor (DSP) or any combination of them.Processor 104 may include such as The cache of one or more rank of on-chip cache 110 and second level cache 112 etc, processor core 114 and register 116.Exemplary processor core 114 may include arithmetic and logical unit (ALU), floating-point unit (FPU), Digital signal processing core (DSP core) or any combination of them.Exemplary Memory Controller 118 can be with processor 104 are used together, or in some implementations, and Memory Controller 118 can be an interior section of processor 104.
Depending on desired configuration, system storage 106 can be any type of memory, including but not limited to: easily The property lost memory (RAM), nonvolatile memory (ROM, flash memory etc.) or any combination of them.System storage Device 106 may include operating system 120, one or more is using 122 and program data 124.In some embodiments, It may be arranged to be operated using program data 124 on an operating system using 122.Program data 124 includes instruction, in root According in calculating equipment 100 of the invention, program data 124 includes the instruction for executing web crawlers De-weight method 200.
Calculating equipment 100 can also include facilitating from various interface equipments (for example, output equipment 142, Peripheral Interface 144 and communication equipment 146) to basic configuration 102 via the communication of bus/interface controller 130 interface bus 140.Example Output equipment 142 include graphics processing unit 148 and audio treatment unit 150.They can be configured as facilitate via One or more port A/V 152 is communicated with the various external equipments of such as display or loudspeaker etc.Outside example If interface 144 may include serial interface controller 154 and parallel interface controller 156, they, which can be configured as, facilitates Via one or more port I/O 158 and such as input equipment (for example, keyboard, mouse, pen, voice-input device, touch Input equipment) or the external equipment of other peripheral hardwares (such as printer, scanner etc.) etc communicated.Exemplary communication is set Standby 146 may include network controller 160, can be arranged to convenient for via one or more communication port 164 and one A or multiple other calculate communication of the equipment 162 by network communication link.
Network communication link can be an example of communication media.Communication media can be usually presented as in such as carrier wave Or computer readable instructions, data structure, program module in the modulated data signal of other transmission mechanisms etc, and can To include any information delivery media." modulated data signal " can such signal, one in its data set or more It is a or it change can the mode of encoded information in the signal carry out.As unrestricted example, communication media can be with Wired medium including such as cable network or private line network etc, and it is such as sound, radio frequency (RF), microwave, infrared (IR) the various wireless mediums or including other wireless mediums.Term computer-readable medium used herein may include depositing Both storage media and communication media.
Calculating equipment 100 can be implemented as server, such as file server, database server, application program service Device and WEB server etc. also can be implemented as a part of portable (or mobile) electronic equipment of small size, these electronic equipments It can be such as cellular phone, personal digital assistant (PDA), personal media player device, wireless network browsing apparatus, individual Helmet, application specific equipment or may include any of the above function mixing apparatus.Calculating equipment 100 can also be real It is now the personal computer for including desktop computer and notebook computer configuration.In some embodiments, 100 quilt of equipment is calculated It is configured to execute the network according to the invention crawler De-weight method 200.
It mismatches, is formed with website as it was noted above, the parameter value number of web crawlers page Duplicate Removal Algorithm is more difficult at present Excessive (the parameter of the same page that (configuration of parameter value number is smaller with website compared with) is extracted in page info leakage or crawler crawls The configuration of value number is larger compared with website) lead to the reduction of crawler efficiency.Such Duplicate Removal Algorithm is applied to the application that multi-site crawls When, the mismatch probability of Configuration Values and website is substantially 100%.For the first case, if parameter value number is set as 4, Index.jsp content=inside.html, index.jsp content=about.html, index.jsp Content=careers.html, index.jsp content=investor.html, index.jsp content= Other.html, index.jsp only 4 page infos are extracted in this 6 pages by content=update.html, Another 2 pages are filtered out by web crawlers Duplicate Removal Algorithm, and the information that will lead to filter out the page so is not extracted.For This, the present invention provides a kind of web crawlers method 200 that not only can guarantee crawler efficiency but also can effectively prevent information leakage extraction, master It wants view-based access control model coordinate to be handled, can effectively solve the problems, such as the page duplicate removal of parameter navigation station network crawler.
Fig. 2 shows the flow chart of web crawlers De-weight method 200 according to an embodiment of the invention, method 200 is suitable In the execution in calculating equipment (such as aforementioned computing device 100).As shown in Fig. 2, method 200 starts from step S220.
In step S220, resource relevant to targeted sites is downloaded and renders, to obtain the visual page of the targeted sites Face information, and establish page visual coordinate system corresponding with the visible page.
Specifically, can use webkit kernel class browser engine to download and render resource relevant to target URL, Such as JavaScript script, CSS style file etc..Page visual coordinate system, which can according to need, to be configured, and is such as arranged clear Look at device interface the upper left corner be origin, origin is X-axis to the right, is downwards Y-axis;Alternatively, selecting the visible page of the targeted sites Content display region the upper left corner be origin, origin is X-axis to the right, is downwards Y-axis;Alternatively, it is also an option that browser book The left vertex for signing the following boundary line on column is origin.Certainly these are exemplary illustration, and origin position can according to need selection Direction for lower left corner X-axis and Y-axis also can according to need adjustment, the invention is not limited in this regard.
Then, in step S240, each hyperlink being presented in the page is extracted from the visible page information after rendering The coordinate information of object is connect, wherein coordinate information includes transverse and longitudinal coordinate value of each hyperlinked object in page visual coordinate system.
Here, multiple hyperlinked objects are had in the visible page of targeted sites, such as 360 main browser pages have with " day Cat ", " Jingdone district ", the hyperlinked object that " Taobao " is title, click the corresponding day cat or Jingdone district that the hyperlinked object can jump Homepage etc..In addition, the visible page after rendering, these hyperlinked objects have certain text box to click range, in this article It is clicked in this frame range effective.Transverse and longitudinal of the acquired each hyperlinked object in page visual coordinate system is sat in step S240 The transverse and longitudinal coordinate value of the practical left upper apex for referring to each hyperlinked object place text box of scale value.In addition, each hyperlinked object Coordinate information further include text box where each hyperlinked object length and height.
According to one embodiment, it is each to extract to can use geometry () method of .QtWebKit.QWebElement The coordinate information of hyperlinked object extracts the content that key in content is each hyperlinked object, and value is each hyperlink object institute It certainly can also include the length and height of each text box in the transverse and longitudinal coordinate value of the left upper apex of text box.For example, using Geometry () method extract information " ./index.jsp content=inside.html ": " (5,5,80,40) " }, Key in result information dictionary is hyperlink, and it is (5,5) that coordinate information can be got in value.
Then, in step S260, for some hyperlinked object, if there is a value to be less than or equal in its transverse and longitudinal coordinate value Corresponding coordinate threshold value, then determine the hyperlinked object in page visual range, it is on the contrary then not in page visual range.
Here, the transverse and longitudinal coordinate threshold value for being previously stored with page visual range in equipment 100 is calculated, the coordinate threshold value is utilized Some hyperlinked object be may determine that whether in page visual range, the two values can realize module by initialization algorithm Configuration.According to one embodiment, transverse and longitudinal coordinate threshold value can be disposed as to 200 pixels, i.e., when the transverse and longitudinal of certain hyperlinked object There is a value when being less than or equal to 200 pixel i.e. it is believed that the hyperlinked object is in page visual range in coordinate value;Certainly It can be configured according to browser interface flexible in size, the invention is not limited in this regard.
If it is 200 pixels that transverse and longitudinal coordinate threshold value, which is respectively provided with, " ./index.jsp content= Inside.html ": " (5,5) ", " ./index.jsp content=about.html ": " (85,5) ", " ./index.jsp Content=careers.html ": " (165,5) ", " ./index.jsp content=investor.html ": " (245, 5) ", " ./index.jsp content=other.html ": " (325,5) ", " ./index.jsp content= Update.html ": " (405,5) " } these hyperlinked objects are in page visual range."/ask.jsp date= 0601 ": " (300,400) ", "/ask.jsp date=0602 ": " (300,450) ", "/ask.jsp date=0603 ": " (300,500) ", "/ask.jsp date=0604 ": " (300,550) ", "/ask.jsp date=0605 ": " (300, 600) " } these hyperlinked objects are not in page visual range.
Then, in step S280, hyperlinked object number in page visual range is counted, if the number is greater than etc. In parameter navigation website decision threshold, then determine the targeted sites for parameter navigate website, and during crawler omit pair The duplicate removal processing of hyperlinked object in page visual range.Conversely, if the number is less than the decision threshold of parameter navigation website, Then determining the targeted sites not is parameter navigation website, and is carried out at duplicate removal to all hyperlinked objects in the targeted sites Reason.
Here, the decision threshold that each parameter navigation website is also previously stored in equipment 100, count (path+ are calculated Parameter+ '=') > num, which is the hyperlink pair with same paths, identical parameters and different parameters value The quantity threshold of elephant.Decision threshold is set in initialization, can be configured by external interface;Different navigational parameters can be with With different decision thresholds, in addition it can need to adjust to decision threshold according to the scale of targeted sites.
For example, there are 6 hyperlinked objects to be located in page visual range in the above example, that is, has and mutually go the same way Diameter, identical parameters and different parameters value hyperlinked object number be 6.If the decision threshold of parameter navigation website is initially set It is 4, then can determine that the website is parameter navigation website, the hyperlinked object being now placed in page visual range is usually to navigate Hyperlink.If the decision threshold of parameter navigation website is initially set 7, can determine that the website is is not parameter navigation website.
If it is determined that certain targeted sites is parameter navigation website, then crawler is to all hyperlinked objects in page visual range Not duplicate removals, i.e., all carry out information extraction to all hyperlinked objects in page visual range, such as extract above-mentioned 6 hyperlink The page info of object.And normal duplicate removal processing is carried out for the hyperlinked object not in page visual range.For example, if "/ask.jsp date=0601 ", "/ask.jsp date=0602 ", "/ask.jsp date=0603 ", "/ask.jsp Date=0604 ", "/ask.jsp date=0605 ", "/ask.jsp date=0606 " } this 6 hyperlinked objects are in page Within the scope of facial vision, then therefrom only extracts the information of preceding 4 hyperlinked objects and filter out the information of rear 2 hyperlinked objects. In addition, if it is determined that so-and-so targeted sites are not parameter navigation websites, then equally to all hyperlinked objects in the targeted sites Carry out duplicate removal processing.
The thinking handled in this way is mainly in parameter navigation website, and general navigation link is located at the page left side or top, And comparatively concentration is compared in each hyperlinked object position, quantity is also relatively more.Therefore, according to page visual coordinate come to one Determine in region the hyperlinked object within (such as 200px) to be identified, and is judged according to the hyperlinked object number in region Whether targeted sites are parameter navigation website.And the hyperlink for a parameter navigation website, in page visual range Object is generally very important as navigation hyperlink, therefore be can extract its all information and guaranteed the comprehensive of crawler result.Position And duplicate removal processing appropriate can be then carried out in the hyperlinked object in page visual range, it is climbed with improving the entirety of targeted sites Worm efficiency.Finally for nonparametric navigation type website, overall page has no apparent important or secondary point, therefore can be right Overall page carries out duplicate removal processing.This differential processing for the different pages both can effectively avoid and miss crucial website letter Breath, and can solve and crawl the problem of excessive same page causes crawler efficiency to reduce.
Various technologies described herein are realized together in combination with hardware or software or their combination.To the present invention Method and apparatus or the process and apparatus of the present invention some aspects or part can take insertion tangible media, such as it is soft The form of program code (instructing) in disk, CD-ROM, hard disk drive or other any machine readable storage mediums, Wherein when program is loaded into the machine of such as computer etc, and is executed by the machine, the machine becomes to practice this hair Bright equipment.
In the case where program code executes on programmable computers, calculates equipment and generally comprise processor, processor Readable storage medium (including volatile and non-volatile memory and or memory element), at least one input unit, and extremely A few output device.Wherein, memory is configured for storage program code;Processor is configured for according to the memory Instruction in the said program code of middle storage executes web crawlers De-weight method of the invention.
By way of example and not limitation, computer-readable medium includes computer storage media and communication media.It calculates Machine readable medium includes computer storage media and communication media.Computer storage medium storage such as computer-readable instruction, The information such as data structure, program module or other data.Communication media is generally modulated with carrier wave or other transmission mechanisms etc. Data-signal processed passes to embody computer readable instructions, data structure, program module or other data including any information Pass medium.Above any combination is also included within the scope of computer-readable medium.
In the instructions provided here, algorithm and display not with any certain computer, virtual system or other Equipment is inherently related.Various general-purpose systems can also be used together with example of the invention.As described above, it constructs this kind of Structure required by system is obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can With using various programming languages realize summary of the invention described herein, and the description that language-specific is done above be for Disclosure preferred forms of the invention.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, knot is not been shown in detail Structure and technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, In Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect Shield the present invention claims than feature more features expressly recited in each claim.More precisely, as following As claims reflect, inventive aspect is all features less than single embodiment disclosed above.Therefore, it abides by Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself As a separate embodiment of the present invention.
Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groups Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example In different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multiple Submodule.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed One of meaning can arbitrary combination come using.
In addition, be described as herein can be by the processor of computer system or by executing by some in the embodiment The combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or method The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practice Element described in this is the example of following device: the device be used for implement as in order to implement the purpose of the invention element performed by Function.
As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc. Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way must Must have the time it is upper, spatially, sequence aspect or given sequence in any other manner.
Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited from It is interior it is clear for the skilled person that in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted that Language used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limit Determine subject of the present invention and selects.Therefore, without departing from the scope and spirit of the appended claims, for this Many modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present invention, to this It invents done disclosure to be illustrative and be not restrictive, it is intended that the scope of the present invention be defined by the claims appended hereto.

Claims (10)

1. a kind of web crawlers De-weight method is provided with page vision in the calculating equipment suitable for executing in calculating equipment The decision threshold of transverse and longitudinal coordinate threshold value and parameter the navigation website of range, this method comprises:
Resource relevant to targeted sites is downloaded and rendered, to obtain the visible page information of the targeted sites, and establishes and is somebody's turn to do Visible page corresponding page visual coordinate system;
The coordinate information for each hyperlinked object being presented in the page, the seat are extracted from the visible page information after rendering Marking information includes transverse and longitudinal coordinate value of each hyperlinked object in page visual coordinate system;
For some hyperlinked object, if there is a value to be less than or equal to corresponding coordinate threshold value in its transverse and longitudinal coordinate value, determine The hyperlinked object in page visual range, it is on the contrary then not in page visual range;And
The hyperlinked object number in the page visual range is counted, if the number is more than or equal to the decision threshold, Determine the targeted sites for parameter navigate website, and during crawler omit to the hyperlink pair in the page visual range The duplicate removal processing of elephant.
2. the method as described in claim 1, further includes:
If it is determined that the targeted sites are parameter navigation website, then duplicate removal is carried out to the hyperlinked object not in page visual range Processing;And
If the number of the hyperlinked object in certain targeted sites in page visual range is less than the decision threshold, determining should Targeted sites are not parameter navigation websites, and carry out duplicate removal processing to all hyperlinked objects in the targeted sites.
3. the method for claim 1, wherein the decision threshold is to have same paths, identical ginseng in targeted sites The decision threshold of several but different parameter value hyperlinked objects.
4. the method for claim 1, wherein the transverse and longitudinal coordinate threshold value of the page visual range is 200 pixels.
5. the method for claim 1, wherein described download and render resource relevant to targeted sites, to be somebody's turn to do The step of visible page information of targeted sites, is suitable for executing by webkit kernel class browser engine.
6. the method for claim 1, wherein each hyperlinked object has text box, coordinate information is suitable for utilizing .QtWebKit.QWebElement geometry () method is extracted, and extracting key in content is the interior of each hyperlinked object Hold, includes the transverse and longitudinal coordinate value of the left upper apex of text box where each hyperlinked object in value value.
7. the method for claim 1, wherein the coordinate information further includes the length of text box where the hyperlinked object Degree and height.
8. the method for claim 1, wherein page visual coordinate system is original with the upper left corner of browser interface Point, origin are X-axis to the right, are downwards Y-axis.
9. a kind of calculating equipment, comprising:
One or more processors;
Memory;And
One or more programs, wherein one or more of programs are stored in the memory and are configured as by described one A or multiple processors execute, and one or more of programs include for executing according to claim 1 into method described in 8 Either method instruction.
10. a kind of computer readable storage medium for storing one or more programs, one or more of programs include instruction, Described instruction when executed by a computing apparatus so that the calculating equipment executes according to claim 1 into method described in 8 Either method.
CN201710706059.6A 2017-08-17 2017-08-17 A kind of web crawlers De-weight method and calculate equipment Active CN107480264B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710706059.6A CN107480264B (en) 2017-08-17 2017-08-17 A kind of web crawlers De-weight method and calculate equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710706059.6A CN107480264B (en) 2017-08-17 2017-08-17 A kind of web crawlers De-weight method and calculate equipment

Publications (2)

Publication Number Publication Date
CN107480264A CN107480264A (en) 2017-12-15
CN107480264B true CN107480264B (en) 2019-11-15

Family

ID=60599809

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710706059.6A Active CN107480264B (en) 2017-08-17 2017-08-17 A kind of web crawlers De-weight method and calculate equipment

Country Status (1)

Country Link
CN (1) CN107480264B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062075B (en) * 2019-12-11 2022-12-23 三一筑工科技股份有限公司 Beam-column one-pen-hoop model generation method and device and computing equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101114285A (en) * 2006-07-25 2008-01-30 腾讯科技(深圳)有限公司 Internet topics file searching method, reptile system and search engine
CN101630330A (en) * 2009-08-14 2010-01-20 苏州锐创通信有限责任公司 Method for webpage classification
CN102880607A (en) * 2011-07-15 2013-01-16 舆情(香港)有限公司 Dynamic network content grabbing method and dynamic network content crawler system
CN103279507A (en) * 2013-05-16 2013-09-04 北京尚友通达信息技术有限公司 Webpage spider operational method and system
CN106570023A (en) * 2015-10-10 2017-04-19 北京国双科技有限公司 Customized method and device for deleting repetitions of crawler system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101114285A (en) * 2006-07-25 2008-01-30 腾讯科技(深圳)有限公司 Internet topics file searching method, reptile system and search engine
CN101630330A (en) * 2009-08-14 2010-01-20 苏州锐创通信有限责任公司 Method for webpage classification
CN102880607A (en) * 2011-07-15 2013-01-16 舆情(香港)有限公司 Dynamic network content grabbing method and dynamic network content crawler system
CN103279507A (en) * 2013-05-16 2013-09-04 北京尚友通达信息技术有限公司 Webpage spider operational method and system
CN106570023A (en) * 2015-10-10 2017-04-19 北京国双科技有限公司 Customized method and device for deleting repetitions of crawler system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种网络爬虫系统中URL去重方法研究;成功等;《中国新技术新产品》;20140625;第23页 *
基于webkit的网络爬虫;郭津丞等;《现代电子技术》;20130915;第62-64、68页 *

Also Published As

Publication number Publication date
CN107480264A (en) 2017-12-15

Similar Documents

Publication Publication Date Title
CN106951451B (en) A kind of webpage content extracting method, device and calculate equipment
CN106293365B (en) A kind of method and device obtaining content of pages
US20120102392A1 (en) Method for displaying a data set
CN105205080B (en) Redundant file method for cleaning, device and system
US20150244737A1 (en) Detecting malicious advertisements using source code analysis
CN102902661A (en) Method for realizing hyperlinks of electronic books
CN103250166A (en) Method and apparatus for providing hand detection
US20160216885A1 (en) Method and device for processing touch operation of electronic apparatus
CN107357496A (en) Annotation process method, electronic equipment and computer-readable storage medium
CN107016282A (en) A kind of information processing method and device
CN106033450A (en) Method and device for blocking advertisement, and browser
CN103530386B (en) The edit methods and browser of browsing device net page
CN105205077A (en) Page layout method, device and system
CN106095241B (en) The window display method of Web application a kind of, device and calculate equipment
CN107480264B (en) A kind of web crawlers De-weight method and calculate equipment
CN108228557B (en) Sequence labeling method and device
CN105426524A (en) Web interface displaying method and device
CN107766419B (en) Threshold denoising-based TextRank document summarization method and device
CN112287264B (en) Webpage layout method and device, electronic equipment and storage medium
CN109478202A (en) Scalable vector graphics packet
CN107085515A (en) Personal page generation method and device
Peng et al. A proximal alternating direction method of multipliers for a minimization problem with nonconvex constraints
Nagar Introduction to MATLAB: For Engineers and Scientists
CN102929777B (en) Network application method of testing and test macro
CN105630980A (en) Game recommending strategy obtaining method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 311501, Unit 1, Building 5, Courtyard 1, Futong East Street, Chaoyang District, Beijing 100102

Applicant after: Beijing Zhichuangyu Information Technology Co., Ltd.

Address before: 100097 Jinwei Building 803, 55 Lanindichang South Road, Haidian District, Beijing

Applicant before: Beijing Knows Chuangyu Information Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant