CN107480264B - A kind of web crawlers De-weight method and calculate equipment - Google Patents
A kind of web crawlers De-weight method and calculate equipment Download PDFInfo
- Publication number
- CN107480264B CN107480264B CN201710706059.6A CN201710706059A CN107480264B CN 107480264 B CN107480264 B CN 107480264B CN 201710706059 A CN201710706059 A CN 201710706059A CN 107480264 B CN107480264 B CN 107480264B
- Authority
- CN
- China
- Prior art keywords
- page
- hyperlinked
- targeted sites
- value
- hyperlinked object
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/954—Navigation, e.g. using categorised browsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9558—Details of hyperlinks; Management of linked annotations
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of web crawlers De-weight methods, suitable for being executed in calculating equipment, the transverse and longitudinal coordinate threshold value of page visual range and the decision threshold of each parameter navigation website are provided in the calculating equipment, this method comprises: downloading and rendering resource relevant to targeted sites, to obtain the visible page information of the targeted sites, and establish corresponding page visual coordinate system;The coordinate information of each hyperlinked object is extracted from the rendering page, which includes transverse and longitudinal coordinate value of each hyperlinked object in page visual coordinate system;If there is a value to be less than or equal to corresponding coordinate threshold value in the transverse and longitudinal coordinate value of some hyperlinked object, determine the hyperlinked object in page visual range;And hyperlinked object number in page visual range is counted, if the number is more than or equal to decision threshold, determine that the targeted sites are navigated website for parameter, and the duplicate removal processing to the hyperlinked object in page visual range is omitted during crawler.
Description
Technical field
The present invention relates to Internet technical field more particularly to a kind of web crawlers De-weight method and calculate equipment.
Background technique
With the rapid development of network, WWW becomes the carrier of bulk information, to efficiently extract and use these letters
Breath web crawlers is born.Web crawlers need to carry out duplicate removal to website similar pages to efficiently crawl site information.But
The structure of current web is multifarious, so that web crawlers duplicate removal forms problem.Such as the website based on parameter navigation, it should
The characteristics of types of sites is to realize jumping for the Different Logic page by the same URL, and it is specific to usually contain some in this URL
Parameter, background application can judge page jump logic according to the value of this parameter.
Web crawlers page duplicate removal mostly uses the Duplicate Removal Algorithm based on number of parameters, such as the configuration of duplicate removal number of parameters at present
It is 4, when carrying out site page information extraction, the URL page of the number greater than 4 of same parameters value will not be by carry out information in URL
It extracts.But this method, when the website to parameter navigation type carries out information extraction, the page is easy to be gone to rerun by web crawlers
The leakage extraction that method filters out so as to cause information.
Therefore, it is necessary to the web crawlers methods that one kind can effectively prevent information leakage to extract.
Summary of the invention
For this purpose, the present invention provides a kind of position searching method and calculates equipment, to try hard to solve or at least alleviate deposit above
The problem of.
According to an aspect of the invention, there is provided a kind of web crawlers De-weight method, suitable for being executed in calculating equipment,
The transverse and longitudinal coordinate threshold value of page visual range and the decision threshold of each parameter navigation website, the party are provided in the calculating equipment
Method includes: downloading and renders relevant to targeted sites resource, to obtain the visible page information of the targeted sites, and establish and
The visible page corresponding page visual coordinate system;Extraction is presented on each in the page from the visible page information after rendering
The coordinate information of a hyperlinked object, coordinate information include transverse and longitudinal coordinate of each hyperlinked object in page visual coordinate system
Value;For some hyperlinked object, if there is a value to be less than or equal to corresponding coordinate threshold value in its transverse and longitudinal coordinate value, determining should
Hyperlinked object in page visual range, it is on the contrary then not in page visual range;And statistics is in page visual range
Hyperlinked object number, if the number be more than or equal to decision threshold, determine the targeted sites for parameter navigate website, and
The duplicate removal processing to the hyperlinked object in page visual range is omitted during crawler.
Optionally, in the method according to the invention, further includes: if it is determined that the targeted sites are parameter navigation website, then
Duplicate removal processing is carried out to the hyperlinked object not in page visual range;And if in page visual range in certain targeted sites
The number of interior hyperlinked object is less than decision threshold, then determining the targeted sites not is parameter navigation website, and to the target
All hyperlinked objects in website carry out duplicate removal processing.
Optionally, in the method according to the invention, decision threshold is to have same paths, identical parameters in targeted sites
But the decision threshold of the different hyperlinked object of parameter value.
Optionally, in the method according to the invention, the transverse and longitudinal coordinate threshold value of page visual range is 200 pixels.
Optionally, in the method according to the invention, resource relevant to targeted sites is downloaded and renders, to obtain the mesh
The step of visible page information of labeling station point, is suitable for executing by webkit kernel class browser engine.
Optionally, in the method according to the invention, each hyperlinked object has text box, and coordinate information is suitable for benefit
It is extracted with the geometry () method of .QtWebKit.QWebElement, extracting key in content is each hyperlinked object
Content includes the transverse and longitudinal coordinate value of the left upper apex of text box where each hyperlink object in value value.
Optionally, in the method according to the invention, coordinate information further includes the length of text box where the hyperlinked object
Degree and height.
Optionally, in the method according to the invention, page visual coordinate system is using the upper left corner of browser interface as origin,
Origin is X-axis to the right, is downwards Y-axis.
According to another aspect of the present invention, a kind of calculating equipment is provided, comprising: one or more processors;Memory;
With one or more programs, wherein the storage of one or more of programs in the memory and is configured as by one
Or multiple processors execute, one or more of programs include the finger for either executing in method as described above method
It enables.
In accordance with a further aspect of the present invention, a kind of computer-readable storage medium for storing one or more programs is provided
Matter, one or more of programs include instruction, and described instruction is when calculating equipment execution, so that the calculating equipment executes such as
Method either in the upper method.
The technical solution provided according to the present invention after the visible page for opening targeted sites by rendering mode, judges position
In the number of the hyperlinked object in page visual range, to judge whether the targeted sites are parameter navigation website.When sentencing
Break certain website be parameter navigate website when, then duplicate removal no longer is carried out to the hyperlinked object in page visual range, and only to page
Hyperlinked object outside facial vision range carries out duplicate removal, when so as to effectively avoid carrying out web crawlers to parameter guidance station point
Information leak extract.
Detailed description of the invention
To the accomplishment of the foregoing and related purposes, certain illustrative sides are described herein in conjunction with following description and drawings
Face, these aspects indicate the various modes that can practice principles disclosed herein, and all aspects and its equivalent aspect
It is intended to fall in the range of theme claimed.Read following detailed description in conjunction with the accompanying drawings, the disclosure it is above-mentioned
And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical appended drawing reference generally refers to identical
Component or element.
Fig. 1 shows the schematic diagram according to an embodiment of the invention for calculating equipment 100;And
Fig. 2 shows the flow charts of web crawlers De-weight method 200 according to an embodiment of the invention.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
Fig. 1 is the block diagram of Example Computing Device 100.In basic configuration 102, calculating equipment 100, which typically comprises, is
System memory 106 and one or more processor 104.Memory bus 108 can be used for storing in processor 104 and system
Communication between device 106.
Depending on desired configuration, processor 104 can be any kind of processing, including but not limited to: microprocessor
(μ P), microcontroller (μ C), digital information processor (DSP) or any combination of them.Processor 104 may include such as
The cache of one or more rank of on-chip cache 110 and second level cache 112 etc, processor core
114 and register 116.Exemplary processor core 114 may include arithmetic and logical unit (ALU), floating-point unit (FPU),
Digital signal processing core (DSP core) or any combination of them.Exemplary Memory Controller 118 can be with processor
104 are used together, or in some implementations, and Memory Controller 118 can be an interior section of processor 104.
Depending on desired configuration, system storage 106 can be any type of memory, including but not limited to: easily
The property lost memory (RAM), nonvolatile memory (ROM, flash memory etc.) or any combination of them.System storage
Device 106 may include operating system 120, one or more is using 122 and program data 124.In some embodiments,
It may be arranged to be operated using program data 124 on an operating system using 122.Program data 124 includes instruction, in root
According in calculating equipment 100 of the invention, program data 124 includes the instruction for executing web crawlers De-weight method 200.
Calculating equipment 100 can also include facilitating from various interface equipments (for example, output equipment 142, Peripheral Interface
144 and communication equipment 146) to basic configuration 102 via the communication of bus/interface controller 130 interface bus 140.Example
Output equipment 142 include graphics processing unit 148 and audio treatment unit 150.They can be configured as facilitate via
One or more port A/V 152 is communicated with the various external equipments of such as display or loudspeaker etc.Outside example
If interface 144 may include serial interface controller 154 and parallel interface controller 156, they, which can be configured as, facilitates
Via one or more port I/O 158 and such as input equipment (for example, keyboard, mouse, pen, voice-input device, touch
Input equipment) or the external equipment of other peripheral hardwares (such as printer, scanner etc.) etc communicated.Exemplary communication is set
Standby 146 may include network controller 160, can be arranged to convenient for via one or more communication port 164 and one
A or multiple other calculate communication of the equipment 162 by network communication link.
Network communication link can be an example of communication media.Communication media can be usually presented as in such as carrier wave
Or computer readable instructions, data structure, program module in the modulated data signal of other transmission mechanisms etc, and can
To include any information delivery media." modulated data signal " can such signal, one in its data set or more
It is a or it change can the mode of encoded information in the signal carry out.As unrestricted example, communication media can be with
Wired medium including such as cable network or private line network etc, and it is such as sound, radio frequency (RF), microwave, infrared
(IR) the various wireless mediums or including other wireless mediums.Term computer-readable medium used herein may include depositing
Both storage media and communication media.
Calculating equipment 100 can be implemented as server, such as file server, database server, application program service
Device and WEB server etc. also can be implemented as a part of portable (or mobile) electronic equipment of small size, these electronic equipments
It can be such as cellular phone, personal digital assistant (PDA), personal media player device, wireless network browsing apparatus, individual
Helmet, application specific equipment or may include any of the above function mixing apparatus.Calculating equipment 100 can also be real
It is now the personal computer for including desktop computer and notebook computer configuration.In some embodiments, 100 quilt of equipment is calculated
It is configured to execute the network according to the invention crawler De-weight method 200.
It mismatches, is formed with website as it was noted above, the parameter value number of web crawlers page Duplicate Removal Algorithm is more difficult at present
Excessive (the parameter of the same page that (configuration of parameter value number is smaller with website compared with) is extracted in page info leakage or crawler crawls
The configuration of value number is larger compared with website) lead to the reduction of crawler efficiency.Such Duplicate Removal Algorithm is applied to the application that multi-site crawls
When, the mismatch probability of Configuration Values and website is substantially 100%.For the first case, if parameter value number is set as 4,
Index.jsp content=inside.html, index.jsp content=about.html, index.jsp
Content=careers.html, index.jsp content=investor.html, index.jsp content=
Other.html, index.jsp only 4 page infos are extracted in this 6 pages by content=update.html,
Another 2 pages are filtered out by web crawlers Duplicate Removal Algorithm, and the information that will lead to filter out the page so is not extracted.For
This, the present invention provides a kind of web crawlers method 200 that not only can guarantee crawler efficiency but also can effectively prevent information leakage extraction, master
It wants view-based access control model coordinate to be handled, can effectively solve the problems, such as the page duplicate removal of parameter navigation station network crawler.
Fig. 2 shows the flow chart of web crawlers De-weight method 200 according to an embodiment of the invention, method 200 is suitable
In the execution in calculating equipment (such as aforementioned computing device 100).As shown in Fig. 2, method 200 starts from step S220.
In step S220, resource relevant to targeted sites is downloaded and renders, to obtain the visual page of the targeted sites
Face information, and establish page visual coordinate system corresponding with the visible page.
Specifically, can use webkit kernel class browser engine to download and render resource relevant to target URL,
Such as JavaScript script, CSS style file etc..Page visual coordinate system, which can according to need, to be configured, and is such as arranged clear
Look at device interface the upper left corner be origin, origin is X-axis to the right, is downwards Y-axis;Alternatively, selecting the visible page of the targeted sites
Content display region the upper left corner be origin, origin is X-axis to the right, is downwards Y-axis;Alternatively, it is also an option that browser book
The left vertex for signing the following boundary line on column is origin.Certainly these are exemplary illustration, and origin position can according to need selection
Direction for lower left corner X-axis and Y-axis also can according to need adjustment, the invention is not limited in this regard.
Then, in step S240, each hyperlink being presented in the page is extracted from the visible page information after rendering
The coordinate information of object is connect, wherein coordinate information includes transverse and longitudinal coordinate value of each hyperlinked object in page visual coordinate system.
Here, multiple hyperlinked objects are had in the visible page of targeted sites, such as 360 main browser pages have with " day
Cat ", " Jingdone district ", the hyperlinked object that " Taobao " is title, click the corresponding day cat or Jingdone district that the hyperlinked object can jump
Homepage etc..In addition, the visible page after rendering, these hyperlinked objects have certain text box to click range, in this article
It is clicked in this frame range effective.Transverse and longitudinal of the acquired each hyperlinked object in page visual coordinate system is sat in step S240
The transverse and longitudinal coordinate value of the practical left upper apex for referring to each hyperlinked object place text box of scale value.In addition, each hyperlinked object
Coordinate information further include text box where each hyperlinked object length and height.
According to one embodiment, it is each to extract to can use geometry () method of .QtWebKit.QWebElement
The coordinate information of hyperlinked object extracts the content that key in content is each hyperlinked object, and value is each hyperlink object institute
It certainly can also include the length and height of each text box in the transverse and longitudinal coordinate value of the left upper apex of text box.For example, using
Geometry () method extract information " ./index.jsp content=inside.html ": " (5,5,80,40) " },
Key in result information dictionary is hyperlink, and it is (5,5) that coordinate information can be got in value.
Then, in step S260, for some hyperlinked object, if there is a value to be less than or equal in its transverse and longitudinal coordinate value
Corresponding coordinate threshold value, then determine the hyperlinked object in page visual range, it is on the contrary then not in page visual range.
Here, the transverse and longitudinal coordinate threshold value for being previously stored with page visual range in equipment 100 is calculated, the coordinate threshold value is utilized
Some hyperlinked object be may determine that whether in page visual range, the two values can realize module by initialization algorithm
Configuration.According to one embodiment, transverse and longitudinal coordinate threshold value can be disposed as to 200 pixels, i.e., when the transverse and longitudinal of certain hyperlinked object
There is a value when being less than or equal to 200 pixel i.e. it is believed that the hyperlinked object is in page visual range in coordinate value;Certainly
It can be configured according to browser interface flexible in size, the invention is not limited in this regard.
If it is 200 pixels that transverse and longitudinal coordinate threshold value, which is respectively provided with, " ./index.jsp content=
Inside.html ": " (5,5) ", " ./index.jsp content=about.html ": " (85,5) ", " ./index.jsp
Content=careers.html ": " (165,5) ", " ./index.jsp content=investor.html ": " (245,
5) ", " ./index.jsp content=other.html ": " (325,5) ", " ./index.jsp content=
Update.html ": " (405,5) " } these hyperlinked objects are in page visual range."/ask.jsp date=
0601 ": " (300,400) ", "/ask.jsp date=0602 ": " (300,450) ", "/ask.jsp date=0603 ":
" (300,500) ", "/ask.jsp date=0604 ": " (300,550) ", "/ask.jsp date=0605 ": " (300,
600) " } these hyperlinked objects are not in page visual range.
Then, in step S280, hyperlinked object number in page visual range is counted, if the number is greater than etc.
In parameter navigation website decision threshold, then determine the targeted sites for parameter navigate website, and during crawler omit pair
The duplicate removal processing of hyperlinked object in page visual range.Conversely, if the number is less than the decision threshold of parameter navigation website,
Then determining the targeted sites not is parameter navigation website, and is carried out at duplicate removal to all hyperlinked objects in the targeted sites
Reason.
Here, the decision threshold that each parameter navigation website is also previously stored in equipment 100, count (path+ are calculated
Parameter+ '=') > num, which is the hyperlink pair with same paths, identical parameters and different parameters value
The quantity threshold of elephant.Decision threshold is set in initialization, can be configured by external interface;Different navigational parameters can be with
With different decision thresholds, in addition it can need to adjust to decision threshold according to the scale of targeted sites.
For example, there are 6 hyperlinked objects to be located in page visual range in the above example, that is, has and mutually go the same way
Diameter, identical parameters and different parameters value hyperlinked object number be 6.If the decision threshold of parameter navigation website is initially set
It is 4, then can determine that the website is parameter navigation website, the hyperlinked object being now placed in page visual range is usually to navigate
Hyperlink.If the decision threshold of parameter navigation website is initially set 7, can determine that the website is is not parameter navigation website.
If it is determined that certain targeted sites is parameter navigation website, then crawler is to all hyperlinked objects in page visual range
Not duplicate removals, i.e., all carry out information extraction to all hyperlinked objects in page visual range, such as extract above-mentioned 6 hyperlink
The page info of object.And normal duplicate removal processing is carried out for the hyperlinked object not in page visual range.For example, if
"/ask.jsp date=0601 ", "/ask.jsp date=0602 ", "/ask.jsp date=0603 ", "/ask.jsp
Date=0604 ", "/ask.jsp date=0605 ", "/ask.jsp date=0606 " } this 6 hyperlinked objects are in page
Within the scope of facial vision, then therefrom only extracts the information of preceding 4 hyperlinked objects and filter out the information of rear 2 hyperlinked objects.
In addition, if it is determined that so-and-so targeted sites are not parameter navigation websites, then equally to all hyperlinked objects in the targeted sites
Carry out duplicate removal processing.
The thinking handled in this way is mainly in parameter navigation website, and general navigation link is located at the page left side or top,
And comparatively concentration is compared in each hyperlinked object position, quantity is also relatively more.Therefore, according to page visual coordinate come to one
Determine in region the hyperlinked object within (such as 200px) to be identified, and is judged according to the hyperlinked object number in region
Whether targeted sites are parameter navigation website.And the hyperlink for a parameter navigation website, in page visual range
Object is generally very important as navigation hyperlink, therefore be can extract its all information and guaranteed the comprehensive of crawler result.Position
And duplicate removal processing appropriate can be then carried out in the hyperlinked object in page visual range, it is climbed with improving the entirety of targeted sites
Worm efficiency.Finally for nonparametric navigation type website, overall page has no apparent important or secondary point, therefore can be right
Overall page carries out duplicate removal processing.This differential processing for the different pages both can effectively avoid and miss crucial website letter
Breath, and can solve and crawl the problem of excessive same page causes crawler efficiency to reduce.
Various technologies described herein are realized together in combination with hardware or software or their combination.To the present invention
Method and apparatus or the process and apparatus of the present invention some aspects or part can take insertion tangible media, such as it is soft
The form of program code (instructing) in disk, CD-ROM, hard disk drive or other any machine readable storage mediums,
Wherein when program is loaded into the machine of such as computer etc, and is executed by the machine, the machine becomes to practice this hair
Bright equipment.
In the case where program code executes on programmable computers, calculates equipment and generally comprise processor, processor
Readable storage medium (including volatile and non-volatile memory and or memory element), at least one input unit, and extremely
A few output device.Wherein, memory is configured for storage program code;Processor is configured for according to the memory
Instruction in the said program code of middle storage executes web crawlers De-weight method of the invention.
By way of example and not limitation, computer-readable medium includes computer storage media and communication media.It calculates
Machine readable medium includes computer storage media and communication media.Computer storage medium storage such as computer-readable instruction,
The information such as data structure, program module or other data.Communication media is generally modulated with carrier wave or other transmission mechanisms etc.
Data-signal processed passes to embody computer readable instructions, data structure, program module or other data including any information
Pass medium.Above any combination is also included within the scope of computer-readable medium.
In the instructions provided here, algorithm and display not with any certain computer, virtual system or other
Equipment is inherently related.Various general-purpose systems can also be used together with example of the invention.As described above, it constructs this kind of
Structure required by system is obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can
With using various programming languages realize summary of the invention described herein, and the description that language-specific is done above be for
Disclosure preferred forms of the invention.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention
Example can be practiced without these specific details.In some instances, well known method, knot is not been shown in detail
Structure and technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, In
Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes
In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect
Shield the present invention claims than feature more features expressly recited in each claim.More precisely, as following
As claims reflect, inventive aspect is all features less than single embodiment disclosed above.Therefore, it abides by
Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself
As a separate embodiment of the present invention.
Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groups
Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example
In different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multiple
Submodule.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment
Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or
Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any
Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed
All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power
Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose
It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention
Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed
One of meaning can arbitrary combination come using.
In addition, be described as herein can be by the processor of computer system or by executing by some in the embodiment
The combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or method
The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practice
Element described in this is the example of following device: the device be used for implement as in order to implement the purpose of the invention element performed by
Function.
As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc.
Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way must
Must have the time it is upper, spatially, sequence aspect or given sequence in any other manner.
Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited from
It is interior it is clear for the skilled person that in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted that
Language used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limit
Determine subject of the present invention and selects.Therefore, without departing from the scope and spirit of the appended claims, for this
Many modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present invention, to this
It invents done disclosure to be illustrative and be not restrictive, it is intended that the scope of the present invention be defined by the claims appended hereto.
Claims (10)
1. a kind of web crawlers De-weight method is provided with page vision in the calculating equipment suitable for executing in calculating equipment
The decision threshold of transverse and longitudinal coordinate threshold value and parameter the navigation website of range, this method comprises:
Resource relevant to targeted sites is downloaded and rendered, to obtain the visible page information of the targeted sites, and establishes and is somebody's turn to do
Visible page corresponding page visual coordinate system;
The coordinate information for each hyperlinked object being presented in the page, the seat are extracted from the visible page information after rendering
Marking information includes transverse and longitudinal coordinate value of each hyperlinked object in page visual coordinate system;
For some hyperlinked object, if there is a value to be less than or equal to corresponding coordinate threshold value in its transverse and longitudinal coordinate value, determine
The hyperlinked object in page visual range, it is on the contrary then not in page visual range;And
The hyperlinked object number in the page visual range is counted, if the number is more than or equal to the decision threshold,
Determine the targeted sites for parameter navigate website, and during crawler omit to the hyperlink pair in the page visual range
The duplicate removal processing of elephant.
2. the method as described in claim 1, further includes:
If it is determined that the targeted sites are parameter navigation website, then duplicate removal is carried out to the hyperlinked object not in page visual range
Processing;And
If the number of the hyperlinked object in certain targeted sites in page visual range is less than the decision threshold, determining should
Targeted sites are not parameter navigation websites, and carry out duplicate removal processing to all hyperlinked objects in the targeted sites.
3. the method for claim 1, wherein the decision threshold is to have same paths, identical ginseng in targeted sites
The decision threshold of several but different parameter value hyperlinked objects.
4. the method for claim 1, wherein the transverse and longitudinal coordinate threshold value of the page visual range is 200 pixels.
5. the method for claim 1, wherein described download and render resource relevant to targeted sites, to be somebody's turn to do
The step of visible page information of targeted sites, is suitable for executing by webkit kernel class browser engine.
6. the method for claim 1, wherein each hyperlinked object has text box, coordinate information is suitable for utilizing
.QtWebKit.QWebElement geometry () method is extracted, and extracting key in content is the interior of each hyperlinked object
Hold, includes the transverse and longitudinal coordinate value of the left upper apex of text box where each hyperlinked object in value value.
7. the method for claim 1, wherein the coordinate information further includes the length of text box where the hyperlinked object
Degree and height.
8. the method for claim 1, wherein page visual coordinate system is original with the upper left corner of browser interface
Point, origin are X-axis to the right, are downwards Y-axis.
9. a kind of calculating equipment, comprising:
One or more processors;
Memory;And
One or more programs, wherein one or more of programs are stored in the memory and are configured as by described one
A or multiple processors execute, and one or more of programs include for executing according to claim 1 into method described in 8
Either method instruction.
10. a kind of computer readable storage medium for storing one or more programs, one or more of programs include instruction,
Described instruction when executed by a computing apparatus so that the calculating equipment executes according to claim 1 into method described in 8
Either method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710706059.6A CN107480264B (en) | 2017-08-17 | 2017-08-17 | A kind of web crawlers De-weight method and calculate equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710706059.6A CN107480264B (en) | 2017-08-17 | 2017-08-17 | A kind of web crawlers De-weight method and calculate equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107480264A CN107480264A (en) | 2017-12-15 |
CN107480264B true CN107480264B (en) | 2019-11-15 |
Family
ID=60599809
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710706059.6A Active CN107480264B (en) | 2017-08-17 | 2017-08-17 | A kind of web crawlers De-weight method and calculate equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107480264B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111062075B (en) * | 2019-12-11 | 2022-12-23 | 三一筑工科技股份有限公司 | Beam-column one-pen-hoop model generation method and device and computing equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101114285A (en) * | 2006-07-25 | 2008-01-30 | 腾讯科技(深圳)有限公司 | Internet topics file searching method, reptile system and search engine |
CN101630330A (en) * | 2009-08-14 | 2010-01-20 | 苏州锐创通信有限责任公司 | Method for webpage classification |
CN102880607A (en) * | 2011-07-15 | 2013-01-16 | 舆情(香港)有限公司 | Dynamic network content grabbing method and dynamic network content crawler system |
CN103279507A (en) * | 2013-05-16 | 2013-09-04 | 北京尚友通达信息技术有限公司 | Webpage spider operational method and system |
CN106570023A (en) * | 2015-10-10 | 2017-04-19 | 北京国双科技有限公司 | Customized method and device for deleting repetitions of crawler system |
-
2017
- 2017-08-17 CN CN201710706059.6A patent/CN107480264B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101114285A (en) * | 2006-07-25 | 2008-01-30 | 腾讯科技(深圳)有限公司 | Internet topics file searching method, reptile system and search engine |
CN101630330A (en) * | 2009-08-14 | 2010-01-20 | 苏州锐创通信有限责任公司 | Method for webpage classification |
CN102880607A (en) * | 2011-07-15 | 2013-01-16 | 舆情(香港)有限公司 | Dynamic network content grabbing method and dynamic network content crawler system |
CN103279507A (en) * | 2013-05-16 | 2013-09-04 | 北京尚友通达信息技术有限公司 | Webpage spider operational method and system |
CN106570023A (en) * | 2015-10-10 | 2017-04-19 | 北京国双科技有限公司 | Customized method and device for deleting repetitions of crawler system |
Non-Patent Citations (2)
Title |
---|
一种网络爬虫系统中URL去重方法研究;成功等;《中国新技术新产品》;20140625;第23页 * |
基于webkit的网络爬虫;郭津丞等;《现代电子技术》;20130915;第62-64、68页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107480264A (en) | 2017-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106951451B (en) | A kind of webpage content extracting method, device and calculate equipment | |
CN106293365B (en) | A kind of method and device obtaining content of pages | |
US20120102392A1 (en) | Method for displaying a data set | |
CN105205080B (en) | Redundant file method for cleaning, device and system | |
US20150244737A1 (en) | Detecting malicious advertisements using source code analysis | |
CN102902661A (en) | Method for realizing hyperlinks of electronic books | |
CN103250166A (en) | Method and apparatus for providing hand detection | |
US20160216885A1 (en) | Method and device for processing touch operation of electronic apparatus | |
CN107357496A (en) | Annotation process method, electronic equipment and computer-readable storage medium | |
CN107016282A (en) | A kind of information processing method and device | |
CN106033450A (en) | Method and device for blocking advertisement, and browser | |
CN103530386B (en) | The edit methods and browser of browsing device net page | |
CN105205077A (en) | Page layout method, device and system | |
CN106095241B (en) | The window display method of Web application a kind of, device and calculate equipment | |
CN107480264B (en) | A kind of web crawlers De-weight method and calculate equipment | |
CN108228557B (en) | Sequence labeling method and device | |
CN105426524A (en) | Web interface displaying method and device | |
CN107766419B (en) | Threshold denoising-based TextRank document summarization method and device | |
CN112287264B (en) | Webpage layout method and device, electronic equipment and storage medium | |
CN109478202A (en) | Scalable vector graphics packet | |
CN107085515A (en) | Personal page generation method and device | |
Peng et al. | A proximal alternating direction method of multipliers for a minimization problem with nonconvex constraints | |
Nagar | Introduction to MATLAB: For Engineers and Scientists | |
CN102929777B (en) | Network application method of testing and test macro | |
CN105630980A (en) | Game recommending strategy obtaining method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: Room 311501, Unit 1, Building 5, Courtyard 1, Futong East Street, Chaoyang District, Beijing 100102 Applicant after: Beijing Zhichuangyu Information Technology Co., Ltd. Address before: 100097 Jinwei Building 803, 55 Lanindichang South Road, Haidian District, Beijing Applicant before: Beijing Knows Chuangyu Information Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |