CN107844527A - Web page address De-weight method, electronic equipment and computer-readable recording medium - Google Patents

Web page address De-weight method, electronic equipment and computer-readable recording medium Download PDF

Info

Publication number
CN107844527A
CN107844527A CN201710954304.5A CN201710954304A CN107844527A CN 107844527 A CN107844527 A CN 107844527A CN 201710954304 A CN201710954304 A CN 201710954304A CN 107844527 A CN107844527 A CN 107844527A
Authority
CN
China
Prior art keywords
node
web page
page address
generalized list
present
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710954304.5A
Other languages
Chinese (zh)
Inventor
李芳�
王建明
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201710954304.5A priority Critical patent/CN107844527A/en
Priority to PCT/CN2018/076170 priority patent/WO2019071896A1/en
Publication of CN107844527A publication Critical patent/CN107844527A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a kind of web page address De-weight method, the method comprising the steps of:A pending web page address is successively read, the pending web page address is searched in generalized list in improving;If not searching the pending web page address in the improvement generalized list, the pending web page address is inserted into the improvement generalized list, and the pending web page address is stored in queue to be captured;If searching the pending web page address in the improvement generalized list, stop the pending web page address being stored in queue to be captured.The present invention can lift web page address deduplicated efficiency.

Description

Web page address De-weight method, electronic equipment and computer-readable recording medium
Technical field
The present invention relates to computer information technology field, more particularly to a kind of web page address De-weight method, electronic equipment and Computer-readable recording medium.
Background technology
At present, the conventional URL duplicate removal schemes of web crawlers have duplicate removal scheme based on database and based on internal memory chained lists Duplicate removal scheme, these schemes have good effect in the case where URL storage capacities are little.But existing distributed reptile face To URL storage capacities it is generally very big, it is necessary to URL duplicate removals can persistently keep Efficient Operation, and above-mentioned conventional URL removing repeats Case efficient can be fallen sharply after crawlers run the long period or the risk of task paralysis.Therefore URL duplicate removals of the prior art Method design is not reasonable, needs improvement badly.
The content of the invention
In view of this, the present invention proposes a kind of web page address De-weight method, electronic equipment and computer-readable recording medium, Heavy work is gone by using URL is carried out based on the improved generalized list of internal memory, traditional URL duplicate removals are significantly better than in time efficiency Scheme.
First, to achieve the above object, the present invention proposes a kind of electronic equipment, and the electronic equipment includes memory, place Manage device and be stored in the web page address machining system that can be run on the memory and on the processor, the web page address Following steps are realized when machining system is by the computing device:
A pending web page address is successively read, the pending web page address is searched in generalized list in improving, wherein, institute The each node weights stated in improvement generalized list are identical, and each node improved in generalized list includes one and judged currently Node whether be root node flag bit;
If not searching the pending web page address in the improvement generalized list, the pending web page address is inserted To the improvement generalized list, and the pending web page address is stored in queue to be captured;And
If searching the pending web page address in the improvement generalized list, stop the pending web page address being stored in Queue to be captured.
Preferably, the flag bit includes the first numerical value and second value;
If the flag bit for detecting present node is the first numerical value, it is determined that the present node is root node, from deserve Front nodal point is begun stepping through, and is defined as the node that sets out of dynamic insertion operation;And
If the flag bit for detecting present node is second value, it is determined that the present node is not root node, is continued Travel through next node.
Preferably, if the present node for improving generalized list is root node, the data structure of the present node includes Pointer initialization operation and the first constructed fuction, the newly-built node object of first constructed fuction node in this prior do not transmit When specifying parameter, give tacit consent to and the data Domain Properties of the newly-built node object are arranged to designated character.
Preferably, if the present node of the improvement generalized list is non-root node, the data structure bag of the present node Pointer initialization operation and the second constructed fuction are included, second constructed fuction is by the newly-built node object transmission of the present node Specified data value is assigned to the data field of the newly-built node object.
Preferably, it is described by the pending web page address be inserted into it is described improvement generalized list include:By the pending webpage String segmentation corresponding to address is single letter, and each node stores a letter in the improvement generalized list.
In addition, to achieve the above object, the present invention also provides a kind of web page address De-weight method, and this method is applied to electronics Equipment, methods described include:
A pending web page address is successively read, the pending web page address is searched in generalized list in improving, wherein, institute The each node weights stated in improvement generalized list are identical, and each node improved in generalized list includes one and judged currently Node whether be root node flag bit;
If not searching the pending web page address in the improvement generalized list, the pending web page address is inserted To the improvement generalized list, and the pending web page address is stored in queue to be captured;And
If searching the pending web page address in the improvement generalized list, stop the pending web page address being stored in Queue to be captured.
Preferably, the flag bit includes the first numerical value and second value;
If the flag bit for detecting present node is the first numerical value, it is determined that the present node is root node, from deserve Front nodal point is begun stepping through, and is defined as the node that sets out of dynamic insertion operation;And
If the flag bit for detecting present node is second value, it is determined that the present node is not root node, is continued Travel through next node.
Preferably, if the present node for improving generalized list is root node, the data structure of the present node includes Pointer initialization operation and the first constructed fuction, the newly-built node object of first constructed fuction node in this prior do not transmit When specifying parameter, give tacit consent to and the data Domain Properties of the newly-built node object are arranged to designated character;And
If the present node for improving generalized list is non-root node, the data structure of the present node is including at the beginning of pointer Beginningization operates and the second constructed fuction, and second constructed fuction is by the specified data of the newly-built node object transmission of the present node Value is assigned to the data field of the newly-built node object.
Preferably, it is described by the pending web page address be inserted into it is described improvement generalized list include:By the pending webpage String segmentation corresponding to address is single letter, and each node stores a letter in the improvement generalized list.
Further, to achieve the above object, the present invention also provides a kind of computer-readable recording medium, the computer Readable storage medium storing program for executing is stored with web page address machining system, and the web page address machining system can be held by least one processor OK, so that the step of at least one computing device web page address De-weight method described above.
Compared to prior art, electronic equipment proposed by the invention, web page address De-weight method and computer-readable deposit Storage media, heavy work is gone by using URL is carried out based on the improved generalized list of internal memory, is significantly better than in time efficiency traditional URL duplicate removal schemes.Further, there is very high space feasibility on space efficiency, can persistently keep the height of heavy work Effect running, without obvious 'bottleneck' restrictions.
Brief description of the drawings
Fig. 1 is the schematic diagram of one optional hardware structure of electronic equipment of the present invention;
Fig. 2 is the program module schematic diagram of the embodiment of web page address machining system one in electronic equipment of the present invention;
Fig. 3 is the implementation process diagram of the embodiment of web page address De-weight method one of the present invention.
Reference:
Electronic equipment 2
Memory 21
Processor 22
Network interface 23
Web page address machining system 20
Search module 201
Insert module 202
Deduplication module 203
Process step S31-S33
The realization, functional characteristics and advantage of the object of the invention will be described further referring to the drawings in conjunction with the embodiments.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not For limiting the present invention.Based on the embodiment in the present invention, those of ordinary skill in the art are not before creative work is made The every other embodiment obtained is put, belongs to the scope of protection of the invention.
It should be noted that the description for being related to " first ", " second " etc. in the present invention is only used for describing purpose, and can not It is interpreted as indicating or implies its relative importance or imply the quantity of the technical characteristic indicated by indicating.Thus, define " the One ", at least one this feature can be expressed or be implicitly included to the feature of " second ".In addition, the skill between each embodiment Art scheme can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when technical scheme With reference to occurring conflicting or will be understood that the combination of this technical scheme is not present when can not realize, also not in application claims Protection domain within.
Explanation is needed further exist for, herein, term " comprising ", "comprising" or its any other variant are intended to contain Lid nonexcludability includes, so that process, method, article or device including a series of elements not only will including those Element, but also the other element including being not expressly set out, or it is this process, method, article or device also to include Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Other identical element also be present in process, method, article or device including the key element.
First, the present invention proposes a kind of electronic equipment 2.
As shown in fig.1, it is the schematic diagram of 2 one optional hardware structure of electronic equipment of the present invention.It is described in the present embodiment Electronic equipment 2 may include, but be not limited to, and connection memory 21, processor 22, network interface can be in communication with each other by system bus 23.It is pointed out that Fig. 1 illustrate only the electronic equipment 2 with component 21-23, it should be understood that being not required for reality All components shown are applied, the more or less component of the implementation that can be substituted.
Wherein, the electronic equipment 2 can be rack-mount server, blade server, tower server or cabinet-type The computing devices such as server, the electronic equipment 2 can be the services that independent server or multiple servers are formed Device cluster.
The memory 21 comprises at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory, Hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), random access storage device (RAM), static random are visited Ask memory (SRAM), read-only storage (ROM), Electrically Erasable Read Only Memory (EEPROM), programmable read-only deposit Reservoir (PROM), magnetic storage, disk, CD etc..In certain embodiments, the memory 21 can be that the electronics is set Standby 2 internal storage unit, such as the hard disk or internal memory of the electronic equipment 2.In further embodiments, the memory 21 Can be the plug-in type hard disk being equipped with the External memory equipment of the electronic equipment 2, such as the electronic equipment 2, intelligent storage Block (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc.. Certainly, the memory 21 can also both include the internal storage unit of the electronic equipment 2 or including its External memory equipment. In the present embodiment, the memory 21 is generally used for storing the operating system for being installed on the electronic equipment 2 and types of applications is soft Part, such as program code of the web page address machining system 20 etc..In addition, the memory 21 can be also used for temporarily depositing Store up the Various types of data that has exported or will export.
The processor 22 can be in certain embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 22 is generally used for controlling the electricity The overall operation of sub- equipment 2, such as perform the control and processing related to the electronic equipment 2 progress data interaction or communication Deng.In the present embodiment, the processor 22 is used to run the program code stored in the memory 21 or processing data, example Web page address machining system 20 as described in running.
The network interface 23 may include radio network interface or wired network interface, and the network interface 23 is generally used for Communication connection is established between the electronic equipment 2 and other electronic equipments.For example, the network interface 23 is used to incite somebody to action by network The electronic equipment 2 is connected with external data platform, and data biography is established between the electronic equipment 2 and external data platform Defeated passage and communication connection.The network can be intranet (Intranet), internet (Internet), whole world movement Communication system (Global System of Mobile communication, GSM), WCDMA (Wideband Code Division Multiple Access, WCDMA), 4G networks, 5G networks, bluetooth (Bluetooth), the nothing such as Wi-Fi Line or cable network.
So far, oneself is through describing the application environment of each embodiment of the present invention and the hardware configuration and work(of relevant device in detail Energy.Below, above-mentioned application environment and relevant device will be based on, proposes each embodiment of the present invention.
As shown in fig.2, it is the program module of the embodiment of web page address machining system 20 1 in electronic equipment 2 of the present invention Figure.In the present embodiment, described web page address machining system 20 can be divided into one or more program modules, one Or multiple program modules are stored in the memory 21, and (it is described in the present embodiment by one or more processors Processor 22) it is performed, to complete the present invention.For example, in fig. 2, described web page address machining system 20 can be divided Into search module 201, insertion module 202 and deduplication module 203.Program module alleged by the present invention is to refer to complete spy The series of computation machine programmed instruction section of function is determined, than program more suitable for describing the web page address machining system 20 described Implementation procedure in electronic equipment 2.The function of putting up with each program module 201-203 below is described in detail.
The search module 201, for being successively read a pending web page address (such as URL addresses), in improvement broad sense The pending web page address is searched in table.Wherein, in the present embodiment, a URL can be successively read from Webpage log (Universal Resource Locator, URL).
Preferably, in the present embodiment, heavy work, the improvement are gone using based on the improved generalized list progress URL of internal memory Each node (such as ordinary node and node element) weight in generalized list is identical (status is equal), and in the improvement generalized list Each node increase by one judge present node whether be root node (root nodes) flag bit.Wherein, the flag bit For determining the node that sets out of each dynamic insertion operation.
In the present embodiment, each node have judge present node whether be root node flag bit (such as isRoot mark Will position), the flag bit includes the first numerical value and (such as true, represents and 1) (such as false, represented 0) with second value.If detection 1) flag bit to present node (such as true, represents, it is determined that the present node is root node, current from this for the first numerical value Node is begun stepping through, and is defined as the node that sets out of dynamic insertion operation;If the flag bit for detecting present node is the second number 0) value (such as false, represents, it is determined that the present node is not root node, continues to travel through next node.
It should be noted that traditional generalized list includes the node (i.e. the different node in status) of different weights, it is such as common Node and node element, therefore, needed when dynamic increase, deletion and the modification of generalized list the type of decision node, meeting Expend more time.The each node status of generalized list after being improved in the present embodiment is equal, therefore without extra judgement, can Dynamic expansion is efficiently carried out, adapts to the demand of duplicate removal.More specifically, the present embodiment by increasing by one in each node IsRoot flag bits, some memory headrooms only are added in each node, belong to a kind of scheme of space for time, imitated in the time Greatly improved in rate.
For example, in the present embodiment, the node data structures for improving generalized list are set as follows.
In the present embodiment, if the present node for improving generalized list is root node (isRoot=true), deserve The data structure of front nodal point includes, but not limited to pointer initialization operation (head=tail=null) and the first constructed fuction (this.data=''), the newly-built node object (such as GLNode objects) of first constructed fuction node in this prior does not pass When passing specified parameter, give tacit consent to by the data Domain Properties (i.e. data attributes) of the newly-built node object be arranged to designated character ''.If The present node for improving generalized list is non-root node (isRoot=false), then the data structure of the present node includes, But it is not limited to, pointer initialization operation (head=tail=null) and the second constructed fuction (this.data=data), this Specified data value (such as data parameters that two constructed fuctions transmit the newly-built node object (such as GLNode objects) of the present node Data) it is assigned to the data field of the newly-built node object.
The insertion module 202, if for not searching the pending web page address in the improvement generalized list, will The pending web page address is inserted into the improvement generalized list, and the pending web page address is stored in into queue to be captured, and is lined up Web crawlers is waited to carry out the crawl operation of web page contents.
Preferably, in the present embodiment, it is described by the pending web page address be inserted into it is described improvement generalized list include:Will Character string (such as URL character strings) corresponding to the pending web page address is divided into single letter, described to improve in generalized list often Individual node stores a letter.
The deduplication module 203, if for searching the pending web page address in the improvement generalized list, stopping will The pending web page address is stored in queue to be captured, and avoids repeating capturing webpage contents operation.
By said procedure module 201-203, web page address machining system 20 proposed by the invention, by using based on The improved generalized list of internal memory carries out URL and goes heavy work, and traditional URL duplicate removal schemes are significantly better than in time efficiency.Further Ground, there is very high space feasibility on space efficiency, the Efficient Operation of heavy work can be persistently kept, without obvious bottle Neck restricts.
In addition, the present invention also proposes a kind of web page address De-weight method.
As shown in fig.3, it is the implementation process diagram of the embodiment of web page address De-weight method one of the present invention.In this implementation In example, according to different demands, the execution sequence of the step in flow chart shown in Fig. 3 can change, and some steps can save Slightly.
Step S31, a pending web page address (such as URL addresses) is successively read, searches for this in improving in generalized list and treat Handle web page address.Wherein, in the present embodiment, a URL (Universal can be successively read from Webpage log Resource Locator, URL).
Preferably, in the present embodiment, heavy work, the improvement are gone using based on the improved generalized list progress URL of internal memory Each node (such as ordinary node and node element) weight in generalized list is identical (status is equal), and in the improvement generalized list Each node increase by one judge present node whether be root node (root nodes) flag bit.Wherein, the flag bit For determining the node that sets out of each dynamic insertion operation.
In the present embodiment, each node have judge present node whether be root node flag bit (such as isRoot mark Will position), the flag bit includes the first numerical value and (such as true, represents and 1) (such as false, represented 0) with second value.If detection 1) flag bit to present node (such as true, represents, it is determined that the present node is root node, current from this for the first numerical value Node is begun stepping through, and is defined as the node that sets out of dynamic insertion operation;If the flag bit for detecting present node is the second number 0) value (such as false, represents, it is determined that the present node is not root node, continues to travel through next node.
It should be noted that traditional generalized list includes the node (i.e. the different node in status) of different weights, it is such as common Node and node element, therefore, needed when dynamic increase, deletion and the modification of generalized list the type of decision node, meeting Expend more time.The each node status of generalized list after being improved in the present embodiment is equal, therefore without extra judgement, can Dynamic expansion is efficiently carried out, adapts to the demand of duplicate removal.More specifically, the present embodiment by increasing by one in each node IsRoot flag bits, some memory headrooms only are added in each node, belong to a kind of scheme of space for time, imitated in the time Greatly improved in rate.
For example, in the present embodiment, the node data structures for improving generalized list are set as follows.
In the present embodiment, if the present node for improving generalized list is root node (isRoot=true), deserve The data structure of front nodal point includes, but not limited to pointer initialization operation (head=tail=null) and the first constructed fuction (this.data=''), the newly-built node object (such as GLNode objects) of first constructed fuction node in this prior does not pass When passing specified parameter, give tacit consent to by the data Domain Properties (i.e. data attributes) of the newly-built node object be arranged to designated character ''.If The present node for improving generalized list is non-root node (isRoot=false), then the data structure of the present node includes, But it is not limited to, pointer initialization operation (head=tail=null) and the second constructed fuction (this.data=data), this Specified data value (such as data parameters that two constructed fuctions transmit the newly-built node object (such as GLNode objects) of the present node Data) it is assigned to the data field of the newly-built node object.
Step S32, if the pending web page address is not searched in the improvement generalized list, by the pending webpage Address is inserted into the improvement generalized list, and the pending web page address is stored in into queue to be captured, and waits in line web crawlers Carry out the crawl operation of web page contents.
Preferably, in the present embodiment, it is described by the pending web page address be inserted into it is described improvement generalized list include:Will Character string (such as URL character strings) corresponding to the pending web page address is divided into single letter, described to improve in generalized list often Individual node stores a letter.
Step S33, if searching the pending web page address in the improvement generalized list, stop the pending webpage Address is stored in queue to be captured, and avoids repeating capturing webpage contents operation.
By above-mentioned steps S31-S33, web page address De-weight method proposed by the invention, change by using based on internal memory The generalized list entered carries out URL and goes heavy work, and traditional URL duplicate removal schemes are significantly better than in time efficiency.Further, in sky Between have very high space feasibility in efficiency, the Efficient Operation of heavy work can be persistently kept, without obvious 'bottleneck' restrictions.
Further, to achieve the above object, the present invention also provide a kind of computer-readable recording medium (such as ROM/RAM, Magnetic disc, CD), the computer-readable recording medium storage has web page address machining system 20, the web page address duplicate removal system System 20 can be performed by least one processor 22, be gone so that at least one processor 22 performs web page address as described above The step of weighing method.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to realized by hardware, but a lot In the case of the former be more preferably embodiment.Based on such understanding, technical scheme is substantially in other words to existing The part that technology contributes can be embodied in the form of software product, and the computer software product is stored in a storage In medium (such as ROM/RAM, magnetic disc, CD), including some instructions to cause a station terminal equipment (can be mobile phone, calculate Machine, server, air conditioner, or network equipment etc.) perform method described in each embodiment of the present invention.
Above by reference to the preferred embodiments of the present invention have been illustrated, not thereby limit to the interest field of the present invention.On State that sequence number of the embodiment of the present invention is for illustration only, do not represent the quality of embodiment.Patrolled in addition, though showing in flow charts Order is collected, but in some cases, can be with the step shown or described by being performed different from order herein.
Those skilled in the art do not depart from the scope of the present invention and essence, can have a variety of flexible programs to realize the present invention, It can be used for another embodiment for example as the feature of one embodiment and obtain another embodiment.It is every to utilize description of the invention And the equivalent structure made of accompanying drawing content or equivalent flow conversion, or other related technical areas are directly or indirectly used in, It is included within the scope of the present invention.

Claims (10)

1. a kind of electronic equipment, it is characterised in that the electronic equipment includes memory, processor and is stored in the memory Web page address machining system that is upper and can running on the processor, the web page address machining system are held by the processor Following steps are realized during row:
A pending web page address is successively read, the pending web page address is searched in generalized list in improving, wherein, it is described to change The each node weights entered in generalized list are identical, and each node improved in generalized list judges present node including one Whether be root node flag bit;
If not searching the pending web page address in the improvement generalized list, the pending web page address is inserted into institute Improvement generalized list is stated, and the pending web page address is stored in queue to be captured;And
If searching the pending web page address in the improvement generalized list, the pending web page address is stored in by stopping to be waited to grab Take queue.
2. electronic equipment as claimed in claim 1, it is characterised in that the flag bit includes the first numerical value and second value;
If the flag bit for detecting present node is the first numerical value, it is determined that the present node is root node, from deserving prosthomere Point is begun stepping through, and is defined as the node that sets out of dynamic insertion operation;And
If the flag bit for detecting present node is second value, it is determined that the present node is not root node, continues to travel through Next node.
3. electronic equipment as claimed in claim 2, it is characterised in that if the present node for improving generalized list is root section Point, then the data structure of the present node include pointer initialization operation and the first constructed fuction, first constructed fuction is at this When the newly-built node object of present node does not transmit specified parameter, give tacit consent to the data Domain Properties setting of the newly-built node object For designated character.
4. electronic equipment as claimed in claim 2, it is characterised in that if the present node for improving generalized list is non-root section Point, then the data structure of the present node include pointer initialization operation and the second constructed fuction, second constructed fuction should The specified data value of the newly-built node object transmission of present node is assigned to the data field of the newly-built node object.
5. electronic equipment as claimed in claim 2, it is characterised in that described that the pending web page address is inserted into described change Entering generalized list includes:It is single letter by string segmentation corresponding to the pending web page address, in the improvement generalized list Each node stores a letter.
A kind of 6. web page address De-weight method, applied to electronic equipment, it is characterised in that methods described includes:
A pending web page address is successively read, the pending web page address is searched in generalized list in improving, wherein, it is described to change The each node weights entered in generalized list are identical, and each node improved in generalized list judges present node including one Whether be root node flag bit;
If not searching the pending web page address in the improvement generalized list, the pending web page address is inserted into institute Improvement generalized list is stated, and the pending web page address is stored in queue to be captured;And
If searching the pending web page address in the improvement generalized list, the pending web page address is stored in by stopping to be waited to grab Take queue.
7. web page address De-weight method as claimed in claim 6, it is characterised in that the flag bit includes the first numerical value and the Two numerical value;
If the flag bit for detecting present node is the first numerical value, it is determined that the present node is root node, from deserving prosthomere Point is begun stepping through, and is defined as the node that sets out of dynamic insertion operation;And
If the flag bit for detecting present node is second value, it is determined that the present node is not root node, continues to travel through Next node.
8. web page address De-weight method as claimed in claim 7, it is characterised in that if the present node for improving generalized list For root node, then the data structure of the present node includes pointer initialization operation and the first constructed fuction, the first construction letter When the newly-built node object for counting node in this prior does not have to transmit specified parameter, give tacit consent to the data field category of the newly-built node object Property is arranged to designated character;And
If the present node for improving generalized list is non-root node, the data structure of the present node initializes including pointer Operation and the second constructed fuction, second constructed fuction assign the specified data value of the newly-built node object transmission of the present node It is worth the data field to the newly-built node object.
9. web page address De-weight method as claimed in claim 7, it is characterised in that described to insert the pending web page address Include to the improvement generalized list:It is single letter by string segmentation corresponding to the pending web page address, the improvement Each node stores a letter in generalized list.
10. a kind of computer-readable recording medium, the computer-readable recording medium storage has web page address machining system, institute Stating web page address machining system can be by least one computing device, so that at least one computing device such as claim The step of web page address De-weight method any one of 6-9.
CN201710954304.5A 2017-10-13 2017-10-13 Web page address De-weight method, electronic equipment and computer-readable recording medium Pending CN107844527A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710954304.5A CN107844527A (en) 2017-10-13 2017-10-13 Web page address De-weight method, electronic equipment and computer-readable recording medium
PCT/CN2018/076170 WO2019071896A1 (en) 2017-10-13 2018-02-10 Website duplicate removing method, electronic device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710954304.5A CN107844527A (en) 2017-10-13 2017-10-13 Web page address De-weight method, electronic equipment and computer-readable recording medium

Publications (1)

Publication Number Publication Date
CN107844527A true CN107844527A (en) 2018-03-27

Family

ID=61661333

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710954304.5A Pending CN107844527A (en) 2017-10-13 2017-10-13 Web page address De-weight method, electronic equipment and computer-readable recording medium

Country Status (2)

Country Link
CN (1) CN107844527A (en)
WO (1) WO2019071896A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108985759A (en) * 2018-06-15 2018-12-11 杭州复杂美科技有限公司 A kind of address generating method and system, equipment and storage medium encrypting currency
CN109657118A (en) * 2018-11-21 2019-04-19 安徽云融信息技术有限公司 A kind of the URL De-weight method and its system of distributed network crawler
CN110134768A (en) * 2019-05-13 2019-08-16 腾讯科技(深圳)有限公司 Processing method, device, equipment and the storage medium of text

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101778041A (en) * 2009-12-31 2010-07-14 福建星网锐捷网络有限公司 Method, device and network equipment for path selection
CN103631839A (en) * 2013-06-27 2014-03-12 西南科技大学 Page region weight model implementation method
US9058392B1 (en) * 2012-03-22 2015-06-16 Google Inc. Client state result de-duping
CN104809182A (en) * 2015-04-17 2015-07-29 东南大学 Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)
CN104809190A (en) * 2015-04-21 2015-07-29 浙江大学 Database access method of tree-like structure data
CN107038179A (en) * 2016-08-23 2017-08-11 平安科技(深圳)有限公司 Item of information storage method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620608A (en) * 2008-07-04 2010-01-06 全国组织机构代码管理中心 Information collection method and system
CN101458740A (en) * 2008-12-25 2009-06-17 东华大学 DNA computer generalized list data structure design method based on three-arm DNA molecule
US8463797B2 (en) * 2010-07-20 2013-06-11 Barracuda Networks Inc. Method for measuring similarity of diverse binary objects comprising bit patterns
CN102184227B (en) * 2011-05-10 2013-05-08 北京邮电大学 General crawler engine system used for WEB service and working method thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101778041A (en) * 2009-12-31 2010-07-14 福建星网锐捷网络有限公司 Method, device and network equipment for path selection
US9058392B1 (en) * 2012-03-22 2015-06-16 Google Inc. Client state result de-duping
CN103631839A (en) * 2013-06-27 2014-03-12 西南科技大学 Page region weight model implementation method
CN104809182A (en) * 2015-04-17 2015-07-29 东南大学 Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)
CN104809190A (en) * 2015-04-21 2015-07-29 浙江大学 Database access method of tree-like structure data
CN107038179A (en) * 2016-08-23 2017-08-11 平安科技(深圳)有限公司 Item of information storage method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴小惠: "《分布式网络爬虫URL去重策略的改进》", 《分布式网络爬虫URL去重策略的改进》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108985759A (en) * 2018-06-15 2018-12-11 杭州复杂美科技有限公司 A kind of address generating method and system, equipment and storage medium encrypting currency
CN109657118A (en) * 2018-11-21 2019-04-19 安徽云融信息技术有限公司 A kind of the URL De-weight method and its system of distributed network crawler
CN110134768A (en) * 2019-05-13 2019-08-16 腾讯科技(深圳)有限公司 Processing method, device, equipment and the storage medium of text
CN110134768B (en) * 2019-05-13 2023-05-26 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
WO2019071896A1 (en) 2019-04-18

Similar Documents

Publication Publication Date Title
CN104679778A (en) Search result generating method and device
CN107844634A (en) Polynary universal model platform modeling method, electronic equipment and computer-readable recording medium
CN101739292B (en) Based on isomeric group operation self-adapting dispatching method and the system of application characteristic
CN102648468B (en) Table search device, table search method, and table search system
CN107844527A (en) Web page address De-weight method, electronic equipment and computer-readable recording medium
CN107688789A (en) Document charts abstracting method, electronic equipment and computer-readable recording medium
CN108491420A (en) Configuration method, application server and the computer readable storage medium of web page crawl
CN107688651A (en) The emotion of news direction determination process, electronic equipment and computer-readable recording medium
CN107798106A (en) A kind of URL De-weight methods in distributed reptile system
CN110674427B (en) Method, device, equipment and storage medium for responding to webpage access request
CN103561083A (en) Data processing method for Internet of things
CN104424316A (en) Data storage method, data searching method, related device and system
CN108334549A (en) A kind of device data storage method, extracting method, storage platform and extraction platform
CN116489178B (en) Method and device for distributed storage of communication information
CN107256130B (en) Data store optimization method and system based on Cuckoo Hash calculation
CN108921193A (en) Picture input method, server and computer storage medium
CN115941708B (en) Cloud big data storage management method and device, electronic equipment and storage medium
CN116366603A (en) Method and device for determining active IPv6 address
CN109918277A (en) Electronic device, the evaluation method of system log cluster analysis result and storage medium
CN107688564A (en) Subject of news Corporate Identity method, electronic equipment and computer-readable recording medium
CN107679908A (en) Sales force's topic nonproductive poll method, electronic installation and storage medium
CN104965909B (en) A kind of request processing method of dynamic web content
CN106713440A (en) Data transmission method and device
CN106487771A (en) The acquisition methods of intrusion behavior and device
CN106776654A (en) A kind of data search method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180327