CN107844527A - Web page address De-weight method, electronic equipment and computer-readable recording medium - Google Patents
Web page address De-weight method, electronic equipment and computer-readable recording medium Download PDFInfo
- Publication number
- CN107844527A CN107844527A CN201710954304.5A CN201710954304A CN107844527A CN 107844527 A CN107844527 A CN 107844527A CN 201710954304 A CN201710954304 A CN 201710954304A CN 107844527 A CN107844527 A CN 107844527A
- Authority
- CN
- China
- Prior art keywords
- node
- web page
- page address
- generalized list
- present
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention discloses a kind of web page address De-weight method, the method comprising the steps of:A pending web page address is successively read, the pending web page address is searched in generalized list in improving;If not searching the pending web page address in the improvement generalized list, the pending web page address is inserted into the improvement generalized list, and the pending web page address is stored in queue to be captured;If searching the pending web page address in the improvement generalized list, stop the pending web page address being stored in queue to be captured.The present invention can lift web page address deduplicated efficiency.
Description
Technical field
The present invention relates to computer information technology field, more particularly to a kind of web page address De-weight method, electronic equipment and
Computer-readable recording medium.
Background technology
At present, the conventional URL duplicate removal schemes of web crawlers have duplicate removal scheme based on database and based on internal memory chained lists
Duplicate removal scheme, these schemes have good effect in the case where URL storage capacities are little.But existing distributed reptile face
To URL storage capacities it is generally very big, it is necessary to URL duplicate removals can persistently keep Efficient Operation, and above-mentioned conventional URL removing repeats
Case efficient can be fallen sharply after crawlers run the long period or the risk of task paralysis.Therefore URL duplicate removals of the prior art
Method design is not reasonable, needs improvement badly.
The content of the invention
In view of this, the present invention proposes a kind of web page address De-weight method, electronic equipment and computer-readable recording medium,
Heavy work is gone by using URL is carried out based on the improved generalized list of internal memory, traditional URL duplicate removals are significantly better than in time efficiency
Scheme.
First, to achieve the above object, the present invention proposes a kind of electronic equipment, and the electronic equipment includes memory, place
Manage device and be stored in the web page address machining system that can be run on the memory and on the processor, the web page address
Following steps are realized when machining system is by the computing device:
A pending web page address is successively read, the pending web page address is searched in generalized list in improving, wherein, institute
The each node weights stated in improvement generalized list are identical, and each node improved in generalized list includes one and judged currently
Node whether be root node flag bit;
If not searching the pending web page address in the improvement generalized list, the pending web page address is inserted
To the improvement generalized list, and the pending web page address is stored in queue to be captured;And
If searching the pending web page address in the improvement generalized list, stop the pending web page address being stored in
Queue to be captured.
Preferably, the flag bit includes the first numerical value and second value;
If the flag bit for detecting present node is the first numerical value, it is determined that the present node is root node, from deserve
Front nodal point is begun stepping through, and is defined as the node that sets out of dynamic insertion operation;And
If the flag bit for detecting present node is second value, it is determined that the present node is not root node, is continued
Travel through next node.
Preferably, if the present node for improving generalized list is root node, the data structure of the present node includes
Pointer initialization operation and the first constructed fuction, the newly-built node object of first constructed fuction node in this prior do not transmit
When specifying parameter, give tacit consent to and the data Domain Properties of the newly-built node object are arranged to designated character.
Preferably, if the present node of the improvement generalized list is non-root node, the data structure bag of the present node
Pointer initialization operation and the second constructed fuction are included, second constructed fuction is by the newly-built node object transmission of the present node
Specified data value is assigned to the data field of the newly-built node object.
Preferably, it is described by the pending web page address be inserted into it is described improvement generalized list include:By the pending webpage
String segmentation corresponding to address is single letter, and each node stores a letter in the improvement generalized list.
In addition, to achieve the above object, the present invention also provides a kind of web page address De-weight method, and this method is applied to electronics
Equipment, methods described include:
A pending web page address is successively read, the pending web page address is searched in generalized list in improving, wherein, institute
The each node weights stated in improvement generalized list are identical, and each node improved in generalized list includes one and judged currently
Node whether be root node flag bit;
If not searching the pending web page address in the improvement generalized list, the pending web page address is inserted
To the improvement generalized list, and the pending web page address is stored in queue to be captured;And
If searching the pending web page address in the improvement generalized list, stop the pending web page address being stored in
Queue to be captured.
Preferably, the flag bit includes the first numerical value and second value;
If the flag bit for detecting present node is the first numerical value, it is determined that the present node is root node, from deserve
Front nodal point is begun stepping through, and is defined as the node that sets out of dynamic insertion operation;And
If the flag bit for detecting present node is second value, it is determined that the present node is not root node, is continued
Travel through next node.
Preferably, if the present node for improving generalized list is root node, the data structure of the present node includes
Pointer initialization operation and the first constructed fuction, the newly-built node object of first constructed fuction node in this prior do not transmit
When specifying parameter, give tacit consent to and the data Domain Properties of the newly-built node object are arranged to designated character;And
If the present node for improving generalized list is non-root node, the data structure of the present node is including at the beginning of pointer
Beginningization operates and the second constructed fuction, and second constructed fuction is by the specified data of the newly-built node object transmission of the present node
Value is assigned to the data field of the newly-built node object.
Preferably, it is described by the pending web page address be inserted into it is described improvement generalized list include:By the pending webpage
String segmentation corresponding to address is single letter, and each node stores a letter in the improvement generalized list.
Further, to achieve the above object, the present invention also provides a kind of computer-readable recording medium, the computer
Readable storage medium storing program for executing is stored with web page address machining system, and the web page address machining system can be held by least one processor
OK, so that the step of at least one computing device web page address De-weight method described above.
Compared to prior art, electronic equipment proposed by the invention, web page address De-weight method and computer-readable deposit
Storage media, heavy work is gone by using URL is carried out based on the improved generalized list of internal memory, is significantly better than in time efficiency traditional
URL duplicate removal schemes.Further, there is very high space feasibility on space efficiency, can persistently keep the height of heavy work
Effect running, without obvious 'bottleneck' restrictions.
Brief description of the drawings
Fig. 1 is the schematic diagram of one optional hardware structure of electronic equipment of the present invention;
Fig. 2 is the program module schematic diagram of the embodiment of web page address machining system one in electronic equipment of the present invention;
Fig. 3 is the implementation process diagram of the embodiment of web page address De-weight method one of the present invention.
Reference:
Electronic equipment | 2 |
Memory | 21 |
Processor | 22 |
Network interface | 23 |
Web page address machining system | 20 |
Search module | 201 |
Insert module | 202 |
Deduplication module | 203 |
Process step | S31-S33 |
The realization, functional characteristics and advantage of the object of the invention will be described further referring to the drawings in conjunction with the embodiments.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not
For limiting the present invention.Based on the embodiment in the present invention, those of ordinary skill in the art are not before creative work is made
The every other embodiment obtained is put, belongs to the scope of protection of the invention.
It should be noted that the description for being related to " first ", " second " etc. in the present invention is only used for describing purpose, and can not
It is interpreted as indicating or implies its relative importance or imply the quantity of the technical characteristic indicated by indicating.Thus, define " the
One ", at least one this feature can be expressed or be implicitly included to the feature of " second ".In addition, the skill between each embodiment
Art scheme can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when technical scheme
With reference to occurring conflicting or will be understood that the combination of this technical scheme is not present when can not realize, also not in application claims
Protection domain within.
Explanation is needed further exist for, herein, term " comprising ", "comprising" or its any other variant are intended to contain
Lid nonexcludability includes, so that process, method, article or device including a series of elements not only will including those
Element, but also the other element including being not expressly set out, or it is this process, method, article or device also to include
Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that
Other identical element also be present in process, method, article or device including the key element.
First, the present invention proposes a kind of electronic equipment 2.
As shown in fig.1, it is the schematic diagram of 2 one optional hardware structure of electronic equipment of the present invention.It is described in the present embodiment
Electronic equipment 2 may include, but be not limited to, and connection memory 21, processor 22, network interface can be in communication with each other by system bus
23.It is pointed out that Fig. 1 illustrate only the electronic equipment 2 with component 21-23, it should be understood that being not required for reality
All components shown are applied, the more or less component of the implementation that can be substituted.
Wherein, the electronic equipment 2 can be rack-mount server, blade server, tower server or cabinet-type
The computing devices such as server, the electronic equipment 2 can be the services that independent server or multiple servers are formed
Device cluster.
The memory 21 comprises at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory,
Hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), random access storage device (RAM), static random are visited
Ask memory (SRAM), read-only storage (ROM), Electrically Erasable Read Only Memory (EEPROM), programmable read-only deposit
Reservoir (PROM), magnetic storage, disk, CD etc..In certain embodiments, the memory 21 can be that the electronics is set
Standby 2 internal storage unit, such as the hard disk or internal memory of the electronic equipment 2.In further embodiments, the memory 21
Can be the plug-in type hard disk being equipped with the External memory equipment of the electronic equipment 2, such as the electronic equipment 2, intelligent storage
Block (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..
Certainly, the memory 21 can also both include the internal storage unit of the electronic equipment 2 or including its External memory equipment.
In the present embodiment, the memory 21 is generally used for storing the operating system for being installed on the electronic equipment 2 and types of applications is soft
Part, such as program code of the web page address machining system 20 etc..In addition, the memory 21 can be also used for temporarily depositing
Store up the Various types of data that has exported or will export.
The processor 22 can be in certain embodiments central processing unit (Central Processing Unit,
CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 22 is generally used for controlling the electricity
The overall operation of sub- equipment 2, such as perform the control and processing related to the electronic equipment 2 progress data interaction or communication
Deng.In the present embodiment, the processor 22 is used to run the program code stored in the memory 21 or processing data, example
Web page address machining system 20 as described in running.
The network interface 23 may include radio network interface or wired network interface, and the network interface 23 is generally used for
Communication connection is established between the electronic equipment 2 and other electronic equipments.For example, the network interface 23 is used to incite somebody to action by network
The electronic equipment 2 is connected with external data platform, and data biography is established between the electronic equipment 2 and external data platform
Defeated passage and communication connection.The network can be intranet (Intranet), internet (Internet), whole world movement
Communication system (Global System of Mobile communication, GSM), WCDMA (Wideband
Code Division Multiple Access, WCDMA), 4G networks, 5G networks, bluetooth (Bluetooth), the nothing such as Wi-Fi
Line or cable network.
So far, oneself is through describing the application environment of each embodiment of the present invention and the hardware configuration and work(of relevant device in detail
Energy.Below, above-mentioned application environment and relevant device will be based on, proposes each embodiment of the present invention.
As shown in fig.2, it is the program module of the embodiment of web page address machining system 20 1 in electronic equipment 2 of the present invention
Figure.In the present embodiment, described web page address machining system 20 can be divided into one or more program modules, one
Or multiple program modules are stored in the memory 21, and (it is described in the present embodiment by one or more processors
Processor 22) it is performed, to complete the present invention.For example, in fig. 2, described web page address machining system 20 can be divided
Into search module 201, insertion module 202 and deduplication module 203.Program module alleged by the present invention is to refer to complete spy
The series of computation machine programmed instruction section of function is determined, than program more suitable for describing the web page address machining system 20 described
Implementation procedure in electronic equipment 2.The function of putting up with each program module 201-203 below is described in detail.
The search module 201, for being successively read a pending web page address (such as URL addresses), in improvement broad sense
The pending web page address is searched in table.Wherein, in the present embodiment, a URL can be successively read from Webpage log
(Universal Resource Locator, URL).
Preferably, in the present embodiment, heavy work, the improvement are gone using based on the improved generalized list progress URL of internal memory
Each node (such as ordinary node and node element) weight in generalized list is identical (status is equal), and in the improvement generalized list
Each node increase by one judge present node whether be root node (root nodes) flag bit.Wherein, the flag bit
For determining the node that sets out of each dynamic insertion operation.
In the present embodiment, each node have judge present node whether be root node flag bit (such as isRoot mark
Will position), the flag bit includes the first numerical value and (such as true, represents and 1) (such as false, represented 0) with second value.If detection
1) flag bit to present node (such as true, represents, it is determined that the present node is root node, current from this for the first numerical value
Node is begun stepping through, and is defined as the node that sets out of dynamic insertion operation;If the flag bit for detecting present node is the second number
0) value (such as false, represents, it is determined that the present node is not root node, continues to travel through next node.
It should be noted that traditional generalized list includes the node (i.e. the different node in status) of different weights, it is such as common
Node and node element, therefore, needed when dynamic increase, deletion and the modification of generalized list the type of decision node, meeting
Expend more time.The each node status of generalized list after being improved in the present embodiment is equal, therefore without extra judgement, can
Dynamic expansion is efficiently carried out, adapts to the demand of duplicate removal.More specifically, the present embodiment by increasing by one in each node
IsRoot flag bits, some memory headrooms only are added in each node, belong to a kind of scheme of space for time, imitated in the time
Greatly improved in rate.
For example, in the present embodiment, the node data structures for improving generalized list are set as follows.
In the present embodiment, if the present node for improving generalized list is root node (isRoot=true), deserve
The data structure of front nodal point includes, but not limited to pointer initialization operation (head=tail=null) and the first constructed fuction
(this.data=''), the newly-built node object (such as GLNode objects) of first constructed fuction node in this prior does not pass
When passing specified parameter, give tacit consent to by the data Domain Properties (i.e. data attributes) of the newly-built node object be arranged to designated character ''.If
The present node for improving generalized list is non-root node (isRoot=false), then the data structure of the present node includes,
But it is not limited to, pointer initialization operation (head=tail=null) and the second constructed fuction (this.data=data), this
Specified data value (such as data parameters that two constructed fuctions transmit the newly-built node object (such as GLNode objects) of the present node
Data) it is assigned to the data field of the newly-built node object.
The insertion module 202, if for not searching the pending web page address in the improvement generalized list, will
The pending web page address is inserted into the improvement generalized list, and the pending web page address is stored in into queue to be captured, and is lined up
Web crawlers is waited to carry out the crawl operation of web page contents.
Preferably, in the present embodiment, it is described by the pending web page address be inserted into it is described improvement generalized list include:Will
Character string (such as URL character strings) corresponding to the pending web page address is divided into single letter, described to improve in generalized list often
Individual node stores a letter.
The deduplication module 203, if for searching the pending web page address in the improvement generalized list, stopping will
The pending web page address is stored in queue to be captured, and avoids repeating capturing webpage contents operation.
By said procedure module 201-203, web page address machining system 20 proposed by the invention, by using based on
The improved generalized list of internal memory carries out URL and goes heavy work, and traditional URL duplicate removal schemes are significantly better than in time efficiency.Further
Ground, there is very high space feasibility on space efficiency, the Efficient Operation of heavy work can be persistently kept, without obvious bottle
Neck restricts.
In addition, the present invention also proposes a kind of web page address De-weight method.
As shown in fig.3, it is the implementation process diagram of the embodiment of web page address De-weight method one of the present invention.In this implementation
In example, according to different demands, the execution sequence of the step in flow chart shown in Fig. 3 can change, and some steps can save
Slightly.
Step S31, a pending web page address (such as URL addresses) is successively read, searches for this in improving in generalized list and treat
Handle web page address.Wherein, in the present embodiment, a URL (Universal can be successively read from Webpage log
Resource Locator, URL).
Preferably, in the present embodiment, heavy work, the improvement are gone using based on the improved generalized list progress URL of internal memory
Each node (such as ordinary node and node element) weight in generalized list is identical (status is equal), and in the improvement generalized list
Each node increase by one judge present node whether be root node (root nodes) flag bit.Wherein, the flag bit
For determining the node that sets out of each dynamic insertion operation.
In the present embodiment, each node have judge present node whether be root node flag bit (such as isRoot mark
Will position), the flag bit includes the first numerical value and (such as true, represents and 1) (such as false, represented 0) with second value.If detection
1) flag bit to present node (such as true, represents, it is determined that the present node is root node, current from this for the first numerical value
Node is begun stepping through, and is defined as the node that sets out of dynamic insertion operation;If the flag bit for detecting present node is the second number
0) value (such as false, represents, it is determined that the present node is not root node, continues to travel through next node.
It should be noted that traditional generalized list includes the node (i.e. the different node in status) of different weights, it is such as common
Node and node element, therefore, needed when dynamic increase, deletion and the modification of generalized list the type of decision node, meeting
Expend more time.The each node status of generalized list after being improved in the present embodiment is equal, therefore without extra judgement, can
Dynamic expansion is efficiently carried out, adapts to the demand of duplicate removal.More specifically, the present embodiment by increasing by one in each node
IsRoot flag bits, some memory headrooms only are added in each node, belong to a kind of scheme of space for time, imitated in the time
Greatly improved in rate.
For example, in the present embodiment, the node data structures for improving generalized list are set as follows.
In the present embodiment, if the present node for improving generalized list is root node (isRoot=true), deserve
The data structure of front nodal point includes, but not limited to pointer initialization operation (head=tail=null) and the first constructed fuction
(this.data=''), the newly-built node object (such as GLNode objects) of first constructed fuction node in this prior does not pass
When passing specified parameter, give tacit consent to by the data Domain Properties (i.e. data attributes) of the newly-built node object be arranged to designated character ''.If
The present node for improving generalized list is non-root node (isRoot=false), then the data structure of the present node includes,
But it is not limited to, pointer initialization operation (head=tail=null) and the second constructed fuction (this.data=data), this
Specified data value (such as data parameters that two constructed fuctions transmit the newly-built node object (such as GLNode objects) of the present node
Data) it is assigned to the data field of the newly-built node object.
Step S32, if the pending web page address is not searched in the improvement generalized list, by the pending webpage
Address is inserted into the improvement generalized list, and the pending web page address is stored in into queue to be captured, and waits in line web crawlers
Carry out the crawl operation of web page contents.
Preferably, in the present embodiment, it is described by the pending web page address be inserted into it is described improvement generalized list include:Will
Character string (such as URL character strings) corresponding to the pending web page address is divided into single letter, described to improve in generalized list often
Individual node stores a letter.
Step S33, if searching the pending web page address in the improvement generalized list, stop the pending webpage
Address is stored in queue to be captured, and avoids repeating capturing webpage contents operation.
By above-mentioned steps S31-S33, web page address De-weight method proposed by the invention, change by using based on internal memory
The generalized list entered carries out URL and goes heavy work, and traditional URL duplicate removal schemes are significantly better than in time efficiency.Further, in sky
Between have very high space feasibility in efficiency, the Efficient Operation of heavy work can be persistently kept, without obvious 'bottleneck' restrictions.
Further, to achieve the above object, the present invention also provide a kind of computer-readable recording medium (such as ROM/RAM,
Magnetic disc, CD), the computer-readable recording medium storage has web page address machining system 20, the web page address duplicate removal system
System 20 can be performed by least one processor 22, be gone so that at least one processor 22 performs web page address as described above
The step of weighing method.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to realized by hardware, but a lot
In the case of the former be more preferably embodiment.Based on such understanding, technical scheme is substantially in other words to existing
The part that technology contributes can be embodied in the form of software product, and the computer software product is stored in a storage
In medium (such as ROM/RAM, magnetic disc, CD), including some instructions to cause a station terminal equipment (can be mobile phone, calculate
Machine, server, air conditioner, or network equipment etc.) perform method described in each embodiment of the present invention.
Above by reference to the preferred embodiments of the present invention have been illustrated, not thereby limit to the interest field of the present invention.On
State that sequence number of the embodiment of the present invention is for illustration only, do not represent the quality of embodiment.Patrolled in addition, though showing in flow charts
Order is collected, but in some cases, can be with the step shown or described by being performed different from order herein.
Those skilled in the art do not depart from the scope of the present invention and essence, can have a variety of flexible programs to realize the present invention,
It can be used for another embodiment for example as the feature of one embodiment and obtain another embodiment.It is every to utilize description of the invention
And the equivalent structure made of accompanying drawing content or equivalent flow conversion, or other related technical areas are directly or indirectly used in,
It is included within the scope of the present invention.
Claims (10)
1. a kind of electronic equipment, it is characterised in that the electronic equipment includes memory, processor and is stored in the memory
Web page address machining system that is upper and can running on the processor, the web page address machining system are held by the processor
Following steps are realized during row:
A pending web page address is successively read, the pending web page address is searched in generalized list in improving, wherein, it is described to change
The each node weights entered in generalized list are identical, and each node improved in generalized list judges present node including one
Whether be root node flag bit;
If not searching the pending web page address in the improvement generalized list, the pending web page address is inserted into institute
Improvement generalized list is stated, and the pending web page address is stored in queue to be captured;And
If searching the pending web page address in the improvement generalized list, the pending web page address is stored in by stopping to be waited to grab
Take queue.
2. electronic equipment as claimed in claim 1, it is characterised in that the flag bit includes the first numerical value and second value;
If the flag bit for detecting present node is the first numerical value, it is determined that the present node is root node, from deserving prosthomere
Point is begun stepping through, and is defined as the node that sets out of dynamic insertion operation;And
If the flag bit for detecting present node is second value, it is determined that the present node is not root node, continues to travel through
Next node.
3. electronic equipment as claimed in claim 2, it is characterised in that if the present node for improving generalized list is root section
Point, then the data structure of the present node include pointer initialization operation and the first constructed fuction, first constructed fuction is at this
When the newly-built node object of present node does not transmit specified parameter, give tacit consent to the data Domain Properties setting of the newly-built node object
For designated character.
4. electronic equipment as claimed in claim 2, it is characterised in that if the present node for improving generalized list is non-root section
Point, then the data structure of the present node include pointer initialization operation and the second constructed fuction, second constructed fuction should
The specified data value of the newly-built node object transmission of present node is assigned to the data field of the newly-built node object.
5. electronic equipment as claimed in claim 2, it is characterised in that described that the pending web page address is inserted into described change
Entering generalized list includes:It is single letter by string segmentation corresponding to the pending web page address, in the improvement generalized list
Each node stores a letter.
A kind of 6. web page address De-weight method, applied to electronic equipment, it is characterised in that methods described includes:
A pending web page address is successively read, the pending web page address is searched in generalized list in improving, wherein, it is described to change
The each node weights entered in generalized list are identical, and each node improved in generalized list judges present node including one
Whether be root node flag bit;
If not searching the pending web page address in the improvement generalized list, the pending web page address is inserted into institute
Improvement generalized list is stated, and the pending web page address is stored in queue to be captured;And
If searching the pending web page address in the improvement generalized list, the pending web page address is stored in by stopping to be waited to grab
Take queue.
7. web page address De-weight method as claimed in claim 6, it is characterised in that the flag bit includes the first numerical value and the
Two numerical value;
If the flag bit for detecting present node is the first numerical value, it is determined that the present node is root node, from deserving prosthomere
Point is begun stepping through, and is defined as the node that sets out of dynamic insertion operation;And
If the flag bit for detecting present node is second value, it is determined that the present node is not root node, continues to travel through
Next node.
8. web page address De-weight method as claimed in claim 7, it is characterised in that if the present node for improving generalized list
For root node, then the data structure of the present node includes pointer initialization operation and the first constructed fuction, the first construction letter
When the newly-built node object for counting node in this prior does not have to transmit specified parameter, give tacit consent to the data field category of the newly-built node object
Property is arranged to designated character;And
If the present node for improving generalized list is non-root node, the data structure of the present node initializes including pointer
Operation and the second constructed fuction, second constructed fuction assign the specified data value of the newly-built node object transmission of the present node
It is worth the data field to the newly-built node object.
9. web page address De-weight method as claimed in claim 7, it is characterised in that described to insert the pending web page address
Include to the improvement generalized list:It is single letter by string segmentation corresponding to the pending web page address, the improvement
Each node stores a letter in generalized list.
10. a kind of computer-readable recording medium, the computer-readable recording medium storage has web page address machining system, institute
Stating web page address machining system can be by least one computing device, so that at least one computing device such as claim
The step of web page address De-weight method any one of 6-9.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710954304.5A CN107844527A (en) | 2017-10-13 | 2017-10-13 | Web page address De-weight method, electronic equipment and computer-readable recording medium |
PCT/CN2018/076170 WO2019071896A1 (en) | 2017-10-13 | 2018-02-10 | Website duplicate removing method, electronic device and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710954304.5A CN107844527A (en) | 2017-10-13 | 2017-10-13 | Web page address De-weight method, electronic equipment and computer-readable recording medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107844527A true CN107844527A (en) | 2018-03-27 |
Family
ID=61661333
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710954304.5A Pending CN107844527A (en) | 2017-10-13 | 2017-10-13 | Web page address De-weight method, electronic equipment and computer-readable recording medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN107844527A (en) |
WO (1) | WO2019071896A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108985759A (en) * | 2018-06-15 | 2018-12-11 | 杭州复杂美科技有限公司 | A kind of address generating method and system, equipment and storage medium encrypting currency |
CN109657118A (en) * | 2018-11-21 | 2019-04-19 | 安徽云融信息技术有限公司 | A kind of the URL De-weight method and its system of distributed network crawler |
CN110134768A (en) * | 2019-05-13 | 2019-08-16 | 腾讯科技(深圳)有限公司 | Processing method, device, equipment and the storage medium of text |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101778041A (en) * | 2009-12-31 | 2010-07-14 | 福建星网锐捷网络有限公司 | Method, device and network equipment for path selection |
CN103631839A (en) * | 2013-06-27 | 2014-03-12 | 西南科技大学 | Page region weight model implementation method |
US9058392B1 (en) * | 2012-03-22 | 2015-06-16 | Google Inc. | Client state result de-duping |
CN104809182A (en) * | 2015-04-17 | 2015-07-29 | 东南大学 | Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter) |
CN104809190A (en) * | 2015-04-21 | 2015-07-29 | 浙江大学 | Database access method of tree-like structure data |
CN107038179A (en) * | 2016-08-23 | 2017-08-11 | 平安科技(深圳)有限公司 | Item of information storage method and system |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101620608A (en) * | 2008-07-04 | 2010-01-06 | 全国组织机构代码管理中心 | Information collection method and system |
CN101458740A (en) * | 2008-12-25 | 2009-06-17 | 东华大学 | DNA computer generalized list data structure design method based on three-arm DNA molecule |
US8463797B2 (en) * | 2010-07-20 | 2013-06-11 | Barracuda Networks Inc. | Method for measuring similarity of diverse binary objects comprising bit patterns |
CN102184227B (en) * | 2011-05-10 | 2013-05-08 | 北京邮电大学 | General crawler engine system used for WEB service and working method thereof |
-
2017
- 2017-10-13 CN CN201710954304.5A patent/CN107844527A/en active Pending
-
2018
- 2018-02-10 WO PCT/CN2018/076170 patent/WO2019071896A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101778041A (en) * | 2009-12-31 | 2010-07-14 | 福建星网锐捷网络有限公司 | Method, device and network equipment for path selection |
US9058392B1 (en) * | 2012-03-22 | 2015-06-16 | Google Inc. | Client state result de-duping |
CN103631839A (en) * | 2013-06-27 | 2014-03-12 | 西南科技大学 | Page region weight model implementation method |
CN104809182A (en) * | 2015-04-17 | 2015-07-29 | 东南大学 | Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter) |
CN104809190A (en) * | 2015-04-21 | 2015-07-29 | 浙江大学 | Database access method of tree-like structure data |
CN107038179A (en) * | 2016-08-23 | 2017-08-11 | 平安科技(深圳)有限公司 | Item of information storage method and system |
Non-Patent Citations (1)
Title |
---|
吴小惠: "《分布式网络爬虫URL去重策略的改进》", 《分布式网络爬虫URL去重策略的改进》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108985759A (en) * | 2018-06-15 | 2018-12-11 | 杭州复杂美科技有限公司 | A kind of address generating method and system, equipment and storage medium encrypting currency |
CN109657118A (en) * | 2018-11-21 | 2019-04-19 | 安徽云融信息技术有限公司 | A kind of the URL De-weight method and its system of distributed network crawler |
CN110134768A (en) * | 2019-05-13 | 2019-08-16 | 腾讯科技(深圳)有限公司 | Processing method, device, equipment and the storage medium of text |
CN110134768B (en) * | 2019-05-13 | 2023-05-26 | 腾讯科技(深圳)有限公司 | Text processing method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2019071896A1 (en) | 2019-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104679778A (en) | Search result generating method and device | |
CN107844634A (en) | Polynary universal model platform modeling method, electronic equipment and computer-readable recording medium | |
CN101739292B (en) | Based on isomeric group operation self-adapting dispatching method and the system of application characteristic | |
CN102648468B (en) | Table search device, table search method, and table search system | |
CN107844527A (en) | Web page address De-weight method, electronic equipment and computer-readable recording medium | |
CN107688789A (en) | Document charts abstracting method, electronic equipment and computer-readable recording medium | |
CN108491420A (en) | Configuration method, application server and the computer readable storage medium of web page crawl | |
CN107688651A (en) | The emotion of news direction determination process, electronic equipment and computer-readable recording medium | |
CN107798106A (en) | A kind of URL De-weight methods in distributed reptile system | |
CN110674427B (en) | Method, device, equipment and storage medium for responding to webpage access request | |
CN103561083A (en) | Data processing method for Internet of things | |
CN104424316A (en) | Data storage method, data searching method, related device and system | |
CN108334549A (en) | A kind of device data storage method, extracting method, storage platform and extraction platform | |
CN116489178B (en) | Method and device for distributed storage of communication information | |
CN107256130B (en) | Data store optimization method and system based on Cuckoo Hash calculation | |
CN108921193A (en) | Picture input method, server and computer storage medium | |
CN115941708B (en) | Cloud big data storage management method and device, electronic equipment and storage medium | |
CN116366603A (en) | Method and device for determining active IPv6 address | |
CN109918277A (en) | Electronic device, the evaluation method of system log cluster analysis result and storage medium | |
CN107688564A (en) | Subject of news Corporate Identity method, electronic equipment and computer-readable recording medium | |
CN107679908A (en) | Sales force's topic nonproductive poll method, electronic installation and storage medium | |
CN104965909B (en) | A kind of request processing method of dynamic web content | |
CN106713440A (en) | Data transmission method and device | |
CN106487771A (en) | The acquisition methods of intrusion behavior and device | |
CN106776654A (en) | A kind of data search method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180327 |