CN108228623A - A kind of data processing method and client device - Google Patents
A kind of data processing method and client device Download PDFInfo
- Publication number
- CN108228623A CN108228623A CN201611159537.8A CN201611159537A CN108228623A CN 108228623 A CN108228623 A CN 108228623A CN 201611159537 A CN201611159537 A CN 201611159537A CN 108228623 A CN108228623 A CN 108228623A
- Authority
- CN
- China
- Prior art keywords
- url
- client device
- template information
- regular expression
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
An embodiment of the present invention provides a kind of data processing method and client devices, and user can be allowed to be screened by simple expression formula to the url of aiming field under one's name.An embodiment of the present invention provides a kind of data processing method, including:Client device obtains Template Information input by user, and the Template Information is used to describe the matching rule of url, and the url is the url of aiming field corresponding with the Template Information under one's name;The Template Information is converted into regular expression by the client device according to preset rule;The client device obtains and the first object url of the regular expression matching in the url;The first object url is added to queue to be crawled by the client device.
Description
Technical field
The present invention relates to client field more particularly to a kind of data processing method and client devices.
Background technology
Web crawlers is a kind of according to certain rule, the automatic program or script for capturing web message.
During web crawlers is used, client device obtains the target domain name for needing to crawl, and client device obtains
All url of aiming field under one's name are taken, and all url are added to and crawls queue and crawls.
However, in practical application, user may not need to crawl all url of aiming field under one's name, but only
Wish to crawl the part url of aiming field under one's name, such as the url under certain subdirectories or subdomain name, if at this point, web crawlers is still right
All url are crawled, and reduction is crawled efficiency.
Invention content
An embodiment of the present invention provides a kind of data processing method and client devices, and user can be allowed to pass through simple table
Up to formula, the url of aiming field under one's name is screened.
In view of this, an embodiment of the present invention provides a kind of data processing method, including:
Client device obtains Template Information input by user, and the Template Information is used to describe the matching rule of url, institute
Url is stated as the url of aiming field corresponding with the Template Information under one's name;
The Template Information is converted into regular expression by the client device according to preset rule;
The client device obtains and the first object url of the regular expression matching in the url;
The first object url is added to queue to be crawled by the client device.
In some possible realization methods, the Template Information includes asterisk wildcard, and the client device is according to pre-
The Template Information is converted into regular expression and included by the rule put:
The client device obtains the corresponding target domain name of the Template Information;
The client device determines the text character of the Template Information;
Asterisk wildcard in the Template Information is converted into the character of regular expression by the client device;
According to the target domain name, the character of the text character and the regular expression determines the client device
The regular expression.
In some possible realization methods, the client device obtains and the regular expression in the url
Matched first object url includes:
The client device obtains candidate url from the url;
The full text of the candidate url is matched to obtain target candidate by the client device with the regular expression
url;
If the target candidate url is identical with the length of the candidate url, the client device determines the candidate
Url is first object url.
In some possible realization methods, the client device obtains and the regular expression in the url
After matched first object url, further include:
It is that described first is removed in the url that the client device, which obtains the second target url, the second target url,
Url other than target url;
The second target url is added to queue to be crawled by the client device.
The embodiment of the present invention additionally provides a kind of client device, including:
First acquisition unit, for obtaining Template Information input by user, the Template Information is used to describe the matching of url
Rule, the url are the url of aiming field corresponding with the Template Information under one's name;
Conversion unit, for the Template Information to be converted into regular expression according to preset rule;
Second acquisition unit, for obtaining the first object url with the regular expression matching in the url;
First adding device, for the first object url to be added to queue to be crawled.
In some possible realization methods, the Template Information includes asterisk wildcard, and the conversion unit includes:
First acquisition module, for obtaining the corresponding target domain name of the Template Information;
First determining module, for determining the text character of the Template Information;
Conversion module, for the asterisk wildcard in the Template Information to be converted into the character of regular expression;
Second determining module, for according to the target domain name, the character of the text character and the regular expression
Determine the regular expression.
In some possible realization methods, the second acquisition unit includes:
Second acquisition module, for obtaining candidate url from the url;
Matching module, for being matched to obtain target candidate the full text of the candidate url with the regular expression
url;
Third determining module, if identical with the length of the candidate url for the target candidate url, it is determined that described
Candidate url is first object url.
In some possible realization methods, the client device further includes:
Third acquiring unit, for obtaining the second target url, the second target url is except described the in the url
Url other than one target url;
Second adding device, for the second target url to be added to queue to be crawled.
The embodiment of the present invention additionally provides a kind of client device, including:
Input unit, output device, processor and memory;
By the operational order that the memory is called to store, the processor is used to perform following steps:
Obtain Template Information input by user, the Template Information is used to describe the matching rule of url, the url be with
The url of the corresponding aiming field of the Template Information under one's name;
The Template Information is converted into regular expression according to preset rule;
The first object url with the regular expression matching is obtained in the url;
The first object url is added to queue to be crawled by the client device.
As can be seen from the above technical solutions, the embodiment of the present invention has the following advantages:
Client device obtains Template Information input by user, and the Template Information is used to describe the matching rule of url, institute
Url is stated as the url of aiming field corresponding with the Template Information under one's name, the client device is according to preset rule by described in
Template Information is converted into regular expression, and the client device obtains and the regular expression matching in the url
The first object url is added to queue to be crawled by first object url, the client device, in the embodiment of the present invention, is used
Family is by inputting simple Template Information, it is possible to which the url of aiming field under one's name is screened.
Description of the drawings
Fig. 1 is one embodiment flow chart of present invention method;
Fig. 2 is another embodiment flow chart of present invention method;
Fig. 3 is the structure diagram of one embodiment of client device of the embodiment of the present invention;
Fig. 4 is the structure diagram of another embodiment of client device of the embodiment of the present invention;
Fig. 5 is the structure diagram of another embodiment of client device of the embodiment of the present invention.
Specific embodiment
An embodiment of the present invention provides a kind of data processing method and client devices, and user can be allowed to pass through simple table
Up to formula, the url of aiming field under one's name is screened.
Term " first ", " second ", " third " in the description and claims of this application and above-mentioned attached drawing, "
The (if present)s such as four " are the objects for distinguishing similar, and specific sequence or precedence are described without being used for.It should manage
The data that solution uses in this way can be interchanged in the appropriate case, so that the embodiments described herein can be in addition to illustrating herein
Or the sequence other than the content of description is implemented.In addition, term " comprising " and and their any deformation, it is intended that covering not
Exclusive includes, for example, contain the process of series of steps or unit, method, system, product or equipment be not necessarily limited to it is clear
Those steps or unit that ground is listed, but may include not listing clearly or for these processes, method, product or set
Standby intrinsic other steps or unit.
During web crawlers is used, client device obtains the target domain name for needing to crawl, and client device obtains
All url of aiming field under one's name are taken, and all url are added to and crawls queue and crawls, however, in practical application, user
It may not need to crawl all url, but only want to crawl the certain subdirectories or subdomain name of aiming field under one's name
Under url, an embodiment of the present invention provides a kind of data processing method and client devices, and user can be allowed to pass through simple table
Up to formula, the url of aiming field under one's name is screened.
Referring to Fig. 1, one embodiment of data processing method of the embodiment of the present invention includes:
101st, client device obtains Template Information input by user.
Template pattern information is used to describing the matching rule of url, url for aiming field corresponding with Template Information under one's name
Url, can be aiming field all url under one's name or part url, not limit herein.
102nd, Template Information is converted into regular expression by client device according to preset rule.
Regular expression, also known as regular expression, many programming languages are all supported to carry out using regular expression
String operation, but for a user, regular expression is relatively obscure, legibility is poor, thus user can input it is simpler
Template Information is converted into regular expression by single Template Information, client device according still further to preset rule.
103rd, client device obtains and the first object url of regular expression matching in url.
Client device can obtain all url under the main domain of target, by mesh according to the corresponding target domain name of Template Information
Url under mark domain name is matched with regular expression, using the url fitted through as first object url.
104th, first object url is added to queue to be crawled by client device.
First object url can be a url or multiple url, if without the first object url of successful match, visitor
Family end equipment prompts the user with that it fails to match.
In the present embodiment, client device obtains Template Information input by user, which is used to describe of url
With rule, Template Information is converted into regular expression, and acquisition and canonical in url by client device according to preset rule
First object url is added to queue to be crawled by the matched first object url of expression formula, client device, and user passes through input
Simple Template Information, it is possible to be screened to the url of aiming field under one's name, obtain url to be crawled.
Referring to Fig. 2, another embodiment of data processing method of the embodiment of the present invention includes:
201st, client device obtains Template Information input by user.
Template pattern information is used to describe the matching rule of the url of aiming field under one's name, and Template Information can utilize wildcard
Symbol using rule, it is single No. * represent except/in addition to all characters, * * represent include/all characters, for example, user input
Template Information for www.gridsum.com/*, represent all url under matching www.gridsum.com first level subdirectories,
Www.gridsum.com/** represents all url under matching all subdirectories of www.gridsum.com, according to client device
Setting, two Template Informations can also include matching www.gridsum.com itself, not do too many restriction herein.
It should be noted that if Template Information input by user does not include asterisk wildcard, matching template information can be defined as
Corresponding url in itself, such as Template Information input by user be www.gridsum.com, then it represents that only match
Www.gridsum.com, all url that can also be defined as under matching www.gridsum.com, by the setting of client device
It determines, does not limit herein.
It should also be noted that, in Template Information input by user, target domain name can be included, target can not also be included
Domain name if not including target domain name, can pre-set target domain name, for example, user sets aiming field entitled in advance
Www.gridsum.com, Template Information input by user are directory information/products/*, then Template Information represents matching
All url of the next stage subdirectory of www.gridsum.com/products.
It should also be noted that, user can also input multiple template information, expression needs to meet multiple matching rules, this
Too many restriction is not done in place.
It is understood that in the present embodiment, definition template information can not also be carried out using rule according to asterisk wildcard
Meaning, but carry out the meaning of definition template information using the combination of customized rule or other rules or various rules, herein
It does not limit.
202nd, client device obtains the corresponding target domain name of Template Information.
If Template Information includes target domain name, client device obtains target domain name from Template Information, if template
Do not include target domain name in information, then client device obtains preset target domain name.
It should be noted that client device can determine whether to include aiming field by the symbol of detection template information
Name, if for example, template input by user work and rest with/beginning, illustrates that input by user is directory information, not including target domain name,
In practical implementations, it can also be judged by other methods, not do too many restriction herein.
203rd, client device determines the text character of Template Information.
In Template Information input by user, in addition to asterisk wildcard, other text characters may be with the word of regular expression
Symbol conflict, so the character by these text characters and regular expression is needed to distinguish, such as dot and question mark belong to just
The then reserved word of expression formula can add the escape character of regular expression before dot and question mark, and expression is a text word
Symbol.
204th, the asterisk wildcard in Template Information is converted into the character of regular expression by client device.
Same character has different meanings in the rule of asterisk wildcard and regular expression, is converted, example
Such as, asterisk wildcard * * can be converted into the .* in regular expression.
If it is understood that do not include asterisk wildcard in Template Information, according to client device setting for not wrapping
The processing rule for including the Template Information of asterisk wildcard is handled, such as client device can set the template for not including asterisk wildcard
Information is only used for the corresponding url of matching template information in itself.
It should be noted that step 202,203 and 204 there is no the sequencing performed, can be according to client device
Setting using others perform sequence.
205th, for client device according to target domain name, the character of text character and regular expression determines regular expression.
As needed, client device can also add in corresponding prefix, such as http before regular expression://,
In addition, if user has input multiple template information, that is, indicate multiple matching rules, can be used just between multiple regular expressions
The then character of expression formula | connection, wherein, | be equivalent to it is in logic or.
206th, client device obtains candidate url from url.
Client device can obtain the url of aiming field under one's name from local data, can also be to server request target
Url under domain name can also capture the url of aiming field under one's name using web crawlers from network, not do too many restriction herein, visitor
Family end equipment obtains candidate url from aiming field url under one's name, can be client device from aiming field url under one's name successively
Single url is obtained as candidate url.
207th, client device is matched the full text of candidate url with regular expression to obtain target candidate url.
Regular expression includes at least one matching rule, and client device is obtained from the full text of candidate url to be met just
The then character of expression formula whole matching rule, generation target candidate url.
If the 208, target candidate url is identical with the length of candidate url, client device determines that candidate url is the first mesh
Mark url.
The target candidate url generated after regular expression matching, if length with match before candidate url length
It is identical, then illustrate successful match, client device determines that candidate url is first object url.
209th, first object url is added to queue to be crawled by client device.
Step 209 is similar with the step 104 of Fig. 1, repeats no more.
It is understood that first object url, which can be user, wishes the url crawled or user is not intended to climb
The url taken, if the latter, then client device obtained in url with after the first object url of regular expression matching, also
Including it is the url in url in addition to first object url that client device, which obtains the second target url, the second target url, objective
Second target url is added to queue to be crawled, as a process negated by family end equipment.
In the present embodiment, client device obtains Template Information input by user, which is used to describe of url
With rule, Template Information is converted into regular expression, and acquisition and canonical in url by client device according to preset rule
The matched first object url of expression formula, user is by inputting simple Template Information, it is possible to aiming field url under one's name into
Row screening.
Secondly, in the present embodiment, refined how Template Information is converted into just by client device according to preset rule
Then expression formula and the step of how obtaining first object url, making the present embodiment, are more specific.
Again, in the present embodiment, the second target url can also be added to queue to be crawled, increases implementation of the present invention
The realization method of example.
For ease of understanding, the present embodiment is described with reference to specific application scenarios.
User A in client device input template information www.gridsum.com/*, according to asterisk wildcard using rule and
The setting of client device, www.gridsum.com/* represent that user needs to match under www.gridsum.com first level subdirectories
All url, after client device gets Template Information, Template Information is converted into regular expression, client device is first
The corresponding target domain name of Template Information is obtained, since Template Information is not with symbol/beginning, client device can determine template
Information includes target domain name, then the character for taking symbol/front is target domain name, i.e. www.gridsum.com is target domain name,
Regular expression orbicular spot is significant, so escape character must be added in before dot, expression dot is a text word
Symbol, asterisk wildcard * is converted into symbol [^ /] * of regular expression, obtain the regular expression of Template Information for www
.gridsum .com/ [^ /] *, client device adds in preset prefix http://, obtaining final regular expression is
http://www .gridsum .com/ [^ /] *, client device are prefixed all url of target domain name, as shown in table 1, visitor
Url of the family end equipment successively in acquisition table 1 is matched with regular expression, wherein http://www.gridsum.com/
Still for url in itself, then the length of url is constant before and after matching, and illustrates successful match for result after news and regular expression matching,
And http://www.gridsum.com/news/2016.html is http with the result after regular expression matching://
Www.gridsum.com/news matches front and rear length and differs, and illustrates to match unsuccessful, remaining url is also using identical
Mode matched, repeat no more, finally, the url of successful match is added to queue to be crawled, user A by client device
By the Template Information of input, the screening of the url to aiming field under one's name is realized.
All url of www.gridsum.com |
http://www.gridsum.com/news |
http://www.gridsum.com/news/2016.html |
…… |
Table 1
It is the introduction of embodiment to present invention method and application scenarios above, it below will be from the angle pair of device
The embodiment of the present invention is described in detail.
Referring to Fig. 3, one embodiment of client device of the embodiment of the present invention includes:
First acquisition unit 301 for obtaining Template Information input by user, is particularly used in the step of performing Fig. 1
101, it repeats no more.
Conversion unit 302 for Template Information to be converted into regular expression according to preset rule, is particularly used in and holds
The step 102 of row Fig. 1, repeats no more;
Second acquisition unit 303, it is specific available for obtaining the first object url with regular expression matching in url
In the step 103 for performing Fig. 1, repeat no more.
First adding device 304 for first object url to be added to queue to be crawled, is particularly used in and performs Fig. 1's
Step 104, it repeats no more.
In the present embodiment, the first acquisition unit 301 of client device obtains Template Information input by user, conversion unit
Template Information is converted into regular expression by 302 according to preset rule, and second acquisition unit 303 is obtained in url and canonical
First object url is added to queue to be crawled by the matched first object url of expression formula, the first adding device 304, and user passes through
Input simple Template Information, it is possible to which the url of aiming field under one's name is screened.
Referring to Fig. 4, another embodiment of client device of the embodiment of the present invention includes:
First acquisition unit 401 for obtaining Template Information input by user, is particularly used in the step of performing Fig. 2
201, it repeats no more.
Conversion unit 402, for Template Information to be converted into regular expression according to preset rule, specifically, conversion
Unit 402 includes:
First acquisition module 4021 for obtaining the corresponding target domain name of Template Information, is particularly used in the step for performing Fig. 2
Rapid 202, it repeats no more;
First determining module 4022 for determining the text character of Template Information, is particularly used in the step of performing Fig. 2
203, it repeats no more;
Conversion module 4023, it is specific available for the asterisk wildcard in Template Information to be converted into the character of regular expression
In the step 204 for performing Fig. 2, repeat no more;
Second determining module 4024, for according to target domain name, the character of text character and regular expression to determine canonical
Expression formula is particularly used in the step 205 for performing Fig. 2, repeats no more.
Second acquisition unit 403, for obtaining the first object url with regular expression matching in url, specifically,
Second acquisition unit 403 includes:
Second acquisition module 4031 for obtaining candidate url from url, is particularly used in the step 206 for performing Fig. 2, no
It repeats again;
Matching module 4032, for the full text of candidate url to be matched to obtain target candidate url with regular expression,
The step 207 for performing Fig. 2 is particularly used in, is repeated no more;
Third determining module 4033, if identical with the length of candidate url for target candidate url, it is determined that candidate url is
First object url is particularly used in the step 208 for performing Fig. 2, repeats no more.
First adding device 404 for first object url to be added to queue to be crawled, is particularly used in and performs Fig. 2's
Step 209, it repeats no more.
It should be noted that in other realization methods of the present embodiment, what user wished to obtain is to exclude first object
Other url other than url, then client device can include:
Third acquiring unit, for obtaining the second target url, the second target url be in url except first object url with
Outer url;
Second adding device, for the second target url to be added to queue to be crawled.
In the present embodiment, the first acquisition unit 401 of client device obtains Template Information input by user, conversion unit
Template Information is converted into regular expression by 402 according to preset rule, and second acquisition unit 403 is obtained in url and canonical
The matched first object url of expression formula, user is by inputting simple Template Information, it is possible to aiming field url under one's name into
Row screening.
Secondly, how the present embodiment has refined client device by the first acquisition module 4021, the first determining module
4022, Template Information is converted into regular expression and how by by 4023 and second determining module 4024 of conversion module
Two acquisition modules 4031, matching module 4032 and third determining module 4033 obtain the first object with regular expression matching
Url, the further perfect realization method of the embodiment of the present invention.
Again, in the present embodiment, client device can be by third acquiring unit and the second adding device by the second mesh
Mark url is added to queue to be crawled, and enriches the realization method of the embodiment of the present invention.
The client device in the embodiment of the present invention is described from the angle of modular functionality entity above, below from
The client device in the embodiment of the present application is described in the angle of hardware handles.
Referring to Fig. 5, another embodiment of client device includes in the embodiment of the present invention:
Input unit 501, output device 502, processor 503 and (the wherein processor of client device of memory 504
501 quantity can be one or more, in Fig. 5 by taking a processor 501 as an example).In some embodiments of the invention, it inputs
Device 501, output device 502, processor 503 and memory 504 can by bus or other means connection, wherein, in Fig. 5 with
For being connected by bus.
Wherein, pass through the operational order that memory 504 is called to store, processor 503, for performing following steps:
Obtain Template Information input by user, Template Information is used to describe the matching rule of url, and url is and Template Information
The url of corresponding aiming field under one's name;
Template Information is converted into regular expression according to preset rule;
The first object url with regular expression matching is obtained in url;
The first object url is added to queue to be crawled.
Specifically, the client device in the present embodiment can be used for performing the behaviour that client device performs in Fig. 1 to Fig. 4
Make, repeat no more.
In the present embodiment, client device obtains Template Information input by user, which is used to describe of url
With rule, Template Information is converted into regular expression by client device according to preset rule, is obtained and canonical table in url
It is added to queue to be crawled up to the matched first object url of formula, and by the first object url, user is simple by inputting
Template Information, it is possible to be screened to the url of aiming field under one's name.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit can refer to the corresponding process in preceding method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of division of logic function can have other dividing mode, such as multiple units or component in actual implementation
It may be combined or can be integrated into another system or some features can be ignored or does not perform.Another point, it is shown or
The mutual coupling, direct-coupling or communication connection discussed can be the indirect coupling by some interfaces, device or unit
It closes or communicates to connect, can be electrical, machinery or other forms.
The unit illustrated as separating component may or may not be physically separate, be shown as unit
The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple
In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme
's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
That each unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list
The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is independent product sale or uses
When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially
The part to contribute in other words to the prior art or all or part of the technical solution can be in the form of software products
It embodies, which is stored in a storage medium, is used including some instructions so that a computer
Equipment (can be personal computer, server or the network equipment etc.) performs the complete of each embodiment the method for the present invention
Portion or part steps.And aforementioned storage medium includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only
Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can store journey
The medium of sequence code.
The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although with reference to before
Embodiment is stated the present invention is described in detail, it will be understood by those of ordinary skill in the art that:It still can be to preceding
The technical solution recorded in each embodiment is stated to modify or carry out equivalent replacement to which part technical characteristic;And these
Modification is replaced, the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution.
Claims (9)
1. a kind of data processing method, which is characterized in that including:
Client device obtains Template Information input by user, and the Template Information is used to describe the matching rule of url, described
Url is the url of aiming field corresponding with the Template Information under one's name;
The Template Information is converted into regular expression by the client device according to preset rule;
The client device obtains and the first object url of the regular expression matching in the url;
The first object url is added to queue to be crawled by the client device.
2. data processing method according to claim 1, which is characterized in that the Template Information includes asterisk wildcard, institute
It states client device the Template Information is converted into regular expression according to preset rule and include:
The client device obtains the corresponding target domain name of the Template Information;
The client device determines the text character of the Template Information;
Asterisk wildcard in the Template Information is converted into the character of regular expression by the client device;
According to the target domain name, the character of the text character and the regular expression determines described the client device
Regular expression.
3. data processing method according to claim 1, which is characterized in that the client device obtains in the url
The first object url with the regular expression matching is taken to include:
The client device obtains candidate url from the url;
The client device is matched the full text of the candidate url with the regular expression to obtain target candidate url;
If the target candidate url is identical with the length of the candidate url, the client device determines the candidate url
For first object url.
4. data processing method according to any one of claim 1 to 3, which is characterized in that the client device exists
After obtaining the first object url with the regular expression matching in the url, further include:
It is that the first object is removed in the url that the client device, which obtains the second target url, the second target url,
Url other than url;
The second target url is added to queue to be crawled by the client device.
5. a kind of client device, which is characterized in that including:
First acquisition unit, for obtaining Template Information input by user, the Template Information is used to describe the matching rule of url
Then, the url is the url of aiming field corresponding with the Template Information under one's name;
Conversion unit, for the Template Information to be converted into regular expression according to preset rule;
Second acquisition unit, for obtaining the first object url with the regular expression matching in the url;
First adding device, for the first object url to be added to queue to be crawled.
6. client device according to claim 5, which is characterized in that the Template Information includes asterisk wildcard, described
Conversion unit includes:
First acquisition module, for obtaining the corresponding target domain name of the Template Information;
First determining module, for determining the text character of the Template Information;
Conversion module, for the asterisk wildcard in the Template Information to be converted into the character of regular expression;
Second determining module, for according to the target domain name, the character of the text character and the regular expression to determine
The regular expression.
7. client device according to claim 5, which is characterized in that the second acquisition unit includes:
Second acquisition module, for obtaining candidate url from the url;
Matching module, for being matched the full text of the candidate url with the regular expression to obtain target candidate url;
Third determining module, if identical with the length of the candidate url for the target candidate url, it is determined that the candidate
Url is first object url.
8. data processing method according to any one of claims 5 to 7, which is characterized in that the client device is also
Including:
Third acquiring unit is that first mesh is removed in the url for obtaining the second target url, the second target url
Mark the url other than url;
Second adding device, for the second target url to be added to queue to be crawled.
9. a kind of client device, which is characterized in that including:
Input unit, output device, processor and memory;
By the operational order that the memory is called to store, the processor is used to perform following steps:
Obtain Template Information input by user, the Template Information is used to describe the matching rule of url, the url be with it is described
The url of the corresponding aiming field of Template Information under one's name;
The Template Information is converted into regular expression according to preset rule;
The first object url with the regular expression matching is obtained in the url;
The first object url is added to queue to be crawled.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611159537.8A CN108228623B (en) | 2016-12-14 | 2016-12-14 | Data processing method and client device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611159537.8A CN108228623B (en) | 2016-12-14 | 2016-12-14 | Data processing method and client device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108228623A true CN108228623A (en) | 2018-06-29 |
CN108228623B CN108228623B (en) | 2021-12-24 |
Family
ID=62650489
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611159537.8A Active CN108228623B (en) | 2016-12-14 | 2016-12-14 | Data processing method and client device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108228623B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110008390A (en) * | 2019-02-27 | 2019-07-12 | 深圳壹账通智能科技有限公司 | Appraisal procedure, device, computer equipment and the storage medium of application program |
CN110851746A (en) * | 2018-07-27 | 2020-02-28 | 北京国双科技有限公司 | Crawler seed generation method and device |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101452463A (en) * | 2007-12-05 | 2009-06-10 | 浙江大学 | Method and apparatus for directionally grabbing page resource |
CN101727447A (en) * | 2008-10-10 | 2010-06-09 | 浙江搜富网络技术有限公司 | Generation method and device of regular expression based on URL |
US20110078558A1 (en) * | 2009-09-30 | 2011-03-31 | International Business Machines Corporation | Method and system for identifying advertisement in web page |
CN102999549A (en) * | 2012-09-25 | 2013-03-27 | 金博 | Method for realizing web crawler tasks |
CN103226568A (en) * | 2013-03-12 | 2013-07-31 | 北京百度网讯科技有限公司 | Method and equipment for crawling page |
CN103514189A (en) * | 2012-06-25 | 2014-01-15 | 上海博腾信息科技有限公司 | Implementing method for web crawler based on search engines |
CN103793462A (en) * | 2013-12-02 | 2014-05-14 | 北京奇虎科技有限公司 | URL (uniform resource locator) purifying method and device |
CN105955984A (en) * | 2016-04-19 | 2016-09-21 | 中国银联股份有限公司 | Network data searching method based on crawler mode |
CN106021608A (en) * | 2016-06-22 | 2016-10-12 | 广东亿迅科技有限公司 | Distributed crawler system and implementing method thereof |
CN106161352A (en) * | 2015-03-31 | 2016-11-23 | 阿里巴巴集团控股有限公司 | A kind of matching process and client, server and matching unit |
-
2016
- 2016-12-14 CN CN201611159537.8A patent/CN108228623B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101452463A (en) * | 2007-12-05 | 2009-06-10 | 浙江大学 | Method and apparatus for directionally grabbing page resource |
CN101727447A (en) * | 2008-10-10 | 2010-06-09 | 浙江搜富网络技术有限公司 | Generation method and device of regular expression based on URL |
US20110078558A1 (en) * | 2009-09-30 | 2011-03-31 | International Business Machines Corporation | Method and system for identifying advertisement in web page |
CN103514189A (en) * | 2012-06-25 | 2014-01-15 | 上海博腾信息科技有限公司 | Implementing method for web crawler based on search engines |
CN102999549A (en) * | 2012-09-25 | 2013-03-27 | 金博 | Method for realizing web crawler tasks |
CN103226568A (en) * | 2013-03-12 | 2013-07-31 | 北京百度网讯科技有限公司 | Method and equipment for crawling page |
CN103793462A (en) * | 2013-12-02 | 2014-05-14 | 北京奇虎科技有限公司 | URL (uniform resource locator) purifying method and device |
WO2015081789A1 (en) * | 2013-12-02 | 2015-06-11 | 北京奇虎科技有限公司 | Url purification method and apparatus |
US20160306893A1 (en) * | 2013-12-02 | 2016-10-20 | Beijing Qihoo Technology Company Limited | Url purification method and url purification apparatus |
CN106161352A (en) * | 2015-03-31 | 2016-11-23 | 阿里巴巴集团控股有限公司 | A kind of matching process and client, server and matching unit |
CN105955984A (en) * | 2016-04-19 | 2016-09-21 | 中国银联股份有限公司 | Network data searching method based on crawler mode |
CN106021608A (en) * | 2016-06-22 | 2016-10-12 | 广东亿迅科技有限公司 | Distributed crawler system and implementing method thereof |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851746A (en) * | 2018-07-27 | 2020-02-28 | 北京国双科技有限公司 | Crawler seed generation method and device |
CN110008390A (en) * | 2019-02-27 | 2019-07-12 | 深圳壹账通智能科技有限公司 | Appraisal procedure, device, computer equipment and the storage medium of application program |
Also Published As
Publication number | Publication date |
---|---|
CN108228623B (en) | 2021-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10284561B2 (en) | Method and server for providing image captcha | |
CN104933056A (en) | Uniform resource locator (URL) de-duplication method and device | |
CN105302815B (en) | The filter method and device of the uniform resource position mark URL of webpage | |
CN107341399A (en) | Assess the method and device of code file security | |
CN110321284B (en) | Test data entry method, device, computer equipment and storage medium | |
CN109815112B (en) | Data debugging method and device based on functional test and terminal equipment | |
CN102811207A (en) | Network information pushing method and system | |
CN107239701A (en) | Recognize the method and device of malicious websites | |
CN107491470A (en) | Data management system, control method and storage medium | |
CN108228623A (en) | A kind of data processing method and client device | |
CN111241400B (en) | Information searching method and device | |
CN106534280A (en) | Data sharing method and device | |
CN116383693A (en) | Data issuing method based on data security automatic classification grading result | |
CN105491094B (en) | Method and device for processing HTTP (hyper text transport protocol) request | |
CN106572074A (en) | Method and device for verifying identifying code | |
CN106294406A (en) | A kind of method and apparatus accessing data for processing application | |
CN106326258B (en) | URL matching method and device | |
CN106126670B (en) | Operation data sorting processing method and device | |
CN108595464A (en) | A kind of method and system for realizing the similar news duplicate removal of multi-source | |
CN117171650A (en) | Document data processing method, system and medium based on web crawler technology | |
CN109992960B (en) | Counterfeit parameter detection method and device, electronic equipment and storage medium | |
CN106611022B (en) | Method and device for improving search efficiency in website | |
CN114490673B (en) | Data information processing method and device, electronic equipment and storage medium | |
CN111008873A (en) | User determination method and device, electronic equipment and storage medium | |
CN106933840A (en) | Forum's catalogue page content crawling method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |