CN108228623A - A kind of data processing method and client device - Google Patents

A kind of data processing method and client device Download PDF

Info

Publication number
CN108228623A
CN108228623A CN201611159537.8A CN201611159537A CN108228623A CN 108228623 A CN108228623 A CN 108228623A CN 201611159537 A CN201611159537 A CN 201611159537A CN 108228623 A CN108228623 A CN 108228623A
Authority
CN
China
Prior art keywords
url
client device
template information
regular expression
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611159537.8A
Other languages
Chinese (zh)
Other versions
CN108228623B (en
Inventor
何熠皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201611159537.8A priority Critical patent/CN108228623B/en
Publication of CN108228623A publication Critical patent/CN108228623A/en
Application granted granted Critical
Publication of CN108228623B publication Critical patent/CN108228623B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

An embodiment of the present invention provides a kind of data processing method and client devices, and user can be allowed to be screened by simple expression formula to the url of aiming field under one's name.An embodiment of the present invention provides a kind of data processing method, including:Client device obtains Template Information input by user, and the Template Information is used to describe the matching rule of url, and the url is the url of aiming field corresponding with the Template Information under one's name;The Template Information is converted into regular expression by the client device according to preset rule;The client device obtains and the first object url of the regular expression matching in the url;The first object url is added to queue to be crawled by the client device.

Description

A kind of data processing method and client device
Technical field
The present invention relates to client field more particularly to a kind of data processing method and client devices.
Background technology
Web crawlers is a kind of according to certain rule, the automatic program or script for capturing web message.
During web crawlers is used, client device obtains the target domain name for needing to crawl, and client device obtains All url of aiming field under one's name are taken, and all url are added to and crawls queue and crawls.
However, in practical application, user may not need to crawl all url of aiming field under one's name, but only Wish to crawl the part url of aiming field under one's name, such as the url under certain subdirectories or subdomain name, if at this point, web crawlers is still right All url are crawled, and reduction is crawled efficiency.
Invention content
An embodiment of the present invention provides a kind of data processing method and client devices, and user can be allowed to pass through simple table Up to formula, the url of aiming field under one's name is screened.
In view of this, an embodiment of the present invention provides a kind of data processing method, including:
Client device obtains Template Information input by user, and the Template Information is used to describe the matching rule of url, institute Url is stated as the url of aiming field corresponding with the Template Information under one's name;
The Template Information is converted into regular expression by the client device according to preset rule;
The client device obtains and the first object url of the regular expression matching in the url;
The first object url is added to queue to be crawled by the client device.
In some possible realization methods, the Template Information includes asterisk wildcard, and the client device is according to pre- The Template Information is converted into regular expression and included by the rule put:
The client device obtains the corresponding target domain name of the Template Information;
The client device determines the text character of the Template Information;
Asterisk wildcard in the Template Information is converted into the character of regular expression by the client device;
According to the target domain name, the character of the text character and the regular expression determines the client device The regular expression.
In some possible realization methods, the client device obtains and the regular expression in the url Matched first object url includes:
The client device obtains candidate url from the url;
The full text of the candidate url is matched to obtain target candidate by the client device with the regular expression url;
If the target candidate url is identical with the length of the candidate url, the client device determines the candidate Url is first object url.
In some possible realization methods, the client device obtains and the regular expression in the url After matched first object url, further include:
It is that described first is removed in the url that the client device, which obtains the second target url, the second target url, Url other than target url;
The second target url is added to queue to be crawled by the client device.
The embodiment of the present invention additionally provides a kind of client device, including:
First acquisition unit, for obtaining Template Information input by user, the Template Information is used to describe the matching of url Rule, the url are the url of aiming field corresponding with the Template Information under one's name;
Conversion unit, for the Template Information to be converted into regular expression according to preset rule;
Second acquisition unit, for obtaining the first object url with the regular expression matching in the url;
First adding device, for the first object url to be added to queue to be crawled.
In some possible realization methods, the Template Information includes asterisk wildcard, and the conversion unit includes:
First acquisition module, for obtaining the corresponding target domain name of the Template Information;
First determining module, for determining the text character of the Template Information;
Conversion module, for the asterisk wildcard in the Template Information to be converted into the character of regular expression;
Second determining module, for according to the target domain name, the character of the text character and the regular expression Determine the regular expression.
In some possible realization methods, the second acquisition unit includes:
Second acquisition module, for obtaining candidate url from the url;
Matching module, for being matched to obtain target candidate the full text of the candidate url with the regular expression url;
Third determining module, if identical with the length of the candidate url for the target candidate url, it is determined that described Candidate url is first object url.
In some possible realization methods, the client device further includes:
Third acquiring unit, for obtaining the second target url, the second target url is except described the in the url Url other than one target url;
Second adding device, for the second target url to be added to queue to be crawled.
The embodiment of the present invention additionally provides a kind of client device, including:
Input unit, output device, processor and memory;
By the operational order that the memory is called to store, the processor is used to perform following steps:
Obtain Template Information input by user, the Template Information is used to describe the matching rule of url, the url be with The url of the corresponding aiming field of the Template Information under one's name;
The Template Information is converted into regular expression according to preset rule;
The first object url with the regular expression matching is obtained in the url;
The first object url is added to queue to be crawled by the client device.
As can be seen from the above technical solutions, the embodiment of the present invention has the following advantages:
Client device obtains Template Information input by user, and the Template Information is used to describe the matching rule of url, institute Url is stated as the url of aiming field corresponding with the Template Information under one's name, the client device is according to preset rule by described in Template Information is converted into regular expression, and the client device obtains and the regular expression matching in the url The first object url is added to queue to be crawled by first object url, the client device, in the embodiment of the present invention, is used Family is by inputting simple Template Information, it is possible to which the url of aiming field under one's name is screened.
Description of the drawings
Fig. 1 is one embodiment flow chart of present invention method;
Fig. 2 is another embodiment flow chart of present invention method;
Fig. 3 is the structure diagram of one embodiment of client device of the embodiment of the present invention;
Fig. 4 is the structure diagram of another embodiment of client device of the embodiment of the present invention;
Fig. 5 is the structure diagram of another embodiment of client device of the embodiment of the present invention.
Specific embodiment
An embodiment of the present invention provides a kind of data processing method and client devices, and user can be allowed to pass through simple table Up to formula, the url of aiming field under one's name is screened.
Term " first ", " second ", " third " in the description and claims of this application and above-mentioned attached drawing, " The (if present)s such as four " are the objects for distinguishing similar, and specific sequence or precedence are described without being used for.It should manage The data that solution uses in this way can be interchanged in the appropriate case, so that the embodiments described herein can be in addition to illustrating herein Or the sequence other than the content of description is implemented.In addition, term " comprising " and and their any deformation, it is intended that covering not Exclusive includes, for example, contain the process of series of steps or unit, method, system, product or equipment be not necessarily limited to it is clear Those steps or unit that ground is listed, but may include not listing clearly or for these processes, method, product or set Standby intrinsic other steps or unit.
During web crawlers is used, client device obtains the target domain name for needing to crawl, and client device obtains All url of aiming field under one's name are taken, and all url are added to and crawls queue and crawls, however, in practical application, user It may not need to crawl all url, but only want to crawl the certain subdirectories or subdomain name of aiming field under one's name Under url, an embodiment of the present invention provides a kind of data processing method and client devices, and user can be allowed to pass through simple table Up to formula, the url of aiming field under one's name is screened.
Referring to Fig. 1, one embodiment of data processing method of the embodiment of the present invention includes:
101st, client device obtains Template Information input by user.
Template pattern information is used to describing the matching rule of url, url for aiming field corresponding with Template Information under one's name Url, can be aiming field all url under one's name or part url, not limit herein.
102nd, Template Information is converted into regular expression by client device according to preset rule.
Regular expression, also known as regular expression, many programming languages are all supported to carry out using regular expression String operation, but for a user, regular expression is relatively obscure, legibility is poor, thus user can input it is simpler Template Information is converted into regular expression by single Template Information, client device according still further to preset rule.
103rd, client device obtains and the first object url of regular expression matching in url.
Client device can obtain all url under the main domain of target, by mesh according to the corresponding target domain name of Template Information Url under mark domain name is matched with regular expression, using the url fitted through as first object url.
104th, first object url is added to queue to be crawled by client device.
First object url can be a url or multiple url, if without the first object url of successful match, visitor Family end equipment prompts the user with that it fails to match.
In the present embodiment, client device obtains Template Information input by user, which is used to describe of url With rule, Template Information is converted into regular expression, and acquisition and canonical in url by client device according to preset rule First object url is added to queue to be crawled by the matched first object url of expression formula, client device, and user passes through input Simple Template Information, it is possible to be screened to the url of aiming field under one's name, obtain url to be crawled.
Referring to Fig. 2, another embodiment of data processing method of the embodiment of the present invention includes:
201st, client device obtains Template Information input by user.
Template pattern information is used to describe the matching rule of the url of aiming field under one's name, and Template Information can utilize wildcard Symbol using rule, it is single No. * represent except/in addition to all characters, * * represent include/all characters, for example, user input Template Information for www.gridsum.com/*, represent all url under matching www.gridsum.com first level subdirectories, Www.gridsum.com/** represents all url under matching all subdirectories of www.gridsum.com, according to client device Setting, two Template Informations can also include matching www.gridsum.com itself, not do too many restriction herein.
It should be noted that if Template Information input by user does not include asterisk wildcard, matching template information can be defined as Corresponding url in itself, such as Template Information input by user be www.gridsum.com, then it represents that only match Www.gridsum.com, all url that can also be defined as under matching www.gridsum.com, by the setting of client device It determines, does not limit herein.
It should also be noted that, in Template Information input by user, target domain name can be included, target can not also be included Domain name if not including target domain name, can pre-set target domain name, for example, user sets aiming field entitled in advance Www.gridsum.com, Template Information input by user are directory information/products/*, then Template Information represents matching All url of the next stage subdirectory of www.gridsum.com/products.
It should also be noted that, user can also input multiple template information, expression needs to meet multiple matching rules, this Too many restriction is not done in place.
It is understood that in the present embodiment, definition template information can not also be carried out using rule according to asterisk wildcard Meaning, but carry out the meaning of definition template information using the combination of customized rule or other rules or various rules, herein It does not limit.
202nd, client device obtains the corresponding target domain name of Template Information.
If Template Information includes target domain name, client device obtains target domain name from Template Information, if template Do not include target domain name in information, then client device obtains preset target domain name.
It should be noted that client device can determine whether to include aiming field by the symbol of detection template information Name, if for example, template input by user work and rest with/beginning, illustrates that input by user is directory information, not including target domain name, In practical implementations, it can also be judged by other methods, not do too many restriction herein.
203rd, client device determines the text character of Template Information.
In Template Information input by user, in addition to asterisk wildcard, other text characters may be with the word of regular expression Symbol conflict, so the character by these text characters and regular expression is needed to distinguish, such as dot and question mark belong to just The then reserved word of expression formula can add the escape character of regular expression before dot and question mark, and expression is a text word Symbol.
204th, the asterisk wildcard in Template Information is converted into the character of regular expression by client device.
Same character has different meanings in the rule of asterisk wildcard and regular expression, is converted, example Such as, asterisk wildcard * * can be converted into the .* in regular expression.
If it is understood that do not include asterisk wildcard in Template Information, according to client device setting for not wrapping The processing rule for including the Template Information of asterisk wildcard is handled, such as client device can set the template for not including asterisk wildcard Information is only used for the corresponding url of matching template information in itself.
It should be noted that step 202,203 and 204 there is no the sequencing performed, can be according to client device Setting using others perform sequence.
205th, for client device according to target domain name, the character of text character and regular expression determines regular expression.
As needed, client device can also add in corresponding prefix, such as http before regular expression://, In addition, if user has input multiple template information, that is, indicate multiple matching rules, can be used just between multiple regular expressions The then character of expression formula | connection, wherein, | be equivalent to it is in logic or.
206th, client device obtains candidate url from url.
Client device can obtain the url of aiming field under one's name from local data, can also be to server request target Url under domain name can also capture the url of aiming field under one's name using web crawlers from network, not do too many restriction herein, visitor Family end equipment obtains candidate url from aiming field url under one's name, can be client device from aiming field url under one's name successively Single url is obtained as candidate url.
207th, client device is matched the full text of candidate url with regular expression to obtain target candidate url.
Regular expression includes at least one matching rule, and client device is obtained from the full text of candidate url to be met just The then character of expression formula whole matching rule, generation target candidate url.
If the 208, target candidate url is identical with the length of candidate url, client device determines that candidate url is the first mesh Mark url.
The target candidate url generated after regular expression matching, if length with match before candidate url length It is identical, then illustrate successful match, client device determines that candidate url is first object url.
209th, first object url is added to queue to be crawled by client device.
Step 209 is similar with the step 104 of Fig. 1, repeats no more.
It is understood that first object url, which can be user, wishes the url crawled or user is not intended to climb The url taken, if the latter, then client device obtained in url with after the first object url of regular expression matching, also Including it is the url in url in addition to first object url that client device, which obtains the second target url, the second target url, objective Second target url is added to queue to be crawled, as a process negated by family end equipment.
In the present embodiment, client device obtains Template Information input by user, which is used to describe of url With rule, Template Information is converted into regular expression, and acquisition and canonical in url by client device according to preset rule The matched first object url of expression formula, user is by inputting simple Template Information, it is possible to aiming field url under one's name into Row screening.
Secondly, in the present embodiment, refined how Template Information is converted into just by client device according to preset rule Then expression formula and the step of how obtaining first object url, making the present embodiment, are more specific.
Again, in the present embodiment, the second target url can also be added to queue to be crawled, increases implementation of the present invention The realization method of example.
For ease of understanding, the present embodiment is described with reference to specific application scenarios.
User A in client device input template information www.gridsum.com/*, according to asterisk wildcard using rule and The setting of client device, www.gridsum.com/* represent that user needs to match under www.gridsum.com first level subdirectories All url, after client device gets Template Information, Template Information is converted into regular expression, client device is first The corresponding target domain name of Template Information is obtained, since Template Information is not with symbol/beginning, client device can determine template Information includes target domain name, then the character for taking symbol/front is target domain name, i.e. www.gridsum.com is target domain name, Regular expression orbicular spot is significant, so escape character must be added in before dot, expression dot is a text word Symbol, asterisk wildcard * is converted into symbol [^ /] * of regular expression, obtain the regular expression of Template Information for www .gridsum .com/ [^ /] *, client device adds in preset prefix http://, obtaining final regular expression is http://www .gridsum .com/ [^ /] *, client device are prefixed all url of target domain name, as shown in table 1, visitor Url of the family end equipment successively in acquisition table 1 is matched with regular expression, wherein http://www.gridsum.com/ Still for url in itself, then the length of url is constant before and after matching, and illustrates successful match for result after news and regular expression matching, And http://www.gridsum.com/news/2016.html is http with the result after regular expression matching:// Www.gridsum.com/news matches front and rear length and differs, and illustrates to match unsuccessful, remaining url is also using identical Mode matched, repeat no more, finally, the url of successful match is added to queue to be crawled, user A by client device By the Template Information of input, the screening of the url to aiming field under one's name is realized.
All url of www.gridsum.com
http://www.gridsum.com/news
http://www.gridsum.com/news/2016.html
……
Table 1
It is the introduction of embodiment to present invention method and application scenarios above, it below will be from the angle pair of device The embodiment of the present invention is described in detail.
Referring to Fig. 3, one embodiment of client device of the embodiment of the present invention includes:
First acquisition unit 301 for obtaining Template Information input by user, is particularly used in the step of performing Fig. 1 101, it repeats no more.
Conversion unit 302 for Template Information to be converted into regular expression according to preset rule, is particularly used in and holds The step 102 of row Fig. 1, repeats no more;
Second acquisition unit 303, it is specific available for obtaining the first object url with regular expression matching in url In the step 103 for performing Fig. 1, repeat no more.
First adding device 304 for first object url to be added to queue to be crawled, is particularly used in and performs Fig. 1's Step 104, it repeats no more.
In the present embodiment, the first acquisition unit 301 of client device obtains Template Information input by user, conversion unit Template Information is converted into regular expression by 302 according to preset rule, and second acquisition unit 303 is obtained in url and canonical First object url is added to queue to be crawled by the matched first object url of expression formula, the first adding device 304, and user passes through Input simple Template Information, it is possible to which the url of aiming field under one's name is screened.
Referring to Fig. 4, another embodiment of client device of the embodiment of the present invention includes:
First acquisition unit 401 for obtaining Template Information input by user, is particularly used in the step of performing Fig. 2 201, it repeats no more.
Conversion unit 402, for Template Information to be converted into regular expression according to preset rule, specifically, conversion Unit 402 includes:
First acquisition module 4021 for obtaining the corresponding target domain name of Template Information, is particularly used in the step for performing Fig. 2 Rapid 202, it repeats no more;
First determining module 4022 for determining the text character of Template Information, is particularly used in the step of performing Fig. 2 203, it repeats no more;
Conversion module 4023, it is specific available for the asterisk wildcard in Template Information to be converted into the character of regular expression In the step 204 for performing Fig. 2, repeat no more;
Second determining module 4024, for according to target domain name, the character of text character and regular expression to determine canonical Expression formula is particularly used in the step 205 for performing Fig. 2, repeats no more.
Second acquisition unit 403, for obtaining the first object url with regular expression matching in url, specifically, Second acquisition unit 403 includes:
Second acquisition module 4031 for obtaining candidate url from url, is particularly used in the step 206 for performing Fig. 2, no It repeats again;
Matching module 4032, for the full text of candidate url to be matched to obtain target candidate url with regular expression, The step 207 for performing Fig. 2 is particularly used in, is repeated no more;
Third determining module 4033, if identical with the length of candidate url for target candidate url, it is determined that candidate url is First object url is particularly used in the step 208 for performing Fig. 2, repeats no more.
First adding device 404 for first object url to be added to queue to be crawled, is particularly used in and performs Fig. 2's Step 209, it repeats no more.
It should be noted that in other realization methods of the present embodiment, what user wished to obtain is to exclude first object Other url other than url, then client device can include:
Third acquiring unit, for obtaining the second target url, the second target url be in url except first object url with Outer url;
Second adding device, for the second target url to be added to queue to be crawled.
In the present embodiment, the first acquisition unit 401 of client device obtains Template Information input by user, conversion unit Template Information is converted into regular expression by 402 according to preset rule, and second acquisition unit 403 is obtained in url and canonical The matched first object url of expression formula, user is by inputting simple Template Information, it is possible to aiming field url under one's name into Row screening.
Secondly, how the present embodiment has refined client device by the first acquisition module 4021, the first determining module 4022, Template Information is converted into regular expression and how by by 4023 and second determining module 4024 of conversion module Two acquisition modules 4031, matching module 4032 and third determining module 4033 obtain the first object with regular expression matching Url, the further perfect realization method of the embodiment of the present invention.
Again, in the present embodiment, client device can be by third acquiring unit and the second adding device by the second mesh Mark url is added to queue to be crawled, and enriches the realization method of the embodiment of the present invention.
The client device in the embodiment of the present invention is described from the angle of modular functionality entity above, below from The client device in the embodiment of the present application is described in the angle of hardware handles.
Referring to Fig. 5, another embodiment of client device includes in the embodiment of the present invention:
Input unit 501, output device 502, processor 503 and (the wherein processor of client device of memory 504 501 quantity can be one or more, in Fig. 5 by taking a processor 501 as an example).In some embodiments of the invention, it inputs Device 501, output device 502, processor 503 and memory 504 can by bus or other means connection, wherein, in Fig. 5 with For being connected by bus.
Wherein, pass through the operational order that memory 504 is called to store, processor 503, for performing following steps:
Obtain Template Information input by user, Template Information is used to describe the matching rule of url, and url is and Template Information The url of corresponding aiming field under one's name;
Template Information is converted into regular expression according to preset rule;
The first object url with regular expression matching is obtained in url;
The first object url is added to queue to be crawled.
Specifically, the client device in the present embodiment can be used for performing the behaviour that client device performs in Fig. 1 to Fig. 4 Make, repeat no more.
In the present embodiment, client device obtains Template Information input by user, which is used to describe of url With rule, Template Information is converted into regular expression by client device according to preset rule, is obtained and canonical table in url It is added to queue to be crawled up to the matched first object url of formula, and by the first object url, user is simple by inputting Template Information, it is possible to be screened to the url of aiming field under one's name.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit can refer to the corresponding process in preceding method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of division of logic function can have other dividing mode, such as multiple units or component in actual implementation It may be combined or can be integrated into another system or some features can be ignored or does not perform.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be the indirect coupling by some interfaces, device or unit It closes or communicates to connect, can be electrical, machinery or other forms.
The unit illustrated as separating component may or may not be physically separate, be shown as unit The component shown may or may not be physical unit, you can be located at a place or can also be distributed to multiple In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also That each unit is individually physically present, can also two or more units integrate in a unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is independent product sale or uses When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially The part to contribute in other words to the prior art or all or part of the technical solution can be in the form of software products It embodies, which is stored in a storage medium, is used including some instructions so that a computer Equipment (can be personal computer, server or the network equipment etc.) performs the complete of each embodiment the method for the present invention Portion or part steps.And aforementioned storage medium includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can store journey The medium of sequence code.
The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although with reference to before Embodiment is stated the present invention is described in detail, it will be understood by those of ordinary skill in the art that:It still can be to preceding The technical solution recorded in each embodiment is stated to modify or carry out equivalent replacement to which part technical characteristic;And these Modification is replaced, the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution.

Claims (9)

1. a kind of data processing method, which is characterized in that including:
Client device obtains Template Information input by user, and the Template Information is used to describe the matching rule of url, described Url is the url of aiming field corresponding with the Template Information under one's name;
The Template Information is converted into regular expression by the client device according to preset rule;
The client device obtains and the first object url of the regular expression matching in the url;
The first object url is added to queue to be crawled by the client device.
2. data processing method according to claim 1, which is characterized in that the Template Information includes asterisk wildcard, institute It states client device the Template Information is converted into regular expression according to preset rule and include:
The client device obtains the corresponding target domain name of the Template Information;
The client device determines the text character of the Template Information;
Asterisk wildcard in the Template Information is converted into the character of regular expression by the client device;
According to the target domain name, the character of the text character and the regular expression determines described the client device Regular expression.
3. data processing method according to claim 1, which is characterized in that the client device obtains in the url The first object url with the regular expression matching is taken to include:
The client device obtains candidate url from the url;
The client device is matched the full text of the candidate url with the regular expression to obtain target candidate url;
If the target candidate url is identical with the length of the candidate url, the client device determines the candidate url For first object url.
4. data processing method according to any one of claim 1 to 3, which is characterized in that the client device exists After obtaining the first object url with the regular expression matching in the url, further include:
It is that the first object is removed in the url that the client device, which obtains the second target url, the second target url, Url other than url;
The second target url is added to queue to be crawled by the client device.
5. a kind of client device, which is characterized in that including:
First acquisition unit, for obtaining Template Information input by user, the Template Information is used to describe the matching rule of url Then, the url is the url of aiming field corresponding with the Template Information under one's name;
Conversion unit, for the Template Information to be converted into regular expression according to preset rule;
Second acquisition unit, for obtaining the first object url with the regular expression matching in the url;
First adding device, for the first object url to be added to queue to be crawled.
6. client device according to claim 5, which is characterized in that the Template Information includes asterisk wildcard, described Conversion unit includes:
First acquisition module, for obtaining the corresponding target domain name of the Template Information;
First determining module, for determining the text character of the Template Information;
Conversion module, for the asterisk wildcard in the Template Information to be converted into the character of regular expression;
Second determining module, for according to the target domain name, the character of the text character and the regular expression to determine The regular expression.
7. client device according to claim 5, which is characterized in that the second acquisition unit includes:
Second acquisition module, for obtaining candidate url from the url;
Matching module, for being matched the full text of the candidate url with the regular expression to obtain target candidate url;
Third determining module, if identical with the length of the candidate url for the target candidate url, it is determined that the candidate Url is first object url.
8. data processing method according to any one of claims 5 to 7, which is characterized in that the client device is also Including:
Third acquiring unit is that first mesh is removed in the url for obtaining the second target url, the second target url Mark the url other than url;
Second adding device, for the second target url to be added to queue to be crawled.
9. a kind of client device, which is characterized in that including:
Input unit, output device, processor and memory;
By the operational order that the memory is called to store, the processor is used to perform following steps:
Obtain Template Information input by user, the Template Information is used to describe the matching rule of url, the url be with it is described The url of the corresponding aiming field of Template Information under one's name;
The Template Information is converted into regular expression according to preset rule;
The first object url with the regular expression matching is obtained in the url;
The first object url is added to queue to be crawled.
CN201611159537.8A 2016-12-14 2016-12-14 Data processing method and client device Active CN108228623B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611159537.8A CN108228623B (en) 2016-12-14 2016-12-14 Data processing method and client device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611159537.8A CN108228623B (en) 2016-12-14 2016-12-14 Data processing method and client device

Publications (2)

Publication Number Publication Date
CN108228623A true CN108228623A (en) 2018-06-29
CN108228623B CN108228623B (en) 2021-12-24

Family

ID=62650489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611159537.8A Active CN108228623B (en) 2016-12-14 2016-12-14 Data processing method and client device

Country Status (1)

Country Link
CN (1) CN108228623B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008390A (en) * 2019-02-27 2019-07-12 深圳壹账通智能科技有限公司 Appraisal procedure, device, computer equipment and the storage medium of application program
CN110851746A (en) * 2018-07-27 2020-02-28 北京国双科技有限公司 Crawler seed generation method and device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
CN101727447A (en) * 2008-10-10 2010-06-09 浙江搜富网络技术有限公司 Generation method and device of regular expression based on URL
US20110078558A1 (en) * 2009-09-30 2011-03-31 International Business Machines Corporation Method and system for identifying advertisement in web page
CN102999549A (en) * 2012-09-25 2013-03-27 金博 Method for realizing web crawler tasks
CN103226568A (en) * 2013-03-12 2013-07-31 北京百度网讯科技有限公司 Method and equipment for crawling page
CN103514189A (en) * 2012-06-25 2014-01-15 上海博腾信息科技有限公司 Implementing method for web crawler based on search engines
CN103793462A (en) * 2013-12-02 2014-05-14 北京奇虎科技有限公司 URL (uniform resource locator) purifying method and device
CN105955984A (en) * 2016-04-19 2016-09-21 中国银联股份有限公司 Network data searching method based on crawler mode
CN106021608A (en) * 2016-06-22 2016-10-12 广东亿迅科技有限公司 Distributed crawler system and implementing method thereof
CN106161352A (en) * 2015-03-31 2016-11-23 阿里巴巴集团控股有限公司 A kind of matching process and client, server and matching unit

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
CN101727447A (en) * 2008-10-10 2010-06-09 浙江搜富网络技术有限公司 Generation method and device of regular expression based on URL
US20110078558A1 (en) * 2009-09-30 2011-03-31 International Business Machines Corporation Method and system for identifying advertisement in web page
CN103514189A (en) * 2012-06-25 2014-01-15 上海博腾信息科技有限公司 Implementing method for web crawler based on search engines
CN102999549A (en) * 2012-09-25 2013-03-27 金博 Method for realizing web crawler tasks
CN103226568A (en) * 2013-03-12 2013-07-31 北京百度网讯科技有限公司 Method and equipment for crawling page
CN103793462A (en) * 2013-12-02 2014-05-14 北京奇虎科技有限公司 URL (uniform resource locator) purifying method and device
WO2015081789A1 (en) * 2013-12-02 2015-06-11 北京奇虎科技有限公司 Url purification method and apparatus
US20160306893A1 (en) * 2013-12-02 2016-10-20 Beijing Qihoo Technology Company Limited Url purification method and url purification apparatus
CN106161352A (en) * 2015-03-31 2016-11-23 阿里巴巴集团控股有限公司 A kind of matching process and client, server and matching unit
CN105955984A (en) * 2016-04-19 2016-09-21 中国银联股份有限公司 Network data searching method based on crawler mode
CN106021608A (en) * 2016-06-22 2016-10-12 广东亿迅科技有限公司 Distributed crawler system and implementing method thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851746A (en) * 2018-07-27 2020-02-28 北京国双科技有限公司 Crawler seed generation method and device
CN110008390A (en) * 2019-02-27 2019-07-12 深圳壹账通智能科技有限公司 Appraisal procedure, device, computer equipment and the storage medium of application program

Also Published As

Publication number Publication date
CN108228623B (en) 2021-12-24

Similar Documents

Publication Publication Date Title
US10284561B2 (en) Method and server for providing image captcha
CN104933056A (en) Uniform resource locator (URL) de-duplication method and device
CN105302815B (en) The filter method and device of the uniform resource position mark URL of webpage
CN107341399A (en) Assess the method and device of code file security
CN110321284B (en) Test data entry method, device, computer equipment and storage medium
CN109815112B (en) Data debugging method and device based on functional test and terminal equipment
CN102811207A (en) Network information pushing method and system
CN107239701A (en) Recognize the method and device of malicious websites
CN107491470A (en) Data management system, control method and storage medium
CN108228623A (en) A kind of data processing method and client device
CN111241400B (en) Information searching method and device
CN106534280A (en) Data sharing method and device
CN116383693A (en) Data issuing method based on data security automatic classification grading result
CN105491094B (en) Method and device for processing HTTP (hyper text transport protocol) request
CN106572074A (en) Method and device for verifying identifying code
CN106294406A (en) A kind of method and apparatus accessing data for processing application
CN106326258B (en) URL matching method and device
CN106126670B (en) Operation data sorting processing method and device
CN108595464A (en) A kind of method and system for realizing the similar news duplicate removal of multi-source
CN117171650A (en) Document data processing method, system and medium based on web crawler technology
CN109992960B (en) Counterfeit parameter detection method and device, electronic equipment and storage medium
CN106611022B (en) Method and device for improving search efficiency in website
CN114490673B (en) Data information processing method and device, electronic equipment and storage medium
CN111008873A (en) User determination method and device, electronic equipment and storage medium
CN106933840A (en) Forum's catalogue page content crawling method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant