CN109582844A - A kind of method, apparatus and system identifying crawler - Google Patents

A kind of method, apparatus and system identifying crawler Download PDF

Info

Publication number
CN109582844A
CN109582844A CN201811321280.0A CN201811321280A CN109582844A CN 109582844 A CN109582844 A CN 109582844A CN 201811321280 A CN201811321280 A CN 201811321280A CN 109582844 A CN109582844 A CN 109582844A
Authority
CN
China
Prior art keywords
word
sample
word frequency
user agent
crawler
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811321280.0A
Other languages
Chinese (zh)
Inventor
张璐
刁士涵
武金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN201811321280.0A priority Critical patent/CN109582844A/en
Publication of CN109582844A publication Critical patent/CN109582844A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The application provides a kind of method, apparatus and system for identifying crawler, wherein the described method includes: obtaining user agent's field from the access request if listening to user for accessing the access request of current page;Determine the word frequency distribution feature of user agent's field;The word frequency distribution feature is input in advance trained crawler identification model, obtain the user whether be crawler recognition result.The application can leak through crawler greatly to avoid the IP storage level because of crawler, and can be to avoid the normal users accidentally injured under public IP due to not needing statistics IP flowing of access or the frequency.

Description

A kind of method, apparatus and system identifying crawler
Technical field
This application involves Internet technical field more particularly to a kind of method, apparatus and system for identifying crawler.
Background technique
Web crawlers (abbreviation crawler) is a kind of chained address searching webpage by webpage, and according to certain rules, The automatic program or script for obtaining web page contents.Current crawler technology can grab webpage source code by the rule of setting In some important informations, cause site information to leak, reduce web portal security performance.
A kind of existing scheme for identifying crawler, can adding up IP, (Internet Protocol is interconnected between network Agreement) flowing of access (or frequency) the corresponding user of the IP is considered as crawler when accumulative flow is more than preset threshold, And blacklist is added and is intercepted.However, this mode when the IP storage level of crawler is larger, is easy the flow because of single IP It is less than preset threshold and leaks through crawler, and be easy to accidentally injure the normal users under public IP.
Summary of the invention
In view of this, the application provides a kind of method, apparatus and system for identifying crawler, to solve existing anti-crawler skill The above problem existing for art scheme.
Specifically, the application is achieved by the following technical solution:
According to a first aspect of the present application, a kind of method for identifying crawler is proposed, comprising:
If listening to user for accessing the access request of current page, user agent is obtained from the access request Field;
Determine the word frequency distribution feature of user agent's field;
The word frequency distribution feature is input in crawler identification model trained in advance, obtains whether the user is to climb The recognition result of worm.
In one embodiment, the word frequency distribution feature of determination user agent's field, comprising:
Word segmentation processing is carried out to user agent's field, obtains at least one target word;
The word frequency distribution feature of user agent's field is determined according to the word frequency of at least one target word.
In one embodiment, the word frequency of at least one target word according to determines user agent's field Word frequency distribution feature, comprising:
Based on the corresponding relationship constructed in advance, the word frequency of each target word at least one described target word is determined;
The word frequency for counting at least one target word falls into the quantity in multiple default word frequency sections;
The corresponding word frequency distribution feature of user agent's field is determined according to the corresponding vector of the quantity.
In one embodiment, the crawler identification model is obtained according to following steps training:
Multiple sample interview requests are obtained, and obtain sample user agent field from the request of the multiple sample interview;
Determine the sample word frequency distribution feature of the sample user agent field;
The sample word frequency distribution feature is demarcated, and using calibrated sample word frequency distribution feature as training Collection, the training crawler identification model.
In one embodiment, the sample word frequency distribution feature of the determination sample user agent field, comprising:
Word segmentation processing is carried out to the sample user agent field, obtains at least one sample object word;
Based on the corresponding relationship constructed in advance, each sample object word at least one described sample object word is determined Word frequency;
The word frequency for counting at least one sample object word falls into the quantity in multiple default word frequency sections;
The corresponding sample word frequency distribution feature of the sample user agent field is determined according to the corresponding vector of the quantity.
It is in one embodiment, described to obtain multiple sample interview requests, comprising:
Positive sample access request and negative sample access request are obtained, the positive sample access request includes that crawler access is current The access request generated when the page, the access that the negative sample access request generates when including normal users access current page are asked It asks.
In one embodiment, the method also includes:
The corresponding relationship between the target word and word frequency is constructed according to the negative sample access request.
According to a second aspect of the present application, a kind of device for identifying crawler is proposed, comprising:
Agent field obtains module, for when listening to user for accessing the access request of current page, from described User agent's field is obtained in access request;
Distribution characteristics determining module, for determining the word frequency distribution feature of user agent's field;
Recognition result obtains module, for the word frequency distribution feature to be input to crawler identification model trained in advance In, obtain the user whether be crawler recognition result.
According to the third aspect of the application, a kind of equipment for identifying crawler, including memory, processor and storage are proposed On a memory and the computer program that can run on a processor, wherein the processor is realized when executing described program The method for stating any identification crawler.
According to the fourth aspect of the application, a kind of computer readable storage medium is proposed, the storage medium is stored with Computer program, the method that the computer program is used to execute any of the above-described identification crawler.
By above technical scheme as it can be seen that the application, which passes through to work as, listens to user for accessing the access request of current page When, user agent's field is obtained from the access request, and determine the word frequency distribution feature of user agent's field, finally The word frequency distribution feature is input in advance trained crawler identification model, obtain the user whether be crawler identification As a result, due to not needing statistics IP flowing of access or the frequency, thus can leak through and climb to avoid because the IP storage level of crawler is big Worm, and can be to avoid the normal users accidentally injured under public IP.Further, due to not being using directly detection anomaly parameter value Simple parameter strategy, thus can be to avoid being bypassed due to parameter strategy and leaking through crawler the case where;Also, due to user's generation It manages field to be not necessarily to obtain from the front end of application program, thus can also avoid carrying out complicated deployment in front end, the side of can be improved The APP applicability of case.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the method for identification crawler shown in the first exemplary embodiment of the application;
Fig. 2 is the stream of the word frequency distribution feature for how determining user agent's field shown in one exemplary embodiment of the application Cheng Tu;
Fig. 3 is the word frequency distribution feature for how determining user agent's field shown in the application another exemplary embodiment Flow chart;
Fig. 4 is a kind of flow chart of the method for identification crawler shown in the second exemplary embodiment of the application;
Fig. 5 is the sample word frequency distribution for how determining sample user agent field shown in one exemplary embodiment of the application The flow chart of feature;
Fig. 6 is a kind of structure chart of the device of identification crawler shown in one exemplary embodiment of the application;
Fig. 7 is a kind of structure chart of the device of identification crawler shown in the application another exemplary embodiment;
Fig. 8 is a kind of structure chart of the equipment of identification crawler shown in one exemplary embodiment of the application.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with the application.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects be described in detail in claims, the application.
It is only to be not intended to be limiting the application merely for for the purpose of describing particular embodiments in term used in this application. It is also intended in the application and the "an" of singular used in the attached claims, " described " and "the" including majority Form, unless the context clearly indicates other meaning.It is also understood that term "and/or" used herein refers to and wraps It may be combined containing one or more associated any or all of project listed.
It will be appreciated that though various information, but this may be described using term first, second, third, etc. in the application A little information should not necessarily be limited by these terms.These terms are only used to for same type of information being distinguished from each other out.For example, not departing from In the case where the application range, the first information can also be referred to as the second information, and similarly, the second information can also be referred to as One information.Depending on context, word as used in this " if " can be construed to " ... when " or " when ... When " or " in response to determination ".
A kind of existing scheme for identifying crawler, can adding up IP, (Internet Protocol is interconnected between network Agreement) flowing of access (or frequency) the corresponding user of the IP is considered as crawler when accumulative flow is more than preset threshold, And blacklist is added and is intercepted.
However, this mode when the IP storage level of crawler is larger, is easy because the flow of single IP is less than default threshold It is worth and leaks through crawler.Also, if crawler and normal users share an IP (i.e. public IP), if working as the access stream of the public IP The IP is intercepted when measuring excessively high, then can accidentally injure the normal users under the public IP.
In view of this, the application provides a kind of method, apparatus and system for identifying crawler, to solve existing anti-crawler skill The above problem existing for art scheme.
Fig. 1 is a kind of flow chart of the method for identification crawler shown in the first exemplary embodiment of the application;The embodiment It can be used for server-side (for example, server cluster etc. of a server and multiple servers composition), can be used for terminal Equipment (for example, desktop computer, laptop, tablet computer, mobile phone etc.).
As shown in Figure 1, the method comprising the steps of S101-S104:
In step s101: if listening to user for accessing the access request of current page, from the access request Middle acquisition user agent's field.
In one embodiment, it when user accesses current page by intelligent terminals such as mobile phone, tablet computers, can be generated For accessing the access request of current page.
In one embodiment, after listening to above-mentioned access request, user agent's word can be obtained from the access request Section.
In one embodiment, above-mentioned user agent's (User Agent, abbreviation UA) field is one carried in access request A head of string, operating system and version, cpu type, browser and version, browsing for being used for server-side identification user One of device rendering engine, browser language and browser plug-in are a variety of.
It is worth noting that the mode for the access request that above-mentioned monitoring is used to access current page can be by developer's root It is chosen according to actual business requirement, the present embodiment is to this without limiting.
In step s 102: determining the word frequency distribution feature of user agent's field.
It in one embodiment, can be to user agent's word after obtaining user agent's field from above-mentioned access request Duan Jinhang feature extraction.
In one embodiment, extracted feature can for can protrude user agent's field well the characteristics of and The feature that the access with other users is asked the difference of middle user agent's field with contacted can be shown, with Enhanced feature identification.
In one embodiment, the word frequency distribution feature of above-mentioned user agent's field can be extracted.Wherein, above-mentioned word frequency distribution Feature may include the words-frequency feature for the word for including in user agent's field.
In one embodiment, determine the extracting mode of the word frequency distribution feature of above-mentioned user agent's field referring also under Embodiment illustrated in fig. 2 is stated, herein first without being described in detail.
In step s 103: the word frequency distribution feature being input in crawler identification model trained in advance, obtain institute State user whether be crawler recognition result.
In one embodiment, user agent's field training crawler in can requesting previously according to sample interview identifies mould Type, the model are used to determine that the user of access current page is according to the word frequency distribution feature of user agent's field in access request No is crawler.
In one embodiment, the specific training method of above-mentioned crawler identification model may refer to following embodiment illustrated in fig. 4, Herein first without being described in detail.
In one embodiment, when the word for determining user agent's field according to the word frequency of at least one target word After frequency distribution characteristics, which can be input in crawler identification model trained in advance, to obtain above-mentioned use Family whether be crawler recognition result.
In one embodiment, it other than the word frequency distribution feature of user agent's field in extraction access request, can also ring The access request that Ying Yu is listened to obtains the parametric statistics feature of user, such as parameter dimensions feature and IP dimension statistical nature, And then the parametric statistics feature and user agent's data characteristics that can be will acquire are input to crawler identification mould trained in advance jointly In type, recognition result is obtained.It is worth noting that the extracting method of the parametric statistics feature may refer to solution in the prior art It releases and illustrates, the present embodiment is to this without limiting.
In one embodiment, after whether obtain the user is the recognition result of crawler, can by developer according to Actual business requirement takes corresponding measure to the user.For example, after the user for determining above-mentioned access current page is crawler, it can Forbidden with the access to the user, or sends identifying code etc. to the user;On the contrary, when determining above-mentioned access current page User be normal users, i.e., be not that can permit access of the user to current page after crawler.
Seen from the above description, the present embodiment by when listen to user for access current page access request when, User agent's field is obtained from the access request, and determines the word frequency distribution feature of user agent's field, finally will The word frequency distribution feature is input in advance trained crawler identification model, obtain the user whether be crawler identification knot Fruit due to not needing statistics IP flowing of access or the frequency, thus can leak through crawler greatly to avoid the IP storage level because of crawler, And it can be to avoid the normal users accidentally injured under public IP.Further, due to not being using directly detection anomaly parameter value Simple parameter strategy, thus can be to avoid being bypassed due to parameter strategy and leaking through crawler the case where;Also, due to user agent Field is not necessarily to obtain from the front end of application program, thus can also avoid carrying out complicated deployment in front end, and scheme can be improved APP applicability.
Fig. 2 is the stream of the word frequency distribution feature for how determining user agent's field shown in one exemplary embodiment of the application Cheng Tu;The present embodiment is carried out by taking the word frequency distribution feature for how determining user agent's field as an example on the basis of the above embodiments Exemplary illustration.As shown in Fig. 2, determining the word frequency distribution feature of user agent's field in above-mentioned steps S102, can wrap Include following steps S201-S202:
In step s 201, word segmentation processing is carried out to user agent's field, obtains at least one target word.
It in one embodiment, can be to user agent's word after obtaining user agent's field from above-mentioned access request Duan Jinhang word segmentation processing, to obtain at least one target word.
In one embodiment, it may include multiple words in above-mentioned user agent's field, thus work as and get user agent After field, word segmentation processing can be carried out to user agent's field according to preset word segmentation regulation.
In one embodiment, above-mentioned word segmentation regulation can be carried out by developer according to business experience or actual business requirement Setting, such as it is set as the part of speech of each word, semanteme etc. in field, the present embodiment is to this without limiting.
In step S202, the word frequency of user agent's field is determined according to the word frequency of at least one target word Distribution characteristics.
In one embodiment, it after obtaining target word to user agent's field progress word segmentation processing, can determine The word frequency of each target word, the i.e. frequency of occurrences of the target word in the access request of normal users.
In one embodiment, after determining the word frequency of each target word, each of above-mentioned user agent's field can be determined The word frequency distribution feature of a target word.
In one embodiment, determine the mode of the word frequency distribution feature of above-mentioned user agent's field referring also to following figures 3 illustrated embodiments, herein first without being described in detail.
Seen from the above description, the present embodiment obtains at least one by carrying out word segmentation processing to user agent's field A target word, and determine that the word frequency distribution of user agent's field is special according to the word frequency of at least one target word Sign, can accurately determine the word frequency distribution feature of user agent's field, be obtained to be subsequent based on the word frequency distribution feature Whether for the recognition result of crawler accurate foundation is provided to the user.
Fig. 3 is the word frequency distribution feature for how determining user agent's field shown in the application another exemplary embodiment Flow chart;The present embodiment on the basis of the above embodiments by taking the word frequency distribution feature for how determining user agent's field as an example into Row exemplary illustration.As shown in figure 3, determining institute according to the word frequency of at least one target word described in above-mentioned steps S202 The word frequency distribution feature for stating user agent's field, may comprise steps of S301-S303:
In step S301, based on the corresponding relationship constructed in advance, each mesh at least one described target word is determined Mark the word frequency of word.
In one embodiment, word segmentation processing is being carried out to above-mentioned user agent's field, is obtaining at least one above-mentioned target word After language, the word frequency of each target word can be determined based on the corresponding relationship constructed in advance respectively.
In one embodiment, above-mentioned corresponding relationship can be the target word and word frequency for being in advance based on sample data building Corresponding relationship, such as the mapping table or correspondence set of word and word frequency, the present embodiment is to this without limiting.
In one embodiment, the building mode of above-mentioned corresponding relationship may refer to following Fig. 6 embodiments, herein first without It is described in detail.
For example, the multiple target words constructed in advance and the corresponding relationship of word frequency are as shown in following table one:
Table one
As shown in Table 1, the word frequency of each target word is the number between 0~1, and the word frequency in the present embodiment is after normalizing Word frequency.For example, n user agent's UA field if it exists, and altogether include m target word in this n user agent's field Language, then the word frequency of each target word is the number that occurs in n user agent's field of the target word divided by n.
In one embodiment, if being " operating system 1 ", " cpu type based on the obtained target word of user agent's field A 3 " and " browser language 5 ", then can be determined based on corresponding relationship shown in table one corresponding word frequency be " 0.10 ", " 0.55 " and " 0.98 ".
In another embodiment, if the target word (e.g., " terminal type B ") of user agent's field B is not in above-mentioned correspondence In relationship, then the word frequency of the target word can be determined using other presetting methods, such as the word frequency of the target word is arranged For " 0 ".
In step s 302, the word frequency for counting at least one target word falls into the number in multiple default word frequency sections Amount.
In one embodiment, multiple word frequency sections can be preset, such as [0,0.01), [0.01,0.10), [0.10, 0.20) ... ..., [0.90,1.00) }.
In one embodiment, after determining the word frequency of each target word at least one above-mentioned target word, Ke Yitong The word frequency for counting each target word falls into the quantity in above-mentioned default word frequency section.
Still by taking above-mentioned user agent's field A as an example, then it includes target word " operating system 1 ", " cpu type 3 " with And " browser language 5 " word frequency " 0.10 ", " 0.55 " and " 0.98 " distribution fall into the 3rd word frequency section (that is, [0.10, 0.20)), the 7th word frequency section (that is, [0.50,0.60)) and the 11st word frequency space (that is, [0.90,1.00)) in.In turn It is found that the word frequency of the target word of user agent's field A fall into it is above-mentioned 11 prediction word frequency section quantity be 0,0,1,0,0, 0,1,0,0,0,1}。
It is worth noting that the numerical value of the word frequency of above-mentioned target word is only for illustration, it is not used in and limits this Shen Protection scope please.
In step S303, the corresponding word frequency distribution of user agent's field is determined according to the corresponding vector of the quantity Feature.
In one embodiment, when the word frequency for counting at least one target word falls into the number in multiple default word frequency sections After amount, the corresponding vector of the quantity can be determined.
Still by taking above-mentioned user agent's field A as an example, the word frequency of the target word of A can be fallen into each prediction word frequency section Quantity splice in order, i.e., the number in above-mentioned prediction word frequency section is fallen into according to the word frequency of the target word of user agent's field A Amount { 0,0,1,0,0,0,1,0,0,0,1 } determines that corresponding vector is " 00100010001 ".
In one embodiment, when the word frequency for determining target word falls into vector corresponding to the quantity in default word frequency section Afterwards, which can be determined as to the corresponding word frequency distribution feature of user agent's field A.
It is worth noting that the specific value in above-mentioned prediction word frequency section can be by developer according to actual business requirement It is configured, the present embodiment is to this without limiting.
Seen from the above description, the present embodiment determines at least one described target word based on the corresponding relationship constructed in advance The word frequency of each target word in language, and the word frequency for counting at least one target word falls into multiple default word frequency sections Quantity, and then the corresponding word frequency distribution feature of user agent's field, Ke Yizhun are determined according to the corresponding vector of the quantity Really the word frequency based on target word each in target word language determines the word frequency distribution feature of user agent's field, is subsequent Obtain whether the user for the recognition result of crawler provides accurate foundation based on the word frequency distribution feature.
Fig. 4 is a kind of flow chart of the method for identification crawler shown in the second exemplary embodiment of the application;The embodiment It can be used for server-side (for example, server cluster etc. of a server and multiple servers composition), can be used for terminal Equipment (for example, desktop computer, laptop, tablet computer, mobile phone etc.).
As shown in figure 4, the method comprising the steps of S401-S406:
In step S401, multiple sample interview requests are obtained, and obtain sample from the request of the multiple sample interview User agent's field.
In one embodiment, it in order to which training is used for crawler identification model, can be accessed in data from the history of current web page Obtain multiple sample interview requests.
It in one embodiment, may include positive sample access request and negative sample in multiple sample interviews request of above-mentioned acquisition This access request, wherein the visit that the positive sample access request generates when may include the crawler access current page having confirmed that Ask request;On the contrary, the negative sample access request may include that normal users (having confirmed that as the user of non-crawler) access is worked as The access request generated when the preceding page.
In one embodiment, can also by developer according to business need or combine business experience construction it is some comprising with The positive sample access request of machine anomaly parameter, the problem of to avoid only model over-fitting is led to by crawler data training pattern. By generating the positive sample access request comprising random anomaly parameter, the feature coverage of model sample can be improved, reduce mould Dependence of the type to crawler sample.
Each of in one embodiment, after obtaining the request of multiple sample interviews, can be requested from multiple sample interviews Sample user agent field is obtained in sample interview request.Wherein, above-mentioned sample user agent field is in sample interview request User agent's UA field, detailed explanation and description may refer to embodiment illustrated in fig. 1, herein without repeating.
In step S402, the sample word frequency distribution feature of the sample user agent field is determined.
It in one embodiment, can be to the sample after obtaining sample user agent field from the request of above-mentioned sample interview This user agent field carries out feature extraction.
In one embodiment, extracted feature can be the spy that can protrude the sample user agent field well Point, and the feature that the access with other sample of users is asked the difference of middle sample user agent field with contacted can be shown, to increase Strong feature identification degree.
In one embodiment, the word frequency distribution feature of above-mentioned sample user agent field can be extracted.Wherein, above-mentioned word frequency Distribution characteristics may include the words-frequency feature for the word for including in sample user agent field.
In one embodiment, determine that the extracting mode of the word frequency distribution feature of above-mentioned sample user agent field can also join Embodiment illustrated in fig. 5 is seen below, herein first without being described in detail.
In step S403, the sample word frequency distribution feature is demarcated, and by calibrated sample word frequency distribution Feature is as training set, the training crawler identification model.
In one embodiment, it after determining the sample word frequency distribution feature of the sample user agent field, can be based on The classification of the request of sample interview belonging to the sample user agent field demarcates the sample word frequency distribution feature.
For example, if obtaining the sample word frequency distribution feature " 01001001100 " of sample user agent field S, and sample The classification of the request of sample interview belonging to this user agent field S is " positive sample ", then this feature can be demarcated as positive sample Feature.It similarly, can should if the classification of the request of sample interview belonging to sample user agent field S is " negative sample " Features localization is the feature of negative sample.
It in one embodiment, can be by calibrated sample word after being demarcated to the sample word frequency distribution feature Frequency distribution characteristics is as training set, training crawler identification model.
It is worth noting that above-mentioned crawler identification model can be chosen by developer according to actual business requirement, It is such as chosen for random forest disaggregated model, the present embodiment is to this without limiting.
In step s 404, if listening to user for accessing the access request of current page, from the access request Middle acquisition user agent's field.
In step S405, the word frequency distribution feature of user agent's field is determined.
In step S406, the word frequency distribution feature is input in crawler identification model trained in advance, obtains institute State user whether be crawler recognition result.
Wherein, the relevant explanation of step S404-S406 and explanation may refer to above-described embodiment, herein without repeating.
Seen from the above description, the present embodiment is by obtaining the request of multiple sample interviews, and from the multiple sample interview Sample user agent field is obtained in request, and determines the sample word frequency distribution feature of the sample user agent field, in turn The sample word frequency distribution feature is demarcated, and using calibrated sample word frequency distribution feature as training set, training institute Crawler identification model is stated, whether can be that crawler establishes base for the subsequent user based on trained model identification access current page Plinth.
Fig. 5 is the sample word frequency distribution for how determining sample user agent field shown in one exemplary embodiment of the application The flow chart of feature;How the present embodiment is on the basis of the above embodiments to determine the sample word frequency of sample user agent field It is illustrated for distribution characteristics.As shown in figure 5, determining the sample user agent field described in step S402 Sample word frequency distribution feature, may comprise steps of S501-S504:
In step S501, word segmentation processing is carried out to the sample user agent field, obtains at least one sample object Word.
It in one embodiment, can be to the sample after obtaining sample user agent field from the request of above-mentioned sample interview This user agent field carries out word segmentation processing, to obtain at least one sample object word.
In one embodiment, it may include multiple words in above-mentioned sample user agent field, thus work as and get sample After user agent's field, word segmentation processing can be carried out to the sample user agent field according to preset word segmentation regulation.
In one embodiment, above-mentioned word segmentation regulation can be carried out by developer according to business experience or actual business requirement Setting, such as it is set as the part of speech of each word, semanteme etc. in field, the present embodiment is to this without limiting.
In step S502, based on the corresponding relationship constructed in advance, determine every at least one described sample object word The word frequency of a sample object word.
In one embodiment, can be constructed previously according to the negative sample access request of acquisition the target word and word frequency it Between corresponding relationship.
For example, after obtaining multiple negative sample access requests, sample can be obtained from each negative sample access request This user agent field, and word segmentation processing is carried out to each sample user agent field, obtain this set " TE ".It is basic herein On, the frequency (that is, word frequency) of the appearance of each target word in statistics set " TE ", and target word is constructed according to statistical result With the corresponding relationship of word frequency, such as mapping table or correspondence set, the present embodiment is to this without limiting.
In one embodiment, the frequency TF that i-th of target word term i occurs in above-mentioned set " TE "term iIt can determine Justice are as follows:
TFterm iNumber/N (1) that=term i occurs
Wherein, N be negative sample data concentrate access request quantity, i=1,2 ..., M, M be negative sample data concentrate visit Ask the quantity for requesting corresponding target word.
It on this basis, can after obtaining sample object word to sample user agent field progress word segmentation processing To be based on above-mentioned corresponding relationship, the word frequency of each sample object word is determined respectively.
In one embodiment, each sample object word in sample object word is determined based on the corresponding relationship constructed in advance Word frequency mode referring also in embodiment illustrated in fig. 3 determine target word word frequency mode, herein without repeating.
In step S503, the word frequency for counting at least one sample object word falls into multiple default word frequency sections Quantity.
In one embodiment, multiple word frequency sections can be preset, such as [0,0.01), [0.01,0.10), [0.10, 0.20) ... ..., [0.90,1.00) }.
In one embodiment, after determining the word frequency of each target word at least one above-mentioned target word, Ke Yitong The word frequency for counting each target word falls into the quantity in above-mentioned default word frequency section.
In one embodiment, the word frequency of statistical sample target word falls into the mode of the quantity in multiple default word frequency sections also May refer in embodiment illustrated in fig. 3 count target word word frequency fall into multiple default word frequency sections quantity mode, This is without repeating.
It is worth noting that the numerical value of the word frequency of above-mentioned target word is only for illustration, it is not used in and limits this Shen Protection scope please.
In step S504, the corresponding sample of the sample user agent field is determined according to the corresponding vector of the quantity Word frequency distribution feature.
In one embodiment, when the word frequency for counting at least one sample object word falls into multiple default word frequency sections Quantity after, can determine the corresponding vector of the quantity.
In one embodiment, the quantity that the word frequency of target word can be fallen into each prediction word frequency section is spelled in order It connects, is to generate corresponding vector.It is possible to further which the vector is determined as the corresponding word frequency point of user agent's field Cloth feature.
Seen from the above description, the present embodiment by the sample user agent field carry out word segmentation processing, obtain to A few sample object word, and based on the corresponding relationship constructed in advance, it determines every at least one described sample object word The word frequency of a sample object word, and the word frequency for counting at least one sample object word falls into multiple default word frequency sections Quantity, and then the corresponding word frequency distribution feature of the sample user agent field is determined according to the corresponding vector of the quantity, The sample user agent field accurately can be determined based on the word frequency of sample object word each in sample object word Word frequency distribution feature is known to be subsequent based on word frequency distribution feature training crawler identification model, and based on trained model Not Fang Wen the user of current page whether be that crawler lays the foundation.
Corresponding with preceding method embodiment, present invention also provides the embodiments of corresponding device.
Fig. 6 is a kind of structure chart of the device of identification crawler shown in one exemplary embodiment of the application;As shown in fig. 6, The apparatus may include: agent field obtains module 110, distribution characteristics determining module 120 and recognition result and obtains module 130, in which:
Agent field obtain module 110, for when listen to user for access current page access request when, from institute State acquisition user agent's field in access request;
Distribution characteristics determining module 120, for determining the word frequency distribution feature of user agent's field;
Recognition result obtains module 130, identifies mould for the word frequency distribution feature to be input to crawler trained in advance In type, obtain the user whether be crawler recognition result.
Seen from the above description, the present embodiment by when listen to user for access current page access request when, User agent's field is obtained from the access request, and determines the word frequency distribution feature of user agent's field, finally will The word frequency distribution feature is input in advance trained crawler identification model, obtain the user whether be crawler identification knot Fruit due to not needing statistics IP flowing of access or the frequency, thus can leak through crawler greatly to avoid the IP storage level because of crawler, And it can be to avoid the normal users accidentally injured under public IP.Further, due to not being using directly detection anomaly parameter value Simple parameter strategy, thus can be to avoid being bypassed due to parameter strategy and leaking through crawler the case where;Also, due to user agent Field is not necessarily to obtain from the front end of application program, thus can also avoid carrying out complicated deployment in front end, and scheme can be improved APP applicability.
Fig. 7 is a kind of structure chart of the device of identification crawler shown in the application another exemplary embodiment;Wherein, it acts on behalf of Field obtains module 210, distribution characteristics determining module 220 and recognition result and obtains module 230, implements with shown in earlier figures 6 Agent field in example obtains the function phase that module 110, distribution characteristics determining module 120 and recognition result obtain module 130 Together, herein without repeating.
As shown in fig. 7, distribution characteristics determining module 220, may include:
Word acquiring unit 221 obtains at least one target word for carrying out word segmentation processing to user agent's field Language;
Characteristics determining unit 222, for determining user agent's word according to the word frequency of at least one target word The word frequency distribution feature of section.
In one embodiment, characteristics determining unit 222 can be also used for:
Based on the corresponding relationship constructed in advance, the word frequency of each target word at least one described target word is determined;
The word frequency for counting at least one target word falls into the quantity in multiple default word frequency sections;
The corresponding word frequency distribution feature of user agent's field is determined according to the corresponding vector of the quantity.
In one embodiment, described device can also include model training module 240;
Model training module 240 may include:
Sample data acquiring unit 241 for obtaining multiple sample interview requests, and is requested from the multiple sample interview Middle acquisition sample user agent field;
Sample characteristics determination unit 242, for determining the sample word frequency distribution feature of the sample user agent field;
Identification model training unit 243, for being demarcated to the sample word frequency distribution feature, and by calibrated sample This word frequency distribution feature is as training set, the training crawler identification model.
In one embodiment, sample characteristics determination unit 242 can be also used for:
Word segmentation processing is carried out to the sample user agent field, obtains at least one sample object word;
The sample word frequency of the sample user agent field is determined according to the word frequency of at least one sample object word Distribution characteristics.
In one embodiment, sample characteristics determination unit 242 can be also used for:
Based on the corresponding relationship constructed in advance, each sample object word at least one described sample object word is determined Word frequency;
The word frequency for counting at least one sample object word falls into the quantity in multiple default word frequency sections;
The corresponding sample word frequency distribution feature of the sample user agent field is determined according to the corresponding vector of the quantity.
In one embodiment, sample data acquiring unit 241 can be also used for:
Positive sample access request and negative sample access request are obtained, the positive sample access request includes that crawler access is current The access request generated when the page, the access that the negative sample access request generates when including normal users access current page are asked It asks.
In one embodiment, model training module 240 can also include:
Corresponding relationship construction unit 244, for constructing the target word and word frequency according to the negative sample access request Between corresponding relationship.
In one embodiment, sample data acquiring unit 241 can be also used for:
Construction includes the positive sample access request of random anomaly parameter.
It is worth noting that all the above alternatives, can form the optional reality of the disclosure using any combination Example is applied, this is no longer going to repeat them.
The embodiment of the device of identification crawler of the invention can be using on network devices.Installation practice can pass through Software realization can also be realized by way of hardware or software and hardware combining.Taking software implementation as an example, it anticipates as a logic Device in justice is to be read computer program instructions corresponding in nonvolatile memory by the processor of equipment where it Into memory, operation is formed, and wherein computer program is used to execute the identification crawler that above-mentioned FIG. 1 to FIG. 5 illustrated embodiment provides Method.For hardware view, as shown in figure 8, the hardware structure diagram of the equipment for identification crawler of the invention, in addition to Fig. 8 Shown in except processor, network interface, memory and nonvolatile memory, the equipment usually can also include that other are hard Part, such as it is responsible for the forwarding chip of processing message;The equipment is also possible to be distributed equipment from hardware configuration, can It can include multiple interface cards, to carry out the extension of Message processing in hardware view.On the other hand, present invention also provides one kind Computer readable storage medium, storage medium are stored with computer program, and computer program is for executing above-mentioned FIG. 1 to FIG. 5 institute Show the method for the identification crawler that embodiment provides.
For device embodiment, since it corresponds essentially to embodiment of the method, so related place is referring to method reality Apply the part explanation of example.The apparatus embodiments described above are merely exemplary, wherein described be used as separation unit The unit of explanation may or may not be physically separated, and component shown as a unit can be or can also be with It is not physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to actual The purpose for needing to select some or all of the modules therein to realize application scheme.Those of ordinary skill in the art are not paying Out in the case where creative work, it can understand and implement.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the application Its embodiment.This application is intended to cover any variations, uses, or adaptations of the application, these modifications, purposes or Person's adaptive change follows the general principle of the application and including the undocumented common knowledge in the art of the application Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the application are by following Claim is pointed out.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including described want There is also other identical elements in the process, method of element, commodity or equipment.
The foregoing is merely the preferred embodiments of the application, not to limit the application, all essences in the application Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the application protection.

Claims (10)

1. a kind of method for identifying crawler characterized by comprising
If listening to user for accessing the access request of current page, user agent's word is obtained from the access request Section;
Determine the word frequency distribution feature of user agent's field;
The word frequency distribution feature is input in crawler identification model trained in advance, obtains whether the user is crawler Recognition result.
2. the method according to claim 1, wherein the word frequency distribution of determination user agent's field is special Sign, comprising:
Word segmentation processing is carried out to user agent's field, obtains at least one target word;
The word frequency distribution feature of user agent's field is determined according to the word frequency of at least one target word.
3. according to the method described in claim 2, it is characterized in that, the word frequency of at least one target word according to is true The word frequency distribution feature of fixed user agent's field, comprising:
Based on the corresponding relationship constructed in advance, the word frequency of each target word at least one described target word is determined;
The word frequency for counting at least one target word falls into the quantity in multiple default word frequency sections;
The corresponding word frequency distribution feature of user agent's field is determined according to the corresponding vector of the quantity.
4. the method according to claim 1, wherein the crawler identification model is trained according to following steps It arrives:
Multiple sample interview requests are obtained, and obtain sample user agent field from the request of the multiple sample interview;
Determine the sample word frequency distribution feature of the sample user agent field;
The sample word frequency distribution feature is demarcated, and using calibrated sample word frequency distribution feature as training set, instruction Practice the crawler identification model.
5. according to the method described in claim 4, it is characterized in that, the sample word of the determination sample user agent field Frequency distribution characteristics, comprising:
Word segmentation processing is carried out to the sample user agent field, obtains at least one sample object word;
Based on the corresponding relationship constructed in advance, the word of each sample object word at least one described sample object word is determined Frequently;
The word frequency for counting at least one sample object word falls into the quantity in multiple default word frequency sections;
The corresponding sample word frequency distribution feature of the sample user agent field is determined according to the corresponding vector of the quantity.
6. according to the method described in claim 5, it is characterized in that, described obtain multiple sample interview requests, comprising:
Positive sample access request and negative sample access request are obtained, the positive sample access request includes crawler access current page The access request of Shi Shengcheng, the access request that the negative sample access request generates when including normal users access current page.
7. according to the method described in claim 6, it is characterized in that, the method also includes:
The corresponding relationship between the target word and word frequency is constructed according to the negative sample access request.
8. a kind of device for identifying crawler characterized by comprising
Agent field obtain module, for when listen to user for access current page access request when, from the access User agent's field is obtained in request;
Distribution characteristics determining module, for determining the word frequency distribution feature of user agent's field;
Recognition result obtains module, for the word frequency distribution feature to be input in crawler identification model trained in advance, obtains To the user whether be crawler recognition result.
9. a kind of equipment for identifying crawler, which is characterized in that including memory, processor and store on a memory and can locate The computer program run on reason device, wherein the processor realizes any institute of the claims 1-7 when executing described program The method for the identification crawler stated.
10. a kind of computer readable storage medium, which is characterized in that the storage medium is stored with computer program, the meter The method that calculation machine program is used to execute any identification crawler of the claims 1-7.
CN201811321280.0A 2018-11-07 2018-11-07 A kind of method, apparatus and system identifying crawler Pending CN109582844A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811321280.0A CN109582844A (en) 2018-11-07 2018-11-07 A kind of method, apparatus and system identifying crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811321280.0A CN109582844A (en) 2018-11-07 2018-11-07 A kind of method, apparatus and system identifying crawler

Publications (1)

Publication Number Publication Date
CN109582844A true CN109582844A (en) 2019-04-05

Family

ID=65921792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811321280.0A Pending CN109582844A (en) 2018-11-07 2018-11-07 A kind of method, apparatus and system identifying crawler

Country Status (1)

Country Link
CN (1) CN109582844A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175278A (en) * 2019-05-24 2019-08-27 新华三信息安全技术有限公司 The detection method and device of web crawlers
CN110781366A (en) * 2019-09-09 2020-02-11 深圳壹账通智能科技有限公司 Webpage data processing method and device, computer equipment and storage medium
CN111143654A (en) * 2019-12-25 2020-05-12 支付宝(杭州)信息技术有限公司 Crawler identification method and device for assisting in identifying crawler, and electronic equipment
CN110012023B (en) * 2019-04-15 2020-06-09 重庆天蓬网络有限公司 Poison-throwing type anti-climbing method, system, terminal and medium
CN111428108A (en) * 2020-03-25 2020-07-17 山东浪潮通软信息科技有限公司 Anti-crawler method, device and medium based on deep learning
CN112383513A (en) * 2020-10-27 2021-02-19 杭州数梦工场科技有限公司 Crawler behavior detection method and device based on proxy IP address pool and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020087307A1 (en) * 2000-12-29 2002-07-04 Lee Victor Wai Leung Computer-implemented progressive noise scanning method and system
CN103631830A (en) * 2012-08-29 2014-03-12 华为技术有限公司 Method and device for detecting web spiders
CN105072089A (en) * 2015-07-10 2015-11-18 中国科学院信息工程研究所 WEB malicious scanning behavior abnormity detection method and system
CN106294368A (en) * 2015-05-15 2017-01-04 阿里巴巴集团控股有限公司 Web spider identification method and device
US20180041530A1 (en) * 2015-04-30 2018-02-08 Iyuntian Co., Ltd. Method and system for detecting malicious web addresses
CN107766928A (en) * 2017-10-25 2018-03-06 福建富士通信息软件有限公司 A kind of terminal identification method based on artificial nerve network model and UA information
CN108429721A (en) * 2017-02-15 2018-08-21 腾讯科技(深圳)有限公司 A kind of recognition methods of web crawlers and device
CN108763274A (en) * 2018-04-09 2018-11-06 北京三快在线科技有限公司 Recognition methods, device, electronic equipment and the storage medium of access request

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020087307A1 (en) * 2000-12-29 2002-07-04 Lee Victor Wai Leung Computer-implemented progressive noise scanning method and system
CN103631830A (en) * 2012-08-29 2014-03-12 华为技术有限公司 Method and device for detecting web spiders
US20180041530A1 (en) * 2015-04-30 2018-02-08 Iyuntian Co., Ltd. Method and system for detecting malicious web addresses
CN106294368A (en) * 2015-05-15 2017-01-04 阿里巴巴集团控股有限公司 Web spider identification method and device
CN105072089A (en) * 2015-07-10 2015-11-18 中国科学院信息工程研究所 WEB malicious scanning behavior abnormity detection method and system
CN108429721A (en) * 2017-02-15 2018-08-21 腾讯科技(深圳)有限公司 A kind of recognition methods of web crawlers and device
CN107766928A (en) * 2017-10-25 2018-03-06 福建富士通信息软件有限公司 A kind of terminal identification method based on artificial nerve network model and UA information
CN108763274A (en) * 2018-04-09 2018-11-06 北京三快在线科技有限公司 Recognition methods, device, electronic equipment and the storage medium of access request

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110012023B (en) * 2019-04-15 2020-06-09 重庆天蓬网络有限公司 Poison-throwing type anti-climbing method, system, terminal and medium
CN110175278A (en) * 2019-05-24 2019-08-27 新华三信息安全技术有限公司 The detection method and device of web crawlers
CN110781366A (en) * 2019-09-09 2020-02-11 深圳壹账通智能科技有限公司 Webpage data processing method and device, computer equipment and storage medium
CN111143654A (en) * 2019-12-25 2020-05-12 支付宝(杭州)信息技术有限公司 Crawler identification method and device for assisting in identifying crawler, and electronic equipment
CN111143654B (en) * 2019-12-25 2023-06-16 支付宝(杭州)信息技术有限公司 Crawler identification method and device for assisting in identifying crawler and electronic equipment
CN111428108A (en) * 2020-03-25 2020-07-17 山东浪潮通软信息科技有限公司 Anti-crawler method, device and medium based on deep learning
CN112383513A (en) * 2020-10-27 2021-02-19 杭州数梦工场科技有限公司 Crawler behavior detection method and device based on proxy IP address pool and storage medium

Similar Documents

Publication Publication Date Title
CN109582844A (en) A kind of method, apparatus and system identifying crawler
US10747771B2 (en) Method and apparatus for determining hot event
US20140310691A1 (en) Method and device for testing multiple versions
CN106874253A (en) Recognize the method and device of sensitive information
CN105389722A (en) Malicious order identification method and device
US20200351291A1 (en) Systems and methods for assessing riskiness of a domain
CN114095567B (en) Data access request processing method and device, computer equipment and medium
CN110855648B (en) Early warning control method and device for network attack
WO2016145993A1 (en) Method and system for user device identification
CN104572798A (en) Method, equipment and system for processing webpage
CN107239701B (en) Method and device for identifying malicious website
WO2020257993A1 (en) Content pushing method and apparatus, server, and storage medium
CN111885007B (en) Information tracing method, device, system and storage medium
WO2019136987A1 (en) Web crawler identification method and apparatus, computer device, and storage medium
TWI701932B (en) Identity authentication method, server and client equipment
CN111371778A (en) Attack group identification method, device, computing equipment and medium
CN112395630A (en) Data encryption method and device based on information security, terminal equipment and medium
CN109981533B (en) DDoS attack detection method, device, electronic equipment and storage medium
US11557005B2 (en) Addressing propagation of inaccurate information in a social networking environment
CN106779899B (en) Malicious order identification method and device
CN106535102A (en) Mobile terminal positioning method and mobile terminal
CN114547257B (en) Class matching method and device, computer equipment and storage medium
CN110020129B (en) Click rate correction method, prediction method, device, computing equipment and storage medium
JP6749866B2 (en) Trend evaluation device and trend evaluation method
CN113297436B (en) User policy distribution method and device based on relational graph network and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190405

RJ01 Rejection of invention patent application after publication