CN112989158A - Method, device and storage medium for identifying webpage crawler behavior - Google Patents

Method, device and storage medium for identifying webpage crawler behavior Download PDF

Info

Publication number
CN112989158A
CN112989158A CN201911290447.6A CN201911290447A CN112989158A CN 112989158 A CN112989158 A CN 112989158A CN 201911290447 A CN201911290447 A CN 201911290447A CN 112989158 A CN112989158 A CN 112989158A
Authority
CN
China
Prior art keywords
behavior
behavior data
user
page
access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911290447.6A
Other languages
Chinese (zh)
Inventor
曾庆维
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SF Technology Co Ltd
SF Tech Co Ltd
Original Assignee
SF Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SF Technology Co Ltd filed Critical SF Technology Co Ltd
Priority to CN201911290447.6A priority Critical patent/CN112989158A/en
Publication of CN112989158A publication Critical patent/CN112989158A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the application provides a method, a device and a storage medium for identifying webpage crawler behaviors, wherein the method comprises the following steps: acquiring user behavior data of multiple dimensions; converting the user behavior data into training characteristics of a preset type according to a preset rule; converting the training features into a sample set; inputting the sample set into a recognition model to train the recognition model by using the sample set; acquiring an access record, wherein the access record is at least one item of behavior data of a user accessing a webpage within historical time; comparing each item of behavior data in the access record with a preset behavior characteristic respectively based on the identification model; and if at least one item of behavior data in the access records is matched with at least one preset behavior characteristic, determining that the access behavior of the user is a crawler behavior. The scheme can improve comprehensiveness and accuracy of crawler behavior identification.

Description

Method, device and storage medium for identifying webpage crawler behavior
Technical Field
The embodiment of the application relates to the technical field of internet, in particular to a method, a device and a storage medium for identifying webpage crawler behaviors.
Background
In the prior art, normal users and crawlers access web pages, and the characteristics of the normal users and the crawlers are different. If a normal user looks at a webpage slowly, the mouse moves the scroll wheel to look slowly. The crawlers are different, when the crawlers visit the webpage, all information of the webpage can be obtained immediately, data are transmitted back to the server, the next webpage is crawled immediately, and the steps are repeated, so that the habit characteristics of normal users and the habit characteristics of the crawlers for visiting the webpage are different.
In the research and practice process of the prior art, the inventor of the embodiment of the application finds that in the prior art, the accuracy rate is not high when the crawler behavior is judged, and the webpage browsing behavior of a normal user is mistakenly judged as the crawler behavior.
Disclosure of Invention
The embodiment of the application provides a method and a device for identifying webpage crawler behaviors and a storage medium, and can improve comprehensiveness and accuracy of identifying the crawler behaviors.
In a first aspect, an embodiment of the present application provides a method for identifying web crawler behaviors, where the method includes:
acquiring user behavior data of multiple dimensions;
converting the user behavior data into training characteristics of a preset type according to a preset rule;
converting the training features into a sample set;
inputting the sample set into a recognition model to train the recognition model by using the sample set;
acquiring an access record, wherein the access record is at least one item of behavior data of a user accessing a webpage within historical time;
comparing each item of behavior data in the access record with a preset behavior characteristic respectively based on the identification model;
and if at least one item of behavior data in the access records is matched with at least one preset behavior characteristic, determining that the access behavior of the user is a crawler behavior.
In one possible design, the user behavior data includes at least one of the following:
the method comprises the following steps of webpage residence time, the number of times of a page roller wheel, the rolling time interval of the page roller wheel, the number of times of clicking events of a page, the time interval of adjacent clicking events, whether an external link exists in the page, whether the next page is a skip page of the current page or not, or whether a page video is clicked and played.
In one possible design, the converting the user behavior data into a preset type of training feature according to a preset rule includes:
determining the behavior types of various behavior data in the user behavior data;
acquiring a judgment condition matched with the behavior type;
according to the behavior types, respectively adopting the matched judging conditions to judge the behavior data matched with the behavior types to obtain the judging results of all the behavior data;
vectorizing each item of behavior data according to the judgment result of each item of behavior data to obtain a feature vector;
and taking the feature vector as the training feature.
In one possible design, the converting the training features into a sample set includes:
according to the preset rule, setting labels for the user behavior data of each dimension respectively;
associating training features with the label correspondences;
and generating the sample set according to the associated training features and labels.
In a possible design, the vectorizing the behavior data according to the determination result of the behavior data to obtain a feature vector includes:
if the judgment result is positive, setting a first mark for the behavior data corresponding to the positive judgment result;
if the judgment result is negative, setting a second mark for the behavior data corresponding to the negative judgment result;
and forming the feature vector by using the marks corresponding to the various pieces of behavior data.
In one possible design, after determining that the access behavior of the user is crawler behavior, the method further includes:
generating a verification code, wherein the verification code is used for verifying the webpage access behavior of the user;
and sending the verification code to a terminal where the user is located.
In one possible design, after determining that the access behavior of the user is crawler behavior, the method further includes:
acquiring a network address of a terminal where the user is located;
and blocking the network address.
In a second aspect, an embodiment of the present application provides an apparatus for identifying web crawler behaviors, which has a function of implementing a method for identifying web crawler behaviors, where the method corresponds to the method for identifying web crawler behaviors provided in the first aspect. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware.
In one possible design, the means for identifying web crawler behavior includes:
the receiving and sending module is used for acquiring user behavior data of multiple dimensions;
the processing module is used for converting the user behavior data acquired by the transceiving module into preset type training characteristics according to preset rules; converting the training features into a sample set; inputting the sample set into a recognition model to train the recognition model by using the sample set;
the receiving and sending module is further used for obtaining an access record, wherein the access record is at least one item of behavior data of a user accessing a webpage within historical time;
the processing module is further used for comparing each item of behavior data in the access record with a preset behavior characteristic respectively based on the recognition model; and if at least one item of behavior data in the access records is matched with at least one preset behavior characteristic, determining that the access behavior of the user is a crawler behavior.
In one possible design, the user behavior data includes at least one of the following:
the method comprises the following steps of webpage residence time, the number of times of a page roller wheel, the rolling time interval of the page roller wheel, the number of times of clicking events of a page, the time interval of adjacent clicking events, whether an external link exists in the page, whether the next page is a skip page of the current page or not, or whether a page video is clicked and played.
In one possible design, the processing module is specifically configured to:
determining the behavior types of various behavior data in the user behavior data;
acquiring a judgment condition matched with the behavior type;
according to the behavior types, respectively adopting the matched judging conditions to judge the behavior data matched with the behavior types to obtain the judging results of all the behavior data;
vectorizing each item of behavior data according to the judgment result of each item of behavior data to obtain a feature vector;
and taking the feature vector as the training feature.
In one possible design, the processing module is specifically configured to:
according to the preset rule, setting labels for the user behavior data of each dimension respectively;
associating training features with the label correspondences;
and generating the sample set according to the associated training features and labels.
In one possible design, the processing module is specifically configured to:
if the judgment result is positive, setting a first mark for the behavior data corresponding to the positive judgment result;
if the judgment result is negative, setting a second mark for the behavior data corresponding to the negative judgment result;
and forming the feature vector by using the marks corresponding to the various pieces of behavior data.
In one possible design, after the processing module determines that the access behavior of the user is crawler behavior, the processing module is further configured to:
generating a verification code, wherein the verification code is used for verifying the webpage access behavior of the user;
and providing the transceiver module to send the verification code to a terminal where the user is located.
In one possible design, after the processing module determines that the access behavior of the user is crawler behavior, the processing module is further configured to:
acquiring a network address of a terminal where the user is located;
and blocking the network address.
In yet another aspect, an apparatus for identifying web crawler behavior is provided, which includes at least one connected processor, a memory and a transceiver, where the memory is used for storing a computer program, and the processor is used for calling the computer program in the memory to execute the method of the first aspect.
Yet another aspect of the embodiments of the present application provides a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the method of the first aspect.
Compared with the prior art, in the scheme provided by the embodiment of the application, on one hand, the recognition model is trained from multiple dimensions, so that the recognition accuracy of the recognition model is higher, whether an access request is a crawler behavior can be comprehensively evaluated, and misjudgment is reduced. On the other hand, the input of the recognition model in the embodiment of the application is training features of multiple dimensions, so that the crawler behaviors under multiple conditions can be covered, other features which can be used for judging whether the access request is the crawler behavior cannot be missed, and the recognition range is wider.
Drawings
FIG. 1 is a flowchart illustrating a method for identifying web crawler behavior according to an embodiment of the present disclosure;
FIG. 2 is a schematic structural diagram of a recognition model in an embodiment of the present application;
FIG. 3 is a schematic structural diagram of an apparatus for identifying web crawler behavior according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a computer device for executing the method for identifying web crawler behavior in the embodiment of the present application.
Detailed Description
The terms "first," "second," and the like in the description and in the claims of the embodiments of the application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprise" and "have," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules expressly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus, such that the division of modules presented in the present application is merely a logical division and may be implemented in a practical application in a different manner, such that multiple modules may be combined or integrated into another system or some features may be omitted or not implemented, and such that couplings or direct couplings or communicative connections shown or discussed may be through interfaces, indirect couplings or communicative connections between modules may be electrical or the like, the embodiments of the present application are not limited. Moreover, the modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purpose of the embodiments of the present application.
The embodiment of the application provides a method, a device and a storage medium for identifying webpage crawler behaviors, which can be used on a server side in a crawler system, wherein the server side can be used for identifying whether the access behaviors of the webpage are crawler behaviors.
The embodiment of the application mainly provides the following technical scheme:
the input of the recognition model is improved, namely the recognition model is trained by adopting user behavior data of multiple dimensions, so that the recognition model can recognize whether the webpage access behavior is a crawler behavior from the multiple dimensions, and can comprehensively evaluate whether the webpage access behavior is the crawler behavior, the accuracy is better, and the recognition range is wide.
Referring to fig. 1, a method for identifying web crawler behavior provided by an embodiment of the present application is described below, where an execution subject of the method may be a server or a device for identifying web crawler behavior. The device for identifying the web crawler behavior may be a server or an application installed on the server, and the embodiment of the present application is not limited thereto. The embodiment of the application comprises the following steps:
101. user behavior data of multiple dimensions is obtained.
The user behavior data can be historical data of a website accessed by a user or an operation log stored in a server background, and the embodiment of the application does not limit the acquisition channel of the user behavior data, whether preprocessing is performed or not.
In some embodiments, the user behavior data comprises at least one of the following:
the method comprises the following steps of webpage residence time, the number of times of a page roller, the rolling time interval of the page roller, the number of times of clicking events of a page, the time interval of adjacent clicking events, external links of the page and whether the next page is a skip page of the current page.
The residence time of the web page refers to the time of the user staying on the web page, and may also be referred to as the residence time of the web page. Typically, a user must stay for a while on a web page, which may be a crawler if the stay time is very short. Therefore, a threshold value can be defined for the residence time of the web page, if the time difference between entering the web page and leaving the web page is greater than the threshold value, the residence time of the web page is 1, otherwise, the residence time of the web page is 0. For example, the threshold value of the web page residence time is set to 1 min.
The scroll wheel times of the page scroll wheel comprise scroll wheel downward rolling times and scroll wheel upward rolling times. Specifically, the number of times the wheel rolls down: after opening the web page, a normal user would use the scroll wheel to slide up and down to view the web page content, but a typical crawler may not use the scroll wheel (not absolutely, some crawlers click through simulation). Thus, whether crawler behavior is present is easily determined by the number of scroll wheels of the page scroll wheel, and a threshold value may be defined, with the value of this feature being 1 if the number of scroll wheel down times is greater than the threshold value, and 0 otherwise. The number of upward rolling times of the roller is as follows: even though the crawler may use a wheel with a simulated click, it is less likely to scroll up, so a threshold may be defined for the page wheel's number of wheels, which is characterized by a value of 1 if the number of wheels up is greater than the threshold, and 0 otherwise.
The time interval of the page scroll wheel comprises the time interval of the scroll wheel rolling upwards and the time interval of the scroll wheel rolling downwards. Or the scrolling time interval of the page scroll wheel is called the scroll wheel frequency of the page scroll wheel, and the scroll wheel frequency of the page scroll wheel includes the frequency of up or down scrolling of the scroll wheel. In general, a normal user reads a page and then scrolls up or down, and a certain time is required for reading a page, so that generally, the crawler behavior does not randomly scroll the page on a page. Thus, the scrolling time interval of the page wheel can be used to determine crawler behavior.
The number of click events for a page, such as mouse clicks on a page or user gesture input. When the number of the click events of the page is the number of the clicks of the page by the mouse, since many normal users are used to double-click the page by the left button of the mouse when accessing the page, a threshold value can be defined, if the number of the double-click of the left button of the page is greater than the threshold value, the value of the characteristic of the number of the click events of the page is 1, otherwise, the value is 0.
The time interval between adjacent click events, such as the two or more mouse click interval time. When the time interval between adjacent click events is the time interval between two mouse clicks, since the crawler will not typically simulate a mouse click (unless steps are necessary, e.g., web page 2 must be entered by a button on web page 1), a threshold may be defined for the time interval between adjacent click events, and if the average click interval is greater than the threshold, the value of the characteristic of the time interval between adjacent click events is 1, otherwise it is 0.
The page has external links, e.g. whether click off link: some links (outer links) of the non-local website may be arranged on a certain page, and the crawler generally does not visit the outer links and only relates to the web pages under the same domain name. Therefore, if there is a click off-site link, the page has an external link with a characteristic of 1, otherwise 0.
Whether the next page is the page jump or not is characterized in that the value of the characteristic that whether the next page is the page jump or not is 1 if the next page is the page jump or not, and the normal user browses the pages is usually the current page jump, but the crawler can not jump through the current page, and basically does not jump in the current page if the crawler for breadth search is the page jump, and otherwise, the value of the characteristic is 0.
Whether the page video is clicked to play, for example, whether the in-station video is clicked. Since a general user may click on the video in the website, the crawler basically cannot click on the video in the website or cannot click on the video in the website. So if clicked, whether the page video is clicked to play or not is the characteristic value of 1, otherwise, the characteristic value is 0.
102. And converting the user behavior data into training characteristics of a preset type according to a preset rule.
Wherein, the preset rule comprises: dividing the features of different dimensions into different preset types, and defining the quantization mode of the features of the preset types according to the different preset types. For example, user behavior data includes: user A at 2019-12-108: 00: 05 to 2019-12-108: 02: 21 access webpage 1, at 2019-12-108: 04: 45 to 2019-12-108: 12: 34 access the web page 2. Since such data cannot be directly used to train the recognition model, the user behavior data is quantized to a retention time of 2 minutes 16 seconds for user a on web page 1 and a retention time of 7 minutes 49 seconds for web page 2. Other similar reasons will not be described in detail.
In some embodiments, the converting the user behavior data into the preset type of training features according to the preset rule includes:
a. and determining the behavior types of all the behavior data in the user behavior data.
b. And acquiring a judgment condition matched with the behavior type.
c. And judging the behavior data matched with the behavior types by adopting the matched judging conditions respectively according to the behavior types to obtain the judging results of all the behavior data.
d. And vectorizing the behavior data according to the judgment result of the behavior data to obtain the feature vector.
In some embodiments, the vectorizing the behavior data according to the determination result of the behavior data to obtain the feature vector includes:
if the judgment result is positive, setting a first mark for the behavior data corresponding to the positive judgment result;
if the judgment result is negative, setting a second mark for the behavior data corresponding to the negative judgment result;
and forming the feature vector by using the marks corresponding to the various pieces of behavior data.
The first mark and the second mark may be numbers, symbols or characters, or a combination of at least one of the numbers, symbols or characters, and the embodiment of the present application does not limit the expression of the first mark and the second mark, as long as the determination results of each item of behavior data can be distinguished and the formation of subsequent feature vectors is not affected.
For example, the determination result is: the residence time of the webpage is 2min, the number of idler wheels of a page idler wheel is 4, and the next page is the skip page of the current page, so that the page residence time and the scroll wheel number belong to positive judgment results. Thus, the first flags may be set for the behavioral data of these items, respectively, e.g., both set to 1.
For another example, if the web page dwell time is 0.01 second, the number of times of scrolling of the page scroll wheel is 2, the next page is not the jump page of the current page, and the page video is not clicked and played, these are all negative determination results. Thus, second flags may be set for the behavioral data of the items, respectively, e.g., the second flags are all set to 0.
e. And taking the feature vector as the training feature.
For example, user behavior data includes:
the residence time of the webpage is 2min, the number of times of a page roller is 4, the scrolling time interval of the page roller is 4 seconds, the number of times of clicking events of the page is 2, the time interval of adjacent clicking events is 1 second, an external link exists in the page, the next page is a skip page of the current page, or the page video is clicked and played.
Thus, the training features may be represented as:
[1,1,1,1,1,1,1,1]
for another example, the web page residence time is 0.01 second, the scroll wheel frequency of the page scroll wheel is 2, the scrolling time interval of the page scroll wheel is 4 seconds, the click event frequency of the page is 0, the time interval of adjacent click events is 0 second, an external link exists in the page, the next page is not a jump page of the current page, or the page video is not clicked and played.
Thus, the training features may be represented as:
[0,0,1,0,0,1,0,0]
where 1 indicates that the user satisfies the determination condition for the certain training feature, and 0 indicates that the user does not have the training feature or does not satisfy the determination condition for the certain training feature. The content structure may be represented in an ordered sequence or an array, which is not limited in the embodiments of the present application.
103. Converting the training features into a sample set.
Wherein, the sample set refers to a feature set used for training the recognition model.
In some embodiments, the converting the training features into a sample set includes:
according to the preset rule, setting labels for the user behavior data of each dimension respectively;
associating training features with the label correspondences;
and generating the sample set according to the associated training features and labels.
Normal access records of the user and crawler access records. The crawler records can be marked manually by website managers or obtained in a reptile-resisting mode. The anti-crawler mode can be obtained by detecting a network address, a User Agent (UA) and the like, and the embodiment of the application does not limit the acquisition mode and the acquisition channel of the crawler access record.
For example, after setting a corresponding label for the user behavior feature, combining the label with the training feature to form a sample set:
[1,1,1,1,1,1,1,1] [0]
[0,0,1,0,0,1,0,0] [1]
wherein [1,1,1,1, 1] and [0,0,1,0,0,1,0,0] are training features, and [0] and [1] are labels. [0] Is the label for training feature [1,1,1,1,1,1,1,1] and [1] is the label for training feature [0,0,1,0,0,1,0,0 ].
104. Inputting the sample set into a recognition model to train the recognition model by using the sample set.
The identification model can be used for identifying whether the webpage access behavior is a crawler behavior or a normal user access behavior.
In some embodiments, the recognition model may be obtained by training using a tree model xgboost, a random forest, a Support Vector Machine (svm), a decision tree, and the like, so that the obtained recognition model has higher accuracy. When the recognition model is trained, in order to approach the actual service, the hyper-parameters of the recognition model can be continuously adjusted, and an optimal recognition model suitable for the service is gradually found.
The training process of the recognition model is described as follows:
the recognition model comprises a plurality of hyper-parameters, and can be obtained by training through a plurality of machine learning methods, for example, the recognition model can be obtained by training through xgboost and svm.
In the hyper-parameters of the recognition model, the learning rate and the tree depth have a large influence on the recognition accuracy of the recognition model, so the embodiment of the application mainly takes the adjustment of the learning rate and the tree depth as an example. For example, when the recognition model is obtained through xgboost training, the recognition effect of the recognition model is the best by continuously adjusting the learning rate of xgboost and the depth of the tree, for example, the learning rate of the last adjustment is 0.15, and the depth of the tree is 4. Therefore, the learning rate of the recognition model can be set to be in the range of [0.1,0.2] by default, for example, the learning rate of xgboost is 0.15 by default. And setting the value range of the time step of the recognition model to be, for example, if the time step is 0.05, the learning rate can be selected from any one of [0.1,0.15,0.2 ]. As another example, the tree default is 3, the defined range is [2,6], and if the time step is 1, then this parameter can be selected as [2,3,4,5,6 ]. The values of other parameters of the recognition model can be analogized in turn. Specifically, tuning the hyper-parameters of the recognition model is to match the different selectable hyper-parameters, try all possibilities, and then find an optimal solution.
For example, when the training features include the dwell time and the scroll time, a recognition model is built by considering the existing features of the user (namely the dwell time and the scroll time), so that the recognition model at the training part can recognize whether the webpage access behavior is the crawler behavior according to the two types of features. The quality of the identification model can be evaluated according to the existing tags, namely the accuracy of the predicted tags is high, and the identification effect of the identification model is good.
In some embodiments, in order to improve the recognition effect of the recognition model, the recognition model may adopt a model structure as shown in fig. 2. The recognition model can adopt 3 layers of random forest networks, and each layer of network can be respectively processed aiming at each training characteristic. For example, three-layer networks can be adopted for training aiming at one or more training characteristics respectively, so that a better recognition effect can be achieved, the output of each layer network is summarized and weighted, and whether the training characteristics input into the whole recognition model at this time accord with the normal access behavior or the crawler behavior of a user is judged by adopting a weighted result (hn). As shown in FIG. 3, accesses to the input recognition model are recorded as Fn, Fn-1, Fn-2, … Fn-m +1, the weighted result is hn, hn is mapped to 2 fully connected layers (dense layers), and the recognition result Yn is output.
105. An access record is obtained.
Wherein the access record is at least one item of behavior data of the user accessing the webpage in historical time. For example, the access record may include at least one of:
the method comprises the following steps of webpage residence time, the number of times of a page roller wheel, the rolling time interval of the page roller wheel, the number of times of clicking events of a page, the time interval of adjacent clicking events, whether an external link exists in the page, whether the next page is a skip page of the current page or not, or whether a page video is clicked and played.
106. Comparing each item of behavior data in the access records with preset behavior characteristics respectively based on the recognition model, and determining that the access behavior of the user is a crawler behavior if at least one item of behavior data in the access records is matched with at least one preset behavior characteristic.
For example, if a user (possibly a crawler) is accessing a web page, acquiring a behavior log of the accessed web page within a period of time, then judging whether a behavior log model within the period of time is a crawler behavior through a recognition model, and if the behavior log model is the crawler behavior, directly disconnecting a request for accessing the web page to stop the current operation behavior for accessing the web page; or sending an authentication request to the user on the web page, the authentication request requesting that the authentication code be entered on the web page.
In some embodiments, the determination of the items of behavior data in the access record may be made based on the structure of the recognition model as shown in FIG. 2. And finally summarizing the output of each layer, weighting the output of each layer to obtain a weighted result, and judging whether the whole training characteristics of the input recognition model accord with the user access or the crawler access according to the weighted result.
In some embodiments, after determining that the access behavior of the user is a crawler behavior, the recognition model outputs a recognition result indicating that the current access behavior of the user is a crawler behavior. And deciding to perform some operation on the user currently accessing the webpage based on the recognition result of the recognition model. Specifically, the following two aspects are included:
(1) using authentication codes to prevent further network access behavior
In some embodiments, a verification code may be generated, where the verification code is used to verify the webpage access behavior of the user, and the verification code is sent to a terminal where the user is located.
The verification code can be generated regularly or randomly, and the generation mode of the verification code is not limited in the embodiment of the application.
Optionally, a timer may be further set, and if the verification code input by the user on the verification code interface is not received within the preset time period, the overtime verification is directly returned to the terminal where the user is located. Or, if the verification code input by the user on the verification code interface is incorrect, the verification failure is directly returned to the terminal where the user is located.
(2) Organizing further network access behavior in a way that network addresses are blocked
In some embodiments, the network address of the terminal where the user is located may be obtained, and the network address may be blocked. For example, when the network address is an Internet Protocol (IP) address, the IP address may be blocked or an IP segment of the IP address may be blocked. The embodiment of the application does not limit the way and the content of the network address. The account number that the user logs in or the terminal identifier of the terminal where the user is located may also be prohibited, which is not limited in the embodiments of the present application.
In the embodiment of the application, training characteristics used for training the model are comprehensively considered, the training characteristics can comprehensively analyze suspicion characteristics related to the behavior of the crawler from all dimensions (namely webpage retention time, roller downward rolling times, roller upward rolling times, mouse click times on a page, two times of mouse click interval time, whether off-site links are clicked, whether the next page is page skip or not, whether in-site videos and other user behavior data are clicked or not), on one hand, the model is trained from multiple dimensions, the identification accuracy of the model is higher, therefore, whether an access request is the behavior of the crawler or not can be comprehensively evaluated, and misjudgment is reduced. On the other hand, the input of the model is the suspected characteristics of multiple dimensions, so that the crawler behaviors under multiple conditions can be covered, other characteristics which can be used for judging whether the access request is the crawler behavior cannot be missed, and the identification range is wider.
Any technical feature mentioned in the embodiment corresponding to any one of fig. 1 and fig. 2 is also applicable to the embodiment corresponding to fig. 3 and fig. 4 in the embodiment of the present application, and the details of the subsequent similarities are not repeated.
The above describes a method for identifying web crawler behaviors in the embodiment of the present application, and the following describes an apparatus for identifying web crawler behaviors in the embodiment of the present application.
Referring to fig. 3, a schematic structural diagram of an apparatus 30 for identifying web crawler behavior shown in fig. 3 is shown, which can be applied to identify crawler behavior. The apparatus 30 for identifying web crawler behavior in the embodiment of the present application can implement the steps corresponding to the method for identifying web crawler behavior executed in the embodiment corresponding to fig. 1. The functions implemented by the device 30 for identifying web crawler behavior may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware. The device 30 for identifying the behavior of the web crawler may include a processing module and a transceiver module, and the processing module and the transceiver module may refer to operations executed in the embodiment corresponding to fig. 1 for their functional implementation, which are not described herein again. For example, the processing module may be used to control the operations of the transceiver module such as acquisition, transceiving, and the like.
In some embodiments, the transceiver module may be configured to obtain user behavior data in multiple dimensions;
the processing module can be used for converting the user behavior data acquired by the transceiving module into a preset type of training feature according to a preset rule; converting the training features into a sample set; inputting the sample set into a recognition model to train the recognition model by using the sample set;
the receiving and sending module is further used for obtaining an access record, wherein the access record is at least one item of behavior data of a user accessing a webpage within historical time;
the processing module is further used for comparing each item of behavior data in the access record with a preset behavior characteristic respectively based on the recognition model; and if at least one item of behavior data in the access records is matched with at least one preset behavior characteristic, determining that the access behavior of the user is a crawler behavior.
In the embodiment of the application, on one hand, the recognition model is trained from multiple dimensions, so that the recognition accuracy of the recognition model is higher, whether an access request is a crawler behavior can be comprehensively evaluated, and misjudgment is reduced. On the other hand, in the embodiment of the application, the input of the recognition model is that the training features of multiple dimensions are obtained based on the user behavior data of multiple dimensions, so that the crawler behaviors under multiple conditions can be covered, and other features which can be used for judging whether the access request is the crawler behavior are not missed, so that the recognition range is wider.
In some embodiments, the user behavior data comprises at least one of the following:
the method comprises the following steps of webpage residence time, the number of times of a page roller wheel, the rolling time interval of the page roller wheel, the number of times of clicking events of a page, the time interval of adjacent clicking events, whether an external link exists in the page, whether the next page is a skip page of the current page or not, or whether a page video is clicked and played.
In some embodiments, the processing module is specifically configured to:
determining the behavior types of various behavior data in the user behavior data;
acquiring a judgment condition matched with the behavior type;
according to the behavior types, respectively adopting the matched judging conditions to judge the behavior data matched with the behavior types to obtain the judging results of all the behavior data;
vectorizing each item of behavior data according to the judgment result of each item of behavior data to obtain a feature vector;
and taking the feature vector as the training feature.
In some embodiments, the processing module is specifically configured to:
according to the preset rule, setting labels for the user behavior data of each dimension respectively;
associating training features with the label correspondences;
and generating the sample set according to the associated training features and labels.
In some embodiments, the processing module is specifically configured to:
if the judgment result is positive, setting a first mark for the behavior data corresponding to the positive judgment result;
if the judgment result is negative, setting a second mark for the behavior data corresponding to the negative judgment result;
and forming the feature vector by using the marks corresponding to the various pieces of behavior data.
In some embodiments, after the processing module determines that the access behavior of the user is crawler behavior, the processing module is further configured to:
generating a verification code, wherein the verification code is used for verifying the webpage access behavior of the user;
and providing the transceiver module to send the verification code to a terminal where the user is located.
In some embodiments, after the processing module determines that the access behavior of the user is crawler behavior, the processing module is further configured to:
acquiring a network address of a terminal where the user is located;
and blocking the network address.
The network authentication server and the terminal device in the embodiment of the present application are described above from the perspective of the modular functional entity, and the network authentication server and the terminal device in the embodiment of the present application are described below from the perspective of hardware processing.
It should be noted that, in the embodiment shown in fig. 3 of this application, the entity device corresponding to the transceiver module may be a transceiver, and the entity device corresponding to the processing module may be a processor. The devices 30 shown in fig. 3 may each have a structure as shown in fig. 4, when the device 30 has the structure as shown in fig. 4, the processor and the transceiver in fig. 4 implement the same or similar functions of the processing module and the transceiver module provided in the device embodiment corresponding to the device, and the memory in fig. 4 stores the computer program that needs to be called when the processor executes the method for identifying the web crawler behavior.
For example, a processor calls a computer program in memory to perform the following:
obtaining, by the transceiver, user behavior data for a plurality of dimensions;
converting the user behavior data acquired by the transceiver module into preset type training characteristics according to preset rules; converting the training features into a sample set; inputting the sample set into a recognition model to train the recognition model by using the sample set;
acquiring an access record through the transceiver, wherein the access record is at least one item of behavior data of a user accessing a webpage in historical time;
comparing each item of behavior data in the access record with a preset behavior characteristic respectively based on the identification model; and if at least one item of behavior data in the access records is matched with at least one preset behavior characteristic, determining that the access behavior of the user is a crawler behavior.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable storage medium.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the present application are generated in whole or in part when the computer program is loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The technical solutions provided by the embodiments of the present application are introduced in detail, and the principles and implementations of the embodiments of the present application are explained by applying specific examples in the embodiments of the present application, and the descriptions of the embodiments are only used to help understanding the method and core ideas of the embodiments of the present application; meanwhile, for a person skilled in the art, according to the idea of the embodiment of the present application, there may be a change in the specific implementation and application scope, and in summary, the content of the present specification should not be construed as a limitation to the embodiment of the present application.

Claims (10)

1. A method of identifying web crawler behavior, the method comprising:
acquiring user behavior data of multiple dimensions;
converting the user behavior data into training characteristics of a preset type according to a preset rule;
converting the training features into a sample set;
inputting the sample set into a recognition model to train the recognition model by using the sample set;
acquiring an access record, wherein the access record is at least one item of behavior data of a user accessing a webpage within historical time;
comparing each item of behavior data in the access record with a preset behavior characteristic respectively based on the identification model;
and if at least one item of behavior data in the access records is matched with at least one preset behavior characteristic, determining that the access behavior of the user is a crawler behavior.
2. The method of claim 1, wherein the user behavior data comprises at least one of the following:
the method comprises the following steps of webpage residence time, the number of times of a page roller wheel, the rolling time interval of the page roller wheel, the number of times of clicking events of a page, the time interval of adjacent clicking events, whether an external link exists in the page, whether the next page is a skip page of the current page or not, or whether a page video is clicked and played.
3. The method of claim 2, wherein the converting the user behavior data into a preset type of training features according to a preset rule comprises:
determining the behavior types of various behavior data in the user behavior data;
acquiring a judgment condition matched with the behavior type;
according to the behavior types, respectively adopting the matched judging conditions to judge the behavior data matched with the behavior types to obtain the judging results of all the behavior data;
vectorizing each item of behavior data according to the judgment result of each item of behavior data to obtain a feature vector;
and taking the feature vector as the training feature.
4. The method of claim 3, wherein converting the training features into a sample set comprises:
according to the preset rule, setting labels for the user behavior data of each dimension respectively;
associating training features with the label correspondences;
and generating the sample set according to the associated training features and labels.
5. The method according to claim 3 or 4, wherein the vectorizing the behavior data according to the determination result of the behavior data to obtain the feature vector comprises:
if the judgment result is positive, setting a first mark for the behavior data corresponding to the positive judgment result;
if the judgment result is negative, setting a second mark for the behavior data corresponding to the negative judgment result;
and forming the feature vector by using the marks corresponding to the various pieces of behavior data.
6. The method of claim 5, wherein after determining that the user's access behavior is crawler behavior, the method further comprises:
generating a verification code, wherein the verification code is used for verifying the webpage access behavior of the user;
and sending the verification code to a terminal where the user is located.
7. The method of claim 5, wherein after determining that the user's access behavior is crawler behavior, the method further comprises:
acquiring a network address of a terminal where the user is located;
and blocking the network address.
8. An apparatus for identifying web crawler behavior, the apparatus comprising:
the receiving and sending module is used for acquiring user behavior data of multiple dimensions;
the processing module is used for converting the user behavior data acquired by the transceiving module into preset type training characteristics according to preset rules; converting the training features into a sample set; inputting the sample set into a recognition model to train the recognition model by using the sample set;
the receiving and sending module is further used for obtaining an access record, wherein the access record is at least one item of behavior data of a user accessing a webpage within historical time;
the processing module is further used for comparing each item of behavior data in the access record with a preset behavior characteristic respectively based on the recognition model; and if at least one item of behavior data in the access records is matched with at least one preset behavior characteristic, determining that the access behavior of the user is a crawler behavior.
9. A computer device, characterized in that the computer device comprises:
at least one processor, memory, and transceiver;
wherein the memory is for storing a computer program and the processor is for calling the computer program stored in the memory to perform the method of any one of claims 1-7.
10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1-7.
CN201911290447.6A 2019-12-16 2019-12-16 Method, device and storage medium for identifying webpage crawler behavior Pending CN112989158A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911290447.6A CN112989158A (en) 2019-12-16 2019-12-16 Method, device and storage medium for identifying webpage crawler behavior

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911290447.6A CN112989158A (en) 2019-12-16 2019-12-16 Method, device and storage medium for identifying webpage crawler behavior

Publications (1)

Publication Number Publication Date
CN112989158A true CN112989158A (en) 2021-06-18

Family

ID=76343001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911290447.6A Pending CN112989158A (en) 2019-12-16 2019-12-16 Method, device and storage medium for identifying webpage crawler behavior

Country Status (1)

Country Link
CN (1) CN112989158A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113821705A (en) * 2021-08-30 2021-12-21 湖南大学 Webpage content acquisition method, terminal equipment and readable storage medium
CN114417812A (en) * 2022-03-15 2022-04-29 太平金融科技服务(上海)有限公司深圳分公司 Text checking method, device, equipment and storage medium
CN114710318A (en) * 2022-03-03 2022-07-05 戎行技术有限公司 Method, device, equipment and medium for limiting high-frequency access of crawler

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014105919A1 (en) * 2012-12-27 2014-07-03 Microsoft Corporation Identifying web pages in malware distribution networks
CN109145185A (en) * 2018-02-02 2019-01-04 北京数安鑫云信息技术有限公司 It identifies web crawlers and extracts the method and device of web crawlers feature
CN109189660A (en) * 2018-09-30 2019-01-11 北京诸葛找房信息技术有限公司 A kind of crawler recognition methods based on user's mouse interbehavior
CN109862018A (en) * 2019-02-21 2019-06-07 中国工商银行股份有限公司 Anti- crawler method and system based on user access activity

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014105919A1 (en) * 2012-12-27 2014-07-03 Microsoft Corporation Identifying web pages in malware distribution networks
CN109145185A (en) * 2018-02-02 2019-01-04 北京数安鑫云信息技术有限公司 It identifies web crawlers and extracts the method and device of web crawlers feature
CN109189660A (en) * 2018-09-30 2019-01-11 北京诸葛找房信息技术有限公司 A kind of crawler recognition methods based on user's mouse interbehavior
CN109862018A (en) * 2019-02-21 2019-06-07 中国工商银行股份有限公司 Anti- crawler method and system based on user access activity

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113821705A (en) * 2021-08-30 2021-12-21 湖南大学 Webpage content acquisition method, terminal equipment and readable storage medium
CN113821705B (en) * 2021-08-30 2024-02-20 湖南大学 Webpage content acquisition method, terminal equipment and readable storage medium
CN114710318A (en) * 2022-03-03 2022-07-05 戎行技术有限公司 Method, device, equipment and medium for limiting high-frequency access of crawler
CN114710318B (en) * 2022-03-03 2024-03-22 戎行技术有限公司 Method, device, equipment and medium for limiting high-frequency access of crawler
CN114417812A (en) * 2022-03-15 2022-04-29 太平金融科技服务(上海)有限公司深圳分公司 Text checking method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Viswanath et al. Towards detecting anomalous user behavior in online social networks
CN110442712B (en) Risk determination method, risk determination device, server and text examination system
CN111401416B (en) Abnormal website identification method and device and abnormal countermeasure identification method
KR101743269B1 (en) Method and apparatus of fraud detection by analysis of PC information and modeling of behavior pattern
US10229160B2 (en) Search results based on a search history
CN112989158A (en) Method, device and storage medium for identifying webpage crawler behavior
US20070198603A1 (en) Using exceptional changes in webgraph snapshots over time for internet entity marking
CN108763274B (en) Access request identification method and device, electronic equipment and storage medium
US11681765B2 (en) System and method for integrating content into webpages
CN104836781A (en) Method distinguishing identities of access users, and device
CN107852412A (en) For phishing and the system and method for brand protection
CN106708841B (en) The polymerization and device of website visitation path
CN108334758A (en) A kind of detection method, device and the equipment of user's ultra vires act
CN110516173B (en) Illegal network station identification method, illegal network station identification device, illegal network station identification equipment and illegal network station identification medium
CN106033579A (en) Data processing method and apparatus thereof
CN107888602A (en) A kind of method and device for detecting abnormal user
CN103713894A (en) Method and equipment for determining access demand information of user
CN102185830B (en) A kind of method and system of security filtration of network television browser
CN107784551A (en) Stock public sentiment data processing method, device, computer equipment and storage medium
CN112488716A (en) Abnormal event detection system
KR20070094264A (en) Method for targeting web advertisement clickers based on click pattern by using a collaborative filtering system with neural networks and system thereof
CN110324352A (en) Identify the method and device of batch registration account group
CN107294905A (en) A kind of method and device for recognizing user
CN103618761B (en) Method and browser for processing cookie information
CN109074401B (en) Extraction of primary content of a linked list

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination