CN113821754A

CN113821754A - Sensitive data interface crawler identification method and device

Info

Publication number: CN113821754A
Application number: CN202111100833.1A
Authority: CN
Inventors: 葛胜利; 魏国富; 夏玉明
Original assignee: Information and Data Security Solutions Co Ltd
Current assignee: Information and Data Security Solutions Co Ltd
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2021-12-21

Abstract

The invention discloses a sensitive data interface crawler identification method and a device, wherein the method comprises the following steps: acquiring a web access log of a website; identifying the crawler according to the web access log; judging the type of the crawler; initiating a request to a website by using parameters of crawlers according to different types of crawlers, acquiring content of the request response, collecting the content of the request response according to a request url, and storing text parts of the content returned by the website in groups according to a collection domain name; extracting feature data of the stored texts, wherein important link addresses and text keyword results are extracted from the texts under each domain name; identifying whether sensitive information is contained in the text keyword result, and outputting whether sensitivity is contained and the type of the sensitivity-contained data; the invention has the advantages that: the crawler motivation is effectively identified, crawler behaviors related to sensitive information are identified, and network information safety is guaranteed.

Description

Sensitive data interface crawler identification method and device

Technical Field

The invention relates to the field of crawler identification, in particular to a crawler identification method and device for a sensitive data interface.

Background

In the prior art, data in a network can be acquired by means of a web crawler and the like, and a program or a script for removing website information is automatically grasped according to a certain rule. The prior art mostly aims at interception of crawlers, but the crawlers can bypass by changing programs or simulating behaviors of real users, and particularly certain valuable sensitive information exists in interfaces of websites.

In the prior art, crawler identification methods can be roughly classified into two types, wherein one type is an expert rule engine scheme, business logs are subjected to data acquisition, single or multiple attribute events are configured for quantity accumulation, and events exceeding a threshold value are intercepted through a threshold value rule; or intercepting through a blacklist acquired by attributes such as the IP, the usergent and the like. Due to the gradual improvement of the technology, the blackstrap industry uses a simulator, and special software conducts wind control rule engine probing and bypasses, so that the information security of the website is difficult to continuously ensure, and especially under the condition that certain valuable sensitive information exists in an interface of the website, the information security of the website is more difficult to maintain.

And the other type is an abnormity detection and identification crawler scheme based on a user behavior sequence, a user access behavior path is constructed, a probability model and other technical schemes are used for calculating the probability of the behavior path, and an unsupervised learning method is used for outputting the access paths of the abnormal users and the related users. However, the technical scheme has a large amount of false alarms, the workload of manual secondary analysis is more increased and complicated, and the information security of the website interface with sensitive information is difficult to maintain.

Chinese patent grant publication No. CN108712426B discloses a crawler identification method and system based on user behavior buried points, wherein the method includes: s1, the client receives the access request initiated by the user and asynchronously sends the access request to the backend service system; s2, after receiving the access request, the back-end service system synchronizes the access log of the user, wherein the access log comprises the access behavior data of the user; s3, the back-end service system aggregates the access behavior data through the rule engine; s4, the back-end service system judges whether the user belongs to the crawler according to the aggregated access behavior data, if so, the crawler characteristic data used for identifying the user as the crawler are aggregated according to the access log, and then the crawler characteristic data are asynchronously pushed to a crawler list in the client through a message queue; and S5, the client side responds to the access request according to the crawler list. According to the crawler identification method, the logs are accessed synchronously, and the crawler is identified after the access behavior data in the logs are aggregated, so that the crawler identification rate is improved and the crawler identification accuracy is improved. But not all crawlers need to be intercepted, and the scheme only identifies the crawlers and cannot identify the crawlers with sensitive data.

In summary, in the prior art, most of the prior art is directed to interception of crawlers, and crawlers with sensitive data cannot be identified, so that network information security is difficult to guarantee.

Disclosure of Invention

The technical problem to be solved by the invention is that the prior art lacks a method for crawler identification of an interface with sensitive information.

The invention solves the technical problems through the following technical means: a sensitive data interface crawler identification method, the method comprising the steps of:

the method comprises the following steps: acquiring a web access log of a website;

step two: identifying the crawler according to the web access log;

step three: judging the type of the crawler;

step four: initiating a request to a website by using parameters of crawlers according to different types of crawlers, acquiring content of the request response, collecting the content of the request response according to a request url, and storing text parts of the content returned by the website in groups according to a collection domain name;

step five: extracting feature data of the stored texts, wherein important link addresses and text keyword results are extracted from the texts under each domain name;

step six: and identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology, and outputting a corresponding result.

The method initiates a request to a website by using parameters of the crawler according to different crawler types, acquires content requesting response, collects the content requesting response according to a request url, stores text parts of the content returned by the website in groups according to a collection domain name, identifies whether sensitive information exists in a text keyword result by using a sensitive data discovery technology, and outputs whether sensitive and sensitive data types, thereby effectively identifying crawler motivations, identifying crawler behaviors related to the sensitive information, and ensuring network information safety.

Further, the web access log includes time of request, IP address, user identity information, sessionid, requestbody, responsbody, method, status, and the user identity information includes account, cookie, uuid.

Further, in the second step, an anomaly detection method or a rule engine method based on the user behavior sequence is adopted to identify the crawler.

Further, the type of the crawler in the third step includes modifying parameters in the url for page switching or the same url for page switching by modifying different parameters requested by the POST content.

Further, the fourth step includes:

step 401: initiating a Request to a website by using parameters of the crawler according to different crawler types, wherein the Request comprises additional headers information, and thus simulating a crawler Request;

step 402: performing page analysis on a website accessed by a crawler, acquiring information returned by a website page, and acquiring content of a request response;

step 403: according to the content of the request url collection request response, if the crawler address of the page switching mode is carried out by modifying the parameters in the url, the non-parameter part of the crawler address is reserved to be used as a collection domain name, and if the crawler address of the page switching mode is carried out by modifying different parameters requested by the POST content, the domain name of the crawler address is directly used as the collection domain name; and storing a plurality of text parts returned by the website according to the grouping of the collection domain names.

Further, the fifth step includes:

by the formula

Calculating word frequency, extracting words with the word frequency exceeding a threshold value in the stored texts as characteristic data, and extracting important link addresses and text keyword results of the texts under each domain name according to the word frequency; wherein n is_i,jMeaning the word t_iThe number of times that it occurs in the text j,

representing the sum of all the word frequencies in the text j,

represents the sum of the frequency numbers of all words in the corpus, nt_iMeaning the word t_iThe total frequency of occurrences in the corpus.

Further, the sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identification number.

Further, the sensitive data interface crawler identification method further comprises the seventh step of:

and e, counting the url gathering request number, the access rate, the request IP address number, the IP access url number, the request useragent number, the return 200 number, the access Referer number, the access Method type and the url sensitive data type of the crawler with the sensitive data interface identified in the step six, and outputting a crawler risk level and an attack type according to the counting result.

The invention also provides a sensitive data interface crawler recognition device, which comprises:

the log acquisition module is used for acquiring a web access log of a website;

the crawler identification module is used for identifying the crawler according to the web access log;

the judging module is used for judging the type of the crawler;

the crawler request simulation module is used for initiating a request to a website by using parameters of crawlers according to different crawler types, acquiring the content of the request response, collecting the content of the request response according to the request url, and storing the text part of the content returned by the website in groups according to the collection domain name;

the feature extraction module is used for extracting feature data of the stored texts, and important link addresses and text keyword results are extracted from the texts under each domain name correspondingly;

and the sensitive judgment module is used for identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology and outputting a corresponding result.

Furthermore, the crawler identification module identifies the crawler by adopting an anomaly detection method or a rule engine method based on a user behavior sequence.

Furthermore, the type of the crawler in the judging module includes modifying parameters in the url to perform page switching or the same url requests different parameters to perform page switching by modifying POST content.

Still further, the crawler request simulation module comprises:

the Request simulation unit is used for initiating a Request to a website by using parameters of the crawler according to different crawler types, wherein the Request contains additional headers information, so that crawler Request simulation is performed;

the request response unit is used for carrying out page analysis on the website accessed by the crawler, acquiring information returned by the website page and obtaining the content of the request response;

the grouping storage unit is used for grouping the content of the request response according to the request url, if the crawler address of the page switching mode is carried out by modifying the parameters in the url, the non-parameter part of the crawler address is reserved to be used as a grouping domain name, and if the crawler address of the page switching mode is carried out by modifying different parameters requested by the POST content, the domain name of the crawler address is directly used as the grouping domain name; and storing a plurality of text parts returned by the website according to the grouping of the collection domain names.

Further, the feature extraction module is further configured to:

by the formula

representing the sum of all the word frequencies in the text j,

Furthermore, the sensitive data interface crawler identification device further comprises a statistic module, which is used for counting url collection request quantity, access rate, request IP address quantity, IP access url quantity, request user quantity, return 200 quantity, access Referer quantity, access Method type and url sensitive data type of the crawler with the sensitive data interface identified by the sensitive judgment module, and outputting the crawler risk level and the attack type according to the statistic result.

The invention has the advantages that: the method initiates a request to a website by using parameters of the crawler according to different crawler types, acquires content requesting response, collects the content requesting response according to a request url, stores text parts of the content returned by the website in groups according to a collection domain name, identifies whether sensitive information exists in a text keyword result by using a sensitive data discovery technology, and outputs whether sensitive and sensitive data types, thereby effectively identifying crawler motivations, identifying crawler behaviors related to the sensitive information, and ensuring network information safety.

Drawings

Fig. 1 is a flowchart of a sensitive data interface crawler identification method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

A sensitive data interface crawler identification method, the method comprising the steps of:

s1: acquiring a web access log of a website; the web access log comprises the time of the request, the IP address, the user identity information, sessionid, requestbody, responsbody, method and status, and the user identity information comprises an account number, a cookie and a uuid.

S2: identifying the crawler according to the web access log; in this embodiment, the prior art is used for crawler identification, and this crawler identification process does not involve identification of sensitive information, and any mature technology capable of crawler identification may be used, specifically, a user behavior sequence-based anomaly detection method or a rule engine method is used to identify a crawler, for example, a scheme disclosed in a patent document listed in the background art.

S3: judging the type of the crawler; the crawler type comprises the steps of modifying parameters in the url to perform page switching or the same url requests different parameters to perform page switching by modifying POST content. For example, the address is http:// www.xxx.com.cn/service/api/getMorereInfo.actionproject _ ID ═ ab922d56d ═ b7fb6e72ddcdb4& startID ═ 4c8dsd 4147148. The switching of different pages is realized by changing the values of the parameters project, ID and start ID of the url to switch the accessed domain name, so that the information of different pages can be obtained in the process of continuously trying to change the values of the parameters project, ID and start ID of the url, and sensitive information can exist in the information. If the address http:// www.xxx.com.cn/login/, in the request, the returned result from the unavailable account _ name is collected by modifying the post parameter { ' account _ name ': 123456789 ' }, for example, the account number of a user is a mobile phone number, but the login password of different software may be different, and by modifying the post parameter (in this specific example, the post parameter is a password), the information of the user on different software is obtained by continuously trying.

S4: initiating a request to a website by using parameters of crawlers according to different types of crawlers, acquiring content of the request response, collecting the content of the request response according to a request url, and storing text parts of the content returned by the website in groups according to a collection domain name; the specific process is as follows:

step 402: analyzing the web page of the website accessed by the crawler, and acquiring information returned by the web page, wherein the types comprise HTML (hypertext markup language), Json character strings, binary data (such as pictures and videos) and the like, so as to obtain the content of the request response;

S5: extracting feature data of the stored texts, wherein important link addresses and text keyword results are extracted from the texts under each domain name;

the weight value required by the traditional TF-IDF is generally very small and close to 0, the accuracy is not very high, and in essence, the IDF is a weight for trying to suppress noise, and the more important the word with small text frequency is simply considered, the less useful the word with large text frequency is. This is not entirely true for most text messages. The simple structure of the IDF cannot make the extracted keywords sufficiently reflect the importance of the words and the distribution of the feature words, so that the IDF cannot well complete the function of adjusting the weight. Especially in the similar corpus, the method has great disadvantages, and the keywords of the similar texts are often covered. For example: the number of education articles in the corpus D is large, and the text j is an article belonging to education, so that the IDF value of words related to education is small, the recall rate of extracting text keywords is low, and the keyword extraction result is inaccurate. On the basis, the invention provides a word inverse frequency mode calculation weighting algorithm, namely

By the formula

representing the sum of all the word frequencies in the text j,

The weighting method reduces the influence of the same type of texts in the corpus on the word weight, and more accurately expresses the importance degree of the word in the document to be searched. The calculation result of the formula just solves the problem that the final weight is too small, and in practical application, 6-bit effective numbers are reserved, so that the calculation result is more accurate.

S6: and identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology, and outputting a corresponding result. The sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identification number.

As a further improvement of the present invention, the sensitive data interface crawler identification method further includes S7:

the Method comprises the steps of counting the number of url collection requests, the access rate, the number of IP addresses requested, the number of IP access urls, the number of user requests, the number of returns 200, the number of referers visited, the types of methods visited and the types of url sensitive data of the crawler with the sensitive data interface identified in S6, outputting a crawler risk level and an attack type according to a counting result, wherein the specific crawler risk level can be judged by selecting one or more indexes according to actual needs, for example, the three indexes of the number of url collection requests, the access rate and the number of returns 200 can be selected, and the crawler with the number of url collection requests exceeding a first preset value, the access rate exceeding a second preset value and the number of returns 200 exceeding a third preset value is classified as high risk.

Through the technical scheme, the method and the device use parameters of the crawlers to initiate requests to websites according to different crawler types, obtain the content of the request response, collect the content of the request response according to the request url, store text parts of the content returned by the websites according to the collection domain name in a grouping mode, use the sensitive data discovery technology to identify whether sensitive information exists in text keyword results, output whether sensitive data are involved and the type of the sensitive data, and therefore effectively identify crawler motivations, identify crawler behaviors related to the sensitive information and guarantee network information safety.

Example 2

Based on embodiment 1, embodiment 2 of the present invention further provides a sensitive data interface crawler recognition apparatus, where the apparatus includes:

the log acquisition module is used for acquiring a web access log of a website;

the judging module is used for judging the type of the crawler;

Specifically, the web access log includes a requested time, an IP address, user identity information, sessionid, requestbody, responsbody, method, and status, and the user identity information includes an account, a cookie, and a uuid.

Specifically, the crawler identification module identifies the crawler by using an anomaly detection method or a rule engine method based on a user behavior sequence.

Specifically, the type of the crawler in the determination module includes modifying parameters in the url to perform page switching or the same url requests different parameters to perform page switching by modifying the POST content.

More specifically, the crawler request simulation module includes:

Specifically, the feature extraction module is further configured to:

by the formula

representing the sum of all the word frequencies in the text j,

Specifically, the sensitive information includes a mobile phone number, a name, an address, a license plate number and an identification number.

Specifically, the sensitive data interface crawler identification device further comprises a statistic module, which is used for counting url collection request number, access rate, request IP address number, IP access url number, request user number, return 200 number, access Referer number, access Method type and url sensitive data type of the crawler with the sensitive data interface, which are distinguished by the sensitive judgment module, and outputting the crawler risk level and the attack type according to the statistic result.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A sensitive data interface crawler identification method is characterized by comprising the following steps:

step two: identifying the crawler according to the web access log;

step three: judging the type of the crawler;

2. The sensitive data interface crawler identification method of claim 1, wherein the web access log comprises requested time, IP address, user identity information, sessionid, requestbody, responsbody, method, status, and the user identity information comprises account number, cookie, and uuid.

3. The method for identifying the sensitive data interface crawler according to claim 1, wherein in the second step, the crawler is identified by adopting an anomaly detection method or a rule engine method based on a user behavior sequence.

4. The method according to claim 1, wherein the crawler type in the third step comprises modifying parameters in the url for page switching or the same url for page switching by modifying different parameters requested by POST content.

5. The sensitive data interface crawler identification method according to claim 4, wherein said step four comprises:

6. The method for identifying the sensitive data interface crawler according to claim 1, wherein the step five comprises:

by the formula

Calculating word frequency, extracting words with the word frequency exceeding a threshold value in the stored texts as characteristic data, and extracting important link addresses and text keyword results of the texts under each domain name according to the word frequency; wherein n is_i,jMeaning the word t_iNumber of occurrences in text j，

Representing the sum of all the word frequencies in the text j,

7. The sensitive data interface crawler identification method of claim 1, wherein the sensitive information comprises a cell phone number, a name, an address, a license plate number, an identification number.

8. The sensitive data interface crawler identification method according to claim 1, further comprising the seventh step of:

9. A sensitive data interface crawler recognition apparatus, said apparatus comprising:

the log acquisition module is used for acquiring a web access log of a website;

the judging module is used for judging the type of the crawler;

10. The sensitive data interface crawler identifying apparatus of claim 9, wherein the web access log comprises requested time, IP address, user identity information, sessionid, requestbody, responsbody, method, status, and the user identity information comprises account number, cookie, uuid.