CN113821754A - Sensitive data interface crawler identification method and device - Google Patents

Sensitive data interface crawler identification method and device Download PDF

Info

Publication number
CN113821754A
CN113821754A CN202111100833.1A CN202111100833A CN113821754A CN 113821754 A CN113821754 A CN 113821754A CN 202111100833 A CN202111100833 A CN 202111100833A CN 113821754 A CN113821754 A CN 113821754A
Authority
CN
China
Prior art keywords
crawler
request
website
sensitive data
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111100833.1A
Other languages
Chinese (zh)
Inventor
葛胜利
魏国富
夏玉明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN202111100833.1A priority Critical patent/CN113821754A/en
Publication of CN113821754A publication Critical patent/CN113821754A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a sensitive data interface crawler identification method and a device, wherein the method comprises the following steps: acquiring a web access log of a website; identifying the crawler according to the web access log; judging the type of the crawler; initiating a request to a website by using parameters of crawlers according to different types of crawlers, acquiring content of the request response, collecting the content of the request response according to a request url, and storing text parts of the content returned by the website in groups according to a collection domain name; extracting feature data of the stored texts, wherein important link addresses and text keyword results are extracted from the texts under each domain name; identifying whether sensitive information is contained in the text keyword result, and outputting whether sensitivity is contained and the type of the sensitivity-contained data; the invention has the advantages that: the crawler motivation is effectively identified, crawler behaviors related to sensitive information are identified, and network information safety is guaranteed.

Description

Sensitive data interface crawler identification method and device
Technical Field
The invention relates to the field of crawler identification, in particular to a crawler identification method and device for a sensitive data interface.
Background
In the prior art, data in a network can be acquired by means of a web crawler and the like, and a program or a script for removing website information is automatically grasped according to a certain rule. The prior art mostly aims at interception of crawlers, but the crawlers can bypass by changing programs or simulating behaviors of real users, and particularly certain valuable sensitive information exists in interfaces of websites.
In the prior art, crawler identification methods can be roughly classified into two types, wherein one type is an expert rule engine scheme, business logs are subjected to data acquisition, single or multiple attribute events are configured for quantity accumulation, and events exceeding a threshold value are intercepted through a threshold value rule; or intercepting through a blacklist acquired by attributes such as the IP, the usergent and the like. Due to the gradual improvement of the technology, the blackstrap industry uses a simulator, and special software conducts wind control rule engine probing and bypasses, so that the information security of the website is difficult to continuously ensure, and especially under the condition that certain valuable sensitive information exists in an interface of the website, the information security of the website is more difficult to maintain.
And the other type is an abnormity detection and identification crawler scheme based on a user behavior sequence, a user access behavior path is constructed, a probability model and other technical schemes are used for calculating the probability of the behavior path, and an unsupervised learning method is used for outputting the access paths of the abnormal users and the related users. However, the technical scheme has a large amount of false alarms, the workload of manual secondary analysis is more increased and complicated, and the information security of the website interface with sensitive information is difficult to maintain.
Chinese patent grant publication No. CN108712426B discloses a crawler identification method and system based on user behavior buried points, wherein the method includes: s1, the client receives the access request initiated by the user and asynchronously sends the access request to the backend service system; s2, after receiving the access request, the back-end service system synchronizes the access log of the user, wherein the access log comprises the access behavior data of the user; s3, the back-end service system aggregates the access behavior data through the rule engine; s4, the back-end service system judges whether the user belongs to the crawler according to the aggregated access behavior data, if so, the crawler characteristic data used for identifying the user as the crawler are aggregated according to the access log, and then the crawler characteristic data are asynchronously pushed to a crawler list in the client through a message queue; and S5, the client side responds to the access request according to the crawler list. According to the crawler identification method, the logs are accessed synchronously, and the crawler is identified after the access behavior data in the logs are aggregated, so that the crawler identification rate is improved and the crawler identification accuracy is improved. But not all crawlers need to be intercepted, and the scheme only identifies the crawlers and cannot identify the crawlers with sensitive data.
In summary, in the prior art, most of the prior art is directed to interception of crawlers, and crawlers with sensitive data cannot be identified, so that network information security is difficult to guarantee.
Disclosure of Invention
The technical problem to be solved by the invention is that the prior art lacks a method for crawler identification of an interface with sensitive information.
The invention solves the technical problems through the following technical means: a sensitive data interface crawler identification method, the method comprising the steps of:
the method comprises the following steps: acquiring a web access log of a website;
step two: identifying the crawler according to the web access log;
step three: judging the type of the crawler;
step four: initiating a request to a website by using parameters of crawlers according to different types of crawlers, acquiring content of the request response, collecting the content of the request response according to a request url, and storing text parts of the content returned by the website in groups according to a collection domain name;
step five: extracting feature data of the stored texts, wherein important link addresses and text keyword results are extracted from the texts under each domain name;
step six: and identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology, and outputting a corresponding result.
The method initiates a request to a website by using parameters of the crawler according to different crawler types, acquires content requesting response, collects the content requesting response according to a request url, stores text parts of the content returned by the website in groups according to a collection domain name, identifies whether sensitive information exists in a text keyword result by using a sensitive data discovery technology, and outputs whether sensitive and sensitive data types, thereby effectively identifying crawler motivations, identifying crawler behaviors related to the sensitive information, and ensuring network information safety.
Further, the web access log includes time of request, IP address, user identity information, sessionid, requestbody, responsbody, method, status, and the user identity information includes account, cookie, uuid.
Further, in the second step, an anomaly detection method or a rule engine method based on the user behavior sequence is adopted to identify the crawler.
Further, the type of the crawler in the third step includes modifying parameters in the url for page switching or the same url for page switching by modifying different parameters requested by the POST content.
Further, the fourth step includes:
step 401: initiating a Request to a website by using parameters of the crawler according to different crawler types, wherein the Request comprises additional headers information, and thus simulating a crawler Request;
step 402: performing page analysis on a website accessed by a crawler, acquiring information returned by a website page, and acquiring content of a request response;
step 403: according to the content of the request url collection request response, if the crawler address of the page switching mode is carried out by modifying the parameters in the url, the non-parameter part of the crawler address is reserved to be used as a collection domain name, and if the crawler address of the page switching mode is carried out by modifying different parameters requested by the POST content, the domain name of the crawler address is directly used as the collection domain name; and storing a plurality of text parts returned by the website according to the grouping of the collection domain names.
Further, the fifth step includes:
by the formula
Figure BDA0003270652050000041
Calculating word frequency, extracting words with the word frequency exceeding a threshold value in the stored texts as characteristic data, and extracting important link addresses and text keyword results of the texts under each domain name according to the word frequency; wherein n isi,jMeaning the word tiThe number of times that it occurs in the text j,
Figure BDA0003270652050000042
representing the sum of all the word frequencies in the text j,
Figure BDA0003270652050000043
represents the sum of the frequency numbers of all words in the corpus, ntiMeaning the word tiThe total frequency of occurrences in the corpus.
Further, the sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identification number.
Further, the sensitive data interface crawler identification method further comprises the seventh step of:
and e, counting the url gathering request number, the access rate, the request IP address number, the IP access url number, the request useragent number, the return 200 number, the access Referer number, the access Method type and the url sensitive data type of the crawler with the sensitive data interface identified in the step six, and outputting a crawler risk level and an attack type according to the counting result.
The invention also provides a sensitive data interface crawler recognition device, which comprises:
the log acquisition module is used for acquiring a web access log of a website;
the crawler identification module is used for identifying the crawler according to the web access log;
the judging module is used for judging the type of the crawler;
the crawler request simulation module is used for initiating a request to a website by using parameters of crawlers according to different crawler types, acquiring the content of the request response, collecting the content of the request response according to the request url, and storing the text part of the content returned by the website in groups according to the collection domain name;
the feature extraction module is used for extracting feature data of the stored texts, and important link addresses and text keyword results are extracted from the texts under each domain name correspondingly;
and the sensitive judgment module is used for identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology and outputting a corresponding result.
Further, the web access log includes time of request, IP address, user identity information, sessionid, requestbody, responsbody, method, status, and the user identity information includes account, cookie, uuid.
Furthermore, the crawler identification module identifies the crawler by adopting an anomaly detection method or a rule engine method based on a user behavior sequence.
Furthermore, the type of the crawler in the judging module includes modifying parameters in the url to perform page switching or the same url requests different parameters to perform page switching by modifying POST content.
Still further, the crawler request simulation module comprises:
the Request simulation unit is used for initiating a Request to a website by using parameters of the crawler according to different crawler types, wherein the Request contains additional headers information, so that crawler Request simulation is performed;
the request response unit is used for carrying out page analysis on the website accessed by the crawler, acquiring information returned by the website page and obtaining the content of the request response;
the grouping storage unit is used for grouping the content of the request response according to the request url, if the crawler address of the page switching mode is carried out by modifying the parameters in the url, the non-parameter part of the crawler address is reserved to be used as a grouping domain name, and if the crawler address of the page switching mode is carried out by modifying different parameters requested by the POST content, the domain name of the crawler address is directly used as the grouping domain name; and storing a plurality of text parts returned by the website according to the grouping of the collection domain names.
Further, the feature extraction module is further configured to:
by the formula
Figure BDA0003270652050000061
Calculating word frequency, extracting words with the word frequency exceeding a threshold value in the stored texts as characteristic data, and extracting important link addresses and text keyword results of the texts under each domain name according to the word frequency; wherein n isi,jMeaning the word tiThe number of times that it occurs in the text j,
Figure BDA0003270652050000062
representing the sum of all the word frequencies in the text j,
Figure BDA0003270652050000063
represents the sum of the frequency numbers of all words in the corpus, ntiMeaning the word tiThe total frequency of occurrences in the corpus.
Further, the sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identification number.
Furthermore, the sensitive data interface crawler identification device further comprises a statistic module, which is used for counting url collection request quantity, access rate, request IP address quantity, IP access url quantity, request user quantity, return 200 quantity, access Referer quantity, access Method type and url sensitive data type of the crawler with the sensitive data interface identified by the sensitive judgment module, and outputting the crawler risk level and the attack type according to the statistic result.
The invention has the advantages that: the method initiates a request to a website by using parameters of the crawler according to different crawler types, acquires content requesting response, collects the content requesting response according to a request url, stores text parts of the content returned by the website in groups according to a collection domain name, identifies whether sensitive information exists in a text keyword result by using a sensitive data discovery technology, and outputs whether sensitive and sensitive data types, thereby effectively identifying crawler motivations, identifying crawler behaviors related to the sensitive information, and ensuring network information safety.
Drawings
Fig. 1 is a flowchart of a sensitive data interface crawler identification method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
A sensitive data interface crawler identification method, the method comprising the steps of:
s1: acquiring a web access log of a website; the web access log comprises the time of the request, the IP address, the user identity information, sessionid, requestbody, responsbody, method and status, and the user identity information comprises an account number, a cookie and a uuid.
S2: identifying the crawler according to the web access log; in this embodiment, the prior art is used for crawler identification, and this crawler identification process does not involve identification of sensitive information, and any mature technology capable of crawler identification may be used, specifically, a user behavior sequence-based anomaly detection method or a rule engine method is used to identify a crawler, for example, a scheme disclosed in a patent document listed in the background art.
S3: judging the type of the crawler; the crawler type comprises the steps of modifying parameters in the url to perform page switching or the same url requests different parameters to perform page switching by modifying POST content. For example, the address is http:// www.xxx.com.cn/service/api/getMorereInfo.actionproject _ ID ═ ab922d56d ═ b7fb6e72ddcdb4& startID ═ 4c8dsd 4147148. The switching of different pages is realized by changing the values of the parameters project, ID and start ID of the url to switch the accessed domain name, so that the information of different pages can be obtained in the process of continuously trying to change the values of the parameters project, ID and start ID of the url, and sensitive information can exist in the information. If the address http:// www.xxx.com.cn/login/, in the request, the returned result from the unavailable account _ name is collected by modifying the post parameter { ' account _ name ': 123456789 ' }, for example, the account number of a user is a mobile phone number, but the login password of different software may be different, and by modifying the post parameter (in this specific example, the post parameter is a password), the information of the user on different software is obtained by continuously trying.
S4: initiating a request to a website by using parameters of crawlers according to different types of crawlers, acquiring content of the request response, collecting the content of the request response according to a request url, and storing text parts of the content returned by the website in groups according to a collection domain name; the specific process is as follows:
step 401: initiating a Request to a website by using parameters of the crawler according to different crawler types, wherein the Request comprises additional headers information, and thus simulating a crawler Request;
step 402: analyzing the web page of the website accessed by the crawler, and acquiring information returned by the web page, wherein the types comprise HTML (hypertext markup language), Json character strings, binary data (such as pictures and videos) and the like, so as to obtain the content of the request response;
step 403: according to the content of the request url collection request response, if the crawler address of the page switching mode is carried out by modifying the parameters in the url, the non-parameter part of the crawler address is reserved to be used as a collection domain name, and if the crawler address of the page switching mode is carried out by modifying different parameters requested by the POST content, the domain name of the crawler address is directly used as the collection domain name; and storing a plurality of text parts returned by the website according to the grouping of the collection domain names.
S5: extracting feature data of the stored texts, wherein important link addresses and text keyword results are extracted from the texts under each domain name;
the weight value required by the traditional TF-IDF is generally very small and close to 0, the accuracy is not very high, and in essence, the IDF is a weight for trying to suppress noise, and the more important the word with small text frequency is simply considered, the less useful the word with large text frequency is. This is not entirely true for most text messages. The simple structure of the IDF cannot make the extracted keywords sufficiently reflect the importance of the words and the distribution of the feature words, so that the IDF cannot well complete the function of adjusting the weight. Especially in the similar corpus, the method has great disadvantages, and the keywords of the similar texts are often covered. For example: the number of education articles in the corpus D is large, and the text j is an article belonging to education, so that the IDF value of words related to education is small, the recall rate of extracting text keywords is low, and the keyword extraction result is inaccurate. On the basis, the invention provides a word inverse frequency mode calculation weighting algorithm, namely
By the formula
Figure BDA0003270652050000101
Calculating word frequency, extracting words with the word frequency exceeding a threshold value in the stored texts as characteristic data, and extracting important link addresses and text keyword results of the texts under each domain name according to the word frequency; wherein n isi,jMeaning the word tiThe number of times that it occurs in the text j,
Figure BDA0003270652050000102
representing the sum of all the word frequencies in the text j,
Figure BDA0003270652050000103
represents the sum of the frequency numbers of all words in the corpus, ntiMeaning the word tiThe total frequency of occurrences in the corpus.
The weighting method reduces the influence of the same type of texts in the corpus on the word weight, and more accurately expresses the importance degree of the word in the document to be searched. The calculation result of the formula just solves the problem that the final weight is too small, and in practical application, 6-bit effective numbers are reserved, so that the calculation result is more accurate.
S6: and identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology, and outputting a corresponding result. The sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identification number.
As a further improvement of the present invention, the sensitive data interface crawler identification method further includes S7:
the Method comprises the steps of counting the number of url collection requests, the access rate, the number of IP addresses requested, the number of IP access urls, the number of user requests, the number of returns 200, the number of referers visited, the types of methods visited and the types of url sensitive data of the crawler with the sensitive data interface identified in S6, outputting a crawler risk level and an attack type according to a counting result, wherein the specific crawler risk level can be judged by selecting one or more indexes according to actual needs, for example, the three indexes of the number of url collection requests, the access rate and the number of returns 200 can be selected, and the crawler with the number of url collection requests exceeding a first preset value, the access rate exceeding a second preset value and the number of returns 200 exceeding a third preset value is classified as high risk.
Through the technical scheme, the method and the device use parameters of the crawlers to initiate requests to websites according to different crawler types, obtain the content of the request response, collect the content of the request response according to the request url, store text parts of the content returned by the websites according to the collection domain name in a grouping mode, use the sensitive data discovery technology to identify whether sensitive information exists in text keyword results, output whether sensitive data are involved and the type of the sensitive data, and therefore effectively identify crawler motivations, identify crawler behaviors related to the sensitive information and guarantee network information safety.
Example 2
Based on embodiment 1, embodiment 2 of the present invention further provides a sensitive data interface crawler recognition apparatus, where the apparatus includes:
the log acquisition module is used for acquiring a web access log of a website;
the crawler identification module is used for identifying the crawler according to the web access log;
the judging module is used for judging the type of the crawler;
the crawler request simulation module is used for initiating a request to a website by using parameters of crawlers according to different crawler types, acquiring the content of the request response, collecting the content of the request response according to the request url, and storing the text part of the content returned by the website in groups according to the collection domain name;
the feature extraction module is used for extracting feature data of the stored texts, and important link addresses and text keyword results are extracted from the texts under each domain name correspondingly;
and the sensitive judgment module is used for identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology and outputting a corresponding result.
Specifically, the web access log includes a requested time, an IP address, user identity information, sessionid, requestbody, responsbody, method, and status, and the user identity information includes an account, a cookie, and a uuid.
Specifically, the crawler identification module identifies the crawler by using an anomaly detection method or a rule engine method based on a user behavior sequence.
Specifically, the type of the crawler in the determination module includes modifying parameters in the url to perform page switching or the same url requests different parameters to perform page switching by modifying the POST content.
More specifically, the crawler request simulation module includes:
the Request simulation unit is used for initiating a Request to a website by using parameters of the crawler according to different crawler types, wherein the Request contains additional headers information, so that crawler Request simulation is performed;
the request response unit is used for carrying out page analysis on the website accessed by the crawler, acquiring information returned by the website page and obtaining the content of the request response;
the grouping storage unit is used for grouping the content of the request response according to the request url, if the crawler address of the page switching mode is carried out by modifying the parameters in the url, the non-parameter part of the crawler address is reserved to be used as a grouping domain name, and if the crawler address of the page switching mode is carried out by modifying different parameters requested by the POST content, the domain name of the crawler address is directly used as the grouping domain name; and storing a plurality of text parts returned by the website according to the grouping of the collection domain names.
Specifically, the feature extraction module is further configured to:
by the formula
Figure BDA0003270652050000121
Calculating word frequency, extracting words with the word frequency exceeding a threshold value in the stored texts as characteristic data, and extracting important link addresses and text keyword results of the texts under each domain name according to the word frequency; wherein n isi,jMeaning the word tiThe number of times that it occurs in the text j,
Figure BDA0003270652050000131
representing the sum of all the word frequencies in the text j,
Figure BDA0003270652050000132
represents the sum of the frequency numbers of all words in the corpus, ntiMeaning the word tiThe total frequency of occurrences in the corpus.
Specifically, the sensitive information includes a mobile phone number, a name, an address, a license plate number and an identification number.
Specifically, the sensitive data interface crawler identification device further comprises a statistic module, which is used for counting url collection request number, access rate, request IP address number, IP access url number, request user number, return 200 number, access Referer number, access Method type and url sensitive data type of the crawler with the sensitive data interface, which are distinguished by the sensitive judgment module, and outputting the crawler risk level and the attack type according to the statistic result.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A sensitive data interface crawler identification method is characterized by comprising the following steps:
the method comprises the following steps: acquiring a web access log of a website;
step two: identifying the crawler according to the web access log;
step three: judging the type of the crawler;
step four: initiating a request to a website by using parameters of crawlers according to different types of crawlers, acquiring content of the request response, collecting the content of the request response according to a request url, and storing text parts of the content returned by the website in groups according to a collection domain name;
step five: extracting feature data of the stored texts, wherein important link addresses and text keyword results are extracted from the texts under each domain name;
step six: and identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology, and outputting a corresponding result.
2. The sensitive data interface crawler identification method of claim 1, wherein the web access log comprises requested time, IP address, user identity information, sessionid, requestbody, responsbody, method, status, and the user identity information comprises account number, cookie, and uuid.
3. The method for identifying the sensitive data interface crawler according to claim 1, wherein in the second step, the crawler is identified by adopting an anomaly detection method or a rule engine method based on a user behavior sequence.
4. The method according to claim 1, wherein the crawler type in the third step comprises modifying parameters in the url for page switching or the same url for page switching by modifying different parameters requested by POST content.
5. The sensitive data interface crawler identification method according to claim 4, wherein said step four comprises:
step 401: initiating a Request to a website by using parameters of the crawler according to different crawler types, wherein the Request comprises additional headers information, and thus simulating a crawler Request;
step 402: performing page analysis on a website accessed by a crawler, acquiring information returned by a website page, and acquiring content of a request response;
step 403: according to the content of the request url collection request response, if the crawler address of the page switching mode is carried out by modifying the parameters in the url, the non-parameter part of the crawler address is reserved to be used as a collection domain name, and if the crawler address of the page switching mode is carried out by modifying different parameters requested by the POST content, the domain name of the crawler address is directly used as the collection domain name; and storing a plurality of text parts returned by the website according to the grouping of the collection domain names.
6. The method for identifying the sensitive data interface crawler according to claim 1, wherein the step five comprises:
by the formula
Figure FDA0003270652040000021
Calculating word frequency, extracting words with the word frequency exceeding a threshold value in the stored texts as characteristic data, and extracting important link addresses and text keyword results of the texts under each domain name according to the word frequency; wherein n isi,jMeaning the word tiNumber of occurrences in text j,
Figure FDA0003270652040000022
Representing the sum of all the word frequencies in the text j,
Figure FDA0003270652040000023
represents the sum of the frequency numbers of all words in the corpus, ntiMeaning the word tiThe total frequency of occurrences in the corpus.
7. The sensitive data interface crawler identification method of claim 1, wherein the sensitive information comprises a cell phone number, a name, an address, a license plate number, an identification number.
8. The sensitive data interface crawler identification method according to claim 1, further comprising the seventh step of:
and e, counting the url gathering request number, the access rate, the request IP address number, the IP access url number, the request useragent number, the return 200 number, the access Referer number, the access Method type and the url sensitive data type of the crawler with the sensitive data interface identified in the step six, and outputting a crawler risk level and an attack type according to the counting result.
9. A sensitive data interface crawler recognition apparatus, said apparatus comprising:
the log acquisition module is used for acquiring a web access log of a website;
the crawler identification module is used for identifying the crawler according to the web access log;
the judging module is used for judging the type of the crawler;
the crawler request simulation module is used for initiating a request to a website by using parameters of crawlers according to different crawler types, acquiring the content of the request response, collecting the content of the request response according to the request url, and storing the text part of the content returned by the website in groups according to the collection domain name;
the feature extraction module is used for extracting feature data of the stored texts, and important link addresses and text keyword results are extracted from the texts under each domain name correspondingly;
and the sensitive judgment module is used for identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology and outputting a corresponding result.
10. The sensitive data interface crawler identifying apparatus of claim 9, wherein the web access log comprises requested time, IP address, user identity information, sessionid, requestbody, responsbody, method, status, and the user identity information comprises account number, cookie, uuid.
CN202111100833.1A 2021-09-18 2021-09-18 Sensitive data interface crawler identification method and device Pending CN113821754A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111100833.1A CN113821754A (en) 2021-09-18 2021-09-18 Sensitive data interface crawler identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111100833.1A CN113821754A (en) 2021-09-18 2021-09-18 Sensitive data interface crawler identification method and device

Publications (1)

Publication Number Publication Date
CN113821754A true CN113821754A (en) 2021-12-21

Family

ID=78922493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111100833.1A Pending CN113821754A (en) 2021-09-18 2021-09-18 Sensitive data interface crawler identification method and device

Country Status (1)

Country Link
CN (1) CN113821754A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150542A (en) * 2023-04-21 2023-05-23 河北网新数字技术股份有限公司 Dynamic page generation method and device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250513A (en) * 2016-08-02 2016-12-21 西南石油大学 A kind of event personalization sorting technique based on event modeling and system
CN106411578A (en) * 2016-09-12 2017-02-15 国网山东省电力公司电力科学研究院 Website monitoring system and method applicable to power industry
CN106776768A (en) * 2016-11-23 2017-05-31 福建六壬网安股份有限公司 A kind of URL grasping means of distributed reptile engine and system
CN108256104A (en) * 2018-02-05 2018-07-06 恒安嘉新(北京)科技股份公司 Internet site compressive classification method based on multidimensional characteristic
CN108712426A (en) * 2018-05-21 2018-10-26 携程旅游网络技术(上海)有限公司 Reptile recognition methods and system a little are buried based on user behavior
CN109308330A (en) * 2018-07-24 2019-02-05 国家计算机网络与信息安全管理中心 The method of enterprise's leakage information extraction, analysis and classification Internet-based
CN110351248A (en) * 2019-06-14 2019-10-18 北京纵横无双科技有限公司 A kind of safety protecting method and device based on intellectual analysis and intelligent current limliting
CN112287198A (en) * 2020-10-28 2021-01-29 上海云信留客信息科技有限公司 Spam short message detection method based on crawler technology

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250513A (en) * 2016-08-02 2016-12-21 西南石油大学 A kind of event personalization sorting technique based on event modeling and system
CN106411578A (en) * 2016-09-12 2017-02-15 国网山东省电力公司电力科学研究院 Website monitoring system and method applicable to power industry
CN106776768A (en) * 2016-11-23 2017-05-31 福建六壬网安股份有限公司 A kind of URL grasping means of distributed reptile engine and system
CN108256104A (en) * 2018-02-05 2018-07-06 恒安嘉新(北京)科技股份公司 Internet site compressive classification method based on multidimensional characteristic
CN108712426A (en) * 2018-05-21 2018-10-26 携程旅游网络技术(上海)有限公司 Reptile recognition methods and system a little are buried based on user behavior
CN109308330A (en) * 2018-07-24 2019-02-05 国家计算机网络与信息安全管理中心 The method of enterprise's leakage information extraction, analysis and classification Internet-based
CN110351248A (en) * 2019-06-14 2019-10-18 北京纵横无双科技有限公司 A kind of safety protecting method and device based on intellectual analysis and intelligent current limliting
CN112287198A (en) * 2020-10-28 2021-01-29 上海云信留客信息科技有限公司 Spam short message detection method based on crawler technology

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
吕宝路 等: "面向敏感信息检测的Web综合漏洞扫描器实现", 电脑知识与技术, vol. 16, no. 23, pages 30 - 32 *
李昌兵 等: "融合卡方统计和TF-IWF算法的特征提取和短文本分类方法", 《重庆理工大学学报(自然科学)》, vol. 35, pages 135 - 140 *
王小林 等: "改进的TF-IDF关键词提取方法", 《计算机科学与应用》, vol. 3, no. 1, 28 February 2013 (2013-02-28), pages 64 - 68 *
王小林 等: "改进的TF-IDF关键词提取方法", 《计算机科学与应用》, vol. 3, no. 1, pages 64 - 68 *
赵国生 等: "《python网络爬虫技术与实战》", 31 January 2021, 机械工业出版社, pages: 100 - 105 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150542A (en) * 2023-04-21 2023-05-23 河北网新数字技术股份有限公司 Dynamic page generation method and device and storage medium

Similar Documents

Publication Publication Date Title
CN109241461B (en) User portrait construction method and device
CN102946319B (en) Networks congestion control information analysis system and analytical method thereof
CN109960729A (en) The detection method and system of HTTP malicious traffic stream
CN106095979B (en) URL merging processing method and device
CN109274632B (en) Website identification method and device
CN109905288B (en) Application service classification method and device
CN105224691B (en) A kind of information processing method and device
CN108156131A (en) Webshell detection methods, electronic equipment and computer storage media
CN112491784A (en) Request processing method and device of Web site and computer readable storage medium
Balla et al. Real-time web crawler detection
CN108337269A (en) A kind of WebShell detection methods
CN113098887A (en) Phishing website detection method based on website joint characteristics
CN108319672A (en) Mobile terminal malicious information filtering method and system based on cloud computing
CN108600270A (en) A kind of abnormal user detection method and system based on network log
CN112131507A (en) Website content processing method, device, server and computer-readable storage medium
CN114244564A (en) Attack defense method, device, equipment and readable storage medium
CN103020208B (en) A kind of searching method and device being adapted with mobile terminal
CN113821754A (en) Sensitive data interface crawler identification method and device
CN117254983A (en) Method, device, equipment and storage medium for detecting fraud-related websites
CN115801455B (en) Method and device for detecting counterfeit website based on website fingerprint
CN107734534A (en) A kind of network load appraisal procedure and device
CN110263283A (en) Website detection method and device
CN112199573B (en) Illegal transaction active detection method and system
CN111611508B (en) Identification method and device for actual website access of user
CN114048311A (en) Phishing early warning method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination