CN113821754A - Sensitive data interface crawler identification method and device - Google Patents
Sensitive data interface crawler identification method and device Download PDFInfo
- Publication number
- CN113821754A CN113821754A CN202111100833.1A CN202111100833A CN113821754A CN 113821754 A CN113821754 A CN 113821754A CN 202111100833 A CN202111100833 A CN 202111100833A CN 113821754 A CN113821754 A CN 113821754A
- Authority
- CN
- China
- Prior art keywords
- crawler
- request
- website
- sensitive data
- content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000004044 response Effects 0.000 claims abstract description 32
- 230000000977 initiatory effect Effects 0.000 claims abstract description 12
- 238000005516 engineering process Methods 0.000 claims description 11
- 238000004088 simulation Methods 0.000 claims description 9
- 235000014510 cooky Nutrition 0.000 claims description 6
- 238000001514 detection method Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 5
- 230000006399 behavior Effects 0.000 abstract description 17
- 230000008450 motivation Effects 0.000 abstract description 4
- 230000035945 sensitivity Effects 0.000 abstract 1
- 238000004364 calculation method Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/552—Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/60—Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a sensitive data interface crawler identification method and a device, wherein the method comprises the following steps: acquiring a web access log of a website; identifying the crawler according to the web access log; judging the type of the crawler; initiating a request to a website by using parameters of crawlers according to different types of crawlers, acquiring content of the request response, collecting the content of the request response according to a request url, and storing text parts of the content returned by the website in groups according to a collection domain name; extracting feature data of the stored texts, wherein important link addresses and text keyword results are extracted from the texts under each domain name; identifying whether sensitive information is contained in the text keyword result, and outputting whether sensitivity is contained and the type of the sensitivity-contained data; the invention has the advantages that: the crawler motivation is effectively identified, crawler behaviors related to sensitive information are identified, and network information safety is guaranteed.
Description
Technical Field
The invention relates to the field of crawler identification, in particular to a crawler identification method and device for a sensitive data interface.
Background
In the prior art, data in a network can be acquired by means of a web crawler and the like, and a program or a script for removing website information is automatically grasped according to a certain rule. The prior art mostly aims at interception of crawlers, but the crawlers can bypass by changing programs or simulating behaviors of real users, and particularly certain valuable sensitive information exists in interfaces of websites.
In the prior art, crawler identification methods can be roughly classified into two types, wherein one type is an expert rule engine scheme, business logs are subjected to data acquisition, single or multiple attribute events are configured for quantity accumulation, and events exceeding a threshold value are intercepted through a threshold value rule; or intercepting through a blacklist acquired by attributes such as the IP, the usergent and the like. Due to the gradual improvement of the technology, the blackstrap industry uses a simulator, and special software conducts wind control rule engine probing and bypasses, so that the information security of the website is difficult to continuously ensure, and especially under the condition that certain valuable sensitive information exists in an interface of the website, the information security of the website is more difficult to maintain.
And the other type is an abnormity detection and identification crawler scheme based on a user behavior sequence, a user access behavior path is constructed, a probability model and other technical schemes are used for calculating the probability of the behavior path, and an unsupervised learning method is used for outputting the access paths of the abnormal users and the related users. However, the technical scheme has a large amount of false alarms, the workload of manual secondary analysis is more increased and complicated, and the information security of the website interface with sensitive information is difficult to maintain.
Chinese patent grant publication No. CN108712426B discloses a crawler identification method and system based on user behavior buried points, wherein the method includes: s1, the client receives the access request initiated by the user and asynchronously sends the access request to the backend service system; s2, after receiving the access request, the back-end service system synchronizes the access log of the user, wherein the access log comprises the access behavior data of the user; s3, the back-end service system aggregates the access behavior data through the rule engine; s4, the back-end service system judges whether the user belongs to the crawler according to the aggregated access behavior data, if so, the crawler characteristic data used for identifying the user as the crawler are aggregated according to the access log, and then the crawler characteristic data are asynchronously pushed to a crawler list in the client through a message queue; and S5, the client side responds to the access request according to the crawler list. According to the crawler identification method, the logs are accessed synchronously, and the crawler is identified after the access behavior data in the logs are aggregated, so that the crawler identification rate is improved and the crawler identification accuracy is improved. But not all crawlers need to be intercepted, and the scheme only identifies the crawlers and cannot identify the crawlers with sensitive data.
In summary, in the prior art, most of the prior art is directed to interception of crawlers, and crawlers with sensitive data cannot be identified, so that network information security is difficult to guarantee.
Disclosure of Invention
The technical problem to be solved by the invention is that the prior art lacks a method for crawler identification of an interface with sensitive information.
The invention solves the technical problems through the following technical means: a sensitive data interface crawler identification method, the method comprising the steps of:
the method comprises the following steps: acquiring a web access log of a website;
step two: identifying the crawler according to the web access log;
step three: judging the type of the crawler;
step four: initiating a request to a website by using parameters of crawlers according to different types of crawlers, acquiring content of the request response, collecting the content of the request response according to a request url, and storing text parts of the content returned by the website in groups according to a collection domain name;
step five: extracting feature data of the stored texts, wherein important link addresses and text keyword results are extracted from the texts under each domain name;
step six: and identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology, and outputting a corresponding result.
The method initiates a request to a website by using parameters of the crawler according to different crawler types, acquires content requesting response, collects the content requesting response according to a request url, stores text parts of the content returned by the website in groups according to a collection domain name, identifies whether sensitive information exists in a text keyword result by using a sensitive data discovery technology, and outputs whether sensitive and sensitive data types, thereby effectively identifying crawler motivations, identifying crawler behaviors related to the sensitive information, and ensuring network information safety.
Further, the web access log includes time of request, IP address, user identity information, sessionid, requestbody, responsbody, method, status, and the user identity information includes account, cookie, uuid.
Further, in the second step, an anomaly detection method or a rule engine method based on the user behavior sequence is adopted to identify the crawler.
Further, the type of the crawler in the third step includes modifying parameters in the url for page switching or the same url for page switching by modifying different parameters requested by the POST content.
Further, the fourth step includes:
step 401: initiating a Request to a website by using parameters of the crawler according to different crawler types, wherein the Request comprises additional headers information, and thus simulating a crawler Request;
step 402: performing page analysis on a website accessed by a crawler, acquiring information returned by a website page, and acquiring content of a request response;
step 403: according to the content of the request url collection request response, if the crawler address of the page switching mode is carried out by modifying the parameters in the url, the non-parameter part of the crawler address is reserved to be used as a collection domain name, and if the crawler address of the page switching mode is carried out by modifying different parameters requested by the POST content, the domain name of the crawler address is directly used as the collection domain name; and storing a plurality of text parts returned by the website according to the grouping of the collection domain names.
Further, the fifth step includes:
by the formula
Calculating word frequency, extracting words with the word frequency exceeding a threshold value in the stored texts as characteristic data, and extracting important link addresses and text keyword results of the texts under each domain name according to the word frequency; wherein n isi,jMeaning the word tiThe number of times that it occurs in the text j,representing the sum of all the word frequencies in the text j,represents the sum of the frequency numbers of all words in the corpus, ntiMeaning the word tiThe total frequency of occurrences in the corpus.
Further, the sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identification number.
Further, the sensitive data interface crawler identification method further comprises the seventh step of:
and e, counting the url gathering request number, the access rate, the request IP address number, the IP access url number, the request useragent number, the return 200 number, the access Referer number, the access Method type and the url sensitive data type of the crawler with the sensitive data interface identified in the step six, and outputting a crawler risk level and an attack type according to the counting result.
The invention also provides a sensitive data interface crawler recognition device, which comprises:
the log acquisition module is used for acquiring a web access log of a website;
the crawler identification module is used for identifying the crawler according to the web access log;
the judging module is used for judging the type of the crawler;
the crawler request simulation module is used for initiating a request to a website by using parameters of crawlers according to different crawler types, acquiring the content of the request response, collecting the content of the request response according to the request url, and storing the text part of the content returned by the website in groups according to the collection domain name;
the feature extraction module is used for extracting feature data of the stored texts, and important link addresses and text keyword results are extracted from the texts under each domain name correspondingly;
and the sensitive judgment module is used for identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology and outputting a corresponding result.
Further, the web access log includes time of request, IP address, user identity information, sessionid, requestbody, responsbody, method, status, and the user identity information includes account, cookie, uuid.
Furthermore, the crawler identification module identifies the crawler by adopting an anomaly detection method or a rule engine method based on a user behavior sequence.
Furthermore, the type of the crawler in the judging module includes modifying parameters in the url to perform page switching or the same url requests different parameters to perform page switching by modifying POST content.
Still further, the crawler request simulation module comprises:
the Request simulation unit is used for initiating a Request to a website by using parameters of the crawler according to different crawler types, wherein the Request contains additional headers information, so that crawler Request simulation is performed;
the request response unit is used for carrying out page analysis on the website accessed by the crawler, acquiring information returned by the website page and obtaining the content of the request response;
the grouping storage unit is used for grouping the content of the request response according to the request url, if the crawler address of the page switching mode is carried out by modifying the parameters in the url, the non-parameter part of the crawler address is reserved to be used as a grouping domain name, and if the crawler address of the page switching mode is carried out by modifying different parameters requested by the POST content, the domain name of the crawler address is directly used as the grouping domain name; and storing a plurality of text parts returned by the website according to the grouping of the collection domain names.
Further, the feature extraction module is further configured to:
by the formula
Calculating word frequency, extracting words with the word frequency exceeding a threshold value in the stored texts as characteristic data, and extracting important link addresses and text keyword results of the texts under each domain name according to the word frequency; wherein n isi,jMeaning the word tiThe number of times that it occurs in the text j,representing the sum of all the word frequencies in the text j,represents the sum of the frequency numbers of all words in the corpus, ntiMeaning the word tiThe total frequency of occurrences in the corpus.
Further, the sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identification number.
Furthermore, the sensitive data interface crawler identification device further comprises a statistic module, which is used for counting url collection request quantity, access rate, request IP address quantity, IP access url quantity, request user quantity, return 200 quantity, access Referer quantity, access Method type and url sensitive data type of the crawler with the sensitive data interface identified by the sensitive judgment module, and outputting the crawler risk level and the attack type according to the statistic result.
The invention has the advantages that: the method initiates a request to a website by using parameters of the crawler according to different crawler types, acquires content requesting response, collects the content requesting response according to a request url, stores text parts of the content returned by the website in groups according to a collection domain name, identifies whether sensitive information exists in a text keyword result by using a sensitive data discovery technology, and outputs whether sensitive and sensitive data types, thereby effectively identifying crawler motivations, identifying crawler behaviors related to the sensitive information, and ensuring network information safety.
Drawings
Fig. 1 is a flowchart of a sensitive data interface crawler identification method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
A sensitive data interface crawler identification method, the method comprising the steps of:
s1: acquiring a web access log of a website; the web access log comprises the time of the request, the IP address, the user identity information, sessionid, requestbody, responsbody, method and status, and the user identity information comprises an account number, a cookie and a uuid.
S2: identifying the crawler according to the web access log; in this embodiment, the prior art is used for crawler identification, and this crawler identification process does not involve identification of sensitive information, and any mature technology capable of crawler identification may be used, specifically, a user behavior sequence-based anomaly detection method or a rule engine method is used to identify a crawler, for example, a scheme disclosed in a patent document listed in the background art.
S3: judging the type of the crawler; the crawler type comprises the steps of modifying parameters in the url to perform page switching or the same url requests different parameters to perform page switching by modifying POST content. For example, the address is http:// www.xxx.com.cn/service/api/getMorereInfo.actionproject _ ID ═ ab922d56d ═ b7fb6e72ddcdb4& startID ═ 4c8dsd 4147148. The switching of different pages is realized by changing the values of the parameters project, ID and start ID of the url to switch the accessed domain name, so that the information of different pages can be obtained in the process of continuously trying to change the values of the parameters project, ID and start ID of the url, and sensitive information can exist in the information. If the address http:// www.xxx.com.cn/login/, in the request, the returned result from the unavailable account _ name is collected by modifying the post parameter { ' account _ name ': 123456789 ' }, for example, the account number of a user is a mobile phone number, but the login password of different software may be different, and by modifying the post parameter (in this specific example, the post parameter is a password), the information of the user on different software is obtained by continuously trying.
S4: initiating a request to a website by using parameters of crawlers according to different types of crawlers, acquiring content of the request response, collecting the content of the request response according to a request url, and storing text parts of the content returned by the website in groups according to a collection domain name; the specific process is as follows:
step 401: initiating a Request to a website by using parameters of the crawler according to different crawler types, wherein the Request comprises additional headers information, and thus simulating a crawler Request;
step 402: analyzing the web page of the website accessed by the crawler, and acquiring information returned by the web page, wherein the types comprise HTML (hypertext markup language), Json character strings, binary data (such as pictures and videos) and the like, so as to obtain the content of the request response;
step 403: according to the content of the request url collection request response, if the crawler address of the page switching mode is carried out by modifying the parameters in the url, the non-parameter part of the crawler address is reserved to be used as a collection domain name, and if the crawler address of the page switching mode is carried out by modifying different parameters requested by the POST content, the domain name of the crawler address is directly used as the collection domain name; and storing a plurality of text parts returned by the website according to the grouping of the collection domain names.
S5: extracting feature data of the stored texts, wherein important link addresses and text keyword results are extracted from the texts under each domain name;
the weight value required by the traditional TF-IDF is generally very small and close to 0, the accuracy is not very high, and in essence, the IDF is a weight for trying to suppress noise, and the more important the word with small text frequency is simply considered, the less useful the word with large text frequency is. This is not entirely true for most text messages. The simple structure of the IDF cannot make the extracted keywords sufficiently reflect the importance of the words and the distribution of the feature words, so that the IDF cannot well complete the function of adjusting the weight. Especially in the similar corpus, the method has great disadvantages, and the keywords of the similar texts are often covered. For example: the number of education articles in the corpus D is large, and the text j is an article belonging to education, so that the IDF value of words related to education is small, the recall rate of extracting text keywords is low, and the keyword extraction result is inaccurate. On the basis, the invention provides a word inverse frequency mode calculation weighting algorithm, namely
By the formula
Calculating word frequency, extracting words with the word frequency exceeding a threshold value in the stored texts as characteristic data, and extracting important link addresses and text keyword results of the texts under each domain name according to the word frequency; wherein n isi,jMeaning the word tiThe number of times that it occurs in the text j,representing the sum of all the word frequencies in the text j,represents the sum of the frequency numbers of all words in the corpus, ntiMeaning the word tiThe total frequency of occurrences in the corpus.
The weighting method reduces the influence of the same type of texts in the corpus on the word weight, and more accurately expresses the importance degree of the word in the document to be searched. The calculation result of the formula just solves the problem that the final weight is too small, and in practical application, 6-bit effective numbers are reserved, so that the calculation result is more accurate.
S6: and identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology, and outputting a corresponding result. The sensitive information comprises a mobile phone number, a name, an address, a license plate number and an identification number.
As a further improvement of the present invention, the sensitive data interface crawler identification method further includes S7:
the Method comprises the steps of counting the number of url collection requests, the access rate, the number of IP addresses requested, the number of IP access urls, the number of user requests, the number of returns 200, the number of referers visited, the types of methods visited and the types of url sensitive data of the crawler with the sensitive data interface identified in S6, outputting a crawler risk level and an attack type according to a counting result, wherein the specific crawler risk level can be judged by selecting one or more indexes according to actual needs, for example, the three indexes of the number of url collection requests, the access rate and the number of returns 200 can be selected, and the crawler with the number of url collection requests exceeding a first preset value, the access rate exceeding a second preset value and the number of returns 200 exceeding a third preset value is classified as high risk.
Through the technical scheme, the method and the device use parameters of the crawlers to initiate requests to websites according to different crawler types, obtain the content of the request response, collect the content of the request response according to the request url, store text parts of the content returned by the websites according to the collection domain name in a grouping mode, use the sensitive data discovery technology to identify whether sensitive information exists in text keyword results, output whether sensitive data are involved and the type of the sensitive data, and therefore effectively identify crawler motivations, identify crawler behaviors related to the sensitive information and guarantee network information safety.
Example 2
Based on embodiment 1, embodiment 2 of the present invention further provides a sensitive data interface crawler recognition apparatus, where the apparatus includes:
the log acquisition module is used for acquiring a web access log of a website;
the crawler identification module is used for identifying the crawler according to the web access log;
the judging module is used for judging the type of the crawler;
the crawler request simulation module is used for initiating a request to a website by using parameters of crawlers according to different crawler types, acquiring the content of the request response, collecting the content of the request response according to the request url, and storing the text part of the content returned by the website in groups according to the collection domain name;
the feature extraction module is used for extracting feature data of the stored texts, and important link addresses and text keyword results are extracted from the texts under each domain name correspondingly;
and the sensitive judgment module is used for identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology and outputting a corresponding result.
Specifically, the web access log includes a requested time, an IP address, user identity information, sessionid, requestbody, responsbody, method, and status, and the user identity information includes an account, a cookie, and a uuid.
Specifically, the crawler identification module identifies the crawler by using an anomaly detection method or a rule engine method based on a user behavior sequence.
Specifically, the type of the crawler in the determination module includes modifying parameters in the url to perform page switching or the same url requests different parameters to perform page switching by modifying the POST content.
More specifically, the crawler request simulation module includes:
the Request simulation unit is used for initiating a Request to a website by using parameters of the crawler according to different crawler types, wherein the Request contains additional headers information, so that crawler Request simulation is performed;
the request response unit is used for carrying out page analysis on the website accessed by the crawler, acquiring information returned by the website page and obtaining the content of the request response;
the grouping storage unit is used for grouping the content of the request response according to the request url, if the crawler address of the page switching mode is carried out by modifying the parameters in the url, the non-parameter part of the crawler address is reserved to be used as a grouping domain name, and if the crawler address of the page switching mode is carried out by modifying different parameters requested by the POST content, the domain name of the crawler address is directly used as the grouping domain name; and storing a plurality of text parts returned by the website according to the grouping of the collection domain names.
Specifically, the feature extraction module is further configured to:
by the formula
Calculating word frequency, extracting words with the word frequency exceeding a threshold value in the stored texts as characteristic data, and extracting important link addresses and text keyword results of the texts under each domain name according to the word frequency; wherein n isi,jMeaning the word tiThe number of times that it occurs in the text j,representing the sum of all the word frequencies in the text j,represents the sum of the frequency numbers of all words in the corpus, ntiMeaning the word tiThe total frequency of occurrences in the corpus.
Specifically, the sensitive information includes a mobile phone number, a name, an address, a license plate number and an identification number.
Specifically, the sensitive data interface crawler identification device further comprises a statistic module, which is used for counting url collection request number, access rate, request IP address number, IP access url number, request user number, return 200 number, access Referer number, access Method type and url sensitive data type of the crawler with the sensitive data interface, which are distinguished by the sensitive judgment module, and outputting the crawler risk level and the attack type according to the statistic result.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A sensitive data interface crawler identification method is characterized by comprising the following steps:
the method comprises the following steps: acquiring a web access log of a website;
step two: identifying the crawler according to the web access log;
step three: judging the type of the crawler;
step four: initiating a request to a website by using parameters of crawlers according to different types of crawlers, acquiring content of the request response, collecting the content of the request response according to a request url, and storing text parts of the content returned by the website in groups according to a collection domain name;
step five: extracting feature data of the stored texts, wherein important link addresses and text keyword results are extracted from the texts under each domain name;
step six: and identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology, and outputting a corresponding result.
2. The sensitive data interface crawler identification method of claim 1, wherein the web access log comprises requested time, IP address, user identity information, sessionid, requestbody, responsbody, method, status, and the user identity information comprises account number, cookie, and uuid.
3. The method for identifying the sensitive data interface crawler according to claim 1, wherein in the second step, the crawler is identified by adopting an anomaly detection method or a rule engine method based on a user behavior sequence.
4. The method according to claim 1, wherein the crawler type in the third step comprises modifying parameters in the url for page switching or the same url for page switching by modifying different parameters requested by POST content.
5. The sensitive data interface crawler identification method according to claim 4, wherein said step four comprises:
step 401: initiating a Request to a website by using parameters of the crawler according to different crawler types, wherein the Request comprises additional headers information, and thus simulating a crawler Request;
step 402: performing page analysis on a website accessed by a crawler, acquiring information returned by a website page, and acquiring content of a request response;
step 403: according to the content of the request url collection request response, if the crawler address of the page switching mode is carried out by modifying the parameters in the url, the non-parameter part of the crawler address is reserved to be used as a collection domain name, and if the crawler address of the page switching mode is carried out by modifying different parameters requested by the POST content, the domain name of the crawler address is directly used as the collection domain name; and storing a plurality of text parts returned by the website according to the grouping of the collection domain names.
6. The method for identifying the sensitive data interface crawler according to claim 1, wherein the step five comprises:
by the formula
Calculating word frequency, extracting words with the word frequency exceeding a threshold value in the stored texts as characteristic data, and extracting important link addresses and text keyword results of the texts under each domain name according to the word frequency; wherein n isi,jMeaning the word tiNumber of occurrences in text j,Representing the sum of all the word frequencies in the text j,represents the sum of the frequency numbers of all words in the corpus, ntiMeaning the word tiThe total frequency of occurrences in the corpus.
7. The sensitive data interface crawler identification method of claim 1, wherein the sensitive information comprises a cell phone number, a name, an address, a license plate number, an identification number.
8. The sensitive data interface crawler identification method according to claim 1, further comprising the seventh step of:
and e, counting the url gathering request number, the access rate, the request IP address number, the IP access url number, the request useragent number, the return 200 number, the access Referer number, the access Method type and the url sensitive data type of the crawler with the sensitive data interface identified in the step six, and outputting a crawler risk level and an attack type according to the counting result.
9. A sensitive data interface crawler recognition apparatus, said apparatus comprising:
the log acquisition module is used for acquiring a web access log of a website;
the crawler identification module is used for identifying the crawler according to the web access log;
the judging module is used for judging the type of the crawler;
the crawler request simulation module is used for initiating a request to a website by using parameters of crawlers according to different crawler types, acquiring the content of the request response, collecting the content of the request response according to the request url, and storing the text part of the content returned by the website in groups according to the collection domain name;
the feature extraction module is used for extracting feature data of the stored texts, and important link addresses and text keyword results are extracted from the texts under each domain name correspondingly;
and the sensitive judgment module is used for identifying whether sensitive information exists in the text keyword result by using a sensitive data discovery technology and outputting a corresponding result.
10. The sensitive data interface crawler identifying apparatus of claim 9, wherein the web access log comprises requested time, IP address, user identity information, sessionid, requestbody, responsbody, method, status, and the user identity information comprises account number, cookie, uuid.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111100833.1A CN113821754A (en) | 2021-09-18 | 2021-09-18 | Sensitive data interface crawler identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111100833.1A CN113821754A (en) | 2021-09-18 | 2021-09-18 | Sensitive data interface crawler identification method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113821754A true CN113821754A (en) | 2021-12-21 |
Family
ID=78922493
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111100833.1A Pending CN113821754A (en) | 2021-09-18 | 2021-09-18 | Sensitive data interface crawler identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113821754A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116150542A (en) * | 2023-04-21 | 2023-05-23 | 河北网新数字技术股份有限公司 | Dynamic page generation method and device and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250513A (en) * | 2016-08-02 | 2016-12-21 | 西南石油大学 | A kind of event personalization sorting technique based on event modeling and system |
CN106411578A (en) * | 2016-09-12 | 2017-02-15 | 国网山东省电力公司电力科学研究院 | Website monitoring system and method applicable to power industry |
CN106776768A (en) * | 2016-11-23 | 2017-05-31 | 福建六壬网安股份有限公司 | A kind of URL grasping means of distributed reptile engine and system |
CN108256104A (en) * | 2018-02-05 | 2018-07-06 | 恒安嘉新(北京)科技股份公司 | Internet site compressive classification method based on multidimensional characteristic |
CN108712426A (en) * | 2018-05-21 | 2018-10-26 | 携程旅游网络技术(上海)有限公司 | Reptile recognition methods and system a little are buried based on user behavior |
CN109308330A (en) * | 2018-07-24 | 2019-02-05 | 国家计算机网络与信息安全管理中心 | The method of enterprise's leakage information extraction, analysis and classification Internet-based |
CN110351248A (en) * | 2019-06-14 | 2019-10-18 | 北京纵横无双科技有限公司 | A kind of safety protecting method and device based on intellectual analysis and intelligent current limliting |
CN112287198A (en) * | 2020-10-28 | 2021-01-29 | 上海云信留客信息科技有限公司 | Spam short message detection method based on crawler technology |
-
2021
- 2021-09-18 CN CN202111100833.1A patent/CN113821754A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250513A (en) * | 2016-08-02 | 2016-12-21 | 西南石油大学 | A kind of event personalization sorting technique based on event modeling and system |
CN106411578A (en) * | 2016-09-12 | 2017-02-15 | 国网山东省电力公司电力科学研究院 | Website monitoring system and method applicable to power industry |
CN106776768A (en) * | 2016-11-23 | 2017-05-31 | 福建六壬网安股份有限公司 | A kind of URL grasping means of distributed reptile engine and system |
CN108256104A (en) * | 2018-02-05 | 2018-07-06 | 恒安嘉新(北京)科技股份公司 | Internet site compressive classification method based on multidimensional characteristic |
CN108712426A (en) * | 2018-05-21 | 2018-10-26 | 携程旅游网络技术(上海)有限公司 | Reptile recognition methods and system a little are buried based on user behavior |
CN109308330A (en) * | 2018-07-24 | 2019-02-05 | 国家计算机网络与信息安全管理中心 | The method of enterprise's leakage information extraction, analysis and classification Internet-based |
CN110351248A (en) * | 2019-06-14 | 2019-10-18 | 北京纵横无双科技有限公司 | A kind of safety protecting method and device based on intellectual analysis and intelligent current limliting |
CN112287198A (en) * | 2020-10-28 | 2021-01-29 | 上海云信留客信息科技有限公司 | Spam short message detection method based on crawler technology |
Non-Patent Citations (5)
Title |
---|
吕宝路 等: "面向敏感信息检测的Web综合漏洞扫描器实现", 电脑知识与技术, vol. 16, no. 23, pages 30 - 32 * |
李昌兵 等: "融合卡方统计和TF-IWF算法的特征提取和短文本分类方法", 《重庆理工大学学报(自然科学)》, vol. 35, pages 135 - 140 * |
王小林 等: "改进的TF-IDF关键词提取方法", 《计算机科学与应用》, vol. 3, no. 1, 28 February 2013 (2013-02-28), pages 64 - 68 * |
王小林 等: "改进的TF-IDF关键词提取方法", 《计算机科学与应用》, vol. 3, no. 1, pages 64 - 68 * |
赵国生 等: "《python网络爬虫技术与实战》", 31 January 2021, 机械工业出版社, pages: 100 - 105 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116150542A (en) * | 2023-04-21 | 2023-05-23 | 河北网新数字技术股份有限公司 | Dynamic page generation method and device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109241461B (en) | User portrait construction method and device | |
CN102946319B (en) | Networks congestion control information analysis system and analytical method thereof | |
CN109960729A (en) | The detection method and system of HTTP malicious traffic stream | |
CN106095979B (en) | URL merging processing method and device | |
CN109274632B (en) | Website identification method and device | |
CN109905288B (en) | Application service classification method and device | |
CN105224691B (en) | A kind of information processing method and device | |
CN108156131A (en) | Webshell detection methods, electronic equipment and computer storage media | |
CN112491784A (en) | Request processing method and device of Web site and computer readable storage medium | |
Balla et al. | Real-time web crawler detection | |
CN108337269A (en) | A kind of WebShell detection methods | |
CN113098887A (en) | Phishing website detection method based on website joint characteristics | |
CN108319672A (en) | Mobile terminal malicious information filtering method and system based on cloud computing | |
CN108600270A (en) | A kind of abnormal user detection method and system based on network log | |
CN112131507A (en) | Website content processing method, device, server and computer-readable storage medium | |
CN114244564A (en) | Attack defense method, device, equipment and readable storage medium | |
CN103020208B (en) | A kind of searching method and device being adapted with mobile terminal | |
CN113821754A (en) | Sensitive data interface crawler identification method and device | |
CN117254983A (en) | Method, device, equipment and storage medium for detecting fraud-related websites | |
CN115801455B (en) | Method and device for detecting counterfeit website based on website fingerprint | |
CN107734534A (en) | A kind of network load appraisal procedure and device | |
CN110263283A (en) | Website detection method and device | |
CN112199573B (en) | Illegal transaction active detection method and system | |
CN111611508B (en) | Identification method and device for actual website access of user | |
CN114048311A (en) | Phishing early warning method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |