WO2020000748A1 - File detection method and apparatus - Google Patents

File detection method and apparatus Download PDF

Info

Publication number
WO2020000748A1
WO2020000748A1 PCT/CN2018/108711 CN2018108711W WO2020000748A1 WO 2020000748 A1 WO2020000748 A1 WO 2020000748A1 CN 2018108711 W CN2018108711 W CN 2018108711W WO 2020000748 A1 WO2020000748 A1 WO 2020000748A1
Authority
WO
WIPO (PCT)
Prior art keywords
url
page content
file
page
urls
Prior art date
Application number
PCT/CN2018/108711
Other languages
French (fr)
Chinese (zh)
Inventor
熊庆昌
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020000748A1 publication Critical patent/WO2020000748A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/40Network security protocols

Definitions

  • the present application relates to the field of Internet technologies, and in particular, to a file detection method and device.
  • Sensitive file detection refers to guessing whether there are specific sensitive files (such as test files, configuration files, management background files, backup files, etc.) in the web server or website through brute force methods.
  • a sensitive file refers to a file that contains some sensitive information and cannot be directly accessed by any user.
  • the status code returned is 200 (the server has successfully processed the request, indicating that the web page can be accessed normally) to determine whether the web server or website exists. Sensitive files. If the status code returned by the http request is 200, this page file can be accessed normally. If the status code returned by the http request is not 200, but 404 (the server cannot find the requested web page) or 403 (the server rejects the request), it means that the page file cannot be accessed. However, because some websites customize the status code 404 page returned by the http request, users will still return 200 when they visit non-existing pages (such as sensitive files). The status code is returned to determine whether there are a large number of false positives on a particular page.
  • http Hypertext Transfer Protocol
  • the embodiments of the present application provide a file detection method and device, which can improve the accuracy of file detection and reduce false positives.
  • an embodiment of the present application provides a file detection method, which includes:
  • N second URLs are determined according to the first URL, and any second URL in the N second URLs is a URL obtained by transforming based on the first URL, where N is an integer greater than or equal to 1;
  • the M is an integer greater than or equal to 1, and the M is less than Or equal to N.
  • an embodiment of the present application provides a file detection device, where the device includes:
  • An obtaining module configured to obtain the first page content returned for the first request when the status code returned by the server for the first request is the target status code, where the first request includes a first uniform resource locator URL;
  • a first determining module configured to determine N second URLs according to the first URL obtained by the obtaining module, and any second URL in the N second URLs is a URL obtained by transforming based on the first URL, the N is an integer greater than or equal to 1;
  • a transceiver module configured to send N second requests to the server based on the N second URLs determined by the first determining module, and receive N second page content returned by the server for the N second requests, where A second request includes a second URL, and a second request corresponds to a second page content;
  • a second determining module configured to determine that, when M second page contents in the N second page contents match the first page contents, a file corresponding to the first URL is a target file, and M is greater than or equal to An integer of 1, the M is less than or equal to the N.
  • an embodiment of the present application provides a terminal, including a processor, an input device, an output device, and a memory.
  • the processor, the input device, the output device, and the memory are connected to each other.
  • the memory is used to store and support the execution of the terminal.
  • the computer program of the above method, the computer program comprising program instructions, the processor being configured to call the program instructions to execute the file detection method of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium.
  • the computer storage medium stores a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processor to execute the foregoing first section.
  • File detection methods on one hand.
  • the embodiment of the present application determines whether the file corresponding to the URL is the target file by comparing whether the content of the pages returned by multiple different requests are consistent, which can improve the accuracy of sensitive file detection, thereby reducing false positives of sensitive files.
  • FIG. 1 is a schematic flowchart of a file detection method according to an embodiment of the present application
  • FIG. 2 is another schematic flowchart of a file detection method according to an embodiment of the present application.
  • FIG. 3 is a schematic block diagram of a document detection device according to an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of a terminal provided by an embodiment of the present application.
  • the embodiments of the present application may be applied in a penetration test process of a website or a web server.
  • Penetration testing is a method to evaluate the security performance of computer network systems by using simulated hacker attacks.
  • a comprehensive information collection is required for the website or web server, and scanning is performed on the website or web server for sensitive files such as configuration files and log files. This is because in a website or web server, sensitive files refer to files containing some sensitive information and cannot be directly accessed by any user.
  • hackers may use the information carried in these sensitive files to attack the website or web server, which may cause the website or web server to be paralyzed, and may even cause the Damage to the user's property of the website or web server.
  • the embodiment of the present application judges whether the content of the page returned by the first request containing the first URL and the content of the page returned by the second request containing the second URL (a URL not in the website or web server, that is, no URL exists) are compared to determine whether the content matches Whether the sensitive file in the website or web server can be accessed by any user. If it matches, it indicates that the sensitive file in the website or web server can be accessed by any user. Then, the file corresponding to the first URL can be determined as a sensitive file. This improves the accuracy of sensitive document detection and reduces false positives. It is also possible to modify the permissions of the file corresponding to the first URL or delete the file corresponding to the first URL in time after the sensitive file is found, thereby improving the security of the website or web server.
  • the first request and the second request in the embodiments of the present application may refer to the same type of request.
  • the first request is a first http request
  • the second request is a second http request.
  • This embodiment of the present application uses the first http request and the second http request as examples.
  • FIG. 1 is a schematic flowchart of a file detection method according to an embodiment of the present application. As shown in FIG. 1, the file detection method may include steps:
  • the terminal When receiving the status code returned by the server for the first request as the target status code, the terminal obtains the content of the first page returned for the first request.
  • the terminal may use a sensitive file scanning tool such as BBScan, Dirsearch, Opendoor, etc. to scan the URL in the server.
  • the scanning process may be to use a sensitive file scanning tool to send an HTTP request containing the URL to the server.
  • the server After receiving the http request, you can send the http status code corresponding to the URL in the http request to the terminal, such as status code 200 (the server has successfully processed the request, indicating that the web page can be accessed normally), status code 404 (the server cannot be found Requested web page), status code 403 (the server rejected the request), and so on.
  • the terminal When the terminal receives that the status code returned by the server for the first http request is the target status code 200, the terminal first determines that the first URL carried by the first http request can be accessed by any user, and the terminal can receive the server's response to the first http request. The content of the first page returned by an http request.
  • the first http request includes the first URL.
  • the terminal determines N second URLs according to the first URL.
  • the structure of the URL is usually "protocol: // servername (IP address) / path / filename".
  • the terminal may perform at least one different character transformation on the file name portion of the first URL and / or the file suffix name of the first URL according to a preset rule (such as adding characters to the file name portion of the first URL).
  • Any one of the N different second URLs may be a URL that is not in the foregoing server, that is, a URL that does not exist, and N is an integer greater than or equal to 1.
  • One character conversion gets a second URL, and one modification gets a second URL.
  • the type of the character may be one or more of numbers, letters (including uppercase and lowercase), and special characters, which are not limited in the embodiments of the present application.
  • the terminal in the embodiment of the present application may automatically transform the first URL according to a preset rule, thereby obtaining a plurality of different second URLs (in fact, the second URL is a non-existent URL), thereby reducing manual processing steps. , Which improves the processing efficiency of file detection.
  • the terminal may add a single character to the header of the file name of the first URL. m "to get a second URL as http://xxx.pingan.com/conf/mconfig.inc, the terminal can also modify the file extension" .inc "of the first URL to a non-existing suffix format". "inu" to get another secondary URL as http://xxx.pingan.com/conf/mconfig.inu.
  • the terminal sends N second requests to the server based on the N second URLs, and receives N second page content returned by the server for the N second requests.
  • the terminal may send N second http requests to the server based on the N second URLs determined above.
  • the server The http requests all return a second page content corresponding to the second http request, and the terminal can receive the N second page content returned by the server for the N second http requests.
  • a second http request includes a second URL determined as described above, and a second http request corresponds to a second page content.
  • N 3
  • the three second URLs determined above may be expressed as URL-a, URL-b, and URL-c, respectively, and the terminal may separately send a second http request a1 containing URL-a to the server and include the URL.
  • the server receives the second http request a1, the second http request b1, the second http request c1, and respectively addresses the second http request a1
  • the second page content A1 corresponding to a1 is returned, and the second page content B1 corresponding to b1 is returned for the second http request b1.
  • the second page content C1 corresponding to c1 is returned for the second http request c1.
  • the terminal receives the three second page contents A1, B1, and C1 returned by the server.
  • the terminal may send a second http request including the second URL to the server, and receive the second http request returned by the server for the second http request.
  • the terminal may also first determine N second URLs according to the first URL, and then send N second HTTP requests to the server together, and receive the second page content returned by the server for each second http request, and obtain a total of N Second page content.
  • the terminal determines that the file corresponding to the first URL is the target file.
  • the terminal can detect whether each of the second page content received in the N second page content is the same as the first page content obtained in the above. If M exists in the N second page content, The content of each second page is the same as the content of the first page, and the terminal may determine that the file corresponding to the first URL is a target file (a sensitive file). If the contents of the N second pages are different from the contents of the first page, the terminal may determine that the file corresponding to the first URL is not a sensitive file. In the embodiment of the present application, by comparing whether the content of the first page and the content of the second page match, the accuracy of file detection can be improved and false positives can be reduced.
  • the file corresponding to the first URL is a sensitive file
  • the first URL cannot be accessed by the user (that is, a page link that does not exist), that is, the status code corresponding to the first URL should indicate that the first URL cannot be accessed by the user.
  • Access status code such as 404 status code, 403 status code.
  • the status code returned by the first http request containing the first URL is 200.
  • the first URL is transformed multiple times (in the header of the filename of the first URL, After adding a few characters at the end of the file name and / or the end of the file suffix name), a plurality of non-existing second URLs (referring to the links of the pages that do not exist in the website or server) are obtained. Because each second URL obtained after the first URL is transformed is a non-existing page link, and the first URL is also a non-existing page link, there must be at least one in the second page content corresponding to the second URL. The second page content is the same as the first page content corresponding to the first URL.
  • the status code corresponding to the first URL is 200, regardless of whether the website or server customizes the 404 page.
  • the second URL is a non-existing page link, because each second URL is a non-existing page link, and the first URL is a normal page link, so the content of the second page corresponding to the second URL is all the same as the first URL.
  • the content of the first page corresponding to the URL is different.
  • the M may be an integer greater than or equal to 1, and the M may be less than or equal to the N.
  • the file corresponding to http://xxx.pingan.com/conf/config.inc is a sensitive file. In the embodiment of the present application, by finding M second page content that is the same as the first page content among the N second page contents, the search is no longer performed, which can improve the processing efficiency of file detection.
  • the terminal may extract a first feature in the first page content, and the first feature may be a key character in the first page content, such as a title or first page content in the first page content.
  • the terminal may also extract a second feature for each of the second second page content, to obtain a total of N second features.
  • the second feature may be a key character in the content of the second page, and the second feature and the first feature are the same feature, that is, the first feature is a title in the first page content, and the second feature That is, the title in the second page content; the first feature is the first 50 characters in the first page content, and the second feature is the first 50 characters in the second page content.
  • the terminal may detect whether the first feature is the same as each second feature of the N second features. If M second features among the N second features are the same as the first feature, the terminal may determine the first feature The file corresponding to a URL is a sensitive file. If none of the N second features are the same as the first feature, the terminal may determine that the file corresponding to the first URL is not a sensitive file.
  • the M may be an integer greater than or equal to 1, and the M may be less than or equal to the N.
  • the first page content returned for the first request is obtained, where the first page content includes the first URL, and then according to the first A URL determines N second URLs, sends N second requests to the server based on the N second URLs, and receives N second page content returned by the server for the N second requests, if the N There are M second page contents in the second page content that match the first page content, then the file corresponding to the first URL is determined to be the target file, which can improve the accuracy of sensitive file detection, thereby reducing false positives of sensitive files .
  • the file detection method may include steps:
  • step S201 in the embodiment of the present application, reference may be made to the implementation manner provided by step S101 in the embodiment shown in FIG. 1, and details are not described herein again.
  • the terminal adds characters N times at any N different positions in the file name part of the first URL to obtain N second URLs.
  • the terminal may perform N character additions at any N different positions in the file name part of the first URL to obtain N second URLs.
  • adding one or more characters at any position in the file name part of the first URL is a character transformation of the file name part of the first URL.
  • the N different positions may be the file name header, file name tail, and file suffix name tail of the first URL, respectively.
  • One or more characters can be added in one position, and the characters added in each position can be different or the same.
  • the type of the character may be one or more of numbers, letters (including uppercase and lowercase), and special characters, which are not limited in the embodiments of the present application.
  • the embodiment of the present application obtains different second URLs by adding characters in different positions in the first URL file name portion (actually, the second URL is a non-existent URL).
  • the operation is simple, the processing is convenient, and manufacturing can be improved.
  • the accuracy of the URL thereby improving the accuracy of detection in subsequent judgments.
  • the terminal can add a character "x "To get a second URL as http://xxx.pingan.com/conf/xconfig.inc, and add a character" y "to the end of the file name of the first URL to get another second URL as http: // xxx.pingan.com/conf/configy.inc, you can also add a character "1" to the end of the file suffix of the first URL to modify the file suffix of the first URL so that the file suffix is ".inc1 "Doesn't exist, and you get another second URL: http://xxx.pingan.com/conf/config.inc1 is also a non-existent URL. Therefore, adding characters to the tail of the file suffix of the first URL by the terminal is also a way to modify the file suffix.
  • S203 The terminal sends N second requests to the server based on the N second URLs, and receives N second page content returned by the server for the N second requests.
  • step S203 in the embodiment of the present application, reference may be made to the implementation manner provided by step S103 in the embodiment shown in FIG. 1, and details are not described herein again.
  • the terminal obtains the similarity values between each second page content and the first page content in the N second page contents to obtain N similarity values.
  • the terminal may use a page similarity algorithm such as a locally sensitive hash algorithm simhash or a minimum hash algorithm minhash to calculate each second page content and the first page content in the N second page contents. Between the similarity values, to obtain the N similarity values. The terminal then judges the magnitude relationship between each of the N similarity values and a preset similarity threshold. If there are M similarity values in the N similarity values that are greater than the preset similarity threshold, , The terminal may determine that the file corresponding to the first URL is a sensitive file. If all the N similarity values are less than or equal to the preset similarity threshold, the terminal may determine that the file corresponding to the first URL is not a sensitive file.
  • a page similarity algorithm such as a locally sensitive hash algorithm simhash or a minimum hash algorithm minhash
  • the M may be an integer greater than or equal to 1, and the M may be less than or equal to the N.
  • the embodiment of the present application can exclude the first page content and the second page content by calculating the similarity value of the first page content and the second page content, and comparing the magnitude relationship between the similarity value and a preset similarity threshold. The small difference between them affects the detection result of the file, thereby further improving the accuracy of the file detection.
  • the contents of the three second pages can be represented by A1, B1, and C1, respectively, and the contents of the first page can be represented by F1.
  • the preset similarity threshold is 90%.
  • the terminal uses simhash to calculate the similarity value between A1 and F1, the similarity value between B1 and F1, and the similarity value between C1 and F1.
  • the three similarity values are similarity between A1 and F1, respectively.
  • the degree value is 67%
  • the similarity value between B1 and F1 is 80%
  • the similarity value between C1 and F1 is 94%.
  • the terminal detects the size between the three similarity values of 67%, 80%, and 94% and a preset similarity threshold of 90%. Because the similarity value between C1 and F1 is 94% greater than the similarity threshold of 90%,
  • the terminal determines that the file corresponding to the first URL is a sensitive file.
  • the terminal may output alarm prompt information, where the alarm prompt information includes the first URL, and the alarm prompt information is used to prompt the target user of the first URL.
  • the file corresponding to a URL is a sensitive file, so that the target user can process the file corresponding to the first URL, to prevent malicious access by hackers and the server caused by the disclosure of sensitive file information, thereby causing the user's property to use the server. Loss and other issues.
  • the programmer may delete or modify the file corresponding to the first URL to access the first URL as the highest level, that is, no one is allowed to access.
  • the first page content returned for the first request is obtained, where the first page content includes the first URL, and then the first page content is returned.
  • the file name portion of a URL is subjected to N different character transformations to obtain N second URLs. Based on the N second URLs, N second requests are sent to the server, and the server receives the N second requests returned by the server.
  • the file corresponding to the first URL is determined as the target file, which can not only reduce the false positives of sensitive files, but also exclude the small differences between the page content to the detection results. The effect of this further improves the accuracy of document detection.
  • FIG. 3 it is a schematic block diagram of a document detection apparatus provided by an embodiment of the present application.
  • the document detection device in the embodiment of the present application includes:
  • the obtaining module 10 is configured to obtain the first page content returned for the first request when the status code returned by the server for the first request is the target status code.
  • the first request includes a first uniform resource locator URL.
  • the first determining module 20 is configured to determine N second URLs according to the first URL obtained by the obtaining module 10. Wherein, any second URL among the N second URLs is a URL obtained by transforming based on the first URL, and N is an integer greater than or equal to 1.
  • the transceiver module 30 is configured to send N second requests to the server based on the N second URLs determined by the first determining module 20, and receive N second page content returned by the server for the N second requests.
  • a second request includes a second URL, and a second request corresponds to a second page content.
  • the second determining module 40 is configured to determine that, when M second page contents in the N second page contents match the first page contents, a file corresponding to the first URL is a target file.
  • M is an integer greater than or equal to 1, and M is less than or equal to N.
  • the first determining module 20 is configured to perform N different character transformations on the file name portion of the first URL obtained by the obtaining module 10 to obtain N second URLs. Among them, one character conversion obtains a second URL.
  • the first determining module 20 is configured to perform N character additions at any N different positions in the file name portion of the first URL obtained by the obtaining module 10 to obtain N second URL; wherein, one or more characters are added at any position in the file name portion of the first URL as a character transformation of the file name portion of the first URL.
  • the first determining module 20 is configured to perform N different modifications to the file suffix of the first URL obtained by the obtaining module 10 to obtain N second URLs. Among them, one modification obtains a second URL.
  • the second determining module 40 includes a first obtaining unit 401, a detecting unit 402, and a first determining unit 403.
  • the first obtaining unit 401 is configured to obtain a first feature in the first page content, and to obtain N second features of the N second page content; and the detecting unit 402 is used to detect the first obtaining unit.
  • the first determining unit 403 is configured to: when there are M second features in the N second features and the first feature, When a feature matches, it is determined that the file corresponding to the first URL is a target file.
  • One of the contents of the second page corresponds to one of the second features.
  • the second determining module 40 includes a second obtaining unit 404 and a second determining unit 405.
  • the second obtaining unit 404 is configured to obtain a similarity value between each second page content and the first page content in the N second page contents to obtain N similarity values;
  • the second determining unit 405 Is used to determine that a file corresponding to the first URL is a target file when M similarity values among the N similarity values are greater than a similarity threshold.
  • the second obtaining unit 404 is specifically configured to use a page similarity algorithm to calculate a similarity value between each second page content in the N second page contents and the first page content. Get N similarity values.
  • the sending and receiving module 30 is further configured to output alarm prompt information after determining that a file corresponding to the first URL is a target file, where the alarm prompt information includes the first URL, and the alarm prompt information is used for Prompt that the file corresponding to the first URL is a sensitive file.
  • the above-mentioned file detection device may implement the implementation manner provided by each step in the implementation manner provided in FIG. 1 or FIG. 2 through the foregoing modules to implement the functions implemented in the foregoing embodiments.
  • the foregoing diagram The corresponding description provided by each step in the method embodiment shown in FIG. 1 or FIG. 2 is not described herein again.
  • the file detection device when receiving the status code returned by the server for the first request as the target status code, may obtain the first page content returned for the first request, where the first page content includes the first A URL, and then determine N second URLs according to the first URL, send N second requests to the server based on the N second URLs, and receive the N first URLs returned by the server for the N second requests.
  • Two-page content if there are M second page contents in the N second page contents that match the first page content, determining the file corresponding to the first URL as the target file can improve the accuracy of sensitive file detection, This reduces false positives for sensitive files.
  • the terminal in the embodiment of the present application may include: one or more processors 801; one or more input devices 802, one or more output devices 803, and a memory 804.
  • the processor 801, the input device 802, the output device 803, and the memory 804 are connected through a bus 805.
  • the memory 802 is configured to store a computer program.
  • the computer program includes program instructions.
  • the processor 801 is configured to execute the program instructions stored in the memory 802.
  • the input device 802 is configured to receive a status code returned by the server for the first request, and the processor 801 is configured to call the program instruction for execution:
  • N second URLs are determined according to the first URL, and any second URL in the N second URLs is a URL obtained by transforming based on the first URL, and N is an integer greater than or equal to 1.
  • the output device 803 is configured to send N second requests to the server based on the N second URLs, and the input device 802 is configured to receive N second page content returned by the server for the N second requests.
  • One of the second requests includes a second URL, and one second request corresponds to a second page content.
  • the processor 801 is further configured to call the program instruction for execution. When there are M second page contents in the N second page contents that match the first page content, determine that a file corresponding to the first URL is a target file. .
  • M is an integer greater than or equal to 1, and M is less than or equal to N.
  • the processor 801 may be a central processing unit (CPU), and the processor may also be another general-purpose processor or a digital signal processor (DSP). , Application specific integrated circuit (ASIC), ready-made programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the input device 802 may include a receiver, a receiving program interface, and the like, and the output device 803 may include a transmitter, a sending program interface, and the like.
  • the memory 804 may include a read-only memory and a random access memory, and provide instructions and data to the processor 801. A part of the memory 804 may further include a non-volatile random access memory. For example, the memory 804 may also store information of a device type.
  • the processor 801, the input device 802, and the output device 803 described in the embodiments of the present application may perform the implementation manners described in the file detection method provided in the embodiments of the present application, and may also execute the methods described in the embodiments of the present application.
  • the implementation manner of the file detection device is not repeated here.
  • An embodiment of the present application further provides a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program includes program instructions. When the program instructions are executed by a processor, the files shown in FIG. 1 or FIG. 2 are implemented. For details of the detection method, refer to the description of the embodiment shown in FIG. 1 or FIG. 2, and details are not described herein again.
  • the computer-readable storage medium may be the file detection device or an internal storage unit of the terminal according to any of the foregoing embodiments, such as a hard disk or a memory of the terminal.
  • the computer-readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card, and a flash memory card provided on the terminal. (flash card) and so on.
  • the computer-readable storage medium may include both an internal storage unit of the terminal and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by the terminal.
  • the computer-readable storage medium may also be used to temporarily store data that has been or will be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

Disclosed are a file detection method and apparatus, wherein the method comprises: where a received status code returned by a server with regard to a first request is a target status code, acquiring first page content returned with regard to the first request, wherein the first page content comprises a first URL; determining N second URLs according to the first URL; sending N second requests to the server based on the N second URLs and receiving N pieces of second page content returned by the server with regard to the N second requests; and if M pieces of second page content in the N pieces of second page content match the first page content, determining that a file corresponding to the first URL is a target file. Applying the embodiment of the present application can increase the accuracy of file detection so as to reduce misinformation.

Description

一种文件检测方法及装置File detection method and device
本申请要求于2018年6月30日提交中国专利局、申请号为2018107047079、申请名称为“一种文件检测方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority from a Chinese patent application filed with the Chinese Patent Office on June 30, 2018, with application number 2018107047079, and with the application name "a method and device for document detection", the entire contents of which are incorporated herein by reference. .
技术领域Technical field
本申请涉及互联网技术领域,尤其涉及一种文件检测方法及装置。The present application relates to the field of Internet technologies, and in particular, to a file detection method and device.
背景技术Background technique
敏感文件检测是指通过暴力尝试的方法猜测web服务器或网站中是否存在特定的敏感文件(如测试文件、配置文件、管理后台文件、备份文件等)。通常来说,在网站或web服务器中,敏感文件指的是包含一些敏感信息的文件,不能被任意用户直接访问。Sensitive file detection refers to guessing whether there are specific sensitive files (such as test files, configuration files, management background files, backup files, etc.) in the web server or website through brute force methods. Generally speaking, in a website or web server, a sensitive file refers to a file that contains some sensitive information and cannot be directly accessed by any user.
目前,通常是通过匹配特定超文本传输协议(hyper text transfer protocol,http)请求返回的状态码是否为200(服务器已成功处理了请求,说明网页可以正常访问)来确定web服务器或网站中是否存在敏感文件。若http请求返回的状态码为200,说明这个页面文件能被正常访问。若http请求返回的状态码不是200,而是404(服务器找不到请求的网页)或403(服务器拒绝请求),说明这个页面文件不能被访问。但因为一些网站定制了http请求返回的状态码404页面,当用户访问不存在的页面(如敏感文件)时仍会返回200,以返回状态码来判断特定页面是否存在会出现大量的误报。At present, it is usually determined by matching a specific Hypertext Transfer Protocol (http) request that the status code returned is 200 (the server has successfully processed the request, indicating that the web page can be accessed normally) to determine whether the web server or website exists. Sensitive files. If the status code returned by the http request is 200, this page file can be accessed normally. If the status code returned by the http request is not 200, but 404 (the server cannot find the requested web page) or 403 (the server rejects the request), it means that the page file cannot be accessed. However, because some websites customize the status code 404 page returned by the http request, users will still return 200 when they visit non-existing pages (such as sensitive files). The status code is returned to determine whether there are a large number of false positives on a particular page.
发明内容Summary of the invention
本申请实施例提供一种文件检测方法及装置,可以提高文件检测的准确性,减少误报。The embodiments of the present application provide a file detection method and device, which can improve the accuracy of file detection and reduce false positives.
第一方面,本申请实施例提供了一种文件检测方法,该方法包括:In a first aspect, an embodiment of the present application provides a file detection method, which includes:
在接收到服务器针对第一请求返回的状态码为目标状态码时,获取针对该第一请求返回的第一页面内容,该第一请求包括第一统一资源定位符URL;When receiving the status code returned by the server for the first request as the target status code, acquiring the first page content returned for the first request, the first request including the first uniform resource locator URL;
根据该第一URL确定出N个第二URL,该N个第二URL中任一第二URL为基于该第一URL变换后得到的URL,该N为大于或等于1的整数;N second URLs are determined according to the first URL, and any second URL in the N second URLs is a URL obtained by transforming based on the first URL, where N is an integer greater than or equal to 1;
基于该N个第二URL向该服务器发送N个第二请求,并接收该服务器针对该N个第二请求返回的N个第二页面内容,其中一个第二请求包括一个第二URL,一个第二请求对应一个第二页面内容;Send N second requests to the server based on the N second URLs, and receive N second page content returned by the server for the N second requests, where a second request includes a second URL, a first Two requests correspond to a second page content;
若该N个第二页面内容中存在M个第二页面内容与该第一页面内容匹配,则确定该第一URL对应的文件为目标文件,该M为大于或等于1的整数,该M小于或等于该N。If there are M second page contents in the N second page contents that match the first page content, it is determined that the file corresponding to the first URL is the target file, the M is an integer greater than or equal to 1, and the M is less than Or equal to N.
第二方面,本申请实施例提供了一种文件检测装置,该装置包括:In a second aspect, an embodiment of the present application provides a file detection device, where the device includes:
获取模块,用于在接收到服务器针对第一请求返回的状态码为目标状态码时,获取针对该第一请求返回的第一页面内容,该第一请求包括第一统一资源定位符URL;An obtaining module, configured to obtain the first page content returned for the first request when the status code returned by the server for the first request is the target status code, where the first request includes a first uniform resource locator URL;
第一确定模块,用于根据该获取模块获取的该第一URL确定出N个第二URL,该N个第二URL中任一第二URL为基于该第一URL变换后得到的URL,该N为大于或等于1的整数;A first determining module, configured to determine N second URLs according to the first URL obtained by the obtaining module, and any second URL in the N second URLs is a URL obtained by transforming based on the first URL, the N is an integer greater than or equal to 1;
收发模块,用于基于该第一确定模块确定的该N个第二URL向该服务器发送N个第二请求,并接收该服务器针对该N个第二请求返回的N个第二页面内容,其中一个第二请求包括一个第二URL,一个第二请求对应一个第二页面内容;A transceiver module, configured to send N second requests to the server based on the N second URLs determined by the first determining module, and receive N second page content returned by the server for the N second requests, where A second request includes a second URL, and a second request corresponds to a second page content;
第二确定模块,用于当该N个第二页面内容中存在M个第二页面内容与该第一页面内容匹配时,确定该第一URL对应的文件为目标文件,该M为大于或等于1的整数,该M小于或等于该N。A second determining module, configured to determine that, when M second page contents in the N second page contents match the first page contents, a file corresponding to the first URL is a target file, and M is greater than or equal to An integer of 1, the M is less than or equal to the N.
第三方面,本申请实施例提供了一种终端,包括处理器、输入设备、输出设备和存储器,该处理器、输入设备、输出设备和存储器相互连接,其中,该存储器用于存储支持终端执行上述方法的计算机程序,该计算机程序包括程序指令,该处理器被配置用于调用该程序指令,执行上述第一方面的文件检测方法。In a third aspect, an embodiment of the present application provides a terminal, including a processor, an input device, an output device, and a memory. The processor, the input device, the output device, and the memory are connected to each other. The memory is used to store and support the execution of the terminal The computer program of the above method, the computer program comprising program instructions, the processor being configured to call the program instructions to execute the file detection method of the first aspect.
第四方面,本申请实施例提供了一种计算机可读存储介质,该计算机存储介质存储有计算机程序,该计算机程序包括程序指令,该程序指令当被处理器执行时使该处理器执行上述第一方面的文件检测方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium. The computer storage medium stores a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processor to execute the foregoing first section. File detection methods on one hand.
本申请实施例通过比较多个不同请求返回的的页面内容是否一致,来判断URL对应的文件是否为目标文件,可以提高敏感文件检测的准确性,从而减少敏感文件的误报。The embodiment of the present application determines whether the file corresponding to the URL is the target file by comparing whether the content of the pages returned by multiple different requests are consistent, which can improve the accuracy of sensitive file detection, thereby reducing false positives of sensitive files.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本申请实施例提供的文件检测方法的一示意流程图;FIG. 1 is a schematic flowchart of a file detection method according to an embodiment of the present application; FIG.
图2是本申请实施例提供的文件检测方法的另一示意流程图;2 is another schematic flowchart of a file detection method according to an embodiment of the present application;
图3是本申请实施例提供的文件检测装置的一示意性框图;3 is a schematic block diagram of a document detection device according to an embodiment of the present application;
图4是本申请实施例提供的终端的一示意性框图。FIG. 4 is a schematic block diagram of a terminal provided by an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
本申请实施例可以应用在网站或web服务器的渗透测试过程中。渗透测试是一种利用模拟黑客攻击的方式,来评估计算机网络系统安全性能的方法。通常在对网站或web服务器进行渗透测试时,需要对网站或web服务器进行全面的信息收集,扫描网站或web服务器中是否存在敏感文件如配置文件、日志文件等。这是因为在网站或web服务器中,敏感文件指的是包含一些敏感信息的文件,不能被任意用户直接访问。若该网站或web服务器中的敏感文件能被任意用户访问,那么黑客就可能利用这些敏感文件中携带的信息攻击该网站或web服务器,从而可能导致该网站或web服务器瘫痪,更甚可能导致该网站或web服务器的用户财产损失。The embodiments of the present application may be applied in a penetration test process of a website or a web server. Penetration testing is a method to evaluate the security performance of computer network systems by using simulated hacker attacks. Usually when performing a penetration test on a website or web server, a comprehensive information collection is required for the website or web server, and scanning is performed on the website or web server for sensitive files such as configuration files and log files. This is because in a website or web server, sensitive files refer to files containing some sensitive information and cannot be directly accessed by any user. If sensitive files in the website or web server can be accessed by any user, hackers may use the information carried in these sensitive files to attack the website or web server, which may cause the website or web server to be paralyzed, and may even cause the Damage to the user's property of the website or web server.
本申请实施例通过比较包含第一URL的第一请求返回的页面内容和包含第二URL(网站或web服务器中没有的URL,即不存在URL)的第二请求返回的页面内容是否匹配来判断该网站或web服务器中的敏感文件是否能被任意用户访问,若匹配,说明该网站或web服务器中的敏感文件能被任意用户访问,那么可以确定该第一URL对应的文件为敏感文件, 提高了敏感文件检测的准确性,减少了误报。还可以在发现敏感文件之后及时修改该第一URL对应的文件的权限或删除该第一URL对应的文件,进而可以提高该网站或web服务器的安全性。The embodiment of the present application judges whether the content of the page returned by the first request containing the first URL and the content of the page returned by the second request containing the second URL (a URL not in the website or web server, that is, no URL exists) are compared to determine whether the content matches Whether the sensitive file in the website or web server can be accessed by any user. If it matches, it indicates that the sensitive file in the website or web server can be accessed by any user. Then, the file corresponding to the first URL can be determined as a sensitive file. This improves the accuracy of sensitive document detection and reduces false positives. It is also possible to modify the permissions of the file corresponding to the first URL or delete the file corresponding to the first URL in time after the sensitive file is found, thereby improving the security of the website or web server.
下面将结合图1至图4,对本申请实施例提供的文件检测方法及装置进行说明。The document detection method and device provided in the embodiments of the present application will be described below with reference to FIGS. 1 to 4.
本申请实施例中的第一请求和第二请求可以指同一类请求,比如第一请求为第一http请求,则第二请求即为第二http请求。本申请实施例以第一http请求和第二http请求为例。The first request and the second request in the embodiments of the present application may refer to the same type of request. For example, if the first request is a first http request, the second request is a second http request. This embodiment of the present application uses the first http request and the second http request as examples.
参见图1,是本申请实施例提供的文件检测方法的一示意流程图。如图1所示,该文件检测方法可包括步骤:FIG. 1 is a schematic flowchart of a file detection method according to an embodiment of the present application. As shown in FIG. 1, the file detection method may include steps:
S101,终端在接收到服务器针对第一请求返回的状态码为目标状态码时,获取针对第一请求返回的第一页面内容。S101. When receiving the status code returned by the server for the first request as the target status code, the terminal obtains the content of the first page returned for the first request.
在一些可行的实施方式中,终端可以利用敏感文件扫描工具如BBScan、Dirsearch、Opendoor等对服务器中的URL进行扫描,扫描过程可以为利用敏感文件扫描工具向服务器发送包含URL的http请求,该服务器接收到该http请求后可以向该终端发送该http请求中该URL对应的http状态码,如状态码200(服务器已成功处理了请求,说明网页可以正常访问)、状态码404(服务器找不到请求的网页)、状态码403(服务器拒绝请求)等等。当终端接收到该服务器针对第一http请求返回的状态码为目标状态码200时,终端先不确定该第一http请求携带的第一URL能被任意用户访问,终端可以接收该服务器针对该第一http请求返回的第一页面内容。其中,该第一http请求包括该第一URL。In some feasible implementation manners, the terminal may use a sensitive file scanning tool such as BBScan, Dirsearch, Opendoor, etc. to scan the URL in the server. The scanning process may be to use a sensitive file scanning tool to send an HTTP request containing the URL to the server. The server After receiving the http request, you can send the http status code corresponding to the URL in the http request to the terminal, such as status code 200 (the server has successfully processed the request, indicating that the web page can be accessed normally), status code 404 (the server cannot be found Requested web page), status code 403 (the server rejected the request), and so on. When the terminal receives that the status code returned by the server for the first http request is the target status code 200, the terminal first determines that the first URL carried by the first http request can be accessed by any user, and the terminal can receive the server's response to the first http request. The content of the first page returned by an http request. The first http request includes the first URL.
S102,终端根据第一URL确定出N个第二URL。S102. The terminal determines N second URLs according to the first URL.
在一些可行的实施方式中,URL的结构通常为“协议://服务器名称(IP地址)/路径/文件名”。终端可以根据预设的规则(如在上述第一URL的文件名部分增加字符)对上述第一URL的文件名部分进行至少一次不同的字符变换,和/或对上述第一URL的文件后缀名进行至少一次不同的修改,共得到N个不同的第二URL。其中,该N个不同的第二URL中任一第二URL可以为上述服务器中没有的URL,即不存在的URL,该N为大于或等于1的整数。一次字符变换得到一个第二URL,一次修改得到一个第二URL。字符的类型可以为数字、字母(包括大小写)、以及特殊字符中的一种或多种,本申请实施例不做限定。本申请实施例中的终端可以根据预设的规则,自动对第一URL进行变换,从而得到多个不同的第二URL(实际上第二URL是不存在的URL),进而减少了人工处理环节,提高了文件检测的处理效率。In some feasible implementations, the structure of the URL is usually "protocol: // servername (IP address) / path / filename". The terminal may perform at least one different character transformation on the file name portion of the first URL and / or the file suffix name of the first URL according to a preset rule (such as adding characters to the file name portion of the first URL). Perform different modifications at least once to obtain a total of N different second URLs. Any one of the N different second URLs may be a URL that is not in the foregoing server, that is, a URL that does not exist, and N is an integer greater than or equal to 1. One character conversion gets a second URL, and one modification gets a second URL. The type of the character may be one or more of numbers, letters (including uppercase and lowercase), and special characters, which are not limited in the embodiments of the present application. The terminal in the embodiment of the present application may automatically transform the first URL according to a preset rule, thereby obtaining a plurality of different second URLs (in fact, the second URL is a non-existent URL), thereby reducing manual processing steps. , Which improves the processing efficiency of file detection.
例如,第一URL为http://xxx.pingan.com/conf/config.inc,且该第一URL对应的状态码为200,终端可以在该第一URL的文件名头部增加单个字符“m”得到一个第二URL为http://xxx.pingan.com/conf/mconfig.inc,终端还可以将该第一URL的文件后缀名“.inc”修改为不存在的后缀名格式“.inu”得到另一个第二URL为http://xxx.pingan.com/conf/mconfig.inu。For example, if the first URL is http://xxx.pingan.com/conf/config.inc, and the status code corresponding to the first URL is 200, the terminal may add a single character to the header of the file name of the first URL. m "to get a second URL as http://xxx.pingan.com/conf/mconfig.inc, the terminal can also modify the file extension" .inc "of the first URL to a non-existing suffix format". "inu" to get another secondary URL as http://xxx.pingan.com/conf/mconfig.inu.
S103,终端基于N个第二URL向服务器发送N个第二请求,并接收服务器针对N个第二请求返回的N个第二页面内容。S103. The terminal sends N second requests to the server based on the N second URLs, and receives N second page content returned by the server for the N second requests.
在一些可行的实施方式中,终端可以基于上述确定出的N个第二URL向上述服务器发送N个第二http请求,该服务器在接收到终端发送的第二http请求之后,针对每个第二 http请求都返回一个该第二http请求对应的第二页面内容,终端可以接收该服务器针对该N个第二http请求返回的N个第二页面内容。其中,一个第二http请求中包括上述确定出的一个第二URL,一个第二http请求对应一个第二页面内容。例如,假设N=3,上述确定的3个第二URL可以分别表示为URL-a、URL-b、URL-c,终端可以分别向服务器发送包含URL-a的第二http请求a1、包含URL-b的第二http请求b1、包含URL-c的第二http请求c1,服务器接收到第二http请求a1、第二http请求b1、第二http请求c1,并分别针对该第二http请求a1返回a1对应的第二页面内容A1、针对该第二http请求b1返回b1对应的第二页面内容B1、针对该第二http请求c1返回c1对应的第二页面内容C1。终端接收服务器返回的3个第二页面内容A1、B1和C1。In some feasible implementation manners, the terminal may send N second http requests to the server based on the N second URLs determined above. After receiving the second http requests sent by the terminal, the server The http requests all return a second page content corresponding to the second http request, and the terminal can receive the N second page content returned by the server for the N second http requests. A second http request includes a second URL determined as described above, and a second http request corresponds to a second page content. For example, assuming N = 3, the three second URLs determined above may be expressed as URL-a, URL-b, and URL-c, respectively, and the terminal may separately send a second http request a1 containing URL-a to the server and include the URL. -b's second http request b1, second URL request c1 including URL-c, the server receives the second http request a1, the second http request b1, the second http request c1, and respectively addresses the second http request a1 The second page content A1 corresponding to a1 is returned, and the second page content B1 corresponding to b1 is returned for the second http request b1. The second page content C1 corresponding to c1 is returned for the second http request c1. The terminal receives the three second page contents A1, B1, and C1 returned by the server.
在一些可行的实施方式中,终端根据上述第一URL每确定出一个第二URL,就可以向服务器发送包含该第二URL的第二http请求,并接收该服务器针对该第二http请求返回的第二页面内容,将该第二URL对应的该第二页面内容临时储存起来,循环N次,得到N个第二URL对应的N个第二页面内容。终端也可以先根据上述第一URL确定出N个第二URL,再一起向服务器发送N个第二http请求,并接收该服务器针对每个第二http请求返回的第二页面内容,共得到N个第二页面内容。In some feasible implementation manners, each time the terminal determines a second URL according to the first URL, the terminal may send a second http request including the second URL to the server, and receive the second http request returned by the server for the second http request. For the second page content, temporarily store the second page content corresponding to the second URL, and cycle N times to obtain N second page content corresponding to the N second URLs. The terminal may also first determine N second URLs according to the first URL, and then send N second HTTP requests to the server together, and receive the second page content returned by the server for each second http request, and obtain a total of N Second page content.
S104,若N个第二页面内容中存在M个第二页面内容与第一页面内容匹配,终端则确定第一URL对应的文件为目标文件。S104. If there are M second page contents in the N second page contents that match the first page content, the terminal determines that the file corresponding to the first URL is the target file.
在一些可行的实施方式中,终端可以检测上述接收到的N个第二页面内容中各个第二页面内容是否与上述获取到的第一页面内容相同,若该N个第二页面内容中存在M个第二页面内容与上述第一页面内容相同,终端就可以确定上述第一URL对应的文件为目标文件(敏感文件)。若该N个第二页面内容全部与上述第一页面内容不相同,则终端可以确定上述第一URL对应的文件不是敏感文件。本申请实施例通过比较第一页面内容和第二页面内容是否匹配,可以提高文件检测的准确性,减少误报。这是因为如果第一URL对应的文件是敏感文件,那么第一URL就不能被用户访问(即不存在的页面链接),也就是第一URL对应的状态码应该是表示该第一URL不能被访问的状态码,如404状态码、403状态码。但由于某些网站或服务器定制了404页面,所以包含第一URL的第一http请求返回的状态码为200,那么,对第一URL进行多次变换(在第一URL的文件名头部、文件名尾部和/或文件后缀名尾部加一些字符)后,得到多个不存在的第二URL(指网站或服务器中不存在的页面链接)。又因为第一URL经过变换后得到的每个第二URL都是一个不存在的页面链接,且第一URL也是不存在的页面链接,所以第二URL对应的第二页面内容中必然存在至少一个第二页面内容与第一URL对应的第一页面内容相同。同理,若第一URL本身就是一个可以被任意用户访问的页面链接,那么第一URL对应的状态码为200,无论网站或服务器是否定制404页面,对第一URL进行变换后得到每个第二URL都是一个不存在的页面链接,因为每个第二URL都是一个不存在的页面链接,而第一URL是正常的页面链接,所以第二URL对应的第二页面内容全部与第一URL对应的第一页面内容不相同。其中,该M可以为大于或等于1的整数,该M可以小于或等于该N。In some feasible implementation manners, the terminal can detect whether each of the second page content received in the N second page content is the same as the first page content obtained in the above. If M exists in the N second page content, The content of each second page is the same as the content of the first page, and the terminal may determine that the file corresponding to the first URL is a target file (a sensitive file). If the contents of the N second pages are different from the contents of the first page, the terminal may determine that the file corresponding to the first URL is not a sensitive file. In the embodiment of the present application, by comparing whether the content of the first page and the content of the second page match, the accuracy of file detection can be improved and false positives can be reduced. This is because if the file corresponding to the first URL is a sensitive file, the first URL cannot be accessed by the user (that is, a page link that does not exist), that is, the status code corresponding to the first URL should indicate that the first URL cannot be accessed by the user. Access status code, such as 404 status code, 403 status code. However, because some websites or servers have customized 404 pages, the status code returned by the first http request containing the first URL is 200. Then, the first URL is transformed multiple times (in the header of the filename of the first URL, After adding a few characters at the end of the file name and / or the end of the file suffix name), a plurality of non-existing second URLs (referring to the links of the pages that do not exist in the website or server) are obtained. Because each second URL obtained after the first URL is transformed is a non-existing page link, and the first URL is also a non-existing page link, there must be at least one in the second page content corresponding to the second URL. The second page content is the same as the first page content corresponding to the first URL. Similarly, if the first URL itself is a page link that can be accessed by any user, then the status code corresponding to the first URL is 200, regardless of whether the website or server customizes the 404 page. The second URL is a non-existing page link, because each second URL is a non-existing page link, and the first URL is a normal page link, so the content of the second page corresponding to the second URL is all the same as the first URL. The content of the first page corresponding to the URL is different. The M may be an integer greater than or equal to 1, and the M may be less than or equal to the N.
例如,N=3,M=1,终端可以检测这3个第二页面内容A1、B1和C1中各个第二页面内容是否与第一页面内容B1相同,此时,终端检测到1个第二页面内容B1与第一页面内 容B1相同之后,终端可以不再检测这3个第二页面内容中未检测的第二页面内容C1是否与第一页面内容B1相同,就可以直接确定上述第一URL:http://xxx.pingan.com/conf/config.inc对应的文件为敏感文件。本申请实施例通过在N个第二页面内容中找到M个与第一页面内容相同的第二页面内容后就不再查找,可以提高文件检测的处理效率。For example, N = 3 and M = 1, the terminal can detect whether each of the three second page contents A1, B1, and C1 is the same as the first page content B1. At this time, the terminal detects one second page content After the page content B1 is the same as the first page content B1, the terminal can no longer detect whether the undetected second page content C1 among the three second page contents is the same as the first page content B1, and can directly determine the first URL. : The file corresponding to http://xxx.pingan.com/conf/config.inc is a sensitive file. In the embodiment of the present application, by finding M second page content that is the same as the first page content among the N second page contents, the search is no longer performed, which can improve the processing efficiency of file detection.
在一些可行的实施方式中,终端可以提取上述第一页面内容中的第一特征,该第一特征可以为第一页面内容中的关键字符,如第一页面内容中的标题或第一页面内容中最前面的50个字符。终端还可以针对上述N个第二页面内容中的每个第二页面内容提取一个第二特征,共得到N个第二特征。其中,第二特征可以为第二页面内容中的关键字符,且该第二特征和该第一特征为同一种特征,即该第一特征为该第一页面内容中的标题,该第二特征也就为该第二页面内容中的标题;该第一特征为该第一页面内容中最前面的50个字符,该第二特征为该第二页面内容中最前面的50个字符。终端可以检测该第一特征与该N个第二特征中的各个第二特征是否相同,若该N个第二特征中存在M个第二特征与该第一特征相同,则终端可以确定上述第一URL对应的文件为敏感文件。若该N个第二特征中不存在任何第二特征与该第一特征相同,则终端可以确定上述第一URL对应的文件不是敏感文件。该M可以为大于或等于1的整数,该M可以小于或等于该N。本申请实施例通过检测N个第二特征中是否存在与第一特征相同的特征来判断上述第一URL是否为敏感文件,计算量较小,处理效率高。In some feasible implementation manners, the terminal may extract a first feature in the first page content, and the first feature may be a key character in the first page content, such as a title or first page content in the first page content. The first 50 characters. The terminal may also extract a second feature for each of the second second page content, to obtain a total of N second features. The second feature may be a key character in the content of the second page, and the second feature and the first feature are the same feature, that is, the first feature is a title in the first page content, and the second feature That is, the title in the second page content; the first feature is the first 50 characters in the first page content, and the second feature is the first 50 characters in the second page content. The terminal may detect whether the first feature is the same as each second feature of the N second features. If M second features among the N second features are the same as the first feature, the terminal may determine the first feature The file corresponding to a URL is a sensitive file. If none of the N second features are the same as the first feature, the terminal may determine that the file corresponding to the first URL is not a sensitive file. The M may be an integer greater than or equal to 1, and the M may be less than or equal to the N. In the embodiment of the present application, whether the first URL is a sensitive file is detected by detecting whether there are the same features among the N second features, the calculation amount is small, and the processing efficiency is high.
本申请实施例通过在接收到服务器针对第一请求返回的状态码为目标状态码时,获取针对该第一请求返回的第一页面内容,该第一页面内容包括第一URL,再根据该第一URL确定出N个第二URL,基于该N个第二URL向该服务器发送N个第二请求,并接收该服务器针对该N个第二请求返回的N个第二页面内容,若该N个第二页面内容中存在M个第二页面内容与该第一页面内容匹配,则确定该第一URL对应的文件为目标文件,可以提高敏感文件检测的准确性,从而减少敏感文件的误报。In the embodiment of the present application, when the status code returned by the server for the first request is the target status code, the first page content returned for the first request is obtained, where the first page content includes the first URL, and then according to the first A URL determines N second URLs, sends N second requests to the server based on the N second URLs, and receives N second page content returned by the server for the N second requests, if the N There are M second page contents in the second page content that match the first page content, then the file corresponding to the first URL is determined to be the target file, which can improve the accuracy of sensitive file detection, thereby reducing false positives of sensitive files .
参见图2,是本申请实施例提供的文件检测方法的另一示意流程图。如图2所示,该文件检测方法可包括步骤:Referring to FIG. 2, it is another schematic flowchart of a file detection method according to an embodiment of the present application. As shown in FIG. 2, the file detection method may include steps:
S201,终端在接收到服务器针对第一请求返回的状态码为目标状态码时,获取针对第一请求返回的第一页面内容。S201: When receiving the status code returned by the server for the first request as the target status code, the terminal obtains the content of the first page returned for the first request.
本申请实施例中上述步骤S201的实现方式可参考图1所示实施例的步骤S101所提供的实现方式,在此不再赘述。For the implementation manner of step S201 in the embodiment of the present application, reference may be made to the implementation manner provided by step S101 in the embodiment shown in FIG. 1, and details are not described herein again.
S202,终端在第一URL的文件名部分中的任意N个不同位置进行N次字符增加,以得到N个第二URL。S202. The terminal adds characters N times at any N different positions in the file name part of the first URL to obtain N second URLs.
在一些可行的实施方式中,终端可以在上述第一URL的文件名部分中的任意N个不同位置进行N次字符增加,以得到N个第二URL。其中,在该第一URL的文件名部分中任一位置增加一个或者多个字符为该第一URL的文件名部分的一次字符变换。N个不同的位置可以分别为上述第一URL的文件名头部、文件名尾部、文件后缀名尾部。在一个位置可以增加一个或多个字符,每个位置中增加的字符可以不相同,也可以相同。字符的类型 可以为数字、字母(包括大小写)、以及特殊字符中的一种或多种,本申请实施例不做限定。本申请实施例通过在第一URL文件名部分中的不同位置增加字符来得到不同的第二URL(实际上第二URL是不存在的URL),操作简单,处理方便,并且可以提高制造不存在URL的准确性,从而在后续判断中提高检测的准确性。In some feasible implementation manners, the terminal may perform N character additions at any N different positions in the file name part of the first URL to obtain N second URLs. Wherein, adding one or more characters at any position in the file name part of the first URL is a character transformation of the file name part of the first URL. The N different positions may be the file name header, file name tail, and file suffix name tail of the first URL, respectively. One or more characters can be added in one position, and the characters added in each position can be different or the same. The type of the character may be one or more of numbers, letters (including uppercase and lowercase), and special characters, which are not limited in the embodiments of the present application. The embodiment of the present application obtains different second URLs by adding characters in different positions in the first URL file name portion (actually, the second URL is a non-existent URL). The operation is simple, the processing is convenient, and manufacturing can be improved. The accuracy of the URL, thereby improving the accuracy of detection in subsequent judgments.
假设第一URL为http://xxx.pingan.com/conf/config.inc,且该第一URL对应的状态码为200,终端可以在该第一URL的文件名头部增加一个字符“x”得到一个第二URL为http://xxx.pingan.com/conf/xconfig.inc,再在该第一URL的文件名尾部增加一个字符“y”得到另一个第二URL为http://xxx.pingan.com/conf/configy.inc,还可以在该第一URL的文件后缀名尾部增加一个字符“1”,从而修改了第一URL的文件后缀名,使得该文件后缀名“.inc1”不存在,进而得到的又一个第二URL:http://xxx.pingan.com/conf/config.inc1也是不存在的URL。因此,终端在第一URL的文件后缀名尾部增加字符也是一种修改文件后缀名的方式。Assume that the first URL is http://xxx.pingan.com/conf/config.inc and the status code corresponding to the first URL is 200. The terminal can add a character "x "To get a second URL as http://xxx.pingan.com/conf/xconfig.inc, and add a character" y "to the end of the file name of the first URL to get another second URL as http: // xxx.pingan.com/conf/configy.inc, you can also add a character "1" to the end of the file suffix of the first URL to modify the file suffix of the first URL so that the file suffix is ".inc1 "Doesn't exist, and you get another second URL: http://xxx.pingan.com/conf/config.inc1 is also a non-existent URL. Therefore, adding characters to the tail of the file suffix of the first URL by the terminal is also a way to modify the file suffix.
S203,终端基于N个第二URL向服务器发送N个第二请求,并接收服务器针对N个第二请求返回的N个第二页面内容。S203: The terminal sends N second requests to the server based on the N second URLs, and receives N second page content returned by the server for the N second requests.
本申请实施例中上述步骤S203的实现方式可参考图1所示实施例的步骤S103所提供的实现方式,在此不再赘述。For the implementation manner of step S203 in the embodiment of the present application, reference may be made to the implementation manner provided by step S103 in the embodiment shown in FIG. 1, and details are not described herein again.
S204,终端获取N个第二页面内容中每个第二页面内容与第一页面内容之间的相似度值,得到N个相似度值。S204. The terminal obtains the similarity values between each second page content and the first page content in the N second page contents to obtain N similarity values.
S205,若N个相似度值中存在M个相似度值大于相似度阈值,则终端确定第一URL对应的文件为目标文件。S205: If there are M similarity values among the N similarity values that are greater than the similarity threshold, the terminal determines that the file corresponding to the first URL is the target file.
在一些可行的实施方式中,终端可以利用页面相似度算法如局部敏感哈希算法simhash或最小哈希算法minhash等计算上述N个第二页面内容中每个第二页面内容与上述第一页面内容之间的相似度值,得到该N个相似度值。终端再判断该N个相似度值中每个相似度值与预设的相似度阈值之间的大小关系,若该N个相似度值中存在M个相似度值大于该预设的相似度阈值,则终端可以确定上述第一URL对应的文件为敏感文件。若该N个相似度值全部小于或等于该预设的相似度阈值,则终端可以确定上述第一URL对应的文件不是敏感文件。该M可以为大于或等于1的整数,该M可以小于或等于该N。本申请实施例通过计算第一页面内容和第二页面内容的相似度值,并比较该相似度值与预设的相似度阈值之间的大小关系,可以排除第一页面内容和第二页面内容之间的微小差异对文件检测结果的影响,从而进一步提高了文件检测的准确性。In some feasible implementation manners, the terminal may use a page similarity algorithm such as a locally sensitive hash algorithm simhash or a minimum hash algorithm minhash to calculate each second page content and the first page content in the N second page contents. Between the similarity values, to obtain the N similarity values. The terminal then judges the magnitude relationship between each of the N similarity values and a preset similarity threshold. If there are M similarity values in the N similarity values that are greater than the preset similarity threshold, , The terminal may determine that the file corresponding to the first URL is a sensitive file. If all the N similarity values are less than or equal to the preset similarity threshold, the terminal may determine that the file corresponding to the first URL is not a sensitive file. The M may be an integer greater than or equal to 1, and the M may be less than or equal to the N. The embodiment of the present application can exclude the first page content and the second page content by calculating the similarity value of the first page content and the second page content, and comparing the magnitude relationship between the similarity value and a preset similarity threshold. The small difference between them affects the detection result of the file, thereby further improving the accuracy of the file detection.
例如,N=3,M=1,这3个第二页面内容可以分别A1、B1和C1表示,第一页面内容可以用F1表示,预设的相似度阈值为90%。终端利用simhash分别计算A1与F1之间的相似度值、B1与F1之间的相似度值、C1与F1之间的相似度值,得到3个相似度值分别为A1与F1之间的相似度值67%、B1与F1之间的相似度值80%、C1与F1之间的相似度值94%。终端检测这3个相似度值67%、80%、94%与预设的相似度阈值90%之间的大小,因为C1与F1之间的相似度值94%大于相似度阈值90%,所以终端确定上述第一URL对应的文件为敏感文件。For example, N = 3 and M = 1. The contents of the three second pages can be represented by A1, B1, and C1, respectively, and the contents of the first page can be represented by F1. The preset similarity threshold is 90%. The terminal uses simhash to calculate the similarity value between A1 and F1, the similarity value between B1 and F1, and the similarity value between C1 and F1. The three similarity values are similarity between A1 and F1, respectively. The degree value is 67%, the similarity value between B1 and F1 is 80%, and the similarity value between C1 and F1 is 94%. The terminal detects the size between the three similarity values of 67%, 80%, and 94% and a preset similarity threshold of 90%. Because the similarity value between C1 and F1 is 94% greater than the similarity threshold of 90%, The terminal determines that the file corresponding to the first URL is a sensitive file.
在一些可行的实施方式中,终端在确定第一URL对应的文件为敏感文件之后,终端可 以输出报警提示信息,该报警提示信息包括该第一URL,该报警提示信息用于提示目标用户该第一URL对应的文件为敏感文件,以便于该目标用户对该第一URL对应的文件进行处理,以防止黑客的恶意访问以及敏感文件信息泄露导致的服务器受到攻击,进而导致使用该服务器的用户财产损失等问题。例如,终端在输出报警提示信息之后,程序员可以将该第一URL对应的文件删除或修改访问该第一URL的权限为最高级,即不允许任何人访问。In some feasible implementation manners, after the terminal determines that the file corresponding to the first URL is a sensitive file, the terminal may output alarm prompt information, where the alarm prompt information includes the first URL, and the alarm prompt information is used to prompt the target user of the first URL. The file corresponding to a URL is a sensitive file, so that the target user can process the file corresponding to the first URL, to prevent malicious access by hackers and the server caused by the disclosure of sensitive file information, thereby causing the user's property to use the server. Loss and other issues. For example, after the terminal outputs the alarm prompt information, the programmer may delete or modify the file corresponding to the first URL to access the first URL as the highest level, that is, no one is allowed to access.
本申请实施例通过在接收到服务器针对第一请求返回的状态码为目标状态码时,获取针对该第一请求返回的第一页面内容,该第一页面内容包括第一URL,再对该第一URL的文件名部分进行N次不同的字符变换,得到N个第二URL,基于该N个第二URL向该服务器发送N个第二请求,并接收该服务器针对该N个第二请求返回的N个第二页面内容,获取该N个第二页面内容中每个第二页面内容与该第一页面内容之间的相似度值,得到N个相似度值,若该N个相似度值中存在M个相似度值大于相似度阈值,则确定该第一URL对应的文件为目标文件,不仅可以减少敏感文件误报的情况,还可以排除页面内容之间的微小差异对检测结果带来的影响,进一步提高了文件检测的准确性。In the embodiment of the present application, when the status code returned by the server for the first request is the target status code, the first page content returned for the first request is obtained, where the first page content includes the first URL, and then the first page content is returned. The file name portion of a URL is subjected to N different character transformations to obtain N second URLs. Based on the N second URLs, N second requests are sent to the server, and the server receives the N second requests returned by the server. Of the N second page content, to obtain the similarity value between each second page content in the N second page content and the first page content, to obtain N similarity values, if the N similarity values If there are M similarity values greater than the similarity threshold, then the file corresponding to the first URL is determined as the target file, which can not only reduce the false positives of sensitive files, but also exclude the small differences between the page content to the detection results. The effect of this further improves the accuracy of document detection.
参见图3,是本申请实施例提供的文件检测装置的一示意性框图。本申请实施例的文件检测装置包括:Referring to FIG. 3, it is a schematic block diagram of a document detection apparatus provided by an embodiment of the present application. The document detection device in the embodiment of the present application includes:
获取模块10,用于在接收到服务器针对第一请求返回的状态码为目标状态码时,获取针对该第一请求返回的第一页面内容。其中,该第一请求包括第一统一资源定位符URL。The obtaining module 10 is configured to obtain the first page content returned for the first request when the status code returned by the server for the first request is the target status code. The first request includes a first uniform resource locator URL.
第一确定模块20,用于根据上述获取模块10获取的该第一URL确定出N个第二URL。其中,该N个第二URL中任一第二URL为基于该第一URL变换后得到的URL,该N为大于或等于1的整数。The first determining module 20 is configured to determine N second URLs according to the first URL obtained by the obtaining module 10. Wherein, any second URL among the N second URLs is a URL obtained by transforming based on the first URL, and N is an integer greater than or equal to 1.
收发模块30,用于基于上述第一确定模块20确定的该N个第二URL向该服务器发送N个第二请求,并接收该服务器针对该N个第二请求返回的N个第二页面内容。其中,一个第二请求包括一个第二URL,一个第二请求对应一个第二页面内容。The transceiver module 30 is configured to send N second requests to the server based on the N second URLs determined by the first determining module 20, and receive N second page content returned by the server for the N second requests. . A second request includes a second URL, and a second request corresponds to a second page content.
第二确定模块40,用于当该N个第二页面内容中存在M个第二页面内容与该第一页面内容匹配时,确定该第一URL对应的文件为目标文件。其中,该M为大于或等于1的整数,该M小于或等于该N。The second determining module 40 is configured to determine that, when M second page contents in the N second page contents match the first page contents, a file corresponding to the first URL is a target file. Wherein, M is an integer greater than or equal to 1, and M is less than or equal to N.
在一些可行的实施方式中,上述第一确定模块20,用于对上述获取模块10获取的该第一URL的文件名部分进行N次不同的字符变换,得到N个第二URL。其中,一次字符变换得到一个第二URL。In some feasible implementation manners, the first determining module 20 is configured to perform N different character transformations on the file name portion of the first URL obtained by the obtaining module 10 to obtain N second URLs. Among them, one character conversion obtains a second URL.
在一些可行的实施方式中,上述第一确定模块20用于在上述获取模块10获取的该第一URL的文件名部分中的任意N个不同位置进行N次字符增加,以得到N个第二URL;其中,在该第一URL的文件名部分中任一位置增加一个或者多个字符为该第一URL的文件名部分的一次字符变换。In some feasible implementation manners, the first determining module 20 is configured to perform N character additions at any N different positions in the file name portion of the first URL obtained by the obtaining module 10 to obtain N second URL; wherein, one or more characters are added at any position in the file name portion of the first URL as a character transformation of the file name portion of the first URL.
在一些可行的实施方式中,上述第一确定模块20用于对上述获取模块10获取的该第一URL的文件后缀名进行N次不同的修改,得到N个第二URL。其中,一次修改得到一个第二URL。In some feasible implementation manners, the first determining module 20 is configured to perform N different modifications to the file suffix of the first URL obtained by the obtaining module 10 to obtain N second URLs. Among them, one modification obtains a second URL.
在一些可行的实施方式中,上述第二确定模块40包括第一获取单元401、检测单元402以及第一确定单元403。该第一获取单元401,用于获取该第一页面内容中的第一特征,并获取该N个第二页面内容的N个第二特征;该检测单元402,用于检测该第一获取单元401获取到的该第一特征与该N个第二特征中的各个第二特征是否匹配;该第一确定单元403,用于当该N个第二特征中存在M个第二特征与该第一特征匹配时,确定该第一URL对应的文件为目标文件。其中,一个该第二页面内容对应一个该第二特征。In some feasible implementation manners, the second determining module 40 includes a first obtaining unit 401, a detecting unit 402, and a first determining unit 403. The first obtaining unit 401 is configured to obtain a first feature in the first page content, and to obtain N second features of the N second page content; and the detecting unit 402 is used to detect the first obtaining unit. Whether the first feature obtained in 401 matches each second feature in the N second features; the first determining unit 403 is configured to: when there are M second features in the N second features and the first feature, When a feature matches, it is determined that the file corresponding to the first URL is a target file. One of the contents of the second page corresponds to one of the second features.
在一些可行的实施方式中,上述第二确定模块40包括第二获取单元404和第二确定单元405。该第二获取单元404,用于获取该N个第二页面内容中每个第二页面内容与该第一页面内容之间的相似度值,得到N个相似度值;该第二确定单元405,用于当该N个相似度值中存在M个相似度值大于相似度阈值时,则确定该第一URL对应的文件为目标文件。In some feasible implementation manners, the second determining module 40 includes a second obtaining unit 404 and a second determining unit 405. The second obtaining unit 404 is configured to obtain a similarity value between each second page content and the first page content in the N second page contents to obtain N similarity values; the second determining unit 405 Is used to determine that a file corresponding to the first URL is a target file when M similarity values among the N similarity values are greater than a similarity threshold.
在一些可行的实施方式中,上述第二获取单元404具体用于利用页面相似度算法计算该N个第二页面内容中每个第二页面内容与该第一页面内容之间的相似度值,得到N个相似度值。In some feasible implementation manners, the second obtaining unit 404 is specifically configured to use a page similarity algorithm to calculate a similarity value between each second page content in the N second page contents and the first page content. Get N similarity values.
在一些可行的实施方式中,上述收发模块30还用于在确定该第一URL对应的文件为目标文件之后,输出报警提示信息,该报警提示信息包括该第一URL,该报警提示信息用于提示该第一URL对应的文件为敏感文件。In some feasible implementation manners, the sending and receiving module 30 is further configured to output alarm prompt information after determining that a file corresponding to the first URL is a target file, where the alarm prompt information includes the first URL, and the alarm prompt information is used for Prompt that the file corresponding to the first URL is a sensitive file.
具体实现中,上述文件检测装置可通过上述各个模块执行上述图1或图2所提供的实现方式中各个步骤所提供的实现方式,实现上述各实施例中所实现的功能,具体可参见上述图1或图2所示的方法实施例中各个步骤提供的相应描述,在此不再赘述。In specific implementation, the above-mentioned file detection device may implement the implementation manner provided by each step in the implementation manner provided in FIG. 1 or FIG. 2 through the foregoing modules to implement the functions implemented in the foregoing embodiments. For details, refer to the foregoing diagram. The corresponding description provided by each step in the method embodiment shown in FIG. 1 or FIG. 2 is not described herein again.
在本申请实施例中,文件检测装置可通过在接收到服务器针对第一请求返回的状态码为目标状态码时,获取针对该第一请求返回的第一页面内容,该第一页面内容包括第一URL,再根据该第一URL确定出N个第二URL,基于该N个第二URL向该服务器发送N个第二请求,并接收该服务器针对该N个第二请求返回的N个第二页面内容,若该N个第二页面内容中存在M个第二页面内容与该第一页面内容匹配,则确定该第一URL对应的文件为目标文件,可以提高敏感文件检测的准确性,从而减少敏感文件的误报。In the embodiment of the present application, when receiving the status code returned by the server for the first request as the target status code, the file detection device may obtain the first page content returned for the first request, where the first page content includes the first A URL, and then determine N second URLs according to the first URL, send N second requests to the server based on the N second URLs, and receive the N first URLs returned by the server for the N second requests Two-page content, if there are M second page contents in the N second page contents that match the first page content, determining the file corresponding to the first URL as the target file can improve the accuracy of sensitive file detection, This reduces false positives for sensitive files.
参见图4,是本申请实施例提供的终端的一示意性框图。如图4所示,本申请实施例中的终端可以包括:一个或多个处理器801;一个或多个输入设备802,一个或多个输出设备803和存储器804。上述处理器801、输入设备802、输出设备803和存储器804通过总线805连接。存储器802用于存储计算机程序,该计算机程序包括程序指令,处理器801用于执行存储器802存储的程序指令。其中,上述输入设备802用于接收服务器针对第一请求返回的状态码,上述处理器801被配置用于调用该程序指令执行:4 is a schematic block diagram of a terminal provided by an embodiment of the present application. As shown in FIG. 4, the terminal in the embodiment of the present application may include: one or more processors 801; one or more input devices 802, one or more output devices 803, and a memory 804. The processor 801, the input device 802, the output device 803, and the memory 804 are connected through a bus 805. The memory 802 is configured to store a computer program. The computer program includes program instructions. The processor 801 is configured to execute the program instructions stored in the memory 802. The input device 802 is configured to receive a status code returned by the server for the first request, and the processor 801 is configured to call the program instruction for execution:
在接收到服务器针对第一请求返回的状态码为目标状态码时,获取针对该第一请求返回的第一页面内容,该第一请求包括第一统一资源定位符URL;When receiving the status code returned by the server for the first request as the target status code, acquiring the first page content returned for the first request, the first request including the first uniform resource locator URL;
根据该第一URL确定出N个第二URL,该N个第二URL中任一第二URL为基于该第一URL变换后得到的URL,该N为大于或等于1的整数。N second URLs are determined according to the first URL, and any second URL in the N second URLs is a URL obtained by transforming based on the first URL, and N is an integer greater than or equal to 1.
上述输出设备803用于基于该N个第二URL向该服务器发送N个第二请求,上述输 入设备802用于接收该服务器针对该N个第二请求返回的N个第二页面内容。其中一个第二请求包括一个第二URL,一个第二请求对应一个第二页面内容。The output device 803 is configured to send N second requests to the server based on the N second URLs, and the input device 802 is configured to receive N second page content returned by the server for the N second requests. One of the second requests includes a second URL, and one second request corresponds to a second page content.
上述处理器801还被配置用于调用该程序指令执行当该N个第二页面内容中存在M个第二页面内容与该第一页面内容匹配时,确定该第一URL对应的文件为目标文件。其中,该M为大于或等于1的整数,该M小于或等于该N。The processor 801 is further configured to call the program instruction for execution. When there are M second page contents in the N second page contents that match the first page content, determine that a file corresponding to the first URL is a target file. . Wherein, M is an integer greater than or equal to 1, and M is less than or equal to N.
应当理解,在本申请实施例中,所称处理器801可以是中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that, in the embodiment of the present application, the processor 801 may be a central processing unit (CPU), and the processor may also be another general-purpose processor or a digital signal processor (DSP). , Application specific integrated circuit (ASIC), ready-made programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
输入设备802可以包括接收器、接收程序接口等,输出设备803可以包括发送器、发送程序接口等。The input device 802 may include a receiver, a receiving program interface, and the like, and the output device 803 may include a transmitter, a sending program interface, and the like.
该存储器804可以包括只读存储器和随机存取存储器,并向处理器801提供指令和数据。存储器804的一部分还可以包括非易失性随机存取存储器。例如,存储器804还可以存储设备类型的信息。The memory 804 may include a read-only memory and a random access memory, and provide instructions and data to the processor 801. A part of the memory 804 may further include a non-volatile random access memory. For example, the memory 804 may also store information of a device type.
具体实现中,本申请实施例中所描述的处理器801、输入设备802、输出设备803可执行本申请实施例提供的文件检测方法中所描述的实现方式,也可执行本申请实施例所描述的文件检测装置的实现方式,在此不再赘述。In specific implementation, the processor 801, the input device 802, and the output device 803 described in the embodiments of the present application may perform the implementation manners described in the file detection method provided in the embodiments of the present application, and may also execute the methods described in the embodiments of the present application. The implementation manner of the file detection device is not repeated here.
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序包括程序指令,该程序指令被处理器执行时实现图1或图2所示的文件检测方法,具体细节请参照图1或图2所示实施例的描述,在此不再赘述。An embodiment of the present application further provides a computer-readable storage medium. The computer-readable storage medium stores a computer program. The computer program includes program instructions. When the program instructions are executed by a processor, the files shown in FIG. 1 or FIG. 2 are implemented. For details of the detection method, refer to the description of the embodiment shown in FIG. 1 or FIG. 2, and details are not described herein again.
上述计算机可读存储介质可以是前述任一实施例所述的文件检测装置或终端的内部存储单元,例如终端的硬盘或内存。该计算机可读存储介质也可以是该终端的外部存储设备,例如该终端上配备的插接式硬盘,智能存储卡(smart media card,SMC),安全数字(secure digital,SD)卡,闪存卡(flash card)等。进一步地,该计算机可读存储介质还可以既包括该终端的内部存储单元也包括外部存储设备。该计算机可读存储介质用于存储该计算机程序以及该终端所需的其他程序和数据。该计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be the file detection device or an internal storage unit of the terminal according to any of the foregoing embodiments, such as a hard disk or a memory of the terminal. The computer-readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card, and a flash memory card provided on the terminal. (flash card) and so on. Further, the computer-readable storage medium may include both an internal storage unit of the terminal and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the terminal. The computer-readable storage medium may also be used to temporarily store data that has been or will be output.
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the scope of protection of the present invention is not limited to this. Any person skilled in the art can easily think of changes or replacements within the technical scope disclosed by the present invention. It should be covered by the protection scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种文件检测方法,其特征在于,包括:A file detection method, comprising:
    在接收到服务器针对第一请求返回的状态码为目标状态码时,获取针对所述第一请求返回的第一页面内容,所述第一请求包括第一统一资源定位符URL;When receiving the status code returned by the server for the first request as the target status code, acquiring the first page content returned for the first request, where the first request includes a first uniform resource locator URL;
    根据所述第一URL确定出N个第二URL,所述N个第二URL中任一第二URL为基于所述第一URL变换后得到的URL,所述N为大于或等于1的整数;N second URLs are determined according to the first URL, and any second URL in the N second URLs is a URL obtained by transforming based on the first URL, and N is an integer greater than or equal to 1. ;
    基于所述N个第二URL向所述服务器发送N个第二请求,并接收所述服务器针对所述N个第二请求返回的N个第二页面内容,其中一个第二请求包括一个第二URL,一个第二请求对应一个第二页面内容;Send N second requests to the server based on the N second URLs, and receive N second page content returned by the server for the N second requests, where a second request includes a second URL, a second request corresponds to a second page content;
    若所述N个第二页面内容中存在M个第二页面内容与所述第一页面内容匹配,则确定所述第一URL对应的文件为目标文件,所述M为大于或等于1的整数,所述M小于或等于所述N。If there are M second page contents in the N second page contents that match the first page content, it is determined that the file corresponding to the first URL is the target file, and M is an integer greater than or equal to 1. , The M is less than or equal to the N.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述第一URL确定出N个第二URL,包括:The method according to claim 1, wherein the determining N second URLs according to the first URL comprises:
    对所述第一URL的文件名部分进行N次不同的字符变换,得到N个第二URL;Performing N different character transformations on the file name portion of the first URL to obtain N second URLs;
    其中,一次字符变换得到一个第二URL。Among them, one character conversion obtains a second URL.
  3. 根据权利要求2所述的方法,其特征在于,所述对所述第一URL的文件名部分进行N次不同的字符变换,包括:The method according to claim 2, wherein performing N different character transformations on the file name portion of the first URL comprises:
    在所述第一URL的文件名部分中的任意N个不同位置进行N次字符增加,以得到N个第二URL;Performing N character additions at any N different positions in the file name portion of the first URL to obtain N second URLs;
    其中,在所述第一URL的文件名部分中任一位置增加一个或者多个字符为所述第一URL的文件名部分的一次字符变换。Wherein, adding one or more characters at any position in the file name portion of the first URL is a character transformation of the file name portion of the first URL.
  4. 根据权利要求1-3任意一项所述的方法,其特征在于,所述若所述N个第二页面内容中存在M个第二页面内容与所述第一页面内容匹配,则确定所述第一URL对应的文件为目标文件,包括:The method according to any one of claims 1-3, wherein if there are M second page contents in the N second page contents that match the first page contents, determining the The file corresponding to the first URL is the target file and includes:
    获取所述第一页面内容中的第一特征,并获取所述N个第二页面内容的N个第二特征,一个所述第二页面内容对应一个所述第二特征;Acquiring a first feature in the first page content, and acquiring N second features in the N second page content, one second page content corresponding to one second feature;
    检测所述第一特征与所述N个第二特征中的各个第二特征是否匹配;Detecting whether the first feature matches each of the N second features;
    若所述N个第二特征中存在M个第二特征与所述第一特征匹配,则确定所述第一URL对应的文件为目标文件。If there are M second features among the N second features that match the first feature, it is determined that the file corresponding to the first URL is a target file.
  5. 根据权利要求1-3任意一项所述的方法,其特征在于,所述若所述N个第二页面内容中存在M个第二页面内容与所述第一页面内容匹配,则确定所述第一URL对应的文件为目标文件,包括:The method according to any one of claims 1-3, wherein if there are M second page contents in the N second page contents that match the first page contents, determining the The file corresponding to the first URL is the target file and includes:
    获取所述N个第二页面内容中每个第二页面内容与所述第一页面内容之间的相似度值,得到N个相似度值;Obtaining similarity values between each second page content and the first page content in the N second page content to obtain N similarity values;
    若所述N个相似度值中存在M个相似度值大于相似度阈值,则确定所述第一URL对应的文件为目标文件。If there are M similarity values among the N similarity values that are greater than a similarity threshold, it is determined that the file corresponding to the first URL is a target file.
  6. 根据权利要求5所述的方法,其特征在于,所述获取所述N个第二页面内容中每个第二页面内容与所述第一页面内容之间的相似度值,得到N个相似度值,包括:The method according to claim 5, wherein the obtaining the similarity value between each second page content and the first page content in the N second page contents to obtain N similarities Values, including:
    利用页面相似度算法计算所述N个第二页面内容中每个第二页面内容与所述第一页面内容之间的相似度值,得到N个相似度值。A page similarity algorithm is used to calculate a similarity value between each second page content and the first page content in the N second page contents to obtain N similarity values.
  7. 根据权利要求1-6任意一项所述的方法,其特征在于,所述在确定所述第一URL对应的文件为目标文件之后,所述方法还包括:The method according to any one of claims 1-6, wherein after determining that a file corresponding to the first URL is a target file, the method further comprises:
    输出报警提示信息,所述报警提示信息包括所述第一URL,所述报警提示信息用于提示所述第一URL对应的文件为敏感文件。Outputting alarm prompt information, the alarm prompt information includes the first URL, and the alarm prompt information is used to prompt that a file corresponding to the first URL is a sensitive file.
  8. 一种文件检测装置,其特征在于,包括:A document detection device, comprising:
    获取模块,用于在接收到服务器针对第一请求返回的状态码为目标状态码时,获取针对所述第一请求返回的第一页面内容,所述第一请求包括第一统一资源定位符URL;An obtaining module, configured to obtain the first page content returned for the first request when the status code returned by the server for the first request is the target status code, the first request includes a first uniform resource locator URL ;
    第一确定模块,用于根据所述获取模块获取的所述第一URL确定出N个第二URL,所述N个第二URL中任一第二URL为基于所述第一URL变换后得到的URL,所述N为大于或等于1的整数;A first determining module, configured to determine N second URLs according to the first URL obtained by the obtaining module, and any second URL of the N second URLs is obtained after being transformed based on the first URL URL, where N is an integer greater than or equal to 1;
    收发模块,用于基于所述第一确定模块确定的所述N个第二URL向所述服务器发送N个第二请求,并接收所述服务器针对所述N个第二请求返回的N个第二页面内容,其中一个第二请求包括一个第二URL,一个第二请求对应一个第二页面内容;A transceiver module, configured to send N second requests to the server based on the N second URLs determined by the first determining module, and receive the N first requests returned by the server for the N second requests Two-page content, where a second request includes a second URL, and a second request corresponds to a second page content;
    第二确定模块,用于当所述N个第二页面内容中存在M个第二页面内容与所述第一页面内容匹配时,确定所述第一URL对应的文件为目标文件,所述M为大于或等于1的整数,所述M小于或等于所述N。A second determining module, configured to: when there are M second page contents in the N second page contents that match the first page content, determine that a file corresponding to the first URL is a target file, and the M Is an integer greater than or equal to 1, the M is less than or equal to the N.
  9. 根据权利要求8所述的装置,其特征在于,所述第一确定模块用于:The apparatus according to claim 8, wherein the first determining module is configured to:
    对所述获取模块获取的所述第一URL的文件名部分进行N次不同的字符变换,得到N个第二URL;Performing N different character transformations on the file name portion of the first URL obtained by the obtaining module to obtain N second URLs;
    其中,一次字符变换得到一个第二URL。Among them, one character conversion obtains a second URL.
  10. 根据权利要求8所述的装置,其特征在于,所述第一确定模块用于:The apparatus according to claim 8, wherein the first determining module is configured to:
    在所述获取模块获取的所述第一URL的文件名部分中的任意N个不同位置进行N次字符增加,以得到N个第二URL;Performing N character additions at any N different positions in the file name portion of the first URL obtained by the obtaining module to obtain N second URLs;
    其中,在所述第一URL的文件名部分中任一位置增加一个或者多个字符为所述第一URL的文件名部分的一次字符变换。Wherein, adding one or more characters at any position in the file name portion of the first URL is a character transformation of the file name portion of the first URL.
  11. 根据权利要求8-10任意一项所述的装置,其特征在于,所述第二确定模块包括:The apparatus according to any one of claims 8 to 10, wherein the second determining module comprises:
    第一获取单元,用于获取所述第一页面内容中的第一特征,并获取所述N个第二页面内容的N个第二特征,一个所述第二页面内容对应一个所述第二特征;A first obtaining unit, configured to obtain a first feature in the first page content, and obtain N second features in the N second page content, where one second page content corresponds to one second feature;
    检测单元,用于检测所述第一获取单元获取到的所述第一特征与所述N个第二特征中的各个第二特征是否匹配;A detecting unit, configured to detect whether the first feature obtained by the first obtaining unit matches each second feature of the N second features;
    第一确定单元,用于当所述N个第二特征中存在M个第二特征与所述第一特征匹配时,确定所述第一URL对应的文件为目标文件。A first determining unit, configured to determine that a file corresponding to the first URL is a target file when M second features among the N second features match the first feature.
  12. 根据权利要求8-10任意一项所述的装置,其特征在于,所述第二确定模块包括:The apparatus according to any one of claims 8 to 10, wherein the second determining module comprises:
    第二获取单元,用于获取所述N个第二页面内容中每个第二页面内容与所述第一页面 内容之间的相似度值,得到N个相似度值;A second obtaining unit, configured to obtain similarity values between each second page content and the first page content in the N second page contents to obtain N similarity values;
    第二确定单元,用于当所述N个相似度值中存在M个相似度值大于相似度阈值时,则确定所述第一URL对应的文件为目标文件。A second determining unit is configured to determine that a file corresponding to the first URL is a target file when M similarity values among the N similarity values are greater than a similarity threshold.
  13. 根据权利要求12所述的装置,其特征在于,所述第二获取单元具体用于:The apparatus according to claim 12, wherein the second obtaining unit is specifically configured to:
    利用页面相似度算法计算所述N个第二页面内容中每个第二页面内容与所述第一页面内容之间的相似度值,得到N个相似度值。A page similarity algorithm is used to calculate a similarity value between each second page content and the first page content in the N second page contents to obtain N similarity values.
  14. 根据权利要求8-13任意一项所述的装置,其特征在于,所述收发模块还用于在确定所述第一URL对应的文件为目标文件之后,输出报警提示信息,所述报警提示信息包括所述第一URL,所述报警提示信息用于提示所述第一URL对应的文件为敏感文件。The device according to any one of claims 8-13, wherein the transceiver module is further configured to output alarm prompt information after determining that the file corresponding to the first URL is a target file, and the alarm prompt information Including the first URL, the alarm prompt information is used to prompt the file corresponding to the first URL to be a sensitive file.
  15. 一种终端,其特征在于,包括处理器、输入设备、输出设备和存储器,所述处理器、输入设备、输出设备和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行:A terminal is characterized in that it includes a processor, an input device, an output device, and a memory, and the processor, the input device, the output device, and the memory are connected to each other. The memory is used to store a computer program, and the computer program Including program instructions, the processor is configured to call the program instructions to execute:
    在接收到服务器针对第一请求返回的状态码为目标状态码时,获取针对所述第一请求返回的第一页面内容,所述第一请求包括第一统一资源定位符URL;When receiving the status code returned by the server for the first request as the target status code, acquiring the first page content returned for the first request, where the first request includes a first uniform resource locator URL;
    根据所述第一URL确定出N个第二URL,所述N个第二URL中任一第二URL为基于所述第一URL变换后得到的URL,所述N为大于或等于1的整数;N second URLs are determined according to the first URL, and any second URL in the N second URLs is a URL obtained by transforming based on the first URL, and N is an integer greater than or equal to 1. ;
    所述输出设备用于基于所述N个第二URL向所述服务器发送N个第二请求,所述输入设备用于接收所述服务器针对所述N个第二请求返回的N个第二页面内容,其中一个第二请求包括一个第二URL,一个第二请求对应一个第二页面内容;The output device is configured to send N second requests to the server based on the N second URLs, and the input device is configured to receive N second pages returned by the server for the N second requests Content, where a second request includes a second URL, and a second request corresponds to a second page content;
    所述处理器还被配置用于调用该程序指令执行当所述N个第二页面内容中存在M个第二页面内容与所述第一页面内容匹配时,确定所述第一URL对应的文件为目标文件,所述M为大于或等于1的整数,所述M小于或等于所述N。The processor is further configured to call the program instruction to execute a file corresponding to the first URL when M second page content matches the first page content in the N second page content. Is an object file, where M is an integer greater than or equal to 1, and M is less than or equal to the N.
  16. 根据权利要求15所述的终端,其特征在于,所述处理器具体用于:The terminal according to claim 15, wherein the processor is specifically configured to:
    对所述第一URL的文件名部分进行N次不同的字符变换,得到N个第二URL;Performing N different character transformations on the file name portion of the first URL to obtain N second URLs;
    其中,一次字符变换得到一个第二URL。Among them, one character conversion obtains a second URL.
  17. 根据权利要求16所述的终端,其特征在于,所述处理器具体用于:The terminal according to claim 16, wherein the processor is specifically configured to:
    在所述第一URL的文件名部分中的任意N个不同位置进行N次字符增加,以得到N个第二URL;Performing N character additions at any N different positions in the file name portion of the first URL to obtain N second URLs;
    其中,在所述第一URL的文件名部分中任一位置增加一个或者多个字符为所述第一URL的文件名部分的一次字符变换。Wherein, adding one or more characters at any position in the file name portion of the first URL is a character transformation of the file name portion of the first URL.
  18. 根据权利要求15-17任意一项所述的终端,其特征在于,所述处理器具体用于:The terminal according to any one of claims 15-17, wherein the processor is specifically configured to:
    获取所述第一页面内容中的第一特征,并获取所述N个第二页面内容的N个第二特征,一个所述第二页面内容对应一个所述第二特征;Acquiring a first feature in the first page content, and acquiring N second features in the N second page content, one second page content corresponding to one second feature;
    检测所述第一特征与所述N个第二特征中的各个第二特征是否匹配;Detecting whether the first feature matches each of the N second features;
    若所述N个第二特征中存在M个第二特征与所述第一特征匹配,则确定所述第一URL对应的文件为目标文件。If there are M second features among the N second features that match the first feature, it is determined that the file corresponding to the first URL is a target file.
  19. 根据权利要求15-17任意一项所述的终端,其特征在于,所述处理器具体用于:The terminal according to any one of claims 15-17, wherein the processor is specifically configured to:
    获取所述N个第二页面内容中每个第二页面内容与所述第一页面内容之间的相似度值, 得到N个相似度值;Obtaining similarity values between each second page content and the first page content in the N second page contents to obtain N similarity values;
    若所述N个相似度值中存在M个相似度值大于相似度阈值,则确定所述第一URL对应的文件为目标文件。If there are M similarity values among the N similarity values that are greater than a similarity threshold, it is determined that the file corresponding to the first URL is a target file.
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-7任一项所述的方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, the processor executes The method according to any one of claims 1-7 is required.
PCT/CN2018/108711 2018-06-30 2018-09-29 File detection method and apparatus WO2020000748A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810704707.9A CN108984673B (en) 2018-06-30 2018-06-30 File detection method and device
CN201810704707.9 2018-06-30

Publications (1)

Publication Number Publication Date
WO2020000748A1 true WO2020000748A1 (en) 2020-01-02

Family

ID=64539194

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/108711 WO2020000748A1 (en) 2018-06-30 2018-09-29 File detection method and apparatus

Country Status (2)

Country Link
CN (1) CN108984673B (en)
WO (1) WO2020000748A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101242279A (en) * 2008-03-07 2008-08-13 北京邮电大学 Automatic penetration testing system and method for WEB system
US20120124372A1 (en) * 2010-10-13 2012-05-17 Akamai Technologies, Inc. Protecting Websites and Website Users By Obscuring URLs
CN103685189A (en) * 2012-09-17 2014-03-26 百度在线网络技术(北京)有限公司 Website security evaluation method and system
CN103685290A (en) * 2013-12-19 2014-03-26 南京理工大学连云港研究院 Vulnerability scanning system based on GHDB

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000165969A (en) * 1998-11-05 2000-06-16 Linak As Remote control system, adjustable household appliance and drive actuator
CN101741643B (en) * 2009-12-24 2011-09-28 北京云快线软件服务有限公司 Content delivery network node detecting method and system
CN103731493B (en) * 2013-12-31 2017-10-24 优视科技有限公司 Page transmission method, apparatus and system
CN107508903B (en) * 2017-09-07 2020-06-16 维沃移动通信有限公司 Webpage content access method and terminal equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101242279A (en) * 2008-03-07 2008-08-13 北京邮电大学 Automatic penetration testing system and method for WEB system
US20120124372A1 (en) * 2010-10-13 2012-05-17 Akamai Technologies, Inc. Protecting Websites and Website Users By Obscuring URLs
CN103685189A (en) * 2012-09-17 2014-03-26 百度在线网络技术(北京)有限公司 Website security evaluation method and system
CN103685290A (en) * 2013-12-19 2014-03-26 南京理工大学连云港研究院 Vulnerability scanning system based on GHDB

Also Published As

Publication number Publication date
CN108984673A (en) 2018-12-11
CN108984673B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN109768992B (en) Webpage malicious scanning processing method and device, terminal device and readable storage medium
KR101724307B1 (en) Method and system for detecting a malicious code
CN110888838B (en) Request processing method, device, equipment and storage medium based on object storage
US9614866B2 (en) System, method and computer program product for sending information extracted from a potentially unwanted data sample to generate a signature
US8448260B1 (en) Electronic clipboard protection
US8307276B2 (en) Distributed content verification and indexing
US10972507B2 (en) Content policy based notification of application users about malicious browser plugins
WO2017000439A1 (en) Detection method, system and device for malicious behaviour, and computer storage medium
WO2013143403A1 (en) Method and system for accessing website
US10122722B2 (en) Resource classification using resource requests
CN107347076B (en) SSRF vulnerability detection method and device
US20180234234A1 (en) System for describing and tracking the creation and evolution of digital files
WO2020000749A1 (en) Method and apparatus for detecting unauthorized vulnerabilities
US20150222649A1 (en) Method and apparatus for processing a webpage
CN106789849A (en) CC attack recognitions method, node and system
US11550920B2 (en) Determination apparatus, determination method, and determination program
US20150019632A1 (en) Server-based system, method, and computer program product for scanning data on a client using only a subset of the data
WO2020073493A1 (en) Sql injection vulnerability detection method, apparatus and device, and readable storage medium
CN112152993A (en) Method and device for detecting webpage hijacking, computer equipment and storage medium
WO2020019514A1 (en) Injection vulnerability detection method and apparatus
CN109145220B (en) Data processing method and device and electronic equipment
WO2020019515A1 (en) Injection vulnerability detection method and device
US8438637B1 (en) System, method, and computer program product for performing an analysis on a plurality of portions of potentially unwanted data each requested from a different device
WO2020000748A1 (en) File detection method and apparatus
CN113225348A (en) Request anti-replay verification method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18924386

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18924386

Country of ref document: EP

Kind code of ref document: A1