CN110460592B - URL analysis method, device, equipment and medium - Google Patents

URL analysis method, device, equipment and medium Download PDF

Info

Publication number
CN110460592B
CN110460592B CN201910687531.5A CN201910687531A CN110460592B CN 110460592 B CN110460592 B CN 110460592B CN 201910687531 A CN201910687531 A CN 201910687531A CN 110460592 B CN110460592 B CN 110460592B
Authority
CN
China
Prior art keywords
behavior
url
library
keyword
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910687531.5A
Other languages
Chinese (zh)
Other versions
CN110460592A (en
Inventor
李中帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGTONG TIANXIA NETWORK TECHNOLOGY Co.,Ltd.
Original Assignee
Guangtong Tianxia Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangtong Tianxia Network Technology Co ltd filed Critical Guangtong Tianxia Network Technology Co ltd
Priority to CN201910687531.5A priority Critical patent/CN110460592B/en
Publication of CN110460592A publication Critical patent/CN110460592A/en
Application granted granted Critical
Publication of CN110460592B publication Critical patent/CN110460592B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0236Filtering by address, protocol, port number or service, e.g. IP-address or URL
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a URL analysis method, which relates to the technical field of network security and is used for solving the problem of inaccurate URL behavior analysis in the prior art, and the method comprises the following steps: receiving URL data; filtering the URL data with the threat to obtain safe URL data; matching with known URLs in a behavior library according to the safe URL data: when the safe URL data are successfully matched, behavior records are obtained and stored in a behavior record library; when the safe URL data fails to be matched, carrying out keyword analysis on the URL which fails to be matched, and storing an analysis result serving as the behavior record into a behavior record library; and updating the behavior library according to the analysis result. The invention also discloses a URL analysis device, electronic equipment and a computer storage medium. According to the method and the device, the threat URL is filtered, behavior analysis is carried out through keyword comparison, and the behavior library is updated in real time, so that the accuracy of behavior analysis is improved.

Description

URL analysis method, device, equipment and medium
Technical Field
The present invention relates to the field of network security technologies, and in particular, to a URL analysis method, apparatus, device, and medium.
Background
The traditional gateway can provide the internet surfing function, but the traditional gateway cannot detect websites accessed by a user, the safety factor is low, and as the access amount increases, the user often encounters unsafe websites such as phishing websites and websites containing Trojan horse viruses, and the user is likely to be attacked or infected by the viruses during the access. Therefore, the intelligent gateway with the firewall function is provided, so that the operation efficiency of the intelligent equipment of the user can not be influenced, and the equipment accessed into the same gateway can be simultaneously subjected to safety protection. The intelligent gateway can record the footprint of the user on the internet while providing the internet access function for the user, wherein the URL is one of the footprints, and the intelligent gateway can detect URL data of websites accessed by the user, then record the possibly threatened websites to remind the user that the websites are threatened, or block the websites containing Trojan to forbid access, so that the user is prevented from being threatened and attacked to a great extent. In addition, the intelligent gateway analyzes the URL data, so that the user can check the online behavior of the intelligent gateway.
However, when the current intelligent gateway URL is analyzed, threat detection and internet behavior analysis are separately implemented, or only one of them is implemented, when the internet behavior analysis is performed, threat and high-risk URL data, such as phishing websites, web pages with trojans, etc., are also analyzed at the same time, these websites are also analyzed for internet behavior as other security websites, and many websites are intercepted and users do not actually browse, or the content in these dangerous URLs is also used to establish a user behavior database, which causes an inaccurate analysis result of internet behavior; in addition, the behavior library needs to be updated manually, the efficiency is low, and accurate judgment is difficult to be performed on URLs which do not exist in the behavior library.
Disclosure of Invention
In order to overcome the defects of the prior art, an object of the present invention is to provide a URL analyzing method, which obtains an accurate URL behavior analyzing result by performing threat analysis, then performing behavior analysis, and performing keyword analysis.
One of the purposes of the invention is realized by adopting the following technical scheme:
a URL analysis method comprising the steps of:
receiving URL data, and storing the URL data in a URL database;
matching the URL data with a threat library according to the URL data, filtering the URL data with threats to obtain safe URL data, and storing threat records into a threat record library;
matching with known URLs in a behavior library according to the safe URL data:
when the safe URL data are successfully matched, behavior records are obtained and stored in a behavior record library, and the behavior records are behavior types corresponding to the safe URLs;
when the safe URL data matching fails, extracting target keywords from URLs which fail to match, namely unknown URLs, performing keyword analysis according to the keywords in the behavior library, and storing analysis results serving as behavior records into a behavior record library;
and updating a behavior library according to the target keywords, the unknown URL and the behavior types corresponding to the unknown URLs in the analysis result.
Further, updating a behavior library according to the target keyword, the unknown URL and the behavior category corresponding to the unknown URL in the analysis result, comprising the following steps:
adding the target keywords and the frequency numbers thereof into a behavior library;
and according to the updated keywords, recalculating the weight of each keyword, and updating the behavior library according to the weight.
Further, the unknown URL webpage also comprises a URL crawled by a crawler at random.
Further, the behavior library comprises a URL library and a keyword library, the URL library comprises known URLs and behavior categories corresponding to the known URLs, the keyword library is a keyword corresponding to the behavior categories, and the keyword library is divided into: the other keyword libraries are divided into an undetermined keyword library and a non-judged keyword library according to the weight from high to low, one or more groups of judgment keywords and frequency numbers corresponding to behavior categories are stored in the judgment keyword library, and the keyword analysis is matched with the judgment keyword library to obtain the analysis result.
Further, the keyword analysis is matched with the judgment keyword library to obtain the analysis result, and the method comprises the following steps of:
randomly selecting a behavior category, obtaining judgment keywords corresponding to the behavior category and frequency numbers of the judgment keywords, recording the judgment keywords as first frequency numbers, and constructing a first array by using the first frequency numbers;
counting target keywords with weights higher than preset weights in the unknown URL and frequency numbers of the target keywords, recording the target keywords as second frequency numbers, and constructing a second array by using the second frequency numbers;
comparing the similarity of the first array and the second array to obtain the similarity value of the unknown URL and the behavior category;
and calculating the similarity of the unknown URL and all behavior categories to obtain the analysis result, wherein the analysis result is the behavior category with the maximum similarity to the unknown URL data.
Further, the method for counting the keywords with the highest weight in the unknown URL and the frequency number of the keywords comprises the following steps:
crawling the webpage of the unknown URL;
segmenting the content of the webpage to obtain all keywords in the webpage;
calculating the weights of all keywords;
and screening out the keywords with the weights higher than the preset weight to obtain the target keywords.
And further, pushing the data in the URL database to a security platform, and updating the threat library according to a result returned by the security platform, wherein the returned result is the newly added threat URL.
The second purpose of the invention is realized by adopting the following technical scheme:
a URL analysis device, comprising:
the acquisition module is used for receiving URL data and storing the URL data in a URL database;
the filtering module is used for matching the URL data with the threat database, filtering the URL data with threats to obtain safe URL data, and storing threat records into the threat record database;
the analysis module is used for matching the safe URL data with known URLs in a behavior library according to the safe URL data:
when the safe URL data are successfully matched, behavior records are obtained and stored in a behavior record library, and the behavior records are behavior types corresponding to the safe URLs;
when the safe URL data matching fails, extracting target keywords from URLs which fail to match, namely unknown URLs, performing keyword analysis according to the keywords in the behavior library, and storing analysis results serving as behavior records into a behavior record library;
and the updating module is used for updating the behavior library according to the target keywords, the unknown URL and the behavior types corresponding to the unknown URLs in the analysis result.
It is a further object of the present invention to provide an electronic device for performing one of the above objects, comprising a processor, a storage medium, and a computer program stored in the storage medium, which when executed by the processor implements the above URL analyzing method.
It is a further object of the present invention to provide a computer-readable storage medium storing one of the objects of the invention, on which a computer program is stored, which computer program, when executed by a processor, implements the URL analyzing method described above.
Compared with the prior art, the invention has the beneficial effects that:
according to the method, the threat URL is filtered, the safe URL is screened out, behavior analysis is carried out on the safe URL, the problem that the threat URL is analyzed simultaneously during behavior analysis, the result is inaccurate is avoided, the threat record can be directly inquired through matching and recording of the threat library, and the behavior record can be directly inquired through matching and recording of the behavior library; and for URLs which are not in the behavior library, behavior categories corresponding to the URLs are obtained through keyword analysis, and the behavior library is updated in real time according to the analysis result, so that the analysis accuracy is further improved, and the behavior library does not need to be updated manually.
Drawings
FIG. 1 is a flowchart of a URL analysis method according to the first embodiment;
FIG. 2 is a flowchart of a keyword analysis method according to the third embodiment;
FIG. 3 is a block diagram showing the structure of a URL analysis apparatus according to a fifth embodiment;
fig. 4 is a block diagram of the electronic apparatus of the sixth embodiment.
Detailed Description
The present invention will now be described in more detail with reference to the accompanying drawings, in which the description of the invention is given by way of illustration and not of limitation. The various embodiments may be combined with each other to form other embodiments not shown in the following description.
Example one
The first embodiment provides a URL analysis method, which comprises the steps of firstly recording a threat URL and then analyzing the behavior category of a safe URL, so as to obtain more accurate threat records and behavior records; by means of the method, behavior types of all URLs can be obtained, the behavior library is updated through the keyword comparison result, and the process of manually updating the URL behavior library is replaced.
Referring to fig. 1, a URL analyzing method includes the following steps:
s110, receiving URL data and storing the URL data in a URL database;
the received URL data is usually URL data acquired from a gateway box, and mainly comprises the full path of the URL, URL access times, access time and the like, the data are pushed to the designated topic in real time, and the topic data are received in real time by using Structured Streaming or a similar stream processing engine to obtain information such as the URL accessed by a user.
S120, matching the URL data with a threat database, filtering the URL data with threats to obtain safe URL data, and storing threat records into a threat record database;
when inquiring, the threat history record can be directly inquired from the threat record library.
S130, matching the safe URL data with known URLs in a behavior library according to the safe URL data:
when the safe URL data are successfully matched, behavior records are obtained and stored in a behavior record library, and the behavior records are behavior types corresponding to the safe URLs;
when the safe URL data matching fails, extracting target keywords from URLs which fail to match, namely unknown URLs, performing keyword analysis according to the keywords in the behavior library, and storing analysis results serving as behavior records into a behavior record library;
the unknown URL webpage also comprises URLs randomly crawled by the crawler, and the randomly crawled URLs are added to improve the updating efficiency of the behavior library.
When matching, matching is performed according to the secure URL and URLs in the behavior library, for example, the behavior category corresponding to the URL address "www.taobao.com" in the behavior library is "shopping", then the behavior categories corresponding to all URLs including "www.taobao.com" in the secure URL are "shopping", and when matching, it is usually necessary to perform normalized preprocessing on the secure URL, for example, removing a URL protocol header.
During keyword analysis, keyword comparison is performed on the web page of the unknown URL, for example, a group of keywords with behavior category of shopping is 'Shunfeng' and 'preferential' in the behavior library, and if the target keywords are 'Shunfeng', 'preferential' and 'apple', the behavior category of the unknown URL can be determined to be 'shopping'; extracting the target keywords according to set conditions, and generally screening out keywords higher than preset weight to serve as the target keywords; or screening out a preset number of target keywords with the highest weight according to weight sorting.
During query, the behavior history records can be directly queried through the behavior record library.
And S140, updating the behavior library according to the target keywords, the unknown URL and the corresponding behavior categories in the analysis result.
Example two
The second embodiment is an improvement based on the first embodiment, which mainly explains and explains the calculation of the behavior library and the weight.
In order to facilitate keyword analysis and behavior type matching, the behavior library comprises a URL library and a keyword library, the URL library comprises known URLs and behavior categories corresponding to the known URLs, the keyword library is keywords corresponding to the behavior categories, and the keyword library is divided into the following parts according to weight: the other keyword libraries are divided into an undetermined keyword library and a non-judged keyword library according to the weight from high to low, one or more groups of judgment keywords and frequency numbers corresponding to behavior categories are stored in the judgment keyword library, and the keyword analysis is matched with the judgment keyword library to obtain the analysis result.
When the behavior library is updated, the weight is calculated according to the newly added target keywords, the keywords which are lower than the preset weight in the judgment keyword library are added into the undetermined keywords, the undetermined keyword library is similar, the keywords with the reduced weight are added into the non-judgment keywords, and the keywords which are higher than the preset weight are added into the judgment keywords.
Specifically, the weight calculation may use a TF-IDF algorithm, or may use other algorithms that can obtain the weight of the keyword.
Taking TF-IDF algorithm as an example, TF represents the frequency of a word or phrase appearing in a document, and is referred to herein as a keyThe frequency of occurrence of the keyword in the web page, for example, the frequency of occurrence of the web page where "offer" is shopping in a certain action category, is represented by the formula:
Figure BDA0002145451670000081
if i is the ith word in the keyword library and j is the number of the web page corresponding to the keyword, for example, if "offer" appears 5 times in the shopping web page with the number "1" and the shopping web page has 100 keywords in total, the TF value of "offer" is 5/100 ═ 0.05, it is determined that the keyword library stores the keyword and the frequency number corresponding to the web page with the number "1", and the TF value of the keyword is also stored.
The IDF represents the inverse document frequency, which is referred to herein as the importance of a certain keyword to the behavior category determination, and is expressed by the following formula:
Figure BDA0002145451670000082
| D | refers to the number of all pages in a certain behavior category, { j: t is ti∈djThe term "means the number of web pages containing a certain keyword, for example, a total of 100" shopping "web pages in the behavior library, wherein 10 web pages have a" preferential "keyword, and the IDF value is 1.
The TFIDF value is the product of TF and IDF, e.g., if TF and IDF of the "good" keyword are 0.05 and 1, respectively, then the TFIDF value is equal to 0.05.
The method comprises the steps of presetting a weight value according to actual conditions, taking the higher weight value as a judgment keyword, judging the keyword, and preventing the judgment of behavior categories by the keywords with lower weight, such as 'yes' words, 'also' words and the like.
EXAMPLE III
The third embodiment is performed on the basis of the first embodiment or/and the second embodiment, and mainly explains and explains the specific process of keyword analysis.
The keyword analysis comprises the following steps:
s210, randomly selecting a behavior category, obtaining judgment keywords corresponding to the behavior category and frequency counts of the judgment keywords, recording the judgment keywords as a first frequency count, and constructing a first array by using the first frequency count;
s220, counting target keywords with weights higher than preset weights in the unknown URL and frequency counts of the target keywords, recording the target keywords as second frequency counts, and constructing a second array by using the second frequency counts;
specifically, the method for counting the keywords with the highest weight in the unknown URL and the frequency of the keywords includes the following steps:
crawling the webpage of the unknown URL;
segmenting the content of the webpage to obtain all keywords in the webpage;
calculating the weights of all keywords;
and screening out the keywords with the weights higher than the preset weight to obtain the target keywords.
S230, comparing the similarity of the first array and the second array to obtain the similarity value of the unknown URL and the behavior category;
and calculating the similarity of the unknown URL and all behavior categories to obtain the analysis result, wherein the analysis result is the behavior category with the maximum similarity to the unknown URL data.
Specifically, the similarity may be calculated by the cosine theorem, or other methods that can calculate the similarity.
Taking the cosine theorem as an example, the cosine theorem satisfies the formula:
Figure BDA0002145451670000091
a and B respectively represent a first array and a second array, and the similarity calculation result is closer to 1, which represents that the similarity of two groups of keywords is higher.
Example four
The fourth embodiment is carried out on the basis of the first embodiment. Which mainly explains and explains the updating of the threat repository.
Specifically, data in the URL database is pushed to a security platform, the threat library is updated according to a result returned by the security platform, and the returned result is the newly added threat URL.
Threat event analysis is carried out on URLs in a URL database or crawlers crawl randomly to update a threat library, threat detection can be carried out by sending URL data to a safety platform, the threat library is updated in real time according to detection result summary data, and websites with serious threats or IP corresponding to the URLs are issued to an intelligent gateway box firewall system, so that the purpose of blocking is achieved.
Through the security platform, the URL can be further subjected to threat detection, and the threat library is updated through the detection result, so that the accuracy rate is higher when the threat library is matched.
EXAMPLE five
An embodiment five discloses a device corresponding to the URL analysis method, which is a virtual device structure of the embodiment, and as shown in fig. 3, the device includes:
the acquisition module 310 is configured to receive URL data and store the URL data in a URL database;
the filtering module 320 is used for matching the URL data with the threat database, filtering the URL data with threats to obtain safe URL data, and storing threat records into the threat record database;
an analysis module 330, configured to match the secure URL data with known URLs in a behavior library:
when the safe URL data are successfully matched, behavior records are obtained and stored in a behavior record library, and the behavior records are behavior types corresponding to the safe URLs;
when the safe URL data matching fails, extracting target keywords from URLs which fail to match, namely unknown URLs, performing keyword analysis according to the keywords in the behavior library, and storing analysis results serving as behavior records into a behavior record library;
and the updating module 340 is configured to update the behavior library according to the target keyword, the unknown URL, and the behavior category corresponding to the unknown URL in the analysis result.
Preferably, updating the behavior library according to the target keyword, the unknown URL and the behavior category corresponding to the unknown URL in the analysis result, includes the following steps:
adding the target keywords and the frequency numbers thereof into a behavior library;
and according to the updated keywords, recalculating the weight of each keyword, and updating the behavior library according to the weight.
The unknown URL webpage also comprises a URL which is crawled by a crawler at random.
Preferably, the behavior library includes a URL library and a keyword library, the URL library includes known URLs and behavior categories corresponding to the known URLs, the keyword library is a keyword corresponding to the behavior categories, and the keyword library is divided into: the other keyword libraries are divided into an undetermined keyword library and a non-judged keyword library according to the weight from high to low, one or more groups of judgment keywords and frequency numbers corresponding to behavior categories are stored in the judgment keyword library, and the keyword analysis is matched with the judgment keyword library to obtain the analysis result.
Preferably, the keyword analysis obtains the analysis result by matching with the judgment keyword library, and includes the following steps:
randomly selecting a behavior category, obtaining judgment keywords corresponding to the behavior category and frequency numbers of the judgment keywords, recording the judgment keywords as first frequency numbers, and constructing a first array by using the first frequency numbers;
counting target keywords with weights higher than preset weights in the unknown URL and frequency numbers of the target keywords, recording the target keywords as second frequency numbers, and constructing a second array by using the second frequency numbers;
comparing the similarity of the first array and the second array to obtain the similarity value of the unknown URL and the behavior category;
and calculating the similarity of the unknown URL and all behavior categories to obtain the analysis result, wherein the analysis result is the behavior category with the maximum similarity to the unknown URL data.
And counting the keywords with the highest weight in the unknown URL and the frequency of the keywords, wherein the method comprises the following steps:
crawling the webpage of the unknown URL;
segmenting the content of the webpage to obtain all keywords in the webpage;
calculating the weights of all keywords;
and screening out the keywords with the weights higher than the preset weight to obtain the target keywords.
Preferably, the data in the URL database is pushed to a security platform, and the threat library is updated according to a result returned by the security platform, where the returned result is a newly added threat URL.
EXAMPLE six
Fig. 4 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention, as shown in fig. 4, the electronic device includes a processor 410, a memory 420, an input device 430, and an output device 440; the number of the processors 410 in the computer device may be one or more, and one processor 410 is taken as an example in fig. 4; the processor 410, the memory 420, the input device 430 and the output device 440 in the electronic apparatus may be connected by a bus or other means, and the bus connection is exemplified in fig. 4.
The memory 420 serves as a computer-readable storage medium, and may be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the URL analysis method in the embodiment of the present invention (for example, the data acquisition module 310, the filtering module 320, the analysis module 330, and the update module 340 in the URL analysis method apparatus). The processor 410 executes various functional applications and data processing of the electronic device by executing the software programs, instructions and modules stored in the memory 420, that is, the URL analysis method of the first to fourth embodiments is implemented.
The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 420 may further include memory located remotely from processor 410, which may be connected to an electronic device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 430 may be used to receive input user identity information, preset weights, and the like. The output device 440 may include a display device such as a display screen.
EXAMPLE seven
An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a URL analysis method, including:
receiving URL data, and storing the URL data in a URL database;
matching the URL data with a threat library according to the URL data, filtering the URL data with threats to obtain safe URL data, and storing threat records into a threat record library;
matching with known URLs in a behavior library according to the safe URL data:
when the safe URL data are successfully matched, behavior records are obtained and stored in a behavior record library, and the behavior records are behavior types corresponding to the safe URLs;
when the safe URL data matching fails, extracting target keywords from URLs which fail to match, namely unknown URLs, performing keyword analysis according to the keywords in the behavior library, and storing analysis results serving as behavior records into a behavior record library;
and updating a behavior library according to the target keywords, the unknown URL and the behavior types corresponding to the unknown URLs in the analysis result.
Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the URL-based analysis method provided by any embodiments of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling an electronic device (which may be a mobile phone, a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the device based on the URL analysis method, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims (7)

1. A URL analysis method, comprising the steps of:
receiving URL data, and storing the URL data in a URL database;
the URL data are matched with a threat database, the URL data with threats are filtered to obtain safe URL data, and threat records are stored in a threat record database;
wherein the threat repository comprises URLs known to have threats, and the threat record repository comprises threat records of which the URL data of the threats are matched with the threat repository;
the secure URL data is matched with known URLs in a behavior library:
when the safe URL data are successfully matched, behavior records are obtained and stored in a behavior record library, and the behavior records are behavior types corresponding to the safe URLs;
when the safe URL data matching fails, extracting target keywords from URLs which fail to be matched, namely unknown URLs, performing keyword analysis according to the keywords in the behavior library, and storing an analysis result as the behavior record into a behavior record library, wherein the behavior library comprises a URL library and a keyword library, the URL library comprises known URLs and behavior categories corresponding to the known URLs, the keyword library comprises keywords corresponding to the behavior categories, and the keyword library is divided into the following parts according to weight: the other keyword libraries are divided into an undetermined keyword library and a non-judged keyword library according to the weight from high to low, one or more groups of judgment keywords and frequency numbers corresponding to behavior categories are stored in the judgment keyword library, and the keyword analysis is matched with the judgment keyword library to obtain the analysis result;
wherein the target keyword is obtained by the following method: crawling the webpage of the unknown URL; segmenting the content of the webpage to obtain all keywords in the webpage; calculating the weights of all keywords; screening out keywords higher than a preset weight to obtain the target keywords; wherein the weight of the key words is calculated by using a TF-IDF algorithm;
updating a behavior library according to the target keywords, the unknown URL and the behavior categories corresponding to the unknown URL in the analysis result, wherein the behavior library is updated according to the target keywords, the unknown URL and the behavior categories corresponding to the unknown URL in the analysis result, the method comprises the following steps:
adding the target keywords and the frequency numbers thereof into a behavior library;
according to the updated keywords, recalculating the weight of each keyword, and updating the behavior library according to the weight, wherein the updating the behavior library according to the weight comprises the following steps: adding the keywords with the weights lower than the preset weight in the judgment keyword library into the undetermined keyword library; and adding the keywords with the reduced weight in the undetermined keyword library into the undetermined keyword library, and adding the keywords with the weight higher than the preset weight into the determined keyword library.
2. The URL analysis method as claimed in claim 1, wherein the unknown URL web page further includes a URL that is crawled randomly by a crawler.
3. The URL analysis method as claimed in claim 2, wherein the keyword analysis obtains the analysis result by matching with a decision keyword library, comprising the steps of:
randomly selecting a behavior category, obtaining judgment keywords corresponding to the behavior category and frequency numbers of the judgment keywords, recording the judgment keywords as first frequency numbers, and constructing a first array by using the first frequency numbers;
counting target keywords with weights higher than preset weights in the unknown URL and frequency numbers of the target keywords, recording the target keywords as second frequency numbers, and constructing a second array by using the second frequency numbers;
comparing the similarity of the first array and the second array to obtain the similarity value of the unknown URL and the behavior category;
and calculating the similarity of the unknown URL and all behavior categories to obtain the analysis result, wherein the analysis result is the behavior category with the maximum similarity to the unknown URL data.
4. The URL analysis method according to claim 1, wherein the data in the URL database is pushed to a security platform, and the threat repository is updated according to a result returned by the security platform, and the returned result is a new threat URL.
5. A URL analysis device, comprising:
the acquisition module is used for receiving URL data and storing the URL data in a URL database;
the filtering module is used for matching the URL data with a threat library, filtering the URL data with threats to obtain safe URL data, and storing threat records into a threat record library, wherein the threat library comprises known URLs with threats, and the threat record library comprises threat records matched by the URL data with threats and the threat library;
an analysis module for matching the secure URL data with known URLs in a behavior library:
when the safe URL data are successfully matched, behavior records are obtained and stored in a behavior record library, and the behavior records are behavior types corresponding to the safe URLs;
when the safe URL data matching fails, extracting target keywords from URLs which fail to be matched, namely unknown URLs, performing keyword analysis according to the keywords in the behavior library, and storing an analysis result as the behavior record into a behavior record library, wherein the behavior library comprises a URL library and a keyword library, the URL library comprises known URLs and behavior categories corresponding to the known URLs, the keyword library comprises keywords corresponding to the behavior categories, and the keyword library is divided into the following parts according to weight: the other keyword libraries are divided into an undetermined keyword library and a non-judged keyword library according to the weight from high to low, one or more groups of judgment keywords and frequency numbers corresponding to behavior categories are stored in the judgment keyword library, and the keyword analysis is matched with the judgment keyword library to obtain the analysis result; wherein the target keyword is obtained by the following method: crawling the webpage of the unknown URL; segmenting the content of the webpage to obtain all keywords in the webpage; calculating the weights of all keywords; screening out keywords higher than a preset weight to obtain the target keywords; wherein the weight of the key words is calculated by using a TF-IDF algorithm;
an updating module, configured to update a behavior library according to the target keyword, the unknown URL, and the behavior category corresponding to the unknown URL in the analysis result, where the updating module updates the behavior library according to the target keyword, the unknown URL, and the behavior category corresponding to the unknown URL in the analysis result, and includes the following steps: adding the target keywords and the frequency numbers thereof into a behavior library; according to the updated keywords, recalculating the weight of each keyword, and updating the behavior library according to the weight, wherein the updating the behavior library according to the weight comprises the following steps: adding the keywords with the weights lower than the preset weight in the judgment keyword library into the undetermined keyword library; and adding the keywords with the reduced weight in the undetermined keyword library into the undetermined keyword library, and adding the keywords with the weight higher than the preset weight into the determined keyword library.
6. An electronic device comprising a processor and a storage medium storing a computer program, wherein the computer program, when executed by the processor, implements the URL analysis method of any one of claims 1 to 4.
7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the URL analysis method according to any one of claims 1 to 4.
CN201910687531.5A 2019-07-26 2019-07-26 URL analysis method, device, equipment and medium Active CN110460592B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910687531.5A CN110460592B (en) 2019-07-26 2019-07-26 URL analysis method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910687531.5A CN110460592B (en) 2019-07-26 2019-07-26 URL analysis method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN110460592A CN110460592A (en) 2019-11-15
CN110460592B true CN110460592B (en) 2021-03-26

Family

ID=68483814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910687531.5A Active CN110460592B (en) 2019-07-26 2019-07-26 URL analysis method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN110460592B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902703B (en) * 2014-03-31 2016-02-10 郭磊 Based on the content of text sorting technique of mobile Internet access
US10303898B2 (en) * 2015-05-11 2019-05-28 Finjan Mobile, Inc. Detection and blocking of web trackers for mobile browsers
CN106230809B (en) * 2016-07-27 2019-11-19 南京快页数码科技有限公司 A kind of mobile Internet public sentiment monitoring method and system based on URL
CN107590169B (en) * 2017-04-14 2020-03-06 南方科技大学 Operator gateway data preprocessing method and system

Also Published As

Publication number Publication date
CN110460592A (en) 2019-11-15

Similar Documents

Publication Publication Date Title
US10778702B1 (en) Predictive modeling of domain names using web-linking characteristics
CN109274632B (en) Website identification method and device
Marchal et al. PhishStorm: Detecting phishing with streaming analytics
CN103023712B (en) Method and system for monitoring malicious property of webpage
CN107992738B (en) Account login abnormity detection method and device and electronic equipment
US10505986B1 (en) Sensor based rules for responding to malicious activity
CN108023868B (en) Malicious resource address detection method and device
Desai et al. Malicious web content detection using machine leaning
CN106992981B (en) Website backdoor detection method and device and computing equipment
CN110602137A (en) Malicious IP and malicious URL intercepting method, device, equipment and medium
CN111224941B (en) Threat type identification method and device
CN106850647B (en) Malicious domain name detection algorithm based on DNS request period
US20160299971A1 (en) Identifying Search Engine Crawlers
CN112131507A (en) Website content processing method, device, server and computer-readable storage medium
CN107231383B (en) CC attack detection method and device
Soleymani et al. A novel approach for detecting DGA-based botnets in DNS queries using machine learning techniques
CN109756467B (en) Phishing website identification method and device
CN109495471B (en) Method, device and equipment for judging WEB attack result and readable storage medium
CN106850632B (en) Method and device for detecting abnormal combined data
CN110460592B (en) URL analysis method, device, equipment and medium
WO2014005885A1 (en) Method and apparatus for web page content categorization
CN114500122B (en) Specific network behavior analysis method and system based on multi-source data fusion
CN108171053B (en) Rule discovery method and system
CN112087414A (en) Detection method and device for mining trojans
CN107332856B (en) Address information detection method and device, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210302

Address after: Room 402, Jinhua Network Economy Center Building, 398 Silian Road, Wucheng District, Jinhua City, Zhejiang Province

Applicant after: GUANGTONG TIANXIA NETWORK TECHNOLOGY Co.,Ltd.

Address before: 310051 room 2503, area a, building 1, No. 57, jianger Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant before: Hangzhou Jixun Huitong Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
PP01 Preservation of patent right
PP01 Preservation of patent right

Effective date of registration: 20230817

Granted publication date: 20210326