Disclosure of Invention
In order to overcome the defects of the prior art, an object of the present invention is to provide a URL analyzing method, which obtains an accurate URL behavior analyzing result by performing threat analysis, then performing behavior analysis, and performing keyword analysis.
One of the purposes of the invention is realized by adopting the following technical scheme:
a URL analysis method comprising the steps of:
receiving URL data, and storing the URL data in a URL database;
matching the URL data with a threat library according to the URL data, filtering the URL data with threats to obtain safe URL data, and storing threat records into a threat record library;
matching with known URLs in a behavior library according to the safe URL data:
when the safe URL data are successfully matched, behavior records are obtained and stored in a behavior record library, and the behavior records are behavior types corresponding to the safe URLs;
when the safe URL data matching fails, extracting target keywords from URLs which fail to match, namely unknown URLs, performing keyword analysis according to the keywords in the behavior library, and storing analysis results serving as behavior records into a behavior record library;
and updating a behavior library according to the target keywords, the unknown URL and the behavior types corresponding to the unknown URLs in the analysis result.
Further, updating a behavior library according to the target keyword, the unknown URL and the behavior category corresponding to the unknown URL in the analysis result, comprising the following steps:
adding the target keywords and the frequency numbers thereof into a behavior library;
and according to the updated keywords, recalculating the weight of each keyword, and updating the behavior library according to the weight.
Further, the unknown URL webpage also comprises a URL crawled by a crawler at random.
Further, the behavior library comprises a URL library and a keyword library, the URL library comprises known URLs and behavior categories corresponding to the known URLs, the keyword library is a keyword corresponding to the behavior categories, and the keyword library is divided into: the other keyword libraries are divided into an undetermined keyword library and a non-judged keyword library according to the weight from high to low, one or more groups of judgment keywords and frequency numbers corresponding to behavior categories are stored in the judgment keyword library, and the keyword analysis is matched with the judgment keyword library to obtain the analysis result.
Further, the keyword analysis is matched with the judgment keyword library to obtain the analysis result, and the method comprises the following steps of:
randomly selecting a behavior category, obtaining judgment keywords corresponding to the behavior category and frequency numbers of the judgment keywords, recording the judgment keywords as first frequency numbers, and constructing a first array by using the first frequency numbers;
counting target keywords with weights higher than preset weights in the unknown URL and frequency numbers of the target keywords, recording the target keywords as second frequency numbers, and constructing a second array by using the second frequency numbers;
comparing the similarity of the first array and the second array to obtain the similarity value of the unknown URL and the behavior category;
and calculating the similarity of the unknown URL and all behavior categories to obtain the analysis result, wherein the analysis result is the behavior category with the maximum similarity to the unknown URL data.
Further, the method for counting the keywords with the highest weight in the unknown URL and the frequency number of the keywords comprises the following steps:
crawling the webpage of the unknown URL;
segmenting the content of the webpage to obtain all keywords in the webpage;
calculating the weights of all keywords;
and screening out the keywords with the weights higher than the preset weight to obtain the target keywords.
And further, pushing the data in the URL database to a security platform, and updating the threat library according to a result returned by the security platform, wherein the returned result is the newly added threat URL.
The second purpose of the invention is realized by adopting the following technical scheme:
a URL analysis device, comprising:
the acquisition module is used for receiving URL data and storing the URL data in a URL database;
the filtering module is used for matching the URL data with the threat database, filtering the URL data with threats to obtain safe URL data, and storing threat records into the threat record database;
the analysis module is used for matching the safe URL data with known URLs in a behavior library according to the safe URL data:
when the safe URL data are successfully matched, behavior records are obtained and stored in a behavior record library, and the behavior records are behavior types corresponding to the safe URLs;
when the safe URL data matching fails, extracting target keywords from URLs which fail to match, namely unknown URLs, performing keyword analysis according to the keywords in the behavior library, and storing analysis results serving as behavior records into a behavior record library;
and the updating module is used for updating the behavior library according to the target keywords, the unknown URL and the behavior types corresponding to the unknown URLs in the analysis result.
It is a further object of the present invention to provide an electronic device for performing one of the above objects, comprising a processor, a storage medium, and a computer program stored in the storage medium, which when executed by the processor implements the above URL analyzing method.
It is a further object of the present invention to provide a computer-readable storage medium storing one of the objects of the invention, on which a computer program is stored, which computer program, when executed by a processor, implements the URL analyzing method described above.
Compared with the prior art, the invention has the beneficial effects that:
according to the method, the threat URL is filtered, the safe URL is screened out, behavior analysis is carried out on the safe URL, the problem that the threat URL is analyzed simultaneously during behavior analysis, the result is inaccurate is avoided, the threat record can be directly inquired through matching and recording of the threat library, and the behavior record can be directly inquired through matching and recording of the behavior library; and for URLs which are not in the behavior library, behavior categories corresponding to the URLs are obtained through keyword analysis, and the behavior library is updated in real time according to the analysis result, so that the analysis accuracy is further improved, and the behavior library does not need to be updated manually.
Detailed Description
The present invention will now be described in more detail with reference to the accompanying drawings, in which the description of the invention is given by way of illustration and not of limitation. The various embodiments may be combined with each other to form other embodiments not shown in the following description.
Example one
The first embodiment provides a URL analysis method, which comprises the steps of firstly recording a threat URL and then analyzing the behavior category of a safe URL, so as to obtain more accurate threat records and behavior records; by means of the method, behavior types of all URLs can be obtained, the behavior library is updated through the keyword comparison result, and the process of manually updating the URL behavior library is replaced.
Referring to fig. 1, a URL analyzing method includes the following steps:
s110, receiving URL data and storing the URL data in a URL database;
the received URL data is usually URL data acquired from a gateway box, and mainly comprises the full path of the URL, URL access times, access time and the like, the data are pushed to the designated topic in real time, and the topic data are received in real time by using Structured Streaming or a similar stream processing engine to obtain information such as the URL accessed by a user.
S120, matching the URL data with a threat database, filtering the URL data with threats to obtain safe URL data, and storing threat records into a threat record database;
when inquiring, the threat history record can be directly inquired from the threat record library.
S130, matching the safe URL data with known URLs in a behavior library according to the safe URL data:
when the safe URL data are successfully matched, behavior records are obtained and stored in a behavior record library, and the behavior records are behavior types corresponding to the safe URLs;
when the safe URL data matching fails, extracting target keywords from URLs which fail to match, namely unknown URLs, performing keyword analysis according to the keywords in the behavior library, and storing analysis results serving as behavior records into a behavior record library;
the unknown URL webpage also comprises URLs randomly crawled by the crawler, and the randomly crawled URLs are added to improve the updating efficiency of the behavior library.
When matching, matching is performed according to the secure URL and URLs in the behavior library, for example, the behavior category corresponding to the URL address "www.taobao.com" in the behavior library is "shopping", then the behavior categories corresponding to all URLs including "www.taobao.com" in the secure URL are "shopping", and when matching, it is usually necessary to perform normalized preprocessing on the secure URL, for example, removing a URL protocol header.
During keyword analysis, keyword comparison is performed on the web page of the unknown URL, for example, a group of keywords with behavior category of shopping is 'Shunfeng' and 'preferential' in the behavior library, and if the target keywords are 'Shunfeng', 'preferential' and 'apple', the behavior category of the unknown URL can be determined to be 'shopping'; extracting the target keywords according to set conditions, and generally screening out keywords higher than preset weight to serve as the target keywords; or screening out a preset number of target keywords with the highest weight according to weight sorting.
During query, the behavior history records can be directly queried through the behavior record library.
And S140, updating the behavior library according to the target keywords, the unknown URL and the corresponding behavior categories in the analysis result.
Example two
The second embodiment is an improvement based on the first embodiment, which mainly explains and explains the calculation of the behavior library and the weight.
In order to facilitate keyword analysis and behavior type matching, the behavior library comprises a URL library and a keyword library, the URL library comprises known URLs and behavior categories corresponding to the known URLs, the keyword library is keywords corresponding to the behavior categories, and the keyword library is divided into the following parts according to weight: the other keyword libraries are divided into an undetermined keyword library and a non-judged keyword library according to the weight from high to low, one or more groups of judgment keywords and frequency numbers corresponding to behavior categories are stored in the judgment keyword library, and the keyword analysis is matched with the judgment keyword library to obtain the analysis result.
When the behavior library is updated, the weight is calculated according to the newly added target keywords, the keywords which are lower than the preset weight in the judgment keyword library are added into the undetermined keywords, the undetermined keyword library is similar, the keywords with the reduced weight are added into the non-judgment keywords, and the keywords which are higher than the preset weight are added into the judgment keywords.
Specifically, the weight calculation may use a TF-IDF algorithm, or may use other algorithms that can obtain the weight of the keyword.
Taking TF-IDF algorithm as an example, TF represents the frequency of a word or phrase appearing in a document, and is referred to herein as a keyThe frequency of occurrence of the keyword in the web page, for example, the frequency of occurrence of the web page where "offer" is shopping in a certain action category, is represented by the formula:
if i is the ith word in the keyword library and j is the number of the web page corresponding to the keyword, for example, if "offer" appears 5 times in the shopping web page with the number "1" and the shopping web page has 100 keywords in total, the TF value of "offer" is 5/100 ═ 0.05, it is determined that the keyword library stores the keyword and the frequency number corresponding to the web page with the number "1", and the TF value of the keyword is also stored.
The IDF represents the inverse document frequency, which is referred to herein as the importance of a certain keyword to the behavior category determination, and is expressed by the following formula:
| D | refers to the number of all pages in a certain behavior category, { j: t is t
i∈d
jThe term "means the number of web pages containing a certain keyword, for example, a total of 100" shopping "web pages in the behavior library, wherein 10 web pages have a" preferential "keyword, and the IDF value is 1.
The TFIDF value is the product of TF and IDF, e.g., if TF and IDF of the "good" keyword are 0.05 and 1, respectively, then the TFIDF value is equal to 0.05.
The method comprises the steps of presetting a weight value according to actual conditions, taking the higher weight value as a judgment keyword, judging the keyword, and preventing the judgment of behavior categories by the keywords with lower weight, such as 'yes' words, 'also' words and the like.
EXAMPLE III
The third embodiment is performed on the basis of the first embodiment or/and the second embodiment, and mainly explains and explains the specific process of keyword analysis.
The keyword analysis comprises the following steps:
s210, randomly selecting a behavior category, obtaining judgment keywords corresponding to the behavior category and frequency counts of the judgment keywords, recording the judgment keywords as a first frequency count, and constructing a first array by using the first frequency count;
s220, counting target keywords with weights higher than preset weights in the unknown URL and frequency counts of the target keywords, recording the target keywords as second frequency counts, and constructing a second array by using the second frequency counts;
specifically, the method for counting the keywords with the highest weight in the unknown URL and the frequency of the keywords includes the following steps:
crawling the webpage of the unknown URL;
segmenting the content of the webpage to obtain all keywords in the webpage;
calculating the weights of all keywords;
and screening out the keywords with the weights higher than the preset weight to obtain the target keywords.
S230, comparing the similarity of the first array and the second array to obtain the similarity value of the unknown URL and the behavior category;
and calculating the similarity of the unknown URL and all behavior categories to obtain the analysis result, wherein the analysis result is the behavior category with the maximum similarity to the unknown URL data.
Specifically, the similarity may be calculated by the cosine theorem, or other methods that can calculate the similarity.
Taking the cosine theorem as an example, the cosine theorem satisfies the formula:
a and B respectively represent a first array and a second array, and the similarity calculation result is closer to 1, which represents that the similarity of two groups of keywords is higher.
Example four
The fourth embodiment is carried out on the basis of the first embodiment. Which mainly explains and explains the updating of the threat repository.
Specifically, data in the URL database is pushed to a security platform, the threat library is updated according to a result returned by the security platform, and the returned result is the newly added threat URL.
Threat event analysis is carried out on URLs in a URL database or crawlers crawl randomly to update a threat library, threat detection can be carried out by sending URL data to a safety platform, the threat library is updated in real time according to detection result summary data, and websites with serious threats or IP corresponding to the URLs are issued to an intelligent gateway box firewall system, so that the purpose of blocking is achieved.
Through the security platform, the URL can be further subjected to threat detection, and the threat library is updated through the detection result, so that the accuracy rate is higher when the threat library is matched.
EXAMPLE five
An embodiment five discloses a device corresponding to the URL analysis method, which is a virtual device structure of the embodiment, and as shown in fig. 3, the device includes:
the acquisition module 310 is configured to receive URL data and store the URL data in a URL database;
the filtering module 320 is used for matching the URL data with the threat database, filtering the URL data with threats to obtain safe URL data, and storing threat records into the threat record database;
an analysis module 330, configured to match the secure URL data with known URLs in a behavior library:
when the safe URL data are successfully matched, behavior records are obtained and stored in a behavior record library, and the behavior records are behavior types corresponding to the safe URLs;
when the safe URL data matching fails, extracting target keywords from URLs which fail to match, namely unknown URLs, performing keyword analysis according to the keywords in the behavior library, and storing analysis results serving as behavior records into a behavior record library;
and the updating module 340 is configured to update the behavior library according to the target keyword, the unknown URL, and the behavior category corresponding to the unknown URL in the analysis result.
Preferably, updating the behavior library according to the target keyword, the unknown URL and the behavior category corresponding to the unknown URL in the analysis result, includes the following steps:
adding the target keywords and the frequency numbers thereof into a behavior library;
and according to the updated keywords, recalculating the weight of each keyword, and updating the behavior library according to the weight.
The unknown URL webpage also comprises a URL which is crawled by a crawler at random.
Preferably, the behavior library includes a URL library and a keyword library, the URL library includes known URLs and behavior categories corresponding to the known URLs, the keyword library is a keyword corresponding to the behavior categories, and the keyword library is divided into: the other keyword libraries are divided into an undetermined keyword library and a non-judged keyword library according to the weight from high to low, one or more groups of judgment keywords and frequency numbers corresponding to behavior categories are stored in the judgment keyword library, and the keyword analysis is matched with the judgment keyword library to obtain the analysis result.
Preferably, the keyword analysis obtains the analysis result by matching with the judgment keyword library, and includes the following steps:
randomly selecting a behavior category, obtaining judgment keywords corresponding to the behavior category and frequency numbers of the judgment keywords, recording the judgment keywords as first frequency numbers, and constructing a first array by using the first frequency numbers;
counting target keywords with weights higher than preset weights in the unknown URL and frequency numbers of the target keywords, recording the target keywords as second frequency numbers, and constructing a second array by using the second frequency numbers;
comparing the similarity of the first array and the second array to obtain the similarity value of the unknown URL and the behavior category;
and calculating the similarity of the unknown URL and all behavior categories to obtain the analysis result, wherein the analysis result is the behavior category with the maximum similarity to the unknown URL data.
And counting the keywords with the highest weight in the unknown URL and the frequency of the keywords, wherein the method comprises the following steps:
crawling the webpage of the unknown URL;
segmenting the content of the webpage to obtain all keywords in the webpage;
calculating the weights of all keywords;
and screening out the keywords with the weights higher than the preset weight to obtain the target keywords.
Preferably, the data in the URL database is pushed to a security platform, and the threat library is updated according to a result returned by the security platform, where the returned result is a newly added threat URL.
EXAMPLE six
Fig. 4 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention, as shown in fig. 4, the electronic device includes a processor 410, a memory 420, an input device 430, and an output device 440; the number of the processors 410 in the computer device may be one or more, and one processor 410 is taken as an example in fig. 4; the processor 410, the memory 420, the input device 430 and the output device 440 in the electronic apparatus may be connected by a bus or other means, and the bus connection is exemplified in fig. 4.
The memory 420 serves as a computer-readable storage medium, and may be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the URL analysis method in the embodiment of the present invention (for example, the data acquisition module 310, the filtering module 320, the analysis module 330, and the update module 340 in the URL analysis method apparatus). The processor 410 executes various functional applications and data processing of the electronic device by executing the software programs, instructions and modules stored in the memory 420, that is, the URL analysis method of the first to fourth embodiments is implemented.
The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 420 may further include memory located remotely from processor 410, which may be connected to an electronic device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 430 may be used to receive input user identity information, preset weights, and the like. The output device 440 may include a display device such as a display screen.
EXAMPLE seven
An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a URL analysis method, including:
receiving URL data, and storing the URL data in a URL database;
matching the URL data with a threat library according to the URL data, filtering the URL data with threats to obtain safe URL data, and storing threat records into a threat record library;
matching with known URLs in a behavior library according to the safe URL data:
when the safe URL data are successfully matched, behavior records are obtained and stored in a behavior record library, and the behavior records are behavior types corresponding to the safe URLs;
when the safe URL data matching fails, extracting target keywords from URLs which fail to match, namely unknown URLs, performing keyword analysis according to the keywords in the behavior library, and storing analysis results serving as behavior records into a behavior record library;
and updating a behavior library according to the target keywords, the unknown URL and the behavior types corresponding to the unknown URLs in the analysis result.
Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the URL-based analysis method provided by any embodiments of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling an electronic device (which may be a mobile phone, a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the device based on the URL analysis method, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.