AU2021106579A4

AU2021106579A4 - An automated system to detect phishing url by using machine learning algorithm

Info

Publication number: AU2021106579A4
Application number: AU2021106579A
Authority: AU
Inventors: Yogesh Haridas Jadhav; Deepa Parasar
Original assignee: Parasar Deepa Dr
Current assignee: Parasar Deepa Dr
Priority date: 2021-08-23
Filing date: 2021-08-23
Publication date: 2021-12-02
Anticipated expiration: 2029-08-23

Abstract

A system and a method for determining malicious uniform resource locator (URL), comprises of: an input module (102) for accepting the URL, wherein accepted URL is stored in a database (104) in either of a plurality of lists, wherein the presence of the accepted URL is checked with the URL existing in the database (104), wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs; a feature extraction module (106) for extracting features of the accepted URL upon absence of the accepted URL in the database (104); and a classification module (108) for classifying type of the accepted URL as malicious and non-malicious based on the extracted features using a machine learning technique, wherein a decision vector categorizes the classified URL in to the first list and the second list. 16 aR ChekURnaadatabase BlackandE Eii I O Lexical Feature Whiteis FIGURE 3 START EXTFRACT FEATURES Tandcasfe ESuspiciousxCharacters *No. of dots and slashes *Keywords and Company check *Multiple occurrence(.comn, Enteronrlto search bar httphthtgpp Decision whether A'l bsent CekULi Check URCheckUtabas maliciousornot resent inR 4, present RESULT UPDATEDATBS ( End FIGURE 4

Description

aR ChekURnaadatabase

O

Lexical Feature

BlackandE I Eii

Whiteis

FIGURE 3

START

Check URCheckUtabas

EXTFRACT FEATURES Tandcasfe

ESuspiciousxCharacters *No. of dots and slashes *Keywords and Company check *Multiple occurrence(.comn, Enteronrlto search bar httphthtgpp

Decision whether A'l bsent CekULi maliciousornot resent inR

4, present

RESULT UPDATEDATBS

( End

FIGURE 4

AN AUTOMATED SYSTEM TODETECT PHISHING URL BY USING MACHINE LEARNING ALGORITHM FIELD OF INVENTION

The present invention generally relates to machine learning algorithms. More specifically, the present invention relates detecting malicious Uniform Resource locator (URLs) by using the machine learning algorithms.

BACKGROUND OF THE INVENTION

Over the past few years, the Internet has played an increasingly large part in everyone's personal professional life. It's not necessary that every website is going to be easily accessible or money making. Day by day more malicious or phishing websites have started to appear. This type of malicious websites is threat to all facets of the consumer. This may result in economic losses for the consumer, although some may create misperception about the ethical administration of the country. Human comprehensible URLs are used to classify billions of websites running today's internet. Adversaries trying to gain unauthorized access to confidential data may use malicious URLs to present them to naive users as a legitimate URL. These URLs are called malicious URLs which serve as an unwanted activity gateway. These malicious URLs result in unethical activities such as theft of confidential and private information. It could lead to ransom ware deployment on user phones. Many security agencies are wary about malicious URLs because they place government and private organizations ' confidential data at risk. Some encourage their users to use social networking sites to publish unauthorized URLs. Many of these URLs are synonymous with business promotion and self-advertising, but some of them may pose a vulnerable threat to naive users. Naive users using malicious URLs face the adversary's extreme security threats. To ensure that users are not allowed to visit malicious websites, validation of URLs is very necessary. One of the basic features that a program should have is to allow the harmless URLs of the user to be requested and to prevent the entry of malicious URLs. This is done by alerting the user that it was a malicious website and they should take precautions in future. Instead of focusing on the syntactic properties of the URL, a program can take semant and lexical properties from each URL. Traditional methods like Black-Listing, Heuristic Classification identifies and block such URLs until they enter the client. One of the basic methods to detect malicious URLs is blacklisting it. Black-List method is typically maintaining a database containing the list of all previously known malicious URLs. A server search is performed each time a new URL is identified in the process. Here, the new URL must fit and check that previously known malicious URL in the black list. The update must be performed in blacklist whenever a new malicious URL is found in the process. With ever-increasing new URLs, the method is repetitive, time consuming, and computationally intensive. Another method is the heuristic classification where the signatures are compared and checked to establish the connection between the new URL and the current malicious URL. While both Black-Listing and Heuristic Classification effectively distinguish malicious and neutral URLs, they cannot cope with the emerging attack techniques. One of these techniques has serious limitations in classifying newly generated URLs that they are inefficient. Most web-based companies use large servers that store as many as millions of URLs and refine these URL sets on a regular basis. The main problem with these solutions is the human intervention required to maintain and update the URL list. Therefore, there exists a need to propose advanced machine learning techniques that Internet users could use as a tool to overcome these limitations by distinguishing between malicious and non-malicious URL using machine learning algorithms.

The technical advancements disclosed by the present invention overcomes the limitations and disadvantages of existing and convention systems and methods.

SUMMARY OF THE INVENTION

The present invention generally relates to a system and a method for detecting phishing in URLs.

An objective of the invention is to identify malicious URL based on machine learning approach. Another objective of the invention is to update the URL into black list and white list. Another objective of this invention is to classify the URL as malicious or valid. Another objective of this invention is to train a system for determining perceptions for finding malicious URL.

According to an aspect of the present invention, a system for determining malicious uniform resource locator (URL), wherein the system comprises of: an input module for accepting the URL, wherein accepted URL is stored in a database in either of a plurality of lists, wherein the presence of the accepted URL is checked with the URL existing in the database, wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs; a feature extraction module associated with the database for extracting features of the accepted URL upon absence of the accepted URL in the database using a plurality of extracting rules; and a classification module associated with the feature extraction module for classifying type of the accepted URL as malicious and non-malicious based on the extracted features using a machine learning technique, wherein a decision vector is created during classification of the URL of the database to categorize the classified URL in to the first list and the second list.

According to an aspect of the present invention, a method for determining malicious uniform resource locator (URL), wherein the method comprises of: accepting the URL using an input module, wherein storing accepted URL in a database in either of a plurality of lists, wherein checking the presence of the accepted URL with existing URL in the database, wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs; extracting features of the accepted URL using a feature extraction module upon absence of the accepted URL in the database by a plurality of extracting rules; and classifying type of the accepted URL as malicious and non-malicioususing a classification module based on the extracted features using a machine learning technique, wherein creating a decision vector during classification of the URL of the database to categorize the classified URL in to the first list and the second list.

To further clarify advantages and features of the present invention, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail with the accompanying drawings.

BRIEF DESCRIPTION OF FIGURES

These and other features, aspects, and advantages of the present invention will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:

Figure 1 illustrates a block diagram of a system for determining malicious uniform resource locator (URL),

Figure 2illustrates a flow diagram of a method for determining malicious uniform resource locator (URL),

Figure 3illustrates a schematic diagram of the overall architecture of the system when a URL is existing in the database, and

Figure 4illustrates a flow diagram of a method for Malicious URL Detection when the URL is not existing in the database.

Further, skilled artisans will appreciate that elements in the drawings are illustrated for simplicity and may not have been necessarily been drawn to scale. For example, the flow charts illustrate the method in terms of the most prominent steps involved to help to improve understanding of aspects of the present invention. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having benefit of the description herein.

DETAILED DESCRIPTION

For the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiment illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated system, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.

It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the invention and are not intended to be restrictive thereof.

Reference throughout this specification to "an aspect", "another aspect" or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.

The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such process or method. Similarly, one or more devices or sub-systems or elements or structures or components proceeded by "comprises...a" does not, without more constraints, preclude the existence of other devices or other sub-systems or other elements or other structures or other components or additional devices or additional sub-systems or additional elements or additional structures or additional components.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The system, methods, and examples provided herein are illustrative only and not intended to be limiting.

Embodiments of the present invention will be described below in detail with reference to the accompanying drawings.

Figure 1 illustrates a block diagram of a system for determining malicious uniform resource locator (URL), wherein the system comprises of:an input module (102), a database (104), a feature extraction module (106), and a classification module (108).

The input module (102) for accepting the URL, wherein accepted URL is stored in a database (104) in either of a plurality of lists, wherein the presence of the accepted URL is checked with the URL existing in the database (104), wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs. The URL is categorized in to the first list and the second list using a lexical filter, wherein the first list is white list and the second list is black list.

The feature extraction module (106)is associated with the database (104) for extracting features of the accepted URL upon absence of the accepted URL in the database (104) using a plurality of extracting rules. The plurality of extracting rules is selected from but not limited to suspicious characters, no. of dots and slashes, multiple occurrences,more than 4 numbers, presence of a famous domain, etc.

The classification module (108)is associated with the feature extraction module (106) for classifying type of the accepted URL as malicious and non-malicious based on the extracted features using a machine learning technique, wherein a decision vector is created during classification of the URL of the database (104) to categorize the classified URL in to the first list and the second list. The machine learning technique includes support vector machine (SVM) having a separate hyper plane which is used as a discriminatory classifier dividing classified output in to the plurality of lists.

The hyper plane is a line in two-dimensional space that divides a plane into two sections where it lies in each category on either side. The SVM is a supervised algorithm for machine learning that is used for classification or regression challenges. It works very well with a clear margin of separation.

Figure 2illustrates a flow diagram of a method for determining malicious uniform resource locator (URL), wherein the method comprises of:

Step (202) discloses about accepting the URL using an input module (102), wherein storing accepted URL in a database (104) in either of a plurality of lists, wherein checking the presence of the accepted URL with existing URL in the database (104), wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs.

Step (204) discloses about extracting features of the accepted URL using a feature extraction module (106) upon absence of the accepted URL in the database (104) by a plurality of extracting rules.

Step (206) discloses about classifying type of the accepted URL as malicious and non-malicioususing a classification module (108) based on the extracted features using a machine learning technique, wherein creating a decision vector during classification of the URL of the database (104) to categorize the classified URL in to the first list and the second list.

Figure 3illustrates a schematic diagram of the overall architecture of the system when a URL is existing in the database (104).

The data set of malicious and non-malicious URL are collected for training purpose. The dataset used for proposed system is URL dataset from a website selected from but not limited to Kaggle.

According to an embodiment, the Kaggle website has 480000 samples in the dataset, 384000 of which are malicious URLs and others are normal URLs. However, the samples may vary.

The proposed system uses features extraction and data modelling of the URLs. A decision vector is created from the URL present in the dataset. The URL in data set gets read one by one and every URL goes through the following operations such as extraction of different features like suspicious characters, no. of dots and slashes etc. These extracted features are then used for training the classifiers. When the user enters the URL to check its authenticity, then first it is checked in the database (104). If the URL is present in database (104), then it means that it has been already checked and found out to be malicious or not. Accordingly, the result is shown to the user and session ends.

The data set of malicious and non-malicious URL are collected for training purpose. A decision vector is created from the URL present in the dataset (104). The URL in data set gets read one by one and every URL will go through the following operations as specified in figure 3, such as extraction of different features like suspicious characters, no. of dots and slashes etc. These extracted features are then used for training the classifiers. When the user enters the URL to check its authenticity, then first it is checked in the database (104). If the URL is present in database (104), then it means that it has been already checked and found out to be malicious or not.

Black and white list: Traditional black and white list of URL is the first step towards categorization of new URL. First level validation is done through the identification of normal and malicious URLs. The standard URLs are applied to the white list and the blacklist directory includes malicious URLs. For checking of any new URL black and white list is traversed to identify the category of URL. It determines whether the URL is in the black list or in the white list.

Lexical analysis: The second step for identification of URL category is the lexical filter. Inside the invalid domain name, it checks for keywords like ' com, ' www, ' etc. This checking is based upon training of the system. Few rules to be checked for domain name are as:

• If there are more than 4 numbers

• Presence of special characters such as (#, $, @, -, -, -)

• Top 5 URL address (com, en, net, org, cc)

• Repetition of "." Symbol in domain name

• Total count of characters in the address of web

Flask Framework: Flask is a popular web framework for Python, which means it is a third-party Python library used for web application development.

If domain name contains number more than 4 numbers. Presence of special characters. Also, if URL contain any of some famous domain then it is likely they are benign URL. Number of dots in domain name is also very important in classification of URL. Also, total length of any URL or domain name is used for decision as phishing URL have long domain name. The specific keyword is pushed in the training vectorizer for each of this feather for training the model.

The dataset is divided in the ratio of 75:25 as training sample and testing sample. The three different machine learning algorithms has been implemented to classify the URL into malicious or normal. Lexical analysis has been performed on the URL to extract the lexical feature. The main task of classifying URLs is done through Support vector machine due to increased accuracy of 8 5. 3 5%.

The Support Vector Machine (SVM) is formally defined by a separate hyperplane which is used as a discriminatory classifier. A hyperplane is a line in two-dimensional space that divides a plane into two sections where it lies in each category on either side. SVM is nothing more than a supervised algorithm for machine learning that is used for classification or regression challenges. It works very well with a clear margin of separation.

For identification of malicious URLs, traditional list of URLs stored as black and white list and machine learning algorithms have been used. Hackers circumvent anti-spai filtering strategies by placing malicious URLs in the message content. Hence the method of the URL analyzer detects the malicious URL with the aid of a reduced phishing feature set. Malicious URL detection plays a critical role in many cyber security applications, and approaches to machine learning are clearly a promising direction.

The drawings and the forgoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, orders of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts necessarily need to be performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples. Numerous variations, whether explicitly given in the specification or not, such as differences in structure, dimension, and use of material, are possible. The scope of embodiments is at least as broad as given by the following claims.

Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any component(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or component of any or all the claims.

Claims

WE CLAIM:

1. A system for determining malicious uniform resource locator (URL), wherein the system comprises of:

an input module (102) for accepting the URL, wherein accepted URL is stored in a database (104) in either of a plurality of lists, wherein the presence of the accepted URL is checked with the URL existing in the database (104), wherein the plurality of lists includes a first list of non malicious URLS and a second list of malicious URLs;

a feature extraction module (106) associated with the database (104) for extracting features of the accepted URL upon absence of the accepted URL in the database (104) using a plurality of extracting rules; and

a classification module (108) associated with the feature extraction module (106) for classifying type of the accepted URL as malicious and non-malicious based on the extracted features using a machine learning technique, wherein a decision vector is created during classification of the URL of the database (104)to categorize the classified URL in to the first list and the second list.

2. The system as claimed in claim 1, wherein the URL is categorized in to the first list and the second list using a lexical filter, wherein the first list is white list and the second list is black list.

3. The system as claimed in claim 1, wherein the plurality of extracting rules is selected from but not limited to suspicious characters, no. of dots and slashes, multiple occurrences, more than 4 numbers, presence of a famous domain, etc.

4. The system as claimed in claim 1, wherein the machine learning technique includes support vector machine having a separate hyperplane which is used as a discriminatory classifier dividing classified output in to the plurality of lists.

5. A method for determining malicious uniform resource locator (URL), wherein the method comprises of:

accepting the URL using an input module (102), wherein storing accepted URL in a database (104) in either of a plurality of lists, wherein checking the presence of the accepted URL with existing URL in the database (104), wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs;

extracting features of the accepted URL using a feature extraction module (106) upon absence of the accepted URL in the database (104) by a plurality of extracting rules; and

classifying type of the accepted URL as malicious and non malicioususing a classification module (108) based on the extracted features using a machine learning technique, wherein creating a decision vector during classification of the URL of the database (104) to categorize the classified URL in to the first list and the second list.

FIGURE 1

FIGURE 2

FIG GURE 3 FIG GURE 4