AU2021106579A4 - An automated system to detect phishing url by using machine learning algorithm - Google Patents

An automated system to detect phishing url by using machine learning algorithm Download PDF

Info

Publication number
AU2021106579A4
AU2021106579A4 AU2021106579A AU2021106579A AU2021106579A4 AU 2021106579 A4 AU2021106579 A4 AU 2021106579A4 AU 2021106579 A AU2021106579 A AU 2021106579A AU 2021106579 A AU2021106579 A AU 2021106579A AU 2021106579 A4 AU2021106579 A4 AU 2021106579A4
Authority
AU
Australia
Prior art keywords
url
list
malicious
database
accepted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2021106579A
Inventor
Yogesh Haridas Jadhav
Deepa Parasar
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Parasar Deepa Dr
Original Assignee
Parasar Deepa Dr
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Parasar Deepa Dr filed Critical Parasar Deepa Dr
Priority to AU2021106579A priority Critical patent/AU2021106579A4/en
Application granted granted Critical
Publication of AU2021106579A4 publication Critical patent/AU2021106579A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

A system and a method for determining malicious uniform resource locator (URL), comprises of: an input module (102) for accepting the URL, wherein accepted URL is stored in a database (104) in either of a plurality of lists, wherein the presence of the accepted URL is checked with the URL existing in the database (104), wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs; a feature extraction module (106) for extracting features of the accepted URL upon absence of the accepted URL in the database (104); and a classification module (108) for classifying type of the accepted URL as malicious and non-malicious based on the extracted features using a machine learning technique, wherein a decision vector categorizes the classified URL in to the first list and the second list. 16 aR ChekURnaadatabase BlackandE Eii I O Lexical Feature Whiteis FIGURE 3 START EXTFRACT FEATURES Tandcasfe ESuspiciousxCharacters *No. of dots and slashes *Keywords and Company check *Multiple occurrence(.comn, Enteronrlto search bar httphthtgpp Decision whether A'l bsent CekULi Check URCheckUtabas maliciousornot resent inR 4, present RESULT UPDATEDATBS ( End FIGURE 4

Description

aR ChekURnaadatabase
O
Lexical Feature
BlackandE I Eii
Whiteis
FIGURE 3
START
Check URCheckUtabas
EXTFRACT FEATURES Tandcasfe
ESuspiciousxCharacters *No. of dots and slashes *Keywords and Company check *Multiple occurrence(.comn, Enteronrlto search bar httphthtgpp
Decision whether A'l bsent CekULi maliciousornot resent inR
4, present
RESULT UPDATEDATBS
( End
FIGURE 4
AN AUTOMATED SYSTEM TODETECT PHISHING URL BY USING MACHINE LEARNING ALGORITHM FIELD OF INVENTION
The present invention generally relates to machine learning algorithms. More specifically, the present invention relates detecting malicious Uniform Resource locator (URLs) by using the machine learning algorithms.
BACKGROUND OF THE INVENTION
Over the past few years, the Internet has played an increasingly large part in everyone's personal professional life. It's not necessary that every website is going to be easily accessible or money making. Day by day more malicious or phishing websites have started to appear. This type of malicious websites is threat to all facets of the consumer. This may result in economic losses for the consumer, although some may create misperception about the ethical administration of the country. Human comprehensible URLs are used to classify billions of websites running today's internet. Adversaries trying to gain unauthorized access to confidential data may use malicious URLs to present them to naive users as a legitimate URL. These URLs are called malicious URLs which serve as an unwanted activity gateway. These malicious URLs result in unethical activities such as theft of confidential and private information. It could lead to ransom ware deployment on user phones. Many security agencies are wary about malicious URLs because they place government and private organizations ' confidential data at risk. Some encourage their users to use social networking sites to publish unauthorized URLs. Many of these URLs are synonymous with business promotion and self-advertising, but some of them may pose a vulnerable threat to naive users. Naive users using malicious URLs face the adversary's extreme security threats. To ensure that users are not allowed to visit malicious websites, validation of URLs is very necessary. One of the basic features that a program should have is to allow the harmless URLs of the user to be requested and to prevent the entry of malicious URLs. This is done by alerting the user that it was a malicious website and they should take precautions in future. Instead of focusing on the syntactic properties of the URL, a program can take semant and lexical properties from each URL. Traditional methods like Black-Listing, Heuristic Classification identifies and block such URLs until they enter the client. One of the basic methods to detect malicious URLs is blacklisting it. Black-List method is typically maintaining a database containing the list of all previously known malicious URLs. A server search is performed each time a new URL is identified in the process. Here, the new URL must fit and check that previously known malicious URL in the black list. The update must be performed in blacklist whenever a new malicious URL is found in the process. With ever-increasing new URLs, the method is repetitive, time consuming, and computationally intensive. Another method is the heuristic classification where the signatures are compared and checked to establish the connection between the new URL and the current malicious URL. While both Black-Listing and Heuristic Classification effectively distinguish malicious and neutral URLs, they cannot cope with the emerging attack techniques. One of these techniques has serious limitations in classifying newly generated URLs that they are inefficient. Most web-based companies use large servers that store as many as millions of URLs and refine these URL sets on a regular basis. The main problem with these solutions is the human intervention required to maintain and update the URL list. Therefore, there exists a need to propose advanced machine learning techniques that Internet users could use as a tool to overcome these limitations by distinguishing between malicious and non-malicious URL using machine learning algorithms.
The technical advancements disclosed by the present invention overcomes the limitations and disadvantages of existing and convention systems and methods.
SUMMARY OF THE INVENTION
The present invention generally relates to a system and a method for detecting phishing in URLs.
An objective of the invention is to identify malicious URL based on machine learning approach. Another objective of the invention is to update the URL into black list and white list. Another objective of this invention is to classify the URL as malicious or valid. Another objective of this invention is to train a system for determining perceptions for finding malicious URL.
According to an aspect of the present invention, a system for determining malicious uniform resource locator (URL), wherein the system comprises of: an input module for accepting the URL, wherein accepted URL is stored in a database in either of a plurality of lists, wherein the presence of the accepted URL is checked with the URL existing in the database, wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs; a feature extraction module associated with the database for extracting features of the accepted URL upon absence of the accepted URL in the database using a plurality of extracting rules; and a classification module associated with the feature extraction module for classifying type of the accepted URL as malicious and non-malicious based on the extracted features using a machine learning technique, wherein a decision vector is created during classification of the URL of the database to categorize the classified URL in to the first list and the second list.
According to an aspect of the present invention, a method for determining malicious uniform resource locator (URL), wherein the method comprises of: accepting the URL using an input module, wherein storing accepted URL in a database in either of a plurality of lists, wherein checking the presence of the accepted URL with existing URL in the database, wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs; extracting features of the accepted URL using a feature extraction module upon absence of the accepted URL in the database by a plurality of extracting rules; and classifying type of the accepted URL as malicious and non-malicioususing a classification module based on the extracted features using a machine learning technique, wherein creating a decision vector during classification of the URL of the database to categorize the classified URL in to the first list and the second list.
To further clarify advantages and features of the present invention, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail with the accompanying drawings.
BRIEF DESCRIPTION OF FIGURES
These and other features, aspects, and advantages of the present invention will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
Figure 1 illustrates a block diagram of a system for determining malicious uniform resource locator (URL),
Figure 2illustrates a flow diagram of a method for determining malicious uniform resource locator (URL),
Figure 3illustrates a schematic diagram of the overall architecture of the system when a URL is existing in the database, and
Figure 4illustrates a flow diagram of a method for Malicious URL Detection when the URL is not existing in the database.
Further, skilled artisans will appreciate that elements in the drawings are illustrated for simplicity and may not have been necessarily been drawn to scale. For example, the flow charts illustrate the method in terms of the most prominent steps involved to help to improve understanding of aspects of the present invention. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having benefit of the description herein.
DETAILED DESCRIPTION
For the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiment illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated system, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.
It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the invention and are not intended to be restrictive thereof.
Reference throughout this specification to "an aspect", "another aspect" or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.
The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such process or method. Similarly, one or more devices or sub-systems or elements or structures or components proceeded by "comprises...a" does not, without more constraints, preclude the existence of other devices or other sub-systems or other elements or other structures or other components or additional devices or additional sub-systems or additional elements or additional structures or additional components.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The system, methods, and examples provided herein are illustrative only and not intended to be limiting.
Embodiments of the present invention will be described below in detail with reference to the accompanying drawings.
Figure 1 illustrates a block diagram of a system for determining malicious uniform resource locator (URL), wherein the system comprises of:an input module (102), a database (104), a feature extraction module (106), and a classification module (108).
The input module (102) for accepting the URL, wherein accepted URL is stored in a database (104) in either of a plurality of lists, wherein the presence of the accepted URL is checked with the URL existing in the database (104), wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs. The URL is categorized in to the first list and the second list using a lexical filter, wherein the first list is white list and the second list is black list.
The feature extraction module (106)is associated with the database (104) for extracting features of the accepted URL upon absence of the accepted URL in the database (104) using a plurality of extracting rules. The plurality of extracting rules is selected from but not limited to suspicious characters, no. of dots and slashes, multiple occurrences,more than 4 numbers, presence of a famous domain, etc.
The classification module (108)is associated with the feature extraction module (106) for classifying type of the accepted URL as malicious and non-malicious based on the extracted features using a machine learning technique, wherein a decision vector is created during classification of the URL of the database (104) to categorize the classified URL in to the first list and the second list. The machine learning technique includes support vector machine (SVM) having a separate hyper plane which is used as a discriminatory classifier dividing classified output in to the plurality of lists.
The hyper plane is a line in two-dimensional space that divides a plane into two sections where it lies in each category on either side. The SVM is a supervised algorithm for machine learning that is used for classification or regression challenges. It works very well with a clear margin of separation.
Figure 2illustrates a flow diagram of a method for determining malicious uniform resource locator (URL), wherein the method comprises of:
Step (202) discloses about accepting the URL using an input module (102), wherein storing accepted URL in a database (104) in either of a plurality of lists, wherein checking the presence of the accepted URL with existing URL in the database (104), wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs.
Step (204) discloses about extracting features of the accepted URL using a feature extraction module (106) upon absence of the accepted URL in the database (104) by a plurality of extracting rules.
Step (206) discloses about classifying type of the accepted URL as malicious and non-malicioususing a classification module (108) based on the extracted features using a machine learning technique, wherein creating a decision vector during classification of the URL of the database (104) to categorize the classified URL in to the first list and the second list.
Figure 3illustrates a schematic diagram of the overall architecture of the system when a URL is existing in the database (104).
The data set of malicious and non-malicious URL are collected for training purpose. The dataset used for proposed system is URL dataset from a website selected from but not limited to Kaggle.
According to an embodiment, the Kaggle website has 480000 samples in the dataset, 384000 of which are malicious URLs and others are normal URLs. However, the samples may vary.
The proposed system uses features extraction and data modelling of the URLs. A decision vector is created from the URL present in the dataset. The URL in data set gets read one by one and every URL goes through the following operations such as extraction of different features like suspicious characters, no. of dots and slashes etc. These extracted features are then used for training the classifiers. When the user enters the URL to check its authenticity, then first it is checked in the database (104). If the URL is present in database (104), then it means that it has been already checked and found out to be malicious or not. Accordingly, the result is shown to the user and session ends.
Figure 4illustrates a flow diagram of a method for Malicious URL Detection when the URL is not existing in the database.
The data set of malicious and non-malicious URL are collected for training purpose. A decision vector is created from the URL present in the dataset (104). The URL in data set gets read one by one and every URL will go through the following operations as specified in figure 3, such as extraction of different features like suspicious characters, no. of dots and slashes etc. These extracted features are then used for training the classifiers. When the user enters the URL to check its authenticity, then first it is checked in the database (104). If the URL is present in database (104), then it means that it has been already checked and found out to be malicious or not.
Black and white list: Traditional black and white list of URL is the first step towards categorization of new URL. First level validation is done through the identification of normal and malicious URLs. The standard URLs are applied to the white list and the blacklist directory includes malicious URLs. For checking of any new URL black and white list is traversed to identify the category of URL. It determines whether the URL is in the black list or in the white list.
Lexical analysis: The second step for identification of URL category is the lexical filter. Inside the invalid domain name, it checks for keywords like ' com, ' www, ' etc. This checking is based upon training of the system. Few rules to be checked for domain name are as:
• If there are more than 4 numbers
• Presence of special characters such as (#, $, @, -, -, -)
• Top 5 URL address (com, en, net, org, cc)
• Repetition of "." Symbol in domain name
• Total count of characters in the address of web
Flask Framework: Flask is a popular web framework for Python, which means it is a third-party Python library used for web application development.
If domain name contains number more than 4 numbers. Presence of special characters. Also, if URL contain any of some famous domain then it is likely they are benign URL. Number of dots in domain name is also very important in classification of URL. Also, total length of any URL or domain name is used for decision as phishing URL have long domain name. The specific keyword is pushed in the training vectorizer for each of this feather for training the model.
The dataset is divided in the ratio of 75:25 as training sample and testing sample. The three different machine learning algorithms has been implemented to classify the URL into malicious or normal. Lexical analysis has been performed on the URL to extract the lexical feature. The main task of classifying URLs is done through Support vector machine due to increased accuracy of 8 5. 3 5%.
The Support Vector Machine (SVM) is formally defined by a separate hyperplane which is used as a discriminatory classifier. A hyperplane is a line in two-dimensional space that divides a plane into two sections where it lies in each category on either side. SVM is nothing more than a supervised algorithm for machine learning that is used for classification or regression challenges. It works very well with a clear margin of separation.
For identification of malicious URLs, traditional list of URLs stored as black and white list and machine learning algorithms have been used. Hackers circumvent anti-spai filtering strategies by placing malicious URLs in the message content. Hence the method of the URL analyzer detects the malicious URL with the aid of a reduced phishing feature set. Malicious URL detection plays a critical role in many cyber security applications, and approaches to machine learning are clearly a promising direction.
The drawings and the forgoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, orders of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts necessarily need to be performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples. Numerous variations, whether explicitly given in the specification or not, such as differences in structure, dimension, and use of material, are possible. The scope of embodiments is at least as broad as given by the following claims.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any component(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or component of any or all the claims.

Claims (5)

WE CLAIM:
1. A system for determining malicious uniform resource locator (URL), wherein the system comprises of:
an input module (102) for accepting the URL, wherein accepted URL is stored in a database (104) in either of a plurality of lists, wherein the presence of the accepted URL is checked with the URL existing in the database (104), wherein the plurality of lists includes a first list of non malicious URLS and a second list of malicious URLs;
a feature extraction module (106) associated with the database (104) for extracting features of the accepted URL upon absence of the accepted URL in the database (104) using a plurality of extracting rules; and
a classification module (108) associated with the feature extraction module (106) for classifying type of the accepted URL as malicious and non-malicious based on the extracted features using a machine learning technique, wherein a decision vector is created during classification of the URL of the database (104)to categorize the classified URL in to the first list and the second list.
2. The system as claimed in claim 1, wherein the URL is categorized in to the first list and the second list using a lexical filter, wherein the first list is white list and the second list is black list.
3. The system as claimed in claim 1, wherein the plurality of extracting rules is selected from but not limited to suspicious characters, no. of dots and slashes, multiple occurrences, more than 4 numbers, presence of a famous domain, etc.
4. The system as claimed in claim 1, wherein the machine learning technique includes support vector machine having a separate hyperplane which is used as a discriminatory classifier dividing classified output in to the plurality of lists.
5. A method for determining malicious uniform resource locator (URL), wherein the method comprises of:
accepting the URL using an input module (102), wherein storing accepted URL in a database (104) in either of a plurality of lists, wherein checking the presence of the accepted URL with existing URL in the database (104), wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs;
extracting features of the accepted URL using a feature extraction module (106) upon absence of the accepted URL in the database (104) by a plurality of extracting rules; and
classifying type of the accepted URL as malicious and non malicioususing a classification module (108) based on the extracted features using a machine learning technique, wherein creating a decision vector during classification of the URL of the database (104) to categorize the classified URL in to the first list and the second list.
FIGURE 1
FIGURE 2
FIG GURE 3 FIG GURE 4
AU2021106579A 2021-08-23 2021-08-23 An automated system to detect phishing url by using machine learning algorithm Ceased AU2021106579A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2021106579A AU2021106579A4 (en) 2021-08-23 2021-08-23 An automated system to detect phishing url by using machine learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2021106579A AU2021106579A4 (en) 2021-08-23 2021-08-23 An automated system to detect phishing url by using machine learning algorithm

Publications (1)

Publication Number Publication Date
AU2021106579A4 true AU2021106579A4 (en) 2021-12-02

Family

ID=78716527

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2021106579A Ceased AU2021106579A4 (en) 2021-08-23 2021-08-23 An automated system to detect phishing url by using machine learning algorithm

Country Status (1)

Country Link
AU (1) AU2021106579A4 (en)

Similar Documents

Publication Publication Date Title
Rao et al. Detection of phishing websites using an efficient feature-based machine learning framework
Xiang et al. Cantina+ a feature-rich machine learning framework for detecting phishing web sites
Rao et al. Phishshield: a desktop application to detect phishing webpages through heuristic approach
Iqbal et al. A novel approach of mining write-prints for authorship attribution in e-mail forensics
Pan et al. Anomaly based web phishing page detection
Choi et al. Efficient malicious code detection using N-gram analysis and SVM
Buber et al. NLP based phishing attack detection from URLs
Rao et al. Two level filtering mechanism to detect phishing sites using lightweight visual similarity approach
Singh et al. Phishing detection from URLs using deep learning approach
Lee et al. LARGen: automatic signature generation for Malwares using latent Dirichlet allocation
Liu et al. GraphXSS: an efficient XSS payload detection approach based on graph convolutional network
Tan et al. Phishing website detection using URL-assisted brand name weighting system
Liu et al. An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment
Abraham et al. Approximate string matching algorithm for phishing detection
Parasar et al. An Automated System to Detect Phishing URL by Using Machine Learning Algorithm
Thaker et al. Detecting phishing websites using data mining
Ray et al. Detection of malicious URLs using deep learning approach
Liu et al. Owleye: An advanced detection system of web attacks based on hmm
AU2021106579A4 (en) An automated system to detect phishing url by using machine learning algorithm
Philomina et al. A comparitative study of machine learning models for the detection of Phishing Websites
Zaimi et al. A literature survey on anti-phishing in websites
Noh et al. Phishing Website Detection Using Random Forest and Support Vector Machine: A Comparison
Swarnalatha et al. Real-time threat intelligence-block phising attacks
Shmalko et al. Profiler: Profile-Based Model to Detect Phishing Emails
Rayala et al. Malicious URL Detection using Logistic Regression

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry