AU2021106579A4 - An automated system to detect phishing url by using machine learning algorithm - Google Patents
An automated system to detect phishing url by using machine learning algorithm Download PDFInfo
- Publication number
- AU2021106579A4 AU2021106579A4 AU2021106579A AU2021106579A AU2021106579A4 AU 2021106579 A4 AU2021106579 A4 AU 2021106579A4 AU 2021106579 A AU2021106579 A AU 2021106579A AU 2021106579 A AU2021106579 A AU 2021106579A AU 2021106579 A4 AU2021106579 A4 AU 2021106579A4
- Authority
- AU
- Australia
- Prior art keywords
- url
- list
- malicious
- database
- accepted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 238000010801 machine learning Methods 0.000 title claims abstract description 21
- 238000000034 method Methods 0.000 claims abstract description 39
- 238000000605 extraction Methods 0.000 claims abstract description 14
- 238000012706 support-vector machine Methods 0.000 claims description 8
- 230000008901 benefit Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 9
- 238000012549 training Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 5
- 238000001514 detection method Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 210000003746 feather Anatomy 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
- H04L63/1483—Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Abstract
A system and a method for determining malicious uniform resource
locator (URL), comprises of: an input module (102) for accepting the
URL, wherein accepted URL is stored in a database (104) in either of a
plurality of lists, wherein the presence of the accepted URL is checked
with the URL existing in the database (104), wherein the plurality of
lists includes a first list of non-malicious URLS and a second list of
malicious URLs; a feature extraction module (106) for extracting
features of the accepted URL upon absence of the accepted URL in the
database (104); and a classification module (108) for classifying type
of the accepted URL as malicious and non-malicious based on the
extracted features using a machine learning technique, wherein a
decision vector categorizes the classified URL in to the first list and the
second list.
16
aR ChekURnaadatabase
BlackandE Eii I
O
Lexical Feature
Whiteis
FIGURE 3
START
EXTFRACT FEATURES Tandcasfe
ESuspiciousxCharacters
*No. of dots and slashes
*Keywords and Company check
*Multiple occurrence(.comn, Enteronrlto search bar
httphthtgpp
Decision whether A'l bsent CekULi
Check URCheckUtabas
maliciousornot resent inR
4, present
RESULT
UPDATEDATBS
( End
FIGURE 4
Description
aR ChekURnaadatabase
Lexical Feature
BlackandE I Eii
Whiteis
FIGURE 3
Check URCheckUtabas
EXTFRACT FEATURES Tandcasfe
ESuspiciousxCharacters *No. of dots and slashes *Keywords and Company check *Multiple occurrence(.comn, Enteronrlto search bar httphthtgpp
Decision whether A'l bsent CekULi maliciousornot resent inR
4, present
( End
FIGURE 4
The present invention generally relates to machine learning algorithms. More specifically, the present invention relates detecting malicious Uniform Resource locator (URLs) by using the machine learning algorithms.
Over the past few years, the Internet has played an increasingly large part in everyone's personal professional life. It's not necessary that every website is going to be easily accessible or money making. Day by day more malicious or phishing websites have started to appear. This type of malicious websites is threat to all facets of the consumer. This may result in economic losses for the consumer, although some may create misperception about the ethical administration of the country. Human comprehensible URLs are used to classify billions of websites running today's internet. Adversaries trying to gain unauthorized access to confidential data may use malicious URLs to present them to naive users as a legitimate URL. These URLs are called malicious URLs which serve as an unwanted activity gateway. These malicious URLs result in unethical activities such as theft of confidential and private information. It could lead to ransom ware deployment on user phones. Many security agencies are wary about malicious URLs because they place government and private organizations ' confidential data at risk. Some encourage their users to use social networking sites to publish unauthorized URLs. Many of these URLs are synonymous with business promotion and self-advertising, but some of them may pose a vulnerable threat to naive users. Naive users using malicious URLs face the adversary's extreme security threats. To ensure that users are not allowed to visit malicious websites, validation of URLs is very necessary. One of the basic features that a program should have is to allow the harmless URLs of the user to be requested and to prevent the entry of malicious URLs. This is done by alerting the user that it was a malicious website and they should take precautions in future. Instead of focusing on the syntactic properties of the URL, a program can take semant and lexical properties from each URL. Traditional methods like Black-Listing, Heuristic Classification identifies and block such URLs until they enter the client. One of the basic methods to detect malicious URLs is blacklisting it. Black-List method is typically maintaining a database containing the list of all previously known malicious URLs. A server search is performed each time a new URL is identified in the process. Here, the new URL must fit and check that previously known malicious URL in the black list. The update must be performed in blacklist whenever a new malicious URL is found in the process. With ever-increasing new URLs, the method is repetitive, time consuming, and computationally intensive. Another method is the heuristic classification where the signatures are compared and checked to establish the connection between the new URL and the current malicious URL. While both Black-Listing and Heuristic Classification effectively distinguish malicious and neutral URLs, they cannot cope with the emerging attack techniques. One of these techniques has serious limitations in classifying newly generated URLs that they are inefficient. Most web-based companies use large servers that store as many as millions of URLs and refine these URL sets on a regular basis. The main problem with these solutions is the human intervention required to maintain and update the URL list. Therefore, there exists a need to propose advanced machine learning techniques that Internet users could use as a tool to overcome these limitations by distinguishing between malicious and non-malicious URL using machine learning algorithms.
The technical advancements disclosed by the present invention overcomes the limitations and disadvantages of existing and convention systems and methods.
The present invention generally relates to a system and a method for detecting phishing in URLs.
An objective of the invention is to identify malicious URL based on machine learning approach. Another objective of the invention is to update the URL into black list and white list. Another objective of this invention is to classify the URL as malicious or valid. Another objective of this invention is to train a system for determining perceptions for finding malicious URL.
According to an aspect of the present invention, a system for determining malicious uniform resource locator (URL), wherein the system comprises of: an input module for accepting the URL, wherein accepted URL is stored in a database in either of a plurality of lists, wherein the presence of the accepted URL is checked with the URL existing in the database, wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs; a feature extraction module associated with the database for extracting features of the accepted URL upon absence of the accepted URL in the database using a plurality of extracting rules; and a classification module associated with the feature extraction module for classifying type of the accepted URL as malicious and non-malicious based on the extracted features using a machine learning technique, wherein a decision vector is created during classification of the URL of the database to categorize the classified URL in to the first list and the second list.
According to an aspect of the present invention, a method for determining malicious uniform resource locator (URL), wherein the method comprises of: accepting the URL using an input module, wherein storing accepted URL in a database in either of a plurality of lists, wherein checking the presence of the accepted URL with existing URL in the database, wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs; extracting features of the accepted URL using a feature extraction module upon absence of the accepted URL in the database by a plurality of extracting rules; and classifying type of the accepted URL as malicious and non-malicioususing a classification module based on the extracted features using a machine learning technique, wherein creating a decision vector during classification of the URL of the database to categorize the classified URL in to the first list and the second list.
To further clarify advantages and features of the present invention, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail with the accompanying drawings.
These and other features, aspects, and advantages of the present invention will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
Figure 1 illustrates a block diagram of a system for determining malicious uniform resource locator (URL),
Figure 2illustrates a flow diagram of a method for determining malicious uniform resource locator (URL),
Figure 3illustrates a schematic diagram of the overall architecture of the system when a URL is existing in the database, and
Figure 4illustrates a flow diagram of a method for Malicious URL Detection when the URL is not existing in the database.
Further, skilled artisans will appreciate that elements in the drawings are illustrated for simplicity and may not have been necessarily been drawn to scale. For example, the flow charts illustrate the method in terms of the most prominent steps involved to help to improve understanding of aspects of the present invention. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having benefit of the description herein.
For the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiment illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated system, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.
It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the invention and are not intended to be restrictive thereof.
Reference throughout this specification to "an aspect", "another aspect" or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.
The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such process or method. Similarly, one or more devices or sub-systems or elements or structures or components proceeded by "comprises...a" does not, without more constraints, preclude the existence of other devices or other sub-systems or other elements or other structures or other components or additional devices or additional sub-systems or additional elements or additional structures or additional components.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The system, methods, and examples provided herein are illustrative only and not intended to be limiting.
Embodiments of the present invention will be described below in detail with reference to the accompanying drawings.
Figure 1 illustrates a block diagram of a system for determining malicious uniform resource locator (URL), wherein the system comprises of:an input module (102), a database (104), a feature extraction module (106), and a classification module (108).
The input module (102) for accepting the URL, wherein accepted URL is stored in a database (104) in either of a plurality of lists, wherein the presence of the accepted URL is checked with the URL existing in the database (104), wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs. The URL is categorized in to the first list and the second list using a lexical filter, wherein the first list is white list and the second list is black list.
The feature extraction module (106)is associated with the database (104) for extracting features of the accepted URL upon absence of the accepted URL in the database (104) using a plurality of extracting rules. The plurality of extracting rules is selected from but not limited to suspicious characters, no. of dots and slashes, multiple occurrences,more than 4 numbers, presence of a famous domain, etc.
The classification module (108)is associated with the feature extraction module (106) for classifying type of the accepted URL as malicious and non-malicious based on the extracted features using a machine learning technique, wherein a decision vector is created during classification of the URL of the database (104) to categorize the classified URL in to the first list and the second list. The machine learning technique includes support vector machine (SVM) having a separate hyper plane which is used as a discriminatory classifier dividing classified output in to the plurality of lists.
The hyper plane is a line in two-dimensional space that divides a plane into two sections where it lies in each category on either side. The SVM is a supervised algorithm for machine learning that is used for classification or regression challenges. It works very well with a clear margin of separation.
Figure 2illustrates a flow diagram of a method for determining malicious uniform resource locator (URL), wherein the method comprises of:
Step (202) discloses about accepting the URL using an input module (102), wherein storing accepted URL in a database (104) in either of a plurality of lists, wherein checking the presence of the accepted URL with existing URL in the database (104), wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs.
Step (204) discloses about extracting features of the accepted URL using a feature extraction module (106) upon absence of the accepted URL in the database (104) by a plurality of extracting rules.
Step (206) discloses about classifying type of the accepted URL as malicious and non-malicioususing a classification module (108) based on the extracted features using a machine learning technique, wherein creating a decision vector during classification of the URL of the database (104) to categorize the classified URL in to the first list and the second list.
Figure 3illustrates a schematic diagram of the overall architecture of the system when a URL is existing in the database (104).
The data set of malicious and non-malicious URL are collected for training purpose. The dataset used for proposed system is URL dataset from a website selected from but not limited to Kaggle.
According to an embodiment, the Kaggle website has 480000 samples in the dataset, 384000 of which are malicious URLs and others are normal URLs. However, the samples may vary.
The proposed system uses features extraction and data modelling of the URLs. A decision vector is created from the URL present in the dataset. The URL in data set gets read one by one and every URL goes through the following operations such as extraction of different features like suspicious characters, no. of dots and slashes etc. These extracted features are then used for training the classifiers. When the user enters the URL to check its authenticity, then first it is checked in the database (104). If the URL is present in database (104), then it means that it has been already checked and found out to be malicious or not. Accordingly, the result is shown to the user and session ends.
Figure 4illustrates a flow diagram of a method for Malicious URL Detection when the URL is not existing in the database.
The data set of malicious and non-malicious URL are collected for training purpose. A decision vector is created from the URL present in the dataset (104). The URL in data set gets read one by one and every URL will go through the following operations as specified in figure 3, such as extraction of different features like suspicious characters, no. of dots and slashes etc. These extracted features are then used for training the classifiers. When the user enters the URL to check its authenticity, then first it is checked in the database (104). If the URL is present in database (104), then it means that it has been already checked and found out to be malicious or not.
Black and white list: Traditional black and white list of URL is the first step towards categorization of new URL. First level validation is done through the identification of normal and malicious URLs. The standard URLs are applied to the white list and the blacklist directory includes malicious URLs. For checking of any new URL black and white list is traversed to identify the category of URL. It determines whether the URL is in the black list or in the white list.
Lexical analysis: The second step for identification of URL category is the lexical filter. Inside the invalid domain name, it checks for keywords like ' com, ' www, ' etc. This checking is based upon training of the system. Few rules to be checked for domain name are as:
• If there are more than 4 numbers
• Presence of special characters such as (#, $, @, -, -, -)
• Top 5 URL address (com, en, net, org, cc)
• Repetition of "." Symbol in domain name
• Total count of characters in the address of web
Flask Framework: Flask is a popular web framework for Python, which means it is a third-party Python library used for web application development.
If domain name contains number more than 4 numbers. Presence of special characters. Also, if URL contain any of some famous domain then it is likely they are benign URL. Number of dots in domain name is also very important in classification of URL. Also, total length of any URL or domain name is used for decision as phishing URL have long domain name. The specific keyword is pushed in the training vectorizer for each of this feather for training the model.
The dataset is divided in the ratio of 75:25 as training sample and testing sample. The three different machine learning algorithms has been implemented to classify the URL into malicious or normal. Lexical analysis has been performed on the URL to extract the lexical feature. The main task of classifying URLs is done through Support vector machine due to increased accuracy of 8 5. 3 5%.
The Support Vector Machine (SVM) is formally defined by a separate hyperplane which is used as a discriminatory classifier. A hyperplane is a line in two-dimensional space that divides a plane into two sections where it lies in each category on either side. SVM is nothing more than a supervised algorithm for machine learning that is used for classification or regression challenges. It works very well with a clear margin of separation.
For identification of malicious URLs, traditional list of URLs stored as black and white list and machine learning algorithms have been used. Hackers circumvent anti-spai filtering strategies by placing malicious URLs in the message content. Hence the method of the URL analyzer detects the malicious URL with the aid of a reduced phishing feature set. Malicious URL detection plays a critical role in many cyber security applications, and approaches to machine learning are clearly a promising direction.
The drawings and the forgoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, orders of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts necessarily need to be performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples. Numerous variations, whether explicitly given in the specification or not, such as differences in structure, dimension, and use of material, are possible. The scope of embodiments is at least as broad as given by the following claims.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any component(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or component of any or all the claims.
Claims (5)
1. A system for determining malicious uniform resource locator (URL), wherein the system comprises of:
an input module (102) for accepting the URL, wherein accepted URL is stored in a database (104) in either of a plurality of lists, wherein the presence of the accepted URL is checked with the URL existing in the database (104), wherein the plurality of lists includes a first list of non malicious URLS and a second list of malicious URLs;
a feature extraction module (106) associated with the database (104) for extracting features of the accepted URL upon absence of the accepted URL in the database (104) using a plurality of extracting rules; and
a classification module (108) associated with the feature extraction module (106) for classifying type of the accepted URL as malicious and non-malicious based on the extracted features using a machine learning technique, wherein a decision vector is created during classification of the URL of the database (104)to categorize the classified URL in to the first list and the second list.
2. The system as claimed in claim 1, wherein the URL is categorized in to the first list and the second list using a lexical filter, wherein the first list is white list and the second list is black list.
3. The system as claimed in claim 1, wherein the plurality of extracting rules is selected from but not limited to suspicious characters, no. of dots and slashes, multiple occurrences, more than 4 numbers, presence of a famous domain, etc.
4. The system as claimed in claim 1, wherein the machine learning technique includes support vector machine having a separate hyperplane which is used as a discriminatory classifier dividing classified output in to the plurality of lists.
5. A method for determining malicious uniform resource locator (URL), wherein the method comprises of:
accepting the URL using an input module (102), wherein storing accepted URL in a database (104) in either of a plurality of lists, wherein checking the presence of the accepted URL with existing URL in the database (104), wherein the plurality of lists includes a first list of non-malicious URLS and a second list of malicious URLs;
extracting features of the accepted URL using a feature extraction module (106) upon absence of the accepted URL in the database (104) by a plurality of extracting rules; and
classifying type of the accepted URL as malicious and non malicioususing a classification module (108) based on the extracted features using a machine learning technique, wherein creating a decision vector during classification of the URL of the database (104) to categorize the classified URL in to the first list and the second list.
FIGURE 1
FIGURE 2
FIG GURE 3 FIG GURE 4
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2021106579A AU2021106579A4 (en) | 2021-08-23 | 2021-08-23 | An automated system to detect phishing url by using machine learning algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2021106579A AU2021106579A4 (en) | 2021-08-23 | 2021-08-23 | An automated system to detect phishing url by using machine learning algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2021106579A4 true AU2021106579A4 (en) | 2021-12-02 |
Family
ID=78716527
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2021106579A Ceased AU2021106579A4 (en) | 2021-08-23 | 2021-08-23 | An automated system to detect phishing url by using machine learning algorithm |
Country Status (1)
Country | Link |
---|---|
AU (1) | AU2021106579A4 (en) |
-
2021
- 2021-08-23 AU AU2021106579A patent/AU2021106579A4/en not_active Ceased
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rao et al. | Detection of phishing websites using an efficient feature-based machine learning framework | |
Xiang et al. | Cantina+ a feature-rich machine learning framework for detecting phishing web sites | |
Rao et al. | Phishshield: a desktop application to detect phishing webpages through heuristic approach | |
Iqbal et al. | A novel approach of mining write-prints for authorship attribution in e-mail forensics | |
Pan et al. | Anomaly based web phishing page detection | |
Choi et al. | Efficient malicious code detection using N-gram analysis and SVM | |
Buber et al. | NLP based phishing attack detection from URLs | |
Rao et al. | Two level filtering mechanism to detect phishing sites using lightweight visual similarity approach | |
Singh et al. | Phishing detection from URLs using deep learning approach | |
Lee et al. | LARGen: automatic signature generation for Malwares using latent Dirichlet allocation | |
Liu et al. | GraphXSS: an efficient XSS payload detection approach based on graph convolutional network | |
Tan et al. | Phishing website detection using URL-assisted brand name weighting system | |
Liu et al. | An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment | |
Abraham et al. | Approximate string matching algorithm for phishing detection | |
Parasar et al. | An Automated System to Detect Phishing URL by Using Machine Learning Algorithm | |
Thaker et al. | Detecting phishing websites using data mining | |
Ray et al. | Detection of malicious URLs using deep learning approach | |
Liu et al. | Owleye: An advanced detection system of web attacks based on hmm | |
AU2021106579A4 (en) | An automated system to detect phishing url by using machine learning algorithm | |
Philomina et al. | A comparitative study of machine learning models for the detection of Phishing Websites | |
Zaimi et al. | A literature survey on anti-phishing in websites | |
Noh et al. | Phishing Website Detection Using Random Forest and Support Vector Machine: A Comparison | |
Swarnalatha et al. | Real-time threat intelligence-block phising attacks | |
Shmalko et al. | Profiler: Profile-Based Model to Detect Phishing Emails | |
Rayala et al. | Malicious URL Detection using Logistic Regression |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) | ||
MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |