About | CSC579-Marvi

MALICIOUS URL DETECTION USING MACHINE LEARNING

CSC579 Project (Marvi Jokhio)

Welcome to my CSC579 Project website, my very own passion project filled with unique and engaging content. Explore my site and all that I have to offer; perhaps the website will ignite your own passions as well.

Problem Statement:

In the digital world, a simple URL can cause a lot of damage and sometimes malicious links are used in emails for attacks and threats. They are also used for scams, frauds, and phishing because usually malware or trojan can be downloaded at the back-end which can use your devices, or you can be convinced to provide your sensitive and confidential information on a fake website. Nowadays, malicious URLs can easily be spread using social hacking and social networking.

Motivation:

To avoid these potential harms there should be a mechanism that can auto-detect these malicious URLs prior to visiting or clicking them and warn the user about possible threats quickly at his device.

Related Work:

The rise of AI and Machine learning has introduced new ways to improve cybersecurity like in other areas of study and research. Many tools have been built using AI to solve cybersecurity problems more effectively where typical solutions or human efforts are insufficient. The majority of cyber-attacks are done using online social engineering techniques by spreading malicious links because that is the easiest way to attack instead of rupturing the security layers in a network or application. Many AI solutions have been proposed to classify URLs in various classes considering the types of possible attacks. Since our technique is based on lexical-based features of the URL, we will put more emphasis on surveying related work that comprises of URL-based approaches. The URL based methods use only the URL structures in detection, even without using any external information, such as WHOIS, blacklists or content analysis. The authors in [1] provided an approach to classify web pages only by extracting features from URL not the content of the websites. The URLs were divided into tokens to extract features for classification and the authors claim that approach was faster and improved the classification results considering the fact that high-quality token selection and feature extraction should be used. [2, 3] did the analysis to find the distinguishable differences in features extracted from the normal and malicious URLs to construct classifiers. The authors in [4] show that efficiency of classification can be improved only using the lexical-based feature extraction but their work tires of categories the normal URLs into further categories of news, business, sports, etc instead of just one category to differ from malicious ones.

Proposed Solution and its Motivation:

We will use lexical-based feature engineering similar to [1] and [4] to classify 5 different classes of Malicious URLs with the best possible accuracy. The detection of maliciousness in URL should be implemented in a way that it predicts correctly but with light-weight processing and efficient approach.

Our approach is simple and efficient which can be very effective for client-side implementation because the time-taken in analyzing the content of the website and then feature extraction can cause in possible delay, thus increasing the chances of harm. Our approach is better for the purpose of detecting the maliciousness to warn the user or entity about a possible threat at first glance through only analyzing the URL itself using Machine learning. We will try to find the most important features from the given data set to achieve the best accuracy of results.

References:

[1] Min-Yen Kan and Hoang Oanh Nguyen Thi. 2005. Fast webpage classification using URL features. In Proceedings of the 14th ACM international conference on Information and knowledge management (CIKM ’05). Association for Computing Machinery, New York, NY, USA, 325–326.

[2] D. Kevin McGrath and Minaxi Gupta. 2008. Behind phishing: an examination of phisher modi operandi. In Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats (LEET’08). USENIX Association, USA, Article 4, 1–8.

[3] Sandeep Yadav, Ashwath Kumar Krishna Reddy, A.L. Narasimha Reddy, and Supranamaya Ranjan. 2010. Detecting algorithmically generated malicious domain names. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement (IMC ’10). Association for Computing Machinery, New York, NY, USA, 48–61.

[4] A. Le, A. Markopoulou, and M. Faloutsos. Phishdef: Url names say it all. In Proceedings of the 30th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, pages 191–195, Shanghai, China, April 2011. IEEE.

Figure a: Machine Learning Methodology