2386 words
12 minutes
DDoS Detection with ML - Part I

Artificial Intelligence is transforming the cybersecurity landscape by offering advanced techniques to identify and mitigate emerging threats. As cyberattacks grow more sophisticated, AI is increasingly relied upon to strengthen security frameworks. A promising area where AI shines is in detecting Distributed Denial of Service (DDoS) attacks. These attacks flood networks with malicious traffic, disrupting services and causing serious harm to businesses and infrastructure. Traditional DDoS detection methods often struggle with the scale and complexity of modern attacks, but AI presents a powerful solution.

Using machine learning algorithms and pattern recognition, AI can analyze massive amounts of network traffic data in real-time, spotting anomalies that might signal a DDoS attack. Its ability to continuously learn from new data means AI can evolve in response to the shifting tactics of cybercriminals. Furthermore, AI-driven systems can respond faster than manual methods, providing early warnings and automating defensive measures.

In this project, I’m diving into the process of machine learning. Part I Agenda

  1. Project Goals
  2. What does the process look like?
  3. Dataset
  4. Environment
  5. Data Exploration
  6. Coclusion

Project Goals#

As the title suggests, the main goal of this project is to detect DDoS attacks using AI techniques. But, as a complete newbie to the topic, I know this journey will be about much more than that. My goal is to understand the process and apply theoretical knowledge in a hands-on project. Sounds simple, but I’m ready for things to get complex… What Does the Process Look Like?

Process of what, exactly?

From here on, I’ll use the term Machine Learning (ML) since my project requires training an algorithm on data to build a model that can make predictive decisions—in this case, answering the simple question, “Is it DDoS?”

So, what does the process of Machine Learning look like? I love the diagram prepared by the authors of the excellent book Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka, Yuxi (Hayden) Liu, and Vahid Mirjalili.

I’m not going to describe every single step. It’s not my intention to overwhelm you with a huge amount of information. Instead, I’d rather explain it alongside the implementation of each step. Hope you’re okay with that (if not, I don’t care—it’s my blog!).

For now, this picture should be sufficient to give you a broad overview of the process.

Dataset#

I’m pretty sure that if you asked anyone, “What’s the most important aspect of Machine Learning?” the answer would always be the same: D A T A. Don’t have enough data? Fail. Your data lacks crucial information? Fail. Data isn’t clean enough? Faiiiiiil.

The entire process is based on data and the associated processes: cleaning, feature extraction, standardization, labeling (for supervised learning), splitting, training, testing, evaluating, and so on. It all starts with gathering data. Luckily, I’ve been equipped with the CICDDoS2019 dataset.

CICDDoS2019 contains benign data and the most up-to-date common DDoS attacks, resembling real-world data (PCAPs). It also includes results from network traffic analysis using CICFlowMeter-V3, which labels flows based on timestamps, source and destination IPs, source and destination ports, protocols, and attacks (in CSV format). Generating realistic background traffic was our top priority when building this dataset. We used our proposed B-Profile system (Sharafaldin et al., 2016) to profile the abstract behavior of human interactions and generate naturalistic benign background traffic in the proposed testbed (see Figure 2). For this dataset, we constructed the abstract behavior of 25 users based on HTTP, HTTPS, FTP, SSH, and email protocols.

Those packages are heavy—I’m talking about gigabytes of pure data in CSV format!

“In this dataset, we have different modern reflective DDoS attacks such as PortMap, NetBIOS, LDAP, MSSQL, UDP, UDP-Lag, SYN, NTP, DNS and SNMP. Attacks were subsequently executed during this period. (…) we executed 12 DDoS attacks includes NTP, DNS, LDAP, MSSQL, NetBIOS, SNMP, SSDP, UDP, UDP-Lag, WebDDoS, SYN and TFTP on the training day and 7 attacks including PortScan, NetBIOS, LDAP, MSSQL, UDP, UDP-Lag and SYN in the testing day. The traffic volume for WebDDoS was so low and PortScan just has been executed in the testing day and will be unknown for evaluating the proposed model.”

For my project purpose I will focus on just one chunk of dataset - DrDoS_DNS. I’m not going to let this project overwhelm me, thus, I’ve decided to pick just one and learn the process. Let’s set up the environment and finally get started.

Environment#

Jupyter is the most frequently chosen environment. As we all stand on the shoulders of giants I’m not gonna question this choice (for now). Another advantage is that to set it all up, we just need few commands, and to be more precise - two of them:

https://jupyter.org/install

I’m aware that this project may need some Python libraries so to avoid any incompatibility I’ll just use virtual envirnonment. venv module is lightweight and can help me isolate my project so it sounds like a perfect choice.

python3 -m venv [name of the virtual envirnment]
source [name]/bin/activate

After activating the virtual environment I’m just ready to run my notebook.

As you can see I named my virtual environment as venv. Not so creative but works anyway. This project folder is the place where I store my chunky csv file.

  • training.ipynb - text-based file used by Jupyter Notebook - a web-based interactive computing programme that helps users analyse and manipulate data using the Python programming language.

  • DrDos_DNS.csv - file with logs/packets

  • requirements.txt - file with all the libraries that might be useful, in Python project they can be installed collectively using pip install -r requirements.txt For now it doesn’t matter what’s in there.

That’s all regarding the environment. From now on I will focus on jupyter file (training.ipynb) where I will be learning the process.

Data exploration#

We can finally take a look at our csv file. Jupyter is a tool that supports interactive data science and scientific computing across all programming languages. Meaning you don’t need to write the whole code, compile it and then run. We are working with blocks of code, just take a look:

This is our first block, keep in mind that each line could be divided into separate block. I like to think about it in frames of one action=one block. Just like the proper function should be written. So, that block is the “defining block”. We import libraries and defining which data we are gonna work on.

These lines in Python are doing the following:

  1. import pandas as pd
    This imports the pandas library and assigns it the shorthand alias pd. pandas is a library commonly used for data analysis and manipulation in Python.

  2. import numpy as np
    This imports the numpy library, also with a shorthand alias (np). numpy is used for working with arrays and provides mathematical functions needed for scientific computations.

  3. pd.set_option('display.max_rows', None)
    This line configures pandas to display all rows when showing data in the console (by default, pandas limits the number of rows displayed). Setting this to None removes that limit.

  4. pd.set_option('display.max_columns', None)
    Similar to the line above, this configures pandas to display all columns when displaying a DataFrame. Without this, pandas might cut off columns when displaying wide tables.

  5. df = pd.read_csv('DrDoS_DNS.csv', low_memory=False)
    This reads a CSV file named 'DrDoS_DNS.csv' into a DataFrame called df. The low_memory=False option loads the entire file at once, avoiding data type guessing that might happen with larger files. Setting low_memory=False can prevent certain types of warnings, especially with large datasets.

  6. df.columns = df.columns.str.strip()
    This line removes any leading or trailing whitespace from the column names in df. This is particularly useful if the column names accidentally have extra spaces that could interfere with accessing them correctly in the code.

With that block we are ready to start messing around with the data.

Features#

In machine learning, features are the individual measurable properties or characteristics of the data used to make predictions or classifications. Each feature represents a specific aspect of the data, and collectively, features provide the model with the information it needs to learn patterns and make accurate predictions.

Let’s find out what features do we got by using below command:

The line print(df.columns[1:-1]) prints a subset of the column names in a DataFrame, specifically all columns except the first and last. What’s wrong with first and the last? Let me show you:

That command prints first 5 rows of the first and last columns. Label column is just label, not a feature, and the first one is like an index so it doesn’t provide any valuable info.

We’ve printed out our features but how many do we got? I ain’t gonna count them but I know the command that can do it for me:

It calculates the number of columns in the DataFrame df minus 2 (first and last), and then prints the result. So, we got 86 features which is quite decent.

Observations#

But how many packets do we got?

It prints the number of rows in the DataFrame df. Oh man, 5,074,413 rows. Over 5 million packets. That’s a lot.

Labels#

Distribiution (or ratio) of labels might be the most crucial aspect of this dataset. Balance is the key. But first, let’s find out what names of labels do we have?

Fine, will work. It could be translated as “DDoS traffic” and “No DDoS traffic”. Okay, it’s time for distribuition:

Oh man, that’s imbalanced as hell. I will need to figure out what to do with that later.

Vector Example#

Using random column I wanted to see how does the column look like. print(df.iloc[131]) should do the work:

Unnamed: 0 928
Flow ID 172.16.0.5-192.168.50.1-634-51124-17
Source IP 172.16.0.5
Source Port 634
Destination IP 192.168.50.1
Destination Port 51124
Protocol 17
Timestamp 2018-12-01 10:51:43.900176
Flow Duration 47202
Total Fwd Packets 200
Total Backward Packets 0
Total Length of Fwd Packets 88000.0
Total Length of Bwd Packets 0.0
Fwd Packet Length Max 440.0
Fwd Packet Length Min 440.0
Fwd Packet Length Mean 440.0
Fwd Packet Length Std 0.0
Bwd Packet Length Max 0.0
Bwd Packet Length Min 0.0
Bwd Packet Length Mean 0.0
Bwd Packet Length Std 0.0
Flow Bytes/s 1864327.782721
Flow Packets/s 4237.108597
Flow IAT Mean 237.19598
Flow IAT Std 337.199588
Flow IAT Max 2016.0
Flow IAT Min 1.0
Fwd IAT Total 47202.0
Fwd IAT Mean 237.19598
Fwd IAT Std 337.199588
Fwd IAT Max 2016.0
Fwd IAT Min 1.0
Bwd IAT Total 0.0
Bwd IAT Mean 0.0
Bwd IAT Std 0.0
Bwd IAT Max 0.0
Bwd IAT Min 0.0
Fwd PSH Flags 0
Bwd PSH Flags 0
Fwd URG Flags 0
Bwd URG Flags 0
Fwd Header Length 6400
Bwd Header Length 0
Fwd Packets/s 4237.108597
Bwd Packets/s 0.0
Min Packet Length 440.0
Max Packet Length 440.0
Packet Length Mean 440.0
Packet Length Std 0.0
Packet Length Variance 0.0
FIN Flag Count 0
SYN Flag Count 0
RST Flag Count 0
PSH Flag Count 0
ACK Flag Count 0
URG Flag Count 0
CWE Flag Count 0
ECE Flag Count 0
Down/Up Ratio 0.0
Average Packet Size 442.2
Avg Fwd Segment Size 440.0
Avg Bwd Segment Size 0.0
Fwd Header Length.1 6400
Fwd Avg Bytes/Bulk 0
Fwd Avg Packets/Bulk 0
Fwd Avg Bulk Rate 0
Bwd Avg Bytes/Bulk 0
Bwd Avg Packets/Bulk 0
Bwd Avg Bulk Rate 0
Subflow Fwd Packets 200
Subflow Fwd Bytes 88000
Subflow Bwd Packets 0
Subflow Bwd Bytes 0
Init_Win_bytes_forward -1
Init_Win_bytes_backward -1
act_data_pkt_fwd 199
min_seg_size_forward 32
Active Mean 0.0
Active Std 0.0
Active Max 0.0
Active Min 0.0
Idle Mean 0.0
Idle Std 0.0
Idle Max 0.0
Idle Min 0.0
SimillarHTTP 0
Inbound 1
Label DrDoS_DNS
Name: 131, dtype: object

Many 0s… it doesn’t seem promising. But we will worry about that later.

Types of data#

To check out the types of data I used df.info():

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5074413 entries, 0 to 5074412
Data columns (total 88 columns):
# Column Dtype
--- ------ -----
0 Unnamed: 0 int64
1 Flow ID object
2 Source IP object
3 Source Port int64
4 Destination IP object
5 Destination Port int64
6 Protocol int64
7 Timestamp object
8 Flow Duration int64
9 Total Fwd Packets int64
10 Total Backward Packets int64
11 Total Length of Fwd Packets float64
12 Total Length of Bwd Packets float64
13 Fwd Packet Length Max float64
14 Fwd Packet Length Min float64
15 Fwd Packet Length Mean float64
16 Fwd Packet Length Std float64
17 Bwd Packet Length Max float64
18 Bwd Packet Length Min float64
19 Bwd Packet Length Mean float64
20 Bwd Packet Length Std float64
21 Flow Bytes/s float64
22 Flow Packets/s float64
23 Flow IAT Mean float64
24 Flow IAT Std float64
25 Flow IAT Max float64
26 Flow IAT Min float64
27 Fwd IAT Total float64
28 Fwd IAT Mean float64
29 Fwd IAT Std float64
30 Fwd IAT Max float64
31 Fwd IAT Min float64
32 Bwd IAT Total float64
33 Bwd IAT Mean float64
34 Bwd IAT Std float64
35 Bwd IAT Max float64
36 Bwd IAT Min float64
37 Fwd PSH Flags int64
38 Bwd PSH Flags int64
39 Fwd URG Flags int64
40 Bwd URG Flags int64
41 Fwd Header Length int64
42 Bwd Header Length int64
43 Fwd Packets/s float64
44 Bwd Packets/s float64
45 Min Packet Length float64
46 Max Packet Length float64
47 Packet Length Mean float64
48 Packet Length Std float64
49 Packet Length Variance float64
50 FIN Flag Count int64
51 SYN Flag Count int64
52 RST Flag Count int64
53 PSH Flag Count int64
54 ACK Flag Count int64
55 URG Flag Count int64
56 CWE Flag Count int64
57 ECE Flag Count int64
58 Down/Up Ratio float64
59 Average Packet Size float64
60 Avg Fwd Segment Size float64
61 Avg Bwd Segment Size float64
62 Fwd Header Length.1 int64
63 Fwd Avg Bytes/Bulk int64
64 Fwd Avg Packets/Bulk int64
65 Fwd Avg Bulk Rate int64
66 Bwd Avg Bytes/Bulk int64
67 Bwd Avg Packets/Bulk int64
68 Bwd Avg Bulk Rate int64
69 Subflow Fwd Packets int64
70 Subflow Fwd Bytes int64
71 Subflow Bwd Packets int64
72 Subflow Bwd Bytes int64
73 Init_Win_bytes_forward int64
74 Init_Win_bytes_backward int64
75 act_data_pkt_fwd int64
76 min_seg_size_forward int64
77 Active Mean float64
78 Active Std float64
79 Active Max float64
80 Active Min float64
81 Idle Mean float64
82 Idle Std float64
83 Idle Max float64
84 Idle Min float64
85 SimillarHTTP object
86 Inbound int64
87 Label object
dtypes: float64(45), int64(37), object(6)
memory usage: 3.3+ GB

Conclusion#

I can’t complain about the lack of variety of data but I can see that there’s much to do with cleaning stuff. I will devote next chapter to data cleaning. I will focus on handling missing data, removing vectors and then making hard decsion on choosing the right features. But that’s just tomorrow me problem. Hope you enjoyed the first episode. Check out for next

~Type

DDoS Detection with ML - Part I
https://typetherapy.blog/posts/typeproject/detecting_ddos/
Author
Type Therapy Blog
Published at
2024-11-03
License
CC BY-NC-SA 4.0