Artificial Intelligence is transforming the cybersecurity landscape by offering advanced techniques to identify and mitigate emerging threats. As cyberattacks grow more sophisticated, AI is increasingly relied upon to strengthen security frameworks. A promising area where AI shines is in detecting Distributed Denial of Service (DDoS) attacks. These attacks flood networks with malicious traffic, disrupting services and causing serious harm to businesses and infrastructure. Traditional DDoS detection methods often struggle with the scale and complexity of modern attacks, but AI presents a powerful solution.

Using machine learning algorithms and pattern recognition, AI can analyze massive amounts of network traffic data in real-time, spotting anomalies that might signal a DDoS attack. Its ability to continuously learn from new data means AI can evolve in response to the shifting tactics of cybercriminals. Furthermore, AI-driven systems can respond faster than manual methods, providing early warnings and automating defensive measures.

In this project, I’m diving into the process of machine learning. Part I Agenda

Project Goals
What does the process look like?
Dataset
Environment
Data Exploration
Coclusion

Project Goals#

As the title suggests, the main goal of this project is to detect DDoS attacks using AI techniques. But, as a complete newbie to the topic, I know this journey will be about much more than that. My goal is to understand the process and apply theoretical knowledge in a hands-on project. Sounds simple, but I’m ready for things to get complex… What Does the Process Look Like?

Process of what, exactly?

From here on, I’ll use the term Machine Learning (ML) since my project requires training an algorithm on data to build a model that can make predictive decisions—in this case, answering the simple question, “Is it DDoS?”

So, what does the process of Machine Learning look like? I love the diagram prepared by the authors of the excellent book Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka, Yuxi (Hayden) Liu, and Vahid Mirjalili.

I’m not going to describe every single step. It’s not my intention to overwhelm you with a huge amount of information. Instead, I’d rather explain it alongside the implementation of each step. Hope you’re okay with that (if not, I don’t care—it’s my blog!).

For now, this picture should be sufficient to give you a broad overview of the process.

Dataset#

I’m pretty sure that if you asked anyone, “What’s the most important aspect of Machine Learning?” the answer would always be the same: D A T A. Don’t have enough data? Fail. Your data lacks crucial information? Fail. Data isn’t clean enough? Faiiiiiil.

The entire process is based on data and the associated processes: cleaning, feature extraction, standardization, labeling (for supervised learning), splitting, training, testing, evaluating, and so on. It all starts with gathering data. Luckily, I’ve been equipped with the CICDDoS2019 dataset.

CICDDoS2019 contains benign data and the most up-to-date common DDoS attacks, resembling real-world data (PCAPs). It also includes results from network traffic analysis using CICFlowMeter-V3, which labels flows based on timestamps, source and destination IPs, source and destination ports, protocols, and attacks (in CSV format). Generating realistic background traffic was our top priority when building this dataset. We used our proposed B-Profile system (Sharafaldin et al., 2016) to profile the abstract behavior of human interactions and generate naturalistic benign background traffic in the proposed testbed (see Figure 2). For this dataset, we constructed the abstract behavior of 25 users based on HTTP, HTTPS, FTP, SSH, and email protocols.

Those packages are heavy—I’m talking about gigabytes of pure data in CSV format!

“In this dataset, we have different modern reflective DDoS attacks such as PortMap, NetBIOS, LDAP, MSSQL, UDP, UDP-Lag, SYN, NTP, DNS and SNMP. Attacks were subsequently executed during this period. (…) we executed 12 DDoS attacks includes NTP, DNS, LDAP, MSSQL, NetBIOS, SNMP, SSDP, UDP, UDP-Lag, WebDDoS, SYN and TFTP on the training day and 7 attacks including PortScan, NetBIOS, LDAP, MSSQL, UDP, UDP-Lag and SYN in the testing day. The traffic volume for WebDDoS was so low and PortScan just has been executed in the testing day and will be unknown for evaluating the proposed model.”

For my project purpose I will focus on just one chunk of dataset - DrDoS_DNS. I’m not going to let this project overwhelm me, thus, I’ve decided to pick just one and learn the process. Let’s set up the environment and finally get started.

Environment#

Jupyter is the most frequently chosen environment. As we all stand on the shoulders of giants I’m not gonna question this choice (for now). Another advantage is that to set it all up, we just need few commands, and to be more precise - two of them:

https://jupyter.org/install

I’m aware that this project may need some Python libraries so to avoid any incompatibility I’ll just use virtual envirnonment. venv module is lightweight and can help me isolate my project so it sounds like a perfect choice.

1
python3 -m venv [name of the virtual envirnment]
2
source [name]/bin/activate

After activating the virtual environment I’m just ready to run my notebook.

As you can see I named my virtual environment as venv. Not so creative but works anyway. This project folder is the place where I store my chunky csv file.

training.ipynb - text-based file used by Jupyter Notebook - a web-based interactive computing programme that helps users analyse and manipulate data using the Python programming language.
DrDos_DNS.csv - file with logs/packets
requirements.txt - file with all the libraries that might be useful, in Python project they can be installed collectively using pip install -r requirements.txt For now it doesn’t matter what’s in there.

That’s all regarding the environment. From now on I will focus on jupyter file (training.ipynb) where I will be learning the process.

Data exploration#

We can finally take a look at our csv file. Jupyter is a tool that supports interactive data science and scientific computing across all programming languages. Meaning you don’t need to write the whole code, compile it and then run. We are working with blocks of code, just take a look:

This is our first block, keep in mind that each line could be divided into separate block. I like to think about it in frames of one action=one block. Just like the proper function should be written. So, that block is the “defining block”. We import libraries and defining which data we are gonna work on.

These lines in Python are doing the following:

import pandas as pd
This imports the pandas library and assigns it the shorthand alias pd. pandas is a library commonly used for data analysis and manipulation in Python.
import numpy as np
This imports the numpy library, also with a shorthand alias (np). numpy is used for working with arrays and provides mathematical functions needed for scientific computations.
pd.set_option('display.max_rows', None)
This line configures pandas to display all rows when showing data in the console (by default, pandas limits the number of rows displayed). Setting this to None removes that limit.
pd.set_option('display.max_columns', None)
Similar to the line above, this configures pandas to display all columns when displaying a DataFrame. Without this, pandas might cut off columns when displaying wide tables.
df = pd.read_csv('DrDoS_DNS.csv', low_memory=False)
This reads a CSV file named 'DrDoS_DNS.csv' into a DataFrame called df. The low_memory=False option loads the entire file at once, avoiding data type guessing that might happen with larger files. Setting low_memory=False can prevent certain types of warnings, especially with large datasets.
df.columns = df.columns.str.strip()
This line removes any leading or trailing whitespace from the column names in df. This is particularly useful if the column names accidentally have extra spaces that could interfere with accessing them correctly in the code.

With that block we are ready to start messing around with the data.

Features#

In machine learning, features are the individual measurable properties or characteristics of the data used to make predictions or classifications. Each feature represents a specific aspect of the data, and collectively, features provide the model with the information it needs to learn patterns and make accurate predictions.

Let’s find out what features do we got by using below command:

The line print(df.columns[1:-1]) prints a subset of the column names in a DataFrame, specifically all columns except the first and last. What’s wrong with first and the last? Let me show you:

That command prints first 5 rows of the first and last columns. Label column is just label, not a feature, and the first one is like an index so it doesn’t provide any valuable info.

We’ve printed out our features but how many do we got? I ain’t gonna count them but I know the command that can do it for me:

It calculates the number of columns in the DataFrame df minus 2 (first and last), and then prints the result. So, we got 86 features which is quite decent.

Observations#

But how many packets do we got?

It prints the number of rows in the DataFrame df. Oh man, 5,074,413 rows. Over 5 million packets. That’s a lot.

Labels#

Distribiution (or ratio) of labels might be the most crucial aspect of this dataset. Balance is the key. But first, let’s find out what names of labels do we have?

Fine, will work. It could be translated as “DDoS traffic” and “No DDoS traffic”. Okay, it’s time for distribuition:

Oh man, that’s imbalanced as hell. I will need to figure out what to do with that later.

Vector Example#

Using random column I wanted to see how does the column look like. print(df.iloc[131]) should do the work:

1
Unnamed: 0                                                      928
2
Flow ID                        172.16.0.5-192.168.50.1-634-51124-17
3
Source IP                                                172.16.0.5
4
Source Port                                                     634
5
Destination IP                                         192.168.50.1
6
Destination Port                                              51124
7
Protocol                                                         17
8
Timestamp                                2018-12-01 10:51:43.900176
9
Flow Duration                                                 47202
10
Total Fwd Packets                                               200
11
Total Backward Packets                                            0
12
Total Length of Fwd Packets                                 88000.0
13
Total Length of Bwd Packets                                     0.0
14
Fwd Packet Length Max                                         440.0
15
Fwd Packet Length Min                                         440.0
16
Fwd Packet Length Mean                                        440.0
17
Fwd Packet Length Std                                           0.0
18
Bwd Packet Length Max                                           0.0
19
Bwd Packet Length Min                                           0.0
20
Bwd Packet Length Mean                                          0.0
21
Bwd Packet Length Std                                           0.0
22
Flow Bytes/s                                         1864327.782721
23
Flow Packets/s                                          4237.108597
24
Flow IAT Mean                                             237.19598
25
Flow IAT Std                                             337.199588
26
Flow IAT Max                                                 2016.0
27
Flow IAT Min                                                    1.0
28
Fwd IAT Total                                               47202.0
29
Fwd IAT Mean                                              237.19598
30
Fwd IAT Std                                              337.199588
31
Fwd IAT Max                                                  2016.0
32
Fwd IAT Min                                                     1.0
33
Bwd IAT Total                                                   0.0
34
Bwd IAT Mean                                                    0.0
35
Bwd IAT Std                                                     0.0
36
Bwd IAT Max                                                     0.0
37
Bwd IAT Min                                                     0.0
38
Fwd PSH Flags                                                     0
39
Bwd PSH Flags                                                     0
40
Fwd URG Flags                                                     0
41
Bwd URG Flags                                                     0
42
Fwd Header Length                                              6400
43
Bwd Header Length                                                 0
44
Fwd Packets/s                                           4237.108597
45
Bwd Packets/s                                                   0.0
46
Min Packet Length                                             440.0
47
Max Packet Length                                             440.0
48
Packet Length Mean                                            440.0
49
Packet Length Std                                               0.0
50
Packet Length Variance                                          0.0
51
FIN Flag Count                                                    0
52
SYN Flag Count                                                    0
53
RST Flag Count                                                    0
54
PSH Flag Count                                                    0
55
ACK Flag Count                                                    0
56
URG Flag Count                                                    0
57
CWE Flag Count                                                    0
58
ECE Flag Count                                                    0
59
Down/Up Ratio                                                   0.0
60
Average Packet Size                                           442.2
61
Avg Fwd Segment Size                                          440.0
62
Avg Bwd Segment Size                                            0.0
63
Fwd Header Length.1                                            6400
64
Fwd Avg Bytes/Bulk                                                0
65
Fwd Avg Packets/Bulk                                              0
66
Fwd Avg Bulk Rate                                                 0
67
Bwd Avg Bytes/Bulk                                                0
68
Bwd Avg Packets/Bulk                                              0
69
Bwd Avg Bulk Rate                                                 0
70
Subflow Fwd Packets                                             200
71
Subflow Fwd Bytes                                             88000
72
Subflow Bwd Packets                                               0
73
Subflow Bwd Bytes                                                 0
74
Init_Win_bytes_forward                                           -1
75
Init_Win_bytes_backward                                          -1
76
act_data_pkt_fwd                                                199
77
min_seg_size_forward                                             32
78
Active Mean                                                     0.0
79
Active Std                                                      0.0
80
Active Max                                                      0.0
81
Active Min                                                      0.0
82
Idle Mean                                                       0.0
83
Idle Std                                                        0.0
84
Idle Max                                                        0.0
85
Idle Min                                                        0.0
86
SimillarHTTP                                                      0
87
Inbound                                                           1
88
Label                                                     DrDoS_DNS
89
Name: 131, dtype: object

Many 0s… it doesn’t seem promising. But we will worry about that later.

Types of data#

To check out the types of data I used df.info():

1
<class 'pandas.core.frame.DataFrame'>
2
RangeIndex: 5074413 entries, 0 to 5074412
3
Data columns (total 88 columns):
4
 #   Column                       Dtype
5
---  ------                       -----
6
 0   Unnamed: 0                   int64
7
 1   Flow ID                      object
8
 2   Source IP                    object
9
 3   Source Port                  int64
10
 4   Destination IP               object
11
 5   Destination Port             int64
12
 6   Protocol                     int64
13
 7   Timestamp                    object
14
 8   Flow Duration                int64
15
 9   Total Fwd Packets            int64
16
 10  Total Backward Packets       int64
17
 11  Total Length of Fwd Packets  float64
18
 12  Total Length of Bwd Packets  float64
19
 13  Fwd Packet Length Max        float64
20
 14  Fwd Packet Length Min        float64
21
 15  Fwd Packet Length Mean       float64
22
 16  Fwd Packet Length Std        float64
23
 17  Bwd Packet Length Max        float64
24
 18  Bwd Packet Length Min        float64
25
 19  Bwd Packet Length Mean       float64
26
 20  Bwd Packet Length Std        float64
27
 21  Flow Bytes/s                 float64
28
 22  Flow Packets/s               float64
29
 23  Flow IAT Mean                float64
30
 24  Flow IAT Std                 float64
31
 25  Flow IAT Max                 float64
32
 26  Flow IAT Min                 float64
33
 27  Fwd IAT Total                float64
34
 28  Fwd IAT Mean                 float64
35
 29  Fwd IAT Std                  float64
36
 30  Fwd IAT Max                  float64
37
 31  Fwd IAT Min                  float64
38
 32  Bwd IAT Total                float64
39
 33  Bwd IAT Mean                 float64
40
 34  Bwd IAT Std                  float64
41
 35  Bwd IAT Max                  float64
42
 36  Bwd IAT Min                  float64
43
 37  Fwd PSH Flags                int64
44
 38  Bwd PSH Flags                int64
45
 39  Fwd URG Flags                int64
46
 40  Bwd URG Flags                int64
47
 41  Fwd Header Length            int64
48
 42  Bwd Header Length            int64
49
 43  Fwd Packets/s                float64
50
 44  Bwd Packets/s                float64
51
 45  Min Packet Length            float64
52
 46  Max Packet Length            float64
53
 47  Packet Length Mean           float64
54
 48  Packet Length Std            float64
55
 49  Packet Length Variance       float64
56
 50  FIN Flag Count               int64
57
 51  SYN Flag Count               int64
58
 52  RST Flag Count               int64
59
 53  PSH Flag Count               int64
60
 54  ACK Flag Count               int64
61
 55  URG Flag Count               int64
62
 56  CWE Flag Count               int64
63
 57  ECE Flag Count               int64
64
 58  Down/Up Ratio                float64
65
 59  Average Packet Size          float64
66
 60  Avg Fwd Segment Size         float64
67
 61  Avg Bwd Segment Size         float64
68
 62  Fwd Header Length.1          int64
69
 63  Fwd Avg Bytes/Bulk           int64
70
 64  Fwd Avg Packets/Bulk         int64
71
 65  Fwd Avg Bulk Rate            int64
72
 66  Bwd Avg Bytes/Bulk           int64
73
 67  Bwd Avg Packets/Bulk         int64
74
 68  Bwd Avg Bulk Rate            int64
75
 69  Subflow Fwd Packets          int64
76
 70  Subflow Fwd Bytes            int64
77
 71  Subflow Bwd Packets          int64
78
 72  Subflow Bwd Bytes            int64
79
 73  Init_Win_bytes_forward       int64
80
 74  Init_Win_bytes_backward      int64
81
 75  act_data_pkt_fwd             int64
82
 76  min_seg_size_forward         int64
83
 77  Active Mean                  float64
84
 78  Active Std                   float64
85
 79  Active Max                   float64
86
 80  Active Min                   float64
87
 81  Idle Mean                    float64
88
 82  Idle Std                     float64
89
 83  Idle Max                     float64
90
 84  Idle Min                     float64
91
 85  SimillarHTTP                 object
92
 86  Inbound                      int64
93
 87  Label                        object
94
dtypes: float64(45), int64(37), object(6)
95
memory usage: 3.3+ GB

Conclusion#

I can’t complain about the lack of variety of data but I can see that there’s much to do with cleaning stuff. I will devote next chapter to data cleaning. I will focus on handling missing data, removing vectors and then making hard decsion on choosing the right features. But that’s just tomorrow me problem. Hope you enjoyed the first episode. Check out for next

~Type