K-Means Clustering and its Real Use-Case in the Security Domain

Deepak Sharma
6 min readAug 21, 2021

Clustering :

Clustering is the assignment of objects to homogeneous groups (called clusters) while making sure that objects in different groups are not similar. Clustering is considered an unsupervised task as it aims to describe the hidden structure of the objects.

Each object is described by a set of characters called features. The first step of dividing objects into clusters is to define the distance between the different objects. Defining an adequate distance measure is crucial for the success of the clustering process.

K-Means Clustering :

There are many clustering algorithms, each has its advantages and disadvantages. A popular algorithm for clustering is K-means, which aims to identify the best k cluster centers in an iterative manner. Cluster centers are served as “representatives” of the objects associated with the cluster.

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.

k-mean’s key features are also its drawbacks :

  • The number of clusters (k) must be given explicitly. In some cases, the number of different groups is unknown.
  • k-means iterative nature might lead to an incorrect result due to convergence to a local minimum.
  • The clusters are assumed to be spherical.

The outputs of executing a k-means on a dataset are :

  • k centroids: centroids for each of the k clusters identified from the dataset.
  • complete dataset labeled to ensure each data point is assigned to one of the clusters.

Work Flow Of K-Means Algorithm :

  1. Collecting dataset.
  2. Identifying the number of clusters (k).
  3. Initializing the k centroids (k-means) for the data.
  4. Determining the distance from each centroid and the cluster with the centroid closest to it.
  5. Recounting the centroids for each cluster.
  6. Steps 4 and 5 are repeated until there is no change in cluster centroids.
  7. If formed clusters do not look reasonable, repeat steps 1–6 for different numbers of clusters.

Applications Of K-Means :

k-means algorithm is very popular and used in a variety of applications such as market segmentation, document clustering, etc. The goal usually when we undergo a cluster analysis is either:

  1. Get a meaningful intuition of the structure of the data we’re dealing with.
  2. Cluster-then-predict where different models will be built for different subgroups if we believe there is a wide variation in the behaviors of different subgroups. An example of that is clustering patients into different subgroups and build a model for each subgroup to predict the probability of the risk of having a heart attack.

Use-Cases in the Security Domain :

k-means can typically be applied to data that has a smaller number of dimensions, is numeric, and is continuous. Think of a scenario in which you want to make groups of similar things from a randomly distributed collection of things k-means is very suitable for such scenarios.

Clustering Analysis for Malware Behavior Detection in Cyber Crime

Cyber-attacks become the biggest threat in computer and networks system around the world. It is important to merge IDS that can detect and analyze the data with high accuracy (i.e., true positives and negative) and low false detection (i.e., false-positive and negative) in the minimum detection time. The K-Means clustering detection model with appointing of data mining, peculiarly clustering method is a notable field that can be explored to overcome this matter. It is a need to have continuous IDS improvement in terms of the accuracy of malware analysis, the detection time, and the suitable detection approach; are the motivations for this research.

Malware Detection :

Malware interrupts the file registry when entering a computer and basically malware tend to create and modify computer files system and Windows registry entries besides the computer inter-process communication and basic network interaction. Intrusion attacks are known to breach the network security policy in organizations and continuously try to interrupt the core fundamentals of cybersecurity: Confidential, Integrity, and Availability, or known as CIA.

Therefore, previous cybersecurity researcher has proposed detection-based for malware intrusion, which is a framework that monitors the behavior of system activity. Then, the behavior will be analyzed by the framework and notify the users if there is a sign of intrusion.

Analysis of Intrusion Detection System :

It divides the data into certain polymerization classes according to the attribute of the data. Network intrusion detection is the process of monitoring the events occurring in a computing system or network and analyzing them for signs of intrusions, defined as attempts to compromise confidentiality.

The intrusion attacks can be divided into four categories: Probe (e.g. IP sweep, vulnerability scanning), denial of service (DoS) (e.g. mail bomb, UDP storm), user-to-root (U2R) (e.g. buffer overflow attacks, rootkits), and remote-to-local (R2L) (e.g. password guessing, worm attack)

Clustering is the method of grouping objects into meaningful subclasses so that the members from the same cluster are quite similar, and the members from different clusters are quite different from each other. Therefore clustering methods can be useful for classifying log data and detecting intrusion.

Cyber Profiling using Log Analysis and K-Means Clustering :

The Activities of Internet users are increasing from year to year and have impacted the behavior of the users themselves. Assessment of user behavior is often only based on interaction across the Internet without knowing any others activities. The log activity can be used as another way to study the behavior of the user. The Log Internet activity is one of the types of big data so that the use of data mining with the K-Means technique can be used as a solution for the analysis of user behavior.

In general, cyber profiling analysis is the exploration of data to determine what user activity at the time of internet access. One method that can be used to support the profiling process is K-Means clustering. Through these algorithms, the data can be grouped by the number of websites visited. This grouping aims to see what the user frequently accesses websites.

Identify Outlier Access :

The average user has more than 100 entitlements and that can be very difficult to manage manually. Through the use of the Clustering and K-Means machine learning model, we can detect access outliers by analyzing what’s going on with dynamic peer groups of users.

Automatic clustering of it alerts :

Large enterprise infrastructure technology components such as network, storage, or database generate large volumes of alert messages. because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. Clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.

Rideshare data analysis

The publicly available uber ride information dataset provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. analyzing this data is useful not just in the context of uber but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.

Conclusion

The goal of K-means is to group data points into distinct non-overlapping subgroups. It does a very good job when the clusters have a kind of spherical shape. However, it suffers as the geometric shapes of clusters deviate from spherical shapes. Moreover, it also doesn’t learn the number of clusters from the data and requires it to be pre-defined. To be a good practitioner, it’s good to know the assumptions behind algorithms/methods so that you would have a pretty good idea about the strength and weaknesses of each method. This will help you decide when to use each method and under what circumstances.

So, that is all of my knowledge about K-Means Clustering… I hope you get something from it.

Thankyou

--

--