Can your technology become smarter and start learning?
Clustering is a method of grouping things together so that objects in one group are similar to each other than those in other groups. Clustering helps identify the meaningfulness and usefulness of data. Machine learning algorithms have two broad categories, supervised and unsupervised learning. In supervised learning, the program is fed the labelled data. You train the algorithm on the data where correct answers are provided, then apply the new rules to new data. In unsupervised learning, you do not provide labels, it is up to the program to discover them. This helps in clustering and finding hidden patterns
The K-means clustering is an unsupervised machine learning used to identify clusters of data objects in datasets. It is a distance based, partitional clustering algorithm and it is iterative in nature. K-means is one of the oldest methods and considered the gold standard when it comes to clustering because of its simplicity and performance. In K-means, clustering aims to partition ‘n’ observations into ‘k’ clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. The goal of this algorithm is to find groups in the data. The algorithm works to assign each data point to one of the K groups based on the features. The approach k means follows to solve a problem is called expectation maximisation.
The steps involved in K Means algorithm are:
1. Scale the whole dataset into common magnitude.
2. Selecting the appropriate value for K which is the number of clusters or centroids
3. Select centroids for each cluster
4. Assign each data point to its closest centroid
5. Adjust the centroid for the newly formed cluster
6. Repeat the steps until all data sets are perfectly organised within a cluster space
Advantages of K-Means
1. It is easy to implement
2. With large Numeric variables, K means will be computationally faster
3. When centroids are recomputed, an instance can move to another cluster
4. It produces tighter clusters
5. It is very effective when clusters have spherical shapes
Limitations of K-means
1. Difficult to predict the number of cluster
2, It produces incorrect results by placing cluster centres in areas of local minimum density
3. Initial seeds have a strong impact on final results
4. Order of data can also have an impact on final results
5. Rescaling datasets will completely change the results
K means is one of the popular cluster algorithms usually applied when solving clustering tasks. In this time and age where large volumes of data are being utilised for clustering, it becomes necessary to adopt a method that can accomplish this task with ease and accuracy. K means has been successfully implemented in academics, diagnostics, search engines and wireless networks for market segmentation, document clustering, image segmentation etc.
“The power to judge the unseen from the seen, to trace the implications of things, to judge the whole piece by a pattern, this cluster of gifts may almost be said to constitute experience – Henry James”
Authors: Benila Jacob, Mark John and Sunil Kumar